Flash-MoE Inference Engine

Skill by

ara.so

— Daily 2026 Skills collection.

Flash-MoE is a pure C/Objective-C/Metal inference engine that runs

Qwen3.5-397B-A17B

(397B parameter Mixture-of-Experts) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. It streams 209GB of expert weights from NVMe SSD on demand — no Python, no ML frameworks, just C, Objective-C, and hand-tuned Metal shaders.

Requirements

Hardware

Apple Silicon Mac (M3 Max or similar), 48GB+ unified memory, 1TB+ SSD with ~210GB free

OS

macOS 26+ (Darwin 25+)

Tools

Xcode Command Line Tools, Python 3.x (for weight extraction only)
Model: Qwen3.5-397B-A17B safetensors weights (download separately from HuggingFace) Installation & Build

Clone the repo

git clone https://github.com/danveloper/flash-moe cd flash-moe/metal_infer

Build everything

make

Verify build artifacts

ls infer chat main The Makefile compiles infer.m , chat.m , main.m with Metal shader compilation for shaders.metal . Weight Preparation Step 1: Extract non-expert weights

From the metal_infer/ directory

Point to your downloaded Qwen3.5-397B safetensors directory

python3 extract_weights.py /path/to/Qwen3.5-397B-A17B-Instruct/

Produces:

model_weights.bin (~5.5GB, mmap'd at runtime)

model_weights.json (tensor manifest)

vocab.bin (vocabulary)

tokenizer.bin (BPE tokenizer data)

Step 2: Pack expert weights (4-bit, production)

From repo root

python3 repack_experts.py /path/to/Qwen3.5-397B-A17B-Instruct/ metal_infer/packed_experts/

Produces packed_experts/ directory (~209GB)

Each expert is a separate file: layer_XX_expert_YYYY.bin

Step 3: Optional 2-bit requantization (faster but breaks JSON/tool calling)

Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total)

python3 metal_infer/repack_experts_2bit.py \ metal_infer/packed_experts/ \ metal_infer/packed_experts_2bit/ Key Commands Basic inference cd metal_infer

4-bit inference (production quality, tool calling works)

./infer --prompt "Explain quantum computing" --tokens 100

2-bit inference (faster, breaks JSON/tool calling)

./infer --prompt "Explain quantum computing" --tokens 100 --2bit

Per-layer timing breakdown

./infer --prompt "Hello" --tokens 20 --timing Interactive chat with tool calling ./chat

Opens TUI with full tool calling support

Uses 4-bit experts by default

MoE-only benchmark (measures expert throughput) ./main

Runs pure expert forward-pass benchmark

Reports tokens/sec without attention overhead

Project Structure
flash-moe/
├── paper/
│ └── flash_moe.pdf # Full technical paper
├── metal_infer/
│ ├── infer.m # Complete inference engine (~7000 lines)
│ ├── shaders.metal # Metal compute kernels (~1200 lines)
│ ├── chat.m # Interactive chat TUI
│ ├── tokenizer.h # Single-header C BPE tokenizer (449 lines)
│ ├── main.m # MoE-only benchmark
│ ├── Makefile
│ ├── extract_weights.py # Safetensors → model_weights.bin
│ ├── repack_experts_2bit.py # 4-bit → 2-bit requantization
│ ├── train_predictor.py # Expert routing prediction analysis
│ ├── model_weights.bin # Non-expert weights (mmap'd)
│ ├── model_weights.json # Tensor manifest
│ ├── vocab.bin
│ ├── tokenizer.bin
│ ├── packed_experts/ # 4-bit expert files (209GB)
│ └── packed_experts_2bit/ # 2-bit expert files (120GB, optional)
├── repack_experts.py # 4-bit expert packing from safetensors
├── progress.py # Results visualization
└── results.tsv # Experiment log
Architecture Overview
The model has
60 transformer layers
:
45 GatedDeltaNet (linear attention) layers
15 standard full attention layers
Each layer: 512 experts, K=4 activated per token + 1 shared expert
Hidden dimension: 4096
Per-layer pipeline (4.28ms average at 4-bit)
CMD3(prev) → CMD1: attention projections + delta-net [1.22ms GPU]
→ CPU: flush results [0.01ms CPU]
→ CMD2: o_proj + norm + routing + shared [0.55ms GPU]
→ CPU: softmax + topK routing [0.003ms]
→ I/O: parallel pread K=4 experts [2.41ms SSD]
→ CMD3: expert forward + combine + norm [0.04ms encode, DEFERRED]
Metal Shader Kernels
The
shaders.metal
file contains hand-written kernels. Key kernels:
// 4-bit dequantized matrix-vector multiply (FMA-optimized)
// Key insight: fma(nibble, scalex, biasx) instead of (nibblescale + bias)x
// Pre-compute scalex and biasx to fuse dequant+multiply in one FMA instruction
kernel void matvec_4bit_fma(
device const uint8_t* weights [[buffer(0)]],
device const float* scales [[buffer(1)]],
device const float* biases [[buffer(2)]],
device const float* x [[buffer(3)]],
device float* out [[buffer(4)]],
uint tid [[thread_position_in_threadgroup]],
uint gid [[threadgroup_position_in_grid]])
{
// ... tiled SIMD-reduced FMA kernel
// 12% faster than naive (nibble * scale + bias) * x
}
// Fused SwiGLU activation
kernel void swiglu(device float* gate [[buffer(0)]],
device const float* up [[buffer(1)]],
uint gid [[thread_position_in_grid]])
{
float g = gate[gid];
gate[gid] = (g / (1.0f + exp(-g))) * up[gid];
}
// RMS normalization (two-pass)
kernel void rms_norm_pass1(...) // sum of squares reduction
kernel void rms_norm_pass2(...) // apply normalization
// GPU RoPE (fused with Q deinterleave and K normalization)
kernel void rope_qk(...)
// MoE combine + residual + sigmoid gate (fused)
kernel void moe_combine_residual(...)
SSD Expert Streaming Pattern
The core innovation — loading only K=4 active experts per layer from SSD:
// Parallel expert loading using GCD dispatch groups
// From infer.m (conceptual pattern)
dispatch_group_t group
=
dispatch_group_create
(
)
;
dispatch_queue_t ioQueue
=
dispatch_get_global_queue
(
QOS_CLASS_USER_INITIATED
,
0
)
;
for
(
int
k
=
0
;
k
<
K_EXPERTS
;
k
++
)
{
int
expert_id
=
top_k_indices
[
k
]
;
dispatch_group_async
(
group
,
ioQueue
,
^
{
// Each expert: ~6.75MB at 4-bit
char
path
[
256
]
;
snprintf
(
path
,
sizeof
(
path
)
,
"packed_experts/layer_%02d_expert_%04d.bin"
,
layer
,
expert_id
)
;
int
fd
=
open
(
path
,
O_RDONLY
)
;
// pread() — non-blocking, OS page cache handles LRU
pread
(
fd
,
expert_buffer
[
k
]
,
expert_size
,
0
)
;
close
(
fd
)
;
}
)
;
}
dispatch_group_wait
(
group
,
DISPATCH_TIME_FOREVER
)
;
// GPU compute follows — serial pipeline is hardware-optimal on Apple Silicon
Why
pread()
not
mmap(): mmap incurs per-page fault overhead on cold data (~5x slower). Direct pread() with OS page cache achieves ~71% hit rate naturally. GatedDeltaNet Linear Attention (BLAS) The recurrence update uses Accelerate BLAS — 64% faster than scalar: // GatedDeltaNet state update per head (conceptual pattern) // state: 128×128 float matrix, 64 heads // From infer.m

import

<

Accelerate

/

Accelerate

.

h

>

for

(

int

h

=

0

;

h

<

64

;

h

++

)

{

float

*

S

=

state

+

h

*

128

*

128

;

// 128×128 state matrix

float

*

q

=

Q

+

h

*

128

;

float

*

k

=

K

+

h

*

128

;

float

*

v

=

V

+

h

*

128

;

// β·(k⊗v) outer product update

// cblas_sger: S += beta * (k ⊗ v)

cblas_sger

(

CblasRowMajor

,

128

,

128

,

beta

[

h

]

,

k

,

1

,

v

,

1

,

S

,

128

)

;

// Decay: S = alpha * S

cblas_sscal

(

128

*

128

,

alpha

[

h

]

,

S

,

1

)

;

// Output: o = S @ q

cblas_sgemv

(

CblasRowMajor

,

CblasNoTrans

,

128

,

128

,

1.0f

,

S

,

128

,

q

,

1

,

0.0f

,

output

+

h

*

128

,

1

)

;

}

Performance Configuration

4-bit (production default)

Quality

Excellent — full tool calling, correct JSON

Speed

4.36 tok/s

Disk

209GB

2-bit (speed testing only)

Quality

Good — but breaks JSON/tool calling (

\name\

instead of

"name"

)

Speed

5.74 tok/s (7.05 peak single token with warm cache)

Disk

120GB
Uses
F_NOCACHE
flag to avoid page cache thrashing
What NOT to Try (Learned from 58 Experiments)
Approach
Why it fails
mmap()
expert files
Per-page fault overhead: 5x slower than
pread()
dispatch_io
dispatch_data
management overhead: -70%
F_RDADVISE
prefetch
SSD DMA + GPU share memory controller — concurrent access: -73% GPU speed
Custom Metal LRU cache
GPU memory pressure: -38% vs OS page cache
LZ4 expert compression
Decompress overhead > warm cache savings: -13%
Temporal expert prediction
25% hit rate, wastes SSD bandwidth: -18%
Speculative early routing
Cache pollution: -38%
MTP speculative decoding
MoE I/O scales per-token (unlike dense models): break-even
Spin-poll GPU wait
CPU thermal throttle competes with GPU: -23%
Parallel SSD + GPU overlap
Unified memory controller arbitration: net negative
Key principle: On Apple Silicon, GPU DMA and SSD DMA share the same memory controller. The serial pipeline (GPU → SSD → GPU) is hardware-optimal. Troubleshooting Build fails

Ensure Xcode CLI tools are installed

xcode-select --install

Check Metal compiler is available

xcrun -sdk macosx metal --version Out of memory The engine is designed to use ~6GB active: 5.5GB: model_weights.bin (mmap'd, read-only) ~200MB: Metal scratch buffers Remaining ~42GB: OS page cache for expert data If you see OOM, check for other processes consuming unified memory: sudo memory_pressure vm_stat Slow performance

Check SSD speed — needs ~17GB/s for target performance

Run with timing to identify bottleneck

./infer --prompt "Hello" --tokens 5 --timing

Verify packed_experts/ is on internal SSD, not external drive

diskutil info / Wrong expert directory

Default paths expected by infer.m:

metal_infer/packed_experts/ (4-bit)

metal_infer/packed_experts_2bit/ (2-bit)

Ensure you're running from metal_infer/ directory

cd metal_infer ./infer --prompt "test" Tool calling broken Use 4-bit, not 2-bit. The 2-bit quantization corrupts quote characters in JSON output, making tool calling unreliable. Always use the default 4-bit configuration for agentic workloads. Memory Safety The engine explicitly manages all allocations: No unbounded caches Expert data never accumulates in GPU memory model_weights.bin is mmap'd read-only — kernel manages pages Expert files are opened/read/closed per inference step

flash-moe-inference

安装

Clone the repo

Build everything

Verify build artifacts

From the metal_infer/ directory

Point to your downloaded Qwen3.5-397B safetensors directory

Produces:

model_weights.bin (~5.5GB, mmap'd at runtime)

model_weights.json (tensor manifest)

vocab.bin (vocabulary)

tokenizer.bin (BPE tokenizer data)

From repo root

Produces packed_experts/ directory (~209GB)

Each expert is a separate file: layer_XX_expert_YYYY.bin

Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total)

4-bit inference (production quality, tool calling works)

2-bit inference (faster, breaks JSON/tool calling)

Per-layer timing breakdown

Opens TUI with full tool calling support

Uses 4-bit experts by default

Runs pure expert forward-pass benchmark

Reports tokens/sec without attention overhead

Ensure Xcode CLI tools are installed

Check Metal compiler is available

Check SSD speed — needs ~17GB/s for target performance

Run with timing to identify bottleneck

Verify packed_experts/ is on internal SSD, not external drive

Default paths expected by infer.m:

metal_infer/packed_experts/ (4-bit)

metal_infer/packed_experts_2bit/ (2-bit)

Ensure you're running from metal_infer/ directory