What is MoE?
Mixture of Experts — a model architecture where only a fraction of parameters activate per token. Enables very large parameter counts at lower inference cost (e.g., DeepSeek-V3, Mixtral).
Full Explanation
Mixture of Experts (MoE) splits a model's feed-forward layers into multiple "expert" sub-networks, routing each token through only 2–4 of them instead of the full network. This means a 141B-parameter MoE model like DeepSeek-V3 activates only ~37B parameters per token — giving near-141B quality at roughly 37B inference cost in compute. However, all 141B parameters must still reside in memory, requiring massive VRAM or unified memory for full GPU acceleration.
Why It Matters for Local AI
MoE models are memory-hungry but compute-efficient. A 46B MoE model like Mixtral 8x7B requires ~48 GB of memory for full acceleration — making the Mac Mini M4 Pro with 48 GB unified memory one of the few sub-$2,000 systems that can run it. Smaller MoE variants (e.g., DeepSeek-R1 distills) are more accessible.
Hardware Relevant to MoE
mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s
gpu · Check Price on Amazon · 16 GB VRAM · 960 GB/s
Related Terms
Quantization→
Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.
VRAM→
Video RAM — dedicated memory on a GPU. Determines the maximum model size you can run with full GPU acceleration. Once a model exceeds VRAM, it spills to system RAM over the slow PCIe bus.
Unified Memory→
Apple Silicon uses a single pool of fast RAM shared between CPU and GPU. Larger unified memory = larger models run entirely at full bandwidth — no PCIe bottleneck.
Max LLM Size→
The largest language model this hardware can run with full GPU/unified-memory acceleration, at the specified quantization. Larger models require more memory.
KV Cache→
Key-Value Cache — stores intermediate attention computations so the model doesn't re-process earlier context on each new token. Larger context = larger KV cache = more VRAM needed.