What is Max LLM Size?
The largest language model this hardware can run with full GPU/unified-memory acceleration, at the specified quantization. Larger models require more memory.
Full Explanation
Max LLM size indicates the largest model parameter count a given hardware configuration can run entirely in GPU VRAM or unified memory at Q4 quantization. Running "within VRAM" means all model layers are GPU-accelerated; exceeding this threshold forces CPU offloading with a significant speed penalty. The formula is approximate: model params (billions) × 0.5 GB ≈ VRAM needed at Q4. A 16 GB card fits ~30B models; 12 GB fits ~22B; 8 GB fits ~13B.
Why It Matters for Local AI
Max LLM size is a practical ceiling, not a hard limit. You can run larger models with layer offloading — they'll just be slower. For interactive use, staying within the max LLM size for your hardware is the difference between 30+ t/s and 3 t/s.
Hardware Relevant to Max LLM Size
gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s
mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s
gpu · Check Price on Amazon · 16 GB VRAM · 288 GB/s
Related Terms
VRAM→
Video RAM — dedicated memory on a GPU. Determines the maximum model size you can run with full GPU acceleration. Once a model exceeds VRAM, it spills to system RAM over the slow PCIe bus.
Unified Memory→
Apple Silicon uses a single pool of fast RAM shared between CPU and GPU. Larger unified memory = larger models run entirely at full bandwidth — no PCIe bottleneck.
Quantization→
Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.
Tokens/s→
Tokens per second — the standard speed metric for LLMs. One token ≈ 0.75 words. Above 10 t/s feels interactive; below 5 t/s feels like watching paint dry.