Performance & Benchmarks

What is Max LLM Size?

The largest language model this hardware can run with full GPU/unified-memory acceleration, at the specified quantization. Larger models require more memory.

Full Explanation

Max LLM size indicates the largest model parameter count a given hardware configuration can run entirely in GPU VRAM or unified memory at Q4 quantization. Running "within VRAM" means all model layers are GPU-accelerated; exceeding this threshold forces CPU offloading with a significant speed penalty. The formula is approximate: model params (billions) × 0.5 GB ≈ VRAM needed at Q4. A 16 GB card fits ~30B models; 12 GB fits ~22B; 8 GB fits ~13B.

Why It Matters for Local AI

Max LLM size is a practical ceiling, not a hard limit. You can run larger models with layer offloading — they'll just be slower. For interactive use, staying within the max LLM size for your hardware is the difference between 30+ t/s and 3 t/s.

Hardware Relevant to Max LLM Size

GIGABYTE GeForce RTX 5070 WINDFORCE OC 12G

gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s

Buy on AmazonAffiliate link — no extra cost to you
Apple Mac Mini (M4 Pro, 2024)

mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s

Buy on AmazonAffiliate link — no extra cost to you
GIGABYTE Radeon RX 9060 XT GAMING OC 16G

gpu · Check Price on Amazon · 16 GB VRAM · 288 GB/s

Buy on AmazonAffiliate link — no extra cost to you

Related Terms