Question 1

What is Max LLM Size?

Accepted Answer

Max LLM size indicates the largest model parameter count a given hardware configuration can run entirely in GPU VRAM or unified memory at Q4 quantization. Running "within VRAM" means all model layers are GPU-accelerated; exceeding this threshold forces CPU offloading with a significant speed penalty. The formula is approximate: model params (billions) × 0.5 GB ≈ VRAM needed at Q4. A 16 GB card fits ~30B models; 12 GB fits ~22B; 8 GB fits ~13B.

Question 2

Why does Max LLM Size matter for local AI?

Accepted Answer

Max LLM size is a practical ceiling, not a hard limit. You can run larger models with layer offloading — they'll just be slower. For interactive use, staying within the max LLM size for your hardware is the difference between 30+ t/s and 3 t/s.

What is Max LLM Size?

Full Explanation

Why It Matters for Local AI

Hardware Relevant to Max LLM Size

Related Terms