Performance & Benchmarks

What is Quantization?

Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.

Full Explanation

Quantization converts a model's weights from 32-bit or 16-bit floating-point values to lower-precision integers. A 70B model in FP16 requires ~140 GB of VRAM — impossible on any consumer hardware. The same model at Q4 (4-bit integers) compresses to ~40 GB, fitting in a Mac Mini M4 Pro with 48 GB unified memory. The quality loss from Q4 vs FP16 is typically imperceptible in benchmarks for models above 13B, but more noticeable on smaller 7B models where every bit of precision matters.

Why It Matters for Local AI

Q4_K_M is the current community standard for the best quality-per-GB ratio. If your hardware has 16 GB of memory, Q4 lets you run a 13B model; Q8 limits you to 7B. When downloading models from Hugging Face or Ollama, always check the quantization level in the filename (e.g., "Q4_K_M", "Q8_0").

Hardware Relevant to Quantization

Apple Mac Mini (M4 Pro, 2024)

mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s

Buy on AmazonAffiliate link — no extra cost to you
GIGABYTE GeForce RTX 5070 WINDFORCE OC 12G

gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s

Buy on AmazonAffiliate link — no extra cost to you

Related Terms