Question 1

What is Quantization?

Accepted Answer

Quantization converts a model's weights from 32-bit or 16-bit floating-point values to lower-precision integers. A 70B model in FP16 requires ~140 GB of VRAM — impossible on any consumer hardware. The same model at Q4 (4-bit integers) compresses to ~40 GB, fitting in a Mac Mini M4 Pro with 48 GB unified memory. The quality loss from Q4 vs FP16 is typically imperceptible in benchmarks for models above 13B, but more noticeable on smaller 7B models where every bit of precision matters.

Question 2

Why does Quantization matter for local AI?

Accepted Answer

Q4_K_M is the current community standard for the best quality-per-GB ratio. If your hardware has 16 GB of memory, Q4 lets you run a 13B model; Q8 limits you to 7B. When downloading models from Hugging Face or Ollama, always check the quantization level in the filename (e.g., "Q4_K_M", "Q8_0").

What is Quantization?

Full Explanation

Why It Matters for Local AI

Hardware Relevant to Quantization

Related Terms