What is Quantization?
Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.
Full Explanation
Quantization converts a model's weights from 32-bit or 16-bit floating-point values to lower-precision integers. A 70B model in FP16 requires ~140 GB of VRAM — impossible on any consumer hardware. The same model at Q4 (4-bit integers) compresses to ~40 GB, fitting in a Mac Mini M4 Pro with 48 GB unified memory. The quality loss from Q4 vs FP16 is typically imperceptible in benchmarks for models above 13B, but more noticeable on smaller 7B models where every bit of precision matters.
Why It Matters for Local AI
Q4_K_M is the current community standard for the best quality-per-GB ratio. If your hardware has 16 GB of memory, Q4 lets you run a 13B model; Q8 limits you to 7B. When downloading models from Hugging Face or Ollama, always check the quantization level in the filename (e.g., "Q4_K_M", "Q8_0").
Hardware Relevant to Quantization
mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s
gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s
Related Terms
VRAM→
Video RAM — dedicated memory on a GPU. Determines the maximum model size you can run with full GPU acceleration. Once a model exceeds VRAM, it spills to system RAM over the slow PCIe bus.
Unified Memory→
Apple Silicon uses a single pool of fast RAM shared between CPU and GPU. Larger unified memory = larger models run entirely at full bandwidth — no PCIe bottleneck.
GGUF→
The standard file format for quantized LLMs used by llama.cpp and Ollama. Replaces the older GGML format. Stores model weights and metadata in a single portable file.
Max LLM Size→
The largest language model this hardware can run with full GPU/unified-memory acceleration, at the specified quantization. Larger models require more memory.
Ollama→
Free open-source tool for running LLMs locally on macOS, Linux, and Windows. Download a model with a single command. No cloud account required. Supports Llama, Mistral, Qwen, Phi, and more.