What is INT4?
4-bit integer quantization — the most common precision level for running large models on consumer hardware. Reduces model size by ~75% vs FP16 with acceptable quality loss for most tasks.
Full Explanation
INT4 (4-bit integer) quantization represents each model weight as a 4-bit integer instead of a 16-bit float, reducing memory footprint by roughly 75%. A 7B model that requires ~14 GB at FP16 precision fits in ~4 GB at INT4. Quality degrades measurably compared to FP16 — especially on math, code, and precise factual recall — but remains acceptable for conversational and summarization tasks. INT4 is the default quantization level in most consumer inference setups, corresponding to Q4_K_M in GGUF or 4-bit in AWQ/EXL2.
Why It Matters for Local AI
INT4 is the reason you can run a 13B model on a 12 GB GPU or a 70B model on 48 GB unified memory. It's the practical enabler of local AI on consumer hardware. For most chat and productivity use cases, the quality difference from FP16 is undetectable. For code generation or precise math, consider INT8 or FP16 on hardware that can fit it.
Hardware Relevant to INT4
gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s
mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s
Related Terms
Quantization→
Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.
GGUF→
The standard file format for quantized LLMs used by llama.cpp and Ollama. Replaces the older GGML format. Stores model weights and metadata in a single portable file.
VRAM→
Video RAM — dedicated memory on a GPU. Determines the maximum model size you can run with full GPU acceleration. Once a model exceeds VRAM, it spills to system RAM over the slow PCIe bus.
Max LLM Size→
The largest language model this hardware can run with full GPU/unified-memory acceleration, at the specified quantization. Larger models require more memory.
AWQ→
Activation-Aware Weight Quantization — a 4-bit quantization method that outperforms GGUF Q4 in quality by identifying and preserving the most important weights. Primarily used with vLLM and HuggingFace.