Performance & Benchmarks

What is INT4?

4-bit integer quantization — the most common precision level for running large models on consumer hardware. Reduces model size by ~75% vs FP16 with acceptable quality loss for most tasks.

Full Explanation

INT4 (4-bit integer) quantization represents each model weight as a 4-bit integer instead of a 16-bit float, reducing memory footprint by roughly 75%. A 7B model that requires ~14 GB at FP16 precision fits in ~4 GB at INT4. Quality degrades measurably compared to FP16 — especially on math, code, and precise factual recall — but remains acceptable for conversational and summarization tasks. INT4 is the default quantization level in most consumer inference setups, corresponding to Q4_K_M in GGUF or 4-bit in AWQ/EXL2.

Why It Matters for Local AI

INT4 is the reason you can run a 13B model on a 12 GB GPU or a 70B model on 48 GB unified memory. It's the practical enabler of local AI on consumer hardware. For most chat and productivity use cases, the quality difference from FP16 is undetectable. For code generation or precise math, consider INT8 or FP16 on hardware that can fit it.

Hardware Relevant to INT4

GIGABYTE GeForce RTX 5070 WINDFORCE OC 12G

gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s

Buy on AmazonAffiliate link — no extra cost to you
Apple Mac Mini (M4 Pro, 2024)

mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s

Buy on AmazonAffiliate link — no extra cost to you

Related Terms