What is KV Cache?
Key-Value Cache — stores intermediate attention computations so the model doesn't re-process earlier context on each new token. Larger context = larger KV cache = more VRAM needed.
Full Explanation
During inference, transformer models compute "keys" and "values" for every token in the context. The KV cache stores these computations so they aren't recalculated on every new output token. Without it, generating token 500 would require reprocessing all 499 prior tokens, making generation exponentially slow. The cache is stored in VRAM and grows linearly with context length — at 128K context, the KV cache alone can consume 8–16 GB depending on the model architecture.
Why It Matters for Local AI
KV cache is the hidden VRAM consumer that surprises people. A model that "fits" in 12 GB VRAM at short context may OOM (out of memory) when you feed it a long document. Tools like Ollama show VRAM usage in real time — watch the KV cache grow as conversation history lengthens.
Hardware Relevant to KV Cache
gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s
mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s
Related Terms
Context Window→
The maximum amount of text (in tokens) a model can "see" at once. Larger context = more document history, longer conversations, bigger code files — but requires more VRAM.
VRAM→
Video RAM — dedicated memory on a GPU. Determines the maximum model size you can run with full GPU acceleration. Once a model exceeds VRAM, it spills to system RAM over the slow PCIe bus.
Quantization→
Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.