Performance & Benchmarks

What is KV Cache?

Key-Value Cache — stores intermediate attention computations so the model doesn't re-process earlier context on each new token. Larger context = larger KV cache = more VRAM needed.

Full Explanation

During inference, transformer models compute "keys" and "values" for every token in the context. The KV cache stores these computations so they aren't recalculated on every new output token. Without it, generating token 500 would require reprocessing all 499 prior tokens, making generation exponentially slow. The cache is stored in VRAM and grows linearly with context length — at 128K context, the KV cache alone can consume 8–16 GB depending on the model architecture.

Why It Matters for Local AI

KV cache is the hidden VRAM consumer that surprises people. A model that "fits" in 12 GB VRAM at short context may OOM (out of memory) when you feed it a long document. Tools like Ollama show VRAM usage in real time — watch the KV cache grow as conversation history lengthens.

Hardware Relevant to KV Cache

GIGABYTE GeForce RTX 5070 WINDFORCE OC 12G

gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s

Buy on AmazonAffiliate link — no extra cost to you
Apple Mac Mini (M4 Pro, 2024)

mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s

Buy on AmazonAffiliate link — no extra cost to you

Related Terms