Question 1

What is KV Cache?

Accepted Answer

During inference, transformer models compute "keys" and "values" for every token in the context. The KV cache stores these computations so they aren't recalculated on every new output token. Without it, generating token 500 would require reprocessing all 499 prior tokens, making generation exponentially slow. The cache is stored in VRAM and grows linearly with context length — at 128K context, the KV cache alone can consume 8–16 GB depending on the model architecture.

Question 2

Why does KV Cache matter for local AI?

Accepted Answer

KV cache is the hidden VRAM consumer that surprises people. A model that "fits" in 12 GB VRAM at short context may OOM (out of memory) when you feed it a long document. Tools like Ollama show VRAM usage in real time — watch the KV cache grow as conversation history lengthens.

What is KV Cache?

Full Explanation

Why It Matters for Local AI

Hardware Relevant to KV Cache

Related Terms