What is Speculative Decoding?
A speed optimization where a small draft model generates candidate tokens that a larger target model then verifies in parallel — producing multiple tokens per forward pass.
Full Explanation
Speculative decoding pairs a small, fast "draft" model with a larger "target" model. The draft model proposes a sequence of tokens; the target model verifies them all in a single parallel forward pass. When the draft is correct (which happens ~70–80% of the time for closely related model pairs), you get multiple tokens for the cost of roughly one target-model pass, effectively multiplying throughput without quality loss. llama.cpp and Ollama both support speculative decoding natively.
Why It Matters for Local AI
On hardware with headroom — like an RTX 5080 that comfortably fits a 13B target model — speculative decoding can increase effective throughput by 2–3×. It's most useful for interactive chat where latency matters more than batch efficiency.
Hardware Relevant to Speculative Decoding
gpu · Check Price on Amazon · 16 GB VRAM · 960 GB/s
gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s
Related Terms
Tokens/s→
Tokens per second — the standard speed metric for LLMs. One token ≈ 0.75 words. Above 10 t/s feels interactive; below 5 t/s feels like watching paint dry.
llama.cpp→
The foundational C++ inference engine for running quantized LLMs locally. Powers Ollama, LM Studio, and most local AI tools under the hood. Supports CPU, CUDA, ROCm, and Metal.
Ollama→
Free open-source tool for running LLMs locally on macOS, Linux, and Windows. Download a model with a single command. No cloud account required. Supports Llama, Mistral, Qwen, Phi, and more.
KV Cache→
Key-Value Cache — stores intermediate attention computations so the model doesn't re-process earlier context on each new token. Larger context = larger KV cache = more VRAM needed.