What is Tokens/s?
Tokens per second — the standard speed metric for LLMs. One token ≈ 0.75 words. Above 10 t/s feels interactive; below 5 t/s feels like watching paint dry.
Full Explanation
Tokens per second (t/s) measures how quickly a model generates output. One token is roughly 0.75 words or 4 characters in English. A conversational exchange typically generates 100–300 tokens of response. At 10 t/s that's 10–30 seconds per reply — tolerable. At 50 t/s it feels nearly instant. At 5 t/s or below, you're watching individual words appear, which breaks the sense of a fluid conversation. The metric is hardware-dependent: the same 7B model at Q4 produces 8 t/s on a budget mini PC and 118 t/s on an RTX 5070.
Why It Matters for Local AI
For interactive chat, target at least 15–20 t/s. For coding assistance where you read output as it streams, 30+ t/s is ideal. For background batch processing, even 5 t/s is fine. Match your hardware to your actual use case — overpaying for speed you don't need is wasteful.
Hardware Relevant to Tokens/s
gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s
mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s
mini-pc · Check Price on Amazon · 16 GB Unified · 120 GB/s
Related Terms
Memory Bandwidth→
How fast data moves between memory and the processor, measured in GB/s. Tokens per second scales nearly linearly with bandwidth — this is the single most important GPU spec for LLM speed.
Quantization→
Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.
VRAM→
Video RAM — dedicated memory on a GPU. Determines the maximum model size you can run with full GPU acceleration. Once a model exceeds VRAM, it spills to system RAM over the slow PCIe bus.
Unified Memory→
Apple Silicon uses a single pool of fast RAM shared between CPU and GPU. Larger unified memory = larger models run entirely at full bandwidth — no PCIe bottleneck.