Performance & Benchmarks

What is Tokens/s?

Tokens per second — the standard speed metric for LLMs. One token ≈ 0.75 words. Above 10 t/s feels interactive; below 5 t/s feels like watching paint dry.

Full Explanation

Tokens per second (t/s) measures how quickly a model generates output. One token is roughly 0.75 words or 4 characters in English. A conversational exchange typically generates 100–300 tokens of response. At 10 t/s that's 10–30 seconds per reply — tolerable. At 50 t/s it feels nearly instant. At 5 t/s or below, you're watching individual words appear, which breaks the sense of a fluid conversation. The metric is hardware-dependent: the same 7B model at Q4 produces 8 t/s on a budget mini PC and 118 t/s on an RTX 5070.

Why It Matters for Local AI

For interactive chat, target at least 15–20 t/s. For coding assistance where you read output as it streams, 30+ t/s is ideal. For background batch processing, even 5 t/s is fine. Match your hardware to your actual use case — overpaying for speed you don't need is wasteful.

Hardware Relevant to Tokens/s

GIGABYTE GeForce RTX 5070 WINDFORCE OC 12G

gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s

Buy on AmazonAffiliate link — no extra cost to you
Apple Mac Mini (M4 Pro, 2024)

mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s

Buy on AmazonAffiliate link — no extra cost to you
Apple Mac Mini (M4, 2024)

mini-pc · Check Price on Amazon · 16 GB Unified · 120 GB/s

Buy on AmazonAffiliate link — no extra cost to you

Related Terms