Question 1

What is Tokens/s?

Accepted Answer

Tokens per second (t/s) measures how quickly a model generates output. One token is roughly 0.75 words or 4 characters in English. A conversational exchange typically generates 100–300 tokens of response. At 10 t/s that's 10–30 seconds per reply — tolerable. At 50 t/s it feels nearly instant. At 5 t/s or below, you're watching individual words appear, which breaks the sense of a fluid conversation. The metric is hardware-dependent: the same 7B model at Q4 produces 8 t/s on a budget mini PC and 118 t/s on an RTX 5070.

Question 2

Why does Tokens/s matter for local AI?

Accepted Answer

For interactive chat, target at least 15–20 t/s. For coding assistance where you read output as it streams, 30+ t/s is ideal. For background batch processing, even 5 t/s is fine. Match your hardware to your actual use case — overpaying for speed you don't need is wasteful.

What is Tokens/s?

Full Explanation

Why It Matters for Local AI

Hardware Relevant to Tokens/s

Related Terms