What is Memory Bandwidth?
How fast data moves between memory and the processor, measured in GB/s. Tokens per second scales nearly linearly with bandwidth — this is the single most important GPU spec for LLM speed.
Full Explanation
Memory bandwidth measures how many gigabytes of data can be moved between memory and compute cores per second. For LLM inference, every generated token requires reading the entire model's weights from memory at least once. A 7B model at Q4 weighs roughly 4 GB; generating one token therefore requires 4 GB of memory reads. At 672 GB/s (RTX 5070), that supports up to ~168 theoretical token-read cycles per second — which is why real-world benchmarks land around 118 t/s after overhead.
Why It Matters for Local AI
Bandwidth is a better predictor of LLM speed than compute FLOPS. A mini PC with 68 GB/s LPDDR5 will always be 8–10× slower than an RTX 5070 at 672 GB/s for the same model, regardless of CPU capability. When comparing hardware, check bandwidth first.
Hardware Relevant to Memory Bandwidth
gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s
mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s
Related Terms
VRAM→
Video RAM — dedicated memory on a GPU. Determines the maximum model size you can run with full GPU acceleration. Once a model exceeds VRAM, it spills to system RAM over the slow PCIe bus.
GDDR7→
The latest generation of GPU memory (2024+). Significantly higher bandwidth than GDDR6X at the same capacity tier. Used in NVIDIA Blackwell cards (RTX 5070 series).
GDDR6→
Previous-generation GPU memory. Lower bandwidth than GDDR7, but paired with larger capacities (e.g., 16GB RX 9060 XT) can offer better model headroom despite lower token speed.
LPDDR4→
Low-Power DDR4 — often soldered memory in mini PCs. Lower bandwidth than desktop DDR4 or DDR5. Limits tokens-per-second compared to high-end alternatives.
Tokens/s→
Tokens per second — the standard speed metric for LLMs. One token ≈ 0.75 words. Above 10 t/s feels interactive; below 5 t/s feels like watching paint dry.