Memory & Storage

What is Memory Bandwidth?

How fast data moves between memory and the processor, measured in GB/s. Tokens per second scales nearly linearly with bandwidth — this is the single most important GPU spec for LLM speed.

Full Explanation

Memory bandwidth measures how many gigabytes of data can be moved between memory and compute cores per second. For LLM inference, every generated token requires reading the entire model's weights from memory at least once. A 7B model at Q4 weighs roughly 4 GB; generating one token therefore requires 4 GB of memory reads. At 672 GB/s (RTX 5070), that supports up to ~168 theoretical token-read cycles per second — which is why real-world benchmarks land around 118 t/s after overhead.

Why It Matters for Local AI

Bandwidth is a better predictor of LLM speed than compute FLOPS. A mini PC with 68 GB/s LPDDR5 will always be 8–10× slower than an RTX 5070 at 672 GB/s for the same model, regardless of CPU capability. When comparing hardware, check bandwidth first.

Hardware Relevant to Memory Bandwidth

GIGABYTE GeForce RTX 5070 WINDFORCE OC 12G

gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s

Buy on AmazonAffiliate link — no extra cost to you
Apple Mac Mini (M4 Pro, 2024)

mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s

Buy on AmazonAffiliate link — no extra cost to you

Related Terms