Question 1

What is Memory Bandwidth?

Accepted Answer

Memory bandwidth measures how many gigabytes of data can be moved between memory and compute cores per second. For LLM inference, every generated token requires reading the entire model's weights from memory at least once. A 7B model at Q4 weighs roughly 4 GB; generating one token therefore requires 4 GB of memory reads. At 672 GB/s (RTX 5070), that supports up to ~168 theoretical token-read cycles per second — which is why real-world benchmarks land around 118 t/s after overhead.

Question 2

Why does Memory Bandwidth matter for local AI?

Accepted Answer

Bandwidth is a better predictor of LLM speed than compute FLOPS. A mini PC with 68 GB/s LPDDR5 will always be 8–10× slower than an RTX 5070 at 672 GB/s for the same model, regardless of CPU capability. When comparing hardware, check bandwidth first.

What is Memory Bandwidth?

Full Explanation

Why It Matters for Local AI

Hardware Relevant to Memory Bandwidth

Related Terms