Question 1

What is VRAM?

Accepted Answer

VRAM (Video RAM) is high-speed memory physically attached to a GPU die, operating at bandwidths up to 672 GB/s on the RTX 5070. Unlike system RAM, it shares a direct path to the GPU's compute cores, allowing the chip to load model weights at full speed. When a language model's weights exceed available VRAM, Ollama and llama.cpp automatically offload layers to system RAM — but those layers then travel over the PCIe bus at ~30–60 GB/s instead of hundreds, cutting tokens-per-second by 50–90%.

Question 2

Why does VRAM matter for local AI?

Accepted Answer

A 7B model at Q4 quantization requires ~4 GB of VRAM. A 13B model needs ~8 GB, and a 70B model needs ~40 GB. If you want to run a 13B model with full GPU acceleration, 12 GB VRAM is the minimum — the RTX 5070 sits right at that threshold.

What is VRAM?

Full Explanation

Why It Matters for Local AI

Hardware Relevant to VRAM

Related Terms