What is VRAM?
Video RAM — dedicated memory on a GPU. Determines the maximum model size you can run with full GPU acceleration. Once a model exceeds VRAM, it spills to system RAM over the slow PCIe bus.
Full Explanation
VRAM (Video RAM) is high-speed memory physically attached to a GPU die, operating at bandwidths up to 672 GB/s on the RTX 5070. Unlike system RAM, it shares a direct path to the GPU's compute cores, allowing the chip to load model weights at full speed. When a language model's weights exceed available VRAM, Ollama and llama.cpp automatically offload layers to system RAM — but those layers then travel over the PCIe bus at ~30–60 GB/s instead of hundreds, cutting tokens-per-second by 50–90%.
Why It Matters for Local AI
A 7B model at Q4 quantization requires ~4 GB of VRAM. A 13B model needs ~8 GB, and a 70B model needs ~40 GB. If you want to run a 13B model with full GPU acceleration, 12 GB VRAM is the minimum — the RTX 5070 sits right at that threshold.
Hardware Relevant to VRAM
gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s
gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s
gpu · Check Price on Amazon · 16 GB VRAM · 288 GB/s
Related Terms
Unified Memory→
Apple Silicon uses a single pool of fast RAM shared between CPU and GPU. Larger unified memory = larger models run entirely at full bandwidth — no PCIe bottleneck.
Quantization→
Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.
Memory Bandwidth→
How fast data moves between memory and the processor, measured in GB/s. Tokens per second scales nearly linearly with bandwidth — this is the single most important GPU spec for LLM speed.
GDDR7→
The latest generation of GPU memory (2024+). Significantly higher bandwidth than GDDR6X at the same capacity tier. Used in NVIDIA Blackwell cards (RTX 5070 series).
PCIe→
Peripheral Component Interconnect Express — the bus connecting a discrete GPU to the motherboard. PCIe 4.0 or 5.0 needed for fast model offloading when VRAM is exceeded.