How Much VRAM Do You Need to Run AI Locally?
VRAM is the most confusing part of buying AI hardware. Vendors advertise GPU compute specs. YouTube benchmarks focus on gaming. And the actual AI community guidance is scattered across Reddit threads and Discord servers. This guide gives you the definitive 2026 answer: exactly how much VRAM you need, for which models, and what happens when you don't have enough.
The Simple Rule: Model Size ÷ 2 = Minimum VRAM (GB)
LLMs are stored as weights — floating-point numbers. At Q4 quantization (the most common format for local use), each billion parameters uses approximately 0.5 GB of VRAM. So a 7B model needs ~3.5 GB, a 13B model needs ~7 GB, and a 70B model needs ~35 GB.
But you also need headroom for the KV cache (attention buffers that grow with context length) and the runtime itself. Add 20–30% to the model weight size for a safe estimate.
| Model Size | Q4 Weights | + KV Cache | Recommended VRAM | Fits In |
|---|---|---|---|---|
| 1B–3B | 0.5–1.5 GB | +1 GB | 4 GB | Almost any GPU |
| 7B | 3.5 GB | +1.5 GB | 6–8 GB | 8 GB GPU (tight) |
| 13B | 6.5 GB | +2 GB | 10–12 GB | 12 GB GPU |
| 34B | 17 GB | +3 GB | 24 GB | 24 GB GPU or dual 12 GB |
| 70B | 35 GB | +5 GB | 40–48 GB | Mac Mini M4 Pro 48 GB or multi-GPU |
| 70B MoE | ~22 GB active | +5 GB | 32 GB | Mac Pro M4 or 48 GB GPU |
What Happens When You Don't Have Enough VRAM?
Modern LLM runtimes (llama.cpp, Ollama) don't crash — they offload layers to system RAM. Each layer that runs on CPU instead of GPU drops throughput significantly:
- ▸Full GPU inference: 40–120+ tokens/second
- ▸50% GPU / 50% CPU offload: 8–15 tokens/second
- ▸Full CPU inference: 2–6 tokens/second (on a fast CPU)
CPU offloading is usable for slow queries — overnight summaries, batch processing — but not comfortable chat. If you plan to chat with a model, fit it entirely in VRAM.
8 GB VRAM: Entry Level
8 GB is the minimum you'll see recommended in 2026. You can run 7B models at Q4 quantization with enough headroom for a 4K context window. Gemma 2 9B runs at Q4 in 8 GB too.
What you can't do comfortably at 8 GB: 13B models (they require offloading), SDXL at high batch sizes, or FLUX.1 Dev (needs ~12 GB). 8 GB was the sweet spot in 2023–2024; in 2026, it's a constraint.
12 GB VRAM: The 2026 Sweet Spot
12 GB fits 13B models at Q4 fully in VRAM — a significant quality jump over 7B. At this tier you also get comfortable FLUX.1 Schnell image generation and can run two smaller models simultaneously if needed.
The RTX 5070 (12 GB GDDR7, 672 GB/s) is the best 12 GB GPU available. Its bandwidth advantage means you generate tokens roughly 33% faster than a 12 GB RTX 4070 Super despite identical VRAM capacity.
16 GB VRAM: Room to Breathe
16 GB lets you run 13B models at Q8 (near-lossless quality) or push toward 34B at Q4. The RX 9060 XT ships with 16 GB GDDR6 — giving it a VRAM edge over the RTX 5070 despite lower bandwidth.
For Stable Diffusion, 16 GB handles FLUX.1 Dev, SD 3.5 Large, and SDXL at high resolutions with ease. At 12 GB you sometimes hit walls with newer architectures.
24 GB+ VRAM: Serious Workloads
24 GB is where you can run 34B models fully in VRAM and begin to approach 70B at aggressive quantization. No consumer discrete GPU currently ships with 24 GB at mainstream price points — you're looking at workstation cards (RTX 4090, RTX 6000 Ada) or Apple Silicon unified memory.
The Mac Mini M4 Pro with 24 GB unified memory is the most accessible 24 GB AI machine in 2026. It runs 70B Q4 models at ~18 tokens/second — not fast, but functional. Upgrade to 48 GB and you get comfortable 70B inference at ~35 t/s.
Discrete VRAM vs Apple Unified Memory: Are They the Same?
Not exactly. Discrete GPU VRAM is dedicated and has higher bandwidth than system RAM. Apple's unified memory is shared between CPU and GPU — which means the GPU sees the full memory pool without a penalty, but the bandwidth is lower than equivalent GDDR7.
| Memory Type | Example | Bandwidth | LLM Advantage |
|---|---|---|---|
| GDDR7 (discrete) | RTX 5070 12 GB | 672 GB/s | Fastest token gen at model size |
| GDDR6 (discrete) | RX 9060 XT 16 GB | ~384 GB/s | More capacity, slower per-token |
| Unified (Apple M4) | Mac Mini M4 Pro 24 GB | 273 GB/s | Full pool available to GPU, runs 70B |
| DDR5 system RAM | Any mini PC 32 GB | 51–68 GB/s | CPU offload only, very slow |
For models that fit in your VRAM, discrete is faster. For models that don't fit, Apple unified memory is your only affordable option short of enterprise hardware.
VRAM Requirements for Image Generation
| Model | Min VRAM | Recommended | Notes |
|---|---|---|---|
| Stable Diffusion 1.5 | 4 GB | 6 GB | Legacy, still useful for LoRAs |
| SDXL | 8 GB | 10 GB | Standard quality benchmark |
| FLUX.1 Schnell | 10 GB | 12 GB | Fast, high quality |
| FLUX.1 Dev | 12 GB | 16 GB | Best open-source image model |
| SD 3.5 Large | 14 GB | 16 GB | Latest Stability AI flagship |
Frequently Asked Questions
Q1Is 8 GB VRAM enough for AI in 2026?
Barely. 8 GB runs 7B models at Q4 quantization but struggles with 13B models, newer architectures, and high-resolution image generation. It was the sweet spot in 2023 but models have grown. If you're buying new hardware in 2026, target 12 GB minimum.
Q2Does system RAM help when you don't have enough VRAM?
Yes — llama.cpp and Ollama offload layers to system RAM automatically. But speed drops dramatically: you might go from 80 tokens/second (full GPU) to 8 tokens/second (half offloaded). More system RAM lets you run larger models at slow speeds. Fast DDR5 RAM helps slightly.
Q3Why does the Mac Mini M4 Pro feel fast despite lower bandwidth than the RTX 5070?
Two reasons. First, the M4 Pro chip is highly optimized for matrix operations at low power. Second, unified memory means models don't need to be split between VRAM and system RAM — the GPU sees all 24 GB at 273 GB/s, versus a GPU that might only see 12 GB at 672 GB/s before offloading starts.
Q4What is Q4 quantization and does it hurt quality?
Q4 quantization stores model weights at 4-bit precision instead of 16-bit, cutting memory use by 75%. Quality loss is minimal for most tasks — perplexity scores typically increase by 1–5%. For casual use, Q4 is indistinguishable from fp16. For precise technical tasks, Q8 is preferred if your VRAM allows it.
Q5Can I use system RAM instead of VRAM to run LLMs?
Yes, via CPU offloading. llama.cpp and Ollama can split a model between GPU VRAM and system RAM. The catch: any layer that runs on CPU is roughly 10× slower than GPU inference. A 70B model split across 12 GB VRAM and 32 GB RAM might run at 3–5 t/s instead of 10–15 t/s on a fully in-VRAM Mac Mini M4 Pro. On Apple Silicon this penalty is smaller because CPU and GPU share the same unified memory pool.
Q6How much VRAM do I need for FLUX.1 image generation?
FLUX.1 Dev at full bfloat16 precision requires 24 GB VRAM. With 8-bit quantization (Q8): 12 GB. With 4-bit quantization (NF4/Q4): 8 GB, though quality degrades slightly. SDXL 1.0: 8 GB minimum, 10–12 GB comfortable. Stable Diffusion 1.5: 4 GB minimum. The RTX 5070 (12 GB) hits the FLUX.1 Q8 sweet spot.
Q7Does quantization affect the quality of LLM outputs?
At Q4 (4-bit), perplexity increases slightly vs full float16 — most users cannot detect the difference in conversational use. Q8 (8-bit) is near-identical to float16 for all practical purposes. Q2 starts showing noticeable quality degradation. The rule of thumb: use Q4 if VRAM is the constraint; use Q8 if you have headroom; use float16 only on high-VRAM hardware for critical tasks.
Q8What happens when a model doesn't fully fit in VRAM?
Modern runtimes don't crash — they split the model. llama.cpp's --n-gpu-layers parameter lets you specify how many layers to offload to GPU. Layers not offloaded run on CPU. The result is a speed penalty proportional to the CPU-offloaded fraction. A 13B model with 4 layers on CPU might run at 60–70% of full GPU speed. A 70B model with 50% on CPU might run at 10–20% of theoretical GPU speed.