Analysis7 min readApril 22, 2026By Alex Voss

How Much VRAM Do You Need to Run AI Locally?

VRAM is the most confusing part of buying AI hardware. Vendors advertise GPU compute specs. YouTube benchmarks focus on gaming. And the actual AI community guidance is scattered across Reddit threads and Discord servers. This guide gives you the definitive 2026 answer: exactly how much VRAM you need, for which models, and what happens when you don't have enough.

TL;DR: 7B models need ~5 GB VRAM (Q4 quant), 13B needs ~8 GB, 70B needs ~40 GB. Minimum for useful local AI: 16 GB unified memory on macOS or 12 GB VRAM on Windows/Linux. Anything under 8 GB is borderline for 2026 models.

The Simple Rule: Model Size ÷ 2 = Minimum VRAM (GB)

LLMs are stored as weights — floating-point numbers. At Q4 quantization (the most common format for local use), each billion parameters uses approximately 0.5 GB of VRAM. So a 7B model needs ~3.5 GB, a 13B model needs ~7 GB, and a 70B model needs ~35 GB.

But you also need headroom for the KV cache (attention buffers that grow with context length) and the runtime itself. Add 20–30% to the model weight size for a safe estimate.

Model SizeQ4 Weights+ KV CacheRecommended VRAMFits In
1B–3B0.5–1.5 GB+1 GB4 GBAlmost any GPU
7B3.5 GB+1.5 GB6–8 GB8 GB GPU (tight)
13B6.5 GB+2 GB10–12 GB12 GB GPU
34B17 GB+3 GB24 GB24 GB GPU or dual 12 GB
70B35 GB+5 GB40–48 GBMac Mini M4 Pro 48 GB or multi-GPU
70B MoE~22 GB active+5 GB32 GBMac Pro M4 or 48 GB GPU

What Happens When You Don't Have Enough VRAM?

Modern LLM runtimes (llama.cpp, Ollama) don't crash — they offload layers to system RAM. Each layer that runs on CPU instead of GPU drops throughput significantly:

  • Full GPU inference: 40–120+ tokens/second
  • 50% GPU / 50% CPU offload: 8–15 tokens/second
  • Full CPU inference: 2–6 tokens/second (on a fast CPU)

CPU offloading is usable for slow queries — overnight summaries, batch processing — but not comfortable chat. If you plan to chat with a model, fit it entirely in VRAM.

8 GB VRAM: Entry Level

8 GB is the minimum you'll see recommended in 2026. You can run 7B models at Q4 quantization with enough headroom for a 4K context window. Gemma 2 9B runs at Q4 in 8 GB too.

What you can't do comfortably at 8 GB: 13B models (they require offloading), SDXL at high batch sizes, or FLUX.1 Dev (needs ~12 GB). 8 GB was the sweet spot in 2023–2024; in 2026, it's a constraint.

Avoid 4 GB and 6 GB GPUs entirely for LLMs in 2026. Even quantized 7B models run poorly and the context window is severely limited.

12 GB VRAM: The 2026 Sweet Spot

12 GB fits 13B models at Q4 fully in VRAM — a significant quality jump over 7B. At this tier you also get comfortable FLUX.1 Schnell image generation and can run two smaller models simultaneously if needed.

The RTX 5070 (12 GB GDDR7, 672 GB/s) is the best 12 GB GPU available. Its bandwidth advantage means you generate tokens roughly 33% faster than a 12 GB RTX 4070 Super despite identical VRAM capacity.

16 GB VRAM: Room to Breathe

16 GB lets you run 13B models at Q8 (near-lossless quality) or push toward 34B at Q4. The RX 9060 XT ships with 16 GB GDDR6 — giving it a VRAM edge over the RTX 5070 despite lower bandwidth.

For Stable Diffusion, 16 GB handles FLUX.1 Dev, SD 3.5 Large, and SDXL at high resolutions with ease. At 12 GB you sometimes hit walls with newer architectures.

24 GB+ VRAM: Serious Workloads

24 GB is where you can run 34B models fully in VRAM and begin to approach 70B at aggressive quantization. No consumer discrete GPU currently ships with 24 GB at mainstream price points — you're looking at workstation cards (RTX 4090, RTX 6000 Ada) or Apple Silicon unified memory.

The Mac Mini M4 Pro with 24 GB unified memory is the most accessible 24 GB AI machine in 2026. It runs 70B Q4 models at ~18 tokens/second — not fast, but functional. Upgrade to 48 GB and you get comfortable 70B inference at ~35 t/s.

Discrete VRAM vs Apple Unified Memory: Are They the Same?

Not exactly. Discrete GPU VRAM is dedicated and has higher bandwidth than system RAM. Apple's unified memory is shared between CPU and GPU — which means the GPU sees the full memory pool without a penalty, but the bandwidth is lower than equivalent GDDR7.

Memory TypeExampleBandwidthLLM Advantage
GDDR7 (discrete)RTX 5070 12 GB672 GB/sFastest token gen at model size
GDDR6 (discrete)RX 9060 XT 16 GB~384 GB/sMore capacity, slower per-token
Unified (Apple M4)Mac Mini M4 Pro 24 GB273 GB/sFull pool available to GPU, runs 70B
DDR5 system RAMAny mini PC 32 GB51–68 GB/sCPU offload only, very slow

For models that fit in your VRAM, discrete is faster. For models that don't fit, Apple unified memory is your only affordable option short of enterprise hardware.

VRAM Requirements for Image Generation

ModelMin VRAMRecommendedNotes
Stable Diffusion 1.54 GB6 GBLegacy, still useful for LoRAs
SDXL8 GB10 GBStandard quality benchmark
FLUX.1 Schnell10 GB12 GBFast, high quality
FLUX.1 Dev12 GB16 GBBest open-source image model
SD 3.5 Large14 GB16 GBLatest Stability AI flagship

Frequently Asked Questions

Q1Is 8 GB VRAM enough for AI in 2026?

Barely. 8 GB runs 7B models at Q4 quantization but struggles with 13B models, newer architectures, and high-resolution image generation. It was the sweet spot in 2023 but models have grown. If you're buying new hardware in 2026, target 12 GB minimum.

Q2Does system RAM help when you don't have enough VRAM?

Yes — llama.cpp and Ollama offload layers to system RAM automatically. But speed drops dramatically: you might go from 80 tokens/second (full GPU) to 8 tokens/second (half offloaded). More system RAM lets you run larger models at slow speeds. Fast DDR5 RAM helps slightly.

Q3Why does the Mac Mini M4 Pro feel fast despite lower bandwidth than the RTX 5070?

Two reasons. First, the M4 Pro chip is highly optimized for matrix operations at low power. Second, unified memory means models don't need to be split between VRAM and system RAM — the GPU sees all 24 GB at 273 GB/s, versus a GPU that might only see 12 GB at 672 GB/s before offloading starts.

Q4What is Q4 quantization and does it hurt quality?

Q4 quantization stores model weights at 4-bit precision instead of 16-bit, cutting memory use by 75%. Quality loss is minimal for most tasks — perplexity scores typically increase by 1–5%. For casual use, Q4 is indistinguishable from fp16. For precise technical tasks, Q8 is preferred if your VRAM allows it.

Q5Can I use system RAM instead of VRAM to run LLMs?

Yes, via CPU offloading. llama.cpp and Ollama can split a model between GPU VRAM and system RAM. The catch: any layer that runs on CPU is roughly 10× slower than GPU inference. A 70B model split across 12 GB VRAM and 32 GB RAM might run at 3–5 t/s instead of 10–15 t/s on a fully in-VRAM Mac Mini M4 Pro. On Apple Silicon this penalty is smaller because CPU and GPU share the same unified memory pool.

Q6How much VRAM do I need for FLUX.1 image generation?

FLUX.1 Dev at full bfloat16 precision requires 24 GB VRAM. With 8-bit quantization (Q8): 12 GB. With 4-bit quantization (NF4/Q4): 8 GB, though quality degrades slightly. SDXL 1.0: 8 GB minimum, 10–12 GB comfortable. Stable Diffusion 1.5: 4 GB minimum. The RTX 5070 (12 GB) hits the FLUX.1 Q8 sweet spot.

Q7Does quantization affect the quality of LLM outputs?

At Q4 (4-bit), perplexity increases slightly vs full float16 — most users cannot detect the difference in conversational use. Q8 (8-bit) is near-identical to float16 for all practical purposes. Q2 starts showing noticeable quality degradation. The rule of thumb: use Q4 if VRAM is the constraint; use Q8 if you have headroom; use float16 only on high-VRAM hardware for critical tasks.

Q8What happens when a model doesn't fully fit in VRAM?

Modern runtimes don't crash — they split the model. llama.cpp's --n-gpu-layers parameter lets you specify how many layers to offload to GPU. Layers not offloaded run on CPU. The result is a speed penalty proportional to the CPU-offloaded fraction. A 13B model with 4 layers on CPU might run at 60–70% of full GPU speed. A 70B model with 50% on CPU might run at 10–20% of theoretical GPU speed.

Related Articles