Run Llama 3.1 70B on RTX 5070
How to run Llama 3.1 70B (Q4) on an RTX 5070 12 GB using Ollama — includes VRAM limits, layer offload settings, and expected speed.
Speed
12–18 tok/s (with CPU offload)
Min Memory
12 GB
Software
Ollama, CUDA 12.4, NVIDIA Driver 565+
Hardware Used in This Guide
gpu · Check Price on Amazon
Step-by-Step Setup
- 01
Install Ollama for Windows/Linux
Download the Ollama installer for your OS. On Linux, the one-liner script handles driver detection automatically.
# Linux curl -fsSL https://ollama.com/install.sh | sh # Verify ollama --version
- 02
Pull Llama 3.1 70B
The Q4_K_M quantized model is ~40 GB. Only ~12 GB fits on the GPU — the rest offloads to CPU RAM. You need ≥ 64 GB system RAM for full offload.
ollama pull llama3.1:70b
- 03
Set GPU layer count
With 12 GB VRAM, you can fit roughly 25–30 of the 80 transformer layers on GPU. Remaining layers run on CPU. Ollama handles this automatically but you can tune with the num_gpu flag.
OLLAMA_NUM_GPU=30 ollama run llama3.1:70b "Test prompt"
- 04
Run via REST API
The OpenAI-compatible API works for all downstream apps.
curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"llama3.1:70b","messages":[{"role":"user","content":"Hello"}]}'
Optimization Tips
- ›
12 GB VRAM + 64 GB system RAM gives 12–18 tok/s — faster than a CPU-only setup by 4–6×.
- ›
For 70B at full GPU speed, pair two RTX 5070s or upgrade to an RTX 5090 (32 GB).
- ›
Llama 3.1 8B fits entirely in 12 GB VRAM and runs at 55–70 tok/s — use it for latency-sensitive tasks.
- ›
Use `ollama ps` to see active models and their VRAM allocation.
Other Hardware for Llama 3.1 70B (Q4)
gpu · Check Price on Amazon · 12 GB VRAM