Language Model70B

Run Llama 3.1 70B on RTX 5070

How to run Llama 3.1 70B (Q4) on an RTX 5070 12 GB using Ollama — includes VRAM limits, layer offload settings, and expected speed.

Speed

12–18 tok/s (with CPU offload)

Min Memory

12 GB

Software

Ollama, CUDA 12.4, NVIDIA Driver 565+

Hardware Used in This Guide

GIGABYTE GeForce RTX 5070 WINDFORCE OC 12G

gpu · Check Price on Amazon

Buy on AmazonAffiliate link — no extra cost to you

Step-by-Step Setup

  1. 01

    Install Ollama for Windows/Linux

    Download the Ollama installer for your OS. On Linux, the one-liner script handles driver detection automatically.

    # Linux
    curl -fsSL https://ollama.com/install.sh | sh
    
    # Verify
    ollama --version
  2. 02

    Pull Llama 3.1 70B

    The Q4_K_M quantized model is ~40 GB. Only ~12 GB fits on the GPU — the rest offloads to CPU RAM. You need ≥ 64 GB system RAM for full offload.

    ollama pull llama3.1:70b
  3. 03

    Set GPU layer count

    With 12 GB VRAM, you can fit roughly 25–30 of the 80 transformer layers on GPU. Remaining layers run on CPU. Ollama handles this automatically but you can tune with the num_gpu flag.

    OLLAMA_NUM_GPU=30 ollama run llama3.1:70b "Test prompt"
  4. 04

    Run via REST API

    The OpenAI-compatible API works for all downstream apps.

    curl http://localhost:11434/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{"model":"llama3.1:70b","messages":[{"role":"user","content":"Hello"}]}'

Optimization Tips

  • 12 GB VRAM + 64 GB system RAM gives 12–18 tok/s — faster than a CPU-only setup by 4–6×.

  • For 70B at full GPU speed, pair two RTX 5070s or upgrade to an RTX 5090 (32 GB).

  • Llama 3.1 8B fits entirely in 12 GB VRAM and runs at 55–70 tok/s — use it for latency-sensitive tasks.

  • Use `ollama ps` to see active models and their VRAM allocation.

Other Hardware for Llama 3.1 70B (Q4)

ASUS Prime GeForce RTX 5070 SFF-Ready 12GB

gpu · Check Price on Amazon · 12 GB VRAM

Buy on AmazonAffiliate link — no extra cost to you

Related Guides