Run Llama 3.3 70B on Mac Mini M4 Pro
Complete guide to running Llama 3.3 70B (Q4) locally on the Mac Mini M4 Pro with 24 GB unified memory using Ollama.
Speed
8–12 tok/s
Min Memory
24 GB
Software
Ollama, macOS 14+
Hardware Used in This Guide
mini-pc · Check Price on Amazon
Step-by-Step Setup
- 01
Install Ollama
Download and install Ollama for macOS. The installer creates a menu-bar icon and background inference server.
ollama --version
- 02
Pull Llama 3.3 70B
The Q4_K_M quantized model is about 40 GB. With 24 GB unified memory the model is partially offloaded to system RAM — performance is still excellent versus a pure CPU machine.
ollama pull llama3.3:70b
- 03
Verify GPU layers
Check that Ollama is loading layers onto the Metal GPU rather than running pure CPU.
ollama run llama3.3:70b "" --verbose 2>&1 | grep "gpu layers"
- 04
Run your first prompt
Start inference. Expect 8–12 tokens/s — comparable to a single RTX 4090 at this parameter count.
ollama run llama3.3:70b "Write a haiku about silicon"
- 05
Point any OpenAI-compatible app at Ollama
Use base URL http://localhost:11434/v1 and any model name. No API key required.
export OPENAI_BASE_URL=http://localhost:11434/v1 export OPENAI_API_KEY=ollama
Optimization Tips
- ›
The M4 Pro's 273 GB/s memory bandwidth is why 70B is viable — bandwidth matters more than raw FLOPS for LLM inference.
- ›
If you hit memory pressure, close Safari and other heavy apps; macOS will reclaim RAM for Ollama automatically.
- ›
Llama 3.3 70B outperforms Llama 3.1 405B on many benchmarks — the 70B sweet spot is real.
- ›
Set `OLLAMA_NUM_PARALLEL=1` to dedicate all memory to one request for maximum single-session speed.
Other Hardware for Llama 3.3 70B
mini-pc · Check Price on Amazon · 16 GB Unified
Related Guides
Run Llama 3.1 8B on Mac Mini M4→
Step-by-step guide to running Llama 3.1 8B locally on the Apple Mac Mini M4 using Ollama — no GPU required.
Run Stable Diffusion on Mac Mini M4→
How to run SDXL and FLUX on the Mac Mini M4 using Diffusers or ComfyUI — with expected generation times and optimization tips.