Run Llama 3.1 8B on Mac Mini M4
Step-by-step guide to running Llama 3.1 8B locally on the Apple Mac Mini M4 using Ollama — no GPU required.
Speed
28–35 tok/s
Min Memory
8 GB
Software
Ollama, macOS 14+
Hardware Used in This Guide
mini-pc · Check Price on Amazon
Step-by-Step Setup
- 01
Install Ollama
Download Ollama from the official site and run the macOS installer. It installs a background service that handles model downloads and inference.
# Verify install ollama --version
- 02
Pull Llama 3.1 8B
The 8B model fits comfortably in the M4's 16 GB unified memory. The download is about 4.7 GB.
ollama pull llama3.1:8b
- 03
Run a test prompt
Start an interactive session and test the model. You should see the first token within 1–2 seconds.
ollama run llama3.1:8b "Explain VRAM in one paragraph"
- 04
Serve via REST API (optional)
Ollama exposes an OpenAI-compatible API on port 11434, so any app that supports custom endpoints (Open WebUI, Chatbox, etc.) works out of the box.
# Start server (already running after install) curl http://localhost:11434/api/generate \ -d '{"model":"llama3.1:8b","prompt":"Hello"}' - 05
Install Open WebUI for a chat interface
For a browser-based ChatGPT-style interface, run Open WebUI via Docker. It auto-discovers your local Ollama instance.
docker run -d \ -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \ ghcr.io/open-webui/open-webui:main
Optimization Tips
- ›
The M4's 16 GB unified memory can hold Llama 3.1 8B and the OS simultaneously — you rarely need to evict the model.
- ›
For faster responses, close memory-heavy apps (browsers, creative tools) before a long inference session.
- ›
Llama 3.1 8B Q4_K_M quantization (Ollama's default) gives near-full quality at 4× lower memory than FP16.
- ›
The M4 Neural Engine offloads some compute — Ollama uses Metal, not the Neural Engine, for better compatibility.
Other Hardware for Llama 3.1 8B
mini-pc · Check Price on Amazon · 24 GB Unified
Related Guides
Run Llama 3.3 70B on Mac Mini M4 Pro→
Complete guide to running Llama 3.3 70B (Q4) locally on the Mac Mini M4 Pro with 24 GB unified memory using Ollama.
Run Stable Diffusion on Mac Mini M4→
How to run SDXL and FLUX on the Mac Mini M4 using Diffusers or ComfyUI — with expected generation times and optimization tips.