How fast does Llama 3.1 8B run on Apple Mac Mini (M4, 2024)?

Llama 3.1 8B runs at 28–35 tok/s on Apple Mac Mini (M4, 2024).

Language Model8B

Run Llama 3.1 8B on Mac Mini M4

Step-by-step guide to running Llama 3.1 8B locally on the Apple Mac Mini M4 using Ollama — no GPU required.

Speed

28–35 tok/s

Min Memory

8 GB

Software

Ollama, macOS 14+

Hardware Used in This Guide

Apple Mac Mini (M4, 2024)

mini-pc · Check Price on Amazon

Buy on AmazonAffiliate link — no extra cost to you

Step-by-Step Setup

01
Install Ollama
Download Ollama from the official site and run the macOS installer. It installs a background service that handles model downloads and inference.
```
# Verify install
ollama --version
```
02
Pull Llama 3.1 8B
The 8B model fits comfortably in the M4's 16 GB unified memory. The download is about 4.7 GB.
```
ollama pull llama3.1:8b
```
03
Run a test prompt
Start an interactive session and test the model. You should see the first token within 1–2 seconds.
```
ollama run llama3.1:8b "Explain VRAM in one paragraph"
```
04
Serve via REST API (optional)
Ollama exposes an OpenAI-compatible API on port 11434, so any app that supports custom endpoints (Open WebUI, Chatbox, etc.) works out of the box.
```
# Start server (already running after install)
curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.1:8b","prompt":"Hello"}'
```

Install Open WebUI for a chat interface

For a browser-based ChatGPT-style interface, run Open WebUI via Docker. It auto-discovers your local Ollama instance.

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  ghcr.io/open-webui/open-webui:main

Optimization Tips

›
The M4's 16 GB unified memory can hold Llama 3.1 8B and the OS simultaneously — you rarely need to evict the model.
›
For faster responses, close memory-heavy apps (browsers, creative tools) before a long inference session.
›
Llama 3.1 8B Q4_K_M quantization (Ollama's default) gives near-full quality at 4× lower memory than FP16.
›
The M4 Neural Engine offloads some compute — Ollama uses Metal, not the Neural Engine, for better compatibility.

Other Hardware for Llama 3.1 8B

Apple Mac Mini (M4 Pro, 2024)

mini-pc · Check Price on Amazon · 24 GB Unified

Buy on AmazonAffiliate link — no extra cost to you

Related Guides

Run Llama 3.3 70B on Mac Mini M4 Pro→

Complete guide to running Llama 3.3 70B (Q4) locally on the Mac Mini M4 Pro with 24 GB unified memory using Ollama.

Run Stable Diffusion on Mac Mini M4→

How to run SDXL and FLUX on the Mac Mini M4 using Diffusers or ComfyUI — with expected generation times and optimization tips.

← All Guides Browse Hardware →

Run Llama 3.1 8B on Mac Mini M4

Step-by-Step Setup

Install Ollama

Pull Llama 3.1 8B

Run a test prompt

Serve via REST API (optional)

Install Open WebUI for a chat interface

Optimization Tips

Other Hardware for Llama 3.1 8B

Related Guides