Language Model8B

Run Llama 3.1 8B on Mac Mini M4

Step-by-step guide to running Llama 3.1 8B locally on the Apple Mac Mini M4 using Ollama — no GPU required.

Speed

28–35 tok/s

Min Memory

8 GB

Software

Ollama, macOS 14+

Hardware Used in This Guide

Apple Mac Mini (M4, 2024)

mini-pc · Check Price on Amazon

Buy on AmazonAffiliate link — no extra cost to you

Step-by-Step Setup

  1. 01

    Install Ollama

    Download Ollama from the official site and run the macOS installer. It installs a background service that handles model downloads and inference.

    # Verify install
    ollama --version
  2. 02

    Pull Llama 3.1 8B

    The 8B model fits comfortably in the M4's 16 GB unified memory. The download is about 4.7 GB.

    ollama pull llama3.1:8b
  3. 03

    Run a test prompt

    Start an interactive session and test the model. You should see the first token within 1–2 seconds.

    ollama run llama3.1:8b "Explain VRAM in one paragraph"
  4. 04

    Serve via REST API (optional)

    Ollama exposes an OpenAI-compatible API on port 11434, so any app that supports custom endpoints (Open WebUI, Chatbox, etc.) works out of the box.

    # Start server (already running after install)
    curl http://localhost:11434/api/generate \
      -d '{"model":"llama3.1:8b","prompt":"Hello"}'
  5. 05

    Install Open WebUI for a chat interface

    For a browser-based ChatGPT-style interface, run Open WebUI via Docker. It auto-discovers your local Ollama instance.

    docker run -d \
      -p 3000:8080 \
      --add-host=host.docker.internal:host-gateway \
      -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
      ghcr.io/open-webui/open-webui:main

Optimization Tips

  • The M4's 16 GB unified memory can hold Llama 3.1 8B and the OS simultaneously — you rarely need to evict the model.

  • For faster responses, close memory-heavy apps (browsers, creative tools) before a long inference session.

  • Llama 3.1 8B Q4_K_M quantization (Ollama's default) gives near-full quality at 4× lower memory than FP16.

  • The M4 Neural Engine offloads some compute — Ollama uses Metal, not the Neural Engine, for better compatibility.

Other Hardware for Llama 3.1 8B

Apple Mac Mini (M4 Pro, 2024)

mini-pc · Check Price on Amazon · 24 GB Unified

Buy on AmazonAffiliate link — no extra cost to you

Related Guides