How-To7 min readApril 26, 2026By Alex Voss

How to Install Ollama and Run Local LLMs — Full Setup Guide (2026)

◆

TL;DR: Install Ollama from ollama.com, run ollama pull llama3.1:8b, then ollama run llama3.1:8b. GPU acceleration is automatic — no configuration needed on Mac (Metal), Windows (CUDA/ROCm), or Linux (CUDA/ROCm).

What is Ollama?

Ollama is a local LLM runtime that makes running large language models on your own hardware as simple as pulling a Docker image. It handles model downloading, GPU acceleration, quantization, and serving — including an OpenAI-compatible REST API on port 11434. It supports Apple Silicon Metal, NVIDIA CUDA, and AMD ROCm automatically.

◈

Best hardware for Ollama in 2026: Mac Mini M4 Pro (65 t/s · silent · 70B capable) · Mac Mini M4 (42 t/s · best value) · RTX 5070 Windforce (118 t/s · Windows/Linux) · GEEKOM A6 (best x86 mini PC)

Install Ollama on Mac

Download the Ollama macOS app from ollama.com. Open the .dmg, drag to Applications, and launch. Ollama runs as a menu bar app and automatically uses Apple Silicon Metal acceleration — no configuration required. Open Terminal and verify:

bash

ollama --version
# Should output: ollama version 0.x.x

Install Ollama on Windows

Download OllamaSetup.exe from ollama.com and run the installer. Ollama installs as a Windows service and automatically detects NVIDIA CUDA or AMD ROCm. Verify in PowerShell:

powershell

ollama --version
ollama list  # Shows installed models

Install Ollama on Linux

bash

curl -fsSL https://ollama.com/install.sh | sh
# Installs Ollama as a systemd service
systemctl status ollama  # Verify running

For NVIDIA: install CUDA drivers first. For AMD: install ROCm 6.x before running the install script — Ollama will detect it automatically.

Download and Run Your First Model

bash

# Download Llama 3.1 8B (5GB, Q4_K_M quantization)
ollama pull llama3.1:8b

# Start interactive chat
ollama run llama3.1:8b

# Other popular models:
ollama pull deepseek-r1:8b      # DeepSeek R1 reasoning model
ollama pull mistral:7b          # Mistral 7B
ollama pull phi3:mini           # Microsoft Phi-3 Mini (3.8B)
ollama pull qwen2.5:14b         # Qwen 2.5 14B

Check GPU Acceleration is Active

bash

# While a model is running, check GPU usage:
ollama ps

# Example output:
# NAME              ID    SIZE   PROCESSOR    UNTIL
# llama3.1:8b       ...   5.0GB  100% GPU     ...  ← GPU acceleration active
# llama3.1:8b       ...   5.0GB  100% CPU     ...  ← CPU only (no GPU detected)

Use Ollama as an API

Ollama exposes an OpenAI-compatible REST API on port 11434 by default. Any app that supports OpenAI's API can connect to it by setting the base URL to http://localhost:11434/v1.

bash

# Simple API call
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "prompt": "What is VRAM?", "stream": false}'

# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "llama3.1:8b", "messages": [{"role": "user", "content": "Hello"}]}'

Connect Open WebUI for a ChatGPT-Like Interface

Open WebUI provides a browser-based chat UI that connects to your local Ollama instance. Install with Docker:

bash

docker run -d -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

# Open http://localhost:3000 in your browser

Recommended Models by Hardware

Hardware	RAM/VRAM	Recommended Model	Speed
Mac Mini M4	16 GB unified	llama3.1:8b or phi3:mini	42 t/s
Mac Mini M4 Pro (24 GB)	24 GB unified	llama3.1:8b or qwen2.5:14b	65 t/s (8B)
Mac Mini M4 Pro (64 GB)	64 GB unified	llama3.1:70b	10 t/s (70B)
RTX 5070 Windforce	12 GB VRAM	llama3.1:8b or qwen2.5:14b	118 t/s (8B)
GEEKOM A6 (32 GB DDR5)	32 GB system RAM	llama3.1:8b (CPU)	16 t/s

Frequently Asked Questions

Q1Is Ollama free to use?

Yes. Ollama is free and open-source (MIT license). The models it runs are also free to download — Llama 3.1, Mistral, DeepSeek R1, Phi-3, and Qwen are all available at no cost. There are no usage limits, no API keys, and no subscription required.

Q2Does Ollama use GPU automatically?

Yes. On Mac, Ollama uses Metal automatically. On Windows with an NVIDIA GPU, it uses CUDA automatically — no configuration needed. On Linux, it uses CUDA (NVIDIA) or ROCm (AMD) if the respective drivers are installed. Run `ollama ps` while a model is running to confirm: it shows '100% GPU' when GPU acceleration is active.

Q3What models can Ollama run?

Ollama supports all major open models: Llama 3.1 (8B, 70B), Mistral (7B), DeepSeek R1 (1.5B to 671B), Phi-3 Mini/Medium, Qwen 2.5 (0.5B to 72B), Gemma 2 (2B, 9B, 27B), Command R, and hundreds more. Browse all available models at ollama.com/library.

Q4How much storage does Ollama use?

Each model takes 2–40 GB of storage depending on size and quantization. Llama 3.1 8B (Q4_K_M) is ~5 GB. Llama 3.1 70B (Q4_K_M) is ~40 GB. Models are stored in ~/.ollama/models on Mac/Linux and C:\Users\username\.ollama\models on Windows. An NVMe SSD is strongly recommended — loading a 40 GB model from a slow drive takes 60+ seconds.

Q5Can I run Ollama on a CPU without a GPU?

Yes. Ollama falls back to CPU inference if no GPU is detected. On a modern 8-core CPU (Intel 12th Gen, AMD Ryzen 5000+), expect 5–15 tokens/second for 7B models — functional but noticeably slower than GPU inference. The GEEKOM A6 with 32GB DDR5 RAM is a good CPU-only option, running 7B at ~16 t/s and 14B at ~8 t/s.

Q6How do I update models in Ollama?

Run `ollama pull modelname` again — it checks for updates and downloads only changed layers. To see all installed models: `ollama list`. To remove a model: `ollama rm modelname`. Ollama stores models efficiently using layer deduplication, so multiple model variants share common layers.

Q7Can multiple apps use Ollama at the same time?

Yes. Ollama runs as a background service and can handle multiple concurrent API requests. However, GPU VRAM is a shared resource — if two apps request different models simultaneously, Ollama loads one and queues the other (or unloads/reloads if VRAM is insufficient). For single-model workflows, concurrent requests work well.

Q8How do I expose Ollama to other devices on my network?

By default, Ollama only listens on localhost. To allow network access, set the environment variable OLLAMA_HOST=0.0.0.0:11434 before starting Ollama. On Mac: set it in the launchd plist. On Linux: set it in the systemd service file. Then access from other devices using your machine's local IP: http://192.168.x.x:11434. Use Tailscale for secure remote access.

How-To

Run Stable Diffusion Locally in 2026

How-To

Running DeepSeek R1 Locally: Which Hardware Can Handle It?

How-To

Run Llama 3 Locally: Hardware Requirements and Setup Guide

What is Ollama?

Install Ollama on Mac

Install Ollama on Windows

Install Ollama on Linux

Download and Run Your First Model

Check GPU Acceleration is Active

Use Ollama as an API

Connect Open WebUI for a ChatGPT-Like Interface

Recommended Models by Hardware

Frequently Asked Questions

Q1Is Ollama free to use?

Q2Does Ollama use GPU automatically?

Q3What models can Ollama run?

Q4How much storage does Ollama use?

Q5Can I run Ollama on a CPU without a GPU?

Q6How do I update models in Ollama?

Q7Can multiple apps use Ollama at the same time?

Q8How do I expose Ollama to other devices on my network?

Related Articles