How to Install Ollama and Run Local LLMs — Full Setup Guide (2026)
ollama pull llama3.1:8b, then ollama run llama3.1:8b. GPU acceleration is automatic — no configuration needed on Mac (Metal), Windows (CUDA/ROCm), or Linux (CUDA/ROCm).What is Ollama?
Ollama is a local LLM runtime that makes running large language models on your own hardware as simple as pulling a Docker image. It handles model downloading, GPU acceleration, quantization, and serving — including an OpenAI-compatible REST API on port 11434. It supports Apple Silicon Metal, NVIDIA CUDA, and AMD ROCm automatically.
Install Ollama on Mac
Download the Ollama macOS app from ollama.com. Open the .dmg, drag to Applications, and launch. Ollama runs as a menu bar app and automatically uses Apple Silicon Metal acceleration — no configuration required. Open Terminal and verify:
ollama --version
# Should output: ollama version 0.x.xInstall Ollama on Windows
Download OllamaSetup.exe from ollama.com and run the installer. Ollama installs as a Windows service and automatically detects NVIDIA CUDA or AMD ROCm. Verify in PowerShell:
ollama --version
ollama list # Shows installed modelsInstall Ollama on Linux
curl -fsSL https://ollama.com/install.sh | sh
# Installs Ollama as a systemd service
systemctl status ollama # Verify runningFor NVIDIA: install CUDA drivers first. For AMD: install ROCm 6.x before running the install script — Ollama will detect it automatically.
Download and Run Your First Model
# Download Llama 3.1 8B (5GB, Q4_K_M quantization)
ollama pull llama3.1:8b
# Start interactive chat
ollama run llama3.1:8b
# Other popular models:
ollama pull deepseek-r1:8b # DeepSeek R1 reasoning model
ollama pull mistral:7b # Mistral 7B
ollama pull phi3:mini # Microsoft Phi-3 Mini (3.8B)
ollama pull qwen2.5:14b # Qwen 2.5 14BCheck GPU Acceleration is Active
# While a model is running, check GPU usage:
ollama ps
# Example output:
# NAME ID SIZE PROCESSOR UNTIL
# llama3.1:8b ... 5.0GB 100% GPU ... ← GPU acceleration active
# llama3.1:8b ... 5.0GB 100% CPU ... ← CPU only (no GPU detected)Use Ollama as an API
Ollama exposes an OpenAI-compatible REST API on port 11434 by default. Any app that supports OpenAI's API can connect to it by setting the base URL to http://localhost:11434/v1.
# Simple API call
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.1:8b", "prompt": "What is VRAM?", "stream": false}'
# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model": "llama3.1:8b", "messages": [{"role": "user", "content": "Hello"}]}'Connect Open WebUI for a ChatGPT-Like Interface
Open WebUI provides a browser-based chat UI that connects to your local Ollama instance. Install with Docker:
docker run -d -p 3000:8080 \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
# Open http://localhost:3000 in your browserRecommended Models by Hardware
| Hardware | RAM/VRAM | Recommended Model | Speed |
|---|---|---|---|
| Mac Mini M4 | 16 GB unified | llama3.1:8b or phi3:mini | 42 t/s |
| Mac Mini M4 Pro (24 GB) | 24 GB unified | llama3.1:8b or qwen2.5:14b | 65 t/s (8B) |
| Mac Mini M4 Pro (64 GB) | 64 GB unified | llama3.1:70b | 10 t/s (70B) |
| RTX 5070 Windforce | 12 GB VRAM | llama3.1:8b or qwen2.5:14b | 118 t/s (8B) |
| GEEKOM A6 (32 GB DDR5) | 32 GB system RAM | llama3.1:8b (CPU) | 16 t/s |
Frequently Asked Questions
Q1Is Ollama free to use?
Yes. Ollama is free and open-source (MIT license). The models it runs are also free to download — Llama 3.1, Mistral, DeepSeek R1, Phi-3, and Qwen are all available at no cost. There are no usage limits, no API keys, and no subscription required.
Q2Does Ollama use GPU automatically?
Yes. On Mac, Ollama uses Metal automatically. On Windows with an NVIDIA GPU, it uses CUDA automatically — no configuration needed. On Linux, it uses CUDA (NVIDIA) or ROCm (AMD) if the respective drivers are installed. Run `ollama ps` while a model is running to confirm: it shows '100% GPU' when GPU acceleration is active.
Q3What models can Ollama run?
Ollama supports all major open models: Llama 3.1 (8B, 70B), Mistral (7B), DeepSeek R1 (1.5B to 671B), Phi-3 Mini/Medium, Qwen 2.5 (0.5B to 72B), Gemma 2 (2B, 9B, 27B), Command R, and hundreds more. Browse all available models at ollama.com/library.
Q4How much storage does Ollama use?
Each model takes 2–40 GB of storage depending on size and quantization. Llama 3.1 8B (Q4_K_M) is ~5 GB. Llama 3.1 70B (Q4_K_M) is ~40 GB. Models are stored in ~/.ollama/models on Mac/Linux and C:\Users\username\.ollama\models on Windows. An NVMe SSD is strongly recommended — loading a 40 GB model from a slow drive takes 60+ seconds.
Q5Can I run Ollama on a CPU without a GPU?
Yes. Ollama falls back to CPU inference if no GPU is detected. On a modern 8-core CPU (Intel 12th Gen, AMD Ryzen 5000+), expect 5–15 tokens/second for 7B models — functional but noticeably slower than GPU inference. The GEEKOM A6 with 32GB DDR5 RAM is a good CPU-only option, running 7B at ~16 t/s and 14B at ~8 t/s.
Q6How do I update models in Ollama?
Run `ollama pull modelname` again — it checks for updates and downloads only changed layers. To see all installed models: `ollama list`. To remove a model: `ollama rm modelname`. Ollama stores models efficiently using layer deduplication, so multiple model variants share common layers.
Q7Can multiple apps use Ollama at the same time?
Yes. Ollama runs as a background service and can handle multiple concurrent API requests. However, GPU VRAM is a shared resource — if two apps request different models simultaneously, Ollama loads one and queues the other (or unloads/reloads if VRAM is insufficient). For single-model workflows, concurrent requests work well.
Q8How do I expose Ollama to other devices on my network?
By default, Ollama only listens on localhost. To allow network access, set the environment variable OLLAMA_HOST=0.0.0.0:11434 before starting Ollama. On Mac: set it in the launchd plist. On Linux: set it in the systemd service file. Then access from other devices using your machine's local IP: http://192.168.x.x:11434. Use Tailscale for secure remote access.