What is Ollama?
Free open-source tool for running LLMs locally on macOS, Linux, and Windows. Download a model with a single command. No cloud account required. Supports Llama, Mistral, Qwen, Phi, and more.
Full Explanation
Ollama is a command-line and API server that abstracts model downloading, quantization selection, hardware detection, and inference into a single unified tool. Running "ollama run llama3.1" automatically downloads the Q4_K_M quantized model, detects whether you have a CUDA GPU, Apple Silicon, or CPU-only, and starts a local chat session. It also exposes an OpenAI-compatible REST API at localhost:11434, meaning any tool built for OpenAI (Continue, Open WebUI, Cursor) works with Ollama as a drop-in replacement.
Why It Matters for Local AI
Ollama is the fastest path from zero to running a local LLM. On Apple Silicon, it uses the Metal backend for GPU acceleration automatically. On NVIDIA, it uses CUDA. On AMD Linux, it uses ROCm. The community model library (ollama.com/library) hosts hundreds of pre-quantized models — no manual GGUF downloading required.
Hardware Relevant to Ollama
mini-pc · Check Price on Amazon · 16 GB Unified · 120 GB/s
mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s
mini-pc · Check Price on Amazon · 16 GB Unified · 51 GB/s
mini-pc · Check Price on Amazon · 16 GB Unified · 51 GB/s
Related Terms
GGUF→
The standard file format for quantized LLMs used by llama.cpp and Ollama. Replaces the older GGML format. Stores model weights and metadata in a single portable file.
CUDA→
NVIDIA's proprietary parallel computing platform. Industry standard for AI/ML. Nearly every AI framework (PyTorch, Ollama, ComfyUI) supports CUDA natively and first.
ROCm→
AMD's open-source GPU compute platform — AMD's answer to NVIDIA CUDA. Required for GPU-accelerated AI on AMD cards. Mature on Linux; less reliable on Windows.
MLX→
Apple's open-source machine learning framework optimized for Apple Silicon. Enables fast LLM inference on M-series chips using the unified memory architecture natively.
LM Studio→
A desktop GUI application for downloading and running local LLMs. Cross-platform (Mac, Windows, Linux). Wraps llama.cpp with a ChatGPT-like interface and built-in model browser.
Quantization→
Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.