What is llama.cpp?
The foundational C++ inference engine for running quantized LLMs locally. Powers Ollama, LM Studio, and most local AI tools under the hood. Supports CPU, CUDA, ROCm, and Metal.
Full Explanation
llama.cpp is a pure C/C++ inference engine created by Georgi Gerganov in early 2023, starting as a weekend project to run Llama on a MacBook. It grew into the foundation of the entire local AI ecosystem. Ollama, LM Studio, and most local AI wrappers use llama.cpp as their inference backend. It supports every major hardware backend — CUDA, ROCm, Metal (Apple), Vulkan — and introduced the GGUF file format. Running llama.cpp directly via command line gives you the most control over context size, batch size, thread count, and layer offloading.
Why It Matters for Local AI
Understanding llama.cpp matters when you need to troubleshoot Ollama performance or configure advanced settings. The "-ngl" flag controls how many model layers are offloaded to the GPU — set it to 999 to push everything to VRAM, or lower numbers to split between GPU and CPU when VRAM is limited.
Hardware Relevant to llama.cpp
gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s
mini-pc · Check Price on Amazon · 16 GB Unified · 120 GB/s
Related Terms
GGUF→
The standard file format for quantized LLMs used by llama.cpp and Ollama. Replaces the older GGML format. Stores model weights and metadata in a single portable file.
Ollama→
Free open-source tool for running LLMs locally on macOS, Linux, and Windows. Download a model with a single command. No cloud account required. Supports Llama, Mistral, Qwen, Phi, and more.
Quantization→
Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.
CUDA→
NVIDIA's proprietary parallel computing platform. Industry standard for AI/ML. Nearly every AI framework (PyTorch, Ollama, ComfyUI) supports CUDA natively and first.
ROCm→
AMD's open-source GPU compute platform — AMD's answer to NVIDIA CUDA. Required for GPU-accelerated AI on AMD cards. Mature on Linux; less reliable on Windows.