Software & Frameworks

What is llama.cpp?

The foundational C++ inference engine for running quantized LLMs locally. Powers Ollama, LM Studio, and most local AI tools under the hood. Supports CPU, CUDA, ROCm, and Metal.

Full Explanation

llama.cpp is a pure C/C++ inference engine created by Georgi Gerganov in early 2023, starting as a weekend project to run Llama on a MacBook. It grew into the foundation of the entire local AI ecosystem. Ollama, LM Studio, and most local AI wrappers use llama.cpp as their inference backend. It supports every major hardware backend — CUDA, ROCm, Metal (Apple), Vulkan — and introduced the GGUF file format. Running llama.cpp directly via command line gives you the most control over context size, batch size, thread count, and layer offloading.

Why It Matters for Local AI

Understanding llama.cpp matters when you need to troubleshoot Ollama performance or configure advanced settings. The "-ngl" flag controls how many model layers are offloaded to the GPU — set it to 999 to push everything to VRAM, or lower numbers to split between GPU and CPU when VRAM is limited.

Hardware Relevant to llama.cpp

GIGABYTE GeForce RTX 5070 WINDFORCE OC 12G

gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s

Buy on AmazonAffiliate link — no extra cost to you
Apple Mac Mini (M4, 2024)

mini-pc · Check Price on Amazon · 16 GB Unified · 120 GB/s

Buy on AmazonAffiliate link — no extra cost to you

Related Terms