Question 1

What is llama.cpp?

Accepted Answer

llama.cpp is a pure C/C++ inference engine created by Georgi Gerganov in early 2023, starting as a weekend project to run Llama on a MacBook. It grew into the foundation of the entire local AI ecosystem. Ollama, LM Studio, and most local AI wrappers use llama.cpp as their inference backend. It supports every major hardware backend — CUDA, ROCm, Metal (Apple), Vulkan — and introduced the GGUF file format. Running llama.cpp directly via command line gives you the most control over context size, batch size, thread count, and layer offloading.

Question 2

Why does llama.cpp matter for local AI?

Accepted Answer

Understanding llama.cpp matters when you need to troubleshoot Ollama performance or configure advanced settings. The "-ngl" flag controls how many model layers are offloaded to the GPU — set it to 999 to push everything to VRAM, or lower numbers to split between GPU and CPU when VRAM is limited.

What is llama.cpp?

Full Explanation

Why It Matters for Local AI

Hardware Relevant to llama.cpp

Related Terms