What is GGUF?
The standard file format for quantized LLMs used by llama.cpp and Ollama. Replaces the older GGML format. Stores model weights and metadata in a single portable file.
Full Explanation
GGUF (GPT-Generated Unified Format) is the model file format introduced by llama.cpp in 2023, now the universal container for quantized open-source LLMs. A GGUF file encodes the model architecture, vocabulary, tokenizer, quantization level, and all weight tensors in a single binary file. Ollama downloads and manages GGUF files behind the scenes; if you download models manually from Hugging Face, you'll typically choose between variants like "Llama-3.1-8B-Q4_K_M.gguf" (smaller, faster) and "Llama-3.1-8B-Q8_0.gguf" (larger, higher quality).
Why It Matters for Local AI
GGUF files are self-contained and hardware-agnostic — the same file runs on Apple Silicon, NVIDIA, AMD, and CPU. The filename encodes the quantization level: Q4_K_M is the community sweet spot for balance. Q8_0 requires double the VRAM for modestly better output quality.
Hardware Relevant to GGUF
gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s
mini-pc · Check Price on Amazon · 16 GB Unified · 120 GB/s
Related Terms
Quantization→
Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.
Ollama→
Free open-source tool for running LLMs locally on macOS, Linux, and Windows. Download a model with a single command. No cloud account required. Supports Llama, Mistral, Qwen, Phi, and more.
llama.cpp→
The foundational C++ inference engine for running quantized LLMs locally. Powers Ollama, LM Studio, and most local AI tools under the hood. Supports CPU, CUDA, ROCm, and Metal.