What is CPU Inference?
Running LLMs on the CPU rather than a GPU. Works on any hardware, no special drivers needed. Limited to ~8–12 t/s on 7B models — fine for background tasks, slow for interactive use.
Full Explanation
CPU inference runs LLM computations on the main processor rather than dedicated GPU hardware. Modern CPUs can run quantized GGUF models via llama.cpp using AVX2/AVX-512 SIMD instructions, but are bottlenecked by system RAM bandwidth (typically 50–100 GB/s) and the lack of thousands of parallel compute units. A high-end Ryzen 9 or Intel Core Ultra achieves 8–15 t/s on 7B Q4 models. This is sufficient for asynchronous tasks (batch summarization, code generation that runs while you take a break) but too slow for fluid conversational AI.
Why It Matters for Local AI
Budget mini PCs running CPU inference are best deployed as always-on AI servers — think a private Ollama endpoint that family or team members can query, or an automation server processing documents overnight. Don't expect conversational fluency from a $230 mini PC.
Hardware Relevant to CPU Inference
mini-pc · Check Price on Amazon · 16 GB Unified · 34 GB/s
mini-pc · Check Price on Amazon · 16 GB Unified · 34 GB/s
mini-pc · Check Price on Amazon · 16 GB Unified · 51 GB/s
Related Terms
Tokens/s→
Tokens per second — the standard speed metric for LLMs. One token ≈ 0.75 words. Above 10 t/s feels interactive; below 5 t/s feels like watching paint dry.
LPDDR4→
Low-Power DDR4 — often soldered memory in mini PCs. Lower bandwidth than desktop DDR4 or DDR5. Limits tokens-per-second compared to high-end alternatives.
Ollama→
Free open-source tool for running LLMs locally on macOS, Linux, and Windows. Download a model with a single command. No cloud account required. Supports Llama, Mistral, Qwen, Phi, and more.
Quantization→
Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.