What is Thermal Throttling?
When a CPU or GPU automatically reduces clock speed to prevent overheating. In LLM inference, sustained throttling cuts tokens per second mid-generation — especially in small mini PC enclosures.
Full Explanation
Thermal throttling occurs when a processor's temperature exceeds a safe threshold, causing it to reduce operating frequency to lower heat output. LLM inference is a sustained all-core workload — unlike gaming, which has variable load — meaning mini PCs and laptops are more likely to throttle during inference than during typical use. A mini PC that benchmarks at 12 t/s for a 30-second test may sustain only 8 t/s during a 10-minute document summarization task once thermals saturate.
Why It Matters for Local AI
Always check thermal throttling reviews, not just burst benchmarks, when evaluating mini PCs for local AI. Models with larger cooling solutions (the Geekom AI A7 Max's dual-fan design, for example) sustain performance better than compact single-fan designs. Adding a quality CPU cooler like the Noctua NH-D15 to a desktop build eliminates throttling entirely.
Hardware Relevant to Thermal Throttling
accessory · Check Price on Amazon
mini-pc · Check Price on Amazon · 16 GB Unified · 68 GB/s
mini-pc · Check Price on Amazon · 16 GB Unified · 34 GB/s
Related Terms
TDP (Power Draw)→
Thermal Design Power in watts — the maximum sustained power draw. Higher TDP generally means more performance but more heat and electricity cost. Important for 24/7 always-on setups.
Tokens/s→
Tokens per second — the standard speed metric for LLMs. One token ≈ 0.75 words. Above 10 t/s feels interactive; below 5 t/s feels like watching paint dry.
CPU Inference→
Running LLMs on the CPU rather than a GPU. Works on any hardware, no special drivers needed. Limited to ~8–12 t/s on 7B models — fine for background tasks, slow for interactive use.
NPU→
Neural Processing Unit — a dedicated AI accelerator chip. Found in modern Ryzen AI CPUs and Apple Silicon. Offloads specific AI tasks from CPU/GPU but too limited for full LLM inference.