Methodology

How We Test AI Hardware

Every benchmark on The AI Desk follows a fixed protocol. We test on real consumer hardware using the same software and settings as our readers — no special drivers, no manufacturer samples, no cherry-picked runs.

Testing Principles

  • Stock configuration

    No overclocking, no custom BIOS settings, no power limit unlocks. We test what you get out of the box.

  • Real workloads

    We use the same tools our readers use: Ollama, llama.cpp, ComfyUI. Not synthetic GPU benchmarks.

  • Consistent environment

    Tests run with background apps closed. System idle for 10 minutes before benchmarking. 3 runs, median reported.

  • Cross-referenced results

    Our numbers are validated against community benchmarks from r/LocalLLaMA and llama.cpp issue threads. Outliers are investigated, not published.

  • Version transparency

    Every result is tagged with the model, quantization format, tool version, and test date.

Benchmark Definitions

Llama 3.1 8B (Q4_K_M)

tokens/second (t/s)

Tool: Ollama 0.3+

Run `ollama run llama3.1:8b` with a 500-token prompt, measure generation speed via `ollama ps` throughput. 3 runs, median reported.

Llama 3.1 13B (Q4_K_M)

tokens/second (t/s)

Tool: Ollama 0.3+

Same method as 8B. Only reported for hardware with sufficient VRAM/RAM to load the model fully without CPU offload.

Stable Diffusion XL (SDXL)

seconds per image (1024×1024)

Tool: ComfyUI

20-step DPM++ 2M Karras, 1024×1024, no ControlNet. 3 runs, median reported. GPU-accelerated path only.

FLUX.1-dev

seconds per image (1024×1024)

Tool: ComfyUI

20 steps, 1024×1024, fp8 checkpoint. Reported only where VRAM ≥ 12GB. 3 runs, median.

What We Don't Test

  • Training performance — all results are inference only. Training requires different hardware priorities.
  • Multi-GPU setups — all benchmarks are single-GPU or single-system. No NVLink, no tensor parallelism.
  • CPU-only inference for GPU products — if a GPU is reviewed, we only report GPU-accelerated inference.
  • API-based models — no cloud inference times. Local hardware only.

Data Sources

Some products are benchmarked by us directly. Others use verified community data from:

  • r/LocalLLaMA: Community benchmark megathreads for new hardware launches
  • llama.cpp GitHub Issues: Official performance tracking and regression tests
  • Simon Willison's Weblog: Independent Apple Silicon AI benchmarks
  • Manufacturer spec sheets: Memory bandwidth, TDP, and core count — taken as given, not independently verified

Update Policy

Benchmark results are updated when a new driver or software version causes a meaningful performance change (≥10%). Each product page shows a "Last Updated" date. If you notice a benchmark that no longer matches your real-world results, the most likely explanation is a driver update — check the date and compare to your software version.