Reference

AI Hardware Glossary

Every term you'll encounter when running LLMs and Stable Diffusion locally — explained without jargon, with real hardware context.

Memory & Storage

Performance & Benchmarks

Quantization

performance

Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.

Read more →

Tokens/s

performance

Tokens per second — the standard speed metric for LLMs. One token ≈ 0.75 words. Above 10 t/s feels interactive; below 5 t/s feels like watching paint dry.

Read more →

Context Window

performance

The maximum amount of text (in tokens) a model can "see" at once. Larger context = more document history, longer conversations, bigger code files — but requires more VRAM.

Read more →

KV Cache

performance

Key-Value Cache — stores intermediate attention computations so the model doesn't re-process earlier context on each new token. Larger context = larger KV cache = more VRAM needed.

Read more →

Max LLM Size

performance

The largest language model this hardware can run with full GPU/unified-memory acceleration, at the specified quantization. Larger models require more memory.

Read more →

MoE

performance

Mixture of Experts — a model architecture where only a fraction of parameters activate per token. Enables very large parameter counts at lower inference cost (e.g., DeepSeek-V3, Mixtral).

Read more →

Speculative Decoding

performance

A speed optimization where a small draft model generates candidate tokens that a larger target model then verifies in parallel — producing multiple tokens per forward pass.

Read more →

Flash Attention

performance

A memory-efficient attention algorithm that rewrites the attention computation to minimize GPU memory reads/writes. Reduces VRAM usage and increases throughput, especially at long context lengths.

Read more →

INT4

performance

4-bit integer quantization — the most common precision level for running large models on consumer hardware. Reduces model size by ~75% vs FP16 with acceptable quality loss for most tasks.

Read more →

Software & Frameworks

Ollama

software

Free open-source tool for running LLMs locally on macOS, Linux, and Windows. Download a model with a single command. No cloud account required. Supports Llama, Mistral, Qwen, Phi, and more.

Read more →

CUDA

software

NVIDIA's proprietary parallel computing platform. Industry standard for AI/ML. Nearly every AI framework (PyTorch, Ollama, ComfyUI) supports CUDA natively and first.

Read more →

ROCm

software

AMD's open-source GPU compute platform — AMD's answer to NVIDIA CUDA. Required for GPU-accelerated AI on AMD cards. Mature on Linux; less reliable on Windows.

Read more →

GGUF

software

The standard file format for quantized LLMs used by llama.cpp and Ollama. Replaces the older GGML format. Stores model weights and metadata in a single portable file.

Read more →

MLX

software

Apple's open-source machine learning framework optimized for Apple Silicon. Enables fast LLM inference on M-series chips using the unified memory architecture natively.

Read more →

llama.cpp

software

The foundational C++ inference engine for running quantized LLMs locally. Powers Ollama, LM Studio, and most local AI tools under the hood. Supports CPU, CUDA, ROCm, and Metal.

Read more →

LM Studio

software

A desktop GUI application for downloading and running local LLMs. Cross-platform (Mac, Windows, Linux). Wraps llama.cpp with a ChatGPT-like interface and built-in model browser.

Read more →

ComfyUI

software

The node-based GUI for Stable Diffusion and Flux image generation. Industry standard for advanced AI image workflows. Requires a CUDA GPU for practical speeds; AMD ROCm on Linux works.

Read more →

CPU Inference

software

Running LLMs on the CPU rather than a GPU. Works on any hardware, no special drivers needed. Limited to ~8–12 t/s on 7B models — fine for background tasks, slow for interactive use.

Read more →

LoRA

software

Low-Rank Adaptation — a fine-tuning technique that trains a tiny set of adapter weights instead of the full model. Runs on consumer GPUs with as little as 8 GB VRAM.

Read more →

RAG

software

Retrieval-Augmented Generation — a technique that lets an LLM answer questions using external documents by fetching relevant chunks at query time instead of relying on training data alone.

Read more →

Embedding Model

software

A model that converts text into numerical vectors for similarity search. Required for RAG pipelines. Much smaller and faster than chat LLMs — runs comfortably on CPU.

Read more →

AWQ

software

Activation-Aware Weight Quantization — a 4-bit quantization method that outperforms GGUF Q4 in quality by identifying and preserving the most important weights. Primarily used with vLLM and HuggingFace.

Read more →

EXL2

software

ExLlamaV2's quantization format — offers mixed-precision quantization (e.g., 3.0 to 6.0 bits per weight) and is often the highest-quality option for a given VRAM budget on NVIDIA GPUs.

Read more →

Open WebUI

software

A self-hosted ChatGPT-like web interface for Ollama and OpenAI-compatible APIs. The most popular local AI frontend — runs as a Docker container or alongside Ollama.

Read more →

Multimodal

software

Models that process both text and images (and sometimes audio or video). Examples: LLaVA, Qwen-VL, Gemma 3. Require additional VRAM for the vision encoder on top of the language model.

Read more →

Hardware & Architecture

Tensor Cores

hardware

Specialized hardware units on NVIDIA GPUs designed for matrix multiplication — the core math operation in neural networks. 5th-gen Tensor Cores (Blackwell) are significantly faster than 4th-gen (Ada Lovelace) for AI inference.

Read more →

TDP (Power Draw)

hardware

Thermal Design Power in watts — the maximum sustained power draw. Higher TDP generally means more performance but more heat and electricity cost. Important for 24/7 always-on setups.

Read more →

Blackwell

hardware

NVIDIA's 2024–2025 GPU architecture generation. Features 5th-generation Tensor Cores, GDDR7 memory, and significant AI inference performance improvements over Ada Lovelace (RTX 40 series).

Read more →

RDNA 4

hardware

AMD's 2024 GPU architecture. Notable IPC improvement over RDNA 3, improved AI inference throughput, paired with GDDR6 in the RX 9060 XT series.

Read more →

NPU

hardware

Neural Processing Unit — a dedicated AI accelerator chip. Found in modern Ryzen AI CPUs and Apple Silicon. Offloads specific AI tasks from CPU/GPU but too limited for full LLM inference.

Read more →

PCIe

hardware

Peripheral Component Interconnect Express — the bus connecting a discrete GPU to the motherboard. PCIe 4.0 or 5.0 needed for fast model offloading when VRAM is exceeded.

Read more →

SDXL

hardware

Stable Diffusion XL — the standard 1024×1024 resolution image generation model. Requires 8+ GB VRAM for practical GPU-accelerated generation. Benchmark: generation time in seconds.

Read more →

eGPU

hardware

External GPU — a discrete GPU connected via Thunderbolt to a laptop or mini PC. Enables GPU-accelerated LLM inference on machines without a built-in GPU slot.

Read more →

NVMe SSD

hardware

High-speed solid-state storage using the PCIe bus. Affects how quickly models load into memory at startup — a PCIe 4.0 NVMe loads a 7B model in ~2 seconds vs ~15 seconds on SATA SSD.

Read more →

Thermal Throttling

hardware

When a CPU or GPU automatically reduces clock speed to prevent overheating. In LLM inference, sustained throttling cuts tokens per second mid-generation — especially in small mini PC enclosures.

Read more →

Connectivity & Interfaces

Ready to buy?

See the Hardware That Uses These Specs

Every product page shows benchmarks and specs with inline definitions — hover any term to see what it means.