Reference
AI Hardware Glossary
Every term you'll encounter when running LLMs and Stable Diffusion locally — explained without jargon, with real hardware context.
Memory & Storage
VRAM
memoryVideo RAM — dedicated memory on a GPU. Determines the maximum model size you can run with full GPU acceleration. Once a model exceeds VRAM, it spills to system RAM over the slow PCIe bus.
Read more →Unified Memory
memoryApple Silicon uses a single pool of fast RAM shared between CPU and GPU. Larger unified memory = larger models run entirely at full bandwidth — no PCIe bottleneck.
Read more →Memory Bandwidth
memoryHow fast data moves between memory and the processor, measured in GB/s. Tokens per second scales nearly linearly with bandwidth — this is the single most important GPU spec for LLM speed.
Read more →GDDR7
memoryThe latest generation of GPU memory (2024+). Significantly higher bandwidth than GDDR6X at the same capacity tier. Used in NVIDIA Blackwell cards (RTX 5070 series).
Read more →GDDR6
memoryPrevious-generation GPU memory. Lower bandwidth than GDDR7, but paired with larger capacities (e.g., 16GB RX 9060 XT) can offer better model headroom despite lower token speed.
Read more →LPDDR4
memoryLow-Power DDR4 — often soldered memory in mini PCs. Lower bandwidth than desktop DDR4 or DDR5. Limits tokens-per-second compared to high-end alternatives.
Read more →LPDDR5
memoryLow-Power DDR5 — faster than LPDDR4, common in mid-range mini PCs (2023–2025). Provides 68–85 GB/s bandwidth, enabling noticeably better CPU inference speeds than LPDDR4 systems.
Read more →Performance & Benchmarks
Quantization
performanceCompressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.
Read more →Tokens/s
performanceTokens per second — the standard speed metric for LLMs. One token ≈ 0.75 words. Above 10 t/s feels interactive; below 5 t/s feels like watching paint dry.
Read more →Context Window
performanceThe maximum amount of text (in tokens) a model can "see" at once. Larger context = more document history, longer conversations, bigger code files — but requires more VRAM.
Read more →KV Cache
performanceKey-Value Cache — stores intermediate attention computations so the model doesn't re-process earlier context on each new token. Larger context = larger KV cache = more VRAM needed.
Read more →Max LLM Size
performanceThe largest language model this hardware can run with full GPU/unified-memory acceleration, at the specified quantization. Larger models require more memory.
Read more →MoE
performanceMixture of Experts — a model architecture where only a fraction of parameters activate per token. Enables very large parameter counts at lower inference cost (e.g., DeepSeek-V3, Mixtral).
Read more →Speculative Decoding
performanceA speed optimization where a small draft model generates candidate tokens that a larger target model then verifies in parallel — producing multiple tokens per forward pass.
Read more →Flash Attention
performanceA memory-efficient attention algorithm that rewrites the attention computation to minimize GPU memory reads/writes. Reduces VRAM usage and increases throughput, especially at long context lengths.
Read more →INT4
performance4-bit integer quantization — the most common precision level for running large models on consumer hardware. Reduces model size by ~75% vs FP16 with acceptable quality loss for most tasks.
Read more →Software & Frameworks
Ollama
softwareFree open-source tool for running LLMs locally on macOS, Linux, and Windows. Download a model with a single command. No cloud account required. Supports Llama, Mistral, Qwen, Phi, and more.
Read more →CUDA
softwareNVIDIA's proprietary parallel computing platform. Industry standard for AI/ML. Nearly every AI framework (PyTorch, Ollama, ComfyUI) supports CUDA natively and first.
Read more →ROCm
softwareAMD's open-source GPU compute platform — AMD's answer to NVIDIA CUDA. Required for GPU-accelerated AI on AMD cards. Mature on Linux; less reliable on Windows.
Read more →GGUF
softwareThe standard file format for quantized LLMs used by llama.cpp and Ollama. Replaces the older GGML format. Stores model weights and metadata in a single portable file.
Read more →MLX
softwareApple's open-source machine learning framework optimized for Apple Silicon. Enables fast LLM inference on M-series chips using the unified memory architecture natively.
Read more →llama.cpp
softwareThe foundational C++ inference engine for running quantized LLMs locally. Powers Ollama, LM Studio, and most local AI tools under the hood. Supports CPU, CUDA, ROCm, and Metal.
Read more →LM Studio
softwareA desktop GUI application for downloading and running local LLMs. Cross-platform (Mac, Windows, Linux). Wraps llama.cpp with a ChatGPT-like interface and built-in model browser.
Read more →ComfyUI
softwareThe node-based GUI for Stable Diffusion and Flux image generation. Industry standard for advanced AI image workflows. Requires a CUDA GPU for practical speeds; AMD ROCm on Linux works.
Read more →CPU Inference
softwareRunning LLMs on the CPU rather than a GPU. Works on any hardware, no special drivers needed. Limited to ~8–12 t/s on 7B models — fine for background tasks, slow for interactive use.
Read more →LoRA
softwareLow-Rank Adaptation — a fine-tuning technique that trains a tiny set of adapter weights instead of the full model. Runs on consumer GPUs with as little as 8 GB VRAM.
Read more →RAG
softwareRetrieval-Augmented Generation — a technique that lets an LLM answer questions using external documents by fetching relevant chunks at query time instead of relying on training data alone.
Read more →Embedding Model
softwareA model that converts text into numerical vectors for similarity search. Required for RAG pipelines. Much smaller and faster than chat LLMs — runs comfortably on CPU.
Read more →AWQ
softwareActivation-Aware Weight Quantization — a 4-bit quantization method that outperforms GGUF Q4 in quality by identifying and preserving the most important weights. Primarily used with vLLM and HuggingFace.
Read more →EXL2
softwareExLlamaV2's quantization format — offers mixed-precision quantization (e.g., 3.0 to 6.0 bits per weight) and is often the highest-quality option for a given VRAM budget on NVIDIA GPUs.
Read more →Open WebUI
softwareA self-hosted ChatGPT-like web interface for Ollama and OpenAI-compatible APIs. The most popular local AI frontend — runs as a Docker container or alongside Ollama.
Read more →Multimodal
softwareModels that process both text and images (and sometimes audio or video). Examples: LLaVA, Qwen-VL, Gemma 3. Require additional VRAM for the vision encoder on top of the language model.
Read more →Hardware & Architecture
Tensor Cores
hardwareSpecialized hardware units on NVIDIA GPUs designed for matrix multiplication — the core math operation in neural networks. 5th-gen Tensor Cores (Blackwell) are significantly faster than 4th-gen (Ada Lovelace) for AI inference.
Read more →TDP (Power Draw)
hardwareThermal Design Power in watts — the maximum sustained power draw. Higher TDP generally means more performance but more heat and electricity cost. Important for 24/7 always-on setups.
Read more →Blackwell
hardwareNVIDIA's 2024–2025 GPU architecture generation. Features 5th-generation Tensor Cores, GDDR7 memory, and significant AI inference performance improvements over Ada Lovelace (RTX 40 series).
Read more →RDNA 4
hardwareAMD's 2024 GPU architecture. Notable IPC improvement over RDNA 3, improved AI inference throughput, paired with GDDR6 in the RX 9060 XT series.
Read more →NPU
hardwareNeural Processing Unit — a dedicated AI accelerator chip. Found in modern Ryzen AI CPUs and Apple Silicon. Offloads specific AI tasks from CPU/GPU but too limited for full LLM inference.
Read more →PCIe
hardwarePeripheral Component Interconnect Express — the bus connecting a discrete GPU to the motherboard. PCIe 4.0 or 5.0 needed for fast model offloading when VRAM is exceeded.
Read more →SDXL
hardwareStable Diffusion XL — the standard 1024×1024 resolution image generation model. Requires 8+ GB VRAM for practical GPU-accelerated generation. Benchmark: generation time in seconds.
Read more →eGPU
hardwareExternal GPU — a discrete GPU connected via Thunderbolt to a laptop or mini PC. Enables GPU-accelerated LLM inference on machines without a built-in GPU slot.
Read more →NVMe SSD
hardwareHigh-speed solid-state storage using the PCIe bus. Affects how quickly models load into memory at startup — a PCIe 4.0 NVMe loads a 7B model in ~2 seconds vs ~15 seconds on SATA SSD.
Read more →Thermal Throttling
hardwareWhen a CPU or GPU automatically reduces clock speed to prevent overheating. In LLM inference, sustained throttling cuts tokens per second mid-generation — especially in small mini PC enclosures.
Read more →Connectivity & Interfaces
Ready to buy?
See the Hardware That Uses These Specs
Every product page shows benchmarks and specs with inline definitions — hover any term to see what it means.