Reference

AI Hardware Glossary

Every term you'll encounter when running LLMs and Stable Diffusion locally — explained without jargon, with real hardware context.

Memory & Storage

VRAM

memory

Video RAM — dedicated memory on a GPU. Determines the maximum model size you can run with full GPU acceleration. Once a model exceeds VRAM, it spills to system RAM over the slow PCIe bus.

Unified Memory

memory

Apple Silicon uses a single pool of fast RAM shared between CPU and GPU. Larger unified memory = larger models run entirely at full bandwidth — no PCIe bottleneck.

Memory Bandwidth

memory

How fast data moves between memory and the processor, measured in GB/s. Tokens per second scales nearly linearly with bandwidth — this is the single most important GPU spec for LLM speed.

GDDR7

memory

The latest generation of GPU memory (2024+). Significantly higher bandwidth than GDDR6X at the same capacity tier. Used in NVIDIA Blackwell cards (RTX 5070 series).

GDDR6

memory

Previous-generation GPU memory. Lower bandwidth than GDDR7, but paired with larger capacities (e.g., 16GB RX 9060 XT) can offer better model headroom despite lower token speed.

LPDDR4

memory

Low-Power DDR4 — often soldered memory in mini PCs. Lower bandwidth than desktop DDR4 or DDR5. Limits tokens-per-second compared to high-end alternatives.

LPDDR5

memory

Low-Power DDR5 — faster than LPDDR4, common in mid-range mini PCs (2023–2025). Provides 68–85 GB/s bandwidth, enabling noticeably better CPU inference speeds than LPDDR4 systems.

Quantization

performance

Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.

Tokens/s

performance

Tokens per second — the standard speed metric for LLMs. One token ≈ 0.75 words. Above 10 t/s feels interactive; below 5 t/s feels like watching paint dry.

Context Window

performance

The maximum amount of text (in tokens) a model can "see" at once. Larger context = more document history, longer conversations, bigger code files — but requires more VRAM.

KV Cache

performance

Key-Value Cache — stores intermediate attention computations so the model doesn't re-process earlier context on each new token. Larger context = larger KV cache = more VRAM needed.

Max LLM Size

performance

The largest language model this hardware can run with full GPU/unified-memory acceleration, at the specified quantization. Larger models require more memory.

MoE

performance

Mixture of Experts — a model architecture where only a fraction of parameters activate per token. Enables very large parameter counts at lower inference cost (e.g., DeepSeek-V3, Mixtral).

Speculative Decoding

performance

A speed optimization where a small draft model generates candidate tokens that a larger target model then verifies in parallel — producing multiple tokens per forward pass.

Flash Attention

performance

A memory-efficient attention algorithm that rewrites the attention computation to minimize GPU memory reads/writes. Reduces VRAM usage and increases throughput, especially at long context lengths.

INT4

performance

4-bit integer quantization — the most common precision level for running large models on consumer hardware. Reduces model size by ~75% vs FP16 with acceptable quality loss for most tasks.

AI Hardware Glossary

Memory & Storage

VRAM

Unified Memory

Memory Bandwidth

GDDR7

GDDR6

LPDDR4

LPDDR5

Performance & Benchmarks

Quantization

Tokens/s

Context Window

KV Cache

Max LLM Size

MoE

Speculative Decoding

Flash Attention

INT4

Software & Frameworks

Ollama

CUDA

ROCm

GGUF

MLX

llama.cpp

LM Studio

ComfyUI

CPU Inference

LoRA

RAG

Embedding Model

AWQ

EXL2

Open WebUI

Multimodal

Hardware & Architecture

Tensor Cores

TDP (Power Draw)

Blackwell

RDNA 4

NPU

PCIe

SDXL

eGPU

NVMe SSD

Thermal Throttling

Connectivity & Interfaces

Thunderbolt 5

See the Hardware That Uses These Specs