What is Unified Memory?
Apple Silicon uses a single pool of fast RAM shared between CPU and GPU. Larger unified memory = larger models run entirely at full bandwidth — no PCIe bottleneck.
Full Explanation
Apple's unified memory architecture places CPU, GPU, and Neural Engine on the same die with a shared high-bandwidth memory pool. On the M4 Pro, this pool runs at 273 GB/s — slower than an RTX 5070's GDDR7 but dramatically faster than any discrete GPU's PCIe bus overflow path. The critical advantage is capacity: a Mac Mini M4 Pro with 48 GB unified memory can fully accelerate a 70B parameter model at Q4, something no consumer GPU under $1,000 can do.
Why It Matters for Local AI
For running 70B models, unified memory Macs are currently the only sub-$2,000 option. A 16 GB M4 Mac Mini tops out at 13B models. The 24 GB M4 Pro comfortably runs 13B models and barely fits some 32B at Q4. The 48 GB M4 Pro config is the practical ceiling for consumer local AI.
Hardware Relevant to Unified Memory
mini-pc · Check Price on Amazon · 16 GB Unified · 120 GB/s
mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s
Related Terms
VRAM→
Video RAM — dedicated memory on a GPU. Determines the maximum model size you can run with full GPU acceleration. Once a model exceeds VRAM, it spills to system RAM over the slow PCIe bus.
Memory Bandwidth→
How fast data moves between memory and the processor, measured in GB/s. Tokens per second scales nearly linearly with bandwidth — this is the single most important GPU spec for LLM speed.
Quantization→
Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.
MLX→
Apple's open-source machine learning framework optimized for Apple Silicon. Enables fast LLM inference on M-series chips using the unified memory architecture natively.