What is Flash Attention?
A memory-efficient attention algorithm that rewrites the attention computation to minimize GPU memory reads/writes. Reduces VRAM usage and increases throughput, especially at long context lengths.
Full Explanation
Flash Attention (and Flash Attention 2/3) is a hardware-aware implementation of scaled dot-product attention that fuses operations and tiles the computation to stay within GPU SRAM rather than repeatedly reading from slow VRAM. At short context lengths (under 4K), the speedup is modest. At 32K+ context, Flash Attention can reduce attention VRAM usage by 5–10× and increase throughput by 2–4×, making long-context inference practical on consumer hardware. It's enabled by default in llama.cpp and most modern inference frameworks.
Why It Matters for Local AI
Flash Attention is what makes running 32K–128K context windows practical on a 12–16 GB GPU. Without it, a 32K context would consume so much VRAM for KV cache that it would leave insufficient room for the model weights themselves.
Hardware Relevant to Flash Attention
gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s
gpu · Check Price on Amazon · 16 GB VRAM · 960 GB/s
Related Terms
KV Cache→
Key-Value Cache — stores intermediate attention computations so the model doesn't re-process earlier context on each new token. Larger context = larger KV cache = more VRAM needed.
Context Window→
The maximum amount of text (in tokens) a model can "see" at once. Larger context = more document history, longer conversations, bigger code files — but requires more VRAM.
VRAM→
Video RAM — dedicated memory on a GPU. Determines the maximum model size you can run with full GPU acceleration. Once a model exceeds VRAM, it spills to system RAM over the slow PCIe bus.
Memory Bandwidth→
How fast data moves between memory and the processor, measured in GB/s. Tokens per second scales nearly linearly with bandwidth — this is the single most important GPU spec for LLM speed.