Performance & Benchmarks

What is Flash Attention?

A memory-efficient attention algorithm that rewrites the attention computation to minimize GPU memory reads/writes. Reduces VRAM usage and increases throughput, especially at long context lengths.

Full Explanation

Flash Attention (and Flash Attention 2/3) is a hardware-aware implementation of scaled dot-product attention that fuses operations and tiles the computation to stay within GPU SRAM rather than repeatedly reading from slow VRAM. At short context lengths (under 4K), the speedup is modest. At 32K+ context, Flash Attention can reduce attention VRAM usage by 5–10× and increase throughput by 2–4×, making long-context inference practical on consumer hardware. It's enabled by default in llama.cpp and most modern inference frameworks.

Why It Matters for Local AI

Flash Attention is what makes running 32K–128K context windows practical on a 12–16 GB GPU. Without it, a 32K context would consume so much VRAM for KV cache that it would leave insufficient room for the model weights themselves.

Hardware Relevant to Flash Attention

GIGABYTE GeForce RTX 5070 WINDFORCE OC 12G

gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s

Buy on AmazonAffiliate link — no extra cost to you
MSI GeForce RTX 5080 16G Gaming Trio OC

gpu · Check Price on Amazon · 16 GB VRAM · 960 GB/s

Buy on AmazonAffiliate link — no extra cost to you

Related Terms