Question 1

What is Flash Attention?

Accepted Answer

Flash Attention (and Flash Attention 2/3) is a hardware-aware implementation of scaled dot-product attention that fuses operations and tiles the computation to stay within GPU SRAM rather than repeatedly reading from slow VRAM. At short context lengths (under 4K), the speedup is modest. At 32K+ context, Flash Attention can reduce attention VRAM usage by 5–10× and increase throughput by 2–4×, making long-context inference practical on consumer hardware. It's enabled by default in llama.cpp and most modern inference frameworks.

Question 2

Why does Flash Attention matter for local AI?

Accepted Answer

Flash Attention is what makes running 32K–128K context windows practical on a 12–16 GB GPU. Without it, a 32K context would consume so much VRAM for KV cache that it would leave insufficient room for the model weights themselves.

What is Flash Attention?

Full Explanation

Why It Matters for Local AI

Hardware Relevant to Flash Attention

Related Terms