FlashAttention

What is FlashAttention?

FlashAttention is an IO-aware attention algorithm that computes exact attention with reduced GPU memory reads/writes through tiling and kernel fusion, enabling faster training and inference for long sequences. It produces mathematically identical results to standard attention while being 2-4x faster and using 5-20x less memory.

Standard attention implementations materialize the full N x N attention matrix in GPU high-bandwidth memory (HBM), where N is the sequence length. For a 100K token sequence, this matrix alone requires 40GB of memory in FP32. FlashAttention never materializes this matrix. Instead, it tiles the computation into blocks that fit in the GPU's fast SRAM (on-chip memory), computes partial attention results within each tile, and accumulates the final output through a numerically stable online softmax algorithm.

The key insight is that GPU computation is fast but memory access is slow. Standard attention is memory-bound — it spends most of its time moving data between HBM and compute units, not actually computing. By fusing multiple operations (matrix multiply, softmax, masking, dropout) into a single kernel that operates on data already in SRAM, FlashAttention eliminates redundant memory transfers and achieves near-peak GPU utilization.

Why does FlashAttention matter?

FlashAttention made long-context models practical. Before its introduction, training on sequences longer than 8K tokens was prohibitively expensive. FlashAttention's linear memory scaling (O(N) instead of O(N^2)) enabled the jump from 4K to 200K+ context windows that modern models offer, while simultaneously reducing training time by 2-3x.

How is FlashAttention used in practice?

Every major model training run since 2023 uses FlashAttention or its successors (FlashAttention-2, FlashAttention-3). PyTorch includes FlashAttention as the default attention backend via scaled_dot_product_attention(). Inference engines like vLLM use FlashAttention for the prefill phase where full attention is computed over the input prompt.

What is FlashAttention?

Why does FlashAttention matter?

How is FlashAttention used in practice?

Related Terms

About the Author