FlashAttention is an IO-aware attention algorithm that computes exact attention with reduced GPU memory reads/writes through tiling and kernel fusion, enabling faster training and inference for long sequences.
FlashAttention is an IO-aware attention algorithm that computes exact attention with reduced GPU memory reads/writes through tiling and kernel fusion, enabling faster training and inference for long sequences. It produces mathematically identical results to standard attention while being 2-4x faster and using 5-20x less memory.
Standard attention implementations materialize the full N x N attention matrix in GPU high-bandwidth memory (HBM), where N is the sequence length. For a 100K token sequence, this matrix alone requires 40GB of memory in FP32. FlashAttention never materializes this matrix. Instead, it tiles the computation into blocks that fit in the GPU's fast SRAM (on-chip memory), computes partial attention results within each tile, and accumulates the final output through a numerically stable online softmax algorithm.
The key insight is that GPU computation is fast but memory access is slow. Standard attention is memory-bound — it spends most of its time moving data between HBM and compute units, not actually computing. By fusing multiple operations (matrix multiply, softmax, masking, dropout) into a single kernel that operates on data already in SRAM, FlashAttention eliminates redundant memory transfers and achieves near-peak GPU utilization.
FlashAttention made long-context models practical. Before its introduction, training on sequences longer than 8K tokens was prohibitively expensive. FlashAttention's linear memory scaling (O(N) instead of O(N^2)) enabled the jump from 4K to 200K+ context windows that modern models offer, while simultaneously reducing training time by 2-3x.
Every major model training run since 2023 uses FlashAttention or its successors (FlashAttention-2, FlashAttention-3). PyTorch includes FlashAttention as the default attention backend via scaled_dot_product_attention(). Inference engines like vLLM use FlashAttention for the prefill phase where full attention is computed over the input prompt.
Aaron is an engineering leader, software architect, and founder with 18 years building distributed systems and cloud infrastructure. Now focused on LLM-powered platforms, agent orchestration, and production AI. He shares hands-on technical guides and framework comparisons at fp8.co.