FlashAttention

FlashAttention is an exact attention algorithm introduced by Tri Dao et al. in a 2022 paper that dramatically accelerates self-attention by tiling computations to keep intermediate tensors in fast GPU SRAM rather than slow HBM memory. The technique reduces the memory footprint of attention from quadratic to linear in sequence length while preserving exact mathematical equivalence to standard attention. FlashAttention-2 (2023) added thread-block-level parallelism and reduced non-matmul FLOPs for an additional 2x speedup, while FlashAttention-3 (2024) added Hopper-architecture-specific optimizations including FP8 support and warp-specialization. FlashAttention is now built into PyTorch (as torch.nn.functional.scaled_dot_product_attention), supported natively by vLLM, TensorRT-LLM, and Hugging Face Transformers, and used in essentially every modern LLM training and inference pipeline. The technique enabled the long-context era — without FlashAttention, training and serving 128K-context or 1M-context models would be prohibitively expensive. AI governance teams encounter FlashAttention as transparent infrastructure that does not affect output quality.

FlashAttention-accelerated inference with Centralpoint: Centralpoint sits above whatever inference stack uses FlashAttention — virtually all modern LLM serving — with consistent metering and audit logging. The model-agnostic platform routes to any LLM, keeps prompts local, and deploys chatbots through one line of JavaScript on any portal.

Related Keywords:
FlashAttention,,

Back