Attention, as the core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long context applications. FlashAttention (and FlashAttention-2) pioneered a way to accelerate attention on GPUs by minimizing memory reads/writes, and it is now being used by most libraries to speed up Transformer training and inference. This has led to a dramatic increase in LLM context lengths over the past two years, from 2-4K (GPT-3, OPT) to 128K (GPT-4), and even 1M (Llama 3). However, despite its success, FlashAttention has not fully leveraged the new capabilities of modern hardware, achieving only 35% of the theoretical peak FLOP utilization on H100 GPUs with FlashAttention-2. In this blog post, we describe three key techniques for accelerating attention on Hopper GPUs: (1) leveraging the asynchronicity of Tensor Cores and TMA by overlapping bulk computation and data movement through warp-specialization, (2) interleaving blocked matmul and softmax operations, and (3) taking advantage of hardware support for low-precision FP8 inconsistent processing.
Attention, as the core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long context applications. FlashAttention (and FlashAttention-2) pioneered a way to accelerate attention on GPUs by minimizing memory reads/writes, and it is now being used by most libraries to speed up Transformer training and inference. This has led to a dramatic increase in LLM context lengths over the past two years, from 2-4K (GPT-3, OPT) to 128K (GPT-4), and even 1M (Llama 3). However, despite its success, FlashAttention has not fully leveraged the new capabilities of modern hardware, achieving only 35% of the theoretical peak FLOP utilization on H100 GPUs with FlashAttention-2. In this blog post, we describe three key techniques for accelerating attention on Hopper GPUs: (1) leveraging the asynchronicity of Tensor Cores and TMA by overlapping bulk computation and data movement through warp-specialization, (2) interleaving blocked matmul and softmax operations, and (3) taking advantage of hardware support for low-precision FP8 inconsistent processing.