Discussion about this post

User's avatar
Meng Li's avatar

Attention, as the core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long context applications. FlashAttention (and FlashAttention-2) pioneered a way to accelerate attention on GPUs by minimizing memory reads/writes, and it is now being used by most libraries to speed up Transformer training and inference. This has led to a dramatic increase in LLM context lengths over the past two years, from 2-4K (GPT-3, OPT) to 128K (GPT-4), and even 1M (Llama 3). However, despite its success, FlashAttention has not fully leveraged the new capabilities of modern hardware, achieving only 35% of the theoretical peak FLOP utilization on H100 GPUs with FlashAttention-2. In this blog post, we describe three key techniques for accelerating attention on Hopper GPUs: (1) leveraging the asynchronicity of Tensor Cores and TMA by overlapping bulk computation and data movement through warp-specialization, (2) interleaving blocked matmul and softmax operations, and (3) taking advantage of hardware support for low-precision FP8 inconsistent processing.

Expand full comment

No posts