Discussion about this post

User's avatar
Michael Lopez Chiesa's avatar

MSA (#1) and Lookahead Sparse Attention (#7) are quietly the same paper: a lightweight learned indexer choosing which KV blocks to keep, where the real difficulty is training that indexer without corrupting the backbone. MSA detaches the index gradient, LSA trains the indexer backbone-free. Same seam, cut two ways. The headline numbers are the least interesting part, since they fall out of the selection budget. The transferable content is the training recipe, and the shared bet worth scrutinizing is that relevant context is sparse and selectable, which thins out on diffuse-dependency tasks. Is the decoupled-indexer pattern the real trend this week, more than the FLOP counts?

No posts

Ready for more?