This paper makes long-context attention cheaper and faster by letting each token use only the query heads it needs.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-27

A new attention mechanism called Grouped Query Experts routes each token to only a subset of query heads, achieving 1.7–1.8x faster prefill on long contexts while matching baseline accuracy on 250M-parameter models trained on 30B tokens.

Open original ↗

Appears in

LLM Inference Efficiency: Phase, Layer, and Time Splitting Strategies Driving Cost Compression

Extraction

Topics: transformer-attentionmixture-of-expertslong-context-modelingefficient-inferencegrouped-query-attention

Claims

Grouped Query Experts achieves 1.7–1.8x faster prefill for long-context inputs compared to standard grouped-query attention.
The method matches baseline average accuracy (56.04 vs 55.86) while using only 9 of 16 query-attention computations.
Attention can be made sparse within grouped-query attention without quality loss when the router receives a strong learning signal and one shared head remains always active.
Grouped Query Experts sits on top of existing GQA infrastructure, making it compatible with long-context models that already use GQA for key-value cache efficiency.

Key quotes

This paper makes long-context attention cheaper and faster by letting each token use only the query heads it needs.

shows that attention can be made sparse inside grouped-query attention without hurting quality, but only when the router gets a strong learning signal and one shared head stays always on.

This is like giving the model many possible attention patterns, while making each token pay for only the small set that seems useful.