Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and 7…

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-15

MiniMax Sparse Attention achieves 28.4x reduction in attention compute and up to 14.2x faster prefill at 1M token context length on H800 GPUs while nearly matching full-attention benchmark performance.

Open original ↗

Appears in

LLM Efficiency Breakthroughs: Small Models and Sparse Architectures Challenge Scale Assumptions

Extraction

Topics: sparse-attentionlong-contextinference-efficiencytransformer-architecture

Claims

MiniMax Sparse Attention reduces attention compute by 28.4x at 1M token context length.
Prefill speed increases 14.2x and decoding speed increases 7.6x on H800 GPUs compared to full attention.
The sparse attention approach mostly matches full-attention benchmark performance despite the efficiency gains.
The efficiency gains stem from not treating every token as equally relevant to every other token.

Key quotes

MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and 7.6X faster decoding on H800 GPUs.

This can happen when attention stops treating every token as [equally important].