Interesting, this paper shows that Transformers may not need separate key and value projections to work well.
Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-09
A new research paper demonstrates that eliminating separate key and value projections in Transformers can cut the KV cache by 50% with only a 3.1% increase in perplexity, sharply reducing inference memory requirements.
Extraction
Topics: transformer-architecturekv-cacheinference-efficiencylanguage-modeling
Claims
- Transformers may not require separate key and value projection matrices to perform well.
- Removing separate KV projections reduces the KV cache size by 50%.
- The perplexity penalty for this architectural simplification is only 3.1%.
- Inference memory requirements fall significantly under this design.
Key quotes
cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed