Interesting, this paper shows that Transformers may not need separate key and value projections to work well.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-09

A new research paper demonstrates that eliminating separate key and value projections in Transformers can cut the KV cache by 50% with only a 3.1% increase in perplexity, sharply reducing inference memory requirements.

Open original ↗

Appears in

LLM Efficiency Breakthroughs: Small Models and Sparse Architectures Challenge Scale Assumptions

Extraction

Topics: transformer-architecturekv-cacheinference-efficiencylanguage-modeling

Claims

Transformers may not require separate key and value projection matrices to perform well.
Removing separate KV projections reduces the KV cache size by 50%.
The perplexity penalty for this architectural simplification is only 3.1%.
Inference memory requirements fall significantly under this design.

Key quotes

cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed