The Information Machine

Interesting, this paper shows that Transformers may not need separate key and value projections to work well.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-09

A new research paper demonstrates that eliminating separate key and value projections in Transformers can cut the KV cache by 50% with only a 3.1% increase in perplexity, sharply reducing inference memory requirements.

Open original ↗

Extraction

Topics: transformer-architecturekv-cacheinference-efficiencylanguage-modeling

Claims

  • Transformers may not require separate key and value projection matrices to perform well.
  • Removing separate KV projections reduces the KV cache size by 50%.
  • The perplexity penalty for this architectural simplification is only 3.1%.
  • Inference memory requirements fall significantly under this design.

Key quotes

cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed