New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next to…

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-24

A new Microsoft research paper argues that transformers achieve better generalization when they are forced to learn compact internal state representations of past context, rather than attending directly to every prior token.

Open original ↗

Extraction

Topics: transformer-architecturemachine-learning-researchgeneralizationinternal-representations

Claims

Transformers generalize better when they compress past context into compact internal states rather than relying on full token-level attention lookback.
Standard transformers have no architectural pressure to form clean context summaries because they can attend directly to all prior tokens.
Constraining models to maintain compact internal states improves their generalization capability.

Key quotes

transformers generalize better when they learn compact internal states, not just next tokens

normal transformers can look back at every earlier token, so they do not have to squeeze the past into a clean summary