The Information Machine

In contrast to the slow decline of the Transformers movie series in 2017, the Transformer architecture in NLP showed imm…

SemiAnalysis Twitter · SemiAnalysis (@SemiAnalysis_) · 2026-06-29

SemiAnalysis credits the 2017 Transformer paper by Vaswani, Shazeer, Jones, and Gomez for introducing Multi-Head Attention and delivering a step-change in NLP perplexity scores.

Open original ↗

Appears in

Extraction

Topics: transformer-architecturemulti-head-attentionnlp-history

Claims

  • The 2017 Transformer architecture introduced Multi-Head Attention (MHA) to NLP.
  • MHA dramatically improved perplexity scores compared to prior sequence modeling approaches.
  • The foundational Transformer paper was authored by Ashish Vaswani, Noam Shazeer, Llion Jones, and Aidan Gomez among others.

Key quotes

In contrast to the slow decline of the Transformers movie series in 2017, the Transformer architecture in NLP showed immense potential. It introduced Multi-Head Attention (MHA) and dramatically improved perplexity scores.