The Information Machine

GPUs are leaving performance on the table.

SemiAnalysis Twitter · SemiAnalysis (@SemiAnalysis_) · 2026-05-27

SemiAnalysis highlights research by Mohamed Abdelfattah at Makora showing that auto-generated CUDA kernels are outperforming hand-tuned ones at scale, arguing GPUs are leaving significant performance on the table.

Open original ↗

Appears in

Extraction

Topics: gpu-performancecuda-kernelskernel-optimizationauto-tuning

Claims

  • GPUs are not achieving their theoretical peak performance in real-world production workloads.
  • Auto-generated CUDA kernels are outperforming hand-written ones at scale.
  • Closing the gap between theoretical peak throughput and real-world performance is nearly impossible with manual kernel tuning.

Key quotes

Closing the gap between theoretical peak and real-world throughput is nearly impossible when hand-tuning CUDA kernels at scale.