The Information Machine

A quick map of the three cuts.

SemiAnalysis Twitter · SemiAnalysis (@SemiAnalysis_) · 2026-07-01

SemiAnalysis explains three hardware-level inference optimization techniques — splitting by phase (prefill vs. decode), by layer (attention on HBM GPUs vs. FFN on SRAM silicon), and by time (interleaving workloads) — each targeting idle compute recovery to reduce cost per token.

Open original ↗

Appears in

Extraction

Topics: inference-optimizationllm-hardwarecompute-efficiency

Claims

  • Inference can be split by phase, routing prefill and decode to separate chips that each match the workload's hardware demands.
  • Attention layers are memory-hungry and benefit from HBM-rich GPUs, while feed-forward layers are compute-hungry and suit SRAM-based silicon.
  • Time-based interleaving is a newer technique that alternates workload slices across a shared chip set to eliminate idle periods.
  • All three optimization approaches share a common principle: recovering wasted compute utilization to lower cost per token.

Key quotes

The pattern under all three: find idle compute and fill it. That's what drives down the cost of intelligence.