A quick map of the three cuts.
SemiAnalysis Twitter · SemiAnalysis (@SemiAnalysis_) · 2026-07-01
SemiAnalysis explains three hardware-level inference optimization techniques — splitting by phase (prefill vs. decode), by layer (attention on HBM GPUs vs. FFN on SRAM silicon), and by time (interleaving workloads) — each targeting idle compute recovery to reduce cost per token.
Appears in
Extraction
Topics: inference-optimizationllm-hardwarecompute-efficiency
Claims
- Inference can be split by phase, routing prefill and decode to separate chips that each match the workload's hardware demands.
- Attention layers are memory-hungry and benefit from HBM-rich GPUs, while feed-forward layers are compute-hungry and suit SRAM-based silicon.
- Time-based interleaving is a newer technique that alternates workload slices across a shared chip set to eliminate idle periods.
- All three optimization approaches share a common principle: recovering wasted compute utilization to lower cost per token.
Key quotes
The pattern under all three: find idle compute and fill it. That's what drives down the cost of intelligence.