The Information Machine

LLM Inference Efficiency: Phase, Layer, and Time Splitting Strategies Driving Cost Compression

open · v1 · 2026-07-02 · 79 items

What

LLM inference is being carved up through three successive hardware-level splits: phase disaggregation (routing prefill and decode to separate GPU pools), layer disaggregation (routing attention and feed-forward computation to hardware matched to each operation's profile), and time-based interleaving (alternating workload slices across a shared chip set to fill idle periods). [1][2] SemiAnalysis framed this three-cut framework as the central story of MLSys 2026, arguing that each split recovers previously wasted compute utilization and lowers cost per token. [1] Production results support the research: Anyscale reports 67% cost savings with prefill-decode (PD) disaggregation using Ray and vLLM on AMD MI325X hardware, [5] and the Kubernetes-native llm-d stack claims up to 70% higher throughput through the same technique. [7] Separately, UCSD's Hao AI Lab released JetSpec, a speculative decoding system using causal parallel tree drafting that achieves 9.64x speedup on MATH-500 and approximately 1,000 tokens per second on a single B200 GPU. [12]

Why it matters

Each successive optimization reduces the effective cost per token at the hardware level rather than through model compression or quality trade-offs. Industry observers including SemiAnalysis argue that cheaper tokens historically expand total inference demand rather than erode revenue, and early evidence — AI quarterly revenue surpassing hardware depreciation costs for the first time [18] — tentatively supports this. If the pattern holds, the practical ceiling on AI deployment shifts from compute cost to software and latency optimization.

Open questions

  • Will prefill-decode disaggregation converge toward a unified aggregation/disaggregation architecture, as a recent arXiv paper examines, or remain a binary architectural choice? [19]

  • Can JetSpec's 9.64x benchmark speedup translate at similar ratios to production workloads once integrated into vLLM and SGLang? [12][14]

  • Does AMD's Helios chip — proposed as a native GPU disaggregation solution — materially change the inference hardware economics against NVIDIA? [15]

  • Will the demand-expansion thesis hold as per-token costs keep falling, or will enterprise inference spending eventually plateau regardless of unit economics?

Narrative

LLM inference has been progressively subdivided by hardware function, with each subdivision designed to eliminate the idle compute that results from mixing workloads with incompatible resource profiles. SemiAnalysis laid out a three-part framework as the organizing theme of MLSys 2026: a phase split routes prefill (compute-bound) and decode (memory-bandwidth-bound) to separate hardware pools suited to each; a layer split routes attention layers (memory-hungry, favoring HBM-rich GPUs) and feed-forward layers (compute-hungry, favoring SRAM-based silicon) to different chip types; and time-based interleaving alternates workload slices across a shared chip set to fill gaps that would otherwise sit idle. [1][2] The common logic across all three, per SemiAnalysis, is finding idle compute and filling it — each step lowering cost per token and, the argument goes, growing total demand rather than compressing margins.

Prefill-decode disaggregation is the most production-mature of the three techniques. The core problem it addresses is that prefill and decode compete for the same GPU when served together: prefill requires high compute throughput while decode is bottlenecked by memory bandwidth, so a unified deployment forces a sub-optimal compromise for both. [3][4] Routing each to a dedicated hardware pool eliminates that compromise at the cost of transferring the KV cache between pools — a real overhead that practitioners treat as a manageable trade-off. [4] Anyscale validated this at scale, reporting 67% cost savings with Ray and vLLM in a disaggregated configuration on AMD MI325X. [5] The Kubernetes-native llm-d stack reports up to 70% higher tokens per second through the same approach, adding intelligent routing and autoscaling on top of standard serving infrastructure. [6][7] UCSD's Hao AI Lab, whose DistServe system pioneered much of the theoretical work on maximizing goodput (tokens delivered within latency SLA, rather than raw throughput), has tracked adoption from initial proposal in 2024 through substantial production uptake by late 2025. [8][9] Red Hat has published a multi-part technical series targeting enterprise deployments, covering disaggregation alongside KV cache tiering and EAGLE speculative decoding. [10][11]

JetSpec, also from Hao AI Lab, represents a distinct optimization line: applying causal parallel tree drafting to speculative decoding. Speculative decoding uses a smaller draft model to propose candidate token sequences that a larger verifier then accepts or rejects in parallel — recovering throughput without changing output quality. JetSpec's contribution is co-optimizing draft cost and draft quality by exploring multiple candidate paths simultaneously rather than a single draft chain. Benchmarks show 9.64x end-to-end speedup on MATH-500 (a reasoning benchmark) and 4.58x on open-ended chat; with CUDA graph and kernel optimizations, the system reaches approximately 1,000 tokens per second on a single B200 GPU. [12][13] The team designed JetSpec for integration with vLLM and SGLang, though production integration is still forthcoming. [14] SemiAnalysis explicitly flagged this result alongside the three-cut macro narrative, treating speculative decoding as a fourth dimension of inference efficiency rather than a separate research track.

At the hardware level, a secondary debate concerns which silicon benefits most from these algorithmic advances. Anyscale's 67% savings figure is on AMD MI325X rather than NVIDIA hardware, and community discussion mentions AMD's Helios chip as a potential native GPU disaggregation solution. [5][15] AMD has received analyst upgrades to a $1T–$1.14T market cap range, partly on inference hardware positioning. [16] Meanwhile, Cerebras reports 750 tokens per second for its 5.6 model — a benchmark reflecting a different hardware approach (wafer-scale SRAM) that sidesteps the memory-bandwidth bottleneck that makes disaggregation necessary in the first place. [17] The convergence of algorithmic splitting strategies with a diversifying hardware market means that inference architecture decisions are increasingly hardware-specific rather than portable.

Timeline

  • 2024-01-01: UCSD Hao AI Lab proposes DistServe, establishing prefill-decode disaggregation's theoretical basis around maximizing goodput rather than raw throughput. [8][9]
  • 2025-12-01: Prefill-decode disaggregation moves from early research proposal to substantial production adoption across the 2024–2025 period. [9]
  • 2026-06-16: Anyscale publishes benchmark showing 67% cost savings with Ray+vLLM PD disaggregation on AMD MI325X. [5]
  • 2026-06-25: Industry commentary notes AI inference costs falling again; developer shift toward price-competitive open-weight and Chinese models noted as driven by inference economics. [26][25]
  • 2026-06-26: Red Hat publishes Part 1 of its distributed LLM inference series covering PD disaggregation, KV cache tiering, and EAGLE speculative decoding. [10][22]
  • 2026-06-27: DistServe and PD disaggregation discussed widely in Japanese-language ML community; Attention-FFN disaggregation noted as an emerging further split beyond phase-level PD. [9][29]
  • 2026-06-28: Report of a Chinese AI lab achieving an inference milestone the industry expected to take years longer. [30]
  • 2026-06-29: llm-d Kubernetes-native inference stack highlighted for 70% throughput gains via PD disaggregation with intelligent routing and autoscaling. [6][7]
  • 2026-06-29: Practitioner thread (Prajjwal, nanoserve) explains the compute/bandwidth conflict in unified serving and the KV-cache transfer cost of disaggregation as a real but manageable trade-off. [3][4]
  • 2026-06-30: Red Hat publishes Part 2 of its distributed inference series on advanced disaggregation topics. [11]
  • 2026-06-30: SemiAnalysis amplifies JetSpec results: 9.64x speedup on MATH-500, 4.58x on open-ended chat, ~1,000 TPS on a single B200 GPU. [12]
  • 2026-07-01: SemiAnalysis publishes three-cut framework (phase, layer, time splitting) as the central inference-efficiency narrative from MLSys 2026, arguing cheaper tokens grow total demand. [1][2]
  • 2026-07-01: AMD Helios proposed as a native GPU disaggregation solution; AMD receives analyst upgrades to $1T–$1.14T market cap range on inference hardware positioning. [15][16]

Perspectives

SemiAnalysis

Frames phase, layer, and time splitting as a unified progression recovering wasted hardware utilization, with each step lowering cost per token; argues cheaper tokens grow total AI demand rather than compress revenue, and identifies this as the defining theme of MLSys 2026.

Evolution: Consistent analytical and demand-bullish position across both the JetSpec amplification and the three-cut macro thread.

UCSD Hao AI Lab

Originators of both DistServe (PD disaggregation, focused on maximizing goodput over raw throughput) and JetSpec (causal parallel tree speculative decoding); positions the two as complementary inference optimization strategies.

Evolution: Consistent research framing; JetSpec represents a newer line of work beyond disaggregation.

Anyscale (Robert Nishihara)

Production validation: 67% cost savings with Ray+vLLM PD disaggregation on AMD MI325X demonstrates that disaggregation is deployable at scale on non-NVIDIA hardware.

Evolution: No prior stance; first entry.

Red Hat

Enterprise infrastructure framing: multi-part technical series treats disaggregation, KV cache tiering, and speculative decoding as production-ready techniques for enterprise customers rather than research-stage work.

Evolution: No prior stance; first entry.

Prajjwal (nanoserve)

Practitioner perspective: disaggregation solves a real compute/bandwidth conflict in unified serving but introduces KV cache transfer overhead that must be weighed explicitly; not a free optimization.

Evolution: No prior stance; first entry.

Open-source ecosystem (llm-d, vLLM, Ray)

Kubernetes-native stacks like llm-d are productizing disaggregation with intelligent routing and autoscaling; vLLM and Ray document disaggregated prefill as a supported configuration.

Evolution: Rapid movement from experimental to documented production feature.

Developer/market observers

Developers follow price and latency over model loyalty; the shift toward open-weight and Chinese models is driven by inference economics, accelerating efficiency investment from Western labs.

Evolution: No prior stance; first entry.

Tensions

  • Disaggregation vs. unified architecture: a recent arXiv paper argues prefill-decode aggregation and disaggregation should be unified rather than treated as a binary choice, while production deployments treat disaggregation as the clear efficiency winner. [19][5][7]
  • Cost compression vs. demand expansion: SemiAnalysis argues falling per-token costs grow total AI demand; some market observers treat the same trend as deflationary pressure on AI revenue. [1][18][27]
  • NVIDIA vs. AMD for disaggregated inference: Anyscale's 67% savings result is on AMD MI325X and AMD Helios is proposed as a native disaggregation solution, directly contesting NVIDIA's assumed inference hardware dominance. [5][15][16]
  • Throughput vs. goodput as the right optimization target: DistServe's framing argues maximizing raw throughput ignores latency SLAs; the correct metric is goodput (tokens delivered within SLA), which changes how disaggregation trade-offs should be evaluated. [8][28]

Status: active and growing

Sources

  1. [1] Inference keeps getting carved up, and every cut makes intelligence cheaper. — SemiAnalysis Twitter (2026-07-01)
  2. [2] A quick map of the three cuts. — SemiAnalysis Twitter (2026-07-01)
  3. [3] Prefill and decode fight for the same GPU. Prefill is compute bound, decode is memory bandwidth bound. Run them in the s... — reactive:inference-cost-optimization (2026-06-29)
  4. [4] Disaggregation goes further: prefill on one GPU pool, decode on another, ship the KV cache between them. You pay a trans... — reactive:inference-cost-optimization (2026-06-29)
  5. [5] 67% Cost Savings with PD Disaggregation Using Ray and vLLM on AMD MI325X — reactive:inference-cost-optimization (2026-06-16)
  6. [6] llm-d is a Kubernetes-native inference serving stack that adds intelligent routing, KV-cache management, and autoscaling... — reactive:inference-cost-optimization (2026-06-29)
  7. [7] @GithubProjects Prefill/decode disaggregation yielding up to 70% higher tokens/sec is massive for scaling large models w... — reactive:inference-cost-optimization (2026-06-29)
  8. [8] Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregation | Hao AI Lab @ UCSD — reactive:inference-cost-optimization
  9. [9] UC San Diego Hao AI Labのブログ非常に分かりやすくDistServeのPrefill Decode Disaggregationの有用性について解説している上に、DistServeを提案した2024年から2025年末ま... — reactive:inference-cost-optimization (2026-06-27)
  10. [10] Red Hat posted a technical article on distributed LLM inference: prefill/decode disaggregation, KV cache tiering, EAGLE ... — reactive:inference-cost-optimization (2026-06-27)
  11. [11] Part 2 of our 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗔𝗜 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 series is now live on Red Hat Developer: 𝘖𝘱𝘵𝘪𝘮𝘪𝘻𝘪𝘯𝘨 𝘋𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘦𝘥 𝘈𝘐 𝘐𝘯𝘧𝘦𝘳𝘦𝘯𝘤𝘦: 𝘈𝘥𝘷... — reactive:inference-cost-optimization (2026-06-30)
  12. [12] Parallel draft tree, tree-causal verification — SemiAnalysis Twitter (2026-06-30)
  13. [13] JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting — reactive:inference-cost-optimization
  14. [14] GitHub - hao-ai-lab/JetSpec: JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Causal Parallel Tree Drafting · GitHub — reactive:inference-cost-optimization
  15. [15] @MilkRoadAI $AMD will solve this issue with Helios by providing for native GPU disaggregation. — reactive:inference-cost-optimization (2026-07-01)
  16. [16] $AMD is now re-rated to $1T - $1.14T market cap from different analysts/researched firms🧵 — reactive:inference-cost-optimization (2026-06-29)
  17. [17] 5.6 on cerebras is quoted at 750 tokens a second this week, and the speed is worth understanding before you budget aroun... — reactive:inference-cost-optimization (2026-06-29)
  18. [18] AI quarterly revenue surpasses depreciation costs for the first time—has the trillion-dollar bet on computing power ente... — reactive:inference-cost-optimization (2026-06-26)
  19. [19] Prefill-Decode Aggregation or Disaggregation? Unifying Both ... - arXiv — reactive:inference-cost-optimization
  20. [20] Paper page - JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting — reactive:inference-cost-optimization
  21. [21] JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with ... — reactive:inference-cost-optimization
  22. [22] Scale your distributed #AI inference from a single vLLM instance to a multi-tenant grid. Explore prefill-decode disaggre... — reactive:inference-cost-optimization (2026-06-26)
  23. [23] Disaggregated Prefilling (experimental) - vLLM — reactive:mlsys-2026-inference-systems
  24. [24] Prefill-decode disaggregation — Ray 2.55.1 — reactive:mlsys-2026-inference-systems
  25. [25] The shift toward Chinese/open-weight models was already happening because developers follow price, latency, availability... — reactive:gpt-56-launch-government-access (2026-06-26)
  26. [26] The cost of running an AI model dropped again this week, and the trend is the whole story. — reactive:inference-cost-optimization (2026-06-25)
  27. [27] OpenAI and Anthropic love to compare AI to electricity. 'AI will be like a utility,' they say. 'Infrastructure for the n... — reactive:inference-cost-optimization (2026-06-30)
  28. [28] Compute-bound and memory-bandwidth-bound are not the same problem. Treating them like one is costing you throughput. — reactive:inference-cost-optimization (2026-06-30)
  29. [29] Attention-FFN Disaggregationが真剣に考えられているということも知らなかったので、本当に勉強になりますね — reactive:inference-cost-optimization (2026-06-27)
  30. [30] A Chinese AI lab just pulled off something the AI industry thought would take years. — reactive:inference-cost-optimization (2026-06-28)