MLSys 2026: Inference Systems Research Preview · history
Version 1
2026-05-22 18:27 UTC · 80 items
What
MLSys 2026, the ninth annual Conference on Machine Learning and Systems, is underway in Bellevue, Washington (week of May 18–22, 2026)[1][2]. The conference's inference track is organized around four converging themes: sparse attention reaching production deployment, prefill-decode disaggregation maturing into an industry standard, MoE expert balancing emerging as an underexplored production challenge, and long-tail rollout distributions degrading RL training throughput[4][7][15][18]. NVIDIA's BLASST paper won the MLSys 2026 Best Paper award[3], and @haoailab — credited with originating prefill-decode disaggregation — is presenting a new disaggregation technique extending that architecture[7].
Why it matters
The systems research at MLSys 2026 marks the industrialization of inference techniques that have been in academic papers for one to two years. Decisions made now — which sparse attention kernels survive production hardening, how disaggregated architectures are composed, how MoE scheduling is solved at scale — will shape AI serving infrastructure for the next hardware generation. The investor community's uptake of PD disaggregation as a GPU-longevity argument[13][14] shows these engineering choices carry real capital implications.
Open questions
What is the new disaggregation technique beyond prefill-decode that @haoailab is presenting — attention-FFN disaggregation, or something architecturally distinct? [7]
Will Distribution-Aware Speculative Decoding (targeting long-tail RL rollout lengths) achieve broad adoption, or will it remain a narrow optimization for specific training regimes? [18]
Given that MoE expert balancing at serving scale is acknowledged as underexplored in open source[15], which groups are closest to publishing production-grade solutions?
Can BLASST's dynamic blocked sparse attention generalize across model families beyond the configurations that earned it Best Paper? [3]
Narrative
MLSys 2026, held in Bellevue, Washington, brings together researchers and engineers from major AI labs and infrastructure companies to address production systems challenges[1][2]. This year's program is distinguished by a cluster of inference-focused research reflecting a broader shift: the frontier of AI systems work has moved from model architecture to the infrastructure layer — how to serve large models faster, cheaper, and at greater scale.
The conference's most prominent individual recognition went to NVIDIA's BLASST paper (Dynamic Blocked Attention Sparsity via Softmax Thresholding), which won the MLSys 2026 Best Paper award[3]. BLASST addresses the core inefficiency of dense attention at long context lengths by dynamically identifying and skipping low-salience attention blocks at runtime using a softmax threshold gate. Its award reflects a wider trend SemiAnalysis documented in a pre-conference thread: sparse attention has crossed from academic benchmarks into live production deployments, with DeepSeek's Sparse Attention and NousResearch's Lighthouse Attention both cited as examples[4]. Multiple reference papers and guides on sparse attention patterns have proliferated alongside these deployments[5][6].
Prefill-decode (PD) disaggregation — routing the compute-bound prefill phase and the memory-bandwidth-bound decode phase to dedicated hardware pools — is now widely characterized as an industry standard. SemiAnalysis credits @haoailab (Hao Zhang's DistServe group) as the originator and notes the group is presenting a further disaggregation technique at MLSys 2026, with StepFun AI demonstrating attention-FFN disaggregation as an additional axis of workload specialization[7]. Production deployment activity is visible across the ecosystem: LMCache Lab released an async PDBackend making disaggregated serving more efficient[8]; Theta EdgeCloud tested PD disaggregation for large-scale LLM serving[9][10]; and the Atlas system was integrated into a PD disaggregation setup using heterogeneous accelerators[11]. Commentary from practitioners at the conference frames co-located prefill and decode at scale as actively wasteful[12], while financial commentators picked up the architecture as evidence that GPU useful lives extend well beyond skeptics' estimates — older prefill-heavy GPUs can be repurposed into disaggregated decode pools[13][14].
Two further research threads are active. For mixture-of-experts (MoE) models — now dominant at the frontier — expert load balancing at serving time is a significant engineering challenge that SemiAnalysis observes receives disproportionately little open-source attention because it primarily surfaces in closed production systems[15]. Load-balancing losses applied during MoE training encourage expert diversity, which complicates runtime scheduling decisions[16]; an ICML 2026 paper on Cross-Layer Load Balancing in Distributed MoE Inference addresses the problem from a different angle[17]. On the training side, RL post-training workloads are exposing a structural inefficiency: the long-tail distribution of rollout lengths creates GPU bubbles in training pipelines. A new Distribution-Aware Speculative Decoding paper specifically targets this problem, building on draft-model techniques including Eagle, MTP, and DFlash[18]. SnorkelAI presented related work on RLVR in low-data, low-compute settings at the conference[19]. Industry presence is broad: vLLM, LMSYS, Inferact (co-hosting a luncheon with a16z), and Delta Institute are among the organizations attending[20][21][22][23].
Timeline
- 2026-04-25: LMSYS publishes blog post on DeepSeek-V4 fast inference and verified RL with SGLang and Miles [29]
- 2026-05-17: SemiAnalysis publishes pre-conference MLSys 2026 research preview thread covering sparse attention, disaggregation, MoE balancing, and RL training efficiency [1][4][7][15][18]
- 2026-05-17: NVIDIA researcher Huizi Mao confirms BLASST wins MLSys 2026 Best Paper for dynamic blocked sparse attention [3]
- 2026-05-18: MLSys 2026 opens in Bellevue; vLLM, LMSYS, Inferact, Delta Institute among attendees [2][20][30][23]
- 2026-05-18: MLSys 2026 Best Paper congratulations circulate; scaling RL across heterogeneous accelerators receives Oral recognition [31][32][33]
- 2026-05-20: Theta EdgeCloud PD disaggregation test coverage published; UT Austin + collaborators LLM cluster tuning paper circulates [9][10][34]
- 2026-05-20: Financial and investment commentary picks up PD disaggregation as argument for extended GPU useful life [25][13][35]
- 2026-05-21: LMCache Lab releases async PDBackend for more efficient disaggregated prefill-decode serving [8]
- 2026-05-21: Inferact co-hosts MLSys luncheon with a16z and Mirendil [22]
- 2026-05-22: Conference final day: GDN kernel work on B200 GPUs presented; SnorkelAI RLVR talk concludes [36][19]
Perspectives
SemiAnalysis
Curatorial and analytical: frames MLSys 2026 as the venue for the most important production AI systems problems, highlighting sparse attention productionization, PD disaggregation extensions, MoE expert balancing gaps, and RL training inefficiency as the four key threads
Evolution: Consistent — SemiAnalysis functions as a research-curating voice for practitioners; no stance shift observable
NVIDIA / Huizi Mao (BLASST team)
BLASST's dynamic blocked sparse attention approach has been validated by the research community as Best Paper, positioning NVIDIA's kernel research at the center of production sparse attention adoption
Evolution: Consistent with NVIDIA's broader role in inference kernel research; elevated by Best Paper recognition
@haoailab (Hao Zhang / DistServe group)
Having originated the now-standard PD disaggregation, the group is extending the disaggregation paradigm further — with attention-FFN disaggregation representing the next layer of workload specialization
Evolution: Progressing from originating PD disaggregation to pushing a new architectural frontier beyond it
LMCache Lab
Actively productionizing PD disaggregation with a new async backend, framing the architecture as ready for deployment at scale
Evolution: Consistent with their KV cache and disaggregation-focused engineering track
Financial / investor commentators (TheValueist, podcast_alpha_x, tropicalvalue)
PD disaggregation is evidence that GPU useful lives are substantially longer than AI skeptics claim — older GPUs can be repurposed in disaggregated decode roles, extending hardware ROI
Evolution: New voice entering the thread; this framing did not originate from ML systems researchers but from the investment community observing the conference
Tensions
- Open-source community vs. production practitioners on MoE expert balancing: SemiAnalysis observes that expert balancing in MoE serving is substantively underexplored in open source because the challenge only surfaces at production scale, creating a knowledge gap between closed systems operators and the broader community[15]. No open-source voice has publicly disputed this characterization. [15][16]
- Disaggregation maturity framing: Practitioner commentary frames PD disaggregation as 'finally stable' for production[27], while vLLM's own documentation still labels disaggregated prefilling as 'experimental'[28]. The gap between conference-floor confidence and official documentation maturity is unresolved. [27][12][28]
Sources
- [1] MLSys 2026 is next week! — SemiAnalysis Twitter (2026-05-17)
- [2] MLSys 2026 is starting today. And I'm excited to be here. This conference focuses on the most important research problem... — reactive:mlsys-2026-inference-systems (2026-05-18)
- [3] Glad to be featured by SemiAnalysis. Our work BLASST was also selected as MLSys 2026 Best Paper: https://t.co/OlkQ7x75BN... — reactive:mlsys-2026-inference-systems (2026-05-17)
- [4] Sparse attention mechanisms are finally moving beyond academic benchmarks into production systems, including DeepSeek Sp… — SemiAnalysis Twitter (2026-05-17)
- [5] The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs — reactive:mlsys-2026-inference-systems
- [6] Sparse Attention Patterns: Local, Strided & Block-Sparse Approaches - Interactive | Michael Brenndoerfer | Michael Brenndoerfer — reactive:mlsys-2026-inference-systems
- [7] @NousResearch Assigning dedicated resources to different types of workloads is an increasingly popular system optimizati… — SemiAnalysis Twitter (2026-05-17)
- [8] PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. — reactive:mlsys-2026-inference-systems (2026-05-21)
- [9] Read “Theta EdgeCloud Tests Prefill/Decode Disaggregation for Large-Scale LLM Serving“ by Theta Labs on Medium: https://... — reactive:mlsys-2026-inference-systems (2026-05-20)
- [10] Theta EdgeCloud Tests Prefill/Decode Disaggregation for Large-Scale LLM Serving #machinelearning #ml #artificialintellig... — reactive:mlsys-2026-inference-systems (2026-05-20)
- [11] Coming back to the heterogeneous nature of inference, we integrate Atlas into a PD disaggregation setup, where we use GP... — reactive:mlsys-2026-inference-systems (2026-05-18)
- [12] @lmsysorg @AMD @dstackai If you're still running prefill and decode on the same GPUs at scale, you're basically burning ... — reactive:mlsys-2026-inference-systems (2026-05-21)
- [13] I took one key insight from this convo: inference disaggregation between prefill and decode enable GPU lifespan to be ex... — reactive:mlsys-2026-inference-systems (2026-05-20)
- [14] @GavinSBaker : AI skeptics have been wrong to claim GPU useful lives are only 1-2 yrs. The disaggregation of prefill (me... — reactive:mlsys-2026-inference-systems (2026-05-21)
- [15] @NousResearch @StepFun_ai @haoailab Large scale production system challenges, such as expert balancing in serving MoE mo… — SemiAnalysis Twitter (2026-05-17)
- [16] The key observation: load-balancing losses used during MoE training encourage expert diversity. — reactive:mlsys-2026-inference-systems (2026-05-19)
- [17] Cross-Layer Load Balancing in Distributed MoE Inference - ICML 2026 — reactive:mlsys-2026-inference-systems
- [18] The long-tail distribution of rollout lengths causes one of the most critical inefficiencies in RL training. — SemiAnalysis Twitter (2026-05-17)
- [19] RT @SnorkelAI: Live from MLSys 2026! Thanks to everyone who joined @pham_derek's talk yesterday on RLVR in low-data, low... — reactive:mlsys-2026-inference-systems (2026-05-22)
- [20] vLLM crew is out in full force at MLSys 2026 🔥 — reactive:mlsys-2026-inference-systems (2026-05-18)
- [21] MLSys 2026 Happy Hour - by LMSYS & Ai2 https://t.co/etdvTEJltB — reactive:mlsys-2026-inference-systems (2026-05-18)
- [22] RT @inferact: Great cohosting this luncheon with @a16z and Mirendil at MLSys 2026 yesterday! 🙌 — reactive:mlsys-2026-inference-systems (2026-05-21)
- [23] Headed to MLSys 2026? — reactive:mlsys-2026-inference-systems (2026-05-17)
- [24] RT @lmcache: PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. — reactive:mlsys-2026-inference-systems (2026-05-22)
- [25] GPU USEFUL LIFE, INFERENCE DISAGGREGATION, AND PRIVATE CREDIT — reactive:mlsys-2026-inference-systems (2026-05-20)
- [26] For anyone listening, the relevant section starts around 44m: prefill vs. decode disaggregation. — reactive:mlsys-2026-inference-systems (2026-05-21)
- [27] @lmsysorg @CloudflareDev The real story here isn't the bug fix, it's that prefill-decode disaggregation is finally stabl... — reactive:mlsys-2026-inference-systems (2026-05-21)
- [28] Disaggregated Prefilling (experimental) - vLLM — reactive:mlsys-2026-inference-systems
- [29] DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles — reactive:mlsys-2026-inference-systems (2026-04-25)
- [30] We’re at MLSys 2026 in Bellevue this week! ⛴️ — reactive:mlsys-2026-inference-systems (2026-05-18)
- [31] Congratulations on the MLSys 2026 best paper! Looking forward to your presentation today! https://t.co/UrVcKgUJj0 — reactive:mlsys-2026-inference-systems (2026-05-19)
- [32] Congrats on winning an MLSys 2026 best research paper! https://t.co/grBdNYK4Fg — reactive:mlsys-2026-inference-systems (2026-05-19)
- [33] @Yong_jun_He Congratulations on the MLSys 2026 Oral! Scaling RL training across heterogeneous accelerators is a genuinel... — reactive:mlsys-2026-inference-systems (2026-05-18)
- [34] Tuning an LLM cluster runs the design space up to ~10^6 GPU-hours. A new paper from UT Austin + @preminstrel + colleague... — reactive:mlsys-2026-inference-systems (2026-05-20)
- [35] @MiamiMarkets @InvestLikeBest @pmarca @GavinSBaker Yes, the core idea of disaggregation (separating prefill/compute-heav... — reactive:mlsys-2026-inference-systems (2026-05-20)
- [36] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-22)