MLSys 2026: Inference Systems Research Preview · history

Version 3

2026-05-24 04:52 UTC · 128 items

What

MLSys 2026 (May 18–22, Bellevue, WA) concluded with four inference research threads extending rapidly from conference papers into engineering artifacts. NVIDIA's BLASST (arXiv 2512.12087) won Best Paper for dynamic blocked sparse attention[3], and the FlashInfer community is tracking integration including sparse-MLA kernels for new SM120 hardware[6]. Attention-FFN disaggregation has produced competing implementations: StepFun AI's StepMesh is openly hosted on GitHub[8] and documented in the Step-3 system paper[9], while vLLM pursues an RFC[10]. Two independent groups — Together.ai/WukLab with DASD (arXiv 2511.13841)[22] at MLSys and MIT HAN Lab with Adaptive Drafter (arXiv 2511.16665, ASPLOS'26)[23][24] — have simultaneously targeted long-tail RL rollout inefficiency, with secondary sources attributing a 2x training speed improvement to the MIT approach[25].

Why it matters

MLSys 2026 marks a moment when sparse attention, disaggregated serving, and RL training efficiency via speculative decoding are all crossing from research papers into production engineering artifacts simultaneously. The convergence of two independent academic groups — Together.ai/WukLab and MIT HAN Lab — on the same long-tail RL rollout inefficiency signals that this problem is recognized as a high-value systems target worth parallel investment. The post-conference activity in vLLM RFCs and FlashInfer issue trackers will define AI serving infrastructure for the next hardware generation.

Open questions

Will FlashInfer integrate BLASST's dynamic blocked sparse attention, and will the parallel sparse-MLA kernel work for SM120 hardware[6] accelerate that path?[5]
StepMesh[8][34] and the vLLM RFC[10] address attention-FFN disaggregation from different integration points — which will establish the interoperability standard, or can they converge around a shared approach?
MIT's Adaptive Drafter reportedly doubles LLM training speed[25][24] while Together.ai claims 50% RL rollout acceleration for DASD[21] — how do these two independent approaches compare across model scales and RL algorithms, and are they composable?
MoE expert balancing at serving scale remains underexplored in open source[27] — does MegaScale-Infer's claimed MoE serving cost reduction[15] address expert balancing specifically, or does the open-source knowledge gap persist?

Narrative

MLSys 2026, held May 18–22 in Bellevue, Washington, brought together researchers and engineers from across the AI industry to address production inference and training systems challenges[1][2]. The conference's most prominent recognition went to NVIDIA's BLASST paper (arXiv 2512.12087), which introduces Dynamic Blocked Attention Sparsity via Softmax Thresholding[3]. BLASST dynamically identifies and skips low-salience attention blocks at runtime using a softmax threshold gate, attacking the core inefficiency of dense attention at long context lengths. Its Best Paper award reflects a broader productionization wave documented in a pre-conference SemiAnalysis thread: sparse attention has migrated from academic benchmarks into live deployments, with DeepSeek's Sparse Attention and NousResearch's Lighthouse Attention cited as examples[4]. The inference community has opened a feature request to integrate BLASST into FlashInfer[5]; separately, FlashInfer's issue tracker shows active work on sparse-MLA paged attention for the new SM120 GPU architecture[6], indicating that sparse attention kernel development is continuing across multiple hardware targets in the weeks following the conference.

Prefill-decode (PD) disaggregation — routing the compute-bound prefill phase and the memory-bandwidth-bound decode phase to dedicated hardware pools — is now widely accepted as an industry best practice[7]. Attention-FFN disaggregation has emerged as the next architectural axis, with a speed of industrialization that illustrates how quickly the ML systems cycle now runs. StepFun AI's StepMesh communication library for attention-FFN disaggregated MoE inference is openly published on GitHub[8], and the accompanying Step-3 paper (arXiv 2507.19427) documents model-system co-design for cost-effective decoding, providing production context for StepMesh's role[9]. The vLLM project posted an RFC proposing ATTN-FFN disaggregation for MoE models[10][11], and a dedicated challenges paper appeared at arXiv 2602.09721[12]. Formal research has addressed both analytical provisioning under stochastic workloads[13] and the hardware systems challenges the approach surfaces for modern MoE architectures[14]. MegaScale-Infer, described as slashing LLM serving costs for MoE models through disaggregation[15], represents another production system in the same architectural direction. Financial commentators picked up PD disaggregation as evidence for extended GPU hardware ROI, arguing that older prefill-heavy GPUs can be repurposed into disaggregated decode pools[16][17]. Practitioner consensus has hardened around co-located prefill and decode at scale being actively wasteful[18][19], though vLLM's own documentation still labels disaggregated prefilling as experimental[20].

Two independent research efforts have targeted the same structural inefficiency in reinforcement learning post-training: the long-tail distribution of rollout lengths that creates GPU bubbles and degrades training throughput. Distribution-Aware Speculative Decoding (DASD), from Together.ai and WukLab (arXiv 2511.13841), adapts draft model behavior to the actual rollout length distribution and was presented at MLSys 2026, with Together.ai claiming up to 50% acceleration of RL rollouts[21][22]. Separately, MIT HAN Lab's Adaptive Drafter paper (arXiv 2511.16665), appearing at ASPLOS'26 under the open-source repository mit-han-lab/fastrl[23], has received MIT News coverage framing it as capable of substantially increasing LLM training efficiency[24], with at least one secondary source characterizing the speedup as 2x[25]. A FAISys 2025 paper also tackled dynamic speculative decoding for RL[26]. The simultaneous emergence of multiple independent approaches targeting the same long-tail RL inefficiency — from different institutions and different conference venues — signals that this is a high-value, recognized problem area in systems research, though direct comparisons between the approaches have not yet appeared.

For mixture-of-experts (MoE) models — now dominant at the frontier — expert load balancing at serving time remains a significant underexplored challenge in open source. SemiAnalysis noted that this problem surfaces primarily in closed production systems and receives disproportionately little open-source attention[27]. A cross-layer load balancing paper for distributed MoE inference appeared at ICML 2026[28], and load-balancing losses applied during MoE training complicate runtime scheduling decisions[29]. MegaScale-Infer's claimed cost reductions for MoE serving via disaggregation[15] are notable in this context, though whether the approach specifically addresses expert balancing has not been publicly detailed. Industry presence at MLSys 2026 was broad: vLLM, LMSYS, Inferact (co-hosting a luncheon with a16z), and Delta Institute were among the organizations attending[30][31][32][33].

Timeline

2025-07-01: StepFun AI submits Step-3 paper (arXiv 2507.19427) describing model-system co-design for cost-effective decoding, providing production context for StepMesh [9]
2025-11-01: Distribution-Aware Speculative Decoding paper submitted to arXiv (2511.13841), targeting long-tail RL rollout GPU bubbles [22][44]
2025-11-01: MIT HAN Lab submits Adaptive Drafter paper (arXiv 2511.16665) on efficient reasoning RL training; GitHub project fastrl confirms ASPLOS'26 venue [38][39][23]
2025-12-01: BLASST paper submitted to arXiv (2512.12087), introducing dynamic blocked sparse attention via softmax thresholding [3][45]
2026-01-01: Analytical Provisioning for Attention-FFN Disaggregated LLM Serving paper submitted to arXiv (2601.21351) [13]
2026-02-01: Challenges of Attention-FFN Disaggregation paper submitted to arXiv (2602.09721) [12]
2026-02-26: MIT News publishes article on Adaptive Drafter method for increasing LLM training efficiency [24]
2026-04-25: LMSYS publishes blog post on DeepSeek-V4 fast inference and verified RL with SGLang and Miles [46]
2026-05-17: SemiAnalysis publishes pre-conference MLSys 2026 research preview covering sparse attention, disaggregation, MoE balancing, and RL training efficiency [1][4][7][27][35]
2026-05-17: NVIDIA researcher Huizi Mao confirms BLASST wins MLSys 2026 Best Paper for dynamic blocked sparse attention [36]
2026-05-18: MLSys 2026 opens in Bellevue; vLLM, LMSYS, Inferact, Delta Institute among attendees [2][30][47][33]
2026-05-18: MLSys 2026 Best Paper congratulations circulate; scaling RL across heterogeneous accelerators receives Oral recognition [48][49][50]
2026-05-20: Theta EdgeCloud PD disaggregation test results published; financial commentary picks up disaggregation as argument for extended GPU useful life [51][52][16][17]
2026-05-21: LMCache Lab releases async PDBackend; StepFun AI publishes StepMesh on GitHub; vLLM RFC for ATTN-FFN disaggregation posted [40][8][10][11]
2026-05-21: Inferact co-hosts MLSys luncheon with a16z [32]
2026-05-22: Conference final day: GDN kernel work on B200 GPUs presented; SnorkelAI RLVR talk concludes; FlashInfer BLASST integration request filed [53][54][5]

Perspectives

SemiAnalysis

Curatorial and analytical: frames MLSys 2026 as the venue for the most important production AI systems problems, highlighting sparse attention productionization, PD disaggregation extensions, MoE expert balancing gaps, and RL training inefficiency as the four key threads

Evolution: Consistent — SemiAnalysis functions as a research-curating voice for practitioners; no stance shift observable

[1][4][7][27][35]

NVIDIA / Huizi Mao (BLASST team)

BLASST's dynamic blocked sparse attention has been validated by the research community as Best Paper, and the community is now pushing for integration into inference kernels like FlashInfer beyond NVIDIA's own stack

Evolution: Elevated by Best Paper recognition; FlashInfer integration request and ongoing sparse kernel work for new SM120 hardware represent broadening external adoption momentum

[36][3][5][6]

@haoailab (Hao Zhang / DistServe group)

Having originated now-standard PD disaggregation, the group is extending the paradigm to attention-FFN disaggregation presented at MLSys 2026

Evolution: Progressing from originating PD disaggregation to pushing the next architectural frontier — workload specialization by transformer component, not just by compute phase

[7]

StepFun AI / StepMesh

Attention-FFN disaggregation for MoE inference requires dedicated communication infrastructure; StepMesh instantiates this as a production-oriented open-source library backed by model-system co-design documented in the Step-3 paper

Evolution: Expanded: StepMesh is now openly hosted on GitHub and the Step-3 paper provides production evidence for the co-design approach

[8][9][34]

vLLM project

Attention-FFN disaggregation for MoE models is being actively pursued via RFC, but disaggregated prefilling remains labeled experimental in official documentation

Evolution: Consistent — the RFC shows forward momentum, but the gap between conference-floor confidence and official documentation maturity persists

[10][11][20]

Together.ai / WukLab (DASD)

Distribution-Aware Speculative Decoding delivers up to 50% acceleration of RL rollouts by adapting draft model behavior to the actual rollout length distribution rather than assuming fixed lengths

Evolution: Consistent; now one of two confirmed independent groups attacking the same long-tail RL rollout problem

[21][22][37]

MIT HAN Lab (Adaptive Drafter / fastrl)

Adaptive Drafter addresses long-tail RL training inefficiency at ASPLOS'26 with an adaptive speculative decoding approach; MIT News frames it as substantially increasing LLM training efficiency, and secondary sources characterize the gain as 2x

Evolution: Previously noted only as 'MIT published on adaptive drafter'; now confirmed as ASPLOS'26 with open-source code (mit-han-lab/fastrl), MIT News coverage, and a sharpened 2x performance claim from secondary sources

[38][39][23][24][25]

LMCache Lab

Actively productionizing PD disaggregation with a new async backend, framing the architecture as ready for deployment at scale

Evolution: Consistent with their KV cache and disaggregation-focused engineering track

[40][41]

Financial / investor commentators (TheValueist, podcast_alpha_x, tropicalvalue)

PD disaggregation is evidence that GPU useful lives are substantially longer than AI skeptics claim — older GPUs can be repurposed in disaggregated decode roles, extending hardware ROI

Evolution: Consistent; this framing entered the thread from the investment community observing the conference, not from ML systems researchers

[42][16][17][43]

Sakura Yuki (inference practitioner)

PD disaggregation is finally stable for production; running co-located prefill and decode at scale is now wasteful rather than a reasonable default

Evolution: Consistent with the broad practitioner consensus emerging at the conference

[19][18]

Tensions

Open-source community vs. production practitioners on MoE expert balancing: SemiAnalysis observes that expert balancing in MoE serving is substantively underexplored in open source because the challenge only surfaces at production scale[27]. MegaScale-Infer claims MoE serving cost reductions via disaggregation[15], but whether this specifically addresses the expert balancing problem — or whether any open-source production-grade solution exists — remains unresolved. [27][15][29]
Disaggregation maturity framing: Practitioner commentary frames PD disaggregation as 'finally stable' for production[19], while vLLM's own documentation still labels disaggregated prefilling as 'experimental'[20]. The gap between conference-floor confidence and official toolchain documentation persists. [19][18][20]
Competing approaches to attention-FFN disaggregation standardization: StepFun AI shipped StepMesh as a standalone open-source communication library[8][34], while vLLM posted an RFC for ATTN-FFN disaggregation built into the vLLM framework[10]. The two efforts address the same problem from different integration points; it is unclear whether they will converge or which will define the interoperability standard. [8][34][10][11]
Competing speculative decoding approaches for RL rollout efficiency: MIT HAN Lab's Adaptive Drafter (ASPLOS'26, reportedly 2x speedup)[23][25] and Together.ai/WukLab's DASD (MLSys 2026, claimed 50% acceleration)[22][21] independently target the long-tail RL rollout GPU bubble problem with different strategies. Direct comparisons between the approaches have not appeared, and whether they are composable or serve different regimes is unresolved. [23][25][22][21][24]

Sources

[1] MLSys 2026 is next week! — SemiAnalysis Twitter (2026-05-17)
[2] MLSys 2026 is starting today. And I'm excited to be here. This conference focuses on the most important research problem... — reactive:mlsys-2026-inference-systems (2026-05-18)
[3] [2512.12087] BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding — reactive:mlsys-2026-inference-systems
[4] Sparse attention mechanisms are finally moving beyond academic benchmarks into production systems, including DeepSeek Sp… — SemiAnalysis Twitter (2026-05-17)
[5] [Feature Request] BLASST: Dynamic BLocked Attention Sparsity via ... — reactive:mlsys-2026-inference-systems
[6] Add sparse-MLA paged attention for SM120 (RTX PRO ... - GitHub — reactive:mlsys-2026-inference-systems
[7] @NousResearch Assigning dedicated resources to different types of workloads is an increasingly popular system optimizati… — SemiAnalysis Twitter (2026-05-17)
[8] stepfun-ai/StepMesh — reactive:mlsys-2026-inference-systems
[9] Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — reactive:mlsys-2026-inference-systems
[10] [RFC]: ATTN-FFN Disaggregation for MoE Models #22799 - GitHub — reactive:mlsys-2026-inference-systems
[11] This amazing Attention-FFN disaggregation implementation from ... — reactive:mlsys-2026-inference-systems
[12] Revealing the Challenges of Attention-FFN Disaggregation ... — reactive:mlsys-2026-inference-systems
[13] [2601.21351] Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads — reactive:mlsys-2026-inference-systems
[14] Revealing the Challenges of Attention-FFN Disaggregation for ... — reactive:mlsys-2026-inference-systems
[15] Unlocking MoE Efficiency: How MegaScale-Infer Slashes LLM ... — reactive:mlsys-2026-inference-systems
[16] I took one key insight from this convo: inference disaggregation between prefill and decode enable GPU lifespan to be ex... — reactive:mlsys-2026-inference-systems (2026-05-20)
[17] @GavinSBaker : AI skeptics have been wrong to claim GPU useful lives are only 1-2 yrs. The disaggregation of prefill (me... — reactive:mlsys-2026-inference-systems (2026-05-21)
[18] @lmsysorg @AMD @dstackai If you're still running prefill and decode on the same GPUs at scale, you're basically burning ... — reactive:mlsys-2026-inference-systems (2026-05-21)
[19] @lmsysorg @CloudflareDev The real story here isn't the bug fix, it's that prefill-decode disaggregation is finally stabl... — reactive:mlsys-2026-inference-systems (2026-05-21)
[20] Disaggregated Prefilling (experimental) - vLLM — reactive:mlsys-2026-inference-systems
[21] Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding — reactive:mlsys-2026-inference-systems
[22] Beat the long tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[23] GitHub - mit-han-lab/fastrl: [ASPLOS'26] Taming the Long-Tail — reactive:mlsys-2026-inference-systems
[24] New method could increase LLM training efficiency | MIT News | Massachusetts Institute of Technology — reactive:mlsys-2026-inference-systems
[25] Aiyu | Taming the Long Tail: Adaptive Speculative Decoding Doubles LLM Training Speed — reactive:mlsys-2026-inference-systems
[26] [PDF] Efficient RL for LLMs with Dynamic and Online Speculative Decoding — reactive:mlsys-2026-inference-systems
[27] @NousResearch @StepFun_ai @haoailab Large scale production system challenges, such as expert balancing in serving MoE mo… — SemiAnalysis Twitter (2026-05-17)
[28] Cross-Layer Load Balancing in Distributed MoE Inference - ICML 2026 — reactive:mlsys-2026-inference-systems
[29] The key observation: load-balancing losses used during MoE training encourage expert diversity. — reactive:mlsys-2026-inference-systems (2026-05-19)
[30] vLLM crew is out in full force at MLSys 2026 🔥 — reactive:mlsys-2026-inference-systems (2026-05-18)
[31] MLSys 2026 Happy Hour - by LMSYS & Ai2 https://t.co/etdvTEJltB — reactive:mlsys-2026-inference-systems (2026-05-18)
[32] RT @inferact: Great cohosting this luncheon with @a16z and Mirendil at MLSys 2026 yesterday! 🙌 — reactive:mlsys-2026-inference-systems (2026-05-21)
[33] Headed to MLSys 2026? — reactive:mlsys-2026-inference-systems (2026-05-17)
[34] StepMesh: A Communication Library for Attention-FFN Disaggregation | StepFun — reactive:mlsys-2026-inference-systems
[35] The long-tail distribution of rollout lengths causes one of the most critical inefficiencies in RL training. — SemiAnalysis Twitter (2026-05-17)
[36] Glad to be featured by SemiAnalysis. Our work BLASST was also selected as MLSys 2026 Best Paper: https://t.co/OlkQ7x75BN... — reactive:mlsys-2026-inference-systems (2026-05-17)
[37] Beat the Long Tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[38] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter — reactive:mlsys-2026-inference-systems
[39] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter — reactive:mlsys-2026-inference-systems
[40] PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. — reactive:mlsys-2026-inference-systems (2026-05-21)
[41] RT @lmcache: PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. — reactive:mlsys-2026-inference-systems (2026-05-22)
[42] GPU USEFUL LIFE, INFERENCE DISAGGREGATION, AND PRIVATE CREDIT — reactive:mlsys-2026-inference-systems (2026-05-20)
[43] For anyone listening, the relevant section starts around 44m: prefill vs. decode disaggregation. — reactive:mlsys-2026-inference-systems (2026-05-21)
[44] Beat the long tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[45] [PDF] Dynamic BLocked Attention Sparsity via Softmax Thresholding - arXiv — reactive:mlsys-2026-inference-systems
[46] DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles — reactive:mlsys-2026-inference-systems (2026-04-25)
[47] We’re at MLSys 2026 in Bellevue this week! ⛴️ — reactive:mlsys-2026-inference-systems (2026-05-18)
[48] Congratulations on the MLSys 2026 best paper! Looking forward to your presentation today! https://t.co/UrVcKgUJj0 — reactive:mlsys-2026-inference-systems (2026-05-19)
[49] Congrats on winning an MLSys 2026 best research paper! https://t.co/grBdNYK4Fg — reactive:mlsys-2026-inference-systems (2026-05-19)
[50] @Yong_jun_He Congratulations on the MLSys 2026 Oral! Scaling RL training across heterogeneous accelerators is a genuinel... — reactive:mlsys-2026-inference-systems (2026-05-18)
[51] Read “Theta EdgeCloud Tests Prefill/Decode Disaggregation for Large-Scale LLM Serving“ by Theta Labs on Medium: https://... — reactive:mlsys-2026-inference-systems (2026-05-20)
[52] Theta EdgeCloud Tests Prefill/Decode Disaggregation for Large-Scale LLM Serving #machinelearning #ml #artificialintellig... — reactive:mlsys-2026-inference-systems (2026-05-20)
[53] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-22)
[54] RT @SnorkelAI: Live from MLSys 2026! Thanks to everyone who joined @pham_derek's talk yesterday on RLVR in low-data, low... — reactive:mlsys-2026-inference-systems (2026-05-22)