MLSys 2026: Inference Systems Research Preview · history

Version 2

2026-05-23 05:02 UTC · 107 items

What

MLSys 2026 (May 18–22, Bellevue, WA) is wrapping up with four inference research threads now substantially documented. NVIDIA's BLASST (arXiv 2512.12087) won Best Paper for dynamic blocked sparse attention[3], and the community is already requesting its integration into FlashInfer[5]. Attention-FFN disaggregation has advanced rapidly from a conference mention to a concrete engineering push: StepFun AI published StepMesh, a dedicated communication library[7], vLLM posted an active RFC[8], and formal papers address both provisioning methodology[10] and hardware challenges[11]. Distribution-Aware Speculative Decoding, targeting long-tail RL rollout inefficiency, is confirmed at arXiv 2511.13841 with Together.ai claiming up to 50% acceleration of RL rollouts[20][19].

Why it matters

MLSys 2026 marks a moment when three inference techniques — dynamic sparse attention, disaggregated serving, and RL training efficiency via speculative decoding — are simultaneously crossing from research paper to production engineering artifact. The speed at which attention-FFN disaggregation moved from a single mention to competing libraries and formal RFCs within the same conference week illustrates how quickly the industrialization cycle now runs. Decisions being made in vLLM RFCs and FlashInfer issue trackers this week will define AI serving infrastructure for the next hardware generation.

Open questions

Will FlashInfer integrate BLASST, and how quickly will production inference engines adopt dynamic blocked sparse attention beyond NVIDIA's own stack? [5]
StepMesh[7] and the vLLM RFC[8] take different implementation approaches to attention-FFN disaggregation — which will establish the interoperability standard, and can they converge?
Together.ai claims 50% RL rollout acceleration from distribution-aware speculative decoding[19], but how broadly does this hold across different model sizes, draft model families, and RL algorithms?
MoE expert balancing at serving scale is acknowledged as underexplored in open source[25] — has any paper or talk at MLSys 2026 offered a production-grade solution, or does the knowledge gap persist?

Narrative

MLSys 2026, held May 18–22 in Bellevue, Washington, brought together researchers and engineers from across the AI industry to address production inference and training systems challenges[1][2]. The conference's most prominent recognition went to NVIDIA's BLASST paper (arXiv 2512.12087), which introduces Dynamic Blocked Attention Sparsity via Softmax Thresholding[3]. BLASST dynamically identifies and skips low-salience attention blocks at runtime using a softmax threshold gate, directly attacking the core inefficiency of dense attention at long context lengths. Its Best Paper award reflects a broader productionization wave that SemiAnalysis documented in a pre-conference thread: sparse attention has migrated from academic benchmarks into live deployments, with DeepSeek's Sparse Attention and NousResearch's Lighthouse Attention cited as examples[4]. The inference community has already opened a feature request to integrate BLASST into FlashInfer[5], the widely used GPU kernel library, which would extend its reach beyond NVIDIA's own toolchain.

Prefill-decode (PD) disaggregation — routing the compute-bound prefill phase and the memory-bandwidth-bound decode phase to dedicated hardware pools — is now widely accepted as an industry best practice[6]. Attention-FFN disaggregation has emerged as the next architectural axis during this conference, with a speed that illustrates how quickly the industrialization cycle now runs. StepFun AI published StepMesh, a communication library purpose-built for attention-FFN disaggregation in MoE inference[7]. The vLLM project posted an RFC proposing ATTN-FFN disaggregation for MoE models[8][9], and formal research papers have appeared addressing both analytical provisioning for attention-FFN disaggregated serving under stochastic workloads[10] and the hardware systems challenges the approach surfaces for modern MoE architectures[11]. @haoailab (Hao Zhang's DistServe group), credited with originating PD disaggregation, is presenting a further disaggregation technique at the conference[6]. Alongside the research, production deployment activity is broad: LMCache Lab released an async PDBackend[12], Theta EdgeCloud published PD disaggregation test results[13][14], and financial commentators picked up the architecture as evidence for extended GPU hardware ROI — arguing that older prefill-heavy GPUs can be repurposed into disaggregated decode pools[15][16]. Practitioner consensus has hardened around co-located prefill and decode at scale being actively wasteful[17], though a visible tension remains: vLLM's own documentation still labels disaggregated prefilling as experimental[18].

Distribution-Aware Speculative Decoding (DASD) addresses a structural inefficiency in reinforcement learning post-training: the long-tail distribution of rollout lengths creates GPU bubbles that degrade training throughput. Together.ai quantifies the benefit at up to 50% acceleration of RL rollouts[19]. The paper (arXiv 2511.13841), from the WukLab group and presented at MLSys 2026, targets the mismatch between speculative decoding draft models — which assume fixed output length — and the highly variable rollout lengths characteristic of RL training[20][21]. A broader ecosystem of related work is emerging in parallel: MIT published on adaptive drafter-based RL training efficiency[22], and a FAISys 2025 paper tackled dynamic and online speculative decoding for RL[23]. SnorkelAI presented on reinforcement learning from verifiable rewards (RLVR) in low-data, low-compute settings at the conference[24].

For mixture-of-experts (MoE) models — now dominant at the frontier — expert load balancing at serving time remains a significant underexplored challenge. SemiAnalysis noted that this problem surfaces primarily in closed production systems and receives disproportionately little open-source attention[25]. A cross-layer load balancing paper for distributed MoE inference appeared at ICML 2026[26], and load-balancing losses applied during MoE training complicate runtime scheduling decisions[27], but no production-grade open-source solution has been publicly announced. Industry presence at MLSys 2026 was broad: vLLM, LMSYS, Inferact (co-hosting a luncheon with a16z), and Delta Institute were among the organizations attending[28][29][30][31].

Timeline

2025-11-01: Distribution-Aware Speculative Decoding paper submitted to arXiv (2511.13841), targeting long-tail RL rollout GPU bubbles [20][38]
2025-12-01: BLASST paper submitted to arXiv (2512.12087), introducing dynamic blocked sparse attention via softmax thresholding [3][39]
2026-01-01: Analytical Provisioning for Attention-FFN Disaggregated LLM Serving paper submitted to arXiv (2601.21351) [10]
2026-04-25: LMSYS publishes blog post on DeepSeek-V4 fast inference and verified RL with SGLang and Miles [40]
2026-05-17: SemiAnalysis publishes pre-conference MLSys 2026 research preview thread covering sparse attention, disaggregation, MoE balancing, and RL training efficiency [1][4][6][25][32]
2026-05-17: NVIDIA researcher Huizi Mao confirms BLASST wins MLSys 2026 Best Paper for dynamic blocked sparse attention [33]
2026-05-18: MLSys 2026 opens in Bellevue; vLLM, LMSYS, Inferact, Delta Institute among attendees [2][28][41][31]
2026-05-18: MLSys 2026 Best Paper congratulations circulate; scaling RL across heterogeneous accelerators receives Oral recognition [42][43][44]
2026-05-20: Theta EdgeCloud PD disaggregation test coverage published; UT Austin + collaborators LLM cluster tuning paper circulates [13][14][45]
2026-05-20: Financial and investment commentary picks up PD disaggregation as argument for extended GPU useful life [35][15][46]
2026-05-21: LMCache Lab releases async PDBackend for more efficient disaggregated prefill-decode serving [12]
2026-05-21: Inferact co-hosts MLSys luncheon with a16z and Mirendil [30]
2026-05-21: StepFun AI publishes StepMesh, a communication library for attention-FFN disaggregation in MoE inference; vLLM RFC for ATTN-FFN disaggregation posted [7][8][9]
2026-05-22: Conference final day: GDN kernel work on B200 GPUs presented; SnorkelAI RLVR talk concludes; FlashInfer integration request for BLASST filed [47][24][5]

Perspectives

SemiAnalysis

Curatorial and analytical: frames MLSys 2026 as the venue for the most important production AI systems problems, highlighting sparse attention productionization, PD disaggregation extensions, MoE expert balancing gaps, and RL training inefficiency as the four key threads

Evolution: Consistent — SemiAnalysis functions as a research-curating voice for practitioners; no stance shift observable

[1][4][6][25][32]

NVIDIA / Huizi Mao (BLASST team)

BLASST's dynamic blocked sparse attention has been validated by the research community as Best Paper, and the community is now pushing for integration into inference kernels like FlashInfer beyond NVIDIA's own stack

Evolution: Elevated by Best Paper recognition; the FlashInfer integration request represents external validation beyond NVIDIA's own endorsement

[33][3][5]

@haoailab (Hao Zhang / DistServe group)

Having originated now-standard PD disaggregation, the group is extending the paradigm to a further disaggregation technique presented at MLSys 2026

Evolution: Progressing from originating PD disaggregation to pushing a new architectural frontier — attention-FFN disaggregation represents the next layer of workload specialization

[6]

StepFun AI / StepMesh

Attention-FFN disaggregation for MoE inference requires dedicated communication infrastructure; StepMesh instantiates this as a production-oriented library

Evolution: New voice entering with a concrete artifact — moves attention-FFN disaggregation from a research concept to a shipped library

[7]

vLLM project

Attention-FFN disaggregation for MoE models is being actively pursued via RFC, but disaggregated prefilling remains labeled experimental in official documentation

Evolution: The RFC for ATTN-FFN disaggregation shows forward momentum, but the gap between conference-floor confidence and official documentation maturity persists

[8][9][18]

Together.ai / WukLab (DASD)

Distribution-Aware Speculative Decoding delivers up to 50% acceleration of RL rollouts by adapting draft model behavior to the actual rollout length distribution rather than assuming fixed lengths

Evolution: New voice with a specific quantified claim; the 50% figure is the most concrete performance benchmark attached to any paper discussed at MLSys 2026

[19][20][21]

LMCache Lab

Actively productionizing PD disaggregation with a new async backend, framing the architecture as ready for deployment at scale

Evolution: Consistent with their KV cache and disaggregation-focused engineering track

[12][34]

Financial / investor commentators (TheValueist, podcast_alpha_x, tropicalvalue)

PD disaggregation is evidence that GPU useful lives are substantially longer than AI skeptics claim — older GPUs can be repurposed in disaggregated decode roles, extending hardware ROI

Evolution: Consistent with prior pass; this framing entered the thread from the investment community observing the conference, not from ML systems researchers

[35][15][16][36]

Sakura Yuki (inference practitioner)

PD disaggregation is finally stable for production; running co-located prefill and decode at scale is now wasteful rather than a reasonable default

Evolution: Consistent with the broad practitioner consensus emerging at the conference

[37][17]

Tensions

Open-source community vs. production practitioners on MoE expert balancing: SemiAnalysis observes that expert balancing in MoE serving is substantively underexplored in open source because the challenge only surfaces at production scale, creating a knowledge gap between closed systems operators and the broader community[25]. No open-source voice has publicly disputed or resolved this characterization at the conference. [25][27]
Disaggregation maturity framing: Practitioner commentary frames PD disaggregation as 'finally stable' for production[37], while vLLM's own documentation still labels disaggregated prefilling as 'experimental'[18]. The gap between conference-floor confidence and official toolchain maturity is unresolved. [37][17][18]
Competing approaches to attention-FFN disaggregation standardization: StepFun AI shipped StepMesh as a standalone communication library[7], while vLLM posted an RFC for ATTN-FFN disaggregation built into the vLLM framework[8]. The two efforts address the same problem from different integration points; it is unclear whether they will converge or whether a standard will emerge from one direction. [7][8][9]

Sources

[1] MLSys 2026 is next week! — SemiAnalysis Twitter (2026-05-17)
[2] MLSys 2026 is starting today. And I'm excited to be here. This conference focuses on the most important research problem... — reactive:mlsys-2026-inference-systems (2026-05-18)
[3] [2512.12087] BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding — reactive:mlsys-2026-inference-systems
[4] Sparse attention mechanisms are finally moving beyond academic benchmarks into production systems, including DeepSeek Sp… — SemiAnalysis Twitter (2026-05-17)
[5] [Feature Request] BLASST: Dynamic BLocked Attention Sparsity via ... — reactive:mlsys-2026-inference-systems
[6] @NousResearch Assigning dedicated resources to different types of workloads is an increasingly popular system optimizati… — SemiAnalysis Twitter (2026-05-17)
[7] StepMesh: A Communication Library for Attention-FFN Disaggregation | StepFun — reactive:mlsys-2026-inference-systems
[8] [RFC]: ATTN-FFN Disaggregation for MoE Models #22799 - GitHub — reactive:mlsys-2026-inference-systems
[9] This amazing Attention-FFN disaggregation implementation from ... — reactive:mlsys-2026-inference-systems
[10] [2601.21351] Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads — reactive:mlsys-2026-inference-systems
[11] Revealing the Challenges of Attention-FFN Disaggregation for ... — reactive:mlsys-2026-inference-systems
[12] PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. — reactive:mlsys-2026-inference-systems (2026-05-21)
[13] Read “Theta EdgeCloud Tests Prefill/Decode Disaggregation for Large-Scale LLM Serving“ by Theta Labs on Medium: https://... — reactive:mlsys-2026-inference-systems (2026-05-20)
[14] Theta EdgeCloud Tests Prefill/Decode Disaggregation for Large-Scale LLM Serving #machinelearning #ml #artificialintellig... — reactive:mlsys-2026-inference-systems (2026-05-20)
[15] I took one key insight from this convo: inference disaggregation between prefill and decode enable GPU lifespan to be ex... — reactive:mlsys-2026-inference-systems (2026-05-20)
[16] @GavinSBaker : AI skeptics have been wrong to claim GPU useful lives are only 1-2 yrs. The disaggregation of prefill (me... — reactive:mlsys-2026-inference-systems (2026-05-21)
[17] @lmsysorg @AMD @dstackai If you're still running prefill and decode on the same GPUs at scale, you're basically burning ... — reactive:mlsys-2026-inference-systems (2026-05-21)
[18] Disaggregated Prefilling (experimental) - vLLM — reactive:mlsys-2026-inference-systems
[19] Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding — reactive:mlsys-2026-inference-systems
[20] Beat the long tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[21] Beat the Long Tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[22] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive ... — reactive:mlsys-2026-inference-systems
[23] [PDF] Efficient RL for LLMs with Dynamic and Online Speculative Decoding — reactive:mlsys-2026-inference-systems
[24] RT @SnorkelAI: Live from MLSys 2026! Thanks to everyone who joined @pham_derek's talk yesterday on RLVR in low-data, low... — reactive:mlsys-2026-inference-systems (2026-05-22)
[25] @NousResearch @StepFun_ai @haoailab Large scale production system challenges, such as expert balancing in serving MoE mo… — SemiAnalysis Twitter (2026-05-17)
[26] Cross-Layer Load Balancing in Distributed MoE Inference - ICML 2026 — reactive:mlsys-2026-inference-systems
[27] The key observation: load-balancing losses used during MoE training encourage expert diversity. — reactive:mlsys-2026-inference-systems (2026-05-19)
[28] vLLM crew is out in full force at MLSys 2026 🔥 — reactive:mlsys-2026-inference-systems (2026-05-18)
[29] MLSys 2026 Happy Hour - by LMSYS & Ai2 https://t.co/etdvTEJltB — reactive:mlsys-2026-inference-systems (2026-05-18)
[30] RT @inferact: Great cohosting this luncheon with @a16z and Mirendil at MLSys 2026 yesterday! 🙌 — reactive:mlsys-2026-inference-systems (2026-05-21)
[31] Headed to MLSys 2026? — reactive:mlsys-2026-inference-systems (2026-05-17)
[32] The long-tail distribution of rollout lengths causes one of the most critical inefficiencies in RL training. — SemiAnalysis Twitter (2026-05-17)
[33] Glad to be featured by SemiAnalysis. Our work BLASST was also selected as MLSys 2026 Best Paper: https://t.co/OlkQ7x75BN... — reactive:mlsys-2026-inference-systems (2026-05-17)
[34] RT @lmcache: PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. — reactive:mlsys-2026-inference-systems (2026-05-22)
[35] GPU USEFUL LIFE, INFERENCE DISAGGREGATION, AND PRIVATE CREDIT — reactive:mlsys-2026-inference-systems (2026-05-20)
[36] For anyone listening, the relevant section starts around 44m: prefill vs. decode disaggregation. — reactive:mlsys-2026-inference-systems (2026-05-21)
[37] @lmsysorg @CloudflareDev The real story here isn't the bug fix, it's that prefill-decode disaggregation is finally stabl... — reactive:mlsys-2026-inference-systems (2026-05-21)
[38] Beat the long tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[39] [PDF] Dynamic BLocked Attention Sparsity via Softmax Thresholding - arXiv — reactive:mlsys-2026-inference-systems
[40] DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles — reactive:mlsys-2026-inference-systems (2026-04-25)
[41] We’re at MLSys 2026 in Bellevue this week! ⛴️ — reactive:mlsys-2026-inference-systems (2026-05-18)
[42] Congratulations on the MLSys 2026 best paper! Looking forward to your presentation today! https://t.co/UrVcKgUJj0 — reactive:mlsys-2026-inference-systems (2026-05-19)
[43] Congrats on winning an MLSys 2026 best research paper! https://t.co/grBdNYK4Fg — reactive:mlsys-2026-inference-systems (2026-05-19)
[44] @Yong_jun_He Congratulations on the MLSys 2026 Oral! Scaling RL training across heterogeneous accelerators is a genuinel... — reactive:mlsys-2026-inference-systems (2026-05-18)
[45] Tuning an LLM cluster runs the design space up to ~10^6 GPU-hours. A new paper from UT Austin + @preminstrel + colleague... — reactive:mlsys-2026-inference-systems (2026-05-20)
[46] @MiamiMarkets @InvestLikeBest @pmarca @GavinSBaker Yes, the core idea of disaggregation (separating prefill/compute-heav... — reactive:mlsys-2026-inference-systems (2026-05-20)
[47] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-22)