MLSys 2026: Inference Systems Research Preview · history

Version 8

2026-05-25 19:01 UTC · 221 items

What

MLSys 2026 (May 18–22, Bellevue, WA) produced four interlocking production inference research threads: NVIDIA's BLASST won Best Paper for dynamic blocked sparse attention[3]; PD disaggregation toolchain maturity expanded through NVIDIA Dynamo Kubernetes documentation[11] and third-party deployment guides from Vultr[12] and Spheron[13], while vLLM's own disaggregated prefilling persists with an 'experimental' label through v0.10.2[16] with Dynamo positioned as the recommended external production path[14]; attention-FFN disaggregation advanced through StepMesh[19], a vLLM RFC[21], and MegaScale-Infer[24]; and two independent speculative decoding systems—DASD[25] and MIT's Adaptive Drafter[27]—target RL rollout long-tail inefficiency. The FlashInfer AI Kernel Generation Contest drew multiple winning teams including UW SyFi[8] and @dogacel0[9], formalizing competitive kernel development as a conference activity.

Why it matters

Production inference has fractured into a stack of specialized workload-routing problems—by compute phase, transformer component, sparsity pattern, and training distribution—each with its own emerging toolchain. The persistence of vLLM's 'experimental' disaggregation label through v0.10.2[16] while Dynamo gains third-party commercial documentation[12][13] signals that disaggregated inference may bifurcate into two ecosystems rather than converge, with downstream implications for which deployment patterns become the interoperability standard.

Open questions

vLLM v0.10.2 still labels disaggregated prefilling 'experimental'[16] and the vLLM-Dynamo integration page[14] routes practitioners to Dynamo—will vLLM's native disaggregated prefilling ever exit experimental status, or has Dynamo permanently become the canonical production path?
Does the vLLM Q1 2026 roadmap[17] commit to a timeline for promoting disaggregated prefilling out of experimental status, or does it formalize Dynamo as the long-term recommended external layer?
Do StepFun AI's StepMesh[19] and the vLLM ATTN-FFN disaggregation RFC[21] converge on a shared interoperability standard, or diverge into competing attention-FFN disaggregation ecosystems?
MegaScale-Infer claims cost reductions via disaggregated expert parallelism[24]—does the full paper address whether this specifically solves the MoE expert load balancing problem SemiAnalysis identified as underexplored in open source[30]?

Narrative

MLSys 2026, held May 18–22 in Bellevue, Washington, brought together the production AI systems community around four inference challenges[1][2]. The conference's top recognition went to NVIDIA's BLASST paper (arXiv 2512.12087), which won Best Paper for Dynamic Blocked Attention Sparsity via Softmax Thresholding[3]: a runtime technique that identifies and skips low-salience attention blocks using a softmax threshold gate, targeting the core inefficiency of dense attention at long context lengths. The sparse attention debate has grown more layered than a simple binary: a practitioner confirmed DeepSeek's MoE routing and sparse attention performs well in production[4]; a dissenting voice argues sparse attention is a transitional stopgap before hardware-native linear attention displaces it by 2027[5]; and the Native Sparse Attention paper (arXiv 2502.11089, ACL 2025) occupies a third position, designing sparse attention to be hardware-aligned and natively trainable from the ground up rather than imposed at runtime[6][7]. A FlashInfer AI Kernel Generation Contest ran at the conference, with the UW SyFi team winning multiple prizes[8] and @dogacel0 placing first[9], formalizing competitive kernel development around FlashInfer as a new conference activity.

Prefill-decode (PD) disaggregation—routing compute-bound prefill and memory-bandwidth-bound decode to dedicated hardware pools—has reached a level of third-party documentation that marks a transition from research configuration to mainstream deployment pattern. AWS Neuron carries an official BETA developer guide[10]; NVIDIA Dynamo published Kubernetes disaggregated communication documentation[11]; and Vultr[12] and Spheron[13] each published step-by-step deployment guides. A dedicated vLLM-Dynamo integration page[14] formalizes Dynamo as the recommended external disaggregation layer for vLLM users. Yet vLLM's own disaggregated prefilling documentation retains the 'experimental' label in both v0.8.5[15] and v0.10.2[16], and the vLLM Q1 2026 roadmap[17] provides visibility into whether that status is expected to change. A documented RDMA KV cache transfer failure in Kubernetes[18] flagged a concrete networking obstacle whose resolution status in current Dynamo documentation remains unclear.

Attention-FFN disaggregation—separating attention and expert/FFN compute across different hardware—has emerged as the next architectural axis beyond PD disaggregation. StepFun AI's StepMesh communication library[19], backed by the Step-3 paper (arXiv 2507.19427)[20], provides one open-source implementation. The vLLM project posted an RFC proposing ATTN-FFN disaggregation built into the framework[21][22]. ByteDance's MegaScale-Infer (arXiv 2504.02263), confirmed at both SIGCOMM 2025[23] and MLSys 2026[24], uses disaggregated expert parallelism for MoE serving at scale with claimed cost reductions. StepMesh as a standalone library and the vLLM RFC as a framework-integrated approach address the same problem from different integration points, with no convergence on a shared standard yet apparent.

Two independent research groups targeted the structural inefficiency of long-tail rollout length distributions in reinforcement learning post-training, which creates GPU bubbles and degrades throughput. Distribution-Aware Speculative Decoding (DASD, arXiv 2511.13841), from Together.ai and WukLab, was presented at MLSys 2026 and claims up to 50% acceleration of RL rollouts[25][26]. MIT HAN Lab's Adaptive Drafter (arXiv 2511.16665, ASPLOS'26), open-sourced at mit-han-lab/fastrl[27], has been framed by MIT News and TechXplore as roughly doubling LLM training efficiency[28][29]. The simultaneous emergence of independent approaches from different institutions and conference venues—without direct comparison between them—marks the long-tail RL rollout problem as high-value and actively contested.

Timeline

2025-02-01: Native Sparse Attention (NSA) paper submitted to arXiv (2502.11089), introducing hardware-aligned and natively trainable sparse attention; published at ACL 2025 [6][7][42][43]
2025-04-01: MegaScale-Infer paper (arXiv 2504.02263) submitted, describing disaggregated expert parallelism for MoE serving at scale; confirmed at SIGCOMM 2025 [46][24][23]
2025-07-01: StepFun AI submits Step-3 paper (arXiv 2507.19427) providing production model-system co-design context for StepMesh [20][50][51]
2025-11-01: DASD paper (arXiv 2511.13841) and Adaptive Drafter paper (arXiv 2511.16665) independently submitted, both targeting long-tail RL rollout GPU bubble inefficiency [26][52][53][54][27][48]
2025-12-01: BLASST paper submitted to arXiv (2512.12087), introducing dynamic blocked sparse attention via softmax thresholding [3][55][56][57]
2026-02-26: MIT News and TechXplore publish coverage of Adaptive Drafter, framing it as roughly doubling LLM training efficiency [28][29]
2026-05-17: SemiAnalysis publishes MLSys 2026 research preview covering sparse attention, disaggregation, MoE balancing, and RL efficiency; NVIDIA's Huizi Mao confirms BLASST wins Best Paper [1][39][40][30][41][31]
2026-05-18: MLSys 2026 opens in Bellevue; vLLM, LMSYS, Inferact, Delta Institute among attendees [2][58][59][60]
2026-05-21: LMCache Lab releases async PDBackend; StepFun AI publishes StepMesh on GitHub; vLLM posts RFC for ATTN-FFN disaggregation [61][19][21][22]
2026-05-22: FlashInfer BLASST integration request filed; dynamic block sparse forward GitHub issue opened; FlashInfer AI Kernel Contest results: UW SyFi wins multiple prizes, @dogacel0 places first [32][33][9][34][8][35][36]
2026-05-23: @superaiwatcher argues sparse attention is a transitional stopgap before hardware-native linear attention displaces it by 2027 [5]
2026-05-24: JasonLiu confirms DeepSeek MoE routing and sparse attention performing well in production; NVIDIA Dynamo publishes Kubernetes disaggregated communication documentation; Vultr and Spheron publish Dynamo deployment guides; vLLM publishes dedicated Dynamo integration page [4][11][12][13][14]
2026-05-25: vLLM disaggregated prefilling confirmed to still carry 'experimental' label in both v0.8.5 and v0.10.2; vLLM Q1 2026 roadmap issue published [15][16][17]

Perspectives

NVIDIA / BLASST team / FlashInfer / Dynamo

BLASST's dynamic blocked sparse attention won Best Paper; the FlashInfer Kernel Contest attracted multi-team competition (UW SyFi winning multiple prizes, @dogacel0 first), formalizing FlashInfer as a competitive kernel development platform; Dynamo Kubernetes documentation and third-party adoption by Vultr and Spheron signal disaggregated inference reaching mainstream deployment maturity.

Evolution: Expanded: UW SyFi multi-prize win adds breadth to contest results beyond the previously noted @dogacel0 first-place finish, and third-party cloud provider documentation extends Dynamo's maturity signal further.

[31][3][32][33][11][12][13][14][9][34][8][35][36]

vLLM project

ATTN-FFN disaggregation for MoE is under active RFC; disaggregated prefilling retains the 'experimental' label in both v0.8.5 and v0.10.2, while the vLLM-Dynamo integration page now routes practitioners to Dynamo as the recommended external production path.

Evolution: Clarified: both v0.8.5 and v0.10.2 documentation confirm the 'experimental' label has not been lifted, resolving the prior open question about whether vLLM's own path has matured, and the Q1 2026 roadmap adds visibility into future status.

[21][22][37][14][15][16][17][38]

SemiAnalysis

Frames MLSys 2026 as the venue for the most critical production inference problems; identifies MoE expert load balancing as the most underexplored gap in open source, only surfacing at production scale.

Evolution: Consistent — functions as a research-curating voice for practitioners with no stance shift.

[1][39][40][30][41]

@superaiwatcher (dissenting)

Sparse attention is a transitional stopgap; hardware-native linear attention will displace manual sparsity implementations by 2027, making the current productionization wave temporary.

Evolution: Consistent — represents the explicit counter-narrative to the conference's sparse attention consensus; the NSA paper is a direct counter-weight to this position.

[5]

NSA paper authors (Yuan et al., ACL 2025)

Hardware-aligned and natively trainable sparse attention (arXiv 2502.11089) offers a third architectural path between BLASST-style runtime sparsity and full linear attention replacement, designed to close the hardware-alignment gap from training time rather than inference patching.

Evolution: Consistent — multiple pointers confirm ongoing community access but introduce no new claims.

[6][7][42][43][44][45]

StepFun AI + ByteDance / attention-FFN disaggregation implementers

Attention-FFN disaggregation requires dedicated communication infrastructure at production scale; StepMesh provides an open-source library instantiation, while MegaScale-Infer demonstrates disaggregated expert parallelism delivering claimed cost reductions for MoE serving.

Evolution: Consistent — both represent production-oriented engineering approaches to the same ATTN-FFN disaggregation problem from different integration points.

[19][20][24][23][46]

Together.ai/WukLab (DASD) + MIT HAN Lab (Adaptive Drafter)

Long-tail RL rollout GPU bubbles are a high-value systems problem; DASD (MLSys 2026, claimed 50% RL rollout acceleration) and Adaptive Drafter (ASPLOS'26, framed as ~2x training efficiency) independently address it with no direct comparison published.

Evolution: Consistent — simultaneous emergence from separate institutions and venues without cross-comparison remains the defining feature of this thread.

[25][26][27][28][47][48][29]

Production practitioners + cloud providers

PD disaggregation is stable for production; AWS Neuron has an official BETA developer guide, Vultr and Spheron have published deployment guides, and DeepSeek's sparse attention and MoE routing are confirmed performing well in production.

Evolution: Consistent cloud-provider and practitioner validation of disaggregated inference as a production-ready configuration.

[49][4][10][12][13]

Tensions

Sparse attention as durable vs. transitional architecture: BLASST Best Paper and DeepSeek production confirmation[4] signal a durable shift, while @superaiwatcher argues it is a stopgap before linear attention displaces it by 2027[5]; NSA paper (arXiv 2502.11089)[6] adds a third path via hardware-aligned trainable sparse attention that could close the efficiency gap motivating the linear attention argument. [3][4][5][6][7][42]
Disaggregation maturity framing: practitioners call PD disaggregation 'finally stable'[49], Dynamo earns third-party commercial deployment guides[12][13], yet vLLM's own disaggregated prefilling retains the 'experimental' label through v0.10.2[16], with Dynamo now positioned as the de facto production bypass rather than a complement. [49][12][13][16][14][11]
ATTN-FFN disaggregation standardization: StepFun AI's StepMesh[19] is a standalone open-source communication library, while the vLLM RFC[21] embeds disaggregation into the framework itself—neither has defined the interoperability standard and the two have not publicly coordinated. [19][21][22]
Competing RL rollout efficiency approaches: MIT HAN Lab's Adaptive Drafter (ASPLOS'26, ~2x speedup framing[28]) and Together.ai/WukLab's DASD (MLSys 2026, 50% acceleration claim[26]) independently target the same long-tail rollout GPU bubble problem with no direct comparison published. [28][26][27][48][29]
MoE expert balancing gap: SemiAnalysis identifies expert load balancing as underexplored in open source because it only surfaces at production scale[30], while MegaScale-Infer claims disaggregated expert parallelism delivers cost reductions[24] without public detail on whether the specific balancing problem is addressed. [30][24][23]

Sources

[1] MLSys 2026 is next week! — SemiAnalysis Twitter (2026-05-17)
[2] MLSys 2026 is starting today. And I'm excited to be here. This conference focuses on the most important research problem... — reactive:mlsys-2026-inference-systems (2026-05-18)
[3] [2512.12087] BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding — reactive:mlsys-2026-inference-systems
[4] @rohanpaul_ai Having built with DeepSeek models in production — the architecture point is real. Their MoE routing and sp... — reactive:mlsys-2026-inference-systems (2026-05-24)
[5] @rasbt Sparse attention is a stopgap. By 2027, hardware-native linear attention will render these manual sparsity implem... — reactive:mlsys-2026-inference-systems (2026-05-23)
[6] Paper page - Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
[7] [PDF] Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
[8] UW SyFi Team Wins Multiple Prizes At FlashInfer AI Kernel Contest — reactive:mlsys-2026-inference-systems
[9] RT @dogacel0: Excited to share I placed #1 (twice!) at the MLSys 2026 × NVIDIA FlashInfer AI Kernel Generation Contest, ... — reactive:mlsys-2026-inference-systems (2026-05-24)
[10] Disaggregated Inference [BETA] — AWS Neuron Documentation — reactive:mlsys-2026-inference-systems
[11] Disagg Communication | NVIDIA Dynamo Documentation — reactive:mlsys-2026-inference-systems
[12] How to Build Disaggregated Inference with NVIDIA Dynamo | Vultr Docs — reactive:mlsys-2026-inference-systems
[13] NVIDIA Dynamo 1.0: Disaggregated LLM Inference Deployment Guide (2026) | Spheron Blog — reactive:mlsys-2026-inference-systems
[14] NVIDIA Dynamo - vLLM — reactive:mlsys-2026-inference-systems
[15] Disaggregated Prefilling (experimental) — vLLM — reactive:mlsys-2026-inference-systems
[16] Disaggregated Prefilling (experimental) - vLLM — reactive:mlsys-2026-inference-systems
[17] [Roadmap] vLLM Roadmap Q1 2026 · Issue #32455 - GitHub — reactive:mlsys-2026-inference-systems
[18] Why RDMA KV Cache Transfer Broke in Kubernetes - Medium — reactive:mlsys-2026-inference-systems
[19] stepfun-ai/StepMesh — reactive:mlsys-2026-inference-systems
[20] Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — reactive:mlsys-2026-inference-systems
[21] [RFC]: ATTN-FFN Disaggregation for MoE Models #22799 - GitHub — reactive:mlsys-2026-inference-systems
[22] This amazing Attention-FFN disaggregation implementation from ... — reactive:mlsys-2026-inference-systems
[23] [PDF] MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with ... — reactive:mlsys-2026-inference-systems
[24] [2504.02263] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism — reactive:mlsys-2026-inference-systems
[25] Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding — reactive:mlsys-2026-inference-systems
[26] Beat the long tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[27] GitHub - mit-han-lab/fastrl: [ASPLOS'26] Taming the Long-Tail — reactive:mlsys-2026-inference-systems
[28] New method could increase LLM training efficiency | MIT News | Massachusetts Institute of Technology — reactive:mlsys-2026-inference-systems
[29] Adaptive drafter model uses downtime to double LLM training speed — reactive:mlsys-2026-inference-systems
[30] @NousResearch @StepFun_ai @haoailab Large scale production system challenges, such as expert balancing in serving MoE mo… — SemiAnalysis Twitter (2026-05-17)
[31] Glad to be featured by SemiAnalysis. Our work BLASST was also selected as MLSys 2026 Best Paper: https://t.co/OlkQ7x75BN... — reactive:mlsys-2026-inference-systems (2026-05-17)
[32] [Feature Request] BLASST: Dynamic BLocked Attention Sparsity via ... — reactive:mlsys-2026-inference-systems
[33] dynamic_block_sparse_fwd_flas... — reactive:mlsys-2026-inference-systems
[34] RT @dogacel0: Excited to share I placed #1 (twice!) at the MLSys 2026 × NVIDIA FlashInfer AI Kernel Generation Contest, ... — reactive:mlsys-2026-inference-systems (2026-05-24)
[35] FlashInfer AI Kernel Generation Contest | Eric Dahan — reactive:mlsys-2026-inference-systems
[36] Zihao Ye - FlashInfer AI Kernel Generation Contest - LinkedIn — reactive:mlsys-2026-inference-systems
[37] Disaggregated Prefilling (experimental) - vLLM — reactive:mlsys-2026-inference-systems
[38] Help testing and implementing sm120 flashmla sparse attention in vllm — reactive:mlsys-2026-inference-systems
[39] Sparse attention mechanisms are finally moving beyond academic benchmarks into production systems, including DeepSeek Sp… — SemiAnalysis Twitter (2026-05-17)
[40] @NousResearch Assigning dedicated resources to different types of workloads is an increasingly popular system optimizati… — SemiAnalysis Twitter (2026-05-17)
[41] The long-tail distribution of rollout lengths causes one of the most critical inefficiencies in RL training. — SemiAnalysis Twitter (2026-05-17)
[42] [PDF] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention | Semantic Scholar — reactive:mlsys-2026-inference-systems
[43] Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
[44] [Literature Review] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
[45] Native Sparse Attention: Hardware-Aligned and Natively Trainable ... — reactive:mlsys-2026-inference-systems
[46] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism — reactive:mlsys-2026-inference-systems
[47] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter, [ASPLOS 2026] — reactive:mlsys-2026-inference-systems
[48] Paper: Beat the long tail: Distribution-Aware Speculative Decoding ... — reactive:mlsys-2026-inference-systems
[49] @lmsysorg @CloudflareDev The real story here isn't the bug fix, it's that prefill-decode disaggregation is finally stabl... — reactive:mlsys-2026-inference-systems (2026-05-21)
[50] Step3: Cost-Effective Multimodal Intelligence - StepFun — reactive:mlsys-2026-inference-systems
[51] Paper page - Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — reactive:mlsys-2026-inference-systems
[52] Beat the long tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[53] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter — reactive:mlsys-2026-inference-systems
[54] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter — reactive:mlsys-2026-inference-systems
[55] [PDF] Dynamic BLocked Attention Sparsity via Softmax Thresholding - arXiv — reactive:mlsys-2026-inference-systems
[56] Dynamic BLocked Attention Sparsity via Softmax Thresholding — reactive:mlsys-2026-inference-systems
[57] BLASST: Dynamic BLocked Attention Sparsity via Softmax ... — reactive:mlsys-2026-inference-systems
[58] vLLM crew is out in full force at MLSys 2026 🔥 — reactive:mlsys-2026-inference-systems (2026-05-18)
[59] We’re at MLSys 2026 in Bellevue this week! ⛴️ — reactive:mlsys-2026-inference-systems (2026-05-18)
[60] Headed to MLSys 2026? — reactive:mlsys-2026-inference-systems (2026-05-17)
[61] PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. — reactive:mlsys-2026-inference-systems (2026-05-21)