MLSys 2026: Inference Systems Research Preview · history

Version 7

2026-05-25 10:13 UTC · 213 items

Changes since v6

Third-party deployment guides from Vultr[^19593] and Spheron[^19594], plus a dedicated vLLM Dynamo integration page[^19595], extend the disaggregation toolchain maturity signal beyond NVIDIA's own Kubernetes documentation from the prior pass — adding Vultr and Spheron as new voices confirming commercial adoption of Dynamo as a deployment standard. An MLSys 2026 × NVIDIA FlashInfer AI Kernel Generation Contest surfaced, with @dogacel0 placing first[^20067][^20068], adding a new conference artifact to the FlashInfer/sparse attention kernel integration thread. Otherwise no new fault lines; the remaining new items are conference administrative pages, explainer articles, and social amplifications that deepen existing themes without introducing substantive new claims.

What

MLSys 2026 (May 18–22, Bellevue, WA) produced four interlocking production inference research threads: NVIDIA's BLASST (arXiv 2512.12087) won Best Paper for dynamic blocked sparse attention[3], with community integration efforts active in FlashInfer[7] and vLLM's SM120/Blackwell pathway[8]. Attention-FFN disaggregation is advancing through StepFun AI's StepMesh[28], a vLLM RFC[30], and MegaScale-Infer (arXiv 2504.02263)[34][32]. The NVIDIA Dynamo disaggregated inference toolchain has expanded to third-party deployment guides from Vultr[21] and Spheron[22] and a dedicated vLLM integration page[23], extending the maturity signal beyond NVIDIA's own Kubernetes documentation[20]. Two independent speculative decoding approaches — MIT HAN Lab's Adaptive Drafter (ASPLOS'26)[39][42] and Together.ai/WukLab's DASD (MLSys 2026)[36] — target the same long-tail RL rollout inefficiency without yet being directly compared.

Why it matters

The four threads collectively signal that production inference is no longer a single bottleneck but a stack of increasingly specialized workload-routing problems — by compute phase (PD disaggregation), by transformer component (attention-FFN disaggregation), by attention sparsity pattern (BLASST, NSA), and by training rollout distribution (DASD, Adaptive Drafter). The spread of NVIDIA Dynamo disaggregated inference documentation to Vultr, Spheron, and vLLM's own integration page suggests the toolchain is reaching a level of third-party adoption that moves disaggregated inference from research configuration to mainstream cloud deployment pattern.

Open questions

Does the vLLM Dynamo integration page[23] reflect a change in vLLM's 'experimental' label for disaggregated prefilling[27], or does it document Dynamo as an external framework that handles the disaggregation layer independently of vLLM's own experimental status?
Does NVIDIA's Dynamo Kubernetes disaggregated communication documentation[20] resolve the RDMA KV cache transfer failures documented in production[25], or does it address a different deployment topology — and how does this affect the disaggregation maturity gap between practitioner claims[26] and vLLM's 'experimental' label[27]?
Will the Reddit community effort to implement SM120 flashmla sparse attention in vLLM[8] and the open FlashInfer dynamic block sparse forward GitHub issue[7] converge on a shared Blackwell-compatible sparse attention kernel, and does the MLSys 2026 × NVIDIA FlashInfer AI Kernel Generation Contest[9][10] accelerate this timeline?
MegaScale-Infer appeared at both SIGCOMM 2025[32] and MLSys 2026[34] — does the full paper detail whether disaggregated expert parallelism specifically addresses the expert load balancing problem SemiAnalysis identified as underexplored in open source[43]?

Narrative

MLSys 2026, held May 18–22 in Bellevue, Washington, brought together researchers and engineers from across the AI industry to address production inference and training systems challenges[1][2]. The conference's most prominent recognition went to NVIDIA's BLASST paper (arXiv 2512.12087), which introduces Dynamic Blocked Attention Sparsity via Softmax Thresholding[3]: a runtime technique that dynamically identifies and skips low-salience attention blocks using a softmax threshold gate, targeting the core inefficiency of dense attention at long context lengths. BLASST's Best Paper award reflects a broader productionization wave — sparse attention has migrated from academic benchmarks into live deployments, with DeepSeek's Sparse Attention and NousResearch's Lighthouse Attention cited as examples in a pre-conference SemiAnalysis thread[4], and a practitioner confirming post-conference that DeepSeek's MoE routing and sparse attention performs well in production[5]. A community request to integrate BLASST into FlashInfer was filed during the conference[6], a GitHub issue tracking dynamic block sparse forward implementation in FlashInfer is now open[7], and a separate Reddit thread documents community effort to implement SM120 flashmla sparse attention in vLLM[8]. An MLSys 2026 × NVIDIA FlashInfer AI Kernel Generation Contest also ran at the conference, with @dogacel0 placing first[9][10], signaling that competitive kernel development around FlashInfer is now a formal conference activity.

The debate over whether sparse attention is a durable or transitional architecture has grown more textured than a simple binary. A dissenting voice, @superaiwatcher, argued post-conference that sparse attention is a transitional stopgap before hardware-native linear attention displaces manual sparsity implementations by 2027[11]. The Native Sparse Attention (NSA) paper (arXiv 2502.11089), published at ACL 2025, offers a direct counter-position: rather than imposing sparsity patterns at runtime on top of dense attention kernels, NSA designs sparse attention to be hardware-aligned and natively trainable from the ground up[12][13][14]. This represents a third architectural path — between BLASST-style dynamic runtime sparsity and full linear attention replacement — suggesting the sparse attention research community is actively working to close the hardware-alignment gap that motivates the linear attention argument without abandoning the sparse paradigm[15][16]. A survey of hardware-efficient attention mechanisms covering sparse, compact, and linear variants contextualizes the full design space[17].

Prefill-decode (PD) disaggregation — routing the compute-bound prefill phase and the memory-bandwidth-bound decode phase to dedicated hardware pools — is now widely accepted as an industry best practice[18]. AWS Neuron's BETA disaggregated inference developer guide[19] signals cloud-provider toolchain formalization, and NVIDIA Dynamo has published official documentation for disaggregated communication in Kubernetes deployments[20]. Third-party deployment guides from Vultr[21] and Spheron[22], and a dedicated vLLM-Dynamo integration page[23], extend this maturity signal further, indicating that disaggregated inference configuration is spreading to cloud hosting platforms and is being formalized within the vLLM ecosystem itself. LMCache Lab released an async PD backend during the conference[24]. A documented RDMA KV cache transfer failure in Kubernetes[25] had flagged a concrete networking obstacle; NVIDIA's Dynamo documentation and the third-party guides suggest the deployment path is becoming increasingly well-specified, though whether the specific failure mode from that documentation is addressed remains unclear. The gap between the practitioner framing of disaggregation as 'finally stable'[26] and vLLM's own 'experimental' label for disaggregated prefilling[27] is the live tension. Attention-FFN disaggregation has emerged as the next architectural axis: StepFun AI's StepMesh communication library for attention-FFN disaggregated MoE inference is openly published on GitHub[28], backed by the Step-3 paper (arXiv 2507.19427)[29]; the vLLM project posted an RFC proposing ATTN-FFN disaggregation for MoE models[30][31]; and MegaScale-Infer (arXiv 2504.02263), a ByteDance system confirmed at SIGCOMM 2025[32], uses disaggregated expert parallelism for serving MoE at scale with claimed cost reductions[33][34][35].

Two independent research efforts targeted the same structural inefficiency in reinforcement learning post-training: the long-tail distribution of rollout lengths that creates GPU bubbles and degrades training throughput. Distribution-Aware Speculative Decoding (DASD), from Together.ai and WukLab (arXiv 2511.13841), was presented at MLSys 2026 and claims up to 50% acceleration of RL rollouts[36][37][38]. MIT HAN Lab's Adaptive Drafter (arXiv 2511.16665, ASPLOS'26, open-source at mit-han-lab/fastrl[39]) has a publicly available conference talk video[40], with MIT News and TechXplore both framing the method as roughly doubling LLM training efficiency[41][42]. The simultaneous emergence of independent approaches targeting the same long-tail RL inefficiency from different institutions and conference venues signals a high-value recognized systems problem; direct comparisons between the approaches have not yet appeared.

Timeline

2025-02-01: Native Sparse Attention (NSA) paper submitted to arXiv (2502.11089), introducing hardware-aligned and natively trainable sparse attention; subsequently published at ACL 2025 [12][13][14][53][49][50][51]
2025-04-01: MegaScale-Infer paper (arXiv 2504.02263, also SIGCOMM 2025) submitted, describing disaggregated expert parallelism for MoE serving at scale [33][59][34][32]
2025-07-01: StepFun AI submits Step-3 paper (arXiv 2507.19427) describing model-system co-design for cost-effective decoding, providing production context for StepMesh [29][56][57]
2025-11-01: Distribution-Aware Speculative Decoding paper submitted to arXiv (2511.13841), targeting long-tail RL rollout GPU bubbles [37][74][38]
2025-11-01: MIT HAN Lab submits Adaptive Drafter paper (arXiv 2511.16665) on efficient reasoning RL training; GitHub project fastrl confirms ASPLOS'26 venue [62][63][39]
2025-12-01: BLASST paper submitted to arXiv (2512.12087), introducing dynamic blocked sparse attention via softmax thresholding [3][75][47][48]
2026-01-01: Analytical Provisioning for Attention-FFN Disaggregated LLM Serving paper submitted to arXiv (2601.21351) [76]
2026-02-01: Challenges of Attention-FFN Disaggregation paper submitted to arXiv (2602.09721) [77]
2026-02-26: MIT News and TechXplore publish coverage of Adaptive Drafter, framing the method as doubling LLM training speed [41][42]
2026-04-25: LMSYS publishes blog post on DeepSeek-V4 fast inference and verified RL with SGLang and Miles [78]
2026-05-17: SemiAnalysis publishes pre-conference MLSys 2026 research preview covering sparse attention, disaggregation, MoE balancing, and RL training efficiency [1][4][18][43][44]
2026-05-17: NVIDIA researcher Huizi Mao confirms BLASST wins MLSys 2026 Best Paper for dynamic blocked sparse attention [45]
2026-05-18: MLSys 2026 opens in Bellevue; vLLM, LMSYS, Inferact, Delta Institute among attendees [2][79][80][81]
2026-05-18: MLSys 2026 Best Paper congratulations circulate; scaling RL across heterogeneous accelerators receives Oral recognition [82][83][84]
2026-05-20: Theta EdgeCloud PD disaggregation test results published; financial commentary picks up disaggregation as argument for extended GPU useful life [85][86][67][68]
2026-05-21: LMCache Lab releases async PDBackend; StepFun AI publishes StepMesh on GitHub; vLLM RFC for ATTN-FFN disaggregation posted; vLLM community gathers IRL at conference [24][28][30][31][54]
2026-05-22: Conference final day: GDN kernel work on B200 GPUs presented; FlashInfer BLASST integration request filed; FlashInfer dynamic block sparse forward GitHub issue opened; MLSys 2026 × NVIDIA FlashInfer AI Kernel Generation Contest results announced with @dogacel0 placing first [87][88][6][89][90][7][9][10]
2026-05-23: @superaiwatcher posts that sparse attention is a stopgap before hardware-native linear attention; KV cache cross-node traffic framing circulates [89][91][92][93][94][95][96][11][97]
2026-05-24: JasonLiu confirms DeepSeek MoE routing and sparse attention performing well in production; NVIDIA Dynamo publishes Kubernetes disaggregated communication documentation; third-party Dynamo deployment guides published by Vultr and Spheron; vLLM publishes dedicated Dynamo integration page [5][70][20][21][22][23]

Perspectives

SemiAnalysis

Curatorial and analytical: frames MLSys 2026 as the venue for the most important production AI systems problems, highlighting sparse attention productionization, PD disaggregation extensions, MoE expert balancing gaps, and RL training inefficiency as the four key threads

Evolution: Consistent — SemiAnalysis functions as a research-curating voice for practitioners; no stance shift observable

[1][4][18][43][44]

NVIDIA / Huizi Mao (BLASST team) / NVIDIA Dynamo

BLASST's dynamic blocked sparse attention has been validated by the research community as Best Paper; community is pushing for integration into inference kernels like FlashInfer beyond NVIDIA's own stack; NVIDIA Dynamo provides official Kubernetes documentation for disaggregated communication and has been adopted by third-party cloud providers Vultr and Spheron as a deployment reference; an MLSys 2026 × NVIDIA FlashInfer AI Kernel Generation Contest further embedded FlashInfer as a competitive kernel development platform at the conference

Evolution: Expanded: third-party deployment guides from Vultr[21] and Spheron[22] and the vLLM Dynamo integration page[23] extend NVIDIA Dynamo's disaggregated inference documentation story beyond NVIDIA-controlled channels, and the FlashInfer kernel contest[9][10] adds a new community engagement dimension

[45][3][6][46][47][48][7][20][21][22][23][9][10]

NSA paper authors (Yuan et al., ACL 2025)

Hardware-aligned and natively trainable sparse attention (arXiv 2502.11089) demonstrates that sparse attention can be designed from training time to close the hardware-alignment gap, offering a third path between BLASST-style runtime sparsity and full linear attention replacement

Evolution: Consistent — multiple additional pointers to the paper[49][50][51][52] confirm ongoing community access and education but introduce no new claims

[12][13][14][53][15][16][49][50][51][52]

@superaiwatcher (dissenting practitioner)

Sparse attention is a stopgap; by 2027, hardware-native linear attention will render manual sparsity implementations obsolete, making the current productionization wave a transitional rather than permanent architecture shift

Evolution: Consistent — represents the explicit counter-narrative to the conference's sparse attention productionization consensus; the NSA paper is a direct counter-weight to this position, though @superaiwatcher has not responded to it in the thread

[11]

JasonLiu (@jsyqrt) (production practitioner)

DeepSeek's MoE routing and sparse attention architecture works well in production, confirming the practical viability of the techniques highlighted at the conference

Evolution: Consistent — adds post-conference practitioner validation to the research consensus

[5]

vLLM project / SM120 community

Attention-FFN disaggregation for MoE models is being actively pursued via RFC; disaggregated prefilling remains labeled experimental in official vLLM documentation; community members are simultaneously working to implement SM120 flashmla sparse attention in vLLM alongside the FlashInfer track; a dedicated vLLM-Dynamo integration page now exists, documenting Dynamo as the external disaggregation layer

Evolution: Broadened: the vLLM Dynamo integration page[23] adds an official vLLM documentation artifact pointing to Dynamo for disaggregated inference, which partially reconciles the 'experimental' label tension by routing disaggregated deployments through Dynamo rather than vLLM's own experimental pathway

[30][31][27][54][8][23]

StepFun AI / StepMesh

Attention-FFN disaggregation for MoE inference requires dedicated communication infrastructure; StepMesh instantiates this as a production-oriented open-source library backed by model-system co-design documented in the Step-3 paper

Evolution: Consistent; StepMesh is openly hosted on GitHub and the Step-3 paper provides production evidence for the co-design approach

[28][29][55][56][57]

ByteDance / MegaScale-Infer

Disaggregated expert parallelism — separating attention and FFN/expert compute across different hardware — enables cost-effective MoE serving at scale, as documented in arXiv 2504.02263 and presented at SIGCOMM 2025

Evolution: Consistent — confirmed at SIGCOMM 2025 in addition to arXiv, adding a peer-reviewed systems venue to the claimed cost reductions

[33][58][59][60][34][32][35]

AWS Neuron

Disaggregated inference is production-ready enough to warrant official BETA documentation in the AWS Neuron developer guide

Evolution: Consistent — cloud-provider toolchain formalization of disaggregated inference

[19]

Together.ai / WukLab (DASD)

Distribution-Aware Speculative Decoding delivers up to 50% acceleration of RL rollouts by adapting draft model behavior to the actual rollout length distribution rather than assuming fixed lengths

Evolution: Consistent; one of two confirmed independent groups attacking the same long-tail RL rollout problem from different institutional and conference contexts

[36][37][61][38]

MIT HAN Lab (Adaptive Drafter / fastrl)

Adaptive Drafter addresses long-tail RL training inefficiency at ASPLOS'26 with an adaptive speculative decoding approach; MIT News and TechXplore both frame it as roughly doubling LLM training efficiency, and the conference talk video is publicly available

Evolution: Consistent; TechXplore coverage[42] added alongside MIT News[41] in prior pass, broadening press pickup beyond the initial institutional article

[62][63][39][41][64][40][42]

LMCache Lab

Actively productionizing PD disaggregation with a new async backend, framing the architecture as ready for deployment at scale

Evolution: Consistent with their KV cache and disaggregation-focused engineering track

[24][65]

Vultr / Spheron (third-party cloud providers)

NVIDIA Dynamo disaggregated inference is mature enough to document as a standard deployment pattern, with both providers publishing their own step-by-step guides for building disaggregated inference on their infrastructure

Evolution: New voice this pass: third-party cloud provider adoption of Dynamo deployment documentation[21][22] adds an independent commercial toolchain maturity signal distinct from NVIDIA's own documentation

[21][22]

Financial / investor commentators (TheValueist, podcast_alpha_x, tropicalvalue)

PD disaggregation is evidence that GPU useful lives are substantially longer than AI skeptics claim — older GPUs can be repurposed in disaggregated decode roles, extending hardware ROI

Evolution: Consistent; this framing has continued circulating post-conference

[66][67][68][69][70]

Sakura Yuki (inference practitioner)

PD disaggregation is finally stable for production; running co-located prefill and decode at scale is now wasteful rather than a reasonable default

Evolution: Consistent with the broad practitioner consensus emerging at the conference

[26][71]

Tensions

Sparse attention as milestone vs. stopgap: The conference and practitioner community framed sparse attention productionization as a durable architectural shift (BLASST Best Paper[3], DeepSeek production confirmation[5]), while @superaiwatcher argues it is a transitional stopgap before hardware-native linear attention displaces manual sparsity by 2027[11]. The NSA paper (arXiv 2502.11089)[12][13] adds a third position: that hardware-aligned sparse attention can be designed natively from training time, potentially closing the efficiency gap that motivates the linear attention argument without abandoning sparsity at all. [3][5][11][4][12][13][14][49][50][51]
Open-source community vs. production practitioners on MoE expert balancing: SemiAnalysis observes that expert balancing in MoE serving is substantively underexplored in open source because the challenge only surfaces at production scale[43]. MegaScale-Infer claims MoE serving cost reductions via disaggregated expert parallelism[33][32], but whether this specifically addresses the expert balancing problem — or whether any open-source production-grade solution exists — remains unresolved. [43][60][33][32][72]
Disaggregation maturity framing: Practitioner commentary frames PD disaggregation as 'finally stable' for production[26], AWS Neuron carries an official BETA developer guide[19], NVIDIA Dynamo has Kubernetes disaggregated communication documentation[20], and third-party deployment guides from Vultr[21] and Spheron[22] along with a vLLM Dynamo integration page[23] extend this maturity signal further; yet vLLM's own documentation still labels disaggregated prefilling as 'experimental'[27]. A documented RDMA KV cache transfer failure in Kubernetes[25] raised a concrete networking obstacle, which NVIDIA Dynamo's documentation may address — but whether it resolves that specific failure mode remains unclear. [26][71][19][20][21][22][23][27][25]
Competing approaches to attention-FFN disaggregation standardization: StepFun AI shipped StepMesh as a standalone open-source communication library[28][55], while vLLM posted an RFC for ATTN-FFN disaggregation built into the vLLM framework[30]. The two efforts address the same problem from different integration points; it is unclear whether they will converge or which will define the interoperability standard. [28][55][30][31][73]
Competing speculative decoding approaches for RL rollout efficiency: MIT HAN Lab's Adaptive Drafter (ASPLOS'26, reportedly ~2x speedup, talk video available[40], covered by MIT News[41] and TechXplore[42]) and Together.ai/WukLab's DASD (MLSys 2026, claimed 50% acceleration[37][36]) independently target the long-tail RL rollout GPU bubble problem with different strategies. Direct comparisons between the approaches have not appeared, and whether they are composable or serve different regimes is unresolved. [39][64][37][36][41][40][38][42]

Sources

[1] MLSys 2026 is next week! — SemiAnalysis Twitter (2026-05-17)
[2] MLSys 2026 is starting today. And I'm excited to be here. This conference focuses on the most important research problem... — reactive:mlsys-2026-inference-systems (2026-05-18)
[3] [2512.12087] BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding — reactive:mlsys-2026-inference-systems
[4] Sparse attention mechanisms are finally moving beyond academic benchmarks into production systems, including DeepSeek Sp… — SemiAnalysis Twitter (2026-05-17)
[5] @rohanpaul_ai Having built with DeepSeek models in production — the architecture point is real. Their MoE routing and sp... — reactive:mlsys-2026-inference-systems (2026-05-24)
[6] [Feature Request] BLASST: Dynamic BLocked Attention Sparsity via ... — reactive:mlsys-2026-inference-systems
[7] dynamic_block_sparse_fwd_flas... — reactive:mlsys-2026-inference-systems
[8] Help testing and implementing sm120 flashmla sparse attention in vllm — reactive:mlsys-2026-inference-systems
[9] RT @dogacel0: Excited to share I placed #1 (twice!) at the MLSys 2026 × NVIDIA FlashInfer AI Kernel Generation Contest, ... — reactive:mlsys-2026-inference-systems (2026-05-24)
[10] RT @dogacel0: Excited to share I placed #1 (twice!) at the MLSys 2026 × NVIDIA FlashInfer AI Kernel Generation Contest, ... — reactive:mlsys-2026-inference-systems (2026-05-24)
[11] @rasbt Sparse attention is a stopgap. By 2027, hardware-native linear attention will render these manual sparsity implem... — reactive:mlsys-2026-inference-systems (2026-05-23)
[12] Paper page - Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
[13] [PDF] Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
[14] [PDF] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention | Semantic Scholar — reactive:mlsys-2026-inference-systems
[15] [Literature Review] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
[16] Native Sparse Attention: Hardware-Aligned and Natively Trainable ... — reactive:mlsys-2026-inference-systems
[17] [PDF] Hardware-efficient, Sparse, Compact, and Linear Attention — reactive:mlsys-2026-inference-systems
[18] @NousResearch Assigning dedicated resources to different types of workloads is an increasingly popular system optimizati… — SemiAnalysis Twitter (2026-05-17)
[19] Disaggregated Inference [BETA] — AWS Neuron Documentation — reactive:mlsys-2026-inference-systems
[20] Disagg Communication | NVIDIA Dynamo Documentation — reactive:mlsys-2026-inference-systems
[21] How to Build Disaggregated Inference with NVIDIA Dynamo | Vultr Docs — reactive:mlsys-2026-inference-systems
[22] NVIDIA Dynamo 1.0: Disaggregated LLM Inference Deployment Guide (2026) | Spheron Blog — reactive:mlsys-2026-inference-systems
[23] NVIDIA Dynamo - vLLM — reactive:mlsys-2026-inference-systems
[24] PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. — reactive:mlsys-2026-inference-systems (2026-05-21)
[25] Why RDMA KV Cache Transfer Broke in Kubernetes - Medium — reactive:mlsys-2026-inference-systems
[26] @lmsysorg @CloudflareDev The real story here isn't the bug fix, it's that prefill-decode disaggregation is finally stabl... — reactive:mlsys-2026-inference-systems (2026-05-21)
[27] Disaggregated Prefilling (experimental) - vLLM — reactive:mlsys-2026-inference-systems
[28] stepfun-ai/StepMesh — reactive:mlsys-2026-inference-systems
[29] Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — reactive:mlsys-2026-inference-systems
[30] [RFC]: ATTN-FFN Disaggregation for MoE Models #22799 - GitHub — reactive:mlsys-2026-inference-systems
[31] This amazing Attention-FFN disaggregation implementation from ... — reactive:mlsys-2026-inference-systems
[32] [PDF] MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with ... — reactive:mlsys-2026-inference-systems
[33] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism — reactive:mlsys-2026-inference-systems
[34] [2504.02263] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism — reactive:mlsys-2026-inference-systems
[35] Disaggregated Expert Parallelism — reactive:mlsys-2026-inference-systems
[36] Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding — reactive:mlsys-2026-inference-systems
[37] Beat the long tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[38] Paper: Beat the long tail: Distribution-Aware Speculative Decoding ... — reactive:mlsys-2026-inference-systems
[39] GitHub - mit-han-lab/fastrl: [ASPLOS'26] Taming the Long-Tail — reactive:mlsys-2026-inference-systems
[40] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter, [ASPLOS 2026] — reactive:mlsys-2026-inference-systems
[41] New method could increase LLM training efficiency | MIT News | Massachusetts Institute of Technology — reactive:mlsys-2026-inference-systems
[42] Adaptive drafter model uses downtime to double LLM training speed — reactive:mlsys-2026-inference-systems
[43] @NousResearch @StepFun_ai @haoailab Large scale production system challenges, such as expert balancing in serving MoE mo… — SemiAnalysis Twitter (2026-05-17)
[44] The long-tail distribution of rollout lengths causes one of the most critical inefficiencies in RL training. — SemiAnalysis Twitter (2026-05-17)
[45] Glad to be featured by SemiAnalysis. Our work BLASST was also selected as MLSys 2026 Best Paper: https://t.co/OlkQ7x75BN... — reactive:mlsys-2026-inference-systems (2026-05-17)
[46] Add sparse-MLA paged attention for SM120 (RTX PRO ... - GitHub — reactive:mlsys-2026-inference-systems
[47] Dynamic BLocked Attention Sparsity via Softmax Thresholding — reactive:mlsys-2026-inference-systems
[48] BLASST: Dynamic BLocked Attention Sparsity via Softmax ... — reactive:mlsys-2026-inference-systems
[49] [PDF] Hardware-Aligned and Natively Trainable Sparse Attention - arXiv — reactive:mlsys-2026-inference-systems
[50] [2502.11089] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
[51] Hardware-Aligned and Natively Trainable Sparse Attention - arXiv — reactive:mlsys-2026-inference-systems
[52] Native Sparse Attention for dummies — The Next Leap in Efficient ... — reactive:mlsys-2026-inference-systems
[53] Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
[54] Always great to see the #vLLM community connecting IRL at MLSys 2026! Thanks to the teams keeping these conversations go... — reactive:mlsys-2026-inference-systems (2026-05-21)
[55] StepMesh: A Communication Library for Attention-FFN Disaggregation | StepFun — reactive:mlsys-2026-inference-systems
[56] Step3: Cost-Effective Multimodal Intelligence - StepFun — reactive:mlsys-2026-inference-systems
[57] Paper page - Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — reactive:mlsys-2026-inference-systems
[58] [Literature Review] MegaScale-Infer: Serving Mixture-of-Experts at ... — reactive:mlsys-2026-inference-systems
[59] MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with ... — reactive:mlsys-2026-inference-systems
[60] Unlocking MoE Efficiency: How MegaScale-Infer Slashes LLM ... — reactive:mlsys-2026-inference-systems
[61] Beat the Long Tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[62] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter — reactive:mlsys-2026-inference-systems
[63] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter — reactive:mlsys-2026-inference-systems
[64] Aiyu | Taming the Long Tail: Adaptive Speculative Decoding Doubles LLM Training Speed — reactive:mlsys-2026-inference-systems
[65] RT @lmcache: PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. — reactive:mlsys-2026-inference-systems (2026-05-22)
[66] GPU USEFUL LIFE, INFERENCE DISAGGREGATION, AND PRIVATE CREDIT — reactive:mlsys-2026-inference-systems (2026-05-20)
[67] I took one key insight from this convo: inference disaggregation between prefill and decode enable GPU lifespan to be ex... — reactive:mlsys-2026-inference-systems (2026-05-20)
[68] @GavinSBaker : AI skeptics have been wrong to claim GPU useful lives are only 1-2 yrs. The disaggregation of prefill (me... — reactive:mlsys-2026-inference-systems (2026-05-21)
[69] For anyone listening, the relevant section starts around 44m: prefill vs. decode disaggregation. — reactive:mlsys-2026-inference-systems (2026-05-21)
[70] RT @tropicalvalue: I took one key insight from this convo: inference disaggregation between prefill and decode enable GP... — reactive:mlsys-2026-inference-systems (2026-05-23)
[71] @lmsysorg @AMD @dstackai If you're still running prefill and decode on the same GPUs at scale, you're basically burning ... — reactive:mlsys-2026-inference-systems (2026-05-21)
[72] The key observation: load-balancing losses used during MoE training encourage expert diversity. — reactive:mlsys-2026-inference-systems (2026-05-19)
[73] AFD: Decoupling Attention and FFN — reactive:mlsys-2026-inference-systems
[74] Beat the long tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[75] [PDF] Dynamic BLocked Attention Sparsity via Softmax Thresholding - arXiv — reactive:mlsys-2026-inference-systems
[76] [2601.21351] Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads — reactive:mlsys-2026-inference-systems
[77] Revealing the Challenges of Attention-FFN Disaggregation ... — reactive:mlsys-2026-inference-systems
[78] DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles — reactive:mlsys-2026-inference-systems (2026-04-25)
[79] vLLM crew is out in full force at MLSys 2026 🔥 — reactive:mlsys-2026-inference-systems (2026-05-18)
[80] We’re at MLSys 2026 in Bellevue this week! ⛴️ — reactive:mlsys-2026-inference-systems (2026-05-18)
[81] Headed to MLSys 2026? — reactive:mlsys-2026-inference-systems (2026-05-17)
[82] Congratulations on the MLSys 2026 best paper! Looking forward to your presentation today! https://t.co/UrVcKgUJj0 — reactive:mlsys-2026-inference-systems (2026-05-19)
[83] Congrats on winning an MLSys 2026 best research paper! https://t.co/grBdNYK4Fg — reactive:mlsys-2026-inference-systems (2026-05-19)
[84] @Yong_jun_He Congratulations on the MLSys 2026 Oral! Scaling RL training across heterogeneous accelerators is a genuinel... — reactive:mlsys-2026-inference-systems (2026-05-18)
[85] Read “Theta EdgeCloud Tests Prefill/Decode Disaggregation for Large-Scale LLM Serving“ by Theta Labs on Medium: https://... — reactive:mlsys-2026-inference-systems (2026-05-20)
[86] Theta EdgeCloud Tests Prefill/Decode Disaggregation for Large-Scale LLM Serving #machinelearning #ml #artificialintellig... — reactive:mlsys-2026-inference-systems (2026-05-20)
[87] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-22)
[88] RT @SnorkelAI: Live from MLSys 2026! Thanks to everyone who joined @pham_derek's talk yesterday on RLVR in low-data, low... — reactive:mlsys-2026-inference-systems (2026-05-22)
[89] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[90] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-22)
[91] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[92] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[93] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[94] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[95] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[96] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[97] Long-context inference and Prefill-Decode disaggregation turn KV Cache into cross-node traffic. — reactive:mlsys-2026-inference-systems (2026-05-23)