MLSys 2026: Inference Systems Research Preview · history

Version 4

2026-05-24 11:13 UTC · 164 items

Changes since v3

Three substantive additions this pass: (1) @superaiwatcher introduces the first explicit counter-narrative in the thread, framing sparse attention as a transitional stopgap before hardware-native linear attention, which adds a new tension against the conference's productionization consensus. (2) AWS Neuron publishing a BETA disaggregated inference developer guide marks the first cloud-provider toolchain formalization visible in the thread, elevating disaggregation maturity beyond startup and research contexts. (3) MegaScale-Infer is now confirmed as arXiv 2504.02263 with 'disaggregated expert parallelism' as its specific technical framing, and the Adaptive Drafter ASPLOS'26 talk video is now publicly available on YouTube. The remaining new items are background reference material, RTs of the GDN B200 kernel work, and financial commentary continuing to circulate.

What

MLSys 2026 (May 18–22, Bellevue, WA) concluded with four inference research threads that are now moving from conference papers into engineering artifacts. NVIDIA's BLASST (arXiv 2512.12087) won Best Paper for dynamic blocked sparse attention[3], and a dissenting practitioner voice has since framed sparse attention as a stopgap before hardware-native linear attention arrives[8]. Attention-FFN disaggregation has produced competing implementations — StepFun AI's StepMesh on GitHub[12] and a vLLM RFC[14] — while AWS Neuron now carries a BETA disaggregated inference guide[10], signaling cloud-provider adoption. Two independent groups have simultaneously targeted long-tail RL rollout inefficiency: Together.ai/WukLab with DASD (arXiv 2511.13841)[24] and MIT HAN Lab with Adaptive Drafter (arXiv 2511.16665, ASPLOS'26)[27], whose conference talk video is now publicly available. MegaScale-Infer (arXiv 2504.02263) is confirmed as a ByteDance system specifically using disaggregated expert parallelism for MoE serving[17][18].

Why it matters

MLSys 2026 marks a moment when sparse attention, disaggregated serving, and RL training efficiency via speculative decoding are all crossing from research papers into production infrastructure simultaneously. Cloud-provider adoption of disaggregated inference (AWS Neuron[10]) and practitioner confirmation of sparse MoE architectures in production[7] indicate these techniques are exiting the research-to-engineering transition zone — making the emerging question of what replaces them (hardware-native linear attention[8]) a forward-looking rather than theoretical concern.

Open questions

Will FlashInfer integrate BLASST's dynamic blocked sparse attention, and will the parallel sparse-MLA kernel work for SM120 hardware[6] accelerate that path?[5]
StepMesh[12] and the vLLM RFC[14] address attention-FFN disaggregation from different integration points — which will establish the interoperability standard, or can they converge?
MIT's Adaptive Drafter reportedly doubles LLM training speed[29][27] while Together.ai claims 50% RL rollout acceleration for DASD[23] — how do these approaches compare across model scales and RL algorithms, and are they composable?
Is sparse attention genuinely a production milestone or a transitional stopgap before hardware-native linear attention renders manual sparsity implementations obsolete by 2027?[8] A practitioner reports DeepSeek's sparse attention working well in production today[7], but the architectural bet is unresolved.
MegaScale-Infer's disaggregated expert parallelism claims MoE serving cost reductions[17][19] — does this approach specifically address expert load balancing, the problem SemiAnalysis identified as underexplored in open source?[30]

Narrative

MLSys 2026, held May 18–22 in Bellevue, Washington, brought together researchers and engineers from across the AI industry to address production inference and training systems challenges[1][2]. The conference's most prominent recognition went to NVIDIA's BLASST paper (arXiv 2512.12087), which introduces Dynamic Blocked Attention Sparsity via Softmax Thresholding[3]. BLASST dynamically identifies and skips low-salience attention blocks at runtime using a softmax threshold gate, attacking the core inefficiency of dense attention at long context lengths. Its Best Paper award reflects a broader productionization wave documented in a pre-conference SemiAnalysis thread: sparse attention has migrated from academic benchmarks into live deployments, with DeepSeek's Sparse Attention and NousResearch's Lighthouse Attention cited as examples[4]. The inference community has opened a feature request to integrate BLASST into FlashInfer[5]; separately, FlashInfer's issue tracker shows active work on sparse-MLA paged attention for the new SM120 GPU architecture[6]. A practitioner confirmed after the conference that DeepSeek's MoE routing and sparse attention performs well in production[7]. A dissenting voice, however, has pushed back: @superaiwatcher characterized sparse attention as a stopgap, arguing that hardware-native linear attention will render manual sparsity implementations obsolete by 2027[8] — a claim that stands against the conference's productionization consensus but reflects a real architectural uncertainty.

Prefill-decode (PD) disaggregation — routing the compute-bound prefill phase and the memory-bandwidth-bound decode phase to dedicated hardware pools — is now widely accepted as an industry best practice[9]. AWS Neuron's publication of a BETA disaggregated inference developer guide[10] signals that cloud-provider toolchains are formalizing the architecture, a step beyond the research-and-startup adoption visible at the conference itself. Practitioners note that long-context inference and PD disaggregation turn KV cache into cross-node traffic[11], surfacing network as a new bottleneck. Attention-FFN disaggregation has emerged as the next architectural axis. StepFun AI's StepMesh communication library for attention-FFN disaggregated MoE inference is openly published on GitHub[12], with the accompanying Step-3 paper (arXiv 2507.19427) documenting model-system co-design for cost-effective decoding[13]. The vLLM project posted an RFC proposing ATTN-FFN disaggregation for MoE models[14][15], and a dedicated challenges paper appeared at arXiv 2602.09721[16]. MegaScale-Infer (arXiv 2504.02263), a ByteDance system, is confirmed to use disaggregated expert parallelism for serving MoE at scale, with claimed serving cost reductions[17][18][19]. Financial commentators picked up PD disaggregation as evidence for extended GPU hardware ROI, arguing that older prefill-heavy GPUs can be repurposed into disaggregated decode pools[20][21]; the financial commentary has continued circulating post-conference[22].

Two independent research efforts targeted the same structural inefficiency in reinforcement learning post-training: the long-tail distribution of rollout lengths that creates GPU bubbles and degrades training throughput. Distribution-Aware Speculative Decoding (DASD), from Together.ai and WukLab (arXiv 2511.13841), adapts draft model behavior to the actual rollout length distribution and was presented at MLSys 2026, with Together.ai claiming up to 50% acceleration of RL rollouts[23][24][25]. Separately, MIT HAN Lab's Adaptive Drafter paper (arXiv 2511.16665), appearing at ASPLOS'26 under the open-source repository mit-han-lab/fastrl[26], has an official conference talk video now publicly available[27], with MIT News framing it as substantially increasing LLM training efficiency[28] and secondary sources characterizing the speedup as 2x[29]. The simultaneous emergence of multiple independent approaches targeting the same long-tail RL inefficiency — from different institutions and different conference venues — signals that this is a high-value, recognized systems problem, though direct comparisons between the approaches have not yet appeared.

For mixture-of-experts (MoE) models — now dominant at the frontier — expert load balancing at serving time remains a significant underexplored challenge in open source. SemiAnalysis noted that this problem surfaces primarily in closed production systems[30]. MegaScale-Infer's disaggregated expert parallelism approach[17] is notable in this context, though whether it specifically addresses expert balancing or achieves cost reductions through other mechanisms has not been publicly detailed. A cross-layer load balancing paper for distributed MoE inference appeared at ICML 2026[31]. Industry presence at MLSys 2026 was broad: vLLM, LMSYS, Inferact, and Delta Institute were among the organizations attending[32][33][34][35], with the vLLM community holding an in-person meetup at the conference[36].

Timeline

2025-04-01: MegaScale-Infer paper (arXiv 2504.02263) submitted, describing disaggregated expert parallelism for MoE serving at scale [17][18]
2025-07-01: StepFun AI submits Step-3 paper (arXiv 2507.19427) describing model-system co-design for cost-effective decoding, providing production context for StepMesh [13][42][43]
2025-11-01: Distribution-Aware Speculative Decoding paper submitted to arXiv (2511.13841), targeting long-tail RL rollout GPU bubbles [24][57][25]
2025-11-01: MIT HAN Lab submits Adaptive Drafter paper (arXiv 2511.16665) on efficient reasoning RL training; GitHub project fastrl confirms ASPLOS'26 venue [47][48][26]
2025-12-01: BLASST paper submitted to arXiv (2512.12087), introducing dynamic blocked sparse attention via softmax thresholding [3][58][39][40]
2026-01-01: Analytical Provisioning for Attention-FFN Disaggregated LLM Serving paper submitted to arXiv (2601.21351) [59]
2026-02-01: Challenges of Attention-FFN Disaggregation paper submitted to arXiv (2602.09721) [16]
2026-02-26: MIT News publishes article on Adaptive Drafter method for increasing LLM training efficiency [28]
2026-04-25: LMSYS publishes blog post on DeepSeek-V4 fast inference and verified RL with SGLang and Miles [60]
2026-05-17: SemiAnalysis publishes pre-conference MLSys 2026 research preview covering sparse attention, disaggregation, MoE balancing, and RL training efficiency [1][4][9][30][37]
2026-05-17: NVIDIA researcher Huizi Mao confirms BLASST wins MLSys 2026 Best Paper for dynamic blocked sparse attention [38]
2026-05-18: MLSys 2026 opens in Bellevue; vLLM, LMSYS, Inferact, Delta Institute among attendees [2][32][61][35]
2026-05-18: MLSys 2026 Best Paper congratulations circulate; scaling RL across heterogeneous accelerators receives Oral recognition [62][63][64]
2026-05-20: Theta EdgeCloud PD disaggregation test results published; financial commentary picks up disaggregation as argument for extended GPU useful life [65][66][20][21]
2026-05-21: LMCache Lab releases async PDBackend; StepFun AI publishes StepMesh on GitHub; vLLM RFC for ATTN-FFN disaggregation posted; vLLM community gathers IRL at conference [49][12][14][15][36]
2026-05-21: Inferact co-hosts MLSys luncheon with a16z [34]
2026-05-22: Conference final day: GDN kernel work on B200 GPUs presented; SnorkelAI RLVR talk concludes; FlashInfer BLASST integration request filed [67][68][5][69][70]
2026-05-23: GDN kernel B200 GPU presentation RTs spread widely post-conference; @superaiwatcher posts that sparse attention is a stopgap before hardware-native linear attention; KV cache cross-node traffic framing circulates [69][71][72][73][74][75][76][8][11]
2026-05-24: JasonLiu confirms DeepSeek MoE routing and sparse attention performing well in production; financial disaggregation commentary continues circulating [7][22]

Perspectives

SemiAnalysis

Curatorial and analytical: frames MLSys 2026 as the venue for the most important production AI systems problems, highlighting sparse attention productionization, PD disaggregation extensions, MoE expert balancing gaps, and RL training inefficiency as the four key threads

Evolution: Consistent — SemiAnalysis functions as a research-curating voice for practitioners; no stance shift observable

[1][4][9][30][37]

NVIDIA / Huizi Mao (BLASST team)

BLASST's dynamic blocked sparse attention has been validated by the research community as Best Paper, and the community is now pushing for integration into inference kernels like FlashInfer beyond NVIDIA's own stack

Evolution: Elevated by Best Paper recognition; FlashInfer integration request and ongoing sparse kernel work for new SM120 hardware represent broadening external adoption momentum

[38][3][5][6][39][40]

@superaiwatcher (dissenting practitioner)

Sparse attention is a stopgap; by 2027, hardware-native linear attention will render manual sparsity implementations obsolete, making the current productionization wave a transitional rather than permanent architecture shift

Evolution: New voice in this thread — represents the first explicit counter-narrative to the conference's sparse attention productionization consensus

[8]

JasonLiu (@jsyqrt) (production practitioner)

DeepSeek's MoE routing and sparse attention architecture works well in production, confirming the practical viability of the techniques highlighted at the conference

Evolution: New voice in this thread — adds post-conference practitioner validation to the research consensus

[7]

@haoailab (Hao Zhang / DistServe group)

Having originated now-standard PD disaggregation, the group is extending the paradigm to attention-FFN disaggregation presented at MLSys 2026

Evolution: Progressing from originating PD disaggregation to pushing the next architectural frontier — workload specialization by transformer component, not just by compute phase

[9]

StepFun AI / StepMesh

Attention-FFN disaggregation for MoE inference requires dedicated communication infrastructure; StepMesh instantiates this as a production-oriented open-source library backed by model-system co-design documented in the Step-3 paper

Evolution: Consistent; StepMesh is openly hosted on GitHub and the Step-3 paper and research page provide production evidence for the co-design approach

[12][13][41][42][43]

vLLM project

Attention-FFN disaggregation for MoE models is being actively pursued via RFC, but disaggregated prefilling remains labeled experimental in official documentation; the community gathered IRL at MLSys 2026 to continue these conversations

Evolution: Consistent — the RFC shows forward momentum, and the in-person community meetup reinforces engagement, but the gap between conference-floor confidence and official documentation maturity persists

[14][15][44][36]

ByteDance / MegaScale-Infer

Disaggregated expert parallelism — separating attention and FFN/expert compute across different hardware — enables cost-effective MoE serving at scale, as documented in arXiv 2504.02263

Evolution: Now identified with a specific arXiv paper (2504.02263) and concrete technical framing around 'disaggregated expert parallelism'; previously cited only as a named system with claimed cost reductions

[17][45][18][19]

AWS Neuron

Disaggregated inference is production-ready enough to warrant official BETA documentation in the AWS Neuron developer guide

Evolution: New institutional voice in this thread — cloud-provider toolchain formalization of disaggregated inference is a step beyond the startup and research adoption visible at the conference

[10]

Together.ai / WukLab (DASD)

Distribution-Aware Speculative Decoding delivers up to 50% acceleration of RL rollouts by adapting draft model behavior to the actual rollout length distribution rather than assuming fixed lengths

Evolution: Consistent; now one of two confirmed independent groups attacking the same long-tail RL rollout problem; Threads post shows ongoing community engagement with the paper

[23][24][46][25]

MIT HAN Lab (Adaptive Drafter / fastrl)

Adaptive Drafter addresses long-tail RL training inefficiency at ASPLOS'26 with an adaptive speculative decoding approach; MIT News frames it as substantially increasing LLM training efficiency, secondary sources characterize the gain as 2x, and the conference talk video is now publicly available

Evolution: Further consolidated: the ASPLOS'26 talk video is now publicly available on YouTube, adding a citable artifact beyond the paper and GitHub repository

[47][48][26][28][29][27]

LMCache Lab

Actively productionizing PD disaggregation with a new async backend, framing the architecture as ready for deployment at scale

Evolution: Consistent with their KV cache and disaggregation-focused engineering track

[49][50]

Financial / investor commentators (TheValueist, podcast_alpha_x, tropicalvalue)

PD disaggregation is evidence that GPU useful lives are substantially longer than AI skeptics claim — older GPUs can be repurposed in disaggregated decode roles, extending hardware ROI

Evolution: Consistent; this framing has continued circulating post-conference via additional RTs

[51][20][21][52][22]

Sakura Yuki (inference practitioner)

PD disaggregation is finally stable for production; running co-located prefill and decode at scale is now wasteful rather than a reasonable default

Evolution: Consistent with the broad practitioner consensus emerging at the conference

[53][54]

Tensions

Sparse attention as milestone vs. stopgap: The conference and practitioner community framed sparse attention productionization as a durable architectural shift (BLASST Best Paper[3], DeepSeek production confirmation[7]), while @superaiwatcher argues it is a transitional stopgap that hardware-native linear attention will displace by 2027[8]. The bet hinges on whether hardware roadmaps will deliver native linear attention fast enough to make the current wave of sparse attention engineering wasted effort. [3][7][8][4]
Open-source community vs. production practitioners on MoE expert balancing: SemiAnalysis observes that expert balancing in MoE serving is substantively underexplored in open source because the challenge only surfaces at production scale[30]. MegaScale-Infer claims MoE serving cost reductions via disaggregated expert parallelism[17][19], but whether this specifically addresses the expert balancing problem — or whether any open-source production-grade solution exists — remains unresolved. [30][19][17][55]
Disaggregation maturity framing: Practitioner commentary frames PD disaggregation as 'finally stable' for production[53], AWS Neuron now carries an official BETA developer guide[10], yet vLLM's own documentation still labels disaggregated prefilling as 'experimental'[44]. The gap between cloud-provider toolchain formalization and framework-level official support persists. [53][54][10][44]
Competing approaches to attention-FFN disaggregation standardization: StepFun AI shipped StepMesh as a standalone open-source communication library[12][41], while vLLM posted an RFC for ATTN-FFN disaggregation built into the vLLM framework[14]. The two efforts address the same problem from different integration points; it is unclear whether they will converge or which will define the interoperability standard. [12][41][14][15][56]
Competing speculative decoding approaches for RL rollout efficiency: MIT HAN Lab's Adaptive Drafter (ASPLOS'26, reportedly 2x speedup, talk video available[27]) and Together.ai/WukLab's DASD (MLSys 2026, claimed 50% acceleration[24][23]) independently target the long-tail RL rollout GPU bubble problem with different strategies. Direct comparisons between the approaches have not appeared, and whether they are composable or serve different regimes is unresolved. [26][29][24][23][28][27][25]

Sources

[1] MLSys 2026 is next week! — SemiAnalysis Twitter (2026-05-17)
[2] MLSys 2026 is starting today. And I'm excited to be here. This conference focuses on the most important research problem... — reactive:mlsys-2026-inference-systems (2026-05-18)
[3] [2512.12087] BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding — reactive:mlsys-2026-inference-systems
[4] Sparse attention mechanisms are finally moving beyond academic benchmarks into production systems, including DeepSeek Sp… — SemiAnalysis Twitter (2026-05-17)
[5] [Feature Request] BLASST: Dynamic BLocked Attention Sparsity via ... — reactive:mlsys-2026-inference-systems
[6] Add sparse-MLA paged attention for SM120 (RTX PRO ... - GitHub — reactive:mlsys-2026-inference-systems
[7] @rohanpaul_ai Having built with DeepSeek models in production — the architecture point is real. Their MoE routing and sp... — reactive:mlsys-2026-inference-systems (2026-05-24)
[8] @rasbt Sparse attention is a stopgap. By 2027, hardware-native linear attention will render these manual sparsity implem... — reactive:mlsys-2026-inference-systems (2026-05-23)
[9] @NousResearch Assigning dedicated resources to different types of workloads is an increasingly popular system optimizati… — SemiAnalysis Twitter (2026-05-17)
[10] Disaggregated Inference [BETA] — AWS Neuron Documentation — reactive:mlsys-2026-inference-systems
[11] Long-context inference and Prefill-Decode disaggregation turn KV Cache into cross-node traffic. — reactive:mlsys-2026-inference-systems (2026-05-23)
[12] stepfun-ai/StepMesh — reactive:mlsys-2026-inference-systems
[13] Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — reactive:mlsys-2026-inference-systems
[14] [RFC]: ATTN-FFN Disaggregation for MoE Models #22799 - GitHub — reactive:mlsys-2026-inference-systems
[15] This amazing Attention-FFN disaggregation implementation from ... — reactive:mlsys-2026-inference-systems
[16] Revealing the Challenges of Attention-FFN Disaggregation ... — reactive:mlsys-2026-inference-systems
[17] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism — reactive:mlsys-2026-inference-systems
[18] MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with ... — reactive:mlsys-2026-inference-systems
[19] Unlocking MoE Efficiency: How MegaScale-Infer Slashes LLM ... — reactive:mlsys-2026-inference-systems
[20] I took one key insight from this convo: inference disaggregation between prefill and decode enable GPU lifespan to be ex... — reactive:mlsys-2026-inference-systems (2026-05-20)
[21] @GavinSBaker : AI skeptics have been wrong to claim GPU useful lives are only 1-2 yrs. The disaggregation of prefill (me... — reactive:mlsys-2026-inference-systems (2026-05-21)
[22] RT @tropicalvalue: I took one key insight from this convo: inference disaggregation between prefill and decode enable GP... — reactive:mlsys-2026-inference-systems (2026-05-23)
[23] Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding — reactive:mlsys-2026-inference-systems
[24] Beat the long tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[25] Paper: Beat the long tail: Distribution-Aware Speculative Decoding ... — reactive:mlsys-2026-inference-systems
[26] GitHub - mit-han-lab/fastrl: [ASPLOS'26] Taming the Long-Tail — reactive:mlsys-2026-inference-systems
[27] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter, [ASPLOS 2026] — reactive:mlsys-2026-inference-systems
[28] New method could increase LLM training efficiency | MIT News | Massachusetts Institute of Technology — reactive:mlsys-2026-inference-systems
[29] Aiyu | Taming the Long Tail: Adaptive Speculative Decoding Doubles LLM Training Speed — reactive:mlsys-2026-inference-systems
[30] @NousResearch @StepFun_ai @haoailab Large scale production system challenges, such as expert balancing in serving MoE mo… — SemiAnalysis Twitter (2026-05-17)
[31] Cross-Layer Load Balancing in Distributed MoE Inference - ICML 2026 — reactive:mlsys-2026-inference-systems
[32] vLLM crew is out in full force at MLSys 2026 🔥 — reactive:mlsys-2026-inference-systems (2026-05-18)
[33] MLSys 2026 Happy Hour - by LMSYS & Ai2 https://t.co/etdvTEJltB — reactive:mlsys-2026-inference-systems (2026-05-18)
[34] RT @inferact: Great cohosting this luncheon with @a16z and Mirendil at MLSys 2026 yesterday! 🙌 — reactive:mlsys-2026-inference-systems (2026-05-21)
[35] Headed to MLSys 2026? — reactive:mlsys-2026-inference-systems (2026-05-17)
[36] Always great to see the #vLLM community connecting IRL at MLSys 2026! Thanks to the teams keeping these conversations go... — reactive:mlsys-2026-inference-systems (2026-05-21)
[37] The long-tail distribution of rollout lengths causes one of the most critical inefficiencies in RL training. — SemiAnalysis Twitter (2026-05-17)
[38] Glad to be featured by SemiAnalysis. Our work BLASST was also selected as MLSys 2026 Best Paper: https://t.co/OlkQ7x75BN... — reactive:mlsys-2026-inference-systems (2026-05-17)
[39] Dynamic BLocked Attention Sparsity via Softmax Thresholding — reactive:mlsys-2026-inference-systems
[40] BLASST: Dynamic BLocked Attention Sparsity via Softmax ... — reactive:mlsys-2026-inference-systems
[41] StepMesh: A Communication Library for Attention-FFN Disaggregation | StepFun — reactive:mlsys-2026-inference-systems
[42] Step3: Cost-Effective Multimodal Intelligence - StepFun — reactive:mlsys-2026-inference-systems
[43] Paper page - Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — reactive:mlsys-2026-inference-systems
[44] Disaggregated Prefilling (experimental) - vLLM — reactive:mlsys-2026-inference-systems
[45] [Literature Review] MegaScale-Infer: Serving Mixture-of-Experts at ... — reactive:mlsys-2026-inference-systems
[46] Beat the Long Tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[47] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter — reactive:mlsys-2026-inference-systems
[48] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter — reactive:mlsys-2026-inference-systems
[49] PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. — reactive:mlsys-2026-inference-systems (2026-05-21)
[50] RT @lmcache: PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. — reactive:mlsys-2026-inference-systems (2026-05-22)
[51] GPU USEFUL LIFE, INFERENCE DISAGGREGATION, AND PRIVATE CREDIT — reactive:mlsys-2026-inference-systems (2026-05-20)
[52] For anyone listening, the relevant section starts around 44m: prefill vs. decode disaggregation. — reactive:mlsys-2026-inference-systems (2026-05-21)
[53] @lmsysorg @CloudflareDev The real story here isn't the bug fix, it's that prefill-decode disaggregation is finally stabl... — reactive:mlsys-2026-inference-systems (2026-05-21)
[54] @lmsysorg @AMD @dstackai If you're still running prefill and decode on the same GPUs at scale, you're basically burning ... — reactive:mlsys-2026-inference-systems (2026-05-21)
[55] The key observation: load-balancing losses used during MoE training encourage expert diversity. — reactive:mlsys-2026-inference-systems (2026-05-19)
[56] AFD: Decoupling Attention and FFN — reactive:mlsys-2026-inference-systems
[57] Beat the long tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[58] [PDF] Dynamic BLocked Attention Sparsity via Softmax Thresholding - arXiv — reactive:mlsys-2026-inference-systems
[59] [2601.21351] Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads — reactive:mlsys-2026-inference-systems
[60] DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles — reactive:mlsys-2026-inference-systems (2026-04-25)
[61] We’re at MLSys 2026 in Bellevue this week! ⛴️ — reactive:mlsys-2026-inference-systems (2026-05-18)
[62] Congratulations on the MLSys 2026 best paper! Looking forward to your presentation today! https://t.co/UrVcKgUJj0 — reactive:mlsys-2026-inference-systems (2026-05-19)
[63] Congrats on winning an MLSys 2026 best research paper! https://t.co/grBdNYK4Fg — reactive:mlsys-2026-inference-systems (2026-05-19)
[64] @Yong_jun_He Congratulations on the MLSys 2026 Oral! Scaling RL training across heterogeneous accelerators is a genuinel... — reactive:mlsys-2026-inference-systems (2026-05-18)
[65] Read “Theta EdgeCloud Tests Prefill/Decode Disaggregation for Large-Scale LLM Serving“ by Theta Labs on Medium: https://... — reactive:mlsys-2026-inference-systems (2026-05-20)
[66] Theta EdgeCloud Tests Prefill/Decode Disaggregation for Large-Scale LLM Serving #machinelearning #ml #artificialintellig... — reactive:mlsys-2026-inference-systems (2026-05-20)
[67] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-22)
[68] RT @SnorkelAI: Live from MLSys 2026! Thanks to everyone who joined @pham_derek's talk yesterday on RLVR in low-data, low... — reactive:mlsys-2026-inference-systems (2026-05-22)
[69] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[70] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-22)
[71] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[72] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[73] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[74] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[75] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[76] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)