MLSys 2026: Inference Systems Research Preview · history

Version 5

2026-05-24 18:56 UTC · 184 items

Changes since v4

The Native Sparse Attention paper (arXiv 2502.11089, ACL 2025) enters the thread as a substantive counter-weight to the sparse-attention-as-stopgap debate: NSA designs hardware-aligned sparse attention from training time rather than imposing it at runtime, introducing a third architectural path between BLASST-style dynamic sparsity and full linear attention replacement that directly complicates the binary framing of the existing tension. MegaScale-Infer is newly confirmed as a SIGCOMM 2025 paper (item 17758), adding a peer-reviewed systems venue to its existing arXiv presence. A Medium post on RDMA KV cache transfer failures in Kubernetes (item 17750) and a new FlashInfer GitHub issue on dynamic block sparse forward (item 18567) add concrete engineering artifacts to the disaggregation and sparse kernel threads. Remaining new items are background reference material on KV cache bottlenecks and additional pointers to existing papers.

What

MLSys 2026 (May 18–22, Bellevue, WA) consolidated four production inference research threads crossing from papers into engineering artifacts. NVIDIA's BLASST (arXiv 2512.12087) won Best Paper for dynamic blocked sparse attention[3], and a FlashInfer GitHub issue tracking dynamic block sparse forward implementation is now open[7]. The sparse-attention debate has grown more nuanced with the Native Sparse Attention (NSA) paper (arXiv 2502.11089, ACL 2025)[10][11], which designs hardware-aligned sparse attention from training time rather than imposing it at runtime — a middle path between BLASST-style dynamic sparsity and @superaiwatcher's predicted linear attention replacement[9]. Attention-FFN disaggregation advances on multiple fronts (StepMesh[20], vLLM RFC[22], AWS Neuron BETA[17]), and MegaScale-Infer (arXiv 2504.02263, also SIGCOMM 2025[25]) is confirmed as ByteDance's production MoE disaggregated expert parallelism system[27].

Why it matters

The NSA paper complicates the binary between runtime sparse attention and full linear attention replacement by showing sparse attention can be redesigned from training time to close the hardware-alignment gap — whether this neutralizes or merely delays the architectural pressure toward linear attention is the new unresolved question. Meanwhile, a documented RDMA KV cache transfer failure in Kubernetes[19] adds a concrete networking obstacle to the disaggregated inference maturity picture that was previously framed only in theoretical terms.

Open questions

Does Native Sparse Attention's hardware-aligned, training-time approach[10][11] substantively answer @superaiwatcher's 'stopgap' critique[9], or does NSA still fall short of the efficiency ceiling that hardware-native linear attention could reach?
MegaScale-Infer appeared at both SIGCOMM 2025[25] and MLSys 2026[27] — does the full paper detail whether disaggregated expert parallelism specifically addresses the expert load balancing problem SemiAnalysis identified as underexplored in open source[29]?
Will the open FlashInfer dynamic block sparse forward issue[7] produce a BLASST-compatible kernel for SM120 hardware, and on what timeline?
Do RDMA KV cache transfer failures in Kubernetes[19] represent a blocking systems gap for cloud-scale PD disaggregation, or an already-solved engineering problem that practitioners have resolved outside of public documentation?

Narrative

MLSys 2026, held May 18–22 in Bellevue, Washington, brought together researchers and engineers from across the AI industry to address production inference and training systems challenges[1][2]. The conference's most prominent recognition went to NVIDIA's BLASST paper (arXiv 2512.12087), which introduces Dynamic Blocked Attention Sparsity via Softmax Thresholding[3]: a runtime technique that dynamically identifies and skips low-salience attention blocks using a softmax threshold gate, targeting the core inefficiency of dense attention at long context lengths. BLASST's Best Paper award reflects a broader productionization wave — sparse attention has migrated from academic benchmarks into live deployments, with DeepSeek's Sparse Attention and NousResearch's Lighthouse Attention cited as examples in a pre-conference SemiAnalysis thread[4], and a practitioner confirming post-conference that DeepSeek's MoE routing and sparse attention performs well in production[5]. A community request to integrate BLASST into FlashInfer was filed during the conference[6], and a GitHub issue tracking dynamic block sparse forward implementation in FlashInfer is now open[7], with separate FlashInfer work underway on sparse-MLA paged attention for the new SM120 GPU architecture[8].

The debate over whether sparse attention is a durable or transitional architecture has grown more textured than a simple binary. A dissenting voice, @superaiwatcher, argued post-conference that sparse attention is a transitional stopgap before hardware-native linear attention displaces manual sparsity implementations by 2027[9]. The Native Sparse Attention (NSA) paper (arXiv 2502.11089), published at ACL 2025, offers a direct counter-position: rather than imposing sparsity patterns at runtime on top of dense attention kernels, NSA designs sparse attention to be hardware-aligned and natively trainable from the ground up[10][11][12]. This represents a third architectural path — between BLASST-style dynamic runtime sparsity and full linear attention replacement — suggesting the sparse attention research community is actively working to close the hardware-alignment gap that motivates the linear attention argument without abandoning the sparse paradigm[13][14]. A survey of hardware-efficient attention mechanisms covering sparse, compact, and linear variants contextualizes the full design space[15]. The practitioner position, represented by JasonLiu, is that current sparse MoE routing and attention works well in production today[5], making the 2027 displacement timeline speculative against currently observable evidence.

Prefill-decode (PD) disaggregation — routing the compute-bound prefill phase and the memory-bandwidth-bound decode phase to dedicated hardware pools — is now widely accepted as an industry best practice[16]. AWS Neuron's BETA disaggregated inference developer guide[17] signals cloud-provider toolchain formalization, a step beyond the research and startup adoption visible at the conference itself. Practitioners note that long-context inference and PD disaggregation turn KV cache into cross-node traffic[18]; a technical account of RDMA KV cache transfer failures in Kubernetes documents a concrete networking obstacle for disaggregated inference at production scale[19]. Attention-FFN disaggregation has emerged as the next architectural axis: StepFun AI's StepMesh communication library for attention-FFN disaggregated MoE inference is openly published on GitHub[20], backed by the Step-3 paper (arXiv 2507.19427)[21]; the vLLM project posted an RFC proposing ATTN-FFN disaggregation for MoE models[22][23]; and a literature review of the challenges paper (arXiv 2602.09721) adds analytical documentation of the problem[24]. MegaScale-Infer (arXiv 2504.02263), a ByteDance system confirmed to have appeared at SIGCOMM 2025[25], uses disaggregated expert parallelism for serving MoE at scale with claimed cost reductions[26][27][28]; whether its approach specifically addresses expert load balancing — flagged by SemiAnalysis as underexplored in open source[29] — has not been publicly detailed.

Two independent research efforts targeted the same structural inefficiency in reinforcement learning post-training: the long-tail distribution of rollout lengths that creates GPU bubbles and degrades training throughput. Distribution-Aware Speculative Decoding (DASD), from Together.ai and WukLab (arXiv 2511.13841), was presented at MLSys 2026 and claims up to 50% acceleration of RL rollouts[30][31][32]. MIT HAN Lab's Adaptive Drafter (arXiv 2511.16665, ASPLOS'26, open-source at mit-han-lab/fastrl[33]) has a publicly available conference talk video[34], with MIT News framing it as substantially increasing LLM training efficiency[35] and secondary sources characterizing the gain as 2x[36]. The simultaneous emergence of independent approaches targeting the same long-tail RL inefficiency from different institutions and conference venues signals a high-value recognized systems problem; direct comparisons between the approaches have not yet appeared.

Timeline

2025-02-01: Native Sparse Attention (NSA) paper submitted to arXiv (2502.11089), introducing hardware-aligned and natively trainable sparse attention; subsequently published at ACL 2025 [10][11][12][41]
2025-04-01: MegaScale-Infer paper (arXiv 2504.02263, also SIGCOMM 2025) submitted, describing disaggregated expert parallelism for MoE serving at scale [26][48][27][25]
2025-07-01: StepFun AI submits Step-3 paper (arXiv 2507.19427) describing model-system co-design for cost-effective decoding, providing production context for StepMesh [21][43][44]
2025-11-01: Distribution-Aware Speculative Decoding paper submitted to arXiv (2511.13841), targeting long-tail RL rollout GPU bubbles [31][64][32]
2025-11-01: MIT HAN Lab submits Adaptive Drafter paper (arXiv 2511.16665) on efficient reasoning RL training; GitHub project fastrl confirms ASPLOS'26 venue [51][52][33]
2025-12-01: BLASST paper submitted to arXiv (2512.12087), introducing dynamic blocked sparse attention via softmax thresholding [3][65][39][40]
2026-01-01: Analytical Provisioning for Attention-FFN Disaggregated LLM Serving paper submitted to arXiv (2601.21351) [66]
2026-02-01: Challenges of Attention-FFN Disaggregation paper submitted to arXiv (2602.09721) [67]
2026-02-26: MIT News publishes article on Adaptive Drafter method for increasing LLM training efficiency [35]
2026-04-25: LMSYS publishes blog post on DeepSeek-V4 fast inference and verified RL with SGLang and Miles [68]
2026-05-17: SemiAnalysis publishes pre-conference MLSys 2026 research preview covering sparse attention, disaggregation, MoE balancing, and RL training efficiency [1][4][16][29][37]
2026-05-17: NVIDIA researcher Huizi Mao confirms BLASST wins MLSys 2026 Best Paper for dynamic blocked sparse attention [38]
2026-05-18: MLSys 2026 opens in Bellevue; vLLM, LMSYS, Inferact, Delta Institute among attendees [2][69][70][71]
2026-05-18: MLSys 2026 Best Paper congratulations circulate; scaling RL across heterogeneous accelerators receives Oral recognition [72][73][74]
2026-05-20: Theta EdgeCloud PD disaggregation test results published; financial commentary picks up disaggregation as argument for extended GPU useful life [75][76][56][57]
2026-05-21: LMCache Lab releases async PDBackend; StepFun AI publishes StepMesh on GitHub; vLLM RFC for ATTN-FFN disaggregation posted; vLLM community gathers IRL at conference [53][20][22][23][46]
2026-05-21: Inferact co-hosts MLSys luncheon with a16z [77]
2026-05-22: Conference final day: GDN kernel work on B200 GPUs presented; SnorkelAI RLVR talk concludes; FlashInfer BLASST integration request filed; FlashInfer dynamic block sparse forward GitHub issue opened [78][79][6][80][81][7]
2026-05-23: GDN kernel B200 GPU presentation RTs spread widely post-conference; @superaiwatcher posts that sparse attention is a stopgap before hardware-native linear attention; KV cache cross-node traffic framing circulates [80][82][83][84][85][86][87][9][18]
2026-05-24: JasonLiu confirms DeepSeek MoE routing and sparse attention performing well in production; financial disaggregation commentary continues circulating [5][59]

Perspectives

SemiAnalysis

Curatorial and analytical: frames MLSys 2026 as the venue for the most important production AI systems problems, highlighting sparse attention productionization, PD disaggregation extensions, MoE expert balancing gaps, and RL training inefficiency as the four key threads

Evolution: Consistent — SemiAnalysis functions as a research-curating voice for practitioners; no stance shift observable

[1][4][16][29][37]

NVIDIA / Huizi Mao (BLASST team)

BLASST's dynamic blocked sparse attention has been validated by the research community as Best Paper, and the community is pushing for integration into inference kernels like FlashInfer beyond NVIDIA's own stack

Evolution: Elevated by Best Paper recognition; a new FlashInfer GitHub issue tracking dynamic block sparse forward implementation represents broadening external adoption momentum beyond the initial integration request

[38][3][6][8][39][40][7]

NSA paper authors (Yuan et al., ACL 2025)

Hardware-aligned and natively trainable sparse attention (arXiv 2502.11089) demonstrates that sparse attention can be designed from training time to close the hardware-alignment gap, offering a third path between BLASST-style runtime sparsity and full linear attention replacement

Evolution: New voice in this thread — introduces a research counterpoint to the sparse-attention-as-stopgap argument that was absent from the conference's binary debate

[10][11][12][41][13][14]

@superaiwatcher (dissenting practitioner)

Sparse attention is a stopgap; by 2027, hardware-native linear attention will render manual sparsity implementations obsolete, making the current productionization wave a transitional rather than permanent architecture shift

Evolution: Consistent — represents the explicit counter-narrative to the conference's sparse attention productionization consensus; the NSA paper is a direct counter-weight to this position, though @superaiwatcher has not responded to it in the thread

[9]

JasonLiu (@jsyqrt) (production practitioner)

DeepSeek's MoE routing and sparse attention architecture works well in production, confirming the practical viability of the techniques highlighted at the conference

Evolution: Consistent — adds post-conference practitioner validation to the research consensus

[5]

@haoailab (Hao Zhang / DistServe group)

Having originated now-standard PD disaggregation, the group is extending the paradigm to attention-FFN disaggregation presented at MLSys 2026

Evolution: Consistent — progressing from originating PD disaggregation to pushing the next architectural frontier: workload specialization by transformer component rather than just by compute phase

[16]

StepFun AI / StepMesh

Attention-FFN disaggregation for MoE inference requires dedicated communication infrastructure; StepMesh instantiates this as a production-oriented open-source library backed by model-system co-design documented in the Step-3 paper

Evolution: Consistent; StepMesh is openly hosted on GitHub and the Step-3 paper provides production evidence for the co-design approach

[20][21][42][43][44]

vLLM project

Attention-FFN disaggregation for MoE models is being actively pursued via RFC, but disaggregated prefilling remains labeled experimental in official documentation; the community gathered IRL at MLSys 2026 to continue these conversations

Evolution: Consistent — the RFC shows forward momentum, but the gap between conference-floor confidence and official documentation maturity persists

[22][23][45][46]

ByteDance / MegaScale-Infer

Disaggregated expert parallelism — separating attention and FFN/expert compute across different hardware — enables cost-effective MoE serving at scale, as documented in arXiv 2504.02263 and presented at SIGCOMM 2025

Evolution: Further solidified: now confirmed at SIGCOMM 2025 in addition to arXiv, adding a peer-reviewed systems venue to the claimed cost reductions

[26][47][48][49][27][25][28]

AWS Neuron

Disaggregated inference is production-ready enough to warrant official BETA documentation in the AWS Neuron developer guide

Evolution: Consistent — cloud-provider toolchain formalization of disaggregated inference, a step beyond the startup and research adoption visible at the conference

[17]

Together.ai / WukLab (DASD)

Distribution-Aware Speculative Decoding delivers up to 50% acceleration of RL rollouts by adapting draft model behavior to the actual rollout length distribution rather than assuming fixed lengths

Evolution: Consistent; one of two confirmed independent groups attacking the same long-tail RL rollout problem from different institutional and conference contexts

[30][31][50][32]

MIT HAN Lab (Adaptive Drafter / fastrl)

Adaptive Drafter addresses long-tail RL training inefficiency at ASPLOS'26 with an adaptive speculative decoding approach; MIT News frames it as substantially increasing LLM training efficiency, secondary sources characterize the gain as 2x, and the conference talk video is publicly available

Evolution: Consistent; the ASPLOS'26 talk video is now publicly available on YouTube, consolidating the citable artifact base beyond the paper and GitHub repository

[51][52][33][35][36][34]

LMCache Lab

Actively productionizing PD disaggregation with a new async backend, framing the architecture as ready for deployment at scale

Evolution: Consistent with their KV cache and disaggregation-focused engineering track

[53][54]

Financial / investor commentators (TheValueist, podcast_alpha_x, tropicalvalue)

PD disaggregation is evidence that GPU useful lives are substantially longer than AI skeptics claim — older GPUs can be repurposed in disaggregated decode roles, extending hardware ROI

Evolution: Consistent; this framing has continued circulating post-conference

[55][56][57][58][59]

Sakura Yuki (inference practitioner)

PD disaggregation is finally stable for production; running co-located prefill and decode at scale is now wasteful rather than a reasonable default

Evolution: Consistent with the broad practitioner consensus emerging at the conference

[60][61]

Tensions

Sparse attention as milestone vs. stopgap: The conference and practitioner community framed sparse attention productionization as a durable architectural shift (BLASST Best Paper[3], DeepSeek production confirmation[5]), while @superaiwatcher argues it is a transitional stopgap before hardware-native linear attention displaces manual sparsity by 2027[9]. The NSA paper (arXiv 2502.11089)[10][11] adds a third position: that hardware-aligned sparse attention can be designed natively from training time, potentially closing the efficiency gap that motivates the linear attention argument without abandoning sparsity at all. [3][5][9][4][10][11][12]
Open-source community vs. production practitioners on MoE expert balancing: SemiAnalysis observes that expert balancing in MoE serving is substantively underexplored in open source because the challenge only surfaces at production scale[29]. MegaScale-Infer claims MoE serving cost reductions via disaggregated expert parallelism[26][25], but whether this specifically addresses the expert balancing problem — or whether any open-source production-grade solution exists — remains unresolved. [29][49][26][25][62]
Disaggregation maturity framing: Practitioner commentary frames PD disaggregation as 'finally stable' for production[60], AWS Neuron carries an official BETA developer guide[17], yet vLLM's own documentation still labels disaggregated prefilling as 'experimental'[45]. A documented RDMA KV cache transfer failure in Kubernetes[19] adds a concrete networking obstacle that sits uneasily alongside the maturity claims. [60][61][17][45][19]
Competing approaches to attention-FFN disaggregation standardization: StepFun AI shipped StepMesh as a standalone open-source communication library[20][42], while vLLM posted an RFC for ATTN-FFN disaggregation built into the vLLM framework[22]. The two efforts address the same problem from different integration points; it is unclear whether they will converge or which will define the interoperability standard. [20][42][22][23][63]
Competing speculative decoding approaches for RL rollout efficiency: MIT HAN Lab's Adaptive Drafter (ASPLOS'26, reportedly 2x speedup, talk video available[34]) and Together.ai/WukLab's DASD (MLSys 2026, claimed 50% acceleration[31][30]) independently target the long-tail RL rollout GPU bubble problem with different strategies. Direct comparisons between the approaches have not appeared, and whether they are composable or serve different regimes is unresolved. [33][36][31][30][35][34][32]

Sources

[1] MLSys 2026 is next week! — SemiAnalysis Twitter (2026-05-17)
[2] MLSys 2026 is starting today. And I'm excited to be here. This conference focuses on the most important research problem... — reactive:mlsys-2026-inference-systems (2026-05-18)
[3] [2512.12087] BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding — reactive:mlsys-2026-inference-systems
[4] Sparse attention mechanisms are finally moving beyond academic benchmarks into production systems, including DeepSeek Sp… — SemiAnalysis Twitter (2026-05-17)
[5] @rohanpaul_ai Having built with DeepSeek models in production — the architecture point is real. Their MoE routing and sp... — reactive:mlsys-2026-inference-systems (2026-05-24)
[6] [Feature Request] BLASST: Dynamic BLocked Attention Sparsity via ... — reactive:mlsys-2026-inference-systems
[7] dynamic_block_sparse_fwd_flas... — reactive:mlsys-2026-inference-systems
[8] Add sparse-MLA paged attention for SM120 (RTX PRO ... - GitHub — reactive:mlsys-2026-inference-systems
[9] @rasbt Sparse attention is a stopgap. By 2027, hardware-native linear attention will render these manual sparsity implem... — reactive:mlsys-2026-inference-systems (2026-05-23)
[10] Paper page - Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
[11] [PDF] Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
[12] [PDF] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention | Semantic Scholar — reactive:mlsys-2026-inference-systems
[13] [Literature Review] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
[14] Native Sparse Attention: Hardware-Aligned and Natively Trainable ... — reactive:mlsys-2026-inference-systems
[15] [PDF] Hardware-efficient, Sparse, Compact, and Linear Attention — reactive:mlsys-2026-inference-systems
[16] @NousResearch Assigning dedicated resources to different types of workloads is an increasingly popular system optimizati… — SemiAnalysis Twitter (2026-05-17)
[17] Disaggregated Inference [BETA] — AWS Neuron Documentation — reactive:mlsys-2026-inference-systems
[18] Long-context inference and Prefill-Decode disaggregation turn KV Cache into cross-node traffic. — reactive:mlsys-2026-inference-systems (2026-05-23)
[19] Why RDMA KV Cache Transfer Broke in Kubernetes - Medium — reactive:mlsys-2026-inference-systems
[20] stepfun-ai/StepMesh — reactive:mlsys-2026-inference-systems
[21] Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — reactive:mlsys-2026-inference-systems
[22] [RFC]: ATTN-FFN Disaggregation for MoE Models #22799 - GitHub — reactive:mlsys-2026-inference-systems
[23] This amazing Attention-FFN disaggregation implementation from ... — reactive:mlsys-2026-inference-systems
[24] [Literature Review] Revealing the Challenges of Attention-FFN ... — reactive:mlsys-2026-inference-systems
[25] [PDF] MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with ... — reactive:mlsys-2026-inference-systems
[26] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism — reactive:mlsys-2026-inference-systems
[27] [2504.02263] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism — reactive:mlsys-2026-inference-systems
[28] Disaggregated Expert Parallelism — reactive:mlsys-2026-inference-systems
[29] @NousResearch @StepFun_ai @haoailab Large scale production system challenges, such as expert balancing in serving MoE mo… — SemiAnalysis Twitter (2026-05-17)
[30] Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding — reactive:mlsys-2026-inference-systems
[31] Beat the long tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[32] Paper: Beat the long tail: Distribution-Aware Speculative Decoding ... — reactive:mlsys-2026-inference-systems
[33] GitHub - mit-han-lab/fastrl: [ASPLOS'26] Taming the Long-Tail — reactive:mlsys-2026-inference-systems
[34] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter, [ASPLOS 2026] — reactive:mlsys-2026-inference-systems
[35] New method could increase LLM training efficiency | MIT News | Massachusetts Institute of Technology — reactive:mlsys-2026-inference-systems
[36] Aiyu | Taming the Long Tail: Adaptive Speculative Decoding Doubles LLM Training Speed — reactive:mlsys-2026-inference-systems
[37] The long-tail distribution of rollout lengths causes one of the most critical inefficiencies in RL training. — SemiAnalysis Twitter (2026-05-17)
[38] Glad to be featured by SemiAnalysis. Our work BLASST was also selected as MLSys 2026 Best Paper: https://t.co/OlkQ7x75BN... — reactive:mlsys-2026-inference-systems (2026-05-17)
[39] Dynamic BLocked Attention Sparsity via Softmax Thresholding — reactive:mlsys-2026-inference-systems
[40] BLASST: Dynamic BLocked Attention Sparsity via Softmax ... — reactive:mlsys-2026-inference-systems
[41] Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
[42] StepMesh: A Communication Library for Attention-FFN Disaggregation | StepFun — reactive:mlsys-2026-inference-systems
[43] Step3: Cost-Effective Multimodal Intelligence - StepFun — reactive:mlsys-2026-inference-systems
[44] Paper page - Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — reactive:mlsys-2026-inference-systems
[45] Disaggregated Prefilling (experimental) - vLLM — reactive:mlsys-2026-inference-systems
[46] Always great to see the #vLLM community connecting IRL at MLSys 2026! Thanks to the teams keeping these conversations go... — reactive:mlsys-2026-inference-systems (2026-05-21)
[47] [Literature Review] MegaScale-Infer: Serving Mixture-of-Experts at ... — reactive:mlsys-2026-inference-systems
[48] MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with ... — reactive:mlsys-2026-inference-systems
[49] Unlocking MoE Efficiency: How MegaScale-Infer Slashes LLM ... — reactive:mlsys-2026-inference-systems
[50] Beat the Long Tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[51] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter — reactive:mlsys-2026-inference-systems
[52] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter — reactive:mlsys-2026-inference-systems
[53] PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. — reactive:mlsys-2026-inference-systems (2026-05-21)
[54] RT @lmcache: PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. — reactive:mlsys-2026-inference-systems (2026-05-22)
[55] GPU USEFUL LIFE, INFERENCE DISAGGREGATION, AND PRIVATE CREDIT — reactive:mlsys-2026-inference-systems (2026-05-20)
[56] I took one key insight from this convo: inference disaggregation between prefill and decode enable GPU lifespan to be ex... — reactive:mlsys-2026-inference-systems (2026-05-20)
[57] @GavinSBaker : AI skeptics have been wrong to claim GPU useful lives are only 1-2 yrs. The disaggregation of prefill (me... — reactive:mlsys-2026-inference-systems (2026-05-21)
[58] For anyone listening, the relevant section starts around 44m: prefill vs. decode disaggregation. — reactive:mlsys-2026-inference-systems (2026-05-21)
[59] RT @tropicalvalue: I took one key insight from this convo: inference disaggregation between prefill and decode enable GP... — reactive:mlsys-2026-inference-systems (2026-05-23)
[60] @lmsysorg @CloudflareDev The real story here isn't the bug fix, it's that prefill-decode disaggregation is finally stabl... — reactive:mlsys-2026-inference-systems (2026-05-21)
[61] @lmsysorg @AMD @dstackai If you're still running prefill and decode on the same GPUs at scale, you're basically burning ... — reactive:mlsys-2026-inference-systems (2026-05-21)
[62] The key observation: load-balancing losses used during MoE training encourage expert diversity. — reactive:mlsys-2026-inference-systems (2026-05-19)
[63] AFD: Decoupling Attention and FFN — reactive:mlsys-2026-inference-systems
[64] Beat the long tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
[65] [PDF] Dynamic BLocked Attention Sparsity via Softmax Thresholding - arXiv — reactive:mlsys-2026-inference-systems
[66] [2601.21351] Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads — reactive:mlsys-2026-inference-systems
[67] Revealing the Challenges of Attention-FFN Disaggregation ... — reactive:mlsys-2026-inference-systems
[68] DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles — reactive:mlsys-2026-inference-systems (2026-04-25)
[69] vLLM crew is out in full force at MLSys 2026 🔥 — reactive:mlsys-2026-inference-systems (2026-05-18)
[70] We’re at MLSys 2026 in Bellevue this week! ⛴️ — reactive:mlsys-2026-inference-systems (2026-05-18)
[71] Headed to MLSys 2026? — reactive:mlsys-2026-inference-systems (2026-05-17)
[72] Congratulations on the MLSys 2026 best paper! Looking forward to your presentation today! https://t.co/UrVcKgUJj0 — reactive:mlsys-2026-inference-systems (2026-05-19)
[73] Congrats on winning an MLSys 2026 best research paper! https://t.co/grBdNYK4Fg — reactive:mlsys-2026-inference-systems (2026-05-19)
[74] @Yong_jun_He Congratulations on the MLSys 2026 Oral! Scaling RL training across heterogeneous accelerators is a genuinel... — reactive:mlsys-2026-inference-systems (2026-05-18)
[75] Read “Theta EdgeCloud Tests Prefill/Decode Disaggregation for Large-Scale LLM Serving“ by Theta Labs on Medium: https://... — reactive:mlsys-2026-inference-systems (2026-05-20)
[76] Theta EdgeCloud Tests Prefill/Decode Disaggregation for Large-Scale LLM Serving #machinelearning #ml #artificialintellig... — reactive:mlsys-2026-inference-systems (2026-05-20)
[77] RT @inferact: Great cohosting this luncheon with @a16z and Mirendil at MLSys 2026 yesterday! 🙌 — reactive:mlsys-2026-inference-systems (2026-05-21)
[78] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-22)
[79] RT @SnorkelAI: Live from MLSys 2026! Thanks to everyone who joined @pham_derek's talk yesterday on RLVR in low-data, low... — reactive:mlsys-2026-inference-systems (2026-05-22)
[80] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[81] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-22)
[82] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[83] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[84] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[85] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[86] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
[87] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)