The Information Machine

MLSys 2026: Inference Systems Research Preview · history

Version 5

2026-05-24 18:56 UTC · 184 items

What

MLSys 2026 (May 18–22, Bellevue, WA) consolidated four production inference research threads crossing from papers into engineering artifacts. NVIDIA's BLASST (arXiv 2512.12087) won Best Paper for dynamic blocked sparse attention[3], and a FlashInfer GitHub issue tracking dynamic block sparse forward implementation is now open[7]. The sparse-attention debate has grown more nuanced with the Native Sparse Attention (NSA) paper (arXiv 2502.11089, ACL 2025)[10][11], which designs hardware-aligned sparse attention from training time rather than imposing it at runtime — a middle path between BLASST-style dynamic sparsity and @superaiwatcher's predicted linear attention replacement[9]. Attention-FFN disaggregation advances on multiple fronts (StepMesh[20], vLLM RFC[22], AWS Neuron BETA[17]), and MegaScale-Infer (arXiv 2504.02263, also SIGCOMM 2025[25]) is confirmed as ByteDance's production MoE disaggregated expert parallelism system[27].

Why it matters

The NSA paper complicates the binary between runtime sparse attention and full linear attention replacement by showing sparse attention can be redesigned from training time to close the hardware-alignment gap — whether this neutralizes or merely delays the architectural pressure toward linear attention is the new unresolved question. Meanwhile, a documented RDMA KV cache transfer failure in Kubernetes[19] adds a concrete networking obstacle to the disaggregated inference maturity picture that was previously framed only in theoretical terms.

Open questions

  • Does Native Sparse Attention's hardware-aligned, training-time approach[10][11] substantively answer @superaiwatcher's 'stopgap' critique[9], or does NSA still fall short of the efficiency ceiling that hardware-native linear attention could reach?

  • MegaScale-Infer appeared at both SIGCOMM 2025[25] and MLSys 2026[27] — does the full paper detail whether disaggregated expert parallelism specifically addresses the expert load balancing problem SemiAnalysis identified as underexplored in open source[29]?

  • Will the open FlashInfer dynamic block sparse forward issue[7] produce a BLASST-compatible kernel for SM120 hardware, and on what timeline?

  • Do RDMA KV cache transfer failures in Kubernetes[19] represent a blocking systems gap for cloud-scale PD disaggregation, or an already-solved engineering problem that practitioners have resolved outside of public documentation?

Narrative

MLSys 2026, held May 18–22 in Bellevue, Washington, brought together researchers and engineers from across the AI industry to address production inference and training systems challenges[1][2]. The conference's most prominent recognition went to NVIDIA's BLASST paper (arXiv 2512.12087), which introduces Dynamic Blocked Attention Sparsity via Softmax Thresholding[3]: a runtime technique that dynamically identifies and skips low-salience attention blocks using a softmax threshold gate, targeting the core inefficiency of dense attention at long context lengths. BLASST's Best Paper award reflects a broader productionization wave — sparse attention has migrated from academic benchmarks into live deployments, with DeepSeek's Sparse Attention and NousResearch's Lighthouse Attention cited as examples in a pre-conference SemiAnalysis thread[4], and a practitioner confirming post-conference that DeepSeek's MoE routing and sparse attention performs well in production[5]. A community request to integrate BLASST into FlashInfer was filed during the conference[6], and a GitHub issue tracking dynamic block sparse forward implementation in FlashInfer is now open[7], with separate FlashInfer work underway on sparse-MLA paged attention for the new SM120 GPU architecture[8].

The debate over whether sparse attention is a durable or transitional architecture has grown more textured than a simple binary. A dissenting voice, @superaiwatcher, argued post-conference that sparse attention is a transitional stopgap before hardware-native linear attention displaces manual sparsity implementations by 2027[9]. The Native Sparse Attention (NSA) paper (arXiv 2502.11089), published at ACL 2025, offers a direct counter-position: rather than imposing sparsity patterns at runtime on top of dense attention kernels, NSA designs sparse attention to be hardware-aligned and natively trainable from the ground up[10][11][12]. This represents a third architectural path — between BLASST-style dynamic runtime sparsity and full linear attention replacement — suggesting the sparse attention research community is actively working to close the hardware-alignment gap that motivates the linear attention argument without abandoning the sparse paradigm[13][14]. A survey of hardware-efficient attention mechanisms covering sparse, compact, and linear variants contextualizes the full design space[15]. The practitioner position, represented by JasonLiu, is that current sparse MoE routing and attention works well in production today[5], making the 2027 displacement timeline speculative against currently observable evidence.

Prefill-decode (PD) disaggregation — routing the compute-bound prefill phase and the memory-bandwidth-bound decode phase to dedicated hardware pools — is now widely accepted as an industry best practice[16]. AWS Neuron's BETA disaggregated inference developer guide[17] signals cloud-provider toolchain formalization, a step beyond the research and startup adoption visible at the conference itself. Practitioners note that long-context inference and PD disaggregation turn KV cache into cross-node traffic[18]; a technical account of RDMA KV cache transfer failures in Kubernetes documents a concrete networking obstacle for disaggregated inference at production scale[19]. Attention-FFN disaggregation has emerged as the next architectural axis: StepFun AI's StepMesh communication library for attention-FFN disaggregated MoE inference is openly published on GitHub[20], backed by the Step-3 paper (arXiv 2507.19427)[21]; the vLLM project posted an RFC proposing ATTN-FFN disaggregation for MoE models[22][23]; and a literature review of the challenges paper (arXiv 2602.09721) adds analytical documentation of the problem[24]. MegaScale-Infer (arXiv 2504.02263), a ByteDance system confirmed to have appeared at SIGCOMM 2025[25], uses disaggregated expert parallelism for serving MoE at scale with claimed cost reductions[26][27][28]; whether its approach specifically addresses expert load balancing — flagged by SemiAnalysis as underexplored in open source[29] — has not been publicly detailed.

Two independent research efforts targeted the same structural inefficiency in reinforcement learning post-training: the long-tail distribution of rollout lengths that creates GPU bubbles and degrades training throughput. Distribution-Aware Speculative Decoding (DASD), from Together.ai and WukLab (arXiv 2511.13841), was presented at MLSys 2026 and claims up to 50% acceleration of RL rollouts[30][31][32]. MIT HAN Lab's Adaptive Drafter (arXiv 2511.16665, ASPLOS'26, open-source at mit-han-lab/fastrl[33]) has a publicly available conference talk video[34], with MIT News framing it as substantially increasing LLM training efficiency[35] and secondary sources characterizing the gain as 2x[36]. The simultaneous emergence of independent approaches targeting the same long-tail RL inefficiency from different institutions and conference venues signals a high-value recognized systems problem; direct comparisons between the approaches have not yet appeared.

Timeline

  • 2025-02-01: Native Sparse Attention (NSA) paper submitted to arXiv (2502.11089), introducing hardware-aligned and natively trainable sparse attention; subsequently published at ACL 2025 [10][11][12][41]
  • 2025-04-01: MegaScale-Infer paper (arXiv 2504.02263, also SIGCOMM 2025) submitted, describing disaggregated expert parallelism for MoE serving at scale [26][48][27][25]
  • 2025-07-01: StepFun AI submits Step-3 paper (arXiv 2507.19427) describing model-system co-design for cost-effective decoding, providing production context for StepMesh [21][43][44]
  • 2025-11-01: Distribution-Aware Speculative Decoding paper submitted to arXiv (2511.13841), targeting long-tail RL rollout GPU bubbles [31][64][32]
  • 2025-11-01: MIT HAN Lab submits Adaptive Drafter paper (arXiv 2511.16665) on efficient reasoning RL training; GitHub project fastrl confirms ASPLOS'26 venue [51][52][33]
  • 2025-12-01: BLASST paper submitted to arXiv (2512.12087), introducing dynamic blocked sparse attention via softmax thresholding [3][65][39][40]
  • 2026-01-01: Analytical Provisioning for Attention-FFN Disaggregated LLM Serving paper submitted to arXiv (2601.21351) [66]
  • 2026-02-01: Challenges of Attention-FFN Disaggregation paper submitted to arXiv (2602.09721) [67]
  • 2026-02-26: MIT News publishes article on Adaptive Drafter method for increasing LLM training efficiency [35]
  • 2026-04-25: LMSYS publishes blog post on DeepSeek-V4 fast inference and verified RL with SGLang and Miles [68]
  • 2026-05-17: SemiAnalysis publishes pre-conference MLSys 2026 research preview covering sparse attention, disaggregation, MoE balancing, and RL training efficiency [1][4][16][29][37]
  • 2026-05-17: NVIDIA researcher Huizi Mao confirms BLASST wins MLSys 2026 Best Paper for dynamic blocked sparse attention [38]
  • 2026-05-18: MLSys 2026 opens in Bellevue; vLLM, LMSYS, Inferact, Delta Institute among attendees [2][69][70][71]
  • 2026-05-18: MLSys 2026 Best Paper congratulations circulate; scaling RL across heterogeneous accelerators receives Oral recognition [72][73][74]
  • 2026-05-20: Theta EdgeCloud PD disaggregation test results published; financial commentary picks up disaggregation as argument for extended GPU useful life [75][76][56][57]
  • 2026-05-21: LMCache Lab releases async PDBackend; StepFun AI publishes StepMesh on GitHub; vLLM RFC for ATTN-FFN disaggregation posted; vLLM community gathers IRL at conference [53][20][22][23][46]
  • 2026-05-21: Inferact co-hosts MLSys luncheon with a16z [77]
  • 2026-05-22: Conference final day: GDN kernel work on B200 GPUs presented; SnorkelAI RLVR talk concludes; FlashInfer BLASST integration request filed; FlashInfer dynamic block sparse forward GitHub issue opened [78][79][6][80][81][7]
  • 2026-05-23: GDN kernel B200 GPU presentation RTs spread widely post-conference; @superaiwatcher posts that sparse attention is a stopgap before hardware-native linear attention; KV cache cross-node traffic framing circulates [80][82][83][84][85][86][87][9][18]
  • 2026-05-24: JasonLiu confirms DeepSeek MoE routing and sparse attention performing well in production; financial disaggregation commentary continues circulating [5][59]

Perspectives

SemiAnalysis

Curatorial and analytical: frames MLSys 2026 as the venue for the most important production AI systems problems, highlighting sparse attention productionization, PD disaggregation extensions, MoE expert balancing gaps, and RL training inefficiency as the four key threads

Evolution: Consistent — SemiAnalysis functions as a research-curating voice for practitioners; no stance shift observable

NVIDIA / Huizi Mao (BLASST team)

BLASST's dynamic blocked sparse attention has been validated by the research community as Best Paper, and the community is pushing for integration into inference kernels like FlashInfer beyond NVIDIA's own stack

Evolution: Elevated by Best Paper recognition; a new FlashInfer GitHub issue tracking dynamic block sparse forward implementation represents broadening external adoption momentum beyond the initial integration request

NSA paper authors (Yuan et al., ACL 2025)

Hardware-aligned and natively trainable sparse attention (arXiv 2502.11089) demonstrates that sparse attention can be designed from training time to close the hardware-alignment gap, offering a third path between BLASST-style runtime sparsity and full linear attention replacement

Evolution: New voice in this thread — introduces a research counterpoint to the sparse-attention-as-stopgap argument that was absent from the conference's binary debate

@superaiwatcher (dissenting practitioner)

Sparse attention is a stopgap; by 2027, hardware-native linear attention will render manual sparsity implementations obsolete, making the current productionization wave a transitional rather than permanent architecture shift

Evolution: Consistent — represents the explicit counter-narrative to the conference's sparse attention productionization consensus; the NSA paper is a direct counter-weight to this position, though @superaiwatcher has not responded to it in the thread

JasonLiu (@jsyqrt) (production practitioner)

DeepSeek's MoE routing and sparse attention architecture works well in production, confirming the practical viability of the techniques highlighted at the conference

Evolution: Consistent — adds post-conference practitioner validation to the research consensus

@haoailab (Hao Zhang / DistServe group)

Having originated now-standard PD disaggregation, the group is extending the paradigm to attention-FFN disaggregation presented at MLSys 2026

Evolution: Consistent — progressing from originating PD disaggregation to pushing the next architectural frontier: workload specialization by transformer component rather than just by compute phase

StepFun AI / StepMesh

Attention-FFN disaggregation for MoE inference requires dedicated communication infrastructure; StepMesh instantiates this as a production-oriented open-source library backed by model-system co-design documented in the Step-3 paper

Evolution: Consistent; StepMesh is openly hosted on GitHub and the Step-3 paper provides production evidence for the co-design approach

vLLM project

Attention-FFN disaggregation for MoE models is being actively pursued via RFC, but disaggregated prefilling remains labeled experimental in official documentation; the community gathered IRL at MLSys 2026 to continue these conversations

Evolution: Consistent — the RFC shows forward momentum, but the gap between conference-floor confidence and official documentation maturity persists

ByteDance / MegaScale-Infer

Disaggregated expert parallelism — separating attention and FFN/expert compute across different hardware — enables cost-effective MoE serving at scale, as documented in arXiv 2504.02263 and presented at SIGCOMM 2025

Evolution: Further solidified: now confirmed at SIGCOMM 2025 in addition to arXiv, adding a peer-reviewed systems venue to the claimed cost reductions

AWS Neuron

Disaggregated inference is production-ready enough to warrant official BETA documentation in the AWS Neuron developer guide

Evolution: Consistent — cloud-provider toolchain formalization of disaggregated inference, a step beyond the startup and research adoption visible at the conference

Together.ai / WukLab (DASD)

Distribution-Aware Speculative Decoding delivers up to 50% acceleration of RL rollouts by adapting draft model behavior to the actual rollout length distribution rather than assuming fixed lengths

Evolution: Consistent; one of two confirmed independent groups attacking the same long-tail RL rollout problem from different institutional and conference contexts

MIT HAN Lab (Adaptive Drafter / fastrl)

Adaptive Drafter addresses long-tail RL training inefficiency at ASPLOS'26 with an adaptive speculative decoding approach; MIT News frames it as substantially increasing LLM training efficiency, secondary sources characterize the gain as 2x, and the conference talk video is publicly available

Evolution: Consistent; the ASPLOS'26 talk video is now publicly available on YouTube, consolidating the citable artifact base beyond the paper and GitHub repository

LMCache Lab

Actively productionizing PD disaggregation with a new async backend, framing the architecture as ready for deployment at scale

Evolution: Consistent with their KV cache and disaggregation-focused engineering track

Financial / investor commentators (TheValueist, podcast_alpha_x, tropicalvalue)

PD disaggregation is evidence that GPU useful lives are substantially longer than AI skeptics claim — older GPUs can be repurposed in disaggregated decode roles, extending hardware ROI

Evolution: Consistent; this framing has continued circulating post-conference

Sakura Yuki (inference practitioner)

PD disaggregation is finally stable for production; running co-located prefill and decode at scale is now wasteful rather than a reasonable default

Evolution: Consistent with the broad practitioner consensus emerging at the conference

Tensions

  • Sparse attention as milestone vs. stopgap: The conference and practitioner community framed sparse attention productionization as a durable architectural shift (BLASST Best Paper[3], DeepSeek production confirmation[5]), while @superaiwatcher argues it is a transitional stopgap before hardware-native linear attention displaces manual sparsity by 2027[9]. The NSA paper (arXiv 2502.11089)[10][11] adds a third position: that hardware-aligned sparse attention can be designed natively from training time, potentially closing the efficiency gap that motivates the linear attention argument without abandoning sparsity at all. [3][5][9][4][10][11][12]
  • Open-source community vs. production practitioners on MoE expert balancing: SemiAnalysis observes that expert balancing in MoE serving is substantively underexplored in open source because the challenge only surfaces at production scale[29]. MegaScale-Infer claims MoE serving cost reductions via disaggregated expert parallelism[26][25], but whether this specifically addresses the expert balancing problem — or whether any open-source production-grade solution exists — remains unresolved. [29][49][26][25][62]
  • Disaggregation maturity framing: Practitioner commentary frames PD disaggregation as 'finally stable' for production[60], AWS Neuron carries an official BETA developer guide[17], yet vLLM's own documentation still labels disaggregated prefilling as 'experimental'[45]. A documented RDMA KV cache transfer failure in Kubernetes[19] adds a concrete networking obstacle that sits uneasily alongside the maturity claims. [60][61][17][45][19]
  • Competing approaches to attention-FFN disaggregation standardization: StepFun AI shipped StepMesh as a standalone open-source communication library[20][42], while vLLM posted an RFC for ATTN-FFN disaggregation built into the vLLM framework[22]. The two efforts address the same problem from different integration points; it is unclear whether they will converge or which will define the interoperability standard. [20][42][22][23][63]
  • Competing speculative decoding approaches for RL rollout efficiency: MIT HAN Lab's Adaptive Drafter (ASPLOS'26, reportedly 2x speedup, talk video available[34]) and Together.ai/WukLab's DASD (MLSys 2026, claimed 50% acceleration[31][30]) independently target the long-tail RL rollout GPU bubble problem with different strategies. Direct comparisons between the approaches have not appeared, and whether they are composable or serve different regimes is unresolved. [33][36][31][30][35][34][32]

Sources

  1. [1] MLSys 2026 is next week! — SemiAnalysis Twitter (2026-05-17)
  2. [2] MLSys 2026 is starting today. And I'm excited to be here. This conference focuses on the most important research problem... — reactive:mlsys-2026-inference-systems (2026-05-18)
  3. [3] [2512.12087] BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding — reactive:mlsys-2026-inference-systems
  4. [4] Sparse attention mechanisms are finally moving beyond academic benchmarks into production systems, including DeepSeek Sp… — SemiAnalysis Twitter (2026-05-17)
  5. [5] @rohanpaul_ai Having built with DeepSeek models in production — the architecture point is real. Their MoE routing and sp... — reactive:mlsys-2026-inference-systems (2026-05-24)
  6. [6] [Feature Request] BLASST: Dynamic BLocked Attention Sparsity via ... — reactive:mlsys-2026-inference-systems
  7. [7] dynamic_block_sparse_fwd_flas... — reactive:mlsys-2026-inference-systems
  8. [8] Add sparse-MLA paged attention for SM120 (RTX PRO ... - GitHub — reactive:mlsys-2026-inference-systems
  9. [9] @rasbt Sparse attention is a stopgap. By 2027, hardware-native linear attention will render these manual sparsity implem... — reactive:mlsys-2026-inference-systems (2026-05-23)
  10. [10] Paper page - Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
  11. [11] [PDF] Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
  12. [12] [PDF] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention | Semantic Scholar — reactive:mlsys-2026-inference-systems
  13. [13] [Literature Review] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
  14. [14] Native Sparse Attention: Hardware-Aligned and Natively Trainable ... — reactive:mlsys-2026-inference-systems
  15. [15] [PDF] Hardware-efficient, Sparse, Compact, and Linear Attention — reactive:mlsys-2026-inference-systems
  16. [16] @NousResearch Assigning dedicated resources to different types of workloads is an increasingly popular system optimizati… — SemiAnalysis Twitter (2026-05-17)
  17. [17] Disaggregated Inference [BETA] — AWS Neuron Documentation — reactive:mlsys-2026-inference-systems
  18. [18] Long-context inference and Prefill-Decode disaggregation turn KV Cache into cross-node traffic. — reactive:mlsys-2026-inference-systems (2026-05-23)
  19. [19] Why RDMA KV Cache Transfer Broke in Kubernetes - Medium — reactive:mlsys-2026-inference-systems
  20. [20] stepfun-ai/StepMesh — reactive:mlsys-2026-inference-systems
  21. [21] Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — reactive:mlsys-2026-inference-systems
  22. [22] [RFC]: ATTN-FFN Disaggregation for MoE Models #22799 - GitHub — reactive:mlsys-2026-inference-systems
  23. [23] This amazing Attention-FFN disaggregation implementation from ... — reactive:mlsys-2026-inference-systems
  24. [24] [Literature Review] Revealing the Challenges of Attention-FFN ... — reactive:mlsys-2026-inference-systems
  25. [25] [PDF] MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with ... — reactive:mlsys-2026-inference-systems
  26. [26] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism — reactive:mlsys-2026-inference-systems
  27. [27] [2504.02263] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism — reactive:mlsys-2026-inference-systems
  28. [28] Disaggregated Expert Parallelism — reactive:mlsys-2026-inference-systems
  29. [29] @NousResearch @StepFun_ai @haoailab Large scale production system challenges, such as expert balancing in serving MoE mo… — SemiAnalysis Twitter (2026-05-17)
  30. [30] Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding — reactive:mlsys-2026-inference-systems
  31. [31] Beat the long tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
  32. [32] Paper: Beat the long tail: Distribution-Aware Speculative Decoding ... — reactive:mlsys-2026-inference-systems
  33. [33] GitHub - mit-han-lab/fastrl: [ASPLOS'26] Taming the Long-Tail — reactive:mlsys-2026-inference-systems
  34. [34] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter, [ASPLOS 2026] — reactive:mlsys-2026-inference-systems
  35. [35] New method could increase LLM training efficiency | MIT News | Massachusetts Institute of Technology — reactive:mlsys-2026-inference-systems
  36. [36] Aiyu | Taming the Long Tail: Adaptive Speculative Decoding Doubles LLM Training Speed — reactive:mlsys-2026-inference-systems
  37. [37] The long-tail distribution of rollout lengths causes one of the most critical inefficiencies in RL training. — SemiAnalysis Twitter (2026-05-17)
  38. [38] Glad to be featured by SemiAnalysis. Our work BLASST was also selected as MLSys 2026 Best Paper: https://t.co/OlkQ7x75BN... — reactive:mlsys-2026-inference-systems (2026-05-17)
  39. [39] Dynamic BLocked Attention Sparsity via Softmax Thresholding — reactive:mlsys-2026-inference-systems
  40. [40] BLASST: Dynamic BLocked Attention Sparsity via Softmax ... — reactive:mlsys-2026-inference-systems
  41. [41] Hardware-Aligned and Natively Trainable Sparse Attention — reactive:mlsys-2026-inference-systems
  42. [42] StepMesh: A Communication Library for Attention-FFN Disaggregation | StepFun — reactive:mlsys-2026-inference-systems
  43. [43] Step3: Cost-Effective Multimodal Intelligence - StepFun — reactive:mlsys-2026-inference-systems
  44. [44] Paper page - Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — reactive:mlsys-2026-inference-systems
  45. [45] Disaggregated Prefilling (experimental) - vLLM — reactive:mlsys-2026-inference-systems
  46. [46] Always great to see the #vLLM community connecting IRL at MLSys 2026! Thanks to the teams keeping these conversations go... — reactive:mlsys-2026-inference-systems (2026-05-21)
  47. [47] [Literature Review] MegaScale-Infer: Serving Mixture-of-Experts at ... — reactive:mlsys-2026-inference-systems
  48. [48] MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with ... — reactive:mlsys-2026-inference-systems
  49. [49] Unlocking MoE Efficiency: How MegaScale-Infer Slashes LLM ... — reactive:mlsys-2026-inference-systems
  50. [50] Beat the Long Tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
  51. [51] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter — reactive:mlsys-2026-inference-systems
  52. [52] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter — reactive:mlsys-2026-inference-systems
  53. [53] PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. — reactive:mlsys-2026-inference-systems (2026-05-21)
  54. [54] RT @lmcache: PD Disaggregation unleashed! The new async PDBackend is now much more efficient in LMCache. — reactive:mlsys-2026-inference-systems (2026-05-22)
  55. [55] GPU USEFUL LIFE, INFERENCE DISAGGREGATION, AND PRIVATE CREDIT — reactive:mlsys-2026-inference-systems (2026-05-20)
  56. [56] I took one key insight from this convo: inference disaggregation between prefill and decode enable GPU lifespan to be ex... — reactive:mlsys-2026-inference-systems (2026-05-20)
  57. [57] @GavinSBaker : AI skeptics have been wrong to claim GPU useful lives are only 1-2 yrs. The disaggregation of prefill (me... — reactive:mlsys-2026-inference-systems (2026-05-21)
  58. [58] For anyone listening, the relevant section starts around 44m: prefill vs. decode disaggregation. — reactive:mlsys-2026-inference-systems (2026-05-21)
  59. [59] RT @tropicalvalue: I took one key insight from this convo: inference disaggregation between prefill and decode enable GP... — reactive:mlsys-2026-inference-systems (2026-05-23)
  60. [60] @lmsysorg @CloudflareDev The real story here isn't the bug fix, it's that prefill-decode disaggregation is finally stabl... — reactive:mlsys-2026-inference-systems (2026-05-21)
  61. [61] @lmsysorg @AMD @dstackai If you're still running prefill and decode on the same GPUs at scale, you're basically burning ... — reactive:mlsys-2026-inference-systems (2026-05-21)
  62. [62] The key observation: load-balancing losses used during MoE training encourage expert diversity. — reactive:mlsys-2026-inference-systems (2026-05-19)
  63. [63] AFD: Decoupling Attention and FFN — reactive:mlsys-2026-inference-systems
  64. [64] Beat the long tail: Distribution-Aware Speculative Decoding for RL ... — reactive:mlsys-2026-inference-systems
  65. [65] [PDF] Dynamic BLocked Attention Sparsity via Softmax Thresholding - arXiv — reactive:mlsys-2026-inference-systems
  66. [66] [2601.21351] Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads — reactive:mlsys-2026-inference-systems
  67. [67] Revealing the Challenges of Attention-FFN Disaggregation ... — reactive:mlsys-2026-inference-systems
  68. [68] DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles — reactive:mlsys-2026-inference-systems (2026-04-25)
  69. [69] vLLM crew is out in full force at MLSys 2026 🔥 — reactive:mlsys-2026-inference-systems (2026-05-18)
  70. [70] We’re at MLSys 2026 in Bellevue this week! ⛴️ — reactive:mlsys-2026-inference-systems (2026-05-18)
  71. [71] Headed to MLSys 2026? — reactive:mlsys-2026-inference-systems (2026-05-17)
  72. [72] Congratulations on the MLSys 2026 best paper! Looking forward to your presentation today! https://t.co/UrVcKgUJj0 — reactive:mlsys-2026-inference-systems (2026-05-19)
  73. [73] Congrats on winning an MLSys 2026 best research paper! https://t.co/grBdNYK4Fg — reactive:mlsys-2026-inference-systems (2026-05-19)
  74. [74] @Yong_jun_He Congratulations on the MLSys 2026 Oral! Scaling RL training across heterogeneous accelerators is a genuinel... — reactive:mlsys-2026-inference-systems (2026-05-18)
  75. [75] Read “Theta EdgeCloud Tests Prefill/Decode Disaggregation for Large-Scale LLM Serving“ by Theta Labs on Medium: https://... — reactive:mlsys-2026-inference-systems (2026-05-20)
  76. [76] Theta EdgeCloud Tests Prefill/Decode Disaggregation for Large-Scale LLM Serving #machinelearning #ml #artificialintellig... — reactive:mlsys-2026-inference-systems (2026-05-20)
  77. [77] RT @inferact: Great cohosting this luncheon with @a16z and Mirendil at MLSys 2026 yesterday! 🙌 — reactive:mlsys-2026-inference-systems (2026-05-21)
  78. [78] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-22)
  79. [79] RT @SnorkelAI: Live from MLSys 2026! Thanks to everyone who joined @pham_derek's talk yesterday on RLVR in low-data, low... — reactive:mlsys-2026-inference-systems (2026-05-22)
  80. [80] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
  81. [81] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-22)
  82. [82] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
  83. [83] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
  84. [84] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
  85. [85] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
  86. [86] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)
  87. [87] RT @romitjain_: I will be presenting our recent work on writing efficient GDN kernels on B200 GPUs at MLSys 2026 today (... — reactive:mlsys-2026-inference-systems (2026-05-23)