A large MoE model may be wasting half its expert compute on tokens that barely need expert help.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-05-24

A new paper finds that 50% of expert computation in large Mixture-of-Experts models like Qwen3 and GLM can be eliminated post-training with negligible accuracy loss by skipping experts for tokens that don't need them.

Open original ↗

Appears in

LLM Inference Efficiency Research Cluster

Extraction

Topics: mixture-of-expertsmodel-efficiencyinference-optimizationpost-training-optimization

Claims

Large MoE models spend roughly half their expert compute on tokens that derive little benefit from expert routing.
Removing 50% of expert computation from already-trained MoE models causes almost no accuracy degradation.
This optimization can be applied to existing models like Qwen3 and GLM without retraining.

Key quotes

A large MoE model may be wasting half its expert compute on tokens that barely need expert help.

50% of expert computation removed, with almost no loss in accuracy.