Wide Expert Parallelism increases the total memory bandwidth available per MoE deployment. This means the model distribu…

SemiAnalysis Twitter · SemiAnalysis (@SemiAnalysis_) · 2026-06-17

Wide Expert Parallelism is a technique for mixture-of-experts models that distributes expert weights across multiple GPUs to increase aggregate memory bandwidth and achieve higher inference throughput per GPU.

Open original ↗

Appears in

LLM Efficiency Breakthroughs: Small Models and Sparse Architectures Challenge Scale Assumptions

Extraction

Topics: mixture-of-expertsexpert-parallelismllm-inferencegpu-architecture

Claims

Wide Expert Parallelism increases total memory bandwidth available for MoE model deployments by spreading weights across GPUs.
Each GPU in a Wide Expert Parallelism setup only loads a small fraction of the total MoE expert weights.
This weight distribution approach results in higher inference throughput per GPU compared to non-distributed configurations.

Key quotes

Wide Expert Parallelism increases the total memory bandwidth available per MoE deployment.

The model distributes the MoE expert weights across multiple GPUs, so each GPU only needs to load a tiny fraction of the weights. This translates to higher throughput per GPU.