Wide Expert Parallelism increases the total memory bandwidth available per MoE deployment. This means the model distribu…
SemiAnalysis Twitter · SemiAnalysis (@SemiAnalysis_) · 2026-06-17
Wide Expert Parallelism is a technique for mixture-of-experts models that distributes expert weights across multiple GPUs to increase aggregate memory bandwidth and achieve higher inference throughput per GPU.
Appears in
Extraction
Topics: mixture-of-expertsexpert-parallelismllm-inferencegpu-architecture
Claims
- Wide Expert Parallelism increases total memory bandwidth available for MoE model deployments by spreading weights across GPUs.
- Each GPU in a Wide Expert Parallelism setup only loads a small fraction of the total MoE expert weights.
- This weight distribution approach results in higher inference throughput per GPU compared to non-distributed configurations.
Key quotes
Wide Expert Parallelism increases the total memory bandwidth available per MoE deployment.
The model distributes the MoE expert weights across multiple GPUs, so each GPU only needs to load a tiny fraction of the weights. This translates to higher throughput per GPU.