Transformers documentation

Expert parallelism

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.0.0rc1).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Expert parallelism

Expert parallelism is a parallelism strategy for mixture-of-experts (MoE) models. Each expert’s feedforward layer lives on a different hardware accelerator. A router dispatches tokens to the appropriate experts and gathers the results. This approach scales models to far larger parameter counts without increasing computation cost because each token activates only a few experts.

DistributedConfig

The DistributedConfig API is experimental and its usage may change in the future.

Enable expert parallelism with the DistributedConfig class and the enable_expert_parallel argument.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed.configuration_utils import DistributedConfig

distributed_config = DistributedConfig(enable_expert_parallel=True)

model = AutoModelForCausalLM.from_pretrained(
    "openai/gpt-oss-120b",
    dtype="auto",
    distributed_config=distributed_config,
)

Expert parallelism automatically enables tensor parallelism for attention layers.

This argument switches to the ep_plan (expert parallel plan) defined in each MoE model’s config file. The GroupedGemmParallel class splits expert weights so each device loads only its local experts. The ep_router routes tokens to experts and an all-reduce operation combines their outputs.

Launch your inference script with torchrun and specify how many devices to use. The number of devices must evenly divide the total number of experts.

torchrun --nproc-per-node 8 your_script.py
Update on GitHub