The DouBao large model team has recently open-sourced its COMET communication optimization system, which aims to address the significant communication overhead faced by Mixture of Experts (MoE) models during distributed training. While the MoE architecture effectively overcomes computational bottlenecks of traditional dense models through sparse activation mechanisms, the communication costs across devices in distributed training environments have become a major constraint on training efficiency and cost.
In MoE models, expert networks are distributed across multiple GPUs, requiring frequent token distribution and result aggregation for each computation, leading to substantial GPU resource idling. Current solutions mainly focus on overlapping computation and communication efficiently but suffer from high memory usage, complex intrusive modifications to training frameworks, and inefficient resource utilization.
To tackle these challenges, the COMET system introduces two core mechanisms. The first is shared tensor dependency parsing, which resolves granularity mismatches between communication and computation by decomposing and rescheduling shared tensors. The system can split shared tensors passed between MoE layers along the token dimension or hidden layer dimension, aligning the smallest units of communication and computation. By dynamically adjusting the calculation order of data blocks, it prioritizes local data block calculations while asynchronously fetching remote tokens to eliminate wait times.
The second mechanism is adaptive load balancing, which dynamically allocates GPU thread block resources to precisely balance communication and computation loads. By encapsulating communication and computation tasks in independent thread blocks, it prevents remote I/O from blocking computing cores. The system adjusts thread block allocation in real time based on input size and parallel strategies, enabling "zero-overhead" dynamic operator switching at runtime to consistently deliver low-latency operators.
To validate the performance of the COMET system, the team conducted end-to-end performance evaluations on several large-scale MoE models. Results showed that in an 8-GPU H800 experimental cluster, COMET significantly reduced forward latency for MoE models by 31.8%-44.4% compared to other baseline systems. On individual MoE layers, COMET’s execution time was notably shorter than baseline solutions, achieving an average speedup of 1.28x to 2.37x.
Currently, the COMET system has been practically applied in production clusters with tens of thousands of GPUs, aiding efficient MoE model training and saving millions of GPU hours cumulatively. The system performs stably across different parallel strategies, input sizes, and hardware environments, demonstrating strong robustness and generalization capabilities.
With approximately 12,000 lines of C++ and CUDA code and 2,000 lines of Python code, the COMET system provides developers with a user-friendly Python API. It establishes a fine-grained pipeline programming paradigm for MoE, achieving intra-operator fusion of communication operations and GEMM computations. It can be directly integrated into existing MoE training frameworks, supporting various parallel modes and offering flexible plug-and-play deployment options. The core code has been open-sourced and is planned to be compatible with compilation ecosystems like Triton.