DouBao's Large Model Team Launches COMET Communication Optimization System to Boost MoE Model Training Efficiency AI NEWS

Home
AInews
DouBao's Large Model Team Launches COMET Communication Optimization System to Boost MoE Model Training Efficiency

DouBao's Large Model Team Launches COMET Communication Optimization System to Boost MoE Model Training Efficiency

2025-03-11

The DouBao large model team has recently open-sourced its COMET communication optimization system, which aims to address the significant communication overhead faced by Mixture of Experts (MoE) models during distributed training. While the MoE architecture effectively overcomes computational bottlenecks of traditional dense models through sparse activation mechanisms, the communication costs across devices in distributed training environments have become a major constraint on training efficiency and cost.

In MoE models, expert networks are distributed across multiple GPUs, requiring frequent token distribution and result aggregation for each computation, leading to substantial GPU resource idling. Current solutions mainly focus on overlapping computation and communication efficiently but suffer from high memory usage, complex intrusive modifications to training frameworks, and inefficient resource utilization.

To tackle these challenges, the COMET system introduces two core mechanisms. The first is shared tensor dependency parsing, which resolves granularity mismatches between communication and computation by decomposing and rescheduling shared tensors. The system can split shared tensors passed between MoE layers along the token dimension or hidden layer dimension, aligning the smallest units of communication and computation. By dynamically adjusting the calculation order of data blocks, it prioritizes local data block calculations while asynchronously fetching remote tokens to eliminate wait times.

The second mechanism is adaptive load balancing, which dynamically allocates GPU thread block resources to precisely balance communication and computation loads. By encapsulating communication and computation tasks in independent thread blocks, it prevents remote I/O from blocking computing cores. The system adjusts thread block allocation in real time based on input size and parallel strategies, enabling "zero-overhead" dynamic operator switching at runtime to consistently deliver low-latency operators.

To validate the performance of the COMET system, the team conducted end-to-end performance evaluations on several large-scale MoE models. Results showed that in an 8-GPU H800 experimental cluster, COMET significantly reduced forward latency for MoE models by 31.8%-44.4% compared to other baseline systems. On individual MoE layers, COMET’s execution time was notably shorter than baseline solutions, achieving an average speedup of 1.28x to 2.37x.

Currently, the COMET system has been practically applied in production clusters with tens of thousands of GPUs, aiding efficient MoE model training and saving millions of GPU hours cumulatively. The system performs stably across different parallel strategies, input sizes, and hardware environments, demonstrating strong robustness and generalization capabilities.

With approximately 12,000 lines of C++ and CUDA code and 2,000 lines of Python code, the COMET system provides developers with a user-friendly Python API. It establishes a fine-grained pipeline programming paradigm for MoE, achieving intra-operator fusion of communication operations and GEMM computations. It can be directly integrated into existing MoE training frameworks, supporting various parallel modes and offering flexible plug-and-play deployment options. The core code has been open-sourced and is planned to be compatible with compilation ecosystems like Triton.

Mindtrip

Mindtrip - AI chatbot that helps you organize a your trip

Ai Drive

Ai Drive - Chat with multiple PDF files

Convex

Convex - AI backend platform for AI assisted app development

Ilus AI

Ilus AI - AI illustration tool for stunning visual content

Vast AI

Vast AI - Cloud-based GPU Rentals for AI Computing

Amazon Nova Act

Amazon Nova Act - Error retrieving information

RIZZ AI

RIZZ AI - Elevate your Tinder experience with AI chat

RECENT AI TOOLS

Scan Relief

Mindtrip

Ai Drive

Convex

Ilus AI

RECENT AI NEWS

OpenAI's ChatGPT-4.5 Passes Turing Test with 73% Success Rate

Overtraining Large Language Models May Lead to Fine-Tuning Difficulties

Google Launches AI to Decode Dolphin Language on Pixel Phones

Hugging Face Expands into Hardware with Acquisition of Pollen Robotics

OpenAI May Tidy Up Model Naming This Summer, Ditching Embarrassing Terms Like “GPT-4o”

Meta Plans to Use EU User Data for AI Training

Google Classroom Adds AI-Generated Quiz Question Feature

NVIDIA Advances US Chip Production: Blackwell AI GPU Leads Domestication Efforts

RECENT AI TOOLS