DeepSeek Open-Source Week Day 3: Introducing DeepGEMM, an Open-Source Matrix Multiplication Library

2025-02-26

In the third day of Open Source Week, the DeepSeek team unveiled its latest matrix multiplication library, DeepGEMM. Designed specifically for NVIDIA Hopper architecture, DeepGEMM is optimized for FP8 (8-bit floating point) General Matrix Multiplication (GEMM). The library provides efficient solutions for both regular and mixed expert (MoE) group GEMM computations.

DeepGEMM's key strength lies in its lightweight design and high performance. Written in CUDA, it does not require pre-compilation during installation but instead uses a lightweight Just-In-Time (JIT) module to dynamically generate all kernels at runtime. This approach simplifies operations management while ensuring code flexibility and adaptability.

Performance-wise, DeepGEMM demonstrates exceptional computational capabilities on NVIDIA H800 GPUs. In regular GEMM calculations (M=64, N=2112, K=7168), it achieves 206 TFLOPS, representing a 2.7x speedup compared to CUTLASS 3.6's optimized implementation. For MoE group GEMM calculations, DeepGEMM delivers consistent speedups ranging from 1.1x to 1.2x.

Technologically, DeepGEMM introduces innovative features such as a secondary accumulation mechanism within CUDA cores, effectively addressing precision issues in FP8 computations. It also supports non-aligned block sizes like 112, further enhancing Stream Multiprocessor (SM) utilization. Additionally, DeepGEMM deeply integrates Hopper architecture's Tensor Memory Accelerator (TMA) technology, enabling data asynchronous transfer and computation overlap, thereby boosting overall efficiency.

DeepGEMM is tailored for large models like DeepSeek-V3/R1, supporting dense matrix and MoE group computations suitable for inference and training scenarios. Developers can quickly deploy this library with Python 3.8+ and CUDA 12.8+ environments.

DeepGEMM is open-sourced under the MIT license and hosted on GitHub. This initiative not only provides a template for Hopper architecture optimization for AI researchers but also opens up opportunities for community contributors to further enhance matrix computation techniques.

While DeepGEMM may not outperform expert-tuned libraries in certain specific shapes, its simple design, efficient performance, and innovative optimization techniques make it an invaluable resource for learning Hopper FP8 matrix multiplication and optimization technologies. The DeepSeek-AI team looks forward to more developers joining in to advance matrix computation technology continuously.

Related Information:

DeepSeek Launches Open Source Week, Unveiling Its First Open Source Project – FlashMLA | ATYUN.COM Official Website - Comprehensive Platform for Artificial Intelligence Tutorials and News

Day Two of DeepSeek's Open Source Week: Release of MoE Model Communication Library DeepEP | ATYUN.COM Official Website - Comprehensive Platform for Artificial Intelligence Tutorials and News