DeepSeek Launches Open-Source Week with Its First Project: FlashMLA

2025-02-24

DeepSeek has officially launched its "Open Source Week" initiative, starting with the release of its first open-source project, FlashMLA. This project is an efficient multi-head linear attention (MLA) decoding kernel optimized for NVIDIA Hopper architecture GPUs, specifically designed to handle variable-length sequence data.

Inspired by flashAttention 2&3 and Cutlass projects, FlashMLA aims to enhance memory and computational efficiency through technological innovation. It supports CUDA 12.3 and later versions, as well as PyTorch 2.0 and above, providing robust technical support for developers.

Technically, FlashMLA employs BF16 data format to balance performance and efficiency. Additionally, it introduces a paging key-value (KV) cache mechanism with a block size of 64, enabling more precise memory management. This mechanism significantly improves memory usage efficiency when dealing with large-scale data.

In terms of hardware performance, FlashMLA demonstrates exceptional capabilities on NVIDIA H800 SXM5 GPUs. Under memory-constrained scenarios, its memory bandwidth can reach up to 3000 GB/s, while under compute-constrained scenarios, its computational power reaches 580 TFLOPS. These performance metrics highlight FlashMLA's strength in handling complex computational tasks.

The technical principles of FlashMLA include block scheduling and parallel computation, along with optimized memory access patterns. Through block scheduling, FlashMLA decomposes computing tasks into smaller blocks for parallel processing, fully utilizing the GPU's parallel computing capability. By optimizing memory access patterns, FlashMLA reduces memory access overhead, further enhancing performance when processing large-scale data.

For developers, FlashMLA offers a straightforward installation and deployment process. Developers can quickly deploy FlashMLA by executing simple installation commands, such as python setup.py install. They can also verify its performance through benchmark test scripts, like python tests/test_flash_mla.py.