DeepSeek recently unveiled its latest large language model, DeepSeek-V3. According to the benchmark results, this model has become the most powerful open-source large language model currently available. Impressively, despite a training cost of only $5.6 million, significantly lower than what major tech companies typically invest, its performance rivals that of leading non-open-source models.
DeepSeek-V3 was trained using 2.8 million GPU hours at a cost of approximately $5.6 million, far less than its competitors. In various benchmarks, the model's performance is on par with GPT-4 and Claude 3.5, particularly excelling in mathematical and programming tasks. This efficiency is attributed to innovative architecture and training techniques, including a new method called "unsupervised loss load balancing."
Notably, DeepSeek, a smaller startup, achieved this milestone with a limited budget. Andrej Karpathy, a founding member of OpenAI, commented on social media that DeepSeek managed to train a state-of-the-art large language model with a minimal budget and open-sourced its weights, making it seem effortless. It is reported that DeepSeek fully self-funded its operations through its hedge fund business without seeking external investments.
The technical core of DeepSeek-V3 lies in its Mixture-of-Experts (MoE) architecture, which has a total of 671 billion parameters but activates only 37 billion parameters per token. This selective activation, combined with innovative training techniques, enables the model to achieve high performance while maintaining efficiency. In particular, DeepSeek-V3 sometimes outperforms industry leaders like OpenAI and Anthropic in mathematical reasoning and programming tasks.
DeepSeek mentioned that they utilized FP8 mixed-precision training and efficient pipeline parallelism, significantly reducing computational requirements. By comparison, Meta's LLaMA 3 model required about 30.8 million GPU hours for training. This means that DeepSeek-V3 is approximately 11 times more efficient in training than LLaMA 3.
Considering that some of the largest AI training clusters use around 100,000 GPUs and can cost billions of dollars, DeepSeek-V3's achievement is even more remarkable. The model was trained using 2,048 H800 GPUs over about two months, demonstrating that efficient architecture and training methods can significantly reduce the resources needed for cutting-edge AI development.
However, DeepSeek-V3's success has also sparked some controversy. There are concerns about whether its training data includes information from proprietary models like GPT-4 or Claude 3.5. If true, this would violate service terms, often referred to as "ToS laundering."
Despite these concerns, the open-source release of DeepSeek-V3 on the Hugging Face platform aligns with the broader trend of democratizing AI capabilities. Its unsupervised loss load balancing strategy and multi-token prediction (MTP) technology set new benchmarks for training efficiency and inference speed.
For the AI industry, DeepSeek-V3 may signal a potential paradigm shift in the development of large language models. This achievement suggests that through clever engineering and efficient training methods, it may be possible to achieve cutting-edge AI capabilities without the previously assumed need for massive computational resources.
As the industry digests these developments, the success of DeepSeek-V3 could prompt a reevaluation of current AI model development approaches. With the gap between open-source and non-open-source models narrowing, companies may need to reassess their strategies and value propositions in an increasingly competitive landscape.