Recently, a Chinese AI company named DeepSeek has launched a series of large language models specifically designed for reasoning tasks - the R1 series. The source code of these algorithms has been made publicly available on the Hugging Face platform.
R1 series primarily consists of two algorithms: R1 and R1-Zero. According to DeepSeek, R1 has outperformed OpenAI's o1 model in multiple reasoning benchmarks. Although R1-Zero is slightly less capable, it holds significant potential for machine learning research.
Both large language models employ a Mixture of Experts (MoE) architecture, incorporating 671 billion parameters. An MoE model comprises several neural networks, each optimized for different task sets. When processing a prompt, a routing mechanism directs queries to the most suitable neural network.
The key advantage of the MoE architecture lies in reducing inference costs. In MoE models, user inputs activate only the specific neural networks required for generating responses, rather than the entire AI system. Consequently, R1 and R1-Zero activate less than one-tenth of their total parameters when responding to prompts.
During the training of R1-Zero, DeepSeek adopted an unconventional approach compared to typical reasoning model training methods. Generally, large language models optimized for reasoning are trained through reinforcement learning and supervised fine-tuning. Reinforcement learning teaches AI models to perform tasks via trial and error, while supervised fine-tuning improves output quality by providing examples of task execution.
In contrast, DeepSeek skipped the supervised fine-tuning phase during R1-Zero's training. Despite this omission, the model still acquired reasoning skills such as breaking down complex tasks into simpler subtasks. This marks the first study validating that large language models can gain reasoning abilities solely through reinforcement learning without supervised fine-tuning.
Although R1-Zero boasts advanced features, its output quality is limited, exhibiting issues like "infinite repetition, poor readability, and language mixing." To address these limitations, DeepSeek developed R1, an enhanced version of R1-Zero with a modified training process that includes the previously skipped supervised fine-tuning stage, significantly improving output quality.
DeepSeek conducted nearly twenty benchmark tests comparing R1 against four popular large language models. Results showed that R1 surpassed OpenAI's reasoning-optimized model o1 in multiple benchmarks. Even in benchmarks where o1 scored higher, R1's score difference was within 5%.
Notably, R1 outperformed o1 in the LiveCodeBench benchmark, which includes programming tasks regularly updated with new exercises, reducing the likelihood of AI models finding readily available answers on the public internet.
Furthermore, DeepSeek released a range of weaker yet more hardware-efficient models distilled from R1. These models are based on the Llama and Qwen open-source large language model families, with parameter sizes ranging from 1.5 billion to 70 billion. Among them, the R1-Distill-Qwen-32B model outperformed the scaled-down version of OpenAI-o1-mini in several benchmarks.