Alibaba Open Sources Qwen2-Math: World's Top Mathematical Reasoning Model, Surpassing GPT-4o and Claude-3.5

2024-08-09

If you haven't heard of "Qwen2" yet, it's not surprising, but that may change from now on, as a new product that has made significant breakthroughs in software development, engineering, and the global STEM (Science, Technology, Engineering, and Mathematics) field, especially in the field of mathematics, has officially debuted.


What is Qwen2?

In the current era of AI models, even technology enthusiasts find it difficult to keep up. Alibaba Cloud, the cloud computing service department of Alibaba Group, a Chinese e-commerce giant, has launched Qwen2 - an open-source large language model (LLM) that is comparable to OpenAI's GPT series, Meta's Llama series, and Anthropic's Claude series in terms of influence.

Since August 2023, Alibaba Cloud has released multiple LLMs under the brand "Qwen" or "Qwen Thousand Synonyms," including Qwen-7B, Qwen-72B, and Qwen-1.8B, with parameter scales ranging from 1.8 billion to 72 billion, directly related to the complexity and intelligence of the models. Alibaba later introduced multimodal versions such as Qwen-Audio for audio input and Qwen-VL for visual input. By early June 2024, the flagship product of the Qwen series, Qwen2, was officially launched, offering five variants of different scales: 0.5B, 1.5B, 7B, 14B, and 72B. So far, Alibaba has released over a hundred Qwen series AI models with various specifications. The market response has been positive, with over 90,000 Chinese companies reportedly applying them to daily operations within the first year of their launch.

Qwen2-Math: The Focus in the Field of Mathematics

Alibaba Cloud's Qwen team has officially introduced the Qwen2-Math series - a large language model specifically designed for the English environment and dedicated to mathematics. Among them, Qwen2-Math-72B-Instruct stands out in the field of mathematical LLM, achieving an 84% accuracy rate in the LLM MATH benchmark test, which covers 12,500 complex math competition questions, especially those with challenging textual descriptions for LLM. Additionally, the model achieved a 96.7% accuracy rate in the GSM8K (Elementary School Math Benchmark Test) and a 47.8% accuracy rate in the university math benchmark test, surpassing other similar models.


It is worth noting that although Microsoft's Orca-Math model achieved a similar accuracy rate to Qwen2-Math-7B-Instruct in the GSM8K test, Alibaba did not mention this data in the official comparison. Nevertheless, the smallest version of Qwen2-Math (1.5 billion parameters) still performs impressively, with an 84.2% accuracy rate in the GSM8K test and a 44.2% accuracy rate in the university math benchmark test, nearly approaching the performance of its four times larger model.

Prospects for the Application of Mathematical AI Models

While large language models (LLMs) were initially focused on areas such as chatbots, enterprise question answering, document creation, and information processing, LLMs dedicated to mathematics provide more powerful tools for professionals who frequently need to solve math problems and work with numbers. Although a foundation in mathematics is essential for programming, LLMs have historically shown unstable performance in solving math problems, often falling short of early AI or machine learning systems.

The Alibaba Qwen2-Math team expressed their hope that Qwen2-Math can play a positive role in solving complex math problems. For both enterprise and individual users, although Qwen2-Math is not completely open-source, its flexible licensing policy allows for a wide range of commercial applications. Specifically, as long as the number of monthly active users does not exceed 100 million, users can freely use Qwen2-Math for commercial purposes. This generous limit is sufficient to cover the needs of numerous startups, small and medium-sized enterprises, and even some large corporations.