Xiaomi's official technology Twitter account has shared exciting news: its large model team has made a significant breakthrough in the field of audio reasoning. Inspired by the DeepSeek-R1 project, the team innovatively introduced a reinforcement learning algorithm to multimodal audio understanding tasks. Within just one week, they achieved an accuracy rate of 64.5%, securing the top position in the internationally renowned MMAU (Massive Multitask Audio Understanding and Reasoning) benchmark test. The team has also open-sourced the related technology to the public.
MMAU benchmark is a crucial measure of audio reasoning capabilities, covering 10,000 test cases of voice, environmental sounds, and music samples. It aims to comprehensively evaluate models' performance across various audio understanding skills. Human experts achieved an accuracy rate of 82.23% on this benchmark, while previous top-performing models were OpenAI's GPT-4o (with an accuracy rate of 57.3%) and Google DeepMind's Gemini2.0Flash (with an accuracy rate of 55.6%).
The research began with fine-tuning the AVQA dataset released by Tsinghua University, initially achieving an accuracy rate of 51.8%. However, a major leap occurred after applying the group-relative policy optimization (GRPO) algorithm from DeepSeek-R1 to the Qwen2-Audio-7B model. With only 38,000 AVQA training samples, the team successfully boosted the accuracy rate to 64.5%, surpassing existing commercial models.
During the research process, the team found that forcing the model to output reasoning steps during training reduced accuracy to 61.1%. This indicates that explicit chain-of-thought output may not be conducive to model training, whereas the real-time feedback mechanism of reinforcement learning more effectively helped the model lock onto high-quality answer distributions. Despite notable achievements, the current accuracy rate still lags behind human expert levels.
This experimental result from Xiaomi's large model team not only highlights the unique advantages of reinforcement learning in audio reasoning but also paves new paths for future research. To promote further academic and industry exchange and collaboration, the team has open-sourced the training code, model parameters, and technical reports.
Related resource links are as follows:
Training code: https://github.com/xiaomi-research/r1-aqa
Model parameters: https://huggingface.co/mispeech/r1-aqa
Technical report: https://arxiv.org/abs/2503.11197
Interactive demo: https://120.48.108.147:7860/