Alibaba Launches Qwen-Audio Series to Build a Universal Audio Understanding and Interaction Platform

2023-12-15


Researchers at Alibaba Group have developed Qwen-Audio, a large-scale audio-language model for diverse tasks. They have adopted a hierarchical tagging-based multi-task framework to address the interference problem in joint training. Qwen-Audio performs well on various audio types and tasks without task-specific fine-tuning. Building upon Qwen-Audio, Qwen-Audio-Chat supports multi-turn conversations and various audio scenarios, demonstrating its general audio understanding and interaction capabilities.

Qwen-Audio goes beyond previous audio-language models by handling not only speech but also natural sounds, music, and songs, enabling joint training on datasets with different granularities. The model excels in speech perception and recognition tasks without task-specific modifications. Qwen-Audio-Chat further extends these capabilities, aligning with human intent and supporting multi-language, multi-turn conversations with audio and text inputs, showcasing its powerful and comprehensive audio understanding.

By expanding pre-training to cover 30 tasks and various audio types, Qwen-Audio addresses the audio understanding limitations of LLM. A multi-task framework facilitates knowledge sharing and mitigates interference. Qwen-Audio performs exceptionally well on benchmark tests without task-specific fine-tuning. As an extension, Qwen-Audio-Chat supports multi-turn conversations and various audio-centric scenarios, demonstrating the comprehensive audio interaction capabilities of LLM.

Qwen-Audio and Qwen-Audio-Chat are models for general audio understanding and flexible human-machine interaction. Qwen-Audio adopts a multi-task pre-training approach, optimizing the audio encoder and freezing the weights of the language model. Qwen-Audio-Chat employs supervised fine-tuning, optimizing the language model and fixing the weights of the audio encoder. The training process includes multi-task pre-training and supervised fine-tuning. Qwen-Audio-Chat achieves diverse human-machine interactions, supporting multi-language, multi-turn conversations with audio and text inputs, showcasing its adaptability and comprehensive audio understanding.

Qwen-Audio achieves state-of-the-art results on various benchmark tasks, significantly outperforming competitors without task-specific fine-tuning. It consistently leads the benchmarks in AAC, SWRT ASC, SER, AQA, VSC, and MNA tasks. The model sets new records on CochlScene, ClothoAQA, and VocalSound, demonstrating its powerful audio understanding capabilities. The outstanding performance of Qwen-Audio in various analyses proves its effectiveness and capability in achieving state-of-the-art results for challenging audio tasks.

The Qwen-Audio series introduces large-scale audio-language models with general understanding capabilities, covering various audio types and tasks. These models, developed through a multi-task training framework, enable knowledge sharing and overcome interference from different text labels in different datasets. Qwen-Audio achieves impressive performance on benchmark tests without task-specific fine-tuning, surpassing previous work. Qwen-Audio-Chat extends these capabilities, supporting multi-turn conversations and various audio scenarios, showcasing strong alignment with human intent and facilitating multi-language interaction.

Future explorations of Qwen-Audio include expanding capabilities for different audio types, languages, and specific tasks. Improving the multi-task framework or exploring alternative knowledge sharing methods can address interference in joint training. Task-specific fine-tuning can enhance performance. Continuously updating goals based on new benchmarks, datasets, and user feedback aims to improve general audio understanding. Qwen-Audio-Chat is refined to maintain consistency with human intent, support multi-language interaction, and enable dynamic multi-turn conversations.