Alibaba Launches CosyVoice 2: Enhanced Streaming Speech Synthesis Model

2024-12-19

While significant advancements have been made in speech synthesis technology, achieving real-time and natural voice output remains a challenge. In high-demand scenarios such as streaming applications, issues like latency, pronunciation accuracy, and speaker consistency are particularly critical. Additionally, existing models often struggle with complex language inputs, such as tongue twisters or homophones. To address these challenges, researchers at Alibaba have introduced an enhanced streaming TTS model called CosyVoice 2.

Introduction to CosyVoice 2

CosyVoice 2 is a comprehensive upgrade of the original CosyVoice, focusing on enhancing speech synthesis technology. This advanced model is designed for both streaming and offline applications, offering higher flexibility and precision. It is suitable for various scenarios, including text-to-speech and interactive voice systems.

Key Improvements of CosyVoice 2:

  1. Unified Streaming and Non-Streaming Modes: Seamless adaptation to various applications without compromising performance.
  2. Significantly Improved Pronunciation Accuracy: The error rate has been reduced by 30% to 50%, providing clearer outputs in complex language environments.
  3. Enhanced Speaker Consistency: Ensures stable voice output in zero-shot learning and cross-lingual synthesis tasks.
  4. Advanced Instruction Control: Precise control over tone, style, and accent through natural language instructions.

Innovation and Advantages

CosyVoice 2 integrates several technological innovations to enhance performance and usability:

  1. Finite Scalar Quantization (FSQ): Replaces traditional vector quantization, optimizing the use of speech token codebooks and improving semantic representation and synthesis quality.
  2. Simplified Text-to-Speech Architecture: Leveraging pre-trained large language models (LLMs), it eliminates the need for additional text encoders, simplifying the model structure and enhancing cross-lingual performance.
  3. Block-Aware Causal Flow Matching: Achieves semantic and acoustic feature alignment with minimal latency, suitable for real-time speech generation.
  4. Rich Instruction Dataset: Based on over 1,500 hours of training data, it provides fine-grained control over accents, emotions, and voice styles, enabling diverse and expressive voice outputs.

Performance Highlights

Extensive evaluations of CosyVoice 2 have demonstrated its significant advantages:

  1. Low Latency and Efficiency: Response times as low as 150 milliseconds, making it ideal for real-time applications like voice chat.
  2. Improved Pronunciation Quality: Excellent performance in handling rare and complex language structures.
  3. High Speaker Similarity: High similarity scores confirm its ability to maintain naturalness and consistency.
  4. Multilingual Capabilities: Outstanding performance in Japanese and Korean benchmarks, though there is room for improvement in overlapping character sets.
  5. Robustness in Complex Scenarios: Exceptional performance in challenging scenarios like tongue twisters, with superior accuracy and clarity compared to previous models.

Conclusion

CosyVoice 2 builds upon its predecessor with thoughtful improvements, addressing key challenges such as latency, accuracy, and speaker consistency with scalable solutions. The integration of advanced features like FSQ and block-aware flow matching provides a balance between performance and usability. While there is still room for improvement in expanding language support and handling complex scenarios, CosyVoice 2 lays a solid foundation for the future of speech synthesis. Its integrated offline and streaming modes ensure high-quality, real-time audio generation, making it suitable for a wide range of applications.