Byte introduces Seed-TTS text-to-speech model, capable of generating human-like speech.

2024-06-06

ByteDance has announced an innovative technology called Seed-TTS, which is a series of large-scale autoregressive text-to-speech (TTS) models capable of generating natural voices that are almost indistinguishable from human speech. The outstanding performance of Seed-TTS lies not only in its highly realistic voice quality, but also in its profound understanding of speech context and perfect integration of speaker similarity and naturalness. According to the team, Seed-TTS, as the foundational model for speech generation, has demonstrated excellent performance in multiple evaluations. Whether in objective or subjective evaluations, the model is able to generate speech that is extremely similar to real human speech, reaching unprecedented heights. Furthermore, through fine-tuning, Seed-TTS has achieved higher subjective ratings in terms of speaker similarity and naturalness, further proving its outstanding performance. It is worth noting that Seed-TTS not only possesses excellent speech generation capabilities, but also provides outstanding control over various speech attributes, such as emotions. This enables Seed-TTS to generate highly expressive and diverse speech for real-world speakers, bringing broader application prospects to the field of speech synthesis. In order to further enhance the performance of the model, the ByteDance team has proposed a self-distillation method and a reinforcement learning method. The self-distillation method is used for speech decomposition, allowing for high-quality timbre decoupling without changing the model structure or loss function. The reinforcement learning method is used to enhance the model's robustness, speaker similarity, and controllability, making Seed-TTS more stable and reliable in dealing with complex speech environments. In addition, ByteDance has also introduced a non-autoregressive (NAR) variant of Seed-TTS called Seed-TTSDiT. This variant adopts a completely diffusion-based architecture, which can directly predict the latent representation of the output speech without relying on pre-estimated phoneme durations. This innovative design allows Seed-TTSDiT to demonstrate unique advantages in speech editing tasks and perform comparably to the language model-based variants. In terms of evaluation, Seed-TTS has undergone comprehensive testing on tasks such as zero-shot speech context learning, speaker fine-tuning, and emotion control. These evaluation results not only demonstrate the outstanding performance of Seed-TTS, but also provide valuable reference data for future benchmark tests. However, despite the numerous advantages and potential of Seed-TTS, it also faces some challenges and limitations. Ensuring the safety, reliability, and ethical use of the technology is a crucial consideration for the ByteDance team during the development process. Additionally, as the technology continues to evolve, how to reduce computational costs and improve generation speed while ensuring speech quality is also an important direction for future research.