Amazon Unveils Largest Text-to-Speech Model to Date
A group of artificial intelligence researchers from Amazon's AI Research Institute has announced the development of what they describe as the largest text-to-speech model in history. By "largest," they mean the model with the most parameters and the largest training dataset. They have published a paper on the development and training process of this model on the arXiv preprint server.
Large language models (LLMs) like ChatGPT have gained attention for their ability to answer questions and generate advanced documents in a human-like manner. However, artificial intelligence is still in the process of entering other mainstream applications. In this new research, the researchers aim to enhance the capabilities of text-to-speech applications by increasing the number of parameters and expanding the training corpus.
This new model, called "Big Adaptive Streaming TTS" (BASE TTS), has 980 million parameters and is trained on 100,000 hours of recorded speech from public websites, mostly in English. The team also provided examples of pronunciation for words and phrases in other languages to ensure the model can correctly pronounce them, such as "au contraire" or "adios, amigo."
The Amazon team also tested the model on smaller datasets to understand its development in the emerging field of AI quality, where both LLMs and text-to-speech applications seem to have suddenly reached higher levels of intelligence. They found that for their application, a moderate-sized dataset was a significant leap towards higher levels.
They also noted that this leap involved a range of language attributes, such as the ability to use compound nouns, express emotions, use foreign words, apply sub-languages and punctuation, and emphasize the correct words in a sentence for questioning.
The team stated that BASE TTS will not be publicly released as they are concerned about its potential unethical use. Instead, they plan to use it for learning applications and apply the knowledge gained so far to improve the overall voice quality of text-to-speech applications.