OpenAI has integrated new transcription and speech generation AI models into its API, aiming to enhance the capabilities of their previous versions.
These new models align with OpenAI's broader "agent" vision of building automated systems capable of independently completing tasks on behalf of users. According to Olivier Godement, OpenAI's product lead, the coming months will see an increase in such agents, focusing on helping customers and developers utilize these practical, accessible, and accurate tools.
The new text-to-speech model, named “gpt-4o-mini-tts,” not only generates more refined and realistic voices but also offers greater "controllability." Developers can use natural language to instruct the model on how to pronounce words, such as requesting it to speak "in the tone of a mad scientist" or "with the calm voice of a meditation teacher."
Jeff Harris, a member of OpenAI’s product team, highlighted that the goal is to allow developers to customize the "experience" and "context" of the voice. In different scenarios, a monotonous and unchanging voice isn't ideal. For instance, in customer support situations, the voice can convey emotions like expressing apologies when needed.
As for the new speech-to-text models, “gpt-4o-transcribe” and “gpt-4o-mini-transcribe,” they are set to replace OpenAI’s long-standing Whisper transcription model. OpenAI claims these new models were trained on a "diverse, high-quality audio dataset," allowing them to better capture accented and varied speech, even performing well in noisy environments.
Harris added that the new models have also improved in reducing "hallucinations." Whisper sometimes fabricates words or entire conversations, leading to inaccuracies in transcriptions. The new models show significant improvements in ensuring precise capture of spoken words without adding details that weren’t heard.
However, transcription accuracy may vary across languages. Based on OpenAI’s internal benchmarks, the more accurate transcription model, “gpt-4o-transcribe,” exhibits a word error rate of nearly 30% for Hindi and Dravidian languages like Tamil, Telugu, Malayalam, and Kannada.
In contrast to past practices, OpenAI does not plan to publicly release its new transcription models. Historically, the company had released new versions of Whisper under the MIT license for commercial use. Harris noted that the new models are much "larger" than Whisper, making them unsuitable for public release. Unlike Whisper, they cannot run on local laptops. OpenAI aims to be more cautious about future open-source releases, optimizing them for specific needs.