Camb AI, a startup focusing on AI-driven content localization technology, recently announced the launch of Mars5, a powerful AI voice cloning model.
Although there are already several models available on the market that can create digital voice replicas, such as ElevenLabs, Camb AI claims that Mars5 has significant advantages in terms of realism.
According to early samples provided by the company, Mars5 not only accurately imitates the original voice but also replicates complex rhythmic parameters, including rhythm, emotion, and intonation. This level of detailed imitation ability marks an important milestone in the field of voice cloning.
Camb AI supports an astonishing number of over 140 languages, nearly three times more than ElevenLabs. However, the current open-source version of this technology is only available in English on GitHub, while the version with more language support can be used on the company's paid Studio platform.
"Mars5's ability to capture rhythm and realism, even with just a few seconds of input, is unprecedented. This undoubtedly represents a milestone in the field of voice," said Akshat Prakash, co-founder and CTO of Camb AI, in a statement.
Mars5 cleverly combines voice cloning and text-to-speech conversion technologies on one platform. Users only need to upload an audio file (ranging from a few seconds to a minute) and provide the text content. The model then captures relevant details, including sound, speaking style, emotion, pronunciation, and meaning, based on the speaker's voice in the audio file, and converts the provided text into realistic speech.
Camb AI claims that Mars5 can capture various emotional tones and intonations, making it suitable for various complex speech scenarios, such as sports commentary, movies, and animations. To achieve this high level of rhythmic imitation, Mars5 utilizes advanced techniques combining autoregressive models (approximately 750 million parameters) and non-autoregressive polynomial diffusion models (approximately 450 million parameters).
Although specific benchmark statistics have not been disclosed, early samples and tests have shown that Mars5 outperforms popular open-source and closed-source speech synthesis models in most cases, including models from Metavoice and ElevenLabs. While competitors' products can also synthesize speech, they do not sound as close to the original voice as Mars5.
As the voice cloning and text-to-speech performance of Mars5 continues to improve, Camb AI also plans to open-source another model called Boli. Boli focuses on achieving context-aware translation, correcting grammar, and providing appropriate colloquial expressions.
Currently, both Mars5 and Boli run on Camb AI's proprietary platform, Camb Studio, supporting over 140 languages. The company also offers these functionalities as APIs to enterprises, small and medium-sized businesses, and developers. Although Prakash did not disclose the specific number of customers, he did mention that Camb AI is collaborating with Major League Soccer, Tennis Australia, and other leading film, music production companies, and government agencies.
Among them, Camb AI successfully provided real-time commentary in four languages for a Major League Soccer match, lasting over two hours without interruption. Additionally, they translated post-match press conferences of the Australian Open into multiple languages and translated the Arabic psychological thriller "Three" into Mandarin. These successful cases fully demonstrate Camb AI's strength in voice cloning and localization technology.