MyShell Launches OpenVoice: An Open-Source AI Real-Time Voice Cloning Model

2024-01-03

Startups, including the increasingly well-known ElevenLabs, have raised millions of dollars to develop their own proprietary algorithms and artificial intelligence software for creating voice clones - an audio program that mimics a user's voice.

However, a new solution called OpenVoice has emerged, developed by members of the Massachusetts Institute of Technology (MIT), Tsinghua University in Beijing, China, and Canadian artificial intelligence startup MyShell. It offers an open-source voice cloning that is almost instantaneous and provides fine-grained control features not found on other voice cloning platforms.

MyShell wrote in a post on its official company account X: "Using highly accurate voice cloning, we can finely control the pitch from emotions to accents, rhythm, pauses, and intonation with just a short audio clip."

The company also included a link to a preprint research paper in the post, describing how they developed OpenVoice and several places where users can access and try it, including MyShell's web application interface (requires user account access) and HuggingFace (accessible publicly without an account).

Qin Zengyi, one of MyShell's chief researchers, said, "MyShell aims to benefit the entire research community. OpenVoice is just the beginning. In the future, we will even provide funding, datasets, and computing power to support the open-source research community. MyShell's core resonance is 'AI for everyone'."

Regarding why MyShell initially chose to develop an open-source voice cloning AI model, Qin Zengyi wrote, "Language, vision, and sound are the three main ways for future artificial general intelligence (AGI). In the research field, although there are some good open-source models for language and vision, there is still a lack of a good model for sound, especially an instant and powerful voice cloning model that allows everyone to customize the generated voice. So, we decided to do it."

Using OpenVoice

In a non-scientific test conducted on the new voice cloning model on HuggingFace, it only takes a few seconds to quickly generate a relatively convincing - if somewhat robotic - clone version of one's own voice using completely random speech.

Unlike other voice cloning applications, when OpenVoice clones a person's voice, there is no need to read out a specific block of text. Simply speaking for a few seconds improvisationally, and the model generates a voice clone that can be played almost immediately.

It is also possible to adjust the "style" through a dropdown menu, choosing between several default settings (e.g., happy, sad, friendly, angry, etc.), and hear the noticeable changes in intonation with these different emotions.

How OpenVoice is Made

OpenVoice consists of two different artificial intelligence models: a text-to-speech (TTS) model and a "pitch converter".

The first model controls "style parameters and language" and is trained on annotated audio samples expressing emotions in "30,000 sentences with two English accents (American and British), one Chinese accent, and one Japanese accent". It also learns intonation, rhythm, and pauses from these clips.

Meanwhile, the pitch converter model is trained on over 300,000 audio samples from more than 20,000 different speakers.

In both cases, the audio of human speech is converted into phonemes - specific sounds that distinguish one word from another - and represented as vector embeddings.

By using a "base speaker" as the TTS model and combining it with the pitch from the user's recorded audio, these two models together can reproduce the user's voice and change their "tone color" or express the emotional content of the text. This is a diagram included in the OpenVoice team's paper, illustrating how these two models work together:

The team notes that their approach is conceptually simple. Nevertheless, it is highly effective and requires much fewer computational resources compared to other methods, including Voicebox, a competitor AI voice cloning model developed by Meta.

"We wanted to develop the most flexible instant voice cloning model to date," Qin said in an email. "Flexibility here means flexible control over style/emotion/accent, etc., and being able to adapt to any language. No one has been able to do this before because it's too difficult. I led a group of experienced AI scientists who spent months finding a solution. We found a very elegant way to decompose this daunting task into some feasible subtasks to achieve something that seemed too difficult as a whole. This decomposed pipeline turned out to be very effective but also very simple."

Who Supports OpenVoice?

Founded in 2023, MyShell, based in Calgary, Alberta, Canada, has raised $5.6 million in seed funding led by INCE Capital, with additional investments from Folius Ventures, Hashkey Capital, SevenX Ventures, TSVC, and OP Crypto, and reportedly has over 400,000 users, according to The Saas News.

The startup describes itself as a "decentralized and comprehensive platform for discovering, creating, and staking AI-native applications."

In addition to offering OpenVoice, the company's web application includes various "personality" text-based AI characters and robots - similar to Character.AI. It also features an animated GIF creator and user-generated text-based RPGs, some of which feature licensed characters like Harry Potter and Marvel.

If MyShell makes OpenVoice open-source, how does it plan to monetize? The company charges a monthly fee to its web application users and third-party robot developers who want to promote their products in the application. It also charges for AI training data.