ElevenLabs Launches AI Tool for Converting Video to Audio Effects

2024-06-19

AI startup ElevenLabs has released its text-to-speech AI product Sound Effects in the field of AI voice technology. Shortly after, the company quickly launched an open-source tool to showcase the immense potential of its technology. This application can generate sound effect samples for video creators in "about 15 seconds" by parsing imported video clips and providing multiple options.


Although developers can access the source code of this application on GitHub, ElevenLabs has also prepared a website for the public to easily try out the Sound Effects API.


When you upload a video, this "video-to-sound effect" application selects four keyframes at one-second intervals on the client side. These frames and a prompt message are then sent to OpenAI's GPT-4 model to generate a customized text-to-speech effect prompt. This prompt is then processed by ElevenLabs' Sound Effects API to generate the corresponding sound effect. Finally, the video and audio are merged into one file on the client side, which users can download and use, with a duration of up to 22 seconds.


Ammaar Reshi, the design director of ElevenLabs, said in an interview, "We believe this is a strong validation of our SFX API capabilities. AI video creators often look for the perfect sound effects, and we believe that by understanding the frames in their videos and proposing the best sound output based on them, we can intelligently accelerate their workflow." He also expressed excitement about the various innovative experiences that this API may bring, specifically mentioning immersive video games where sound can be generated in real-time based on player interaction.

This API allows developers to build fully customized AI sound effects using short descriptions. ElevenLabs charges based on the duration of the generated audio, with a fee of 100 characters per generation or 25 characters per second, depending on the set duration.


In a brief test, this video-to-sound effect application demonstrated its convenience. After importing a movie clip of a car driving in a off-road environment without audio, ElevenLabs' AI generated four options, each sounding like a car driving on a gravel road. While applying sound effects to clips is interesting, the true potential may lie in integrating this capability into larger systems to have a greater impact.

As the popularity of AI video generation continues to rise, ElevenLabs may continue to explore new audio solutions to meet the growing needs of developers, filmmakers, and creators, maintaining its leading position in the industry.