Effort Towards Open-source Modular GPT-4-o and Hugging Face Speech-to-Speech AI NEWS

Home
AInews
Effort Towards Open-source Modular GPT-4-o and Hugging Face Speech-to-Speech

Effort Towards Open-source Modular GPT-4-o and Hugging Face Speech-to-Speech

2025-01-07

In the evolution of AI technology, numerous impressive proprietary models remain confined within companies, accessible only to those involved in internal projects.

Conversely, the community endeavors to match these proprietary models by developing and refining open-source alternatives. One such initiative worth exploring is Hugging Face's speech-to-speech project.

What exactly is Hugging Face's speech-to-speech project, and why should you care about it?

Let's delve into this topic.

Hugging Face's Speech-to-Speech Project

The Hugging Face speech-to-speech project is a modular framework that integrates various open-source models using the Transformers library to facilitate speech-to-speech tasks.

This project aims to achieve capabilities comparable to GPT-4-o by leveraging open-source models, with a design that is easy to modify and meets the needs of many developers.

The workflow comprises multiple model functionalities arranged in a cascading manner, including:

Voice Activity Detection (VAD)
- Silero VAD v5
Speech-to-Text (STT)
- Any Whisper model
- Lightning Whisper MLX
- Paraformer - FunASR
Language Model (LM)
- Any instruction model from Hugging Face Hub
- max-lm
- OpenAI API
Text-to-Speech (TTS)
- Parler-TTS
- MeloTTS
- ChatTTS

Note that while not all available models need to be used, the workflow requires at least one model from each of the four categories to function correctly.

The primary goal of this workflow is to transform any input speech into another form, such as different languages or tones.

Let's set up the project in your environment to test the workflow.

Project Setup

First, clone the GitHub repository into your environment using the following code:

git clone https://github.com/huggingface/speech-to-speech.git
cd speech-to-speech

Install the necessary packages using pip. The recommended method is to use uv, but you can also use pip:

pip install -r requirements.txt

If you are using a Mac, use the following command:

pip install -r requirements_mac.txt

Ensure that the installation is complete before proceeding. It is also advisable to use a virtual environment to avoid conflicts with your main environment.

Project Usage

There are several recommended methods for implementing the workflow. One approach is the server/client method.

To run the workflow on your server, use the following command:

python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0

Then, locally run the following command to receive microphone input and generate audio output:

python listen_and_play.py --host

If you are using a Mac, you can use the following parameters for local usage:

python s2s_pipeline.py --local_mac_optimal_settings --host

If you prefer using Docker, you will need the NVIDIA container toolkit. Once your environment is ready, simply run:

docker compose up

These are the ways to execute the workflow. Now let's look at some parameters you can explore in the Hugging Face speech-to-speech pipeline.

Additional Parameters

Each STT (speech-to-text), LM (language model), and TTS (text-to-speech) has parameters prefixed with stt, lm, or tts.

For instance, here’s how to run the workflow using CUDA:

python s2s_pipeline.py --lm_model_name microsoft/Phi-3-mini-4k-instruct --stt_compile_mode reduce-overhead --tts_compile_mode default --recv_host 0.0.0.0 --send_host 0.0.0.0

Spot AI

Transform cameras into smart video intelligence

Miko

AI interactive learning companion for children

Comet

Smart browser with AI features available for any website

Mirelo AI

AI-generated soundtracks for your video projects

Giskard AI

AI platform for identifying model vulnerabilities

SnapCalorie

AI photo calorie tracker for accurate nutrition

Supio

**AI legal assistant for personal injury cases**

Hugging Face's Speech-to-Speech Project

Project Setup

Project Usage

Additional Parameters

RECENT AI TOOLS

Action Figure Generator

Spot AI

Miko

Comet

Mirelo AI

RECENT AI NEWS

Intel Launches New Crescent Island GPU, Re-entering the AI Chip Market

You will soon be able to shop at Walmart through ChatGPT

Google Meet Launches AI-Powered Virtual Makeup Feature

Gemini by Google is Now Available to Help You Schedule Google Calendar Meetings

Google Updates Search and Discovery Features with New Expandable Ads and AI Capabilities

Sam Altman Says ChatGPT Will Soon Allow Adult Users to Engage in Explicit Conversations

Oracle Details Upcoming AI Clusters Powered by Nvidia and AMD Chips

Salesforce Launches New OpenAI and Anthropic Integrations

RECENT AI TOOLS