Effort Towards Open-source Modular GPT-4-o and Hugging Face Speech-to-Speech

2025-01-07

In the evolution of AI technology, numerous impressive proprietary models remain confined within companies, accessible only to those involved in internal projects.

Conversely, the community endeavors to match these proprietary models by developing and refining open-source alternatives. One such initiative worth exploring is Hugging Face's speech-to-speech project.

What exactly is Hugging Face's speech-to-speech project, and why should you care about it?

Let's delve into this topic.

Hugging Face's Speech-to-Speech Project

The Hugging Face speech-to-speech project is a modular framework that integrates various open-source models using the Transformers library to facilitate speech-to-speech tasks.

This project aims to achieve capabilities comparable to GPT-4-o by leveraging open-source models, with a design that is easy to modify and meets the needs of many developers.

The workflow comprises multiple model functionalities arranged in a cascading manner, including:

  1. Voice Activity Detection (VAD)
    • Silero VAD v5
  2. Speech-to-Text (STT)
    • Any Whisper model
    • Lightning Whisper MLX
    • Paraformer - FunASR
  3. Language Model (LM)
    • Any instruction model from Hugging Face Hub
    • max-lm
    • OpenAI API
  4. Text-to-Speech (TTS)
    • Parler-TTS
    • MeloTTS
    • ChatTTS

Note that while not all available models need to be used, the workflow requires at least one model from each of the four categories to function correctly.

The primary goal of this workflow is to transform any input speech into another form, such as different languages or tones.

Let's set up the project in your environment to test the workflow.

Project Setup

First, clone the GitHub repository into your environment using the following code:

git clone https://github.com/huggingface/speech-to-speech.git
cd speech-to-speech

Install the necessary packages using pip. The recommended method is to use uv, but you can also use pip:

pip install -r requirements.txt

If you are using a Mac, use the following command:

pip install -r requirements_mac.txt

Ensure that the installation is complete before proceeding. It is also advisable to use a virtual environment to avoid conflicts with your main environment.

Project Usage

There are several recommended methods for implementing the workflow. One approach is the server/client method.

To run the workflow on your server, use the following command:

python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0

Then, locally run the following command to receive microphone input and generate audio output:

python listen_and_play.py --host

If you are using a Mac, you can use the following parameters for local usage:

python s2s_pipeline.py --local_mac_optimal_settings --host

If you prefer using Docker, you will need the NVIDIA container toolkit. Once your environment is ready, simply run:

docker compose up

These are the ways to execute the workflow. Now let's look at some parameters you can explore in the Hugging Face speech-to-speech pipeline.

Additional Parameters

Each STT (speech-to-text), LM (language model), and TTS (text-to-speech) has parameters prefixed with stt, lm, or tts.

For instance, here’s how to run the workflow using CUDA:

python s2s_pipeline.py --lm_model_name microsoft/Phi-3-mini-4k-instruct --stt_compile_mode reduce-overhead --tts_compile_mode default --recv_host 0.0.0.0 --send_host 0.0.0.0