Hugging Face's Speech-to-Speech Project
The Hugging Face speech-to-speech project is a modular framework that integrates various open-source models using the Transformers library to facilitate speech-to-speech tasks.
This project aims to achieve capabilities comparable to GPT-4-o by leveraging open-source models, with a design that is easy to modify and meets the needs of many developers.
The workflow comprises multiple model functionalities arranged in a cascading manner, including:
- Voice Activity Detection (VAD)
- Silero VAD v5
- Speech-to-Text (STT)
- Any Whisper model
- Lightning Whisper MLX
- Paraformer - FunASR
- Language Model (LM)
- Any instruction model from Hugging Face Hub
- max-lm
- OpenAI API
- Text-to-Speech (TTS)
- Parler-TTS
- MeloTTS
- ChatTTS
Note that while not all available models need to be used, the workflow requires at least one model from each of the four categories to function correctly.
The primary goal of this workflow is to transform any input speech into another form, such as different languages or tones.
Let's set up the project in your environment to test the workflow.
Project Setup
First, clone the GitHub repository into your environment using the following code:
git clone https://github.com/huggingface/speech-to-speech.git cd speech-to-speech
Install the necessary packages using pip. The recommended method is to use uv, but you can also use pip:
pip install -r requirements.txt
If you are using a Mac, use the following command:
pip install -r requirements_mac.txt
Ensure that the installation is complete before proceeding. It is also advisable to use a virtual environment to avoid conflicts with your main environment.
Project Usage
There are several recommended methods for implementing the workflow. One approach is the server/client method.
To run the workflow on your server, use the following command:
python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0
Then, locally run the following command to receive microphone input and generate audio output:
python listen_and_play.py --host
If you are using a Mac, you can use the following parameters for local usage:
python s2s_pipeline.py --local_mac_optimal_settings --host
If you prefer using Docker, you will need the NVIDIA container toolkit. Once your environment is ready, simply run:
docker compose up
These are the ways to execute the workflow. Now let's look at some parameters you can explore in the Hugging Face speech-to-speech pipeline.
Additional Parameters
Each STT (speech-to-text), LM (language model), and TTS (text-to-speech) has parameters prefixed with stt, lm, or tts.
For instance, here’s how to run the workflow using CUDA:
python s2s_pipeline.py --lm_model_name microsoft/Phi-3-mini-4k-instruct --stt_compile_mode reduce-overhead --tts_compile_mode default --recv_host 0.0.0.0 --send_host 0.0.0.0