New ETS Conversion Model Enhances Naturalness in Speech Synthesis

2024-05-28

With the rapid development of technology, we have witnessed the emergence of a series of computing tools that have significantly improved the quality of life for people with disabilities or sensory impairments. Among them, a technology called Electromyography to Speech (ETS) conversion model is particularly noteworthy as it can convert the electrical signals generated by the human skeletal muscles into speech.


Recently, researchers from Brunswick University and SUPSI have introduced Diff-ETS, a new type of ETS conversion model that can generate more natural synthetic speech. This innovative model is detailed in a paper published on the preprint server arXiv, and it is expected to provide new communication pathways for those who have lost their ability to speak due to procedures such as laryngectomy.

Traditional ETS conversion technology consists of two core components: an Electromyography (EMG) encoder and a vocoder. The EMG encoder is responsible for converting EMG signals into acoustic speech features, while the vocoder uses these features to synthesize speech signals.

"Due to the scarcity of available data and the influence of signal noise, the naturalness of synthesized speech is often unsatisfactory," wrote Zhao Ren, Kevin Schek, and their colleagues in the paper. "Our work proposes the Diff-ETS model, which uses a score-based diffusion probability model to enhance the naturalness of synthesized speech. This diffusion model is used to improve the quality of acoustic features predicted by the EMG encoder."

Compared to many ETS conversion models composed of an encoder and a vocoder, the Diff-ETS model adds a key component - the diffusion probability model. This new addition is expected to make the synthesized speech more natural.

Zhao Ren, Schek, and their colleagues first trained the EMG encoder to predict the log Mel spectrogram (a visual representation of audio signals) and phoneme targets based on EMG signals. Then, they used the diffusion probability model to enhance the log Mel spectrogram and converted it into synthetic speech using a pre-trained vocoder.

After a series of rigorous tests, the researchers evaluated the Diff-ETS model and compared it with existing benchmark ETS technologies. The results were exciting, as the speech generated by Diff-ETS was not only more natural but also closer to real human voices.

"In our experiments, we fine-tuned the predictions of the pre-trained EMG encoder and trained the two models in an end-to-end manner," further explained Zhao Ren, Schek, and their colleagues in the paper. "Through objective metrics and listening tests, we compared Diff-ETS with the benchmark ETS model without diffusion. The results showed that Diff-ETS significantly outperformed the benchmark model in terms of speech naturalness."

In the future, the ETS conversion model developed by this research team is expected to drive further advancements in the field of artificially generated audible speech technology. These systems will provide opportunities for those who are unable to speak to have a voice and communicate more easily with others.

"In future research, we will explore various methods to reduce the number of model parameters, such as model compression and knowledge distillation techniques, to achieve real-time generation of speech samples," the researchers stated. "Additionally, we can also attempt to simultaneously train the diffusion model, encoder, and vocoder to further improve speech quality."