Parallel text-to-speech (TTS) models are commonly used to achieve real-time speech synthesis, as they offer stronger control and faster synthesis speed compared to traditional autoregressive models. However, parallel models, especially those based on transformer architecture, face challenges in incremental synthesis due to their fully parallel structure. With the increasing popularity of real-time and streaming applications, there is a need for TTS systems that can generate speech incrementally to meet the requirements of streaming TTS. This is crucial for achieving lower response latency and improving user experience.
Researchers at NVIDIA have proposed a variant of FastPitch called Incremental FastPitch, which can incrementally generate high-quality Mel blocks for real-time speech synthesis with lower latency. The proposed model improves the architecture by introducing block-based FFT modules in the decoder, training with block attention masks constrained by receptive fields, and using fixed-size past model states for inference. This achieves speech quality comparable to parallel FastPitch but significantly reduces latency. It adopts constrained receptive field training and explores the use of static and dynamic block masks, which are important for effectively aligning the model with limited receptive field inference during synthesis.
An end-to-end neural TTS system typically consists of two main components: an acoustic model and a vocoder. The process starts with converting text to Mel-spectrograms using acoustic models such as Tacotron 2, FastSpeech, FastPitch, and GlowTTS. Then, Mel features are transformed into waveforms using vocoders like WaveNet, WaveRNN, WaveGlow, and HiF-GAN. This research also mentions the use of the Chinese Standard Mandarin Speech Corpus for training and evaluation, which includes 10,000 audio segments from a single Mandarin-speaking female speaker. The proposed model parameters follow the open-source FastPitch implementation with modifications to the decoder using causal convolutions in the positional feed-forward layer.
Incremental FastPitch is a variant of FastPitch that enables incremental synthesis of high-quality Mel blocks for real-time speech synthesis with lower latency. The model uses block attention masks constrained by receptive fields during training to help the decoder adapt to limited receptive fields in incremental inference. The proposed model also utilizes fixed-size past model states during inference to maintain Mel continuity between blocks. The model is trained and evaluated using the Chinese Standard Mandarin Speech Corpus. The model parameters follow the open-source FastPitch implementation, using causal convolutions in the positional feed-forward layer. Mel-spectrograms are generated by applying FFT with a size of 1024, a hop length of 256, and a window length of 1024 to normalized waveforms.
The experimental results show that Incremental FastPitch can generate speech quality comparable to parallel FastPitch but with significantly lower latency, making it suitable for real-time speech applications. The proposed model combines block-based FFT modules, block attention masks trained with constrained receptive fields, and fixed-size past model states for inference, contributing to improved performance. A visual ablation study demonstrates that Incremental FastPitch can generate Mel-spectrograms with almost no observable differences compared to parallel FastPitch, highlighting the effectiveness of the proposed model.
In conclusion, Incremental FastPitch is a variant of FastPitch that enables low-latency incremental synthesis of high-quality Mel blocks for real-time speech applications. The proposed model combines block-based FFT modules, block attention masks trained with constrained receptive fields, and fixed-size past model states for inference, resulting in speech quality comparable to parallel FastPitch but with significantly lower latency. A visual ablation study demonstrates that Incremental FastPitch can generate Mel-spectrograms with almost no observable differences compared to parallel FastPitch, highlighting the effectiveness of the proposed model. The model parameters follow the open-source FastPitch implementation with modifications to the decoder using causal convolutions in the positional feed-forward layer. Incremental FastPitch offers a faster and more controllable speech synthesis process, making it a promising approach for real-time applications.