Artificial Intelligence Utilizes Human Perception to Filter Noisy Audio
Researchers have developed a new deep learning model that is expected to significantly improve audio quality in real-world scenarios by leveraging a previously underutilized tool - human perception.
Researchers have found that they can improve the objectively measured audio quality by utilizing subjective evaluations of sound quality from individuals, combined with a speech enhancement model.
Compared to other conventional methods, the new model performs better in minimizing the presence of noisy audio. Noisy audio, which is unwanted sound that may interfere with people's ability to hear what they want to hear. Importantly, the quality scores predicted by the model are highly correlated with human judgments.
Traditional approaches to reducing background noise use artificial intelligence algorithms to extract noise from the desired signal. However, these objective methods do not always align with the audience's evaluation of what makes speech intelligible, said Donald Williamson, co-author of the study and associate professor at the College of Computer Science and Engineering at Ohio State University.
"What sets this research apart from others is that we attempt to utilize perception to train the model to eliminate unwanted sound," said Williamson. "If certain qualities of the signal can be perceived by people, then our model can leverage this additional information to learn and better eliminate noise."
The study, published in the IEEE/ACM Transactions on Audio, Speech, and Language Processing, focuses on improving monaural speech enhancement, which refers to speech from a single audio channel, such as a microphone.
The researchers trained the new model using two datasets involving recordings of people's conversations from previous studies. In some cases, there were background noises that could potentially mask the conversation, such as television or music. Listeners rated the speech quality of each recording on a scale of 1 to 100.
The team's model performs well because it employs a joint learning approach, combining a dedicated speech enhancement language module with a prediction model that can predict the average opinion score that human listeners might give to noisy signals.
The results show that their new approach achieves better speech quality compared to other models, in terms of perceptual quality, intelligibility, and human ratings, compared to objective measures.
However, using human perception of sound quality also has its challenges, said Williamson.
"The evaluation of noisy audio is difficult because it is highly subjective. It depends on your hearing ability and auditory experience," he said. Factors like the use of hearing aids or cochlear implants also affect the perception abilities of ordinary individuals in their sound environments.
Given that improving the quality of noisy speech is crucial for enhancing hearing aids, speech recognition systems, speaker verification applications, and hands-free communication systems, these perceptual differences must be small enough to prevent inconvenience to users caused by noisy audio.