Deep-learning-based audio-visual speech enhancement in presence of Lombard effect
Introduction
Speech is perhaps the most common way that people use to communicate with each other. Often, this kind of communication is harmed by several sources of disturbance that may have different nature, such as the presence of competing speakers, the loud music during a party, and the noise inside a car cabin. We refer to the sounds other than the speech of interest as background noise.
Background noise is known to affect two attributes of speech: intelligibility and quality (Loizou, 2007). Both of these aspects are important in a conversation, since poor intelligibility makes it hard to comprehend what a speaker is saying and poor quality may affect speech naturalness and listening effort (Loizou, 2007). Humans tend to tackle the negative effects of background noise by instinctively changing the way of speaking, their speaking style, in a process known as Lombard effect (Lombard, 1911, Zollinger, Brumm, 2011). The changes that can be observed vary widely across individuals (Junqua, 1993, Marxer, Barker, Alghamdi, Maddock, 2018) and affect multiple dimensions: acoustically, the average fundamental frequency (F0) and the sound energy increase, the spectral tilt flattens due to an energy increment at high frequencies and the centre frequency of the first and second formant (F1 and F2) shifts (Junqua, 1993, Lu, Cooke, 2008); visually, head and face motion are more pronounced and the movements of the lips and jaw are amplified (Vatikiotis-Bateson, Barbosa, Chow, Oberg, Tan, Yehia, 2007, Garnier, Henrich, Dubois, 2010, Garnier, Ménard, Richard, 2012); temporally, the speech rate changes due to an increase of the vowel duration (Junqua, 1993, Cooke, King, Garnier, Aubanel, 2014).
Although Lombard effect improves the intelligibility of speech in noise (Summers, Pisoni, Bernacki, Pedlow, Stokes, 1988, Pittman, Wiley, 2001), effective communication might still be challenged by some particular conditions, e.g. the hearing impairment of the listener. In these situations, speech enhancement (SE) algorithms may be applied to the noisy signal aiming at improving speech quality and speech intelligibility. In the literature, several SE techniques have been proposed. Some approaches consider SE as a statistical estimation problem (Loizou, 2007), and include some well-known methods, like the Wiener filtering (Lim and Oppenheim, 1979) and the minimum mean square error estimator of the short-time magnitude spectrum (Ephraim and Malah, 1984). Many improved methods have been proposed, which primarily distinguish themselves by refined statistical speech models (Martin, 2005, Erkelens, Hendriks, Heusdens, Jensen, 2007, Gerkmann, Martin, 2009) or noise models (Martin, Breithaupt, 2003, Loizou, 2007). These techniques, which make statistical assumptions on the distributions of the signals, have been reported to be largely unable to provide speech intelligibility improvements (Hu, Loizou, 2007, Jensen, Hendriks, 2012). As an alternative, data-driven techniques, especially deep learning, make less strict assumptions on the distribution of the speech, of the noise or on the way they are mixed: a learning algorithm is used to find a function that best maps features from degraded speech to features from clean speech. Over the years, the speech processing community has put a considerable effort into designing training targets and objective functions (Wang, Narayanan, Wang, 2014, Erdogan, Hershey, Watanabe, Le Roux, 2015, Williamson, Wang, Wang, 2016, Michelsanti, Tan, Sigurdsson, Jensen, 2019) for different neural network models, including deep neural networks (Xu, Du, Dai, Lee, 2014, Kolbæk, Tan, Jensen, 2017), denoising autoencoders (Lu et al., 2013), recurrent neural networks (Weninger et al., 2014), fully convolutional neural networks (Park and Lee, 2017), and generative adversarial networks (Michelsanti and Tan, 2017). These methods represent the current state of the art in the field (Wang and Chen, 2018), and since they use only audio signals, we refer to them as audio-only SE (AO-SE) systems.
Previous studies show that observing the speaker’s facial and lip movements contributes to speech perception (Sumby, Pollack, 1954, Erber, 1975, McGurk, MacDonald, 1976). This finding suggests that a SE system could tolerate higher levels of background noise, if visual cues could be used in the enhancement process. This intuition is confirmed by a pioneering study on audio-visual SE (AV-SE) by Girin et al. (2001), where simple geometric features extracted from the video of the speaker’s mouth are used. Later, more complex frameworks based on classical statistical approaches have been proposed (Almajai, Milner, 2011, Abel, Hussain, 2014, Abel, Hussain, Luo, 2014), and very recently deep learning methods have been used for AV-SE (Hou, Wang, Lai, Lin, Tsao, Chang, Wang, 2018, Gabbay, Shamir, Peleg, 2018, Ephrat, Mosseri, Lang, Dekel, Wilson, Hassidim, Freeman, Rubinstein, 2018, Afouras, Chung, Zisserman, 2018, Owens, Efros, 2018, Morrone, Pasa, Tikhanoff, Bergamaschi, Fadiga, Badino, 2019).
It is reasonable to think that visual features are mostly helpful for SE when the speech is so degraded that AO-SE systems achieve poor performance, i.e. when background noise heavily dominates over the speech of interest. Since in such acoustical environment spoken communication is particularly hard, we can assume that the speakers are under the influence of Lombard effect. In other words, the input to SE systems in this situation is Lombard speech. Despite this consideration, state-of-the-art SE systems do not take Lombard effect into account, because collecting Lombard speech is usually expensive. The training and the evaluation of the systems are usually performed with speech recorded in quiet and afterwards degraded with additive noise. Previous works show that speaker (Hansen and Varadarajan, 2009) and speech recognition (Junqua, 1993) systems that ignore Lombard effect achieve sub-optimal performance, also in visual (Heracleous, Ishi, Sato, Ishiguro, Hagita, 2013, Marxer, Barker, Alghamdi, Maddock, 2018) and audio-visual settings (Heracleous et al., 2013). It is therefore of interest to conduct a similar study also in a SE context.
With the objective of providing a more extensive analysis of the impact of Lombard effect on deep-learning-based SE systems, the present work extends a preliminary study (Michelsanti et al., 2019a), providing the following novel contributions. First, new experiments are conducted, where deep-learning-based SE systems trained with Lombard or non-Lombard speech are evaluated on Lombard speech using a cross-validation setting to avoid that a potential intra-speaker variability of the adopted dataset leads to biased conclusions. Then, an investigation of the effect that the inter-speaker variability has on the systems is carried out, both in relation to acoustic as well as visual features. Next, as an example application, a system trained with both Lombard and non-Lombard data using a wide signal-to-noise-ratio (SNR) range is compared with a system trained only on non-Lombard speech, as it is currently done for the state-of-the-art models. Finally, especially since existing objective measures are limited to predict speech quality and intelligibility from the audio signals in isolation, listening tests using audio-visual stimuli have been performed. This test setup, which is generally not employed to evaluate SE systems, is closer to a real-world scenario, where a listener is usually able to look at the face of the talker.
Section snippets
Materials: Audio-visual speech corpus and noise data
The speech material used in this study is the Lombard GRID corpus (Alghamdi et al., 2018), which is an extension of the popular audio-visual GRID dataset (Cooke et al., 2006). It consists of 55 native speakers of British English (25 males and 30 females) that are between 18 and 30 years old. The sentences pronounced by the talkers adhere to the syntax from the GRID corpus, six-word sentences with the following structure: < command > < color* > < preposition > < letter* > < digit* >
Methodology
In this study, we train and evaluate systems that perform spectral SE using deep learning, as illustrated in Fig. 1. The processing pipeline is inspired by Gabbay et al. (2018) and the same as the one used in (Michelsanti et al., 2019a). To have a self-contained exposition, we report the main details of it in this section. We did not explore the effect of changing the network topology because we are interested in the performance gap between Lombard and non-Lombard systems, and, for this, it is
Experiments
The experiments conducted in this study compare the performance of AO-SE, VO-SE, and AV-SE systems in terms of two widely adopted objective measures: perceptual evaluation of speech quality (PESQ) (Rix et al., 2001), specifically the wideband extension (ITU, 2005) as implemented by Loizou (2007), and extended short-time objective intelligibility (ESTOI) (Jensen and Taal, 2016). PESQ scores, used to estimate speech quality, lie between and 4.5, where high values correspond to high speech
Listening tests
Although it has been shown that visual cues have an impact on speech perception (Sumby, Pollack, 1954, McGurk, MacDonald, 1976), the currently available objective measures used to estimate speech quality and speech intelligibility, e.g. PESQ and ESTOI, only take into account the audio signals. Even when listening tests are performed to evaluate the performance of a SE system, visual stimuli are usually ignored and not presented to the participants (Hussain et al., 2017), despite the fact that
Conclusion
In this paper, we presented an extensive analysis of the impact of Lombard effect on audio, visual and audio-visual speech enhancement systems based on deep learning. We conducted several experiments using a database consisting of 54 speakers and showed the general benefit of training a system with Lombard speech.
In more detail, we first trained systems with Lombard or non-Lombard speech and evaluated them on Lombard speech adopting a cross-validation setup. The results showed that systems
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was supported, in part, by the Oticon Foundation.
References (90)
- et al.
The listening talker: a review of human and algorithmic context-induced modifications of speech
Comput. Speech Lang.
(2014) - et al.
Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?
Comput. Speech Lang.
(2014) - et al.
Analysis of the visual Lombard effect and automatic recognition experiments
Comput. Speech Lang.
(2013) - et al.
The impact of the Lombard effect on audio and visual speech recognition systems
Speech Commun.
(2018) - et al.
300 Faces in-the-wild challenge: database and results
Image Vision Comput.
(2016) - et al.
Examining visible articulatory features in clear and plain speech
Speech Commun.
(2015) - et al.
Novel two-stage audiovisual speech filtering in noisy environments
Cognit. Comput.
(2014) - et al.
Cognitively inspired speech processing for multimodal hearing technology
Proceedings of CICARE
(2014) - et al.
The conversation: Deep audio-visual speech enhancement
Proceedings of Interspeech
(2018) Visual Speech Enhancement and its Application in Speech Perception Training
(2017)
A corpus of audio-visual Lombard speech with frontal and profile views
J. Acoust. Soc. Am.
Short term spectral analysis, synthesis, and modification by discrete Fourier transform
IEEE Trans. Acoust Speech SignalProcess.
Visually derived Wiener filters for speech enhancement
IEEE Trans. Audio Speech Lang.Process.
Analysis of correlation between audio and visual speech features for clean audio feature prediction in noise
Proceedings of Interspeech/ICSLP
Dominance statistics: ordinal analyses to answer ordinal questions
Psychol. Bull.
An audio-visual corpus for speech perception and automatic speech recognition
J. Acoust. Soc. Am.
Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator
IEEE Trans. Acoust Speech SignalProcess.
Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation
ACM Transactions on Graphics
Auditory-visual perception of speech
Journal of Speech and Hearing Disorders
Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks
Proceedings of ICASSP
Minimum mean-square error estimation of discrete Fourier coefficients with generalized Gamma priors
IEEE Trans. Audio Speech Lang.Process.
Discovering statistics using IBM SPSS statistics
Visual speech enhancement
Proceedings of Interspeech
An acoustic and articulatory study of Lombard speech: global effects on the utterance
Proceedings of Interspeech/ICSLP
Influence of sound immersion and communicative interaction on the Lombard effect
J. Speech Lang. Hear. Res.
Effect of being seen on the production of visible speech cues. a pilot study on Lombard speech
Proceedings of Interspeech/ICSLP
On the statistics of spectral amplitudes after variance reduction by temporal cepstrum smoothing and cepstral nulling
IEEE Trans. Signal Process.
Audio-visual enhancement of speech in noise
J. Acoust. Soc. Am.
Understanding the difficulty of training deep feedforward neural networks
Proceedings of AISTATS
Signal estimation from modified short-time Fourier transform
IEEE Trans. Acoust Speech SignalProcess.
Analysis and compensation of Lombard speech across noise type and levels with application to in-set/out-of-set speaker recognition
IEEE Trans. Audio Speech Lang.Process.
Computation of measures of effect size for neuroscience data sets
Eur. J. Neurosci.
Improving neural networks by preventing co-adaptation of feature detectors
arXiv preprint arXiv:1207.0580
Audio-visual speech enhancement based on multimodal deep convolutional neural network
IEEE Trans. Emerg. Top.Comput. Intell.
A comparative intelligibility study of single-microphone noise reduction algorithms
J. Acoust. Soc. Am.
Towards multi-modal hearing aid design and evaluation in realistic audio-visual settings: Challenges and opportunities
Proceedings of CHAT
Batch normalization: accelerating deep network training by reducing internal covariate shift
Proceedings of ICML
Image-to-image translation with conditional adversarial networks
Proceedings of CVPR
Spectral magnitude minimum mean-square error estimation using binary and continuous gain functions
IEEE Trans. Audio Speech Lang.Process.
An algorithm for predicting the intelligibility of speech masked by modulated noise maskers
IEEE/ACM Trans. Audio SpeechLang. Process.
The Lombard reflex and its role on human listeners and automatic speech recognizers
J. Acoust. Soc. Am.
Cited by (29)
Speech enhancement with noise estimation and filtration using deep learning models
2023, Theoretical Computer ScienceCitation Excerpt :The Deep-LPC-MHA Net-improved AKF's speech was also the most popular among the study's 10 participants. For the first time, the Deep-LPC-MHA Net approach produces the most accurate LPC estimations ever, allowing the AKF to provide improved speech of the highest quality and intelligibility [33]. Compared to a model that only used audio, the comparative simulation results express that the AV DNN performance superior than the A-only method in terms of objective metrics such as PESQ, STOI, SI-SDR, and DBSTOI.
Two-Point Neurons for Efficient Multimodal Speech Enhancement
2023, ICASSPW 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, ProceedingsSNR-Based Inter-Component Phase Estimation Using Bi-Phase Prior Statistics for Single-Channel Speech Enhancement
2023, IEEE/ACM Transactions on Audio Speech and Language ProcessingNew approach for quality analysis of the hearing impaired using combined temporal and spectral processing
2023, International Journal of Biomedical Engineering and Technology