Elsevier

Speech Communication

Volume 115, December 2019, Pages 38-50
Speech Communication

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

https://doi.org/10.1016/j.specom.2019.10.006Get rights and content

Highlights

  • Impact of Lombard effect on audio, visual and audio-visual speech enhancement (SE).

  • Benefit of training SE systems with Lombard speech in terms of objective measures.

  • Investigation of the inter-speaker performance variability of SE systems.

  • Proposal of SE systems that work well for a wide SNR range.

  • Evaluation of SE systems with audio-visual listening tests.

Abstract

When speaking in presence of background noise, humans reflexively change their way of speaking in order to improve the intelligibility of their speech. This reflex is known as Lombard effect. Collecting speech in Lombard conditions is usually hard and costly. For this reason, speech enhancement systems are generally trained and evaluated on speech recorded in quiet to which noise is artificially added. Since these systems are often used in situations where Lombard speech occurs, in this work we perform an analysis of the impact that Lombard effect has on audio, visual and audio-visual speech enhancement, focusing on deep-learning-based systems, since they represent the current state of the art in the field.

We conduct several experiments using an audio-visual Lombard speech corpus consisting of utterances spoken by 54 different talkers. The results show that training deep-learning-based models with Lombard speech is beneficial in terms of both estimated speech quality and estimated speech intelligibility at low signal to noise ratios, where the visual modality can play an important role in acoustically challenging situations. We also find that a performance difference between genders exists due to the distinct Lombard speech exhibited by males and females, and we analyse it in relation with acoustic and visual features. Furthermore, listening tests conducted with audio-visual stimuli show that the speech quality of the signals processed with systems trained using Lombard speech is statistically significantly better than the one obtained using systems trained with non-Lombard speech at a signal to noise ratio of 5 dB. Regarding speech intelligibility, we find a general tendency of the benefit in training the systems with Lombard speech.

Introduction

Speech is perhaps the most common way that people use to communicate with each other. Often, this kind of communication is harmed by several sources of disturbance that may have different nature, such as the presence of competing speakers, the loud music during a party, and the noise inside a car cabin. We refer to the sounds other than the speech of interest as background noise.

Background noise is known to affect two attributes of speech: intelligibility and quality (Loizou, 2007). Both of these aspects are important in a conversation, since poor intelligibility makes it hard to comprehend what a speaker is saying and poor quality may affect speech naturalness and listening effort (Loizou, 2007). Humans tend to tackle the negative effects of background noise by instinctively changing the way of speaking, their speaking style, in a process known as Lombard effect (Lombard, 1911, Zollinger, Brumm, 2011). The changes that can be observed vary widely across individuals (Junqua, 1993, Marxer, Barker, Alghamdi, Maddock, 2018) and affect multiple dimensions: acoustically, the average fundamental frequency (F0) and the sound energy increase, the spectral tilt flattens due to an energy increment at high frequencies and the centre frequency of the first and second formant (F1 and F2) shifts (Junqua, 1993, Lu, Cooke, 2008); visually, head and face motion are more pronounced and the movements of the lips and jaw are amplified (Vatikiotis-Bateson, Barbosa, Chow, Oberg, Tan, Yehia, 2007, Garnier, Henrich, Dubois, 2010, Garnier, Ménard, Richard, 2012); temporally, the speech rate changes due to an increase of the vowel duration (Junqua, 1993, Cooke, King, Garnier, Aubanel, 2014).

Although Lombard effect improves the intelligibility of speech in noise (Summers, Pisoni, Bernacki, Pedlow, Stokes, 1988, Pittman, Wiley, 2001), effective communication might still be challenged by some particular conditions, e.g. the hearing impairment of the listener. In these situations, speech enhancement (SE) algorithms may be applied to the noisy signal aiming at improving speech quality and speech intelligibility. In the literature, several SE techniques have been proposed. Some approaches consider SE as a statistical estimation problem (Loizou, 2007), and include some well-known methods, like the Wiener filtering (Lim and Oppenheim, 1979) and the minimum mean square error estimator of the short-time magnitude spectrum (Ephraim and Malah, 1984). Many improved methods have been proposed, which primarily distinguish themselves by refined statistical speech models (Martin, 2005, Erkelens, Hendriks, Heusdens, Jensen, 2007, Gerkmann, Martin, 2009) or noise models (Martin, Breithaupt, 2003, Loizou, 2007). These techniques, which make statistical assumptions on the distributions of the signals, have been reported to be largely unable to provide speech intelligibility improvements (Hu, Loizou, 2007, Jensen, Hendriks, 2012). As an alternative, data-driven techniques, especially deep learning, make less strict assumptions on the distribution of the speech, of the noise or on the way they are mixed: a learning algorithm is used to find a function that best maps features from degraded speech to features from clean speech. Over the years, the speech processing community has put a considerable effort into designing training targets and objective functions (Wang, Narayanan, Wang, 2014, Erdogan, Hershey, Watanabe, Le Roux, 2015, Williamson, Wang, Wang, 2016, Michelsanti, Tan, Sigurdsson, Jensen, 2019) for different neural network models, including deep neural networks (Xu, Du, Dai, Lee, 2014, Kolbæk, Tan, Jensen, 2017), denoising autoencoders (Lu et al., 2013), recurrent neural networks (Weninger et al., 2014), fully convolutional neural networks (Park and Lee, 2017), and generative adversarial networks (Michelsanti and Tan, 2017). These methods represent the current state of the art in the field (Wang and Chen, 2018), and since they use only audio signals, we refer to them as audio-only SE (AO-SE) systems.

Previous studies show that observing the speaker’s facial and lip movements contributes to speech perception (Sumby, Pollack, 1954, Erber, 1975, McGurk, MacDonald, 1976). This finding suggests that a SE system could tolerate higher levels of background noise, if visual cues could be used in the enhancement process. This intuition is confirmed by a pioneering study on audio-visual SE (AV-SE) by Girin et al. (2001), where simple geometric features extracted from the video of the speaker’s mouth are used. Later, more complex frameworks based on classical statistical approaches have been proposed (Almajai, Milner, 2011, Abel, Hussain, 2014, Abel, Hussain, Luo, 2014), and very recently deep learning methods have been used for AV-SE (Hou, Wang, Lai, Lin, Tsao, Chang, Wang, 2018, Gabbay, Shamir, Peleg, 2018, Ephrat, Mosseri, Lang, Dekel, Wilson, Hassidim, Freeman, Rubinstein, 2018, Afouras, Chung, Zisserman, 2018, Owens, Efros, 2018, Morrone, Pasa, Tikhanoff, Bergamaschi, Fadiga, Badino, 2019).

It is reasonable to think that visual features are mostly helpful for SE when the speech is so degraded that AO-SE systems achieve poor performance, i.e. when background noise heavily dominates over the speech of interest. Since in such acoustical environment spoken communication is particularly hard, we can assume that the speakers are under the influence of Lombard effect. In other words, the input to SE systems in this situation is Lombard speech. Despite this consideration, state-of-the-art SE systems do not take Lombard effect into account, because collecting Lombard speech is usually expensive. The training and the evaluation of the systems are usually performed with speech recorded in quiet and afterwards degraded with additive noise. Previous works show that speaker (Hansen and Varadarajan, 2009) and speech recognition (Junqua, 1993) systems that ignore Lombard effect achieve sub-optimal performance, also in visual (Heracleous, Ishi, Sato, Ishiguro, Hagita, 2013, Marxer, Barker, Alghamdi, Maddock, 2018) and audio-visual settings (Heracleous et al., 2013). It is therefore of interest to conduct a similar study also in a SE context.

With the objective of providing a more extensive analysis of the impact of Lombard effect on deep-learning-based SE systems, the present work extends a preliminary study (Michelsanti et al., 2019a), providing the following novel contributions. First, new experiments are conducted, where deep-learning-based SE systems trained with Lombard or non-Lombard speech are evaluated on Lombard speech using a cross-validation setting to avoid that a potential intra-speaker variability of the adopted dataset leads to biased conclusions. Then, an investigation of the effect that the inter-speaker variability has on the systems is carried out, both in relation to acoustic as well as visual features. Next, as an example application, a system trained with both Lombard and non-Lombard data using a wide signal-to-noise-ratio (SNR) range is compared with a system trained only on non-Lombard speech, as it is currently done for the state-of-the-art models. Finally, especially since existing objective measures are limited to predict speech quality and intelligibility from the audio signals in isolation, listening tests using audio-visual stimuli have been performed. This test setup, which is generally not employed to evaluate SE systems, is closer to a real-world scenario, where a listener is usually able to look at the face of the talker.

Section snippets

Materials: Audio-visual speech corpus and noise data

The speech material used in this study is the Lombard GRID corpus (Alghamdi et al., 2018), which is an extension of the popular audio-visual GRID dataset (Cooke et al., 2006). It consists of 55 native speakers of British English (25 males and 30 females) that are between 18 and 30 years old. The sentences pronounced by the talkers adhere to the syntax from the GRID corpus, six-word sentences with the following structure:  < command >   < color* >   < preposition >   < letter* >   < digit* > 

Methodology

In this study, we train and evaluate systems that perform spectral SE using deep learning, as illustrated in Fig. 1. The processing pipeline is inspired by Gabbay et al. (2018) and the same as the one used in (Michelsanti et al., 2019a). To have a self-contained exposition, we report the main details of it in this section. We did not explore the effect of changing the network topology because we are interested in the performance gap between Lombard and non-Lombard systems, and, for this, it is

Experiments

The experiments conducted in this study compare the performance of AO-SE, VO-SE, and AV-SE systems in terms of two widely adopted objective measures: perceptual evaluation of speech quality (PESQ) (Rix et al., 2001), specifically the wideband extension (ITU, 2005) as implemented by Loizou (2007), and extended short-time objective intelligibility (ESTOI) (Jensen and Taal, 2016). PESQ scores, used to estimate speech quality, lie between 0.5 and 4.5, where high values correspond to high speech

Listening tests

Although it has been shown that visual cues have an impact on speech perception (Sumby, Pollack, 1954, McGurk, MacDonald, 1976), the currently available objective measures used to estimate speech quality and speech intelligibility, e.g. PESQ and ESTOI, only take into account the audio signals. Even when listening tests are performed to evaluate the performance of a SE system, visual stimuli are usually ignored and not presented to the participants (Hussain et al., 2017), despite the fact that

Conclusion

In this paper, we presented an extensive analysis of the impact of Lombard effect on audio, visual and audio-visual speech enhancement systems based on deep learning. We conducted several experiments using a database consisting of 54 speakers and showed the general benefit of training a system with Lombard speech.

In more detail, we first trained systems with Lombard or non-Lombard speech and evaluated them on Lombard speech adopting a cross-validation setup. The results showed that systems

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported, in part, by the Oticon Foundation.

References (90)

  • N. Alghamdi et al.

    A corpus of audio-visual Lombard speech with frontal and profile views

    J. Acoust. Soc. Am.

    (2018)
  • J. Allen

    Short term spectral analysis, synthesis, and modification by discrete Fourier transform

    IEEE Trans. Acoust Speech SignalProcess.

    (1977)
  • I. Almajai et al.

    Visually derived Wiener filters for speech enhancement

    IEEE Trans. Audio Speech Lang.Process.

    (2011)
  • I. Almajai et al.

    Analysis of correlation between audio and visual speech features for clean audio feature prediction in noise

    Proceedings of Interspeech/ICSLP

    (2006)
  • Boersma, P., Weenink, D., 2001. Praat: doing phonetics by computer. http://www.fon.hum.uva.nl/praat/ Accessed: March...
  • N. Cliff

    Dominance statistics: ordinal analyses to answer ordinal questions

    Psychol. Bull.

    (1993)
  • M. Cooke et al.

    An audio-visual corpus for speech perception and automatic speech recognition

    J. Acoust. Soc. Am.

    (2006)
  • EBU, 2014. EBU recommendation R128 - Loudness normalisation and permitted maximum level of audio...
  • Y. Ephraim et al.

    Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

    IEEE Trans. Acoust Speech SignalProcess.

    (1984)
  • A. Ephrat et al.

    Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation

    ACM Transactions on Graphics

    (2018)
  • N.P. Erber

    Auditory-visual perception of speech

    Journal of Speech and Hearing Disorders

    (1975)
  • H. Erdogan et al.

    Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks

    Proceedings of ICASSP

    (2015)
  • J.S. Erkelens et al.

    Minimum mean-square error estimation of discrete Fourier coefficients with generalized Gamma priors

    IEEE Trans. Audio Speech Lang.Process.

    (2007)
  • A. Field

    Discovering statistics using IBM SPSS statistics

    (2013)
  • A. Gabbay et al.

    Visual speech enhancement

    Proceedings of Interspeech

    (2018)
  • M. Garnier et al.

    An acoustic and articulatory study of Lombard speech: global effects on the utterance

    Proceedings of Interspeech/ICSLP

    (2006)
  • M. Garnier et al.

    Influence of sound immersion and communicative interaction on the Lombard effect

    J. Speech Lang. Hear. Res.

    (2010)
  • M. Garnier et al.

    Effect of being seen on the production of visible speech cues. a pilot study on Lombard speech

    Proceedings of Interspeech/ICSLP

    (2012)
  • T. Gerkmann et al.

    On the statistics of spectral amplitudes after variance reduction by temporal cepstrum smoothing and cepstral nulling

    IEEE Trans. Signal Process.

    (2009)
  • L. Girin et al.

    Audio-visual enhancement of speech in noise

    J. Acoust. Soc. Am.

    (2001)
  • X. Glorot et al.

    Understanding the difficulty of training deep feedforward neural networks

    Proceedings of AISTATS

    (2010)
  • D. Griffin et al.

    Signal estimation from modified short-time Fourier transform

    IEEE Trans. Acoust Speech SignalProcess.

    (1984)
  • J.H. Hansen et al.

    Analysis and compensation of Lombard speech across noise type and levels with application to in-set/out-of-set speaker recognition

    IEEE Trans. Audio Speech Lang.Process.

    (2009)
  • H. Hentschke et al.

    Computation of measures of effect size for neuroscience data sets

    Eur. J. Neurosci.

    (2011)
  • G.E. Hinton et al.

    Improving neural networks by preventing co-adaptation of feature detectors

    arXiv preprint arXiv:1207.0580

    (2012)
  • J.-C. Hou et al.

    Audio-visual speech enhancement based on multimodal deep convolutional neural network

    IEEE Trans. Emerg. Top.Comput. Intell.

    (2018)
  • Y. Hu et al.

    A comparative intelligibility study of single-microphone noise reduction algorithms

    J. Acoust. Soc. Am.

    (2007)
  • A. Hussain et al.

    Towards multi-modal hearing aid design and evaluation in realistic audio-visual settings: Challenges and opportunities

    Proceedings of CHAT

    (2017)
  • S. Ioffe et al.

    Batch normalization: accelerating deep network training by reducing internal covariate shift

    Proceedings of ICML

    (2015)
  • P. Isola et al.

    Image-to-image translation with conditional adversarial networks

    Proceedings of CVPR

    (2017)
  • ITU, 2003. Recommendation ITU-R BS.1534-1: method for the subjective assessment of intermediate quality level of coding...
  • ITU, 2005. Recommendation P.862.2: Wideband extension to recommendation P.862 for the assessment of wideband telephone...
  • J. Jensen et al.

    Spectral magnitude minimum mean-square error estimation using binary and continuous gain functions

    IEEE Trans. Audio Speech Lang.Process.

    (2012)
  • J. Jensen et al.

    An algorithm for predicting the intelligibility of speech masked by modulated noise maskers

    IEEE/ACM Trans. Audio SpeechLang. Process.

    (2016)
  • J.-C. Junqua

    The Lombard reflex and its role on human listeners and automatic speech recognizers

    J. Acoust. Soc. Am.

    (1993)
  • Cited by (29)

    View all citing articles on Scopus
    View full text