Evaluating the intelligibility benefit of speech modifications in known noise conditions

https://doi.org/10.1016/j.specom.2013.01.001Get rights and content

Abstract

The use of live and recorded speech is widespread in applications where correct message reception is important. Furthermore, the deployment of synthetic speech in such applications is growing. Modifications to natural and synthetic speech have therefore been proposed which aim at improving intelligibility in noise. The current study compares the benefits of speech modification algorithms in a large-scale speech intelligibility evaluation and quantifies the equivalent intensity change, defined as the amount in decibels that unmodified speech would need to be adjusted by in order to achieve the same intelligibility as modified speech. Listeners identified keywords in phonetically-balanced sentences representing ten different types of speech: plain and Lombard speech, five types of modified speech, and three forms of synthetic speech. Sentences were masked by either a stationary or a competing speech masker. Modification methods varied in the manner and degree to which they exploited estimates of the masking noise. The best-performing modifications led to equivalent intensity changes of around 5 dB in moderate and high noise levels for the stationary masker, and 3–4 dB in the presence of competing speech. These gains exceed those produced by Lombard speech. Synthetic speech in noise was always less intelligible than plain natural speech, but modified synthetic speech reduced this deficit by a significant amount.

Highlights

► Intelligibility of 10 modified natural and synthetic speech types evaluated in two maskers. ► Gains obtained worth up to 5 dB of level increase for unmodified speech. ► Larger gains for stationary noise than single-talker masker. ► The best modification methods outperformed Lombard speech. ► Synthetic speech 4–8 dB less intelligible than natural but modifications led to gains.

Introduction

Speech output, whether spoken live, recorded, or generated synthetically from text, is used in a growing range of applications, including public address systems, vehicle navigation devices and mobile phones, and is likely to become more widespread in domestic situations for interaction with consumer devices and speech-based warning systems. Maintaining intelligibility in such settings without resorting to increases in output level is a challenge, particularly in the presence of additive and convolutional distortions. Unlike current speech output technology, human talkers appear to adapt to the immediate context by changing the acoustic, phonetic, and linguistic content of their speech (Lindblom, 1990, Picheny et al., 1985, Summers et al., 1988, Howell et al., 2006, Uther et al., 2007, Patel and Schell, 2008, Cooke and Lu, 2010). Recently, a number of speech modification algorithms designed to promote intelligibility have been proposed, some inspired by human speech production changes, and useful gains in intelligibility in noise have been reported. The purpose of the current article is to evaluate within a common framework the performance of a range of speech modification strategies, alongside a number of natural speech styles.

Most speech modification algorithms proposed to date are noise-independent. Methods include boosting the consonant-vowel power ratio (Niederjohn and Grotelueschen, 1976, Skowronski and Harris, 2006, Yoo et al., 2007), spectral tilt flattening and formant enhancement (McLoughlin and Chance, 1997, Raitio et al., 2011), manipulation of duration and prosody (Huang et al., 2010), and voice conversion (Langner and Black, 2005). Raitio et al. (2011) also proposed a method using adaptation techniques (Yamagishi et al., 2009b) for hidden Markov model (HMM) text-to-speech (TTS) that require recordings of Lombard speech from the speaker whose voice is to be synthesized. A different approach, described by Moore and Nicolao (2011), also makes use of adaptation to map between normal, hypo, and hyper-articulated speech.

Some work has been carried out using prior knowledge or estimates of the noise context. These approaches include modification of the local signal-to-noise ratio (SNR) (Sauert and Vary, 2006, Tang and Cooke, 2010), optimisation of the spectral audio power reallocation based on the Speech Intelligibility Index (Sauert and Vary, 2010, Sauert and Vary, 2011) or glimpse proportion (Tang and Cooke, 2012), cepstral extraction based on the glimpse proportion measure (Valentini-Botinhao et al., 2012a), and the insertion of small pauses (Tang and Cooke, 2011). Recently, Taal et al. (2012) presented an optimisation algorithm based on a spectro-temporal perceptual distortion measure.

The evaluation described in this paper aims to quantify the effect on intelligibility of speech modifications under energy and duration constraints. Listeners identified words in phonetically-balanced sets of utterances presented in both stationary and fluctuating maskers. Ten different types of speech were evaluated. These were either natural or synthetic speech, presented with and without modification. The two unmodified natural types – ‘plain’ and ‘Lombard’ – were produced in quiet and noise respectively. Five algorithmic modifications of natural speech were also evaluated, alongside unmodified synthetic speech and two further modified synthetic types.

The specific modification approaches selected for the evaluation were a subset of those developed in recent studies by the authors, chosen to exhibit a wide variety of potential modification techniques. Algorithms differed principally in their use of noise estimates, the parameters being modified, and the optimisation criterion employed. Alongside one noise-independent approach, others make use of information about the noise context during offline optimisation, while the rest employ online noise estimates. A number of the tested modification algorithms restrict themselves to changing spectral weights, either globally or locally in time; others additionally use time-domain amplitude range compression strategies. Some of the approaches were inspired by observed human speech production changes in intelligibility-enhancing types of speech, while others employed model-based optimisation of objective intelligibility.

The performance of each type is characterised in terms of the change in the percentage of keywords identified correctly by listeners. In addition, the concept of equivalent intensity change (EIC) is introduced, which describes the amount in decibels by which plain speech would need to be changed to acquire the same intelligibility as a given synthetic/modified type. A design goal for the evaluation was to be able to distinguish different speech types at a resolution of about 1 dB of EIC.

Section 2 provides a brief introduction to each of the 10 speech types evaluated in the current study. Speech and noise corpora are described in Section 3, along with details of the estimation of psychometric functions for the noise maskers. The outcome of the evaluation is presented in Section 4.

Section snippets

Speech types

Table 1 lists the 10 speech types whose intelligibility in noise is reported here, and summarises the extent to which each method uses noise signals or estimates both offline and online.

While the focus of the current study was on measuring intelligibility rather than naturalness or quality, informal listening suggested that most of the non-synthetic types were highly-natural and free from artefacts. The two methods which included a stage of dynamic range compression (SSDRC and TMDRC) were

Speech material

A speech dataset comprised of natural sentences was chosen over the use of isolated words or restricted-vocabulary sentences in order to obtain evaluation results for phonetically-balanced materials more representative of everyday speech. The existing list of Harvard sentence materials (Rothauser et al., 1969) fits these criteria. The Harvard sentence lists define 72 sets of 10 sentences each. Each 10-sentence set is phonetically-balanced. Sets 1–18 (180 sentences) were used in the current

Keyword scores

Fig. 2, Fig. 3 show keyword scores for the competing speech and speech-shaped noise maskers relative to scores for the plain speech type at each of three SNR levels denoted High, Mid, and Low. Speech types are ranked by degree of gain.

A 3-factor (modification, SNR level2, masker type) repeated-measures ANOVA on arcsine-transformed keyword scores confirmed visual

Key findings

Speech modification can lead to substantial increases in intelligibility for sentences presented in noise relative to an unmodified speech baseline. The most successful techniques evaluated here produced increases in keyword scores over plain speech which ranged from 7.6 to 36.5 percentage points for a stationary masker, with smaller increases (5.5 to 15.4 percentage points) in the presence of a competing talker. These quantities correspond to gains in the range 2.5–5.2 dB and 2.4–4.1 dB for the

Conclusions

This paper reports the results of the first large-scale evaluation of speech production modification strategies designed to increase intelligibility in noise without changing overall signal-to-noise ratio. Some modification approaches were inspired by studies of human speech modes known to be intelligible, while others sought modifications which optimised one of several objective intelligibility models. A number of modification algorithms led to useful gains, equivalent to increasing the level

Acknowledgements

We thank Vasilis Karaiskos for help in running the listening tests, Julian Villegas for contributions to the recording of speech material, and T-C. Zorilă, V. Kandia and D. Erro for useful discussions on developing SSDRC and TMDRC. The research leading to these results was partly funded from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement 213850 (SCALE) and by the Future and Emerging Technologies (FET) programme under FET-Open grant number 256230

References (56)

  • M. Cooke

    A glimpsing model of speech perception in noise

    J. Acoust. Soc. Amer.

    (2006)
  • M. Cooke et al.

    Spectral and temporal changes to speech produced in the presence of energetic and informational maskers

    J. Acoust. Soc. Amer.

    (2010)
  • J. Dreher et al.

    Effects of ambient noise on speaker intelligibility for words and phrases

    J. Acoust. Soc. Amer.

    (1957)
  • W.A. Dreschler et al.

    ICRA noises: Artificial noise signals with speech-like spectral and temporal properties for hearing aid assessment

    Audiology

    (2001)
  • Erro, D., Stylianou, Y., Navas, E., Hernaez, I., 2012. Implementation of simple spectral techniques to enhance the...
  • J. Festen et al.

    Effects of fluctuating noise and interfering speech on the speech-reception threshold for impaired and normal hearing

    J. Acoust. Soc. Amer.

    (1990)
  • V. Hazan et al.

    Acoustic–phonetic characteristics of speech produced with communicative intent to counter adverse listening conditions

    J. Acoust. Soc. Amer.

    (2011)
  • V. Hazan et al.

    Cue-enhancement strategies for natural VCV and sentence materials presented in noise

    Speech Hear. Lang.

    (1996)
  • J.H. Holland

    Adaptation in Natural and Artificial Systems

    (1975)
  • P. Howell et al.

    Strength of British English accents in altered listening conditions

    Percept. Psychophys.

    (2006)
  • Huang, D.Y., Rahardja, S., Ong, E.P., 2010. Lombard effect mimicking. In: Proc. SSW7, Kyoto, Japan, pp....
  • J.M. Kates

    Signal processing for hearing aids

  • Langner, B., Black, A.W., 2005. Improving the understandability of speech synthesis by modeling speech in noise. In:...
  • B. Lindblom

    Explaining phonetic variation: A sketch of the H&H theory

  • E. Lombard

    Le signe d’élévation de la voix (the sign of the elevation of the voice)

    Annales des maladies de l’oreille et du larynx

    (1911)
  • Y. Lu et al.

    Speech production modifications produced by competing talkers, babble and stationary noise

    J. Acoust. Soc. Amer.

    (2008)
  • McLoughlin, I.V., Chance, R.J., 1997. LSP-based speech modification for intelligibility enhancement. In: Proc. Digital...
  • Moore, R.K., Nicolao, M., 2011. Reactive speech synthesis: Actively managing phonetic contrast along an H&H continuum....
  • Cited by (109)

    • Improving the Quality and Intelligibility of Electrolaryngeal Speech during Mobile Communication with Landline Analogous Bandpass Filtering

      2022, Journal of Voice
      Citation Excerpt :

      Selection of the phrase for recording was done from the standard Harvard sentences.31 Recorded Harvard sentences are difficult to understand even for normal voice, and due to this feature, they are extensively used in measuring the quality of speech codecs in telecommunication [32–34]. EL speech of the same laryngectomee of Phase I, saying the phonetically balanced Harvard sentence ”Oak is strong and also gives shade”31was recorded.

    • Glimpse-based estimation of speech intelligibility from speech-in-noise using artificial neural networks

      2021, Computer Speech and Language
      Citation Excerpt :

      In the previous experiment, the ANN used in niHEGP for glimpse estimation was only trained on normal natural signals. When the natural speech signal is processed by algorithms that are designed to improve speech intelligibility in noise or for synthetic speech (e.g. Cooke et al., 2013b), the properties of the signal, especially in the frequency domain (Tang and Cooke, 2018), can be considerably different from natural unmodified signals. It is therefore of interest to investigate how niHEGP performance could be impacted by algorithmically-modified and synthetic speech.

    View all citing articles on Scopus
    View full text