Evaluating the intelligibility benefit of speech modifications in known noise conditions
Highlights
► Intelligibility of 10 modified natural and synthetic speech types evaluated in two maskers. ► Gains obtained worth up to 5 dB of level increase for unmodified speech. ► Larger gains for stationary noise than single-talker masker. ► The best modification methods outperformed Lombard speech. ► Synthetic speech 4–8 dB less intelligible than natural but modifications led to gains.
Introduction
Speech output, whether spoken live, recorded, or generated synthetically from text, is used in a growing range of applications, including public address systems, vehicle navigation devices and mobile phones, and is likely to become more widespread in domestic situations for interaction with consumer devices and speech-based warning systems. Maintaining intelligibility in such settings without resorting to increases in output level is a challenge, particularly in the presence of additive and convolutional distortions. Unlike current speech output technology, human talkers appear to adapt to the immediate context by changing the acoustic, phonetic, and linguistic content of their speech (Lindblom, 1990, Picheny et al., 1985, Summers et al., 1988, Howell et al., 2006, Uther et al., 2007, Patel and Schell, 2008, Cooke and Lu, 2010). Recently, a number of speech modification algorithms designed to promote intelligibility have been proposed, some inspired by human speech production changes, and useful gains in intelligibility in noise have been reported. The purpose of the current article is to evaluate within a common framework the performance of a range of speech modification strategies, alongside a number of natural speech styles.
Most speech modification algorithms proposed to date are noise-independent. Methods include boosting the consonant-vowel power ratio (Niederjohn and Grotelueschen, 1976, Skowronski and Harris, 2006, Yoo et al., 2007), spectral tilt flattening and formant enhancement (McLoughlin and Chance, 1997, Raitio et al., 2011), manipulation of duration and prosody (Huang et al., 2010), and voice conversion (Langner and Black, 2005). Raitio et al. (2011) also proposed a method using adaptation techniques (Yamagishi et al., 2009b) for hidden Markov model (HMM) text-to-speech (TTS) that require recordings of Lombard speech from the speaker whose voice is to be synthesized. A different approach, described by Moore and Nicolao (2011), also makes use of adaptation to map between normal, hypo, and hyper-articulated speech.
Some work has been carried out using prior knowledge or estimates of the noise context. These approaches include modification of the local signal-to-noise ratio (SNR) (Sauert and Vary, 2006, Tang and Cooke, 2010), optimisation of the spectral audio power reallocation based on the Speech Intelligibility Index (Sauert and Vary, 2010, Sauert and Vary, 2011) or glimpse proportion (Tang and Cooke, 2012), cepstral extraction based on the glimpse proportion measure (Valentini-Botinhao et al., 2012a), and the insertion of small pauses (Tang and Cooke, 2011). Recently, Taal et al. (2012) presented an optimisation algorithm based on a spectro-temporal perceptual distortion measure.
The evaluation described in this paper aims to quantify the effect on intelligibility of speech modifications under energy and duration constraints. Listeners identified words in phonetically-balanced sets of utterances presented in both stationary and fluctuating maskers. Ten different types of speech were evaluated. These were either natural or synthetic speech, presented with and without modification. The two unmodified natural types – ‘plain’ and ‘Lombard’ – were produced in quiet and noise respectively. Five algorithmic modifications of natural speech were also evaluated, alongside unmodified synthetic speech and two further modified synthetic types.
The specific modification approaches selected for the evaluation were a subset of those developed in recent studies by the authors, chosen to exhibit a wide variety of potential modification techniques. Algorithms differed principally in their use of noise estimates, the parameters being modified, and the optimisation criterion employed. Alongside one noise-independent approach, others make use of information about the noise context during offline optimisation, while the rest employ online noise estimates. A number of the tested modification algorithms restrict themselves to changing spectral weights, either globally or locally in time; others additionally use time-domain amplitude range compression strategies. Some of the approaches were inspired by observed human speech production changes in intelligibility-enhancing types of speech, while others employed model-based optimisation of objective intelligibility.
The performance of each type is characterised in terms of the change in the percentage of keywords identified correctly by listeners. In addition, the concept of equivalent intensity change (EIC) is introduced, which describes the amount in decibels by which plain speech would need to be changed to acquire the same intelligibility as a given synthetic/modified type. A design goal for the evaluation was to be able to distinguish different speech types at a resolution of about 1 dB of EIC.
Section 2 provides a brief introduction to each of the 10 speech types evaluated in the current study. Speech and noise corpora are described in Section 3, along with details of the estimation of psychometric functions for the noise maskers. The outcome of the evaluation is presented in Section 4.
Section snippets
Speech types
Table 1 lists the 10 speech types whose intelligibility in noise is reported here, and summarises the extent to which each method uses noise signals or estimates both offline and online.
While the focus of the current study was on measuring intelligibility rather than naturalness or quality, informal listening suggested that most of the non-synthetic types were highly-natural and free from artefacts. The two methods which included a stage of dynamic range compression (SSDRC and TMDRC) were
Speech material
A speech dataset comprised of natural sentences was chosen over the use of isolated words or restricted-vocabulary sentences in order to obtain evaluation results for phonetically-balanced materials more representative of everyday speech. The existing list of Harvard sentence materials (Rothauser et al., 1969) fits these criteria. The Harvard sentence lists define 72 sets of 10 sentences each. Each 10-sentence set is phonetically-balanced. Sets 1–18 (180 sentences) were used in the current
Keyword scores
Fig. 2, Fig. 3 show keyword scores for the competing speech and speech-shaped noise maskers relative to scores for the plain speech type at each of three SNR levels denoted High, Mid, and Low. Speech types are ranked by degree of gain.
A 3-factor (modification, SNR level2, masker type) repeated-measures ANOVA on arcsine-transformed keyword scores confirmed visual
Key findings
Speech modification can lead to substantial increases in intelligibility for sentences presented in noise relative to an unmodified speech baseline. The most successful techniques evaluated here produced increases in keyword scores over plain speech which ranged from 7.6 to 36.5 percentage points for a stationary masker, with smaller increases (5.5 to 15.4 percentage points) in the presence of a competing talker. These quantities correspond to gains in the range 2.5–5.2 dB and 2.4–4.1 dB for the
Conclusions
This paper reports the results of the first large-scale evaluation of speech production modification strategies designed to increase intelligibility in noise without changing overall signal-to-noise ratio. Some modification approaches were inspired by studies of human speech modes known to be intelligible, while others sought modifications which optimised one of several objective intelligibility models. A number of modification algorithms led to useful gains, equivalent to increasing the level
Acknowledgements
We thank Vasilis Karaiskos for help in running the listening tests, Julian Villegas for contributions to the recording of speech material, and T-C. Zorilă, V. Kandia and D. Erro for useful discussions on developing SSDRC and TMDRC. The research leading to these results was partly funded from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement 213850 (SCALE) and by the Future and Emerging Technologies (FET) programme under FET-Open grant number 256230
References (56)
- et al.
Prediction of speech intelligibility based on an auditory preprocessing model
Speech Comm.
(2010) - et al.
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds
Speech Comm.
(1999) - et al.
Applied principles of clear and Lombard speech for automated intelligibility enhancement in noisy environments
Speech Comm.
(2006) - et al.
Do you speak E-NG-L-I-SH? a comparison of foreigner- and infant-directed speech
Speech Comm.
(2007) - et al.
Statistical parametric speech synthesis
Speech Comm.
(2009) - ANSI S3.5-1997, 1997. Methods for the calculation of the Speech Intelligibility...
- et al.
Frequency-important functions for words in high- and low context sentences
J. Speech Hear. Res.
(1992) Audio dynamic range compression for minimum perceived distortion
IEEE Trans. Audio Electroacoust.
(1969)Praat, a system for doing phonetics by computer
Glot Internat.
(2001)- et al.
Semantic and phonetic enhancements for speech-in-noise recognition by native and non-native listeners
J. Acoust. Soc. Amer.
(2007)
A glimpsing model of speech perception in noise
J. Acoust. Soc. Amer.
Spectral and temporal changes to speech produced in the presence of energetic and informational maskers
J. Acoust. Soc. Amer.
Effects of ambient noise on speaker intelligibility for words and phrases
J. Acoust. Soc. Amer.
ICRA noises: Artificial noise signals with speech-like spectral and temporal properties for hearing aid assessment
Audiology
Effects of fluctuating noise and interfering speech on the speech-reception threshold for impaired and normal hearing
J. Acoust. Soc. Amer.
Acoustic–phonetic characteristics of speech produced with communicative intent to counter adverse listening conditions
J. Acoust. Soc. Amer.
Cue-enhancement strategies for natural VCV and sentence materials presented in noise
Speech Hear. Lang.
Adaptation in Natural and Artificial Systems
Strength of British English accents in altered listening conditions
Percept. Psychophys.
Signal processing for hearing aids
Explaining phonetic variation: A sketch of the H&H theory
Le signe d’élévation de la voix (the sign of the elevation of the voice)
Annales des maladies de l’oreille et du larynx
Speech production modifications produced by competing talkers, babble and stationary noise
J. Acoust. Soc. Amer.
Cited by (109)
Improving the Quality and Intelligibility of Electrolaryngeal Speech during Mobile Communication with Landline Analogous Bandpass Filtering
2022, Journal of VoiceCitation Excerpt :Selection of the phrase for recording was done from the standard Harvard sentences.31 Recorded Harvard sentences are difficult to understand even for normal voice, and due to this feature, they are extensively used in measuring the quality of speech codecs in telecommunication [32–34]. EL speech of the same laryngectomee of Phase I, saying the phonetically balanced Harvard sentence ”Oak is strong and also gives shade”31was recorded.
Glimpse-based estimation of speech intelligibility from speech-in-noise using artificial neural networks
2021, Computer Speech and LanguageCitation Excerpt :In the previous experiment, the ANN used in niHEGP for glimpse estimation was only trained on normal natural signals. When the natural speech signal is processed by algorithms that are designed to improve speech intelligibility in noise or for synthetic speech (e.g. Cooke et al., 2013b), the properties of the signal, especially in the frequency domain (Tang and Cooke, 2018), can be considerably different from natural unmodified signals. It is therefore of interest to investigate how niHEGP performance could be impacted by algorithmically-modified and synthetic speech.
Listeners’ Spectral Reallocation Preferences for Speech in Noise
2023, Applied Sciences (Switzerland)