Elsevier

Computer Speech & Language

Volume 35, January 2016, Pages 73-92
Computer Speech & Language

Evaluating the predictions of objective intelligibility metrics for modified and synthetic speech

https://doi.org/10.1016/j.csl.2015.06.002Get rights and content

Highlights

  • Algorithmically modified speech is used to assess objective intelligibility metrics.

  • Reduced predictive power of the metrics for the given speech is demonstrated.

  • Metrics show two opposite predictive patterns in fluctuating and stationary maskers.

  • The glimpse proportion metric is extended.

Abstract

Several modification algorithms that alter natural or synthetic speech with the goal of improving intelligibility in noise have been proposed recently. A key requirement of many modification techniques is the ability to predict intelligibility, both offline during algorithm development, and online, in order to determine the optimal modification for the current noise context. While existing objective intelligibility metrics (OIMs) have good predictive power for unmodified natural speech in stationary and fluctuating noise, little is known about their effectiveness for other forms of speech. The current study evaluated how well seven OIMs predict listener responses in three large datasets of modified and synthetic speech which together represent 396 combinations of speech modification, masker type and signal-to-noise ratio. The chief finding is a clear reduction in predictive power for most OIMs when faced with modified and synthetic speech. Modifications introducing durational changes are particularly harmful to intelligibility predictors. OIMs that measure masked audibility tend to over-estimate intelligibility in the presence of fluctuating maskers relative to stationary maskers, while OIMs that estimate the distortion caused by the masker to a clean speech prototype exhibit the reverse pattern.

Introduction

Spoken language applications using recorded natural1 or synthetic speech can be made more robust through algorithmic speech modification. Unlike traditional speech enhancement techniques (e.g., Hu and Loizou, 2004, Martin, 2005, Chen et al., 2006, Srinivasan et al., 2007) which focus on the noise-corrupted speech signal, the speech modification approach (e.g., Sauert and Vary, 2006, Bonardo and Zovato, 2007, Yoo et al., 2007, Brouckxon et al., 2008, Tang and Cooke, 2010) alters the clean speech signal prior to output or transmission. A recent evaluation (Cooke et al., 2013b) demonstrated that speech modification can result in intelligibility gains in noise equivalent to increases of more than 5 dB in output level.

A key ingredient in the design of effective modification strategies is the estimation of listener performance at frequent intervals during the development cycle. However, while subjective intelligibility scores remain the ultimate reference, continuous behavioural testing during algorithm design is usually infeasible. An alternative is to use objective intelligibility metrics (OIMs) to predict listener scores. OIMs not only avoid the need for extensive subjective testing, but can also be used at the core of the algorithm optimisation process. A number of speech modification algorithms (e.g., Sauert and Vary, 2010a, Tang and Cooke, 2011, Taal et al., 2013, Valentini-Botinhao et al., 2014) have been developed and optimised based on maximising intelligibility predictions made by OIMs such as the Speech Intelligibility Index (SII; ANSI, 1997) or the glimpse proportion metric (GP; Cooke, 2006).

OIMs have been motivated by two distinct approaches to account for the effect of noise on speech. In addition to the aforementioned SII and GP metrics, the Articulation Index (AI; French and Steinberg, 1947, Fletcher and Galt, 1950, Kryter, 1962a, Kryter, 1962b), and the extended Speech Intelligibility Index (ESII; Rhebergen and Versfeld, 2005) focus on quantifying the masked audibility of speech in the presence of noise. On the other hand, techniques such as the Normalised-Covariance Measure (NCM; Holube and Kollmeier, 1996, Ma et al., 2009), the Christiansen–Pedersen–Dau metric (henceforth referred to as CPD for brevity; Christiansen et al., 2010) and the Short-Time Objective Intelligibility metric (STOI; Taal et al., 2010) correlate representations of the clean reference speech and the speech-plus-noise signal in an attempt to measure the distortion caused by the masker. Another distortion-based approach is the Coherence Speech Intelligibility Index (CSII) proposed by Kates and Arehart (2005). The CSII measures the similarity between clean and noisy speech using magnitude-square coherence (Carter et al., 1973, Kates, 1992) which quantifies the degree to which the output of a system is linearly related to its input.

Both audibility- and distortion-based approaches target spectro-temporal regions least affected by the noise, but differ in their assumptions. While techniques based on audibility require separated estimates of speech and noise in order to estimate masking, distortion-based OIMs assume that human listeners possess a template of the clean speech which is compared to the incoming noisy speech.

When an OIM is employed as the objective function to be maximised, the predictive accuracy of the OIM is critical in determining the validity and effectiveness of the optimisation process. Most of the OIMs mentioned above have been evaluated with recorded natural speech or speech processed by noise reduction techniques. Relatively few studies have investigated their predictive power for modified natural speech or synthetic speech in noise: most OIMs were originally proposed to predict the intelligibility of distorted natural speech, for distortions caused by additive noise together with artefacts introduced by suppression algorithms applied to the noisy speech signal.

Predicting the intelligibility impact of modification algorithms is likely to be challenging since the most successful methods (in terms of improving masked intelligibility) modify the signal in diverse domains – durational and spectral/formant – and possibly through non-linear operations. While the alterations benefit intelligibility, they may also introduce artefacts to the speech signal, leading to degraded speech quality. Nevertheless, the relation between speech intelligibility and quality is complex, and factors such as listening effort and loudness interact. Intelligibility and quality are not simply negatively or positively correlated, especially across listeners (Preminger and Tasell, 1995). For synthetic speech it might be expected that the OIMs’ task is even more challenging because the natural speech reference signal is not available, i.e., distortions introduced by the text-to-speech (TTS) system cannot be taken into account. Consequently, predicting the intelligibility of poor quality synthetic speech may be even more difficult.

In two initial studies, which concerned solely the ability of OIMs to predict the masked intelligibility of modified and synthetic speech regardless of the perceptual speech quality, we observed a large reduction in the predictive accuracy of several OIMs on modified and synthetic speech relative to unmodified speech (Tang and Cooke, 2011, Valentini-Botinhao et al., 2011). The current study extends these pilots to a larger range of objective intelligibility metrics and includes behavioural data from recent extensive evaluations of 30 forms of modified and synthetic speech (Cooke et al., 2013a, Cooke et al., 2013b). Specifically, we evaluate the performance of one standard (SII) and six recent objective intelligibility metrics (ESII, GP, NCM, CSII, CPD, STOI) in predicting subjective intelligibility scores for both modified and synthetic speech in additive noise. The evaluation makes use of three datasets which together contain 396 combinations of speech modification, masker type and signal-to-noise ratio (SNR). The seven metrics are introduced in Section 2 while Section 3 describes the evaluation datasets. The outcome of a comparison of model predictions against behavioural data from large-scale listening tests is presented in Section 4.

Section snippets

Speech Intelligibility Index (SII)

SII and AI share a common underlying idea: speech intelligibility is dependent on the audibility of the signal in each frequency band. The AI can be expressed as a function of the masking level represented by the SNR (SNRfAI) in each frequency channel as

AI=f=1FWf·SNRfAI,f=1FWf=1where Wf denotes the band importance function (BIF) in channel f and SNRfAI is a value in the interval [0, 1] based on a piecewise-linear transformation of the actual SNR level SNRf in band f

SNRfAI=min(15,max(15,SNRf))

Datasets

The OIMs described above were evaluated based on listeners’ responses to speech from three datasets (Table 1). One – natural – consists of unmodified and modified natural speech. A second dataset, tts, contains speech generated by an HMM-based synthesiser. The third dataset, hurricane, is made up of both natural and synthetic speech. Further details of the listening tests are provided in the articles mentioned in Table 1.

Objective intelligibility predictions

All OIMs were evaluated by inspecting both the Pearson correlation coefficient ρ between mean listener scores and the raw output of the metric, and the standard deviation of the error σe, computed as

σe=σd·1ρ2where σd is the standard deviation of subjective intelligibility scores for a given experimental condition. Statistical comparisons among dependent correlations were conducted using a method described in Meng et al. (1992) based on Chi-squared tests on z-transformed scores.

Discussion

Compared to model-listener correlations reported in the literature for unmodified natural speech or speech processed by noise reduction techniques, the current study highlights a clear reduction in the performance of a representative range of OIMs for modified and synthetic speech. One contributing factor for most OIMs is their inability to predicting intelligibility across different maskers, especially for stationary versus highly fluctuating maskers. Additionally, many OIMs were adversely

Conclusions

In the current study state-of-the-art OIMs that provide good predictions of natural speech performed less well for modified and synthetic speech, especially for those modifications introducing temporal changes. While many OIMs produced reasonable estimates for modified speech in the presence of single masker types, across-noise predictions were generally poor. Methods motivated by masked audibility tended to over-estimate intelligibility for fluctuating maskers and under-estimate

Acknowledgements

This study was supported by the LISTA Project (http://listening-talker.org), funded by the Future and Emerging Technologies programme within the 7th Framework Programme for Research of the European Commission, FET-Open Grant Number 256230. We thank Yannis Stylianou for sharing a MATLAB implementation of ESII, and Cees Taal for making the MATLAB implementation of STOI available online for free access. The implementation of SII is available online at http://www.sii.to while MATLAB implementations

References (87)

  • D. Bonardo et al.

    Speech synthesis enhancement in noisy environments

  • A.R. Bradlow et al.

    Speaking clearly for learning-impaired children: sentence perception in noise

    J. Speech Hear. Res.

    (2003)
  • H. Brouckxon et al.

    An overview of the VUB entry for the 2012 Hurricane Challenge

  • H. Brouckxon et al.

    Time and frequency dependent amplification for speech intelligibility enhancement in noisy environments

  • G.C. Carter et al.

    Estimation of the magnitude-squared coherence function via overlapped fast Fourier transform processing

    IEEE Trans. Audio Electroacoust.

    (1973)
  • J. Chen et al.

    New insights into the noise reduction Wiener filter

    IEEE Trans. Audio Speech Lang. Process.

    (2006)
  • L.-H. Chen et al.

    DNN-based stochastic postfilter for HMM-based speech synthesis

  • M. Cooke

    Modelling Auditory Processing and Organisation

    (1993)
  • M. Cooke

    A glimpsing model of speech perception in noise

    J. Acoust. Soc. Am.

    (2006)
  • M. Cooke et al.

    An audio–visual corpus for speech perception and automatic speech recognition

    J. Acoust. Soc. Am.

    (2006)
  • M. Cooke et al.

    Intelligibility-enhancing speech modifications: the Hurricane Challenge

  • T. Dau et al.

    A quantitative model of the “effective” signal processing in the auditory system. I. Model structure

    J. Acoust. Soc. Am.

    (1996)
  • D. Erro et al.

    Implementation of simple spectral techniques to enhance the intelligibility of speech using a harmonic model

  • D. Erro et al.

    Statistical synthesizer with embedded prosodic and spectral modifications to generate highly intelligible speech in noise

  • H. Fletcher et al.

    The perception of speech and its relation to telephony

    J. Acoust. Soc. Am.

    (1950)
  • N.R. French et al.

    Factors governing the intelligibility of speech sounds

    J. Acoust. Soc. Am.

    (1947)
  • E. Godoy et al.

    Increasing speech intelligibility via spectral shaping with frequency warping and dynamic range compression plus transient enhancement

  • N. Hodoshima et al.

    Improving syllable identification by a preprocessing method reducing overlap-masking in reverberant environments

    J. Acoust. Soc. Am.

    (2006)
  • I. Holube et al.

    Speech intelligibility prediction in hearing-impaired listeners based on a psychoacoustically motivated perception model

    J. Acoust. Soc. Am.

    (1996)
  • Y. Hu et al.

    Evaluation of objective quality measures for speech enhancement

    IEEE Trans. Audio Speech Lang. Process.

    (2008)
  • Y. Hu et al.

    Speech enhancement based on wavelet thresholding the multitaper spectrum

  • Y. Hu et al.

    Evaluation of objective measures for speech enhancement

  • ISO 389-7

    Acoustics – Reference Zero for the Calibration of Audiometric Equipment – Part 7: Reference Threshold of Hearing Under Free-field and Diffuse-field Listening Conditions

    (2006)
  • J. Kates et al.

    Coherence and the speech intelligibility index

    J. Acoust. Soc. Am.

    (2005)
  • J.M. Kates

    On using coherence to measure distortion in hearing aids

    J. Acoust. Soc. Am.

    (1992)
  • S. King et al.

    The Blizzard Challenge 2010

    (2010, September)
  • K. Kokkinakis et al.

    Evaluation of objective measures for quality assessment of reverberant speech

  • J.C. Krause et al.

    Acoustic properties of naturally produced clear speech at normal speaking rates

    J. Acoust. Soc. Am.

    (2004)
  • K.D. Kryter

    Methods for the calculation and use of the Articulation Index

    J. Acoust. Soc. Am.

    (1962)
  • K.D. Kryter

    Validation of the articulation index

    J. Acoust. Soc. Am.

    (1962)
  • R. Kubichek et al.

    Advances in objective voice quality assessment

  • P.C. Loizou

    Speech Enhancement: Theory and Practice

    (2013)
  • J. Ma et al.

    Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions

    J. Acoust. Soc. Am.

    (2009)
  • Cited by (22)

    • ASR-based speech intelligibility prediction: A review

      2022, Hearing Research
      Citation Excerpt :

      In contrast to STOI, CSTI computes the per-frequency-band correlations over the entire signal rather than over the short time segments used in STOI. Despite the fact that the STOI has become a common benchmark in the field of speech processing (Gao and Tew, 2015; Marcinek et al., 2021; Van Kuyk et al., 2018), many studies have shown it has a poor performance in conditions like fluctuating noise and reverberation (Relaño-Iborra et al., 2016), modified and synthesized speech (Tang et al., 2016), and additive noise with strong temporal modulation content (Jensen and Taal, 2016; Jørgensen et al., 2015). In addition to these works, there have also been many efforts to tackle STOI’s deficits from different angles and to improve on its SIP performance in different noise and acoustic conditions (Andersen et al., 2017; Jensen and Taal, 2016; Karbasi et al., 2016b; Taghia and Martin, 2014).

    • Nonintrusive objective measurement of speech intelligibility: A review of methodology

      2022, Biomedical Signal Processing and Control
      Citation Excerpt :

      Similar to NISI, NISA has not been evaluated relative to subjective data and thus it is unknown how well it correlates with subjective SI. This requires further investigation because STOI has been reported to have poor accuracy performance when deployed for algorithmically modified speech [99,100]. Karbasi et al. proposed a statistics-based approach that synthesized clean speech features using a statistical model trained with clean speech and incorporated features into an intrusive framework for nonintrusive SI measurement [47].

    • Glimpse-based estimation of speech intelligibility from speech-in-noise using artificial neural networks

      2021, Computer Speech and Language
      Citation Excerpt :

      The first 150 sentences were used to train the ANN for glimpse detection. While standard measures, such as the SII, and other methods have shown robust accuracy in temporally-stationary noise maskers, their performance tends to decline when handling noises whose intensity significantly varies over time (Rhebergen et al., 2006; Tang et al., 2016). In order to examine the capacity of the proposed method in challenging conditions, nine temporally-fluctuating noise maskers were generated and tested along with speech-shaped noise (SSN) – the only stationary noise masker.

    • Learning static spectral weightings for speech intelligibility enhancement in noise

      2018, Computer Speech and Language
      Citation Excerpt :

      Based on common features of the spectral weightings discovered via optimisation, Section 5 describes the results of a second intelligibility experiment using a number of generic, masker-independent spectral weightings. Tang et al. (2016) reported further significant improvements in the predictive power of the HEGP metric by removing inaudible (sub-threshold) glimpses, and by applying a quasi-logarithmic transformation to the GP value, based on the finding that subjective intelligibility scores reach ceiling for relatively low values of GP (Barker and Cooke, 2007). These extensions increased listener-model correlations from 0.79, 0.71 and 0.53 for the original GP metric to 0.92, 0.83 and 0.87 across three large-scale datasets.

    • A non-intrusive method for estimating binaural speech intelligibility from noise-corrupted signals captured by a pair of microphones

      2018, Speech Communication
      Citation Excerpt :

      This leaves the question of whether the high correlation with the objective scores can be translated to a good match with subjective intelligibility unanswered. There is some evidence (Tang and Cooke, 2012; Tang et al., 2016b) suggesting that STOI lacks predictive accuracy when making predictions for algorithmically-modified speech or across different types of maskers. Based on full-band clarity index C50 (Naylor and Gaubitch, 2010), a data-driven non-intrusive room acoustic estimation method for predicting ASR performance in reverberant conditions was introduced (Peso Parada et al., 2016).

    • Evaluating a distortion-weighted glimpsing metric for predicting binaural speech intelligibility in rooms

      2016, Speech Communication
      Citation Excerpt :

      The monaural DWGP metric incorporates a distortion weighting factor with the glimpse proportion metric (GP, Cooke, 2006; Tang, 2014). This weighting factor was initially introduced in Tang (2014) to increase the consistency of predictions by the GP metric across different noise maskers, especially between stationary (e.g. speech-shaped noise) and fluctuating (e.g. single-talker competing speech) maskers (Tang et al., 2016). The calculation of the distortion weighting factor was inspired by a STI-based metric, the normalise-covariance metric (Holube and Kollmeier, 1996), which uses the cross-correlation coefficient of the reference clean and noise-corrupted speech envelopes within each frequency band to determine the speech-to-distortion level.

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Roger K. Moore.

    View full text