Elsevier

Speech Communication

Volume 48, Issue 9, September 2006, Pages 1162-1181
Speech Communication

Emotional speech recognition: Resources, features, and methods

https://doi.org/10.1016/j.specom.2006.04.003Get rights and content

Abstract

In this paper we overview emotional speech recognition having in mind three goals. The first goal is to provide an up-to-date record of the available emotional speech data collections. The number of emotional states, the language, the number of speakers, and the kind of speech are briefly addressed. The second goal is to present the most frequent acoustic features used for emotional speech recognition and to assess how the emotion affects them. Typical features are the pitch, the formants, the vocal tract cross-section areas, the mel-frequency cepstral coefficients, the Teager energy operator-based features, the intensity of the speech signal, and the speech rate. The third goal is to review appropriate techniques in order to classify speech into emotional states. We examine separately classification techniques that exploit timing information from which that ignore it. Classification techniques based on hidden Markov models, artificial neural networks, linear discriminant analysis, k-nearest neighbors, support vector machines are reviewed.

Introduction

Emotional speech recognition aims at automatically identifying the emotional or physical state of a human being from his or her voice. The emotional and physical states of a speaker are known as emotional aspects of speech and are included in the so-called paralinguistic aspects. Although the emotional state does not alter the linguistic content, it is an important factor in human communication, because it provides feedback information in many applications as it is outlined next.

Making a machine to recognize emotions from speech is not a new idea. The first investigations were conducted around the mid-1980s using statistical properties of certain acoustic features (Van Bezooijen, 1984, Tolkmitt and Scherer, 1986). Ten years later, the evolution of computer architectures made the implementation of more complicated emotion recognition algorithms feasible. Market requirements for automatic services motivate further research. In environments like aircraft cockpits, speech recognition systems were trained by employing stressed speech instead of neutral (Hansen and Cairns, 1995). The acoustic features were estimated more precisely by iterative algorithms. Advanced classifiers exploiting timing information were proposed (Cairns and Hansen, 1994, Womack and Hansen, 1996, Polzin and Waibel, 1998). Nowadays, research is focused on finding powerful combinations of classifiers that advance the classification efficiency in real-life applications. The wide use of telecommunication services and multimedia devices paves also the way for new applications. For example, in the projects “Prosody for dialogue systems” and “SmartKom”, ticket reservation systems are developed that employ automatic speech recognition being able to recognize the annoyance or frustration of a user and change their response accordingly (Ang et al., 2002, Schiel et al., 2002). Similar scenarios are also presented for call center applications (Petrushin, 1999, Lee and Narayanan, 2005). Emotional speech recognition can be employed by therapists as a diagnostic tool in medicine (France et al., 2000). In psychology, emotional speech recognition methods can cope with the bulk of enormous speech data in real-time extracting the speech characteristics that convey emotion and attitude in a systematic manner (Mozziconacci and Hermes, 2000).

In the future, the emotional speech research will primarily be benefited by the on-going availability of large-scale emotional speech data collections, and will focus on the improvement of theoretical models for speech production (Flanagan, 1972) or models related to the vocal communication of emotion (Scherer, 2003). Indeed, on the one hand, large data collections which include a variety of speaker utterances under several emotional states are necessary in order to faithfully assess the performance of emotional speech recognition algorithms. The already available data collections consist only of few utterances, and therefore it is difficult to demonstrate reliable emotion recognition results. The data collections listed in Section 2 provide initiatives to set up more relaxed and close to real-life specifications for recording large-scale emotional speech data collections that are complementary to the already existing resources. On the other hand, theoretical models of speech production and vocal communication of emotion will provide the necessary background for a systematic study and will deploy more accurate emotional cues through time. In the following, the contributions of the paper are identified and its outline is given.

Several reviews on emotional speech analysis have already appeared (Van Bezooijen, 1984, Scherer et al., 1991, Cowie et al., 2001, Pantic and Rothkrantz, 2003, Scherer, 2003, Douglas-Cowie et al., 2003). However, as the research towards understanding human emotions increasingly attracts the attention of the research community, the short list of 19 data collections appeared in (Douglas-Cowie et al., 2003) does not adequately cover the topic. In this tutorial, 64 data collections are reviewed. Furthermore, an up-to-date literature survey is provided, complementing the previous studies in (Van Bezooijen, 1984, Scherer et al., 1991, Cowie et al., 2001). Finally, the paper is focused on describing the feature extraction methods and the emotion classification techniques, topics that have not been treated in (Scherer, 2003, Pantic and Rothkrantz, 2003).

In Section 2, a corpus of 64 data collections is reviewed putting emphasis on the data collection procedures, the kind of speech (natural, simulated, or elicited), the content, and other physiological signals that may accompany the emotional speech. In Section 3, short-term features (i.e. features that are extracted on speech frame basis) that are related to the emotional content of speech are discussed. In addition to short-term features, their contours are of fundamental importance for emotional speech recognition. The emotions affect the contour characteristics, such as statistics and trends as is summarized in Section 4. Emotion classification techniques that exploit timing information and other techniques that ignore it are surveyed in Section 5. Therefore, Sections 3 Estimation of short-term acoustic features, 4 Cues to emotion aim at describing the appropriate features to be used with the emotional classification techniques reviewed in Section 5. Finally, Section 6 concludes the tutorial by indicating future research directions.

Section snippets

Data collections

A record of emotional speech data collections is undoubtedly useful for researchers interested in emotional speech analysis. An overview of 64 emotional speech data collections is presented in Table 1. For each data collection additional information is also described such as the speech language, the number and the profession of the subjects, other physiological signals possibly recorded simultaneously with speech, the data collection purpose (emotional speech recognition, expressive synthesis),

Estimation of short-term acoustic features

Methods for estimating short-term acoustic features that are frequently used in emotion recognition are described hereafter. Short-term features are estimated on a frame basisfs(n;m)=s(n)w(m-n),where s(n) is the speech signal and w(m  n) is a window of length Nw ending at sample m (Deller et al., 2000). Most of the methods stem from the front-end signal processing employed in speech recognition and coding. However, the discussion is focused on acoustic features that are useful for emotion

Cues to emotion

In this section, we review how the contour of selected short-term acoustic features is affected by the emotional states of anger, disgust, fear, joy, and sadness. A short-term feature contour is formed by assigning the feature value computed on a frame basis to all samples belonging to the frame. For example, the energy contour is given bye(n)=Es(m),n=m-Nw+1,,m.The contour trends (i.e. its plateaux, its rising or falling slopes) is a valuable feature for emotion recognition, because they

Emotion classification techniques

The output of emotion classification techniques is a prediction value (label) about the emotional state of an utterance. An utterance uξ is a speech segment corresponding to a word or a phrase. Let uξ, ξ  {1, 2,  , Ξ} be an utterance of the data collection. In order to evaluate the performance of a classification technique, the cross-validation method is used. According to this method, the utterances of the whole data collection are divided into the design set Ds containing NDs utterances and the

Concluding remarks

In this paper, several topics have been addressed. First, a list of data collections was provided including all available information about the databases such as the kinds of emotions, the language, etc. Nevertheless, there are still some copyright problems since the material from radio or TV is held under a limited agreement with broadcasters. Furthermore, there is a need for adopting protocols such as those in (Douglas-Cowie et al., 2003, Scherer, 2003, Schröder, 2005) that address issues

Acknowledgment

This work has been supported by the research project 01ED312 “Use of Virtual Reality for training pupils to deal with earthquakes” financed by the Greek Secretariat of Research and Technology.

References (121)

  • M. Mrayati et al.

    Distinctive regions and models: a new theory of speech production

    Speech Comm.

    (1988)
  • M. Nordstrand et al.

    Measurements of ariculatory variation in expressive speech for a set of Swedish vowels

    Speech Comm.

    (2004)
  • T.L. Nwe et al.

    Speech emotion recognition using hidden Markov models

    Speech Comm.

    (2003)
  • K.R. Scherer

    Vocal communication of emotion: a review of research paradigms

    Speech Comm.

    (2003)
  • Abelin, A., Allwood, J., 2000. Cross linguistic interpretation of emotional prosody. In: Proc. ISCA Workshop on Speech...
  • H. Akaike

    A new look at the statistical model identification

    IEEE Trans. Automat. Contr.

    (1974)
  • Alter, K., Rank, E., Kotz, S.A., 2000. Accentuation and emotions – two different systems? In: Proc. ISCA Workshop on...
  • Ambrus, D.C., 2000. Collecting and recording of an emotional speech database. Tech. rep., Faculty of Electrical...
  • Amir, N., Ron, S., Laor, N., 2000. Analysis of an emotional speech corpus in Hebrew based on objective criteria. In:...
  • Ang, J., Dhillon, R., Krupski, A., Shriberg, E., Stolcke, A., 2002. Prosody-based automatic detection of annoyance and...
  • Atal, B., Schroeder, M., 1967. Predictive coding of speech signals. In: Proc. Conf. on Communications and Processing,...
  • R. Banse et al.

    Acoustic profiles in vocal emotion expression

    J. Pers. Soc. Psychol.

    (1996)
  • Batliner, A., Hacker, C., Steidl, S., Nöth, E., D’ Archy, S., Russell, M., Wong, M., 2004. “You stupid tin box” –...
  • S.E. Bou-Ghazale et al.

    Hmm based stressed speech modelling with application to improved synthesis and recognition of isolated speech under stress

    IEEE Trans. Speech Audio Processing

    (1998)
  • R. Buck

    The biological affects, a typology

    Psychol. Rev.

    (1999)
  • Bulut, M., Narayanan, S.S., Sydral, A.K., 2002. Expressive speech synthesis using a concatenative synthesizer. In:...
  • Burkhardt, F., Sendlmeier, W.F., 2000. Verification of acoustical correlates of emotional speech using...
  • D. Cairns et al.

    Nonlinear analysis and detection of speech under stressed conditions

    J. Acoust. Soc. Am.

    (1994)
  • Choukri, K., 2003. European Language Resources Association, (ELRA). Available from:...
  • Chuang, Z.J., Wu, C.H., 2002. Emotion recognition from textual input using an emotional semantic network. In: Proc....
  • Clavel, C., Vasilescu, I., Devillers, L., Ehrette, T., 2004. Fiction database for emotion detection in abnormal...
  • Cole, R., 2005. The CU kids’ speech corpus. The Center for Spoken Language Research (CSLR). Available from:...
  • Cowie, R., Douglas-Cowie, E., 1996. Automatic statistical analysis of the signal and prosodic signs of emotion in...
  • R. Cowie et al.

    Emotion recognition in human–computer interaction

    IEEE Signal Processing Mag.

    (2001)
  • S.B. Davis et al.

    Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

    IEEE Trans. Acoust. Speech Signal Processing

    (1980)
  • Dellaert, F., Polzin, T., Waibel, A., 1996. Recognizing emotion in speech. In: Proc. Internat. Conf. on Spoken Language...
  • J.R. Deller et al.

    Discete-Time Processing of Speech Signals

    (2000)
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the em algorithm

    J. Roy. Statist. Soc. Ser. B

    (1977)
  • P. Eckman

    An argument for basic emotions

    Cognition Emotion

    (1992)
  • Edgington, M., 1997. Investigating the limitations of concatenative synthesis. In: Proc. European Conf. on Speech...
  • B. Efron et al.

    An Introduction to the Bootstrap

    (1993)
  • Engberg, I.S., Hansen, A.V., 1996. Documentation of the Danish Emotional Speech database (DES). Internal AAU report,...
  • Fischer, K., 1999. Annotating emotional language data. Tech. Rep. 236, Univ. of...
  • J.L. Flanagan

    Speech Analysis, Synthesis and Perception

    second ed.

    (1972)
  • D.J. France et al.

    Acoustical properties of speech as indicators of depression and suicidal risk

    IEEE Trans. Biomed. Eng.

    (2000)
  • K. Fukunaga

    Introduction to Statistical Pattern Recognition

    second ed.

    (1990)
  • Gonzalez, G.M., 1999. Bilingual computer-assisted psychological assessment: an innovative approach for screening...
  • Hansen, J.H.L., 1996. NATO IST-03 (formerly RSG. 10) speech under stress web page. Available from:...
  • H.M. Hanson et al.

    A system for finding speech formants and modulations via energy separation

    IEEE Trans. Speech Audio Processing

    (1994)
  • Cited by (759)

    View all citing articles on Scopus
    View full text