Emotional speech recognition: Resources, features, and methods
Introduction
Emotional speech recognition aims at automatically identifying the emotional or physical state of a human being from his or her voice. The emotional and physical states of a speaker are known as emotional aspects of speech and are included in the so-called paralinguistic aspects. Although the emotional state does not alter the linguistic content, it is an important factor in human communication, because it provides feedback information in many applications as it is outlined next.
Making a machine to recognize emotions from speech is not a new idea. The first investigations were conducted around the mid-1980s using statistical properties of certain acoustic features (Van Bezooijen, 1984, Tolkmitt and Scherer, 1986). Ten years later, the evolution of computer architectures made the implementation of more complicated emotion recognition algorithms feasible. Market requirements for automatic services motivate further research. In environments like aircraft cockpits, speech recognition systems were trained by employing stressed speech instead of neutral (Hansen and Cairns, 1995). The acoustic features were estimated more precisely by iterative algorithms. Advanced classifiers exploiting timing information were proposed (Cairns and Hansen, 1994, Womack and Hansen, 1996, Polzin and Waibel, 1998). Nowadays, research is focused on finding powerful combinations of classifiers that advance the classification efficiency in real-life applications. The wide use of telecommunication services and multimedia devices paves also the way for new applications. For example, in the projects “Prosody for dialogue systems” and “SmartKom”, ticket reservation systems are developed that employ automatic speech recognition being able to recognize the annoyance or frustration of a user and change their response accordingly (Ang et al., 2002, Schiel et al., 2002). Similar scenarios are also presented for call center applications (Petrushin, 1999, Lee and Narayanan, 2005). Emotional speech recognition can be employed by therapists as a diagnostic tool in medicine (France et al., 2000). In psychology, emotional speech recognition methods can cope with the bulk of enormous speech data in real-time extracting the speech characteristics that convey emotion and attitude in a systematic manner (Mozziconacci and Hermes, 2000).
In the future, the emotional speech research will primarily be benefited by the on-going availability of large-scale emotional speech data collections, and will focus on the improvement of theoretical models for speech production (Flanagan, 1972) or models related to the vocal communication of emotion (Scherer, 2003). Indeed, on the one hand, large data collections which include a variety of speaker utterances under several emotional states are necessary in order to faithfully assess the performance of emotional speech recognition algorithms. The already available data collections consist only of few utterances, and therefore it is difficult to demonstrate reliable emotion recognition results. The data collections listed in Section 2 provide initiatives to set up more relaxed and close to real-life specifications for recording large-scale emotional speech data collections that are complementary to the already existing resources. On the other hand, theoretical models of speech production and vocal communication of emotion will provide the necessary background for a systematic study and will deploy more accurate emotional cues through time. In the following, the contributions of the paper are identified and its outline is given.
Several reviews on emotional speech analysis have already appeared (Van Bezooijen, 1984, Scherer et al., 1991, Cowie et al., 2001, Pantic and Rothkrantz, 2003, Scherer, 2003, Douglas-Cowie et al., 2003). However, as the research towards understanding human emotions increasingly attracts the attention of the research community, the short list of 19 data collections appeared in (Douglas-Cowie et al., 2003) does not adequately cover the topic. In this tutorial, 64 data collections are reviewed. Furthermore, an up-to-date literature survey is provided, complementing the previous studies in (Van Bezooijen, 1984, Scherer et al., 1991, Cowie et al., 2001). Finally, the paper is focused on describing the feature extraction methods and the emotion classification techniques, topics that have not been treated in (Scherer, 2003, Pantic and Rothkrantz, 2003).
In Section 2, a corpus of 64 data collections is reviewed putting emphasis on the data collection procedures, the kind of speech (natural, simulated, or elicited), the content, and other physiological signals that may accompany the emotional speech. In Section 3, short-term features (i.e. features that are extracted on speech frame basis) that are related to the emotional content of speech are discussed. In addition to short-term features, their contours are of fundamental importance for emotional speech recognition. The emotions affect the contour characteristics, such as statistics and trends as is summarized in Section 4. Emotion classification techniques that exploit timing information and other techniques that ignore it are surveyed in Section 5. Therefore, Sections 3 Estimation of short-term acoustic features, 4 Cues to emotion aim at describing the appropriate features to be used with the emotional classification techniques reviewed in Section 5. Finally, Section 6 concludes the tutorial by indicating future research directions.
Section snippets
Data collections
A record of emotional speech data collections is undoubtedly useful for researchers interested in emotional speech analysis. An overview of 64 emotional speech data collections is presented in Table 1. For each data collection additional information is also described such as the speech language, the number and the profession of the subjects, other physiological signals possibly recorded simultaneously with speech, the data collection purpose (emotional speech recognition, expressive synthesis),
Estimation of short-term acoustic features
Methods for estimating short-term acoustic features that are frequently used in emotion recognition are described hereafter. Short-term features are estimated on a frame basiswhere s(n) is the speech signal and w(m − n) is a window of length Nw ending at sample m (Deller et al., 2000). Most of the methods stem from the front-end signal processing employed in speech recognition and coding. However, the discussion is focused on acoustic features that are useful for emotion
Cues to emotion
In this section, we review how the contour of selected short-term acoustic features is affected by the emotional states of anger, disgust, fear, joy, and sadness. A short-term feature contour is formed by assigning the feature value computed on a frame basis to all samples belonging to the frame. For example, the energy contour is given byThe contour trends (i.e. its plateaux, its rising or falling slopes) is a valuable feature for emotion recognition, because they
Emotion classification techniques
The output of emotion classification techniques is a prediction value (label) about the emotional state of an utterance. An utterance uξ is a speech segment corresponding to a word or a phrase. Let uξ, ξ ∈ {1, 2, … , Ξ} be an utterance of the data collection. In order to evaluate the performance of a classification technique, the cross-validation method is used. According to this method, the utterances of the whole data collection are divided into the design set containing utterances and the
Concluding remarks
In this paper, several topics have been addressed. First, a list of data collections was provided including all available information about the databases such as the kinds of emotions, the language, etc. Nevertheless, there are still some copyright problems since the material from radio or TV is held under a limited agreement with broadcasters. Furthermore, there is a need for adopting protocols such as those in (Douglas-Cowie et al., 2003, Scherer, 2003, Schröder, 2005) that address issues
Acknowledgment
This work has been supported by the research project 01ED312 “Use of Virtual Reality for training pupils to deal with earthquakes” financed by the Greek Secretariat of Research and Technology.
References (121)
- et al.
Reflections of depression in acoustic measures of the patients speech
J. Affect. Disord.
(2001) - et al.
The role of intonation in emotional expressions
Speech Comm.
(2005) - et al.
Modifications of phonetic labial targets in emotive speech: effects of the co-production of speech and emotions
Speech Comm.
(2004) - et al.
Describing the emotional states that are expressed in speech
Speech Comm.
(2003) - et al.
Emotional speech: towards a new generation of databases
Speech Comm.
(2003) - et al.
Modeling drivers’ speech under stress
Speech Comm.
(2003) - et al.
ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments
Speech Comm.
(1995) - et al.
A corpus-based speech synthesis system with emotion
Speech Comm.
(2003) - et al.
Conveyance of emotional connotations by a single word in English
Speech Comm.
(2005) Comprehension of prosody in Parkinson’s disease
Proc. Cortex
(1999)
Distinctive regions and models: a new theory of speech production
Speech Comm.
Measurements of ariculatory variation in expressive speech for a set of Swedish vowels
Speech Comm.
Speech emotion recognition using hidden Markov models
Speech Comm.
Vocal communication of emotion: a review of research paradigms
Speech Comm.
A new look at the statistical model identification
IEEE Trans. Automat. Contr.
Acoustic profiles in vocal emotion expression
J. Pers. Soc. Psychol.
Hmm based stressed speech modelling with application to improved synthesis and recognition of isolated speech under stress
IEEE Trans. Speech Audio Processing
The biological affects, a typology
Psychol. Rev.
Nonlinear analysis and detection of speech under stressed conditions
J. Acoust. Soc. Am.
Emotion recognition in human–computer interaction
IEEE Signal Processing Mag.
Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
IEEE Trans. Acoust. Speech Signal Processing
Discete-Time Processing of Speech Signals
Maximum likelihood from incomplete data via the em algorithm
J. Roy. Statist. Soc. Ser. B
An argument for basic emotions
Cognition Emotion
An Introduction to the Bootstrap
Speech Analysis, Synthesis and Perception
second ed.
Acoustical properties of speech as indicators of depression and suicidal risk
IEEE Trans. Biomed. Eng.
Introduction to Statistical Pattern Recognition
second ed.
A system for finding speech formants and modulations via energy separation
IEEE Trans. Speech Audio Processing
Cited by (759)
In-depth investigation of speech emotion recognition studies from past to present –The importance of emotion recognition from speech signal for AI–
2024, Intelligent Systems with ApplicationsAssessing the effectiveness of ensembles in Speech Emotion Recognition: Performance analysis under challenging scenarios[Formula presented]
2024, Expert Systems with ApplicationsVoice Acoustic Parameters as Predictors of Depression
2024, Journal of VoiceA nonlinear feature extraction approach for speech emotion recognition using VMD and TKEO
2023, Applied Acoustics