Abstract
For many applications of emotion recognition, such as virtual agents, the system must select responses while the user is speaking. This requires reliable on-line recognition of the user’s affect. However most emotion recognition systems are based on turnwise processing. We present a novel approach to on-line emotion recognition from speech using Long Short-Term Memory Recurrent Neural Networks. Emotion is recognised frame-wise in a two-dimensional valence-activation continuum. In contrast to current state-of-the-art approaches, recognition is performed on low-level signal frames, similar to those used for speech recognition. No statistical functionals are applied to low-level feature contours. Framing at a higher level is therefore unnecessary and regression outputs can be produced in real-time for every low-level input frame. We also investigate the benefits of including linguistic features on the signal frame level obtained by a keyword spotter.
Similar content being viewed by others
References
Batliner A, Steidl S, Nöth E (2008) Releasing a thoroughly annotated and processed spontaneous emotional database: the FAU Aibo Emotion Corpus. In: Deviller L, Martin JC, Cowie R, Douglas-Cowie E, Batliner A (eds) Proc. of a satellite workshop of LREC 2008 on corpora for research on emotion and affect, pp 28–31. Marrakesh
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Proc. of interspeech, pp 1517–1520. Lisbon, Portugal
Caridakis G, Malatesta L, Kessous L, Amir N, Raouzaiou A, Karpouzis K (2006) Modeling naturalistic affective states via facial and vocal expressions recognition. In: Proc. of the 8th international conference on multimodal interfaces, pp 146–154. Banff, Alberta, Canada,
Castellano G, Kessous L, Caridakis G (2008) Emotion recognition through multiple modalities: face, body gesture, speech. In: Peter C, Beale R (eds) Affect and emotion in human-computer interaction. Springer, Berlin, pp 92–103
Cowie R, Douglas-Cowie E, Savvidou S, McMahon E, Sawey M, Schröder M (2000) Feeltrace: an instrument for recording perceived emotion in real time. In: Proceedings of the ISCA workshop on speech and emotion, pp 19–24
Douglas-Cowie E, Cowie R, Sneddon I, Cox C, Lowry O, McRorie M, Martin JC, Devillers L, Abrilian S, Batliner A, Amir N, Karpouzis K (2007) The HUMAINE database. In: Proc. of ACII, pp 488–500
Eyben F, Wöllmer M, Schuller B (2009) openEAR—introducing the Munich Open-source Emotion and Affect Recognition Toolkit. In: Proc. of ACII, pp 576–581. Amsterdam, The Netherlands
Fernandez S, Graves A, Schmidhuber J (2007) An application of recurrent neural networks to discriminative keyword spotting. In: Proc. of ICANN, pp 220–229. Porto, Portugal
Fernandez S, Graves A, Schmidhuber J (2008) Phoneme recognition in TIMIT with BLSTM-CTC. Tech. rep., IDSIA
Graves A (2008) Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technische Universität München
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18(5–6):602–610
Graves A, Fernandez S, Schmidhuber J (2005) Bidirectional LSTM networks for improved phoneme classification and recognition. In: Proceedings of ICANN, vol 18. Warsaw, Poland, pp 602–610
Graves A, Fernandez S, Liwicki M, Bunke H, Schmidhuber J (2008) Unconstrained online handwriting recognition with recurrent neural networks. Adv Neural Inf Process Syst
Grimm M, Kroschel K, Narayanan S (2007) Support vector regression for automatic recognition of spontaneous emotions in speech. In: Proc. of ICASSP, pp 1085–1088
Grimm M, Kroschel K, Narayanan S (2008) The vera am mittag german audio-visual emotional speech database. In: Proc. of ICME, pp 865–868. Hannover, Germany
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proc. of ECML, pp 137–142. Chemniz, Germany
Lang KJ, Waibel AH, Hinton GE (1990) A time-delay neural network architecture for isolated word recognition. Neural Netw 3(1):23–43
Lin T, Horne BG, Tino P, Giles CL (1996) Learning long-term dependencies in NARX recurrent neural networks. IEEE Trans Neural Netw 7(6):1329–1338
Liwicki M, Graves A, Fernandez S, Bunke H, Schmidhuber J (2007) A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In: Proc. of ICDAR, pp 367–371. Curitiba, Brazil
Peters C, O’Sullivan C (2002) Synthetic vision and memory for autonomous virtual humans. Comput Graph Forum 21(4):743–753
Riedmiller M, Braun H (1993) A direct adaptive method for faster backpropagation learning: the RPROP algorithm. In: IEEE international conference on neural networks, pp 586–591
Schaefer AM, Udluft S, Zimmermann HG (2008) Learning long-term dependencies with recurrent neural networks. Neurocomputing 71(13–15):2481–2488
Schmidhuber J (1992) Learning complex extended sequences using the principle of history compression. Neural Comput 4(2):234–242
Schröder M, Devillers L, Karpouzis K, Martin JC, Pelachaud C, Peter C, Pirker H, Schuller B, Tao J, Wilson I (2007) What should a generic emotion markup language be able to represent? In: Paiva A, Prada R, Picard RW (eds) Affective computing and intelligent interaction. Springer, Berlin, pp 440–451
Schröder M, Cowie R, Heylen D, Pantic M, Pelachaud C, Schuller B (2008) Towards responsive sensitive artificial listeners. In: Proc. of 4th intern. workshop on human-computer conversation. Bellagio, Italy
Schuller B, Rigoll G (2006) Timing levels in segment-based speech emotion recognition. In: Proc. of interspeech, pp 1818–1821. Pittsburgh, PA, USA
Schuller B, Rigoll G, Lang M (2003) Hidden Markov model-based speech emotion recognition. In: Proc. of ICASSP, pp 1–4. Hong Kong, China
Schuller B, Reiter S, Rigoll G (2006) Evolutionary feature generation in speech emotion recognition. In: Proc. of ICME, pp 5–8. Toronto, Canada
Schuller B, Vlasenko B, Minguez R, Rigoll G, Wendemuth A (2007) Comparing one and two-stage acoustic modeling in the recognition of emotion in speech. In: Proc. of ASRU, pp 596–600. Kyoto, Japan
Schuller B, Wimmer M, Mösenlechner L, Kern C, Arsic D, Rigoll G (2008) Brute-forcing hierarchical functionals for paralinguistics: A waste of feature space? In: Proc. of ICASSP, pp 4501–4504. Las Vegas, Nevada, USA
Schuller B, Müller R, Eyben F, Gast J, Hörnler B, Wöllmer M, Rigoll G, Höthker A, Konosu H (2009) Being bored? Recognising natural interest by extensive audiovisual integration for real-life application. Image Vis Comput J 27(12):1760–1774. Special issue on visual and multimodal analysis of human spontaneous behavior
Schuller B, Steidl S, Batliner A (2009) The Interspeech 2009 emotion challenge. In: Proc. of interspeech, pp 312–315. Brighton, UK
Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A (2009) Acoustic emotion recognition: A benchmark comparison of performances. In: Proc. of ASRU 2009. Merano, Italy
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Proc 45:2673–2681
Seppi D, Batliner A, Schuller B, Steidl S, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Aharonson V (2008) Patterns, prototypes, performance: classifying emotional user states. In: Proc. of interspeech, pp 601–604. Brisbane, Australia
Steidl S (2009) Automatic classification of emotion-related user states in spontaneous children’s speech. Logos, Berlin
Steininger S, Schiel F, Dioubina O, Raubold S (2002) Development of user-state conventions for the multimodal corpus in smartkom. In: Workshop on multimodal resources and multimodal systems evaluation, pp 33–37. Las Palmas
Streit M, Batliner A, Portele T (2006) Emotions analysis and emotion-handling subdialogues. In: Wahlster W (ed) SmartKom: foundations of multimodal dialogue systems. Springer, Berlin, pp 317–332
Vlasenko B, Schuller B, Wendemuth A, Rigoll G (2007) Frame vs. turn-level: Emotion recognition from speech considering static and dynamic processing. In: Paiva A (ed) Proc. of ACII, pp 139–147. Lisbon, Portugal
Werbos P (1990) Backpropagation through time: What it does and how to do it. Proc IEEE 78:1550–1560
Witten IH, Frank E (2005) Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco,
Wöllmer M, Eyben F, Reiter S, Schuller B, Cox C, Douglas-Cowie E, Cowie R (2008) Abandoning emotion classes—towards continuous emotion recognition with modelling of long-range dependencies. In: Proc. of interspeech, pp 597–600. Brisbane, Australia
Wöllmer M, Al-Hames M, Eyben F, Schuller B, Rigoll G (2009) A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams. Neurocomputing 73:366–380
Wöllmer M, Eyben F, Keshet J, Graves A, Schuller B, Rigoll G (2009) Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks. In: Proc. of ICASSP, pp 3949–3952. Taipei, Taiwan
Wöllmer M, Eyben F, Schuller B, Douglas-Cowie E, Cowie R (2009) Data-driven clustering in emotional space for affect recognition using discriminatively trained LSTM networks. In: Proc. of interspeech, pp 1595–1598. Brighton, UK
Wöllmer M, Eyben F, Schuller B, Rigoll G (2009) Robust vocabulary independent keyword spotting with graphical models. In: Proc. of ASRU 2009. Merano, Italy
Wöllmer M, Eyben F, Schuller B, Sun Y, Moosmayr T, Nguyen-Thien N (2009) Robust in-car spelling recognition—a tandem BLSTM-HMM approach. In: Proc. of interspeech, pp 2507–2510. Brighton, UK
Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Eyben, F., Wöllmer, M., Graves, A. et al. On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues. J Multimodal User Interfaces 3, 7–19 (2010). https://doi.org/10.1007/s12193-009-0032-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12193-009-0032-6