Skip to main content

Advertisement

Log in

On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues

  • Original Paper
  • Published:
Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Abstract

For many applications of emotion recognition, such as virtual agents, the system must select responses while the user is speaking. This requires reliable on-line recognition of the user’s affect. However most emotion recognition systems are based on turnwise processing. We present a novel approach to on-line emotion recognition from speech using Long Short-Term Memory Recurrent Neural Networks. Emotion is recognised frame-wise in a two-dimensional valence-activation continuum. In contrast to current state-of-the-art approaches, recognition is performed on low-level signal frames, similar to those used for speech recognition. No statistical functionals are applied to low-level feature contours. Framing at a higher level is therefore unnecessary and regression outputs can be produced in real-time for every low-level input frame. We also investigate the benefits of including linguistic features on the signal frame level obtained by a keyword spotter.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Batliner A, Steidl S, Nöth E (2008) Releasing a thoroughly annotated and processed spontaneous emotional database: the FAU Aibo Emotion Corpus. In: Deviller  L, Martin JC, Cowie R, Douglas-Cowie E, Batliner A (eds) Proc. of a satellite workshop of LREC 2008 on corpora for research on emotion and affect, pp 28–31. Marrakesh

  2. Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166

    Article  Google Scholar 

  3. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Proc. of interspeech, pp 1517–1520. Lisbon, Portugal

  4. Caridakis G, Malatesta L, Kessous L, Amir N, Raouzaiou A, Karpouzis K (2006) Modeling naturalistic affective states via facial and vocal expressions recognition. In: Proc. of the 8th international conference on multimodal interfaces, pp 146–154. Banff, Alberta, Canada,

  5. Castellano G, Kessous L, Caridakis G (2008) Emotion recognition through multiple modalities: face, body gesture, speech. In: Peter C, Beale R (eds) Affect and emotion in human-computer interaction. Springer, Berlin, pp 92–103

    Chapter  Google Scholar 

  6. Cowie R, Douglas-Cowie E, Savvidou S, McMahon E, Sawey M, Schröder M (2000) Feeltrace: an instrument for recording perceived emotion in real time. In: Proceedings of the ISCA workshop on speech and emotion, pp 19–24

  7. Douglas-Cowie E, Cowie R, Sneddon I, Cox C, Lowry O, McRorie M, Martin JC, Devillers L, Abrilian S, Batliner A, Amir N, Karpouzis K (2007) The HUMAINE database. In: Proc. of ACII, pp 488–500

  8. Eyben F, Wöllmer M, Schuller B (2009) openEAR—introducing the Munich Open-source Emotion and Affect Recognition Toolkit. In: Proc. of ACII, pp 576–581. Amsterdam, The Netherlands

  9. Fernandez S, Graves A, Schmidhuber J (2007) An application of recurrent neural networks to discriminative keyword spotting. In: Proc. of ICANN, pp 220–229. Porto, Portugal

  10. Fernandez S, Graves A, Schmidhuber J (2008) Phoneme recognition in TIMIT with BLSTM-CTC. Tech. rep., IDSIA

  11. Graves A (2008) Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technische Universität München

  12. Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18(5–6):602–610

    Article  Google Scholar 

  13. Graves A, Fernandez S, Schmidhuber J (2005) Bidirectional LSTM networks for improved phoneme classification and recognition. In: Proceedings of ICANN, vol 18. Warsaw, Poland, pp 602–610

  14. Graves A, Fernandez S, Liwicki M, Bunke H, Schmidhuber J (2008) Unconstrained online handwriting recognition with recurrent neural networks. Adv Neural Inf Process Syst

  15. Grimm M, Kroschel K, Narayanan S (2007) Support vector regression for automatic recognition of spontaneous emotions in speech. In: Proc. of ICASSP, pp 1085–1088

  16. Grimm M, Kroschel K, Narayanan S (2008) The vera am mittag german audio-visual emotional speech database. In: Proc. of ICME, pp 865–868. Hannover, Germany

  17. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  18. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proc. of ECML, pp 137–142. Chemniz, Germany

  19. Lang KJ, Waibel AH, Hinton GE (1990) A time-delay neural network architecture for isolated word recognition. Neural Netw 3(1):23–43

    Article  Google Scholar 

  20. Lin T, Horne BG, Tino P, Giles CL (1996) Learning long-term dependencies in NARX recurrent neural networks. IEEE Trans Neural Netw 7(6):1329–1338

    Article  Google Scholar 

  21. Liwicki M, Graves A, Fernandez S, Bunke H, Schmidhuber J (2007) A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In: Proc. of ICDAR, pp 367–371. Curitiba, Brazil

  22. Peters C, O’Sullivan C (2002) Synthetic vision and memory for autonomous virtual humans. Comput Graph Forum 21(4):743–753

    Article  Google Scholar 

  23. Riedmiller M, Braun H (1993) A direct adaptive method for faster backpropagation learning: the RPROP algorithm. In: IEEE international conference on neural networks, pp 586–591

  24. Schaefer AM, Udluft S, Zimmermann HG (2008) Learning long-term dependencies with recurrent neural networks. Neurocomputing 71(13–15):2481–2488

    Article  Google Scholar 

  25. Schmidhuber J (1992) Learning complex extended sequences using the principle of history compression. Neural Comput 4(2):234–242

    Article  Google Scholar 

  26. Schröder M, Devillers L, Karpouzis K, Martin JC, Pelachaud C, Peter C, Pirker H, Schuller B, Tao J, Wilson I (2007) What should a generic emotion markup language be able to represent? In: Paiva A, Prada R, Picard RW (eds) Affective computing and intelligent interaction. Springer, Berlin, pp 440–451

    Chapter  Google Scholar 

  27. Schröder M, Cowie R, Heylen D, Pantic M, Pelachaud C, Schuller B (2008) Towards responsive sensitive artificial listeners. In: Proc. of 4th intern. workshop on human-computer conversation. Bellagio, Italy

  28. Schuller B, Rigoll G (2006) Timing levels in segment-based speech emotion recognition. In: Proc. of interspeech, pp 1818–1821. Pittsburgh, PA, USA

  29. Schuller B, Rigoll G, Lang M (2003) Hidden Markov model-based speech emotion recognition. In: Proc. of ICASSP, pp 1–4. Hong Kong, China

  30. Schuller B, Reiter S, Rigoll G (2006) Evolutionary feature generation in speech emotion recognition. In: Proc. of ICME, pp 5–8. Toronto, Canada

  31. Schuller B, Vlasenko B, Minguez R, Rigoll G, Wendemuth A (2007) Comparing one and two-stage acoustic modeling in the recognition of emotion in speech. In: Proc. of ASRU, pp 596–600. Kyoto, Japan

  32. Schuller B, Wimmer M, Mösenlechner L, Kern C, Arsic D, Rigoll G (2008) Brute-forcing hierarchical functionals for paralinguistics: A waste of feature space? In: Proc. of ICASSP, pp 4501–4504. Las Vegas, Nevada, USA

  33. Schuller B, Müller R, Eyben F, Gast J, Hörnler B, Wöllmer M, Rigoll G, Höthker A, Konosu H (2009) Being bored? Recognising natural interest by extensive audiovisual integration for real-life application. Image Vis Comput J 27(12):1760–1774. Special issue on visual and multimodal analysis of human spontaneous behavior

    Article  Google Scholar 

  34. Schuller B, Steidl S, Batliner A (2009) The Interspeech 2009 emotion challenge. In: Proc. of interspeech, pp 312–315. Brighton, UK

  35. Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A (2009) Acoustic emotion recognition: A benchmark comparison of performances. In: Proc. of ASRU 2009. Merano, Italy

  36. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Proc 45:2673–2681

    Article  Google Scholar 

  37. Seppi D, Batliner A, Schuller B, Steidl S, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Aharonson V (2008) Patterns, prototypes, performance: classifying emotional user states. In: Proc. of interspeech, pp 601–604. Brisbane, Australia

  38. Steidl S (2009) Automatic classification of emotion-related user states in spontaneous children’s speech. Logos, Berlin

    Google Scholar 

  39. Steininger S, Schiel F, Dioubina O, Raubold S (2002) Development of user-state conventions for the multimodal corpus in smartkom. In: Workshop on multimodal resources and multimodal systems evaluation, pp 33–37. Las Palmas

  40. Streit M, Batliner A, Portele T (2006) Emotions analysis and emotion-handling subdialogues. In: Wahlster W (ed) SmartKom: foundations of multimodal dialogue systems. Springer, Berlin, pp 317–332

    Chapter  Google Scholar 

  41. Vlasenko B, Schuller B, Wendemuth A, Rigoll G (2007) Frame vs. turn-level: Emotion recognition from speech considering static and dynamic processing. In: Paiva A (ed) Proc. of ACII, pp 139–147. Lisbon, Portugal

  42. Werbos P (1990) Backpropagation through time: What it does and how to do it. Proc IEEE 78:1550–1560

    Article  Google Scholar 

  43. Witten IH, Frank E (2005) Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco,

    MATH  Google Scholar 

  44. Wöllmer M, Eyben F, Reiter S, Schuller B, Cox C, Douglas-Cowie E, Cowie R (2008) Abandoning emotion classes—towards continuous emotion recognition with modelling of long-range dependencies. In: Proc. of interspeech, pp 597–600. Brisbane, Australia

  45. Wöllmer M, Al-Hames M, Eyben F, Schuller B, Rigoll G (2009) A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams. Neurocomputing 73:366–380

    Article  Google Scholar 

  46. Wöllmer M, Eyben F, Keshet J, Graves A, Schuller B, Rigoll G (2009) Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks. In: Proc. of ICASSP, pp 3949–3952. Taipei, Taiwan

  47. Wöllmer M, Eyben F, Schuller B, Douglas-Cowie E, Cowie R (2009) Data-driven clustering in emotional space for affect recognition using discriminatively trained LSTM networks. In: Proc. of interspeech, pp 1595–1598. Brighton, UK

  48. Wöllmer M, Eyben F, Schuller B, Rigoll G (2009) Robust vocabulary independent keyword spotting with graphical models. In: Proc. of ASRU 2009. Merano, Italy

  49. Wöllmer M, Eyben F, Schuller B, Sun Y, Moosmayr T, Nguyen-Thien N (2009) Robust in-car spelling recognition—a tandem BLSTM-HMM approach. In: Proc. of interspeech, pp 2507–2510. Brighton, UK

  50. Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Florian Eyben.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Eyben, F., Wöllmer, M., Graves, A. et al. On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues. J Multimodal User Interfaces 3, 7–19 (2010). https://doi.org/10.1007/s12193-009-0032-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12193-009-0032-6

Navigation