Skip to main content
Log in

Emotion recognition from speech using global and local prosodic features

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In this paper, global and local prosodic features extracted from sentence, word and syllables are proposed for speech emotion or affect recognition. In this work, duration, pitch, and energy values are used to represent the prosodic information, for recognizing the emotions from speech. Global prosodic features represent the gross statistics such as mean, minimum, maximum, standard deviation, and slope of the prosodic contours. Local prosodic features represent the temporal dynamics in the prosody. In this work, global and local prosodic features are analyzed separately and in combination at different levels for the recognition of emotions. In this study, we have also explored the words and syllables at different positions (initial, middle, and final) separately, to analyze their contribution towards the recognition of emotions. In this paper, all the studies are carried out using simulated Telugu emotion speech corpus (IITKGP-SESC). These results are compared with the results of internationally known Berlin emotion speech corpus (Emo-DB). Support vector machines are used to develop the emotion recognition models. The results indicate that, the recognition performance using local prosodic features is better compared to the performance of global prosodic features. Words in the final position of the sentences, syllables in the final position of the words exhibit more emotion discriminative information compared to the words and syllables present in the other positions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Banziger, T., & Scherer, K. R. (2005). The role of intonation in emotional expressions. Speech Communication, 46, 252–267.

    Article  Google Scholar 

  • Benesty, J., Sondhi, M. M., & Huang, Y. (Eds.) (2008). Springer handbook on speech processing. Berlin: Springer.

    Google Scholar 

  • Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., & Weiss, B. (2005). A database of German emotional speech. In Interspeech.

    Google Scholar 

  • Cahn, J. E. (1990). The generation of affect in synthesized speech. In JAVIOS (pp. 1–19), July 1990

    Google Scholar 

  • Cowie, R., & Cornelius, R. R. (2003). Describing the emotional states that are expressed in speech. Speech Communication, 40, 5–32.

    Article  MATH  Google Scholar 

  • Dellaert, F., Polzin, T., & Waibel, A. (1996). Recognizing emotion in speech. In 4th international conference on spoken language processing (pp. 1970–1973), Philadelphia, PA, USA, Oct. 1996

    Google Scholar 

  • Iida, A., Campbell, N., Higuchi, F., & Yasumura, M. (2003). A corpus-based speech synthesis system with emotion. Speech Communication, 40, 161–187.

    Article  MATH  Google Scholar 

  • Iliou, T., & Anagnostopoulos, C. N. (2009). Statistical evaluation of speech features for emotion recognition. In Fourth international conference on digital telecommunications, Colmar, France, July (pp. 121–126).

    Google Scholar 

  • Kao, Y.h., & Lee, L.s. (2006). Feature analysis for emotion recognition from Mandarin speech considering the special characteristics of Chinese language. In INTERSPEECH–ICSLP (pp. 1814–1817), Pittsburgh, Pennsylvania, Sept. 2006

    Google Scholar 

  • Koolagudi, S. G., & Rao, K. S. (2011). Two stage emotion recognition based on speaking rate. International Journal of Speech Technology, 14, 35–48.

    Article  Google Scholar 

  • Koolagudi, S. G., & Rao, K. S. (2012a). Emotion recognition from speech: a review. International Journal of Speech Technology, 15(2), 99–117.

    Article  Google Scholar 

  • Koolagudi, S. G., & Rao, K. S. (2012b). Emotion recognition from speech using source, system and prosodic features. International Journal of Speech Technology, 15(2), 265–289.

    Article  Google Scholar 

  • Koolagudi, S. G., & Rao, K. S. (2012c). Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features. International Journal of Speech Technology. doi:10.1007/s10772-012-9150-8.

    Google Scholar 

  • Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). IITKGP-SESC: speech database for emotion analysis, Aug. 2009. Communications in computer and information science, Lecture notes in computer science. Berlin: Springer.

    Google Scholar 

  • Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13, 293–303.

    Article  Google Scholar 

  • Luengo, I., Navas, E., Hernez, I., & Snchez, J. (2005). Automatic emotion recognition using prosodic parameters. In INTERSPEECH, Lisbon, Portugal (pp. 493–496), Sept. 2005

    Google Scholar 

  • Lugger, M., & Yang, B. (2007). The relevance of voice quality features in speaker independent emotion recognition. In ICASSP (pp. IV17–IV20), Honolulu, Hawai, USA, May 2007. New York: IEEE Press.

    Google Scholar 

  • McGilloway, S., Cowie, R., Douglas-Cowie, E., Gielen, S., Westerdijk, M., & Stroeve, S. (2000). Approaching automatic recognition of emotion from voice: a rough benchmark. In ISCA workshop on speech and emotion, Belfast.

    Google Scholar 

  • Murray, I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion by rule in synthetic speech. Speech Communication, 16, 369–390.

    Article  Google Scholar 

  • Murray, I. R., Arnott, J. L., & Rohwer, E. A. (1996). Emotional stress in synthetic speech: Progress and future directions. Speech Communication, 20, 85–91.

    Article  Google Scholar 

  • Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613.

    Article  Google Scholar 

  • Nwe, T. L., Foo, S. W., & Silva, L. C. D. (2003). Speech emotion recognition using hidden Markov models. Speech Communication, 41, 603–623.

    Article  Google Scholar 

  • Prasanna, S. R. M. (2004). Event-based analysis of speech. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India, Mar. 2004.

  • Prasanna, S. R. M., & Zachariah, J. M. (2002). Detection of vowel onset point in speech. In Proc. IEEE int. conf. acoust., speech, signal processing. Orlando, Florida, USA, May 2002.

    Google Scholar 

  • Prasanna, S. R. M., Reddy, B. V. S., & Krishnamoorthy, P. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Audio, Speech, and Language Processing, 17, 556–565.

    Article  Google Scholar 

  • Rao, K. S. (2005). Acquisition and incorporation prosody knowledge for speech systems in Indian languages. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India, May 2005.

  • Rao, K. S. (2011a). Application of prosody models for developing speech systems in Indian languages. International Journal of Speech Technology, 14, 19–33.

    Article  Google Scholar 

  • Rao, K. S. (2011b). Role of neural network models for developing speech systems. Sadhana, 36, 783–836.

    Article  Google Scholar 

  • Rao, K. S. & Koolagudi, S. G., (2011). Identification of Hindi dialects and emotions using spectral and prosodic features of speech. IJSCI: International Journal of Systemics, Cybernetics and Informatics, 9(4), 24–33.

    Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Speech and Audio Processing, 14, 972–980.

    Article  Google Scholar 

  • Rao, K. S., Prasanna, S. R. M., & Sagar, T. V. (2007). Emotion recognition using multilevel prosodic information. In Workshop on image and signal processing (WISP-2007). Guwahati, India, Dec. 2007. Guwahati: IIT Guwahati.

    Google Scholar 

  • Rao, K. S., Reddy, R., Maity, S., & Koolagudi, S. G. (2010). Characterization of emotions using the dynamics of prosodic features. In International conference on speech prosody. Chicago, USA, May 2010.

    Google Scholar 

  • Rao, K. S., Saroj, V. K., Maity, S., & Koolagudi, S. G. (2011). Recognition of emotions from video using neural network models. Expert Systems with Applications, 38, 13181–13185.

    Article  Google Scholar 

  • Scherer, K. R. (2003). Vocal communication of emotion: A review of research paradigms. Speech Communication, 40, 227–256.

    Article  MATH  Google Scholar 

  • Schroder, M. (2001). Emotional speech synthesis: a review. In Seventh European conference on speech communication and technology, Eurospeech, Aalborg, Denmark, Sept. 2001

    Google Scholar 

  • Schroder, M., & Cowie, R. (2006). Issues in emotion-oriented computing toward a shared understanding. In Workshop on emotion and computing, HUMAINE.

    Google Scholar 

  • Schuller, B. (2012). The computational paralinguistics challenge. IEEE Signal Processing Magazine, 29, 97–101.

    Article  Google Scholar 

  • Schuller, B., Batliner, A., Steidl, S., & Seppi, D. (2011). Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Communication, 53, 1062–1087.

    Article  Google Scholar 

  • Ververidis, D., & Kotropoulos, C. (2006). A state of the art review on emotional speech databases. In Eleventh Australasian international conference on speech science and technology, Auckland, New Zealand, Dec. 2006

    Google Scholar 

  • Ververidis, D., Kotropoulos, C., & Pitas, I. (2004). Automatic emotional speech classification. In ICASSP (pp. I593–I596). New York: IEEE Press.

    Google Scholar 

  • Vuppala, A. K., Yadav, J., Chakrabarti, S., & Rao, K. S. (2012). Vowel onset point detection for low bit rate coded speech. IEEE Transactions on Audio, Speech, and Language Processing, 20, 1894–1903.

    Article  Google Scholar 

  • Wang, Y., Du, S., & Zhan, Y. (2008). Adaptive and optimal classification of speech emotion recognition. In Fourth international conference on natural computation (pp. 407–411).

    Chapter  Google Scholar 

  • Werner, S., & Keller, E. (1994). Prosodic aspects of speech. In E. Keller (Ed.), Fundamentals of speech synthesis and speech recognition: basic concepts, state of the art, the future challenges (pp. 23–40). Chichester: Wiley.

    Google Scholar 

  • Zhang, S. (2008). Emotion recognition in Chinese natural speech by combining prosody and voice quality features. In Sun et al. (Ed.), Advances in neural networks (pp. 457–464). Lecture notes in computer science. Berlin: Springer.

    Google Scholar 

  • Zhu, A., & Luo, Q. (2007). Study on speech emotion recognition system in E-learning. In J. Jacko (Ed.), Human computer interaction, Part III, HCII (pp. 544–552). Lecture notes in computer science. Berlin: Springer.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. Sreenivasa Rao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rao, K.S., Koolagudi, S.G. & Vempada, R.R. Emotion recognition from speech using global and local prosodic features. Int J Speech Technol 16, 143–160 (2013). https://doi.org/10.1007/s10772-012-9172-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-012-9172-2

Keywords

Navigation