Abstract
In recent years, speech emotion recognition (SER) has engrossed more attention in speech processing because of its potential in various speech-based intelligent systems. In deep learning algorithms to capture discriminative features of the audio emotion samples, a large number of features are required, which increases the computational complexity of the network. This paper presents a three-layered sequential deep convolutional neural network (DCNN) based on mel frequency log spectrogram (MFLS) for emotion recognition. Mel frequency log spectrogram that confines the salient information from the emotion speech corpus and two-dimensional DCNN. Exploratory outcomes on the Berlin Emo-DB dataset show that the proposed method gives 95.68 and 96.07% accuracy for the speaker-dependent and speaker-independent approaches. The performance of the proposed method is compared with CNN and CNN-LSTM on the Berlin Emo-DB dataset and results in improved accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Schuller BW (2018) Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends. Commun ACM 61(5):90–99
Khalil RA, Jones E, Babar MI et al (2019) Speech emotion recognition using deep learning techniques: a review. IEEE Access 7:17327–117345
Gunawan TS et al (2018) A review on emotion recognition algorithms using speech analysis. Indonesian J Elect Eng Inf (IJEEI) 6(1):12–20
Anagnostopoulos C, Iliou T, Giannoukos I (2012) Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif Intell Rev 43(2):155–177
Guidi A, Vanello N, Bertschy G, Gentili C, Landini L, Scilingo E (2015) Automatic analysis of speech F0 contour for the characterization of mood changes in bipolar patients. Biomed Signal Process Control 17:29–37
Dellaert F, Polzin T, Waibel A (1996) Recognizing emotion in speech. In: Proceeding of fourth international conference on spoken language processing. ICSLP’96, vol 3. IEEE, pp 1970–1973
Zhou Y, Sun Y, Zhang J, Yan Y (2009) Speech emotion recognition using both spectral and prosodic features. In: Information engineering and computer science, pp 1–4
Haq S, Jackson P, Edge J (2008) Audio-visual feature selection and reduction for emotion classification. In: Proceedings of international conference on auditory-visual speech processing (AVSP’08), Tangalooma, Australia
Sonawane A, Inamdar M, Kishor B (2017) Sound based human emotion recognition using MFCC & multiple SVM. In: 2017 international conference on information, communication, instrumentation and control (ICICIC), pp 1–4. IEEE, Indore, India
El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn 44(3):572–587
Jianwei N, Yanmin Q, Kai Y (2014) Acoustic emotion recognition using deep neural network. In: 9th international symposium on Chinese spoken language processing, pp 128–132. IEEE
Huang Z, Ming D, Qirong M, Yongzhao Z (2014) Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM international conference on multimedia, pp 801–804
Zheng Q, Yu J, Zou Y (2015) An experimental study of speech emotion recognition based on deep convolutional neural networks. In 2015 international conference on affective computing and intelligent interaction (ACII), pp 827–831. IEEE)
Badshah A, Jamil M, A., Nasir, R., Sung, W.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 international conference on platform technology and service (PlatCon), pp 1–5. IEEE (2017)
Zhao J, Xia M, Lijiang C (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed Signal Process Control 47:312–323
Bhangale K, Titare P, Pawar R, Bhavsar S (2018) Synthetic speech spoofing detection using MFCC and radial basis function SVM. IOSR J Eng 8(6):55–62
Zheng F, Guoliang Z, Zhanjiang S (2001) Comparison of different implementations of MFCC. J Comput Sci Technol 16(6):582–589
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. 1st edn, MIT Press, Cambridge, MA
Shende P, Dandwate Y (2020) Convolutional neural network based multimodal biometric human authentication using face, Palm Veins and Fingerprint. Int J Innov Technol Explor Eng (IJITEE) 9(3):771–777
Burkhardt F, Paescke A, Rolfes M, Sendlmeirer WF, Weiss B (2005) A database of German emotional speech. In: 9th European conference of speech commincation and Technology
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Bhangale, K., Mohanaprasad, K. (2022). Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network. In: Sivasubramanian, A., Shastry, P.N., Hong, P.C. (eds) Futuristic Communication and Network Technologies. VICFCNT 2020. Lecture Notes in Electrical Engineering, vol 792. Springer, Singapore. https://doi.org/10.1007/978-981-16-4625-6_24
Download citation
DOI: https://doi.org/10.1007/978-981-16-4625-6_24
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-4624-9
Online ISBN: 978-981-16-4625-6
eBook Packages: EngineeringEngineering (R0)