Abstract
Traditional neural network-based speaker identification (SI) studies employ a combination of acoustic features extracted from sequential sounds to present the speakers’ voice biometrics in which several sound segments before and after the current segment are stacked and fed to the network. Although this method is particularly important for speech recognition tasks where words are constructed from sequential sound segments, and successful recognition of words depends on the previous phonetic sequences, SI systems should be able to operate based on the distinctive speaker features available in an individual sound segment and identify the speaker regardless of the previously uttered sounds. This paper investigates this hypothesis by proposing a novel text-independent SI model trained at sound level. In order to achieve this, the investigation was conducted by first studying the best distinguishable configuration of coefficients in a single acoustic segment, then to identify the best frame length to overlapping ratio, and finally measuring the reliability of conducting SI using only a single sound segment. Overall more than one hundred SI systems were trained and evaluated, in which results indicate that performing SI using a single acoustic sound frame decreases the complexity of SI and facilitates it since the classifier requires to learn fewer number of acoustic features in compare to the traditional stacked-based approaches.
Similar content being viewed by others
References
Ahmad KS, Thosar AS, Nirmal JH, Pande VS (2015) A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. In: ICAPR 2015–2015 8th Int. Conf. Adv. Pattern Recognit. https://doi.org/10.1109/ICAPR.2015.7050669.
Ali H, Tran SN, Benetos E, d’Avila Garcez AS (2018) Speaker recognition with hybrid features from a deep belief network. Neural Comput Applic 29:13–19. https://doi.org/10.1007/s00521-016-2501-7
Almaadeed N, Aggoun A, Amira A (2015) Speaker identification using multimodal neural networks and wavelet analysis. IET Biometrics 4:18–28. https://doi.org/10.1049/iet-bmt.2014.0011
Biagetti G, Crippa P, Falaschetti L, Orcioni S, Turchetti C, (2016) Robust speaker identification in a meeting with short audio segments. In: Smart Innov. Syst. Technol., pp. 465–477. https://doi.org/10.1007/978-3-319-39627-9_41.
Chandra M, Nandi P, Kumari A, Mishra S (2015) Spectral-subtraction based features for speaker identification. In: Adv. Intell. Syst. Comput. https://doi.org/10.1007/978-3-319-12012-6_58
Chollet F (2015) Keras https://keras.io
Daqrouq K, Al Azzawi KY (2012) Average framing linear prediction coding with wavelet transform for text-independent speaker identification system. Comput Electr Eng 38:1467–1479. https://doi.org/10.1016/j.compeleceng.2012.04.014
Dhonde SB, Chaudhari A, Jagade SM (2017) Integration of mel-frequency cepstral coefficients with log energy and temporal derivatives for text-independent speaker identification. In: Adv. Intell. Syst. Comput. https://doi.org/10.1007/978-981-10-1675-2_78
Do H, Tashev I, Acero A (2011) A new speaker identification algorithm for gaming scenarios. In: ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. https://doi.org/10.1109/ICASSP.2011.5947588.
Dutta M, Patgiri C, Sarma M, Sarma KK (2014) Closed-set text-independent speaker identification system using multiple ANN classifiers. In: Adv. Intell. Syst. Comput. https://doi.org/10.1007/978-3-319-11933-5_41
Fan X, Hansen JHL (2011) Speaker identification within whispered speech audio streams. IEEE Trans Audio Speech Lang Process 19:1408–1421. https://doi.org/10.1109/TASL.2010.2091631
Hansen JHL, Hasan T (2015) Speaker recognition by machines and humans: A tutorial review. IEEE Signal Process Mag 32. https://doi.org/10.1109/MSP.2015.2462851
Hinton G, Deng L, Yu D, Dahl G, Mohamed AR, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29:82–97. https://doi.org/10.1109/MSP.2012.2205597
Islam MA, Jassim WA, Cheok NS, Zilany MSA (2016) A robust speaker identification system using the responses from a model of the auditory periphery. PLoS One 11. https://doi.org/10.1371/journal.pone.0158520
Kockmann M, Burget L, Honza Černocký J (2011) Application of speaker- and language identification state-of-the-art techniques for emotion recognition. Speech Comm 53:1172–1185. https://doi.org/10.1016/j.specom.2011.01.007
LeCun YA, Bengio Y, Hinton GE (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539
Lee H, Largman Y, Pham P, Ng A (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. Adv Neural Inf Proces Syst. https://doi.org/10.1145/1553374.1553453
Lu H, Bernheim Brush AJ, Priyantha B, Karlson AK, Liu J (2011) SpeakerSense: Energy efficient unobtrusive speaker identification on mobile phones. In: Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), pp. 188–205. https://doi.org/10.1007/978-3-642-21726-5_12.
Lukic Y, Vogt C, Dürr O, Stadelmann T, Durr O, Stadelmann T (2016) Speaker identification and clustering using convolutional neural networks. In: Mach. Learn. Signal Process. (MLSP), 2016 IEEE 26th Int. Work., IEEE, pp. 13–16. https://doi.org/10.1109/MLSP.2016.7738816.
Lyons et al. (2020, January 14). jameslyons/python_speech_features: release v0.6.1 (Version 0.6.1). Zenodo. https://doi.org/10.5281/zenodo.3607820.
Maina CW, Walsh JML (2010) Joint speech enhancement and speaker identification using approximate Bayesian inference. In: 2010 44th Annu. Conf. Inf. Sci. Syst. CISS 2010. https://doi.org/10.1109/CISS.2010.5464893.
Matejka P, Glembek O, Novotny O, Plchot O, Grezl F, Burget L, Cernocky JH (2016) Analysis of DNN approaches to speaker identification. In: ICASSP. IEEE Int Conf Acoust Speech Signal Process - Proc, IEEE:5100–5104. https://doi.org/10.1109/ICASSP.2016.7472649
Mohamed A, Dahl GE, Hinton G (2012) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20:14–22. https://doi.org/10.1109/TASL.2011.2109382
Nagaraja BG, Jayanna HS (2013) Multilingual speaker identification by combining evidence from lpr and multitaper mfcc. J Intell Syst. https://doi.org/10.1515/jisys-2013-0038
Nakagawa S, Wang L, Ohtsuka S (2012) Speaker identification and verification by combining MFCC and phase information. IEEE Trans Audio Speech Lang Process. https://doi.org/10.1109/TASL.2011.2172422
Qi P, Wang L (2011) Experiments of GMM based speaker identification. In: URAI 2011–2011 8th Int. Conf. Ubiquitous Robot. Ambient Intell.. https://doi.org/10.1109/URAI.2011.6145927.
Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3:72–83. https://doi.org/10.1109/89.365379
Richardson F, Member S, Reynolds D, Dehak N (2015) Deep neural network approaches to speaker and language recognition. In: IEEE Signal Process. Lett., IEEE, Queensland, Australia, pp. 1671–1675. https://doi.org/10.1109/LSP.2015.2420092
Saeed K, Nammous MK (2007) A speech-and-speaker identification system: feature extraction, description, and classification of speech-signal image. IEEE Trans Ind Electron 54:887–897. https://doi.org/10.1109/TIE.2007.891647
Shahamiri SR, Binti Salim SS (2014) Artificial neural networks as speech recognisers for dysarthric speech: Identifying the best-performing set of MFCC parameters and studying a speaker-independent approach. Adv Eng Inform 28. https://doi.org/10.1016/j.aei.2014.01.001
Shahamiri SR, Salim SSB (2014) A multi-views multi-learners approach towards Dysarthric speech recognition using multi-nets Artifi cial neural networks. IEEE Trans Neural Syst Rehabil Eng 22:1053–1063. https://doi.org/10.1109/TNSRE.2014.2309336
Shahamiri SRSR, Binti Salim SS, Salim SSB (2014) Real-time frequency-based noise-robust automatic speech recognition using multi-nets artificial neural networks: a multi-views multi-learners approach. Neurocomputing 129:199–207. https://doi.org/10.1016/j.neucom.2013.09.040
Sinith MS, Salim A, Gowri Sankar K, Sandeep Narayanan KV, Soman V (2010) A novel method for text-independent speaker identification using MFCC and GMM. In: ICALIP 2010–2010 Int. Conf. Audio, Lang. Image Process. Proc. https://doi.org/10.1109/ICALIP.2010.5684389
Sremath S, Reza S, Singh A, Wang R, Tirumala SS, Shahamiri SR, Garhwal AS, Wang R (2017) Speaker identification features extraction methods : a systematic review. Expert Syst Appl 90:250–271. https://doi.org/10.1016/j.eswa.2017.08.015
Tirumala SS, Shahamiri SR (2017) A deep autoencoder approach for speaker identification. In: ACM Int. Conf. Proceeding Ser. https://doi.org/10.1145/3163080.3163097.
Yu H, Ma Z, Li M, Guo J (2014) Histogram transform model using MFCC features for text-independent speaker identification. In: 2014 48th Asilomar Conf. Signals, Syst. Comput., IEEE, pp. 500–504. https://doi.org/10.1109/ACSSC.2014.7094494
Zhang Z, Wang L, Kai A, Yamada T, Li W, Iwahashi M (2015) Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. EURASIP J Audio Speech Music Process 2015:12. https://doi.org/10.1186/s13636-015-0056-7
Zhao X, Wang Y, Wang D (2014) Robust speaker identification in noisy and reverberant conditions. In: ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. pp. 3997–4001. https://doi.org/10.1109/ICASSP.2014.6854352.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shahamiri, S.R., Thabtah, F. An investigation towards speaker identification using a single-sound-frame. Multimed Tools Appl 79, 31265–31281 (2020). https://doi.org/10.1007/s11042-020-09580-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09580-4