Abstract
In this paper, we propose a new segmentation method for diarization applications. In the proposed method, segmentation is performed using a discriminatively trained support vector regression, while a generative classifier helps it to estimate the probable change points. Since, there is no pre-labeled training samples in segmentation task, the proposed model-based segmentation method tries to suggest a proper solution to bridge this gap. It is assumed that initial applied samples are labeled with the first speaker in an unsupervised manner, while the subsequent training samples are chosen by applying the help-training approach. These samples are estimated to be conducive when both regression and classifier blocks, label positive/negative samples to be advantageous. These samples would be purified in next steps and speakers’ models would be updated iteratively. In addition, a new procedure is introduced to estimate deleted and inserted change points that is executed when segmentation is completed. In comparison to similar approaches, experiments have shown performance improvement about 29% in diarization error rate.
Similar content being viewed by others
References
Adankon MM, Cheriet M (2011) Help-Training for semi-supervised support vector machines. Pattern Recogn 44(9):2220–2230
Anguera Miro X, Bozonnet S, Evans N, Fredouille C, Friedland G, Vinyals O (2012) Speaker Diarization: A Review of Recent Research. IEEE Trans Audio Speech Lang Process 20(2):356–370
Anguera, X., Wooters, C., Peskin, B., and Aguiló, M. (2006). Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System. In Machine Learning for Multimodal Interaction, S. Renals, and S. Bengio, eds. (Springer Berlin Heidelberg), pp. 402–414.
Campbell WM, Campbell JP, Reynolds DA, Singer E, Torres-Carrasquillo PA (2006) Support vector machines for speaker and language recognition. Comput Speech Lang 20(2–3):210–229
Cumani S, Laface P (2014) Large-Scale Training of Pairwise Support Vector Machines for Speaker Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(11):1590–1600
Cyrta P, Trzciński T, Stokowiec W (2018) Speaker Diarization Using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings. In: Borzemski L, Świątek J, Wilimowska Z (eds) Information Systems Architecture and Technology: Proceedings of 38th International Conference on Information Systems Architecture and Technology – ISAT 2017, vol 655. Springer International Publishing, Cham, pp 107–117
Dimitriadis, D., Fousek, P. (2017) Developing On-Line Speaker Diarization System. Proc. Interspeech 2017, pp. 2739-2743.
Frihia H, Bahi H (2017) HMM/SVM segmentation and labelling of Arabic speech for speech recognition applications. International Journal of Speech Technology 20(3):563–573
Galliano, S., Gravier, G., and Chaubard, L. (2009). The ester 2 evaluation campaign for the rich transcription of french radio broadcasts. In: Proceddings of Interspeech, vol. 9, pp. 2583–2586.
Geiger J, Wallhoff F, Rigoll G (2010) GMM-UBM based open-set online speaker diarization. Interspeech, pp. 2330-2333.
Gemmeke JF, Cranen B, Remes U (2011) Sparse imputation for large vocabulary noise robust ASR. Comput Speech Lang 25(2):462–479
Han K, Wang DL (2013) Towards Generalizing Classification Based Speech Separation. IEEE Trans Audio Speech Lang Process 21(1):168–177
Hautamaki V, Kinnunen T, Sedlak F, Lee KA, Ma B, Li H (2013) Sparse Classifier Fusion for Speaker Verification. IEEE Trans Audio Speech Lang Process 21(8):1622–1631
Hu, M., Sharma, D., Doclo, S., Brookes, M., and Naylor, P.A. (2015). Speaker change detection and speaker diarization using spatial information. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5743–5747.
Huang Y, Friedland G, Müller C, Mirghafori N (2007) Speeding up speaker diarization by using prosodic features. ICSI Technical Report TR-07-004.
India M, Fonollosa JAR, Hernando J (2017) LSTM neural network-based speaker segmentation using acoustic and language modelling. Interspeech 2017, pp. 2834–2838.
Kenny P, Reynolds D, Castaldo F (2010) Diarization of telephone conversations using factor analysis. IEEE Journal of Selected Topics in Signal Processing 4(6):1059–1070
Kinnunen, T., and Rajan, P. (2013). A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7229–7233.
Kotti, M., Martins, L.G.P.M., Benetos, E., Cardoso, J.S., and Kotropoulos, C. (2006). Automatic Speaker Segmentation using Multiple Features and Distance Measures: A Comparison of Three Approaches. In 2006 IEEE International Conference on Multimedia and Expo, pp. 1101–1104.
Kumar, K., Kim, C., and Stern, R.M. (2011). Delta-spectral cepstral coefficients for robust speech recognition. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4784–4787.
Kunešová M, Zajíc Z, Radová V (2017) Experiments with Segmentation in an Online Speaker Diarization System. In: Ekštein K, Matoušek V (eds) Text, Speech, and Dialogue, vol. 10415. Springer International Publishing, Cham, pp 429–437
Li J, Deng L, Gong Y, Haeb-Umbach R (2014) An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(4):745–777
Li H, Ma B, Lee C-H (2007) A Vector Space Modeling Approach to Spoken Language Identification. IEEE Trans Audio Speech Lang Process 15(1):271–284
Liu, G., Lei, Y., and Hansen, J.H.L. (2012). Robust feature front-end for speaker identification. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4233–4236.
Lopez-Otero P, Docio-Fernandez L, Garcia-Mateo C (2017) Ensemble audio segmentation for radio and television programmes. Multimedia Tools and Applications 76(5):7421–7444
Lu L, Zhang H-J (2005) Unsupervised speaker segmentation and tracking in real-time audio content analysis. Multimedia Systems 10(4):332–343
Lu L, Zhang H-J, Jiang H (2002) Content analysis for audio classification and segmentation. IEEE Transactions on Speech and Audio Processing 10(7):504–516
Ma, X. (2017). A Novel Audio Segmentation for Audio Diarization. In Information Technology and Intelligent Transportation Systems, V.E. Balas, L.C. Jain, and X. Zhao, eds. (Springer International Publishing), pp. 399–407.
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2009). Online Dictionary Learning for Sparse Coding. In Proceedings of the 26th Annual International Conference on Machine Learning, (New York, NY, USA: ACM), pp. 689–696.
Malegaonkar AS, Ariyaeeinia AM, Sivakumaran P (2007) Efficient Speaker Change Detection Using Adapted Gaussian Mixture Models. IEEE Trans Audio Speech Lang Process 15(6):1859–1869
Meignier S, Merlin T (2010) LIUM SpkDiarization: an open source toolkit for diarization. in Proc. CMU SPUD Workshop, Dallas (Texas, USA).
Meignier S, Moraru D, Fredouille C, Bonastre J-F, Besacier L (2006) Step-by-step and integrated approaches in broadcast news speaker diarization. Comput Speech Lang 20(2–3):303–330
Mesgarani N, Slaney M, Shamma SA (2006) Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations. IEEE Trans Audio Speech Lang Process 14(3):920–930
Minotto VP, Jung CR, Lee B (2014) Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs. IEEE Transactions on Multimedia 16(4):1032–1044
Moattar MH, Homayounpour MM (2012) A review on speaker diarization systems and approaches. Speech Comm 54(10):1065–1103
Naik N, Mankad SH, Thakkar P (2018) Investigating the Effect of Varying Window Sizes in Speaker Diarization for Meetings Domain. In: Satapathy SC, Joshi A (eds) Information and Communication Technology for Intelligent Systems (ICTIS 2017) - Volume 2, vol 84. Springer International Publishing, Cham, pp 361–369
Parthasarathi SHK, Bourlard H, Gatica-Perez D (2013) Wordless Sounds: Robust Speaker Diarization Using Privacy-Preserving Audio Representations. IEEE Trans Audio Speech Lang Process 21(1):85–98
Phan H, Maas M, Mazur R, Mertins A (2015) Random Regression Forests for Acoustic Event Detection and Classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(1):20–31
Reynolds, D.A., and Torres-carrasquillo, P. (2004). The MIT Lincoln Laboratory RT-04F diarization systems: Applications to broadcast audio and telephone conversations. In In Proc. Fall 2004 Rich Transcription Workshop (RT-04), Palisades.
Sainath TN, Ramabhadran B, Picheny M, Nahamoo D, Kanevsky D (2011) Exemplar-based sparse representation features: From TIMIT to LVCSR. IEEE Trans Audio Speech Lang Process 19(8):2598–2613
Schölkopf B, Smola AJ (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Shao Y, Srinivasan S, Wang D (2007) Incorporating auditory feature uncertainties in robust speaker identification. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07 4:IV–277
Shum SH, Dehak N, Dehak R, Glass JR (2013) Unsupervised methods for speaker diarization: An integrated and iterative approach. IEEE Trans Audio Speech Lang Process 21(10):2015–2028
Silovsky, J., and Prazak, J. (2012). Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4193–4196.
Sinclair, M., and King, S. (2013). Where are the challenges in speaker diarization? In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7741–7745.
Sinha, R., E. Tranter, S., Gales, M.J.F., and Woodland, P. (2005). The Cambridge University March 2005 speaker diarisation system. pp. 2437–2440.
Soldi, G., Beaugeant, C., and Evans, N. (2015). Adaptive and online speaker diarization for meeting data. In 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 2112–2116.
Stafylakis T, Katsouros V, Carayannis G (2010) The Segmental Bayesian Information Criterion and Its Applications to Speaker Diarization. IEEE Journal of Selected Topics in Signal Processing 4(5):857–866
Stan A et al (2016) ALISA: An automatic lightly supervised speech segmentation and alignment tool. Comput Speech Lang 35:116–133
Suykens JAK (2002) Least squares support vector machines. World Scientific, River Edge
Tranter SE, Reynolds DA (2004) Speaker diarisation for broadcast news. In: Odyssey04-The Speaker and Language Recognition Workshop, Toledo, Spain.
Wang Q, Downey C, Wan L, Mansfield PA, Moreno IL (2017) Speaker diarization with LSTM. arXiv preprint - arXiv:1710.10468, 2017.
Xavier-de-Souza S, Suykens JAK, Vandewalle J, Bolle D (2010) Coupled Simulated Annealing. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 40(2):320–335
Xu Y, Du J, Dai L-R, Lee C-H (2015) A Regression Approach to Speech Enhancement Based on Deep Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(1):7–19
Yousafzai J, Sollich P, Cvetkovic Z, Yu B (2011) Combined features and kernel design for noise robust phoneme classification using support vector machines. IEEE Trans Audio Speech Lang Process 19(5):1396–1407
Yu C, Hansen JHL (2017) Active Learning Based Constrained Clustering For Speaker Diarization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(11):2188–2198
Zajíc, Z., Kunešová, M., and Radová, V. (2016). Investigation of Segmentation in i-Vector Based Speaker Diarization of Telephone Speech. In Speech and Computer, A. Ronzhin, R. Potapova, and G. Németh, eds. (Springer International Publishing), pp. 411–418.
Zhang S-X, Gales MJ (2013) Structured SVMs for automatic speech recognition. IEEE Trans Audio Speech Lang Process 21(3):544–555
Zhao X, Shao Y, Wang D (2012) CASA-Based Robust Speaker Identification. IEEE Trans Audio Speech Lang Process 20(5):1608–1616
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Teimoori, F., Razzazi, F. Unsupervised help-trained LS-SVR-based segmentation in speaker diarization system. Multimed Tools Appl 78, 11743–11777 (2019). https://doi.org/10.1007/s11042-018-6621-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6621-1