Unsupervised help-trained LS-SVR-based segmentation in speaker diarization system

Teimoori, Farshad; Razzazi, Farbod

doi:10.1007/s11042-018-6621-1

Unsupervised help-trained LS-SVR-based segmentation in speaker diarization system

Published: 04 October 2018

Volume 78, pages 11743–11777, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

159 Accesses
1 Citation
Explore all metrics

Abstract

In this paper, we propose a new segmentation method for diarization applications. In the proposed method, segmentation is performed using a discriminatively trained support vector regression, while a generative classifier helps it to estimate the probable change points. Since, there is no pre-labeled training samples in segmentation task, the proposed model-based segmentation method tries to suggest a proper solution to bridge this gap. It is assumed that initial applied samples are labeled with the first speaker in an unsupervised manner, while the subsequent training samples are chosen by applying the help-training approach. These samples are estimated to be conducive when both regression and classifier blocks, label positive/negative samples to be advantageous. These samples would be purified in next steps and speakers’ models would be updated iteratively. In addition, a new procedure is introduced to estimate deleted and inserted change points that is executed when segmentation is completed. In comparison to similar approaches, experiments have shown performance improvement about 29% in diarization error rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-supervised Learning: A Succinct Review

Article 20 January 2023

A survey on instance segmentation: state of the art

Article 03 July 2020

A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets

Article 09 February 2021

References

Adankon MM, Cheriet M (2011) Help-Training for semi-supervised support vector machines. Pattern Recogn 44(9):2220–2230
Article Google Scholar
Anguera Miro X, Bozonnet S, Evans N, Fredouille C, Friedland G, Vinyals O (2012) Speaker Diarization: A Review of Recent Research. IEEE Trans Audio Speech Lang Process 20(2):356–370
Article Google Scholar
Anguera, X., Wooters, C., Peskin, B., and Aguiló, M. (2006). Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System. In Machine Learning for Multimodal Interaction, S. Renals, and S. Bengio, eds. (Springer Berlin Heidelberg), pp. 402–414.
Campbell WM, Campbell JP, Reynolds DA, Singer E, Torres-Carrasquillo PA (2006) Support vector machines for speaker and language recognition. Comput Speech Lang 20(2–3):210–229
Article Google Scholar
Cumani S, Laface P (2014) Large-Scale Training of Pairwise Support Vector Machines for Speaker Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(11):1590–1600
Article Google Scholar
Cyrta P, Trzciński T, Stokowiec W (2018) Speaker Diarization Using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings. In: Borzemski L, Świątek J, Wilimowska Z (eds) Information Systems Architecture and Technology: Proceedings of 38th International Conference on Information Systems Architecture and Technology – ISAT 2017, vol 655. Springer International Publishing, Cham, pp 107–117
Google Scholar
Dimitriadis, D., Fousek, P. (2017) Developing On-Line Speaker Diarization System. Proc. Interspeech 2017, pp. 2739-2743.
Frihia H, Bahi H (2017) HMM/SVM segmentation and labelling of Arabic speech for speech recognition applications. International Journal of Speech Technology 20(3):563–573
Article Google Scholar
Galliano, S., Gravier, G., and Chaubard, L. (2009). The ester 2 evaluation campaign for the rich transcription of french radio broadcasts. In: Proceddings of Interspeech, vol. 9, pp. 2583–2586.
Geiger J, Wallhoff F, Rigoll G (2010) GMM-UBM based open-set online speaker diarization. Interspeech, pp. 2330-2333.
Gemmeke JF, Cranen B, Remes U (2011) Sparse imputation for large vocabulary noise robust ASR. Comput Speech Lang 25(2):462–479
Article Google Scholar
Han K, Wang DL (2013) Towards Generalizing Classification Based Speech Separation. IEEE Trans Audio Speech Lang Process 21(1):168–177
Article Google Scholar
Hautamaki V, Kinnunen T, Sedlak F, Lee KA, Ma B, Li H (2013) Sparse Classifier Fusion for Speaker Verification. IEEE Trans Audio Speech Lang Process 21(8):1622–1631
Article Google Scholar
Hu, M., Sharma, D., Doclo, S., Brookes, M., and Naylor, P.A. (2015). Speaker change detection and speaker diarization using spatial information. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5743–5747.
Huang Y, Friedland G, Müller C, Mirghafori N (2007) Speeding up speaker diarization by using prosodic features. ICSI Technical Report TR-07-004.
India M, Fonollosa JAR, Hernando J (2017) LSTM neural network-based speaker segmentation using acoustic and language modelling. Interspeech 2017, pp. 2834–2838.
Kenny P, Reynolds D, Castaldo F (2010) Diarization of telephone conversations using factor analysis. IEEE Journal of Selected Topics in Signal Processing 4(6):1059–1070
Article Google Scholar
Kinnunen, T., and Rajan, P. (2013). A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7229–7233.
Kotti, M., Martins, L.G.P.M., Benetos, E., Cardoso, J.S., and Kotropoulos, C. (2006). Automatic Speaker Segmentation using Multiple Features and Distance Measures: A Comparison of Three Approaches. In 2006 IEEE International Conference on Multimedia and Expo, pp. 1101–1104.
Kumar, K., Kim, C., and Stern, R.M. (2011). Delta-spectral cepstral coefficients for robust speech recognition. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4784–4787.
Kunešová M, Zajíc Z, Radová V (2017) Experiments with Segmentation in an Online Speaker Diarization System. In: Ekštein K, Matoušek V (eds) Text, Speech, and Dialogue, vol. 10415. Springer International Publishing, Cham, pp 429–437
Chapter Google Scholar
Li J, Deng L, Gong Y, Haeb-Umbach R (2014) An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(4):745–777
Article Google Scholar
Li H, Ma B, Lee C-H (2007) A Vector Space Modeling Approach to Spoken Language Identification. IEEE Trans Audio Speech Lang Process 15(1):271–284
Article Google Scholar
Liu, G., Lei, Y., and Hansen, J.H.L. (2012). Robust feature front-end for speaker identification. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4233–4236.
Lopez-Otero P, Docio-Fernandez L, Garcia-Mateo C (2017) Ensemble audio segmentation for radio and television programmes. Multimedia Tools and Applications 76(5):7421–7444
Article Google Scholar
Lu L, Zhang H-J (2005) Unsupervised speaker segmentation and tracking in real-time audio content analysis. Multimedia Systems 10(4):332–343
Article Google Scholar
Lu L, Zhang H-J, Jiang H (2002) Content analysis for audio classification and segmentation. IEEE Transactions on Speech and Audio Processing 10(7):504–516
Article Google Scholar
Ma, X. (2017). A Novel Audio Segmentation for Audio Diarization. In Information Technology and Intelligent Transportation Systems, V.E. Balas, L.C. Jain, and X. Zhao, eds. (Springer International Publishing), pp. 399–407.
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2009). Online Dictionary Learning for Sparse Coding. In Proceedings of the 26th Annual International Conference on Machine Learning, (New York, NY, USA: ACM), pp. 689–696.
Malegaonkar AS, Ariyaeeinia AM, Sivakumaran P (2007) Efficient Speaker Change Detection Using Adapted Gaussian Mixture Models. IEEE Trans Audio Speech Lang Process 15(6):1859–1869
Article Google Scholar
Meignier S, Merlin T (2010) LIUM SpkDiarization: an open source toolkit for diarization. in Proc. CMU SPUD Workshop, Dallas (Texas, USA).
Meignier S, Moraru D, Fredouille C, Bonastre J-F, Besacier L (2006) Step-by-step and integrated approaches in broadcast news speaker diarization. Comput Speech Lang 20(2–3):303–330
Article Google Scholar
Mesgarani N, Slaney M, Shamma SA (2006) Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations. IEEE Trans Audio Speech Lang Process 14(3):920–930
Article Google Scholar
Minotto VP, Jung CR, Lee B (2014) Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs. IEEE Transactions on Multimedia 16(4):1032–1044
Article Google Scholar
Moattar MH, Homayounpour MM (2012) A review on speaker diarization systems and approaches. Speech Comm 54(10):1065–1103
Article Google Scholar
Naik N, Mankad SH, Thakkar P (2018) Investigating the Effect of Varying Window Sizes in Speaker Diarization for Meetings Domain. In: Satapathy SC, Joshi A (eds) Information and Communication Technology for Intelligent Systems (ICTIS 2017) - Volume 2, vol 84. Springer International Publishing, Cham, pp 361–369
Chapter Google Scholar
Parthasarathi SHK, Bourlard H, Gatica-Perez D (2013) Wordless Sounds: Robust Speaker Diarization Using Privacy-Preserving Audio Representations. IEEE Trans Audio Speech Lang Process 21(1):85–98
Article Google Scholar
Phan H, Maas M, Mazur R, Mertins A (2015) Random Regression Forests for Acoustic Event Detection and Classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(1):20–31
Article Google Scholar
Reynolds, D.A., and Torres-carrasquillo, P. (2004). The MIT Lincoln Laboratory RT-04F diarization systems: Applications to broadcast audio and telephone conversations. In In Proc. Fall 2004 Rich Transcription Workshop (RT-04), Palisades.
Sainath TN, Ramabhadran B, Picheny M, Nahamoo D, Kanevsky D (2011) Exemplar-based sparse representation features: From TIMIT to LVCSR. IEEE Trans Audio Speech Lang Process 19(8):2598–2613
Article Google Scholar
Schölkopf B, Smola AJ (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Google Scholar
Shao Y, Srinivasan S, Wang D (2007) Incorporating auditory feature uncertainties in robust speaker identification. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07 4:IV–277
Google Scholar
Shum SH, Dehak N, Dehak R, Glass JR (2013) Unsupervised methods for speaker diarization: An integrated and iterative approach. IEEE Trans Audio Speech Lang Process 21(10):2015–2028
Article Google Scholar
Silovsky, J., and Prazak, J. (2012). Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4193–4196.
Sinclair, M., and King, S. (2013). Where are the challenges in speaker diarization? In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7741–7745.
Sinha, R., E. Tranter, S., Gales, M.J.F., and Woodland, P. (2005). The Cambridge University March 2005 speaker diarisation system. pp. 2437–2440.
Soldi, G., Beaugeant, C., and Evans, N. (2015). Adaptive and online speaker diarization for meeting data. In 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 2112–2116.
Stafylakis T, Katsouros V, Carayannis G (2010) The Segmental Bayesian Information Criterion and Its Applications to Speaker Diarization. IEEE Journal of Selected Topics in Signal Processing 4(5):857–866
Article Google Scholar
Stan A et al (2016) ALISA: An automatic lightly supervised speech segmentation and alignment tool. Comput Speech Lang 35:116–133
Article Google Scholar
Suykens JAK (2002) Least squares support vector machines. World Scientific, River Edge
Book Google Scholar
Tranter SE, Reynolds DA (2004) Speaker diarisation for broadcast news. In: Odyssey04-The Speaker and Language Recognition Workshop, Toledo, Spain.
Wang Q, Downey C, Wan L, Mansfield PA, Moreno IL (2017) Speaker diarization with LSTM. arXiv preprint - arXiv:1710.10468, 2017.
Xavier-de-Souza S, Suykens JAK, Vandewalle J, Bolle D (2010) Coupled Simulated Annealing. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 40(2):320–335
Article Google Scholar
Xu Y, Du J, Dai L-R, Lee C-H (2015) A Regression Approach to Speech Enhancement Based on Deep Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(1):7–19
Article Google Scholar
Yousafzai J, Sollich P, Cvetkovic Z, Yu B (2011) Combined features and kernel design for noise robust phoneme classification using support vector machines. IEEE Trans Audio Speech Lang Process 19(5):1396–1407
Article Google Scholar
Yu C, Hansen JHL (2017) Active Learning Based Constrained Clustering For Speaker Diarization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(11):2188–2198
Article Google Scholar
Zajíc, Z., Kunešová, M., and Radová, V. (2016). Investigation of Segmentation in i-Vector Based Speaker Diarization of Telephone Speech. In Speech and Computer, A. Ronzhin, R. Potapova, and G. Németh, eds. (Springer International Publishing), pp. 411–418.
Zhang S-X, Gales MJ (2013) Structured SVMs for automatic speech recognition. IEEE Trans Audio Speech Lang Process 21(3):544–555
Article Google Scholar
Zhao X, Shao Y, Wang D (2012) CASA-Based Robust Speaker Identification. IEEE Trans Audio Speech Lang Process 20(5):1608–1616
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
Farshad Teimoori & Farbod Razzazi

Authors

Farshad Teimoori
View author publications
You can also search for this author in PubMed Google Scholar
Farbod Razzazi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Farbod Razzazi.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Teimoori, F., Razzazi, F. Unsupervised help-trained LS-SVR-based segmentation in speaker diarization system. Multimed Tools Appl 78, 11743–11777 (2019). https://doi.org/10.1007/s11042-018-6621-1

Download citation

Received: 25 November 2017
Revised: 08 May 2018
Accepted: 27 August 2018
Published: 04 October 2018
Issue Date: May 2019
DOI: https://doi.org/10.1007/s11042-018-6621-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised help-trained LS-SVR-based segmentation in speaker diarization system

Abstract

Access this article

Similar content being viewed by others

Self-supervised Learning: A Succinct Review

A survey on instance segmentation: state of the art

A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised help-trained LS-SVR-based segmentation in speaker diarization system

Abstract

Access this article

Similar content being viewed by others

Self-supervised Learning: A Succinct Review

A survey on instance segmentation: state of the art

A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation