Skip to main content
Log in

Unsupervised help-trained LS-SVR-based segmentation in speaker diarization system

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, we propose a new segmentation method for diarization applications. In the proposed method, segmentation is performed using a discriminatively trained support vector regression, while a generative classifier helps it to estimate the probable change points. Since, there is no pre-labeled training samples in segmentation task, the proposed model-based segmentation method tries to suggest a proper solution to bridge this gap. It is assumed that initial applied samples are labeled with the first speaker in an unsupervised manner, while the subsequent training samples are chosen by applying the help-training approach. These samples are estimated to be conducive when both regression and classifier blocks, label positive/negative samples to be advantageous. These samples would be purified in next steps and speakers’ models would be updated iteratively. In addition, a new procedure is introduced to estimate deleted and inserted change points that is executed when segmentation is completed. In comparison to similar approaches, experiments have shown performance improvement about 29% in diarization error rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Adankon MM, Cheriet M (2011) Help-Training for semi-supervised support vector machines. Pattern Recogn 44(9):2220–2230

    Article  Google Scholar 

  2. Anguera Miro X, Bozonnet S, Evans N, Fredouille C, Friedland G, Vinyals O (2012) Speaker Diarization: A Review of Recent Research. IEEE Trans Audio Speech Lang Process 20(2):356–370

    Article  Google Scholar 

  3. Anguera, X., Wooters, C., Peskin, B., and Aguiló, M. (2006). Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System. In Machine Learning for Multimodal Interaction, S. Renals, and S. Bengio, eds. (Springer Berlin Heidelberg), pp. 402–414.

  4. Campbell WM, Campbell JP, Reynolds DA, Singer E, Torres-Carrasquillo PA (2006) Support vector machines for speaker and language recognition. Comput Speech Lang 20(2–3):210–229

    Article  Google Scholar 

  5. Cumani S, Laface P (2014) Large-Scale Training of Pairwise Support Vector Machines for Speaker Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(11):1590–1600

    Article  Google Scholar 

  6. Cyrta P, Trzciński T, Stokowiec W (2018) Speaker Diarization Using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings. In: Borzemski L, Świątek J, Wilimowska Z (eds) Information Systems Architecture and Technology: Proceedings of 38th International Conference on Information Systems Architecture and Technology – ISAT 2017, vol 655. Springer International Publishing, Cham, pp 107–117

    Google Scholar 

  7. Dimitriadis, D., Fousek, P. (2017) Developing On-Line Speaker Diarization System. Proc. Interspeech 2017, pp. 2739-2743.

  8. Frihia H, Bahi H (2017) HMM/SVM segmentation and labelling of Arabic speech for speech recognition applications. International Journal of Speech Technology 20(3):563–573

    Article  Google Scholar 

  9. Galliano, S., Gravier, G., and Chaubard, L. (2009). The ester 2 evaluation campaign for the rich transcription of french radio broadcasts. In: Proceddings of Interspeech, vol. 9, pp. 2583–2586.

  10. Geiger J, Wallhoff F, Rigoll G (2010) GMM-UBM based open-set online speaker diarization. Interspeech, pp. 2330-2333.

  11. Gemmeke JF, Cranen B, Remes U (2011) Sparse imputation for large vocabulary noise robust ASR. Comput Speech Lang 25(2):462–479

    Article  Google Scholar 

  12. Han K, Wang DL (2013) Towards Generalizing Classification Based Speech Separation. IEEE Trans Audio Speech Lang Process 21(1):168–177

    Article  Google Scholar 

  13. Hautamaki V, Kinnunen T, Sedlak F, Lee KA, Ma B, Li H (2013) Sparse Classifier Fusion for Speaker Verification. IEEE Trans Audio Speech Lang Process 21(8):1622–1631

    Article  Google Scholar 

  14. Hu, M., Sharma, D., Doclo, S., Brookes, M., and Naylor, P.A. (2015). Speaker change detection and speaker diarization using spatial information. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5743–5747.

  15. Huang Y, Friedland G, Müller C, Mirghafori N (2007) Speeding up speaker diarization by using prosodic features. ICSI Technical Report TR-07-004.

  16. India M, Fonollosa JAR, Hernando J (2017) LSTM neural network-based speaker segmentation using acoustic and language modelling. Interspeech 2017, pp. 2834–2838.

  17. Kenny P, Reynolds D, Castaldo F (2010) Diarization of telephone conversations using factor analysis. IEEE Journal of Selected Topics in Signal Processing 4(6):1059–1070

    Article  Google Scholar 

  18. Kinnunen, T., and Rajan, P. (2013). A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7229–7233.

  19. Kotti, M., Martins, L.G.P.M., Benetos, E., Cardoso, J.S., and Kotropoulos, C. (2006). Automatic Speaker Segmentation using Multiple Features and Distance Measures: A Comparison of Three Approaches. In 2006 IEEE International Conference on Multimedia and Expo, pp. 1101–1104.

  20. Kumar, K., Kim, C., and Stern, R.M. (2011). Delta-spectral cepstral coefficients for robust speech recognition. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4784–4787.

  21. Kunešová M, Zajíc Z, Radová V (2017) Experiments with Segmentation in an Online Speaker Diarization System. In: Ekštein K, Matoušek V (eds) Text, Speech, and Dialogue, vol. 10415. Springer International Publishing, Cham, pp 429–437

    Chapter  Google Scholar 

  22. Li J, Deng L, Gong Y, Haeb-Umbach R (2014) An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(4):745–777

    Article  Google Scholar 

  23. Li H, Ma B, Lee C-H (2007) A Vector Space Modeling Approach to Spoken Language Identification. IEEE Trans Audio Speech Lang Process 15(1):271–284

    Article  Google Scholar 

  24. Liu, G., Lei, Y., and Hansen, J.H.L. (2012). Robust feature front-end for speaker identification. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4233–4236.

  25. Lopez-Otero P, Docio-Fernandez L, Garcia-Mateo C (2017) Ensemble audio segmentation for radio and television programmes. Multimedia Tools and Applications 76(5):7421–7444

    Article  Google Scholar 

  26. Lu L, Zhang H-J (2005) Unsupervised speaker segmentation and tracking in real-time audio content analysis. Multimedia Systems 10(4):332–343

    Article  Google Scholar 

  27. Lu L, Zhang H-J, Jiang H (2002) Content analysis for audio classification and segmentation. IEEE Transactions on Speech and Audio Processing 10(7):504–516

    Article  Google Scholar 

  28. Ma, X. (2017). A Novel Audio Segmentation for Audio Diarization. In Information Technology and Intelligent Transportation Systems, V.E. Balas, L.C. Jain, and X. Zhao, eds. (Springer International Publishing), pp. 399–407.

  29. Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2009). Online Dictionary Learning for Sparse Coding. In Proceedings of the 26th Annual International Conference on Machine Learning, (New York, NY, USA: ACM), pp. 689–696.

  30. Malegaonkar AS, Ariyaeeinia AM, Sivakumaran P (2007) Efficient Speaker Change Detection Using Adapted Gaussian Mixture Models. IEEE Trans Audio Speech Lang Process 15(6):1859–1869

    Article  Google Scholar 

  31. Meignier S, Merlin T (2010) LIUM SpkDiarization: an open source toolkit for diarization. in Proc. CMU SPUD Workshop, Dallas (Texas, USA).

  32. Meignier S, Moraru D, Fredouille C, Bonastre J-F, Besacier L (2006) Step-by-step and integrated approaches in broadcast news speaker diarization. Comput Speech Lang 20(2–3):303–330

    Article  Google Scholar 

  33. Mesgarani N, Slaney M, Shamma SA (2006) Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations. IEEE Trans Audio Speech Lang Process 14(3):920–930

    Article  Google Scholar 

  34. Minotto VP, Jung CR, Lee B (2014) Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs. IEEE Transactions on Multimedia 16(4):1032–1044

    Article  Google Scholar 

  35. Moattar MH, Homayounpour MM (2012) A review on speaker diarization systems and approaches. Speech Comm 54(10):1065–1103

    Article  Google Scholar 

  36. Naik N, Mankad SH, Thakkar P (2018) Investigating the Effect of Varying Window Sizes in Speaker Diarization for Meetings Domain. In: Satapathy SC, Joshi A (eds) Information and Communication Technology for Intelligent Systems (ICTIS 2017) - Volume 2, vol 84. Springer International Publishing, Cham, pp 361–369

    Chapter  Google Scholar 

  37. Parthasarathi SHK, Bourlard H, Gatica-Perez D (2013) Wordless Sounds: Robust Speaker Diarization Using Privacy-Preserving Audio Representations. IEEE Trans Audio Speech Lang Process 21(1):85–98

    Article  Google Scholar 

  38. Phan H, Maas M, Mazur R, Mertins A (2015) Random Regression Forests for Acoustic Event Detection and Classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(1):20–31

    Article  Google Scholar 

  39. Reynolds, D.A., and Torres-carrasquillo, P. (2004). The MIT Lincoln Laboratory RT-04F diarization systems: Applications to broadcast audio and telephone conversations. In In Proc. Fall 2004 Rich Transcription Workshop (RT-04), Palisades.

  40. Sainath TN, Ramabhadran B, Picheny M, Nahamoo D, Kanevsky D (2011) Exemplar-based sparse representation features: From TIMIT to LVCSR. IEEE Trans Audio Speech Lang Process 19(8):2598–2613

    Article  Google Scholar 

  41. Schölkopf B, Smola AJ (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge

    Google Scholar 

  42. Shao Y, Srinivasan S, Wang D (2007) Incorporating auditory feature uncertainties in robust speaker identification. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07 4:IV–277

    Google Scholar 

  43. Shum SH, Dehak N, Dehak R, Glass JR (2013) Unsupervised methods for speaker diarization: An integrated and iterative approach. IEEE Trans Audio Speech Lang Process 21(10):2015–2028

    Article  Google Scholar 

  44. Silovsky, J., and Prazak, J. (2012). Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4193–4196.

  45. Sinclair, M., and King, S. (2013). Where are the challenges in speaker diarization? In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7741–7745.

  46. Sinha, R., E. Tranter, S., Gales, M.J.F., and Woodland, P. (2005). The Cambridge University March 2005 speaker diarisation system. pp. 2437–2440.

  47. Soldi, G., Beaugeant, C., and Evans, N. (2015). Adaptive and online speaker diarization for meeting data. In 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 2112–2116.

  48. Stafylakis T, Katsouros V, Carayannis G (2010) The Segmental Bayesian Information Criterion and Its Applications to Speaker Diarization. IEEE Journal of Selected Topics in Signal Processing 4(5):857–866

    Article  Google Scholar 

  49. Stan A et al (2016) ALISA: An automatic lightly supervised speech segmentation and alignment tool. Comput Speech Lang 35:116–133

    Article  Google Scholar 

  50. Suykens JAK (2002) Least squares support vector machines. World Scientific, River Edge

    Book  Google Scholar 

  51. Tranter SE, Reynolds DA (2004) Speaker diarisation for broadcast news. In: Odyssey04-The Speaker and Language Recognition Workshop, Toledo, Spain.

  52. Wang Q, Downey C, Wan L, Mansfield PA, Moreno IL (2017) Speaker diarization with LSTM. arXiv preprint - arXiv:1710.10468, 2017.

  53. Xavier-de-Souza S, Suykens JAK, Vandewalle J, Bolle D (2010) Coupled Simulated Annealing. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 40(2):320–335

    Article  Google Scholar 

  54. Xu Y, Du J, Dai L-R, Lee C-H (2015) A Regression Approach to Speech Enhancement Based on Deep Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(1):7–19

    Article  Google Scholar 

  55. Yousafzai J, Sollich P, Cvetkovic Z, Yu B (2011) Combined features and kernel design for noise robust phoneme classification using support vector machines. IEEE Trans Audio Speech Lang Process 19(5):1396–1407

    Article  Google Scholar 

  56. Yu C, Hansen JHL (2017) Active Learning Based Constrained Clustering For Speaker Diarization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(11):2188–2198

    Article  Google Scholar 

  57. Zajíc, Z., Kunešová, M., and Radová, V. (2016). Investigation of Segmentation in i-Vector Based Speaker Diarization of Telephone Speech. In Speech and Computer, A. Ronzhin, R. Potapova, and G. Németh, eds. (Springer International Publishing), pp. 411–418.

  58. Zhang S-X, Gales MJ (2013) Structured SVMs for automatic speech recognition. IEEE Trans Audio Speech Lang Process 21(3):544–555

    Article  Google Scholar 

  59. Zhao X, Shao Y, Wang D (2012) CASA-Based Robust Speaker Identification. IEEE Trans Audio Speech Lang Process 20(5):1608–1616

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Farbod Razzazi.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Teimoori, F., Razzazi, F. Unsupervised help-trained LS-SVR-based segmentation in speaker diarization system. Multimed Tools Appl 78, 11743–11777 (2019). https://doi.org/10.1007/s11042-018-6621-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6621-1

Keywords

Navigation