Skip to main content

No Need to Scream: Robust Sound-Based Speaker Localisation in Challenging Scenarios

  • Conference paper
  • First Online:
Social Robotics (ICSR 2019)

Abstract

This paper is about speaker verification and horizontal localisation in the presence of conspicuous noise. Specifically, we are interested in enabling a mobile robot to robustly and accurately spot the presence of a target speaker and estimate his/her position in challenging acoustic scenarios. While several solutions to both tasks have been proposed in the literature, little attention has been devoted to the development of systems able to function in harsh noisy conditions. To address these shortcomings, in this work we follow a purely data-driven approach based on deep learning architectures which, by not requiring any knowledge either on the nature of the masking noise or on the structure and acoustics of the operation environment, it is able to reliably act in previously unexplored acoustic scenes. Our experimental evaluation, relying on data collected in real environments with a robotic platform, demonstrates that our framework is able to achieve high performance both in the verification and localisation tasks, despite the presence of copious noise.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/

  2. Bhattacharya, G., Alam, M.J., Kenny, P.: Deep speaker embeddings for short-duration speaker verification. In: Interspeech, pp. 1517–1521 (2017)

    Google Scholar 

  3. Chakrabarty, D., Elhilali, M.: Abnormal sound event detection using temporal trajectories mixtures. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 216–220 (2016)

    Google Scholar 

  4. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964 (2016)

    Google Scholar 

  5. Deng, S., Han, J., Zhang, C., Zheng, T., Zheng, G.: Robust minimum statistics project coefficients feature for acoustic environment recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8232–8236 (2014)

    Google Scholar 

  6. Feng, L.: Speaker recognition. Ph.D. thesis, Technical University of Denmark, IMM-THESIS, DK-280, Kgs. Lyngby, Denmark (2004)

    Google Scholar 

  7. He, W., Motlicek, P., Odobez, J.M.: Deep neural networks for multiple speaker detection and localization. arXiv preprint arXiv:1711.11565 (2017)

  8. Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker verification. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119 (2016)

    Google Scholar 

  9. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  10. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)

  11. Holdsworth, J., Nimmo-Smith, I., Patterson, R., Rice, P.: Implementing a gammatone filter bank. Annex C of the SVOS Final Report: Part A: The Auditory Filterbank, vol. 1, pp. 1–5 (1988)

    Google Scholar 

  12. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  13. Lyon, R.F., Katsiamis, A.G., Drakakis, E.M.: History and future of auditory filter models. In: Proceedings of 2010 IEEE International Symposium on Circuits and Systems, pp. 3809–3812 (2010)

    Google Scholar 

  14. Ma, N., May, T., Brown, G.J.: Exploiting deep neural networks and head movements for robust binaural localization of multiple sources in reverberant environments. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 25(12), 2444–2453 (2017)

    Article  Google Scholar 

  15. Maganti, H.K., Matassoni, M.: Auditory processing inspired robust feature enhancement for speech recognition. In: Fred, A., Filipe, J., Gamboa, H. (eds.) BIOSTEC 2011. CCIS, vol. 273, pp. 205–218. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-29752-6_15

    Chapter  Google Scholar 

  16. Marchegiani, L., Fafoutis, X.: On cross-language consonant identification in second language noise. J. Acoust. Soc. Am. 138(4), 2206–2209 (2015)

    Article  Google Scholar 

  17. Marchegiani, L., Newman, P.: Learning to listen to your ego-(motion): metric motion estimation from auditory signals. In: Giuliani, M., Assaf, T., Giannaccini, M.E. (eds.) TAROS 2018. LNCS (LNAI), vol. 10965, pp. 247–259. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96728-8_21

    Chapter  Google Scholar 

  18. Marchegiani, L., Posner, I.: Leveraging the urban soundscape: auditory perception for smart vehicles. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 6547–6554 (2017)

    Google Scholar 

  19. Marchegiani, M.L., Pirri, F., Pizzoli, M.: Multimodal speaker recognition in a conversation scenario. In: Fritz, M., Schiele, B., Piater, J.H. (eds.) ICVS 2009. LNCS, vol. 5815, pp. 11–20. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04667-4_2

    Chapter  Google Scholar 

  20. Noda, K., Hashimoto, N., Nakadai, K., Ogata, T.: Sound source separation for robot audition using deep learning. In: 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), pp. 389–394 (2015)

    Google Scholar 

  21. Reynolds, D.A.: An overview of automatic speaker recognition technology. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. IV–4072 (2002)

    Google Scholar 

  22. Rudzyn, B., Kadous, W., Sammut, C.: Real time robot audition system incorporating both 3D sound source localisation and voice characterisation. In: 2007 IEEE International Conference on Robotics and Automation, pp. 4733–4738 (2007)

    Google Scholar 

  23. Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: 22nd ACM International Conference on Multimedia (ACM-MM 2014), Orlando, FL, USA, November 2014

    Google Scholar 

  24. Schluter, R., Bezrukov, I., Wagner, H., Ney, H.: Gammatone features and feature combination for large vocabulary speech recognition. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP 2007, vol. 4, pp. IV–649 (2007)

    Google Scholar 

  25. Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: Interspeech, pp. 999–1003 (2017)

    Google Scholar 

  26. Stefanov, K., Sugimoto, A., Beskow, J.: Look who’s talking: visual identification of the active speaker in multi-party human-robot interaction. In: Proceedings of the 2nd Workshop on Advancements in Social Signal Processing for Multimodal Interaction, pp. 22–27. ACM (2016)

    Google Scholar 

  27. Takahashi, N., Gygli, M., Van Gool, L.: Aenet: Learning deep audio features for video analysis. arXiv preprint arXiv:1701.00599 (2017)

  28. Tapus, A., Bandera, A., Vazquez-Martin, R., Calderita, L.V.: Perceiving the person and their interactions with the others for social robotics-a review. Pattern Recogn. Lett. 118, 3–13 (2019)

    Article  Google Scholar 

  29. Toshio, I.: An optimal auditory filter. In: IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 198–201 (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Letizia Marchegiani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tse, T.H.E., De Martini, D., Marchegiani, L. (2019). No Need to Scream: Robust Sound-Based Speaker Localisation in Challenging Scenarios. In: Salichs, M., et al. Social Robotics. ICSR 2019. Lecture Notes in Computer Science(), vol 11876. Springer, Cham. https://doi.org/10.1007/978-3-030-35888-4_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-35888-4_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-35887-7

  • Online ISBN: 978-3-030-35888-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics