No Need to Scream: Robust Sound-Based Speaker Localisation in Challenging Scenarios

Tse, Tze Ho Elden; De Martini, Daniele; Marchegiani, Letizia

doi:10.1007/978-3-030-35888-4_17

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11876))

Included in the following conference series:

International Conference on Social Robotics

2365 Accesses
5 Citations

Abstract

This paper is about speaker verification and horizontal localisation in the presence of conspicuous noise. Specifically, we are interested in enabling a mobile robot to robustly and accurately spot the presence of a target speaker and estimate his/her position in challenging acoustic scenarios. While several solutions to both tasks have been proposed in the literature, little attention has been devoted to the development of systems able to function in harsh noisy conditions. To address these shortcomings, in this work we follow a purely data-driven approach based on deep learning architectures which, by not requiring any knowledge either on the nature of the masking noise or on the structure and acoustics of the operation environment, it is able to reliably act in previously unexplored acoustic scenes. Our experimental evaluation, relying on data collected in real environments with a robotic platform, demonstrates that our framework is able to achieve high performance both in the verification and localisation tasks, despite the presence of copious noise.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/
Bhattacharya, G., Alam, M.J., Kenny, P.: Deep speaker embeddings for short-duration speaker verification. In: Interspeech, pp. 1517–1521 (2017)
Google Scholar
Chakrabarty, D., Elhilali, M.: Abnormal sound event detection using temporal trajectories mixtures. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 216–220 (2016)
Google Scholar
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964 (2016)
Google Scholar
Deng, S., Han, J., Zhang, C., Zheng, T., Zheng, G.: Robust minimum statistics project coefficients feature for acoustic environment recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8232–8236 (2014)
Google Scholar
Feng, L.: Speaker recognition. Ph.D. thesis, Technical University of Denmark, IMM-THESIS, DK-280, Kgs. Lyngby, Denmark (2004)
Google Scholar
He, W., Motlicek, P., Odobez, J.M.: Deep neural networks for multiple speaker detection and localization. arXiv preprint arXiv:1711.11565 (2017)
Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker verification. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119 (2016)
Google Scholar
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
Holdsworth, J., Nimmo-Smith, I., Patterson, R., Rice, P.: Implementing a gammatone filter bank. Annex C of the SVOS Final Report: Part A: The Auditory Filterbank, vol. 1, pp. 1–5 (1988)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lyon, R.F., Katsiamis, A.G., Drakakis, E.M.: History and future of auditory filter models. In: Proceedings of 2010 IEEE International Symposium on Circuits and Systems, pp. 3809–3812 (2010)
Google Scholar
Ma, N., May, T., Brown, G.J.: Exploiting deep neural networks and head movements for robust binaural localization of multiple sources in reverberant environments. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 25(12), 2444–2453 (2017)
Article Google Scholar
Maganti, H.K., Matassoni, M.: Auditory processing inspired robust feature enhancement for speech recognition. In: Fred, A., Filipe, J., Gamboa, H. (eds.) BIOSTEC 2011. CCIS, vol. 273, pp. 205–218. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-29752-6_15
Chapter Google Scholar
Marchegiani, L., Fafoutis, X.: On cross-language consonant identification in second language noise. J. Acoust. Soc. Am. 138(4), 2206–2209 (2015)
Article Google Scholar
Marchegiani, L., Newman, P.: Learning to listen to your ego-(motion): metric motion estimation from auditory signals. In: Giuliani, M., Assaf, T., Giannaccini, M.E. (eds.) TAROS 2018. LNCS (LNAI), vol. 10965, pp. 247–259. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96728-8_21
Chapter Google Scholar
Marchegiani, L., Posner, I.: Leveraging the urban soundscape: auditory perception for smart vehicles. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 6547–6554 (2017)
Google Scholar
Marchegiani, M.L., Pirri, F., Pizzoli, M.: Multimodal speaker recognition in a conversation scenario. In: Fritz, M., Schiele, B., Piater, J.H. (eds.) ICVS 2009. LNCS, vol. 5815, pp. 11–20. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04667-4_2
Chapter Google Scholar
Noda, K., Hashimoto, N., Nakadai, K., Ogata, T.: Sound source separation for robot audition using deep learning. In: 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), pp. 389–394 (2015)
Google Scholar
Reynolds, D.A.: An overview of automatic speaker recognition technology. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. IV–4072 (2002)
Google Scholar
Rudzyn, B., Kadous, W., Sammut, C.: Real time robot audition system incorporating both 3D sound source localisation and voice characterisation. In: 2007 IEEE International Conference on Robotics and Automation, pp. 4733–4738 (2007)
Google Scholar
Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: 22nd ACM International Conference on Multimedia (ACM-MM 2014), Orlando, FL, USA, November 2014
Google Scholar
Schluter, R., Bezrukov, I., Wagner, H., Ney, H.: Gammatone features and feature combination for large vocabulary speech recognition. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP 2007, vol. 4, pp. IV–649 (2007)
Google Scholar
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: Interspeech, pp. 999–1003 (2017)
Google Scholar
Stefanov, K., Sugimoto, A., Beskow, J.: Look who’s talking: visual identification of the active speaker in multi-party human-robot interaction. In: Proceedings of the 2nd Workshop on Advancements in Social Signal Processing for Multimodal Interaction, pp. 22–27. ACM (2016)
Google Scholar
Takahashi, N., Gygli, M., Van Gool, L.: Aenet: Learning deep audio features for video analysis. arXiv preprint arXiv:1701.00599 (2017)
Tapus, A., Bandera, A., Vazquez-Martin, R., Calderita, L.V.: Perceiving the person and their interactions with the others for social robotics-a review. Pattern Recogn. Lett. 118, 3–13 (2019)
Article Google Scholar
Toshio, I.: An optimal auditory filter. In: IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 198–201 (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Electronic Systems, BAE Systems, Farnborough, UK
Tze Ho Elden Tse
Oxford Robotics Institute, University of Oxford, Oxford, UK
Daniele De Martini
Department of Electronic Systems, Aalborg University, Aalborg, Denmark
Letizia Marchegiani

Authors

Tze Ho Elden Tse
View author publications
You can also search for this author in PubMed Google Scholar
Daniele De Martini
View author publications
You can also search for this author in PubMed Google Scholar
Letizia Marchegiani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Letizia Marchegiani .

Editor information

Editors and Affiliations

Robotics Lab, Universidad Carlos III de Madrid, Leganés, Madrid, Spain
Miguel A. Salichs
The National University of Singapore, Singapore, Singapore
Shuzhi Sam Ge
Faculty of Industrial Design, Eindhoven University of Technology, Eindhoven, Noord-Brabant, The Netherlands
Emilia Ivanova Barakova
Mechanical & Industrial, Qatar University, Doha, Qatar
John-John Cabibihan
Department of Aerospace Engineering, The Pennsylvania State University, University Park, PA, USA
Alan R. Wagner
Robotics Lab - Department of Systems Engineering and Automation, Universidad Carlos III de Madrid, Leganés, Madrid, Spain
Álvaro Castro-González
Wichita State University, Wichita, KS, USA
Hongsheng He

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tse, T.H.E., De Martini, D., Marchegiani, L. (2019). No Need to Scream: Robust Sound-Based Speaker Localisation in Challenging Scenarios. In: Salichs, M., et al. Social Robotics. ICSR 2019. Lecture Notes in Computer Science(), vol 11876. Springer, Cham. https://doi.org/10.1007/978-3-030-35888-4_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-35888-4_17
Published: 17 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35887-7
Online ISBN: 978-3-030-35888-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics