Abstract
This paper is about speaker verification and horizontal localisation in the presence of conspicuous noise. Specifically, we are interested in enabling a mobile robot to robustly and accurately spot the presence of a target speaker and estimate his/her position in challenging acoustic scenarios. While several solutions to both tasks have been proposed in the literature, little attention has been devoted to the development of systems able to function in harsh noisy conditions. To address these shortcomings, in this work we follow a purely data-driven approach based on deep learning architectures which, by not requiring any knowledge either on the nature of the masking noise or on the structure and acoustics of the operation environment, it is able to reliably act in previously unexplored acoustic scenes. Our experimental evaluation, relying on data collected in real environments with a robotic platform, demonstrates that our framework is able to achieve high performance both in the verification and localisation tasks, despite the presence of copious noise.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/
Bhattacharya, G., Alam, M.J., Kenny, P.: Deep speaker embeddings for short-duration speaker verification. In: Interspeech, pp. 1517–1521 (2017)
Chakrabarty, D., Elhilali, M.: Abnormal sound event detection using temporal trajectories mixtures. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 216–220 (2016)
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964 (2016)
Deng, S., Han, J., Zhang, C., Zheng, T., Zheng, G.: Robust minimum statistics project coefficients feature for acoustic environment recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8232–8236 (2014)
Feng, L.: Speaker recognition. Ph.D. thesis, Technical University of Denmark, IMM-THESIS, DK-280, Kgs. Lyngby, Denmark (2004)
He, W., Motlicek, P., Odobez, J.M.: Deep neural networks for multiple speaker detection and localization. arXiv preprint arXiv:1711.11565 (2017)
Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker verification. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119 (2016)
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
Holdsworth, J., Nimmo-Smith, I., Patterson, R., Rice, P.: Implementing a gammatone filter bank. Annex C of the SVOS Final Report: Part A: The Auditory Filterbank, vol. 1, pp. 1–5 (1988)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lyon, R.F., Katsiamis, A.G., Drakakis, E.M.: History and future of auditory filter models. In: Proceedings of 2010 IEEE International Symposium on Circuits and Systems, pp. 3809–3812 (2010)
Ma, N., May, T., Brown, G.J.: Exploiting deep neural networks and head movements for robust binaural localization of multiple sources in reverberant environments. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 25(12), 2444–2453 (2017)
Maganti, H.K., Matassoni, M.: Auditory processing inspired robust feature enhancement for speech recognition. In: Fred, A., Filipe, J., Gamboa, H. (eds.) BIOSTEC 2011. CCIS, vol. 273, pp. 205–218. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-29752-6_15
Marchegiani, L., Fafoutis, X.: On cross-language consonant identification in second language noise. J. Acoust. Soc. Am. 138(4), 2206–2209 (2015)
Marchegiani, L., Newman, P.: Learning to listen to your ego-(motion): metric motion estimation from auditory signals. In: Giuliani, M., Assaf, T., Giannaccini, M.E. (eds.) TAROS 2018. LNCS (LNAI), vol. 10965, pp. 247–259. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96728-8_21
Marchegiani, L., Posner, I.: Leveraging the urban soundscape: auditory perception for smart vehicles. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 6547–6554 (2017)
Marchegiani, M.L., Pirri, F., Pizzoli, M.: Multimodal speaker recognition in a conversation scenario. In: Fritz, M., Schiele, B., Piater, J.H. (eds.) ICVS 2009. LNCS, vol. 5815, pp. 11–20. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04667-4_2
Noda, K., Hashimoto, N., Nakadai, K., Ogata, T.: Sound source separation for robot audition using deep learning. In: 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), pp. 389–394 (2015)
Reynolds, D.A.: An overview of automatic speaker recognition technology. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. IV–4072 (2002)
Rudzyn, B., Kadous, W., Sammut, C.: Real time robot audition system incorporating both 3D sound source localisation and voice characterisation. In: 2007 IEEE International Conference on Robotics and Automation, pp. 4733–4738 (2007)
Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: 22nd ACM International Conference on Multimedia (ACM-MM 2014), Orlando, FL, USA, November 2014
Schluter, R., Bezrukov, I., Wagner, H., Ney, H.: Gammatone features and feature combination for large vocabulary speech recognition. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP 2007, vol. 4, pp. IV–649 (2007)
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: Interspeech, pp. 999–1003 (2017)
Stefanov, K., Sugimoto, A., Beskow, J.: Look who’s talking: visual identification of the active speaker in multi-party human-robot interaction. In: Proceedings of the 2nd Workshop on Advancements in Social Signal Processing for Multimodal Interaction, pp. 22–27. ACM (2016)
Takahashi, N., Gygli, M., Van Gool, L.: Aenet: Learning deep audio features for video analysis. arXiv preprint arXiv:1701.00599 (2017)
Tapus, A., Bandera, A., Vazquez-Martin, R., Calderita, L.V.: Perceiving the person and their interactions with the others for social robotics-a review. Pattern Recogn. Lett. 118, 3–13 (2019)
Toshio, I.: An optimal auditory filter. In: IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 198–201 (1995)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Tse, T.H.E., De Martini, D., Marchegiani, L. (2019). No Need to Scream: Robust Sound-Based Speaker Localisation in Challenging Scenarios. In: Salichs, M., et al. Social Robotics. ICSR 2019. Lecture Notes in Computer Science(), vol 11876. Springer, Cham. https://doi.org/10.1007/978-3-030-35888-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-35888-4_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35887-7
Online ISBN: 978-3-030-35888-4
eBook Packages: Computer ScienceComputer Science (R0)