Do We Need Sound for Sound Source Localization?

Oya, Takashi; Iwase, Shohei; Natsume, Ryota; Itazuri, Takahiro; Yamaguchi, Shugo; Morishima, Shigeo

doi:10.1007/978-3-030-69544-6_8

Takashi Oya¹²,
Shohei Iwase¹²,
Ryota Natsume¹²,
Takahiro Itazuri¹²,
Shugo Yamaguchi¹² &
…
Shigeo Morishima¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12627))

Included in the following conference series:

Asian Conference on Computer Vision

792 Accesses
6 Citations

Abstract

During the performance of sound source localization which uses both visual and aural information, it presently remains unclear how much either image or sound modalities contribute to the result, i.e. do we need both image and sound for sound source localization? To address this question, we develop an unsupervised learning system that solves sound source localization by decomposing this task into two steps: (i) “potential sound source localization”, a step that localizes possible sound sources using only visual information (ii) “object selection”, a step that identifies which objects are actually sounding using aural information. Our overall system achieves state-of-the-art performance in sound source localization, and more importantly, we find that despite the constraint on available information, the results of (i) achieve similar performance. From this observation and further experiments, we show that visual information is dominant in “sound” source localization when evaluated with the currently adopted benchmark dataset. Moreover, we show that the majority of sound-producing objects within the samples in this dataset can be inherently identified using only visual information, and thus that the dataset is inadequate to evaluate a system’s capability to leverage aural information. As an alternative, we present an evaluation protocol that enforces both visual and aural information to be leveraged, and verify this property through several experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article Google Scholar
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: International Conference on Knowledge Discovery and Data Mining (KDD) (2016)
Google Scholar
Ke, G., et al..: LightGBM: a highly efficient gradient boosting decision tree. In: Neural Information Processing Systems (NIPS) (2017)
Google Scholar
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., Gulin, A.: CatBoost: unbiased boosting with categorical features. In: Neural Information Processing Systems (NIPS) (2018)
Google Scholar
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Arandjelović, R., Zisserman, A.: Objects that sound. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 451–466. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_27
Chapter Google Scholar
Hu, D., Nie, F., Li, X.: Deep multimodal clustering for unsupervised audiovisual learning. In: Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., Torralba, A.: Self-supervised audio-visual co-segmentation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019)
Google Scholar
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 587–604. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_35
Chapter Google Scholar
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 639–658. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_39
Chapter Google Scholar
Senocak, A., Oh, T.H., Kim, J., Yang, M.H., So Kweon, I.: Learning to localize sound source in visual scenes. In: Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Zhao, H., Gan, C., Ma, W.C., Torralba, A.: The sound of motions. In: International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Gan, C., Huang, D., Zhao, H., Tenenbaum, J.B., Torralba, A.: Music gesture for visual sound separation. In: Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Aytar, Y., Vondrick, C., Torralba, A.: See, hear, and read: deep aligned representations. arXiv preprint arXiv:1706.00932 (2017)
Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: Neural Information Processing Systems (NIPS) (2016)
Google Scholar
Harwath, D., Glass, J.R.: Learning word-like units from joint audio-visual analysis. In: Association for Computational Linguistics (ACL) (2017)
Google Scholar
Harwath, D., Torralba, A., Glass, J.: Unsupervised learning of spoken language with visual context. In: Neural Information Processing Systems (NIPS) (2016)
Google Scholar
Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_48
Chapter Google Scholar
Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 252–268. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_16
Chapter Google Scholar
Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models from self-supervised synchronization. In: Neural Information Processing Systems (NIPS) (2018)
Google Scholar
Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: generating natural sound for videos in the wild. In: Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Chen, K., Zhang, C., Fang, C., Wang, Z., Bui, T., Nevatia, R.: Visually indicated sound generation by perceptually optimized classification. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11134, pp. 560–574. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11024-6_43
Chapter Google Scholar
Gao, R., Grauman, K.: 2.5D visual sound. In: Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Morgado, P., Nvasconcelos, N., Langlois, T., Wang., O.: Self-supervised generation of spatial audio for 360\(^{\circ }\) video. In: Neural Information Processing Systems (NIPS) (2018)
Google Scholar
Lyon, R.F.: A computational model of binaural localization and separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (1983)
Google Scholar
Hershey, J.R., Chen, Z., Roux, J.L., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016)
Google Scholar
Chen, Z., Luo, Y., Mesgarani, N.: Deep attractor network for single-microphone speaker separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)
Google Scholar
Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)
Google Scholar
Parekh, S., Essid, S., Ozerov, A., Duong, N., Pérez, P., Richard, G.: Motion informed audio source separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)
Google Scholar
Sedighin, F., Babaie-Zadeh, M., Rivet, B., Jutten, C.: Two multimodal approaches for single microphone source separation. In: European Signal Processing Conference (EUSIPCO) (2016)
Google Scholar
Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 36–54. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_3
Chapter Google Scholar
Smaragdis, P., Casey, M.: Audio/visual independent components. In: International Conference on Independent Component Analysis and Signal Separation (ICA) (2003)
Google Scholar
Pu, J., Panagakis, Y., Petridis, S., Pantic, M.: Audio-visual object localization and separation using low-rank and sparsity. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)
Google Scholar
Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. In: Special Interest Group on Computer GRAPHics and Interactive Techniques (SIGGRAPH) (2018)
Google Scholar
Gabbay, A., Shamir, A., Peleg, S.: Visual speech enhancement. In: International Speech Communication Association (INTERSPEECH) (2018)
Google Scholar
Afouras, T., Chung, J.S., Zisserman, A.: The conversation: deep audio-visual speech enhancement. In: International Speech Communication Association (INTERSPEECH) (2018)
Google Scholar
Llagostera Casanovas, A., Monaci, G., Vandergheynst, P., Gribonval, R.: Blind audiovisual source separation based on sparse redundant representations. Trans. Multimedia 12, 358–371 (2010)
Article Google Scholar
Nakadai, K., Okuno, H.G., Kitano, H.: Real-time sound source localization and separation for robot audition. In: International Conference on Spoken Language Processing (ICSLP) (2002)
Google Scholar
Argentieri, S., Danès, P., Souères, P.: A survey on sound source localization in robotics: from binaural to array processing methods. Comput. Speech Lang. 34, 87–112 (2015)
Article Google Scholar
Nakamura, K., Nakadai, K., Asano, F., Ince, G.: Intelligent sound source localization and its application to multimodal human tracking. In: International Conference on Intelligent Robots and Systems (IROS) (2011)
Google Scholar
Strobel, N., Spors, S., Rabenstein, R.: Joint audio-video object localization and tracking. Sig. Process. Mag. 18, 22–31 (2001)
Article Google Scholar
Hershey, J.R., Movellan, J.R.: Audio vision: using audio-visual synchrony to locate sounds. In: Neural Information Processing Systems (NIPS) (1999)
Google Scholar
Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. In: Computer Vision and Pattern Recognition (CVPR) (2005)
Google Scholar
Fisher III, J.W., Darrell, T., Freeman, W.T., Viola, P.A.: Learning joint statistical models for audio-visual fusion and segregation. In: Neural Information Processing Systems (NIPS) (2001)
Google Scholar
Barzelay, Z., Schechner, Y.: Harmony in motion. In: Computer Vision and Pattern Recognition (CVPR) (2007)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Neural Information Processing Systems (NIPS) (2012)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (ICLR) (2015)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: International Conference on Machine Learning (ICML) (2010)
Google Scholar
Lin, M., Chen, Q., Yan, S.: Network in network. In: International Conference on Learning Representations (ICLR) (2014)
Google Scholar
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. In: Neural Information Processing Systems (NIPS) (1994)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar

Download references

Acknowledgements

This research is supported by the JST ACCEL (JPMJAC1602), JST-Mirai Program (JPMJMI19B2), JSPS KAKENHI (JP17H06101, JP19H01129 and JP19H04137).

Author information

Authors and Affiliations

Waseda Research Institute for Science and Engineering, Tokyo, Japan
Takashi Oya, Shohei Iwase, Ryota Natsume, Takahiro Itazuri, Shugo Yamaguchi & Shigeo Morishima

Authors

Takashi Oya
View author publications
You can also search for this author in PubMed Google Scholar
Shohei Iwase
View author publications
You can also search for this author in PubMed Google Scholar
Ryota Natsume
View author publications
You can also search for this author in PubMed Google Scholar
Takahiro Itazuri
View author publications
You can also search for this author in PubMed Google Scholar
Shugo Yamaguchi
View author publications
You can also search for this author in PubMed Google Scholar
Shigeo Morishima
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Takashi Oya or Shohei Iwase .

Editor information

Editors and Affiliations

Waseda University, Tokyo, Japan
Hiroshi Ishikawa
Institute of Automation of Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Czech Technical University in Prague, Prague, Czech Republic
Tomas Pajdla
University of Pennsylvania, Philadelphia, PA, USA
Jianbo Shi

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 18037 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Oya, T., Iwase, S., Natsume, R., Itazuri, T., Yamaguchi, S., Morishima, S. (2021). Do We Need Sound for Sound Source Localization?. In: Ishikawa, H., Liu, CL., Pajdla, T., Shi, J. (eds) Computer Vision – ACCV 2020. ACCV 2020. Lecture Notes in Computer Science(), vol 12627. Springer, Cham. https://doi.org/10.1007/978-3-030-69544-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-69544-6_8
Published: 26 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69543-9
Online ISBN: 978-3-030-69544-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics