Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

Owens, Andrew; Wu, Jiajun; McDermott, Josh H.; Freeman, William T.; Torralba, Antonio

doi:10.1007/s11263-018-1083-5

Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

Published: 11 July 2018

Volume 126, pages 1120–1137, (2018)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Andrew Owens^1,2,
Jiajun Wu²,
Josh H. McDermott²,
William T. Freeman² &
…
Antonio Torralba²

1292 Accesses
36 Citations
7 Altmetric
Explore all metrics

Abstract

The sound of crashing waves, the roar of fast-moving cars—sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. This paper extends an earlier conference paper, Owens et al. (in: European conference on computer vision, 2016b), with additional experiments and discussion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 9

Fig. 14

Ambient Sound Provides Supervision for Visual Learning

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

“Seeing Sound”: Audio Classification Using the Wigner-Ville Distribution and Convolutional Neural Networks

Notes

For conciseness, we sometimes call these “sound-making” objects, even if they are not literally the source of the sound.
As a result, this model has a larger pool5 layer than the other methods: $7 \times 7$ versus $6 \times 6$. Likewise, the fc6 layer of Wang and Gupta (2015) is smaller (1024 vs. 4096 dims.).

References

Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to see by moving. In IEEE international conference on computer vision.
Andrew, G., Arora, R., Bilmes, J. A., & Livescu, K. (2013). Deep canonical correlation analysis. In International conference on machine learning.
Arandjelović, R., & Zisserman, A. (2017). Look, listen and learn. ICCV.
Aytar, Y., Vondrick, C., & Torralba, A. (2016). Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems.
Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations. CVPR.
de Sa, V. R. (1994a). Learning classification with unlabeled data. Advances in neural information processing systems (pp 112)
de Sa, V. R. (1994b). Minimizing disagreement for self-supervised classification. In Proceedings of the 1993 Connectionist Models Summer School (pp. 300.). Psychology Press.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition.
Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. In IEEE international conference on computer vision.
Doersch, C., & Zisserman, A. (2017). Multi-task self-supervised visual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2051–2060).
Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., & Brox, T. (2014). Discriminative unsupervised feature learning with convolutional neural networks. In Advances in neural information processing systems.
Ellis, D. P., Zeng, X., McDermott, J. H. (2011). Classifying soundtracks with audio texture features. In IEEE international conference on acoustics, speech, and signal processing.
Eronen, A. J., Peltonen, V. T., Tuomi, J. T., Klapuri, A. P., Fagerlund, S., Sorsa, T., et al. (2006). Audio-based context recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 14(1), 321–329.
Article Google Scholar
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Fisher III J. W., Darrell, T., Freeman, W. T., Viola, P. A. (2000). Learning joint statistical models for audio–visual fusion and segregation. In Advances in neural information processing systems.
Gaver, W. W. (1993). What in the world do we hear?: An ecological approach to auditory event perception. Ecological psychology, 5(1), 1–29.
Article Google Scholar
Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., & Ritter, M. (2017). Audio set: An ontology and human-labeled dartaset for audio events. In IEEE international conference on acoustics, speech, and signal processing.
Girshick, R. (2015). Fast r-cnn. In IEEE international conference on computer vision.
Goroshin, R., Bruna, J., Tompson, J., Eigen, D., & LeCun, Y. (2015). Unsupervised feature learning from temporal data. arXiv preprint arXiv:1504.02518.
Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Hershey, J. R., & Movellan, J. R. (1999). Audio vision: Using audio–visual synchrony to locate sounds. In Advances in neural information processing systems.
Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: Towards removing the curse of dimensionality. In ACM symposium on theory of computing.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning.
Isola, P. (2015). The discovery of perceptual structure from visual co-occurrences in space and time. PhD thesis
Isola, P., Zoran, D., Krishnan, D., & Adelson, E.H. (2016). Learning visual groups from co-occurrences in space and time. In International conference on learning representations, Workshop.
Jayaraman, D., & Grauman, K. (2015). Learning image representations tied to ego-motion. In IEEE international conference on computer vision.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM multimedia conference.
Kidron, E., Schechner, Y. Y., & Elad, M. (2005). Pixels that sound. In IEEE conference on computer vision and pattern recognition.
Krähenbühl, P., Doersch, C., Donahue, J., & Darrell, T. (2016). Data-dependent initializations of convolutional neural networks. In International conference on learning representations
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.
Le, Q. V., Ranzato, M. A., Monga, R., Devin, M., Chen, K., Corrado, G. S, Dean, J., & Ng, A. Y. (2012). Building high-level features using large scale unsupervised learning. In International conference on machine learning
Lee, K., Ellis, D. P., & Loui, A. C. (2010). Detecting local semantic concepts in environmental sounds using markov model based clustering. In IEEE international conference on acoustics, speech, and signal processing.
Leung, T., & Malik, J. (2001). Representing and recognizing the visual appearance of materials using three-dimensional textons. International Journal of Computer Vision, 43(1), 29–44.
Article MATH Google Scholar
Lin, M., Chen, Q., & Yan, S. (2014). Network in network. International conference on learning representations.
McDermott, J. H., & Simoncelli, E. P. (2011). Sound texture perception via statistics of the auditory periphery: Evidence from sound synthesis. Neuron, 71(5), 926–940.
Article Google Scholar
Mishkin, D., & Matas, J. (2015). All you need is a good init. arXiv preprint arXiv:1511.06422.
Mobahi, H., Collobert, R., & Weston, J. (2009). Deep learning from temporal coherence in video. In International conference on machine learning.
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In International Conference on Machine Learning.
Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2015). Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Conference on computer vision and pattern recognition.
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., & Freeman, W. T. (2016a). Visually indicated sounds. In CVPR.
Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016b). Ambient sound provides supervision for visual learning. In European conference on computer vision.
Pathak, D., Girshick, R., Dollár, P., Darrell, T., & Hariharan, B. (2017). Learning features by watching objects move. In CVPR.
Salakhutdinov, R., & Hinton, G. (2009). Semantic hashing. International Journal of Approximate Reasoning, 50(7), 969–978.
Article Google Scholar
Slaney, M., & Covell, M. (2000). Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In Advances in neural information processing systems.
Smith, L., & Gasser, M. (2005). The development of embodied cognition: Six lessons from babies. Artificial life, 11(1–2), 13–29.
Article Google Scholar
Srivastava, N., & Salakhutdinov, R. R. (2012). Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems.
Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., & Li, L. J. (2015). The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817.
Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In IEEE international conference on computer vision
Weiss, Y., Torralba, A., & Fergus, R. (2009). Spectral hashing. In Advances in neural information processing systems.
Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In IEEE conference on computer vision and pattern recognition.
Zhang, R., Isola, P., & Efros, A. A. (2016). Colorful image colorization. In European conference on computer vision pp. 649–666. Springer
Zhang, R., Isola, P., & Efros, A. A. (2017). Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR.
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems.
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Object detectors emerge in deep scene cnns. In International conference on learning representations.
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In The IEEE conference on computer vision and pattern recognition (CVPR).

Download references

Acknowledgements

This work was supported by NSF Grants #1524817 to A.T; NSF Grants #1447476 and #1212849 to W.F.; a McDonnell Scholar Award to J.H.M.; and a Microsoft Ph.D. Fellowship to A.O. It was also supported by Shell Research, and by a donation of GPUs from NVIDIA. We thank Phillip Isola for the helpful discussions, and Carl Vondrick for sharing the data that we used in our experiments. We also thank the anonymous reviewers for their comments, which significantly improved the paper (in particular, for suggesting the comparison with texton features in Sect. 5).

Author information

Authors and Affiliations

University of California, Berkeley, USA
Andrew Owens
Massachusetts Institute of Technology, Cambridge, USA
Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman & Antonio Torralba

Authors

Andrew Owens
View author publications
You can also search for this author in PubMed Google Scholar
Jiajun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Josh H. McDermott
View author publications
You can also search for this author in PubMed Google Scholar
William T. Freeman
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Torralba
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrew Owens.

Additional information

Communicated by Edwin Hancock, Richard Wilson, Will Smith, Adrian Bors and Nick Pears.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Sound Textures

We now describe in more detail how we computed sound textures from audio clips. For this, we closely follow the work of McDermott and Simoncelli (2011).

Subband envelopes To compute the cochleagram features $\{c_i\}$, we filter the input waveform s with a bank of bandpass filters $\{f_i\}$.

$$\begin{aligned} c_i(t) = |(s *f_i) + j H(s *f_i)|, \end{aligned}$$

(1)

where H is the Hilbert transform and $*$ denotes cross-correlation. We then resample the signal to 400 Hz and compress it by raising each sample to the 0.3 power (examples in Fig. 2).

Correlations As described in Sect. 3, we compute the correlation between bands using a subset of the entries in the cochlear-channel correlation matrix. Specifically, we include the correlation between channels $c_j$ and $c_k$ if $|j - k| \in \{1, 2, 3, 5\}$. The result is a vector $\rho $ of correlation values.

Modulation filters We also include modulation filter responses. To get these, we compute each band’s response to a filter bank $\{m_i\}$ of 10 bandpass filters whose center frequencies are spaced logarithmically from 0.5 to 200 Hz:

$$\begin{aligned} b_{ij} = \frac{1}{N}||c_i *m_j||^2, \end{aligned}$$

(2)

where N is the length of the signal.

Marginal statistics We estimate marginal moments of the cochleagram features, computing the mean $\mu _i$ and standard deviation $\sigma _i$ of each channel. We also estimate the loudness, l, of the sequence by taking the median of the energy at each timestep, i.e. $l = \text{ median }(||c(t)||)$.

Normalization To account for global differences in gain, we normalize the cochleagram features by dividing by the loudness, l. Following McDermott and Simoncelli (2011), we normalize the modulation filter responses by the variance of the cochlear channel, computing $\tilde{b}_{ij} = \sqrt{b_{ij}/\sigma _i^2}$. Similarly, we normalize the standard deviation of each cochlear channel, computing $\tilde{\sigma }_{i} = \sqrt{\sigma _{i}^2/\mu _i^2}$. From these normalized features, we construct a sound texture vector: $[\mu , \tilde{\sigma }, \rho , \tilde{b}, l]$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Owens, A., Wu, J., McDermott, J.H. et al. Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning. Int J Comput Vis 126, 1120–1137 (2018). https://doi.org/10.1007/s11263-018-1083-5

Download citation

Received: 09 May 2017
Accepted: 20 March 2018
Published: 11 July 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s11263-018-1083-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

Abstract

Access this article

Similar content being viewed by others

Ambient Sound Provides Supervision for Visual Learning

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

“Seeing Sound”: Audio Classification Using the Wigner-Ville Distribution and Convolutional Neural Networks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Sound Textures

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

Abstract

Access this article

Similar content being viewed by others

Ambient Sound Provides Supervision for Visual Learning

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

“Seeing Sound”: Audio Classification Using the Wigner-Ville Distribution and Convolutional Neural Networks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Sound Textures

A Sound Textures

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation