Skip to main content
Log in

Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The sound of crashing waves, the roar of fast-moving cars—sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. This paper extends an earlier conference paper, Owens et al. (in: European conference on computer vision, 2016b), with additional experiments and discussion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. For conciseness, we sometimes call these “sound-making” objects, even if they are not literally the source of the sound.

  2. As a result, this model has a larger pool5 layer than the other methods: \(7 \times 7\) versus \(6 \times 6\). Likewise, the fc6 layer of Wang and Gupta (2015) is smaller (1024 vs. 4096 dims.).

References

  • Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to see by moving. In IEEE international conference on computer vision.

  • Andrew, G., Arora, R., Bilmes, J. A., & Livescu, K. (2013). Deep canonical correlation analysis. In International conference on machine learning.

  • Arandjelović, R., & Zisserman, A. (2017). Look, listen and learn. ICCV.

  • Aytar, Y., Vondrick, C., & Torralba, A. (2016). Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems.

  • Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations. CVPR.

  • de Sa, V. R. (1994a). Learning classification with unlabeled data. Advances in neural information processing systems (pp 112)

  • de Sa, V. R. (1994b). Minimizing disagreement for self-supervised classification. In Proceedings of the 1993 Connectionist Models Summer School (pp. 300.). Psychology Press.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition.

  • Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. In IEEE international conference on computer vision.

  • Doersch, C., & Zisserman, A. (2017). Multi-task self-supervised visual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2051–2060).

  • Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., & Brox, T. (2014). Discriminative unsupervised feature learning with convolutional neural networks. In Advances in neural information processing systems.

  • Ellis, D. P., Zeng, X., McDermott, J. H. (2011). Classifying soundtracks with audio texture features. In IEEE international conference on acoustics, speech, and signal processing.

  • Eronen, A. J., Peltonen, V. T., Tuomi, J. T., Klapuri, A. P., Fagerlund, S., Sorsa, T., et al. (2006). Audio-based context recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 14(1), 321–329.

    Article  Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Fisher III J. W., Darrell, T., Freeman, W. T., Viola, P. A. (2000). Learning joint statistical models for audio–visual fusion and segregation. In Advances in neural information processing systems.

  • Gaver, W. W. (1993). What in the world do we hear?: An ecological approach to auditory event perception. Ecological psychology, 5(1), 1–29.

    Article  Google Scholar 

  • Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., & Ritter, M. (2017). Audio set: An ontology and human-labeled dartaset for audio events. In IEEE international conference on acoustics, speech, and signal processing.

  • Girshick, R. (2015). Fast r-cnn. In IEEE international conference on computer vision.

  • Goroshin, R., Bruna, J., Tompson, J., Eigen, D., & LeCun, Y. (2015). Unsupervised feature learning from temporal data. arXiv preprint arXiv:1504.02518.

  • Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Hershey, J. R., & Movellan, J. R. (1999). Audio vision: Using audio–visual synchrony to locate sounds. In Advances in neural information processing systems.

  • Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: Towards removing the curse of dimensionality. In ACM symposium on theory of computing.

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning.

  • Isola, P. (2015). The discovery of perceptual structure from visual co-occurrences in space and time. PhD thesis

  • Isola, P., Zoran, D., Krishnan, D., & Adelson, E.H. (2016). Learning visual groups from co-occurrences in space and time. In International conference on learning representations, Workshop.

  • Jayaraman, D., & Grauman, K. (2015). Learning image representations tied to ego-motion. In IEEE international conference on computer vision.

  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM multimedia conference.

  • Kidron, E., Schechner, Y. Y., & Elad, M. (2005). Pixels that sound. In IEEE conference on computer vision and pattern recognition.

  • Krähenbühl, P., Doersch, C., Donahue, J., & Darrell, T. (2016). Data-dependent initializations of convolutional neural networks. In International conference on learning representations

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.

  • Le, Q. V., Ranzato, M. A., Monga, R., Devin, M., Chen, K., Corrado, G. S, Dean, J., & Ng, A. Y. (2012). Building high-level features using large scale unsupervised learning. In International conference on machine learning

  • Lee, K., Ellis, D. P., & Loui, A. C. (2010). Detecting local semantic concepts in environmental sounds using markov model based clustering. In IEEE international conference on acoustics, speech, and signal processing.

  • Leung, T., & Malik, J. (2001). Representing and recognizing the visual appearance of materials using three-dimensional textons. International Journal of Computer Vision, 43(1), 29–44.

    Article  MATH  Google Scholar 

  • Lin, M., Chen, Q., & Yan, S. (2014). Network in network. International conference on learning representations.

  • McDermott, J. H., & Simoncelli, E. P. (2011). Sound texture perception via statistics of the auditory periphery: Evidence from sound synthesis. Neuron, 71(5), 926–940.

    Article  Google Scholar 

  • Mishkin, D., & Matas, J. (2015). All you need is a good init. arXiv preprint arXiv:1511.06422.

  • Mobahi, H., Collobert, R., & Weston, J. (2009). Deep learning from temporal coherence in video. In International conference on machine learning.

  • Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In International Conference on Machine Learning.

  • Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2015). Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Conference on computer vision and pattern recognition.

  • Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., & Freeman, W. T. (2016a). Visually indicated sounds. In CVPR.

  • Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016b). Ambient sound provides supervision for visual learning. In European conference on computer vision.

  • Pathak, D., Girshick, R., Dollár, P., Darrell, T., & Hariharan, B. (2017). Learning features by watching objects move. In CVPR.

  • Salakhutdinov, R., & Hinton, G. (2009). Semantic hashing. International Journal of Approximate Reasoning, 50(7), 969–978.

    Article  Google Scholar 

  • Slaney, M., & Covell, M. (2000). Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In Advances in neural information processing systems.

  • Smith, L., & Gasser, M. (2005). The development of embodied cognition: Six lessons from babies. Artificial life, 11(1–2), 13–29.

    Article  Google Scholar 

  • Srivastava, N., & Salakhutdinov, R. R. (2012). Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems.

  • Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., & Li, L. J. (2015). The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817.

  • Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In IEEE international conference on computer vision

  • Weiss, Y., Torralba, A., & Fergus, R. (2009). Spectral hashing. In Advances in neural information processing systems.

  • Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In IEEE conference on computer vision and pattern recognition.

  • Zhang, R., Isola, P., & Efros, A. A. (2016). Colorful image colorization. In European conference on computer vision pp. 649–666. Springer

  • Zhang, R., Isola, P., & Efros, A. A. (2017). Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR.

  • Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems.

  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Object detectors emerge in deep scene cnns. In International conference on learning representations.

  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In The IEEE conference on computer vision and pattern recognition (CVPR).

Download references

Acknowledgements

This work was supported by NSF Grants #1524817 to A.T; NSF Grants #1447476 and #1212849 to W.F.; a McDonnell Scholar Award to J.H.M.; and a Microsoft Ph.D. Fellowship to A.O. It was also supported by Shell Research, and by a donation of GPUs from NVIDIA. We thank Phillip Isola for the helpful discussions, and Carl Vondrick for sharing the data that we used in our experiments. We also thank the anonymous reviewers for their comments, which significantly improved the paper (in particular, for suggesting the comparison with texton features in Sect. 5).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew Owens.

Additional information

Communicated by Edwin Hancock, Richard Wilson, Will Smith, Adrian Bors and Nick Pears.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Sound Textures

A Sound Textures

We now describe in more detail how we computed sound textures from audio clips. For this, we closely follow the work of McDermott and Simoncelli (2011).

Subband envelopes To compute the cochleagram features \(\{c_i\}\), we filter the input waveform s with a bank of bandpass filters \(\{f_i\}\).

$$\begin{aligned} c_i(t) = |(s *f_i) + j H(s *f_i)|, \end{aligned}$$
(1)

where H is the Hilbert transform and \(*\) denotes cross-correlation. We then resample the signal to 400 Hz and compress it by raising each sample to the 0.3 power (examples in Fig. 2).

Correlations As described in Sect. 3, we compute the correlation between bands using a subset of the entries in the cochlear-channel correlation matrix. Specifically, we include the correlation between channels \(c_j\) and \(c_k\) if \(|j - k| \in \{1, 2, 3, 5\}\). The result is a vector \(\rho \) of correlation values.

Modulation filters We also include modulation filter responses. To get these, we compute each band’s response to a filter bank \(\{m_i\}\) of 10 bandpass filters whose center frequencies are spaced logarithmically from 0.5 to 200 Hz:

$$\begin{aligned} b_{ij} = \frac{1}{N}||c_i *m_j||^2, \end{aligned}$$
(2)

where N is the length of the signal.

Marginal statistics We estimate marginal moments of the cochleagram features, computing the mean \(\mu _i\) and standard deviation \(\sigma _i\) of each channel. We also estimate the loudness, l, of the sequence by taking the median of the energy at each timestep, i.e. \(l = \text{ median }(||c(t)||)\).

Normalization To account for global differences in gain, we normalize the cochleagram features by dividing by the loudness, l. Following McDermott and Simoncelli (2011), we normalize the modulation filter responses by the variance of the cochlear channel, computing \(\tilde{b}_{ij} = \sqrt{b_{ij}/\sigma _i^2}\). Similarly, we normalize the standard deviation of each cochlear channel, computing \(\tilde{\sigma }_{i} = \sqrt{\sigma _{i}^2/\mu _i^2}\). From these normalized features, we construct a sound texture vector: \([\mu , \tilde{\sigma }, \rho , \tilde{b}, l]\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Owens, A., Wu, J., McDermott, J.H. et al. Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning. Int J Comput Vis 126, 1120–1137 (2018). https://doi.org/10.1007/s11263-018-1083-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-018-1083-5

Keywords

Navigation