Deep Nets: What have They Ever Done for Vision?

Yuille, Alan L.; Liu, Chenxi

doi:10.1007/s11263-020-01405-z

Deep Nets: What have They Ever Done for Vision?

Published: 27 November 2020

Volume 129, pages 781–802, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

3132 Accesses
33 Citations
9 Altmetric
Explore all metrics

Abstract

This is an opinion paper about the strengths and weaknesses of Deep Nets for vision. They are at the heart of the enormous recent progress in artificial intelligence and are of growing importance in cognitive science and neuroscience. They have had many successes but also have several limitations and there is limited understanding of their inner workings. At present Deep Nets perform very well on specific visual tasks with benchmark datasets but they are much less general purpose, flexible, and adaptive than the human visual system. We argue that Deep Nets in their current form are unlikely to be able to overcome the fundamental problem of computer vision, namely how to deal with the combinatorial explosion, caused by the enormous complexity of natural images, and obtain the rich understanding of visual scenes that the human visual achieves. We argue that this combinatorial explosion takes us into a regime where “big data is not enough” and where we need to rethink our methods for benchmarking performance and evaluating vision algorithms. We stress that, as vision algorithms are increasingly used in real world applications, that performance evaluation is not merely an academic exercise but has important consequences in the real world. It is impractical to review the entire Deep Net literature so we restrict ourselves to a limited range of topics and references which are intended as entry points into the literature. The views expressed in this paper are our own and do not necessarily represent those of anybody else in the computer vision community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SSD: Single Shot MultiBox Detector

A survey on Image Data Augmentation for Deep Learning

Article Open access 06 July 2019

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

Notes

The first author remembers that in the mid 1990s and early 2000s the term “neural network” in the title of a submission to a computer vision conference was sadly a good predictor for rejection and recalls sympathizing with researchers who were pursuing such unfashionable ideas.
In addition to visualization, training a small neural network (also known as readout functions) on top of deep features is another popular technique of assessing how much they encode some particular properties, which is now widely adopted in the self-supervised learning literature (Noroozi et al. 2016; Zhang et al. 2016).
Admittedly in ResNets (He et al. 2016), there is only one “decision layer”, and the analogy to “template matching” also weakens at higher layers due to presence of the residual connection.
https://michaelbach.de/ot/.
This issue we are describing is in general related to how Deep Nets can take unintended, “shortcut” solutions, for example the chromatic aberration noticed in Doersch et al. (2015), or the low level statistics and edge continuity noticed in Noroozi et al. (2016). In this paper we highlight “over-sensitivity to context” as the representative example, for both familiarity and keeping the discussion contained.
The first author remembers that when studying text detection for the visually impaired we were so concerned about dataset biases that we recruited blind subjects who would walk the streets of San Francisco taking images automatically (but found the main difference from regular images was that there was a greater variety of angles).
Available from Sowerby Research Centre, British Aerospace.
https://www.caranddriver.com/features/a32266303/self-driving-cars-are-taking-longer-to-build-than-everyone-thought/.
https://www.ncbi.nlm.nih.gov/books/NBK210143/.
https://www.wired.com/story/done-right-ai-make-policing-fairer/.
Quote during a public talk by a West Coast Professor who, perhaps coincidentally, had a start-up company.

References

Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., & Süsstrunk, S. (2012). SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2274–2282.
Google Scholar
Alcorn, MA., Li, Q., Gong, Z., Wang, C., Mai, L., Ku, W., & Nguyen, A. (2019). Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects. In CVPR, Computer Vision Foundation/IEEE (pp. 4845–4854).
Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016). Neural module networks. In CVPR, IEEE Computer Society (pp. 39–48).
Arbib, M. A., & Bonaiuto, J. J. (2016). From neuron to cognition via computational neuroscience. Cambridge: MIT Press.
Google Scholar
Arterberry, M. E., & Kellman, P. J. (2016). Development of perception in infancy: The cradle of knowledge revisited. Oxford: Oxford University Press.
Google Scholar
Athalye, A., Carlini, N., & Wagner, DA. (2018). Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, JMLR.org, JMLR Workshop and Conference Proceedings (Vol. 80, pp. 274–283).
Barlow, H., & Tripathy, S. P. (1997). Correspondence noise and signal pooling in the detection of coherent visual motion. Journal of Neuroscience, 17(20), 7954–7966.
Google Scholar
Bashford, A., & Levine, P. (2010). The Oxford handbook of the history of eugenics. OUP USA.
Battaglia, P. W., Hamrick, J. B., & Tenenbaum, J. B. (2013). Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences, 110(45), 18327–18332.
Google Scholar
Biederman, I. (1987). Recognition-by-components: a theory of human image understanding. Psychological Review, 94(2), 115.
Google Scholar
Biggio, B., Corona, I., Maiorca, D., Nelson, B., Srndic, N., Laskov, P., Giacinto, G., & Roli, F. (2013). Evasion attacks against machine learning at test time. In ECML/PKDD (3), Springer, Lecture Notes in Computer Science (Vol. 8190, pp. 387–402).
Bowyer, KW., Kranenburg, C., & Dougherty, S. (1999). Edge detector evaluation using empirical ROC curves. In CVPR, IEEE Computer Society (pp. 1354–1359).
Boyden, E. S., Zhang, F., Bamberg, E., Nagel, G., & Deisseroth, K. (2005). Millisecond-timescale, genetically targeted optical control of neural activity. Nature Neuroscience, 8(9), 1263.
Google Scholar
Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency (pp. 77–91).
Canny, J. F. (1986). A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6), 679–698.
Google Scholar
Chang, AX., Funkhouser, TA., Guibas, LJ., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., & Yu, F. (2015). Shapenet: An information-rich 3d model repository. CoRR abs/1512.03012.
Changizi, M. (2010). The vision revolution: How the latest research overturns everything we thought we knew about human vision. Benbella books.
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Google Scholar
Chen, X., & Yuille, AL. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS (pp. 1736–1744).
Chen, X., & Yuille, AL. (2015). Parsing occluded people by flexible compositions. In CVPR, IEEE Computer Society (pp. 3945–3954).
Chen, Y., Zhu, L., Lin, C., Yuille, AL., & Zhang, H. (2007). Rapid inference on a novel AND/OR graph for object detection, segmentation and parsing. In NIPS, Curran Associates, Inc., (pp. 289–296).
Chomsky, N. (2014). Aspects of the theory of syntax. Cambridge: MIT Press.
Google Scholar
Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A., & Oliva, A. (2016). Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific Reports, 6, 27755.
Google Scholar
Clune, J., Mouret, J. B., & Lipson, H. (2013). The evolutionary origins of modularity. Proceedings of the Royal Society B: Biological Sciences, 280(1755), 20122863.
Google Scholar
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. MCSS, 2(4), 303–314.
MathSciNet MATH Google Scholar
Darwiche, A. (2018). Human-level intelligence or animal-like abilities? Commun ACM, 61(10), 56–67.
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). Imagenet: A large-scale hierarchical image database. In CVPR, IEEE Computer Society (pp. 248–255).
Doersch, C., Gupta, A., & Efros, AA. (2015). Unsupervised visual representation learning by context prediction. In ICCV, IEEE Computer Society (pp. 1422–1430).
Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In NIPS (pp. 2366–2374)
Everingham, M., Gool, L. J. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Google Scholar
Felzenszwalb, P. F., Girshick, R. B., McAllester, D. A., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Google Scholar
Firestone, C. (2020). Performance versus competence in human-machine comparisons. In Proceedings of the National Academy of Sciences In Press.
Fukushima, K., & Miyake, S. (1982). Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition, Competition and cooperation in neural nets (pp. 267–285). Berlin: Springer.
Google Scholar
Geisler, W. S. (2011). Contributions of ideal observer theory to vision research. Vision Research, 51(7), 771–781.
Google Scholar
Geman, S. (2007). Compositionality in vision. In The grammar of vision: probabilistic grammar-based models for visual scene understanding and object categorization.
George, D., Lehrach, W., Kansky, K., Lázaro-Gredilla, M., Laan, C., Marthi, B., et al. (2017). A generative vision model that trains with high data efficiency and breaks text-based captchas. Science, 358(6368), eaag2612.
Google Scholar
Gibson, J. J. (1986). The ecological approach to visual perception. Hove: Psychology Press.
Google Scholar
Girshick, RB., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, IEEE Computer Society (pp. 580–587).
Goodfellow, IJ., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, AC., & Bengio, Y. (2014). Generative adversarial nets. In NIPS (pp. 2672–2680).
Goodfellow, IJ., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International Conference on Learning Representations.
Gopnik, A., Meltzoff, A. N., & Kuhl, P. K. (1999). The scientist in the crib: Minds, brains, and how children learn. New York: William Morrow and Co.
Google Scholar
Gopnik, A., Glymour, C., Sobel, D. M., Schulz, L. E., Kushnir, T., & Danks, D. (2004). A theory of causal learning in children: Causal maps and bayes nets. Psychological Review, 111(1), 3.
Google Scholar
Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New Jersey: John Wiley.
Google Scholar
Gregoriou, G. G., Rossi, A. F., Ungerleider, L. G., & Desimone, R. (2014). Lesions of prefrontal cortex reduce attentional modulation of neuronal responses and synchrony in v4. Nature Neuroscience, 17(7), 1003–1011.
Google Scholar
Gregory, R. L. (1973). Eye and brain: The psychology of seeing. New York: McGraw-Hill.
Google Scholar
Grenander, U. (1993). General pattern theory-A mathematical study of regular structures. Oxford: Clarendon Press.
MATH Google Scholar
Guu, K., Pasupat, P., Liu, EZ., & Liang, P. (2017). From language to programs: Bridging reinforcement learning and maximum marginal likelihood. In ACL (1), Association for Computational Linguistics (pp. 1051–1062).
Guzmán, A. (1968). Decomposition of a visual scene into three-dimensional bodies. In Proceedings of the December 9–11, 1968, Fall Joint Computer Conference, Part I (pp. 291–304)
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR, IEEE Computer Society (pp. 770–778).
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, RB. (2019). Momentum contrast for unsupervised visual representation learning. CoRR abs/1911.05722.
Hoffman, J., Tzeng, E., Park, T., Zhu, J., Isola, P., Saenko, K., et al. (2018). Cycada: Cycle-consistent adversarial domain adaptation. ICML, PMLR, Proceedings of Machine Learning Research, 80, 1994–2003.
Google Scholar
Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In ECCV (3), Springer, Lecture Notes in Computer Science. (Vol. 7574, pp. 340–353).
Hornik, K., Stinchcombe, M. B., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
MATH Google Scholar
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, JMLR.org, JMLR Workshop and Conference Proceedings (Vol. 37, pp. 448–456).
Jabr, F. (2012). The connectome debate: Is mapping the mind of a worm worth it. New York: Scientific American.
Google Scholar
Jégou, S., Drozdzal, M., Vázquez, D., Romero, A., & Bengio, Y.(2017) The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In CVPR Workshops, IEEE Computer Society (pp. 1175–1183).
Julesz, B. (1971). Foundations of cyclopean perception. Chicago: U. Chicago Press.
Google Scholar
Kaushik, D., Hovy, EH., & Lipton, ZC. (2020). Learning the difference that makes A difference with counterfactually-augmented data. In ICLR, OpenReview.net.
Kokkinos, I. (2017). Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR, IEEE Computer Society (pp. 5454–5463).
Konishi, S., Yuille, AL., Coughlan, JM., & Zhu, SC. (1999). Fundamental bounds on edge detection: An information theoretic evaluation of different edge cues. In CVPR, IEEE Computer Society (pp. 1573–1579)
Konishi, S., Yuille, A. L., Coughlan, J. M., & Zhu, S. C. (2003). Statistical edge detection: Learning and evaluating edge cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(1), 57–74.
Google Scholar
Kortylewski, A., He, J., Liu, Q., & Yuille, AL. (2020). Compositional convolutional neural networks: A deep architecture with innate robustness to partial occlusion. CoRR abs/2003.04490.
Kortylewski, A., Liu, Q., Wang, H., Zhang, Z., & Yuille, AL. (2020). Combining compositional models and deep networks for robust object classification under occlusion. In WACV, IEEE (pp. 1322–1330).
Krizhevsky, A., Sutskever, I., & Hinton, GE. (2012). Imagenet classification with deep convolutional neural networks. In NIPS (pp. 1106–1114).
LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551.
Google Scholar
Lee, T. S., & Mumford, D. (2003). Hierarchical bayesian inference in the visual cortex. JOSA A, 20(7), 1434–1448.
Google Scholar
Lin, X., Wang, H., Li, Z., Zhang, Y., Yuille, AL., & Lee, TS. (2017). Transfer of view-manifold learning to similarity perception of novel objects. In International Conference on Learning Representations.
Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L., Fei-Fei, L., Yuille, AL., Huang, J., & Murphy, K. (2018). Progressive neural architecture search. In ECCV (1), Springer, Lecture Notes in Computer Science (Vol. 11205, pp. 19–35).
Liu, C., Dollár, P., He, K., Girshick, RB., Yuille, AL., & Xie, S. (2020). Are labels necessary for neural architecture search? CoRR abs/2003.12056.
Liu, R., Liu, C., Bai, Y., & Yuille, AL. (2019). Clevr-ref+: Diagnosing visual reasoning with referring expressions. In CVPR, Computer Vision Foundation/IEEE (pp. 4185–4194).
Liu, Z., Knill, D. C., & Kersten, D. (1995). Object classification for human and ideal observers. Vision Research, 35(4), 549–568.
Google Scholar
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR, IEEE Computer Society (pp. 3431–3440).
Lu, H., & Yuille, AL. (2005). Ideal observers for detecting motion: Correspondence noise. In NIPS (pp. 827–834).
Lyu, J., Qiu, W., Wei, X., Zhang, Y., Yuille, AL., & Zha, Z. (2019). Identity preserve transform: Understand what activity classification models have learnt. CoRR abs/1912.06314.
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2017). Towards deep learning models resistant to adversarial attacks. CoRR abs/1706.06083.
Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., & Yuille, AL. (2015). Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In ICCV, IEEE Computer Society (pp. 2533–2541).
Marcus, G. (2018). Deep learning: A critical appraisal. CoRR abs/1801.00631.
Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. New York: Henry Holt and Co. Inc.
Google Scholar
Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, IEEE Computer Society (pp. 4040–4048).
McManus, J. N., Li, W., & Gilbert, C. D. (2011). Adaptive shape processing in primary visual cortex. Proceedings of the National Academy of Sciences, 108(24), 9739–9746.
Google Scholar
Mengistu, H., Huizinga, J., Mouret, J., & Clune, J. (2016). The evolutionary origins of hierarchy. PLoS Computational Biology, 12(6), e1004829.
Google Scholar
Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. CoRR abs/1411.1784.
Mu, J., Qiu, W., Hager, GD., & Yuille, AL. (2019). Learning from synthetic animals. CoRR abs/1912.08265.
Mumford, D. (1994). Pattern theory: a unifying perspective. In First European Congress of Mathematics, Springer (pp. 187–224).
Mumford, D., & Desolneux, A. (2010). Pattern theory: The stochastic analysis of real-world signals. Cambridge: CRC Press.
MATH Google Scholar
Murez, Z., Kolouri, S., Kriegman, DJ., Ramamoorthi, R., Kim, K. (2018). Image to image translation for domain adaptation. In CVPR (pp. 4500–4509), 10.1109/CVPR.2018.00473, http://openaccess.thecvf.com/content_cvpr_2018/html/Murez_Image_to_Image_CVPR_2018_paper.html.
Noroozi, M., Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV (6), Springer, Lecture Notes in Computer Science (Vol. 9910, pp. 69–84).
Papandreou, G., Chen, L., Murphy, KP., Yuille, AL. (2015). Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In ICCV, IEEE Computer Society (pp. 1742–1750).
Pearl, J. (1989). Probabilistic reasoning in intelligent systems—networks of plausible inference. Morgan Kaufmann series in representation and reasoning, Morgan Kaufmann.
Pearl, J. (2009). Causality. Cambridge: Cambridge University Press.
MATH Google Scholar
Penn, D. C., Holyoak, K. J., & Povinelli, D. J. (2008). Darwin’s mistake: Explaining the discontinuity between human and nonhuman minds. Behavioral and Brain Sciences, 31(2), 109–130.
Google Scholar
Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., & Dean, J. (2018). Efficient neural architecture search via parameter sharing. ICML, PMLR, Proceedings of Machine Learning Research, 80, 4092–4101.
Google Scholar
Poirazi, P., & Mel, B. W. (2001). Impact of active dendrites and structural plasticity on the memory capacity of neural tissue. Neuron, 29(3), 779–796.
Google Scholar
Qiao, S., Liu, C., Shen, W., & Yuille, AL. (2018). Few-shot image recognition by predicting parameters from activations. In CVPR, IEEE Computer Society (pp. 7229–7238).
Qiu, W., & Yuille, AL. (2016). Unrealcv: Connecting computer vision to unreal engine. In ECCV Workshops (3), Lecture Notes in Computer Science (Vol. 9915, pp. 909–916).
Ren, S., He, K., Girshick, RB., & Sun, J. (2015). Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS (pp. 91–99).
Ren, Z., Yan, J., Ni, B., Liu, B., Yang, X., & Zha, H. (2017). Unsupervised deep learning for optical flow estimation. In AAAI, AAAI Press (pp. 1495–1501).
Rensink, R. A., O’Regan, J. K., & Clark, J. J. (1997). To see or not to see: The need for attention to perceive changes in scenes. Psychological Science, 8(5), 368–373.
Google Scholar
Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11), 1019.
Google Scholar
Rosenfeld, A., Zemel, RS., & Tsotsos, JK. (2018). The elephant in the room. CoRR abs/1808.03305.
Rother, C., Kolmogorov, V., & Blake, A. (2004). “Grabcut”: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3), 309–314.
Google Scholar
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
MATH Google Scholar
Russell, S. J., & Norvig, P. (2010). Artificial Intelligence—A Modern Approach. Pearson Education: Third International Edition.
Sabour, S., Frosst, N., & Hinton, GE. (2017). Dynamic routing between capsules. In NIPS (pp. 3856–3866).
Salakhutdinov, R., Tenenbaum, JB., & Torralba, A. (2012). One-shot learning with a hierarchical nonparametric bayesian model. In ICML Unsupervised and Transfer Learning, JMLR.org, JMLR Proceedings (Vol. 27, pp. 195–206).
Santoro, A., Hill, F., Barrett, DGT., Morcos, AS., & Lillicrap, TP. (2018). Measuring abstract reasoning in neural networks. In ICML, JMLR.org, JMLR Workshop and Conference Proceedings (Vol. 80, pp. 4477–4486).
Seung, S. (2012). Connectome: How the brain’s wiring makes us who we are. HMH.
Shen, W., Zhao, K., Jiang, Y., Wang, Y., Bai, X., & Yuille, A. L. (2017a). Deepskeleton: Learning multi-task scale-associated deep side outputs for object skeleton extraction in natural images. IEEE Transactions on Image Processing, 26(11), 5298–5311.
MathSciNet MATH Google Scholar
Shen, Z., Liu, Z., Li, J., Jiang, Y., Chen, Y., & Xue, X. (2017) DSOD: learning deeply supervised object detectors from scratch. In ICCV, IEEE Computer Society (pp. 1937–1945).
Shu ,M., Liu, C., Qiu, W., & Yuille, AL. (2020). Identifying model weakness with adversarial examiner. In AAAI, AAAI Press, (pp. 11998–12006).
Simons, D. J., & Chabris, C. F. (1999). Gorillas in our midst: Sustained inattentional blindness for dynamic events. Perception, 28(9), 1059–1074.
Google Scholar
Simonyan ,K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.
Smirnakis, SM., & Yuille, AL. (1995). Neural implementation of bayesian vision theories by unsupervised learning. In The Neurobiology of Computation, Springer, (pp. 427–432).
Smith, L., & Gasser, M. (2005). The development of embodied cognition: Six lessons from babies. Artificial Life, 11(1–2), 13–29.
Google Scholar
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, IJ., & Fergus, R. (2014). Intriguing properties of neural networks. In International Conference on Learning Representations.
Tjan, B. S., Braje, W. L., Legge, G. E., & Kersten, D. (1995). Human efficiency for recognizing 3-d objects in luminance noise. Vision Research, 35(21), 3053–3069.
Google Scholar
Torralba, A., & Efros, AA. (2011). Unbiased look at dataset bias. In CVPR, IEEE Computer Society (pp. 1521–1528).
Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., & Madry, A. (2019). Robustness may be at odds with accuracy. In ICLR (Poster), OpenReview.net.
Tu, Z., Chen, X., Yuille, AL., & Zhu, SC. (2003). Image parsing: Unifying segmentation, detection, and recognition. In ICCV, IEEE Computer Society (pp. 18–25).
Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. In CVPR (pp. 2962–2971), 10.1109/CVPR.2017.316, https://doi.org/10.1109/CVPR.2017.316
Uesato, J., O’Donoghue, B., Kohli, P., & van den Oord, A. (2018). Adversarial risk and the dangers of evaluating against weak attacks. ICML, PMLR, Proceedings of Machine Learning Research, 80, 5032–5041.
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, AN., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In NIPS (pp. 5998–6008).
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching networks for one shot learning. In NIPS (pp. 3630–3638).
Wang, J., Zhang, Z., Premachandran, V., & Yuille, AL. (2015). Discovering internal representations from object-cnns using population encoding. CoRR abs/1511.06855.
Wang, J., Zhang, Z., Xie, C., Zhou, Y., Premachandran, V., Zhu, J., et al. (2018). Visual concepts and compositional voting. Annals of Mathematical Sciences and Applications, 2(3), 4.
MathSciNet MATH Google Scholar
Wang, P., & Yuille, AL. (2016). DOC: deep occlusion estimation from a single image. In ECCV (1), Springer, Lecture Notes in Computer Science (Vol. 9905, pp. 545–561).
Wang, T., Zhao, J., Yatskar, M., Chang, K., & Ordonez, V. (2019). Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In ICCV, IEEE (pp. 5309–5318).
Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In ICCV, IEEE Computer Society (pp. 2794–2802).
Wen, H., Shi, J., Zhang, Y., Lu, K. H., Cao, J., & Liu, Z. (2017). Neural encoding and decoding with deep learning for dynamic natural vision. Cerebral Cortex, 28, 1–25.
Google Scholar
Wu, Z., Xiong, Y., Yu, SX., & Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. In CVPR, IEEE Computer Society (pp. 3733–3742).
Xia, F., Wang, P., Chen, L., & Yuille, AL. (2016), Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In ECCV (5), Springer, Lecture Notes in Computer Science (Vol. 9909, pp. 648–663).
Xia, Y., Zhang, Y., Liu, F., Shen, W., & Yuille, AL. (2020).Synthesize then compare: Detecting failures and anomalies for semantic segmentation. CoRR abs/2003.08440.
Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., & Yuille, AL. (2017). Adversarial examples for semantic segmentation and object detection. In ICCV, IEEE Computer Society (pp. 1378–1387).
Xie, C., Wang, J., Zhangm, Z., Ren, Z., & Yuille, AL. (2018). Mitigating adversarial effects through randomization. In International Conference on Learning Representations.
Xie, L., & Yuille, AL. (2017). Genetic CNN. In ICCV, IEEE Computer Society (pp. 1388–1397).
Xie, S., & Tu, Z. (2015). Holistically-nested edge detection. In ICCV, IEEE Computer Society (pp. 1395–1403).
Xu, L., Krzyzak, A., & Yuille, A. L. (1994). On radial basis function nets and kernel regression: Statistical consistency, convergence rates, and receptive field size. Neural Networks, 7(4), 609–628.
MATH Google Scholar
Yamane, Y., Carlson, E. T., Bowman, K. C., Wang, Z., & Connor, C. E. (2008). A neural code for three-dimensional object shape in macaque inferotemporal cortex. Nature Neuroscience, 11(11), 1352–1360.
Google Scholar
Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23), 8619–8624.
Google Scholar
Yang, C., Kortylewski, A., Xie, C., Cao, Y., & Yuille, AL. (2020). Patchattack: A black-box texture-based attack with reinforcement learning. CoRR abs/2004.05682.
Yosinski, J., Clune, J., Nguyen, AM., Fuchs. TJ., & Lipson, H. (2015). Understanding neural networks through deep visualization. CoRR abs/1506.06579.
Yuille, A., & Kersten, D. (2006). Vision as bayesian inference: Analysis by synthesis? Trends in Cognitive Sciences, 10(7), 301–308.
Google Scholar
Yuille, A. L., & Mottaghi, R. (2016). Complexity of representation and inference in compositional models with part sharing. Journal of Machine Learning Research, 17, 292–319.
MathSciNet MATH Google Scholar
Zbontar, J., & LeCun, Y. (2015). Computing the stereo matching cost with a convolutional neural network. In CVPR, IEEE Computer Society (pp. 1592–1599).
Zeiler, MD., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In ECCV (1), Springer, Lecture Notes in Computer Science (Vol. 8689, pp. 818–833).
Zendel, O., Murschitz, M., Humenberger, M., & Herzner, W. (2015). CV-HAZOP: introducing test data validation for computer vision. In ICCV, IEEE Computer Society (pp. 2066–2074).
Zhang, R., Isola, P., & Efros, AA. (2016). Colorful image colorization. In ECCV (3), Springer, Lecture Notes in Computer Science (Vol. 9907, pp. 649–666).
Zhang, Y., Qiu, W., Chen, Q., Hu, X., & Yuille, AL. (2018). Unrealstereo: Controlling hazardous factors to analyze stereo vision. In 3DV, IEEE Computer Society (pp. 228–237).
Zhang, Z., Shen, W., Qiao, S., Wang, Y., Wang, B., & Yuille, AL. (2020). Robust face detection via learning small faces on hard images. In WACV, IEEE (pp. 1350–1359).
Zhou, B., Lapedriza, À., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In NIPS (pp. 487–495).
Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., & Torralba, A. (2015). Object detectors emerge in deep scene cnns. In International Conference on Learning Representations.
Zhou, T., Brown, M., Snavely, N., & Lowe, DG. (2017). Unsupervised learning of depth and ego-motion from video. In CVPR, IEEE Computer Society (pp. 6612–6619).
Zhou, Z., & Firestone, C. (2019). Humans can decipher adversarial images. Nature Communications, 10(1), 1–9.
Google Scholar
Zhu, H., Tang, P., Yuille, AL., Park, S., & Park, J. (2019). Robustness of object recognition under extreme occlusion in humans and computational models. In CogSci, cognitivesciencesociety.org (pp. 3213–3219).
Zhu, L., Chen, Y., Torralba, A., Freeman, WT., Yuille, AL. (2010). Part and appearance sharing: Recursive compositional models for multi-view. In CVPR, IEEE Computer Society (pp. 1919–1926).
Zhu, S., & Mumford, D. (2006). A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2(4), 259–362.
MATH Google Scholar
Zhu, Z., Xie, L., & Yuille, AL. (2017). Object recognition with and without objects. In IJCAI, ijcai.org (pp. 3609–3615).
Zitnick, C. L., Agrawal, A., Antol, S., Mitchell, M., Batra, D., & Parikh, D. (2016). Measuring machine intelligence through visual question answering. AI Magazine, 37(1), 63–72.
Google Scholar
Zoph, B., & Le, QV. (2017). Neural architecture search with reinforcement learning. In ICLR, OpenReview.net.

Download references

Acknowledgements

This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC Award CCF-1231216 and ONR N00014-15-1-2356. We thank Kyle Rawlins, Tal Linzen, Wei Shen, and Adam Kortylewski for providing feedback and Weichao Qiu, Daniel Kersten, Ed Connor, Chaz Firestone, Vicente Ordonez, and Greg Hager for discussions on some of these topics. We thank the reviewers for some very helpful feedback which greatly improved the paper.

Author information

Authors and Affiliations

Johns Hopkins University, Baltimore, MD, USA
Alan L. Yuille & Chenxi Liu

Authors

Alan L. Yuille
View author publications
You can also search for this author in PubMed Google Scholar
Chenxi Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenxi Liu.

Additional information

Communicated by Ivan Laptev.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

For those readers unfamiliar with Monty Python see: https://youtu.be/Qc7HmhrgTuQ

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuille, A.L., Liu, C. Deep Nets: What have They Ever Done for Vision?. Int J Comput Vis 129, 781–802 (2021). https://doi.org/10.1007/s11263-020-01405-z

Download citation

Received: 10 January 2019
Accepted: 09 November 2020
Published: 27 November 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s11263-020-01405-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Nets: What have They Ever Done for Vision?

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

A survey on Image Data Augmentation for Deep Learning

Attention mechanisms in computer vision: A survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep Nets: What have They Ever Done for Vision?

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

A survey on Image Data Augmentation for Deep Learning

Attention mechanisms in computer vision: A survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation