Skip to main content
Log in

A survey of visual neural networks: current trends, challenges and opportunities

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Research of visual neural networks (VNNs) is one of the most important topics in deep learning and has received wide attention from industry and academia for their promising performance. The applications of VNNs range from image classification and target detection to scene segmentation in various fields such as transportation, healthcare and finance. In general, VNNs can be divided into two types: Convolutional neural networks (CNNs) and Transformer networks. In the last decade, CNNs have dominated the research of vision tasks. Recently, Transformer networks are successfully used in the fields of natural language processing and computer vision, and have achieved remarkable performance in many vision tasks. In this paper, the basic architectures and current trends of these two types of VNNs are first introduced. Then, three major challenges of VNNs are pointed out: scalability, robustness and interpretability. Next, the lightweight, robust and interpretable solutions are summarized and analyzed. Finally, the future opportunities of VNNs are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

References

  1. Liang, X., Tang, Z., Xie, X., Wu, J., Zhang, X.: Robust and fast image hashing with two-dimensional PCA. Multimedia Syst. 27(3), 389–401 (2021)

    Article  Google Scholar 

  2. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8697–8710. (2018)

  3. Tan, M., Pang, R., Le, Q.V.: Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10781–10790. (2020)

  4. Pang, W., He, Q., Li, Y.: Predicting skeleton trajectories using a Skeleton-Transformer for video anomaly detection. Multimed. Syst. (2022). https://doi.org/10.1007/s00530-022-00915-9

    Article  Google Scholar 

  5. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40, 834–848 (2018)

    Article  Google Scholar 

  6. Radenović, F., Tolias, G., Chum, O.: Fine-tuning CNN image retrieval with no human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 41, 1655–1668 (2018)

    Article  Google Scholar 

  7. Radenović, F., Iscen, A., Tolias, G., Avrithis, Y., Chum, O.: Revisiting oxford and paris: Large-scale image retrieval benchmarking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5706–5715. (2018)

  8. Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, pp. 2146–2153. (2009)

  9. Bengio, Y.: Learning deep architectures for AI. Now Publishers Inc, 2(1), 1–127 (2009)

  10. Lv, J., Wang, X., Shao, C.: TMIF: transformer-based multi-modal interactive fusion for automatic rumor detection. Multimed. Syst. (2022). https://doi.org/10.1007/s00530-022-00916-8

    Article  Google Scholar 

  11. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S.: An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations. (2021)

  12. Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., Sutskever, I.: Generative pretraining from pixels. In: Proceedings of the International Conference on Machine Learning, pp. 1691–1703 (2020)

  13. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: Proceedings of the International Conference on Learning Representations. (2014)

  14. Zhang, Y., Tiňo, P., Leonardis, A., Tang, K.: A survey on neural network interpretability. IEEE Trans. Emerg. Top. Comput. Intell. 5(5), 726–742 (2021).

    Article  Google Scholar 

  15. Alam, M., Samad, M.D., Vidyaratne, L., Glandon, A., Iftekharuddin, K.M.: Survey on deep neural networks in speech and vision systems. Neurocomputing 417, 302–321 (2020)

    Article  Google Scholar 

  16. Bouvrie, J.: Notes on convolutional neural networks. (2006)

  17. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT press (2016)

    MATH  Google Scholar 

  18. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)

    Article  Google Scholar 

  19. Scherer, D., Müller, A., Behnke, S.: Evaluation of pooling operations in convolutional architectures for object recognition. In: Proceedings of the International Conference on Artificial Neural Networks, pp. 92–101. Springer, (2010)

  20. Wang, T., Wu, D.J., Coates, A., Ng, A.Y.: End-to-end text recognition with convolutional neural networks. In: Proceedings of the 21st International Conference on Pattern Recognition, pp. 3304–3308, (2012)

  21. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37, 1904–1916 (2015)

    Article  Google Scholar 

  22. Lin, M., Chen, Q., Yan, S.: Network in network. In: Proceedings of the International Conference on Learning Representations. (2014)

  23. Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput. 29, 2352–2449 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  24. Nwankpa, C., Ijomah, W., Gachagan, A., Marshall, S.: Activation functions: comparison of trends in practice and research for deep learning. arXiv preprint arXiv:1811.03378 (2018)

  25. Hochreiter, S.: The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 6, 107–116 (1998)

    Article  MATH  Google Scholar 

  26. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the International Conference on Machine Learning, pp. 448–456. (2015)

  27. Santurkar, S., Tsipras, D., Ilyas, A., Madry, A.: How does batch normalization help optimization? Adv. Neural Inf. Process. Syst. 31, 2488–2498 (2018)

    Google Scholar 

  28. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)

    Article  Google Scholar 

  29. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 84–90 (2012)

    Google Scholar 

  30. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  31. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Proceedings of the European Conference on Computer Vision, pp. 818–833. Springer, (2014)

  32. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations. (2015)

  33. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1–9. (2015)

  34. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778. (2016)

  35. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2826. (2016)

  36. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the Thirty-first AAAI Conference on Artificial Intelligence. pp. 4278–4284. (2017)

  37. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Proceedings of the European Conference on Computer Vision, pp. 630–645. Springer, (2016)

  38. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1492–1500. (2017)

  39. Zagoruyko, S., Komodakis, N.: Wide residual networks. In: British Machine Vision Conference. (2016)

  40. Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Lin, H., Zhang, Z., Sun, Y., He, T., Mueller, J., Manmatha, R.: Resnest: Split-attention networks. arXiv preprint arXiv:2004.08955 (2020)

  41. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. (2017)

  42. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. (2018)

  43. Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1251–1258. (2017)

  44. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)

  45. Su, J., Faraone, J., Liu, J., Zhao, Y., Thomas, D.B., Leong, P.H., Cheung, P.Y.: Redundancy-reduced mobilenet acceleration on reconfigurable logic for imagenet classification. In: International Symposium on Applied Reconfigurable Computing, pp. 16–28. Springer, (2018)

  46. Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In: Proceedings of the International Conference on Learning Representations. (2016)

  47. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2018)

  48. Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: Proceedings of the International Conference on Machine Learning, pp. 6105–6114. (2019)

  49. Pan, X., Luo, P., Shi, J., Tang, X.: Two at once: Enhancing learning and generalization capacities via ibn-net. In: Proceedings of the European Conference on Computer Vision, pp. 464–479. (2018)

  50. Dai, Z., Liu, H., Le, Q., Tan, M.: Coatnet: marrying convolution and attention for all data sizes. Adv. Neural. Inf. Process. Syst. 34, 3965–3977 (2021)

    Google Scholar 

  51. Han, K., Wang, Y., Guo, J., Tang, Y., Wu, E.: Vision GNN: An Image is Worth Graph of Nodes. arXiv preprint arXiv:2206.00272 (2022)

  52. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł, Polosukhin, I.: Attention is all you need. Adv. Neural. Inf. Process. Syst. 30, 5998–6008 (2017)

    Google Scholar 

  53. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  54. Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)

  55. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision, pp. 213–229. Springer, (2020)

  56. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural. Inf. Process. Syst. 28, 91–99 (2015)

    Google Scholar 

  57. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448. (2015)

  58. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of the International Conference on Machine Learning, pp. 1597–1607. (2020)

  59. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1, 9 (2019)

    Google Scholar 

  60. Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Dollár, P.: Designing network design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10428–10436. (2020)

  61. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: Proceedings of the International Conference on Machine Learning, pp. 10347–10357. (2021)

  62. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022. (2021)

  63. Bello, I.: Lambdanetworks: Modeling long-range interactions without attention. arXiv preprint arXiv:2102.08602 (2021)

  64. Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., Fu, B.: Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650 (2021)

  65. Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578. (2021)

  66. Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: PVT v2: Improved baselines with Pyramid Vision Transformer. Computational Visual Media 1–10 (2022)

  67. Zhou, Daquan, et al.: Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021).

  68. Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 (2021)

  69. Zhang, J., Peng, H., Wu, K., Liu, M., Xiao, B., Fu, J., Yuan, L.: MiniViT: Compressing Vision Transformers with Weight Multiplexing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12145–12154. (2022)

  70. Bai, Y., Mei, J., Yuille, A.L., Xie, C.: Are Transformers more robust than CNNs? Adv. Neural. Inf. Process. Syst. 34, 26831–26843 (2021)

    Google Scholar 

  71. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)

  72. Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. In: Proceedings of the International Conference on Learning Representations, (2019)

  73. Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262–15271. (2021)

  74. Zhang, Z., Xie, Y., Xing, F., McGough, M., Yang, L.: Mdnet: A semantically and visually interpretable medical image diagnosis network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6428–6436. (2017)

  75. Ericsson, L., Gouk, H., Loy, C.C., Hospedales, T.M.: Self-supervised representation learning: introduction, advances, and challenges. IEEE Signal Process. Mag. 39, 42–62 (2022)

    Article  Google Scholar 

  76. Zhang, Y., Kang, B., Hooi, B., Yan, S., Feng, J.: Deep long-tailed learning: A survey. arXiv preprint arXiv:2110.04596 (2021)

  77. Liu, W., Wang, H., Shen, X., Tsang, I.: The emerging trends of multi-label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021)

  78. Liang, T., Glossner, J., Wang, L., Shi, S., Zhang, X.: Pruning and quantization for deep neural network acceleration: a survey. Neurocomputing 461, 370–403 (2021)

    Article  Google Scholar 

  79. Blalock, D., Gonzalez Ortiz, J.J., Frankle, J., Guttag, J.: What is the state of neural network pruning? Proc. Mach. Learn. Syst. 2, 129–146 (2020)

    Google Scholar 

  80. Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T.: Rethinking the value of network pruning. In: Proceedings of the International Conference on Learning Representations. (2019)

  81. Krishnamoorthi, R.: Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342 (2018)

  82. LeCun, Y., Denker, J., Solla, S.: Optimal brain damage. Adv. Neural. Inf. Process. Syst. 2, 598–605 (1989)

    Google Scholar 

  83. Hassibi, B., Stork, D.G., Wolff, G.J.: Optimal brain surgeon and general network pruning. In: Proceedings of the IEEE International Conference on Neural Networks, pp. 293–299 (1993)

  84. Molchanov, D., Ashukha, A., Vetrov, D.: Variational dropout sparsifies deep neural networks. In: Proceedings of the International Conference on Machine Learning, pp. 2498–2507 (2017)

  85. Luo, J.-H., Wu, J., Lin, W.: Thinet: A filter level pruning method for deep neural network compression. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5058–5066 (2017)

  86. Louizos, C., Welling, M., Kingma, D.P.: Learning sparse neural networks through L0 regularization. arXiv preprint arXiv:1712.01312 (2017)

  87. He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397 (2017)

  88. Yu, R., Li, A., Chen, C.-F., Lai, J.-H., Morariu, V.I., Han, X., Gao, M., Lin, C.-Y., Davis, L.S.: Nisp: Pruning networks using neuron importance score propagation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9194–9203 (2018)

  89. Mittal, D., Bhardwaj, S., Khapra, M.M., Ravindran, B.: Recovering from random pruning: On the plasticity of deep convolutional neural networks. In: Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 848–857 (2018)

  90. Anwar, S., Sung, W.: Compact deep convolutional neural networks with coarse pruning. arXiv preprint arXiv:1610.09639 (2016)

  91. Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? Adv. Neural. Inf. Process. Syst. 32, 14014–14024  (2019)

    Google Scholar 

  92. Lin, J., Rao, Y., Lu, J., Zhou, J.: Runtime neural pruning. Adv. Neural Inf. Process. Syst. 30, 2181–2191 (2017)

    Google Scholar 

  93. Yu, J., Yang, L., Xu, N., Yang, J., Huang, T.: Slimmable neural networks. arXiv preprint arXiv:1812.08928 (2018)

  94. Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., Sun, J.: Repvgg: Making vgg-style convnets great again. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13733–13742 (2021)

  95. Guo, Z., Zhang, X., Mu, H., Heng, W., Liu, Z., Wei, Y., Sun, J.: Single path one-shot neural architecture search with uniform sampling. In: Proceedings of the European Conference on Computer Vision, pp. 544–560. Springer (2020)

  96. Yu, H., Han, Q., Li, J., Shi, J., Cheng, G., Fan, B.: Search what you want: Barrier panelty nas for mixed precision quantization. In: Proceedings of the European Conference on Computer Vision, pp. 1–16. Springer (2020)

  97. Zhang, B., Chen, H., Yang, L., Chen, C., Zhu, Y., Doermann, D.: Cp-nas: Child-parent neural architecture search for 1-bit cnns. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pp. 1033–1039 (2021)

  98. Chen, M., Peng, H., Fu, J., & Ling, H.: Autoformer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12270–12280 (2021)

  99. Chen, M., Wu, K., Ni, B., et al.: Searching the search space of vision transformer. Adv. Neural. Inf. Process. Syst. 34, 8714–8726 (2021)

    Google Scholar 

  100. Tang, Y., Han, K., Wang, Y., Xu, C., Guo, J., Xu, C., Tao, D.: Patch slimming for efficient vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12165–12174 (2022)

  101. Köster, U., Webb, T., Wang, X., Nassar, M., Bansal, A.K., Constable, W., Elibol, O., Gray, S., Hall, S., Hornof, L.: Flexpoint: an adaptive numerical format for efficient training of deep neural networks. Adv. Neural. Inf. Process. Syst. 30, 1742–1752 (2017)

    Google Scholar 

  102. Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G.: Mixed precision training. In: Proceedings of the International Conference on Learning Representations (2017)

  103. Das, D., Mellempudi, N., Mudigere, D., Kalamkar, D., Avancha, S., Banerjee, K., Sridharan, S., Vaidyanathan, K., Kaul, B., Georganas, E.: Mixed precision training of convolutional neural networks using integer operations. arXiv preprint arXiv:1802.00930 (2018)

  104. Banner, R., Hubara, I., Hoffer, E., Soudry, D.: Scalable methods for 8-bit training of neural networks. Adv. Neural. Inf. Process. Syst. 31, 5151–5159 (2018)

    Google Scholar 

  105. Wu, S., Li, G., Chen, F., Shi, L.: Training and inference with integers in deep neural networks. In: Proceedings of the International Conference on Learning Representations (2018)

  106. Yang, Y., Deng, L., Wu, S., Yan, T., Xie, Y., Li, G.: Training high-performance and large-scale deep neural networks with full 8-bit integers. Neural Netw. 125, 70–82 (2020)

    Article  Google Scholar 

  107. Zhu, F., Gong, R., Yu, F., Liu, X., Wang, Y., Li, Z., Yang, X., Yan, J.: Towards unified int8 training for convolutional neural network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1969–1979 (2020)

  108. Liu, Z., Wang, Y., Han, K., Zhang, W., Ma, S., Gao, W. : Post-training quantization for vision transformer. Adv. Neural. Inf. Process. Syst. 34, 28092–28103 (2021)

    Google Scholar 

  109. Li, Z., Yang, T., Wang, P., Cheng, J.: Q-ViT: Fully Differentiable Quantization for Vision Transformer. arXiv preprint arXiv:2201.07703 (2022)

  110. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2, (2015)

  111. Buciluǎ, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 535–541 (2006)

  112. Liu, S., Yin, L., Mocanu, D.C., Pechenizkiy, M.: Do we actually need dense over-parameterization? in-time over-parameterization in sparse training. In: Proceedings of the International Conference on Machine Learning, pp. 6989–7000 (2021)

  113. Jia, D., Han, K., Wang, Y., Tang, Y., Guo, J., Zhang, C., Tao, D.: Efficient vision transformers via fine-grained manifold distillation. arXiv preprint arXiv:2107.01378 (2021)

  114. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: Proceedings of the International Conference on Learning Representations. (2015)

  115. Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: Proceedings of the International Conference on Learning Representations (2017)

  116. Koratana, A., Kang, D., Bailis, P., Zaharia, M.: Lit: Learned intermediate representation training for model compression. In: Proceedings of the International Conference on Machine Learning, pp. 3509–3518. (2019)

  117. Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744 (2017)

  118. Tung, F., Mori, G.: Similarity-preserving knowledge distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1365–1374 (2019)

  119. Tian, F., Gao, B., Cui, Q., Chen, E., Liu, T.-Y.: Learning deep representations for graph clustering. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp.1293–1299 (2014)

  120. Huang, Z., Wang, N.: Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219 (2017)

  121. Wu, X., Wu, Y., Zhao, Y.: Binarized neural networks on the imagenet classification task. arXiv preprint arXiv:1604.03058 (2016)

  122. Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976 (2019)

  123. Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., Xu, C.: Cmt: Convolutional neural networks meet vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12175–12185 (2022)

  124. Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., Ramabhadran, B.: Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6655–6659 (2013)

  125. Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., Lempitsky, V.: Speeding-up convolutional neural networks using fine-tuned cp-decomposition. In: Proceedings of the International Conference on Learning Representations (2015)

  126. Novikov, A., Podoprikhin, D., Osokin, A., Vetrov, D.P.: Tensorizing neural networks. Adv. Neural. Inf. Process. Syst. 22, 442–450 (2015)

    Google Scholar 

  127. Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: Proceedings of the British Machine Vision Conference (2014)

  128. Chen, W., Wilson, J., Tyree, S., Weinberger, K., Chen, Y.: Compressing neural networks with the hashing trick. In: Proceedings of the International Conference on Machine Learning, pp. 2285–2294 (2015)

  129. Lin, S., Ji, R., Chen, C., Tao, D., Luo, J.: Holistic cnn compression via low-rank decomposition with knowledge transfer. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2889–2905 (2018)

    Article  Google Scholar 

  130. Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019)

  131. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)

  132. Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V.: Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324 (2019)

  133. Ma, N., Zhang, X., Zheng, H.-T., Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European Conference on Computer Vision, pp. 116–131 (2018)

  134. Sun, K., Li, M., Liu, D., Wang, J.: Igcv3: Interleaved low-rank group convolutions for efficient deep neural networks. In: Proceedings of the British Machine Vision Conference, 101, (2018)

  135. Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)

  136. Zhang, T., Qi, G.-J., Xiao, B., Wang, J.: Interleaved group convolutions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4373–4382 (2017)

  137. Huang, G., Liu, S., Van der Maaten, L., Weinberger, K.Q.: Condensenet: An efficient densenet using learned group convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2752–2761 (2018)

  138. Zhang, Z., Li, J., Shao, W., Peng, Z., Zhang, R., Wang, X., Luo, P.: Differentiable learning-to-group channels via groupable convolutional neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3542–3551 (2019)

  139. Chen, T., Cheng, Y., Gan, Z., Yuan, L., Zhang, L., Wang, Z.: Chasing sparsity in vision transformers: an end-to-end exploration. Adv. Neural. Inf. Process. Syst. 34, 19974–19988 (2021)

    Google Scholar 

  140. Evci, U., Gale, T., Menick, J., Castro, P.S., Elsen, E.: Rigging the lottery: Making all tickets winners. In: Proceedings of the International Conference on Machine Learning, pp. 2943–2952 (2020)

  141. Wu, L., Lin, X., Chen, Z., Huang, J., Liu, H., Yang, Y.: An efficient binary convolutional neural network with numerous skip connections for fog computing. IEEE Internet Things J. 8, 11357–11367 (2021)

    Article  Google Scholar 

  142. Singh, P., Namboodiri, V.P.: SkipConv: skip convolution for computationally efficient deep CNNs. In: Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2020)

  143. Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? Adv. Neural Inf. Process. Syst. 34, 12116–12128 (2021)

  144. Dong, Y., Cordonnier, J.-B., Loukas, A.: Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In: Proceedings of the International Conference on Machine Learning, pp. 2793–2803 (2021)

  145. Hosseini, H., Xiao, B., Poovendran, R.: Google's cloud vision api is not robust to noise. In: Proceedings of the 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 101–105 (2017)

  146. Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do CIFAR-10 classifiers generalize to CIFAR-10? arXiv preprint arXiv:1806.00451 (2018)

  147. Morrison, K., Gilby, B., Lipchak, C., Mattioli, A., Kovashka, A.: Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers. arXiv preprint arXiv:2106.13122 (2021)

  148. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 113–123 (2019)

  149. Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: Augmix: A simple data processing method to improve robustness and uncertainty. In: Proceedings of the International Conference on Learning Representations (2020)

  150. Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M.: The many faces of robustness: A critical analysis of out-of-distribution generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349 (2021)

  151. DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)

  152. Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023–6032 (2019)

  153. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: Proceedings of the International Conference on Learning Representations (2018)

  154. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018)

  155. Lopes, R.G., Yin, D., Poole, B., Gilmer, J., Cubuk, E.D.: Improving robustness without sacrificing accuracy with patch gaussian augmentation. arXiv preprint arXiv:1906.02611 (2019)

  156. Verma, V., Lamb, A., Beckham, C., Najafi, A., Mitliagkas, I., Lopez-Paz, D., Bengio, Y.: Manifold mixup: Better representations by interpolating hidden states. In: Proceedings of the International Conference on Machine Learning, pp. 6438–6447 (2019)

  157. Raghunathan, A., Xie, S.M., Yang, F., Duchi, J.C., Liang, P.: Adversarial training can hurt generalization. arXiv preprint arXiv:1906.06032 (2019)

  158. Modas, A., Rade, R., Ortiz-Jiménez, G., Moosavi-Dezfooli, S.-M., Frossard, P.: PRIME: a few primitives can boost robustness to common corruptions. arXiv preprint arXiv:2112.13547 (2021)

  159. Chen, J., Kang, X., Liu, Y., Wang, Z.J.: Median filtering forensics based on convolutional neural networks. IEEE Signal Process. Lett. 22, 1849–1853 (2015)

    Article  Google Scholar 

  160. Bayar, B., Stamm, M.C.: A deep learning approach to universal image manipulation detection using a new convolutional layer. In: Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security, pp. 5–10 (2016)

  161. Rao, Y., Ni, J.: A deep learning approach to detection of splicing and copy-move forgeries in images. In: Proceedings of the 2016 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1–6 (2016)

  162. Bappy, J.H., Roy-Chowdhury, A.K., Bunk, J., Nataraj, L., Manjunath, B.: Exploiting spatial structure for localizing manipulated image regions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4970–4979 (2017)

  163. Wu, Y., AbdAlmageed, W., Natarajan, P.: Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9543–9552 (2019)

  164. Zhou, P., Han, X., Morariu, V.I., Davis, L.S.: Learning rich features for image manipulation detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1053–1061 (2018)

  165. Bi, X., Zhang, Z., Xiao, B.: Reality transform adversarial generators for image splicing forgery detection and localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14294–14303 (2021)

  166. Hao, J., Zhang, Z., Yang, S., Xie, D., Pu, S.: Transforensics: image forgery localization with dense self-attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15055–15064 (2021)

  167. Nguyen, E., Bui, T., Swaminathan, V., Collomosse, J.: OSCAR-Net: object-centric scene graph attention for image attribution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14499–14508 (2021)

  168. Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10231–10241 (2021)

  169. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: Proceedings of the International Conference on Learning Representations (2013)

  170. Shaham, U., Yamada, Y., Negahban, S.: Understanding adversarial training: Increasing local stability of neural nets through robust optimization. arXiv preprint arXiv:1511.05432 (2015)

  171. Moosavi-Dezfooli, S.-M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2574–2582 (2016)

  172. Tramèr, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., McDaniel, P.: Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204 (2017)

  173. Wong, E., Rice, L., Kolter, J.Z.: Fast is better than free: revisiting adversarial training. In: Proceedings of the International Conference on Learning Representations (2020)

  174. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: Proceedings of the International Conference on Learning Representations (2018)

  175. Zhang, D., Zhang, T., Lu, Y., Zhu, Z., Dong, B.: You only propagate once: accelerating adversarial training via maximal principle. Adv. Neural. Inf. Process. Syst. 32, 227–238 (2019)

    Google Scholar 

  176. Shell, K.: Applications of Pontryagin’s maximum principle to economics. Mathematical Systems Theory and Economics I/II, pp. 241–292. Springer (1969)

  177. Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., Jordan, M.: Theoretically principled trade-off between robustness and accuracy. In: Proceedings of the International Conference on Machine Learning, pp. 7472–7482 (2019)

  178. Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., Usunier, N.: Parseval networks: improving robustness to adversarial examples. In: Proceedings of the International Conference on Machine Learning, pp. 854–863 (2017)

  179. Yan, Z., Guo, Y., Zhang, C.: Deep defense: training dnns with improved adversarial robustness. Adv. Neural. Inf. Process. Syst. 31, 417–426 (2018)

    Google Scholar 

  180. Song, Y., Shu, R., Kushman, N., Ermon, S.: Constructing unrestricted adversarial examples with generative models. Adv. Neural. Inf. Process. Syst. 31, 8322–8333 (2018)

    Google Scholar 

  181. Samangouei, P., Kabkab, M., Chellappa, R.: Defense-gan: Protecting classifiers against adversarial attacks using generative models. In: Proceedings of the International Conference on Learning Representations, pp. 1–17 (2018)

  182. Mu, N., Wagner, D.: Defending against Adversarial Patches with Robust Self-Attention. In: ICML 2021 Workshop on Uncertainty and Robustness in Deep Learning (2021)

  183. Wickramanayake, S., Hsu, W., Lee, M.L.: Towards fully interpretable deep neural networks: are we there yet? arXiv preprint arXiv:2106.13164 (2021)

  184. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)

  185. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)

  186. Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F.: Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav). In: Proceedings of the International Conference on Machine Learning, pp. 2668–2677 (2018)

  187. Zheng, H., Fu, J., Mei, T., Luo, J.: Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5209–5217. (2017)

  188. Huang, Z., Li, Y.: Interpretable and accurate fine-grained recognition via region grouping. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8662–8672 (2020)

  189. Pillai, V., Pirsiavash, H.: Explainable models with consistent interpretations. UMBC Student Collection 25(3), 2431–2439 (2021)

    Google Scholar 

  190. Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 782–791 (2021)

  191. Zhang, J., Bargal, S.A., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. Int. J. Comput. Vis. 126, 1084–1102 (2018)

    Article  Google Scholar 

  192. Williford, J.R., May, B.B., Byrne, J.: Explainable face recognition. In: Proceedings of the European Conference on Computer Vision, pp. 248–263. Springer (2020)

  193. Petsiuk, V., Das, A., Saenko, K.: Rise: Randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421 (2018)

  194. Smilkov, D., Thorat, N., Kim, B., Viégas, F., Wattenberg, M.: Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825 (2017)

  195. Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. In: Proceedings of the workshop on the International Conference on Learning Representations (2015)

  196. Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS One 10, e0130140 (2015)

    Article  Google Scholar 

  197. Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. In: Proceedings of the International Conference on Machine Learning, pp. 3145–3153. (2017)

  198. Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: Proceedings of the International Conference on Machine Learning, pp. 3319–3328 (2017)

  199. Li, O., Liu, H., Chen, C., Rudin, C.: Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3530–3537 (2018)

  200. Ancona, M., Ceolini, E., Öztireli, C., Gross, M.: Towards better understanding of gradient-based attribution methods for deep neural networks. In: Proceedings of the International Conference on Learning Representations (2018)

  201. Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 5797–5808 (2019)

  202. Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., Su, J.K.: This looks like that: deep learning for interpretable image recognition. Adv. Neural. Inf. Process. Syst. 32, 8928–8939 (2019)

    Google Scholar 

  203. Koh, P.W., Liang, P.: Understanding black-box predictions via influence functions. In: Proceedings of the International Conference on Machine Learning, pp. 1885–1894 (2017)

Download references

Acknowledgements

This work is partially supported by the Guangxi Natural Science Foundation (2022GXNSFAA035506), the National Natural Science Foundation of China (62272111, 61962008), Guangxi “Bagui Scholar” Team for Innovation and Research, Guangxi Talent Highland Project of Big Data Intelligence and Application, Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing, and the Innovation Project of Guangxi Graduate Education (YCBZ2022063, YCSW2022177). Many thanks to the reviewers for their helpful suggestions.

Author information

Authors and Affiliations

Authors

Contributions

PF did the main work and wrote the draft manuscript. ZT supervised the work and wrote the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zhenjun Tang.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Feng, P., Tang, Z. A survey of visual neural networks: current trends, challenges and opportunities. Multimedia Systems 29, 693–724 (2023). https://doi.org/10.1007/s00530-022-01003-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-022-01003-8

Keywords

Navigation