Elsevier

Neurocomputing

Volume 438, 28 May 2021, Pages 14-33
Neurocomputing

Deep learning for monocular depth estimation: A review

https://doi.org/10.1016/j.neucom.2020.12.089Get rights and content

Highlights

  • Recent development in deep learning for monocular depth estimation is reviewed.

  • Depth estimation is classified into supervised, unsupervised, and semi-supervised methods.

  • In terms of tasks, depth estimation is summarized as single-task and multi-task methods.

Abstract

Depth estimation is a classic task in computer vision, which is of great significance for many applications such as augmented reality, target tracking and autonomous driving. Traditional monocular depth estimation methods are based on depth cues for depth prediction with strict requirements, e.g. shape-from-focus/ defocus methods require low depth of field on the scenes and images. Recently, a large body of deep learning methods have been proposed and has shown great promise in handling the traditional ill-posed problem. This paper aims to review the state-of-the-art development in deep learning-based monocular depth estimation. We give an overview of published papers between 2014 and 2020 in terms of training manners and task types. We firstly summarize the deep learning models for monocular depth estimation. Secondly, we categorize various deep learning-based methods in monocular depth estimation. Thirdly, we introduce the publicly available dataset and the evaluation metrics. And we also analysis the properties of these methods and compare their performance. Finally, we highlight the challenges in order to inform the future research directions.

Introduction

Scene depth estimation plays an important role in computer vision, which enhances the perception and understanding of real three-dimensional scenes leading to a wide range of applications such as robotic navigation, autonomous driving, and virtual reality [1], [53], [139], [145], [166]. Active depth estimation methods usually utilize lasers, structured light and other reflections on the object surface to obtain depth point clouds, complete surface modeling and estimate scene depth maps [61], [182]. However, obtaining dense and accurate depth maps usually requires extremely heavy costs of manpower and computing resources [101], [178]. Therefore, image-based depth estimation has become the mainstream of research, and can be applied in a wide range of applications [89], [135].

The evolution of image-based depth estimation is shown in Fig. 1. In the early period, researchers estimated depth maps depending on depth cues, such as vanishing points [142], focus and defocus [138], and shadow [181]. However, most of these methods were applied in constraint scenes [138], [142], [181]. With the development of computer vision, many hand-made features and probabilistic graph models have been proposed, such as scale-invariant feature transform (SIFT) [88], speeded up robust features (SURF) [7], pyramid histogram of oriented gradient (PHOG) [9], Conditional Random Field (CRF) [66], and Markov Random Field (MRF) [25], which were adopted to predict monocular depth maps with parameter and non-parameter learning in the machine learning process [25], [66], [81]. The advent of deep learning technologies has brought great advantages to image processing [47], [68], [148], [172] especially depth estimation.

Traditional depth estimation methods of image-based depth estimation are usually based on binocular camera, which calculates the disparity of two 2D images (taken by a binocular camera) through stereo matching and triangulation to obtain a depth map [40], [82], [117], [170], [180]. However, the binocular depth estimation method requires at least two fixed cameras [185], and it is difficult to capture enough features in the image to match when the scene has less or no texture [84]. Therefore, researchers turn their attention to monocular depth estimation. Monocular depth estimation uses only one camera to obtain an image or video sequence, which does not require additional complicated equipments and professional techniques. It has vast application demands due to the availability of only one single camera in most application scenarios. Thus,there is an increasing demand for monocular depth estimation in recent years. Since monocular images lack a reliable stereoscopic visual relationship, it is essentially an ill-posed problem to regress depth in 3D space [102]. Therefore, researchers propose various methods for monocular depth estimation [8], [67].

Monocular images adopt a two-dimensional form to reflect the three-dimensional world. However, one dimension of the scene, namely depth, has missed in the imaging process, which makes it impossible to judge the size and distance of the object, nor to judge whether the object is occluded by another object. Therefore, we need to recover the depth of the monocular image. Based on the depth map, we can judge the size and distance of the object to meet the needs of scene understanding. When the estimated depth map can reflect the three-dimensional structure of the scene, we can consider that the depth estimation method is effectiveness.

This paper focuses on the research of monocular depth estimation, which surveys deep learning-based methods in recent years, details their remarks, and compares their performances. Furthermore, this paper describes the limitations of these existing methods and briefly introduces the future trends. The remainder of this paper is as follows: Section 2 introduces some deep learning models for monocular depth estimation; Section 3 summarizes deep learning-based methods of monocular depth estimation, from training manners and task types; Section 4 introduces the common datasets and evaluation metrics of depth estimation, and then analysis their properties and compares their performance; Section 5 discusses the challenges and trends of monocular depth estimation; Conclusions are drawn in Section 6.

Section snippets

Deep Learning models for monocular depth estimation

This section mainly introduces common deep learning models for monocular depth estimation: Convolutional Neural Network (CNN) [63], Recurrent Neural Network (RNN) [122], and Generative Adversarial Network (GAN) [39].

Deep learning methods for monocular depth estimation

Deep neural networks have played an important role in various areas with their powerful feature learning ability. Monocular depth estimation based deep learning is a task of learning depth maps from a single 2D color image through a deep neural network, which was firstly proposed by Eigen et al. [29] in 2014. It was a coarse-to-fine framework, where the coarse network learned the global depth on the entire image to obtain a rough depth map and the fine network learned the local features to

Datasets and metrics

This section introduces the datasets and evaluation metrics of deep learning models for monocular depth estimation.

Challenges and trends

Over the past several years, monocular depth estimation based on deep learning has been extensively researched and developed. However, there are still some limitations needed to be overcome.

  • 1) In order to improve the accuracy, researchers deepen the layers of the deep neural networks, which increases the memory usage and space complexity.

  • 2) In multi-task learning, deep learning methods for monocular depth estimation always apply multiple sub-networks or sub-modules to process different

Conclusion

Monocular depth estimation plays an important role in scene understanding and high-accuracy depth maps are beneficial to the realization of multiple applications. This paper introduces related deep learning models and summarizes deep learning-based monocular depth estimation algorithms, from training manners to task types. Furthermore, this paper also summarizes the properties and performance of these monocular depth estimation methods. Finally, this paper identifies the potential challenges

CRediT authorship contribution statement

Yue Ming: Investigation, Formal analysis, Software, Writing - review & editing. Xuyang Meng: Investigation, Formal analysis, Software, Writing - review & editing. Chunxiao Fan: Resources, Supervision, Writing - review & editing. Hui Yu: Methodology, Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Yue Ming received the B.S. degree in Communication Engineering, and the M.Sc degree in Human–Computer Interaction Engineering, and Ph.D. degree in Signal and Information Processing from Beijing Jiaotong University, China, in 2006, 2008, and 2013. She worked as a visiting scholar in Carnegie Mellon University, U.S., between 2010 and 2011. Since 2013, she has been working as a faculty member at Beijing University of Posts and Telecommunications. Her research interests are in the areas of

References (197)

  • Y. Almalioglu, M.R.U. Saputra, P.P. de Gusmao, A. Markham, N. Trigoni, Ganvo: unsupervised deep monocular visual...
  • L. Andraghetti, P. Myriokefalitakis, P.L. Dovesi, B. Luque, M. Poggi, A. Pieropan, S. Mattoccia, Enhancing...
  • M. Arjovsky, S. Chintala, L. Bottou, Wasserstein gan, 2017. arXiv preprint...
  • A. Atapour-Abarghouei et al.

    Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer

  • A. Atapour-Abarghouei et al.

    Veritatem dies aperit-temporally consistent depth prediction enabled by a multi-task geometric and semantic scene understanding approach

  • H. Bay, T. Tuytelaars, L. Van Gool, Surf: speeded up robust features, in: European Conference on Computer Vision,...
  • A. Bhoi, Monocular depth estimation: a survey, 2019. arXiv preprint...
  • A. Bosch, A. Zisserman, X. Munoz, Image classification using random forests and ferns, in: 2007 IEEE 11th International...
  • Y. Cao et al.

    Estimating depth from monocular images as classification using deep fully convolutional residual networks

    IEEE Trans. Circuits Syst. Video Technol.

    (2017)
  • V. Casser et al.

    Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos

  • A. Ceni et al.

    Interpreting recurrent neural networks behaviour via excitable network attractors

    Cogn. Comput.

    (2020)
  • J.R. Chang et al.

    Pyramid stereo matching network

  • C. Chen et al.

    On the over-smoothing problem of cnn based disparity estimation

  • L. Chen, Z. Yang, J. Ma, Z. Luo, Driving scene perception network: real-time joint detection, depth estimation and...
  • P.Y. Chen et al.

    Towards scene understanding: unsupervised monocular depth estimation with semantic-aware representation

  • W. Chen et al.

    Single-image depth perception in the wild

    Adv. Neural Inf. Process. Syst.

    (2016)
  • W. Chen et al.

    Learning single-image depth from videos using quality assessment networks

  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase...
  • B.C. Choy et al.

    Universal correspondence network

    Adv. Neural Inf. Process. Syst.

    (2016)
  • T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to algorithms, third edition thomas h. cormen, charles...
  • G.R. Cross et al.

    Markov random field texture models

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1983)
  • A. CS Kumar, S.M. Bhandarkar, M. Prasad, Monocular depth prediction using generative adversarial networks, in:...
  • N. Dos Santos Rosa, V. Guizilini, V. Grassi, Sparse-to-continuous: enhancing monocular depth estimation using occupancy...
  • D. Eigen et al.

    Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture

  • D. Eigen et al.

    Depth map prediction from a single image using a multi-scale deep network

    Adv. Neural Inf. Process. Syst.

    (2014)
  • J.M. Facil, B. Ummenhofer, H. Zhou, L. Montesano, T. Brox, J. Civera, Cam-convs: Camera-aware multi-scale convolutions...
  • X. Fei et al.

    Geo-supervised visual depth prediction

    IEEE Robot. Autom. Lett.

    (2019)
  • T. Feng et al.

    Sganvo: unsupervised deep visual odometry and depth estimation with stacked generative adversarial networks

    IEEE Robot. Autom. Lett.

    (2019)
  • H. Fu et al.

    Deep ordinal regression network for monocular depth estimation

  • R. Garg, V.K. Bg, G. Carneiro, I. Reid, Unsupervised cnn for single view depth estimation: geometry to the rescue, in:...
  • A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? The kitti vision benchmark suite, in: 2012 IEEE...
  • C. Godard et al.

    Unsupervised monocular depth estimation with left-right consistency

  • C. Godard et al.

    Digging into self-supervised monocular depth estimation

  • M. Goldman et al.

    Learn stereo, infer mono: siamese networks for self-supervised, monocular, depth estimation

  • I. Goodfellow et al.

    Generative adversarial nets

    Adv. Neural Inf. Process. Syst.

    (2014)
  • A.N. Gorban et al.

    How deep should be the depth of convolutional neural networks: a backyard dog case study

    Cogn. Comput.

    (2019)
  • K. Gregor, I. Danihelka, A. Graves, D.J. Rezende, D. Wierstra, Draw: a recurrent neural network for image generation,...
  • V. Guizilini et al.

    3d packing for self-supervised monocular depth estimation

  • V. Guizilini, R. Hou, J. Li, R. Ambrus, A. Gaidon, Semantically-guided representation learning for self-supervised...
  • X. Guo et al.

    Learning monocular depth by distilling cross-domain stereo networks

  • Cited by (224)

    • DRC: Chromatic aberration intensity priors for underwater image enhancement

      2024, Journal of Visual Communication and Image Representation
    View all citing articles on Scopus

    Yue Ming received the B.S. degree in Communication Engineering, and the M.Sc degree in Human–Computer Interaction Engineering, and Ph.D. degree in Signal and Information Processing from Beijing Jiaotong University, China, in 2006, 2008, and 2013. She worked as a visiting scholar in Carnegie Mellon University, U.S., between 2010 and 2011. Since 2013, she has been working as a faculty member at Beijing University of Posts and Telecommunications. Her research interests are in the areas of biometrics, computer vision, computer graphics, information retrieval, pattern recognition, etc.

    Xuyang Meng is a PhD student at Beijing University of Posts and Telecommunications. She received her BS degree in engineering from Yanshan University in 2016, and her research interests are computer vision and 3-D reconstruction.

    Chunxiao Fan is currently a professor and the director of Center for information electronic and intelligence system. She served as a member of ISO/IEC JTC1/SC6 WG9, ASN.1 (since 2006) and Chinese Sensor network working group. She also was elevated to evaluation expert of Beijing Scientific and Technical Academy Awards. Her research interests include Heterogeneous media data analysis, Internet of Things, data mining, communication software and so on. In recent years, she is director of several Nation Science Foundation Project. She has published more than 30 papers in international journals and conferences, authored and edited three books and has authorized several patent for invention.

    Hui Yu is a Professor with the University of Portsmouth, UK. His research interests include methods and practical development in visual computing, machine learning and AI with the applications focusing on human–machine interaction, multimedia, virtual reality and robotics as well as 4D facial expression generation, perception and analysis. He serves as an Associate Editor for IEEE Transactions on Human–Machine Systems and Neurocomputing journal.

    The work presented in this paper was partly supported by Natural Science Foundation of China (Grant No. 62076030), Beijing Natural Science Foundation of China (Grant No. L201023, and No. L182033) and the Fundamental Research Funds for the Central Universities (2019PTB-001).

    1

    Co-first author.

    View full text