Deep learning for monocular depth estimation: A review☆
Introduction
Scene depth estimation plays an important role in computer vision, which enhances the perception and understanding of real three-dimensional scenes leading to a wide range of applications such as robotic navigation, autonomous driving, and virtual reality [1], [53], [139], [145], [166]. Active depth estimation methods usually utilize lasers, structured light and other reflections on the object surface to obtain depth point clouds, complete surface modeling and estimate scene depth maps [61], [182]. However, obtaining dense and accurate depth maps usually requires extremely heavy costs of manpower and computing resources [101], [178]. Therefore, image-based depth estimation has become the mainstream of research, and can be applied in a wide range of applications [89], [135].
The evolution of image-based depth estimation is shown in Fig. 1. In the early period, researchers estimated depth maps depending on depth cues, such as vanishing points [142], focus and defocus [138], and shadow [181]. However, most of these methods were applied in constraint scenes [138], [142], [181]. With the development of computer vision, many hand-made features and probabilistic graph models have been proposed, such as scale-invariant feature transform (SIFT) [88], speeded up robust features (SURF) [7], pyramid histogram of oriented gradient (PHOG) [9], Conditional Random Field (CRF) [66], and Markov Random Field (MRF) [25], which were adopted to predict monocular depth maps with parameter and non-parameter learning in the machine learning process [25], [66], [81]. The advent of deep learning technologies has brought great advantages to image processing [47], [68], [148], [172] especially depth estimation.
Traditional depth estimation methods of image-based depth estimation are usually based on binocular camera, which calculates the disparity of two 2D images (taken by a binocular camera) through stereo matching and triangulation to obtain a depth map [40], [82], [117], [170], [180]. However, the binocular depth estimation method requires at least two fixed cameras [185], and it is difficult to capture enough features in the image to match when the scene has less or no texture [84]. Therefore, researchers turn their attention to monocular depth estimation. Monocular depth estimation uses only one camera to obtain an image or video sequence, which does not require additional complicated equipments and professional techniques. It has vast application demands due to the availability of only one single camera in most application scenarios. Thus,there is an increasing demand for monocular depth estimation in recent years. Since monocular images lack a reliable stereoscopic visual relationship, it is essentially an ill-posed problem to regress depth in 3D space [102]. Therefore, researchers propose various methods for monocular depth estimation [8], [67].
Monocular images adopt a two-dimensional form to reflect the three-dimensional world. However, one dimension of the scene, namely depth, has missed in the imaging process, which makes it impossible to judge the size and distance of the object, nor to judge whether the object is occluded by another object. Therefore, we need to recover the depth of the monocular image. Based on the depth map, we can judge the size and distance of the object to meet the needs of scene understanding. When the estimated depth map can reflect the three-dimensional structure of the scene, we can consider that the depth estimation method is effectiveness.
This paper focuses on the research of monocular depth estimation, which surveys deep learning-based methods in recent years, details their remarks, and compares their performances. Furthermore, this paper describes the limitations of these existing methods and briefly introduces the future trends. The remainder of this paper is as follows: Section 2 introduces some deep learning models for monocular depth estimation; Section 3 summarizes deep learning-based methods of monocular depth estimation, from training manners and task types; Section 4 introduces the common datasets and evaluation metrics of depth estimation, and then analysis their properties and compares their performance; Section 5 discusses the challenges and trends of monocular depth estimation; Conclusions are drawn in Section 6.
Section snippets
Deep Learning models for monocular depth estimation
This section mainly introduces common deep learning models for monocular depth estimation: Convolutional Neural Network (CNN) [63], Recurrent Neural Network (RNN) [122], and Generative Adversarial Network (GAN) [39].
Deep learning methods for monocular depth estimation
Deep neural networks have played an important role in various areas with their powerful feature learning ability. Monocular depth estimation based deep learning is a task of learning depth maps from a single 2D color image through a deep neural network, which was firstly proposed by Eigen et al. [29] in 2014. It was a coarse-to-fine framework, where the coarse network learned the global depth on the entire image to obtain a rough depth map and the fine network learned the local features to
Datasets and metrics
This section introduces the datasets and evaluation metrics of deep learning models for monocular depth estimation.
Challenges and trends
Over the past several years, monocular depth estimation based on deep learning has been extensively researched and developed. However, there are still some limitations needed to be overcome.
1) In order to improve the accuracy, researchers deepen the layers of the deep neural networks, which increases the memory usage and space complexity.
2) In multi-task learning, deep learning methods for monocular depth estimation always apply multiple sub-networks or sub-modules to process different
Conclusion
Monocular depth estimation plays an important role in scene understanding and high-accuracy depth maps are beneficial to the realization of multiple applications. This paper introduces related deep learning models and summarizes deep learning-based monocular depth estimation algorithms, from training manners to task types. Furthermore, this paper also summarizes the properties and performance of these monocular depth estimation methods. Finally, this paper identifies the potential challenges
CRediT authorship contribution statement
Yue Ming: Investigation, Formal analysis, Software, Writing - review & editing. Xuyang Meng: Investigation, Formal analysis, Software, Writing - review & editing. Chunxiao Fan: Resources, Supervision, Writing - review & editing. Hui Yu: Methodology, Supervision, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Yue Ming received the B.S. degree in Communication Engineering, and the M.Sc degree in Human–Computer Interaction Engineering, and Ph.D. degree in Signal and Information Processing from Beijing Jiaotong University, China, in 2006, 2008, and 2013. She worked as a visiting scholar in Carnegie Mellon University, U.S., between 2010 and 2011. Since 2013, she has been working as a faculty member at Beijing University of Posts and Telecommunications. Her research interests are in the areas of
References (197)
- et al.
Survey on deep neural networks in speech and vision systems
Neurocomputing
(2020) - et al.
A deep domain adaption model with multi-task networks for planetary gearbox fault diagnosis
Neurocomputing
(2020) - et al.
Self-supervised monocular image depth learning and confidence estimation
Neurocomputing
(2020) - et al.
Adversarial-learning-based image-to-image transformation: a survey
Neurocomputing
(2020) - et al.
A brief survey on semantic segmentation with deep learning
Neurocomputing
(2020) - et al.
An improved deep convolutional neural network with multi-scale information for bearing fault diagnosis
Neurocomputing
(2019) - et al.
An unsupervised image segmentation method combining graph clustering and high-level feature representation
Neurocomputing
(2020) - et al.
Staincnns: an efficient stain feature learning method
Neurocomputing.
(2020) - et al.
Single image super-resolution incorporating example-based gradient profile estimation and weighted adaptive p-norm
Neurocomputing
(2019) - et al.
Effective image super resolution via hierarchical convolutional neural network
Neurocomputing
(2020)
Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer
Veritatem dies aperit-temporally consistent depth prediction enabled by a multi-task geometric and semantic scene understanding approach
Estimating depth from monocular images as classification using deep fully convolutional residual networks
IEEE Trans. Circuits Syst. Video Technol.
Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos
Interpreting recurrent neural networks behaviour via excitable network attractors
Cogn. Comput.
Pyramid stereo matching network
On the over-smoothing problem of cnn based disparity estimation
Towards scene understanding: unsupervised monocular depth estimation with semantic-aware representation
Single-image depth perception in the wild
Adv. Neural Inf. Process. Syst.
Learning single-image depth from videos using quality assessment networks
Universal correspondence network
Adv. Neural Inf. Process. Syst.
Markov random field texture models
IEEE Trans. Pattern Anal. Mach. Intell.
Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture
Depth map prediction from a single image using a multi-scale deep network
Adv. Neural Inf. Process. Syst.
Geo-supervised visual depth prediction
IEEE Robot. Autom. Lett.
Sganvo: unsupervised deep visual odometry and depth estimation with stacked generative adversarial networks
IEEE Robot. Autom. Lett.
Deep ordinal regression network for monocular depth estimation
Unsupervised monocular depth estimation with left-right consistency
Digging into self-supervised monocular depth estimation
Learn stereo, infer mono: siamese networks for self-supervised, monocular, depth estimation
Generative adversarial nets
Adv. Neural Inf. Process. Syst.
How deep should be the depth of convolutional neural networks: a backyard dog case study
Cogn. Comput.
3d packing for self-supervised monocular depth estimation
Learning monocular depth by distilling cross-domain stereo networks
Cited by (224)
DAP: A dataset-agnostic predictor of neural network performance
2024, NeurocomputingAction recognition in compressed domains: A survey
2024, NeurocomputingDeciphering pixel insights: A deep dive into deep learning strategies for enhanced indoor depth estimation
2024, International Journal of Information Management Data InsightsDepth as attention to learn image representations for visual localization, using monocular images
2024, Journal of Visual Communication and Image RepresentationDRC: Chromatic aberration intensity priors for underwater image enhancement
2024, Journal of Visual Communication and Image RepresentationVehicle-to-everything (V2X) in the autonomous vehicles domain – A technical review of communication, sensor, and AI technologies for road user safety
2024, Transportation Research Interdisciplinary Perspectives
Yue Ming received the B.S. degree in Communication Engineering, and the M.Sc degree in Human–Computer Interaction Engineering, and Ph.D. degree in Signal and Information Processing from Beijing Jiaotong University, China, in 2006, 2008, and 2013. She worked as a visiting scholar in Carnegie Mellon University, U.S., between 2010 and 2011. Since 2013, she has been working as a faculty member at Beijing University of Posts and Telecommunications. Her research interests are in the areas of biometrics, computer vision, computer graphics, information retrieval, pattern recognition, etc.
Xuyang Meng is a PhD student at Beijing University of Posts and Telecommunications. She received her BS degree in engineering from Yanshan University in 2016, and her research interests are computer vision and 3-D reconstruction.
Chunxiao Fan is currently a professor and the director of Center for information electronic and intelligence system. She served as a member of ISO/IEC JTC1/SC6 WG9, ASN.1 (since 2006) and Chinese Sensor network working group. She also was elevated to evaluation expert of Beijing Scientific and Technical Academy Awards. Her research interests include Heterogeneous media data analysis, Internet of Things, data mining, communication software and so on. In recent years, she is director of several Nation Science Foundation Project. She has published more than 30 papers in international journals and conferences, authored and edited three books and has authorized several patent for invention.
Hui Yu is a Professor with the University of Portsmouth, UK. His research interests include methods and practical development in visual computing, machine learning and AI with the applications focusing on human–machine interaction, multimedia, virtual reality and robotics as well as 4D facial expression generation, perception and analysis. He serves as an Associate Editor for IEEE Transactions on Human–Machine Systems and Neurocomputing journal.
- ☆
The work presented in this paper was partly supported by Natural Science Foundation of China (Grant No. 62076030), Beijing Natural Science Foundation of China (Grant No. L201023, and No. L182033) and the Fundamental Research Funds for the Central Universities (2019PTB-001).
- 1
Co-first author.