Keywords

1 Introduction

Visual odometry (VO) is a highly active field of research in computer vision with a plethora of applications in domains such as autonomous driving, robotics, and augmented reality. VO with a single camera using traditional geometric approaches inherently suffers from the fact that camera trajectory and map can only be estimated up to an unknown scale which also leads to scale drift. Moreover, sufficient motion parallax is required to estimate motion and structure from successive frames. To avoid these issues, typically more complex sensors such as active depth cameras or stereo rigs are employed. However, these sensors require larger efforts in calibration and increase the costs of the vision system.

Metric depth can also be recovered from a single image if a-priori knowledge about the typical sizes or appearances of objects is used. Deep learning based approaches tackle this by training deep neural networks on large amounts of data. In this paper, we propose a novel approach to monocular visual odometry, Deep Virtual Stereo Odometry (DVSO), which incorporates deep depth predictions into a geometric monocular odometry pipeline. We use deep stereo disparity for virtual direct image alignment constraints within a framework for windowed direct bundle adjustment (e.g. Direct Sparse Odometry [8]). DVSO achieves comparable performance to the state-of-the-art stereo visual odometry systems on the KITTI odometry benchmark. It can even outperform the state-of-the-art geometric VO methods when tuning scale-dependent parameters such as the virtual stereo baseline.

As an additional contribution, we propose a novel stacked residual network architecture that refines disparity estimates in two stages and is trained in a semi-supervised way. In typical supervised learning approaches [6, 24, 25], depth ground truth needs to be acquired for training with active sensors like RGB-D cameras and 3D laser scanners which are costly to obtain. Requiring a large amount of such labeled data is an additional burden that limits generalization to new environments. Self-supervised [11, 14] and unsupervised learning approaches [49], on the other hand, overcome this limitation and do not require additional active sensors. Commonly, they train the networks on photometric consistency, for example in stereo imagery [11, 14], which reduces the effort for collecting training data. Still, the current self-supervised approaches are not as accurate as supervised methods [23]. We combine self-supervised and supervised training, but avoid the costly collection of LiDAR data in our approach. Instead, we make use of Stereo Direct Sparse Odometry (Stereo DSO [40]) to provide accurate sparse 3D reconstructions on the training set. Our deep depth prediction network outperforms the current state-of-the-art methods on KITTI.

A video demonstrating our methods as well as the results is available at https://youtu.be/sLZOeC9z_tw.

Fig. 1.
figure 1

DVSO achieves monocular visual odometry on KITTI on par with state-of-the-art stereo methods. It uses deep-learning based left-right disparity predictions (lower left) for initialization and virtual stereo constraints in an optimization-based direct visual odometry pipeline. This allows for recovering accurate metric estimates.

1.1 Related Work

Deep Learning for Monocular Depth Estimation. Deep learning based approaches have recently achieved great advances in monocular depth estimation. Employing deep neural network avoids the hand-crafted features used in previous methods [19, 36]. Supervised deep learning [6, 24, 25] has recently shown great success for monocular depth estimation. Eigen et al. [5, 6] propose a two scale CNN architecture which directly predicts the depth map from a single image. Laina et al. [24] propose a residual network [17] based fully convolutional encoder-decoder architecture [27] with a robust regression loss function. The aforementioned supervised learning approaches need large amounts of ground-truth depth data for training. Self-supervised approaches [11, 14, 44] overcome this limitation by exploiting photoconsistency and geometric constraints to define loss functions, for example, in a stereo camera setup. This way, only stereo images are needed for training which are typically easier to obtain than accurate depth measurements from active sensors such as 3D lasers or RGB-D cameras. Godard et al. [14] achieve the state-of-the-art depth estimation accuracy for a fully self-supervised approach. The semi-supervised scheme proposed by Kuznietsov et al. [23] combines the self-supervised loss with supervision with sparse LiDAR ground truth. They do not need multi-scale depth supervision or left-right consistency in their loss, and achieve better performance than the self-supervised approach in [14]. The limitation of this semi-supervised approach is the requirement for LiDAR data which are costly to collect. In our approach we use Stereo Direct Sparse Odometry to obtain sparse depth ground-truth for semi-supervised training. Since the extracted depth maps are even sparser than LiDAR data, we also employ multi-scale self-supervised training and left-right consistency as in Godard et al. [14]. Inspired by [20, 34], we design a stacked network architecture leveraging the concept of residual learning [17].

Deep Learning for VO/SLAM. In recent years, large progress has been achieved in the development of monocular VO and SLAM methods [8, 9, 31, 32]. Due to projective geometry, metric scale cannot be observed with a single camera [37] which introduces scale drift. A popular approach is hence to use stereo cameras for VO [8, 10, 31] which avoid scale ambiguity and leverage stereo matching with a fixed baseline for estimating 3D structure. While stereo VO delivers more reliable depth estimation, it requires self-calibration for long-term operation [4, 46]. The integration of a second camera also introduces additional costs. Some recent monocular VO approaches have integrated monocular depth estimation [39, 46] to recover the metric scale by scale-matching. CNN-SLAM [39] extends LSD-SLAM [9] by predicting depth with a CNN and refining the depth maps using Bayesian filtering [7, 9]. Their method shows superior performance over monocular SLAM [9, 30, 35, 45] on indoor datasets [15, 38]. Yin et al. [46] propose to use convolutional neural fields and consecutive frames to improve the monocular depth estimation from a CNN. Camera motion is estimated using the refined depth. CodeSLAM [2] focuses on the challenge of dense 3D reconstruction. It jointly optimizes a learned compact representation of the dense geometry with camera poses. Our work tackles the problem of odometry with monocular cameras and integrates deep depth prediction with multi-view stereo to improve camera pose estimation. Another line of research trains networks to directly predict the ego-motion end-to-end using supervised [41] or unsupervised learning [26, 49]. However, the estimated ego-motion of these methods is still by far inferior to geometric visual odometry approaches. In our approach, we phrase visual odometry as a geometric optimization problem but incorporate photoconsistency constraints with state-of-the-art deep monocular depth predictions into the optimization. This way, we obtain a highly accurate monocular visual odometry that is not prone to scale drift and achieves comparable results to traditional stereo VO methods.

2 Semi-Supervised Deep Monocular Depth Estimation

In this section, we will introduce our semi-supervised approach to deep monocular depth estimation. It builds on three key ingredients: self-supervised learning from photoconsistency in a stereo setup similar to [14], supervised learning based on accurate sparse depth reconstruction by Stereo DSO, and two-stage refinement of the network predictions in a stacked encoder-decoder architecture.

Fig. 2.
figure 2

Overview of StackNet architecture.

2.1 Network Architecture

We coin our architecture StackNet since it stacks two sub-networks, SimpleNet and ResidualNet, as depicted in Fig. 2. Both sub-networks are fully convolutional deep neural network adopted from DispNet [28] with an encoder-decoder scheme. ResidualNet has fewer layers and takes the outputs of SimpleNet as inputs. Its purpose is to refine the disparity maps predicted by SimpleNet by learning an additive residual signal. Similar residual learning architectures have been successfully applied to related deep learning tasks [20, 34]. The detailed network architecture is illustrated in the supplementary material.

SimpleNet. SimpleNet is an encoder-decoder architecture with a ResNet-50 based encoder and skip connections between corresponding encoder and decoder layers. The decoder upprojects the feature maps to the original resolution and generates 4 pairs of disparity maps \( disp _{ simple ,s}^{ left }\) and \( disp _{ simple ,s}^{ right }\) in different resolutions \(s \in [0,3]\). The upprojection is implemented by resize-convolution [33], i.e. a nearest-neighbor upsampling layer by a factor of two followed by a convolutional layer. The usage of skip connections enables the decoder to recover high-resolution results with fine-grained details.

ResidualNet. The purpose of ResidualNet is to further refine the disparity maps predicted by SimpleNet. ResidualNet learns the residual signals \( disp _{ res ,s}\) to the disparity maps \( disp _{ simple ,s}\) (both left and right and for all resolutions). Inspired by FlowNet 2.0 [20], the inputs to ResidualNet contain various information on the prediction and the errors made by SimpleNet: we input \(I^{ left }\), \( disp _{ simple ,0}^{ left }\), \(I^{ right }_{ recons }\), \(I^{ left }_{ recons }\) and \(e_l\), where

  • \(I^{ right }_{ recons }\) is the reconstructed right image by warping \(I^{ left }\) using \( disp _{ simple ,0}^{ right }\).

  • \(I^{ left }_{ recons }\) is the generated left image by back-warping \(I^{ right }_{ recons }\) using \( disp _{ simple ,0}^{ left }\).

  • \(e_l\) is the \(\ell 1\) reconstruction error between \(I^{ left }\) and \(I^{ left }_{ recons }\)

For the warping, rectified stereo images are required while stereo camera intrinsics and extrinsics are not needed as our network directly outputs disparity.

The final refined outputs \( disp _s\) are \( disp _s = disp _{ simple ,s} \oplus disp _{res,s}, s \in [0, 3]\), where \(\oplus \) is element-wise summation. The encoder of ResidualNet contains 12 residual blocks in total and predicts 4 scales of residual disparity maps as SimpleNet. Adding more layers does not further improve performance in our experiments. Notably, only the left image is used as an input to either SimpleNet and ResidualNet, while the right image is not required. However, the network outputs a refined disparity map for the left and right stereo image. Both facts will be important for our monocular visual odometry approach.

2.2 Loss Function

We define a loss \(\mathcal {L}_s\) at each output scale s, resulting in the total loss \(\mathcal {L} = \sum _{s=0}^{3}\mathcal {L}_s.\) The loss at each scale \(\mathcal {L}_s\) is a linear combination of five terms which are symmetric in left and right images,

$$\begin{aligned} \mathcal {L}_s= & {} \alpha _{U}\left( \mathcal {L}^{ left }_{U} + \mathcal {L}^{ right }_{U}\right) + \alpha _{S}\left( \mathcal {L}^{ left }_{S} + \mathcal {L}^{ right }_{S}\right) + \alpha _{lr}\left( \mathcal {L}^{ left }_{ lr } + \mathcal {L}^{ right }_{ lr }\right) \nonumber \\+ & {} \alpha _{ smooth }\left( \mathcal {L}^{ left }_{ smooth } + \mathcal {L}^{ right }_{ smooth }\right) + \alpha _{ occ }\left( \mathcal {L}^{ left }_{ occ } + \mathcal {L}^{ right }_{ occ }\right) , \end{aligned}$$
(1)

where \(\mathcal {L}_{U}\) is a self-supervised loss, \(\mathcal {L}_{S}\) is a supervised loss, \(\mathcal {L}_{ lr }\) is a left-right consistency loss, \(\mathcal {L}_{ smooth }\) is a smoothness term encouraging the predicted disparities to be locally smooth and \(\mathcal {L}_{ occ }\) is an occlusion regularization term. In the following, we detail the left components \(\mathcal {L}^{ left }\) of the loss function at each scale. The right components \(\mathcal {L}^{ right }\) are defined symmetrically.

Self-supervised Loss. The self-supervised loss measures the quality of the reconstructed images. The reconstructed image is generated by warping the input image into the view of the other rectified stereo image. This procedure is fully (sub-)differentiable for bilinear sampling [21]. Inspired by [14, 47], the quality of the reconstructed image is measured with the combination of the \(\ell _1\) loss and single scale structural similarity (SSIM) [42]:

$$\begin{aligned} \mathcal {L}_U^{ left }= & {} \frac{1}{N}\sum _{x,y}\alpha \frac{1-\text {SSIM}\left( I^{ left }(x,y), I^{ left }_{ recons }(x,y)\right) }{2} \\ \nonumber+ & {} (1-\alpha )||I^{ left }(x,y)-I^{ left }_{ recons }(x,y)||_1, \end{aligned}$$
(2)

with a \(3\times 3\) box filter for SSIM and \(\alpha \) set to 0.84.

Supervised Loss. The supervised loss measures the deviation of the predicted disparity map from the disparities estimated by Stereo DSO at a sparse set of pixels:

$$\begin{aligned} \mathcal {L}_s^{ left } = \frac{1}{N}\sum _{(x,y) \in \varOmega _{ DSO , left }}\beta _\epsilon \left( disp ^{ left }(x, y) - { disp }^{ left }_{ DSO }(x,y)\right) \end{aligned}$$
(3)

where \(\varOmega _{ DSO , left }\) is the set of pixels with disparities estimated by DSO and \(\beta _\epsilon (x)\) is the reverse Huber (berHu) norm introduced in [24] which lets the training focus more on larger residuals. The threshold \(\epsilon \) is adaptively set as a batch-dependent value \(\epsilon = 0.2 \max _{(x,y)\in \varOmega _{ DSO , left }} \left| disp ^{ left }(x,y) - disp _{ DSO }^{ left }(x,y)\right| \).

Left-Right Disparity Consistency Loss. Given only the left image as input, the network predicts the disparity map of the left as well as the right image as in [14]. As proposed in [14, 47], consistency between the left and right disparity image is improved by

$$\begin{aligned} \mathcal {L}_{ lr }^{ left } = \frac{1}{N}\sum _{x,y}\Bigl | disp ^{ left }(x, y) - disp ^{ right }(x - disp ^{ left }(x, y), y) \Bigr |. \end{aligned}$$
(4)

Disparity Smoothness Regularization. Depth reconstruction based on stereo image matching is an ill-posed problem on its own: the depth of homogeneously textured areas and occluded areas cannot be determined. For these areas, we apply the regularization term

$$\begin{aligned} \mathcal {L}_{ smooth }^{ left } \!\!=\!\! \frac{1}{N} \sum _{x,y}\Bigl |\nabla ^2_x disp ^{ left }(x, y)\Bigr |e^{\!-\!\Bigl ||\nabla ^2_x I^{ left }(x, y) \Bigr ||} + \Bigl |\nabla ^2_y disp ^{ left }(x, y)\Bigr |e^{-\Bigl ||\nabla ^2_y I^{ left }(x, y) \Bigr ||} \end{aligned}$$
(5)

that assumes that the predicted disparity map should be locally smooth. We use a second-order smoothness prior [43] and downweight it when the image gradient is high [18].

Occlusion Regularization. \(\mathcal {L}_{ smooth }^{ left }\) itself tends to generate a shadow area where values gradually change from foreground to background due to stereo occlusion. To favor background depths and hard transitions at occlusions [48], we impose \(\mathcal {L}_{ occ }^{ left }\) which penalizes the total sum of absolute disparities. The combination of smoothness- and occlusion regularizer prefers to directly take the (smaller) closeby background disparity which better corresponds to the assumption that the background part is uncovered

$$\begin{aligned} \mathcal {L}_{ occ }^{ left } = \frac{1}{N} \sum _{x, y}\Bigl | disp ^{ left }(x, y)\Bigr |. \end{aligned}$$
(6)

3 Deep Virtual Stereo Odometry

Deep Virtual Stereo Odometry (DVSO) builds on the windowed sparse direct bundle adjustment formulation of monocular DSO. We use our disparity predictions for DSO in two key ways: Firstly, we initialize depth maps of new keyframes from the disparities. Beyond this rather straight-forward approach, we also incorporate virtual direct image alignment constraints into the windowed direct bundle adjustment of DSO. We obtain these constraints by warping images with the estimated depth by bundle adjustment and the predicted right disparities by our network assuming a virtual stereo setup. As shown in Fig. 3, DVSO integrates both the predicted left disparities and right disparities for the left image. The right image of the stereo setup is not used for our VO method at any stage, making it a monocular VO method.

In the following, we use \(D^{L}\) and \(D^{R}\) as shorthands to represent the predicted left (\(disp_0^{left}\)) and right disparity map (\(disp_0^{right}\)) at scale \(s = 0\), respectively. When using purely geometric cues, scale drift is one of the main sources of error of monocular VO due to scale unobservability [37]. In DVSO we use the left disparity map \(D^{L}\) predicted by StackNet for initialization instead of randomly initializing the depth like in monocular DSO [8]. The disparity value of an image point with coordinate \(\mathbf {p}\) is converted to the inverse depth \(d_\mathbf {p}\) using the rectified camera intrinsics and stereo baseline of the training set of StackNet [16], \(d_{\mathbf {p}} = \frac{D^{L}(\mathbf {p})}{f_xb}\). In this way, the initialization of DVSO becomes more stable than monocular DSO and the depths are initialized with a consistent metric scale.

The point selection strategy of DVSO is similar to monocular DSO [8], while we also introduce a left-right consistency check (similar to Equation (4)) to filter out the pixels which likely lie in the occluded area

$$\begin{aligned} e_{lr} = \Bigl | D^{L}(\mathbf {p}) - D^{R}(\mathbf {p'}) \Bigr | \quad \text {with} \quad \mathbf {p'} = \mathbf {p} - \begin{bmatrix} D^L(\mathbf {p})&0\\ \end{bmatrix}^\top . \end{aligned}$$
(7)

The pixels with \(e_{lr} > 1\) are not selected.

Fig. 3.
figure 3

System overview of DVSO. Every new frame is used for visual odometry and fed into the proposed StackNet to predict left and right disparity. The predicted left and right disparities are used for depth initialization, while the right disparity is used to form the virtual stereo term in direct sparse bundle adjustment.

Every new frame is firstly tracked with respect to the reference keyframe using direct image alignment in a coarse-to-fine manner [8]. Afterwards DVSO decides if a new keyframe has to be created for the new frame following the criteria proposed by [8]. When a new keyframe is created, the temporal multi-view energy function \(E_{photo} := \sum _{i\in \mathcal {F}}\sum _{\mathbf {p} \in \mathcal {P}_i}\sum _{j \in \text {obs}(\mathbf {p})}E_{ij}^{\mathbf {p}}\) needs to be optimized, where \(\mathcal {F}\) is a fixed-sized window containing the active keyframes, \(\mathcal {P}_i\) is the set of points selected from its host keyframe with index i and \(j \in \text {obs}(\mathbf {p})\) is the index of the keyframe which observes \(\mathbf {p}\). \(E_{ij}^{\mathbf {p}}\) is the photometric error of the point \(\mathbf {p}\) when projected from the host keyframe \(I_i\) onto the other keyframe \(I_j\):

$$\begin{aligned} E_{ij}^{\mathbf {p}} := \omega _{\mathbf {p}} \left\| (I_j[\mathbf {\tilde{p}}] - b_j) - \frac{e^{a_j}}{e^{a_i}}(I_i[\mathbf {p}] - b_i) \right\| _\gamma , \end{aligned}$$
(8)

where \(\mathbf {\tilde{p}}\) is the projected image coordinate using the relative rotation matrix \(\mathbf R \in SO(3)\) and translation vector \(\mathbf t \in \mathbb {R}^3\) [16], \(\mathbf {\tilde{p}} = \varPi _c\left( \mathbf {R}\varPi _c^{-1}\left( \mathbf {p}, d_{\mathbf {p}}\right) + \mathbf {t}\right) \), where \(\varPi _\mathbf {c}\) and \(\varPi _\mathbf {c}^{-1}\) are the camera projection and back-projection functions. The parameters \(a_i\), \(a_j\), \(b_i\) and \(b_j\) are used for modeling the affine brightness transformation [8]. The weight \(\omega _{\mathbf {p}}\) penalizes the points with high image gradient [8] with the intuition that the error originating from bilinear interpolation of the discrete image values is larger. \(||\cdot ||_{\gamma }\) is the Huber norm with the threshold \(\gamma \). For the detailed explanation of the energy function, please refer to [8].

To further improve the accuracy of DVSO, inspired by Stereo DSO [40] which couples the static stereo term with the temporal multi-view energy function, we introduce a novel virtual stereo term \(E^{\dagger \mathbf {p}}\) for each point \(\mathbf {p}\)

$$\begin{aligned} E_{i}^{\dagger \mathbf {p}} = \omega _{\mathbf {p}} \left\| I_i^\dagger \left[ \mathbf {p}^\dagger \right] - I_i\left[ \mathbf {p}\right] \right\| _\gamma \quad \text {with} \quad I_i^\dagger \left[ \mathbf {p}^\dagger \right] = I_i\left[ \mathbf {p}^\dagger - \begin{bmatrix} D^{R}\left( \mathbf {p}^\dagger \right)&0 \\ \end{bmatrix}^\top \right] , \end{aligned}$$
(9)

where \(\mathbf {p}^\dagger = \varPi _\mathbf {c}(\varPi _\mathbf {c}^{-1}(\mathbf {p}, d_{\mathbf {p}}) + \mathbf {t}_b)\) is the virtual projected coordinate of \(\mathbf {p}\) using the vector \(\mathbf {t}_b\) denoting the virtual stereo baseline which is known during the training of StackNet. The intuition behind this term is to optimize the estimated depth of the visual odometry to become consistent with the disparity prediction of StackNet. Instead of imposing the consistency directly on the estimated and predicted disparities, we formulate the residuals in photoconsistency which better reflects the uncertainties of the prediction of StackNet and also keeps the unit of the residuals consistent with the temporal direct image alignment terms.

We then optimize the total energy

$$\begin{aligned} E_{photo} := \sum _{i\in \mathcal {F}}\sum _{\mathbf {p} \in \mathcal {P}_i} \left( \lambda E_{i}^{\dagger \mathbf {p}} + \sum _{j \in \text {obs}(\mathbf {p})}E_{ij}^{\mathbf {p}} \right) , \end{aligned}$$
(10)

where the coupling factor \(\lambda \) balances the temporal and the virtual stereo term. All the parameters of the total energy are jointly optimized using the Gauss Newton method [8]. In order to keep a fixed size of the active window (\(N=7\) keyframes in our experiments), old keyframes are removed from the system by marginalization using the Schur complement [8]. Unlike sliding window bundle adjustment, the parameter estimates outside the optimization window including camera poses and depths in a marginalization prior are also incorporated into the optimization. In contrast to the MSCKF [29], the depths of pixels are explicitly maintained in the state and optimized for. In our optimization framework we trade off predicted depth and triangulated depth using robust norms.

4 Experiments

We quantitatively evaluate our StackNet with other state-of-the-art monocular depth prediction methods on the publicly available KITTI dataset [12]. In the supplementary materials, we demonstrate results on the Cityscapes dataset [3] and the Make3D dataset [36] to show the generalization ability. For DVSO, we evaluate its tracking accuracy on the KITTI odometry benchmark with other state-of-the-art monocular as well as stereo visual odometry systems. In the supplementary material, we also demonstrate its results on the Frankfurt sequence of the Cityscapes dataset to show the generalization of DVSO.

Fig. 4.
figure 4

Qualitative comparison with state-of-the-art methods. The ground truth is interpolated for better visualization. Our approach shows better prediction on thin structures than the self-supervised approach [14], and delivers more detailed disparity maps than the semi-supervised approach using LiDAR data [23].

4.1 Monocular Depth Estimation

Dataset. We train StackNet using the train/test split (K) of Eigen et al. [6]. The training set contains 23488 images from 28 scenes belonging to the categories “city”, “residential” and “road”. We used 22600 images of them for training and the remaining ones for validation. We further split K into 2 subsets \(\mathbf K _o\) and \(\mathbf K _r\). \(\mathbf K _o\) contains the images of the sequences which appear in the training set (but not the test set) of the KITTI odometry benchmark on which we use Stereo DSO [40] to extract sparse ground-truth depth data. \(\mathbf K _r\) contains the remaining images in K. Specifically, \(\mathbf K _o\) contains the images of sequences 01, 02, 06, 08, 09 and 10 of the KITTI odometry benchmark.

Implementation Details. StackNet is implemented in TensorFlow [1] and trained from scratch on a single Titan X Pascal GPU. We resize the images to \(512 \times 256\) for training and it takes less than 40 ms for inference including the I/O overhead. The weights are set to \(\alpha _u = 1\), \(\alpha _s = 10\), \(\alpha _{lr} = 1\), \(\alpha _{smooth} = 0.1/2^s\) and \(\alpha _{occ} = 0.01\), where s is the output scale. As suggested by [14], we use exponential linear units (ELUs) for SimpleNet, while we use leaky rectified linear units (Leaky ReLUs) for ResidualNet. We first train SimpleNet on \(\mathbf K _o\) in the semi-supervised way for 80 epochs with a batch size of 8 using the Adam optimizer [22]. The learning rate is initially set to \(\lambda = 10^{-4}\) for the first 50 epochs and halved every 15 epochs afterwards until the end. Then we train SimpleNet with \(\lambda = 5 \times 10^{-5}\) on \(\mathbf K _r\) for 40 epochs in the self-supervised way without \(\mathcal {L_S}\). In the end, we train again on \(\mathbf K _o\) without \(\mathcal {L_U}\) using \(\lambda = 10^{-5}\) for 5 epochs. We explain the dataset schedule as well as the parameter tuning in detail in the supplementary material.

After training SimpleNet, we freeze its weights and train StackNet by cascading ResidualNet. StackNet is trained with \(\lambda = 5 \times 10^{-5}\) in the same dataset schedules but with less epochs, i.e. 30, 15, 3 epochs, respectively. We apply random gamma, brightness and color augmentations [14]. We also employ the post-processing for left disparities proposed by Godard et al. [14] to reduce the effect of stereo disocclusions. In the supplementary material we also provide an ablation study on the various loss terms.

KITTI. Table 1 shows the evaluation results with the error metrics used in [6]. We crop the images as applied by Eigen et al. [6] to compare with [14, 23] within different depth ranges. The best performance of our network is achieved with the dataset schedule \(\mathbf K _o \rightarrow \mathbf K _r \rightarrow \mathbf K _o \) as we described above. We outperform the state-of-the-art self-supervised approach proposed by Godard et al. [14] by a large margin. Our method also outperforms the state-of-the-art semi-supervised method using the LiDAR ground truth proposed by Kuznietsov et al. [23] on all the metrics except for the less restrictive \(\delta < 1.25^2\) and \(\delta < 1.25^3\).

Figure 4 shows a qualitative comparison with other state-of-the-art methods. Compared to the semi-supervised approach, our results contain more details and deliver comparable prediction on thin structures like the poles. Although the results of Godard et al. [14] appear more detailed on some parts, they are not actually accurate, which can be inferred by the quantitative evaluation. In general, the predictions of Godard et al. [14] on thin objects are not as accurate as our method. In the supplementary material, we show the error maps for the predicted depth maps.  Figure 5 further show the advantages of our method compared to the state-of-the-art self-supervised and semi-supervised approaches. The results of Godard et al. [14] are predicted by the network trained with both the Cityscapes dataset and the KITTI dataset. On the wall of the far building in the left figure, our network can better predict consistent depth on the surface, while the prediction of the self-supervised network shows strong checkerboard artifact, which is apparently inaccurate. The semi-supervised approach also shows checkerboard artifact (but much slighter). The right side of the figure shows shadow artifacts for the approach of Godard et al. [14] around the boundaries of the traffic sign, while the result of Kuznietsov et al. [23] fails to predict the structure. Please refer to our supplementary material for further results. We also demonstrate how our trained depth prediction network generalizes to other datasets in the supplementary material.

Table 1. Evaluation results on the KITTI [13] Raw test split of Eigen et al. [6]. CS refers to the Cityscapes dataset [3]. Upper part: depth range 0-80 m, lower part: 1-50 m.All results are obtained using the crop from [6]. Our SimpleNet trained on \(\mathbf K _o\) outperforms [14] (self-supervised) trained on CS and K. StackNet also outperforms semi-supervision with LiDAR [23] on most metrics.
Fig. 5.
figure 5

Qualitative results on Eigen et al.’s KITTI Raw test split. The result of Godard et al. [14] shows a strong shadow effect around object contours, while our result does not. The result of Kuznietsov [23] shows failure on predicting the traffic sign. Both other methods [14, 23] predict checkerboard artifacts on the far building, while our approach predicts such artifacts less.

4.2 Monocular Visual Odometry

KITTI Odometry Benchmark. The KITTI odometry benchmark contains 11 (0–10) training sequences and 11 (11–21) test sequences. Ground-truth 6D poses are provided for the training sequences, whereas for the test sequences evaluation results are obtained by submitting to the KITTI website. We use the error metrics proposed in [12].

We firstly provide an ablation study for DVSO to show the effectiveness of the design choices in our approach. In Table 2 we give results for DVSO in different variants with the following components: initializing the depth with the left disparity prediction (in), using the right disparity for the virtual stereo term in windowed bundle adjustment (vs), checking left-right disparity consistency for point selection (lr), and tuning the virtual stereo baseline tb. The intuition behind the virtual stereo baseline is that StackNet is trained over various camera parameters and hence provides a depth scale for an average baseline. For tb, we therefore tune the scale factors of different sequences with different cameras intrinsics, to better align the estimated scale with the ground truth. Baselines are tuned for each of the 3 different camera parameter sets in the training set individually using grid search on one training sequence. Specifically, we tuned the baselines on sequences 00, 03 and 05 which correspond to 3 different camera parameter sets. The test set contains the same camera parameter sets as the training set and we map the virtual baselines for tb correspondingly. Monocular DSO (after Sim(3) alignment) is also shown as the baseline. The results show that our full approach achieves the best average performance. Our StackNet also adds significantly to the performance of DVSO compared with using depth predictions from [14].

Table 2. Ablation study for DVSO. \(^*\) and \(^\dagger \) indicate the sequences used and not used for training StackNet, respectively. \(t_{rel}\)(\(\%\)) and \(r_{rel}\)(\(^\circ \)) are translational- and rotational RMSE, respectively. Both \(t_{rel}\) and \(r_{rel}\) are averaged over 100 to 800 m intervals. in: \(D^{L}\) is used for depth initialization. vs: virtual stereo term is used with \(D^{R}\). lr: left-right disparity consistency is checked using predictions. tb: tuned virtual baseline is used. DVSO’([14]): full (invslrtb) with depth from [14]. DVSO: full with depth from StackNet. Best results are shown as bold, second best italic. DVSO clearly outperforms the other variants.
Table 3. Comparison with state-of-the-art stereo visual odometry. DVSO: our full approach (invslrtb). Global optimization and loop-closure are turned off for stereo ORB-SLAM2 and Stereo LSD-SLAM. DVSO (monocular) achieves comparable performance to these stereo methods.
Fig. 6.
figure 6

Results on KITTI odometry seq. 00. Top: comparisons with monocular methods (Sim(3)-aligned) and stereo methods. DVSO provides significantly more consistent trajectories than other monocular methods and compares well to stereo approaches. Bottom: DVSO with StackNet produces more accurate trajectory and map than with [14].

Fig. 7.
figure 7

Evaluation results on the KITTI odometry test set. We show translational and rotational errors with respect to path length intervals. For translational errors, DVSO achieves comparable performance to Stereo LSD-SLAM, while for rotational errors, DVSO achieves comparable results to Stereo DSO and better results than all other methods. Note that with virtual baseline tuning, DVSO achieves the best performance among all the methods evaluated.

Table 4. Comparison with deep learning approaches. Note that Deep VO [41] is trained on sequences 00, 02, 08 and 09 of the KITTI Odometry Benchmark. UnDeepVO [26] and SfMLearner [49] are trained unsupervised on seqs 00-08 end-to-end. Results of DeepVO and UnDeepVO taken from [41] and [26] while for SfMLearner we ran their pre-trained model. Our DVSO clearly outperforms state-of-the-art deep learning based VO methods.

We also compare DVSO with other state-of-the-art stereo visual odometry systems on the sequences 00–10. The sequences with marker \(^*\) are used for training StackNet and the sequences with marker \(^\dagger \) are not used for training the network. In Table 3 and the following tables, DVSO means our full approach with baseline tuning (invslrtb). The average RMSE of DVSO without baseline tuning is better than Stereo LSD-VO, but not as good as Stereo DSO [40] or ORB-SLAM2 [31] (stereo, without global optimization and loop closure). Importantly, DVSO uses only monocular images. With the baseline tuning, DVSO achieves even better average performance than all other stereo systems on both rotational and translational errors. Figure 6 shows the estimated trajectory on sequence 00. Both monocular ORB-SLAM2 and DSO suffer from strong scale drift, while DVSO achieves superior performance on eliminating the scale drift. We also show the estimated trajectory on 00 by running DVSO using the depth map predicted by Godard et al. [14] with the model trained on the Cityscapes and the KITTI dataset. For the results in Fig. 6 our depth predictions are more accurate. Figure 7 shows the evaluation result of the sequences 11–21 by submitting results of DVSO with and without baseline tuning to the KITTI odometry benchmark. Note that in Fig. 7, Stereo LSD-SLAM and ORB-SLAM2 are both full stereo SLAM approaches with global optimization and loop closure. For qualitative comparisons of further estimated trajectories, please refer to our supplementary material.

We also compare DVSO with DeepVO [41], UnDeepVO [26] and SfMLearner [49] which are deep learning based visual odometry systems trained end-to-end on KITTI. As shown in Table 4, on all available sequences, DVSO achieves better performance than the other two end-to-end approaches. Table 4 also shows the comparison with the deep learning based scale recovery methods for monocular VO proposed by Yin et al. [46]. DVSO also outperforms their method. In the supplementary material, we also show the estimated trajectory on the Cityscapes Frankfurt sequence to demonstrate generalization capabilities.

5 Conclusion

We presented a novel monocular visual odometry system, DVSO, which recovers metric scale and reduces scale drift in geometric monocular VO. A deep learning approach predicts monocular depth maps for the input images which are used to initialize sparse depths in DSO to a consistent metric scale. Odometry is further improved by a novel virtual stereo term that couples estimated depth in windowed bundle adjustment with the monocular depth predictions. For monocular depth prediction we have presented a semi-supervised deep learning approach, which utilizes a self-supervised image reconstruction loss and sparse depth predictions from Stereo DSO as ground truth depths for supervision. A stacked network architecture predicts state-of-the-art refined disparity estimates.

Our evaluation conducted on the KITTI odometry benchmark demonstrates that DVSO outperforms the state-of-the-art monocular methods by a large margin and achieves comparable results to stereo VO methods. With virtual baseline tuning, DVSO can even outperform state-of-the-art stereo VO methods, i.e., Stereo LSD-VO, ORB-SLAM2 without global optimization and loop closure, and Stereo DSO, while using only monocular images.

The key practical benefit of the proposed method is that it allows us to recover accurate and scale-consistent odometry with only a single camera. Future work could comprise fine-tuning of the network inside the odometry pipeline end-to-end. This could enable the system to adapt online to new scenes and camera setups. Given that the deep net was trained on driving sequences, in future work we also plan to investigate how much the proposed approach can generalize to other camera trajectories and environments.