Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry

Yang, Nan; Wang, Rui; Stückler, Jörg; Cremers, Daniel

doi:10.1007/978-3-030-01237-3_50

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11212))

Included in the following conference series:

European Conference on Computer Vision

3407 Accesses
151 Citations

Abstract

Monocular visual odometry approaches that purely rely on geometric cues are prone to scale drift and require sufficient motion parallax in successive frames for motion estimation and 3D reconstruction. In this paper, we propose to leverage deep monocular depth prediction to overcome limitations of geometry-based monocular visual odometry. To this end, we incorporate deep depth predictions into Direct Sparse Odometry (DSO) as direct virtual stereo measurements. For depth prediction, we design a novel deep network that refines predicted depth from a single image in a two-stage process. We train our network in a semi-supervised way on photoconsistency in stereo images and on consistency with accurate sparse depth reconstructions from Stereo DSO. Our deep predictions excel state-of-the-art approaches for monocular depth on the KITTI benchmark. Moreover, our Deep Virtual Stereo Odometry clearly exceeds previous monocular and deep-learning based methods in accuracy. It even achieves comparable performance to the state-of-the-art stereo methods, while only relying on a single camera.

You have full access to this open access chapter, Download conference paper PDF

Unsupervised Monocular Visual Odometry with Lightweight Depth Architecture

A Comparison of Deep Learning-Based Monocular Visual Odometry Algorithms

Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

Keywords

1 Introduction

Visual odometry (VO) is a highly active field of research in computer vision with a plethora of applications in domains such as autonomous driving, robotics, and augmented reality. VO with a single camera using traditional geometric approaches inherently suffers from the fact that camera trajectory and map can only be estimated up to an unknown scale which also leads to scale drift. Moreover, sufficient motion parallax is required to estimate motion and structure from successive frames. To avoid these issues, typically more complex sensors such as active depth cameras or stereo rigs are employed. However, these sensors require larger efforts in calibration and increase the costs of the vision system.

Metric depth can also be recovered from a single image if a-priori knowledge about the typical sizes or appearances of objects is used. Deep learning based approaches tackle this by training deep neural networks on large amounts of data. In this paper, we propose a novel approach to monocular visual odometry, Deep Virtual Stereo Odometry (DVSO), which incorporates deep depth predictions into a geometric monocular odometry pipeline. We use deep stereo disparity for virtual direct image alignment constraints within a framework for windowed direct bundle adjustment (e.g. Direct Sparse Odometry [8]). DVSO achieves comparable performance to the state-of-the-art stereo visual odometry systems on the KITTI odometry benchmark. It can even outperform the state-of-the-art geometric VO methods when tuning scale-dependent parameters such as the virtual stereo baseline.

As an additional contribution, we propose a novel stacked residual network architecture that refines disparity estimates in two stages and is trained in a semi-supervised way. In typical supervised learning approaches [6, 24, 25], depth ground truth needs to be acquired for training with active sensors like RGB-D cameras and 3D laser scanners which are costly to obtain. Requiring a large amount of such labeled data is an additional burden that limits generalization to new environments. Self-supervised [11, 14] and unsupervised learning approaches [49], on the other hand, overcome this limitation and do not require additional active sensors. Commonly, they train the networks on photometric consistency, for example in stereo imagery [11, 14], which reduces the effort for collecting training data. Still, the current self-supervised approaches are not as accurate as supervised methods [23]. We combine self-supervised and supervised training, but avoid the costly collection of LiDAR data in our approach. Instead, we make use of Stereo Direct Sparse Odometry (Stereo DSO [40]) to provide accurate sparse 3D reconstructions on the training set. Our deep depth prediction network outperforms the current state-of-the-art methods on KITTI.

A video demonstrating our methods as well as the results is available at https://youtu.be/sLZOeC9z_tw.

1.1 Related Work

Deep Learning for Monocular Depth Estimation. Deep learning based approaches have recently achieved great advances in monocular depth estimation. Employing deep neural network avoids the hand-crafted features used in previous methods [19, 36]. Supervised deep learning [6, 24, 25] has recently shown great success for monocular depth estimation. Eigen et al. [5, 6] propose a two scale CNN architecture which directly predicts the depth map from a single image. Laina et al. [24] propose a residual network [17] based fully convolutional encoder-decoder architecture [27] with a robust regression loss function. The aforementioned supervised learning approaches need large amounts of ground-truth depth data for training. Self-supervised approaches [11, 14, 44] overcome this limitation by exploiting photoconsistency and geometric constraints to define loss functions, for example, in a stereo camera setup. This way, only stereo images are needed for training which are typically easier to obtain than accurate depth measurements from active sensors such as 3D lasers or RGB-D cameras. Godard et al. [14] achieve the state-of-the-art depth estimation accuracy for a fully self-supervised approach. The semi-supervised scheme proposed by Kuznietsov et al. [23] combines the self-supervised loss with supervision with sparse LiDAR ground truth. They do not need multi-scale depth supervision or left-right consistency in their loss, and achieve better performance than the self-supervised approach in [14]. The limitation of this semi-supervised approach is the requirement for LiDAR data which are costly to collect. In our approach we use Stereo Direct Sparse Odometry to obtain sparse depth ground-truth for semi-supervised training. Since the extracted depth maps are even sparser than LiDAR data, we also employ multi-scale self-supervised training and left-right consistency as in Godard et al. [14]. Inspired by [20, 34], we design a stacked network architecture leveraging the concept of residual learning [17].

Deep Learning for VO/SLAM. In recent years, large progress has been achieved in the development of monocular VO and SLAM methods [8, 9, 31, 32]. Due to projective geometry, metric scale cannot be observed with a single camera [37] which introduces scale drift. A popular approach is hence to use stereo cameras for VO [8, 10, 31] which avoid scale ambiguity and leverage stereo matching with a fixed baseline for estimating 3D structure. While stereo VO delivers more reliable depth estimation, it requires self-calibration for long-term operation [4, 46]. The integration of a second camera also introduces additional costs. Some recent monocular VO approaches have integrated monocular depth estimation [39, 46] to recover the metric scale by scale-matching. CNN-SLAM [39] extends LSD-SLAM [9] by predicting depth with a CNN and refining the depth maps using Bayesian filtering [7, 9]. Their method shows superior performance over monocular SLAM [9, 30, 35, 45] on indoor datasets [15, 38]. Yin et al. [46] propose to use convolutional neural fields and consecutive frames to improve the monocular depth estimation from a CNN. Camera motion is estimated using the refined depth. CodeSLAM [2] focuses on the challenge of dense 3D reconstruction. It jointly optimizes a learned compact representation of the dense geometry with camera poses. Our work tackles the problem of odometry with monocular cameras and integrates deep depth prediction with multi-view stereo to improve camera pose estimation. Another line of research trains networks to directly predict the ego-motion end-to-end using supervised [41] or unsupervised learning [26, 49]. However, the estimated ego-motion of these methods is still by far inferior to geometric visual odometry approaches. In our approach, we phrase visual odometry as a geometric optimization problem but incorporate photoconsistency constraints with state-of-the-art deep monocular depth predictions into the optimization. This way, we obtain a highly accurate monocular visual odometry that is not prone to scale drift and achieves comparable results to traditional stereo VO methods.

2 Semi-Supervised Deep Monocular Depth Estimation

In this section, we will introduce our semi-supervised approach to deep monocular depth estimation. It builds on three key ingredients: self-supervised learning from photoconsistency in a stereo setup similar to [14], supervised learning based on accurate sparse depth reconstruction by Stereo DSO, and two-stage refinement of the network predictions in a stacked encoder-decoder architecture.

2.1 Network Architecture

We coin our architecture StackNet since it stacks two sub-networks, SimpleNet and ResidualNet, as depicted in Fig. 2. Both sub-networks are fully convolutional deep neural network adopted from DispNet [28] with an encoder-decoder scheme. ResidualNet has fewer layers and takes the outputs of SimpleNet as inputs. Its purpose is to refine the disparity maps predicted by SimpleNet by learning an additive residual signal. Similar residual learning architectures have been successfully applied to related deep learning tasks [20, 34]. The detailed network architecture is illustrated in the supplementary material.

SimpleNet. SimpleNet is an encoder-decoder architecture with a ResNet-50 based encoder and skip connections between corresponding encoder and decoder layers. The decoder upprojects the feature maps to the original resolution and generates 4 pairs of disparity maps $ disp _{ simple ,s}^{ left }$ and $ disp _{ simple ,s}^{ right }$ in different resolutions $s \in [0,3]$. The upprojection is implemented by resize-convolution [33], i.e. a nearest-neighbor upsampling layer by a factor of two followed by a convolutional layer. The usage of skip connections enables the decoder to recover high-resolution results with fine-grained details.

ResidualNet. The purpose of ResidualNet is to further refine the disparity maps predicted by SimpleNet. ResidualNet learns the residual signals $ disp _{ res ,s}$ to the disparity maps $ disp _{ simple ,s}$ (both left and right and for all resolutions). Inspired by FlowNet 2.0 [20], the inputs to ResidualNet contain various information on the prediction and the errors made by SimpleNet: we input $I^{ left }$, $ disp _{ simple ,0}^{ left }$, $I^{ right }_{ recons }$, $I^{ left }_{ recons }$ and $e_l$, where

$I^{ right }_{ recons }$ is the reconstructed right image by warping $I^{ left }$ using $ disp _{ simple ,0}^{ right }$.
$I^{ left }_{ recons }$ is the generated left image by back-warping $I^{ right }_{ recons }$ using $ disp _{ simple ,0}^{ left }$.
$e_l$ is the $\ell 1$ reconstruction error between $I^{ left }$ and $I^{ left }_{ recons }$

For the warping, rectified stereo images are required while stereo camera intrinsics and extrinsics are not needed as our network directly outputs disparity.

The final refined outputs $ disp _s$ are $ disp _s = disp _{ simple ,s} \oplus disp _{res,s}, s \in [0, 3]$, where $\oplus $ is element-wise summation. The encoder of ResidualNet contains 12 residual blocks in total and predicts 4 scales of residual disparity maps as SimpleNet. Adding more layers does not further improve performance in our experiments. Notably, only the left image is used as an input to either SimpleNet and ResidualNet, while the right image is not required. However, the network outputs a refined disparity map for the left and right stereo image. Both facts will be important for our monocular visual odometry approach.

2.2 Loss Function

We define a loss $\mathcal {L}_s$ at each output scale s, resulting in the total loss $\mathcal {L} = \sum _{s=0}^{3}\mathcal {L}_s.$ The loss at each scale $\mathcal {L}_s$ is a linear combination of five terms which are symmetric in left and right images,

$$\begin{aligned} \mathcal {L}_s= & {} \alpha _{U}\left( \mathcal {L}^{ left }_{U} + \mathcal {L}^{ right }_{U}\right) + \alpha _{S}\left( \mathcal {L}^{ left }_{S} + \mathcal {L}^{ right }_{S}\right) + \alpha _{lr}\left( \mathcal {L}^{ left }_{ lr } + \mathcal {L}^{ right }_{ lr }\right) \nonumber \\+ & {} \alpha _{ smooth }\left( \mathcal {L}^{ left }_{ smooth } + \mathcal {L}^{ right }_{ smooth }\right) + \alpha _{ occ }\left( \mathcal {L}^{ left }_{ occ } + \mathcal {L}^{ right }_{ occ }\right) , \end{aligned}$$

(1)

where $\mathcal {L}_{U}$ is a self-supervised loss, $\mathcal {L}_{S}$ is a supervised loss, $\mathcal {L}_{ lr }$ is a left-right consistency loss, $\mathcal {L}_{ smooth }$ is a smoothness term encouraging the predicted disparities to be locally smooth and $\mathcal {L}_{ occ }$ is an occlusion regularization term. In the following, we detail the left components $\mathcal {L}^{ left }$ of the loss function at each scale. The right components $\mathcal {L}^{ right }$ are defined symmetrically.

Self-supervised Loss. The self-supervised loss measures the quality of the reconstructed images. The reconstructed image is generated by warping the input image into the view of the other rectified stereo image. This procedure is fully (sub-)differentiable for bilinear sampling [21]. Inspired by [14, 47], the quality of the reconstructed image is measured with the combination of the $\ell _1$ loss and single scale structural similarity (SSIM) [42]:

$$\begin{aligned} \mathcal {L}_U^{ left }= & {} \frac{1}{N}\sum _{x,y}\alpha \frac{1-\text {SSIM}\left( I^{ left }(x,y), I^{ left }_{ recons }(x,y)\right) }{2} \\ \nonumber+ & {} (1-\alpha )||I^{ left }(x,y)-I^{ left }_{ recons }(x,y)||_1, \end{aligned}$$

(2)

with a $3\times 3$ box filter for SSIM and $\alpha $ set to 0.84.

Supervised Loss. The supervised loss measures the deviation of the predicted disparity map from the disparities estimated by Stereo DSO at a sparse set of pixels:

$$\begin{aligned} \mathcal {L}_s^{ left } = \frac{1}{N}\sum _{(x,y) \in \varOmega _{ DSO , left }}\beta _\epsilon \left( disp ^{ left }(x, y) - { disp }^{ left }_{ DSO }(x,y)\right) \end{aligned}$$

(3)

where $\varOmega _{ DSO , left }$ is the set of pixels with disparities estimated by DSO and $\beta _\epsilon (x)$ is the reverse Huber (berHu) norm introduced in [24] which lets the training focus more on larger residuals. The threshold $\epsilon $ is adaptively set as a batch-dependent value $\epsilon = 0.2 \max _{(x,y)\in \varOmega _{ DSO , left }} \left| disp ^{ left }(x,y) - disp _{ DSO }^{ left }(x,y)\right| $.

Left-Right Disparity Consistency Loss. Given only the left image as input, the network predicts the disparity map of the left as well as the right image as in [14]. As proposed in [14, 47], consistency between the left and right disparity image is improved by

$$\begin{aligned} \mathcal {L}_{ lr }^{ left } = \frac{1}{N}\sum _{x,y}\Bigl | disp ^{ left }(x, y) - disp ^{ right }(x - disp ^{ left }(x, y), y) \Bigr |. \end{aligned}$$

(4)

Disparity Smoothness Regularization. Depth reconstruction based on stereo image matching is an ill-posed problem on its own: the depth of homogeneously textured areas and occluded areas cannot be determined. For these areas, we apply the regularization term

$$\begin{aligned} \mathcal {L}_{ smooth }^{ left } \!\!=\!\! \frac{1}{N} \sum _{x,y}\Bigl |\nabla ^2_x disp ^{ left }(x, y)\Bigr |e^{\!-\!\Bigl ||\nabla ^2_x I^{ left }(x, y) \Bigr ||} + \Bigl |\nabla ^2_y disp ^{ left }(x, y)\Bigr |e^{-\Bigl ||\nabla ^2_y I^{ left }(x, y) \Bigr ||} \end{aligned}$$

(5)

that assumes that the predicted disparity map should be locally smooth. We use a second-order smoothness prior [43] and downweight it when the image gradient is high [18].

Occlusion Regularization. $\mathcal {L}_{ smooth }^{ left }$ itself tends to generate a shadow area where values gradually change from foreground to background due to stereo occlusion. To favor background depths and hard transitions at occlusions [48], we impose $\mathcal {L}_{ occ }^{ left }$ which penalizes the total sum of absolute disparities. The combination of smoothness- and occlusion regularizer prefers to directly take the (smaller) closeby background disparity which better corresponds to the assumption that the background part is uncovered

$$\begin{aligned} \mathcal {L}_{ occ }^{ left } = \frac{1}{N} \sum _{x, y}\Bigl | disp ^{ left }(x, y)\Bigr |. \end{aligned}$$

(6)

3 Deep Virtual Stereo Odometry

Deep Virtual Stereo Odometry (DVSO) builds on the windowed sparse direct bundle adjustment formulation of monocular DSO. We use our disparity predictions for DSO in two key ways: Firstly, we initialize depth maps of new keyframes from the disparities. Beyond this rather straight-forward approach, we also incorporate virtual direct image alignment constraints into the windowed direct bundle adjustment of DSO. We obtain these constraints by warping images with the estimated depth by bundle adjustment and the predicted right disparities by our network assuming a virtual stereo setup. As shown in Fig. 3, DVSO integrates both the predicted left disparities and right disparities for the left image. The right image of the stereo setup is not used for our VO method at any stage, making it a monocular VO method.

In the following, we use $D^{L}$ and $D^{R}$ as shorthands to represent the predicted left ($disp_0^{left}$) and right disparity map ($disp_0^{right}$) at scale $s = 0$, respectively. When using purely geometric cues, scale drift is one of the main sources of error of monocular VO due to scale unobservability [37]. In DVSO we use the left disparity map $D^{L}$ predicted by StackNet for initialization instead of randomly initializing the depth like in monocular DSO [8]. The disparity value of an image point with coordinate $\mathbf {p}$ is converted to the inverse depth $d_\mathbf {p}$ using the rectified camera intrinsics and stereo baseline of the training set of StackNet [16], $d_{\mathbf {p}} = \frac{D^{L}(\mathbf {p})}{f_xb}$. In this way, the initialization of DVSO becomes more stable than monocular DSO and the depths are initialized with a consistent metric scale.

The point selection strategy of DVSO is similar to monocular DSO [8], while we also introduce a left-right consistency check (similar to Equation (4)) to filter out the pixels which likely lie in the occluded area

$$\begin{aligned} e_{lr} = \Bigl | D^{L}(\mathbf {p}) - D^{R}(\mathbf {p'}) \Bigr | \quad \text {with} \quad \mathbf {p'} = \mathbf {p} - \begin{bmatrix} D^L(\mathbf {p})&0\\ \end{bmatrix}^\top . \end{aligned}$$

(7)

The pixels with $e_{lr} > 1$ are not selected.

Every new frame is firstly tracked with respect to the reference keyframe using direct image alignment in a coarse-to-fine manner [8]. Afterwards DVSO decides if a new keyframe has to be created for the new frame following the criteria proposed by [8]. When a new keyframe is created, the temporal multi-view energy function $E_{photo} := \sum _{i\in \mathcal {F}}\sum _{\mathbf {p} \in \mathcal {P}_i}\sum _{j \in \text {obs}(\mathbf {p})}E_{ij}^{\mathbf {p}}$ needs to be optimized, where $\mathcal {F}$ is a fixed-sized window containing the active keyframes, $\mathcal {P}_i$ is the set of points selected from its host keyframe with index i and $j \in \text {obs}(\mathbf {p})$ is the index of the keyframe which observes $\mathbf {p}$. $E_{ij}^{\mathbf {p}}$ is the photometric error of the point $\mathbf {p}$ when projected from the host keyframe $I_i$ onto the other keyframe $I_j$:

$$\begin{aligned} E_{ij}^{\mathbf {p}} := \omega _{\mathbf {p}} \left\| (I_j[\mathbf {\tilde{p}}] - b_j) - \frac{e^{a_j}}{e^{a_i}}(I_i[\mathbf {p}] - b_i) \right\| _\gamma , \end{aligned}$$

(8)

where $\mathbf {\tilde{p}}$ is the projected image coordinate using the relative rotation matrix $\mathbf R \in SO(3)$ and translation vector $\mathbf t \in \mathbb {R}^3$ [16], $\mathbf {\tilde{p}} = \varPi _c\left( \mathbf {R}\varPi _c^{-1}\left( \mathbf {p}, d_{\mathbf {p}}\right) + \mathbf {t}\right) $, where $\varPi _\mathbf {c}$ and $\varPi _\mathbf {c}^{-1}$ are the camera projection and back-projection functions. The parameters $a_i$, $a_j$, $b_i$ and $b_j$ are used for modeling the affine brightness transformation [8]. The weight $\omega _{\mathbf {p}}$ penalizes the points with high image gradient [8] with the intuition that the error originating from bilinear interpolation of the discrete image values is larger. $||\cdot ||_{\gamma }$ is the Huber norm with the threshold $\gamma $. For the detailed explanation of the energy function, please refer to [8].

To further improve the accuracy of DVSO, inspired by Stereo DSO [40] which couples the static stereo term with the temporal multi-view energy function, we introduce a novel virtual stereo term $E^{\dagger \mathbf {p}}$ for each point $\mathbf {p}$

$$\begin{aligned} E_{i}^{\dagger \mathbf {p}} = \omega _{\mathbf {p}} \left\| I_i^\dagger \left[ \mathbf {p}^\dagger \right] - I_i\left[ \mathbf {p}\right] \right\| _\gamma \quad \text {with} \quad I_i^\dagger \left[ \mathbf {p}^\dagger \right] = I_i\left[ \mathbf {p}^\dagger - \begin{bmatrix} D^{R}\left( \mathbf {p}^\dagger \right)&0 \\ \end{bmatrix}^\top \right] , \end{aligned}$$

(9)

where $\mathbf {p}^\dagger = \varPi _\mathbf {c}(\varPi _\mathbf {c}^{-1}(\mathbf {p}, d_{\mathbf {p}}) + \mathbf {t}_b)$ is the virtual projected coordinate of $\mathbf {p}$ using the vector $\mathbf {t}_b$ denoting the virtual stereo baseline which is known during the training of StackNet. The intuition behind this term is to optimize the estimated depth of the visual odometry to become consistent with the disparity prediction of StackNet. Instead of imposing the consistency directly on the estimated and predicted disparities, we formulate the residuals in photoconsistency which better reflects the uncertainties of the prediction of StackNet and also keeps the unit of the residuals consistent with the temporal direct image alignment terms.

We then optimize the total energy

$$\begin{aligned} E_{photo} := \sum _{i\in \mathcal {F}}\sum _{\mathbf {p} \in \mathcal {P}_i} \left( \lambda E_{i}^{\dagger \mathbf {p}} + \sum _{j \in \text {obs}(\mathbf {p})}E_{ij}^{\mathbf {p}} \right) , \end{aligned}$$

(10)

where the coupling factor $\lambda $ balances the temporal and the virtual stereo term. All the parameters of the total energy are jointly optimized using the Gauss Newton method [8]. In order to keep a fixed size of the active window ($N=7$ keyframes in our experiments), old keyframes are removed from the system by marginalization using the Schur complement [8]. Unlike sliding window bundle adjustment, the parameter estimates outside the optimization window including camera poses and depths in a marginalization prior are also incorporated into the optimization. In contrast to the MSCKF [29], the depths of pixels are explicitly maintained in the state and optimized for. In our optimization framework we trade off predicted depth and triangulated depth using robust norms.

4 Experiments

We quantitatively evaluate our StackNet with other state-of-the-art monocular depth prediction methods on the publicly available KITTI dataset [12]. In the supplementary materials, we demonstrate results on the Cityscapes dataset [3] and the Make3D dataset [36] to show the generalization ability. For DVSO, we evaluate its tracking accuracy on the KITTI odometry benchmark with other state-of-the-art monocular as well as stereo visual odometry systems. In the supplementary material, we also demonstrate its results on the Frankfurt sequence of the Cityscapes dataset to show the generalization of DVSO.

4.1 Monocular Depth Estimation

Dataset. We train StackNet using the train/test split (K) of Eigen et al. [6]. The training set contains 23488 images from 28 scenes belonging to the categories “city”, “residential” and “road”. We used 22600 images of them for training and the remaining ones for validation. We further split K into 2 subsets $\mathbf K _o$ and $\mathbf K _r$. $\mathbf K _o$ contains the images of the sequences which appear in the training set (but not the test set) of the KITTI odometry benchmark on which we use Stereo DSO [40] to extract sparse ground-truth depth data. $\mathbf K _r$ contains the remaining images in K. Specifically, $\mathbf K _o$ contains the images of sequences 01, 02, 06, 08, 09 and 10 of the KITTI odometry benchmark.

Implementation Details. StackNet is implemented in TensorFlow [1] and trained from scratch on a single Titan X Pascal GPU. We resize the images to $512 \times 256$ for training and it takes less than 40 ms for inference including the I/O overhead. The weights are set to $\alpha _u = 1$, $\alpha _s = 10$, $\alpha _{lr} = 1$, $\alpha _{smooth} = 0.1/2^s$ and $\alpha _{occ} = 0.01$, where s is the output scale. As suggested by [14], we use exponential linear units (ELUs) for SimpleNet, while we use leaky rectified linear units (Leaky ReLUs) for ResidualNet. We first train SimpleNet on $\mathbf K _o$ in the semi-supervised way for 80 epochs with a batch size of 8 using the Adam optimizer [22]. The learning rate is initially set to $\lambda = 10^{-4}$ for the first 50 epochs and halved every 15 epochs afterwards until the end. Then we train SimpleNet with $\lambda = 5 \times 10^{-5}$ on $\mathbf K _r$ for 40 epochs in the self-supervised way without $\mathcal {L_S}$. In the end, we train again on $\mathbf K _o$ without $\mathcal {L_U}$ using $\lambda = 10^{-5}$ for 5 epochs. We explain the dataset schedule as well as the parameter tuning in detail in the supplementary material.

After training SimpleNet, we freeze its weights and train StackNet by cascading ResidualNet. StackNet is trained with $\lambda = 5 \times 10^{-5}$ in the same dataset schedules but with less epochs, i.e. 30, 15, 3 epochs, respectively. We apply random gamma, brightness and color augmentations [14]. We also employ the post-processing for left disparities proposed by Godard et al. [14] to reduce the effect of stereo disocclusions. In the supplementary material we also provide an ablation study on the various loss terms.

KITTI. Table 1 shows the evaluation results with the error metrics used in [6]. We crop the images as applied by Eigen et al. [6] to compare with [14, 23] within different depth ranges. The best performance of our network is achieved with the dataset schedule $\mathbf K _o \rightarrow \mathbf K _r \rightarrow \mathbf K _o $ as we described above. We outperform the state-of-the-art self-supervised approach proposed by Godard et al. [14] by a large margin. Our method also outperforms the state-of-the-art semi-supervised method using the LiDAR ground truth proposed by Kuznietsov et al. [23] on all the metrics except for the less restrictive $\delta < 1.25^2$ and $\delta < 1.25^3$.

Figure 4 shows a qualitative comparison with other state-of-the-art methods. Compared to the semi-supervised approach, our results contain more details and deliver comparable prediction on thin structures like the poles. Although the results of Godard et al. [14] appear more detailed on some parts, they are not actually accurate, which can be inferred by the quantitative evaluation. In general, the predictions of Godard et al. [14] on thin objects are not as accurate as our method. In the supplementary material, we show the error maps for the predicted depth maps. Figure 5 further show the advantages of our method compared to the state-of-the-art self-supervised and semi-supervised approaches. The results of Godard et al. [14] are predicted by the network trained with both the Cityscapes dataset and the KITTI dataset. On the wall of the far building in the left figure, our network can better predict consistent depth on the surface, while the prediction of the self-supervised network shows strong checkerboard artifact, which is apparently inaccurate. The semi-supervised approach also shows checkerboard artifact (but much slighter). The right side of the figure shows shadow artifacts for the approach of Godard et al. [14] around the boundaries of the traffic sign, while the result of Kuznietsov et al. [23] fails to predict the structure. Please refer to our supplementary material for further results. We also demonstrate how our trained depth prediction network generalizes to other datasets in the supplementary material.

Table 1. Evaluation results on the KITTI [13] Raw test split of Eigen et al. [6]. CS refers to the Cityscapes dataset [3]. Upper part: depth range 0-80 m, lower part: 1-50 m.All results are obtained using the crop from [6]. Our SimpleNet trained on $\mathbf K _o$ outperforms [14] (self-supervised) trained on CS and K. StackNet also outperforms semi-supervision with LiDAR [23] on most metrics.

Full size table

4.2 Monocular Visual Odometry

KITTI Odometry Benchmark. The KITTI odometry benchmark contains 11 (0–10) training sequences and 11 (11–21) test sequences. Ground-truth 6D poses are provided for the training sequences, whereas for the test sequences evaluation results are obtained by submitting to the KITTI website. We use the error metrics proposed in [12].

We firstly provide an ablation study for DVSO to show the effectiveness of the design choices in our approach. In Table 2 we give results for DVSO in different variants with the following components: initializing the depth with the left disparity prediction (in), using the right disparity for the virtual stereo term in windowed bundle adjustment (vs), checking left-right disparity consistency for point selection (lr), and tuning the virtual stereo baseline tb. The intuition behind the virtual stereo baseline is that StackNet is trained over various camera parameters and hence provides a depth scale for an average baseline. For tb, we therefore tune the scale factors of different sequences with different cameras intrinsics, to better align the estimated scale with the ground truth. Baselines are tuned for each of the 3 different camera parameter sets in the training set individually using grid search on one training sequence. Specifically, we tuned the baselines on sequences 00, 03 and 05 which correspond to 3 different camera parameter sets. The test set contains the same camera parameter sets as the training set and we map the virtual baselines for tb correspondingly. Monocular DSO (after Sim(3) alignment) is also shown as the baseline. The results show that our full approach achieves the best average performance. Our StackNet also adds significantly to the performance of DVSO compared with using depth predictions from [14].

Table 2. Ablation study for DVSO. $^*$ and $^\dagger $ indicate the sequences used and not used for training StackNet, respectively. $t_{rel}$($\%$) and $r_{rel}$($^\circ $) are translational- and rotational RMSE, respectively. Both $t_{rel}$ and $r_{rel}$ are averaged over 100 to 800 m intervals. in: $D^{L}$ is used for depth initialization. vs: virtual stereo term is used with $D^{R}$. lr: left-right disparity consistency is checked using predictions. tb: tuned virtual baseline is used. DVSO’([14]): full (in, vs, lr, tb) with depth from [14]. DVSO: full with depth from StackNet. Best results are shown as bold, second best italic. DVSO clearly outperforms the other variants.

Full size table

Table 3. Comparison with state-of-the-art stereo visual odometry. DVSO: our full approach (in, vs, lr, tb). Global optimization and loop-closure are turned off for stereo ORB-SLAM2 and Stereo LSD-SLAM. DVSO (monocular) achieves comparable performance to these stereo methods.

Full size table

Table 4. Comparison with deep learning approaches. Note that Deep VO [41] is trained on sequences 00, 02, 08 and 09 of the KITTI Odometry Benchmark. UnDeepVO [26] and SfMLearner [49] are trained unsupervised on seqs 00-08 end-to-end. Results of DeepVO and UnDeepVO taken from [41] and [26] while for SfMLearner we ran their pre-trained model. Our DVSO clearly outperforms state-of-the-art deep learning based VO methods.

Full size table

We also compare DVSO with other state-of-the-art stereo visual odometry systems on the sequences 00–10. The sequences with marker $^*$ are used for training StackNet and the sequences with marker $^\dagger $ are not used for training the network. In Table 3 and the following tables, DVSO means our full approach with baseline tuning (in, vs, lr, tb). The average RMSE of DVSO without baseline tuning is better than Stereo LSD-VO, but not as good as Stereo DSO [40] or ORB-SLAM2 [31] (stereo, without global optimization and loop closure). Importantly, DVSO uses only monocular images. With the baseline tuning, DVSO achieves even better average performance than all other stereo systems on both rotational and translational errors. Figure 6 shows the estimated trajectory on sequence 00. Both monocular ORB-SLAM2 and DSO suffer from strong scale drift, while DVSO achieves superior performance on eliminating the scale drift. We also show the estimated trajectory on 00 by running DVSO using the depth map predicted by Godard et al. [14] with the model trained on the Cityscapes and the KITTI dataset. For the results in Fig. 6 our depth predictions are more accurate. Figure 7 shows the evaluation result of the sequences 11–21 by submitting results of DVSO with and without baseline tuning to the KITTI odometry benchmark. Note that in Fig. 7, Stereo LSD-SLAM and ORB-SLAM2 are both full stereo SLAM approaches with global optimization and loop closure. For qualitative comparisons of further estimated trajectories, please refer to our supplementary material.

We also compare DVSO with DeepVO [41], UnDeepVO [26] and SfMLearner [49] which are deep learning based visual odometry systems trained end-to-end on KITTI. As shown in Table 4, on all available sequences, DVSO achieves better performance than the other two end-to-end approaches. Table 4 also shows the comparison with the deep learning based scale recovery methods for monocular VO proposed by Yin et al. [46]. DVSO also outperforms their method. In the supplementary material, we also show the estimated trajectory on the Cityscapes Frankfurt sequence to demonstrate generalization capabilities.

5 Conclusion

We presented a novel monocular visual odometry system, DVSO, which recovers metric scale and reduces scale drift in geometric monocular VO. A deep learning approach predicts monocular depth maps for the input images which are used to initialize sparse depths in DSO to a consistent metric scale. Odometry is further improved by a novel virtual stereo term that couples estimated depth in windowed bundle adjustment with the monocular depth predictions. For monocular depth prediction we have presented a semi-supervised deep learning approach, which utilizes a self-supervised image reconstruction loss and sparse depth predictions from Stereo DSO as ground truth depths for supervision. A stacked network architecture predicts state-of-the-art refined disparity estimates.

Our evaluation conducted on the KITTI odometry benchmark demonstrates that DVSO outperforms the state-of-the-art monocular methods by a large margin and achieves comparable results to stereo VO methods. With virtual baseline tuning, DVSO can even outperform state-of-the-art stereo VO methods, i.e., Stereo LSD-VO, ORB-SLAM2 without global optimization and loop closure, and Stereo DSO, while using only monocular images.

The key practical benefit of the proposed method is that it allows us to recover accurate and scale-consistent odometry with only a single camera. Future work could comprise fine-tuning of the network inside the odometry pipeline end-to-end. This could enable the system to adapt online to new scenes and camera setups. Given that the deep net was trained on driving sequences, in future work we also plan to investigate how much the proposed approach can generalize to other camera trajectories and environments.

References

Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)
Google Scholar
Bloesch, M., Czarnowski, J., Clark, R., Leutenegger, S., Davison, A.J.: Codeslam-learning a compact, optimisable representation for dense visual slam. arXiv preprint arXiv:1804.00874 (2018)
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Dang, T., Hoffmann, C., Stiller, C.: Continuous stereo self-calibration by camera parameter tracking. IEEE Trans. Image Process. 18(7), 1536–1550 (2009)
Article MathSciNet Google Scholar
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)
Google Scholar
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374 (2014)
Google Scholar
Engel, J., Sturm, J., Cremers, D.: Semi-dense visual odometry for a monocular camera. In: IEEE International Conference on Computer Vision (ICCV) (2013)
Google Scholar
Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 611–625 (2017)
Article Google Scholar
Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_54
Chapter Google Scholar
Engel, J., Stückler, J., Cremers, D.: Large-scale direct slam with stereo cameras. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1935–1942. IEEE (2015)
Google Scholar
Garg, R., B.G., V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_45
Chapter Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. (IJRR) 32(11), 1231–1237 (2013)
Article Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the KITTI vision benchmark suite. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012)
Google Scholar
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. arXiv preprint arXiv:1609.03677 (2016)
Handa, A., Whelan, T., McDonald, J., Davison, A.J.: A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In: IEEE international conference on Robotics and automation (ICRA), pp. 1524–1531. IEEE (2014)
Google Scholar
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)
MATH Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Heise, P., Klose, S., Jensen, B., Knoll, A.: Pm-huber: Patchmatch with huber regularization for stereo matching. In: IEEE International Conference on Computer Vision (ICCV), pp. 2360–2367. IEEE (2013)
Google Scholar
Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. In: ACM Transactions on Graphics (TOG), vol. 24, pp. 577–584. ACM (2005)
Google Scholar
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, p. 6 (2017)
Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kuznietsov, Y., Stückler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. arXiv preprint arXiv:1702.02706 (2017)
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: Fourth International Conference on 3D Vision (3DV), pp. 239–248. IEEE (2016)
Google Scholar
Li, B., Shen, C., Dai, Y., van den Hengel, A., He, M.: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFS. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1119–1127 (2015)
Google Scholar
Li, R., Wang, S., Long, Z., Gu, D.: UnDeepVO: Monocular visual odometry through unsupervised deep learning. arXiv preprint arXiv:1709.06841 (2017)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048 (2016)
Google Scholar
Mourikis, A.I., Roumeliotis, S.I.: A multi-state constraint kalman filter for vision-aided inertial navigation. In: IEEE International Conference on Robotics and Automation, pp. 3565–3572. IEEE (2007)
Google Scholar
Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: Orb-slam: a versatile and accurate monocular slam system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)
Article Google Scholar
Mur-Artal, R., Tardós, J.D.: Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 33(5), 1255–1262 (2017)
Article Google Scholar
Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: Dtam: Dense tracking and mapping in real-time. In: IEEE International Conference on Computer Vision (ICCV), pp. 2320–2327. IEEE (2011)
Google Scholar
Odena, A., Dumoulin, V., Olah, C.: Deconvolution and checkerboard artifacts. Distill (2016). https://doi.org/10.23915/distill.00003, http://distill.pub/2016/deconv-checkerboard
Pang, J., Sun, W., Ren, J., Yang, C., Yan, Q.: Cascade residual learning: A two-stage convolutional neural network for stereo matching. In: International Conference on Computer Vision-Workshop on Geometry Meets Deep Learning (ICCVW 2017), vol. 3 (2017)
Google Scholar
Pizzoli, M., Forster, C., Scaramuzza, D.: Remode: Probabilistic, monocular dense reconstruction in real time. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2609–2616. IEEE (2014)
Google Scholar
Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In: Advances in Neural Information Processing Systems, pp. 1161–1168 (2006)
Google Scholar
Strasdat, H., Montiel, J., Davison, A.J.: Scale drift-aware large scale monocular SLAM. Robotics: Science and Systems VI 2 (2010)
Google Scholar
Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of RGB-D slam systems. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 573–580. IEEE (2012)
Google Scholar
Tateno, K., Tombari, F., Laina, I., Navab, N.: Cnn-slam: Real-time dense monocular slam with learned depth prediction. arXiv preprint arXiv:1704.03489 (2017)
Wang, R., Schwörer, M., Cremers, D.: Stereo dso: Large-scale direct sparse visual odometry with stereo cameras. In: International Conference on Computer Vision (ICCV). Venice, Italy, October 2017
Google Scholar
Wang, S., Clark, R., Wen, H., Trigoni, N.: Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2043–2050. IEEE (2017)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Woodford, O., Torr, P., Reid, I., Fitzgibbon, A.: Global stereo reconstruction under second-order smoothness priors. IEEE Trans. Pattern Anal. Mach. Intell. 31(12), 2115–2128 (2009)
Article Google Scholar
Xie, J., Girshick, R., Farhadi, A.: Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 842–857. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_51
Chapter Google Scholar
Yang, N., Wang, R., Gao, X., Cremers, D.: Challenges in monocular visual odometry: photometric calibration, motion bias and rolling shutter effect. IEEE Robot. Autom. Lett. (RA-L) 3, 2878–2885 (2018). https://doi.org/10.1109/LRA.2018.2846813
Article Google Scholar
Yin, X., Wang, X., Du, X., Chen, Q.: Scale recovery for monocular visual odometry using depth estimated with deep convolutional neural fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5870–5878 (2017)
Google Scholar
Zhao, H., Gallo, O., Frosio, I., Kautz, J.: Is l2 a good loss function for neural networks for image processing?. ArXiv e-prints 1511 (2015)
Zhong, Y., Dai, Y., Li, H.: Self-supervised learning for stereo matching with self-improving ability. arXiv preprint arXiv:1709.00930 (2017)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR, vol. 2, p. 7 (2017)
Google Scholar

Download references

Acknowledgements

We would like to thank Clément Godard, Yevhen Kuznietsov, Ruihao Li and Tinghui Zhou for providing the data and code. We also would like to thank Martin Schwörer for the fruitful discussion.

Author information

Authors and Affiliations

Technical University of Munich, Munich, Germany
Nan Yang, Rui Wang, Jörg Stückler & Daniel Cremers
Artisense, Munich, Germany
Nan Yang, Rui Wang & Daniel Cremers

Authors

Nan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jörg Stückler
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Cremers
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nan Yang .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

1 Electronic supplementary material

Supplementary material 1 (pdf 16021 KB)

Supplementary material 2 (mp4 53471 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, N., Wang, R., Stückler, J., Cremers, D. (2018). Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11212. Springer, Cham. https://doi.org/10.1007/978-3-030-01237-3_50

Download citation

DOI: https://doi.org/10.1007/978-3-030-01237-3_50
Published: 07 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01236-6
Online ISBN: 978-3-030-01237-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry

Abstract

Similar content being viewed by others

Unsupervised Monocular Visual Odometry with Lightweight Depth Architecture

A Comparison of Deep Learning-Based Monocular Visual Odometry Algorithms

Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

Keywords