1 Introduction

Odometry aims to predict six degrees of freedom (6-DOF) relative camera poses from motion sensors. It is a fundamental component of a wide variety of robotics and vision tasks, including simultaneous localization and mapping (SLAM), automatic navigation, and virtual reality (Durrant-Whyte & Bailey, 2006; Fuentes-Pacheco et al., 2015; Taketomi et al., 2017; Zhang & Tao, 2020). In particular, visual and visual-inertial odometry have attracted a lot of attention over recent years due to the low cost and easy setup of cameras and inertial measurement unit (IMU) sensors. The relative camera pose is recovered using geometric clues and motion models. Classic geometric methods usually formulate the odometry problem as an optimization problem by incorporating well-established geometric and motion constraints as the objective functions. Nevertheless, due to the complexity and diversity of real-world environments, the explicitly modeled constraints can hardly explain all aspects of the sensor data. Though successful in some real-world scenarios, geometric systems fail to work when the underlying assumptions behind the optimization objectives, such as static environments, discriminative visual features, noiseless observations and brightness constancy, are violated in the real world. Furthermore, since odometry is essentially a time-series prediction problem, how to properly handle time dependency and environment dynamics presents further challenges. Classic geometric methods use filtering or bundle adjustments to take the temporal information into account, while the implicitly implied error distributions might not hold in practice.

Recently end-to-end deep learning methods provide an alternative solution for the odometry problem, which relieves the above-mentioned intrinsic problems in geometric methods. Learning-based methods tackle this problem from another perspective that does not explicitly model the constraints for optimization but learns the mapping from sensor data to camera pose implicitly from large-scale datasets (Wang et al., 2017; Clark et al., 2017; Xue et al., 2019). It has been shown that well-trained deep networks are able to effectively capture the inherent complexity and diversity of the training data and establish the mapping between visual/sequential inputs to desired targets in many computer vision tasks (He et al., 2016; Xu et al., 2021; Zhang et al., 2022a), thus holding promise for addressing the limitations of geometric approaches. In addition, learning-based frameworks can implicitly learn calibrated representations and require no explicit calibration procedures. For monocular visual odometry, the absolute scale can also be recovered from training data, which instead is a non-trivial challenge for geometric methods.

Although existing deep odometry learning methods have performed competitively against their geometric counterparts, they still fail to satisfy some basic requirements. First of all, due to the broad range of scenarios where odometry is required, odometry systems are expected to be easily compatible with various configurations and settings, such as multiple sensors and dynamic environments. In addition, the common existence of data degeneration, such as from hardware malfunctions and unexpected occlusions, requires a safe and robust system in which a proper uncertainty measure is desirable for self-awareness of the potential anomalies and system bias. Moreover, theoretical analyses of current black box deep odometry models, such as generalizability on unseen test data and extendibility to extra sensors, are still obscure but essential for understanding and assessing the model performance.

Here we devise a unified odometry learning framework from an information-theoretic perspective, which well addresses the above issues. Our work is motivated by the recent successes of deep variational inference and learning theory based on mutual information (MI). Specifically, we translate the odometry problem to optimizing an information bottleneck (IB) objective function where the latent representation is formulated as a bottleneck between the observations and relative camera poses. In doing so, we eliminate the pose-irrelevant information from the latent representation to achieve better generalizability. Modeling by MI constraints provides a flexible way to account for different aspects of the problem and quantify their effectiveness in information-theoretic language. This framework is also attractive in that the operations are performed on the probabilistic distribution of the latent representation, which naturally provides an uncertainty measure for interrogating the data quality and system bias.

More importantly, the information-theoretic formulation allows us to leverage information theory to investigate the theoretical properties of the proposed method. Our theoretical findings not only benefit the evaluation of the model performance but also provide insights for subsequent research. We obtain a theoretical guarantee of the proposed framework by deriving an upper bound of the expected generalization error w.r.t. the IB objective function under mild network and loss function conditions. We show that the latent space dimensionality also bounds the expected generalization error, providing a theoretical explanation for the complexity-overfitting trade-off in the latent representation space. When the test data is biased, our result shows that the growing rate of d should not exceed that of \(n/\log (n)\), where d is the latent space dimensionality, and n is the sample size. We further quantify the usefulness of a latent representation for relative camera pose prediction using the MI between the representation and poses. In doing so, we prove a lower bound for this MI given extra sensors, which reveals the conditions required for a sensor to theoretically guarantee a performance gain. It is noteworthy that our theoretical results hold not only for the odometry problem but also for a wider variety of problems that share the same Markov chain assumption and the IB objective function. A connection between our information-theoretic framework and geometric methods is further established for deeper insights.

The main contributions of this paper are:

  1. 1.

    We propose information-theoretic odometry learning by leveraging the IB objective function to eliminate pose-irrelevant information from the latent representation;

  2. 2.

    We develop the theoretical performance guarantee of the proposed framework by deriving upper bounds on the generalization error w.r.t. IB and the latent space dimensionality as well as a lower bound on the MI between the latent representation and poses;

  3. 3.

    We empirically verify the effectiveness of our method on the well-known KITTI and EuRoC datasets and show how the intrinsic uncertainty benefits failure detection and inference refinement.

2 Related Work

Deep representation for odometry learning Leveraging deep neural networks to learn compact feature representation from high-dimension sensor data has been proven effective for odometry. Kendall et al. (2015) proposed PoseNet by using neural networks for camera relocalization, based upon which Wang et al. (2017) introduced a recurrent module to model the temporal correlation of features for visual odometry. Subsequently, Xue et al. (2019) further considered a memory and refinement module to address the prediction drift caused by error accumulation. Recently, deep learning-based odometry has also been extended to the multi-sensor configuration. Clark et al. (2017) extended the DeepVO framework to incorporate IMU data by leveraging an extra recurrent network for learning better feature representation. A recent study by Chen et al. (2019) investigated more effective and robust sensor fusion via soft and hard attention for visual-inertial odometry. Apart from end-to-end learning, there are also trends in unsupervised learning (Zhou et al., 2017; Yin & Shi, 2018; Ranjan et al., 2019; Bian et al., 2019) and the combination of learned features with geometric methods (Zhan et al., 2020; Yang et al., 2020; Zhang et al., 2022c, b). We refer readers to Chen et al. (2020) for a more detailed discussion of current methods. These deep odometry learning methods have achieved promising performance. However, theoretical understandings remain obscure: (1) how to learn a compact representation with a theoretically guaranteed generalizability when test data is biased and (2) in what conditions extra sensors can benefit the pose prediction problem.

Fig. 1
figure 1

a The classic learning-based odometry framework, where 6-DOF poses are directly predicted from deterministic latent representations. b The proposed information bottleneck (IB) framework for odometry learning. h and s are the deterministic and stochastic components, respectively. Superscripts o and p represent the observation- and pose-level transition models. Red solid arrows denote the pose regressor, and red dashed arrows denote the bottleneck constraints. Output arrows from a shaded stochastic representation represent samples from the learned latent distribution (Color figure online)

Information bottleneck Information bottleneck (IB) provides an appealing tool for deep learning by learning an informative and compact latent representation (Tishby et al., 2000; Tishby & Zaslavsky, 2015; Shwartz-Ziv & Tishby, 2017). To address the intractability of MI calculation, Alemi et al. (2017) proposed to optimize a variational bound of IB for deep learning, which was successfully applied to many tasks including dynamics learning (Hafner et al., 2020), task transfer (Goyal et al., 2019), and network compression (Dai et al., 2018). Partly inspired by these developments, we for the first time propose an IB-based framework for odometry learning and derive an optimizable variational bound for this sequential prediction problem. The derivation can be more delicate if we incorporate more constraints, potentially from geometric and kinematic insights. We further adopt the deterministic-stochastic separation as in Chung et al. (2015); Hafner et al. (2019, 2020), while ours differs in that our derivation of the variational bound allows modeling two transition models separately, each with a deterministic component to improve model capacity. Moreover, though IB-based methods have shown to be effective for learning a compact representation, the underpinning generalizability theory remains unclear. The generalization error bounds for general learning algorithms have been studied in Xu and Raginsky (2017) in information-theoretic language. This work was subsequently extended by Zhang et al. (2021) to explain the generalizability of deep neural networks. However, their results are not applicable to the IB-based methods, which will be addressed in this paper.

Uncertainty modeling for odometry learning Modeling uncertainty to deal with extreme cases like hardware malfunctions and unexpected occlusions, is crucial for a reliable and robust odometry system. It can be categorized into model-intrinsic epistemic uncertainty and data-dependent aleatoric uncertainty, which have been studied in the Bayesian deep learning literature (MacKay, 1992; Gal & Ghahramani, 2016; Kendall & Gal, 2017). For odometry, Wang et al. (2018) and Yang et al. (2020) captured the aleatoric uncertainty by imposing a probabilistic distribution on poses and used the second moment prediction as an uncertainty measure. Recently, Loquercio et al. (2020) showed that a combined epistemic-aleatoric uncertainty framework (Kendall & Gal, 2017) could improve the performance on several robotics tasks such as motion and steering angle predictions. In contrast to them, our framework provides a built-in and efficient uncertainty measure that accounts for both uncertainty types. We empirically demonstrate how to use this uncertainty measure to evaluate data quality and system biases. Accordingly, we propose a refined inference procedure that discards highly uncertain results to improve pose prediction accuracy.

3 Information-Theoretic Odometry Learning

Odometry aims to predict the relative 6-DOF pose \(\xi _t\) between two consecutive observations \(\{o_{t-1:t}^{(m)}\}_{m=1}^{{\mathcal {M}}}\) from \({\mathcal {M}}\) sensors (e.g. camera, IMU and lidar), where t is the time index. This pose prediction problem can be formulated as \(\xi _t = g(\{o_{t-1:t}^{(m)}\}_{m=1}^{{\mathcal {M}}}, \Theta )\), where g is the mapping function of an odometry system and \(\Theta \) is the parameter set of g. Classic deep odometry learning methods model g by neural networks and learn \(\Theta \) from training data. Furthermore, they usually use a recurrent module to model the motion dynamics of the observation sequence. Figure 1a shows a typical procedure shared by representative deep odometry learning methods.

In many settings, observations are of high dimensionalities, such as images and lidar 3D points. Geometric methods use low-dimensional features to represent observations, while learning-based methods learn a representation from training data. However, both features may contain pose-irrelevant information that is specific to certain sensor domain. Retaining such information encourages the model to overfit the training data and yield poor generalization performance. Since parsimony is preferred in machine learning, it is expected to eliminate the pose-irrelevant information.

To this end, we tackle this problem by explicitly introducing a constraint on the pose-irrelevant information. Specifically, we quantify the pose-irrelevance and the usefulness of a latent representation for pose prediction from an information-theoretic perspective. By assuming the latent representation \(s_t\) at time t is drawn from a Gaussian distribution, the MI \(I(\{o_{1:T}^{(m)}\}_{m=1}^{{\mathcal {M}}}||s_{1:T}|\xi _{1:T})\) and the MI \(I(\xi _{1:T} || s_{1:T})\) can provide quantitative measures for the aforementioned two aspects. Accordingly, given a sequence of observations \(\{o_{1:T}^{(m)}\}_{m=1}^{{\mathcal {M}}}\) and pose annotations \(\xi _{1:T}\) from time 1 to T, our information-theoretic odometry learning problem is:

$$\begin{aligned}&max_{\Theta }\ {\mathcal {J}}(\Theta ) =I(\xi _{1:T} || s_{1:T}) - \gamma I_{\mathrm{bottleneck}},\end{aligned}$$
(1)
$$\begin{aligned}&I_{\mathrm{bottleneck}} = I(\{o_{1:T}^{(m)}\}_{m=1}^{{\mathcal {M}}}||s_{1:T}|\xi _{1:T}), \end{aligned}$$
(2)

where the IB weight \(\gamma \) controls the trade-off between the two MI terms. By Eq. (12), the latent representation \(s_{1:T}\) essentially provides an information bottleneck between poses and observations, which eliminates pose-irrelevant information from the observations. Due to the high dimensionality of the observation space, it is non-trivial to calculate the two MI. Thus we optimize a variational lower bound instead:

$$\begin{aligned}&{\mathcal {J}}(\Theta )\ge {\mathcal {J}}'(\Theta )= E_{s_{1:T},\{o_{1:T}^{(m)}\}_{m=1}^{{\mathcal {M}}},\xi _{1:T}} [\sum \nolimits _{t=1}^T J'_t], \end{aligned}$$
(3)
$$\begin{aligned}&J'_t = {\mathcal {J}}_t^{pose} - \gamma {\mathcal {J}}_t^{\mathrm{bottleneck}}, \end{aligned}$$
(4)
$$\begin{aligned}&{\mathcal {J}}_t^{pose} = log\ q_\theta (\xi _t|s_t), \end{aligned}$$
(5)
$$\begin{aligned}&{\mathcal {J}}_t^{\mathrm{bottleneck}} = D_{KL}(p_\phi || q_\varphi ). , \end{aligned}$$
(6)
$$\begin{aligned}&p_\phi = p_\phi (s_t|\{o_{t-1:t}^{(m)}\}_{m=1}^{{\mathcal {M}}},s_{t-1}), \end{aligned}$$
(7)
$$\begin{aligned}&q_\varphi = q_\varphi (s_t|\xi _t, s_{t-1}). \end{aligned}$$
(8)

The detailed derivation is provided in the Supplementary Material. This lower bound consists of a variational pose regressor \(q_\theta (\xi _t|s_t)\), an observation-level transition model \(p_\phi (s_t|\{o_{t-1:t}^{(m)}\}_{m=1}^{{\mathcal {M}}},s_{t-1})\), and a pose-level transition model \(q_\varphi (s_t|\xi _t, s_{t-1})\), all of which are modeled by neural networks. For simplicity, we denote the representations from the observation-level and pose-level transition models \(s^o_t\) and \(s^p_t\), respectively. In practice, \(s^o_t\) is used for the pose regressor. Intuitively, minimizing the KL divergence in Eq. (678) forces the distribution of \(s_t^o\) to approximate that of \(s_t^p\) which does not encode the observation information at time t, thus regularizing \(s_t^o\) for containing pose-irrelevant information.

Stochastic-only transition models, however, may compromise model performance due to uncertainty accumulation during the sampling process. To address this problem, we further introduce a deterministic component according to Chung et al. (2015) and Hafner et al. (2019). In doing so, we reformulate the two transition models in the KL divergence in Eq. (678) as:

$$\begin{aligned}&{{\textbf {observation-level}}}:\ p_\phi (s_t^o|h_t^o), \end{aligned}$$
(9)
$$\begin{aligned}&h_t^o = f^o(h_{t-1}^o,\{o_{t-1:t}^{(m)}\}_{m=1}^{{\mathcal {M}}}, s_{t-1}^o, s_{t-1}^p), \end{aligned}$$
(10)
$$\begin{aligned}&{{\textbf {pose-level}}}:\ q_\varphi (s_t^p|h_t^p), \end{aligned}$$
(11)
$$\begin{aligned}&h_t^p = f^p(h_{t-1}^p,\xi _t, s_{t-1}^o, s_{t-1}^p). \end{aligned}$$
(12)

We use two deterministic functions \(f^o\) and \(f^p\) for observation- and pose-level transitions, respectively, which are both modeled by recurrent neural networks. In addition, both \(s^o_{t-1}\) and \(s^p_{t-1}\) are used for the two deterministic transition functions to help to reduce the KL divergence between the distributions of \(s_t^o\) and \(s_t^p\). Ground-truth 6-DOF poses are fed into \(f^p\) during the training phase, while for testing, we use predicted poses to provide a runtime estimate of \(s_t^p\). Fig. 1b shows the overall framework of our method.

Remark I. Since we model the latent representation in the probabilistic space, the variance of the latent representation naturally provides an uncertainty measure. We empirically show how this intrinsic uncertainty reveals data quality and system bias in Sect. 5.3. Of note is that it is straightforward to extend the proposed information-theoretic framework to different problem settings. We can add arbitrary linear MI constraints into the proposed objective and derive similar variational bounds to satisfy different requirements such as dynamics-awareness in complex environments.

Remark II. All variational IB-based methods origin from Alemi et al. (2017). However, applying IB into a specific domain is non-trivial. The challenge lies in the derivation of proper variational bounds based on the specific properties of each problem. This derivation can be more delicate if we incorporate more constraints, potentially from geometric and kinematic insights. Besides, we differ from Dai et al. (2018) and Goyal et al. (2019) in that sequential observations are modeled. From this perspective, our development related to Hafner et al. (2019) and Hafner et al. (2020), from which we further borrowed the motivation of the deterministic component, which by itself is rooted from Chung et al. (2015) and Buesing et al. (2018). Ours differs in that we model the two transition models [Eq. (678)] separately, each with a deterministic component to improve model capacity (Fig. 1b and Eqs. (910)–(1112)). Moreover, we theoretically prove that constraining the IB objective essentially upper bounds the expected generalization error and establish the connection between IB and geometric methods in Sect. 4, which provides deeper insights into IB-based methods.

4 Theoretical Analysis

Formulating a problem in information-theoretic language enables us to analyze the proposed method by exploring elegant tools in information theory (Cover & Thomas, 1991) and related results in learning theory (Xu & Raginsky, 2017; Zhang et al., 2021). In this work, we show that the MI between the bottleneck and observations as well as the latent space dimensionality upper bound the expected generalization error, which provides not only insights into the generalizability of the method but also a performance guarantee. To our knowledge, this is the first time that such generalization bounds have been derived for IB by using a general loss function other than cross-entropy (Vera et al., 2018). By replacing the general loss function with the cross-entropy, our bound is tighter than that obtained by Vera et al. (2018) in terms of the sample size. We further derive a lower bound on the MI between the latent representation and poses given extra sensors, which suggests what features make a sensor useful for pose prediction in information-theoretic language. The connection between information bottleneck and geometric methods is also established to provide further insights.

4.1 Generalization Bound for Information Bottleneck

Xu and Raginsky (2017) and Zhang et al. (2021) obtained the generalization bound w.r.t. the MI between input data X and learning parameters \(\Theta \) for general learning algorithms and neural networks. However, what IB regularizes is the MI between X and the latent representation. To derive a generalization bound for the IB objective function, we first prove a relationship between these two kinds of MI in Lemma 1 under the Markov chain \(X\rightarrow S\rightarrow \xi \), an underlying assumption for IB.

Lemma 1

If \(X\rightarrow S\rightarrow \xi \) forms a Markov chain and assume \(\xi =g(X,\Theta )\) is a one-to-one function w.r.t. X and \(\Theta \), then we have

$$\begin{aligned} I(X,S)\ge & {} I(X,\xi ) = I(X, \Theta ) + E_{\theta }[H(X|\theta )] \end{aligned}$$
(13)
$$\begin{aligned}\ge & {} I(X,\Theta ). \end{aligned}$$
(14)

Lemma 1 enables us to extend the generalizability results for neural networks regarding \(I(X,\Theta )\) (Zhang et al., 2021) to the IB setting, leading to the following theoretical counterpart.

Theorem 1

Assuming \(X\rightarrow S\rightarrow \xi \) is a Markov chain, the loss function \(l(X,\Theta )\) is sub-\(\sigma \)-Gaussian distributedFootnote 1 and the prediction function \(\xi =g(X,\Theta )\) is a one-to-one function w.r.t. the input data and network parameters \(\Theta \), we have the following upper bound for the expected generalization error:

$$\begin{aligned} E[R(\Theta )-R_T(\Theta )]\le \exp (-\frac{L}{2}log\frac{1}{\eta })\sqrt{\frac{2\sigma ^2}{n}I(X,S)}, \end{aligned}$$
(15)

where L, \(\eta \), and n are the effective number of layers causing information loss, a constant smaller than 1, and the sample size, respectively. \(R(\Theta )=E_{X\sim D}[l(X,\Theta )]\) is the expected loss value given \(\Theta \) and \(R_T(\Theta )=\frac{1}{n}\sum _{i=1}^n l(X_i,\Theta )\) is a sample estimate of \(R(\Theta )\) from the training data.

The difference between our result and previous works is that we bound the generalization error by I(XS) which is minimized in Eq. (12) rather than \(I(X,\Theta )\) which is hard to evaluate. By Theorem 1, we show that minimizing the MI between the bottleneck and observations tightens the upper bound on the expected generalization error and thus provides a theoretical performance guarantee. It is worth noting that our theoretical results apply not only to our odometry learning setting but also to a wider variety of tasks that use the IB method. This bound also implies that a larger sample size and a deeper network lead to better generalization performance, which is consistent with the results shown in Xu and Raginsky (2017) and Zhang et al. (2021). The detailed proof of Lemma 1 and Theorem 1 can be found in the Supplementary Material.

Remark I. The result of Zhang et al. (2021) is interesting in that it provides an explanation for why deeper networks lead to better performance. However, the expected generalization errors in Zhang et al. (2021) and Xu and Raginsky (2017) are both bounded by \(I(X||\Theta )\), which remains difficult to evaluate in practice. Though their results give a lot of insights into the generalizability of algorithms in information-theoretic language, it is non-trivial to minimize \(I(X||\Theta )\) explicitly to control the generalization error bound. We move one step further by extending their results to I(X||S), the mutual information between input data and latent representations, which itself can be bounded by various well-established variational bounds (Poole et al., 2019) and optimized during training. Our result provides an explanation for the empirical generalization ability of the IB method, which explicitly minimizes I(X||S). By minimizing I(X||S), we are actually tightening the upper bound of the generalization error, thus leading to better generalization performance.

A related work by Vera et al. (2018) proved a similar result for IB: “Let \({\mathcal {F}}\) be a class of encoders. Then, for every \(P_{XY}\) and every \(\delta \in (0,1)\), with probability at least \(1-\delta \) over the choice of \({\mathcal {S}}_n\sim P_{XY}^n\) the following inequality holds \(\forall Q_{U|X}\in {\mathcal {F}}\):

$$\begin{aligned} \varepsilon _{gap}(Q_{U|X}, {\mathcal {S}}_n)\le & {} A_\delta \sqrt{I(\hat{P_X}||Q_{U|X})}\frac{log(n)}{\sqrt{n}} \nonumber \\&+\,\frac{C_\delta }{\sqrt{n}}+{\mathcal {O}}\left( \frac{log(n)}{n}\right) , \end{aligned}$$
(16)

where \((A_\delta , B_\delta , C_\delta )\) are quantities independent of the data set \({\mathcal {S}}_n: A_\delta :=\frac{\sqrt{2}B_\delta }{P_X(x_{min})}(1+1/\sqrt{|X|}),B_\delta :=2+\sqrt{log\left( \frac{|Y|+3}{\delta }\right) }\) and \(C_\delta :=2|U|e^{-1}+B_\delta \sqrt{|Y|}log\frac{|U|}{P_Y(y_{min})}\). \(\varepsilon _{gap}(Q_{U|X}, S_n)\) is the generalization gap which is defined as \(|L_{emp}(Q_{U|X}, {\mathcal {S}}_n)--L(Q_{U|X})|\). \(L(Q_{U|X})\) and \(L_{emp}(Q_{U|X}, {\mathcal {S}}_n)\) are the true risk and the empirical risks, respectively.” We refer readers to Vera et al. (2018) for more details on their result

Our result differs from that of Vera et al. (2018) in that: (1) Eq. 16 only applies to the cross-entropy loss function, while our result holds for a broader range of loss functions under the sub-\(\sigma \)-Gaussian assumption; (2) we provide a tighter generalization bound compared with that of Vera et al. (2018) w.r.t. sample rate (\(\frac{1}{\sqrt{n}}\) vs. \(\frac{log(n)}{\sqrt{n}}\)); (3) For regression problems and for a large latent space, \(A_\delta \) and \(C_\delta \) in Eq. (16) could be large due to the positive dependency on |Y| and |U|. Besides, \(\frac{1}{P_X(x_{min})}\) and \(\frac{1}{P_Y(y_{min})}\) might also be large in practice, resulting in a loose bound for the generalization error.

Remark II. We now give more discussions on the assumptions of Theorem 1: (1) a Markov chain \(X\rightarrow S\rightarrow \xi \) is implicitly implied in neural networks with encoder-decoder structures since the decoder only takes the encoder output as its input and thus does not depend on X given S. In this case, we have \(P(\xi | S) = P(\xi | S,X)\). It is worth noting that in more general settings where more flexible network structures that allow additional connections between X and \(\xi \) are used, this Markov chain assumption may not hold. However, for the IB methods, since an IB model is essentially encoder-decoder structured by constraining the information flow between the encoder and the decoder, the Markov chain assumption on \(X\rightarrow S\rightarrow \xi \) holds under this setting. (2) As discussed in Xu and Raginsky (2017), the sub-\(\sigma \)-Gaussian assumption actually implies a broad range of loss functions. For instance, as long as a loss function l is bounded, i.e., \(l(\cdot ,\cdot )\in [a,b]\), then it is guaranteed to be sub-\(\sigma \)-Gaussian distributed with \(\sigma =\frac{b-a}{2}\) (Xu & Raginsky, 2017). The network loss landscape consists of multiple local minima, flat or sharp, and most deep learning methods assume a local Gaussian distribution by using L2 loss (Chaudhari et al., 2017). Sub-\(\sigma \)-Gaussian is more general and provides several superiorities over the commonly used Gaussian assumption. Chaudhari et al. (2017) claimed that a flat local minimum is preferred for deep learning optimization algorithms due to the robustness towards parameter perturbations. Sub-\(\sigma \)-Gaussian can well represent such flat local regions, e.g. the almost-flat bounded uniform distribution is sub-\(\sigma \)-Gaussian distributed. It is also worth noting that considering the density of local minima (Chaudhari et al., 2017), \(\sigma \) is not necessarily large for local regions, which can be a concern for the tightness of the generalization bound. Another appealing property is that the sum of sub-\(\sigma \)-Gaussian is still sub-\(\sigma \)-Gaussian, i.e. it can fit a larger region with multiple local minima. (3) The one-to-one function assumption can be conservative due to the complexity of real-world data. For many applications, we may use pretrained models to extract high-level features and use these features as input data. For example, a pretrained FlowNet (Dosovitskiy et al., 2015; Ilg et al., 2017) is usually used in deep odometry learning methods. The input data part of this assumption could arguably hold under such circumstances. Considering the prediction part of this assumption, the cardinality of the space of \(\xi \) could be sufficiently large for regression problems and for classification problems, the cardinality of the prediction space could also be large since we usually predict the probabilities of each category. Extending the results to a looser assumption on the network function remains an interesting direction for future research.

4.2 Generalization Bound for Latent Dimensionality

We further investigate the generalizability w.r.t. model complexity in terms of the cardinality and dimensionality of the latent representation space under the IB framework.

Corollary 1

Given the same assumptions in Theorem 1 and let |S| be the cardinality of the latent representation space, we have

$$\begin{aligned} E[R(\Theta )-R_T(\Theta )]\le \exp \left( -\frac{L}{2}log\frac{1}{\eta }\right) \sqrt{\frac{2\sigma ^2}{n}log|S|}. \end{aligned}$$
(17)

It is well recognized that a large model complexity can impair the generalizability of the model. We reveal this complexity-overfitting trade-off in Corollary 1, where the expected generalization error is upper bounded by the cardinality of the latent representation space. In addition, considering the model design and sample collection, Corollary 1 indicates that the growing rate of log|S| should not exceed that of n to avoid an exploded generalization error bound.

Corollary 2

Given the same assumptions in Theorem 1 and assume S lies in a d-dimensional subspace of the latent representation space, \(sup_{s_i\in S_i}\ ||s_i|| \le M,\forall i\in [1,d]\) and S can be approximated by a densely quantized space, the following generalization bound holds:

$$\begin{aligned}&E[R(\Theta )-R_T(\Theta )]\le exp\left( -\frac{L}{2}log\frac{1}{\eta }\right) \sigma {\mathcal {C}}, \end{aligned}$$
(18)
$$\begin{aligned}&{\mathcal {C}} = \sqrt{\frac{dlog (d)}{n}+2log(2M)\frac{d}{n}+\frac{d}{n/log(n)}}. \end{aligned}$$
(19)

In practice, it is usually difficult to evaluate log|S| in Corollary 1 numerically. Therefore, we leverage the quantization trick used in Xu and Raginsky (2017) to reduce the upper bound to a function w.r.t. the dimensionality d of the latent representation space. The result is given in Corollary 2, which suggests that the growing rate of d should not exceed that of n/log(n). It is worth noting that this result holds not only for IB but also for a broader range of encoder-decoder models under the Markov chain assumption on \(X\rightarrow S\rightarrow \xi \).

4.3 Predictability Bound for Extra Sensors

Odometry performance is highly dependent on the sensors deployed, yet it remains non-trivial to select informative sensors that guarantee a performance gain. In this section, we address this problem using information-theoretic language under our proposed framework.

Theorem 2

If \((\{o^{(m)}\}_{m=1}^{{\mathcal {M}}},\ o^{({\mathcal {M}}+1)})\rightarrow S\rightarrow \xi \) forms a Markov chain, then we have,

$$\begin{aligned}&I(\xi ||S) \ge I_{old} + I_{new} - I_{obs}, \end{aligned}$$
(20)
$$\begin{aligned}&I_{old} = I(\xi ||\{o^{(m)}\}_{m=1}^{({\mathcal {M}})}), \end{aligned}$$
(21)
$$\begin{aligned}&I_{new} = I(\xi || o^{({\mathcal {M}}+1)} | \{o^{(m)}\}_{m=1}^{{\mathcal {M}}}), \end{aligned}$$
(22)
$$\begin{aligned}&I_{obs} =I(o^{({\mathcal {M}}+1)}||\{o^{(m)}\}_{m=1}^{{\mathcal {M}}}|\xi ). \end{aligned}$$
(23)

Theorem 2 suggests that if a new sensor \(o^{({\mathcal {M}}+1)}\) is useful for pose prediction, the MI between \(o^{({\mathcal {M}}+1)}\) and poses given existing sensors should be large. Meanwhile, it is preferred to have a small MI between \(\{o^{(m)}\}_{m=1}^{({\mathcal {M}})}\) and \(o^{({\mathcal {M}}+1)}\) given pose information. In other words, a heterogenous sensor that shares little pose-irrelevant information with existing sensors is desirable. In addition, we further observe that the information gain between \(I(\xi || o^{({\mathcal {M}}+1)} | \{o^{(m)}\}_{m=1}^{{\mathcal {M}}})\) and \(I(o^{({\mathcal {M}}+1)}||\{o^{(m)}\}_{m=1}^{{\mathcal {M}}}|\xi )\) provides a theoretical guarantee for the performance of the learned latent representation.

4.4 Connection with Geometric Methods

More generally, an odometry system can be modeled as \(h(z_{k,j}, v_k, \check{x_k})\rightarrow (\hat{x_k},p_j)\) where \(z_{k,j}, v_k, \check{x_k}, \hat{x_k}\) and \(p_j\) are observations, noise, prior pose, posterior pose, and latent state, respectively. At this level, the bottleneck MI \(I(z_{k,j},v_k||p_j|\hat{x_k}){=}H[h(z_{k,j},v_k,\check{x_k})|\hat{x_k}]{-}H[h(z_{k,j},v_k,\check{x_k})|\hat{x_k},z_{k,j},v_k]\) is the extra entropy (\(\Delta H\)) introduced by \((z_{k,j},v_k)\), which differs for different h. Factor graph based methods use optimization over L2 costs as h, where \(p_j\) is inferred landmark and a Gaussian noise is assumed. \(\Delta H\) in this case is implied in the noise variance which corresponds to the pre-specified weight of each cost function. Learning-based methods learn h from data where \(p_j\) is the latent feature. Minimizing \(\Delta H\) means reducing the uncertainty from noise and inexact learned function forms. The same analysis applies to kinematic function for \(\check{x_k}\). In addition, filter-based methods can also be included in by following the same logic. Take the kinematics part of Kalman filter (linear Gaussian system) as an example: \(\check{x_k}=A_k\hat{x_{k-1}}+u_k+w_k\), where the prior \(\check{x_k}\) is the latent state and the variance of \(\hat{x_{k-1}}\) and \(w_k\) are \(\hat{\Sigma _{k-1}}\) and R, respectively. Then \(I(u_k,w_k||\check{x_k})=\frac{1}{2}ln(|A_k\hat{\Sigma _{k-1}}A_k^T+R|/|A_k\hat{\Sigma _{k-1}}A_k^T|)\), suggesting that a smaller bottleneck MI corresponds to a relatively smaller noise variance.

5 Experiments

We tested our method on the well-known KITTI (Geiger et al., 2013) and EuRoC (Burri et al., 2016) datasets. Since most existing supervised methods are not open source, we re-implemented the representative state-of-the-art methods, including DeepVO (Wang et al., 2017), VINet (Clark et al., 2017), and two attention-based visual-inertial methods recently proposed by Chen et al. (2019), namely, SoftFusion and HardFusion, as our baselines. All models shared the same network architecture for a fair comparison. We further examine the ability of generalization to more challenging scenarios such as extreme weather and lighting conditions by testing DeepVO and InfoVO on vKITTI2 (Cabon et al., 2020). In addition, we empirically study the pose-irrelevant information contained in DeepVO and InfoVO to examine the underlying hypothesis of the problem that we target. We also conducted extensive ablation studies on the deterministic component, the weight of the IB objective, the sample size, extra sensors, the intrinsic uncertainty measure, and the growing rate relationship between the latent dimension and n/log(n).

5.1 Datasets and Experimental Settings

The KITTI odometry dataset consists of 11 real-world car driving videos and calibrated ground-truth 6-DOF pose annotations. The EuRoC dataset was instead collected from a MAV in two buildings, resulting in 11 sequences of different difficulties by manually adjusted obstacles. For visual-inertial experiments, we manually aligned the 100 Hz IMU records in the raw KITTI dataset to the 10 Hz image sequences using the corresponding timestamps. The image and IMU sequences in EuRoC were downsampled to 10 Hz and 100 Hz, respectively. We split the training and test datasets following the recent work by Chen et al. (2019). Our implementation was based on PyTorch (Steiner et al., 2019), and we will release the source code package and the trained models. We used GRU (Cho et al., 2014) to model the deterministic transitions and IMU records. Pretrained FlowNet was used to extract features from image data (Dosovitskiy et al., 2015; Ilg et al., 2017). More advanced optical flow estimation methods could also be explored such as RAFT Teed and Deng (2020) and GMFlow Xu et al. (2022). The other parts were modeled by MLP layers.

5.1.1 Detailed Network Architecture

The overall network can be separated into four components: (1) observation encoders: for image observation, we first extract the output from the \(out\_conv6\_1\) layer of a pretrained FlowNet2S (Ilg et al., 2017) model as an intermediate high-level feature, which is then flattened and fed into three MLP layers that have feature size 1024 to obtain image features. Note that the last MLP layer does not use the non-linear activation. For IMU data, we use a two-layer GRU model that has feature size 1024 to extract IMU features; (2) deterministic transition models: for the observation-level transition, we first fuse the observation features and concatenate the fused feature with \(s^o_{t-1}\) and \(s^p_{t-1}\) from last time step. The features are concatenated in VINet and InfoVIO. For SoftFusion, SoftInfoVIO, HardFusion and HardInfoVIO, we also use the same soft and hard fusion strategy proposed in Chen et al. (2019), while the Gumbel temperature linearly degrades from 1 to 0.5 in the first 150 epochs during training and is fixed to 0.5 for testing. We tile the 6-DOF poses eight times to a vector of length 48 for the pose-level transition, which is then also concatenated with \(s^o_{t-1}\) and \(s^p_{t-1}\). Ground-truth 6-DOF poses are used during training, while the predicted poses are used during testing. The concatenated features are then fed into an MLP and a GRU layer to obtain \(h^o_t\) and \(h^p_t\), respectively. (3) Stochastic state estimators: the deterministic states are fed into two MLP layers to obtain the mean and standard error vectors of the stochastic representation, both with size 128. Note that the last MLP layer does not use the non-linear activation. To avoid a trivial solution, we set the minimum standard error to 0.1 and only predict the residue, where the softplus function is used to guarantee a positive residue. We further use the reparameterization trick proposed in Kingma and Welling (2014) to sample from the stochastic representation distributions, which enables gradient backpropagation through the stochastic representations. (4) Pose regressor: we feed the sampled observation-level representation \(s^o_t\) into three MLP layers to obtain the translation and rotation prediction results. Both translation and rotation share the first two MLP layers, while we use two separate MLP layers without non-linear activation for translation and rotation, respectively.

All MLP layers with non-linear activation use the Relu function and have feature sizes 256 and 512 for KITTI and EuRoC, respectively. The state size is set to 128 and 256 for KITTI and EuRoC, respectively. For all baseline models (DeepVO, VINet, SoftFusion, and HardFusion), we remove the pose-level transitions and stochastic state estimators and directly feed \(h^o_t\) into the pose regressor for prediction.

5.1.2 Training and Evaluation Strategies

We used the same training and test splits as Chen et al. (2019). For KITTI, we used sequences 00, 01, 02, 04, 06, 08, and 09 for training and the rest for testing. For EuRoC, we used the sequence MH_04_difficult for testing and the rest for training. KITTI odometry dataset does not contain synchronized IMU data. Therefore, we manually aligned the 100 Hz IMU records in the raw KITTI data to the 10 Hz image sequences using the corresponding timestamps. EuRoC provides synchronized image and IMU data, collected at 20 Hz and 200 Hz, respectively. Following the practice of previous work (Chen et al., 2019; Clark et al., 2017), we downsampled the image and IMU data in EuRoC to 10 Hz and 100 Hz, respectively. By assuming a Gaussian distribution for \(q_\theta (\xi _t|s_t)\), we reduced the optimization of Eq. (3) to minimizing the L2-norm of the pose errors, resulting in the following loss function:

$$\begin{aligned} {\mathcal {L}}=\sum _{n=1}^N \alpha ||t-{\hat{t}}|| + \beta ||r-{\hat{r}}||, \end{aligned}$$
(24)

where t and \({\hat{t}}\) are the ground-truth and predicted translation. r and \({\hat{r}}\) are the ground-truth and predicted rotation. We used Euler angles as the quantitative rotation measure. \(\alpha \) and \(\beta \) are the translation and rotation error weights, respectively, which were set to 1 and 100 for KITTI and 100 and 20 for EuRoC empirically. We predicted the mean and variance of the stochastic representation \(s_t\) and set the minimum variance to be 0.01 to avoid a trivial solution. We set \(\gamma \) in Eq. (12) to balance the bottleneck effect. All models were trained for 300 epochs using mini-batches of 16 clips containing five frames each. We set an initial learning rate to 1e−4, which was reduced to 1e−5 and 5e-6 at epoch 150 and 250 to stabilize the training process.

We trained and evaluated the odometry model in a clip-wise manner. For evaluation, we used a sliding window strategy s.t. the evaluated clips are overlapped, which means a frame-pair can appear at different positions in a clip. A refinement strategy that eliminates the results from the first position and averagely ensembles the rest was designed based on our empirical observations, which will be discussed in Sect. 5.3. Following Sturm et al. (2012) and Chen et al. (2019), the averaged root mean squared errors (RMSEs) were used for evaluating both translation and rotation performance.

Remark I. In odometry learning, we usually use Euler angles or quaternions for rotation representation rather than SO(3) as implied in SE(3) due to the redundant parameters of the rotation matrix and the orthogonal constraint. In this work, we adopt Euler angles in our experiments and assume a Gaussian distribution in this vector space for simplicity and easier implementation. Though 3D von Mises–Fisher distribution Khatri and Mardia (1977) and 4D-Bingham distribution Gilitschenski et al. (2019) can be arguably more appropriate to model Euler angles and quaternions, respectively, it is non-trivial to evaluate and use them for training in practice. The exploration of these more advanced representation and distribution choices remains potentially important future research work.

Remark II. In terms of the choice of hyperparameters like \(\alpha \), \(\beta \), and \(\gamma \), we basically followed the initial setup of prior works such as Wang et al. (2017); Chen et al. (2019); Hafner et al. (2020) and performed a non-intensive and small-range grid searching. More elegant methods such as relying on the covariance estimates (Peretroukhin & Kelly, 2017) can be considered in future studies and applications to new datasets.

Table 1 Test results on KITTI and EuRoC

5.2 Main Results

We implemented our visual-inertial framework using three fusion strategies proposed in Chen et al. (2019), namely InfoVIO, SoftInfoVIO, and HardInfoVIO. We also included two traditional visual-inertial odometry methods for comparison, i.e., OKVIS (Leutenegger et al., 2015) for EuRoC and MSCKF (Mourikis & Roumeliotis, 2007; Hu & Chen, 2014) for KITTI. OKVIS is not used for KITTI due to the lack of accurate time synchronization between images and IMU data. Following Sturm et al. (2012) and Chen et al. (2019), we report the averaged root mean squared errors (RMSEs) of translation and rotation. The results are given in Table 1. Our results support the effectiveness of IB w.r.t. the generalizability to test data. Specifically, our basic models (InfoVO/InfoVIO) outperformed all baselines w.r.t. both metrics on KITTI and the translation error on EuRoC. Visual odometry models performed well for translation prediction while incorporating IMU significantly improved the rotation results. Since the MAV trajectories are challenging w.r.t. rotation, the traditional method (OKVIS) still outperformed the other methods, although our result was competitive with the other learning-based baselines. Our re-implementation achieved a better result on KITTI compared with Chen et al. (2019) but the performance on EuRoC degraded. EuRoC by its nature is much more challenging than KITTI. The major difficulties include (1) the diverse scenarios including an industrial machine hall and an office room, compared with the similar-looking street views in KITTI, (2) the varying difficulty levels of different sequences by manually adjusted obstacles, and (3) the grey-scale images while the FlowNet encoder was pretrained using RGB images, which indicates a domain gap from RGB to grey images and thus degrades the results accordingly. Therefore, reducing the performance gap on EuRoC may require more carefully designed training strategies. Comparisons between the two datasets are summarized in the Supplementary Material.

5.2.1 Visualization of KITTI Trajectories

We further provide per sequence result and trajectory visualization for DeepVO, InfoVO, VINet and InfoVIO to illustrate the benefit of optimizing the IB objective.

Table 2 Per sequence results on KITTI. We report the averaged translation RMSE drift \(t_{rel}\) (\(\%\)) on length of 100–800 m and the averaged rotation RMSE drift \(r_{rel}\) (\(^o/100\,\mathrm{m}\)) on length of 100–800 m
Fig. 2
figure 2

Predicted trajectories of DeepVO, InfoVO, VINet, and InfoVIO on KITTI sequences 05, 07 and 10 (Color figure online)

Results of the test sequences 05, 07, and 10 are presented in Table 2 and Fig. 2. Though long-term accumulated drifts are observed for all end-to-end learning-based odometry methods, InfoVO and InfoVIO that optimize the IB objective still perform better than DeepVO and VINet, especially on sequence 05, which is longer and more challenging due to the increased number of turns.

5.3 Generalization to Challenging Scenarios

In addition to the results reported on the test splits of KITTI and EuRoC, we further examine the performance of InfoVO on vKITTI2 (Cabon et al., 2020), a simulated autonomous driving dataset that contains various scenarios. We illustrate the benefit of the IB objective by training DeepVO and InfoVO on the clean sequences in vKITTI2 and comparing their performance on the more challenging counterparts that have different weather conditions (rain and fog) and lighting conditions (morning, sunset, and overcast). We used Scene 01, 02, and 06 as the training set and left Scene 18 and 20 as the test set. Of note is that only the clean sequences in the training set are used during training.

Table 3 Results on challenging sequences on vKITTI2. W and L denotes sequences that contain different weather conditions (rain and fog) and lighting conditions (morning, sunset, and overcast), respectively

Results under different weather and lighting conditions are presented in Table 3. It is shown that InfoVO achieves better generalization results in the challenging scenarios than DeepVO w.r.t. both translation and rotation predictions. Besides, our results suggest extreme weather conditions present more challenging than different lighting conditions due to the noises and texture losses in the frames, which remains an interesting research direction towards a more robust odometry system in those challenging scenarios.

5.4 Compactness of the Latent Space

A key hypothesis underlying the motivation to develop our framework is that methods without specific consideration on the compactness of the latent space will implicitly encode pose-irrelevant information into the learnt features, which can be eliminated by the information bottleneck objective. We empirically demonstrated this phenomenon by comparing the reconstruction accuracies using the features learnt by DeepVO and InfoVO.

Since the optical flow features from the pretrained FlowNet2S (Ilg et al., 2017) are used as the network inputs for both DeepVO and InfoVO, we proposed to empirically measure the amount of pose-irrelevant information by the ability to reconstruct those optical flow features from the latent space of DeepVO and InfoVO, respectively. Specifically, we used three MLP layers as the reconstruction decoder, which takes the latent features from the DeepVO and InfoVO models trained on the KITTI dataset as input. We varied the hidden size d of the decoder to examine the performance under different reconstruction capacities. We adopted the same training/test split as in our main experiment and trained the decoder for 300 epochs.

Table 4 Results of the reconstruction of optical flow features on KITTI

The results of the averaged MSE loss \({\bar{l}}\) for optical flow feature reconstruction using different hidden sizes are presented in Table 4. We also reported the results by taking white Gaussian noise as input. The input optical flow vectors contain both pose-relevant and pose-irrelevant information, such as occlusions and the motion of dynamic objects. Since InfoVO achieves a higher accuracy than DeepVO in terms of pose prediction, which indicates that InfoVO has extracted more pose-relevant information than DeepVO to achieve this, the inferiority of InfoVO to reconstruct optical flow features indicates that InfoVO has eliminated more pose-irrelevant information than DeepVO, while maintaining pose-relevant information from the optical flow features for downstream pose prediction tasks. It is worth noting that the reconstruction performance of InfoVO is close to that of random noise using the hidden size 128, which means although a certain degree of pose-irrelevant information may still exist in the feature space of InfoVO, the remaining amount is small, and it requires a relatively powerful decoder to extract this information.

5.5 Growing Rate of the Latent Dimension

As suggested in Corollary 2, the growing rate of the latent dimension d should not exceed that of \(n/\log (n)\) to avoid overfitting and achieve a tighter generalization bound. To illustrate this effect, we use different sample size ratios for sequence 01 to train InfoVO, and test the trained models on sequences 09 and 10 that have quite different motion patterns (slower vehicle speed) with sequence 01. We first choose the sample size ratio \(r_0=1/4\) as the starting point, and empirically determine its corresponding latent dimension \(d_0=384\) that leads to neither underfitting nor overfitting. Then we study the performance of InfoVO models using different latent dimensions under the sample size ratios \(r_1=1/2\) and \(r_2=1.0\), whose growing rates of n/log(n) are 1.780 and 3.208, respectively. The results are presented in Fig. 3.

Fig. 3
figure 3

Results of varying latent dimensions (256, 512, 1024, 1536, 2048) under the sample size ratios 1/2 (red) and 1.0 (blue). The RMSE results of the combined 6-DOF translation and rotation vector are reported

We examine the results of latent dimensions 256, 512, 1024, 1536, and 2048. For \(r_1=1/2\) and \(r_2=1.0\), the latent dimensions that have the same growing rates as n/log(n) are \(384 * 1.780\approx 684\) and \(384 * 3.208\approx 1232\), respectively. Accordingly, our results showed that the latent dimensions 512 and 1024 achieved the best test results before overfitting for \(r_1=1/2\) and \(r_2=1.0\), respectively. A small latent dimension led to an underfitted model while overfitting was observed when the growing rate of the latent dimension exceeds that of n/log(n), which supports Corollary 2 empirically.

5.6 Ablation Studies

Extensive ablation studies were conducted to examine the effects of (1) the deterministic component, (2) the IB weight, (3) the sample size and (4) extra sensors. Key observations include: (1) without the deterministic component, both translation and rotation performance dropped significantly; (2) determining the IB weight \(\gamma \) presents a trade-off between the accuracy of translation and rotation prediction; (3) a larger sample size reduces both the uncertainty and prediction errors; and (4) IMU is more ‘useful’ than cameras for rotation prediction while cameras are more crucial than IMU for translation prediction, according to the discussions on Theorem 2.

5.6.1 Effect of the Deterministic Component

We conducted stochastic-only ablation experiments to examine the effects of the deterministic components in Eqs. (910) and (1112) by removing the deterministic nodes in Fig. 1b. We implemented two versions depending on whether the observation- and pose-level latent representations (\(s^o\) and \(s^p\)) were both used as the recurrent network state (StochasticVO/VIO-d), or not (StochasticVO/VIO-s). Results are summarized in Table 5. Without the deterministic component, both translation and rotation performance dropped significantly, which supports the effectiveness of the proposed deterministic component.

Table 5 Results of the stochastic-only models on KITTI

Remark. For the stochastic-only models, we remove the stochastic state estimators and let the GRU layer in the deterministic transition models directly output the means and standard error residues of the stochastic representation. For state transitions, we then used sampled states as the transitioned state context for the transition model at the next time step. More details of the two implementations are given below. StochasticVO/VIO-d is short for “stochastic VO/VIO with double transition states”, which used \((s^o_{t-1},s^p_{t-1})\) as the transition state from the last time step for both observation- and pose-level transitions. StochasticVO/VIO-s is short for “stochastic VO/VIO with single transition states”, which used \((s^o_{t-1},s^o_{t-1})\) and \((s^p_{t-1},s^p_{t-1})\) as the transition state from last time step for observation- and pose-level transitions, respectively.

5.6.2 Effect of the IB Weight

We examined the effect of the IB weight, i.e. \(\gamma \) in Eqs. (12) and (4). As shown in Table 6, Although \(\gamma =0.1\) presents a good choice for training on the EuRoC dataset, we observed that the translation and rotation results did not change consistently with different IB weights on the KITTI dataset. While the translation accuracy degrades under a larger \(\gamma \), the rotation result improves instead. This finding indicates that the determination of the IB weight actually presents a trade-off between the accuracy of translation and rotation predictions and should be taken into account in different scenarios according to the requirements of specific tasks.

Table 6 Results of varying IB weights \(\gamma \) for InfoVIO

5.6.3 Effect of the Sample Size

We study the effect of the sample size by using different ratios \(r_n\) of training samples for training the model. Recall that we let the minimum variance be 0.01 to avoid a trivial solution, which sets an empirical lower bound of the uncertainty. Table 7 shows that a larger sample size reduces both the uncertainty and prediction errors. An interesting observation from our results is that though more training samples still benefit the prediction performance, the averaged variance or the uncertainty measure does not reduce after half of the dataset is added. We suspect that this may be due to the fact that KITTI sequences exhibit quite similar patterns (mostly road driving scenarios). Thus half samples are sufficient for the model to be “familiar” with the dataset and reach the uncertainty margin. While if the training samples are not sufficient enough, e.g. 1/4 of total samples, the variance increases significantly.

Table 7 Results of varying sample sizes on KITTI

5.6.4 Effect of Extra Sensors

Motivated by Theorem 2 and our failure-awareness analysis, we study the performance gain of IMU given images and vice versa. The comparison between InfoVO and InfoVIO provides the performance gain of IMU given images. Similarly, to study the performance gain of images given IMU, We trained an IMU-only model, denoted as InfoIO, which is then compared with InfoVIO. The results are summarized in Table 8, which implies that IMU is more ‘useful’ than cameras for rotation prediction while cameras are more crucial than IMU for translation prediction. Moreover, IMU provides a larger performance gain in EuRoC than KITTI, which is consistent with the fact that the synchronization in EuRoC between IMU and ground-truth poses are more accurate. We also observed that InfoIO performs poorly in KITTI. The large performance gain of images given IMU in KITTI w.r.t. both translation and rotation might also result from the inaccurate alignment of IMU records from the raw KITTI dataset to the image and ground-truth pose sequences.

Table 8 Performance gain of IMU given images and images given IMU

5.7 What Does the Intrinsic Uncertainty Mean?

We next used the averaged variance of the stochastic latent representation as an intrinsic uncertainty measure and empirically showed how this uncertainty reveals the system properties and data degradation. We found some interesting relationships between the uncertainty and poses, e.g., larger turning angles and smaller forward distances lead to higher uncertainty. Our analysis suggests a practical data collection guideline, i.e., augmenting the uncertain parts of the pose distribution.

5.7.1 Uncertainty on KITTI and EuRoC

We show the uncertainty results of InfoVIO on KITTI and EuRoC in Figs. 4 and 5, respectively. Since the translations along x and y axes and the rotations around x and z axes are relatively small in the KITTI dataset, their uncertainties do not exhibit a clear pattern. While for the translation along the forward axis-z and the rotation around the upward axis-y (turning left/right), a clear negative and a clear positive relationship are observed for each motion. The reason for this can be that a large forward parallax provides more distinctive matching features for pose prediction, while a large turning angle instead dramatically reduces the shared visible areas and results in difficulties in achieving accurate predictions. For the EuRoC dataset, we observed a consistent positive relationship for all three rotations, which makes sense in that the MAV rotations are more uniformly distributed along the three axes. The negative relationship in the translation results of EuRoC is more obscure than that of KITTI, partly due to the relative difficulties in accurately predicting MAV translations since EuRoC has a much smaller translation scale than KITTI.

Fig. 4
figure 4

Uncertainty results of InfoVIO on KITTI. The top and bottom rows represent translation and rotation results. The first, second, and third columns represent x, y, and z, respectively. xyz are with respect to the coordinate system in KITTI. pos-i means the result is evaluated at the i-th position in a clip

Fig. 5
figure 5

Uncertainty results of InfoVIO on EuRoC. The arrangement and notation are kept the same as Fig. 4

Table 9 Results on KITTI by evaluating at different positions in a clip

Remark. There is also a line of work that attempts to combine learning based methods with geometric pipelines (Peretroukhin & Kelly, 2017; Yang et al., 2020), where uncertainty plays an important role by serving as a quality measure to properly weigh the learned results. The recent successful work by Yang et al. (2020) used learned aleatoric uncertainty to integrate learned results into the DVO pipeline and achieves SOTA performance in monocular odometry. Our work makes contribution in that we do not explicitly learn the variance of final prediction, but use the variance of the intrinsic latent state instead as the uncertainty measure, which we empirically show that can capture the epistemic uncertainty as well and holds the potential to provide better fusion guidance. It remains an interesting future research direction to see whether our uncertainty measure can really benefit this hybrid pipeline that combines the merits of both learning and geometric methods.

Fig. 6
figure 6

Uncertainty results of InfoVIO on both noisy and missing data of the KITTI dataset. The arrangement and notation are kept the same as Fig. 4. Blue, orange, and green circles denote results from normal data, noisy data, and missing data, respectively. Both images and IMU records were degraded (Color figure online)

Fig. 7
figure 7

Uncertainty results of InfoVIO on noisy data of the KITTI dataset. The arrangement and notation are kept the same as Fig. 4. Blue, orange, green, and red circles denote results from normal data and degraded data with images, IMU, and both images and IMU being noisy, respectively (Color figure online)

Fig. 8
figure 8

Uncertainty results of InfoVIO on missing data of the KITTI dataset. The arrangement and notation are kept the same as Fig. 4. Blue, orange, green, and red circles denote results from normal data and degraded data with images, IMU, and both images and IMU missing, respectively

5.7.2 Uncertainty w.r.t. the Evaluated Position in a Clip

We trained and evaluated the odometry model in a clip-wise manner. Surprisingly, the evaluated position for a frame-pair in consecutive clips also affected the intrinsic uncertainty, as shown in Figs. 4 and 5. This makes sense in that when evaluated at a latter position of a clip, the prediction model can leverage more information accumulated from former observations, thus leading to more confident predictions. In Table 9, we show that, in general, a larger uncertainty results in a higher prediction error. The result also holds for the deterministic DeepVO and VINet baselines, implying that this is a structural system problem in the clip-wise recurrent models. Therefore, our findings supports that InfoVO is able to capture this kind of epistemic uncertainty, which is caused by the model design rather than input data. Based on this observation, we propose a simple refinement strategy that eliminates results from the most uncertain position (pos-0) and averagely ensembles the results from the rest positions. We report the refined evaluation results for all models in our main results and ablation studies.

5.7.3 Failure-Awareness

We show that our intrinsic uncertainty measure is failure-aware, which is crucial for a robust odometry system. We considered two failure cases, namely, degradations with noisy data and missing data. We add Gaussian noise with mean 0 and standard error 0.1 to the observations in the test dataset to create noisy data. To generate missing data, we replace the observations with the Gaussian noise.

Table 10 Results of the proposed intrinsic uncertainties under different data degradation settings on KITTI and EuRoC

In Fig. 6, we report the visualization results of uncertainties versus different translations and rotations on KITTI by applying data corruption to both images and IMU. The results of single sensor corruption under the noisy and missing data settings are also provided in Figs. 7 and 8, respectively. The visualization results on EuRoC is provided in the Supplementary Material. We summarize the intrinsic variances under different data degradation settings in Table 10. Our model becomes more uncertain as the data degrades. The uncertainty reaches the highest when the data is missing, as expected. A more interesting observation is that the quality of IMU data dominates the uncertainty for both KITTI and EuRoC, implying that current image encoders are not trained well enough, and a better image encoder is desirable to fully utilize the visual information. Also, data degradation on IMU records leads to higher uncertainty in EuRoC than in KITTI. We suspect the reason is that the synchronization between the ground-truth poses and IMU records are less accurate in KITTI than in EuRoC, leading to noisy IMU data for training. At last, the model trained on EuRoC exhibits the same performance on the noisy and the missing data, which implies that EuRoC dataset may be more prone to noises. These observations support that the proposed intrinsic uncertainty measure provides a practical tool for failure diagnoses, such as noises, sensor malfunctions, and even mis-synchronization between sensors.

6 Conclusion and Future Research

This paper targets odometry learning by proposing an information-theoretic framework that leverages an IB-based objective function to eliminate the pose-irrelevant information. A recurrent deterministic-stochastic transition model is introduced to facilitate the modeling of time dependency of the observation sequences. The proposed framework can be easily extended to different problem settings and provide not only an intrinsic uncertainty measure but also an elegant theoretical analysis tool for evaluating the system performance. We derive generalization error bounds for the IB-based method and a predictability lower bound for the latent representation given extra sensors. They provide theoretical performance guarantees for the proposed framework, and more generally, information-bottleneck based methods. Extensive experiments on KITTI and EuRoC support our discoveries.

The proposed method falls into end-to-end supervised learning methods. Obtaining the required ground-truth pose labels can be challenging for large-scale data collection and training. Two recent research trends provide promising solutions to mitigate this problem, i.e. embodied methods that utilize simulated environments and unsupervised learning methods that leveraged the geometric constraints and trained the model jointly with other auxiliary tasks like depth prediction. The difficulty in bringing embodied methods into current state-of-the-art frameworks is the domain gap between simulation and the real world, where proper domain adaptation techniques are desired. Integrating unsupervised and supervised methods can also be challenging, which requires more dedicated training strategies and model design. It is worth noting that our proposed IB method improves on the representation level and can also be applied in these fields to obtain better latent representations. We foresee further developments by incorporating novel techniques into our IB framework.