Multi-Sensor Fusion Self-Supervised Deep Odometry and Depth Estimation

: This paper presents a new deep visual-inertial odometry and depth estimation framework for improving the accuracy of depth estimation and ego-motion from image sequences and inertial measurement unit (IMU) raw data. The proposed framework predicts ego-motion and depth with absolute scale in a self-supervised manner. We ﬁrst capture dense features and solve the pose by deep visual odometry (DVO), and then combine the pose estimation pipeline with deep inertial odometry (DIO) by the extended Kalman ﬁlter (EKF) method to produce the sparse depth and pose with absolute scale. We then join deep visual-inertial odometry (DeepVIO) with depth estimation by using sparse depth and the pose from DeepVIO pipeline to align the scale of the depth prediction with the triangulated point cloud and reduce image reconstruction error. Speciﬁcally, we use the strengths of learning-based visual-inertial odometry (VIO) and depth estimation to build an end-to-end self-supervised learning architecture. We evaluated the new framework on the KITTI datasets and compared it to the previous techniques. We show that our approach improves results for ego-motion estimation and achieves comparable results for depth estimation, especially in the detail area.


Introduction
Dense depth estimation from an RGB image is the fundamental issue for 3D scene reconstruction that is useful for computer vision applications, such as automatic driving [1], simultaneous localization and mapping (SLAM) [2], and 3D scene understanding [3]. With rapid development of in depth estimation (from monocular), many supervised and unsupervised learning methods have been proposed. Instead of traditional supervised methods depending on expensively collected ground truth, unsupervised learning from stereo images or monocular videos is a more universal solution [4,5]. However, due to the lack of perfect ground truth and geometric constraints, unsupervised depth estimation methods that suffer from inherent scale ambiguity and poor performance, perform well in some scenarios, such as occlusion, non-textured regions, dynamic motion objects, and indoor environment.
To overcome the lack of geometric constraints in unsupervised depth estimation training, recent works have used sparse LiDAR data [6][7][8] to guide depth estimation in the process of image feature extraction and improve the quality of supervised depth map generation. These methods lead to the dependence on sparse LiDAR data, which are relatively expensive. A recent trend in depth estimation methods involves traditional SLAM [9], which could provide an accurate sparse point cloud, learning to predict monocular depth and odometry in a self-supervised manner [10,11].
To integrate visual odometry (VO) or the SLAM system into depth estimation, the authors of [10,12,13] presented a neural network to correct classical VO estimators in a selfsupervised manner and enhance geometric constraints. Self-supervised depth estimation,

•
Based on the SuperPoint [23] dense feature point extraction method, we added the sparse depth pose with absolute scale to the depth estimation geometric constraints; • The DeepVIO pipeline joint keypoint is based on DVO with DIO and uses the EKF module to update the relative pose; • We tested our framework on the KITTI dataset, showing that our approach produces more accurate absolute depth maps than contemporaneous methods. Our model also demonstrates stronger generalization capabilities and robustness across datasets.

Related Work
In this section, we provide an overview of current methods for self-supervised depth estimation and techniques for learning-based feature extraction and match.

Self-Supervised Monocular Depth Prediction
Depth estimation from a monocular image is significant for scene understanding in computer vision. Supervised learning-based methods for depth prediction rely upon the availability of ground-truth depth [26][27][28], while the effort to collect large amounts of labeled images is high. In the self-supervised depth estimation method, the photometric errors originate from static stereo warping with the rectified baseline or two adjacent frame temporal warping. Based on that theory, a lot of research in the field of supervised depth estimation has been conducted to overcome the need for ground truth data [29], selfsupervised learning methods [30], minimizing photometric reprojection errors, and using the binary mask to filter dynamic objects in videos [30,31]. However, these methods lack geometric constraints and scale ambiguity in the learning process. Recently, a combination with geometric constraint depth estimation methods have been proposed [10,11,[32][33][34]. For example, the average depth varies greatly between adjacent frames when there are limited image pixel movement ranges, relative estimated poses, and inconsistent reference frames between the poses [15]. Since visual odometry based on monocular image sequences could only estimate the relative poses, constraining monocular depth estimation will result in inconsistent depth map scales. We introduce an IMU with an absolute scale to form VIO, which essentially eliminates the problem of inconsistency in depth scales between adjacent frames. Thus, our work combines the advantages of the visual odometry method based on deep keypoints and raw IMU data, and essentially disentangles the scale and enhances geometry construction.

Learning-Based Feature Extraction and Matching
Traditional feature detectors and descriptors have been used on classical SLAM systems. Based on classical handcraft feature detectors and feature extractor-like features from an accelerated segment test (FAST) [35], oriented fast and rotated brief (ORB) [36], and scale-invariant feature transform (SIFT) [37], these methods attempt to dedicate to dimensionality reduction and utilize various approaches to map high-dimensional descriptors to low-dimensional spaces. However, they lose a great amount of information on the raw image. With deep learning "booming", some researchers have attempted to use higherlevel features obtained through deep learning models to build-up deep feature extractors. CNN-based descriptors, such as MatchNet [38], which consists of a featured network for extracting feature representation, significantly improves feature descriptor results.
However, most deep learning methods rely heavily on data used for training and cannot fit well into unknown environments. Instead of using human supervision to define interest points in real images [39], SuperPoint [23] proposes a fully-convolutional neural network architecture for interest point detection and descriptions using a self-supervised pipeline. Our work adopts the deep feature descriptor detector, feature extractor, and VIO pipeline as our foundation to improve the pose and depth estimation result.

Deep Visual-Inertial Odometry Learning Methods
Traditional VIO fusion relies on manually crafted image processing pipelines, which can be divided into loosely-coupled and tightly-coupled methods [40]. Recently, deep learning methods [41] have been used to state estimation tasks, including VIO. Instead of using human supervision to define interest points in real images, such as FAST [35], SIFT [37], Daniel DeTone [42], designed SuperPoint, which operates on a full-sized image and produces interest point detections accompanied by fixed-length descriptors in a single forward pass.
For supervised learning VO methods, these approaches infer the camera pose by learning directly from real image data, such as Flowdometry [43], cast the VO problem as a regression problem by using FlowNet [44] to extract optical flow features and a fully connected layer to predict camera translation and rotation, and DVO [13] and ESP-VO [45] incorporate recurrent neural networks (RNNs), to implicitly model the sequential motion dynamics of the image sequences. Han, L. et al. [13] presented a self-supervised deep learning network for monocular VIO; Shamwell et al. [46] presented an unsupervised deep neural network approach to the fusion of RGB-D imagery with inertial measurements for absolute trajectory estimation. Inspired by this work, we incorporated raw IMU data into a visual, odometry-based deep keypoint with a fusion model to regularize the camera pose and alignment depth map.

Materials and Methods
We propose a framework to predict the dense depth and odometry with an absolute scale only using the monocular images and IMU raw data. Figure 2 depicts an overview of our system; we used DeepVIO to replace the PoseNet. The SuperPoint [23] network has two sub networks-KeypointNet and MatchNet-to estimate VO. After that, the DVO and DIO fusion module are used to estimate the odometry of the camera and sparse depth based on the 3D points triangulated. Then DepthNet combines sparse depth and pose to output the depth map with absolute scale. Specifically, we propose DeepVIO, self-supervised, by the depth estimation process.

Self-Supervised Depth Estimation
The depth module is an encoder-decoder network, DepthNet; it takes a target image and outputs depth valuesD T (p) for every pixel p in the image.The encoder of DepthNet uses ResNet to extract the features of the input images with four scale layers, while the skip connections fuse the encoder layer features with the decoder upsample convolution network, and the decoder finally outputs a depth map corresponding to the pixels of the input image. The pose module (PoseNet) uses ResNet [47] to extract image features, and then the decoder adopts convolution layer regression, six parameters of [R, t]. PoseNet take, as input, the concatenation of the target image I t and two neighbor (source) images I S , S ∈ {t − 1, t + 1}. It outputs transformation matrices,T T→S represent the six degrees of freedom (6DoF) relative poses between the images. Self-supervised learning is proceeded by image reconstruction using the inverse warping technique. The inputs of the training sample include the target frames I T at t and the source frames I S at the nearby frame I S ∈ {I t−1 , I t+1 } [31]. The self-supervised training uses the source images I S to synthesize the target image I T . When the depth, together with the pose, is provided, the source image can synthesize a new view (target) by applying a projective warping from the source camera point of view. The sampling is done by projecting the homogeneous coordinates of the target pixel p t onto the source view p s [30]. Given the camera intrinsics K, the encoderdecoder network DepthNet estimated depth ofD T (p) and the pose module-predicted transformation matrixT T→S , the projection is done by the equation: (1) we adopt the popular combination of the least absolute deviation loss (L1 loss) and structural similarity index (SSIM) by [4] computing the photometric errors, whereÎ S (p t ) is the intensity value of p t in the reconstructed imageÎ S , p represents the pixel in the image, and S represents the source image. We use the edge-aware depth smoothness loss, which uses the image gradient to weigh the depth gradient [30]:

Deep Visual Odometry Based on Keypoint
We chose SuperPoint [23] as our DVO network backbone instead of traditional feature extractors, e.g., ORB [9,36], SIFT [37]. SuperPoint is a learning-based feature extraction method that has a shared encoder with two decoders, similar to the traditional feature extraction method SIFT, and has both feature point detection and description functions. The encoder is based on VGG network architecture and consists of convolutional layers, spatial downsampling via pooling, non-linear activation functions, and rectified linear unit (ReLU). After the encoder, the architecture splits into two decoder "heads", which learn taskspecific weights for interest point detection and interest point description. When the feature points of two adjacent frames are obtained from KeypointNet, we associate the feature points of the two frames through MatchNet. Taking advantage of geometric constraints of 3D structures from sequence frames, we join estimate depth and pose in a self-supervised manner using photometric consistency, we get correspondences from matched deep features by using a deep detector and descriptor and recover the camera pose via traditional geometry methods. Specifically, the correspondences located in occluded or out-of-bounds dynamics regions, are masked out to improve the accuracy of 2D-2D correspondences.
We refer to the image pair I i and I j as the input of feature extractions, the transformation matrix from I i to I j as T ij = [R, t], where R ∈ R 3×3 is the rotation matrix and t ∈ R 3×1 is the translation vector. The DVO network includes a shared encoder and detector and descriptor heads as the detector and descriptor, respectively. It extracts features from input images I i , I j ∈ R H×W×1 , an output detector feature H det ∈ R H×W×1 , and descriptor feature H desc ∈ R H×W×D . Then H det applies non-maximum suppression to get sparse keypoints. Moreover, the descriptor is sampled from H desc using bilinear interpolation, which filter out redundant candidates by non-maximum suppression.

Deep Pose Estimation Decode
Typically, the traditional visual odometry pose[R, t] estimation method includes epipolar geometry-based and PnP-based. When the 2D-2D pixel correspondences (p i , p j ) between the image pair builds, we can use the epipolar geometry-based method to solve the fundamental matrix F via the simple normalized 8-point algorithm in random sample consensus (RANSAC) loop [47]. The epipolar geometry solves F: where the correspondences (p i , p j ) are formed from SuperPoint, F is the fundamental matrix, K is the camera intrinsics. However, in some cases, the fundamental matrix will fail to solve. Perspective-n-Point (PnP) is used to solve camera pose given 3D-2D correspondences when the camera motion is pure rotation or the camera translation, tinily. PnP minimizes the reprojection error: Epipolar and PnP methods need constant judgments, a switch in motion process, and difficult-to-solve complex motions, which are not robust and are not accurate. Therefore, we use the network to replace the geometric solution method and fuse the network prediction with IMU poses in the training strategy.
The MatchNet outputs match N feature points. We feed the points [6 × N] in which the correspondences (p i , p j ) are formed from SuperPoint detection, matching into the onedimensional CNN network, then process them through long short-term memory (LSTM) layers with 128 and 256 cells and a fully connected (FC) layer. The output layer contains two linear layers to produce the prediction of rotation and translation SE(3) dvo .

DeepVIO Fusion Module
As aforementioned, we can resolve the inherently scaled ambiguity, and DVO significantly degrades in some scenarios by fusing DVO with IMU data. Different from the previous learning-based method, which directly feeds the IMU and images into the network to predict the pose or use the IMU as the L1 loss of the DVO output, we designed a monocular DeepVIO that combined the DVO with DIO by using EKF to predict and update the pose state. We first define the IMU model at time τ, the measured accelerometer values a m , gyroscope values w m , and the robot state S v ø at time τ.
The accelerometer and gyroscope random noise n a , n w ,ḃ a ø = n b a ,ḃ a ø = n b a ,ḃ w ø = n b w are assumed to mean Gaussian n a ∼ N 0, σ 2 a I : where C v ø r k is the robot states, g rk is the gravity vector. Moreover, the robot states are defined: We get the linearized system from (6) and (7). The system matrix is defined F, the linearized error matrix G, and the noise n = [n T w n T b w n T a n T b a ]. To solve the error states, δẋ τ is used to propagate error state covariances: We apply Euler's method transform continuous model (8) to discrete time. From time t τ to t τ+1 δt = t τ+1 − t τ , the sate transition matrix Φ τ,τ+1 use the order approximation: (9) Then the IMU measurement propagates state covariance Q to the next step state covariance:

DIO-Net Measurement Model
In this paper, we propose a DIO-Net deep inertial odometry network to replace the inertial odometry data process, preintegration, and pose prediction. The DIO-Net architecture is illustrated in Figure 3. To obtain the IMU data that have space features, we associate CNN and LSTM in the deep inertial odometry. The model is comprised of two CNNs to extract the deep feature firstly; two LSTM layers, preintegrated features, and two linear layers produce the final odometry prediction. Furthermore, the IMU data enters 32 × 32 , 64 × 64, 128 × 128 CNNs for feature extraction, and then the features enter the LSTM layers after ReLU, and finally FC outputs the six parameters of pose. In this process, the CNN transforms the input feature to a 128-channel feature, then LSTM processes the last layer feature and outputs 256 channels feature, the FC regresses the 3D rotation, and the 3D translation presents as SE(3) dio .

DIO and DVO EKF Fusion Model
Our EKF model is robot-centric-based, and the EKF propagates its state based on the kinematics theory, maintaining the system's differentiability. It incorporates vision and IMU relative measurements learned from deep networks in its update step, as well as uncertainty. With new image and IMU data input, the result of DIO prediction is set to the observations of EKF and the pose of DVO prediction is set to a robot-centric state. EKF fuses the DVO and DVI results, performs the EKF operation to obtain the fused system state, and finally moves the time stamp from k to k + 1, and repeats the above steps. We express the DVO estimated pose SE(3) dvo as: The covariance R k can be represented as a diagonal matrix. The measurement residual k+1 = T θ T r T : where (·) ∧ is skew symmetric operator.
In the training process, to make the network residual differentiable from the measurement residual, we approximate the network output residual to θ = φ r k v k+1 − φ r k v k+1 by using Baker-Campbell-Hausdorff (BCH). The error states could find the DVO Jacobin , and the θ and r are represented as: The final DVO Jacobin H k+1 is shown in (14).
The EKF estimation update and error δx k+1 are shown in (12)(13)(14); the calculation of the Kalman Gain: The calculation of the posterior state and covariance: The error δx k+1 : where H k+1 is the measurement Jacobian, R k is corresponding covariances, k+1 is the measurement residual. Finally, the reference frame for all states is shifted forward from frame S rk to frame S rk+1 , and the robot pose transforms to the next EKF iteration after it is composed with the DVO pose. The VIO fusion final output: Using the output of the EKF fusion module, the DVO updatesT vo = [R, t] ∈ SE(3) dvo and regresses the network, with the rotation component R ∈ SO(3), and the translation component t ∈ R 3 . The pose loss L rt = |T dvo − T dio | is defined for all pairs of relative pose transformations; it contains the rotation loss L rot and translation loss L trans : The pose total loss is defined:

Supervised with Sparse Depth from DeepVIO
In order to resolve the scale ambiguity problem and enhance the geometric constraints, we fuse the self-supervised depth estimation process with the output of DeepVIO based on the 3D geometry structure. Depending on the dense correspondences and the pose from DeepVIO, we can directly recover the sparse depth map D s with an absolution scale through the two view triangulation module [42]. Then the sparse depth map D s aligns the scale of prediction depth D p by using the scale factor w s = mean(D s /D p ). Then, the refined depth D f = w s D p is supervised by sparse depth D s to minimize error. The depth loss L sd is defined with D s : The total training loss is given by L total = L pe + λ ds L ds + λ rt L rt + λ sd L sd (22) where λ ds , λ rt , λ sd are the weight of edge-aware loss L ds , pose loss L rt and sparse depth L sd .

Results
In this section, we conduct several experiments to present the evaluation results of depth and odometry estimation on the KITTI [48] and Oxford RobotCar dataset [49] dataset. We support our analysis with some visualizations, to verify our design decisions.

Implementation Details
As shown in Figure 2, our framework includes three subnetworks-SuperPoint, Depth-Net, and DeepVIO-implemented in PyTorch. There are around 20 M trainable parameters and it takes 40 h to train the network on a GTX 2080Ti GPU. The input image resolution is set to 640 × 192; the batch size is set to 4. Adam optimizer is used for minimizing the loss function, with β 1 = 0.9, β 2 = 0.99, and the batch size is set to 4. The weights λ ds , λ ds in the loss function are set to 0.55 and λ ds is set to 0.1. For DeepVIO, we use the pre-trained SuperPoint network to extract and match the correspondences and then connect to DVO for pose estimation. We set c r = 0.15, c r = 0.75 and β rt = 0.1. Firstly, we only train the DeepVIO network with DIO supervising 25 epochs, then use the trained DeepVIO to train depth estimation in an unsupervised manner via image reconstruction loss. After 25 epochs, we then jointly train both networks for 10 epochs.

Datasets
To train, validate, and test our system, we validate our design on the original KITTI dataset [50] and KITTI Odometry dataset. The original KITTI dataset consists of 389 pairs of stereo images and depth maps, 39.2 km of visual ranging sequences, a Velodyne laser scanner, and a GPS/IMU localization unit, sampled and synchronized at 10 Hz. The odometry benchmark consists of 22 stereo sequences, saved in lossless png format. It provides 11 sequences (00-10) with ground truth trajectories for training and 11 to 21 sequences) without ground truth for evaluation. For the original KITTI dataset, according to Eigen et al., we split 23,488 images from 32 scenes for training and 697 images from 29 scenes for testing [50].
The Oxford RobotCar dataset [49] uses the Oxford RobotCar platform, an autonomous Nissan LEAF, to traverse the route through the Oxford city centre twice a week on average between May 2014 and December 2015. The dataset records over 1000 km of driving records, collecting nearly 20 million images from 6 cameras mounted on the vehicle, as well as LiDAR, GPS, and INS ground truth. We use the Oxford RobotCar dataset to test the robustness of our algorithm.

Depth Estimation
We adopted evaluations on the KITTI Raw and KITTI Odometry datasets. There were four error metrics already used in previous works [4,6,50], namely absolute relative error (Abs Rel), square relative error (Sq Rel), root mean square error (RMSE), and the root mean square error in log space (RMSE log). Other accuracy metrics are the percentages of pixels where the ratio (δ) between the estimated depth and ground truth depth is smaller than 1.25, 1.25 2 , and 1.25 3 . We compare our method with several self-supervised depth estimation SOT methods and summarize our results in Table 1. In addition,we illustrate their performance qualitatively in Figure 4. In contrast to previous methods, our method outperforms other competitors and shows improvements in most evaluation metrics. It improves the baseline method by 8%. We show that our proposed DeepVIO architecture can increase the geometric constraints of monocular depth and improve the accuracy of monocular depth estimation. In Figure 4, we compare the supervised depth estimation DORN [29], unsupervised depth estimation Monodepth2 [30] with end-to-end PoseNet and unsupervised depth estimation TrainFlow [24], with PoseNet, based on optical flow, respectively. The results show that our proposed DeepVIO method can improve the accuracy of depth estimation and enhance the detail of depth estimation at the edge of objects. An ablation study is carried out for depth estimation performance of dynamic objects, such as people or cars in the point cloud. We combine RGB and depth projection into 3D point clouds with camera intrinsic K, and compare it with the supervised method DORN [29], and the unsupervised method, TrainFlow and Monodepth2. As shown in Figure 5, compared with the supervised method DORN and the optical flow-based VO supervised depth estimation method TrainFlow [24], we find that the fusion of sparse point clouds and the absolute scale into the unsupervised depth estimation could significantly improve the monocular depth estimation results in the dynamic environment.

Pose Estimation
We follow the previous works on KITTI Odometry criteria evaluating possible subsequences of length (100, 200, . . . , 800) meters and report the average translational errors t err (%) and rotational errors r err ( • /100 m). It measures the difference between the points of the ground truth and the predicted trajectory. Using timestamps to associate the ground truth poses with the corresponding predicted poses, we compute the difference between each pair of poses and output the mean and standard deviation. Table 2 reports the evaluation results of the DeepVIO output poses, and compares them to the previous works, such as ORB-SLAM2 [36], Deep-VO-Feat [52], SFM-Learner [51], SC-SfMLearner [60]. Both extensions improve the baseline and the attention module performs well. When coupled with the self-supervised depth estimation, the DeepVIO performance training-to have a consistent pose estimation-outperforms all of the state-of-the-art, compared to classical SLAM libviso2 and learning-based techniques, Sc-SfMLearner and SfMLearner [5,13,62]. Figures 6 and 7 show the trajectory in the XY-plane. In Figure 6, our trajectory can start from the starting point and return to the origin, forming a closed loop, such as the graph, proving that our pose estimation is relatively accurate. In Figures 6 and 7, especially in contrast to the GT trajectories, our trajectories are able to follow the GT since our approach of introducing absolute scales could preserve the reality scale of the pose.

Ablation Study
We also performed ablation experiments to examine the effectiveness of our contributions. The first ablation study was carried out by comparing the depth value error between the predicted depth and ground truth. Random snippets of images were taken from the KITTI dataset , testing the images through the framework. Then we selected the points with the larger error between the predicted value and ground truth value. Experimental results are shown in Figure 8. It can be observed that, in the weak texture region, or far away areas, the predicted depth from our framework obtained the absolute scale with DeepVIO, improving generalization ability. In addition, we used the Oxford RobotCar dataset, which, including video image sequences and IMU data to test the adaptability of our method. In the RobotCar dataset experiment, we compare it with TrainFlow [24] and Monodepth2 [30]. The results in Figure 9 show that our method is more applicable to other datasets than other methods. This proves our method can adapt to different environments, since we added DeepVIO in the depth estimation.

Discussion
The proposed, new, self-supervised depth and pose estimation framework combines DepthNet with DeepVIO to supervise each other. To our knowledge, it is the first such attempt in this domain. The proposed model shows good depth estimation and pose results compared to other reference methods. These experiments also demonstrate the applicability of the EKF fusion is valid at pose estimation and absolute scale. In self-supervised depth estimation, we make full use of the pose and sparse depth produced by Force DeepVIO, where the pose is used to synthesize the target image to minimize the reprojection error, and the sparse depth is used to correct the dense depth output by DepthNet. In depth estimation result evaluations, 3D point clouds synthesized with estimated depths, and camera parameters could "value" the depth and pose accuracy. In particular, in autonomous driving scenarios, where the camera on the car is moving and there are moving objects in the scene, depth estimation is challenging. As shown in Figure 10, in the point cloud restoration experiment, our method reconstructs the point cloud of the detailed parts of the scene, such as cars and utility poles. Compared with other methods, our method can restore the geometry of the objects better.
Despite the overall promising results, our network framework contains many subnetworks: DepthNet, SuperNet, DIO, and DVO. In the process of joint network training, it is necessary to train partial networks and then freeze their parameters to train other networks, which is prone to failure. In the experiments, we first pre-train DeepVIO, use the network parameters provided by SuperNet, train DIO and DVO, and finally train jointly with DepthNet. Therefore, it is necessary to consider how to simplify the network structure and reduce the number of sub-networks in future work. Furthermore, in self-supervised depth estimation, the method of normalizing the dense depth estimated by DepthNet, with the mean of the sparse depths, over-relies on the number of sparse depth values. If few feature points are extracted and matched, the effect of the depth scale supervision is degraded. How to improve the geometric constraints of self-supervised depth estimation has always been an important issue in the field. Recently, some researchers used sparse LiDAR to complement the depth map estimated by the depth estimation network to increase geometric constraints. The fusion of sparse LiDAR points and IMU raw data can directly calculate the pose, which can reduce the feature extraction and matching process of deep learning. This not only reduced the number of sub-networks and computing resources, but also increases the geometric constraints and the pose of the true scale, and could solve complex network problems. In addition, depth estimation also needs to consider some special scenarios, such as dark, foggy, rainy, and snowy weather. These scenes are very challenging scenes, and some recent studies have focused on these problems, such as Wang, K et al. [63] research on depth estimation in night environments. With the widespread application of depth estimation, new research needs to consider special scenarios to make depth estimation more general.

Conclusions
We propose a new depth and odometry estimation framework that integrates Deep-VIO with depth estimation in a self-supervised learning-based method. We combined the strengths of learning-based VIO and depth estimation to build an end-to-end learning architecture. The deep keypoint-based visual odometry module captures dense correspondences by using the SuperPoint feature detector and descriptor and solves the pose and sparse depth through the two-view triangulation geometry method. The DVO joins the DIO by EKF and predicts and updates the pose state. Finally, the sparse depth and pose are used to refine prediction depth and enhance geometry reconstruction. The experiments show that our presented model outperforms all other state-of-the-art depth estimation methods on the KITTI dataset, and shows excellent generalization ability on the Oxford RobotCar dataset.
Future work includes the depth completion method for guiding depth estimation with the sparse depth from DeepVIO to bring further improvements. Finally, exploring the benefits of the improved depth prediction for 3D reconstruction is another interesting research direction [35,44].