UnVELO: Unsupervised Vision-Enhanced LiDAR Odometry with Online Correction

Due to the complementary characteristics of visual and LiDAR information, these two modalities have been fused to facilitate many vision tasks. However, current studies of learning-based odometries mainly focus on either the visual or LiDAR modality, leaving visual–LiDAR odometries (VLOs) under-explored. This work proposes a new method to implement an unsupervised VLO, which adopts a LiDAR-dominant scheme to fuse the two modalities. We, therefore, refer to it as unsupervised vision-enhanced LiDAR odometry (UnVELO). It converts 3D LiDAR points into a dense vertex map via spherical projection and generates a vertex color map by colorizing each vertex with visual information. Further, a point-to-plane distance-based geometric loss and a photometric-error-based visual loss are, respectively, placed on locally planar regions and cluttered regions. Last, but not least, we designed an online pose-correction module to refine the pose predicted by the trained UnVELO during test time. In contrast to the vision-dominant fusion scheme adopted in most previous VLOs, our LiDAR-dominant method adopts the dense representations for both modalities, which facilitates the visual–LiDAR fusion. Besides, our method uses the accurate LiDAR measurements instead of the predicted noisy dense depth maps, which significantly improves the robustness to illumination variations, as well as the efficiency of the online pose correction. The experiments on the KITTI and DSEC datasets showed that our method outperformed previous two-frame-based learning methods. It was also competitive with hybrid methods that integrate a global optimization on multiple or all frames.

Most previous learning-based visual-LiDAR odometries (VLOs) [27][28][29][30] commonly adopt a vision-dominant fusion scheme, which projects a LiDAR frame into a camera frame and leads to a sparse depth map. Therefore, how to deal with sparse depth maps or generate dense depth maps becomes a challenge to achieve an accurate VLO. Besides, self-supervised VLOs often employ a view synthesis loss [27][28][29] or an additional point-topoint distance loss [28,29], for learning. The former loss depends on the predicted dense depth maps, which are inevitably noisy, while the latter is sensitive to the sparsity of points.
In this work, we adopted a LiDAR-dominant fusion scheme to implement our unsupervised VLO. Specifically, the 3D point clouds of a LiDAR frame are converted into a dense vertex map via spherical projection as in LOs [17][18][19][20][21]. A vertex color map is then generated, which assigns each vertex a color retrieved from the aligned visual image. We further performed LiDAR-based point-to-plane matching within locally planar regions, while establishing pixel correspondences based on visual images for cluttered regions. A geometric consistency loss and a visual consistency loss are, respectively, defined for these two different types of regions. By this means, the complementary characteristics of the visual and LiDAR modalities are well exploited. Moreover, our LiDAR-dominant scheme does not need to predict dense depth maps, avoiding the construction of a complex depth-prediction network and preventing the noise introduced by the predicted depth. Considering that LiDAR plays the dominant role in the visual-LiDAR fusion, we named our method unsupervised vision-enhanced LiDAR odometry (UnVELO).
The losses used for UnVELO are unsupervised, requiring no ground truth labels for training. This implies that the losses can also be applied for optimization during test time. Recently, test time optimization has been explored in several unsupervised VOs [14][15][16] to either refine the weights of their networks or refine the predicted outputs further, referred to as online learning and online correction, respectively. Compared with online learning [14,16], the online correction scheme [15] significantly reduces the number of parameters to be optimized, leading to a higher computational efficiency. Thus, in this work, we adopted online correction to refine the pose predicted by the trained UnVELO network. In contrast to the optimization loss in VOs that tightly couples depth and pose prediction, our UnVELO predicts the pose only, making test time optimization more effective.
In summary, the proposed method distinguishes itself from previous self-supervised VLOs in the following aspects: • We adopted a LiDAR-dominant fusion scheme to implement an unsupervised visual-LiDAR odometry. In contrast to previous vision-dominant VLOs [27][28][29], which predict both the pose and dense depth maps, our method only needs to predict the pose, avoiding the inclusion of the noise generated from the depth prediction. • We placed a geometric consistency loss and a visual consistency loss, respectively, on locally planar regions and cluttered regions, by which the complementary characteristics of the visual and LiDAR modalities can be exploited well. • We designed an online pose-correction module to refine the predicted pose during test time. Benefiting from the LiDAR-dominant scheme, our online pose correction is more effective than its vision-dominant counterparts. • The proposed method outperformed previous two-frame-based learning methods. Besides, while introducing two-frame constraints only, our method achieved a performance comparable to the hybrid methods, which include a global optimization on multiple or all frames.

Related Work
Pose estimation is a key problem in simultaneous localization and mapping (SLAM), which plays an important role in various applications such as autonomous driving [31], 3D reconstruction [32], and augmented reality [33]. To date, most odometry methods use a visual camera or LiDAR for pose estimation. The visual camera can provide dense color information of the scene, but is sensitive to the lighting conditions, while the latter can obtain accurate, but sparse distance measurements; thus, they are complementary.
There are also some works that have attempted to exploit other on-board sensors for pose estimation. For example, radar odometry [34] adopts an extended Kalman filter to propagate the motion state from the IMU and corrects the drift by the measurements from radar and GPS. DeepLIO [19] uses two different networks to extract motion features from LiDAR and IMU data, respectively, and fuses the features by an attention-based soft fusion module for pose estimation. A discussion of the full literature of these methods is beyond the scope of this paper. In this section, we mainly focus on the methods using a visual camera and LiDAR.

Visual and LiDAR Odometry
Although state-of-the-art performance is maintained by conventional methods such as LOAM [32], V-LOAM [35], and SOFT2 [36], learning-based visual or LiDAR odometries have been attracting great research interest. In this work, we briefly review the learningbased methods.

Visual Odometry
A majority of end-to-end visual odometry works focus on self-or unsupervised monocular VOs [9][10][11][37][38][39][40]. They take a monocular image sequence as the input to jointly train the pose-and depth-prediction networks by minimizing a view synthesis loss. To overcome the scale ambiguity problem in monocular VOs, SC-SfMLearner [10] and Xiong et al. [11] proposed geometric consistency constraints to achieve globally scale-consistent predictions, while UnDeepVO [41] opts to learn from stereo sequences. Besides, additional predictions such as the motion mask [41] and optical flow [42] are also integrated to address motion or occlusion problems. Recently, hybrid methods such as DVSO [43] and D3VO [44] have integrated end-to-end networks with traditional global optimization modules to boost the performance.

Visual-LiDAR Odometry
In contrast to the extensive studies on VOs and LOs, the works on visual-LiDAR odometries are relatively scarce. Existing learning-based VLOs include Self-VLO [27], RGBD-VO [28], MVL-SLAM [29], and Tibebu et al. [52]. Most of them [27][28][29] project 3D points into a camera frame for depth representation. Then, visual and depth images are either concatenated or separate to feed into 2D CNNs for feature extraction and fusion. However, the feature fusion of sparse depth maps and dense visual images can be challenging. It is difficult to extract reliable multi-modal features for the areas that are not covered by the depth. Besides, their main supervision signals come from the view synthesis loss [37], which is still sensitive to lighting condition changes. Differently, Tibebu et al. [52] projected LiDAR data into a 1D vector or 2D range map within the LiDAR frame. They feed visual images and LiDAR data into two streams for feature extraction and constructed a fusion layer and an LSTM module for feature fusion. Due to the different resolution of visual and LiDAR data, they adopt two independent modules to extract visual and LiDAR features respectively, and perform the feature fusion at the last layer only. In contrast to the mentioned VLOs, our LiDAR-dominant method projects the visual and LiDAR data into two dense images with the same size and obtains multi-modal features by a single feature extractor. Moreover, our method predicts relative poses only for training; thus, the optimization is also more efficient compared to the vision-dominant methods, which need to predict both the depth and pose.

Visual-LiDAR Fusion
Visual-LiDAR fusion has been widely investigated in various tasks including depth completion [5,6], scene flow estimation [7,8], and visual-LiDAR odometry [27][28][29]. According to which view plays the dominant role, we classify existing fusion strategies into vision-dominant, LiDAR-dominant, or vision-LiDAR-balanced ones. Most depth completion [5,6], scene flow estimation [7], and VLOs [27][28][29] adopt the vision-dominant strategy, which projects LiDAR frames into camera frames, leading to sparse depth maps, which are hard to deal with. Some scene flow estimation [8] and VLO [52] methods adopt the vision-LiDAR-balanced fusion strategy, which constructs two streams to extract features, respectively, from the LiDAR and camera views, often along with a complex module to fuse the features of the two modalities.
In order to avoid dealing with sparse depth maps or designing complex fusion modules, we adopted the LiDAR-dominant fusion scheme. It first projects 3D LiDAR points into a dense vertex map and then colorizes each vertex with the visual information. The LiDAR-dominant fusion scheme is also adopted by several 3D object detection methods such as PointAugmenting [53]. In contrast to these works that paint 3D points [53], we encode LiDAR data as 2D vertex maps for more efficient computation.

Test Time Optimization
Test time optimization is a strategy to refine the weights or outputs of a trained network during test time. Recently, this strategy has been applied to various unsupervised learning tasks [16,[54][55][56][57][58] since their losses require no ground truth, making test time optimization possible. In self-supervised VOs, Li et al. [14] proposed online meta-learning to continuously adapt their VO networks to new environments. Li et al. [58] optimized the predicted depth and flow via a Gauss-Newton layer and took the optimized results as pseudo labels to supervise the online learning of the depth and flow models. DOC [15] designs a deep online correction framework to efficiently optimize the pose predicted by a trained VO. GLNet [16] adopts both weight and output fine-tuning modules to boost its performance. All the above-mentioned self-supervised VOs [14][15][16] use the view synthesis loss for learning and test time optimization. This loss involves the depth and pose, both of which are predicted by the trained networks. Therefore, the quality of their pose refinement is affected by the predicted depth maps, which are noisy. In this work, we applied online correction to refine the pose predicted by our UnVELO network during test time. In contrast to VOs [14][15][16], our losses only involve the predicted pose while using depth information directly converted from accurate LiDAR measurements. It therefore facilitates the pose refinement.

Materials and Methods
As shown in Figure 1, the proposed method consists of the data pre-processing, pose estimation, and online correction modules. Given two consecutive LiDAR scans S t , S t+1 and synchronized visual images I t , I t+1 , our method first generates the corresponding vertex maps, normal maps, and vertex color maps in the data pre-processing step. Then, the vertex maps and vertex color maps are concatenated and input into a network for pose estimation. During test time, the pose predicted from the network is further optimized via the online pose-correction module (OPC).

Vertex Map
As is common practice [18], we adopted a spherical projection π(·) : R 3 → R 2 to convert each 3D point in a LiDAR frame into a pixel on a vertex map. Specifically, a 3D point p = [p x , p y , p z ] T within a field of view (FOV) is projected into a 2D pixel u via in which f h is the horizontal FOV and f vu is the upper part of the vertical FOV f v . Moreover, δ h and δ v denote the horizontal and vertical angular resolutions, respectively. We then point p projected to the pixel u, we define V(u) = p; otherwise, V(u) = 0. Thus, along with the vertex map, we also obtain a binary mask Mv to indicate the black pixels as follows: where 1(·) is an indicator and || · || denotes the L 2 norm.  Given two consecutive LiDAR scans S t , S t+1 and visual images I t , I t+1 , the data pre-processing step produces vertex maps V t , V t+1 , normal maps N t , N t+1 , as well as vertex color maps Vc t , Vc t+1 . The vertex maps and vertex color maps are concatenated and fed into a pose-estimation network to predict the pose P t←t+1 from frame t + 1 to frame t. During test time, the predicted pose is further optimized via the online pose-correction module.

Normal Map
Normal vectors are important for point cloud registration as they can characterize the local geometry around points [49,50]. In this work, we adopted singular-value decomposition (SVD) [50,59] to estimate the normals. For each pixel u and its associated point p = V(u), we compute the mean µ and covariance Σ within a neighboring set N p as follows: where | · | denotes the cardinality of a set. Empirically, we set N p = {p i ||p i − p|| < 0.15||p|| ∧ p i ∈ W } as the set of neighboring points near p, and W is a local window of size 5 × 7 centered at u on the vertex map. Then, we obtain the singular vector n corresponding to the minimum singular value of Σ by SVD and take it as the normal vector. The normal map N ∈ R H×W×3 is then defined by N(u) = n for valid pixels and N(u) = 0 otherwise. We generate a confidence map C by computing the similarity of the normals with four neighbors [18]. That is, where · is the inner product and N u denotes the four connected neighboring pixels of u. The confidence is in [0, 1]. A high confidence indicates a planar surface, and a low confidence often corresponds to a cluttered region. Figure 2 shows that the regions on the ground and walls have high confidence and those on the object boundaries or plants have low confidence. We therefore generate a binary mask Mn to indicate locally planar regions by where δ c is the threshold of the confidence.

Vertex Color Map
In order to fuse visual information, we generate a vertex color map Vc ∈ R H×W×3 based on the vertex map V and a synchronized visual image I. Specifically, for each vertex p = V(u) on the vertex map, we retrieve the color of its corresponding pixel u at the visual image through the following camera projection: where K ∈ R 3×3 denotes a camera intrinsic matrix and T C←L ∈ R 4×4 is the transformation matrix from the LiDAR to the camera coordinates. Since the values of the projected u are continuous, we obtain its color via the bilinear sampling scheme as in VOs [37]. That is to say,Ĩ(u ) = ∑ u i ∈N u ω i I(u i ), in which N u contains the four closest pixels of u , ω i is linearly proportional to the spatial proximity between u and u i , and ∑ ω i = 1. Then, we obtain a vertex color map Vc(u) =Ĩ(u ), along with a binary mask Mc:

Network Architecture
As shown in Figure 3, we constructed a fully convolutional neural network composed of a feature encoder and a pose estimator to infer the relative pose between two consecutive frames. Two consecutive vertex maps V t , V t+1 and their corresponding vertex color maps Vc t , Vc t+1 are concatenated as the input, which has a size of H × W × 12. The feature encoder contains 13 convolutional layers, where the kernel size of the first layer is 5 × 5 and the rest are 3 × 3. The vertical and horizontal strides of Layers 2, 6, 10 were set to (1, 2), the strides of Layer 4, 8 to (2, 2), and the remaining with a stride of (1, 1). This implies that only 2 down-sampling operations are performed in the vertical direction, but 5 downsampling operations are performed in the horizontal direction, since the input's width is greater than its height. The pose estimator predicts a 3D translation vector [t x , t y , t z ] T and a 3D Euler vector [r x , r y , r z ] T through two separate branches. Finally, we obtain a 6D vector P t←t+1 = [t x , t y , t z , r x , r y , r z ] T , from which a 4 × 4 transformation matrix T t←t+1 is constructed.

Training Loss
We designed a loss L composed of a geometric loss L geo and a visual loss L vis to train the pose-estimation network. That is, where λ is a scaling factor to balance two terms. The details are introduced in the following. The geometric loss L geo places geometric constraints on locally planar regions where the normals have high confidence. We adopted the point-to-plane distance [18] to measure the registration error of points in two LiDAR frames. Formally, given an estimated pose P t←t+1 and the corresponding transformation matrix T t←t+1 from frame t + 1 to frame t, for each pixel u t+1 in frame t + 1, we transform the corresponding 3D point and obtain its registered correspondencep t in frame t by using a line-of-sight criterion [50]. That is,p Then, the confidence-weighted point-to-plane distance at pixel u t+1 is defined as follows: The geometric loss is further defined by Here, denotes the elementwise multiplication. M geo is a binary mask to select valid and highly confident pixels on locally planar regions. The visual loss L vis enforces visual consistency for pixels on object boundaries and cluttered regions. This loss is complementary to the geometric loss. In contrast to the geometric loss focusing on planar regions that have reliable normals, but often lack texture, the visual loss pays attention to the regions with less confident for the normals, but textured. Specifically, we transform V t+1 into frame t via Equation (10) and generate a new vertex color map Vc t together with its mask Mc t from the transformed v. We then adopted the photometric error to measure the difference of the corresponding pixels in two vertex color maps, that is and the visual loss is defined as follows: Here, M vis marks the valid pixels with low confidences of the normals, which are often lying on object boundaries and cluttered regions.

Formulation
In the training of the pose-estimation network that is parameterized by Θ, the model in essence performs the following optimization: Here, L is the loss defined in Equation (9) and TS denotes the training set. Note that all parameters of the loss are ignored in Equation (9) for the sake of brevity. Once the network is trained, the network's parameters Θ * are fixed for inference. During test time, when two consecutive frames are given, we further optimized the predicted pose P t←t+1 while keeping Θ * fixed. This online correction benefits from the unsupervised loss, which requires no ground truth labels. The optimization is conducted by which can be solved by the gradient descent method while taking the pose predicted by the network as the initial value. We adopted Adam [60] to minimize it for N iterations.

Hard Sample Mining
Hard sample mining (HSM) is widely used in deep metric learning [61] and person re-identification [62] to speed up convergence and improve the learned embeddings. In this paper, we took HSM as a plug-and-play component in the OPC to filter the easy samples and outliers for optimization, and thus, we can focus on the challenging correspondences to facilitate the convergence. The sampling metric is defined within a neighborhood of each sample. More specifically, given a point p t+1 = V t+1 (u t+1 ), we take all neighboring points of its correspondencep t obtained by Equation (11) as matching candidates and calculate their matching errors. Then, the relative standard deviation (RSD) of these matching errors is calculated. A point having a large RSD implies that either the mean of the matching errors within the neighborhood is small (i.e., easy point samples) or the standard deviation of the errors is large (i.e., outlier points). Therefore, we leave out both easy samples and outliers, while selecting the remaining as hard samples. That is, M hard geo = 1(RSD(u t+1 ) < mean(RSD(u t+1 ))). (20) Then, we update the binary mask M geo by and all the others in the optimization loss are kept unchanged. Figure 4 presents two examples of our hard sample mining results. It shows that a portion of points on locally planar regions and most points on trees are selected. When only taking the selected points into consideration for the geometric loss, the online pose correction procedure can be facilitated.

Dataset and Evaluation Metrics
We evaluated the proposed method on the KITTI odometry benchmark [63] and the DSEC dataset [64]. KITTI [63] contains 22 sequences of LiDAR scans captured by a Velodyne HDL-64E, together with synchronized visual images. The ground truth pose of sequence 00-10 is provided for training or evaluation. DSEC [64] contains images captured by a stereo RGB camera and LiDAR scans collected by a Velodyne VLP-16 LiDAR. However, since the LiDAR and camera were not synchronized, we took the provided ground truth disparity maps to obtain the 3D point clouds. Moreover, as no ground truth pose was available, we took the pose estimated by LOAM [32] as the pseudo ground truth for the evaluation. For the performance evaluation, following the official criteria provided in the KITTI benchmark, we adopted the average translational error (%) and rotational error (deg/100 m) on all possible sub-sequences of length (100, 200, · · · , 800) meters as the evaluation metrics.

Implementation Details
The propose method was implemented in PyTorch [65]. For the data pre-processing on KITTI, we set f h = 80 • considering the camera's horizontal FOV, f v = 24 • , and f vu = 3 • . In addition, we set δ h = 0.375 and δ v = 0.1786 in order to generate vertex maps with a size of 64 × 448. Note that only 3D points within the camera's FOV were projected to ensure most vertexes can be associated with visual information. Besides, each visual image was resized into 192 × 624 for computational efficiency. For data pre-processing on DSEC, we set f h = 52 • , f v = 28 • , and f vu = 10 • . δ h and δ v were set according to the generated vertex maps with a size of 64 × 320. We used images captured by the left RGB camera and cropped the images to 840 × 1320 from the top-right corner to ensure that they were covered by the ground truth disparities. The cropped images were further resized to 384 × 608 for computational efficiency.
In the training stage, we used the Adam optimizer [60] with β 1 = 0.9, β 2 = 0.999 and a mini-batch size of four to train the model for 300 K iterations. In KITTI, the initial learning rate starts from 0.0001 and decreases by 0.8 every 30 K iterations. The scalar λ in the training loss was set to 1.0 empirically, and the confidence threshold δ c = 0.9. For the online pose correction, we also adopted Adam with β 1 = 0.9 and β 2 = 0.999. Considering the difference of the translation and rotation values, we set the learning rate of the translation parameters as 0.025, while the learning rate of the rotation as 0.0025. In DSEC, the initial learning rate was set to 0.0002. The learning rate of the translation and rotation parameters in the online pose correction were set to 0.02 and 0.001. All other parameters were kept the same as those in KITTI.

Ablation Studies
We first conducted a series of experiments on KITTI to investigate the effectiveness of each component proposed in our pose-estimation network. To this end, we checked the following model variants: (1) UnVELO1: only vertex maps were input to the pose network, and only the geometric loss was used for training; (2) UnVELO2: both vertex maps and vertex color maps were input, but only the geometric loss was used; (3) UnVELO3: both vertex maps and vertex color maps were input, while both the geometric and visual loss were used for training. All these model variants were tested without the online posecorrection module. In order to illustrate the advantage of the LiDAR-dominant fusion (LDF) scheme, a vision-dominant-fusion (VDF)-based model VLO [27] was also included for comparison. (Note that this VLO model corresponds to the basic model "VLO1" in [27]. The full model "VLO4" in [27] was not taken since it adopts additional data augmentation and flip consistency constraints, while we wanted to keep the setting the same as our UnVELO models.) Besides, as is common practice [15,18,24,37,66], we took sequence 00-08 for training and sequence 09-10 for testing. The results are presented in Table 1. We observed that the input of both vertex maps and vertex color maps can slightly improve the performance, and the use of both the geometric and visual loss boosted the performance further. Moreover, UnVELO3 outperformed the VLO model by a considerable margin, indicating the advantage of the LiDAR-dominant scheme. Table 1. The performance of our pose-estimation network and its variants. The Modal column denotes the modalities of the network inputs, where "L" stands for LiDAR and "V" stands for visual. The best results are bold. "L-2D" and "L-Dep" denote the input LiDAR data are a 2D range map and sparse depth map, respectively.

Models
We further investigated the effectiveness of the online correction (OC) module. In this experiment, different numbers of iterations and the model without hard sample mining (w/o HSM) were tested, and their results are reported in Table 2. In addition, we applied the online pose-correction scheme to the vision-dominant model VLO and report their results for comparison. Since VLO predicts both the depth and pose, we also tested the model that refines both the pose and depth in the online correction, denoted as "VLO+OC-40 Opt-Dep". The experiments showed that the online correction improved the performance for both VLO and UnVELO. In VLO, the results of "VLO+OC-40 Opt-Dep" indicated that the additional depth refinement made the online correction harder, and the performance may degenerate. When comparing UnVELO with VLO, we saw that the online correction was much more effective for our UnVELO model and the performance was consistently improved with the increase of the iteration number.  Figure 5 plots the performance when varying with the iteration number of the online correction. It shows that both translation and rotation errors converged around 40 iterations and the hard sample mining scheme led to a more stable convergence and better performance with a minor increase in the runtime (about 1 ms per iteration).

Runtime Analysis
We conducted all experiments on a single NVIDIA RTX 2080Ti GPU and provide the runtime of each model variant in Table 2. When comparing the UnVELO and VLO model variants with the same number of online correction iterations, we observed that the LiDAR-dominant UnVELO models performed more efficiently than the visual-dominant counterparts. Moreover, the UnVELO model was very efficient when online correction was not applied. However, as expected, the runtime of our model went up along with the increase of the iteration number. For compensation, we additionally conducted an experiment that performed the online correction on frame t and frame t + 2, which is denoted as the "UnVELO3+OC-40-Inter2" model. It achieved comparable results to the one conducting the online correction on two consecutive frames while taking half the time.

Comparison on KITTI
Finally, we compared our full model, which is "UnVELO3+OC-40" in Table 2, but is referred to as UnVELO here, with state-of-the-art learning-based methods on the KITTI odometry benchmark. Since some methods [15,18,20,27,66,67] are trained on sequence 00-08 and tested on sequence 09-10, while the others [17,21,[24][25][26]45,46,68] are trained on sequence 00-06 and evaluated on sequence 07-10, we conducted two experiments following these two different splittings. The results are respectively presented in Tables 3 and 4. The results showed that our method outperformed most end-to-end unsupervised odometry methods in both experimental settings. It was also comparable to hybrid methods, which integrate global optimization modules. Table 3. Comparison of the proposed method with the SoTA trained on Seq.00-08 and tested on Seq.09-10 of KITTI. The Modal column denotes the modalities of network inputs, where "L" stands for LiDAR and "V" stands for visual. The best results are bold.  "L-P" denotes the input LiDAR data are raw point clouds. "L-2D" denotes the input LiDAR data are a 2D range map, and "L-vox" denotes the input LiDAR data are 3D voxels.

Method
In addition, Figure 6 plots the trajectories obtained by our UnVELO and the codeavailable methods including SeVLO [27], DeepLO [18], and DeLORA [20] for a more intuitive comparison. The plots show that our method obtained trajectories closer to the ground truth than the others.

Comparison on DSEC
We also compared the proposed method with SeVLO [27], DeepLO [18], and De-LORA [20] on the DSEC dataset, which contains both day and night scenarios with large illumination variations. We should note that the generated vertex map will be sparser under poor illumination conditions, since the 3D point clouds are obtained from the disparity maps, as shown in Figure 7. Besides, the LiDAR used in this dataset had only 16 lines, which is too sparse for the LOs. Thus, the compared LOs also took the 3D point clouds obtained from the disparity maps as the input.
For a more comprehensive comparison, we chose a day sequence "zurich_city_06_a" and a night sequence "zurich_city_03_a" with totally different illumination conditions for testing. The comparison results are presented in Table 5, and the trajectories are plotted in Figure 8. The results demonstrated the effectiveness of our method. Table 5. Evaluation results on the DSEC dataset. The Modal column denotes the modalities of network inputs, where "L" stands for LiDAR and "V" stands for visual. The best results are bold.

Conclusions
Vision-dominant VLOs need to predict both dense depth maps and relative poses for unsupervised training, but the noisy predicted depth map limits the accuracy of the predicted pose. In this paper, we proposed an unsupervised vision-enhanced LiDAR odometry, which projects visual and LiDAR data into two dense images with the same resolutions to facilitate the visual-LiDAR fusion. A geometric loss and a visual loss were proposed to exploit the complementary characteristics of these two modalities, leading to a better robustness to the lighting condition variations compared to the vision-dominant VLOs trained with view the synthesis loss. Moreover, an online correction module was also designed to refine the predicted pose during test time. The experiments on KITTI and DSEC showed that our method outperformed the other two-frame-based learning methods and was even competitive with hybrid methods. Besides, while the pose accuracy of vision-dominant VLOs is limited by the noisy predicted dense depth, our LiDAR-dominant method only needs to predict the pose, which not only achieved better performance, but also improved the optimization efficiency significantly.
Our method provides a novel promising design for VLO. In future work, it will be necessary to explore the long-term temporal constraints for pose correction to improve the robustness to the abrupt motion changes, dynamic objects, and other disturbances.