VA-LOAM: Visual Assist LiDAR Odometry and Mapping for Accurate Autonomous Navigation

In this study, we enhanced odometry performance by integrating vision sensors with LiDAR sensors, which exhibit contrasting characteristics. Vision sensors provide extensive environmental information but are limited in precise distance measurement, whereas LiDAR offers high accuracy in distance metrics but lacks detailed environmental data. By utilizing data from vision sensors, this research compensates for the inadequate descriptors of LiDAR sensors, thereby improving LiDAR feature matching performance. Traditional fusion methods, which rely on extracting depth from image features, depend heavily on vision sensors and are vulnerable under challenging conditions such as rain, darkness, or light reflection. Utilizing vision sensors as primary sensors under such conditions can lead to significant mapping errors and, in the worst cases, system divergence. Conversely, our approach uses LiDAR as the primary sensor, mitigating the shortcomings of previous methods and enabling vision sensors to support LiDAR-based mapping. This maintains LiDAR Odometry performance even in environments where vision sensors are compromised, thus enhancing performance with the support of vision sensors. We adopted five prominent algorithms from the latest LiDAR SLAM open-source projects and conducted experiments on the KITTI odometry dataset. This research proposes a novel approach by integrating a vision support module into the top three LiDAR SLAM methods, thereby improving performance. By making the source code of VA-LOAM publicly available, this work enhances the accessibility of the technology, fostering reproducibility and transparency within the research community.


Introduction
Simultaneous Localization and Mapping (SLAM) is a process in which autonomous systems like robots or cars determine their location and simultaneously create a map of an unknown environment.This technology is a critical component in navigation systems such as unmanned aerial vehicles (UAVs) and autonomous driving vehicles.Visual Odometry (VO) and LiDAR (Light Detection and Ranging) Odometry are extensively used in these systems for tracking location and constructing maps.Understanding the unique strengths of each technology is vital for designing effective SLAM systems.
Vision sensors, including monocular cameras, stereo cameras, and RGB-D cameras, collect high-frequency visual data (i.e., 30~60 Hz) and provide detailed spatial analysis.Monocular cameras alone face challenges in estimating depth, thus fusion with additional sensors is used to achieve more accurate Visual Odometry.For instance, studies [1][2][3][4] have enhanced the depth estimation accuracy of monocular cameras by integrating accelerometers and gyroscopes.Research [5,6] proposed new methods for performing Visual SLAM using stereo cameras.Stereo cameras can measure depth by aligning images from both lenses, allowing for more precise environmental mapping.To compensate for the limited depth range of stereo cameras, methods have expanded it using RGB-D cameras, as proposed in [7,8].RGB-D cameras offer a more accurate and wider depth measurement range than stereo cameras but can face difficulties in outdoor environments due to their sensitivity to light.
LiDAR measures distances by detecting light reflected from objects, creating highly precise three-dimensional point clouds.Various studies [9][10][11][12][13][14] have proposed methods for Odometry and Mapping using LiDAR.However, as LiDAR operates at a low frequency (i.e., 10 Hz) and only provides 3D points and intensity data, spatial analysis can be challenging.While accurate mapping with LiDAR enables precise pose estimation, significant mapping errors may occur leading to serious pose estimation inaccuracies if sufficient descriptors are not provided.
Many studies propose methods that combine and complement the strengths of vision sensors and LiDAR sensors, which have contrasting characteristics.The approaches suggested in [15][16][17][18] involve extracting visual features from the vision sensor and measuring depth with the LiDAR sensor.Although these methods leverage the advantages of both sensors, the point cloud generated with the LiDAR sensor is less dense compared to the vision sensor, resulting in 3D-2D depth association errors.Particularly, these depth association errors become more pronounced with objects that are further away.Such errors can degrade the precision of LiDAR Odometry's pose estimation.Moreover, vision sensors are highly dependent on environmental conditions such as weather, changes in lighting, shadows, and light reflections.Methods that use the vision sensor as the primary sensor in Visual-LiDAR Fusion are significantly affected by environmental changes, which can lead to substantial errors.In [19], deep learning is used to fuse the two sensors, while in [20], a method is employed to adjust the weights of each sensor's measurements based on environmental conditions.This paper proposes a new method that utilizes visual information from vision sensors to enhance the accuracy of LiDAR Odometry.We suggest a technique to reduce 3D-2D depth association errors and enable more precise pose estimation in LiDAR Odometry.By using only LiDAR features and assigning image descriptors to them, we enhance the uniqueness of the LiDAR points.Employing LiDAR as the primary sensor allows the system to maintain performance in LiDAR Odometry and Mapping even when vision sensors fail or environmental conditions change.This approach offers the advantage of maintaining the high precision of LiDAR sensors while minimizing the environmental limitations faced by vision sensors.To achieve this, we analyzed the performance of various open-source LiDAR Odometry methods using the KITTI dataset [21] and developed the Visual Assist LiDAR Odometry and Mapping (VA-LOAM) method, which integrates visual information into the top three methods with the lowest root mean square error (RMSE) on location.
To summarize, the main contributions of this work are fourfold: (1) Visual Information Integration: This study proposes a new method that utilizes visual information collected from vision sensors to enhance the precision of LiDAR Odometry.This approach reduces 3D-2D depth association errors and enables accurate pose estimation in LiDAR Odometry.By integrating vision sensor data with LiDAR data, this method achieves better performance compared to traditional LiDAR Odometry.The rich environmental information provided via vision sensors complements the limitations of LiDAR, maintaining high accuracy even in complex environments; (2) Enhanced LiDAR Odometry through Vision Sensor Support: We have clarified that this contribution focuses on using LiDAR as the primary sensor while utilizing vision sensors as a supplementary aid.Traditional methods that fuse LiDAR and vision sensors often rely on the vision sensor as the main sensor, which can fail in environments where vision sensors are weak (e.g., dark conditions and reflective surfaces).Our method ensures that in typical environments vision sensors assist in matching LiDAR feature points, improving accuracy.However, in challenging conditions for vision sensors, the system can operate using only LiDAR, maintaining the performance of traditional LiDAR-based Odometry.This approach ensures stable and consistent performance across various environments by leveraging the strengths of LiDAR while mitigating the weaknesses of vision sensors; (3) Validation and Performance Improvement of VA-LOAM: This paper develops and validates the Visual Assist LiDAR Odometry and Mapping (VA-LOAM) method, which integrates visual information into existing LiDAR Odometry techniques.This method was tested using the publicly available KITTI dataset, demonstrating improved performance over existing LiDAR Odometry methods; (4) Open-Source Contribution: By making the source code of VA-LOAM publicly available, this work ensures the reproducibility and transparency of the research across the community, enhancing the accessibility of the technology.This fosters collaboration and innovation in research and development.

Related Work
Vision sensors and LiDAR sensors are widely used for estimating the 6 degrees of freedom (6DOF) position and orientation.They are essential for accurately determining the position and orientation in UAV autopilot systems, robot SLAM, and autonomous vehicle navigation systems.There are three primary methods of SLAM that utilize these sensors: Visual SLAM, which uses only vision sensors; LiDAR SLAM, which uses only LiDAR sensors; and Visual-LiDAR SLAM, which integrates both vision and LiDAR sensors.
(1) Visual SLAM: Refs.[1][2][3][4][5][6][7][8] pertain to this method.LSD-SLAM [1] and SVO [2] match continuous images using photogrammetric consistency without preprocessing the sensor data.This approach is particularly useful in environments lacking distinct image features.Methods [3][4][5][6][7][8] extract local image feature points (edges, corner points, and lines) and track them to estimate their positional changes.By analyzing the camera's motion and the feature points' positional shifts, the distance to these points and the camera's pose can be calculated.Additionally, image features are used to perform loop detection.ORB-SLAM [3] performs feature matching using fast features and brief descriptors.The pose is calculated based on matched feature points using the Perspective-n-Point (PnP) algorithm.VINS-Mono [4] is a tightly coupled sensor fusion method that uses visual sensors and IMUs.It performs visual-inertial odometry using tracked features from a monocular camera and pre-integrated IMU measurements.While accurate feature matching can be achieved through image descriptors, the process of estimating feature point depth involves significant errors.
Research has been conducted using stereo cameras and RGB-D cameras to reduce these errors in depth estimation.TOMONO [5] and ENGEL [6] proposed methods using stereo cameras.TOMONO [5] introduced an edge point-based SLAM method using stereo cameras, which is particularly effective in non-textured environments where it detects edges and performs edge-based SLAM.ENGEL [6] improved SLAM accuracy by estimating pixel depth using a fixed-baseline stereo camera and motion from a multiview stereo.KERL [7] and SCHOPS [8] proposed RGB-D SLAM using entropy-based keyframe selection and loop closure detection; (2) LiDAR SLAM: Refs.[9][10][11][12][13][14] apply to this method.This technique uses point clouds containing three-dimensional points and intensities, employing feature extraction and matching to estimate position.LiDAR provides accurate 3D points, enabling precise pose estimation.However, a lack of sufficient descriptors can lead to matching errors.Methods [10][11][12][13][14] have evolved from LOAM [9].F-LOAM [14] offers faster processing speeds and lower memory usage, enabling real-time SLAM on lower-performance devices.A-LOAM [11] enhances accuracy by incorporating loop closure functionality and reducing mapping errors caused by obstacles.LeGO-LOAM [10] proposes a lightweight, terrain-optimized method for ground vehicles, classifying and processing the terrain accordingly.ISC-LOAM [13] addresses the issue of insufficient descriptors in LiDAR by proposing an intensity-based scan context, which improves performance in loop closure detection.LIO-SAM [12]  Loosely Coupled Systems: These systems are relatively simple to implement and offer high accuracy.However, they can be vulnerable to sensor errors and changes in dynamic environments; Tightly Coupled Systems: These systems are strong against uncertainty in sensor data and environmental changes, enabling precise position estimation.Nevertheless, they necessitate high-performance processing capabilities and sophisticated data integration techniques.

Coordinate Systems
As seen in Figure 1, we utilize three coordinate systems: (•) w denotes the world coordinate system, (•) c denotes the camera coordinate system, and (•) l denotes the LiDAR coordinate system.The world coordinate system provides a fixed reference frame and serves as the reference point for all other coordinate systems.The LiDAR and camera coordinate systems are defined based on the sensor's position and orientation and can be transformed from the world coordinate system through rotations and translations.The origin of the world coordinate system coincides with the initial measurement position of the LiDAR coordinate system (odometry's initial position).At the same time, t i , the relationship between the camera coordinate system and the LiDAR coordinate system, can be expressed with the transformation matrix T c i l i ∈ SE(3), which belongs to the special Euclidean group.T c l represents the extrinsic parameters between the two sensors, consisting of a rotation matrix and a translation matrix.These extrinsic parameters, which indicate the relative position and direction between the two sensors, can be obtained through calibration [22].
Sensors 2024, 24, x FOR PEER REVIEW 5 of 17 Figure 1.This illustration visually represents the process by which each sensor's coordinate system is transformed into the world coordinate system.Data collected in each sensor coordinate system can be expressed in another sensor's coordinate system or integrated into the world coordinate system through transformation matrices.

Camera Projection Model
Data collected via LiDAR are measured as 3D points in the LiDAR coordinate system.Each coordinate,  , from the LiDAR point cloud is transformed into the camera coordinate system using a transformation matrix.Subsequently, these 3D coordinates are pro-Figure 1.This illustration visually represents the process by which each sensor's coordinate system is transformed into the world coordinate system.Data collected in each sensor coordinate system can be expressed in another sensor's coordinate system or integrated into the world coordinate system through transformation matrices.

Camera Projection Model
Data collected via LiDAR are measured as 3D points in the LiDAR coordinate system.Each coordinate, X l , from the LiDAR point cloud is transformed into the camera coordinate system using a transformation matrix.Subsequently, these 3D coordinates are projected onto a 2D plane using the pinhole camera model matrix K.During this process, the depth information along the Z-axis is removed, resulting in 2D coordinates that correspond to the camera's image plane.In Equation ( 5), u and v represent the coordinates in the image plane, f denotes the focal length of the camera, and c refers to the principal point of the camera.

System Overview
This study aims to enhance localization performance by integrating LiDAR Odometry and Mapping (LOAM) methods with a visual module, as shown in Figure 2. The overall system configuration is as follows.First, LiDAR data are used to detect LiDAR features (surf and edge).Next, the visual assist module uses images from the camera sensor to generate image descriptors for the LiDAR features.Then, to compute the motion displacement, the features detected in the previous frame are matched with those detected in the current frame.Edge features are matched using image descriptors, while surf features are matched using the conventional LiDAR Odometry method.If no matching data for the image descriptors are available, indicating a sensor failure or an environment where the camera sensor cannot operate, the edge features are also matched using the conventional LiDAR Odometry method, similar to the surf features.Since vision sensors are sensitive to environmental changes, LiDAR is utilized as the main sensor, with the vision sensor serving a supportive role.This configuration is implemented in a loosely coupled manner, allowing the visual module to enhance the performance of LiDAR Odometry under normal conditions, while maintaining its effectiveness even in environments where vision sensors are vulnerable.

Point Cloud Preprocessing
During the point cloud preprocessing stage, the detection of LiDAR features is a crucial task.The primary LiDAR features are categorized as edges and planars, based on the curvature calculations among adjacent point clouds.If the curvature exceeds a predefined threshold, the point is classified as an edge feature, indicative of significant local changes.Conversely, if the curvature does not surpass the threshold, it is classified as a planar fea-

Point Cloud Preprocessing
During the point cloud preprocessing stage, the detection of LiDAR features is a crucial task.The primary LiDAR features are categorized as edges and planars, based on the curvature calculations among adjacent point clouds.If the curvature exceeds a predefined threshold, the point is classified as an edge feature, indicative of significant local changes.Conversely, if the curvature does not surpass the threshold, it is classified as a planar feature, representing a relatively flat surface.This curvature-based extraction method plays a key role in environmental mapping and obstacle detection within LiDAR Odometry.This equation calculates the curvature c around a specific point X l (k,i) .The curvature is determined by considering the differences in distances between the point and its neighboring points.Higher curvature values indicate a more abrupt change in the geometric structure around the point.

Visual Assist LiDAR Feature
Previous studies [15][16][17][18] have utilized LiDAR-assisted visual odometry based on image feature points.However, as illustrated in Figure 3, due to the density differences between vision sensors and LiDAR sensors, not all image feature points match with LiDAR 3D points.

Point Cloud Preprocessing
During the point cloud preprocessing stage, the detection of LiDAR features is a crucial task.The primary LiDAR features are categorized as edges and planars, based on the curvature calculations among adjacent point clouds.If the curvature exceeds a predefined threshold, the point is classified as an edge feature, indicative of significant local changes.Conversely, if the curvature does not surpass the threshold, it is classified as a planar feature, representing a relatively flat surface.This curvature-based extraction method plays a key role in environmental mapping and obstacle detection within LiDAR Odometry.This equation calculates the curvature  around a specific point  ( , ) .The curvature is determined by considering the differences in distances between the point and its neighboring points.Higher curvature values indicate a more abrupt change in the geometric structure around the point.

Visual Assist LiDAR Feature
Previous studies [15][16][17][18] have utilized LiDAR-assisted visual odometry based on image feature points.However, as illustrated in Figure 3, due to the density differences between vision sensors and LiDAR sensors, not all image feature points match with LiDAR 3D points.As illustrated in Figure 4, depth association errors frequently occur at image features such as edges and corners.Image features are detected where there is a significant contrast from surrounding areas, particularly at edges or corners.LiDAR is not as dense as vision sensors, which makes it challenging to generate point clouds at precise locations like edges and corners.This discrepancy leads to incorrect matches between image feature points and LiDAR 3D points, resulting in errors in pose estimation.
In this study, we propose a visual assist LiDAR feature matching method to resolve the depth association errors between image feature points and LiDAR 3D points.During the point cloud preprocessing phase, both edge and planar features are detected, and image descriptors corresponding to the edge features are extracted.These edge features, equipped with image descriptors, are defined as visual assist LiDAR features (VALFs).While planar features generally represent non-textured objects and do not provide significant differences in image feature vectors, edge features are suitable for image descriptor extraction due to their texture presence and curvature changes.Unlike LiDAR features, VALFs are detected based on three-dimensional terrain characteristics and can be matched through image descriptors.
such as edges and corners.Image features are detected where there is a significant from surrounding areas, particularly at edges or corners.LiDAR is not as dense a sensors, which makes it challenging to generate point clouds at precise locati edges and corners.This discrepancy leads to incorrect matches between image points and LiDAR 3D points, resulting in errors in pose estimation.In this study, we propose a visual assist LiDAR feature matching method to the depth association errors between image feature points and LiDAR 3D points the point cloud preprocessing phase, both edge and planar features are detected, age descriptors corresponding to the edge features are extracted.These edge f equipped with image descriptors, are defined as visual assist LiDAR features ( While planar features generally represent non-textured objects and do not provid icant differences in image feature vectors, edge features are suitable for image de extraction due to their texture presence and curvature changes.Unlike LiDAR f VALFs are detected based on three-dimensional terrain characteristics and can be m through image descriptors. Table 1 summarizes the advantages and disadvantages of various image des In this study, considering the presence of 200-300 LiDAR feature points projec urban area camera images, ORB and BRISK descriptors were selected for their ba matching accuracy and processing time.Using Equation ( 6), edge features are ex have checked and revised alland then projected onto the camera image through E (5).If these features are present within the image region, visual descriptors are e to generate VALFs.Unlike traditional LiDAR edge/planar features, VALFs utiliz descriptors to perform matching across consecutive frames, thereby enhancing th nuity and accuracy of the feature tracking process.We conducted experiments on depth association methods.Specifically, we tested the conventional method of est the depth of image features and the proposed method of applying image descri LiDAR features.As seen in Figure 5, VALF generates a diverse and large numb features compared to the conventional method of estimating the 3D depth of im tures.The diversity and number of features are crucial factors in pose estimation  Table 1 summarizes the advantages and disadvantages of various image descriptors.In this study, considering the presence of 200-300 LiDAR feature points projected onto urban area camera images, ORB and BRISK descriptors were selected for their balance of matching accuracy and processing time.Using Equation ( 6), edge features are extracted and then projected onto the camera image through Equation (5).If these features are present within the image region, visual descriptors are extracted to generate VALFs.Unlike traditional LiDAR edge/planar features, VALFs utilizes these descriptors to perform matching across consecutive frames, thereby enhancing the continuity and accuracy of the feature tracking process.We conducted experiments on 3D-2D depth association methods.Specifically, we tested the conventional method of estimating the depth of image features and the proposed method of applying image descriptors to LiDAR features.As seen in Figure 5, VALF generates a diverse and large number of 3D features compared to the conventional method of estimating the 3D depth of image features.The diversity and number of features are crucial factors in pose estimation.Figure 6 displays the results of matching the current frame and previous frame features obtained through the two 3D-2D depth association methods.As illustrated in Figure 6, it is evident that the number of matched features using VALFs is significantly higher.Figure 6 displays the results of matching the current frame and previous frame fea tures obtained through the two 3D-2D depth association methods.As illustrated in Figur 6, it is evident that the number of matched features using VALFs is significantly higher.Figure 6 displays the results of matching the current frame and previous frame features obtained through the two 3D-2D depth association methods.As illustrated in Figure 6, it is evident that the number of matched features using VALFs is significantly higher.

Pose Estimation
Pose estimation utilizes VALFs (F l v ), edge features F l e ), planar features (F l p ), and a Global feature map (M g ).The Global feature map comprises an edge feature map (M g e ), a planar feature map (M g p ), and a VALF map (M g v ).The edge and planar feature maps are stored in a 3D kd-tree structure, while the VALF map is stored as a vector containing descriptors and 3D points.
To estimate the optimal pose between the current frame and the Global map, the distance between matched Local and Global features is minimized.
Residual for VALFs : Global features are estimated by collecting points adjacent to the edge/planar feature points.Edge/planar features determine the position and direction of lines and planars by calculating their covariance matrices.The local smoothness of each feature point is used to compute weights that enhance accuracy by reflecting the importance of consistently extracted features during matching.VALFs, with their descriptors, facilitate finding matching local and global features.In environments where visual sensors are challenging to operate, optimization is performed using edge and planar features when VALFs do not match, and using VALFs and planar features when VALFs match.The optimization equations are as follows: When VALFs do not match : min When VALFs match : min Optimization does not simultaneously use all three types-VALFs and edge and planar features-due to the significant difference in the number of VALFs and edge features.LiDAR features cover a full 360-degree range, but cameras only cover the front, leading to a narrower field of view.Many edge features projected from the LiDAR onto the image fall outside the camera's Field of View (FOV), resulting in a discrepancy in the number of points between VALFs and edge features.

Experimental Setup and Validation
We employed the KITTI dataset [21] to evaluate the performance of Visual Assist Lidar Odometry and Mapping (VA-LOAM).The odometry benchmark from this dataset was used as the test set, initially experimenting with algorithms such as F-LOAM [14], A-LOAM [11], LeGO-LOAM [10], ISC-LOAM [13], and LIO-SAM [12].From these, the top three performing algorithms were selected, and further experiments were conducted by integrating our proposed Visual Assist module.The accuracy of each experiment was quantitatively analyzed using the root mean square error (RMSE) as a measure of positional error.

Evaluation on Public Datasets
We conducted a performance evaluation of both traditional LiDAR-based Odometry and Mapping methods and the enhanced Visual Assist Lidar Odometry and Mapping (VA-LOAM) using the renowned KITTI dataset [21], which is notable for its road driving scenarios.This dataset comprises sensor data collected from vehicles equipped with stereo and mono cameras, Velodyne HDL-64 LiDAR, and high-precision GPS/INS, capturing LiDAR point clouds and camera images at a frequency of 10Hz.As shown in Table 2, the tests used the odometry benchmark from the KITTI training set, which contains ground truth position data.This evaluation aimed to compare the discrepancies between the positions estimated via each method and their ground truth positions.The dataset features a variety of environments, ranging from urban areas densely packed with buildings to rural areas lush with vegetation, and highways devoid of nearby objects.It also includes scenarios both with and without loop closures.

Performance of LiDAR Odometry and Mapping
We conducted a comparative analysis of existing LiDAR Odometry algorithms on the KITTI dataset [21], including F-LOAM [14], A-LOAM [11], LEGO-LOAM [10], ISC-LOAM [13], and LIO-SAM [12].We compared their performances both with and without the integration of loop closure detection.In Tables 3 and 4, the numbers highlighted in bold represent the algorithms that achieved the highest performance in each dataset sequence.Using only LiDAR for odometry, F-LOAM [14] and ISC-LOAM [13] showed equivalent performance.A-LOAM [11] did not exhibit the best performance in any of the datasets, while ISC-LOAM [13] performed well in 8 out of 13 datasets, LEGO-LOAM [10] in 2, and LIO-SAM [12] in 3. The RMSE of LiDAR Odometry was evaluated in the absence and presence of loop detection in the KITTI odometry dataset.

Performance of Visual Assist LiDAR Odometry and Mapping
In our performance comparison experiment using the KITTI dataset [21], we enhanced the top-performing methods-LEGO-LOAM [10], ISC-LOAM [13], and LIO-SAM [12]-by integrating a visual module, resulting in the development of VA-LEGO-LOAM, VA-ISC-LOAM, and VA-LIO-SAM.Each modified method maintains its original point cloud processing capabilities while incorporating an additional visual assist Lidar feature (VALF), which utilizes image descriptors to further refine the estimation of position and orientation.
Tables 5 and 6 detail the comparative performance of these enhanced methods against their original counterparts, showcasing the benefits of integrating the visual module.Tables 7 and 8 show the improvement rates in position accuracy for each method, demonstrating a reduction in RMSE.These figures indicate that position estimation accuracy has improved and highlight the effectiveness of the visual module in reducing position errors under various conditions.Specifically, VA-LEGO-LOAM showed an average reduction of 7.02%, VA-ISC-LOAM showed 12.43%, and VA-LIO-SAM showed 3.67%.Examining each data sequence, in most cases, the visual assist module provided missing descriptors in the existing LiDAR feature matching process, thereby enhancing LiDAR feature matching performance and consequently improving position estimation accuracy.However, in some instances, performance degradation occurred.This was due to incorrect matching during the VALF process.In the KITTI dataset, as shown in Figure 7a, significant changes between the previous frame image and the current frame image occur due to the synchronization of the vision sensor and the LiDAR sensor, resulting in VALF matching errors.For example, if the vision sensor's output frequency is 30Hz, like most vision sensors, and the LiDAR sensor's output frequency is 10Hz, as illustrated in Figure 7b, there can be two additional images between LiDAR data frames.This allows for improved VALF matching through image-based VALF tracking.Future research will aim to address this issue by applying this method.synchronization of the vision sensor and the LiDAR sensor, resulting in VALF matching errors.For example, if the vision sensor's output frequency is 30Hz, like most vision sensors, and the LiDAR sensor's output frequency is 10Hz, as illustrated in Figure 7b, there can be two additional images between LiDAR data frames.This allows for improved VALF matching through image-based VALF tracking.Future research will aim to address this issue by applying this method.Table 9 presents the results of experiments conducted under varying environmental conditions.The test environments include daytime, night-time, and camera sensor failure.Night-time and camera sensor failures are challenging conditions for the visual assist module.The night-time data were generated by reducing the brightness of the original daytime image data and applying image processing, as illustrated in Figure 8.We tested VA-LEGO LOAM, VA-ISC LOAM, and VA-LIO-SAM, which integrate the sensor fault visual assist module.While the performance in night-time conditions was lower compared to daytime, the error did not increase significantly.This is attributed to the feature matching performance; although there is a difference in feature matching performance between images with a large time gap, the impact on feature matching between images with a small-time gap is minimal.Additionally, in cases where the camera sensor failed, the LiDAR Odometry performance remained stable.This is because the visual assist module cannot operate without image-based descriptors, but LiDAR-based Odometry can still be performed without image descriptors, maintaining the original LiDAR Odometry performance.This result demonstrates the advantages of a loosely coupled approach in sensor fusion, showing that using LiDAR as the main sensor, compared to the conventional method of obtaining depth using visual features, provides robustness against camera sensor failures and environmental changes.9 presents the results of experiments conducted under varying environmental conditions.The test environments include daytime, night-time, and camera sensor failure.Night-time and camera sensor failures are challenging conditions for the visual assist module.The night-time data were generated by reducing the brightness of the original daytime image data and applying image processing, as illustrated in Figure 8.We tested VA-LEGO LOAM, VA-ISC LOAM, and VA-LIO-SAM, which integrate the sensor fault visual assist module.While the performance in night-time conditions was lower compared to daytime, the error did not increase significantly.This is attributed to the feature matching performance; although there is a difference in feature matching performance between images with a large time gap, the impact on feature matching between images with a small-time gap is minimal.Additionally, in cases where the camera sensor failed, the LiDAR Odometry performance remained stable.This is because the visual assist module cannot operate without image-based descriptors, but LiDAR-based Odometry can still be performed without image descriptors, maintaining the original LiDAR Odometry performance.This result demonstrates the advantages of a loosely coupled approach in sensor fusion, showing that using LiDAR as the main sensor, compared to the conventional method of obtaining depth using visual features, provides robustness against camera sensor failures and environmental changes.

Conclusions
This paper provides several significant contributions.Firstly, it proposes a novel method to reduce 3D-2D depth association errors and enable accurate pose estimation in LiDAR Odometry by utilizing only LiDAR features instead of image key points and enhancing their uniqueness through image descriptors.Furthermore, it demonstrates the capability to maintain performance even in the event of vision sensor failure or environmental changes by leveraging LiDAR as the primary sensor.Through evaluations on the KITTI dataset [21], the top three methods with the lowest position RMSE are selected to develop Visual Assist LiDAR Odometry and Mapping (VA-LOAM), which evolves into subversions such as VA-LeGo-LOAM, VA-ISC-LOAM, and VA-LIO-SAM.The results of RMSE reduction in each version clearly indicate the potential of visual assistance modules to enhance LiDAR Odometry performance.This research lays a crucial foundation for the advancement of precise mapping and localization techniques using LiDAR and visual sensor data and provides broader research and application possibilities by making these methods publicly available.

Figure 2 .
Figure 2. System flowchart for Visual Assist LiDAR Odometry and Mapping.

Figure 2 .
Figure 2. System flowchart for Visual Assist LiDAR Odometry and Mapping.

Figure 3 .
Figure 3.This figure demonstrates the depth association issues arising from the density differences between vision sensors and LiDAR sensors.The LiDAR point cloud is shown in green, image features in red, and corresponding features in blue.

Figure 3 .
Figure 3.This figure demonstrates the depth association issues arising from the density differences between vision sensors and LiDAR sensors.The LiDAR point cloud is shown in green, image features in red, and corresponding features in blue.

Figure 4 .
Figure 4. Illustration of depth association challenges in LiDAR-assisted Visual SLAM.Th highlights the potential for multiple LiDAR point clouds to be projected onto a single imag point.

Figure 4 .
Figure 4. Illustration of depth association challenges in LiDAR-assisted Visual SLAM.This figure highlights the potential for multiple LiDAR point clouds to be projected onto a single image feature point.

Figure 5 .
Figure 5. (a) Shows the 3D image features obtained by estimating the depth of image features usin LiDAR point clouds, which is the conventional method.The LiDAR point cloud is shown in green image features in red, and corresponding features in blue.(b) Illustrates the proposed VALF method presented in this paper.

Figure 6 .
Figure 6.Shows the results of 3D feature matching between the current and previous frames.(a Shows the matching of 3D image features, while (b) shows the matching of VALF features.

Figure 5 .Figure 5 .
Figure 5. (a) Shows the 3D image features obtained by estimating the depth of image features using LiDAR point clouds, which is the conventional method.The LiDAR point cloud is shown in green, image features in red, and corresponding features in blue.(b) Illustrates the proposed VALF method presented in this paper.

Figure 6 .
Figure 6.Shows the results of 3D feature matching between the current and previous frames.(a) Shows the matching of 3D image features, while (b) shows the matching of VALF features.

Figure 6 .
Figure 6.Shows the results of 3D feature matching between the current and previous frames.(a) Shows the matching of 3D image features, while (b) shows the matching of VALF features.
Local features consist of edge feature points p l e ϵF g e , planar feature points p l p ϵF g p , and VALF points p l v ϵF g v ; Global features are composed of Global lines p g e ϵM g e , Global planars p g p ϵM g p , and Global VALFs p g v ϵM g v .The Gauss-Newton method, a nonlinear optimization technique, is employed to minimize the distances between these matched points in order to accurately estimate Sensors 2024, 24, 3831 9 of 15the optimal pose.This process involves point-to-line, point-to-planar, and point-to-point matching.The mathematical expressions for each matching type are as follows:Residual for edge features : f e (p e ) =

Figure 7 .
Figure 7. (a) shows the KITTI dataset consisting of synchronized sensor data, while (b) represents a typical asynchronous data system.

Figure 7 .
Figure 7. (a) shows the KITTI dataset consisting of synchronized sensor data, while (b) represents a typical asynchronous data system.

Figure 8 .
Figure 8.(a) is the daytime image, and (b) is the night-time image.Figure 8. (a) is the daytime image, and (b) is the night-time image.

Figure 8 .
Figure 8.(a) is the daytime image, and (b) is the night-time image.Figure 8. (a) is the daytime image, and (b) is the night-time image.

Table 1 .
Advantages and disadvantages of image descriptors.

Table 1 .
Advantages and disadvantages of image descriptors.

Table 2 .
This table presents the various environments and driving distances that comprise the KITTI dataset.

Table 3 .
The root mean square error (RMSE) of LiDAR Odometry was evaluated without loop detection in the KITTI odometry dataset.

Table 4 .
The root mean square error (RMSE) of LiDAR Odometry was evaluated with loop detection in the KITTI odometry dataset.

Table 5 .
Evaluation of RMSE for LiDAR Odometry with the proposed visual aid module (without loop detection).The visual aid module was incorporated into the three LiDAR odometers that exhibited the most favorable performance in the Kitti test set.

Table 6 .
Evaluation of RMSE for LiDAR Odometry with the proposed visual aid module (with loop detection).The visual aid module was incorporated into the three LiDAR odometers that exhibited the most favorable performance in the Kitti test set.

Table 7 .
The efficacy of the proposed visual aid module evaluated by comparing it with the traditional LiDAR Odometry method to assess the percentage improvement in accuracy without loop detection.

Table 8 .
The efficacy of the proposed visual aid module evaluated by comparing it with the traditional LiDAR Odometry method to assess the percentage improvement in accuracy with loop detection.

Table 7 .
The efficacy of the proposed visual aid module evaluated by comparing it with the traditional LiDAR Odometry method to assess the percentage improvement in accuracy without loop detection.

Table 8 .
The efficacy of the proposed visual aid module evaluated by comparing it with the traditional LiDAR Odometry method to assess the percentage improvement in accuracy with loop detection.

Table 9 .
This demonstrates the performance of Visual Assist LiDAR Odometry under sensor faults and environmental changes.: The numbers highlighted in bold indicate the algorithm that achieved the highest performance on each dataset sequence. Note