TQU-SLAM Benchmark Dataset for Comparative Study to Build Visual Odometry Based on Extracted Features from Feature Descriptors and Deep Learning

: The problem of data enrichment to train visual SLAM and VO construction models using deep learning (DL) is an urgent problem today in computer vision. DL requires a large amount of data to train a model, and more data with many different contextual and conditional conditions will create a more accurate visual SLAM and VO construction model. In this paper, we introduce the TQU-SLAM benchmark dataset, which includes 160,631 RGB-D frame pairs. It was collected from the corridors of three interconnected buildings comprising a length of about 230 m. The ground-truth data of the TQU-SLAM benchmark dataset were prepared manually, including 6-DOF camera poses, 3D point cloud data, intrinsic parameters, and the transformation matrix between the camera coordinate system and the real world. We also tested the TQU-SLAM benchmark dataset using the PySLAM framework with traditional features such as SHI_TOMASI, SIFT, SURF, ORB, ORB2, AKAZE, KAZE, and BRISK and features extracted from DL such as VGG, DPVO, and TartanVO. The camera pose estimation results are evaluated, and we show that the ORB2 features have the best results ( Err d = 5.74 mm), while the ratio of the number of frames with detected keypoints of the SHI_TOMASI feature is the best ( r d = 98.97%). At the same time, we also present and analyze the challenges of the TQU-SLAM benchmark dataset for building visual SLAM and VO systems.


Introduction
Visual odometry (VO) is applied in many fields, such as [1] autonomous vehicles, unmanned aerial vehicles, underwater robots, space exploration robots, agriculture robots, medical robots, and AR/VR.Therefore, VO has been of research interest for many years.Based on the research of Neyestani et al. [2], Agostinho et al. [3], and Wang et al. [1], the problem of estimating VO using a vision-based method with monocular camera data is solved using two methods: knowledge-based and learning-based methods.Learning-based methods are based on traditional machine learning and DL.Knowledge-based methods include appearance-based methods, feature-based methods, and hybrid methods.Imagebased VO is the process of determining the position and orientation of a robot or entity by analyzing a set of images obtained from its environment.As in the Neyestani et al. [2] study, the framework to build a VO system includes five steps, with input data being images, feature detection, feature tracking, motion estimation, triangulation, and trajectory estimation, as shown in Figure 1.The output of the VO framework is 6-DOF including 3D pose, which is the position of the camera in the scene; the remaining 3D information comprises the direction of camera movement.Based on feature-based methods, the feature-detection step comprises the process of feature extraction, such as SIFT [4], SURF [5], ORB [6], and BRISK [7] features, etc. Detected features are tracked on consecutive frames to find the corresponding points on the frames using techniques such as optical flow [8].This is followed by a motion-estimation step that typically uses epipolar geometry constraints to incorporate feature-to-feature matching (2D to 2D).From there, motion parameters are calculated, and projection from 3D to 2D is performed to minimize 3D tracked landmarks against the current image frame, and 3D to 3D is incorporated to increase the accuracy of the estimated pose.To perform camera pose optimization, research often uses algorithms such as Bundle adjustment, Kalman Filter, and EKF Graph optimization [1].For DL-based methods, the steps of feature detection, feature matching, and pose estimation are performed using DL [1].
Currently, most VO estimation methods are evaluated on the KITTI dataset [9][10][11], TUM RGB-D SLAM dataset [12], NYUDepth dataset [13], and ICL-NUIM dataset [14].However, DL methods always need a very large amount of data to train a VO estimation model.With this approach, the more data that are trained, and trained in more contexts, the more accurate the VO estimation results will be.In this paper, we propose a standard dataset called the TQU-SLAM benchmark dataset to evaluate VO estimation methods.Our dataset was collected using an Intel RealSense D435 in an environment comprising the second floor of three interconnected buildings: Building A, Building B, and Building C.This dataset includes 160,631 RGB-D frame pairs.The data were collected over a particular length (FO-D is 230.63 m, OP-D is 228.13 m) and were collected four times and eight times.
Our paper has the following main contributions: • We introduce the TQU-SLAM benchmark dataset for evaluating VO models, algorithms, and methods.The ground truth of the TQU-SLAM benchmark dataset is constructed by hand, including the camera coordinates in the real-world coordinate system, 3D point cloud data, intrinsic parameters, and the transformation matrix between the camera coordinate system and the real world.• We experiment with the TQU-SLAM benchmark dataset for estimating VO based on PySLAM [15,22] with the features extracted such as ORB [16], ORB2 [17], SIFT [4], SURF [5], BRISK [7], AKAZE [18], and KAZE [18], or features extracted from a DL model such as VGG [19], DPVO [20], and TartanVO [21].• The results are analyzed and compared with the TQU-SLAM benchmark dataset with many types of traditional features and features extracted from DL models.
The structure of our paper is organized as follows: Related studies are presented in Section 2. Section 3 introduces the TQU-SLAM benchmark dataset and the process of building ground-truth data of the camera's moving trajectory.Section 4 presents the testing of the TQU-SLAM benchmark dataset with traditional features and features extracted from DL.The experiments, results, and challenges of VO estimation of the TQU-SLAM benchmark dataset are shown in Section 5. Finally, conclusions and future research are presented in Section 6.

Related Works
The VO process was also presented in great detail in the survey studies of Agostinho et al. [3], Neyestani et al. [2], and He et al. [23].In this section, we present two problem areas: VO methods, VO models, and VO algorithms; and datasets to evaluate VO models and VO techniques.
About the research on VO methods, Agostinho et al. [3] built a taxonomy that includes two approaches to building VO: knowledge-based methods and learning-based methods.Knowledge-based methods include the following techniques, as illustrated in Figure 2  In research on knowledge-based methods, Davison et al. [24] proposed an algorithm called MonoSLAM to construct real-time 3D motion trajectories from data acquired from a camera.MonoSLAM can perform localization and mapping at a speed of 30 Hz on a computer with normal configuration.In particular, the structure-from-motion algorithm is used so that MonoSLAM can work on longer frame sequences.Klein et al. [25] proposed the PTAM technique to perform camera poses in an unexplored scene.The PTAM technique is used to perform two tasks, tracking the camera's position and performing environmental mapping, with two parallel streams; of these, the mapping task is performed based on keyframes using the bundle adjustment technique.New points are initialized using an epipolar search.Ganai et al. [26] proposed a DTAM system that can perform real-time tracking and reconstruction of camera positions with high parallelization capabilities.This system does not use feature extraction but rather uses the density of pixels, identifying a patchwork surface with built-in keyframes by estimating depth.Izadi et al. [27] proposed the KinectFusion system, which allows building 3D pose Kinect cameras/depth cameras, with many tasks performed in parallel using GPU: 3D reconstruction using CAD models, the creation of models that are geometrically precise based on built point cloud data, and the creation of 3D models using mesh representation.Kerl et al. [28] proposed the DVO and LSD-SLAM system to build a visual SLAM system based on dense RGB-D images.The authors are used to extend the geometry method to include a geometric error term and perform frame-to-frame matching.The pose graph has a new keyframe inserted when one is detected.At the same time, an entropy-based technique is used to detect loop closure for eliminating drift.Forster et al. [29] proposed an SVO system to build a fast, accurate VO from a camera's data.This method performs semi-directly and thus can reduce feature extraction and enhance matching techniques for motion estimation.To estimate camera pose and scene structure, SVO is divided into two classes: feature-based methods that use epipolar geometry to estimate camera motion and image structure, and methods that use invariant feature descriptors to estimate the scene from the combination of consecutive frames.The second type are direct methods, where intensity values are used to directly estimate the structure and movement of the camera.Bloesch et al. [30,31] proposed ROVIO to track and estimate camera positions using pixel-intensity errors of image patches.Then, the extended Kalman filter algorithm is used to monitor multilevel patch features.Whelan et al. [32,33] proposed a visual SLAM system called Elastic Fusion, which can perform real-time tracking and estimation of camera positions using dense pixels.Each moving part in each frame is registered to the motion model of the frame sequence.Engel et al. [34] proposed a DSO system to track and estimate the camera's position in the scene.DSO uses a direct and sparse approach to estimate VO from a monocular camera.The general model is optimized for model parameters such as camera poses, camera intrinsics, and geometry parameters.Mur et al. [16] proposed ORB-SLAM, Mur et al. [17] proposed ORB-SLAM2, and Campos et al. [35] proposed ORB-SLAM3.Details of these three techniques are presented below.Schneider et al. [36] proposed Maplab as an open framework to build a visual-inertial SLAM.It includes two main components: an online part and an offline part.VIO and localization front end, ROVIOLI's online part, is used to obtain data from the sensor.Maplab-console's offline section is used in a batch offline method with various algorithms.
In research on learning-based methods, Huang et al. [37] proposed an online initialization method and a calibration method to optimize a visual-inertial odometry from data obtained from a monocular camera.This method performs state initialization and correction using the space-time constraints of the bootstrapping system.The aligned poses are used to initialize a set of spatial and temporal parameters.Zhou et al. [38] proposed the DPLVO method as an improvement of DSO to directly estimate VO using points and lines.The general optimization problem is performed on 3D lines, points, and poses using a sliding window.At the same time, long-term setup is performed on detected keyframes.Ban et al. [39] proposed a method to estimate VO using end-to-end deep learning based on depth and optical flow.The proposed method learns the ground-truth pose states and estimates camera pose with 6-DOF on a space of one-frame-by-one-frame.Lin et al. [40] proposed an unsupervised DL method to estimate 6-DOF VO from a monocular camera.The depth is estimated from DispNet, and a residual-based method is used to perform pose refinement.In particular, the processes of estimating rotation, translation, and scale of the pose are performed separately.Gadipudi et al. [41] proposed WPO-Net to estimate 6-DOF VO from a monocular camera.WPO-Net is a supervised learning-based method and is based on a feature encoder and pose regressor with a convolutional neural network from multiple consecutive two-grayscale-image stacks.Kim et al. [42] proposed SimVODIS to perform three main tasks: VO estimation and object detection and instance segmentation.SimVODIS is a deep network that operates on a stream and exploits shared feature maps with two branches: semantic mapping and data-driven VO.Almalioglu et al. [43] proposed SelfVIO as an adversarial training and self-adaptive visual-inertial sensor fusion method to estimate VO from monocular RGB image sequences.The 6-DOF camera pose is estimated using GANs based on the objective loss function of warping view sequences.
We surveyed some of the datasets for evaluating VO methods, shown below.KITTI dataset: The KITTI dataset [9][10][11] is the most popular dataset for evaluating visual SLAM and VO models and algorithms.This dataset includes two versions: the KITTI 2012 dataset [10] and the KITTI 2015 dataset [11].The data from Geiger et al. [9] are used to evaluate the VO, object-detection, and object-tracking models.The KITTI 2012 dataset was collected from two high-resolution camera systems, a Velodyne HDL-64E laser scanner (grayscale and color), and a state-of-the-art OXTS RT 3003 localization system (a combination of devices such as GPS, GLONASS, security IMU, and RTK correction signals).The KITTI 2012 dataset is also divided into datasets serving different problems.The dataset used to evaluate optical flow estimation models includes 194 image pairs for training and 195 image pairs for testing; the images have a resolution of 1240 × 376 pixels, and the ground-truth data are built based on 50% density.The dataset used to evaluate 3D visual odometry/SLAM models consists of 22 stereo sequences collected from a length of 39.2 km of driving.This dataset provides benchmarks and evaluation measures for VO and Visual SLAM such as motion trajectory and driving speed.The dataset used to evaluate object detection and 3D orientation estimation methods comprises ground-truth data including accurate 3D bounding boxes for object classes such as cars, vans, trucks, pedestrians, cyclists, and trams.Original data of 3D objects in point cloud data were manually labeled to evaluate algorithms for 3D orientation estimation and 3D tracking.Geiger et al. [10] provided a raw dataset for evaluating stereo, optical flow, and object detection models.The data collection system was built based on the following sources:

Data Collection
We set up the experiment in second-floor hallways of Building A, Building B, and Building C of Tan Trao University (TQU) in Vietnam, as illustrated in Figure 3.We called the "TQU-SLAM benchmark dataset".Classrooms open onto these hallways.The data were collected from the environment using a Intel RealSense D435 camera (https://www.intelrealsense.com/depth-camera-d435/,accessed on 6 May 2024), illustrated in Figure 4.The camera was mounted on a vehicle, shown in Figure 5.The angle between the camera's view and the ground was 45 degrees.For the total distance traveled by the vehicle at one time, the FO-D was 230.63 m and the OP-D was 228.13 m; the width was 2 m, and for every 0.5 m, we assigned a numbered marker with dimensions of 10 × 10 cm on each marked corner.The total number of markers we used was 332.The moving speed to collect data was 0.2 m/s.The data we collected were color images and depth images with a resolution of 640 × 480 pixels.We always drove the car in the middle of the hallway.The data acquisition speed was 15 fps.We performed data collection four times in two days, each one hour apart.The first day, we collected data twice (1st, 2nd), and the second day, we collected the remaining two times (3rd, 4th).We collected data in the afternoon from 2:00 p.m. to 3:00 p.m.Each time, the direction of movement according to the blue arrow was in the forward direction (FO-D), and the direction of movement according to the red arrow was in the opposite direction (OP-D).All data of the TQU-SLAM benchmark dataset are shown in link (https://drive.google.com/drive/folders/16Dx_nORUvUHFg2BU9mm8aBYMvtAzE9m7, accessed on 6 May 2024).The data we collected are shown in Table 1.

Preparing Ground-Truth Trajectory for VO
To prepare the ground-truth data for evaluating the results of VO estimation, we built the ground-truth data of the motion trajectory as follows.We predefined a coordinate system in a real-world space as shown in Figure 3, where the X axis is red, the Y axis is green, and the Z axis is blue.
We used a self-developed tool in the Python programming language to mark four points on the color image, as shown in Figure 6.Then we took the corresponding four marker points on the depth image, because each frame was obtained as a pair of RGB images and depth images.To convert the four marked points of the marker on the RGB image and the four corresponding points on the depth image to 3D point cloud data, we used the camera's intrinsic parameters, shown in Equation ( 1): where f x, f y, cx, and cy are the intrinsic parameters of the camera.For each marker point with coordinates (x d , y d ) and depth value d a on the depth image, the result is converted to point M with coordinates (x m , y m , z m ) using Equation (2): As shown in Figure 6, the four points of the marker point cloud data are according to the camera coordinate system, with the original coordinate system being the center of the camera.Therefore, to find these four points in the real-world coordinate system, it is necessary to find the rotation and translation matrix (transformation matrix) to transform four points from the camera coordinate system to the real-world coordinate system.
For a point with coordinates M(x, y, z) in the camera coordinate system, M ′ (x ′ , y ′ , z ′ ) are the coordinates of point M in the real-world coordinate system, which we find after performing a transformation from the camera coordinate system to the real-world coordinate system using Equation (3): where Ro 11 , Ro From the coordinates of the four points of point cloud data in the camera's coordinate system, we define their coordinates in the matrix as Equation ( 5): The transformation matrix according to the x, y, z axes is presented by θ 1 , θ 2 , θ 3 in Equation ( 6): The results of the transformation are shown in the vector X ′ , Y ′ , Z ′ in Equation ( 7): where (x ) are the coordinates of four points of point cloud data in the real-world coordinate system.From this, we have a linear equation, presented as Equation (8).
To estimate θ 1 , θ 2 , θ 3 , we use the least squares method [46], as in Equation ( 9): Finally, the conversion matrix between the camera coordinate system and the realworld coordinate system is of the form (θ 1 ; θ 2 ; θ 3 ).The coordinates of the center of the marker (x c , y c , z c ) in the real-world coordinate system are calculated as Equation ( 10): The ground-truth data results of the motion trajectory in the real-world coordinate system are shown in Figure 7.

Comparative Study of Building VO Based on Feature-Based Methods
As shown in Figure 1, the framework for constructing VO based on knowledge-based methods from the collected images of a monocular camera includes five steps.We present some recently published VO estimation techniques.

PySLAM
The PySLAM framework was proposed by Luigi [15,22].It includes an open-source library developed for the PySLAM framework.The PySLAM framework allows the embedding of many types of features, including both traditional features automatically extracted from feature descriptors and DL features extracted from DL models.The source code of the PySLAM framework we used is shown in the link (https://github.com/luigifreda/pyslam, accessed on 6 May 2024).From there, it is possible to check and select good features for the process of building models to estimate a camera's motion trajectory, helping to build pathfinding systems for robots and blind people.At the same time, PySLAM framework was developed in the C++ and Python programming languages.Just like the knowledge-based methods presented in Figure 1, the PySLAM framework also performs 6-DOF VO estimation with several steps.The feature-extraction step, or keypoint detection, comprises the process of detecting features/keypoints between two consecutive frames.The features can be edges, corners, or blobs.Typically, the features are detected and extracted according to a feature descriptor such as SHI_TOMASI, SIFT, SURF, ORB, ORB2, AKAZE, KAZE [18], or BRISK.At the same time, feature extraction through DL networks is also performed, for features such as VGG [19] and D2-Net [47]. Figure 8 shows the match between the corresponding ORB features in two consecutive frames of the TQU-SLAM benchmark dataset.The set of ORB features in two consecutive frames is small; the features can only be detected from the image area of the marker and railing in the moving journey.This also shows that the TQU-SLAM benchmark dataset contains many challenges for feature descriptors.In the motion-estimation step, features are extracted from the monitored frame sequence, from which the camera motion is estimated through the transformation matrix in Equation (11): where Ro i,i−1 is the rotation matrix between two consecutive frames i and i − 1, and t i,i−1 is the translation vector between two consecutive frames i and i − 1.
The spatial correlation matrix between frames i and i − 1 is calculated based on Equation ( 12): where t is a translation matrix in the x, y, z direction.The motion trajectory in 3D space is also calculated.A scale factor α is the shift factor between two consecutive frames.The calculation of the essential matrix is performed in the epipolar plane system using "epipolar constraints".To solve this problem, one can use the five-point algorithm, or 5-point RANSAC [22,48].The RANSAC algorithm used here is to estimate the correspondence between two sets of keypoints.The points converted from the input set within a certain limit are called inliers.The algorithm iterates about K times [49] and outputs the largest number of inliers.K is calculated according to Equation ( 13): where p is the probability of finding a descriptors keypoints, s = 2 is the minimum number of samples needed to estimate a line model, and w is the likelihood ratio of the points being inliers.
In the local-optimization step, camera pose estimation and motion trajectory errors are accumulated from camera pose estimation and transformation.Currently, there are two methods commonly applied to optimize a camera's movement trajectory in an environment.In the first method, the entire motion trajectory and camera pose are rechecked to minimize errors.The second method is to use the Kalman algorithm or particle filter to calibrate the map during data collection and edit the estimated camera pose when detecting new keypoints.

DPVO (Deep Patch Visual Odometry)
DPVO was proposed by Teed et al. [20].DPVO is a deep learning framework that combines a CNN and a recurrent neural network (RNN).The architecture of DPVO is illustrated in Figure 9.With the input image data being an RGB image, DPVO performs per-pixel feature extraction using ResNet and uses it to calculate similarities between images in the frame sequence.Next, two residual networks are used: the first network is used to extract matching features, and the second network is used to extract contextual features.The first layer of each network performs convolution with a size of 7 × 7, the next two residual blocks have a size of 32 × 32, and the next two residual blocks have a size of 64 × 64.Finally, the output of the feature vector is 1/4 the size of the input image.Next, a two-level feature pyramid is constructed with sizes 1/4 and 1/8, respectively, with the resolution of the input image based on a 4 × 4 filter with average pooling to the matching features to estimate the amount of optical flow.Then, p × p patches are obtained from the feature map with random positions in the image space (pixel space) using the bilinear sampling technique.The patches map is created by linking patches, and it has the same size as the feature map with regard to pixels.Furthermore, DPVO uses a patch graph to represent the relationship between patches and video frames.The projections of patches on the frames are the patches' movement trajectories.At the same time, DPVO also proposes an update operator that is a recurrent network that iteratively fine-tunes the depth of each patch and the camera pose of each frame in the video.DPVO is rated to be 3x faster than DROID-SLAM [50].In DPVO, the model is trained from several datasets such as the TartanAir [51], TUM RGB-D SLAM [12], EuRoC [52], and ICL-NUIM [14] datasets.

TartanVO
TartanVO was proposed by Wang et al. [21].This network has an architecture consisting of two stages: the first is a matching network to match the features between two consecutive frames (i − 1, i), thereby estimating the optical flow.Second, the pose network is used to predict the camera pose based on the estimated optical flow.The architecture of TartanVO is shown in Figure 10.In the TartanVO model, the authors built a pretrained model trained on large datasets: the EuRoC [52], KITTI [9][10][11], Cityscapes [45], and TartanAir [51] datasets.From these, they enriched the contexts and environmental conditions in which the model was trained.To increase the model's generalization ability, TartanVO proposes two up-to-scale loss functions.The first is the cosine similarity loss to calculate the cosine angle between the estimated translation and the displacement label.The second is the normalized distance loss to calculate the distance between the estimated translation vector and translation label.

Evaluation Measures
To evaluate the results of the VO algorithm based on feature-based methods, we use several evaluation metrics as follows: • The absolute trajectory error (ATE) [12] is the distance error between the ground-truth ÂT i and the estimated motion AT i trajectory, aligned with an optimal SE(3) pose T. ATE is calculated according to Equation ( 14): 1 where t rel is the average translations RMSE drift (%) on a length of 100-800 m. r rel is the average rotational RMSE drift ( • /100 m) on a length of 100-800 m.

•
We calculate trajectory error (Err d ), being the distance error between the ground-truth ÂT i and the estimated motion AT i trajectory.Err d is calculated according to Equation ( 15): where N is the frame number of the frame sequence used to estimate the camera's motion trajectory.

•
In addition, we also calculate the ratio of the number of frames with detected keypoints (r d (%)).

Evaluation Parameters
In this paper, we use the PySLAM framework [22], with all the features integrated into this framework.PySLAM's development source code is in Python v3.9 language and programmed on Ubuntu 20.04.For the DPVO build source code, we used the code at the link (https://github.com/princeton-vl/DPVO,accessed on 6 May 2024).For the TartanVO build source code, we used the code at the link (https://github.com/castacks/tartanvo,accessed on 6 May 2024).At the same time, there is support from several open-source libraries such as Numpy (1.18.2),OpenCV (4.5.1),PyTorch (≥1.4.0), and Tensorflow-gpu (1.14.0).The PySLAM framework was implemented on computers with the following configuration: CPU i5 12,400 f, 16 G DDr4, GPU RTX 3060 12 GB.In this paper, with the DPVO and TartanVO networks, we performed three experiments (L1, L2, L3) with each subset of the TQU-SLAM benchmark dataset for estimating the camera motion trajectory.The results are shown in the next section.

Results and Discussions
The results of estimating VO/camera pose/camera trajectory using the TQU-SLAM benchmark dataset when using SHI_TOMASI, SIFT, SURF, ORB, ORB2, AKAZE, KAZE [18], BRISK, and VGG to extract features are shown in Table 2.The average distance error of the ORB2 feature is the lowest (Err d = 5.74 mm).The ratio of detected and extracted frames characterized for 6-DOF camera pose estimation is shown in Table 3.The average ratio of detected frames with the SHI_TOMASI feature is the highest (r d = 98.97%).
Table 3.The ratio of frames with detected features/keypoints in the TQU-SLAM benchmark dataset when using the following extracted features: SHI_TOMASI, SIFT, SURF, ORB, ORB2, AKAZE, KAZE, BRISK, and VGG.  Figure 11 shows the results of estimating the VO of FO-D of the TQU-SLAM benchmark dataset when using the SHI_TOMASI features extracted from RGB images.Figure 11 also shows that SHI_TOMASI features are extracted better and more in the first FO-D compared to data at other times.Figure 12 shows the results of estimating points on the moving trajectory on the 1st-FO-D of the TQU-SLAM benchmark dataset when using the features extracted from VGG.The results in Figure 12 also show that the features extracted are limited when using VGG; there are many frames whose features cannot be extracted.Therefore, the camera's position cannot be estimated.Figure 13 shows the difficult feature extraction in frame pairs when using VGG for feature extraction.The results of the camera motion trajectory estimation error (ATE) based on the DPVO and TartanVO networks on the TQU-SLAM benchmark dataset are shown in Table 5.
Table 5.The results of the camera motion trajectory estimation error (ATE) based on the DPVO and TartanVO networks on the TQU-SLAM benchmark dataset.The ratio of the number of detected and extracted frames characterized to estimate the VO of the DPVO is shown in Table 6.This rate is low, only 49.68%, while the characteristic extraction rate of the TartanVO is 100%.Table 6.The ratio of frames with detected features/keypoints on the TQU-SLAM benchmark dataset when using the DPVO model to extract the features.

Challenges
As shown in Figures 11-13 and Table 3, the TQU-SLAM benchmark dataset contains some challenges for VO construction as follows.Firstly, regarding lighting conditions, the obtained RGB images have very weak lighting conditions, and the light intensity is uneven.There are road sections with adequate lighting, but there are road sections where we had to use additional light from mobile phone lights to assist, as shown in Figure 13.Therefore, the keypoints between two consecutive frames are not detected.Second, the collected environment is highly homogeneous.The image data obtained by the Intel RealSense D435 are mainly floor data and a small part are wall and railing data.
Therefore, the data do not have great discrimination between frames; in the scene, there are few objects in the environment.Although we posted markers on the floor, the number of keypoints detected in the two frames is not high.This leads to a failure to estimate the 6-DOF camera poses across multiple frames.Although, the DPVO and TartanVO networks were pretrained on many large datasets such as the EuRoC [52], KITTI [9][10][11], Cityscapes [45], and TartanAir [51] datasets, when performing feature extraction on the TQU-SLAM benchmark dataset, there is a large number of frames that cannot be detected and whose features cannot be extracted (more than 50% of DPVO).The results are shown in Tables 4-6.Figures 14 and 15 illustrate that the misestimation results of DPVO and TartanVO are very large.This proves that the problem of data enrichment to train the VO estimation model still contains many challenges that need to be implemented.

Conclusions and Future Works
VO systems are widely applied in pathfinding robots, autonomous vehicles operating in industry, etc. DL has had impressive results in building visual SLAM and VO systems.However, DL requires a large amount of data to train the model.In this paper, we introduced the TQU-SLAM benchmark dataset, collected from an RGB-D image sensor (Intel RealSense D435) moving in the corridors of three buildings of a particular length (FO-D is 230.63 m, OP-D is 228.13 m).The ground-truth data of the TQU-SLAM benchmark dataset include 6-DOF camera poses/camera trajectory and a 3D point cloud.It was also tested to estimate VO using some traditional features and features extracted from DL, such as VGG based on the PySLAM framework.Among them, the ORB2 features have the best results (Err d = 5.74 mm), and the ratio of the number of frames with detected keypoints of the SHI_TOMASI feature is the best (r d = 98.97%).Shortly, we will renormalize the TQU-SLAM benchmark dataset and prepare many types of ground-truth data for evaluating visual SLAM and VO models.We will build a pretrained model on the TQU-SLAM benchmark dataset and perform comparative research on DL models for visual SLAM and VO systems.

Figure 1 .
Figure 1.VO framework to build camera motion trajectory from image sequence.
: appearance-based techniques, feature-based techniques, and hybrid techniques.Learning-based methods include traditional machine learning techniques and deep learning techniques.

Figure 2 .
Figure 2. The tree of the methods to build VO from images obtained from a camera.

Figure 3 .
Figure 3. Illustration of the hallway environment of Building A, Building B, and Building C of Tan Trao University in Vietnam for data collection.In the environment, we highlight 15 keypoints.The direction of movement according to the blue arrow is in the forward direction, and the direction of movement according to the red arrow is in the opposite direction.

Figure 4 .
Figure 4. Illustration of an Intel RealSense D435 camera with an infrared projector, a color (RGB) camera, and two cameras to collect stereo depth (D) images.

Figure 5 .
Figure 5. Illustration of a vehicle equipped with a camera and computer when collecting data.

Figure 6 .
Figure 6.Illustration of marker application and the marker results collected on a color image.

Figure 7 .
Figure 7. Illustration of the real-world coordinate system we defined and the camera's motion trajectory.The ground truth of the camera's motion trajectory is the black points.

Figure 8 .
Figure 8. Illustration of ORB feature matching of two consecutive frames from the TQU-SLAM benchmark dataset.

Figure 11 .
Figure 11.Illustration of the results of estimating VO of TQU-SLAM benchmark dataset when using the SHI_TOMASI features.The results are presented on FO-D data.The black points belong to the ground-truth trajectory, and the red points are the estimated camera position.

Figure 12 .
Figure 12.Illustration of the results of estimating VO of the first FO-D of the TQU-SLAM benchmark dataset when using the features extracted by VGG.The black points belong to the ground-truth trajectory, and the red points are the estimated camera position.

Figure 13 .
Figure 13.Illustration of the feature matching extracted from the VGG of frame pair of 1st FO-D of the TQU-SLAM benchmark dataset.The lines connecting two features on two consecutive frames represent two corresponding locations on two consecutive frames detected by the feature descriptor.The results of the camera motion trajectory estimation error (Err d ) based on the DPVO and TartanVO networks on the TQU-SLAM benchmark dataset are shown in Table4.

Table 4 .
The results of the camera motion trajectory estimation error (Err d ) are based on the DPVO and TartanVO networks on the TQU-SLAM benchmark dataset.

Figure 14
Figure14shows the results of estimating the camera's motion trajectory based on the DPVO model when experimenting with L1 of the TQU-SLAM benchmark dataset.

Figure 14 .
Figure 14.Illustration of the results of estimating the camera's moving trajectory on the TQU-SLAM benchmark dataset when using the DPVO model.The ground-truth motion trajectory is the black line, and the estimated trajectory of the DPVO model is the blue line.

Figure 15
Figure 15 shows the results of estimating the camera's motion trajectory based on the TartanVO model when experimenting with L1 of the TQU-SLAM benchmark dataset.

Figure 15 .
Figure 15.Illustration of the results of estimating the camera's moving trajectory on the TQU-SLAM benchmark dataset when using the TartanVO model.The ground-truth motion trajectory is the black line, and the estimated trajectory of the TartanVO model is the blue line.

Table 1 .
The number of frames of four data acquisition times for the TQU-SLAM benchmark dataset.
Ro 11 Ro 12 Ro 13 Tr 1 Ro 21 Ro 22 Ro 23 Tr 2 Ro 31 Ro 32 Ro 33 Tr 3 Tr 2 , and Tr 3 are the components of the translation matrix from the camera coordinate system to the real-world coordinate system.The transformation result of point M ′ is shown in Equation (4): Ro 11 x +Ro 12 y +Ro 13 z + Tr 1 y ′ = Ro 21 x +Ro 22 y +Ro 23 z + Tr 2 z ′ = Ro 31 x +Ro 32 y +Ro 33 z + Tr 3 12, Ro 13 , Ro 21 , Ro 22 , Ro 23 , Ro 31 , Ro 32 , Ro 33 are the components of the rotation matrix from the camera coordinate system to the real-world coordinate system.Tr 1 ,
As shown inTable 4, the average estimation error (Err d ) of DPVO is 13.7 m, and the (Err d ) of TartanVO is 14.96 m.Table 5 also shows that the average estimation error (ATE) of DPVO is 15.96 m, and the (ATE) of TartanVO is 17.46 m.These results show that DPVO is better than TartanVO in the estimation error, but the estimated realized frame rate of DPVO is only 49.68%.