DK-SLAM: Monocular Visual SLAM with Deep Keypoint Learning, Tracking, and Loop Closing

Hao Qu; Lilian Zhang; Jun Mao; Junbo Tie; Xiaofeng He; Xiaoping Hu; Yifei Shi; Changhao Chen

doi:10.3390/app15147838

,

and

¹

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

²

College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Appl. Sci.2025, 15(14), 7838;https://doi.org/10.3390/app15147838

This article belongs to the Section Robotics and Automation

Version Notes

Order Reprints

Abstract

The performance of visual SLAM in complex, real-world scenarios is often compromised by unreliable feature extraction and matching when using handcrafted features. Although deep learning-based local features excel at capturing high-level information and perform well on matching benchmarks, they struggle with generalization in continuous motion scenes, adversely affecting loop detection accuracy. Our system employs a Model-Agnostic Meta-Learning (MAML) strategy to optimize the training of keypoint extraction networks, enhancing their adaptability to diverse environments. Additionally, we introduce a coarse-to-fine feature tracking mechanism for learned keypoints. It begins with a direct method to approximate the relative pose between consecutive frames, followed by a feature matching method for refined pose estimation. To mitigate cumulative positioning errors, DK-SLAM incorporates a novel online learning module that utilizes binary features for loop closure detection. This module dynamically identifies loop nodes within a sequence, ensuring accurate and efficient localization. Experimental evaluations on publicly available datasets demonstrate that DK-SLAM outperforms leading traditional and learning-based SLAM systems, such as ORB-SLAM3 and LIFT-SLAM. DK-SLAM achieves 17.7% better translation accuracy and 24.2% better rotation accuracy than ORB-SLAM3 on KITTI and 34.2% better translation accuracy on EuRoC. These results underscore the efficacy and robustness of our DK-SLAM in varied and challenging real-world environments.

Keywords:

monocular SLAM; deep learning; feature extraction and matching; loop closing

1. Introduction

Visual localization and mapping are fundamental components in autonomous systems for motion estimation and environmental perception. These capabilities have diverse applications, spanning from self-driving vehicles and unmanned aerial vehicles (UAVs) to mobile robots and immersive wearable technologies such as virtual reality (VR) and augmented reality (AR) devices. The quest for robust localization and mapping under varying environmental conditions has increasingly drawn the research community’s attention.

Visual Simultaneous Localization and Mapping (SLAM) systems generally comprise two main components: front-end perception and back-end optimization. They are typically divided into two categories: direct methods and feature-based methods. Direct methods, such as DSO [1], SVO [2], and H-SLAM [3], process the pixel intensities in an image to optimize a photometric loss at the pixel level, directly using image data for map construction and pose estimation. While these methods can be highly accurate in specific scenarios, they often demand substantial computational resources and can become unstable in dynamic environments.

One the other hand, feature-based methods, exemplified by the ORB-SLAM series [4,5], focus on identifying and utilizing key image features, i.e., keypoints. These methods extract keypoints with distinctive properties, compute descriptors, and perform feature matching, outlier removal, and relative motion tracking. Their back-end processes include loop closure detection and global optimization, crucial for refining and maintaining the system’s state estimates over time. Feature-based approaches typically excel in computational efficiency and robustness across diverse scenarios. However, they may struggle in environments with significant lighting changes or insufficient texture, where traditional handcrafted keypoints might not be detected or matched reliably.

In recent years, deep learning has demonstrated remarkable capabilities in computer vision tasks, such as image classification and object detection. Its integration into visual odometry (VO) and SLAM has garnered considerable interest. End-to-end methods use deep neural networks (DNNs) to directly infer pose estimates from images [6,7,8,9]. However, these methods often lack interpretability and may have difficulties generalizing across different scenarios due to discrepancies between training and testing data distributions. Learned-feature-based SLAM methods [10,11,12,13] replace traditional feature detection with deep learning-based visual feature extraction, which is then integrated into the SLAM back-end. These methods leverage both deep learning and geometric models to enhance SLAM performance.

Robust keypoint tracking and matching are vital for the reliability and effectiveness of visual SLAM systems in various applications. Although previous research has explored data-driven feature extraction in SLAM systems, there is still a need for more effective feature extraction networks tailored to the dynamic conditions encountered in SLAM applications. Current neural networks for feature extraction, such as SuperPoint [14], are typically trained on static object detection datasets, which may limit their performance in dynamic, real-world environments with varying lighting and movement. Additionally, achieving reliable matching for learned features and efficient loop closing remains a significant challenge. Unlike ORB features, which encapsulate explicit low-level image information, learned features often produce high-dimensional descriptor embeddings, complicating the process of establishing matches and filtering outliers. Moreover, enabling loop closure detection using learned features poses further difficulties. Learning-based Bag-of-Words (BoW) methods for loop closure detection involve complex training processes, often requiring extensive offline training and substantial memory storage, and they struggle with generalization to new environments.

To address the aforementioned challenges, we propose DK-SLAM, a novel deep keypoint-based SLAM system that incorporates deep keypoint extraction through meta-learning, coarse-to-fine feature tracking, and an efficient online learning-based loop closing. Our DK-SLAM employs an enhanced Model-Agnostic Meta-Learning (MAML) strategy to extract robust deep local features, improving the system’s ability to perform feature extraction in diverse and dynamic environments. This meta-learning strategy enables our DK-SLAM system to adapt to diverse conditions without the need for additional training during deployment, thereby enhancing its applicability in real-world scenarios. Additionally, we introduce a novel feature matching strategy tailored for deep learned keypoints. This strategy involves a two-stage process: initial estimation of relative poses using patch photometric loss optimization, followed by refinement through 3D–2D relationships, resulting in improved matching accuracy and robustness. Furthermore, to enable a more effective and efficient loop closing, we propose an online learning-based Bag-of-Words (BoW) model. Our model compresses learned features into binary descriptors and utilizes images from previous timesteps within an online learning framework. It constructs a dynamic tree structure for Bag-of-Words, facilitating loop closure detection in new environments.

Experimental evaluation conducted on public car-driving and drone datasets demonstrates the superior performance of our proposed DK-SLAM compared with representative traditional visual SLAM such as ORB-SLAM3 and LDSO, as well as learning-based systems such as LIFT-SLAM. Notably, compared with the representative monocular ORB-SLAM3, DK-SLAM achieves a translation accuracy improvement of approximately 17.7% and a rotation accuracy enhancement of 24.2% on the KITTI dataset [15], and it surpasses ORB-SLAM3 by approximately 34.2% in translation accuracy on the EuRoC dataset [16].

Our contributions can be summarized as follows:

We propose DK-SLAM, a novel monocular visual SLAM system with deep keypoint meta-learning. Our deep feature extractor, trained using a MAML strategy, enhances adaptability to diverse scenes.
We develop a coarse-to-fine two-stage keypoint matching strategy that estimates relative poses through patch photometric loss optimization and refines them based on the 3D–2D relationship, resulting in improved accuracy.
We introduce an online learning-based Bag-of-Words (BoW) model that effectively and efficiently utilizes binarized deep keypoints. This model dynamically fine-tunes itself to ensure accurate loop detection in the long-term operation.

The reminder of this paper is organized as follows: Section 2 contains a survey of related work; Section 3 introduces our proposed deep keypoint-based monocular SLAM framework; Section 4 evaluates DK-SLAM in both car-driving and drone scenarios, and conducts extensive ablation studies; and Section 6 finally draws conclusions.

2. Related Works

In this section, we provide a brief overview of some related works in learning-based feature extraction, learning-based visual SLAM, and loop closure detection.

2.1. Learning-Based Feature Extractor

Mainstream handcrafted extractors, including ORB [17], SIFT [18], and Shi-Thomas [19], are employed in visual SLAMs like ORB-SLAM2 [5] and VINS-Mono [20]. However, these methods rely on gradient information face challenges in dynamic lighting and low-texture environments. To address these issues, researchers explored deep feature extractors. SuperPoint [14] employs a shared encoder for high-level information and distinct decoders for detectors and descriptors, using synthesized labels for self-supervised training. D2-Net [21] eliminates the need for a separate keypoint decoder, ensuring robustness. To enhance repeatability detection, Key.Net [22] combines multi-scale gradients with depth feature maps. It achieves state-of-the-art performance on the HPatches [23] with 53.4% repeatability under viewpoint changes. Additionally, HF-net2 [24] is a multi-task teacher–student model for keypoint detection and description in unstructured environments. It reduces visual SLAM localization RMSE by 52% in low-light sandy terrains. Despite their success, the mentioned deep extractors exhibit inherent flaws in the face of diverse scenes, leading to potential performance degradation.

2.2. Learning-Based Visual SLAM

Recent research explores novel SLAM methods that exploit the advantages of both deep learning and multi-view geometry [25,26]. GCNv2-SLAM [10] integrates Graph Convolutional Networks (GCNs) [27] into SLAM, culminating in a comprehensive system with offline Bag-of-Words (BoW) training. It reduces absolute trajectory error (ATE) by 52% compared with ORB features. In another study [28], knowledge distillation from HF-Net [29] enhances feature detection within a compact model. Additionally, ref. [30] enhances SuperPoint with edge and multi-scale information. In LIFT-SLAM [11], a LIFT [31] extractor serves as the SLAM front-end, implementing an adaptive feature matching strategy tailored to different datasets. LIFT-SLAM reduces absolute trajectory error by 45% compared with ORB-SLAM on KITTI sequence 03. In contrast, DK-SLAM adopts a meta-learning-based training strategy, obviating the need for separate hyperparameter designs for distinct datasets. Ref. [32] proposes a robust deep-learning visual SLAM system for weakly textured underwater environments, which integrates a feature generator UWNet into ORB-SLAM3. It achieves superior performance with an average reduction of 44.45% in absolute pose error. Moreover, ref. [33] proposes an online adaptive keypoint extraction framework for visual odometry, which adaptively adjusts keypoint selection policies based on real-time feedback. To address the challenges posed by dynamic objects in visual SLAM, ref. [34] introduces a semantic segmentation network that identifies dynamic objects in the scene. It achieving a 39.5% improvement in positioning accuracy over ORB-SLAM2 in sequences with multiple dynamic objects on the KITTI. The system then removes feature points associated with these objects during the front-end processing, leading to state-of-the-art performance in environments with multiple moving objects. Additionally, ref. [35] exploits geometric information from objects to enhance VSLAM performance, and ref. [36] applies learning-based semantic segmentation to correct a scale in monocular SLAM mapping and localization. Furthermore, ref. [37] incorporates learned depth labels into monocular visual SLAM, transforming it into a virtual stereo SLAM to overcome scale ambiguity. These depth labels are generated from unsupervised depth estimation networks, enabling the system to achieve more accurate mapping and localization.

2.3. Loop Closure Detection

Loop closure detection is a crucial component of visual SLAM, tasked with identifying loop nodes and refining initial pose estimates [38]. Traditional Bag-of-Words (BoW) models like DBow2, DBow3, and FBow [39] utilize the binary descriptor and store descriptors in a k-d tree structure based on Hamming distance. Despite their utility, these conventional BoW methods require extensive offline training data, which can limit their adaptability. To address this, iBow [40] was introduced, offering online BoW training directly from testing datasets. This approach provides robust transfer performance across various scenarios, enhancing the system’s adaptability. Another approach to loop closure detection, similar to NetVLAD [41], as used in [28,42], heavily depends on the quality and relevance of the training set. Significant differences between the training and testing datasets can lead to performance degradation. In contrast, ref. [43] utilizes deep learning to handle loop closure under varying viewpoints and lighting conditions. It achieves an average precision–recall AUC improvement of 26% over DBoW2 and 18% over AlexNet [44] across six public datasets. Instead of relying on the entire image, this approach focuses on landmarks to determine loop closure, enhancing its robustness and accuracy in diverse environments. Ref. [45] presents LCDNet, a learning-based approach for loop closure detection and point cloud registration in LiDAR SLAM. It effectively detects loop closures even in challenging reverse loop scenarios.

3. Deep Keypoint-Based Monocular SLAM

An overview of our proposed DK-SLAM system is illustrated in Figure 1. In the front-end, we utilize a neural network-based feature extractor to detect and describe keypoints across multi-scale image pyramids. To enhance generalization, the local feature extractor undergoes Model-Agnostic Meta-Learning (MAML) during training. To ensure a balanced distribution of keypoints across images of varying scales, we implement an averaging distribution strategy. Subsequently, to enhance the tracking and matching performance of the learned visual keypoints, we incorporate low-level information from the image. Initially, we construct local photometric constraints on the pixels surrounding keypoints to estimate the initial relative pose. Furthermore, we estimate the matching range of keypoints in adjacent frames based on this initial relative pose. Once the front-end completes keypoint tracking and pose estimation, keyframes are selected based on a predefined interval time scale. These selected keyframes are imported into the keyframe database, with the most recent keyframe being integrated into the back-end loop closure module. Memory consumption is optimized by converting floating keyframe descriptors into binary hash codes. Lastly, we introduce an online Bag-of-Words (BoW) module to establish a loop closure mechanism, facilitating closed-loop detection and global map optimization.

Figure 1. An overview of our proposed DK-SLAM framework with deep keypoint meta-learning, two-stage coarse-to-fine keypoint tracking, and online learning-based binary BoW for loop closing.

3.1. Deep Keypoint Meta-Learning

3.1.1. Feature Extractor Network

Inspired by SuperPoint [14], our deep keypoint extractor also leverages VGG16 [46] as the backbone architecture. Unlike SuperPoint, we introduce Batch Normalization layers after each Convolutional Neural Network (CNN) to encourage training convergence. After this backbone, there are decoders for a keypoint detector and descriptor, each consisting of multi-layer CNNs. The first module, i.e., the detector decoder, produces a probability mask matching the input image size. We posit that keypoints are situated at positions where the mask value surpasses a certain threshold. The second module, i.e., the descriptor decoder, predicts feature descriptors for keypoints, which are then normalized to a dimension of 256 using the L2 norm.

3.1.2. Self-Supervised Keypoint Learning

Our network undergoes self-supervised training, employing a multi-step process. Initially, we obtain the homography-warped images from current training images. To train a keypoint detector, we leverage the pre-trained MagicPoint network [14] to acquire keypoint pseudo labels for both the original and warped images. In both images, keypoint predictions and pseudo labels contribute to the formation of detector loss, denoted as

L_{p}

and

L_{w p}

, with CrossEntropy serving as the loss metric.

L_{p}

represents the detector loss of the original image, while

L_{w p}

represents the detector loss of the warped image. Moving on to descriptor training, we utilize the sparse descriptor loss

L_{d}

proposed in [47]. This involves obtaining both the warped image and pixel-wise correspondences. Descriptor loss functions are then constructed solely for pseudo-labeled keypoints, employing triplet loss. The sparse descriptor loss is composed of two components: homography-matched descriptor loss

L_{d m}

and non-matched descriptor loss

L_{d n}

. The homography-matched descriptor loss, represented by Equation (1), involves a subscript i ranging between

(0, 1, \dots, N)

, where

(u_{i}, \hat{u_{i}})

denotes the N matched keypoint position pairs.

L_{d m} (u_{i}) = {∥d_{I} (u_{i}) - d_{w I} (\hat{u_{i}})∥}^{2}

(1)

As depicted in Equation (2), we use M pairs of descriptors from the vicinity of matched descriptors to formulate a non-matched descriptor loss. The subscript j ranges from 0 to M, where

(u_{i}, {u^{'}}_{j})

represents unmatched keypoint position pairs.

L_{d n} (u_{i}) = \frac{1}{M} \sum_{j = 1}^{M} ({∥d_{I} (u_{i}) - d_{w I} ({u^{'}}_{j})∥}^{2})

(2)

The descriptor loss

L_{d}

, illustrated in Equation (3) with z representing the boundary margin, is expressed as

L_{d} = \frac{1}{N} \sum_{i = 1}^{N} \max (0, z + L_{d m} - L_{d n})

(3)

This descriptor loss

L_{d}

aims to minimize the feature distance between matching descriptors while maximizing the feature distance between non-matching descriptors. The overall training loss is the sum of the aforementioned detector loss and descriptor loss. Here,

λ

acts as a parameter to balance these different losses.

L_{all} = (L_{p} + L_{w p}) + λ L_{d}

(4)

3.1.3. MAML-Based Visual Keypoint Meta-Learning

Learning-based visual keypoint (local feature) extractors face poor generalization due to variances in visual features across scenes. Significant feature disparity between training and unseen datasets can lead to catastrophic forgetting. To address this, we leverage insights from Model-Agnostic Meta-Learning (MAML), adapting network parameters through meta-training, which enhances generalization to new scenarios without requiring additional training during deployment. The MAML-based meta-training consists of inner loop training and outer loop training. The training set is partitioned into a support set

D_{s}

and a query set

D_{q}

, where

D_{s}

is involved in inner loop training, and

D_{q}

participates in outer loop training.

In the MAML training strategy, the original network parameter

θ_{a}

is initially duplicated to

θ_{b}

. Subsequently, both parameters undergo training in both the inner and outer loops. Within the inner loop, the support dataset

D_{s}

is utilized for iterative updates to

θ_{a}

. This involves dividing a batch of support dataset into n distinct tasks

(D_{s}^{i}, i = 0 \dots n - 1)

, performing m parameter updates on each task, resulting in the updated parameter

θ_{a}^{m}

. After completing one iteration of inner loop training, a minibatch of query set data

D_{q}^{i}

is used to update the

θ_{b}

. The detailed steps of this feature meta-training strategy using MAML are shown in Algorithm 1 below.

3.1.4. The Distribution Strategy of Deep Keypoints

To ensure a balanced dispersion of learned keypoints across all scales and corners within an image, we incorporate a keypoint distribution strategy during the feature extraction stage. We uniformly distribute the entire set of keypoints across a multi-scale image pyramid. Within each layer of the image pyramid, a predefined number of keypoints is distributed evenly within the grid of different regions on the image. Thus, this strategy can prevent the concentration of learned keypoints in specific areas.

Algorithm 1 The Procedure of Deep Keypoint Meta-Learning in DK-SLAM.

Require:: Feature training dataset $D_{f}$ , inner loop learning rate $α_{1}$ , outer loop learning rate $α_{2}$ .

1:: Copy the initial network parameters as $θ_{a}$ and $θ_{b}$ .
2:: while not done do
3:: Sample a batch of support set $D_{s}$ and query set $D_{q}$ from $D_{f}$ , whose batch size is the same
4:: Sample a minibatch $D_{s}^{i}$ from $D_{s}$ . Sample a minibatch $D_{q}^{i}$ from $D_{q}$ . Where $i = 0 \dots n - 1$ . n is the number of batch size.
5:: for $D_{s}^{i}$ and $D_{q}^{i}$ in $D_{s}$ , $D_{q}$ do
6:: Calculate the gradient on $θ_{a}$ using $D_{s}^{i}$ and Equation (4). The gradient is $▽_{θ_{a}} L_{a l l}$ .
7:: Update $θ_{a}$ with $▽_{θ_{a}} L_{a l l}$ . $θ_{a}^{'} = θ_{a} - α_{1} ▽_{θ_{a}} L_{a l l}$
8:: Repeat steps 6 and 7 for m iterations. Received updated m times $θ_{a}^{m}$ . In our implementation, we set m as 4. We call these iterations as inner loop.
9:: Calculate the gradient on $θ_{a}^{m}$ using $D_{q}^{i}$ and Equation (4). The gradient is $▽_{θ_{a}^{m}} L_{a l l}$ .
10:: Update $θ_{b}$ with $▽_{θ_{a}^{m}} L_{a l l}$ . $θ_{b}^{'} = θ_{b} - α_{2} ▽_{θ_{a}^{m}} L_{a l l}$ We call the updating for $θ_{b}$ as outer loop.
11:: end for
12:: end while

3.2. Coarse-to-Fine Keypoint Tracking

Accurate keypoint matching is essential for visual SLAM performance. Traditional systems like ORB-SLAM assume uniform motion between adjacent frames, often leading to challenges in finding correct matches and complicating optimization when this assumption fails. Unlike ORB features, which are straightforward low-level image descriptors, learned features generate complex high-dimensional embeddings, making matching and outlier filtering more difficult. To address these issues, we propose a two-stage deep keypoint tracking strategy to enhance matching robustness and accuracy.The overview of proposed tracking method as Figure 2 shown.

Figure 2. Diagram of our proposed coarse-to-fine two-stage keypoint tracking strategy. This process begins with relative pose estimation through patch photometric loss optimization, followed by refinement using the 3D–2D keypoint relationship for enhanced accuracy.

3.2.1. Semi-Direct Coarse Keypoint Tracking

Taking inspiration from the semi-direct visual odometry method [2], we introduce a coarse tracking method based on photometric constraints. We assume that the light intensity and texture around matching keypoints are similar in adjacent image frames. Map points from the last frame can be projected onto the current frame using the relative pose. If the relative pose is accurate, the photometric loss of the projected keypoint’s patch will be minimal. As depicted in Equations (5) and (6), the pixels of the keypoint’s patch, the image points in the last frame coordinate

p_{i}^{l}

, and the relative pose of adjacent frames

ξ_{l c}^{coarse}

collectively form the photometric loss

L_{p}

. Given the known map points and patch pixel values of the last frame, we adjust the relative pose

ξ_{l c}^{coarse}

to minimize the photometric loss

L_{p}

.

ξ_{l c}^{coarse} = arg min_{ξ_{l c}^{coarse}} \frac{1}{2} \sum_{i \in \bar{χ}} {∥L_{p} (ξ_{l c}^{coarse}, p_{i}^{l})∥}^{2}

(5)

L_{p} (ξ_{l c}^{coarse}, p_{i}^{l}) = I_{c} (π (ξ_{l c}^{coarse} \cdot p_{i}^{l})) - I_{l} (π (p_{i}^{l}))

(6)

Here,

π

denotes the projection of the 3D map point from camera coordinates to pixel coordinates.

\bar{χ}

represents the set of keypoint indices. The photometric loss is constructed using the patches around the projected points.

3.2.2. Coarse-to-Fine Keypoint Tracking

Having obtained the coarse relative pose in the initial stage, we proceed to match map points between adjacent frames and refine the pose of the current frame

ξ_{w}^{c}

through the 3D–2D projection relationship. Subsequently, leveraging the coarse relative pose

ξ_{l c}^{coarse}

, we project the 3D map points from the last frame’s coordinate system to the pixel coordinate system of the current frame. Subsequently, we search for matching keypoints within a fixed radius range centered on the projection location, utilizing the Hamming distance of the descriptors as the search criterion.

Upon establishing the 3D–2D matching relationships, we construct a pose graph. As expressed in Equation (7), we employ the coarse relative pose in conjunction with the last frame’s pose to derive the initial pose of the current frame.

{\hat{ξ}}_{w}^{c} = ξ_{l c}^{coarse} ξ_{w}^{l}

(7)

The current frame’s pose serves as the vertex in the graph, with the number of vertices being subject to optimization. Map points from the current frame and the reprojection loss of these map points constitute the edges. We exclusively optimize the pose within the tracking module. As detailed in Equation (8), we iteratively refine the current frame pose

{\hat{ξ}}_{w}^{c}

to minimize the reprojection loss. In our implementation, the g2o [48] framework is employed for optimizing this pose graph.

ξ_{w}^{c} = arg min_{{\hat{ξ}}_{w}^{c}} \frac{1}{2} \sum_{i \in \bar{χ}} {∥u_{i} - π ({\hat{ξ}}_{w}^{c} . p_{i}^{w})∥}^{2}

(8)

3.3. Deep Keypoint-Based Loop Closing

Our DK-SLAM system features a deep-keypoint-based loop closure module that performs both loop closure detection and correction. To improve generalization in unfamiliar environments, we utilize an online learning-based Bag-of-Words (BoW) model for effective loop closure detection. Once loop closure is detected, the system identifies the closed-loop nodes and optimizes the global map using the relative poses from the detected loops.

3.3.1. Online Learning for Binary BoW

Unlike handcrafted descriptors, learned descriptors occupy a larger feature space, e.g., 128 dimensions in our case. Offline-trained BoW models may struggle to capture this space, making it difficult to distinguish deep feature indices in BoW leaf nodes. To tackle this, we introduce an online learning-based BoW model that uses extracted features exclusively from testing scene data. Constructing a tree-like structure with floating descriptors is challenging due to the large memory and time consumption, so inspired by [49], we employ binary hash transformation for processing deep features. As depicted in Equation (9), for descriptor vector values

d_{i}

less than 0, the processed

{\hat{d}}_{i}

is set to 0. For descriptor values greater than 0, the processed

{\hat{d}}_{i}

is set to 1, where i ranges from 0 to 256.

{\hat{d}}_{i} = \{\begin{matrix} 1 & if d_{i} \geq 0 \\ 0 & if d_{i} < 0 \end{matrix}

(9)

3.3.2. Loop Node Detection

As shown in Figure 3, the loop closure thread receives keyframes filtered from the local mapping thread, storing them in the inverted database to construct a Bag-of-Words (BoW) model. The BoW model utilizes the feature descriptors of the current frame

K_{c}

to match descriptors stored in the database. Each keyframe is converted into a unique BoW vector for representation. We compute the similarity score between the current keyframe

K_{c}

and the BoW vectors of inverted keyframes to identify the most similar candidate keyframe

K_{h}

. However, the candidate keyframe

K_{h}

only considers differences in the numerical values of descriptors and neglects differences in descriptor positions, which may lead to mismatches.

Figure 3. An illustration of our proposed Online Learning-based Binary BoW. The BoW is constructed incrementally, with matched descriptors in the keyframes database stored within the same leaf node. In the presence of unmatched descriptors in the current keyframe, a new leaf node is created.

Thus, we first employ the Grid-Based Motion Statistics (GMS) [50]-based brute force matching method to match deep keypoints between candidate closed-loop nodes. If the number of matches is below a threshold (i.e., 20 in our system), the candidate loop keyframe is classified as a mismatch. GMS operates on the principle that the number of keypoints near correctly matched feature points should surpass the number near incorrectly matched keypoints. Following a procedure similar to ORB-SLAM2 [5], we compute the similarity transformation matrix

T_{sim}

between candidate closed-loop nodes through matched keypoints. Subsequently,

T_{sim}

is applied to match map points between the current keyframe

K_{c}

and the candidate keyframe

K_{h}

. Further optimization of the similarity transformation matrix

T_{s i m}

is conducted using the matched map points. If the number of inliers in the optimization function exceeds the threshold, the candidate keyframe

K_{h}

is deemed a matched keyframe

K_{m}

.

3.3.3. Global Map Correction via Loop Closing

Optimizing the similarity transformation matrix

T_{sim}

refines the pose of the current keyframe and its co-view related keyframes. Concurrently, we update the positions of co-observed map points across these keyframes. Next, we align the map points of the matched keyframe

K_{m}

and its connected keyframes with those of the current keyframe

K_{c}

and its connected keyframes. Updating the visibility graph between keyframes based on matched map points follows. Utilizing

K_{c}

,

K_{m}

, global map points, and the visibility graph, we construct and optimize the essential graph. Finally, a global bundle adjustment refines all keyframes and map point positions within the global map.

4. Experiments

This section discusses the implementation details and evaluates the pose estimation performance of our proposed DK-SLAM on two widely adopted public datasets: the KITTI dataset, representing the car-driving scenario, and the EuRoC dataset, representing drone navigation. Moreover, we conducted an extensive ablation study to validate the effectiveness of our proposed key modules, including learned feature extraction, coarse-to-fine matching strategy, and online learning-based Bag-of-Words (BoW).

4.1. Training Details and Datasets

4.1.1. Training Details

In MAML-based training, we set the batch size to 8, with 4 batches dedicated to support sets and 4 batches to query sets. Each minibatch is treated as a task, and 4 tasks are trained in the inner loop. The trained model calculates gradients on the query set, updating raw model parameters. The training spans 200K iterations, using the MS-COCO dataset [51].

4.1.2. KITTI Odometry Dataset

The KITTI odometry dataset [15] serves as a benchmark for self-driving scenarios, offering stereo RGB and grayscale images, along with LiDAR and pose ground truth data. Grayscale images from sequences 00, 02, 03, 04, 05, 06, 07, 08, 09, and 10 are utilized for pose evaluation. Sequence 01 is excluded from the evaluation due to its inherent challenges for monocular SLAM systems.

4.1.3. EuRoC MAV Dataset

The EuRoC dataset [16] represents an indoor MAV flight dataset, encompassing stereo grayscale images, IMU data, and Leica Vicon’s pose ground truth. Grayscale images are collected at a frequency of 20Hz. Due to space limitations, we focus on sequences MH01–05 for SLAM performance evaluation.

4.2. Pose Evaluation in the Car-Driving Scenario

To evaluate the effectiveness of our SLAM system in the car-driving scenario, we conducted experiments using the KITTI dataset and employed the official evaluation metrics, computing root-mean-square error (RMSE) for both translation and rotation vectors. This evaluation spanned sequences ranging from 100 m to 800 m, providing an overall metric for pose accuracy. We conducted a comprehensive assessment of our SLAM system’s tracking and loop closure capabilities in complex environments, focusing on Sequences 00, 02, 03, 04, 05, 06, 07, 08, 09, and 10 from the KITTI dataset. We excluded Sequence 01 from our evaluation due to its unique challenges for monocular SLAM systems, as a highway scene with few distinctive feature points, especially near the horizon, led to significant scale degradation in monocular SLAM performance.

We benchmarked our DK-SLAM against several leading SLAM systems: LDSO [52], ORB-SLAM3 [4], VISO-M [53], and LIFT-SLAM [11]. The comparison of trajectories generated by the proposed SLAM and benchmarks is shown in Figure 4. Among these, VISO-M exhibited the least accuracy, particularly struggling in low-light conditions where its corner detection algorithms were unstable. LDSO, which includes a loop closure module, managed to reduce accumulated errors effectively in challenging scenarios. ORB-SLAM3, a sophisticated keypoint-based SLAM system that uses FAST corner detection and BRIEF descriptors, performed exceptionally well across most sequences, as shown in Table 1. However, it showed a noticeable drop in performance on Sequence 10, primarily due to a scarcity of reliable corner points, which resulted in unstable tracking and adversely affected overall pose estimation accuracy. The reliance on BRIEF, which describes local photometric information around corners, made ORB-SLAM3 particularly sensitive to lighting changes, thereby impacting feature tracking performance. In environments with significant lighting variations, the inability to track reliable feature points at the front-end increased positioning errors. Although the back-end loop closure could correct some accumulated errors, it struggled to improve overall accuracy. In contrast, LIFT-SLAM, which depends on learning-based local features, generally underperformed across most sequences. This outcome suggests that while learning-based features have potential, they do not consistently outperform traditional feature-based methods. ORB-SLAM3’s attention to scale, direction, and a traditional SLAM system for achieving an even keypoint distribution gave it a competitive advantage, helping to filter out mismatched features at the front-end.

Figure 4. The generated trajectories of our proposed DK-SLAM on Sequences 00, 02, 05, 07, 09, and 10 of the KITTI dataset, compared with LDSO and ORB-SLAM3.

Table 1. The pose evaluation on the KITTI dataset: our DK-SLAM system outperforms both representative traditional and learning-based monocular SLAM methods. Sequence 01 is excluded from the evaluation due to its inherent challenges for all monocular SLAM baselines, which prevent the generation of reasonable comparative results.

Our DK-SLAM system demonstrated exceptional performance across most KITTI sequences. Specifically, DK-SLAM shows a substantial improvement over traditional monocular ORB-SLAM3, achieving approximately 17.7% better translation accuracy and 24.2% higher rotation accuracy on the KITTI dataset. When compared with LIFT-SLAM, a leading SLAM system based on learned features, DK-SLAM surpasses it by nearly 2.7 times in translation accuracy and 9 times in rotation accuracy. This significant enhancement in performance is attributed to DK-SLAM’s use of meta-learning-based feature extraction and a coarse-to-fine matching strategy, which focuses on keypoint-surrounding patches and optimizes pose estimation through precise 3D–2D matching relationships. Additionally, our system’s online training capability builds a learning-based Bag-of-Words (BoW) model using previously acquired data, efficiently constraining the BoW’s feature description space and improving loop scene detection. This approach enables the loop closure module to accurately correct cumulative errors. Moreover, as illustrated in Figure 5, DK-SLAM excels in mapping performance, providing detailed geometric insights. The incorporation of MAML-based feature training enhances the generalization capability of the feature extractor, allowing it to capture more comprehensive scene details. This combination of strategies ensures that DK-SLAM offers robust and accurate performance in varied and complex environments.

Figure 5. Mapping results generated by our proposed DK-SLAM system.

4.3. Pose Evaluation in the UAV Scenario

As shown in Table 2 and Figure 6, we come to validate our proposed DK-SLAM on the EuRoC dataset. Following the previous research, we address the scale ambiguity with Umeyama alignment [54] for monocular vision methods. Comparisons include ORB-SLAM3 [4], DSO [1], DSM [55], SVO [2], and LIFE-SLAM [11]. Results are derived from [56] for ORB-SLAM3, DSO, DSM, and SVO and [11] for LIFE-SLAM. We utilize the absolute translation error as the evaluation metric.

Table 2. Comparison of absolute translation errors (in meters) between the proposed DK-SLAM and other baselines on the EuRoC dataset. Scaling with the ground truth is necessary for evaluation due to the absence of absolute scale in monocular visual SLAM methods, as noted in the table. “-” indicates performance failure.

Figure 6. The generated trajectories of our proposed DK-SLAM on Sequences MH01, MH02, MH03, MH04, and MH05 of the EuRoC dataset, compared with DSO and ORB-SLAM3.

The evaluation results reveal that the learned feature-based SLAM, i.e., LIFE-SLAM, often fails in drone scenarios, as it relies on LIFT as a binary feature extractor and uses the Hamming distance between keypoints in consecutive frames for matching. In addition, LIFE-SLAM does not consider the structural relationships between keypoints, leading to a significant number of mismatches and subsequent feature tracking failures. ORB-SLAM3 [4] employs an offline-trained Bag-of-Words (DBow3) for loop closure. Due to the sensitivity of its handcrafted descriptors to lighting variations, ORB-SLAM3 struggles with lower loop detection accuracy in complex lighting conditions, which diminishes the overall performance of the system.

In contrast, our DK-SLAM demonstrates outstanding localization performance. It uses a two-stage tracking strategy that precisely locates feature points and minimizes incorrect matches. The learned features in DK-SLAM are robust, performing well even in low-light conditions and supporting stable feature tracking. Moreover, the online learning-based deep Bag-of-Words model excels in loop detection. This is particularly evident in the MH02 sequence, where DK-SLAM surpasses ORB-SLAM3. This superior performance is due to the use of learned features for constructing the Bag-of-Words model, which provides detailed scene differentiation, and the online training capability, which allows for rapid adaptation to changing environments.

4.4. Ablation Study

Table 3 summarizes the results of the ablation study into the key modules within the DK-SLAM system, assessing the impact of our MAML-based deep keypoint meta-learning and coarse-to-fine keypoint tracking. For a fair comparison, “Ours1”, “Ours2”, and “Ours3” all use a keypoint search radius of 7.

Table 3. Ablation study into the key modules in our DK-SLAM system.

(1) The analysis of keypoint meta-learning module: The module’s effectiveness is evident when comparing “Ours2” and “Ours3.” “Ours2,” which employs the original SuperPoint strategy without our keypoint meta-learning approach, performs slightly worse than “Ours3,” which integrates MAML-based keypoint meta-learning. In “Ours3,” MAML is applied to train the local feature extractor by iteratively enhancing generalization on support and query sets, resulting in a robust and adaptable detector. The batch size of “Ours2” and “Ours3” is set as 8. Both models were trained for 150K iterations. In “Ours3”, we split every training batch to 4 mini-batch support sets and 4 mini-batch query sets. As mentioned in Algorithm 1, the number of update iterations in inner loop is 4.

Table 4 presents the average number of matching points for ORB-SLAM3, Ours2, and Ours3 on Sequences 00, 05, 09, and 10 of the KITTI dataset. Setting the total number of keypoints to 3000 and the pyramid level to 3, MAML-based training significantly improves the feature detection performance of SuperPoints, allowing them to acquire more robust features and enhancing SLAM accuracy.

Table 4. The average number of matching points for ORB-SLAM3 and DK-SLAM without (Ours2) or with feature meta-learning (Ours3).

The enhanced SLAM performance attributed to the proposed keypoint meta-learning can also be observed in feature matching results, as depicted in Figure 7. Using SuperPoint for matching increases the number of matches, addressing ORB-SLAM3’s limitations in texture-less areas. ORB-SLAM3’s descriptors, describing lighting changes around corners, are susceptible to environmental variations. In contrast, the original learning-based local feature methods such as SuperPoint capture robust deep features unaffected by lighting changes, ensuring stable tracking across consecutive frames. Furthermore, compared with original Superpoint, MAML-based keypoint is better at capturing inter-frame matching features.

Figure 7. Samples of keypoints detection and matching. From top to bottom: ORB-SLAM3 matching (ORB3) and DK-SLAM without (Ours2) and with keypoint meta-learning (Ours3).

(2) The analysis of coarse-to-fine tracking module: In “Ours1”, lacking a coarse-to-fine tracking strategy, reliance on a constant velocity motion model for keypoint matching leads to errors, resulting in inaccurate correspondence and tracking failure. Figure 8 shows a failure case for “Ours1” in accurate feature matching due to uniform motion model challenges. In contrast, “Ours3” uses a two-stage strategy for stable tracking, combining a semi-direct method for coarse pose estimation and refined feature matching.

Figure 8. Samples of keypoint detection and matching. Up: matching without two-stage strategy (Ours1). Bottom: matching with two-stage strategy (Ours3).

(3) The analysis of loop-closing module: We further perform ablation studies on the proposed loop closure method to assess its performance. Precision–recall metrics on KITTI 00, 05, and 06 sequences are illustrated in Figure 9, with additional evaluation at 100% precision. The quantitative results, presented in Table 5, reveal our deep BoW’s superior recall rate compared with traditional BoWs. Notably, in the KITTI 06 sequence, our BoW achieves a recall rate that is remarkably close to 100%. In contrast, the recall rate of the iBoW, belonging to the traditional BoW category, is significantly lower, suggesting that handcrafted descriptors might struggle to accurately identify loop nodes in a sequence. The learned local descriptor captures high-level information, enhancing robustness across diverse environments. This feature stability, unaffected by lighting changes, results in superior loop detection performance.

Figure 9. Precision–recall curves depicting the performance of the Bag-of-Words (BoW) approach in the proposed DK-SLAM on Sequences 00, 05, and 06 of the KITTI dataset.

Table 5. Performance comparison of loop closing with the maximum recalls across various methods, all achieving 100% precision.

(4) The analysis of SLAM efficiency: We conduct an ablation study on the computational efficiency of DK-SLAM, analyzing the time consumption of different modules. We conduct the ablation study on KITTI data. Additionally, we calculate median and mean feature tracking times for each SLAM module. Based on the results, as shown in Table 6, MAML indicates the MAML-based local feature, Two-Stage indicates the coarse-to-fine feature tracking, and Online BoW indicates the online learning for binary BoW. The first row of results corresponds to the official ORB-SLAM2 pipeline without any proposed modules. The results indicate that adding MAML-based features significantly increases the feature tracking time. This is because the learned features are inferred on the GPU, while image processing, feature tracking, back-end optimization, and loop closure are all processed on the CPU. Transferring the inferred features from the GPU to subsequent CPU threads involves data transmission, leading to increased time cost. The coarse-to-fine feature tracking also increases the time consumption, but not significantly. This is because we employ local photometric constraints rather than global constraints, limiting the rise in computational cost. The online BoW increases additional time as well. However, by binarizing the floating-point descriptors to a compressed binary vector, we reduce the computational cost, mitigating the impact of the online BoW on processing time.

Table 6. Comparison of time consumptions of different modules.

Additionally, we conduct a comparative analysis of feature tracking time consumption against other typical monocular VSLAMs. As illustrated in Figure 10, handcrafted feature-based ORB-SLAM3 and learned feature-based SuperPoint-SLAM [60] were selected for comparison. Notably, the proposed DK-SLAM demonstrates significantly higher efficiency than SuperPoint-SLAM. This performance advantage stems from fundamental architectural differences: SuperPoint-SLAM is developed using libtorch [61], which transfers images to the GPU individually for processing. In contrast, our method leverages TensorRT’s C++ implementation. TensorRT provides hardware acceleration capabilities that pre-cache images in memory and enable parallel processing, thereby achieving efficiency gains over the libtorch-based approach. Nevertheless, DK-SLAM still exhibits an efficiency gap compared with ORB-SLAM3. This discrepancy is primarily attributable to the data transfer between GPU and CPU. Additionally, the proposed DK-SLAM requires about 1 GB GPU memory to operate.

Figure 10. Comparison of tracking time of ORB-SLAM3, DK-SLAM, and SuperPoint-SLAM.

5. Discussion

DK-SLAM is designed to enhance monocular visual SLAM performance in real-world scenarios by leveraging learned local features. Real-world scenarios suffer from unpredictable illumination variations. Handcrafted features, reliant on low-level image information, are susceptible to complex lighting, leading to performance degradation. Existing works [10,11,28] integrate learned features into VSLAM directly without any refinement. However, these learned features are primarily developed for static tasks (e.g., image matching, structure from motion), whereas VSLAM operates in dynamic scenes with illumination changes and variable motion. To address the above challenges, we modify VSLAM performance through the following three dimensions:

First, we modify the training of learned features. Most learned features [10,14,21,22] are trained on static RGB datasets (e.g., COCO), whose domain differs from real-world SLAM scenarios. Deep models often degrade when tested on different domain data [62]. To address this, we employ Model-Agnostic Meta-Learning (MAML) to improve extractor generalization. MAML approximates inner and outer loop gradients as a second-order gradient, enabling faster convergence to a global minimum. The global minimum of error implies the model’s optimal generalization performance. We conduct ablation studies to verify MAML’s impact on local feature performance. The ablation result in Table 3 demonstrates that the proposed MAML-based training achieves a 12.76% reduction in translation error and a 10.71% reduction in rotation error for VSLAM systems. Table 4 demonstrates that MAML-based training also increases the number of feature matchings.

Second, we present a novel feature tracking method with local photometric constraints. Monocular VSLAMs [4,5] set a constant velocity model to initialize feature matching ranges and initial pose. However, real-world motions rarely conform to a constant velocity model. The flawed motion model induces significant errors in a feature matching range across adjacent frames. Incorrect feature matching can lead to catastrophic failures in VSLAM systems. Meanwhile, traditional VSLAM uses low-level handcrafted cues (e.g., keypoint orientation/scale) to filter matches. Learned features lack such mechanisms, relying solely on descriptor Euclidean distance [14]. To resolve this, we propose a local photometric tracking method inspired by a semi-direct approach [2]. By incorporating pixel-intensity constraints around keypoints, our method generates a more accurate feature matching range than the constant velocity model. In Table 3, the ablation results indicate that the absence of local photometric constraints makes the proposed SLAM prone to localization failures. This may be attributed to sharp turns in the scene, which cause the constant velocity motion model to yield inaccurate feature correspondences. Such behavior is also visualized in Figure 8.

Third, we design an online BoW for learned features. Existing offline BoW models are trained on preprocessed datasets. However, learned local features suffer from generalization gaps, and offline BoWs exist in distribution shifts between training and test sets. To address this, we adopt an online BoW approach that continuously trains the vocabulary using real-time data, thereby mitigating the negative impact caused by data distribution discrepancies. Additionally, learned descriptors are floating-point vectors, which incur high computational costs during BoW construction. We therefore convert 256-dimensional floating-point vectors into binary vectors. This reduces the computational cost on BoW building.

However, the proposed DK-SLAM still exhibits limitations. As shown in runtime Figure 10, DK-SLAM operates less efficiently than ORB-SLAM3. This inefficiency stems from the learned features running on the GPU while data processing remains CPU-bound. Frequent data transfer between CPU and GPU degrades SLAM performance. Future work will set the entire SLAM pipeline on the GPU. In addition, DK-SLAM inherits the scale ambiguity of monocular systems. This limitation arises from fundamental geometry theory. Future work will integrate IMU or stereo sensors to recover an absolute scale.

6. Conclusions

This work presents DK-SLAM, a monocular visual SLAM with deep keypoint meta-learning, a coarse-to-fine matching strategy, and an online binary learned feature BoW. To enhance generalization, we employ MAML for local feature training, yielding a feature extractor with robust capabilities in unseen scenarios. Coarse relative poses between frames are estimated via a semi-direct method for accurate feature point matching. An online binary learned feature BoW corrects SLAM accumulation errors. Our proposed DK-SLAM outperforms representative baselines, e.g., ORB-SLAM3, on the KITTI and EuRoC datasets. However, the GPU-based online front-end faces efficiency challenges during information transmission to the CPU-based back-end. Future work will explore knowledge distillation for neural network parameters compression and further SLAM framework modifications to enhance efficiency.

Author Contributions

Conceptualization, H.Q. and C.C.; methodology, H.Q. and C.C.; software, H.Q. and C.C.; validation, H.Q. and C.C.; writing—original draft preparation, H.Q. and C.C.; writing—review and editing, H.Q., J.M., X.H. (Xiaofeng He), X.H. (Xiaoping Hu), C.C., and L.Z.; supervision, L.Z., Y.S. and J.T.; funding acquisition, L.Z. and C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (NFSC) (Nos. 62103427, 62073331, and 62103430). Natural Science Foundation of Hunan Province, China under (No. 2023JJ20051) and a Major Project of the Natural Science Foundation of Hunan Province under (No. 2021JC0004). Changhao Chen is funded by the Young Elite Scientist Sponsorship Program by CAST (No. YESS20220181). And supported by the Science and Technology Innovation Program of Hunan Province, China (2023RC3011).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in this study are openly available at https://www.cvlibs.net/datasets/kitti/eval_odometry.php and https://projects.asl.ethz.ch/datasets/.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Engel, J.J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 40, 611–625. [Google Scholar] [CrossRef]
Forster, C.; Zhang, Z.; Gassner, M.; Werlberger, M.; Scaramuzza, D. SVO: Semidirect visual odometry for monocular and multicamera systems. IEEE Trans. Robot. 2016, 33, 249–265. [Google Scholar] [CrossRef]
Younes, G.; Khalil, D.; Zelek, J.; Asmar, D. H-SLAM: Hybrid direct–indirect visual SLAM. Robot. Auton. Syst. 2024, 179, 104729. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Tu, Z.; Chen, C.; Pan, X.; Liu, R.; Cui, J.; Mao, J. Ema-vio: Deep visual–inertial odometry with external memory attention. IEEE Sens. J. 2022, 22, 20877–20885. [Google Scholar] [CrossRef]
Wang, Z.; Shen, M.; Chen, Q. Eliminating scale ambiguity of unsupervised monocular visual odometry. Neural Process. Lett. 2023, 55, 9743–9764. [Google Scholar] [CrossRef]
Song, R.; Liu, J.; Xiao, Z.; Yan, B. GraphAVO: Self-Supervised Visual Odometry Based on Graph-Assisted Geometric Consistency. IEEE Trans. Intell. Transp. Syst. 2024, 25, 20673–20682. [Google Scholar] [CrossRef]
Bian, J.W.; Zhan, H.; Wang, N.; Li, Z.; Zhang, L.; Shen, C.; Cheng, M.M.; Reid, I. Unsupervised scale-consistent depth learning from video. Int. J. Comput. Vis. 2021, 129, 2548–2564. [Google Scholar] [CrossRef]
Tang, J.; Ericson, L.; Folkesson, J.; Jensfelt, P. GCNv2: Efficient Correspondence Prediction for Real-Time SLAM. IEEE Robot. Autom. Lett. 2019, 4, 3505–3512. [Google Scholar] [CrossRef]
Bruno, H.M.S.; Colombini, E. LIFT-SLAM: A deep-learning feature-based monocular visual SLAM method. Neurocomputing 2020, 455, 97–110. [Google Scholar] [CrossRef]
Wu, X.; Liu, Z.; Tian, Y.; Liu, Z.; Chen, W. KN-SLAM: Keypoints and neural implicit encoding SLAM. IEEE Trans. Instrum. Meas. 2024, 73, 2512712. [Google Scholar] [CrossRef]
Liu, L.; Aitken, J.M. Hfnet-slam: An accurate and real-time monocular slam system with deep features. Sensors 2023, 23, 2113. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-Supervised Interest Point Detection and Description. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 337–349. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; Omari, S.; Achtelik, M.; Siegwart, R.Y. The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar] [CrossRef]
Lindeberg, T. Scale Invariant Feature Transform. Scholarpedia 2012, 7, 10491. [Google Scholar] [CrossRef]
Shi, J.; Tomasi, C. Good features to track. In Proceedings of the 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 21–23 June 1994; pp. 593–600. [Google Scholar]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-Net: A Trainable CNN for Joint Description and Detection of Local Features. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8084–8093. [Google Scholar]
Barroso-Laguna, A.; Mikolajczyk, K. Key.Net: Keypoint Detection by Handcrafted and Learned CNN Filters Revisited. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 698–711. [Google Scholar] [CrossRef]
Balntas, V.; Lenc, K.; Vedaldi, A.; Mikolajczyk, K. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5173–5182. [Google Scholar]
Petrakis, G.; Partsinevelos, P. Keypoint detection and description through deep learning in unstructured environments. Robotics 2023, 12, 137. [Google Scholar] [CrossRef]
Favorskaya, M.N. Deep learning for visual SLAM: The state-of-the-art and future trends. Electronics 2023, 12, 2006. [Google Scholar] [CrossRef]
Kazerouni, I.A.; Fitzgerald, L.; Dooly, G.; Toal, D. A survey of state-of-the-art on visual SLAM. Expert Syst. Appl. 2022, 205, 117734. [Google Scholar] [CrossRef]
Tang, J.; Folkesson, J.; Jensfelt, P. Geometric Correspondence Network for Camera Motion Estimation. IEEE Robot. Autom. Lett. 2018, 3, 1010–1017. [Google Scholar] [CrossRef]
Li, D.; Shi, X.; Long, Q.; Liu, S.; Yang, W.; Wang, F.; Wei, Q.; Qiao, F. DXSLAM: A Robust and Efficient Visual SLAM System with Deep Features. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 4958–4965. [Google Scholar]
Sarlin, P.E.; Cadena, C.; Siegwart, R.; Dymczyk, M. From Coarse to Fine: Robust Hierarchical Localization at Large Scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Lu, Y.; Lu, G. SuperThermal: Matching Thermal as Visible Through Thermal Feature Exploration. IEEE Robot. Autom. Lett. 2021, 6, 2690–2697. [Google Scholar] [CrossRef]
Yi, K.M.; Trulls, E.; Lepetit, V.; Fua, P.V. LIFT: Learned Invariant Feature Transform. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Wang, Z.; Cheng, Q.; Mu, X. Ru-slam: A robust deep-learning visual simultaneous localization and mapping (slam) system for weakly textured underwater environments. Sensors 2024, 24, 1937. [Google Scholar] [CrossRef]
Zhang, R.; Wang, Y.; Li, Z.; Ding, F.; Wei, C.; Wu, M. Online Adaptive Keypoint Extraction for Visual Odometry Across Different Scenes. IEEE Robot. Autom. Lett. 2025, 10, 7539–7546. [Google Scholar] [CrossRef]
Wen, S.; Li, X.; Liu, X.; Li, J.; Tao, S.; Long, Y.; Qiu, T. Dynamic SLAM: A Visual SLAM in Outdoor Dynamic Scenes. IEEE Trans. Instrum. Meas. 2023, 72, 5028911. [Google Scholar] [CrossRef]
Dong, Y.; Wang, S.; Yue, J.; Chen, C.; He, S.; Wang, H.; He, B. A Novel Texture-Less Object Oriented Visual SLAM System. IEEE Trans. Intell. Transp. Syst. 2021, 22, 36–49. [Google Scholar] [CrossRef]
Lee, J.; Back, M.; Hwang, S.S.; Chun, I.Y. Improved Real-Time Monocular SLAM Using Semantic Segmentation on Selective Frames. IEEE Trans. Intell. Transp. Syst. 2023, 24, 2800–2813. [Google Scholar] [CrossRef]
Liu, F.; Huang, M.; Ge, H.; Tao, D.; Gao, R. Unsupervised Monocular Depth Estimation for Monocular Visual SLAM Systems. IEEE Trans. Instrum. Meas. 2024, 73, 2502613. [Google Scholar] [CrossRef]
Tsintotas, K.A.; Bampis, L.; Gasteratos, A. The Revisiting Problem in Simultaneous Localization and Mapping: A Survey on Visual Loop Closure Detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19929–19953. [Google Scholar] [CrossRef]
Galvez-López, D.; Tardos, J.D. Bags of Binary Words for Fast Place Recognition in Image Sequences. IEEE Trans. Robot. 2012, 28, 1188–1197. [Google Scholar] [CrossRef]
Garcia-Fidalgo, E.; Ortiz, A. iBoW-LCD: An Appearance-Based Loop-Closure Detection Approach Using Incremental Bags of Binary Words. IEEE Robot. Autom. Lett. 2018, 3, 3051–3057. [Google Scholar] [CrossRef]
Arandjelović, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1437–1451. [Google Scholar] [CrossRef]
Wan, K.; Luo, J. SPVL-vSLAM: Visual SLAM for Autonomous Driving Vehicles Based on Semantic Patch-NetVLAD Loop Closure Detection in Semi-Static Scenes. IEEE Trans. Intell. Transp. Syst. 2025, 26, 8975–8991. [Google Scholar] [CrossRef]
Memon, A.R.; Liu, Z.; Wang, H. Viewpoint-Invariant Loop Closure Detection Using Step-Wise Learning with Controlling Embeddings of Landmarks. IEEE Trans. Intell. Transp. Syst. 2022, 23, 20148–20159. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Cattaneo, D.; Vaghi, M.; Valada, A. Lcdnet: Deep loop closure detection and point cloud registration for lidar slam. IEEE Trans. Robot. 2022, 38, 2074–2093. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Jau, Y.Y.; Zhu, R.; Su, H.; Chandraker, M. Deep Keypoint-Based Camera Pose Estimation with Geometric Constraints. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 4950–4957. [Google Scholar]
Kümmerle, R.; Grisetti, G.; Strasdat, H.; Konolige, K.; Burgard, W. G2o: A general framework for graph optimization. In Proceedings of the IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 3607–3613. [Google Scholar]
Wang, Y.; Xu, B.; Fan, W.; Xiang, C. A Robust and Efficient Loop Closure Detection Approach for Hybrid Ground/Aerial Vehicles. Drones 2023, 7, 135. [Google Scholar] [CrossRef]
Bian, J.; Lin, W.Y.; Matsushita, Y.; Yeung, S.K.; Nguyen, T.D.; Cheng, M.M. GMS: Grid-Based Motion Statistics for Fast, Ultra-Robust Feature Correspondence. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–25 July 2017; pp. 2828–2837. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Gao, X.; Wang, R.; Demmel, N.; Cremers, D. LDSO: Direct Sparse Odometry with Loop Closure. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 2198–2204. [Google Scholar]
Geiger, A.; Ziegler, J.; Stiller, C. StereoScan: Dense 3d Reconstruction in Real-time. In Proceedings of the IEEE Intelligent Vehicles Symposium, Baden-Baden, Germany, 5–9 June 2011. [Google Scholar]
Umeyama, S. Least-Squares Estimation of Transformation Parameters Between Two Point Patterns. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 376–380. [Google Scholar] [CrossRef]
Zubizarreta, J.; Aguinaga, I.; Montiel, J.M.M. Direct Sparse Mapping. IEEE Trans. Robot. 2020, 36, 1363–1370. [Google Scholar] [CrossRef]
Teed, Z.; Deng, J. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. In Proceedings of the Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
Cummins, M.; Newman, P. FAB-MAP: Probabilistic localization and mapping in the space of appearance. Int. J. Robot. Res. 2008, 27, 647–665. [Google Scholar] [CrossRef]
Garcia-Fidalgo, E.; Ortiz, A. Hierarchical place recognition for topological mapping. IEEE Trans. Robot. 2017, 33, 1061–1074. [Google Scholar] [CrossRef]
Milford, M.J.; Wyeth, G.F. SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation, St. Paul, MN, USA, 14–18 May 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 1643–1649. [Google Scholar]
Deng, C.; Qiu, K.; Xiong, R.; Zhou, C. Comparative Study of Deep Learning Based Features in SLAM. In Proceedings of the 2019 4th Asia-Pacific Conference on Intelligent Robot Systems (ACIRS), Nagoya, Japan, 13–15 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 250–254. [Google Scholar]
Ketkar, N.; Moolayil, J.; Ketkar, N.; Moolayil, J. Introduction to pytorch. In Deep Learning with Python: Learn Best Practices of Deep Learning Models with PyTorch; Apress: New York, NY, USA, 2021; pp. 27–91. [Google Scholar]
Ko, K.; Kim, S.; Kwon, H. Selective Audio Perturbations for Targeting Specific Phrases in Speech Recognition Systems. Int. J. Comput. Intell. Syst. 2025, 18, 103. [Google Scholar] [CrossRef]

Figure 1. An overview of our proposed DK-SLAM framework with deep keypoint meta-learning, two-stage coarse-to-fine keypoint tracking, and online learning-based binary BoW for loop closing.

Figure 2. Diagram of our proposed coarse-to-fine two-stage keypoint tracking strategy. This process begins with relative pose estimation through patch photometric loss optimization, followed by refinement using the 3D–2D keypoint relationship for enhanced accuracy.

Figure 3. An illustration of our proposed Online Learning-based Binary BoW. The BoW is constructed incrementally, with matched descriptors in the keyframes database stored within the same leaf node. In the presence of unmatched descriptors in the current keyframe, a new leaf node is created.

Figure 4. The generated trajectories of our proposed DK-SLAM on Sequences 00, 02, 05, 07, 09, and 10 of the KITTI dataset, compared with LDSO and ORB-SLAM3.

Figure 5. Mapping results generated by our proposed DK-SLAM system.

Figure 6. The generated trajectories of our proposed DK-SLAM on Sequences MH01, MH02, MH03, MH04, and MH05 of the EuRoC dataset, compared with DSO and ORB-SLAM3.

Figure 7. Samples of keypoints detection and matching. From top to bottom: ORB-SLAM3 matching (ORB3) and DK-SLAM without (Ours2) and with keypoint meta-learning (Ours3).

Figure 8. Samples of keypoint detection and matching. Up: matching without two-stage strategy (Ours1). Bottom: matching with two-stage strategy (Ours3).

Figure 9. Precision–recall curves depicting the performance of the Bag-of-Words (BoW) approach in the proposed DK-SLAM on Sequences 00, 05, and 06 of the KITTI dataset.

Figure 10. Comparison of tracking time of ORB-SLAM3, DK-SLAM, and SuperPoint-SLAM.

Table 1. The pose evaluation on the KITTI dataset: our DK-SLAM system outperforms both representative traditional and learning-based monocular SLAM methods. Sequence 01 is excluded from the evaluation due to its inherent challenges for all monocular SLAM baselines, which prevent the generation of reasonable comparative results.

Method	Sensor	Metric	00	02	03	04	05	06	07	08	09	10	Avg
VISO-M	Mono	$t_{r e l}$	36.95	21.98	16.14	2.61	17.20	7.91	20.00	39.78	29.01	28.52	22.01
VISO-M	Mono	$r_{r e l}$	2.42	1.22	2.67	1.53	3.52	1.83	5.30	1.99	1.32	3.23	2.50
LDSO	Mono	$t_{r e l}$	2.84	4.91	2.75	1.58	2.07	6.49	12.76	30.17	23.14	9.20	9.59
LDSO	Mono	$r_{r e l}$	0.36	0.74	0.16	0.19	0.24	0.16	5.13	0.24	0.21	0.21	0.76
ORB-SLAM3	Mono	$t_{r e l}$	2.80	6.04	2.71	2.22	3.41	7.07	2.14	10.35	3.13	10.78	5.07
ORB-SLAM3	Mono	$r_{r e l}$	0.3	0.38	0.17	0.24	0.39	0.18	0.42	0.31	0.54	0.36	0.33
LIFT-SLAM	Mono	$t_{r e l}$	3.18	8.73	1.46	2.22	6.09	12.24	2.42	47.10	19.91	9.72	11.31
LIFT-SLAM	Mono	$r_{r e l}$	2.99	2.49	0.34	0.48	3.11	2.91	4.02	2.02	2.14	2.24	2.27
DK-SLAM (Ours)	Mono	$t_{r e l}$	2.57	4.38	4.70	1.12	1.98	8.20	1.11	7.61	2.97	7.03	4.17
DK-SLAM (Ours)	Mono	$r_{r e l}$	0.31	0.27	0.17	0.24	0.26	0.17	0.28	0.28	0.29	0.23	0.25

Table 2. Comparison of absolute translation errors (in meters) between the proposed DK-SLAM and other baselines on the EuRoC dataset. Scaling with the ground truth is necessary for evaluation due to the absence of absolute scale in monocular visual SLAM methods, as noted in the table. “-” indicates performance failure.

Method	MH01	MH02	MH03	MH04	MH05	Avg
DSO	0.046	0.046	0.172	3.810	0.110	0.8368
SVO	0.100	0.120	0.410	0.430	0.300	0.2720
DSM	0.039	0.036	0.055	0.057	0.067	0.0508
ORB-SLAM3	0.016	0.027	0.028	0.138	0.072	0.0562
LIFT-SLAM	0.044	0.053	0.049	-	-	-
DK-SLAM	0.013	0.013	0.027	0.077	0.055	0.0370

Table 3. Ablation study into the key modules in our DK-SLAM system.

Method	MAML	Two-Stage	Metric	00	02	03	04	05	06	07	08	09	10	Avg
Ours1	✔		$t_{r e l}$	2.88	-	4.29	1.24	2.80	7.64	1.76	-	2.70	-	-
Ours1	✔		$r_{r e l}$	0.41	-	0.16	0.31	0.35	0.18	0.64	-	0.24	-	-
Ours2		✔	$t_{r e l}$	2.92	4.42	4.54	2.01	2.48	7.97	2.11	8.85	4.03	8.48	4.78
Ours2		✔	$r_{r e l}$	0.38	0.34	0.21	0.16	0.34	0.18	0.46	0.26	0.20	0.23	0.28
Ours3	✔	✔	$t_{r e l}$	2.57	4.38	4.70	1.12	1.98	8.20	1.11	7.61	2.97	7.03	4.17
Ours3	✔	✔	$r_{r e l}$	0.31	0.27	0.17	0.24	0.26	0.17	0.28	0.28	0.29	0.23	0.25

Table 4. The average number of matching points for ORB-SLAM3 and DK-SLAM without (Ours2) or with feature meta-learning (Ours3).

Seq	ORB-SLAM3	Ours2	Ours3
K00	228.3	352.6	376.7
K05	219.5	371.2	383.1
K09	150.4	258.0	265.3
K10	171.4	306.0	323.1

Table 5. Performance comparison of loop closing with the maximum recalls across various methods, all achieving 100% precision.

Seq	FAB-MAP [57]	Emilio [58]	seqSLAM [59]	iBoW [40]	Ours
K00	0.49	0.90	0.67	0.77	0.98
K05	0.32	0.76	0.41	0.26	0.87
K06	0.55	0.95	0.65	0.96	1

Table 6. Comparison of time consumptions of different modules.

MAML	Two-Stage	Online BoW	Tracking Time (ms)
MAML	Two-Stage	Online BoW	Median	Mean
			38	42
✔			70	76
✔	✔		73	78
✔	✔	✔	75	78

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

DK-SLAM: Monocular Visual SLAM with Deep Keypoint Learning, Tracking, and Loop Closing

Abstract

1. Introduction

3. Deep Keypoint-Based Monocular SLAM

3.1. Deep Keypoint Meta-Learning

3.1.1. Feature Extractor Network

3.1.2. Self-Supervised Keypoint Learning

3.1.3. MAML-Based Visual Keypoint Meta-Learning

3.1.4. The Distribution Strategy of Deep Keypoints

3.2. Coarse-to-Fine Keypoint Tracking

3.2.1. Semi-Direct Coarse Keypoint Tracking

3.2.2. Coarse-to-Fine Keypoint Tracking

3.3. Deep Keypoint-Based Loop Closing

3.3.1. Online Learning for Binary BoW

3.3.2. Loop Node Detection

3.3.3. Global Map Correction via Loop Closing

4. Experiments

4.1. Training Details and Datasets

4.1.1. Training Details

4.1.2. KITTI Odometry Dataset

4.1.3. EuRoC MAV Dataset

4.2. Pose Evaluation in the Car-Driving Scenario

4.3. Pose Evaluation in the UAV Scenario

4.4. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics