Next Article in Journal
Simulated Comparison of On-Chip Terahertz Filters for Sub-Wavelength Dielectric Sensing
Previous Article in Journal
MF-IEKF: A Multiplicative Federated Invariant Extended Kalman Filter for INS/GNSS
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

KOM-SLAM: A GNN-Based Tightly Coupled SLAM and Multi-Object Tracking Framework

1
Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-0033, Japan
2
Institute of Industrial Science, The University of Tokyo, Tokyo 153-8505, Japan
3
Graduate School of Advanced Science and Engineering, Hiroshima University, Hiroshima 739-8527, Japan
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(1), 128; https://doi.org/10.3390/s26010128
Submission received: 19 November 2025 / Revised: 21 December 2025 / Accepted: 23 December 2025 / Published: 24 December 2025
(This article belongs to the Section Intelligent Sensors)

Abstract

Coupled simultaneous localization and mapping (SLAM) and multi-object tracking have been studied in recent years. Although these tasks achieve promising results, they mainly associate keypoints and objects across frames separately, which limits their robustness in complex dynamic scenes. To overcome this limitation, we propose KOM-SLAM, a tightly coupled SLAM and multi-object tracking framework based on a Graph Neural Network (GNN), which jointly learns keypoint and object associations across frames while estimating ego-poses in a differentiable manner. The framework constructs a spatiotemporal graph over keypoints and object detections for association, and employs a multilayer perceptron (MLP) followed by a sigmoid activation that adaptively adjusts association thresholds based on ego-motion and spatial context. We apply a soft assignment on keypoints to ensure differentiable pose estimation, enabling the pose loss to supervise the association learning directly. Experiments on the KITTI Tracking demonstrate that our method achieves improved performance in both localization and object tracking.

1. Introduction

Simultaneous localization and mapping (SLAM) and multi-object tracking are two essential tasks for autonomous systems. SLAM estimates the ego-pose (the current position and orientation of the sensor/robot) and builds a map of the surrounding environment [1,2], while multi-object tracking tracks the motion of detected objects to support safe decision-making [3,4]. Although these tasks are traditionally developed independently, the two tasks are naturally complementary. SLAM provides accurate ego-pose estimates for tracking, while object tracking helps identify dynamic objects, improving SLAM robustness in dynamic environments.
Tightly coupled SLAM and multi-object tracking systems usually build a factor graph of odometry and object states, and a joint optimization is applied to improve the accuracy and robustness of localization and object tracking using temporal and spatial constraints [5,6]. However, most existing approaches rely on manually designed heuristics for data association, such as descriptor similarity for keypoints and spatial proximity for objects. These heuristic associations can be brittle in cluttered or ambiguous environments, particularly when multiple dynamic objects occlude or overlap. Additionally, treating keypoint and object associations independently prevents full exploitation of the shared information between feature extraction and object detection.
Graph Neural Networks (GNNs) have emerged as powerful tools for learning associations in structured data by leveraging message passing to aggregate information from connected neighbors [7,8]. While GNNs have been applied to tasks like keypoint matching and object tracking independently [9,10], they have not been fully explored in a tightly coupled SLAM and multi-object tracking framework, where keypoint association and object association could be learned jointly. Moreover, most existing GNN-based approaches are not designed to support differentiable pose estimation, limiting their integration into end-to-end learning pipelines.
In response to these challenges, we propose KOM-SLAM, a tightly coupled SLAM and multi-object tracking framework, which jointly matches keypoints and objects across frames with a GNN. Our approach builds a graph that includes both keypoints and object detections across frames, enabling the GNN to simultaneously find the association between keypoints and between objects. Furthermore, to improve robustness under varying motion patterns, we apply a Multilayer Perceptron (MLP)–Sigmoid gating layer to restrict the association dynamically based on ego-motion and the spatial distance of each keypoint from the ego-pose. In addition, we adopt a soft assignment scheme over the keypoint similarity matrix, which enables differentiable pose estimation and allows the final odometry loss to directly supervise the GNN association learning.
The main contributions of the new proposed GNN-based, tightly coupled SLAM and multi-object tracking system are the following:
1.
We propose KOM-SLAM, a GNN-based, tightly coupled SLAM and multi-object tracking framework, where the GNN jointly learns associations between keypoints and objects. A soft assignment mechanism is applied to backpropagate the pose estimation loss through the GNN. To the best of our knowledge, this is the first learning-based system that tightly integrates SLAM and multi-object tracking.
2.
We embed the ego motion and spatial distance between the keypoint and the ego pose in the network to allow the dynamic adjustment of the keypoint matching range.
3.
We validate the effectiveness of KOM-SLAM on the KITTI Tracking dataset, demonstrating improved performance in both pose estimation and object tracking.
The rest of the paper is organized as follows. Section 2 reviews the related work. Section 3 presents the proposed GNN-based, tightly coupled SLAM and multi-object tracking framework in detail. Experimental results and evaluations are demonstrated in Section 4. Finally, Section 5 concludes the paper and discusses directions of future work.

2. Related Work

2.1. Coupled SLAM and Multi-Object Tracking

Traditional SLAM systems rely on feature extraction and frame-to-frame correspondence to estimate ego-motion, using corner keypoints in visual inputs [11] or edge and planar points in Light Detection and Ranging (LiDAR) data [12]. However, including keypoints on dynamic objects in pose estimation introduces errors due to unmodeled motion. To address this, dynamic observation methods have been proposed to filter out potentially dynamic points—either by removing features associated with detected objects [13,14] or by segmenting dynamic regions using photometric or geometric cues [15]. The robustness of these methods is heavily dependent on the quality of the road scene understanding. Recent benchmarks, such as RSUD20K [16], have advanced the state-of-the-art in road scene understanding by providing diverse and high-resolution data for object detection. While effective in avoiding dynamic noise, these approaches often lack the ability to model or track object motion, limiting the SLAM system’s robustness in dynamic environments.
To improve robustness in dynamic environments, loosely coupled systems combine SLAM and multi-object tracking, where each component operates independently but shares information to enhance overall performance. In these frameworks, object tracking typically helps identify and remove dynamic points from sensor data, while SLAM uses the filtered static features to perform accurate ego-pose estimation. MaskFusion [17] and DOT [18], the visual dynamic object tracking methods for SLAM, rely on reprojection or photometric errors to track object motion with object masks and identify static keypoints. Instead of tracking image masks, ClusterSLAM [19] and ClusterVO [20] project keypoints into 3D, track 3D clusters of points, and estimate camera poses based on static features. LiDAR-based methods take advantage of the accurate depth measurement and object detection, often using Kalman filtering and point cloud registration to track objects [21]. Hybrid LiDAR-visual systems, such as Dynam-LVIO [22], benefit from sensor redundancy and use both the iterative closest point (ICP) error and reprojection error to update a Kalman filter to track objects. While these loosely coupled designs are modular and flexible, their ego-poses estimation only relies on the static features, which may limit their effectiveness in highly dynamic scenes where static features are not sufficient.
In contrast to loosely coupled methods, tightly coupled SLAM and multi-object tracking frameworks typically jointly optimize ego-poses, object poses, and object motions with a factor graph in the backend. This joint optimization allows dynamic object observations and motion models to directly support ego-pose estimation, which is particularly beneficial when static features are sparse or occluded. Visual systems usually calculate reprojection errors as constraint functions in the joint optimization. DynaSLAM II [5] and CubeSLAM [23] detect 2D boxes and optimize object points with factor graphs, while segmentation-based approaches like TwistSLAM [24] apply optical flow to track the object points. Recent advances also explore richer representations, including quadrics [6] and articulated models for non-rigid bodies [25], improving robustness across object types. The LiDAR-based methods, such as DL-SLOT [26] and LIMOT [27], benefit from accurate 3D detections and pay more attention to the object trajectory prediction and association. Instead of applying constant motion constraints, LIO-LOT [28] applies a constant acceleration and angular velocity motion model for objects. IMM-SLAMMOT [29] includes multiple motion models in the factor graph to be more adaptive to the complex real-world dynamic scenarios. The switching-coupled systems combine the loosely coupled factors to avoid unreliable objects causing performance degradation [30,31]. Despite their effectiveness, most existing tightly coupled methods rely on heuristic-based association strategies, leaving room for learning-based techniques to better capture complex correspondence patterns.

2.2. GNN-Based Matching

Graph Neural Networks (GNNs) have emerged as a powerful tool for learning associations in structured data, such as object tracking and keypoint matching across frames. In multi-object tracking, Wang et al. use the 2D appearance feature as nodes, define the edges between the nodes within pixel distance thresholds across frames, and apply GraphConv as their aggregation method [32]. Similarly, PTP [33] connects edges based on 3D distance and applies GraphConv; however, it utilizes Long Short-Term Memory (LSTM) and MLP to encode 3D detection results into positioning features as GNN nodes. Instead of using GraphConv, GNN3DMOT [3] improves the tracking performance by applying an attention-based weight to aggregate the difference in features between nodes and neighborhood nodes. Brasó and Leal-Taixé build a graph with multiple frames and design a time-aware message-passing network to update the node features [34]. Following it, SUSHI [10] and Bilgi et al. [35] perform detection association over different hierarchies of timespans to improve long-term association. At each hierarchical level, a shared-weights GNN is applied. Chen et al. apply GNN to multi-object tracking in satellite videos and focus on the tiny object tracking task [36]. Instead of using heuristic methods to define GNN edges, Gao et al. design NodeNet to generate the edge connections for GNN, constructed by an encoder and a decoder [37].
In keypoint matching, SuperGlue [9] represents keypoints as graph nodes with both descriptor and positioning features. It uses alternating self-attention layers, where nodes in the same frame are fully connected, and cross-attention layers, where nodes are fully connected across frames. Following SuperGlue, LightGlue [38] uses the positioning information to calculate the attention scores in self-attention layers and uses descriptor similarity in cross-attention layers. To improve the time efficiency of SuperGlue, SGMNet [39] applies a seeding module to generate a small set of matches as seeds. Subsequently, a Seeded GNN is applied to utilize seed matches to pass messages across frames and within frames through attentional aggregation. Similarly, ClusterGNN [40] reduces the number of edges by operating GNN on clusters of keypoints. They use the K-means algorithm to cluster the query and key vectors, and then apply a cluster-based sparse attention to aggregate the features. Furthermore, LoFTR [41] combines the GNN with a local feature detector to build an end-to-end trainable model. It enables the matching model to supervise the local feature convolutional neural network (CNN), which outperforms using a pre-built keypoint detector. These works collectively demonstrate GNNs’ strength in capturing contextual cues and finding correspondence in matching tasks. However, most existing works perform the object association and keypoint association independently, and the GNN approaches are not fully explored in the coupled object and keypoint matching.

3. Method

3.1. System Architecture

The overall structure of the proposed GNN-based coupled SLAM and multi-object tracking system is illustrated in Figure 1. The system takes the extracted keypoints and object detections from two frames of perception data at timestamps A and B as input. Its objective is to establish associations between keypoints and between objects across frames and estimate the relative transformation T A B SE ( 3 ) , which maps coordinates from frame A to frame B. In practice, we use frames A and B that are temporally adjacent (i.e., consecutive frames), as this ensures sufficient spatial overlap and stable geometric consistency for both keypoint and object associations. The framework does not strictly require consecutive frames, but larger temporal gaps generally increase motion magnitude and reduce association reliability. The impact of reduced frame density and increased time intervals is empirically evaluated in Section 4.4.3.
Each keypoint i in frame t { A , B } is represented as a tuple ( p i t , d i t ) , where p i t R 3 denotes the 3D position, and d i t R F k is the corresponding descriptor. Similarly, each object detection j in frame t is defined as ( b j t , f j t ) , where b j t = ( x j , y j , z j , l j , w j , h j , θ j ) denotes the 3D bounding box parameters (position, size, and yaw angle) and f j t R F o is the associated appearance feature.
In practice, the input data ( p i t , d i t ) and ( b j t , f j t ) are derived from stereo image inputs. Specifically, 2D keypoints and their descriptors are extracted from left images using SuperPoint [11], and their 3D coordinates p i t are estimated by triangulation using depth inferred from stereo images and camera intrinsic matrix. Object detections are generated by PointRCNN [42], which provides 3D bounding box parameters b j t . The associated appearance features f j t are extracted by applying a pretrained ResNet backbone to the cropped image region of each detected object.
The features and positioning information of keypoint and object are encoded as nodes in an attention-based GNN, including self-attention, keypoint–object, and cross-attention layers. In the GNN layers, the positioning attributes remain fixed and are used to guide the attention mechanism, while the feature vectors are the only components updated. The updated features are then used to compute matching score matrices across frames. For keypoints, an auxiliary matrix is additionally introduced to guide the similarity computation by incorporating the estimated relative transformation and spatial distances between the keypoints and the ego-poses.
Let N A and N B denote the number of keypoints extracted in timestamps A and B, respectively. Soft assignment is applied to the keypoint matching score matrix P kp R N A × N B to obtain the corresponding keypoints { p k A , p k B } . These correspondences are then used to estimate the relative transformation T A B via a differentiable optimization module. For object associations, a set of object correspondences ( i , j ) M A × M B is established through the object matching score matrix P obj R M A × M B , where M A and M B are the number of detected objects in frames A and B, respectively. Unlike keypoints, soft assignment is not applied to objects due to their discrete and non-repeatable nature.
Finally, ego-pose optimization is performed by minimizing the alignment error of the matched keypoints. To further refine both pose and object trajectories, a factor graph is constructed over a sliding window to jointly optimize relative transformations, object poses, and object motions. The use of soft associations for keypoints, together with differentiable optimization, ensures that the entire framework is trainable.

3.2. Graph Neural Network

As illustrated in Figure 2, the proposed attention-based GNN jointly processes keypoints and object detections from two frames by propagating and refining node features through geometric and appearance-based relationships both within and across timestamps.
Each graph node is represented as a pair n = ( x , g ) , where x denotes the learnable feature vector (keypoint descriptor d i or object appearance feature f ), and g denotes the corresponding positional attribute (keypoints 3D position p or object bounding box parameters b ). While g remains fixed throughout message passing, it is used in the attention mechanism to construct queries and keys, thereby guiding how information flows between nodes. Only the feature x is updated layer by layer.
GNN edges are constructed differently across three layers to capture complementary contextual relationships. In the self-attention layer, keypoint nodes are connected to other keypoints within a predefined distance threshold in the same frame. Similarly, object nodes are connected to nearby object nodes in the same frame. This layer captures local spatial relationships. In the keypoint–object layer, the keypoint nodes and object nodes in the same frame are connected if a keypoint lies inside the 3D bounding box of the object, allowing keypoints to incorporate higher-level object context. Lastly, in the cross-attention layer, keypoints are connected to keypoints in the other frame within a spatial threshold. Likewise, object nodes are connected across frames based on spatial distance. This layer facilitates temporal correspondence reasoning across frames.
At each layer of GNN, node features are updated through attention-based message passing from its connected neighbors. Specifically, the updated feature x i for node i is calculated by combining the original feature x i with the aggregated message m i via a 2-layer MLP, as defined in Equation (1):
x i = x i + MLP ( [ x i m i ] ) ,
where n i = ( x i , g i ) is the updated GNN node, and denotes concatenation operation. The 2-layer architecture is chosen to provide sufficient expressive power for feature refinement while maintaining computational efficiency and preventing over-fitting.
The aggregated message m i is defined as a weighted sum over neighbors N ( i ) , calculated by Equations (2) and (3):
m i = j N ( i ) α i j v j ,
α i j = softmax k ( q i T k k ) j ,
where v , q , and k denote the value, query, and key vectors, respectively. These vectors are obtained by applying different linear layers to the node features. The specific choice of input features depends on the type of layer, as shown in Figure 2. In the self-attention layers, keypoints embed positioning information for queries and keys, while their descriptors are applied to generate value vectors. Object nodes follow the same pattern, using bounding boxes to generate queries and keys. In the keypoint–object layers, queries are derived from keypoint positions and keys from object bounding boxes, while the values are taken from either keypoint descriptors or object features, enabling information exchange between keypoints and objects. Finally, in the cross-attention layers, queries, keys, and values are all constructed from descriptors (for keypoints) or appearance features (for objects), so that message passing emphasizes similarity in feature space across frames.
Each GNN layer is applied sequentially, and multiple iterations of message passing can be performed to allow contextual information to propagate throughout the graph. The updated keypoint and object features are then used for computing similarity matrices and downstream estimation tasks.

3.3. Matching Score Matrix Calculation

To determine correspondences between keypoints and between objects across two frames A and B, we compute two matching score matrices for keypoints and objects based on the updated node features obtained from the GNN. Since the processes are similar, we describe the calculation for keypoints, while the object case follows identically.
Let D A R N A × F k and D B R N B × F k denote the updated descriptors matrices for keypoints in frames A and B, respectively, where each row corresponds to D i · t : = d ˜ i t , the updated features of keypoint i in frame t. We first calculate a similarity matrix S R N A × N B using the inner product of the feature matrices, as shown in Equation (4):
S = D A ( D B ) .
We define the raw matching score matrix P raw R N A × N B as the sum of the row-wise and column-wise softmax normalizations of S in Equation (5):
P i j raw = softmax k ( S i k ) j + softmax k ( S k j ) i .
In addition to descriptor similarity, we introduce an auxiliary matrix P aux [ 0 , 1 ] N A × N B based on keypoint positions and estimated ego-motion. As shown in Equation (6), we compute a learned threshold τ i j using a lightweight network, which estimates the possible range that the corresponding keypoint might be in for each keypoint:
τ i j = MLP [ p i A 2 v ego ω ego ] ,
where p i A 2 is the Euclidean distance from keypoint i to the ego pose in frame A, and v ego and ω ego represent the estimated translational and rotational velocities of ego-motion between frames A and B. The auxiliary matrix, which acts as a learned spatial prior, is then defined using the sigmoid function σ in Equation (7):
P i j aux = σ ( τ i j δ i j ) ,
where δ i j = p i A p j B 2 is defined as the distance between the keypoint pair ( i , j ) .
Finally, the keypoint matching score matrix P R N A × N B is obtained by element-wise multiplication of the raw matching score matrix P raw and the auxiliary matrix P aux , as shown in Equation (8):
P = P raw P aux .
This fused matrix P is used for soft keypoint matching and subsequent relative pose estimation. Note that for objects, the raw matching score matrix is directly used as the matching score matrix.

3.4. Correspondence Association and Joint Optimization

Given the matching score matrices computed from the updated keypoint and object features, our framework identifies inter-frame correspondences and jointly optimizes the relative ego-motion and object states.
Object correspondences are first determined through an argmax operation combined with a mutual consistency check. Let P obj R M A × M B denote the object-level matching score matrix between the M A and M B detected objects in frames A and B, respectively. To ensure that both objects select each other as the most similar match, a candidate correspondence ( i , j ) is selected if both conditions in Equations (9) and (10) are met:
j = argmax k P i k obj ,
i = argmax k P k j obj .
Keypoint correspondences are derived using soft assignment. Given the keypoint score matrix P kp R N A × N B , the corresponding location in frame B for each keypoint p i A in frame A is estimated as a weighted average, as defined in Equation (11):
p ^ i B = j = 1 N B exp ( P i j ) k = 1 N B exp ( P i k ) · p j B .
Crucially, this soft assignment mechanism is designed to maintain differentiability, allowing the matching scores to be directly supervised by the downstream odometry loss.
The resulting set of soft correspondences ( p i A , p ^ i B ) is then used to estimate the relative transformation T A B SE ( 3 ) by minimizing the sum of squared alignment errors E rel defined in Equation (12):
E rel = i p ^ i B T A B · p i A 2 2 .
This optimization is solved using the Levenberg–Marquardt (LM) algorithm [43], implemented in a differentiable form with Theseus [44]. This transformation aligns keypoints from frame A to frame B and provides an initial estimate of ego-motion. For the initial guess of T A B in the optimization, we use a constant motion model. For the first frame, T A B is initialized as an identity transformation.
To further refine the ego-motion estimate and incorporate object dynamics, we employ a joint optimization framework with a factor graph following [27]. The graph consists of ego-pose nodes, object pose nodes, and object motion nodes spanning a fixed-length sliding window (selected to be 5) of recent frames. Let X denote the set of all variables in the graph. The following factors are defined: (i) odometry constraints E odom between consecutive ego-poses; (ii) observation constraints E obs linking ego-poses to observed object poses; (iii) motion constraints E motion connecting consecutive object poses and motion; and (iv) constant velocity constraints E const between object motion nodes, as shown in Equation (13):
X * = argmin X E odom + E obs + E motion + E const ,
where X * denotes the optimized factor graph nodes. The optimization is again performed using the LM algorithm [43] with Theseus [44], maintaining full differentiability. This joint optimization allows for mutual refinement of ego-motion and object tracking, enhancing both consistency and accuracy. Crucially, both the soft assignment mechanism and the factor graph optimization are implemented in a differentiable manner, forming the foundation for our learning pipeline.

3.5. Training Loss

To supervise the learning process of KOM-SLAM, we design a composite loss function that supervises object association, keypoint correspondence estimation, and ego-motion prediction. The overall loss is defined in Equation (14):
L = λ obj L obj + λ kp L kp + λ odom L odom ,
where λ obj , λ kp , and λ odom are weighting coefficients.
Ground-truth tracking IDs are used to establish object correspondences between frames A and B. Let P ^ obj denote the ground truth object matching score matrix, where P ^ i j obj = 1 for the objects with the same tracking IDs, P ^ i j obj = 0 for objects with different tracking IDs. Following PTP [33], we apply the BCE loss for each entry of P obj and the CE loss for each column of P obj , as shown in Equation (15):
L obj = L BCE + L CE .
Ground truth matched keypoint pairs are identified using both spatial distance and descriptor similarity after aligning keypoints in frame A to the coordinate frame of B using the ground-truth relative transformation T A B , gt . A pair ( i , j ) is considered positive if the spatial distance (Equation (16)) and descriptor similarity (Equation (17)) criteria are met:
T A B , gt · p i A p j B 2 < δ gt ,
cos ( d i A , d j B ) > δ sim ,
where δ dist and δ sim are distance and descriptor similarity thresholds, respectively. Let P gt kp denote the sets of ground truth matched keypoint pairs. As shown in Equation (18), following LightGlue [38], the keypoint matching loss is formulated as mean negative log-likelihood over the ground truth matched pairs:
L kp = 1 | P gt kp | ( i , j ) P gt kp log P i j kp .
To supervise ego-motion estimation, we penalize both translation and rotation errors between the predicted transformation T A B and the ground-truth T A B , gt . Let q est , q gt R 4 denote the estimated and ground-truth unit quaternions, and t est , t gt R 3 denote the translation vectors. The transformation loss, L odom , is defined as the sum of the rotation and translation errors, as shown in Equation (19):
L odom = β · q est q gt 2 + t est t gt 2 ,
where β is a weighting factor to balance the contributions of rotational and translational errors. The factor β is introduced for scale normalization, which is necessary because the rotational error (distance between unit quaternions) and the translational error (in meters) typically have different magnitudes. This ensures that both components contribute equitably to the overall loss gradient during training.

4. Experiments

We conduct experiments on the KITTI Tracking dataset [45] to evaluate the effectiveness of KOM-SLAM. Section 4.1 describes the implementation details and dataset setup. The experimental evaluation is organized into two main parts: Section 4.2 presents the results for odometry estimation, while Section 4.3 evaluates the multi-object tracking performance.

4.1. Experimental Details

For keypoint extraction, we apply SuperPoint [11], a self-supervised keypoint detector and descriptor, to generate keypoints from the images of the KITTI Tracking dataset. To obtain 3D object detections, we adopt PointRCNN [42], a two-stage 3D object detector that operates on LiDAR point clouds. For both models, we choose to use the pretrained weights available online. To fully evaluate KOM-SLAM on all the sequences of the KITTI Tracking dataset, we train two models. One is trained using sequences 0000–0010, while the other is trained with 0011, 0013, 0014, 0018, 0019, and 0020. This separation is necessary because we avoid evaluating a sequence using a model that has been trained on it. By training two models on complementary subsets, we ensure that every sequence can be used for evaluation without overlapping with its training data. Following prior works [5,24], the sequences 0012, 0015, 0016, and 0017 are excluded from the evaluation because they contain little or no camera motion. All models are trained using the Adam optimizer with a learning rate of 1 × 10 4 for 80 epochs and a batch size of 1. The GNN module consists of two layers, each integrating self-attention, cross-attention, and keypoint–object interaction mechanisms, with a feature dimension of 256. The training loss is a weighted combination of object association loss, keypoint association loss, and odometry consistency loss, with weights set to λ obj = 0.8 , λ kp = 1.0 , and λ odom = 0.6 , respectively. The overall balance factor is set to β = 200.0 . These hyperparameters were determined through a grid search. We observed that the system maintains stable performance within a ± 10 % range of the specified λ values. However, we found that reducing the factor β causes an increase in rotation error, because rotational errors are numerically smaller than translational errors. The proposed framework runs at approximately 49 ms per frame, where the GNN module takes approximately 22 ms per frame, and the backend optimization costs approximately 24 ms per frame. All experiments are conducted on a desktop system equipped with an NVIDIA GeForce RTX 4070 GPU. Figure 3 illustrates an example of the joint keypoint and object association results generated by KOM-SLAM on the KITTI Tracking dataset, highlighting successful correspondence in a challenging dynamic scene.

4.2. Odometry Estimation

This subsection presents the evaluation of ego-pose estimation performance using KOM-SLAM on the KITTI Tracking dataset. Specifically, sequences 0000–0010 are evaluated using a model trained on sequences 0011, 0013, 0014, 0018, 0019, and 0020, while the remaining sequences are tested using the model trained on 0000–0010. This split ensures that the test sequences are not seen during training, enabling a fair evaluation of generalization performance. We use the Relative Pose Error (RPE) as the evaluation metric to assess the accuracy of the estimated ego motion. RPE measures the local accuracy of the transformation between two consecutive frames and is particularly suitable for evaluating methods that emphasize pairwise frame association, such as ours.
Table 1 compares our method with three established baselines: ORB-SLAM2 [1], DynaSLAM [46], and DynaSLAM2 [5]. KOM-SLAM outperforms the baselines in both translational (RPEt) and rotational (RPER) components on the majority of sequences. Notably, it achieves the lowest mean RPEt of 0.041 m/frame and the lowest mean RPER of 0.028°/frame across all test sequences. This demonstrates the effectiveness of incorporating both keypoint- and object-level associations for robust ego-motion estimation in dynamic environments. Additionally, KOM-SLAM achieves the lowest standard deviation in translation and a standard deviation in rotation comparable to the best-performing method (0.020 vs. 0.015 of DynaSLAM [46]), demonstrating stable and consistent pose estimates across different sequences.

4.3. Multi-Object Tracking

This subsection presents the evaluation of multi-object tracking performance using KOM-SLAM on the KITTI Tracking dataset. We follow the evaluation of DynaSLAM2 [5], and compare object pose estimation accuracy across 12 longest object trajectories of the KITTI tracking dataset. We adopt standard evaluation metrics to assess the tracking performance: True Positives (TP) and Multi-Object Tracking Precision (MOTP), computed in 2D image space, Bird’s-Eye View (BV), and full 3D space. TP reflects the percentage of correctly tracked frames with respect to ground-truth annotations, while MOTP measures the average spatial alignment error of matched objects. Higher TP and MOTP indicate better tracking performance and localization accuracy. Table 2 summarizes the comparison between KOM-SLAM and DynaSLAM2. KOM-SLAM achieves higher TP and MOTP across all three modalities—2D, BV, and 3D—demonstrating substantial improvements in both tracking continuity and pose accuracy.

4.4. Ablation Study

To systematically analyze the contribution of each key component in KOM-SLAM, we conduct an ablation study focusing on the keypoint–object layer, attention-based GNN, the MLP–Sigmoid gating layer, the keypoint soft assignment mechanism, and the robustness under varying keypoint densities and temporal spacing.

4.4.1. Ego Pose Estimation Ablation

Table 3 demonstrates the ego-pose estimation results for several ablated variants on the KITTI Tracking dataset. Specifically, we evaluate the following: (1) No Keypoint–Object Layer, where the keypoint–object interaction layer in the GNN is removed, and only the self-attention and cross-attention layers are used; (2) No GNN, where the attention-based GNN is removed, and associations are computed directly from raw keypoint features; (3) No Gating Layer, where the MLP–Sigmoid gating mechanism is disabled, and association scores are determined with the raw matching score matrix P raw ; (4) No Soft Assignment, where hard assignment based on matching score matrix is used instead of soft assignment; and (5) KOM-SLAM, which corresponds to the full KOM-SLAM framework.
Across most sequences, the translational error (RPEt) remains relatively stable among different variants, indicating that coarse geometric alignment can still be obtained without learned association refinement. In contrast, the rotational error (RPER) exhibits a clear sensitivity to the association strategy. In particular, removing the keypoint–object layer leads to noticeably higher rotation errors and variance, highlighting its importance for stable and accurate pose estimation.
The full model consistently achieves the lowest mean rotation error and standard deviation, confirming that the combination of keypoint–object interaction, GNN-based feature refinement, MLP–Sigmoid gating, and differentiable soft assignment is essential for robust ego-motion estimation. These results validate the necessity of our core design choices rather than relying on any single component alone.

4.4.2. Object Tracking Association Ablation

To evaluate the effectiveness of the GNN-based association module for object tracking, we compare three variants using standard Multi-Object Tracking metrics: Multi-Object Tracking Accuracy (MOTA) and ID F1 Score (IDF1), as shown in Table 4. The compared methods include the following: (1) a Naive Approach, which associates objects solely based on the minimum Euclidean distance between the bounding box centers of the current detection and the predicted objects from the previous information; (2) No GNN, which uses raw appearance features without GNN-based refinement; (3) No Keypoint–Object Layer, which removes the keypoint–object interaction layer in GNN and only uses self-attention and cross-attention layers; and (4) KOM-SLAM, the full KOM-SLAM association pipeline.
For this ablation, ground-truth object detections are used to isolate the influence of the association mechanism and eliminate variance introduced by the upstream 3D detector. The naive distance-based approach performs poorly in sequences with frequent occlusions and interacting objects, leading to low MOTA and IDF1 scores. No GNN variant and no keypoint–object layer variant improve performance in most sequences by leveraging appearance cues, but still struggle with identity consistency under complex interactions.
The full model consistently achieves the highest MOTA and IDF1 scores in most sequences, demonstrating that the GNN architecture effectively captures relationships between objects. This results in more robust identity preservation and significantly improved tracking precision compared to heuristic or association strategies based on raw appearance feature matching.

4.4.3. Data Density and Time Gap Analysis

To assess the robustness of KOM-SLAM under practical sensing constraints, we further analyze the impact of reduced keypoint density and increased temporal spacing between frames, with results summarized in Table 5. For the Sparser Keypoints variant, 50% of the detected keypoints are randomly retained in each frame. For the Every 2 Frames variant, only every second frame in each sequence is used for pose estimation.
When keypoint density is reduced, KOM-SLAM maintains comparable mean and standard deviation of the relative pose error, demonstrating strong robustness to sparse visual features. This indicates that the learned association mechanism can effectively find correspondences even with limited observations.
In contrast, increasing the temporal gap between frames results in noticeable degradation, particularly in rotation error. This effect is most pronounced in sequences with sharp turns (e.g., 0000, 0004, 0007), where larger inter-frame motion violates the small-motion assumption implicitly used in pairwise matching. These results highlight the importance of temporal resolution for accurate motion estimation, while also demonstrating that KOM-SLAM degrades gracefully under challenging conditions rather than failing catastrophically in sequences without large inter-frame motion.

5. Conclusions and Future Work

In this work, we introduced KOM-SLAM, a GNN-based framework that tightly couples odometry estimation and multi-object tracking. Our method constructs a spatiotemporal graph over keypoints and objects across frames, enabling the GNN to jointly update their features for robust association. A differentiable soft assignment mechanism is integrated with odometry estimation, allowing pose loss to directly supervise the learning process. To further enhance the robustness of keypoint association, we incorporate ego-motion and spatial priors into the matching process through an MLP–Sigmoid gating layer. We evaluated KOM-SLAM on the KITTI Tracking dataset, where it consistently outperformed strong baselines. The results demonstrate notable gains in both odometry accuracy and multi-object tracking precision, particularly in challenging sequences with frequent object motion and occlusion.
While KOM-SLAM demonstrates improvements in ego-motion estimation and multi-object tracking, the current framework presents several limitations that motivate future work.
Firstly, the system relies on high-quality external perception modules (SuperPoint for keypoints and PointRCNN for 3D detection). This dependency means the system’s overall performance is bounded by the accuracy and completeness of these detectors. Our learning pipeline currently mainly supervises the association, leaving the feature extraction and detection unsupervised.
Additionally, the system relies on frame-independent object detection and feature extraction, which can suffer from temporal inconsistency in bounding box parameters. While the inherent redundancy of keypoints provides robustness to this inconsistency in ego-pose estimation, the detection inconsistency severely challenges the factor graph optimization, as the constant velocity constraints applied to dynamic objects struggle to reconcile the detection noise with the smooth motion model, thereby risking the degradation of object trajectory accuracy and potentially destabilizing the coupled ego-pose estimate.
Furthermore, the system’s reliance on consecutive frame matching and a constant motion model for initialization makes the method sensitive to large motion changes. As demonstrated in our ablation study (matching every 2 frames), extremely high-speed scenarios or sequences with significant rotational velocity changes between frames can degrade performance, as the geometric prediction becomes unreliable.
Finally, while we demonstrate the excellent RPE performance, the current framework lacks long-term robustness mechanisms such as global mapping consistency and loop closure. The integration of GNN-based re-localization is necessary to prevent long-term drift in the ego-pose and object trajectories.
Building upon these limitations, future efforts will concentrate on three main directions: (1) integrating the front-end keypoint extraction and temporally consistent object detection modules into the differentiable pipeline for a truly end-to-end trainable system. (2) Exploring extensions to global consistency and long-term tracking, potentially by introducing map nodes into the GNN and designing a global loop closure factor for the optimization backend. (3) Expanding the evaluation and robustness analysis by applying KOM-SLAM to other diverse and challenging datasets, such as nuScenes [47] or Waymo [48], and also comparing with additional state-of-the-art visual odometry methods, such as DROID-SLAM [49] and DeepV2D [50].

Author Contributions

Conceptualization, J.L. and Y.T.; methodology, J.L.; software, J.L.; validation, J.L.; formal analysis, J.L.; investigation, J.L.; resources, S.K.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, J.L., Y.T., Y.G. and S.K.; visualization, J.L.; supervision, Y.T., Y.G. and S.K.; project administration, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available at https://www.cvlibs.net/datasets/kitti/eval_tracking.php (accessed on 8 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
  2. Shan, T.; Englot, B.; Meyers, D.; Wang, W.; Ratti, C.; Rus, D. LIO-SAM: Tightly-Coupled Lidar Inertial Odometry via Smoothing and Mapping. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 5135–5142. [Google Scholar]
  3. Weng, X.; Wang, Y.; Man, Y.; Kitani, K.M. GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with 2D-3D Multi-Feature Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6499–6508. [Google Scholar]
  4. Nagy, M.; Werghi, N.; Hassan, B.; Dias, J.; Khonji, M. RobMOT: 3D Multi-Object Tracking Enhancement Through Observational Noise and State Estimation Drift Mitigation in LiDAR Point Clouds. IEEE Trans. Intell. Transp. Syst. 2025, 26, 16047–16059. [Google Scholar] [CrossRef]
  5. Bescos, B.; Campos, C.; Tardós, J.D.; Neira, J. DynaSLAM II: Tightly-Coupled Multi-Object Tracking and SLAM. IEEE Robot. Autom. Lett. 2021, 6, 5191–5198. [Google Scholar] [CrossRef]
  6. Tian, R.; Zhang, Y.; Yang, L.; Zhang, J.; Coleman, S.; Kerr, D. DynaQuadric: Dynamic Quadric SLAM for Quadric Initialization, Mapping, and Tracking. IEEE Trans. Intell. Transp. Syst. 2024, 25, 17234–17246. [Google Scholar] [CrossRef]
  7. Shen, Y.; Li, H.; Yi, S.; Chen, D.; Wang, X. Person Re-Identification with Deep Similarity-Guided Graph Neural Network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 486–504. [Google Scholar]
  8. Li, Y.; Gu, C.; Dullien, T.; Vinyals, O.; Kohli, P. Graph Matching Networks for Learning the Similarity of Graph Structured Objects. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 3835–3845. [Google Scholar]
  9. Sarlin, P.-E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning Feature Matching with Graph Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4938–4947. [Google Scholar]
  10. Cetintas, O.; Brasó, G.; Leal-Taixé, L. Unifying Short and Long-Term Tracking with Graph Hierarchies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June; pp. 22877–22887.
  11. DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-Supervised Interest Point Detection and Description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
  12. Zhang, J.; Singh, S. LOAM: Lidar Odometry and Mapping in Real-Time. In Proceedings of the Robotics: Science and Systems, Berkeley, CA, USA, 12–16 July 2014; pp. 1–9. [Google Scholar]
  13. Kannapiran, S.; Bendapudi, N.; Yu, M.-Y.; Parikh, D.; Berman, S.; Vora, A.; Pandey, G. Stereo Visual Odometry with Deep Learning-Based Point and Line Feature Matching Using an Attention Graph Neural Network. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 3491–3498. [Google Scholar]
  14. Cui, J.; Chen, J.; Li, L. SAGE-ICP: Semantic Information-Assisted ICP. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 8537–8543. [Google Scholar]
  15. Koledić, K.; Cvišić, I.; Marković, I.; Petrović, I. MOFT: Monocular Odometry Based on Deep Depth and Careful Feature Selection and Tracking. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 6175–6181. [Google Scholar]
  16. Zunair, H.; Khan, S.; MHamza, A.B. RSUD20K: A dataset for road scene understanding in autonomous driving. In Proceedings of the 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 708–714. [Google Scholar]
  17. Runz, M.; Buffier, M.; Agapito, L. MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects. In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Munich, Germany, 16–20 October 2018; pp. 10–20. [Google Scholar]
  18. Ballester, I.; Fontán, A.; Civera, J.; Strobl, K.H.; Triebel, R. DOT: Dynamic Object Tracking for Visual SLAM. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11705–11711. [Google Scholar]
  19. Huang, J.; Yang, S.; Zhao, Z.; Lai, Y.-K.; Hu, S.-M. ClusterSLAM: A SLAM Backend for Simultaneous Rigid Body Clustering and Motion Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5875–5884. [Google Scholar]
  20. Huang, J.; Yang, S.; Mu, T.-J.; Hu, S.-M. ClusterVO: Clustering Moving Instances and Estimating Visual Odometry for Self and Surroundings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2168–2177. [Google Scholar]
  21. Moosmann, F.; Stiller, C. Joint Self-Localization and Tracking of Generic Objects in 3D Range Data. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation (ICRA), Karlsruhe, Germany, 6–10 May 2013; pp. 1146–1152. [Google Scholar]
  22. Shi, J.; Wang, W.; Qi, M.; Li, X.; Yan, Y. DYNAM-LVIO: A Dynamic-Object-Aware Lidar Visual Inertial Odometry in Dynamic Urban Environments. IEEE Trans. Instrum. Meas. 2024, 73, 1–19. [Google Scholar] [CrossRef]
  23. Yang, S.; Scherer, S. CubeSLAM: Monocular 3-D Object SLAM. IEEE Trans. Robot. 2019, 35, 925–938. [Google Scholar] [CrossRef]
  24. Gonzalez, M.; Marchand, E.; Kacete, A.; Royan, J. TwistSLAM: Constrained SLAM in Dynamic Environment. IEEE Robot. Autom. Lett. 2022, 7, 6846–6853. [Google Scholar] [CrossRef]
  25. Qiu, Y.; Wang, C.; Wang, W.; Henein, M.; Scherer, S. AIRDOS: Dynamic SLAM Benefits from Articulated Objects. In Proceedings of the 2022 IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 8047–8053. [Google Scholar]
  26. Tian, X.; Zhu, Z.; Zhao, J.; Tian, G.; Ye, C. DL-SLOT: Tightly-Coupled Dynamic Lidar SLAM and 3D Object Tracking Based on Collaborative Graph Optimization. IEEE Trans. Intell. Veh. 2023, 9, 1017–1027. [Google Scholar] [CrossRef]
  27. Zhu, Z.; Zhao, J.; Huang, K.; Tian, X.; Lin, J.; Ye, C. LIMOT: A Tightly-Coupled System for Lidar-Inertial Odometry and Multi-Object Tracking. IEEE Robot. Autom. Lett. 2024, 9, 6600–6607. [Google Scholar] [CrossRef]
  28. Li, X.; Yan, Z.; Feng, S.; Xia, C.; Li, S.; Zhou, Y. LIO-LOT: Tightly-Coupled Multi-Object Tracking and Lidar-Inertial Odometry. IEEE Trans. Intell. Transp. Syst. 2024, 26, 742–756. [Google Scholar] [CrossRef]
  29. Ying, Z.; Li, H. IMM-SLAMMOT: Tightly-Coupled SLAM and IMM-Based Multi-Object Tracking. IEEE Trans. Intell. Veh. 2023, 9, 3964–3974. [Google Scholar] [CrossRef]
  30. Liu, Y.; Liu, J.; Hao, Y.; Deng, B.; Meng, Z. A Switching-Coupled Backend for Simultaneous Localization and Dynamic Object Tracking. IEEE Robot. Autom. Lett. 2021, 6, 1296–1303. [Google Scholar] [CrossRef]
  31. Lin, Y.-K.; Lin, W.-C.; Wang, C.-C. Asynchronous State Estimation of Simultaneous Ego-Motion Estimation and Multiple Object Tracking for Lidar-Inertial Odometry. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 10616–10622. [Google Scholar]
  32. Wang, Y.; Kitani, K.; Weng, X. Joint Object Detection and Multi-Object Tracking with Graph Neural Networks. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13708–13715. [Google Scholar]
  33. Weng, X.; Yuan, Y.; Kitani, K. PTP: Parallelized Tracking and Prediction with Graph Neural Networks and Diversity Sampling. IEEE Robot. Autom. Lett. 2021, 6, 4640–4647. [Google Scholar] [CrossRef]
  34. Brasó, G.; Leal-Taixé, L. Learning a Neural Solver for Multiple Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6247–6257. [Google Scholar]
  35. Bilgi, H.Ç.; Alatan, A.A. Bi-Directional Tracklet Embedding for Multi-Object Tracking. In Proceedings of the 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 4035–4041. [Google Scholar]
  36. Chen, H.; Li, N.; Li, D.; Lv, J.; Zhao, W.; Zhang, R.; Xu, J. Multiple Object Tracking in Satellite Video with Graph-Based Multi-Clue Fusion Tracker. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5639914. [Google Scholar]
  37. Gao, Y.; Xu, H.; Li, J.; Wang, N.; Gao, X. Multi-Scene Generalized Trajectory Global Graph Solver with Composite Nodes for Multiple Object Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 1842–1850. [Google Scholar]
  38. Lindenberger, P.; Sarlin, P.-E.; Pollefeys, M. LightGlue: Local Feature Matching at Light Speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023; pp. 17627–17638. [Google Scholar]
  39. Chen, H.; Luo, Z.; Zhang, J.; Zhou, L.; Bai, X.; Hu, Z.; Tai, C.-L.; Quan, L. Learning to Match Features with Seeded Graph Matching Network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 6301–6310. [Google Scholar]
  40. Shi, Y.; Cai, J.-X.; Shavit, Y.; Mu, T.-J.; Feng, W.; Zhang, K. ClusterGNN: Cluster-Based Coarse-to-Fine Graph Neural Network for Efficient Feature Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12517–12526. [Google Scholar]
  41. Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-Free Local Feature Matching with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8922–8931. [Google Scholar]
  42. Shi, S.; Wang, X.; Li, H. PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
  43. Moré, J.J. The Levenberg-Marquardt Algorithm: Implementation and Theory. In Numerical Analysis: Proceedings of the Biennial Conference Held at Dundee, Dundee, UK, 28 June–1 July 1977; Springer: Berlin/Heidelberg, Germany, 2006; pp. 105–116. [Google Scholar]
  44. Pineda, L.; Fan, T.; Monge, M.; Venkataraman, S.; Sodhi, P.; Chen, R.T.; Ortiz, J.; DeTone, D.; Wang, A.; Anderson, S.; et al. Theseus: A Library for Differentiable Nonlinear Optimization. Adv. Neural Inf. Process. Syst. 2022, 35, 3801–3818. [Google Scholar]
  45. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision Meets Robotics: The KITTI Dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  46. Bescos, B.; Fácil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
  47. Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
  48. Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2446–2454. [Google Scholar]
  49. Teed, Z.; Deng, J. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Adv. Neural Inf. Process. Syst. 2021, 34, 16558–16569. [Google Scholar]
  50. Teed, Z.; Deng, J. Deepv2d: Video to depth with differentiable structure from motion. arXiv 2018, arXiv:1812.04605. [Google Scholar]
Figure 1. Overall architecture of KOM-SLAM. The figure illustrates the tightly coupled Graph Neural Network (GNN) framework. It shows the flow from keypoints and object inputs through the GNN association module to the joint optimization backend for estimating ego-pose and object states.
Figure 1. Overall architecture of KOM-SLAM. The figure illustrates the tightly coupled Graph Neural Network (GNN) framework. It shows the flow from keypoints and object inputs through the GNN association module to the joint optimization backend for estimating ego-pose and object states.
Sensors 26 00128 g001
Figure 2. (a) Overall architecture of the proposed GNN attention module. (b) Query, key, and value generation in the self-attention layer for intra-entity, intra-frame feature refinement. (c) Query, key, and value generation in the keypoint–object layer for cross-entity, intra-frame message passing between keypoints and objects. (d) Query, key, and value generation in the cross-attention layer for intra-entity, inter-frame feature exchange.
Figure 2. (a) Overall architecture of the proposed GNN attention module. (b) Query, key, and value generation in the self-attention layer for intra-entity, intra-frame feature refinement. (c) Query, key, and value generation in the keypoint–object layer for cross-entity, intra-frame message passing between keypoints and objects. (d) Query, key, and value generation in the cross-attention layer for intra-entity, inter-frame feature exchange.
Sensors 26 00128 g002
Figure 3. Keypoint and object correspondence example of our proposed framework on the KITTI tracking dataset. The green dots and the lines demonstrate the matched keypoints and their correspondence. The red rectangles and red lines represent the detected objects and their association.
Figure 3. Keypoint and object correspondence example of our proposed framework on the KITTI tracking dataset. The green dots and the lines demonstrate the matched keypoints and their correspondence. The red rectangles and red lines represent the detected objects and their association.
Sensors 26 00128 g003
Table 1. Ego-pose estimation results on the KITTI Tracking dataset.
Table 1. Ego-pose estimation results on the KITTI Tracking dataset.
seqORB-SLAM2 [1]DynaSLAM [46]DynaSLAM2 [5]KOM-SLAM
RPEt (m/f)RPER (°/f)RPEt (m/f)RPER (°/f)RPEt (m/f)RPER (°/f)RPEt (m/f)RPER (°/f)
00000.040.060.040.060.040.060.040.07
00010.050.040.050.040.050.040.040.03
00020.040.030.040.030.040.020.030.00
00030.070.040.070.040.060.040.050.02
00040.070.060.070.060.070.060.060.07
00050.060.030.060.030.060.030.050.01
00060.020.040.020.040.020.010.010.01
00070.050.070.050.070.050.070.040.03
00080.080.040.080.040.100.040.070.03
00090.060.050.060.050.060.060.040.02
00100.070.040.070.040.070.030.060.02
00110.040.030.040.030.040.030.030.04
00130.040.050.040.050.040.040.030.02
00140.030.080.030.080.030.080.030.05
00180.050.030.050.030.050.020.040.00
00190.050.030.050.030.050.020.030.02
00200.110.070.050.040.070.040.040.01
mean0.0550.0460.0510.0450.0530.0410.0410.028
std0.0210.0160.0150.0150.0180.0190.0140.020
Bold values indicate the best performance for each metric in the corresponding row.
Table 2. Object pose estimation comparison on KITTI tracking dataset.
Table 2. Object pose estimation comparison on KITTI tracking dataset.
seq/obj.id/classDynaSLAM2 [5]KOM-SLAM
2D TP2D MOTPBV TPBV MOTP3D TP3D MOTP2D TP2D MOTPBV TPBV MOTP3D TP3D MOTP
03/1/car50.0071.7939.3456.6138.5348.2095.9086.8595.9075.5297.5466.19
05/31/car28.9660.3014.4846.8411.4534.2082.0983.6381.4276.9783.7866.15
10/0/car81.6373.5170.4147.6068.3740.2898.2989.5498.2976.9898.6368.84
11/0/car72.6574.7861.6650.7452.5847.3597.8587.7197.8584.4298.1274.04
11/35/car53.1765.2519.0531.956.3526.0277.3484.0971.0984.7476.5674.43
18/2/car86.3674.8167.0545.4762.1234.8093.1889.8493.1880.9293.1874.46
18/3/car53.3370.9421.7541.4516.8435.8085.9686.2085.6177.0685.9664.50
19/63/car35.2663.5029.4845.6926.4833.8954.9190.8953.7684.6555.4977.93
19/72/car29.1162.5929.4355.4829.4339.8112.3475.6712.0373.3012.0361.74
20/0/car63.6878.5443.7845.0031.8446.1585.0090.3385.5074.9886.0065.06
20/12/car42.7776.7737.6449.2936.2340.8191.9188.2490.0573.6791.7662.03
20/122/car34.9078.7634.5148.0529.0244.4390.9883.2177.2572.1591.3760.27
mean52.6570.9639.0547.0134.1039.3180.4886.3578.9477.9580.8767.97
std19.016.2017.826.1328.306.3623.494.0923.474.4023.695.63
Bold values indicate the best performance for each metric in the corresponding row.
Table 3. Ablation study on ego pose estimation.
Table 3. Ablation study on ego pose estimation.
seqNo Keypoint–Object LayerNo GNNNo Gating LayerNo Soft AssignmentKOM-SLAM
RPEt (m/f)RPER (°/f)RPEt (m/f)RPER (°/f)RPEt (m/f)RPER (°/f)RPEt (m/f)RPER (°/f)RPEt (m/f)RPER (°/f)
00000.040.070.040.100.040.080.040.090.040.07
00010.040.150.040.010.040.090.040.020.040.03
00020.030.000.030.060.030.030.030.010.030.00
00030.060.020.060.040.060.060.060.030.050.02
00040.060.080.060.070.070.070.070.100.060.07
00050.050.100.050.010.050.030.050.140.050.01
00060.010.060.010.060.010.010.020.110.010.01
00070.040.100.040.180.040.290.050.090.040.03
00080.070.030.070.070.070.050.070.040.070.03
00090.040.030.040.050.040.030.040.050.040.02
00100.060.150.060.040.060.030.060.050.060.02
00110.030.150.030.120.030.030.040.200.030.04
00130.030.080.030.050.030.040.030.060.030.02
00140.030.090.030.090.030.070.030.100.030.05
00180.040.110.040.030.040.030.040.030.040.00
00190.030.170.030.110.030.110.030.050.030.02
00200.040.030.040.000.040.020.040.010.040.01
mean0.0410.0840.0410.0640.0420.0620.0430.0690.0410.028
std0.0150.0500.0150.0440.0150.0620.0140.0490.0140.020
Bold values indicate the best performance for each metric in the corresponding row.
Table 4. Ablation study on object tracking.
Table 4. Ablation study on object tracking.
seqNaive ApproachNo GNNNo Keypoint–Object LayerKOM-SLAM
MOTAIDF1MOTAIDF1MOTAIDF1MOTAIDF1
00001.000.970.990.981.001.001.001.00
00010.940.910.960.910.990.940.990.96
00020.910.900.920.680.980.880.990.91
00030.590.590.970.890.970.951.001.00
00040.700.700.990.970.980.981.000.99
00050.770.740.970.920.970.951.000.99
00060.250.250.990.950.990.850.990.87
00070.950.950.990.961.000.991.001.00
00080.320.320.970.800.850.711.000.92
00090.830.800.970.870.980.930.990.94
00100.720.700.940.880.960.961.001.00
00110.870.870.960.830.980.930.990.95
00131.000.990.950.850.910.840.950.92
00140.890.880.890.820.900.850.920.88
00180.730.690.990.810.960.870.990.88
00190.990.990.950.750.960.750.970.82
00200.920.920.980.920.990.910.990.97
mean0.790.770.960.880.960.900.990.94
std0.220.210.0270.0770.0390.0790.0210.054
Bold values indicate the best performance for each metric in the corresponding row.
Table 5. Ablation study on data density and time gap.
Table 5. Ablation study on data density and time gap.
seqSparser KeypointsEvery 2 FramesKOM-SLAM
RPEt (m/f)RPER (°/f)RPEt (m/f)RPER (°/f)RPEt (m/f)RPER (°/f)
00000.040.080.070.280.040.07
00010.040.020.080.140.040.03
00020.030.060.040.050.030.00
00030.060.130.090.120.050.02
00040.070.080.120.310.060.07
00050.050.010.060.010.050.01
00060.020.010.040.100.010.01
00070.050.080.140.330.040.03
00080.070.050.090.130.070.03
00090.050.050.100.250.040.02
00100.060.090.080.090.060.02
00110.030.010.050.130.030.04
00130.030.050.050.190.030.02
00140.030.090.060.140.030.05
00180.040.000.080.040.040.00
00190.030.070.060.070.030.02
00200.040.070.090.060.040.01
mean0.0430.0550.0760.140.0410.028
std0.0150.0340.0260.0940.0140.020
Bold values indicate the best performance for each metric in the corresponding row.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Tian, Y.; Gu, Y.; Kamijo, S. KOM-SLAM: A GNN-Based Tightly Coupled SLAM and Multi-Object Tracking Framework. Sensors 2026, 26, 128. https://doi.org/10.3390/s26010128

AMA Style

Liu J, Tian Y, Gu Y, Kamijo S. KOM-SLAM: A GNN-Based Tightly Coupled SLAM and Multi-Object Tracking Framework. Sensors. 2026; 26(1):128. https://doi.org/10.3390/s26010128

Chicago/Turabian Style

Liu, Jinze, Ye Tian, Yanlei Gu, and Shunsuke Kamijo. 2026. "KOM-SLAM: A GNN-Based Tightly Coupled SLAM and Multi-Object Tracking Framework" Sensors 26, no. 1: 128. https://doi.org/10.3390/s26010128

APA Style

Liu, J., Tian, Y., Gu, Y., & Kamijo, S. (2026). KOM-SLAM: A GNN-Based Tightly Coupled SLAM and Multi-Object Tracking Framework. Sensors, 26(1), 128. https://doi.org/10.3390/s26010128

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop