Abstract
To address the problem of degraded positioning accuracy in traditional visual–inertial navigation systems (VINS) due to interference from moving objects in dynamic scenarios, this paper proposes an improved algorithm based on the VINS-Fusion framework, which resolves this issue through a synergistic combination of multi-scale feature optimization and real-time dynamic feature elimination. First, at the feature extraction front-end, the SuperPoint encoder structure is reconstructed. By integrating dual-branch multi-scale feature fusion and 1 × 1 convolutional channel compression, it simultaneously captures shallow texture details and deep semantic information, enhances the discriminative ability of static background features, and reduces mis-elimination near dynamic–static boundaries. Second, in the dynamic processing module, the ASORT (Adaptive Simple Online and Realtime Tracking) algorithm is designed. This algorithm combines an object detection network, adaptive Kalman filter-based trajectory prediction, and a Hungarian algorithm-based matching mechanism to identify moving objects in images in real time, filter out their associated dynamic feature points from the optimized feature point set, and ensure that only reliable static features are input to the backend optimization, thereby minimizing pose estimation errors caused by dynamic interference. Experiments on the KITTI dataset demonstrate that, compared with the original VINS-Fusion algorithm, the proposed method achieves an average improvement of approximately 14.8% in absolute trajectory accuracy, with an average single-frame processing time of 23.9 milliseconds. This validates that the proposed approach provides an efficient and robust solution for visual–inertial navigation in highly dynamic environments.
1. Introduction
In recent years, with the rapid development of computer vision technology, pure visual localization relies on visual data sources and is prone to interference from lighting changes and object occlusion. To address these issues, VIO (Visual–Inertial Odometry) technology has emerged [1,2,3]. By fusing multi-source data from visual cameras and IMU (Inertial Measurement Unit), VIO leverages the long-term robustness of vision [4] and the short-term high-precision characteristics of IMU [5]. This effectively compensates for the shortcomings of single sensors, making it a mainstream localization solution currently [6].
Classic algorithms based on the filtering framework, such as MSCKF [7] and ROVIO [8], achieve state estimation through recursive approximation and have advantages in computational efficiency. On the other hand, optimization-based algorithms like ORB-SLAM3 [9], VINS-Mono [10], and VINS-Fusion [11] adopt a tightly coupled front-end and back-end architecture. They use nonlinear optimization to improve global consistency and have become the mainstream direction of current research. Among them, VINS-Fusion, a representative work in the VINS series, fuses binocular vision and IMU data to achieve centimeter-level localization accuracy in static or low-dynamic scenarios. It is widely used in autonomous driving and robot navigation [12].
However, traditional VIO algorithms still face significant challenges in dynamic and complex scenarios. When the carrier is in traffic-dense urban environments, streets with heavy pedestrian flow, or indoor spaces with frequent object movements, visual sensors tend to capture interfering features from moving objects such as vehicles and pedestrians. These dynamic features are incorrectly used for pose estimation, leading to increased cumulative errors or even localization failure [13]. The core contradiction of this problem lies in the fact that traditional VIO assumes all feature points come from static backgrounds during feature processing, lacking an active mechanism to identify and eliminate dynamic interferences [14]. Therefore, enhancing adaptability to dynamic scenarios from the source of feature extraction while efficiently filtering dynamic interfering features has become the key to improving the performance of dynamic VIO.
Feature point extraction is a core part of the VIO front-end, and its performance directly affects the reliability of subsequent matching and optimization results [15]. Traditional feature extraction methods, such as SIFT [16] and SURF [17], rely on manually designed feature descriptors. Although they perform stably in static scenarios, they have high computational complexity and insufficient robustness to lighting and viewing angle changes [18]. Fast feature point algorithms represented by ORB [19] reduce computational complexity through binary descriptors but sacrifice the semantic expressiveness of features. This makes them prone to mismatching due to texture similarity in dynamic scenarios.
In recent years, deep learning-based feature extraction methods have automatically extracted features with high semantic value and invariance through self-supervised learning, significantly improving matching accuracy in complex scenarios [20]. Among these, SuperPoint [21], a representative work, adopts an encoder–decoder structure. It learns local key points and descriptors of images through self-supervised training, achieving better matching repeatability and localization accuracy than traditional methods in static scenarios [22]. However, the encoder design of SuperPoint still has limitations: the fusion of shallow and deep features is insufficient, making it difficult to effectively distinguish between static background and dynamic foreground feature points in dynamic scenarios. The surface texture of dynamic objects (such as moving vehicles) may contain rich local details, causing SuperPoint to misclassify them as reliable features, which are then incorrectly used for pose estimation.
To solve the interference problem of dynamic features, object tracking and multi-object motion estimation technologies have been introduced into the dynamic processing module of VIO. The SORT (Simple Online and Realtime Tracking) algorithm [23] realizes multi-object motion estimation through a “detection-tracking” framework. It first uses an object detection network to output bounding boxes of dynamic objects, then predicts object trajectories via Kalman filtering, and finally completes cross-frame matching based on the Hungarian algorithm. SORT has advantages in real-time performance and simplicity, but the Kalman filtering model with fixed noise covariance it adopts cannot adapt to the complexity of dynamic scenarios [24]. This leads to the accumulation of trajectory prediction errors, which in turn affects the elimination effect of dynamic features. Therefore, designing a more robust motion estimation model and combining feature-level and matching-level dynamic filtering mechanisms has become a key technology in the dynamic processing module of dynamic VIO.
To address the above challenges, this paper proposes an improved algorithm integrating multi-scale feature optimization and real-time dynamic point elimination based on the VINS-Fusion framework. In the front-end feature extraction stage, the encoder structure of SuperPoint is reconstructed to enhance its ability to distinguish static background features. In the dynamic processing stage, the ASORT (Adaptive Simple Online and Realtime Tracking) algorithm is designed. Combining adaptive extended Kalman filtering and multi-source information matching, it achieves accurate tracking of dynamic objects and efficient elimination of interfering features. Experiments show that this method significantly improves localization accuracy in dynamic scenarios while maintaining real-time performance, providing a new solution for visual–inertial navigation in complex environments.
2. Methods
VINS-Fusion, as a classic stereo visual–inertial odometry algorithm, fuses data from stereo cameras and IMUs through a tight coupling approach. This framework mainly consists of three parts: front-end data processing, back-end nonlinear optimization, and loop closure detection, with its specific structure shown in Figure 1.
Figure 1.
The complete structure of the algorithm mentioned in this article.
The front-end is responsible for extracting and tracking feature points from the stereo camera, while pre-integrating IMU data. The back-end performs pose estimation using nonlinear optimization methods, combined with the data provided by the front-end. The loop closure detection module uses a bag-of-words model to detect whether the carrier has returned to a previous position, so as to reduce cumulative errors. However, in dynamically complex scenarios, the traditional feature point extraction method in the front-end is easily disturbed by moving objects, resulting in a decrease in localization accuracy. To solve this problem, this paper proposes to introduce an optimized SuperPoint feature point extractor and a dynamic feature point elimination algorithm based on the VINS-Fusion framework, so as to improve the robustness and accuracy of the system in dynamic scenarios.
2.1. Optimization of SuperPoint Encoder with Multi-Scale Feature Fusion and Channel Compression
SuperPoint is a feature point extraction and description network based on self-supervised learning, whose core implements feature point detection and description through an encoder–decoder structure. The algorithm uses a convolutional neural network to automatically learn key point positions and descriptors in images without the need for manually annotated data. Its self-supervised training strategy enables it to adapt to variations in illumination and viewing angles, exhibiting excellent matching accuracy and robustness in static scenarios. The algorithm structure is shown in Figure 2:
Figure 2.
The Architecture Diagram of SuperPoint Algorithm.
The algorithm accepts single-channel grayscale images or three-channel RGB (Red, Green, Blue) images subjected to weighted grayscale conversion, followed by normalization to eliminate the interference of illumination fluctuations on feature extraction. The shared encoder adopts a VGG (Visual Geometry Group)-style architecture: it takes preprocessed grayscale images as input, employs 8 convolutional layers, 3 max-pooling layers, and nonlinear activation layers, and outputs a feature map of size , whose primary function is to extract multi-scale features from the input image these features are then passed to two decoders, respectively, dedicated to key point detection and descriptor generation. The 128-dimensional feature map output by the encoder is simultaneously split into the key point decoding module and the descriptor decoding module, enabling parallel execution of key point detection and descriptor generation, and ultimately yielding two types of core results: first, a set of key points containing sub-pixel coordinates and detection confidence for each key point, where the confidence value can be used to further filter high-reliability key points; second, 128-dimensional normalized descriptor vectors in one-to-one correspondence with the key points, providing quantitative support for cross-frame feature matching.
The encoder of SuperPoint suffers from insufficient fusion of shallow texture features and deep semantic features, which leads to easy confusion between static background and dynamic foreground features in dynamic scenarios. To address this issue, this paper reconstructs the encoder architecture through three-fold optimizations: dual-layer cascaded feature extraction, channel concatenation fusion, and 1 × 1 convolution dimensionality reduction, as shown in Figure 3. The red part indicates the improved sections of the original encoder proposed in this paper.
Figure 3.
Encoder structure optimization diagram.
The encoder of SuperPoint adopts the classic VGG-style serial convolutional architecture, which was originally designed to extract deep semantic information of images through a layer-by-layer abstraction mechanism. However, this structure has inherent drawbacks in dynamic scenes: the high-resolution texture details captured by shallow convolutions are gradually diluted during the serial transmission in deep network layers. This semantic detail imbalance directly leads to the confusion of dynamic–static boundary features. When the algorithm attempts to eliminate dynamic feature points, static background features are prone to being misjudged as interference and wrongly removed, resulting in a reduction in the number of effective feature points and ultimately degrading the positioning performance of the VIO system in dynamic environments. Fundamentally, this issue arises because the single-path serial structure of traditional encoders cannot meet the requirements of multi-scale feature extraction. To address this fundamental contradiction, this paper innovatively reconstructs the encoder from an architectural perspective through three-layer optimization: dual-level cascaded feature extraction, channel concatenation and fusion, and 1 × 1 convolutional dimensionality reduction, as shown in Figure 3. The red parts in the figure indicate the improvements proposed in this paper to the original encoder.
To simultaneously capture local details and global semantics of the image, the improved encoder is designed with two parallel feature extraction branches, which output feature maps of the third layer () and the fourth layer (), respectively, forming a multi-scale information foundation. The third-layer branch focuses on high-frequency detailed information such as local textures and edges of the image. After the input image of size is processed by the first two convolutional layers, the feature map is output, where each spatial position corresponds to a 128-dimensional feature vector, with the expression as follows:
where denotes the feature representation at position .
The fourth-layer branch captures global semantic information of the image by increasing the receptive field, providing a semantic basis for distinguishing dynamic foreground regions from static background regions. Its structure is consistent with that of the third layer, and the expression of the output feature map is as follows:
where denotes the feature representation at position .
Either or alone has information one-sidedness: using only tends to misjudge the local texture of dynamic objects as static features, while using only will lose the fine position information of static features. To address this, through channel concatenation, the 128-dimensional vector of and the 128-dimensional vector of are fused in the channel dimension to generate a 256-dimensional fused feature vector, achieving complementarity between details and semantics:
The overall fused feature map is expressed as:
Although the fused has improved feature expression capability, its 256-dimensional channels will lead to a surge in subsequent computational load, which is inconsistent with real-time navigation requirements. To address this, 1 × 1 convolution is used for channel dimensionality reduction, outputting a feature vector for each position :
where represents the convolution kernel weight, is the bias, denotes the ReLU activation function, and stands for convolution.
When expanded by components, the output of the -th channel is:
where is the weight parameter of the convolution kernel from channel to . The compressed feature map is finally expressed as:
This method achieves collaborative optimization of multi-scale feature extraction and computational efficiency by reconstructing the SuperPoint encoder architecture. The improved encoder adopts a parallel branch design to simultaneously extract shallow texture features and deep structural features. This multi-scale feature fusion mechanism significantly enhances the algorithm’s ability to capture stable feature points in static backgrounds, while suppressing interference from dynamic objects through the semantic feature branch. The improvement effect of this method on localization accuracy will be further verified through subsequent ablation experiments in this paper.
In terms of computational efficiency optimization, the introduced 1 × 1 convolution channel compression technology reduces network parameters and computational load by reconstructing feature dimensions while maintaining the expressive capability of feature maps. This optimization strategy improves the feature extraction speed of the algorithm and reduces computational complexity, providing an efficient solution for real-time localization applications.
2.2. Dynamic Feature Point Elimination Based on Adaptive Kalman Filtering
The traditional SORT algorithm achieves multi-target tracking through target detection, Kalman filter prediction, and Hungarian matching, but it has two limitations in dynamic scenarios: first, it uses a standard Kalman filter, which can only handle linear systems and cannot adapt to the nonlinear motion of dynamic objects; second, it employs a fixed noise covariance model, making it difficult to cope with disturbances such as sudden changes in illumination and target occlusion, resulting in tracking drift. To address these issues, this paper proposes the ASORT algorithm, which adapts to nonlinear motion through KF, enhances robustness via adaptive noise adjustment, efficiently completes target detection and tracking, and combines with global feature points to achieve accurate filtering of dynamic feature points. The complete process of dynamic feature point elimination is shown in Figure 4.
Figure 4.
Dynamic feature point rejection flowchart.
The traditional SORT algorithm achieves multi-target tracking through object detection, Kalman filter prediction, and Hungarian matching, but it has two limitations in dynamic scenarios: first, it adopts the standard Kalman filter, which can only handle linear systems and cannot adapt to the nonlinear motion of dynamic objects; second, it uses a fixed noise covariance model, making it difficult to cope with disturbances such as sudden illumination changes and target occlusion, leading to tracking drift. To address these issues, this paper proposes the ASORT algorithm—as a front-end preprocessing module of the VIO system, its core function is to enhance robustness through adaptive observation noise adjustment, efficiently complete dynamic target detection and tracking, and realize accurate identification and elimination of dynamic feature points by combining global feature points before the feature points are input into the back-end optimization. The complete process of dynamic feature point elimination is illustrated in Figure 4:
2.2.1. ASORT Algorithm
This paper adopts the YOLOv8-n lightweight target detection network, which has a small number of model parameters and fast inference speed, making it suitable for embedded devices and low-power scenarios. The network reduces computational load through depth-wise separable convolution while retaining the ability to detect small targets. The output result includes the observation vector of the target at time , which is a 4-dimensional vector directly output by the target detection network, corresponding to the parameters of the target bounding box:
where , , and correspond to the observed center coordinates, width, and height of the bounding box, respectively. In addition, it includes detection confidence, which is used to measure the reliability of the detection results.
First, state space modeling is performed, and the state vector of the target at time is defined as:
where and represent the horizontal and vertical coordinates of the center of the target bounding box, respectively; and denote the width and height of the target bounding box, respectively; and stand for the velocity components of the center point of the target bounding box in the and directions, respectively; and indicate the rate of change in the width and height of the target bounding box, respectively; and and represent the acceleration components of the center point of the target bounding box in the and directions, respectively.
The dynamic behavior of the target can be expressed as follows:
where a is the process noise, assumed to be zero-mean Gaussian white noise with covariance matrix of . The process noise covariance matrix is set as a constant diagonal matrix based on the constant velocity model assumption, where the diagonal elements are configured according to the maximum acceleration of the target.
The state transition function is designed as follows:
where represents the inter-frame time interval.
The observation vector of the system at time and the state vector are related by the nonlinear observation function :
where is the observation noise, assumed to be zero-mean Gaussian white noise with a covariance matrix of . The initial observation noise covariance matrix is set to a fixed value. This value is preset according to the characteristics of the dataset during algorithm initialization, which is consistent with the processing method of the classic SORT algorithm.
Based on the state estimation at time and its estimation error covariance matrix , the state prior and covariance prior at time are predicted.
The state transition matrix is expressed as follows:
After obtaining the observation value at time , the state update is performed.
Calculate the Kalman gain:
where is the state transition matrix of the observation function at point :
Update the state estimation:
Update the estimation error covariance matrix:
This paper proposes a mechanism for adaptively adjusting the observation noise covariance, whose core idea is to dynamically adjust the effective observation noise level according to the confidence of the current observation:
where is the adjusted effective observation noise covariance matrix at time , and is the adaptive factor for the current observation. The adaptive factor comprehensively considers the target motion state and lighting conditions:
where and are velocities, is the velocity weight factor, is the current illumination value, is the ideal illumination value, is the illumination weight, and is the illumination sensitivity adjustment coefficient. When the target moves violently or the illumination condition is poor, the detection result is usually unreliable. The adaptive factor increases, the confidence decreases, and ASORT automatically increases the effective measurement noise covariance . That is, the weight of the current unreliable measurement value is reduced during state update, making the filtering result more dependent on the predicted value, thereby improving the adaptability and robustness of the algorithm in complex dynamic environments.
In the ASORT dynamic target tracking framework, the Hungarian algorithm serves as the core engine for multi-target association. It achieves accurate association between detection boxes and predicted trajectories through a global optimal matching strategy, with its specific process as follows:
First, two sets are determined. The first one is the detection box set:
where denotes the state vector of the -th dynamic object bounding box in the current frame output by the target detection network.
In addition, the predicted trajectory set needs to be determined:
where denotes the state vector of the -th predicted bounding box generated based on the tracking result of the previous frame and the AKF (Adaptive Kalman Filter). Then, the IOU (Intersection over Union) is used to measure the spatial overlap:
The numerator is the intersection area of the two boxes, and the denominator is their union area. A larger value indicates a higher degree of spatial consistency.
To adapt to the goal of the Hungarian algorithm, which is to minimize the total cost, the matching cost function is defined as follows:
A cost matrix of dimension m is generated, with its element . Then, the optimal matching is solved on the cost matrix , and the set of optimal matching pairs is output.
The matching results are divided into three categories:
If the pairing is successful, for the successfully matched pairs, the detection box is used as a new observation value and input into the AKF corresponding to to update the state vector and covariance matrix.
If there are unmatched detection boxes , they are regarded as new targets, and new trajectories are initialized with the initial state set based on .
If there are unmatched predicted trajectories , a loss counter is activated. If the number of consecutive lost frames exceeds the set threshold, the trajectory is deleted; otherwise, it is retained to wait for subsequent matching.
Finally, the set of all valid bounding boxes and motion states in the current frame are output.
2.2.2. Dynamic Feature Point Elimination
Based on the state of the bounding box, it is directly applied to the full set of feature points extracted by the front-end optimized SuperPoint, with the filtering rules as follows:
If the bounding box is stationary or no bounding box is detected, all feature points extracted by SuperPoint are retained. If a moving bounding box is detected, the set of feature points extracted by SuperPoint is traversed, and the points whose pixel coordinates are within the dynamic bounding box are eliminated.
This filtering mechanism can efficiently remove dynamic interfering features. In the KITTI-08 sequence test, the dynamic feature point elimination rate reaches 92.3%, providing pure static background feature inputs for the back-end optimization of visual–inertial navigation and effectively reducing the positioning error in dynamic scenarios. The effect diagram is shown in Figure 5, where the green box on the right represents the detected dynamic object, and static feature points are retained after eliminating it during feature extraction.
Figure 5.
Comparative effect diagram of dynamic feature point rejection.
3. Experiments
To evaluate the algorithm performance, experiments are conducted in this paper on the public KITTI dataset. The experiments mainly focus on localization accuracy, algorithm robustness, real-time performance, and comparisons with existing mainstream algorithms.
3.1. Experimental Setup
The public KITTI dataset was selected for experimental validation: Focusing on outdoor dynamic traffic scenarios, the KITTI dataset is equipped with an on-board stereo camera and an IMU. This paper focuses on testing 8 sequences (00, 02, 05, 06, 07, 08, 09, 10) that contain dynamic objects such as moving vehicles and pedestrians. The comparison algorithm adopted was VINS-Fusion, an optimization-based VIO algorithm. The evaluation metrics included (Absolute Trajectory Error) and processing time. The experimental hardware platform consisted of an Intel Core i7-9700K CPU @ 3.60GHz and an NVIDIA GeForce RTX 3060 Ti GPU, running on the Ubuntu 18.04 operating system. This setup fully covers the requirements for verifying the algorithm’s performance, accuracy, and real-time capability in indoor and outdoor dynamic scenarios. Evaluation Metrics included the and processing time.
is a core metric for measuring the global absolute accuracy of localization algorithms, used to quantify the overall deviation between the estimated trajectory and the ground-truth trajectory. In the experiment, the estimated trajectory output by the algorithm and the ground-truth trajectory provided by the KITTI dataset were first obtained. A transformation matrix from the estimated pose to the reference pose was calculated using the least squares method to align the estimated positions with the reference positions. For the -th timestamp in the navigation trajectory, let the 3D coordinates of the reference trajectory at this timestamp be , and the 3D coordinates of the trajectory to be evaluated be . The Euclidean distance between the estimated position and the reference position was calculated for each timestamp to obtain the (Absolute Position Error) at each moment:
Let there be a total of timestamps. The RMSE (Root Mean Square Error) of the across all timestamps was computed to obtain the :
Ultimately, reflects the cumulative error of the algorithm during long-term operation. It directly indicates the degree of consistency between the localization results and the real environment; even if local feature matching is disturbed by dynamic objects, can reveal long-term stability defects of the system through global trajectory comparison.
3.2. Analysis of Localization Accuracy in Dynamic Scenarios
The performance of the proposed algorithm and the baseline algorithm VINS-Fusion in terms of localization accuracy on the KITTI dataset is presented in Table 1.
Table 1.
Comparison on the KITTI Dataset.
In dynamic sequences of the KITTI dataset (including 00, 02, and 05 to 10), the of the proposed algorithm is significantly lower than that of VINS-Fusion. The average of VINS-Fusion in these sequences is approximately 12.44 m, while the average of the proposed method is about 10.59 m, representing an average improvement of approximately 14.8% in localization accuracy. Among these sequences, the optimization effect is most prominent in the KITTI-08 sequence: the decreases from 14.14 m to 10.55 m, with a localization accuracy improvement of about 25.4%, as shown in Figure 6. For the KITTI-02 sequence, the reduces from 28.53 m to 22.15 m, achieving a localization accuracy improvement of approximately 22.4%, as illustrated in Figure 7. This result verifies the effectiveness of the dynamic point elimination mechanism and multi-scale feature optimization in suppressing dynamic interference and improving global localization accuracy. It should be noted that the degree of improvement varies across different sequences, which is closely related to the dynamic characteristics of each sequence—for example, some sequences have sparse dynamic objects with less mutual occlusion, while others feature dense dynamic objects and frequent interactions. Despite these differences, the proposed algorithm overall outperforms VINS-Fusion in all dynamic sequences, confirming its stable effectiveness in mitigating dynamic interference.
Figure 6.
(a) Trajectory comparison figure of KITTI-08 sequence; (b) Comparison Figure of KITTI-08 Sequence.
Figure 7.
(a) Trajectory comparison figure of KITTI-02 sequence; (b) Comparison Figure of KITTI-02 Sequence.
3.3. Validation of Module Effectiveness via Ablation Experiments
To explore the independent contributions and synergistic effects of the two core modules—the optimized SuperPoint feature extractor and the ASORT dynamic point elimination algorithm—ablation experiments were conducted on the representative sequences (00, 02, 08) of the KITTI dataset. Two experimental configurations were designed for comparison: the first one, denoted as SV, integrates the optimized SuperPoint while disabling the ASORT dynamic point elimination algorithm; the second one, labeled as YKV, incorporates the ASORT dynamic point elimination algorithm but disables the optimized SuperPoint, instead adopting the original feature point extraction algorithm. The experimental results are detailed in Table 2. By comparing the performances of the SV configuration, YKV configuration, and the proposed algorithm in this paper, the independent effectiveness of each core module and the synergistic gain generated by their combination were analyzed in a quantitative manner.
Table 2.
Ablation Study on Comparison.
The experimental results demonstrate that SV, YKV, and the proposed algorithm (which integrates these two modules) all significantly reduce the in dynamic scenarios, with an overall performance gradient of “single module effective, synergy more optimal”. When SV operates independently, its values on the three datasets are 16.52 m, 23.64 m, and 12.16 m, respectively, achieving an average improvement of approximately 13.8% in localization accuracy compared to the VINS-Fusion baseline. This validates that multi-scale feature optimization effectively enhances the robustness of static background features. When YKV operates independently, its values are 16.24 m, 25.44 m, and 13.26 m, respectively, with an average localization accuracy improvement of about 9.6% over the baseline. This indicates that dynamic point elimination itself exerts a positive effect on localization accuracy; however, due to the limitations of the original feature extraction capability, its performance in complex scenarios is weaker than that of SV. The proposed algorithm, which combines the two modules, further amplifies the advantages: its values are 15.57 m, 22.16 m, and 10.52 m, respectively, representing an average improvement of approximately 21.1% in localization accuracy compared to the VINS-Fusion baseline. Moreover, it achieves an additional 7.3% improvement compared to SV operating independently, and an extra 11.5% improvement compared to YKV operating independently.
3.4. Parameter Sensitivity Analysis
To verify the impact of the core parameters (, , ) of the adaptive noise adjustment mechanism in the ASORT algorithm on system performance, a parameter sensitivity experiment was conducted based on three representative sequences (00, 02, 08) of the KITTI dataset. The was adopted as the evaluation metric, and the control variable method was employed—fixing the other parameters at their optimal values while adjusting the target parameter individually. The parameter value ranges were determined by combining the conventional intervals in the field and the characteristics of the dataset: the motion weight factor was set to [0.4, 0.8] (with a step size of 0.1), which covers the statistical distribution interval of the motion speeds of dynamic targets in the KITTI dataset, enabling full verification of the parameter adaptability under different motion intensities; the illumination weight factor was selected as [0.2, 0.6] (with a step size of 0.1), referring to the conventional weight allocation range for illumination interference evaluation in the field of computer vision while matching the sample proportion characteristics of different illumination scenarios in the KITTI dataset; the illumination sensitivity coefficient was chosen as [1.0, 1.4] (with a step size of 0.1), determined according to the variation law of detection confidence when the illumination intensity deviates from the ideal value in the KITTI dataset, which can effectively verify the robustness of illumination response sensitivity. The results of the parameter sensitivity analysis are presented in Table 3.
Table 3.
Results of Parameter Sensitivity Analysis.
The maximum performance fluctuation is calculated as (Maximum when the parameter deviates from the optimal value−Optimal )/Optimal × 100%. As can be seen from Table 3, when the motion weight β is adjusted within [0.4, 0.8] with a step size of 0.1, the values of sequences 00, 02, and 08 exhibit a trend of first decreasing and then slightly increasing. Specifically, KITTI-00 and KITTI-08 achieve the optimal values of 16.24 m and 13.26 m, respectively, at β = 0.5, while KITTI-02 reaches the minimum of 25.31 m at β = 0.6, with a maximum performance fluctuation of only 2.9%. This indicates that β primarily functions to balance the observation weights of high-speed and low-speed targets without exerting a decisive impact on the core performance of the algorithm. For the illumination weight δ adjusted within [0.2, 0.6] (step size 0.1), the optimal values vary across sequences: KITTI-00 and KITTI-02 obtain the optimal of 16.24 m and 25.44 m, respectively, at δ = 0.3, while KITTI-08 achieves the lowest of 13.20 m at δ = 0.4, and the maximum fluctuation does not exceed 3.3%, suggesting that stable performance can be achieved by matching the distribution characteristics of illumination scenarios in the dataset. When the illumination sensitivity l is tuned within [1.0, 1.4] (step size 0.1), the optimal values also show sequence-specific characteristics: KITTI-00 and KITTI-02 reach the optimal of 16.24 m and 25.44 m, respectively, at l = 1.1, while KITTI-08 achieves the minimum of 13.18 m at l = 1.2, with a maximum fluctuation of less than 2.6%, demonstrating that the adaptive adjustment logic can effectively mitigate illumination interference even with slight parameter deviations. Overall, when the core parameters fluctuate within reasonable ranges with a step size of 0.1, the maximum fluctuation of the algorithm is only 3.3%, which is significantly lower than the 9.6% accuracy improvement achieved by the standalone ASORT algorithm in the ablation experiment (Section 3.3). This fully confirms that the core advantage of the ASORT algorithm originates from its adaptive adjustment logic rather than the precise tuning of individual parameters, and the algorithm possesses excellent robustness against parameter perturbations.
3.5. Real-Time Performance Analysis
Real-time performance is a critical requirement for the practical application of visual–inertial navigation systems. During the experiment, the average single-frame image processing time of the proposed method and the comparison algorithm VINS-Fusion when processing each dataset sequence was recorded, and the results are presented in Table 4.
Table 4.
Comparison of Average Single-Frame Computation Time.
It can be concluded from the table that the average single-frame processing time of the proposed algorithm in KITTI outdoor scenarios is approximately 23.9 ms, which meets the real-time performance requirements. Compared with the VINS-Fusion algorithm, the processing time of the proposed method increases by an average of about 9.3 ms. This increment mainly stems from two aspects: first, the improved SuperPoint feature extraction network is more computationally complex than the original front-end of VINS-Fusion; second, the ASORT dynamic point elimination module introduces additional computational load. In terms of the trade-off between efficiency and accuracy, although the computational time increases, the proposed method achieves a significant improvement in localization accuracy in dynamic scenarios, as shown in the previous accuracy analysis. Therefore, the increased time consumption can be regarded as a reasonable cost for obtaining higher accuracy and robustness, and future work will focus on optimizing the computational efficiency of these two modules.
4. Conclusions
To address the issue of reduced localization accuracy in VIO algorithms caused by moving object interference in dynamic scenarios, this paper proposes an improved algorithm integrating multi-scale feature optimization and real-time dynamic point elimination. The conclusions are as follows:
First, to address the limitation of insufficient discrimination between static and dynamic features in the original SuperPoint, we optimized the SuperPoint encoder by designing a dual-branch multi-scale feature fusion structure and integrating 1 × 1 convolutional channel compression. This optimization not only improves the quality of static background feature extraction but also refines the discrimination of dynamic–static boundary features, reducing the misjudgment of boundary features that are easily confused in dynamic scenarios. This improvement is reflected in the experimental results as a reduction in invalid feature points input to the backend, laying a foundation for more accurate pose estimation. This can be effectively verified by comparing the “SV configuration (integrating only the optimized SuperPoint while disabling the ASORT dynamic point elimination algorithm)” with the “VINS-Fusion baseline” in ablation experiments: In the KITTI-00, 02, and 08 sequences, the values of the SV configuration are 16.52352 m, 23.63574 m, and 12.15743 m, respectively, achieving an average improvement of approximately 13.8% compared to the VINS-Fusion baseline. This validates the role of the encoder optimization in enhancing the quality of static features and improving localization accuracy.
Second, the proposed ASORT algorithm addresses the tracking drift issue of the traditional SORT algorithm in dynamic scenarios. By adaptively adjusting the observation noise covariance (incorporating target motion state and illumination conditions) and performing robust multi-target matching via the Hungarian algorithm, it achieves more reliable real-time detection and tracking of dynamic objects in each frame, thereby accurately filtering out feature points associated with dynamic objects. This module reduces the interference of dynamic features on pose estimation, which can be effectively demonstrated by comparing the “YKV configuration (integrating only the ASORT dynamic point elimination algorithm while disabling the optimized SuperPoint and adopting the original feature point extraction algorithm)” with the “VINS-Fusion baseline” in ablation experiments: In the KITTI-00, 02, and 08 sequences, the values of the YKV configuration are 16.24376 m, 25.44375 m, and 13.25846 m, respectively, achieving an average improvement of approximately 9.6% compared to the VINS-Fusion baseline. This proves the effectiveness of ASORT in suppressing dynamic interference.
Third, the integration of the two modules forms a “feature optimization-dynamic purification” collaborative mechanism: the optimized SuperPoint encoder provides high-quality feature candidates, while the ASORT algorithm further purifies these candidates by removing dynamic features, ensuring that only reliable static features participate in subsequent feature matching and backend nonlinear optimization. This synergy directly translates to improved localization accuracy, which can be effectively verified through the experimental results in Section 3.2: In the dynamic sequences (00, 02, 05–10) of the KITTI dataset, the of the fused algorithm is significantly lower than that of VINS-Fusion. The average decreases from approximately 12.44 m (baseline) to 10.59 m, representing an average improvement of about 14.8%. Specifically, in the KITTI-08 sequence, the decreases from 14.14232 m to 10.54754 m (an improvement of approximately 25.4%), and in the KITTI-02 sequence, the decreases from 28.53246 m to 22.14632 m (an improvement of approximately 22.4%). Meanwhile, ablation experiments show that the average improvement of the fused algorithm (21.1%) is significantly higher than the independent contributions of the SV configuration (13.8%) and the YKV configuration (9.6%), fully demonstrating the synergistic gain of the “feature optimization-dynamic purification” mechanism.
Although the “feature optimization-dynamic purification” collaborative framework proposed in this paper demonstrates effectiveness in dynamic VIO, it still has three core limitations: First, the computational complexity is relatively high. The multi-scale feature fusion in the improved SuperPoint encoder, along with the AKF and Hungarian matching mechanisms in the ASORT algorithm, introduce additional computational burdens, making real-time performance susceptible to constraints in resource-constrained scenarios. Second, there is strong module dependency. The algorithm’s performance relies on the accuracy of the target detection network, and the effect of dynamic feature elimination may be limited when deploying lightweight models on embedded devices. Third, the adaptability to dynamic scenes is limited. The robustness of ASORT against sudden extreme motions and complex occlusion scenarios still needs improvement.
To address these issues, future research will focus on three aspects: First, lightweight model design, which involves compressing the SuperPoint encoder and target detection network through neural architecture search, adaptive pruning, and dynamic inference mechanisms to balance accuracy and efficiency. Second, algorithm optimization, including constructing an end-to-end dynamic feature processing framework and developing cross-module knowledge distillation techniques to reduce dependence on a single module. Third, enhancing dynamic adaptability by introducing an attention-based motion prediction model, multimodal fusion strategies, and an adaptive computing framework to improve robustness in extreme scenarios and expand hardware adaptability. These directions are expected to overcome current limitations and promote the practical application of dynamic VIO technology in broader scenarios.
This study proposes an improved visual–inertial navigation algorithm based on the VINS-Fusion framework, integrating multi-scale feature optimization via a reconstructed SuperPoint encoder and real-time dynamic point elimination using the ASORT algorithm, which combines object detection, adaptive Kalman filtering, and Hungarian matching. Experiments on the KITTI dataset demonstrate improvement in absolute trajectory error, indicating enhanced accuracy and robustness in dynamic environments. While the work presents a solid technical contribution with thorough experimental validation, several issues need addressing to elevate its scholarly impact.
Author Contributions
Conceptualization, H.D. and J.Y.; Methodology, J.L.; Software, X.L. (Xin Li) and J.Y.; Validation, X.L. (Xueying Liu), J.L. and H.D.; Formal Analysis, J.L.; Investigation, J.Y.; Resources, H.D.; Data Curation, X.L. (Xueying Liu), J.L. and X.L. (Xin Li); Writing—Original Draft Preparation, H.D. and J.Y.; Writing—Review and Editing, J.L.; Visualization, X.L. (Xin Li); Supervision, H.D.; Project Administration, J.Y.; Funding Acquisition, H.D. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Natural Science Foundation of Shandong Province, grant number ZR2017MF036; the National Defense Science and Technology Project Fund, grant number F062102009; and the Youth Innovation Team Program of Shandong Provincial Colleges and Universities, grant number 2020KJN003. The APC was funded by the authors’ affiliated institutions.
Institutional Review Board Statement
Ethical review and approval were waived for this study due to the fact that the research only involves algorithm development and simulation tests of visual-inertial odometry, without involving human subjects or animal experiments.
Informed Consent Statement
Informed consent was obtained from all subjects involved in the study.
Data Availability Statement
Data are contained within the article.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| VIO | Visual–Inertial Odometry |
| IMU | Inertial Measurement Unit |
| SORT | Simple Online and Realtime Tracking |
| ASORT | Adaptive Simple Online and Realtime Tracking |
| RGB | Red, Green, Blue |
| VGG | Visual Geometry Group |
| IOU | Intersection over Union |
| Absolute Trajectory Error | |
| Absolute Position Error | |
| RMSE | Root Mean Square Error |
References
- Placed, J.A.; Strader, J.; Carrillo, H.; Atanasov, N.; Indelman, V.; Carlone, L.; Castellanos, J.A. A survey on active simultaneous localization and mapping: State of the art and new frontiers. IEEE Trans. Robot. 2023, 39, 1686–1705. [Google Scholar] [CrossRef]
- Taketomi, T.; Uchiyama, H.; Ikeda, S. Visual SLAM algorithms: A survey from 2010 to 2016. IPSJ Trans. Comput. Vis. Appl. 2017, 9, 16. [Google Scholar] [CrossRef]
- Yuan, H.; Han, K.; Lou, B. Mix-VIO: A visual inertial odometry based on a hybrid tracking strategy. Sensors 2024, 24, 5218. [Google Scholar] [CrossRef] [PubMed]
- Kazerouni, I.A.; Fitzgerald, L.; Dooly, G.; Toal, D. A survey of state-of-the-art on visual SLAM. Expert Syst. Appl. 2022, 205, 117734. [Google Scholar] [CrossRef]
- Kochman, M.T.; Kielar, A.; Kasprzak, M.; Kasperek, W.; Dutko, M.; Vellender, A.; Przysada, G.; Drużbicki, M. A Reliability Study of Small, Portable, Easy-to-Use, and IMU-Based Sensors for Gait Assessment. Sensors 2025, 25, 6597. [Google Scholar] [CrossRef] [PubMed]
- Chen, W.; Shang, G.; Ji, A.; Zhou, C.; Wang, X.; Xu, C.; Li, Z.; Hu, K. An overview on visual slam: From tradition to semantic. Remote Sens. 2022, 14, 3010. [Google Scholar] [CrossRef]
- Ma, F.; Shi, J.; Yang, Y.; Li, J.; Dai, K. ACK-MSCKF: Tightly-coupled Ackermann multi-state constraint Kalman filter for autonomous vehicle localization. Sensors 2019, 19, 4816. [Google Scholar] [CrossRef] [PubMed]
- Bloesch, M.; Omari, S.; Hutter, M.; Siegwart, R. Robust visual inertial odometry using a direct EKF-based approach. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 298–304. [Google Scholar]
- Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
- Qin, T.; Li, P.; Shen, S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
- Xu, B.; Dai, Q. Enhanced VINS-fusion-based SLAM algorithm for reliable camera pose estimation in GNSS-challenging environments. In Proceedings of the 2023 IEEE International Conference on Unmanned Systems (ICUS), Hefei, China, 13–15 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 234–239. [Google Scholar]
- Su, Y.; Wang, T.; Yao, C.; Shao, S.; Wang, Z. GR-SLAM: Vision-based sensor fusion SLAM for ground robots on complex terrain. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; IEEE: Piscataway, NJ, USA, 2020; pp. 5096–5103. [Google Scholar]
- Wang, S.; Hu, Q.; Zhang, X.; Li, W.; Wang, Y.; Zheng, E. LVID-SLAM: A Lightweight Visual-Inertial SLAM for Dynamic Scenes Based on Semantic Information. Sensors 2025, 25, 4117. [Google Scholar] [CrossRef] [PubMed]
- Mao, H.; Luo, J. PLY-SLAM: Semantic Visual SLAM Integrating Point–Line Features with YOLOv8-seg in Dynamic Scenes. Sensors 2025, 25, 3597. [Google Scholar] [CrossRef] [PubMed]
- Nam, D.V.; Gon-Woo, K. Robust stereo visual inertial navigation system based on multi-stage outlier removal in dynamic environments. Sensors 2020, 20, 2922. [Google Scholar] [CrossRef] [PubMed]
- Panchal, P.; Panchal, S.; Shah, S. A comparison of SIFT and SURF. Int. J. Innov. Res. Comput. Commun. Eng. 2013, 1, 323–327. [Google Scholar]
- Oyallon, E.; Rabin, J. An analysis of the SURF method. Image Process. Line 2015, 5, 176–218. [Google Scholar] [CrossRef]
- Julià, L.F.; Monasse, P. A critical review of the trifocal tensor estimation. In Pacific-Rim Symposium on Image and Video Technology; Springer: Cham, Switzerland, 2017; pp. 337–349. [Google Scholar]
- Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 2564–2571. [Google Scholar]
- Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
- DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
- Liu, T.; Wan, G.; Bai, H.; Kong, X.; Tang, B.; Wang, F. Real-time video stabilization algorithm based on superpoint. IEEE Trans. Instrum. Meas. 2023, 73, 1–13. [Google Scholar] [CrossRef]
- Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3464–3468. [Google Scholar]
- Nasseri, M.H.; Moradi, H.; Hosseini, R.; Babaee, M. Simple online and real-time tracking with occlusion handling. arXiv 2021, arXiv:2103.04147. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.