Accurate Location in Dynamic Traffic Environment Using Semantic Information and Probabilistic Data Association

High-accurate and real-time localization is the fundamental and challenging task for autonomous driving in a dynamic traffic environment. This paper presents a coordinated positioning strategy that is composed of semantic information and probabilistic data association, which improves the accuracy of SLAM in dynamic traffic settings. First, the improved semantic segmentation network, building on Fast-SCNN, uses the Res2net module instead of the Bottleneck in the global feature extraction to further explore the multi-scale granular features. It achieves the balance between segmentation accuracy and inference speed, leading to consistent performance gains on the coordinated localization task of this paper. Second, a novel scene descriptor combining geometric, semantic, and distributional information is proposed. These descriptors are made up of significant features and their surroundings, which may be unique to a traffic scene, and are used to improve data association quality. Finally, a probabilistic data association is created to find the best estimate using a maximum measurement expectation model. This approach assigns semantic labels to landmarks observed in the environment and is used to correct false negatives in data association. We have evaluated our system with ORB-SLAM2 and DynaSLAM, the most advanced algorithms, to demonstrate its advantages. On the KITTI dataset, the results reveal that our approach outperforms other methods in dynamic traffic situations, especially in highly dynamic scenes, with sub-meter average accuracy.


Introduction
Real-time localization in the dynamic traffic environment is one of the essential technologies for unmanned autonomous vehicles (UAVs). The environment has high dynamic characteristics with many participants and significant scene changes. Simultaneous localization and mapping (SLAM) are often used to solve the problem of autonomous localization in unknown environments. It determines the current location of the autonomous vehicle based on the surrounding environment data observed by the sensors. The ability to deal with dynamic situations and changes, according to [1], is a significant problem for autonomous driving localization. Traditional SLAM systems make the assumption that all objects in the environment would remain static. These SLAM systems use outlier filtering approaches [2] and robust implicit penalties [3] to deal with dynamic environment difficulties, while Kerr et al. [4] show that these methods are only robust in low dynamic circumstances. The research of [5] also demonstrates that the topic of real-time dynamic environment localization is still unsolved and that the existing technical level needs to be improved further.
In recent years, deep learning has achieved great success in visual perception, and its inference speed and perception accuracy have achieved consistent performance improve-ments in autonomous driving applications. VSLAM can be combined with deep learning to jointly complete the real-time positioning task of autonomous vehicles. In this paper, the semantic information obtained by deep learning is added to the visual SLAM system, and a coordinated localization method of semantic information and probabilistic data association is proposed to meet the challenge of real-time localization in dynamic traffic environments. The improved semantic segmentation network extracts multi-scale granular features to understand better and describe scene semantic information. To ensure the quality of data association, semantic information is used to eliminate the interference of dynamic feature points. However, strictly removing interference may overlook matching pairings in the data connection to some extent. As a result, the expected measurement likelihood model is used in this work, which can identify the best estimate for the optimization model when the data is incomplete or contain unobserved latent data.
The main contributions of this paper can be summarized as follows: (1) The improved semantic segmentation network, building on Fast-SCNN, uses the Res2net module instead of the Bottleneck in the global feature extraction to further explore the multi-scale granular features. It achieves the balance between segmentation accuracy and inference speed, leading to consistent performance gains on the coordinated localization task of this paper. (2) The robust scene descriptor fuses geometric, semantic, and distributional information to improve the quality of data association. (3) The probabilistic data association is created to find the best estimate using a maximum measurement expectation model. This approach assigns semantic labels to landmarks observed in the environment and is used to correct false negatives in data association.

Related Word
Astonishing progress has been made in SLAM technology, enabling large-scale applications and witnessing the development of autonomous positioning. SLAM technology can be divided into LIDAR SLAM and Visual SLAM according to different sensors. Since LIDAR is expensive, low-cost cameras are more suitable for commercial promotion, and visual SLAM has developed rapidly with computer vision in recent years. As early as 1999, P.M. Newman [6] studied vision and SLAM-related issues and confirmed that visual SLAM could learn from machine vision-related research results. People thought that only stereo cameras could achieve visual SLAM for a long time until A.J. Davison [7] used monocular cameras to complete SLAM, creating monocular visual SLAM. The PTAM framework, the basic framework of visual SLAM, was proposed by Klein G. and Murray D. [8], which comprises two threads of tracking and mapping. The Track thread uses FAST [9] to extract features and initially estimate the camera pose, and the Map thread uses the Bundle Adjustment [10] algorithm to correct the pose estimation deviation. Raúl Mur-Artal et al. proposed ORB-SLAM [11], which adds map initialization and closed-loop detection functions to the PTAM framework, optimizes keyframes selection and map construction and has good processing speed and map accuracy. The ORB-SLAM2 version [12] supports monocular, binocular, and RGB-D interfaces. Moreover, the ORB-SLAM3 version [13] adds IMU coupling and supports the fisheye camera, which can run stably in real-time in small and large indoor and outdoor scenes. These classic SLAM systems show outstanding performance in static or low-dynamic environments but cannot get rid of the interference of dynamic objects in high-dynamic environments.
Visual SLAM in dynamic environments has become a hot research topic. The systems can usually be divided into three methods to eliminate the effects of dynamic objects: direct, feature point, and deep learning. Alcantarilla P.F. et al. [14] used dense disparity maps and dense optical flow between consecutive frames to estimate dense 3D scene flow, which they paired with motion likelihood to detect moving objects. This method enhances localization and mapping outcomes for dense and dynamic situations by omitting erroneous measurement information from the estimation. Jiyu Chenga et al. [15] employed the optical flow of consecutive frames to differentiate dynamic feature points in an image.
Dynamic feature points will be added to the feature map, and static feature points will be entered into the visual SLAM system to ensure the accuracy of the posture estimate. Forster C et al. [16] utilized a direct technique to monitor and triangulate high-gradient pixels and motion information and a robust probabilistic depth estimation algorithm to achieve greater accuracy in low-texture dynamic scenarios.
None of the above methods goes beyond traditional geometric reconstruction to improve the system's understanding of the environment. With the rapid development of deep learning technology, more and more research attempts to introduce deep learning into SLAM, and some work has achieved good results. These jobs can be roughly divided into two categories. One category is deep learning methods to replace some modules in traditional visual SLAM [17][18][19][20]. The method for extracting depth information from picture pairs was proposed by Zbontar J. and LeCun Y. [17]. This method uses a convolutional neural network to learn picture similarity and a binary classification dataset for stereo matching to retrieve depth information. A lightweight point tracking system was devised by DeTone D. et al. [18]. In this system, a neural network extracts the image's important 2D points, and another network predicts the homography of these points and matches them, boosting the tracking system's real-time performance. Garg R. et al. [19] introduced an unsupervised convolutional neural network for single-view depth prediction, which addressed the shortcomings of manually annotated data. The network is comparable to other state-of-the-art slams in terms of performance. Borna Besic and Abhinav Valada [20] suggested an end-to-end deep learning architecture for filtering dynamic objects from RGB-D sequences and fixing occlusion regions in dynamic objects. Specifically, the generative adversarial network uses a gated loop feedback method to improve temporal consistency by training the model from coarse to fine. The model also adjusts the depth of the images, ensuring geometric consistency throughout the inpainting architecture's end-to-end training.
Another study adds semantic information to classic SLAM technology by combining visual SLAM with deep learning [21][22][23][24]. DynaSLAM is a system proposed by Berta Bescos et al. [21] that uses deep learning and multi-view geometry to recognize dynamic objects, restore background frames, build static scene maps that reduce emotional interference, and improve localization performance in dynamic situations. DynaSLAM II, according to Berta Bescos et al. [22], combines instance semantic segmentation and ORB features with Object Data Association to add dynamic objects to Bundle Adjustment to monitor dynamic items. As a result, the environment around dynamic objects is better understood, and posture prediction is improved. DS-SLAM [23] combines a semantic segmentation network with motion consistency checking to decrease the influence of dynamic objects and generate dense semantic glyph maps. Yuxiang Sun et al. [24] used motion segmentation to optimize the loss function, resulting in more accurate results. Nikolay Atanasov et al. [25] employ object detection to extract semantic information from sensors, create maps with semantic labels, and solve the semantic localization problem using ensemble-based Bayesian filters in polynomial time.
Precision localization in autonomous driving scenarios has gotten a lot of interest from industry and research in recent years. Peiliang Li et al. [26] employed 2D boxes and viewpoint classification to construct lightweight 3D box inference systems. The rough initial position is immediately derived from the 2D frame in this work. The dynamic target tracking is completed utilizing the BA optimization method combined with semantic and features information. Wentao Cheng et al. [27] employed semantic information in the route to address the autonomous vehicle localization challenge. The CenterNet network is used to detect road semantic features, key points represent lane lines and road signs, and semantic associations are used to optimize the overall state. Tong Qin et al. [28] developed a lightweight autonomous driving positioning framework that included vehicle-side mapping, cloud-based maintenance, and user-side positioning. Learning-based semantic segmentation is used to extract significant landmarks. The semantic landmarks are then converted to 3D and registered on the local map. The cloud server will receive the local map. The data collected by different vehicles are combined by the cloud server, which compresses the global semantic map. Finally, for localization, the compact map is delivered to production vehicles.
In this paper, we examine the strengths and shortcomings of previous work and present a joint localization solution for dynamic traffic conditions. The technique makes heavy use of semantic priors and probabilistic data associations and a maximum expectation measurement estimation algorithm to achieve good pose estimate accuracy in the presence of unobserved latent data in varied dynamic traffic scenarios. Figure 1 depicts a high-level overview of the system framework. To accomplish pixellevel real-time semantic segmentation without losing accuracy, the video streams pass through an enhanced Fast-SCNN network. The system can swiftly delete dynamic feature points based on the semantic information received by the segmentation network to prevent impacting the quality of subsequent data linkages. In dynamic traffic conditions, more complex scene descriptors include geometric, semantic, and distributional information to increase localization accuracy. A maximum expectation measurement approach is used to predict the best camera posture and landmark locations by giving semantic labels to observed landmarks in the backdrop using probabilistic data association. local map. The data collected by different vehicles are combined by the cloud server, which compresses the global semantic map. Finally, for localization, the compact map is delivered to production vehicles.

System Overview
In this paper, we examine the strengths and shortcomings of previous work and present a joint localization solution for dynamic traffic conditions. The technique makes heavy use of semantic priors and probabilistic data associations and a maximum expectation measurement estimation algorithm to achieve good pose estimate accuracy in the presence of unobserved latent data in varied dynamic traffic scenarios. Figure 1 depicts a high-level overview of the system framework. To accomplish pixellevel real-time semantic segmentation without losing accuracy, the video streams pass through an enhanced Fast-SCNN network. The system can swiftly delete dynamic feature points based on the semantic information received by the segmentation network to prevent impacting the quality of subsequent data linkages. In dynamic traffic conditions, more complex scene descriptors include geometric, semantic, and distributional information to increase localization accuracy. A maximum expectation measurement approach is used to predict the best camera posture and landmark locations by giving semantic labels to observed landmarks in the backdrop using probabilistic data association.

Dynamic Objects Segmentation and Culling
The fact that dynamic feature points participate in matching and contribute to localization failure is one of the most common visual SLAM system flaws. Fast-SCNN [29], a dual-branch encoder-decoder network, achieves pixel-wise segmentation of pixels in realtime, allowing dynamic objects to be quickly distinguished without compromising the SLAM system. At low resolution and full input resolution, Fast-deep SCNN's and shallow two-layer networks collect global context and learn spatial features, respectively. The four modules of the Fast-SCNN network (shown in Figure 2) are all built using depth wise separable convolutions, which means they have less network parameters and faster segmentation, but they also have the problem of losing segmentation accuracy. This research offers the Res2net module [30] to replace the Bottleneck for multi-scale feature representation in order to address the problem that the global feature extraction of this network is

Dynamic Objects Segmentation and Culling
The fact that dynamic feature points participate in matching and contribute to localization failure is one of the most common visual SLAM system flaws. Fast-SCNN [29], a dual-branch encoder-decoder network, achieves pixel-wise segmentation of pixels in real-time, allowing dynamic objects to be quickly distinguished without compromising the SLAM system. At low resolution and full input resolution, Fast-deep SCNN's and shallow two-layer networks collect global context and learn spatial features, respectively. The four modules of the Fast-SCNN network (shown in Figure 2) are all built using depth wise separable convolutions, which means they have less network parameters and faster segmentation, but they also have the problem of losing segmentation accuracy. This research offers the Res2net module [30] to replace the Bottleneck for multi-scale feature representation in order to address the problem that the global feature extraction of this network is rough, and that the segmentation impact is not perfect. The difference between Res2net and the Bottleneck block is shown in Figure 3. rough, and that the segmentation impact is not perfect. The difference between Res2net and the Bottleneck block is shown in Figure 3. The Res2net module creates hierarchical residual connections within the residual block, which may be put into the Fast-SCNN model's backbone to achieve long-term performance improvements. The Res2net module separates the input feature maps into numerous groups, uses the previous group's output map as the input for the next group, and then uses the 1 × 1 filter to fuse the feature maps of all groups. This module improves the receptive field of networks at all levels by extracting multi-scale features at the granularity level. It also effectively simplifies the complexity of the correlation between learning object categories and improves the accuracy of classification boundaries. The modified Fast-SCNN network benefits from hierarchical residual connections, which improve segmentation accuracy without adding too many network parameters. The new network parameter is 1.27 million, which is only 0.16 million greater than the previous one. This ensures the network's applicability in dynamic traffic conditions. The network's performance is confirmed in experiment A, with segmentation accuracy 2.11 percent greater than the original version and inference speed 216.3 fps, striking a solid balance between inference speed and segmentation accuracy. Table 1 displays the segmented semantic labels, which include the majority of the object categories seen in vibrant traffic scenes.

Conv2D
DSConv  The Res2net module creates hierarchical residual connections within the block, which may be put into the Fast-SCNN model's backbone to achieve long-te formance improvements. The Res2net module separates the input feature maps merous groups, uses the previous group's output map as the input for the nex and then uses the 1 × 1 filter to fuse the feature maps of all groups. This module im the receptive field of networks at all levels by extracting multi-scale features at t ularity level. It also effectively simplifies the complexity of the correlation betwee ing object categories and improves the accuracy of classification boundaries. The modified Fast-SCNN network benefits from hierarchical residual conn which improve segmentation accuracy without adding too many network par The new network parameter is 1.27 million, which is only 0.16 million greater previous one. This ensures the network's applicability in dynamic traffic conditio network's performance is confirmed in experiment A, with segmentation accur percent greater than the original version and inference speed 216.3 fps, striking balance between inference speed and segmentation accuracy. Table 1 displays mented semantic labels, which include the majority of the object categories seen in traffic scenes.

Conv2D
DSConv The Res2net module creates hierarchical residual connections within the residual block, which may be put into the Fast-SCNN model's backbone to achieve long-term performance improvements. The Res2net module separates the input feature maps into numerous groups, uses the previous group's output map as the input for the next group, and then uses the 1 × 1 filter to fuse the feature maps of all groups. This module improves the receptive field of networks at all levels by extracting multi-scale features at the granularity level. It also effectively simplifies the complexity of the correlation between learning object categories and improves the accuracy of classification boundaries.
The modified Fast-SCNN network benefits from hierarchical residual connections, which improve segmentation accuracy without adding too many network parameters. The new network parameter is 1.27 million, which is only 0.16 million greater than the previous one. This ensures the network's applicability in dynamic traffic conditions. The network's performance is confirmed in experiment A, with segmentation accuracy 2.11 percent greater than the original version and inference speed 216.3 fps, striking a solid balance between inference speed and segmentation accuracy. Table 1 displays the segmented semantic labels, which include the majority of the object categories seen in vibrant traffic scenes. Given the real-time requirement of traffic scene localization, ORB [31] descriptor has the characteristics of rotation invariance and low computational cost and can quickly extract and match scene features. FAST corners are pulled on multiple scales of the Gaussian pyramid, and feature points at different levels are removed according to the allocation strategy of each layer. Equation (1) is the expression of the allocation strategy for each layer: where N is the total number of extracted feature points, s is the scaling factor of the image pyramid, and m is the number of image pyramid layers. The selection of feature points follows the principle: the pixel gray value changes beyond the threshold, and the semantic label in the corner of the static object. The following formula is used to express the selection criteria: m(x, y) = ω ij (L(x + 1, y) − L(x − 1, y)) 2 + (L(x, y + 1) − L(x, y − 1)) 2 (2) where ω ij is the semantic label weight of the candidate feature point, the emotional type weight is set to 0, and the static target's weight increases. The weights of the categories construction and object have been increased by three times, while the weights of the other categories have remained the same. The method is easy and effective for extracting static feature points, increasing the ratio of target feature points dispersed across construction and object categories, and boosting the quality of subsequent data association and ultimate positioning accuracy. The feature points are concentrated in the local part of the image, and the effect of the descriptor will be very unsatisfactory. To this end, the quadtree algorithm [32] can uniformize the feature points. The rendering of the final feature point extraction is shown in Figure 4. The ORB descriptor determines the orientation of feature points through intensity centroids and uses binary strings to describe the pixel variation information of feature points and their neighborhoods.

Semantic Feature Descriptor
Descriptors based on visual geometric features cannot accurately describe dynamic traffic scenes due to visible aliasing or changes in visual appearance. Incorporating semantic information and distribution information into descriptors can improve the above problems. Descriptors that fuse semantic and geometric features tradeoff uniqueness and robustness. It is not affected by perspective transformation and can also solve the difficulty of matching multiple feature points with the same category between different frames.

Semantic Feature Descriptor
Descriptors based on visual geometric features cannot accurately describe d traffic scenes due to visible aliasing or changes in visual appearance. Incorpor mantic information and distribution information into descriptors can improve th problems. Descriptors that fuse semantic and geometric features tradeoff unique robustness. It is not affected by perspective transformation and can also solve t culty of matching multiple feature points with the same category between frames.
The semantic segmentation network extracts high-level semantic informatio ferent levels and becomes the original data for constructing semantic informat improved Fast-SCNN above shows the competitiveness of pixel segmentation, following formula represents the extracted semantic information: where represents the semantic result of the kth segmentation, characterized by egory , the position , and the confidence of the pixel point.
Static objects commonly found in traffic environments can provide robust de features, so semantic context descriptors with inherently static features are gene this work. The points with dramatic changes in semantics, that is, the feature poin the category of pixel points changes, are selected as the key points of semantic info description, and Equation (5)  The semantic segmentation network extracts high-level semantic information at different levels and becomes the original data for constructing semantic information. The improved Fast-SCNN above shows the competitiveness of pixel segmentation, and the following formula represents the extracted semantic information: where S k represents the semantic result of the kth segmentation, characterized by the category S c k , the position S l k , and the confidence S s k of the pixel point. Static objects commonly found in traffic environments can provide robust descriptive features, so semantic context descriptors with inherently static features are generated in this work. The points with dramatic changes in semantics, that is, the feature points where the category of pixel points changes, are selected as the key points of semantic information description, and Equation (5) is used to express the selection of key points. The construction of the semantic descriptor is to aggregate the features of key points and the distribution features from the neighborhood and tell them in the form of a matrix. Figure 5 shows the construction process of the semantic descriptor.
S D (P, P ) = ∑ sgn(P (m,n) − P (m,n) ) P, P ∈ S c k(m,n) (5) where sgn() is a sign function. When the pixel types in the semantic descriptor of the key point are the same, it is recorded as 1. Otherwise, it is 0. The semantic segmentation network extracts high-level semantic information at different levels and selects the feature points with the most significant changes in semantic information as key points. The semantic information distribution of its neighborhood is analyzed for key points, and the semantic descriptors are represented in matrix form.

Probabilistic Data Association
Data association aims to establish a mapping of sensor observations { } =1 to road sign positions { } =1 and vehicle attitude { } =1 relation. The traditional SLAM pose estimation optimization is divided into two steps, firstly estimating the data association, and then substituting the data association estimation results into the pose and road sign estimation. This leads to data association results that greatly affect the accuracy of pose estimation optimization. To this end, probabilistic data association methods add semantic labels to observed landmarks in the environment, improving the problem of data association accuracy. Figure 6 is an illustrative overview of a probabilistic data association method. The maximum expected measurement likelihood model [33] considers the overall distribution of data associations and poses estimation as an overall optimization problem. This method finds the maximum estimate for an optimized model when data associations are incomplete or when there are unobserved latent data. The overall optimization model is specifically expressed as: where , represent the initial sensor attitude and road sign estimations, respectively. represents all the predicted values associated with the data, which can be warped as: The estimated value will change drastically with the camera pose, landmark position, and landmark category, traverse the possibilities of all data associations under , , and until an optimal global value maximizes the overall weight. The expected and observed values are obtained from the specific data-related expected value and the observed value corresponding to the overall expected maximum value. At this time, these values are the optimal solution combination for the system. The above Equation (7) can be transformed: where is the data correlation value corresponding to the overall expected maximum value, and , are the equivalent sensor observation values at this moment. The semantic segmentation network extracts high-level semantic information at different levels and selects the feature points with the most significant changes in semantic information as key points. The semantic information distribution of its neighborhood is analyzed for key points, and the semantic descriptors are represented in matrix form.

Probabilistic Data Association
Data association aims to establish a mapping of sensor observations {Z t } T t=1 to road sign positions {l m } M m=1 and vehicle attitude {X t } T t=1 relation. The traditional SLAM pose estimation optimization is divided into two steps, firstly estimating the data association, and then substituting the data association estimation results into the pose and road sign estimation. This leads to data association results that greatly affect the accuracy of pose estimation optimization. To this end, probabilistic data association methods add semantic labels to observed landmarks in the environment, improving the problem of data association accuracy. Figure 6 is an illustrative overview of a probabilistic data association method. The maximum expected measurement likelihood model [33] considers the overall distribution of data associations and poses estimation as an overall optimization problem. This method finds the maximum estimate for an optimized model when data associations are incomplete or when there are unobserved latent data. The overall optimization model is specifically expressed as: where X i , L i represent the initial sensor attitude and road sign estimations, respectively. E D represents all the predicted values associated with the data, which can be warped as: The estimated value will change drastically with the camera pose, landmark position, and landmark category, traverse the possibilities of all data associations under X i , L i , and Z until an optimal global value maximizes the overall weight. The expected and observed values are obtained from the specific data-related expected value and the observed value corresponding to the overall expected maximum value. At this time, these values are the optimal solution combination for the system. The above Equation (7) can be transformed: where ω k j is the data correlation value corresponding to the overall expected maximum value, and x αk , l βk are the equivalent sensor observation values at this moment.

Experiments
Experiments are performed on this scheme on the KITTI dataset [34] to test its performance in dynamic traffic environments. All investigations are implemented on Ub-untu18.04, NVIDIA-Linux-x_64-460.84, and CUDA11.1 development tools. The improved Fast-SCNN network is implemented on the PyTorch deep learning framework using Python, and the rest is implemented in C++ on the ROS operating system [35].

Evaluation of Models for Extracting Semantic Information
The upgraded Fast-SCNN network extracts picture semantic information, which is used to eliminate dynamic feature points and build semantic descriptors. The original Fast-SCNN network implementation is used in the trials, and the Bottleneck module is replaced by Res2Net to improve semantic segmentation accuracy. On two NVIDIA TI-TAN Xp GPUs, the network is trained with Batch Size 256 mini-batches. With an initial value of 0.001, the learning rate is dynamically set. The momentum coefficient was initially fixed to 0.5, but over several epochs, it was gradually annealed to 0.9.
This research compares the performance of the upgraded Fast-SCNN network and the original grid on the KITTI semantic segmentation dataset [36] to demonstrate its effectiveness. All experiments are conducted on laboratory workstations developed with the PyTorch deep learning framework to maintain constant experimental circumstances. IoU Class, IoU Category, and FPS are used to assess the model's performance. It can be seen from Table 2 that the performance of the improved method in the IoU Class and IoU Category is 1.25% and 2.11% higher than the original method, respectively. Although the processing speed is slightly inferior to the original network, it also meets the real-time segmentation requirements in dynamic traffic scenarios. It achieves an effective balance between inference speed and segmentation accuracy.  Figure 6. The illustrative overview of our proposed probabilistic data association approach. Top: Semantically segmented multiple objects with the same semantic label, and these objects are occluded. Bottom: Finding the optimal combination based on the current expected measurement likelihood model.

Experiments
Experiments are performed on this scheme on the KITTI dataset [34] to test its performance in dynamic traffic environments. All investigations are implemented on Ubuntu18.04, NVIDIA-Linux-x_64-460.84, and CUDA11.1 development tools. The improved Fast-SCNN network is implemented on the PyTorch deep learning framework using Python, and the rest is implemented in C++ on the ROS operating system [35].

Evaluation of Models for Extracting Semantic Information
The upgraded Fast-SCNN network extracts picture semantic information, which is used to eliminate dynamic feature points and build semantic descriptors. The original Fast-SCNN network implementation is used in the trials, and the Bottleneck module is replaced by Res2Net to improve semantic segmentation accuracy. On two NVIDIA TITAN Xp GPUs, the network is trained with Batch Size 256 mini-batches. With an initial value of 0.001, the learning rate is dynamically set. The momentum coefficient was initially fixed to 0.5, but over several epochs, it was gradually annealed to 0.9.
This research compares the performance of the upgraded Fast-SCNN network and the original grid on the KITTI semantic segmentation dataset [36] to demonstrate its effectiveness. All experiments are conducted on laboratory workstations developed with the PyTorch deep learning framework to maintain constant experimental circumstances. IoU Class, IoU Category, and FPS are used to assess the model's performance. It can be seen from Table 2 that the performance of the improved method in the IoU Class and IoU Category is 1.25% and 2.11% higher than the original method, respectively. Although the processing speed is slightly inferior to the original network, it also meets the real-time segmentation requirements in dynamic traffic scenarios. It achieves an effective balance between inference speed and segmentation accuracy.

Evaluation of Positioning Accuracy under KITTI Dataset
In the experiments, the image pyramid is set to m = 8, s = 1/1.2. In the feature point extraction stage, FAST-9 is chosen, and the threshold is set low enough to obtain more corner points. Harris corner filter selects appropriate corners as feature points. When generating the semantic descriptor, its size is specified to 21 × 21 and the threshold to 55, which has the best performance.
The localization performance is comprehensively evaluated on the KITTI dataset to verify the effectiveness and superiority of the association of scene descriptors and probabilistic data [25]. The KITTI odometer benchmark consists of 22 stereo sequences containing real-world data collected in urban, rural, and highway scenes. According to the degree of scene dynamics, comparative experiments are performed on static sequences (KITTI 00), low dynamic sequences (KITTI 04, 05), and high dynamic sequences (KITTI 01, 09).
Considering the errors caused by other factors unrelated to the algorithm, the evaluation indicators include relative pose error (RPE) and absolute trajectory error (ATE). RPE is used to evaluate the system's anti-drift performance, while ATE assesses the system's comprehensive positioning capability. The findings of the root mean square error (RMSE) comparison between systems are shown in Table 3. The less the root mean square error, the more accurate the posture estimation is and the better the system's overall performance is. The best accuracy is indicated in black. It can be seen from the table that compared with other systems, this scheme has the best performance in all dynamic sequences; especially in high dynamic sequences, the performance is greatly improved. The effect is very similar to the state-of-the-art ORB-SLAM2 system in static sequences. The tracking trajectories in 3D space are converted to 2D space and plotted in the same graph as the ground truth to express the experimental comparison results more intuitively. The performance of the three algorithms in the static sequence (KITTI 00) environment is not much different, as shown by the visualization results of the camera trajectory error in Figure 7. The estimated value of the camera trajectory is not much different from the ground truth, and they are all relatively precise. The absolute trajectory error of the system under a low dynamic sequence is shown in Figure 8. This method can overcome the interference of dynamic objects and has the best performance in ground truth trajectory estimation. However, this advantage is not prominent in low dynamic scenarios. The reason may be due to the (RANSAC) outlier detection method used by ORB-SLAM2 and its resistance to a certain degree of active interference. As shown in Figure 9, in sequences with high dynamics and large scene changes, the outlier detection method used by ORB-SLAM2 is no longer applicable. It is affected by dynamic objects, and its estimated camera trajectory has a large difference from the ground truth, and even serious errors in some places. At the same time, the performance of this system is far superior to that of DynaSLAM. Although the system's accuracy is slightly lower than DynaSLAM at certain moments, the error is quickly fixed, and the RMSE decline is significantly smaller than DynaSLAM. This could be due to the difficulty of matching the characteristics of low-texture regions with too comparable scenes. The maximum expected measurement estimation model could predict the system's excellent pose value in the case of insufficient data association. The absolute trajectory error of the system under a low dynamic sequence is shown in Figure 8. This method can overcome the interference of dynamic objects and has the best performance in ground truth trajectory estimation. However, this advantage is not prominent in low dynamic scenarios. The reason may be due to the (RANSAC) outlier detection method used by ORB-SLAM2 and its resistance to a certain degree of active interference. The absolute trajectory error of the system under a low dynamic sequence is shown in Figure 8. This method can overcome the interference of dynamic objects and has the best performance in ground truth trajectory estimation. However, this advantage is not prominent in low dynamic scenarios. The reason may be due to the (RANSAC) outlier detection method used by ORB-SLAM2 and its resistance to a certain degree of active interference. As shown in Figure 9, in sequences with high dynamics and large scene changes, the outlier detection method used by ORB-SLAM2 is no longer applicable. It is affected by dynamic objects, and its estimated camera trajectory has a large difference from the ground truth, and even serious errors in some places. At the same time, the performance of this system is far superior to that of DynaSLAM. Although the system's accuracy is slightly lower than DynaSLAM at certain moments, the error is quickly fixed, and the RMSE decline is significantly smaller than DynaSLAM. This could be due to the difficulty of matching the characteristics of low-texture regions with too comparable scenes. The maximum expected measurement estimation model could predict the system's excellent pose value in the case of insufficient data association. As shown in Figure 9, in sequences with high dynamics and large scene changes, the outlier detection method used by ORB-SLAM2 is no longer applicable. It is affected by dynamic objects, and its estimated camera trajectory has a large difference from the ground truth, and even serious errors in some places. At the same time, the performance of this system is far superior to that of DynaSLAM. Although the system's accuracy is slightly lower than DynaSLAM at certain moments, the error is quickly fixed, and the RMSE decline is significantly smaller than DynaSLAM. This could be due to the difficulty of matching the characteristics of low-texture regions with too comparable scenes. The maximum expected measurement estimation model could predict the system's excellent pose value in the case of insufficient data association.

Conclusions
For changing traffic settings, research-based on-scene descriptors and probabilistic data association give precise localization solutions. The new Fast-SCNN network recovers multi-scale features at a higher granularity level and extracts semantic information more quickly without sacrificing spatial information. The approach overcomes the negative impacts of dynamic targets on a broad scale using semantic information and prior knowledge. More complex scene descriptors aggregate geometric information, semantic information, and distribution information, which improves the accuracy of feature point matching. When there is unobserved potential data, the probabilistic data association approach finds the best-estimated value for the optimization model to achieve accurate positioning in dynamic traffic circumstances.
Comparative experiments with other excellent SLAM systems show that this method can achieve the highest accuracy in high and low dynamic traffic scenes. Although this research has made some progress in robustness and accuracy, there is still a long way to go. On the one hand, follow-up work strengthens research on precise localization in dynamic traffic environments with significant visual changes. It increases the applicability of SLAM systems in more challenging scenarios. On the other hand, the technology will be tested and fine-tuned in real traffic environments to improve the system's ability to handle dynamic objects.

Conclusions
For changing traffic settings, research-based on-scene descriptors and probabilistic data association give precise localization solutions. The new Fast-SCNN network recovers multi-scale features at a higher granularity level and extracts semantic information more quickly without sacrificing spatial information. The approach overcomes the negative impacts of dynamic targets on a broad scale using semantic information and prior knowledge. More complex scene descriptors aggregate geometric information, semantic information, and distribution information, which improves the accuracy of feature point matching. When there is unobserved potential data, the probabilistic data association approach finds the best-estimated value for the optimization model to achieve accurate positioning in dynamic traffic circumstances.
Comparative experiments with other excellent SLAM systems show that this method can achieve the highest accuracy in high and low dynamic traffic scenes. Although this research has made some progress in robustness and accuracy, there is still a long way to go. On the one hand, follow-up work strengthens research on precise localization in dynamic traffic environments with significant visual changes. It increases the applicability of SLAM systems in more challenging scenarios. On the other hand, the technology will be tested and fine-tuned in real traffic environments to improve the system's ability to handle dynamic objects.