Visual Simultaneous Localization and Mapping Optimization Method Based on Object Detection in Dynamic Scene

: SLAM (Simultaneous Localization and Mapping), as one of the basic functions of mobile robots, has become a hot topic in the field of robotics this year. The majority of SLAM systems in use today, however, disregard the impact of dynamic objects on the system by defining the external environment as static. A SLAM system suitable for dynamic scenes is proposed, aiming at the issue that dynamic objects in real scenes can affect the localization accuracy and map effect of traditional visual SLAM systems. Initially, the enhanced lightweight YOLOv5s target detection algorithm is employed to detect dynamic objects in each frame of the image. Simultaneously, an assessment is conducted on the feature points present on dynamic objects to determine their potential impact on system accuracy, subsequently guiding the decision to retain or exclude these feature points. The preserved static feature points are then utilized for pose estimation and map construction. Experiments on the publicly available TUM dataset and the KITTI dataset are conducted to compare the system in this paper with ORB-SLAM 3, DS-SLAM, and DynaSLAM, and the algorithm is verified to have better performance.


SLAM (Simultaneous Localization and Mapping
) is a robotic methodology employed when a robot operates in an unknown environment [1].Utilizing onboard sensors such as cameras, inertial sensors, and laser sensors, the robot captures information about the external surroundings [2,3].These acquired data are then used in real-time to estimate the robot's own pose and simultaneously construct a map of the surrounding environment.The camera possesses characteristics such as compactness, low-power consumption, and cost-effectiveness.Simultaneously, it has the capability to capture a more comprehensive set of environmental information.Consequently, the adoption of cameras as sensors for visual SLAM has emerged as a prominent and actively pursued research direction [4].
Research scholars have proposed several visual SLAM system frameworks [5], including the VINS-Mono [6] and ORB-SLAM series, which have shown promising results thus far [7].Among them, the ORB-SLAM series has drawn a lot of interest from academics studying this area because of its benefits, which include strong stability and real-time performance.Mur-Artal et al., released the open-source ORB-SLAM2 in 2017 that offers three camera modes for greater versatility [8].In 2021, they even unveiled ORB-SLAM3 with a multi-map system that incorporates a feature-based inertial visual odometry that greatly enhances the system's localization accuracy [9].However, there are still certain issues with ORB-SLAM3 in practical situations [10].While moving objects in actual realworld environments can cause errors in the correlation of visual odometry data, existing algorithms typically assume the external environment to be static [11].This affects the accuracy of the SLAM algorithm [12].As a result, studying SLAM algorithms in dynamic environments is very crucial [13,14].
To improve the robustness and accuracy of the SLAM system in dynamic environments, many researchers optimize the algorithm based on ORB-SLAM [15].DynaSLAM [16], as Appl.Sci.2024, 14, 1787 2 of 15 proposed by Bescos et al., is a hybrid semantic segmentation and multi-view geometry method that segments dynamic objects by instances using semantic segmentation networks.This allows for precise feature point culling on dynamic objects and background repair.On the basis of ORB-SLAM2, DS-SLAM is integrated with a semantic segmentation network that greatly enhances localization accuracy in dynamic scenes by minimizing the impact of dynamic targets on the system via the motion consistency detection method [17].DM-SLAM leverages both the distributed and local RANSAC (Random Sample Consensus) to extract static features from dynamic scenes while incorporating them into the system's initialization [18], based on the characteristics of both statics and object dynamics.In dynamic scenes, the system performs better than conventional SLAM systems.Zhu et al. proposed a visual SLAM using a deep feature matcher to extract features in each image frame by a deep feature extractor in a visual odometer and proposed a camera position fusion estimation method [19], which is able to operate at 0.08 s per frame with low error.Wei et al. optimized the visual SLAM algorithm through semantic information, based on the framework of ORB-SLAM2, and added the dynamic region detection module into the visual odometer to improve the accuracy of the system's position estimation in dynamic environments.Then, the image is semantically segmented by the BiSeNetV2 network, dynamic objects are removed using semantic information and dynamic regions, and finally a map containing semantic information is constructed [20].
The existing SLAM system is susceptible to the influence of dynamic objects in dynamic scenes, which leads to a decrease in the system's localization accuracy.In this research, we propose a SLAM system for dynamic situations that determines whether a feature point influences the system accuracy and whether it should be rejected or retained [21].The system detects a feature point on a dynamic object using an enhanced lightweight YOLOv5-based method [22].After that, position is estimated using the feature points that were kept, and a dense point cloud map is produced to satisfy navigation and path planning requirements [23].The experimental findings demonstrate the good real-time performance and ability of the algorithm presented in this research to increase the system accuracy in dynamic circumstances [24].
The rest of this paper is organized as follows.The general architecture of the SLAM system used in this work is described in Section 2, along with the guiding ideas and purposes of each module.The elements of the target detection network and the improvement techniques are covered in detail in Section 3. Section 4 provides a description and analysis of the comparative experiments that were conducted to evaluate the system's performance.In Section 5, the research of this paper is discussed.Finally, in Section 6, the work is summarized, along with an outlook for future work.

System Description
The system framework of this paper as shown in Figure 1 can be divided into five main components, i.e., tracking thread, localized mapping, loop closing, dynamic object detection thread, and dense map building.
The tracking thread is in charge of tracking the localization and orientation of the camera in real-time.This component matches features using ORB (Oriented FAST and Rotated BRIEF) feature points to estimate the camera pose.Pose estimation, descriptor matching, and feature extraction are steps in the feature matching process.
For the purpose of creating a local map, keyframe insertion and deletion are mainly handled by the local mapping thread.It also optimizes the data generated by the tracking thread to increase localization accuracy [25].
The job of the loop closing thread is to check if the camera has returned to a previously visited area on a regular basis, which would indicate a loop closure event.The loop closure detection thread initiates a loop closure operation in the event that a loop closure is found.
The dynamic object detection thread uses the YOLOv5s network as its main body, replacing YOLOv5's Backbone with MobileNetV3-Small to lighten the overall network structure [26].
To create the dense point cloud map, the thread responsible for building the d map must first convert the depth and image information into 3D point cloud data.N it must color the point cloud using the RGB image's color information and stitch the p cloud together using keyframes.FAST is a high-speed algorithm for detecting corner points in an image.The F algorithm first selects one pixel in the image to act as the center pixel before determi corner points.To determine whether this center pixel is a corner point, its intensity va are compared to those of the surrounding pixels.As shown in Figure 2, the center p point p has a gray scale of IP, and the FAST algorithm selects a fixed number of surrou ing pixels, usually 16 pixels.It then compares the gray values of these surrounding p with the gray value of the center pixel.If there are 12 consecutive points whose gray va satisfy Equation (1), then the point is a corner point.

|𝐼 − 𝐼 | ∈ 𝑇
where T is the threshold value and Ix represents the gray level of the surrounding 16 p points.To create the dense point cloud map, the thread responsible for building the dense map must first convert the depth and image information into 3D point cloud data.Next, it must color the point cloud using the RGB image's color information and stitch the point cloud together using keyframes.
The tracking thread extracts ORB feature points from the input image.ORB is an algorithm that identifies and describes important feature points in an image.The two basic components of ORB feature extraction are BRIEF (Binary Robust Independent Elementary Features) feature description and FAST (Features from Accelerated Segment Test) feature point detection.
FAST is a high-speed algorithm for detecting corner points in an image.The FAST algorithm first selects one pixel in the image to act as the center pixel before determining corner points.To determine whether this center pixel is a corner point, its intensity values are compared to those of the surrounding pixels.As shown in Figure 2, the center pixel point p has a gray scale of I P , and the FAST algorithm selects a fixed number of surrounding pixels, usually 16 pixels.It then compares the gray values of these surrounding pixels with the gray value of the center pixel.If there are 12 consecutive points whose gray values satisfy Equation (1), then the point is a corner point.
where T is the threshold value and I x represents the gray level of the surrounding 16 pixel points.The algorithm known as BRIEF is chosen by ORB to describe the feature.BRIEF's primary objective is to produce robust and efficient binary feature descriptors that may be used to match and identify important or distinctive points in a picture.BRIEF selects a feature point and determines a T × T window and then randomly selects n pairs of pixel points defined as where p(x) and p(y) are the gray values at point x and y, respectively.Then, the descriptor of feature point p is defined as One of the key components of the system is feature matching; the accuracy of the location estimation can only be ensured when the feature points are matched accurately and efficiently.The camera position problem must be solved when feature matching is finished, and the method in this work uses the ICP (Iterative Closest Point) algorithm to estimate the position of 3D feature points.The principle is as follows: assuming that the two known frames are F1 and F2, where P = {p1,⋯, p2}∈F1 and Q = {q1,⋯, q2}∈F2 are a set of well-matched three-dimensional points, and the camera's pose is calculated by SVD decomposition, the rotation matrix R and translation matrix t are made to satisfy Equation (4).

∀𝑖, 𝑝 = 𝑅𝑞 𝑡
The local mapping thread is responsible for optimizing keyframes through operations such as eliminating outliers, generating new map points, and removing redundant keyframes.Subsequently, the optimized keyframes are output to the loop closing thread.The loop closing thread, using the keyframe positioned at the front of the buffer queue, replaces it with the currently detected loop closure keyframe.The similarity between the current keyframe and the consensus keyframe is computed to determine the presence of a loop closure.
According to the depth information and image information provided by the camera, the two-dimensional coordinate points are converted into three-dimensional point cloud data.Xi = [x,y,z] represents the coordinates of the points, X = {x1, x2,⋯, xn} represents the set of spatial points, and the corresponding pixel coordinate of the known three-dimensional points under the camera coordinate system is expressed as Xi = [u, v, 1].According to the camera's pinhole model imaging principle, we can obtain the point cloud.The spatial position information of the point cloud can be obtained according to the pinhole model imaging principle of the camera: The algorithm known as BRIEF is chosen by ORB to describe the feature.BRIEF's primary objective is to produce robust and efficient binary feature descriptors that may be used to match and identify important or distinctive points in a picture.BRIEF selects a feature point and determines a T × T window and then randomly selects n pairs of pixel points defined as where p(x) and p(y) are the gray values at point x and y, respectively.Then, the descriptor of feature point p is defined as One of the key components of the system is feature matching; the accuracy of the location estimation can only be ensured when the feature points are matched accurately and efficiently.The camera position problem must be solved when feature matching is finished, and the method in this work uses the ICP (Iterative Closest Point) algorithm to estimate the position of 3D feature points.The principle is as follows: assuming that the two known frames are F 1 and F 2 , where P = {p 1 ,• • • , p 2 }∈F 1 and Q = {q 1 ,• • • , q 2 }∈F 2 are a set of well-matched three-dimensional points, and the camera's pose is calculated by SVD decomposition, the rotation matrix R and translation matrix t are made to satisfy Equation (4). ∀i, The local mapping thread is responsible for optimizing keyframes through operations such as eliminating outliers, generating new map points, and removing redundant keyframes.Subsequently, the optimized keyframes are output to the loop closing thread.The loop closing thread, using the keyframe positioned at the front of the buffer queue, replaces it with the currently detected loop closure keyframe.The similarity between the current keyframe and the consensus keyframe is computed to determine the presence of a loop closure.
According to the depth information and image information provided by the camera, the two-dimensional coordinate points are converted into three-dimensional point cloud data.X i = [x,y,z] represents the coordinates of the points, X = {x 1 , x 2 ,• • • , x n } represents the set of spatial points, and the corresponding pixel coordinate of the known three-dimensional points under the camera coordinate system is expressed as X i = [u, v, 1].According to the camera's pinhole model imaging principle, we can obtain the point cloud.The spatial position information of the point cloud can be obtained according to the pinhole model imaging principle of the camera: where K represents the camera internal reference; R represents the rotation matrix; t represents the translation vector; and z is the scale factor between the depth value and the actual spatial distance.
In this paper, the obtained point cloud data are processed by the PCL (Point Cloud Library).The p ix , p iy , and p iz of the point cloud p i can be calculated according to the following equation: For every keyframe, a corresponding point cloud is created.All of the point clouds are then stitched together using the system's keyframe position data, and the RGB image's color information is utilized to give the point clouds their color, creating a globally dense point cloud map.

YOLOv5
The YOLO (You Only Look Once) series is a superb example of a target detection algorithm that is easier to use, quicker, and more accurate than conventional algorithms [27,28].Compared to YOLOv4, YOLOv5 optimizes the network model structure and data augmentation, which increases the accuracy and speed of detection [29].
The overall structure of YOLOv5 is mostly made up of the input end, the Backbone, the Neck, the prediction, and the output end, as seen in Figure 3. Mosaic data enhancement is used on the input side to enhance the background and small targets of the detected objects by randomly scaling, cropping, and arranging the images.Backbone is enhanced with the addition of Focus and CSP (Cross-Stage Partial) structures.The convolutional neural network layer of feature extraction uses the Focus structure, while the CSP structure efficiently lowers the parameter and computation to increase feature extraction efficiency.Prediction is utilized as the output side and Neck is used for the feature fusion network.Neck uses both PAN (Path Aggregation Network) and SPP (Spatial Pyramid Pooling) structures [30].Based on the network's width and depth, YOLOv5 can be divided into five versions: YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, and YOLOv5n.YOLOv5s is the base version utilized in this paper.
Appl.Sci.2024, 14, x FOR PEER REVIEW 5 of 15 where K represents the camera internal reference; R represents the rotation matrix; t represents the translation vector; and z is the scale factor between the depth value and the actual spatial distance.
In this paper, the obtained point cloud data are processed by the PCL (Point Cloud Library).The pix, piy, and piz of the point cloud pi can be calculated according to the following equation: For every keyframe, a corresponding point cloud is created.All of the point clouds are then stitched together using the system's keyframe position data, and the RGB image's color information is utilized to give the point clouds their color, creating a globally dense point cloud map.

YOLOv5
The YOLO (You Only Look Once) series is a superb example of a target detection algorithm that is easier to use, quicker, and more accurate than conventional algorithms [27,28].Compared to YOLOv4, YOLOv5 optimizes the network model structure and data augmentation, which increases the accuracy and speed of detection [29].
The overall structure of YOLOv5 is mostly made up of the input end, the Backbone, the Neck, the prediction, and the output end, as seen in Figure 3. Mosaic data enhancement is used on the input side to enhance the background and small targets of the detected objects by randomly scaling, cropping, and arranging the images.Backbone is enhanced with the addition of Focus and CSP (Cross-Stage Partial) structures.The convolutional neural network layer of feature extraction uses the Focus structure, while the CSP structure efficiently lowers the parameter and computation to increase feature extraction efficiency.Prediction is utilized as the output side and Neck is used for the feature fusion network.Neck uses both PAN (Path Aggregation Network) and SPP (Spatial Pyramid Pooling) structures [30].Based on the network's width and depth, YOLOv5 can be divided into five versions: YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, and YOLOv5n.YOLOv5s is the base version utilized in this paper.

Lightweight Networking MobileNet-V3
The real-time performance of traditional convolutional neural networks on embedded systems and mobile devices is limited by their high memory requirements.Specifically designed for image identification and computer vision tasks requiring little processing resources on mobile devices and embedded systems, MobileNet is a lightweight deep learning neural network architecture.MobileNet's design objective is to minimize the size of the model and the network's computational load while preserving high recognition performance to meet the resource requirements of mobile devices with limited resources.Conventional convolution, as illustrated in Figure 4, uses a convolution kernel to carry out a convolution operation on each input channel.Four feature maps with the same size as the input layer are produced as the final result of a convolutional layer with four filters, assuming the input layer is a three-channel color image of 64 by 64 pixels.

Lightweight Networking MobileNet-V3
The real-time performance of traditional convolutional neural networks on embedded systems and mobile devices is limited by their high memory requirements.Specifically designed for image identification and computer vision tasks requiring little processing resources on mobile devices and embedded systems, MobileNet is a lightweight deep learning neural network architecture.MobileNet's design objective is to minimize the size of the model and the network's computational load while preserving high recognition performance to meet the resource requirements of mobile devices with limited resources.Conventional convolution, as illustrated in Figure 4, uses a convolution kernel to carry out a convolution operation on each input channel.Four feature maps with the same size as the input layer are produced as the final result of a convolutional layer with four filters, assuming the input layer is a three-channel color image of 64 by 64 pixels.As Figure 5 illustrates, MobileNet reduces computational complexity and parameter count while capturing feature information effectively by using depth separable convolution, which splits the standard convolution into two steps: depth convolution and pointwise convolution.In 2019, Google released MobileNet-V3, which introduces a number of key improvements and optimizations to MobileNet-V2 for better performance and efficiency.As Figure 6 illustrates, NAS (Neural Architecture Search) technology is used to optimize the MobileNet-V3 network to perform better at particular tasks.Figure 6 illustrates this process by finding the optimal network structure to increase performance metrics.Second, MobileNet-V3 uses a lightweight activation function dubbed Hard Swish, which increases the model's nonlinearity while occasionally outperforming traditional ReLU.In the meantime, MobileNet-V3 presents the SE (Squeeze-and-Excitation) module, which uses an attention mechanism to execute adaptive weight adjustment of the feature map during training.The image input is first processed through a 1 × 1 convolution to increase the number of channels; next, the feature map data are optimized using the SE attention mechanism; finally, the number of channels is decreased through a 1 × 1 convolution under high-dimensional space.The residuals are used to connect the input and output when the As Figure 5 illustrates, MobileNet reduces computational complexity and parameter count while capturing feature information effectively by using depth separable convolution, which splits the standard convolution into two steps: depth convolution and pointwise convolution.

Lightweight Networking MobileNet-V3
The real-time performance of traditional convolutional neural networks on embedded systems and mobile devices is limited by their high memory requirements.Specifically designed for image identification and computer vision tasks requiring little processing resources on mobile devices and embedded systems, MobileNet is a lightweight deep learning neural network architecture.MobileNet's design objective is to minimize the size of the model and the network's computational load while preserving high recognition performance to meet the resource requirements of mobile devices with limited resources.Conventional convolution, as illustrated in Figure 4, uses a convolution kernel to carry out a convolution operation on each input channel.Four feature maps with the same size as the input layer are produced as the final result of a convolutional layer with four filters, assuming the input layer is a three-channel color image of 64 by 64 pixels.As Figure 5 illustrates, MobileNet reduces computational complexity and parameter count while capturing feature information effectively by using depth separable convolution, which splits the standard convolution into two steps: depth convolution and pointwise convolution.In 2019, Google released MobileNet-V3, which introduces a number of key improvements and optimizations to MobileNet-V2 for better performance and efficiency.As Figure 6 illustrates, NAS (Neural Architecture Search) technology is used to optimize the MobileNet-V3 network to perform better at particular tasks.Figure 6 illustrates this process by finding the optimal network structure to increase performance metrics.Second, MobileNet-V3 uses a lightweight activation function dubbed Hard Swish, which increases the model's nonlinearity while occasionally outperforming traditional ReLU.In the meantime, MobileNet-V3 presents the SE (Squeeze-and-Excitation) module, which uses an attention mechanism to execute adaptive weight adjustment of the feature map during training.The image input is first processed through a 1 × 1 convolution to increase the number of channels; next, the feature map data are optimized using the SE attention mechanism; finally, the number of channels is decreased through a 1 × 1 convolution under high-dimensional space.The residuals are used to connect the input and output when the In 2019, Google released MobileNet-V3, which introduces a number of key improvements and optimizations to MobileNet-V2 for better performance and efficiency.As Figure 6 illustrates, NAS (Neural Architecture Search) technology is used to optimize the MobileNet-V3 network to perform better at particular tasks.Figure 6 illustrates this process by finding the optimal network structure to increase performance metrics.Second, MobileNet-V3 uses a lightweight activation function dubbed Hard Swish, which increases the model's nonlinearity while occasionally outperforming traditional ReLU.In the meantime, MobileNet-V3 presents the SE (Squeeze-and-Excitation) module, which uses an attention mechanism to execute adaptive weight adjustment of the feature map during training.The image input is first processed through a 1 × 1 convolution to increase the number of channels; next, the feature map data are optimized using the SE attention mechanism; finally, the number of channels is decreased through a 1 × 1 convolution under high-dimensional space.The residuals are used to connect the input and output when the step size is equal to 1 and the input and output feature maps are of the same shape; when the step size is equal to 2, the output is the downscaled feature map directly.
Appl.Sci.2024, 14, x FOR PEER REVIEW 7 of 15 step size is equal to 1 and the input and output feature maps are of the same shape; when the step size is equal to 2, the output is the downscaled feature map directly.

Improvement in YOLOv5s Target Detection Network
To achieve the lightweighting of the YOLOv5s network, the Backbone of YOLOv5s is replaced with MobileNetV3-Small, a specific configuration in the MobileNetV3 family designed to offer a smaller model size and a lighter computational burden for resourceconstrained mobile devices and embedded systems.It is a variant of the MobileNetV3 family and is particularly suitable for applications that require higher model size and computational efficiency.The MobileNetV3-Small network structure is shown in Table 1.
As shown in the table, MobileNetV3-Small consists of a combination of several convolutional layers.Input denotes the shape of the feature matrix of the input current layer; Operator denotes the convolutional operation (conv2d is the normal convolutional layer; bneck is the BN layer, a combination of the convolutional layer and the activation function; NBN denotes that the BN layer is not used); Out denotes the output channel size; SE denotes the attention module; NL denotes the type of activation function (HS for the h-swish activation function and RE for the ReLU activation function); S denotes the step size; and "-" denotes that the layer is not used and has no parameters.As shown in Figure 7, the Backbone of YOLOv5s is replaced with MobileNetV3-Small and adjusted to generate feature maps at different scales, and the three different scales of feature maps required for

Improvement in YOLOv5s Target Detection Network
To achieve the lightweighting of the YOLOv5s network, the Backbone of YOLOv5s is replaced with MobileNetV3-Small, a specific configuration in the MobileNetV3 family designed to offer a smaller model size and a lighter computational burden for resourceconstrained mobile devices and embedded systems.It is a variant of the MobileNetV3 family and is particularly suitable for applications that require higher model size and computational efficiency.The MobileNetV3-Small network structure is shown in Table 1.
As shown in the table, MobileNetV3-Small consists of a combination of several convolutional layers.Input denotes the shape of the feature matrix of the input current layer; Operator denotes the convolutional operation (conv2d is the normal convolutional layer; bneck is the BN layer, a combination of the convolutional layer and the activation function; NBN denotes that the BN layer is not used); Out denotes the output channel size; SE denotes the attention module; NL denotes the type of activation function (HS for the h-swish activation function and RE for the ReLU activation function); S denotes the step size; and "-" denotes that the layer is not used and has no parameters.As shown in Figure 7, the Backbone of YOLOv5s is replaced with MobileNetV3-Small and adjusted to generate feature maps at different scales, and the three different scales of feature maps required for the Neck of YOLOv5s are extracted from 3-P3, 8-P4, and 11-P5.The CBL (Convolutional Block Layer) mainly consists of a convolutional layer and a batch normalization layer, which is used for feature extraction and normalization.CSP splits the original input into two branches and performs convolutional operations separately to halve the number of channels, allowing the model to learn more features.SPP can convert feature maps of arbitrary size into feature vectors of fixed size.
the Neck of YOLOv5s are extracted from 3-P3, 8-P4, and 11-P5.The CBL (Convolutional Block Layer) mainly consists of a convolutional layer and a batch normalization layer, which is used for feature extraction and normalization.CSP splits the original input into two branches and performs convolutional operations separately to halve the number of channels, allowing the model to learn more features.SPP can convert feature maps of arbitrary size into feature vectors of fixed size.

Methods of Determining Information
When the feature points in the tracking thread of a classical SLAM system move, it can cause a build-up of errors in the camera position computation.This can ultimately impact the system's accuracy and potentially result in localization failure.Generally speaking, objects that are immobile, like people and animals in indoor settings, are considered dynamic objects; on the other hand, stationary objects, like tables and computers, are considered static objects.Table 2 displays the information judgment method developed in this paper.This will prevent the effective feature points in the environment from being mistakenly eliminated.Specifically, only feature points located in the dynamic detection frame and not in the static detection frame are considered to be dynamic feature points, which will impact the accuracy of the system to be eliminated.This paper is simultaneously located in or not located in the dynamic detection frame and static detection frame feature points are considered as effective feature points to be retained.

Experiment and Analysis
This paper uses the publicly available TUM dataset in addition to the KITTI dataset to validate the algorithm's performance.Image sequences of indoor scenes taken with a Kinect camera and referenced with actual camera poses are included in the TUM dataset.Experiments are also conducted using the highly dynamic sequences walking_xyz, walk-ing_static, and walking_halfsphere on this dataset to confirm the algorithm's accuracy and robustness in actual indoor environments.Sequences 01, 02, and 03 on the KITTI dataset are also utilized for experiments to confirm the accuracy and resilience of the algorithms in actual outdoor scenes.The dataset comprises real image data that were gathered from urban, rural, and highway scenes.For this paper, a laptop running Ubuntu 20.04 with 16 GB of RAM serves as the experimental platform.

Methods of Determining Information
When the feature points in the tracking thread of a classical SLAM system move, it can cause a build-up of errors in the camera position computation.This can ultimately impact the system's accuracy and potentially result in localization failure.Generally speaking, objects that are immobile, like people and animals in indoor settings, are considered dynamic objects; on the other hand, stationary objects, like tables and computers, are considered static objects.Table 2 displays the information judgment method developed in this paper.This will prevent the effective feature points in the environment from being mistakenly eliminated.Specifically, only feature points located in the dynamic detection frame and not in the static detection frame are considered to be dynamic feature points, which will impact the accuracy of the system to be eliminated.This paper is simultaneously located in or not located in the dynamic detection frame and static detection frame feature points are considered as effective feature points to be retained.

Experiment and Analysis
This paper uses the publicly available TUM dataset in addition to the KITTI dataset to validate the algorithm's performance.Image sequences of indoor scenes taken with a Kinect camera and referenced with actual camera poses are included in the TUM dataset.Experiments are also conducted using the highly dynamic sequences walking_xyz, walk-ing_static, and walking_halfsphere on this dataset to confirm the algorithm's accuracy and robustness in actual indoor environments.Sequences 01, 02, and 03 on the KITTI dataset are also utilized for experiments to confirm the accuracy and resilience of the algorithms in actual outdoor scenes.The dataset comprises real image data that were gathered from urban, rural, and highway scenes.For this paper, a laptop running Ubuntu 20.04 with 16 GB of RAM serves as the experimental platform.

Experiments on Object Detection Models
The object detection network is trained using the VOC-2007 dataset, which consists of 20 different objects including people, chairs, and cars.The dataset for this training consists of 7653 images, 1000 epochs are trained, batch_size is set to 128, Learning_rate is 0.0032, Momentum is 0.843, and Weight_decay is 0.00036.
The images in the walking_halfsphere dataset are detected and the Parameter quantity, Calculation quantity, and Detection time of different target detection networks are shown in Table 3.All the tests are performed five times and the final results are averaged.It can be observed that using MobileNetV3 instead of the YOLOv5s Backbone feature extraction network reduces the model Parameter quantity by 50.08%,Calculation quantity by 62.63%, and Detection time by 10.71%.

Indoor Posture Estimation Experiment
This paper develops a set of information determination methods to retain the static feature points in the dynamic detection frame.A feature point is considered dynamic only if it is in the dynamic detection frame and not in the static detection frame.Rejecting a feature point based solely on its location will affect the accuracy of the system.Figure 8a illustrates the effect of culling, while Figure 8b displays the image prior to dynamic feature point culling and Figure 8b displays the image subsequent to culling.The man's plaid shirt in the figure has a lot of feature points that, if left unculled, will significantly lower the system's accuracy.This paper's method successfully identifies the dynamic objects in the image and retains the feature points on the static objects, like computers, while rejecting the feature points on the dynamic objects.

Experiments on Object Detection Models
The object detection network is trained using the VOC-2007 dataset, which consists of 20 different objects including people, chairs, and cars.The dataset for this training consists of 7653 images, 1000 epochs are trained, batch_size is set to 128, Learning_rate is 0.0032, Momentum is 0.843, and Weight_decay is 0.00036.
The images in the walking_halfsphere dataset are detected and the Parameter quantity, Calculation quantity, and Detection time of different target detection networks are shown in Table 3.All the tests are performed five times and the final results are averaged.It can be observed that using MobileNetV3 instead of the YOLOv5s Backbone feature extraction network reduces the model Parameter quantity by 50.08%,Calculation quantity by 62.63%, and Detection time by 10.71%.

Indoor Posture Estimation Experiment
This paper develops a set of information determination methods to retain the static feature points in the dynamic detection frame.A feature point is considered dynamic only if it is in the dynamic detection frame and not in the static detection frame.Rejecting a feature point based solely on its location will affect the accuracy of the system.Figure 8a illustrates the effect of culling, while Figure 8b displays the image prior to dynamic feature point culling and Figure 8b displays the image subsequent to culling.The man's plaid shirt in the figure has a lot of feature points that, if left unculled, will significantly lower the system's accuracy.This paper's method successfully identifies the dynamic objects in the image and retains the feature points on the static objects, like computers, while rejecting the feature points on the dynamic objects.Robustness in SLAM refers to the system's capacity to withstand a range of errors and disturbances.The accuracy of the system will be affected by the presence of dynamic objects in the dataset, but the system's robustness will be demonstrated by its equally good performance in managing the dynamic dataset.The difference between the calculated and true values of the posture is represented by the Absolute Posture Error (APE), Robustness in SLAM refers to the system's capacity to withstand a range of errors and disturbances.The accuracy of the system will be affected by the presence of dynamic objects in the dataset, but the system's robustness will be demonstrated by its equally good performance in managing the dynamic dataset.The difference between the calculated and true values of the posture is represented by the Absolute Posture Error (APE), which provides a visual evaluation of the algorithm's correctness.The difference in the amount of position change at the same timestamp is represented by the Relative Posture Error (RPE), which is a useful tool for estimating system drift.After aligning the estimated and actual position values using the timestamps, the change in position is computed between the true and estimated values over time.The difference in the change in values represents the position error of the camera.Tables 4 and 5 present comparative test results between ORB-SLAM3 and the algorithm proposed in this paper across three different datasets.The evaluation metrics include root-mean-square error (RMSE), mean error (MEAN), and standard deviation (SD).The improvement ratio is calculated using the formula given in Equation ( 7): where m is the running result of ORB-SLAM3 and n is the running result of the algorithm in this paper.From Table 5, it can be seen that the algorithm in this paper has improved the position estimation accuracy compared to ORB-SLAM3 after the dynamic feature points are removed.In terms of absolute trajectory error, the root-mean-square error in the walking_xyz sequence is improved by 80.02%, the mean error is improved by 92.09%, and the standard deviation is improved by 87.92%.In terms of the relative displacement trajectory error, the root-mean-square error in the walking_xyz sequence is improved by 71.71%, the mean error is improved by 82.65%, the standard deviation is improved by 61.58%, and the other two dynamic sequences are also significantly improved.
Figure 9a shows the camera trajectory of ORB-SLAM3 in the walking_xyz sequence, and Figure 9b shows the camera trajectory of our SLAM in the walking_xyz sequence, where the solid line is the trajectory value computed by the system and the dashed line is the real trajectory value of the camera.Figure 9c shows the comparison between ORB-SLAM3 and our SLAM in xyz mode with the real trajectory, and Figure 9d shows the comparison between ORB-SLAM3 and our SLAM in rpy mode with the real trajectory.
From the figure, it can be seen that the trajectory calculated by ORB-SLAM3 has a large error, while the trajectory calculated by our SLAM is closer to the real value.
Figure 10a shows the results of the APE comparison between ORB-SLAM3 and our SLAM in the walking_xyz sequence, and Figure 10a shows the results of the RPE comparison between ORB-SLAM3 and our SLAM in the walking_xyz sequence.Our SLAM performs significantly better than ORB-SLAM3 in APE and better than ORB-SLAM3 in RPE.It can be seen that the algorithm in this paper has less error in dynamic environments compared to ORB-SLAM3 and has higher accuracy and robustness.
Table 6 displays the absolute trajectory errors of the algorithm used in this paper in comparison to the DS-SLAM and DynaSLAM algorithms.In the walking_xyz and walking_static sequences, Table 6 shows that the algorithm used in this study performs better than the other algorithms in terms of root-mean-square error, and that DS-SLAM and DynaSLAM are comparable in these two sequences.Meanwhile, in the walking_halfsphere sequence, DynaSLAM performs better than DS-SLAM with respect to standard deviation and root-mean-square error, while the technique presented in this research comes in second.Comparing the error of this work to similar methods over different dataset sequences, it performs reasonably well overall.Figure 10a shows the results of the APE comparison between ORB-SLAM3 and our SLAM in the walking_xyz sequence, and Figure 10a shows the results of the RPE comparison between ORB-SLAM3 and our SLAM in the walking_xyz sequence.Our SLAM performs significantly better than ORB-SLAM3 in APE and better than ORB-SLAM3 in RPE.It can be seen that the algorithm in this paper has less error in dynamic environments compared to ORB-SLAM3 and has higher accuracy and robustness.
(a) (b) Figure 10a shows the results of the APE comparison between ORB-SLAM3 and o SLAM in the walking_xyz sequence, and Figure 10a shows the results of the RPE comp ison between ORB-SLAM3 and our SLAM in the walking_xyz sequence.Our SLAM p forms significantly better than ORB-SLAM3 in APE and better than ORB-SLAM3 in RP It can be seen that the algorithm in this paper has less error in dynamic environme compared to ORB-SLAM3 and has higher accuracy and robustness.This paper compares the system with ORB-SLAM3, DS-SLAM, and DynaSLAM to validate the system's real-time performance.Table 7 shows that ORB-SLAM3 has the shortest tracking time per frame because it does not process the dynamic objects in the scene.However, the tracking time per frame outperforms DS-SLAM with DynaSLAM in this paper when comparing the systems in similar dynamic scenes.The comparison of the frame rate of this paper system with ORB-SLAM3, DS-SLAM, and DynaSLAM is presented in Table 8.The table shows that ORB-SLAM3 achieves the highest frame rate of 37 frames per second because it does not detect dynamic objects.When comparing DS-SLAM and DynaSLAM, the system described in this paper has the highest frame rate.This paper's system has the highest frame rate, able to sustain over 33 frames per second while the algorithm runs, demonstrating the improved real-time performance of the algorithm.

Outdoor Posture Estimation Experiment
Table 9 presents the absolute trajectory errors for the KITTI dataset comparison between the algorithm presented in this paper and DS-SLAM, DynaSLAM.Table 9 shows that the algorithm used in this paper performs better in terms of root-mean-square error than the other algorithms in the 01 and 02 sequences, and that DS-SLAM and DynaSLAM perform similarly in these two sequences.In the 03 sequence, DynaSLAM performs poorly while DS-SLAM performs best in terms of standard deviation and root-meansquare error.Overall, this paper's errors in various outdoor dataset sequences outperform comparable algorithms.

Dense Building Maps Experiment
In this paper, the walking_xyz sequence is selected as the image sequence for dense map construction, in which two men walk around a computer desk.If we do not eliminate the dynamic objects to build the map as shown in Figure 11a, we can observe that the dynamic objects in the scene have an impact on the map construction, resulting in a large number of characters in the map residual shadow, which seriously reduces the effect of dense map construction.The algorithm used in this study creates the map that is depicted in Figure 11b, removing the impact of dynamic objects and creating a good map that is suitable for use in robot navigation and path planning.

Dense Building Maps Experiment
In this paper, the walking_xyz sequence is selected as the image sequence for dense map construction, in which two men walk around a computer desk.If we do not eliminate the dynamic objects to build the map as shown in Figure 11a, we can observe that the dynamic objects in the scene have an impact on the map construction, resulting in a large number of characters in the map residual shadow, which seriously reduces the effect of dense map construction.The algorithm used in this study creates the map that is depicted in Figure 11b, removing the impact of dynamic objects and creating a good map that is suitable for use in robot navigation and path planning.

Discussion
This work aims to maximize the performance of SLAM systems in dynamic environments by concentrating on the identification and rejection of dynamic feature points.The approach presented in this research performs well in terms of localization errors in dynamic settings, according to experimental results.This benefit is explained by the system's capacity to recognize dynamic objects and lessen their interference.In the meantime, the system suggested in this paper shows a higher frame rate in the frame rate comparison experiments with similar algorithms.This is explained by the fact that MobileNetV3-Small replaces the Backbone of the target detection network YOLOv5s, making the entire network model lighter.Varying application scenarios may require varying levels of accuracy from SLAM algorithms.For example, robots working in indoor environments often have slightly lower accuracy requirements, with absolute trajectory errors needing to be limited within a few centimeters to ten centimeters.The system in this paper can still fulfill this requirement in dynamic scenarios, so the system in this paper has good practicality.While the system presented in this paper has significantly improved the map effect and localization accuracy in dynamic scenes, in real-world applications, tracking failure may occur

Discussion
This work aims to maximize the performance of SLAM systems in dynamic environments by concentrating on the identification and rejection of dynamic feature points.The approach presented in this research performs well in terms of localization errors in dynamic settings, according to experimental results.This benefit is explained by the system's capacity to recognize dynamic objects and lessen their interference.In the meantime, the system suggested in this paper shows a higher frame rate in the frame rate comparison experiments with similar algorithms.This is explained by the fact that MobileNetV3-Small replaces the Backbone of the target detection network YOLOv5s, making the entire network model lighter.Varying application scenarios may require varying levels of accuracy from SLAM algorithms.For example, robots working in indoor environments often have slightly lower accuracy requirements, with absolute trajectory errors needing to be limited within a few centimeters to ten centimeters.The system in this paper can still fulfill this requirement in dynamic scenarios, so the system in this paper has good practicality.While the system presented in this paper has significantly improved the map effect and localization accuracy in dynamic scenes, in real-world applications, tracking failure may occur when features in dynamic environments are absent or the camera moves too quickly.The localization accuracy of the system may be adversely affected in situations with low light or abrupt changes in illumination because the system developed in this paper is a visual SLAM and its localization depends on visual information.Future research can look into adding more sensors, like inertial measurement units (IMUs), and introducing a variety of feature types to improve the system.These upgrades allow the system to adjust to increasingly complicated settings.

Conclusions
In this paper, a target-detection-based SLAM system is designed with the goal of minimizing the impact of dynamic objects in the environment on the accuracy of the SLAM system.The target detection network's lightweight MobileNetV3 network replaces the YOLOv5s Backbone network in an effort to reduce the model parameters and computation of the target detection network.The improved YOLOv5s detects dynamic objects in the environment; in the meantime, this paper proposes a dynamic feature point determination method that can effectively identify the dynamic feature points on the dynamic objects that impact the system's accuracy while retaining the static feature points to avoid mistakenly excluding the effective feature points.Constructing a dense point cloud map concurrently requires the addition of a dense map building thread, and once the dynamic objects are removed, the map construction effect is satisfactory.The system in this paper improves the position estimation accuracy to over 90% when compared to the ORB-SLAM3 position estimation accuracy, and it also partially improves the position estimation accuracy when compared to dynamic SLAM systems like DS-SLAM and DynaSLAM, according to the experimental results on TUM and KITTI datasets.Consequently, there is a good chance that the SLAM system based on the enhanced YOLOv5s presented in this paper will find use.In the future, we will consider optimizing and improving the system by combining multiple sensors to further improve the system accuracy and at the same time adapt the system to more scenarios.

Figure 1 .
Figure 1.Framework overview of our proposed SLAM.

Figure 1 .
Figure 1.Framework overview of our proposed SLAM.

Figure 9 .
Figure 9.The results of APE in the walking_xyz sequence.(a) ORB-SLAM3 comparison results with true trajectories, (b) our SLAM comparison results with true trajectories, (c) comparison curves between the three trajectories in xyz mode, (d) comparison curves between the three trajectories in rpy mode.

Figure 9 .Figure 9 .
Figure 9.The results of APE in the walking_xyz sequence.(a) ORB-SLAM3 comparison results with true trajectories, (b) our SLAM comparison results with true trajectories, (c) comparison curves between the three trajectories in xyz mode, (d) comparison curves between the three trajectories in rpy mode.

Figure 11 .
Figure 11.Comparison of maps before and after processing.(a) Before culling, (b) after culling.

Figure 11 .
Figure 11.Comparison of maps before and after processing.(a) Before culling, (b) after culling.

Table 2 .
Feature point determination method.

Table 2 .
Feature point determination method.

Table 3 .
Comparison of experimental results of different models.

Table 3 .
Comparison of experimental results of different models.

Table 6 .
Comparison of absolute trajectory errors of different algorithms.

Table 7 .
Tracking time comparison.

Table 9 .
Comparison of absolute trajectory error.