PLY-SLAM: Semantic Visual SLAM Integrating Point–Line Features with YOLOv8-seg in Dynamic Scenes

Mao, Huan; Luo, Jingwen

doi:10.3390/s25123597

Open AccessArticle

PLY-SLAM: Semantic Visual SLAM Integrating Point–Line Features with YOLOv8-seg in Dynamic Scenes

by

Huan Mao

¹ and

Jingwen Luo

^1,2,*

¹

School of Information Science and Technology, Yunnan Normal University, Kunming 650500, China

²

Engineering Research Center of Computer Vision and Intelligent Control Technology, Department of Education of Yunnan Province, Kunming 650500, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(12), 3597; https://doi.org/10.3390/s25123597

Submission received: 28 April 2025 / Revised: 3 June 2025 / Accepted: 5 June 2025 / Published: 7 June 2025

(This article belongs to the Special Issue Advances in Vision-Based UAV Navigation: Innovations and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In dynamic and low texture environments, traditional point-feature-based visual SLAM (vSLAM) often faces the challenges of poor robustness and low localization accuracy. To this end, this paper proposes a semantic vSLAM approach that fuses point-line features with YOLOv8-seg. First, we designed a high-performance 3D line-segment extraction method that determines the number of points to be sampled for each line-segment in terms of the length of the 2D line-segments extracted from the image, and back-projects these sampled points combined with the depth image to obtain the 3D point set of the line-segments. On this basis, accurate 3D line-segment fitting is realized in combination with the RANSAC algorithm. Subsequently, we introduce Delaunay triangulation to construct the geometric relationships between map points, detect dynamic feature points by matching changes in the topological structure of feature points in adjacent frames, and combine them with the instance labels provided by the YOLOv8-seg to accurately remove dynamic feature points. Finally, a loop-closure detection mechanism that fuses point–line features with instance-level matching is designed to calculate a normalized similarity score by combining the positional similarity of the instances, the scale similarity, and the spatial consistency of the static instances. A series of simulations and experiments demonstrate the superior performance of our method.

Keywords:

dynamic scene; semantic visual SLAM; point-line features; YOLOv8-seg; loop-closure detection

1. Introduction

With the rapid development of intelligent robots and autonomous driving technology, the robustness and accuracy of Simultaneous Localization and Mapping (SLAM) systems are facing severe challenges in complex dynamic scenes. Traditional point-feature-based visual SLAM (vSLAM) schemes (e.g., the ORB-SLAM family [1,2,3], VINS-Fusion [4]) perform well in static, highly textured environments, but in dynamic or weakly textured indoor scenes (e.g., hospital corridors, office environments), they tend to fail in pose tracking due to the accumulation of errors caused by the interference of dynamic objects as well as the sparsity of features. According to statistics, over 60% of indoor scenes contain large areas of low texture (such as white walls and glass curtain walls) [5], and feature point mismatches caused by dynamic objects (such as pedestrians and mobile devices) can further introduce pose estimation bias and even cause system crashes. Although existing methods mitigate this problem to some extent by fusing inertial measurement units (IMUs) [6] or laser radar (LiDAR) [7] based on visual sensors, it is still a great challenge to balance the real-time performance of multimodal data fusion with dynamic feature refinement rejection. It is worth mentioning that the traditional loop-closure detection mechanism has a high mismatch rate in dynamic scenes due to the lack of semantic information, which seriously limits the reliability of long-term localization in SLAM systems.

To address the above challenges, researchers have proposed a variety of improvement methods, including fusing line features to enhance structural information, introducing semantic segmentation models to identify and reject dynamic regions, and utilizing topology modeling to improve system robustness. Typical works include binocular PL-SLAM [8] and monocular PL-SLAM [9], which enhance the system’s localization and mapping capabilities by fusing point and line features; and DynaSLAM [10] and DS-SLAM [11], which utilize deep learning methods to detect and eliminate dynamic objects, enhancing the stability of the system in dynamic environments. Although there have been many research achievements, how to effectively integrate multiple-feature information and improve the robustness and accuracy of the system in dynamic and low texture environments is still an urgent issue to be solved. To this end, this paper proposes a semantic vSLAM algorithm that integrates point–line features with YOLOv8-seg. By designing an adaptive 3D line-segment construction method based on depth uncertainty and combining geometric topology analysis with dynamic feature collaborative filtering mechanism of semantic labels, the stability of the system in dynamic and weak texture environments is significantly improved. In addition, this paper innovatively constructs a multimodal loop-closure detection model that integrates the bag-of-words (BoW) similarity of point–line features with instance level spatial distribution consistency, effectively reducing the closed-loop mismatch rate. The main contributions of this paper are as follows:

A high-performance 3D line-segment extraction method is designed, which determines the number of points to be sampled for each line-segment based on the length of the extracted 2D line-segments, and then combines these with the depth image for back-projection to obtain the 3D point set of the line-segments, based on which, the RANSAC algorithm is utilized to achieve accurate 3D line-segment fitting to solve the issue of the error sensitivity of the endpoints in the back-projection of the traditional line features.
A “geometric + semantic” dynamic feature rejection strategy is constructed, which utilizes Delaunay triangulation to construct the geometric relationship between map points, and detects dynamic featNure points through the topological changes of matching feature points in adjacent frames, and then combines the instance labels provided by YOLOv8-seg to accurately reject dynamic feature points.
A loop-closure detection mechanism that fuses point–line features with instance-level matching is developed, which combines the positional similarity of instances, scale similarity, and spatial consistency of static instances to calculate normalized similarity scores, and improves the accuracy and robustness of closed-loop detection through a weighted fusion strategy.

The rest of the paper is organized as follows. The related works are briefly reviewed in Section 2. The overview of our methodology is described in Section 3, and its implementation scheme is detailed. The simulation studies under various datasets and performance evaluations are presented in Section 4, while Section 5 concludes the paper.

2. Related Work

2.1. Point–Line Features Based SLAM

To enhance the geometric representation of weakly textured scenes, many researchers have proposed integrating line features with point features for optimization. Early works such as PL-SLAM [9] used collinear constraints to jointly optimize point–line features, but its line parameterization suffers from redundancy of degrees of freedom. Zhou et al. [12] developed a vSLAM method based on building structural lines, which utilizes the global directional information of structural lines to constrain camera orientation and reduce localization drift. However, this method is prone to failure in unstructured scenes, and if LSD line features without structural constraints are used, it can also operate stably in unstructured scenes. In recent years, PL-VIO, which was proposed by Wang et al. [13], achieved visual–inertial tight coupling by modeling the re-projection error of line feature endpoints, but it did not consider the impact of depth sensor noise on line reconstruction. To enhance the robustness of localization in weakly textured scenes, Xu et al. [14] proposed a point–line-based visual inertial system IPLM-VINS by introducing line features into the VINS mono system. This method used a line-segment length suppression strategy to remove redundant short lines in the front-end, and added a line-segment re-projection error and Huber kernel function in the back-end to enhance the ability to resist outliers. Unfortunately, it may remove some short lines with obvious features and accurate matching, which will affect the quality of the line segments and be easily affected by occlusion in dynamic environments. Zeng et al. [15] proposed an efficient point–line-based visual inertial SLAM system EPL-VINS by combining the Lucas–Kanade (LK) algorithm with the region-growing (RG) algorithm of the line segment detector (LSD); however, combining LK optical flow with the line-segment detection algorithm greatly affects the real-time performance of the system and it cannot operate stably in dynamic environments. For vSLAM systems, the noise characteristics of depth sensors have a significant impact on the accuracy of feature reconstruction. ToF cameras are prone to depth drift due to multipath interference [16], while structured light cameras are prone to failure on transparent object surfaces [17]. In this vein, Shabanov et al. [18] introduced a low-quality depth image denoising and optimization method based on self-supervised learning, but did not consider the geometric characteristics of line features. In terms of the above analysis, for extracted line features, we propose an adaptive sampling strategy based on line-segment length, and combines it with a depth noise distribution model to fit the optimal 3D line-segment using an improved RANSAC algorithm, significantly reducing endpoint projection errors.

2.2. Dynamic SLAM

In dynamic scenes, SLAM systems based on static environment assumptions have significant limitations. These systems often misidentify dynamic components as part of a static environment, which introduces a large amount of error accumulation when optimizing camera poses, leading to localization failures and serious deviations in the constructed maps. Therefore, improving the robustness and reliability of vSLAM systems in dynamic scenes is a challenging research hotspot. To obtain a consistent and complete map of dynamic scenes, many methods [19,20,21,22] adopted Mask R-CNN for pre-detection of dynamic targets, and combined it with traditional geometric methods to detect and remove dynamic features, improving the stability and accuracy of the algorithm.

Currently, eliminating dynamic interference mainly relies on motion consistency testing or semantic priors. To address the performance degradation of RGB-D SLAM in dynamic environments, Sun et al. [23] proposed an online motion removal approach based on RGB-D data. Without requiring prior semantic or appearance information, their method incrementally built and updated a foreground model to filter out dynamic object data, thereby enhancing the robustness of RGB-D SLAM. A real-time depth edge-based RGB-D SLAM method was developed by Li et al. [24], which reduced the influence of dynamic objects through static weighting and significantly improved localization accuracy in dynamic environments. DynaSLAM [10] combined multi-view geometric and semantic segmentation, but it required real-time GPU inference with high-computational overhead. If a lightweight YOLOv8-seg network is used, it can effectively meet the requirements of real-time performance. The DPL-SLAM proposed by [25] introduced line features based on ORB-SLAM3 and combined YOLOv5 and the Lucas–Kanade (LK) optical flow method to eliminate potential dynamic features. Nevertheless, the line-segment extraction algorithm they used suffers from over-segmentation issues, resulting in poor structural quality of the constructed map. Additionally, the optical flow method is highly dependent on the environment, which limits its applicability in various scenarios. Instead, if a high-performance 3D line segment extraction method is constructed by merging and optimizing line features, it can not only effectively solve the issue of over segmentation, but also generate sparse maps with good structure, thereby enhancing robustness in the presence of partially occluded or incomplete line segments. It is worth mentioning that if lightweight YOLOv8-seg with Delaunay triangulation is further introduced, it can be well-suited for real-time applications in dynamic environments. Dong et al. [26] developed an adaptive method based on point–line–planar multi-feature fusion, which dynamically selects the feature type by calculating the information entropy of an image region (planar features are disabled when the information entropy exceeds a predefined threshold) and integrates a YOLOv5 detector to remove dynamic objects. However, the presence of dynamic objects may cause the information entropy to be higher than the set threshold, causing the system to extract only point features. Once the feature points on the dynamic objects have been eliminated, this can lead to an insufficient number of static feature points, which can affect the estimation of the mobile robot’s position or even the loss of tracking.Hence, this paper introduces Delaunay triangulation into dynamic detection, detecting dynamic feature points through changes in the topology of matching feature points in adjacent frames, and combining with the instance labels provided by the YOLOv8-seg to accurately remove dynamic feature points.

Additionally, the conventional BoW models (e.g., DBoW2 [27]) are susceptible to interference from dynamic objects and angle-of-view changes. SA-LOAM [28] embedded semantic labels into laser point cloud descriptors, but it was not applicable to vSLAM. Ji et al. [29] developed an object-level loop-closure detection method based on a 3D scene graph, which combined spatial layout and semantic consistency. The method improved loop-closure detection robustness and accuracy through object-level data association, graph matching, and object pose graph optimization, but did not integrate geometric features. Thus, this work proposes a loop-closure detection mechanism that combines point–line features with instance-level matching. By combining the positional similarity, scale similarity, and spatial consistency of static instances, normalized image similarity scores are calculated, and a weighted fusion strategy is used to enhance the discriminative ability of closed-loop detection.

3. Pipeline

Figure 1 provides an overview of the constructed 3D SLAM system in this paper. In the front-end, we extract point–line features on the input RGB image and perform instance segmentation using YOLOv8-seg. The dynamic features are then detected and rejected by combining the YOLOv8-seg instance segmentation results with Delaunay triangulation. Further, the camera’s pose is calculated based on the obtained high-confidence static features using minimized re-projection error, i.e.,

e_{p} = p_{I} (u, v, 1) - F (T_{c w}, K, z_{w}, P_{w})

(1)

where

p_{I} (u, v, 1)

is the homogeneous coordinate representation of pixel

(u, v)

in the current frame,

P_{w}

represents a spatial point

(x_{w}, y_{w}, z_{w})

, K represents the camera’s internal parameter,

F (\cdot)

represents the mapping from world coordinates to pixel coordinates, and

T_{c w}

represents the pose transformation matrix.

Additionally, let

(P_{w s}, P_{w e})

be the two endpoints of a line segment L in the world coordinate system, and

(p_{c s}, p_{c e})

be the two endpoints of the line segment l that matches L in the current frame. On the one hand, we project

(P_{w s}, P_{w e})

onto the current image frame and calculate the distance between their projection points and l to obtain the re-projection error; On the other hand, considering that the endpoints of line segments may be obscured or misaligned in different frames, for the extracted line segments in the current frame, we calculate the coefficients of the line by their endpoints, and then calculate the distance from the projected endpoints to the extracted line segments, i.e.,

l {(a, b, c)}^{T} = \frac{p_{c s}^{H} \times p_{c e}^{H}}{∥p_{c s}^{H} \times p_{c e}^{H}∥}

(2)

where

a, b, c

denote the coefficient of line-segment l,

p_{c s}^{H}

and

p_{c e}^{H}

denote the homogeneous coordinate of

p_{c s}

and

p_{c e}

, respectively. Thus, the error function of the line-segment is as follows:

e_{l} = (\begin{matrix} l {(a, b, c)}^{T} \cdot Π^{H} (T_{c w}, K, z_{w}^{s}, P_{w}^{s}) \\ l {(a, b, c)}^{T} \cdot Π^{H} (T_{c w}, K, z_{w}^{e}, P_{w}^{e}) \end{matrix})

(3)

where

Π^{H} (T_{c w}, K, z_{w}^{s}, P_{w}^{s})

indicates the homogeneous coordinate representation of

Π (T_{c w}, K, z_{w}^{s}, P_{w}^{s})

.

On this basis, assuming that the number of matched point features and line features is n and m, respectively. Then, the camera’s pose optimization is as follows:

ω^{*} = arg min [\sum_{j = 1}^{n} ρ (e_{p, j}^{T} A_{e_{p, j}}^{- 1} e_{p, j}) + \sum_{k = 1}^{m} ρ (F (n) e_{l, k}^{T} B_{e_{l, k}}^{- 1} e_{l, k})]

(4)

where

ρ (\cdot)

denotes the kernel function,

F (n) = 2^{- \frac{n}{50}}

is an adaptive factor for adjusting the weights of points and lines [30], and

A_{e_{p, j}}^{- 1}

and

B_{e_{l, k}}^{- 1}

denote the inverse of the covariance matrices of the point features and line features, respectively.

In the back-end, we design a loop-closure detection mechanism that fuses point–line features with instance-level matching to calculate a normalized image similarity score by combining the positional similarity, scale similarity, and spatial consistency of static instances. Moreover, to meet the application requirements of navigation and obstacle avoidance, we not merely constructed a pose graph based on high-confidence static 3D map points and 3D map lines obtained by removing dynamic objects from the front-end, as well as keyframes, but also introduced the PCL point cloud library and Octomap to construct a 3D semantic octree map of the environment.

3.1. Line Feature Extraction and Optimization

In this work, to describe the structural scale of the scene more clearly and improve the geometric expression ability of line features, we designed a high-performance 3D line-segment extraction method based on the LSD algorithm [31]. This method adaptively adjusts the number of points to be sampled for each line-segment according to the length of the extracted 2D line-segments in the image, and couples these sampling points with depth information for back projection to obtain the 3D point set of the line-segments. Then, combined with the RANSAC algorithm, accurate 3D line-segment fitting is achieved, enhancing the robustness in the case of incomplete line-segments caused by partial occlusion. The specific steps are as follows:

Assuming that the starting and ending points of the extracted $i^{t} h$ line feature $L_{\vec{P_{s i} P_{e i}}}$ are represented as $P_{s i} = (x_{s i}, y_{s i})$ and $P_{e i} = (x_{e i}, y_{e i})$ , respectively, the number of sampling points is adaptively adjusted by the length of the line-segment as follows:

$N_{sup} = min (L_{A v g}, {∥P_{s i} - P_{e i}∥}_{2})$

(5)

where $L_{A v g}$ denotes the result of rounding up the average length of the line-segments extracted in the current frame.
Generate uniform sampling points based on the number of adaptive sampling points, i.e.,

$P_{j} = P_{s i} (1 - \frac{j}{N_{sup}}) + P_{s i} (\frac{j}{N_{sup}}), j \in \{0, 1, \dots, N_{sup}\}$

(6)
Back-project the generated sample points into 3D points, i.e.,

$P_{j} = (\begin{matrix} X_{j} \\ Y_{j} \\ Z_{j} \end{matrix}) = (\begin{matrix} \frac{u_{j} - c_{x}}{f_{x}} d_{j} \\ \frac{v_{j} - c_{y}}{f_{y}} d_{j} \\ d_{j} \end{matrix})$

(7)

where $(u_{j}, v_{j})$ is the pixel coordinate of $p_{j}$ in the RGB image, $d_{j}$ is the depth value corresponding to pixel $(u_{j}, v_{j})$ in the depth image, and $c_{x}$ , $c_{y}$ , $f_{x}$ , and $f_{y}$ denote the internal parameters of the camera.
To construct a robust geometric model, further calculate the covariance matrix $\sum_{j}$ for each three-dimensional point $P_{j}$ , which not merely takes into account the uncertainty of image point projection, but also incorporates the depth error model to estimate the depth variance, i.e.,

$\sum_{j} = [\begin{matrix} \frac{Z_{j}}{f_{x}} 0 \frac{X_{j}}{Z_{j}} \\ 0 \frac{Z_{j}}{f_{y}} \frac{Y_{j}}{Z_{j}} \\ 0 0 1 \end{matrix}] \cdot [\begin{matrix} 1 0 0 \\ 0 1 0 \\ 0 0 σ_{Z_{j}^{2}} \end{matrix}] \cdot [\begin{matrix} \frac{Z_{j}}{f_{x}} 0 0 \\ 0 \frac{Z_{j}}{f_{y}} 0 \\ \frac{X_{j}}{Z_{j}} \frac{Y_{j}}{Z_{j}} 1 \end{matrix}] = J \cdot Λ \cdot J^{T}$

(8)

$σ_{Z_{j}^{2}} = α * Z_{j}^{2} + β * Z_{j} + γ$

(9)

where $α$ , $β$ and $γ$ are quadratic polynomial coefficients used to model the noise standard deviation of depth sensors, J denotes the Jacobian matrix for restoring image points to 3D points, and $Λ$ denotes the covariance of uncertainty between image points and depth values.
Construct robust 3D line-segment fitting algorithms based on Marginal Distance. Specifically, given a 3D point set $\{P_{j}\}$ and its covariance, fit a straight line L by randomly sampling two points from it via RANSAC and use the Mahalanobis distance to determine whether the other points are interior points. After completing the iterations, perform SVD fitting optimization on the maximum interior point set $I$ , i.e.,

$arg min_{v} \sum_{i \in I} {∥(P_{i} - \bar{P}) - v v^{T} (P_{i} - \bar{P})∥}^{2}$

(10)

where $\bar{P}$ is the centroid of the interior point, and v is the direction vector of the line.
Project the interior points along the optimized direction and take the extreme points as line endpoints A and B, i.e.,

$\{\begin{matrix} A = \bar{P} + min_{i} \frac{(P_{i} - \bar{P}) \cdot v}{{∥v s .∥}^{2}} \cdot v \\ B = \bar{P} + max_{i} \frac{(P_{i} - \bar{P}) \cdot v}{{∥v s .∥}^{2}} \cdot v \end{matrix}$

(11)
The reliability of the final constructed 3D line-segments is verified by means of the proportion of inliers and the length of the line-segment, i.e.,

$\{\begin{matrix} \frac{I}{l_{l e n}} > T h_{1} \\ ∥A - B∥ > T h_{2} \end{matrix}$

(12)

where $l_{l e n}$ is the length of the 2D line-segment, $T h_{1}$ and $T h_{2}$ are the inlier threshold and length threshold, respectively.

In this way, the optimized line feature set can be obtained by saving the 3D line-segments that satisfy the condition of Equation (12). This method not only improves the reconstruction rate of line features in low texture environments and enhances robustness to partially occluded or incomplete line-segments but also significantly improves reconstruction accuracy and robustness through adaptive sampling and geometric validation strategies.

3.2. Dynamic Feature Detection and Rejection

Typically, the impact of dynamic objects can cause changes in the geometric relationship between dynamic feature points and static feature points. Therefore, in this paper, dynamic feature points are detected and rejected by fusing Delaunay triangulation and YOLOv8-seg to ensure the stable operation of the system, as shown in Figure 2. First, we introduce Delaunay triangulation to construct a triangular network between map points to represent their geometric relationships, and detect dynamic feature points by detecting changes in the geometric relationships of map points corresponding to matching features of adjacent frames. Then, the instance segmentation model YOLOv8-seg is used to obtain the labels and detection boxes of each semantic target, and combined with depth images to obtain more accurate target states.

Traverse all feature points, and if the dynamic weight $w_{i}$ of the $i^{t h}$ feature point is greater than the set threshold $w_{t h}$ , add that feature point to the dynamic feature point array $D P$ . Moreover, read the semantic labels of each feature point and save them to the array of that semantic label, so that the total number of feature points $N_{a l l}$ included in each label can be calculated.
Traverse the set of dynamic feature points $D P$ , for the $i^{t h}$ feature point, read the semantic label of its location, and add 1 to the global dynamic weight $w_{l a b}$ of its label.
The dynamic degree $r_{d}$ of each label can be obtained through the above two steps, as follows:

$r_{d} = \frac{w_{l a b}}{N_{a l l}}$

(13)
According to its semantic information, if it is an active dynamic label (pedestrian, car, animal) and its global dynamic weight is greater than the minimum threshold $T h_{min}$ , all feature points on its label are eliminated and the label is labeled as dynamic, which facilitates the elimination of line-segments on dynamic objects; if it is a passive dynamic label (chair, backpack) and its global dynamic weight is greater than the maximum threshold $T h_{max}$ , all feature points on its label are eliminated and the label as dynamic to facilitate the culling of line-segments on dynamic objects. It should be noted that the dynamic weights obtained through triangulation may have certain errors. Hence, for static objects (chairs, bags) in a general sense that are judged as dynamic objects, the condition requirements should be stricter to avoid removing too many features that may prevent the system from completing pose estimation and mapping.
Traverse all line features, and if the starting or ending point of the line feature is within the dynamic label, remove it.

3.3. Loop-Closure Detection

To improve the accuracy and robustness of loop-closure detection, we construct a similarity calculation method that combines a point–line BoW model with instance-level matching. On the one hand, we calculate the similarity

S_{p} (v_{c u r}^{p}, v_{c a n d}^{p})

and

S_{l} (v_{c u r}^{l}, v_{c a n d}^{l})

of the point’s BoW vector

v^{p}

and line’s BoW vector

v^{l}

in the current frame

F_{c u r}

and candidate frame

F_{c a n d}

, respectively. Then, the similarity is weighted using the information entropy

H_{p}

of the point features and the information entropy

H_{l}

of the line features in

F_{c u r}

to reflect the degree of influence of the point-line features in the similarity computation of the closed-loop candidate frames. In this way, the similarity calculation of fused point–line features is constructed as follows:

S_{p l} (v_{c u r}, v_{c a n d}) = \frac{H_{p}}{H_{p} + H_{l}} S_{p} (v_{c u r}^{p}, v_{c a n d}^{p}) + \frac{H_{l}}{H_{p} + H_{l}} S_{l} (v_{c u r}^{l}, v_{c a n d}^{l})

(14)

On the other hand, we design an instance-based image similarity function that enhances the matching confidence of static objects by fusing the IoUs of entity detection boxes and introducing a spatial consistency check. Furthermore, a dynamic thresholding mechanism is adopted to accommodate different numbers of instance matches, making the final score more reliable. The specific steps are as follows:

To unify the position or distance of instances within the same scale range, we standardize the relative position of objects by calculating the diagonal length of the image, i.e.,

$d i a g = \sqrt{W^{2} + H^{2}}$

(15)

where $W$ and $H$ are the width and height of the input image, respectively.
To avoid the influence of dynamic objects on loop-closure detection, when calculating instance-matching similarity, we directly remove the potential dynamic object instances (people, cats, etc.), and classify other instances into the corresponding category list, and further construct a similarity matrix for each category. Specifically, assuming that a certain category contains $n_{1}$ and $n_{2}$ instances in $F_{c u r}$ and $F_{c a n d}$ , respectively, a similarity matrix $S_{i n s t} (c u r, c a n d)$ of size $n_{1} \times n_{2}$ can be constructed. In this way, the comprehensive similarity score between the $i^{t h}$ instance and the $j^{t h}$ instance is represented as $S_{i n s t} (c u r_{i}, c a n d_{j})$ , in which the position similarity $S_{P o s_{i j}}$ is obtained by calculating the normalized distance between the center points of instances. It mainly emphasizes whether the relative spatial positions of two instances on the image are close, i.e.,

$S_{P o s_{i j}} = 1 - \frac{D (c_{i}, c_{j})}{d i a g}$

(16)

where $D (c_{i}, c_{j})$ denotes the distance between the center points of two instances.

While the size similarity

S_{I o U (b o x_{i}, b o x_{j})}

in

S_{i n s t} (c u r_{i}, c a n d_{j})

is mainly calculated by the IoU of the detection box, emphasizing whether similar areas are occupied in the image. When the misalignment is severe or the size difference is large, the size similarity will be lower, i.e.,

S_{I o U (b o x_{i}, b o x_{j})} = \frac{A r e a (b o x_{i} \cap b o x_{j})}{A r e a (b o x_{i} \cup b o x_{j}) + ε} = \frac{A r e a (b o x_{i} \cap b o x_{j})}{A r e a (b o x_{i}) + A r e a (b o x_{j}) A r e a (b o x_{i} \cap b o x_{j}) + ε}

(17)

where

b o x_{i}

denotes the detection box of the

i^{t h}

instance,

A r e a (\cdot)

denotes calculating the area of the detection box. To prevent the denominator from being 0, we set

ε = 1 \times 10^{- 6}

in this paper.

By weighting and summing them as follows, it is not only possible to determine whether two instances are in similar positions in the image, but also whether they occupy similar areas, thus improving the robustness of the matching, i.e.,

S_{i n s t} (c u r_{i}, c a n d_{j}) = 0.6 * S_{P o s_{i j}} + 0.4 * S_{I o U (b o x_{i}, b o x_{j})}

(18)

For static objects, it is also necessary to check spatial consistency, which is essentially to determine whether two instances have similar spatial layout environments (i.e., whether their relative positional relationships in their respective scenes are consistent). For each instance

c u r_{i}

and

c a n d_{j}

, extract objects belonging to the same semantic category but different instances from

F_{c u r}

and

F_{c a n d}

, respectively, and save them as neighboring instances; Meanwhile, calculate the distance between the center position of each neighboring instance and the target instance, and normalize it (diagonal length) to save it as a neighborhood distance vector. Then, pairwise comparisons are made between the neighborhood distance vectors of the two target instances in their respective scenes, and the normalized distance difference between each pair is calculated. In this way, all similarities are normalized to obtain an average similarity

{S_{A}}_{v e r} (c u r_{i}, c a n d_{j})

. Further, by setting a threshold

T h_{A v e r}

and constructing the following reward and punishment mechanism based on the relationship between

{S_{A}}_{v e r} (c u r_{i}, c a n d_{j})

and

T h_{A v e r}

, i.e.,

S_{i n s t} (c u r_{i}, c a n d_{j}) = \{\begin{matrix} min (1, S_{i n s t} (c u r_{i}, c a n d_{j}) + 0.1), {S_{A}}_{v e r} (c u r_{i}, c a n d_{j}) > T h_{A v e r} \\ S_{i n s t} (c u r_{i}, c a n d_{j}) * 0.8, {S_{A}}_{v e r} (c u r_{i}, c a n d_{j}) < T h_{A v e r} \end{matrix}

(19)

After the above calculation, we can obtain the instance-matching similarity score between two frames. To improve the recall of matching similar instances between images, we introduce a two-step matching method, which first identifies the instances with positional similarity greater than 95% in the two frames as matches and saves their matching information, and then adds 1 to the number of matches, meanwhile, adding 1 to the total similarity scores of semantic instances in the two frames, i.e.,

S_{t o t l e} (c u r, c a n d) = S_{t o t l e} (c u r, c a n d) + 1

(20)

Considering that there may be some true matches that are not recognized due to factors such as viewpoint and occlusion after position matching, we introduce a dynamic threshold matching strategy in the second stage. That is, for the remaining unmatched instances, we introduce a dynamic threshold to control the tolerance of matching. The setting of the dynamic threshold is as follows:

T h_{d y n a} = max (0.3, min (0.6, 0.6 - 0.05 \times min (n_{1}, n_{2})))

(21)

Subsequently, the maximum similarity pair

(i^{*}, j^{*})

that has not yet been matched is identified from the similarity matrix

S_{i n s t} (c u r, c a n d)

. If

S_{i n s t} (c u r_{i}, c a n d_{j}) > T h_{d y n a}

, its matching information is saved and the number of matches

N_{m a c h}

is increased by 1; also, the total similarity scores of semantic instances of the two frames are calculated as follows:

S_{t o t l e} (c u r, c a n d) = S_{t o t l e} (c u r, c a n d) + S_{i n s t} (c u r_{i}, c a n d_{j})

(22)

This strategy can improve the robustness and recall of object matching in multiple instances while ensuring correct matching. Finally, after the matching is completed, an average similarity score is calculated based on the total similarity score of semantic instances in the two frames and the number of matches, which is the final instance similarity score for

F_{c u r}

and

F_{c a n d}

, i.e.,

S_{f i n}^{i n s t} (c u r, c a n d) = \frac{S_{t o t l e} (c u r, c a n d)}{N_{m a c h}}

(23)

Therefore, the final similarity of the loop-closure detection based on the point–line BoWs and semantic instances is calculated as follows:

S_{f i n} (c u r, c a n d) = 0.6 \times S_{p l} (v_{c u r}, v_{c a n d}) + 0.4 \times S_{f i n}^{i n s t} (c u r, c a n d)

(24)

4. Simulations and Experiments

To verify the performance of the proposed algorithm, a series of simulation studies and experimental validations were conducted. All of the experiments were performed on a laptop with an Intel (Intel Corporation, Santa Clara, CA, USA) i7-13700H CPU with 16 GB of DDR3 RAM, running under Ubuntu 18.04.

4.1. Simulation Studies Under Datasets

In this study, we evaluated the performance of our algorithm on the TUM RGB-D and ICL-NUIM datasets, and employed ATE (Absolute Trajectory Error) and RPE (Relative Pose Error) as indicators to measure the accuracy of pose estimation. To validate the pose estimation performance of the SLAM system that integrates point–line features in this paper, we first conducted comparative experiments on the TUM static dataset sequence using ORB-SLAM2 [2], PL-SLAM [9], RGB-D SLAM [32], PLP-SLAM [33], and Ours. All statistical data comes from papers on corresponding algorithms and real experiments of this system, where “-” indicates that the algorithm did not provide relevant experimental results, and “×” indicates that the accuracy of the algorithm implemented in this paper did not improve when comparing the two algorithms.

From Table 1, it can be seen that the proposed method exhibited superior accuracy performance in multiple typical scenes of the TUM dataset. Among them, in scenarios such as fr1_floor and fr3_long_office, the proposed method showed a 77% and 53% improvement compared to PL-SLAM, respectively, demonstrating its strong robustness in environments with unclear structures. Meanwhile, in relatively simple structural environments such as fr1_xyz and fr2_xyz, our method still demonstrated stable accuracy advantages compared to PL-SLAM and ORB-SLAM2, further verifying the adaptability and stability of our algorithm under different structural features. We also found that our method obtained the highest accuracy on five of the test sequences, and the gap between our algorithm and the algorithm with the highest accuracy on the rest of the sequences was smaller. From this, it can be concluded that the line features effectively improve the localization accuracy and stability of the system.

Table 2, Table 3 and Table 4 compare the localization accuracy of ORB-SLAM2, DS-SLAM [11], RDS-SLAM [34], DynaTm SLAM [35], and Ours on the walking and sitting_static dynamic sequences in the TUM dataset. It can be observed that in the high-dynamic scenes of the walking sequence, our method achieved an average improvement of 96.74% and 69.95% in ATE compared to ORB-SLAM2 and RDS-SLAM, respectively. Meanwhile, we find that there was no significant difference in pose estimation accuracy between our algorithm and DS-SLAM. However, DS-SLAM mainly relied on epipolar geometric constraints to distinguish between dynamic and static features. In the case of frequent camera rotation, its accuracy may be affected due to the difficulty in accurately finding epipolar lines.

Furthermore, although the deep learning-based method can determine a more complete dynamic region, it can lead to misjudging the state of the dynamic target during the change of camera’s angle-of-view. Instead, our method preliminarily identified dynamic points by judging the geometric relationships between adjacent frames, and combined YOLOv8-seg segmentation to determine dynamic regions, which can accurately detect dynamic features even in the case of angle-of-view changes. Thus, such a dual strategy effectively enhances the adaptability of the algorithm. It can be further seen from Table 2, Table 3 and Table 4 that the ATE of our algorithm was higher than that of DynaTm-TM on the fr3_walk_rpy sequence, which is due to the fact that its motion state was misjudged when the angle-of-view changes greatly, and that there was a period of time in the dataset in which only a single metal partition was captured, which results in the number of feature points being low. For this case, the line features in our method can effectively ensure the stability of the system. In addition, on the walking-static sequence, there was case where pedestrians occupy most of the area in the scene image, while our method possessed good adaptability due to its ability to extract enough line features, resulting in smaller trajectory errors.

Figure 3 demonstrates the ATE results of ORB-SLAM2 and our algorithm on fr3_walk_xyz, fr3_walk_static, fr3_walk_rpy, and fr3_walk_half sequences, respectively, where the larger the red area is, the larger the error between the estimated trajectory and the reference trajectory is. It can be found that the trajectories estimated by the proposed algorithm had a high degree of fit to the real trajectories, which is mainly due to the fact that we utilize the dynamic feature rejection technique based on YOLOv8-seg and Delaunay, which not merely maintains a lower error performance, but also enhances the stability of the system.

Additionally, we quantitatively evaluate the performance of the 3D line-segment extraction method in this paper by calculating the average reconstruction rate as follows:

\bar{L R R} = \frac{1}{N_{F}} \sum_{i = 1}^{N_{F}} (\frac{N_{3 D}^{i}}{N_{2 D}^{i}} * 100 %)

(25)

where

N_{F}

denotes the number of images in the dataset used in the experiment,

N_{3 D}^{i}

denotes the number of 3D line-segments constructed using the method of this paper in the

i^{t h}

frame, and

N_{2 D}^{i}

denotes the number of 2D line-segments used to construct the 3D line-segments in the

i^{t h}

frame. As a note, due to the excessive segmentation of LSD line-segment extractor, we performed optimization on the original LSD extracted line-segments, so the number of reconstructed line-segments used is not consistent with the original extracted line-segments.

Figure 4 shows the 3D line-segment reconstruction results of our method on different dataset sequences. As can be seen, our method can accurately reveal the structural features of the scene by reconstructing the local sparse map through 3D line-segments. From Table 5, it can be seen that the fitting reconstruction method proposed in this paper exhibited significant advantages compared to traditional line-segment endpoint triangulation methods. On the sequences of live_room_traj1_frei, traj0_frei_png, and freiburg3_long_office_househol, the reconstruction accuracy increased by 34.8%, 21.5%, and 39.7%, respectively, with an average increase of 32%. The improvement was particularly significant in complex office scenes. Moreover, the accuracy fluctuation range of our method in three scenes was only 3.5%, while the traditional method reached 14.7%. This indicates that the generalization ability of the algorithm for different scenes is effectively improved by the geometrically-constrained optimization and continuous line-segment fitting strategies.

Figure 5 illustrates the local mapping effects of different algorithms on dynamic fr3_walk_xyz sequences. It can be seen that when there were dynamic objects in the camera’s field-of-view, it not only affected the accuracy of pose estimation but also led to a lot of residual images in the constructed 3D dense map. Such dense maps cannot be used for navigation and obstacle avoidance after being converted into octree maps. However, our algorithm detected dynamic objects in the front-end and excluded dynamic features from the map construction process, effectively ensuring the consistency of the map.

Figure 6 demonstrates the matching effects based on semantic instance similarity in this paper. From Figure 6a, it can be seen that the two images possess high similarity, with two instances of chairs, two instances of displays, and two instances of keyboards. However, the proposed method utilized position threshold restrictions to prevent mismatches and thus improve the accuracy of similarity calculation between the two frames. In Figure 6d, although the static backgrounds of the two scenes had a high similarity, the occlusion of dynamic objects results in a low similarity score calculated by the point–line-based bag-of-words model. However, our method can effectively avoid the interference of dynamic objects. For the case of Figure 6f, where two images had significant differences but both had the same instance, our algorithm will not encounter the problem of high similarity due to the same instance. From Table 6, it can be further concluded that the instance-matching method proposed in this paper can adapt well to various scenes and calculate reasonable similarity scores, effectively improving the recall of loop-closure detection.

4.2. Experimental Testing in Real Scenes

In the experiment, we conducted experimental verification of our algorithm using a TurtleBot3 mobile robot (ROBOTIS, Seoul, Republic of Korea) equipped with an Intel D435i depth camera. Figure 7 and Table 7 show the experimental scene and camera parameters, respectively. It should be noted that there are randomly walking pedestrians in this scene, so it is a real dynamic scene.

Figure 8 illustrates the feature extraction and sparse mapping effects of a robot in real dynamic scenes. It can be observed from Figure 8a that the system extracts many point–line features from the pedestrian, resulting in poor consistency between the constructed sparse map and the real scene. However, in Figure 8b, most of the features on the pedestrian have been removed, and the sparse map has a high-consistency with the real scene, demonstrating the effectiveness of the dynamic feature removal method proposed in this paper.

To demonstrate the global mapping effect, we constructed a global semantic map in another office with more objects, as shown in Figure 9. It can be seen that since ORB-SLAM2 did not involve dynamic feature rejection, resulting in the presence of residual shadows of pedestrians walking back and forth in the generated global map, and the overall structure had a certain deviation from the real environment, which will affect the autonomous navigation and obstacle avoidance of the robot. In contrast, we combined YOLOv8-seg and Delaunay to remove feature points on dynamic objects, resulting in a global map that maintains high-consistency with the real environment and exhibits good robustness throughout the entire process.

To clearly show the effects of semantic mapping, we exploited different colors to render the entities according to the results of YOLOv8-seg instance segmentation, and construct the dense semantic map and semantic octree map of the real scene, respectively. From Figure 10c,d, we find that the semantic maps constructed by our method can accurately reveal the corresponding entities, and the whole map was not misaligned and had good consistency.

Meanwhile, we utilized the SLAM trajectory evaluation tool “EVO” to evaluate the global localization accuracy of our algorithm. Figure 11 shows the trajectories generated by our algorithm and ORB-SLAM2, and it can be found that the trajectory of ORB-SLAM2 exhibited serious errors compared to ours. This is mainly due to the fact that when the robot was located at the loop-closure point for the first time, there were dynamic objects in the field-of-view and part of the static background was obscured, while, when it passed through the loop-closure point for the second time, there were no dynamic objects, which causes some impact for ORB-SLAM2 using the BoW model, leading to unsuccessful detection of the closed-loop at this point, and failing to correct the positional drift in a timely manner; whereas, we combined the semantic information and the point–lines feature that can correctly detect the closed-loop.

Furthermore, we quantitatively revealed the effectiveness of closed-loop detection using Precision–Recall (P–R) curves, and balanced “P” and “R” by adjusting the normalized similarity coefficient

α \in [0 . 1, 0 . 8]

.

Figure 12 compares the P–R curves of closed-loop detection using different features in the indoor dynamic environment. It can be observed that the Precision (0.88–0.92) and Recall (0.55–0.65) of using a single feature (ORB or LBD) were limited, while the “ORB+LBD” method significantly improved Recall, but the accuracy was still constrained by dynamic impact. In contrast, “ORB+LBD+Semantic” improved the precision while maintaining a high Recall, and forming an obvious right-up shifted P–R curve, which indicates that the method of combining instance matching to calculate the similarity can provide richer semantic information, and effectively suppresses the influence of dynamic objects on loop-closure detection.

Table 8 compares the average tracking time and detection segmentation time of each frame between our algorithm and some mainstream algorithms. We find that the average time required for detecting and segmenting each frame in our method was reduced by 87% compared to Dyna-SLAM [10], 87% compared to YOLO-SLAM [36], and 70% compared to DO-SLAM [37]. Moreover, the average tracking time of our method was reduced by 76.6% compared to DO-SLAM. Hence, we could reasonably to conclude that our method has good real-time performance while ensuring accuracy.

To further verify the performance of our algorithm in large-scale scenes, we carried out a experiment in a 27 m × 20 m outdoor corridor environment. From the global mapping results in Figure 13, we find that our method can extract rich line features in outdoor corridors, which can better assist point features in localization and mapping. In Figure 13a, the addition of line features can better reveal the structural information of the scene, while the dense map constructed in Figure 13b had high consistency with the real scene, without distortion or deformation. We could reasonably to conclude that our algorithm still exhibits good accuracy and robustness in large-scale scenes.

5. Conclusions

This paper proposes a vSLAM system that integrates point-line features with semantic information provided by YOLOv8-seg to address the issues of poor robustness and low positioning accuracy of traditional point-feature-based vSLAM in dynamic and low texture environments. We designed an efficient 3D line-segment extraction and modeling method, which combines 2D line-segment sampling and depth back-projection, and achieves robust 3D line-segment fitting based on RANSAC. Meanwhile, Delaunay triangulation was introduced to analyze the geometric structure between point features, and combined with the semantic segmentation information, dynamic feature points were effectively eliminated to improve the localization accuracy and the consistency of mapping in dynamic environments. In addition, we designed a loop-closure detection mechanism that integrates instance-level semantic information with geometric consistency, further improving the global consistency and robustness of the system.

Simulation studies and experimental results show that our method achieves excellent performance on multiple public datasets, demonstrating good robustness and accuracy in weak-texture and real dynamic environments. However, there is still room for further optimization in the registration of 3D line-segments and map construction in the system. The line-segment matching method may still have errors in some cases, making it difficult to effectively associate the same line-segments in different frames, resulting in duplicate mapping in the map. In addition, Delaunay triangulation based on geometric relationships is also susceptible to point-matching errors. In future work, we will strive to improve the accuracy of line-segment matching and further enhance the robustness of geometric relationship modeling to achieve more accurate and efficient line feature vSLAM systems.

Author Contributions

Conceptualization, H.M. and J.L.; methodology, H.M. and J.L.; software, H.M.; validation, H.M. and J.L.; formal analysis, H.M. and J.L.; investigation, H.M. and J.L.; resources, H.M. and J.L.; data curation, H.M. and J.L.; writing—original draft preparation, H.M.; writing—review and editing, J.L.; visualization, H.M. and J.L.; supervision, J.L.; project administration, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Nature Science Foundation of China under Grants 62463033 and 62063036, the ‘Xingdian Talent Support Program’ Youth Talent Special Project of Yunnan Province under Grant 01000208019916008, and the Basic Research Program of Yunnan Province—Key Project under Grant 202501AS070014.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The TUM dataset was obtained from https://cvg.cit.tum.de/data/datasets/rgbd-dataset/download, accessed on 4 June 2025. ICL-NUIM dataset was obtained from https://www.doc.ic.ac.uk/~ahanda/VaFRIC/iclnuim.html, accessed on 4 June 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mur-Artal, R.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Zhang, Z.; Rebecq, H.; Forster, C.; Scaramuzza, D. Benefit of Large Field-of-View Cameras for Visual Odometry. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 801–808. [Google Scholar]
Bloesch, M.; Omari, S.; Hutter, M.; Siegwart, R. Robust Visual Inertial Odometry Using a Direct EKF-Based Approach. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 298–304. [Google Scholar]
Shan, T.; Englot, B.; Meyers, D.; Wang, W.; Ratti, C.; Rus, D. LIO-SAM: Tightly-Coupled Lidar Inertial Odometry via Smoothing and Mapping. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October–24 January 2020; pp. 5135–5142. [Google Scholar]
Gómez-Ojeda, R.; Moreno, F.-A.; Zuniga-Noël, D.; Gonzalez-Jimenez, J. PL-SLAM: A Stereo SLAM System through the Combination of Points and Line Segments. IEEE Trans. Robot. 2019, 35, 734–746. [Google Scholar] [CrossRef]
Pumarola, A.; Vakhitov, A.; Agudo, A.; Sanfeliu, A.; Moreno-Noguer, F. PL-SLAM: Real-Time Monocular Visual SLAM with Points and Lines. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4503–4508. [Google Scholar]
Bescos, B.; Fácil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
Yu, C.; Liu, Z.; Liu, X.; Xie, F.; Yang, Y. DS-SLAM: A Semantic Visual SLAM Towards Dynamic Environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar]
Zhou, H.; Zou, D.; Pei, L.; Ying, R.; Liu, P.; Yu, W. StructSLAM: Visual SLAM with building structure lines. IEEE Trans. Veh. Technol. 2015, 64, 1364–1375. [Google Scholar] [CrossRef]
Wang, R.; Schwörer, M.; Cremers, D. Stereo DSO: Large-Scale Direct Sparse Visual Odometry with Stereo Cameras. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3903–3911. [Google Scholar]
Xu, J.; Fang, Y.; Gao, W. Robot Visual Inertial SLAM Algorithm Based on Point-Line Features. In Proceedings of the 2024 7th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Hangzhou, China, 15–17 August 2024; IEEE: New York, NY, USA, 2024; pp. 944–949. [Google Scholar]
Zeng, D.; Liu, X.; Huang, K.; Liu, J. EPL-VINS: Efficient point-line fusion visual-inertial SLAM with LK-RG line tracking method and 2-DoF line optimization. IEEE Robot. Autom. Lett. 2024, 9, 5911–5918. [Google Scholar] [CrossRef]
Marco, J.; Hernandez, Q.; Muñoz, A.; Agudo, A.; Moreno-Noguer, F.; Sanfeliu, A. DeepToF: Off-the-Shelf Real-Time Correction of Multipath Interference in Time-of-Flight Imaging. ACM Trans. Graph. 2017, 36, 1–12. [Google Scholar] [CrossRef]
Zollhöfer, M.; Stotko, P.; Görlitz, A.; Theobalt, C.; Nießner, M.; Dai, A.; Innmann, M.; Stamminger, M.; Klein, R. State of the Art on 3D Reconstruction with RGB-D Cameras. Comput. Graph. Forum 2018, 37, 625–652. [Google Scholar] [CrossRef]
Shabanov, A.; Krotov, I.; Chinaev, N.; Poletaev, V.; Kozlukov, S.; Pasechnik, I.; Yakupov, B.; Sanakoyeu, A.; Lebedev, V.; Ulyanov, D. Self-supervised depth denoising using lower-and higher-quality RGB-d sensors. In Proceedings of the 2020 International Conference on 3D Vision (3DV), Fukuoka, Japan, 25–28 November 2020; IEEE: New York, NY, USA, 2020; pp. 743–752. [Google Scholar]
Xie, W.; Liu, P.X.; Zheng, M. Moving Object Segmentation and Detection for Robust RGBD-SLAM in Dynamic Environments. IEEE Trans. Instrum. Meas. 2020, 70, 1–8. [Google Scholar] [CrossRef]
Yuan, X.; Chen, S. Sad-SLAM: A Visual SLAM Based on Semantic and Depth Information. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October–24 January 2020; pp. 4930–4935. [Google Scholar]
Li, A.; Wang, J.; Xu, M.; Yang, Y. DP-SLAM: A Visual SLAM with Moving Probability towards Dynamic Environments. Inf. Sci. 2021, 556, 128–142. [Google Scholar] [CrossRef]
Wen, S.; Li, P.; Zhao, Y.; Liu, M.; Meng, M.Q.H. Semantic Visual SLAM in Dynamic Environment. Auton. Robots 2021, 45, 493–504. [Google Scholar] [CrossRef]
Sun, Y.; Liu, M.; Meng, M.Q.H. Motion Removal for Reliable RGB-D SLAM in Dynamic Environments. Auton. Robots 2018, 42, 1111–1130. [Google Scholar] [CrossRef]
Li, S.; Lee, D. RGB-D SLAM in Dynamic Environments Using Static Point Weighting. IEEE Robot. Autom. Lett. 2017, 2, 2263–2270. [Google Scholar] [CrossRef]
Lin, Z.; Zhang, Q.; Tian, Z.; Yu, P.; Lan, J. DPL-SLAM: Enhancing Dynamic Point-Line SLAM through Dense Semantic Methods. IEEE Sens. J. 2024, 24, 14596–14607. [Google Scholar] [CrossRef]
Dong, J.; Lu, M.; Xu, Y.; Deng, F.; Chen, J. PLPD-SLAM: Point-Line-Plane-Based RGB-D SLAM for Dynamic Environments. In Proceedings of the 2024 IEEE 18th International Conference on Control & Automation (ICCA), Reykjavík, Iceland, 18–21 June 2024; pp. 719–724. [Google Scholar]
Galvez-López, D.; Tardós, J.D. Bags of Binary Words for Fast Place Recognition in Image Sequences. IEEE Trans. Robot. 2012, 28, 1188–1197. [Google Scholar] [CrossRef]
Shan, Z.; Li, R.; Schwertfeger, S. SA-LOAM: Semantic-Aided LiDAR SLAM with Loop Closure. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2022; pp. 4627–4633. [Google Scholar]
Ji, X.; Liu, P.; Niu, H.; Chen, X.; Ying, R.; Wen, F. Loop closure detection based on object-level spatial layout and semantic consistency. arXiv 2023, arXiv:2304.05146. [Google Scholar]
Alamanos, I.; Tzafestas, C. ORB-LINE-SLAM: An Open-Source Stereo Visual SLAM System with Point and Line Features. TechRxiv 2023. [Google Scholar] [CrossRef]
Von Gioi, R.G.; Jakubowicz, J.; Morel, J.M.; Randall, G. LSD: A fast line segment detector with a false detection control. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 32, 722–732. [Google Scholar] [CrossRef]
Li, Y.; Yunus, R.; Brasch, N.; Schwertfeger, S. RGB-D SLAM with Structural Regularities. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11581–11587. [Google Scholar]
Shu, F.; Wang, J.; Pagani, A.; Stricker, D. Structure PLP-SLAM: Efficient Sparse Mapping and Localization Using Point, Line and Plane for Monocular, RGB-D and Stereo Cameras. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2105–2112. [Google Scholar]
Liu, Y.; Miura, J. RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods. IEEE Access 2021, 9, 23772–23785. [Google Scholar] [CrossRef]
Zhong, M.; Hong, C.; Jia, Z.; Zhang, X. DynaTM-SLAM: Fast Filtering of Dynamic Feature Points and Object-Based Localization in Dynamic Indoor Environments. Robot. Auton. Syst. 2024, 174, 104634. [Google Scholar] [CrossRef]
Wu, W.; Guo, L.; Gao, H.; Yang, H. YOLO-SLAM: A Semantic SLAM System towards Dynamic Environment with Geometric Constraint. Neural Comput. Appl. 2022, 34, 6011–6026. [Google Scholar] [CrossRef]
Wei, Y.; Zhou, B.; Duan, Y.; Han, Z. DO-SLAM: Research and Application of Semantic SLAM System towards Dynamic Environments Based on Object Detection. Appl. Intell. 2023, 53, 30009–30026. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed semantic visual SLAM integrating point–line features with YOLOv8-seg.

Figure 2. Scenarios for the dynamic feature rejection.

Figure 3. Comparison of the ATE for different algorithms under TUM dataset sequences. (a) fr3_walk_xyz, (b) fr3_walk_Static, (c) fr3_walk_rpy, (d) fr3_walk_half, (e) fr3_walk_xyz, (f) fr3_walk_Static, (g) fr3_walk_rpy, (h) fr3_walk_half.

Figure 4. 3D line-segment reconstruction results on different dataset sequences. The first row shows the ORB feature point and 2D line-segment extraction results, the middle row shows the results of triangulation of line-segment endpoints, and the bottom row shows the 3D line-segment reconstruction results.The red color indicates the extracted line segment, and the green dot indicates the extracted ORB feature point. (a) live_room_traj1_frei, (b) traj0_frei_png, (c) freiburg3_long_office_househol.

Figure 5. Local mapping effects of different algorithms on the dynamic fr3_walk_xyz sequence. (a) ORB-SLAM2, (b) PLY-SLAM. In (a,b), the left indicates the dense map, while the right indicates the corresponding semantic octree map.

Figure 6. Instance-based semantic similarity-matching effects.

Figure 7. Experimental scene.

Figure 8. Feature extraction and sparse mapping effects in real dynamic scene. (a) Without removing dynamic features, (b) After removing dynamic features. In (a,b), the left indicates the point–line feature extraction results, while the right indicates the pose graph at the corresponding time.The red represents the 3D line segments in the current frame, the black line segments represent the 3D line segments in the local map but not in the current frame, the blue line represents the keyframe, and the green line represents the trajectory of the keyframe.

Figure 9. Comparison of global mapping effects of different algorithms in real dynamic environments. (a) ORB-SLAM2, (b) PLY-SLAM. In (a,b), the left indicates the semantic dense map, while the right indicates the semantic octree map.

Figure 10. Local semantic mapping results in a real scene. (a) denotes the scene image, (b) denotes the semantic mask generated by YOLOv8, (c) denotes the dense semantic map, (d) denotes the semantic octree map, and (e) denotes the colors corresponding to different instances.

Figure 11. Comparison of trajectories of different algorithms in real dynamic scene.

Figure 12. Comparison of P–R curves for loop closure detection using different feature.

Figure 13. Global mapping effect in outdoor corridor environment. (a) is a sparse map and robot’s trajectory represented using keyframes, while (b) is the corresponding dense map.

Table 1. Comparison of RMSE for the ATE among different algorithms on TUM static dataset sequences (/m).

Sequences	PL-SLAM	RGB-DSLAM	PLP-SLAM	ORB-SLAM2	PLY-SLAM	↑PL-SLAM	↑ORB-SLAM2
fr1_xyz	0.0121	0.0116	0.0103	0.0127	0.0092	24%	27%
fr1_floor	0.0759	0.0622	0.0121	0.0390	0.0168	77%	56%
fr2_xyz	0.0043	0.0040	0.0145	0.0039	0.0035	19%	10%
fr2_large_loop	-	0.1020	-	0.1069	0.0850	-	20%
fr3_str_tex_far	0.0089	0.0092	0.0089	0.0158	0.0113	×	28%
fr3_long_office	0.0197	0.0186	0.0101	0.0126	0.0092	53%	27%
fr3_nstr_tex_n_wl	0.0206	-	0.0159	0.0212	0.0125	39%	41%
fr3_nstr_tex_far	-	0.0274	0.0352	0.0304	0.0283	-	7%

Table 2. Comparison of RMSE for the ATE among different algorithms on TUM dynamic dataset sequences (/m).

	Sequences	ORB-SLAM2	DS-SLAM	RDS-SLAM	DynaTm SLAM	PLY-SLAM	↑ ORB-SLAM2 (%)
Hight Dynamic	fr3_walk_xyz	0.7988	0.0274	0.0571	0.0150	0.0169	97.9
	fr3_walk_sta	0.3922	0.0081	0.0206	0.0068	0.0065	98.3
	fr3_walk_rpy	1.0799	0.4442	0.1604	0.0288	0.0364	96.6
	fr3_walk_half	0.5000	0.0303	0.0807	0.0291	0.0294	94.1
Low Dynamic	fr3_sitting_sta	0.0080	0.0065	0.0084	0.0064	0.0059	26.0

Table 3. Comparison of RMSE for the relative translation error (RTE) among different algorithms on TUM dataset sequences.

	Sequence	Relative Translation Error (RTE/m)
	Sequence	ORB-SLAM2	DS-SLAM	RDS-SLAM	DynaTm	PLY-SLAM	↑ORB-SLAM2 (%)
High Dynamic	fr3_walk_xyz	0.3816	0.0333	0.0426	0.0191	0.0224	94.1
	fr3_walk_sta	0.2320	0.0102	0.0221	0.0088	0.0079	96.6
	fr3_walk_rpy	0.3866	0.1503	0.1320	0.0356	0.0414	89.3
	fr3_walk_half	0.3264	0.0297	0.0482	0.0281	0.0295	91.1
Low Dynamic	fr3_sitting_sta	0.0086	0.0078	0.0123	0.0083	0.0076	11.6

Table 4. Comparison of RMSE for the relative rotation error (RRE) among different algorithms on TUM dataset sequences.

	Sequence	Relative Rotation Error (RRE/°)
	Sequence	ORB-SLAM2	DS-SLAM	RDS-SLAM	DynaTm	PLY-SLAM	↑ORB-SLAM2 (%)
High Dynamic	fr3_walk_xyz	7.3659	0.8266	0.9222	0.6006	0.6352	91.4
	fr3_walk_sta	4.0904	0.2690	0.4944	0.2510	0.2370	94.2
	fr3_walk_rpy	7.4997	3.0042	13.169	0.8228	0.8381	88.8
	fr3_walk_half	6.5744	0.8142	1.8828	0.7443	0.8045	87.8
Low Dynamic	fr3_sitting_sta	0.2798	0.2735	0.3338	0.2718	0.2642	5.6

Table 5. Average reconstruction rate of line-segments on different dataset sequences.

Dataset Sequences	Traditional Methods	PLY-SLAM
live_room_traj1_frei	51.6%	86.4%
traj0_frei_png	63.4%	84.9%
freiburg3_long_office_househol	48.7%	88.4%

Table 6. Calculation of similarity score.

Similarity	Figure 6a	Figure 6b	Figure 6c	Figure 6d	Figure 6e	Figure 6f
$S_{p}$	0.1087	0.0161	0.0156	0.0171	0.0139	0.0026
$S_{1}$	0.1744	0.0572	0.0356	0.0299	0.0099	0.0070
$S_{p l}$	0.1345	0.0322	0.0235	0.0221	0.0123	0.0043
$S_{i n s t}$	0.92	0.88	0.88	0.90	0.62	0.00
$S_{f i n}$	0.4487	0.3713	0.3661	0.3733	0.2554	0.0026

Table 7. Main parameters of the Intel D435i depth camera.

Parameter	Properties
dimensions (length × depth × height)	90 mm × 25 mm × 25 mm
effective measurement depth	0.2∼10 m
RGB frame rate and resolution	1920 × 1080 at 30 fps
RGB sensor FOV (horizontal × vertical)	69.4° × 42.5° (+/−3°)
depth frame rate and resolution	up to 1280 × 720 at 90 fps
depth FOV (horizontal × vertical for HD 16:9)	85.2° × 58°

Table 8. Comparison of time consumption of different algorithms.

Algorithm	Dyna-SLAM	DO-SLAM	YOLO-SLAM	PLY-SLAM
Hardware	NVIDIA (NVIDIA Corporation, Santa Clara, CA, USA) Tesla M40 GPU	Inter Core i5-4288U	Inter Core i5-4288U CPU	Nvidia GeForce RTX 3060
Network model	Mask R-NN	YOLOv5	YOLOv3	YOLOv8-seg
Segmentation time (ms)	195	81.44	696.03	24.21
Tracking time (ms)	>300	118.23	651.53	27.61

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mao, H.; Luo, J. PLY-SLAM: Semantic Visual SLAM Integrating Point–Line Features with YOLOv8-seg in Dynamic Scenes. Sensors 2025, 25, 3597. https://doi.org/10.3390/s25123597

AMA Style

Mao H, Luo J. PLY-SLAM: Semantic Visual SLAM Integrating Point–Line Features with YOLOv8-seg in Dynamic Scenes. Sensors. 2025; 25(12):3597. https://doi.org/10.3390/s25123597

Chicago/Turabian Style

Mao, Huan, and Jingwen Luo. 2025. "PLY-SLAM: Semantic Visual SLAM Integrating Point–Line Features with YOLOv8-seg in Dynamic Scenes" Sensors 25, no. 12: 3597. https://doi.org/10.3390/s25123597

APA Style

Mao, H., & Luo, J. (2025). PLY-SLAM: Semantic Visual SLAM Integrating Point–Line Features with YOLOv8-seg in Dynamic Scenes. Sensors, 25(12), 3597. https://doi.org/10.3390/s25123597

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PLY-SLAM: Semantic Visual SLAM Integrating Point–Line Features with YOLOv8-seg in Dynamic Scenes

Abstract

1. Introduction

2. Related Work

2.1. Point–Line Features Based SLAM

2.2. Dynamic SLAM

3. Pipeline

3.1. Line Feature Extraction and Optimization

3.2. Dynamic Feature Detection and Rejection

3.3. Loop-Closure Detection

4. Simulations and Experiments

4.1. Simulation Studies Under Datasets

4.2. Experimental Testing in Real Scenes

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI