Tightly-Coupled Visual-Inertial Odometry Using Point and Geometrically Optimized Line Features

Yuan, Yanxin; Cheng, Yi; Liu, Jiansong; Kuai, Zheng; Li, Baoquan

doi:10.3390/electronics15102061

Open AccessArticle

Tightly-Coupled Visual-Inertial Odometry Using Point and Geometrically Optimized Line Features

by

Yanxin Yuan

,

Yi Cheng

^*

,

Jiansong Liu

,

Zheng Kuai

and

Baoquan Li

Tianjin Key Laboratory of Intelligent Control of Electrical Equipment, School of Control Science and Engineering, Tiangong University, Tianjin 300387, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2061; https://doi.org/10.3390/electronics15102061

Submission received: 22 March 2026 / Revised: 3 May 2026 / Accepted: 6 May 2026 / Published: 12 May 2026

Download

Browse Figures

Versions Notes

Abstract

Visual-Inertial Odometry (VIO) estimates system pose by fusing visual and inertial measurements. Although line features can enhance structural perception, existing approaches still face challenges such as redundant short segments and weak geometric constraints. To address these, in the front end, we propose a complete geometric optimization pipeline for line features. This pipeline adopts a length-threshold-based filtering strategy and integrates the proposed geometric-consistency-based merging mechanism, endpoint-distance-based verification mechanism, and epipolar-constraint-based triangulation method, transforming fragmented short segments into structurally complete 3D spatial lines. In the back end, reprojection residuals of the optimized line features are jointly optimized with point residuals, IMU pre-integration residuals, and marginalization priors in a sliding-window framework. Experiments on the EuRoC dataset show that compared to VINS-Mono, PL-VINS, and EPLF-VINS, the proposed method reduces the Absolute Pose Error (APE) by 17.57%, 9.88%, and 6.65%, respectively. Additionally, compared to PL-VINS, it reduces the line feature processing time by 4.16% and the average per-frame processing time by 2.36%, validating the effectiveness of the proposed method.

Keywords:

simultaneous localization and mapping; line segment merging; geometric constraint; sliding window optimization

1. Introduction

Simultaneous Localization and Mapping (SLAM) is a crucial technology in fields such as robotic autonomous navigation and augmented reality. Single-sensor SLAM systems are mainly categorized into LiDAR-based SLAM and vision-based SLAM. LiDAR SLAM performs stably in structured environments but is limited by the radar detection range during operation [1]; visual SLAM relies on stable texture features in the environment; under severe illumination changes or dynamic object occlusions, camera observations become limited, so its localization and mapping performance tends to degrade [2]. The IMU can measure acceleration and angular velocity at a high frequency, with advantages of high measurement accuracy, insensitivity to the surrounding environment, and high-frequency acquisition of the robot’s internal motion information, but it has cumulative error [3]. VIO achieves stable and accurate state estimation in complex scenes by fusing visual and inertial measurements; it uses visual geometric constraints to suppress IMU integration drift, and leverages the IMU’s high-frequency measurements to compensate for the uncertainty of inter-frame visual tracking and to provide scale information [4]. A classic method in this domain is the VINS-Mono, a complete and efficient tightly coupled optimization framework proposed to systematically achieve high-precision, real-time monocular visual-inertial state estimation and exert a profound impact on subsequent research [5].

Based on the sensor data fusion strategy, VIO can be categorized into two methods, loosely coupled [6] and tightly coupled [7]. In the former, the camera and IMU modules independently perform pose estimation, and their results are fused subsequently. In the latter, raw measurements from both sensors are directly merged, and state estimation is performed via joint optimization, thus better utilizing the complementary properties of different sensors. Bescos et al. utilize stereo vision to achieve multi-object tracking functionality, they jointly optimize the trajectories of objects in both dynamic and static scenes within a sliding window based on a tightly coupled approach [8]. Li et al. combine data such as GNSS carrier phase with visual-inertial navigation using tight coupling, designing the high-precision P3-VINS method [9]. Although the tightly coupled method has higher computational complexity, it makes full use of the raw sensor measurement information, resulting in higher estimation accuracy and robustness compared to the loosely coupled method.

From the perspective of visual information processing, VIO methods are primarily categorized into feature-based methods [10,11] and direct methods [12,13]. In direct methods, the camera pose is estimated by minimizing the photometric error between images; this approach does not require feature point extraction and descriptor matching, making it suitable for texture-less scenes. However, direct methods rely on the photometric consistency assumption and are easily interfered by factors such as illumination changes, camera exposure adjustments, and motion blur in practical applications. In feature-based methods, representative pixels on the image are selected for feature matching, and the relative camera motion is then estimated using algorithms such as epipolar geometry [14] or PnP [15]. Most feature-based methods rely on point feature tracking and matching. However, these point features are highly susceptible to degradation in low-texture, repetitive-structure, or drastically changing illumination environments, leading to feature tracking loss and pose estimation drift, which limits the system’s robustness. Notably, man-made environments are rich in line features. These features can be stably extracted and tracked even in low-texture regions, and they can provide stronger geometric constraints, thus offering crucial support for enhancing the robustness and accuracy of VIO.

In the early stages of VIO research, line features are mainly applied in offline 3D reconstruction (SFM) and small-scale visual odometry (VO), preliminarily validating their potential in geometric recovery. However, limited by computational efficiency and matching robustness, line features struggle to meet real-time requirements. Pumarola et al. are the first to propose an initialization mechanism based on line features and construct a point-line tightly coupled SLAM system, enabling cold start in low-texture scenes completely lacking point features [16]. However, full-process line feature processing and the corresponding Bundle Adjustment (BA) computational overhead reduce the system’s real-time performance. Gomez-Ojeda et al. propose a stereo vision-based tightly coupled SLAM system that integrates points and lines, using a joint point-line Bag-of-Words model for loop closure detection throughout the whole pipeline to improve robustness in low-texture environments [17]. He et al. propose a method that fuses point and line features in a tightly coupled VIO framework [18]. This work represents spatial lines using Plücker coordinates and an orthonormal parameterization method, and jointly optimizes the constructed line feature residuals along with point features and IMU data in a sliding window. Fu et al. propose PL-VINS [19], which improves the LSD detector by tuning hidden parameters and applying a length rejection strategy, thereby increasing line extraction speed. They also establish a more precise point-to-line distance residual model based on Plücker coordinates and achieve efficient point-line-inertial tight coupling within a sliding-window optimization framework, leading to noticeable improvements in accuracy and real-time performance. However, PL-VINS still suffers from line fragmentation and structural discontinuity due to the lack of geometric merging and verification mechanisms, which limits the stability and quality of line feature constraints. Xu et al. propose the EPLF-VINS system [20], which adopts a modified EDLines detector for line segment extraction and replaces descriptor-based matching with line optical flow tracking, thereby significantly improving the efficiency of line feature processing. In addition, it presents an endpoint-independent residual model that eliminates the dependence on unstable line segment endpoints. However, this method mainly focuses on improving the speed of detection and tracking, and does not directly optimize the geometric structural quality of line features. Therefore, the geometric integrity of its line features still has considerable room for further improvement.

These studies demonstrate that line features can provide stable geometric constraints and suppress drift in low-texture regions. Current line segment detection methods tend to generate a large number of redundant segments with similar orientations, spatial proximity, or local overlaps. Direct use of these segments increases the computational burden on the system and adversely affects the accuracy of back-end optimization. Therefore, how to effectively leverage single-frame geometric information at the front end to remove redundant segments, construct stable representations, and provide high-quality constraints for the back end becomes a key issue in improving the performance of point-line fused VIO.

Table 1 compares the differences in line feature processing pipelines among representative point-line VIO methods, comprehensively illustrating the line feature processing strategy of each approach.

To address the aforementioned issues, this work presents a VIO system built upon geometrically optimized line features, aiming to alleviate line segment fragmentation and enhance structural representation in complex scenes. The key contributions of this work are summarized as follows:

First, a complete front-end geometric optimization pipeline for line features is proposed. This pipeline adopts a line segment filtering strategy based on length threshold, and integrates the proposed merging mechanism based on geometric consistency constraints, verification mechanism based on endpoint distances, and triangulation method for optimized line features based on epipolar constraints. It removes redundant and unstable short segments, improves the structural integrity of line features, and thereby provides reliable geometric constraints for back-end optimization.
Second, the optimized line features, point features, IMU pre-integration, and marginalization prior information are integrated into a tightly coupled sliding window optimization framework. A point-line fused state estimation model is constructed to achieve high-precision and robust pose estimation.
Finally, comprehensive experimental validation is conducted on the public EuRoC dataset. The results show that the proposed method improves pose estimation accuracy, system robustness, and overall efficiency in complex environments.

2. Related Work

2.1. Filter-Based and Optimization-Based VIO

From the perspective of state estimation implementation, VIO can be primarily classified into filter-based methods [21,22,23] and optimization-based methods [24,25,26,27]. The former adheres to the principle of recursive estimation; they utilize high-frequency IMU data to predict the system state, obtaining a prior probability distribution of the current state. Then, these methods combine this prior distribution with visual observations to perform state updates. This incremental processing mechanism avoids the batch optimization of historical data, thus achieving higher computational efficiency and real-time performance; however, it also leads to continuously accumulating errors that are difficult to effectively suppress [28]. The framework based on the Extended Kalman Filter (EKF) has played a key role in the development of VIO [29]. A typical representative is the Multi-state Constraint Kalman Filter (MSCKF) algorithm [30]. The innovation of this algorithm is that it avoids augmenting the state vector with 3D feature coordinates. Conversely, it projects the reprojection constraints of visual features into the null space of the state vector, thus constructing an observation model that only relates to multiple camera poses. This innovation enables the MSCKF to significantly improve computational efficiency while maintaining the estimation accuracy characteristic of tight coupling.

Optimization-based methods adopt a sliding window strategy and perform maximum a posteriori (MAP) estimation by jointly optimizing system states and observations across multiple timestamps within the window. Although their computational complexity is higher than that of filter-based methods, they continuously refine the estimation errors of historical poses within the window through iterative linearization and optimization of the local trajectory, thereby achieving higher estimation accuracy [31]. The OKVIS system has pioneered the application of keyframe-based sliding window optimization in VIO [32]. It jointly estimates poses, velocities, sensor biases, and map point structures for all keyframes within the window via nonlinear optimization and uses marginalization to control computational cost, leading to more accurate and reliable pose estimation. The ORB-SLAM3 system introduces a multi-map mechanism [33]. It maintains robust operation in challenging scenarios by constructing and fusing submaps upon tracking failure. With an efficient tightly coupled optimization framework, it supports monocular, stereo, and RGB-D cameras and achieves high accuracy and long-term robustness. Given their clear advantages in estimation accuracy, this work adopts an optimization-based tightly coupled framework as the core state estimation module for the proposed VIO system.

2.2. Geometric Representation of Points and Lines

In VIO, the geometric representation of point and line features forms the foundation of state estimation. The representation of point features is intuitive and concise; their image coordinates are

(u, v)

, corresponding to 3D spatial coordinates

(x, y, z)

with 3 degrees of freedom (DoF). In contrast, the parameterization of 3D spatial lines is more complex. Primary representation methods include Plücker coordinates and the orthonormal representation; this paper adopts a hybrid approach utilizing both to fully leverage their respective advantages at different processing stages.

In 3D space, a line can be represented by Plücker coordinates as

L = {(n^{T}, d^{T})}^{T} \in ℝ^{6}

, where

n \in ℝ^{3}

is the normal vector of the plane formed by the line and the camera center, and

d \in ℝ^{3}

is the direction vector of the line. This representation is over-parameterized due to the invariance of a 3D line under translations along its direction and rotations about its own axis, yielding the constraint

n^{T} d = 0

. In this paper, Plücker coordinates are employed for initialization and coordinate transformation in the front end. For a line feature observed in frames

i

and

j

, the geometric transformation between its Plücker coordinates

L^{i}

and

L^{j}

is given by:

L^{j} = [\begin{array}{l} n^{j} \\ d^{j} \end{array}] = T_{i j} L^{i} = [\begin{matrix} R_{i j} & {[p_{i j}]}_{\times} R_{i j} \\ 0 & R_{i j} \end{matrix}] [\begin{array}{l} n^{i} \\ d^{i} \end{array}],

(1)

where

T_{i j}

denotes the transformation matrix from frame

i

to frame

j

,

R_{i j}

represents the rotation matrix between the two frames, and the operator

{[•]}_{\times}

indicates the skew-symmetric matrix of a 3D vector.

To avoid over-parameterization and improve the efficiency of back-end nonlinear optimization, this paper adopts the orthonormal representation. The representation decomposes a line into orientation and positional components, including an orthonormal basis matrix

U

for line direction and a matrix

W

that describes positional information. Plücker coordinates can be converted to the orthonormal representation via QR decomposition [34]. We define the rotation matrix between the line coordinate and the camera frame as:

U = R (φ) = [\begin{matrix} \frac{n}{∥ n ∥} & \frac{d}{∥ d ∥} & \frac{n \times d}{∥ n \times d ∥} \end{matrix}],

(2)

where

φ \in ℝ^{3}

represents the rotation angles around the x-, y-, and z- axes of the camera frame. The Plücker coordinates can be expressed as follows:

[\begin{matrix} n & d \end{matrix}] = U [\begin{matrix} ∥ n ∥ & 0 \\ 0 & ∥ d ∥ \\ 0 & 0 \end{matrix}] = [\begin{matrix} \frac{n}{∥ n ∥} & \frac{d}{∥ d ∥} & \frac{n \times d}{∥ n \times d ∥} \end{matrix}] [\begin{matrix} ∥ n ∥ & 0 \\ 0 & ∥ d ∥ \\ 0 & 0 \end{matrix}] .

(3)

To represent the positional component concisely, we define the matrix

W

as:

W = [\begin{matrix} w_{1} & - w_{2} \\ w_{2} & w_{1} \end{matrix}] = [\begin{matrix} \cos (ϕ) & - \sin (ϕ) \\ \sin (ϕ) & \cos (ϕ) \end{matrix}] = \frac{1}{\sqrt{∥ n ∥^{2} + ∥ d ∥^{2}}} [\begin{matrix} ∥ n ∥ & - ∥ d ∥ \\ ∥ d ∥ & ∥ n ∥ \end{matrix}],

(4)

where

ϕ

denotes a rotation angle. The distance from the 3D line to the camera frame origin is expressed as

d = ∥ n ∥ / ∥ d ∥ = w_{1} / w_{2}

, so

W

encodes the distance information

d

. We adopt

O = {[φ^{T}, ϕ]}^{T}

as the minimal parameterization of a 3D spatial line for optimization. After optimization, the corresponding Plücker coordinates of the spatial line

L

can be recovered via:

L = {[\begin{matrix} w_{1} u_{1}^{T} & w_{2} u_{2}^{T} \end{matrix}]}^{T} = \frac{1}{\sqrt{∥ n ∥^{2} + ∥ d ∥^{2}}} L^{'},

(5)

where

u_{i}

represents the ith column vector of matrix

U

.

3. Preliminaries and System Overview

3.1. Notations

For clarity, this work presents the notations and coordinate frame conventions employed throughout the paper. The world frame, IMU frame, and camera frame are denoted by

{(•)}^{w}

,

{(•)}^{b}

and

{(•)}^{c}

, respectively. Gravity is assumed to be aligned with the

z -

axis of the world frame. Rotations are represented using either matrices

R \in S O (3)

or quaternions

q \in S^{3}

, while translations are represented using 3D vectors

p \in ℝ^{3}

. Specifically, the coordinate transformation from the IMU frame to the camera frame is described by the extrinsic parameters, comprising a translation vector

p_{b c}

and a rotation quaternion

q_{b c}

. A double-subscript notation is adopted here, where

{(•)}_{c w}

denotes a quantity associated with the transformation from the camera frame to the world frame.

3.2. System Overview

This work proposes a tightly coupled point-line VIO system designed to enhance pose estimation accuracy and system robustness in challenging scenarios. As depicted in Figure 1, the system consists of two modules: the front end and the back end. The front end is primarily responsible for preprocessing visual-inertial data, while the back end achieves high-precision state estimation based on a sliding window optimization framework.

During the data preprocessing stage, the system processes point features, line features, and IMU data in parallel. For point features, salient corners in the image are first detected using the Shi–Tomasi algorithm, and subsequently tracked across adjacent frames via the KLT optical flow method; afterwards, the essential matrix between two frames is estimated based on the RANSAC algorithm, and outliers are rejected by leveraging epipolar constraints, ultimately yielding reliable point track pairs. For line features, the system first uses an improved LSD algorithm to extract line segment structures from the image and merges them based on geometric constraints. It verifies the validity of merged line segments through endpoint distance constraints, eliminating segments that do not satisfy these constraints. Next, the processed line segments are characterized using LBD descriptors and matched via the K-Nearest Neighbors (KNN) algorithm. A Hamming distance threshold filters out candidate matches with low similarity to enhance matching quality. Subsequently, the successfully matched line segments are triangulated using epipolar geometry constraints to recover their 3D spatial line features. Simultaneously, the system pre-integrates IMU data between keyframes to provide necessary motion constraints for back-end state estimation.

During the state estimation stage, the back end constructs a nonlinear least-squares problem within a sliding window using visual feature observations and IMU pre-integration measurements from the front end. Its cost function incorporates visual reprojection errors, IMU pre-integration errors, and a prior residual from marginalization. The primary goal is to accurately estimate keyframe poses, while simultaneously optimizing state vectors including velocity, sensor biases, and the 3D coordinates of map features. To ensure real-time performance and computational efficiency, a marginalization strategy removes old frames during optimization, transforming their constraint information into a prior term acting on the remaining state variables. Ultimately, the system outputs a high-precision pose trajectory and the corresponding sparse point-line map.

4. Methodology

4.1. LSD-Based Line Segment Filtering Strategy

The stability and geometric quality of line features are crucial for the accuracy and robustness of VIO. Although the LSD algorithm can effectively detect line segments in images, its local gradient-based nature lacks the capacity for global geometric continuity judgment, often producing fragmented and redundant short segments. These short segments not only carry limited geometric information but also increase the computational burden of subsequent processing and can lead to mismatches under viewpoint changes, ultimately compromising system accuracy and robustness.

To enhance line feature quality, this paper configures key parameters of the LSD algorithm. First, the image pyramid level count is configured as 2 and the scale factor as 0.5, aiming to maintain multi-scale detection capability while controlling computational cost. Second, the minimum density threshold is set to 0.6 to filter out line segments with insufficient geometric significance. Building upon this, a short segment length threshold is used to remove overly short line segments in the image. Specifically, this threshold

L_{s}

is defined as a fraction of the image size as:

L_{s} = ⌈η * \min (W_{I}, H_{I})⌉,

(6)

where

W_{I}

and

H_{I}

denote the width and height of the image, respectively, and

η

is the length scaling factor, empirically set to 0.125. Increasing its value will filter out longer, more stable, but fewer line segments. The operator

⌈•⌉

represents the ceiling operation. This filtering method removes redundant short segments, providing a cleaner set of line features for subsequent processing and state estimation.

4.2. Geometric-Constrained Line Segment Merging

In VIO, long and continuous line features provide stronger and more stable geometric structural constraints than short segments. However, edge detection errors often cause originally continuous long lines to be incorrectly segmented into multiple adjacent short segments. This mis-segmentation not only increases the computational burden of the system but also disrupts the continuity of the original structure, severely weakening the constraining power that line features exert in back-end optimization. To address this issue, this paper proposes a line segment merging strategy grounded in geometric constraints.

To determine whether two line segments can be merged, this method adopts the following geometric criteria. Given two detected line segments

l_{1}

and

l_{3}

in the current frame, defined by their start points

s_{1}

and

s_{3}

and end points

e_{1}

and

e_{3}

, respectively, their lengths are first calculated. Considering that longer segments provide more reliable direction estimates, the longer segment is used as the reference to mitigate potential noise introduced by shorter segments. After identifying the reference segment, the direction vectors of both segments are computed. Direction consistency is then evaluated using the dot product and the resulting angle difference. If the dot product is negative, it indicates opposing directions; in this case, the endpoints of the shorter segment are swapped to align its direction. Subsequently, the angle between the two vectors is calculated. If this angle is less than a set threshold

θ_{T} = 3^{°}

, they are deemed to satisfy the direction consistency condition. This threshold was determined by trial and error. Next, the spatial proximity between the segments is further assessed using two geometric distance metrics: the perpendicular distance

D i s

, which refers to the distance from an endpoint to the infinite line of the other segment; and the projection distance

d i s

, which refers to the shortest distance from an endpoint to the other segment itself. The thresholds for these distances are determined empirically as

T_{1} = 5

pixels and

T_{2} = 3

pixels. If any pair of endpoints satisfies both the distance thresholds and the angle difference condition, the two segments are considered mergeable into a complete segment

l_{g o o d}

.

The merged segments can be categorized into three distinct extension types, as illustrated in Figure 2. During the merging process, the start and end points of the new segment are adjusted according to the specific merge type. The complete workflow of the merging process is detailed in Algorithm 1.

Algorithm 1: Line Segments Merging
	$Input : Initial Line segments l_{1} (s_{1}, e_{1}), l_{3} (s_{3}, e_{3})$
	$Output : Merged line segments l_{g o o d}$
1	if $l e n g t h (l_{1}) < l e n g t h (l_{3})$ then
2	$S w a p (l_{1}, l_{3})$
3	end if
4	if $d o t_{13} < 0$ then
5	$R e v e r s e D i r e c t i o n (l_{3})$
6	end if
7	if $D i s (s_{3}, l_{1}) < T_{1}$ $and D i s (e_{3}, l_{1}) < T_{1}$ and $θ_{13} < θ_{T}$ then
8	if $d i s (s_{3}, l_{1}) < T_{2} and d i s (e_{3}, l_{1}) < T_{2}$ then
9	$l_{g o o d} = (s_{1}, e_{1})$
10	$else if d i s (s_{3}, l_{1}) < T_{2} and d i s (e_{3}, l_{1}) > T_{2}$ then
11	$l_{g o o d} = (s_{1}, e_{3})$
12	$else if d i s (s_{3}, l_{1}) > T_{2} and d i s (e_{3}, l_{1}) < T_{2}$ then
13	$l_{g o o d} = (s_{3}, e_{1})$
14	$end if$
15	end if
16	Return $l_{g o o d}$

4.3. Structural Consistency Verification of Merged Lines

To ensure the geometric accuracy of the merged segments from Section 4.2 and prevent incorrect mergers, this paper proposes a verification method based on endpoint distance constraints. This verification performs a geometric consistency assessment on the merged line results, ensuring the reliability of line features in the VIO system. Its core idea is to construct the line equation using the endpoints of the merged segment and use this as a benchmark to verify whether the geometric structure of the original segments is reasonably preserved after merging.

Specifically, the line model is first determined based on the two endpoints of the merged segment. Subsequently, the shortest distances from the two unused endpoints of the original segments to this line are calculated, respectively. If both distances fall within the set distance threshold

T_{2}

, the merge is considered geometrically consistent, and the segment

l_{o p t}

is retained for subsequent feature tracking and matching; otherwise, it is deemed an invalid merge and discarded.

This verification mechanism filters out geometrically inconsistent merging results, avoiding spurious observations from erroneous mergers. By providing more reliable geometric constraints, this mechanism enhances the pose estimation performance of the VIO system, as validated by the experimental results in Section 5. The complete implementation process of the system is shown in Algorithm 2.

Algorithm 2: Line Segments Verification and Refinement
	$Input : Merged line segments l_{g o o d}$
	$Output : Refined line segments l_{o p t}$
1	if $D i s (s_{3}, l_{1}) < T_{1} and D i s (e_{3}, l_{1}) < T_{1} and θ_{13} < θ_{T}$ then
2	$if l_{g o o d} = (s_{1}, e_{1})$ then
3	if $D i s (s_{3}, l_{g o o d}) < T_{2} and D i s (e_{3}, l_{g o o d}) < T_{2}$ then
4	$l_{o p t} = l_{g o o d}$
5	else
6	$Delete l_{g o o d}$
7	end if
8	else if $l_{g o o d} = (s_{1}, e_{3})$ then
9	if $D i s (s_{3}, l_{g o o d}) < T_{2} and D i s (e_{1}, l_{g o o d}) < T_{2}$ then
10	$l_{o p t} = l_{g o o d}$
11	else
12	$Delete l_{g o o d}$
13	end if
14	else if $l_{g o o d} = (s_{3}, e_{1})$ then
15	if $D i s (s_{1}, l_{g o o d}) < T_{2} and D i s (e_{3}, l_{g o o d}) < T_{2}$ then
16	$l_{o p t} = l_{g o o d}$
17	$else$
18	$Delete l_{g o o d}$
19	$end if$
20	$end if$
21	end if
22	Return $l_{o p t}$

4.4. 3D Reconstruction of Line Features

The triangulation of line features aims to recover their 3D spatial parameters from matched 2D line segments across adjacent frames, utilizing epipolar constraints. However, directly using raw line segments generated by the LSD detector for triangulation poses fundamental challenges, as these segments are often fragmented and redundant. This not only increases the probability of mismatches but also makes the endpoints of short segments susceptible to noise, significantly undermining the reliability of geometric constraints. These factors collectively lead to increased errors in depth calculation during epipolar geometry-based triangulation, thereby compromising the accuracy of spatial structure reconstruction.

To address this, this section performs triangulation based on the stable line segments output from Section 4.1, Section 4.2 and Section 4.3, which have undergone geometric enhancement. The triangulation schematic for merged segments is illustrated in Figure 3. Consider two adjacent keyframes with camera centers

c_{0}

and

c_{1}

, and corresponding camera pose

{R, t}

. The image planes of the two cameras are

Π_{0}

and

Π_{1}

. Two short segments detected by LSD, denoted as

l_{1} (s_{1}, e_{1})

and

l_{3} (s_{3}, e_{3})

, are projected onto the image planes. Their projections on

Π_{0}

are

{l^{'}}_{1} ({s^{'}}_{1}, {e^{'}}_{1})

and

{l^{'}}_{3} ({s^{'}}_{3}, {e^{'}}_{3})

, and on

Π_{1}

are

{l^{″}}_{1} ({s^{″}}_{1}, {e^{″}}_{1})

and

{l^{″}}_{3} ({s^{″}}_{3}, {e^{″}}_{3})

. Direct triangulation would recover two separate 3D line segments, as illustrated by the red and green spatial lines in Figure 3. In contrast, our method merges them according to the geometric constraints described in Section 4.2. For example,

{l^{'}}_{1}

and

{l^{'}}_{3}

on

Π_{0}

are merged into a single segment

({s^{'}}_{1}, {e^{'}}_{3})

; similarly, the merged segment is

({s^{″}}_{1}, {e^{″}}_{3})

obtained on

Π_{1}

. Subsequently, triangulation is performed on the matched merged segments and using the camera geometry

{R, t}

, recovering a more complete 3D line feature

l (s_{1}, e_{3})

.

This strategy enhances computational efficiency by merging the two independent triangulations traditionally performed on fractured short line segments into a single triangulation of the complete long segment. Furthermore, the complete long segment resulting from geometric merging provides stronger geometric constraints for triangulation, contributing to coherent and structurally complete 3D maps. These geometrically optimized 3D line features form the foundation for the line reprojection residual model in the subsequent tightly coupled optimization framework.

4.5. Modeling of Line Reprojection Residuals

The construction of the line feature reprojection residual model is a critical step for incorporating geometric constraints into the nonlinear optimization framework, entailing three key components: spatial line parameterization, projection mapping modeling, and residual function design.

First, to achieve an efficient and stable representation of 3D lines, this paper employs Plücker coordinates for parameterization modeling. Specifically, given a 3D line

L^{w}

in the world frame, it can be converted to the camera frame as follows:

L^{c} = [\begin{matrix} n^{c} \\ d^{c} \end{matrix}] = T_{w c} L^{w},

(7)

Next, the line feature

L^{c}

in the camera coordinate frame is projected onto the image plane using the following equation, yielding its corresponding 2D line expression:

l^{c} = K n^{c} = [\begin{matrix} f_{y} & 0 & 0 \\ 0 & f_{x} & 0 \\ - f_{y} c_{x} & - f_{x} c_{y} & f_{x} f_{y} \end{matrix}] n^{c} = [\begin{matrix} l_{1} \\ l_{2} \\ l_{3} \end{matrix}],

(8)

where

n

denotes the normal vector of the plane containing the spatial line, and

K

represents the projection matrix of the line feature, incorporating the focal lengths

f_{x}

,

f_{y}

, and the principal point coordinates

c_{x}

,

c_{y}

. In the normalized image plane, the projection matrix

K

reduces to the identity matrix.

Finally, based on the aforementioned projection relationship, the reprojection residual for the 3D spatial line

L_{l}

observed in the ith frame is formulated as the perpendicular distances from the endpoints to the projected line

l_{l}^{c_{i}}

, which corresponds to the projection of

L_{l}

onto that image plane. As shown in Figure 4, the specific calculation formula is as follows:

r_{l} (z_{L_{l}}^{c_{i}}, X) = [\begin{matrix} d (s_{l}, l_{l}^{c_{i}}) \\ d (e_{l}, l_{l}^{c_{i}}) \end{matrix}] = \frac{1}{\sqrt{l_{1}^{2} + l_{2}^{2}}} [\begin{matrix} s_{l}^{T} l_{l}^{c_{i}} \\ e_{l}^{T} l_{l}^{c_{i}} \end{matrix}],

(9)

where

X

denotes the full set of state variables within the sliding window, and

z_{L_{l}}^{c_{i}}

represents the

L_{l}

line feature observed in the

i

camera frame.

s_{l} = [s_{x}, s_{y}, 1]

and

e_{l} = [e_{x}, e_{y}, 1]

are the endpoint coordinates of the observed line segment

l

.

To minimize this residual within the sliding-window optimization framework, it is necessary to compute its Jacobian matrix

J_{l}

with respect to the relevant state variables. The systematic derivation and its expanded form are presented as follows:

J_{l} = \frac{\partial r_{l}}{\partial l^{c_{i}}} \frac{\partial l^{c_{i}}}{\partial L^{c_{i}}} [\begin{matrix} \frac{\partial L^{c_{i}}}{\partial δ x^{i}} & \frac{\partial L^{c_{i}}}{\partial L^{w}} \frac{\partial L^{w}}{\partial δ O} \end{matrix}],

(10)

\frac{\partial r_{l}}{\partial l^{c_{i}}} = [\begin{matrix} \frac{\partial r_{1}}{\partial l_{1}^{c_{i}}} & \frac{\partial r_{1}}{\partial l_{2}^{c_{i}}} & \frac{\partial r_{1}}{\partial l_{3}^{c_{i}}} \\ \frac{\partial r_{2}}{\partial l_{1}^{c_{i}}} & \frac{\partial r_{2}}{\partial l_{2}^{c_{i}}} & \frac{\partial r_{2}}{\partial l_{3}^{c_{i}}} \end{matrix}] = \frac{1}{\sqrt{l_{1}^{2} + l_{2}^{2}}} [\begin{matrix} \frac{- l_{1} s_{l}^{T} l^{c_{i}}}{l_{1}^{2} + l_{2}^{2}} + s_{x} & \frac{- l_{2} s_{l}^{T} l^{c_{i}}}{l_{1}^{2} + l_{2}^{2}} + s_{y} & 1 \\ \frac{- l_{1} e_{l}^{T} l^{c_{i}}}{l_{1}^{2} + l_{2}^{2}} + e_{x} & \frac{- l_{2} e_{l}^{T} l^{c_{i}}}{l_{1}^{2} + l_{2}^{2}} + e_{y} & 1 \end{matrix}],

(11)

\frac{\partial l^{c_{i}}}{\partial L^{c_{i}}} = K [\begin{matrix} I_{3 \times 3} & 0_{3 \times 3} \end{matrix}],

(12)

\frac{\partial L^{c_{i}}}{\partial δ x^{i}} = T_{c b}^{- 1} [\begin{matrix} R_{b w}^{T} {[d^{w}]}_{\times} & {[R_{b w}^{T} (n^{w} + {[d^{w}]}_{\times} p_{b w})]}_{\times} & 0_{3 \times 9} \\ 0_{3 \times 3} & R_{b w}^{T} {[d^{w}]}_{\times} & 0_{3 \times 9} \end{matrix}],

(13)

\frac{\partial L^{c_{i}}}{\partial L^{w}} \frac{\partial L^{w}}{\partial δ O} = T_{c w}^{- 1} [\begin{matrix} 0_{3 \times 1} & - w_{1} u_{3} & w_{1} u_{2} & - w_{2} u_{1} \\ w_{2} u_{3} & 0_{3 \times 1} & - w_{2} u_{1} & w_{1} u_{2} \end{matrix}],

(14)

where

O

stands for the set of line features captured by at least two camera frames within the sliding window.

4.6. Sliding Window-Based Tightly Coupled Optimization

Based on the established line feature reprojection residual model, this paper constructs a tightly coupled nonlinear optimization pipeline for VIO to achieve deep fusion of multi-sensor data and high-precision estimation of the system state. To address the state estimation problem, the BA strategy is adopted, constructing the objective function by minimizing the reprojection residuals of both point and line features. To circumvent the cubic computational complexity inherent to global BA, a fixed-length sliding window mechanism is employed to constrain the optimization scale. This approach ensures estimation accuracy while effectively controlling computational complexity and guaranteeing numerical convergence stability.

The state variables to be optimized within the sliding window are given as follows:

X = [x_{0}, x_{1}, \cdot \cdot \cdot x_{k}, λ_{0}, λ_{1}, \cdot \cdot \cdot λ_{m}, O_{0}, O_{1}, \cdot \cdot \cdot O_{l}]

(15)

x_{i} = [p_{b_{i}}^{w}, q_{b_{i}}^{w}, v_{b_{i}}^{w}, b_{a}, b_{g}], i \in [0, k],

(16)

where

x_{i}

represents the IMU state of the ith keyframe, comprising its position, velocity, orientation in the world frame, along with the accelerometer and gyroscope biases in the body frame.

k

,

m

, and

l

represent the numbers of keyframes, point features, and line features within the sliding window, respectively;

λ_{m}

is the inverse depth parameter for the mth point feature; and

O_{l}

denotes the orthonormal representation parameters of the lth line feature in the world coordinate frame.

Based on the above definitions, a tightly coupled nonlinear least-squares problem is formulated. The overall objective function

F

incorporates the reprojection residuals of point and line features, the IMU pre-integration residuals, and prior information. It is constructed to derive the maximum a posteriori estimate of the system state and is defined as follows:

\begin{matrix} F & = \min_{X} {\sum_{i \in B} ρ (∥ r_{b} (z_{b_{i} b_{i + 1}}, X) ∥_{Σ_{b_{i} b_{i + 1}}}^{2}) + ∥ r_{p} - J_{p} X ∥^{2} \\ + \sum_{(i, j) \in P} ρ_{p} (∥ r_{p} (z_{p_{j}}^{c_{i}}, X) ∥_{Σ_{p_{j}}^{c_{i}}}^{2}) + \sum_{(i, l) \in L} ρ_{l} (∥ r_{l} (z_{L_{l}}^{c_{i}}, X) ∥_{Σ_{L_{l}}^{c_{i}}}^{2})} \end{matrix}

(17)

where

r_{b} (z_{b_{i} b_{i + 1}}, X)

denotes the IMU pre-integration residual between consecutive frames

i

and

i + 1

;

r_{p} (z_{p_{j}}^{c_{i}}, X)

and

r_{l} (z_{L_{l}}^{c_{i}}, X)

represent the reprojection residuals of point and line features, respectively;

B

,

P

, and

L

are heir corresponding sets of observations;

Σ_{b_{i} b_{i + 1}}

,

\sum_{p j}^{c i}

, and

Σ_{L_{l}}^{c_{i}}

denote the corresponding measurement covariance matrices.

{r_{p}, J_{p}}

denotes the prior information from marginalization within the sliding window.

To enhance the system’s robustness against mismatches and measurement noise, the Huber robust kernel function

ρ (•)

is applied to weigh all residual terms. Finally, the Ceres Solver [35] library is adopted to iteratively solve the aforementioned nonlinear least-squares problem, outputting the optimized system state.

5. Experimental Validation

5.1. Experimental Setup and Dataset

To systematically assess the performance of the proposed method, experiments are conducted on a portable computing platform equipped with Ubuntu 20.04 operating system and the ROS Noetic framework. The hardware platform features an Intel^® Core™ i7-14650HX processor, 16 GB DDR5 RAM, and an NVIDIA GeForce RTX 4060 Laptop GPU. This paper utilizes the public EuRoC micro aerial vehicle (MAV) dataset [36] for validation. Released by ETH Zurich, this dataset includes 20 Hz global shutter stereo images and 200 Hz IMU measurements, and provides millimeter-accurate trajectory ground truth obtained by a laser tracker. The EuRoC MAV dataset covers three typical scenarios: industrial warehouses, office environments, and outdoor areas. It includes challenging conditions such as high-speed motion, illumination changes, weak textures, and sensor noise, allowing for evaluation of VIO systems in terms of accuracy, scale consistency, and front-end robustness.

To ensure a fair comparison, the loop-closure detection module is disabled for all evaluated algorithms. The camera intrinsics, camera-IMU extrinsics, and IMU noise parameters follow the calibration values provided with the dataset. All other parameters are set to the defaults in the respective open source implementations. Figure 5 displays experimental snapshots from nine EuRoC sequences, offering a qualitative perspective on the proposed method’s performance in challenging environments.

5.2. Accuracy Evaluation of the EuRoC MAV Dataset

This study utilizes left-eye images from the EuRoC MAV dataset as visual input and evaluates performance across all 11 sequences. To validate the accuracy and robustness under varying motion patterns and environmental conditions, VINS-Mono, PL-VINS, and EPLF-VINS are selected as representative baselines. The APE is employed as the primary accuracy metric. It calculates the positional deviation between estimated and ground-truth poses, and its root mean square (RMSE) is statistically analyzed to reflect the local estimation accuracy and temporal stability. All APE results are computed using the open source Evo package (https://github.com/MichaelGrupp/evo) (accessed on 21 March 2026) for consistency, with the millimeter-accurate ground truth serving as a benchmark.

As presented in Table 2, the proposed approach attains the lowest APE on 7 out of 11 sequences, with an average APE of 0.1642 m. Compared to VINS-Mono, PL-VINS, and EPLF-VINS, the APE is reduced by 17.57%, 9.88%, and 6.65%, respectively, demonstrating better overall consistency and localization accuracy. This improvement is primarily attributed to the proposed geometric constraint optimization strategy, which enhances front-end feature quality and mitigates line segment fragmentation prevalent in conventional methods. As shown in Figure 6, for the high-dynamic sequences V1_03_difficult and V2_02_medium, the line features extracted by our method exhibit improved structural integrity and spatial distribution stability compared to PL-VINS.

To further assess the algorithm’s performance in typical complex scenarios, Figure 7 illustrates the APE over time for the proposed method, PL-VINS, and EPLF-VINS on the MH_04_difficult and V2_03_difficult sequences. As shown, our method maintains lower error amplitudes and smaller fluctuation ranges for most of the duration, indicating better estimation stability and temporal consistency. This advantage is more evident in V2_03_difficult sequence. Specifically, in the weak-texture scenario V2_03_difficult, our method achieves an APE RMSE of 0.1923 m, which is 26.18% lower than that of EPLF-VINS, demonstrating the favorable adaptability of our line segment merging and verification strategy in visually sparse environments. In the highly dynamic scenario MH_04_difficult, our method achieves an accuracy of 0.1990 m, significantly outperforming the compared algorithms. This indicates that the introduced line segment merging mechanism effectively improves feature tracking continuity and alleviates temporary line feature loss caused by rapid rotation. Figure 8 and Figure 9 present the APE heatmaps for the V2_01_easy and V2_02_medium sequences, comparing the proposed method with EPLF-VINS. The color gradient from blue to red indicates the error magnitude. It can be observed that the proposed method tends to yield lower errors in most parts of the trajectories, for example in regions with noticeable drift. Overall, the proposed method shows clear advantages in structured, line-feature-rich environments, further supporting its effectiveness.

To evaluate the trajectory estimation performance of different methods from a global perspective, Figure 10 presents a 2D comparison of the estimated trajectories relative to the ground truth for the MH_05_difficult and V2_03_difficult sequences, comparing the proposed method with the compared methods. Figure 11 further shows the temporal profiles of the position estimates along the x-, y- and z- axes for the same sequences, offering a visual comparison of each algorithm’s accuracy across different dimensions. Results indicate that the estimated trajectories from the proposed method exhibit good consistency with the true trajectories in both overall trends and local details, with minimal error fluctuations across all coordinate directions. These results collectively support the improvement in trajectory estimation accuracy and consistency achieved by the proposed geometric enhancement strategy.

Based on the comprehensive analysis of the above results, the proposed method achieves higher pose estimation accuracy and robustness compared to VINS-Mono, PL-VINS, and EPLF-VINS. This advantage is primarily attributed to the proposed front-end geometric optimization pipeline.

(1) Short line segments extracted by conventional methods provide weak constraints and are easily affected by image noise and local occlusion. Our method integrates fragmented short segments into spatially continuous long segments through length filtering, geometric-consistency-based merging, and endpoint-distance verification, obtaining line features with more complete structural representation. As shown in Figure 6, in low-texture and highly dynamic scenarios, the line segments extracted by our method are more continuous and stable in spatial distribution, thereby providing reliable geometric constraints for back-end optimization and effectively suppressing error accumulation. (2) Different from directly triangulating raw short line segments, our method applies epipolar triangulation to the merged complete long segments, which facilitates obtaining more reliable 3D line parameters. This provides accurate reprojection residual constraints for the back-end nonlinear optimization, thus improving the overall pose estimation accuracy of the system. The combined effect of the above two aspects enables the proposed method to achieve lower localization errors and stronger robustness in challenging scenarios such as low-texture and highly dynamic environments.

5.3. Efficiency and Real-Time Performance Analysis

To quantitatively evaluate the front-end efficiency of the proposed algorithm, this section analyzes the computational time cost of the line feature processing module. Experiments are conducted on all 11 sequences of the EuRoC MAV dataset. The cumulative processing time from feature detection to tracking is measured and compared with that of the PL-VINS method. It should be noted that due to the differences in line segment extraction strategies between our method and EPLF-VINS, a direct runtime comparison would not be meaningful. Therefore, we only compare the efficiency with PL-VINS in this section. All reported data are averaged over ten independent runs. As shown in Table 3, the proposed method demonstrates improved efficiency across all sequences, with an average processing time of 12.87 ms, representing a 4.16% reduction compared to PL-VINS. The efficiency gain is more notable in the challenging MH_04_difficult sequence, where the processing time is reduced by 16.44%.

In addition, we evaluate the average per-frame processing time of the system, including front-end feature processing, IMU pre-integration, nonlinear optimization, and marginalization. Table 4 compares the per-frame processing time of our method and PL-VINS on all EuRoC sequences. Our method achieves an average per-frame processing time of 58.74 ms, outperforming the 60.17 ms of PL-VINS with an average improvement of 2.36%. These improvements mainly stem from the filtering and merging of redundant short line segments, which reduce the number of independent line features that require subsequent matching and triangulation. As a result, the overall system efficiency is enhanced.

5.4. 3D Mapping and Reconstruction Accuracy

This section evaluates the spatial structure reconstruction capability of the proposed algorithm in complex scenes through 3D mapping experiments. The V1_03_difficult and MH_03_medium scenarios from the EuRoC dataset, which exhibit rich structural features, are selected. Figure 12 presents the mapping results of the method on these sequences. The visualization results demonstrate that the proposed method can effectively reconstruct the main structural outlines of indoor scenes. In areas with significant illumination changes or sparse textures, the proposed method can extract key structural features, generating coherent and structurally complete maps.

6. Conclusions

To address the insufficient utilization of geometric information from line features in VIO, this paper proposes a tightly coupled VIO system based on geometrically optimized line features. The main contributions include a complete front-end geometric optimization pipeline for line features. This pipeline first adopts a length-threshold-based filtering strategy to remove redundant and unstable short segments from images, and further integrates the proposed geometric-consistency-based merging mechanism, endpoint-distance-based verification mechanism, and epipolar-constraint-based triangulation method for optimized line features, thereby improving the structural integrity of line features. On this basis, the optimized line features are integrated with point features, IMU pre-integration, and marginalization priors into a sliding-window optimization framework, constructing a robust and accurate point-line fused state estimation system. Experimental results show that the proposed method reduces the APE by 17.57%, 9.88%, and 6.65% compared to VINS-Mono, PL-VINS, and EPLF-VINS, respectively. Additionally, compared to PL-VINS, it reduces the line feature processing time by 4.16% and the average per-frame processing time by 2.36%, validating its effectiveness in trajectory estimation accuracy, processing efficiency, and 3D scene reconstruction. Future work will systematically analyze matching errors and outlier sensitivity of line features, and concentrate on integrating the method with deep learning-based front ends to improve pose estimation performance in complex environments.

Author Contributions

Conceptualization, Y.Y. and Y.C.; methodology, Y.Y.; software, Y.Y. and Z.K.; validation, Y.Y. and Y.C.; formal analysis, Y.Y.; investigation, Y.Y., Z.K. and J.L.; resources, Y.Y. and B.L.; data curation, Y.Y. and Y.C.; writing—original draft preparation, Y.Y.; writing—review and editing, Y.Y., J.L. and Y.C.; visualization, Y.Y. and Y.C.; supervision, B.L.; project administration, Y.Y. and B.L.; funding acquisition, B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China under grants 61973234 and 62203326, and in part by the Tianjin Natural Science Foundation under grant 20JCYBJC00180.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code presented in this study is available upon request from the corresponding author.

Acknowledgments

The authors would like to thank Tiangong University for the technical support and all members of our team for their contributions to the visual-inertial odometry experiments. The authors acknowledge the anonymous reviewers for their helpful comments on the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, H.; Du, L.; Bao, S.; Yuan, J.; Ma, S. LVIO-Fusion: Tightly-coupled LiDAR-visual-inertial odometry and mapping in degenerate environments. IEEE Robot. Autom. Lett. 2024, 9, 3783–3790. [Google Scholar] [CrossRef]
Kazerouni, I.A.; Fitzgerald, L.; Dooly, G.; Toal, D. A survey of state-of-the-art on visual SLAM. Expert Syst. Appl. 2022, 205, 117734. [Google Scholar] [CrossRef]
Libero, Y.; Klein, I. Augmented virtual filter for multiple IMU navigation. IEEE Trans. Instrum. Meas. 2024, 73, 1–12. [Google Scholar] [CrossRef]
Huang, G. Visual-Inertial Navigation: A concise review. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 9572–9582. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Oishi, S.; Koide, K.; Yokozuka, M.; Banno, A. L-C*: Visual-inertial loose coupling for resilient and lightweight direct visual localization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 4033–4039. [Google Scholar] [CrossRef]
Yang, Z.; Shen, S. Monocular visual-inertial state estimation with online initialization and camera-IMU extrinsic calibration. IEEE Trans. Autom. Sci. Eng. 2017, 14, 39–51. [Google Scholar] [CrossRef]
Bescos, B.; Campos, C.; Tardós, J.D.; Neira, J. DynaSLAM II: Tightly-coupled multi-object tracking and SLAM. IEEE Robot. Autom. Lett. 2021, 6, 5191–5198. [Google Scholar] [CrossRef]
Li, T.; Pei, L.; Xiang, Y.; Yu, W.; Truong, T.K. P^3-VINS: Tightly-coupled PPP/INS/visual SLAM based on optimization approach. IEEE Trans. Robot. Autom. Lett. 2022, 7, 7021–7027. [Google Scholar] [CrossRef]
Lin, S.; Zhang, X.; Liu, Y.; Wang, H.; Zhang, X.; Zhuang, Y. FLM PL-VIO: A robust monocular point-line visual-inertial odometry based on fast line matching. IEEE Trans. Ind. Electron. 2024, 71, 16026–16036. [Google Scholar] [CrossRef]
Zeng, D.; Liu, X.; Huang, K.; Liu, J. EPL-VINS: Efficient point-line fusion visual-inertial SLAM with LK-RG line tracking method and 2-DoF line optimization. IEEE Robot. Autom. Lett. 2024, 9, 5911–5918. [Google Scholar] [CrossRef]
von Stumberg, L.; Cremers, D. DM-VIO: Delayed marginalization visual-inertial odometry. IEEE Robot. Autom. Lett. 2022, 7, 1408–1415. [Google Scholar] [CrossRef]
Xu, B.; Li, X.; Wang, J.; Yuen, C.; Li, J. PVI-DSO: Leveraging planar regularities for direct sparse visual-inertial odometry. IEEE Sens. J. 2023, 23, 17415–17425. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Lepetit, V.; Moreno-Noguer, F.; Fua, P. EPnP: An accurate O(n) solution to the PnP problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
Pumarola, A.; Vakhitov, A.; Agudo, A.; Sanfeliu, A.; Moreno-Noguer, F. PL-SLAM: Real-time monocular visual SLAM with points and lines. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4503–4508. [Google Scholar] [CrossRef]
Gomez-Ojeda, R.; Moreno, F.-A.; Zuñiga-Noël, D.; Scaramuzza, D.; Gonzalez-Jimenez, J. PL-SLAM: A stereo SLAM system through the combination of points and line segments. IEEE Trans. Robot. 2019, 35, 734–746. [Google Scholar] [CrossRef]
He, Y.; Zhao, J.; Guo, Y.; He, W.; Yuan, K. PL-VIO: Tightly-coupled monocular visual-inertial odometry using point and line features. Sensors 2018, 18, 1159. [Google Scholar] [CrossRef] [PubMed]
Fu, Q.; Wang, J.; Yu, H.; Ali, I.; Guo, F.; He, Y.; Zhang, H. PL-VINS: Real-time monocular visual-inertial SLAM with point and line features. arXiv 2020, arXiv:2009.07462. [Google Scholar] [CrossRef]
Xu, L.; Yin, H.; Shi, T.; Jiang, D.; Huang, B. EPLF-VINS: Real-time monocular visual-inertial SLAM with efficient point-line flow features. IEEE Robot. Autom. Lett. 2022, 8, 752–759. [Google Scholar] [CrossRef]
Heo, S.; Jung, J.H.; Park, C.G. Consistent EKF-based visual-inertial navigation using points and lines. IEEE Sens. J. 2018, 18, 7638–7649. [Google Scholar] [CrossRef]
Chen, Z.; Miao, Z.; Liu, M.; Wu, C.; Wang, Y. A fast and accurate visual inertial odometry using hybrid point-line features. IEEE Robot. Autom. Lett. 2024, 9, 11345–11352. [Google Scholar] [CrossRef]
Wei, H.; Tang, F.; Xu, Z.; Zhang, C.; Wu, Y. A point-line VIO system with novel feature hybrids and with novel line predicting-matching. IEEE Robot. Autom. Lett. 2021, 6, 8681–8688. [Google Scholar] [CrossRef]
Li, W.; Cai, H.; Zhao, S.; Liu, Y.; Liu, C. A fast vision-inertial odometer based on line midpoint descriptor. Int. J. Autom. Comput. 2021, 18, 667–679. [Google Scholar] [CrossRef]
Luo, D.; Zhuang, Y.; Wang, S. Hybrid sparse monocular visual odometry with online photometric calibration. Int. J. Robot. Res. 2022, 41, 993–1021. [Google Scholar] [CrossRef]
Kuang, Z.; Wei, W.; Yan, Y.; Li, J.; Lu, G.; Peng, Y.; Li, J.; Shang, W. A real-time and robust monocular visual inertial SLAM system based on point and line features for mobile robots of smart cities toward 6G. IEEE Open J. Commun. Soc. 2022, 3, 1950–1962. [Google Scholar] [CrossRef]
Liu, Z.; Shi, D.; Li, R.; Qin, W.; Zhang, Y.; Ren, X. PLC-VIO: Visual-inertial odometry based on point-line constraints. IEEE Trans. Autom. Sci. Eng. 2022, 19, 1880–1897. [Google Scholar] [CrossRef]
Liu, Y.; Xiong, R.; Wang, Y.; Huang, H.; Xie, X.; Liu, X.; Zhang, G. Stereo visual-inertial odometry with multiple Kalman filters ensemble. IEEE Trans. Ind. Electron. 2016, 63, 6205–6216. [Google Scholar] [CrossRef]
Seiskari, O.; Rantalankila, P.; Kannala, J.; Ylilammi, J.; Rahtu, E.; Solin, A. HybVIO: Pushing the limits of real time visual-inertial odometry. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2022; pp. 701–710. [Google Scholar] [CrossRef]
Mourikis, A.I.; Roumeliotis, S.I. A multi-state constraint Kalman filter for vision-aided inertial navigation. In Proceedings of the 2007 IEEE International Conference on Robotics and Automation (ICRA), Rome, Italy, 10–14 April 2007; pp. 3565–3572. [Google Scholar] [CrossRef]
Eckenhoff, K.; Geneva, P.; Huang, G. Closed-form preintegration methods for graph based visual-inertial navigation. Int. J. Robot. Res. 2019, 38, 563–586. [Google Scholar] [CrossRef]
Leutenegger, S.; Lynen, S.; Bosse, M.; Siegwart, R.; Furgale, P. Keyframe-based visual-inertial odometry using nonlinear optimization. Int. J. Robot. Res. 2014, 34, 314–334. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM3: An accurate open-source library for visual, visual-inertial, and multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Bartoli, A.; Sturm, P. Structure-from-motion using lines: Representation, triangulation, and bundle adjustment. Comput. Vis. Image Underst. 2005, 100, 416–441. [Google Scholar] [CrossRef]
Agarwal, S.; Mierle, K. Ceres Solver. Available online: http://ceres-solver.org (accessed on 9 April 2018).
Burri, M.; Nikolic, J.; Siegwart, R. The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]

Figure 1. The overview of the proposed point-line VIO framework.

Figure 2. Illustration of the line segment merging process. (a) Short segment merges into long segment; (b) end-to-end merging of two segments; (c) endpoint of short segment embeds into long segment.

Figure 3. Schematic of the triangulation process for geometrically merged line segments.

Figure 4. Illustration of line reprojection residual.

Figure 5. Experimental snapshots of the proposed method on nine EuRoC sequences.

Figure 6. Comparison of line feature extraction between the proposed approach (left) and PL-VINS (right) in highly dynamic sequences.

Figure 7. Comparison of APE evolution over time for the proposed method (top), EPLF-VINS (middle) and PL-VINS (bottom) on (a) MH_04_difficult and (b) V2_03_difficult sequences.

Figure 8. APE heatmaps comparing the proposed method and EPLF-VINS on V2_01_easy sequence: (a) the proposed method; (b) EPLF-VINS.

Figure 9. APE heatmaps comparing the proposed method and EPLF-VINS on V2_02_medium sequence: (a) the proposed method; (b) EPLF-VINS.

Figure 10. Two-dimensional trajectory comparison on (a) MH_05_difficult and (b) V2_03_difficult sequences.

Figure 11. Temporal evolution of position estimation on MH_05 and V2_03_difficult sequences.

Figure 12. Three-dimensional reconstruction results V1_03 and MH_03_medium sequences.

Table 1. Comparison of line feature processing pipelines in point-line VIO/SLAM systems.

Method	Line Feature Processing Strategy
PL-SLAM [16]	LSD + LBD + basic triangulation
PL-VINS [19]	Modified LSD + LBD + basic triangulation
EPLF-VINS [20]	Modified EDLines + Line optical flow tracking + endpoint-independent residual model
Ours	Modified LSD + length filtering + geometric merging + endpoint verification + LBD + optimized epipolar triangulation

Table 2. APE RMSE (m) results on the EuRoC Dataset.

Sequence	VINS	PL-VINS	EFLF-VINS	Ours
MH_01_easy	0.1920	0.2296	0.1910	0.2301
MH_02_easy	0.1676	0.1779	0.1670	0.1778
MH_03_medium	0.2181	0.2141	0.2147	0.2138
MH_04_difficult	0.3695	0.3045	0.2449	0.1990
MH_05_difficult	0.3366	0.2976	0.3317	0.2814
V1_01_easy	0.0788	0.0748	0.0715	0.0750
V1_02_medium	0.0974	0.0921	0.0916	0.0919
V1_03_difficult	0.1893	0.1451	0.1472	0.1445
V2_01_easy	0.0942	0.0924	0.0921	0.0852
V2_02_medium	0.1349	0.1494	0.1226	0.1155
V2_03_difficult	0.3129	0.2269	0.2605	0.1923
Average	0.1992	0.1822	0.1759	0.1642

Values highlighted in bold denote the nearest true value.

Table 3. Line feature processing time (ms) comparison on the EuRoC Dataset.

Sequence	PL-VINS	Ours	Gain
MH_01_easy	16.85	16.48	2.20%
MH_02_easy	16.28	15.96	1.97%
MH_03_medium	15.39	15.07	2.08%
MH_04_difficult	14.42	12.05	16.44%
MH_05_difficult	15.17	14.24	6.13%
V1_01_easy	13.73	13.26	3.42%
V1_02_medium	12.62	12.04	4.60%
V1_03_difficult	10.81	10.42	3.61%
V2_01_easy	11.15	11.01	1.26%
V2_02_medium	11.47	11.22	2.18%
V2_03_difficult	10.06	9.87	1.89%
Average	13.45	12.87	4.16%

Table 4. Average per-frame processing time of different methods on the EuRoC dataset.

Sequence	PL-VINS	Ours	Gain
MH_01_easy	68.14	67.65	0.72%
MH_02_easy	67.88	66.20	2.47%
MH_03_medium	64.92	64.53	0.60%
MH_04_difficult	68.32	64.24	5.97%
MH_05_difficult	66.89	65.05	2.75%
V1_01_easy	56.74	56.18	0.99%
V1_02_medium	54.47	54.09	0.70%
V1_03_difficult	53.05	49.52	6.65%
V2_01_easy	54.11	53.56	1.02%
V2_02_medium	53.97	53.41	1.04%
V2_03_difficult	53.36	51.71	3.09%
Average	60.17	58.74	2.36%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yuan, Y.; Cheng, Y.; Liu, J.; Kuai, Z.; Li, B. Tightly-Coupled Visual-Inertial Odometry Using Point and Geometrically Optimized Line Features. Electronics 2026, 15, 2061. https://doi.org/10.3390/electronics15102061

AMA Style

Yuan Y, Cheng Y, Liu J, Kuai Z, Li B. Tightly-Coupled Visual-Inertial Odometry Using Point and Geometrically Optimized Line Features. Electronics. 2026; 15(10):2061. https://doi.org/10.3390/electronics15102061

Chicago/Turabian Style

Yuan, Yanxin, Yi Cheng, Jiansong Liu, Zheng Kuai, and Baoquan Li. 2026. "Tightly-Coupled Visual-Inertial Odometry Using Point and Geometrically Optimized Line Features" Electronics 15, no. 10: 2061. https://doi.org/10.3390/electronics15102061

APA Style

Yuan, Y., Cheng, Y., Liu, J., Kuai, Z., & Li, B. (2026). Tightly-Coupled Visual-Inertial Odometry Using Point and Geometrically Optimized Line Features. Electronics, 15(10), 2061. https://doi.org/10.3390/electronics15102061

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Tightly-Coupled Visual-Inertial Odometry Using Point and Geometrically Optimized Line Features

Abstract

1. Introduction

2. Related Work

2.1. Filter-Based and Optimization-Based VIO

2.2. Geometric Representation of Points and Lines

3. Preliminaries and System Overview

3.1. Notations

3.2. System Overview

4. Methodology

4.1. LSD-Based Line Segment Filtering Strategy

4.2. Geometric-Constrained Line Segment Merging

4.3. Structural Consistency Verification of Merged Lines

4.4. 3D Reconstruction of Line Features

4.5. Modeling of Line Reprojection Residuals

4.6. Sliding Window-Based Tightly Coupled Optimization

5. Experimental Validation

5.1. Experimental Setup and Dataset

5.2. Accuracy Evaluation of the EuRoC MAV Dataset

5.3. Efficiency and Real-Time Performance Analysis

5.4. 3D Mapping and Reconstruction Accuracy

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI