GLP-VO: A Hybrid Visual Odometry Framework for Low-Altitude UAV Imaging in Complex Urban Environments

Xu, Yuxuan; Jiang, Bo; Huang, Longyang; Qu, Ruokun; Wang, Zhiyuan

doi:10.3390/drones10050329

Open AccessArticle

GLP-VO: A Hybrid Visual Odometry Framework for Low-Altitude UAV Imaging in Complex Urban Environments

by

Yuxuan Xu

^*,

Bo Jiang

,

Longyang Huang

,

Ruokun Qu

and

Zhiyuan Wang

College of Air Traffic Management, Civil Aviation Flight University of China, Guanghan 618307, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(5), 329; https://doi.org/10.3390/drones10050329

Submission received: 21 December 2025 / Revised: 4 April 2026 / Accepted: 6 April 2026 / Published: 28 April 2026

(This article belongs to the Special Issue Autonomous Drone Navigation in GPS-Denied Environments)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A hybrid visual odometry framework (GLP-VO) is proposed, integrating line features with point features to construct a robust geometric constraint model.
An adaptive weighting strategy is designed to dynamically adjust the contributions of point and line features based on real-time scene texture and structural complexity.

What are the implications of the main findings?

The proposed method effectively addresses UAV positioning drift in low-texture or dynamic urban environments, significantly outperforming traditional point-based approaches.
The method provides a low-drift, highly robust autonomous navigation solution for low-altitude urban flight that does not rely on pre-training or GPU-intensive learning modules.

Abstract

Accurate and robust UAV navigation in complex urban environments remains challenging due to dense buildings, dynamic obstacles, and unreliable GPS signals. To address this issue, this paper proposes GLP-VO, a hybrid visual odometry framework that combines geometric structure features with point features. An adaptive weighting strategy is introduced to balance the contributions of different feature types according to matching quality and scene complexity, while geometric constraints are incorporated into the optimization process to improve pose estimation accuracy and stability. Experiments on the TUM RGB-D dataset and real UAV flight sequences verify the effectiveness of the proposed method. GLP-VO achieves the best ATE results in five of the ten evaluated TUM sequences, including 0.91 cm on f1_xyz and 0.62 cm on f3_str_tex_far, and remains competitive on challenging sequences such as f2_360_kidnap with an ATE of 2.26 cm. In the ablation study, the full model reduces ATE and RPE by up to 44.9% and 43.1%, respectively. Moreover, the proposed system runs at approximately 35 FPS on the desktop platform and 11 FPS on the onboard platform, demonstrating a favorable balance between accuracy, robustness, and real-time performance.

Keywords:

UAV navigation; visual odometry; low-altitude environments; hybrid geometric–keypoint integration

1. Introduction

With the rapid development of the low-altitude economy, urban air mobility (UAM) [1] has become an important part of modern urban transportation systems. The growth of this sector has created numerous application opportunities for unmanned aerial vehicles (UAVs), including logistics, urban management, infrastructure monitoring, and agriculture [2]. However, fully utilizing UAVs in these complex low-altitude urban environments, characterized by dense buildings, imposes high demands on navigation accuracy and flight stability.

In urban low-altitude environments, UAVs encounter a unique set of challenges. The dense concentration of buildings, narrow streets [3,4,5], and reflective surfaces can significantly degrade the reliability and accuracy of GPS signals, leading to reduced positioning precision [1,6,7]. Additionally, the complexity of urban environments renders traditional Global Navigation Satellite Systems (GNSS) unsuitable for high-precision localization. Consequently, there is an increasing demand for alternative localization methods that can operate reliably under these challenging conditions.

Visual odometry (VO) is crucial for UAV autonomous navigation, as it enables trajectory estimation using onboard cameras. In urban environments characterized by dense architectural and dynamic elements, traditional VO methods that rely exclusively on point features, which can be sparse or unreliable (e.g., ORB or SIFT [8]), may struggle due to varying illumination and occlusions.

Traditionally, most visual odometry (VO) systems have relied on point features, as they are easy to extract, track, and process for motion estimation. Recently, various visual-inertial odometry (VIO) [9,10,11] and visual simultaneous localization and mapping (vSLAM) [12,13,14] systems have begun integrating line features to enhance accuracy and robustness, particularly in challenging environments. Line segments provide more structural information about the environment compared with point features, making them particularly valuable in structured urban settings where points may be sparse. However, extracting and matching line features can be computationally expensive and time-consuming, potentially hindering real-time performance. Additionally, many existing methods often overlook the interrelationships and constraints between point and line features, limiting the potential for improved motion estimation accuracy. To address these challenges, hybrid systems combining both point and line features are being developed, balancing the strengths of each feature type while ensuring efficient processing.

This study introduces a novel visual odometry (VO) framework specifically designed for unmanned aerial vehicles (UAVs) operating in low-altitude urban environments. The primary objective is to enhance the accuracy and robustness of UAV localization by integrating geometric structure features extracted using the Line Edge Detector (LED) algorithm [15] with traditional point features. The key contributions of this research are outlined as follows:

(1) We introduce a novel mechanism that leverages both geometric structural cues and distinctive local features to provide a more comprehensive spatial representation of urban scenes. This integration balances the robustness of structural information with the efficiency of keypoint-based methods. (2) We propose an enhanced fusion framework that unifies geometric and local features through descriptor relationships and geometric constraints. A self-adjusting weighting mechanism is incorporated to dynamically adjust each feature’s weight based on its reliability and the complexity of the surrounding environment. (3) We develop a unified optimization scheme that accommodates both structural and local feature constraints. By minimizing the respective reprojection errors within a single backend solver, we enhance the accuracy and robustness of attitude estimation.

The rest of this article is organized as follows. Section 2 reviews related work on point and geometric-structure-based VO and feature fusion. Section 3 details the methodology, including the system framework, feature extraction, matching, fusion, and optimization. Finally, Section 4 presents experimental results that validate the effectiveness of the hybrid VO system.

2. Related Work

2.1. Point-Based Visual Odometry Algorithm

Point-based visual odometry (VO) represents a foundational approach to motion estimation in robotic and autonomous systems. By tracking point features across consecutive frames, VO systems can determine a robot’s trajectory without external localization sources such as GPS. Feature extraction methods, including ORB (Oriented FAST and Rotated BRIEF), SIFT (Scale-Invariant Feature Transform), and SURF (Speeded-Up Robust Features), are widely used for their robustness to scale, rotation, and illumination changes. Among these, ORB-SLAM [8] has emerged as a well-established system, providing real-time monocular, stereo, and RGB-D solutions for large-scale environments [16,17]. Despite their popularity, point-based VO systems often struggle in low-texture or dynamic environments, such as urban canyons, open fields, or indoor settings, where feature matching is difficult due to insufficient keypoints. This results in tracking errors and drift, which limit the system’s overall accuracy [8,11].

To address these challenges, techniques such as RANSAC (Random Sample Consensus) [18] have been implemented to robustly handle outliers and improve feature matching accuracy, especially in dynamic or cluttered environments [19]. Additionally, bundle adjustment and loop closure detection methods have been integrated into point-based VO systems to refine pose estimates and mitigate drift over time. Bundle adjustment minimizes reprojection errors across keyframes, while loop closure helps correct accumulated drift by revisiting previously mapped areas [20,21]. These methods have significantly enhanced the robustness of point-based VO, making it suitable for large-scale applications, such as autonomous vehicles and UAVs.

However, point-based methods continue to encounter difficulties in environments with sparse texture. This limitation has motivated research into integrating point features with other visual information, such as line features, to develop more reliable and accurate VO systems.

2.2. Multi-Feature Fusion VO/vSLAM Algorithm

To overcome the limitations of point-based methods, multi-feature fusion approaches have become increasingly popular in VO and SLAM research. By integrating various feature types, such as point and line features and inertial measurements, these systems achieve greater robustness, accuracy, and scalability, particularly in complex environments such as urban areas. The fusion of point and line features represents a significant advancement, in which point features provide texture information and line features capture a scene’s geometric structure [14,22]. Line-based VO methods, such as LSD-SLAM [23], leverage line segments to improve localization accuracy, particularly in structured environments such as cityscapes, where walls, roads, and buildings provide abundant geometric features. The integration of point and line features has demonstrated significant improvements in drift mitigation and tracking performance in feature-sparse environments, such as tunnels or large open spaces [24].

Five frequent line representations are summarized in the study by [25]: the closest point (CP) and direction representation, the double image line representation, the Denavit–Hartenberg representation, the two-point or two-plane representation, and the Plücker coordinate representation [25]. However, the use of some of these single-line representations has disadvantages. Consequently, in some point and line fusion vSLAM algorithms, such as those proposed by [26,27], an orthogonal representation that is more suitable for optimization is used in combination with Plücker coordinates, thereby improving the algorithmic performance.

In addition, integrating visual inertial ranging (VIO) improves the VO performance by combining visual data with measurements from an inertial measurement unit (IMU). VIO algorithms can be categorized into two modes: loosely coupled and tightly coupled [28]. In loosely coupled systems, such as those proposed in [29,30], visual and inertial data are processed separately, and their results are then fused using techniques such as the extended Kalman filter (EKF). In contrast, tightly coupled systems, such as those proposed in [31,32,33], directly integrate visual and inertial data in a unified optimization process. This approach has been demonstrated to improve positioning accuracy and robustness in dynamic environments. These tightly coupled VIO systems [28,34] are particularly well-suited for applications such as UAV navigation in GPS signal-absent regions, where precise localization is critical.

Beyond visual odometry and SLAM algorithms themselves, recent studies have examined UAV autonomy from a broader system perspective. Mishra and Palanisamy described an end-to-end autonomy framework for UAVs based on sensing, perception, planning, and controls, while also addressing multi-agent fleet operations and validation requirements [35]. Bakirci further considered real-world UAV deployment constraints, including communication range, onboard computing limits, routing efficiency, and energy-aware task division in a swarm UAV system [36]. These studies indicate that visual odometry should not be viewed as an isolated algorithmic component, but as part of a larger UAV autonomy stack operating under practical sensing, computation, and communication constraints.

3. Materials and Methods

Our proposed visual odometry (VO) system targets low-altitude urban environments and achieves robust, high-precision pose estimation by unifying local and structural cues. As shown in Figure 1, the system is divided into a frontend and a backend. On the frontend, we employ a dual-feature-extraction scheme that simultaneously detects angular keypoints and salient geometric structural feature lines. This hybrid feature set captures both subtle texture details and broader geometric structures in urban buildings. Next, a descriptor-based matching process combines appearance and geometric constraints to ensure reliable data association under varying lighting and occlusion conditions.

To balance the contributions of local and structural information, we employ a self-tuning weighting strategy that dynamically adjusts the impact of each feature based on contextual factors, such as scene complexity and motion dynamics. Subsequently, the backend progressively refines the camera pose by jointly minimizing the reprojection error of both feature types within a robust optimization framework. By leveraging multi-threading, GPU acceleration, and efficient data-flow design, the entire pipeline can run in real-time while ensuring the system’s adaptability to challenging urban scenes.

3.1. Feature Extraction and Description

3.1.1. Representation of Geometric Structural Features

Geometric structural features are primarily characterized by line segments and their positional relationships in the scene. In low-altitude urban environments, intersecting lines, parallel lines, and the polygons they form are key geometric features that reflect object edges, angles, and spatial relationships, providing crucial information for visual odometry, especially in dynamic scenes.

To represent these features effectively, we employ an enhanced Plücker coordinate method combined with an orthonormal representation for optimization. Plücker coordinates [1] are widely used for representing 3D lines. Although a 3D line has four degrees of freedom, its Plücker representation is expressed in six dimensions through two components: the normal vector n and the direction vector v. Specifically, the Plücker coordinate representation of a line L is given by

L = (\begin{matrix} n^{T} \\ v^{T} \end{matrix}) \in R^{6},

(1)

where

n \in R^{3}

is the normal vector, which defines the plane containing the line, and

v \in R^{3}

is the direction vector that describes the line’s orientation. To eliminate redundancy and ensure a unique representation, Plücker coordinates satisfy the constraint

n^{T} v = 0 .

(2)

This constraint ensures the orthogonality of the normal and direction vectors, thereby preventing redundancy in the coordinate representation and ensuring a consistent, efficient coordinate representation.

To better capture the complex geometric relationships between intersecting lines and other structural features in the scene, we extended the traditional Plücker coordinate representation. As illustrated in Figure 2, in addition to the basic normal and direction vectors, the augmented representation incorporates angular information between intersecting line segments to encode their spatial relationships as geometric features. By fusing multiple line segments, we can represent polygons, resulting in higher-level scene shape descriptors. Furthermore, the normal vector of each polygon indicates its overall orientation, while the geometric constraints among these features ensure consistency throughout the environment and maintain structural coherence.

Although complete closed polygons may become partially invisible under occlusion or missing-edge conditions, this issue is alleviated in our dataset because the images are mainly captured from a bird’s-eye-view UAV perspective. Under this observation mode, building outlines, road boundaries, and scene contours remain relatively clear in most frames, making local polygon-like structures more likely to be observed than in ground-view imagery. In addition, the proposed method does not require strict polygon closure; when some boundaries are incomplete, it still relies on line-level and angle-based geometric constraints for matching and pose estimation.

The enhanced geometric representation of each line

L_{i}

can be written as

L_{i} = (\begin{matrix} n_{i}^{T} \\ v_{i}^{T} \\ θ_{i}^{T} \\ \hat{n_{i}} \\ A_{i}, \end{matrix})

(3)

where

n_{i}

and

v_{i}

are the normal and direction vectors from the original Plücker coordinate representation,

θ_{i}

is the angle between the line

L_{i}

and its neighboring lines,

{\hat{n}}_{i}

is the normal vector of the polygon formed by the lines, and

A_{i}

is the area of the polygon formed by the intersecting lines.

By extending the Plücker coordinate representation, we capture richer geometric features, including their spatial relationships and higher-order structures such as polygons. However, Plücker coordinates include redundant degrees of freedom, making them suboptimal for optimization tasks. Consequently, we convert them into an orthogonal representation that eliminates redundancy while preserving the underlying geometric features. As conventional QR decomposition [2] does not fully account for the geometric constraints between intersecting lines, we introduce geometric constraint optimization to preserve the angles between the lines during the transformation.

This optimization ensures that the geometric constraints between lines are accurately maintained. The optimization objective can be expressed as

(\begin{matrix} n \\ v \end{matrix}) = U (\begin{matrix} ω_{1} \\ ω_{2}, \end{matrix})

(4)

where

U \in S O (3)

is the rotation matrix that represents the geometric relationships between the lines after optimization;

ω_{1}

and

ω_{2}

represent the parameters that describe the scale and geometric structure of the lines.

This method ensures that the transformed representation is not only free of redundancy but also adheres to geometric constraints, including inter-line angles and the spatial relationships between structures, providing a more robust and accurate representation of geometric features.

In enhanced geometric representation, not all additional terms play the same role in practice. The angle between neighboring or intersecting lines provides the primary structural constraint, since it directly captures stable local geometric relationships that are useful for matching and pose estimation. The polygon normal is introduced as a secondary cue to describe the dominant orientation of grouped line structures, while the polygon area is used only as an auxiliary descriptor to improve the distinguishability of more complex geometric configurations. Therefore, the practical benefit of the enhanced representation mainly comes from angle-based local structure encoding, with the normal and area terms serving complementary roles.

3.1.2. Feature Descriptor Generation

The extraction of geometric constraints involves deriving relationships between geometric features directly from the orthonormal representation, thereby simplifying the representation of lines and their spatial relationships, as shown in Figure 3. Based on these extracted constraints, we generate geometric feature descriptors that provide a compact yet robust representation of the features necessary for geometric matching and optimization. The key geometric constraints and their corresponding descriptors are discussed below.

(1) Angle constraint: The angle between two lines

L_{i}

and

L_{j}

is computed using the cosine of the angle between their direction vectors

v_{i}

and

v_{j}

, respectively. The angle

θ_{i j}

is given by

cos (θ_{i j}) = \frac{v_{i}^{T} v_{j}}{‖ v_{i} ‖ ‖ v_{j} ‖} .

(5)

The angle descriptor is then derived from this angle, capturing the relative orientation between the two lines.

(2) Distance constraint: The spatial distance

l_{i j}

between two lines is given by the Euclidean distance:

l_{i j} = ‖ P_{i} - P_{j} ‖ .

(6)

This distance descriptor represents the spatial proximity between two features. By maintaining this distance constraint during optimization, we ensure that the geometric relationships between features are preserved across different viewpoints.

(3) Geometric shape constraint: For more complex geometric structures, such as polygons formed by multiple lines, we derive a geometric shape descriptor from the shape’s area or perimeter. For instance, for a polygon formed by edges

v_{1}, v_{2}, \dots, v_{m}

, the area A of the polygon can be calculated using

A = \frac{1}{2} \sum_{i = 1}^{m} | | v_{i} | | .

(7)

This descriptor is crucial for preserving the integrity of complex shapes, such as polygons, ensuring that their spatial properties remain consistent across different viewpoints.

(4) Normal vector descriptor: The normal vector descriptor is derived from the orthonormal representation and represents the orientation of a line or surface. The normal vector

n_{i}

is calculated using

n_{i} = \frac{v_{i} \times P_{i}}{‖ v_{i} \times P_{i} ‖},

(8)

where

v_{i}

is the direction vector of the line and

P_{i}

is a point on the line. This descriptor encodes the directional properties of the feature and is critical for comparing the orientations of different features.

Combining these extracted constraints yields the final geometric feature descriptor D, which is represented as

D = (n_{1}, n_{2}, \dots, n_{m}, θ_{1}, θ_{2}, \dots, θ_{3}, l_{1}, l_{2}, \dots, l_{m}, A),

(9)

where

n_{i}

is the normal vector of the lines or surfaces,

θ_{i}

represents the angles between lines,

l_{i}

denotes the distances between corresponding lines, and A represents the geometric shape descriptor (e.g., the area of a polygon). Therefore, the dimensionality of the descriptor is six.

These descriptors provide a compact yet comprehensive representation of geometric features, encoding basic spatial relationships, orientation and shape information. Once generated, these geometric feature descriptors must undergo an optimization process to minimize reprojection errors between descriptors across different frames or viewpoints and to ensure that the spatial relationships between features remain consistent.

The objective of the optimization is to minimize the error between descriptors, typically achieved by minimizing the weighted sum of squared differences between the descriptors from different viewpoints [6]. The optimization process can be formalized as

\min \sum_{i, j} ω_{i j} {(D_{i} - D_{j})}^{2},

(10)

where

ω_{i j}

is a weight factor,

D_{i}

and

D_{j}

are the descriptors from different viewpoints, and the summation is taken over all pairs of corresponding features.

This optimization process ensures that geometric features are well-aligned and consistent across different frames, directly improving the performance of visual odometry and SLAM systems.

3.2. Geometric Structure Feature Matching

Matching geometric structure features involves using descriptors generated from orthogonal representations. The purpose of geometric feature matching is to identify corresponding geometric structures across different viewpoints by evaluating spatial relationships, orientations, and geometric constraints. This process involves extracting geometric constraints from descriptors, computing similarities, and verifying these constraints to optimize the matching results and ensure robustness and accuracy.

3.2.1. Descriptor Preprocessing

Descriptor preprocessing is an essential step in preparing geometric feature descriptors for matching and optimization tasks. This stage involves normalizing descriptors to ensure consistency, reducing noise, and transforming them into a format suitable for subsequent matching.

First, descriptors are normalized to a consistent scale. For example, distance descriptors

l_{i}

, angle descriptors

θ_{i}

, and shape descriptors

A_{i}

are standardized by subtracting the mean and dividing by the standard deviation:

d_{i} = \frac{D_{i} - μ D_{i}}{σ D_{i}},

(11)

where

μ D_{i}

and

σ D_{i}

represent the mean and standard deviation of the descriptor values. This step ensures that all descriptors are on the same scale and comparable across frames or viewpoints.

The descriptor is originally defined using the complete set of geometric terms, including normal-related orientation, angle, distance, and area-based information. However, in practical bird’s-eye-view UAV imagery, not every local feature group contains sufficiently complete higher-order structures such as polygon-like regions. As a result, high-order terms such as area may contribute limited discriminative information in part of the samples and may introduce redundancy.

For high-dimensional descriptors, Principal Component Analysis (PCA) [7] is employed to reduce the dimensionality while preserving the most significant components. PCA identifies the principal components of the descriptor matrix X, where

X \in R^{n \times m}

represents the matrix of descriptors with n samples and m dimensions. PCA aims to identify the eigenvectors v of the covariance matrix

Σ

, defined as

Σ = \frac{1}{n} X^{T} X .

(12)

The eigenvectors corresponding to the largest eigenvalues form the principal components, which capture the most variance in the data. The reduced descriptor

D_{r e d u c e d}

is subsequently obtained by projecting the original descriptors onto these principal components:

D_{r e d u c e d} = X V,

(13)

where V denotes the matrix of principal eigenvectors, and

D_{r e d u c e d}

denotes the matrix of descriptors with reduced dimensions. Although it helps remove redundancy and improve compactness, overly aggressive dimensionality reduction may suppress less frequent but still informative structural cues, especially those associated with higher-order geometric configurations. Therefore, PCA is used conservatively in the proposed framework, with the goal of reducing redundancy while preserving the dominant and potentially important geometric information required for robust matching.

3.2.2. Similarity-Based Matching

To match geometric features, we first evaluate the similarity between descriptors

d_{i}

and

d_{j}

using the Euclidean distance and cosine similarity. These distance metrics effectively capture the spatial and angular relationships between feature descriptors.

The Euclidean distance quantifies the spatial difference between two descriptors and is defined as

D_{E u c l i d} (i, j) = ‖ d_{i} - d_{j} ‖,

(14)

where

d_{i}

and

d_{j}

represent the normalized descriptors for features i and j.

The cosine similarity captures the directional alignment between two descriptors. It is particularly useful for comparing feature descriptors that represent directional properties, such as normal vectors or angles:

D_{cos} (i, j) = \frac{d_{i}^{T} d_{j}}{‖ d_{i} ‖ ‖ d_{j} ‖},

(15)

where

d_{i}

and

d_{j}

represent the feature descriptors, and the denominator normalizes the descriptors to focus solely on their orientation.

To enhance matching robustness, we employ a k-Nearest Neighbor (k-NN) search algorithm [8]. For each feature descriptor

d_{i}

, we identify the k most similar descriptors within the set of all descriptors. This search is based on the similarity score computed earlier:

N_{i} = {d_{j} ∣ D_{Cos} (i, j) i s \min i = 1, \dots k, j = 1, \dots k} .

(16)

By selecting the closest descriptors, we ensure that the most similar features are matched, reducing the likelihood of incorrect matches and improving correspondence accuracy. This approach enables a more reliable matching process by focusing on the best candidates, minimizing the risk of outliers.

3.2.3. Geometric Validation

To enhance the accuracy and robustness of line feature matching, geometric consistency constraints are enforced based on the geometric features of the matched lines, such as their angles and normal vectors. These constraints ensure that the matched lines preserve the expected geometric relationships, improving match reliability. Specifically, two primary geometric consistency conditions are enforced.

The first condition is the near-perfect alignment of the normal vectors

{\hat{n}}_{i}

and

{\hat{n}}_{j}

of two matched lines

L_{i}

and

L_{j}

. This alignment is verified by evaluating the dot product between the normal vectors, ensuring they are nearly orthogonal or parallel:

|{\hat{n}}_{i}^{T}, {\hat{n}}_{j}| \approx 1 .

(17)

This condition ensures that the matched lines either lie in the same plane or have very similar orientations, which is essential for maintaining geometric coherence during matching.

The angular difference

θ_{i j}

between adjacent lines must also be consistent. Specifically, for the two lines

L_{i}

and

L_{j}

, the difference between their angles with respect to their neighboring lines should remain within a reasonable threshold

ε

. This can be expressed as

| θ_{i} - θ_{j} | \leq ε,

(18)

where

θ_{i}

and

θ_{j}

represent the angles of the lines relative to their adjacent lines, and

ε

represents a tolerance threshold for the angular difference. By enforcing this constraint, we preserve the geometric relationships between the lines, which is particularly important in complex scenes where significant geometric consistency is expected.

To enforce these geometric consistency constraints effectively, we integrate the RANSAC (Random Sample Consensus) algorithm into the matching process. RANSAC is used to estimate the geometric transformation that aligns the matched lines while accounting for the inherent presence of outliers. The RANSAC algorithm iteratively selects random subsets of candidate matches, estimates the transformation, and evaluates the consistency of the matches with the estimated model. This process ensures that only matches satisfying the geometric consistency constraints—such as the alignment of normal vectors and angular consistency—are retained. RANSAC filters out mismatches and outliers, retaining the inliers that conform to the geometric constraints. The results of this matching process for the geometric structure features are shown in Figure 4.

For comparison, Figure 5 presents a representative point-feature matching result between two consecutive UAV frames. Although point features can provide effective local correspondences around salient image regions such as roof corners and boundary intersections, their matching performance in bird’s-eye-view urban scenes still depends strongly on local appearance distinctiveness. In the presence of repetitive rooftop patterns, weak-texture areas, partial occlusions, or viewpoint disturbances, point-only matching may become ambiguous and cannot explicitly preserve higher-level geometric relationships such as parallelism, orthogonality, and polygonal structure. Therefore, while point features remain useful for short-term inter-frame association, they are insufficient on their own for robust visual odometry in complex urban UAV environments, which further motivates the introduction of geometric structure features and the subsequent feature fusion strategy.

3.3. Feature Fusion

The adaptive weighting strategy in the proposed framework is organized hierarchically. The primary component is the quality-based weighting, which directly reflects the reliability of point features and geometric structure features during matching. On this basis, a complexity-related adjustment is introduced only as a secondary heuristic to strengthen structural cues in scenes with richer geometric configurations. In addition, the online update mechanism is treated as an optional refinement for further adaptation, rather than a mandatory component of the core weighting framework.

3.3.1. Adaptive Weighting Mechanism for Feature Fusion

To effectively combine point and geometric structure features, we introduce an adaptive weighting mechanism that dynamically adjusts the contribution of each feature type based on its matching quality and contextual relevance. The details of the feature fusion process are presented in Algorithm 1. Let

F_{p}

and

F_{g}

represent the sets of point features and geometric structure features, respectively. These features are fused by assigning adaptive weights

w_{p}

and

w_{g}

, where

F = w_{p} F_{p} + w_{g} F_{g}, w_{p} + w_{g} = 1 .

(19)

The weights

w_{p}

and

w_{g}

are determined based on the matching quality of each feature type, denoted as

q_{p}

for point features and

q_{g}

for geometric structure features. These qualities are evaluated using different metrics, such as descriptor distance for point features and geometric consistency for structure features. We define the weights as

w_{p} = \frac{q_{p}}{q_{p} + q_{g}},

(20)

w_{g} = \frac{q_{g}}{q_{p} + q_{g}},

(21)

where

q_{p}

and

q_{g}

represent the matching quality of point features and geometric structure features, respectively. A higher matching quality will increase the weight of the corresponding feature, ensuring that more reliable features contribute more to the fusion process.

In addition to the matching quality, we introduce a contextual reliability factor to account for the environmental conditions, denoted as

r_{p}

for point features and

r_{g}

for geometric structure features. This factor reflects the suitability of each feature type under the current conditions, such as texture-rich regions or areas with complex geometric structures. The contextual reliability factors adjust the feature weights as follows:

w_{p} = \frac{q_{p} \cdot r_{p}}{q_{p} \cdot r_{p} + q_{g} \cdot r_{g}},

(22)

w_{g} = \frac{q_{g} \cdot r_{g}}{q_{p} \cdot r_{p} + q_{g} \cdot r_{g}} .

(23)

Thus, the fusion strategy adapts not only to the matching quality of features but also to the dynamic conditions of the environment. This ensures a more robust and accurate pose estimation.

Algorithm 1: Feature fusion with adaptive weighting.

3.3.2. Complexity-Based Dynamic Weighting for Geometric Structure Features

While the adaptive weighting mechanism based on matching quality and contextual reliability is effective, we further enhance the strategy by incorporating the complexity of the geometric structure features. In urban environments, regions with intricate geometric structures, such as intersections or polygons, offer more reliable cues for localization compared to simpler regions with few geometric features. Therefore, we introduce a dynamic weighting mechanism for geometric structure features based on their complexity.

While the adaptive weighting mechanism—based on matching quality and contextual reliability—is effective, we further augment the strategy by incorporating the complexity of geometric structure features. In urban environments, regions characterized by intricate geometric structures, such as intersections or polygons, provide more reliable localization cues than simpler regions with few geometric features. Consequently, we introduce a dynamic weighting mechanism for geometric structure features based on their complexity.

Let Complexity(G) denote the complexity of the geometric structure features. This complexity can be quantified by factors such as the number of intersecting lines, the angles between lines, and the sizes of the polygons formed by these lines. The weight

w_{g}

is adjusted by multiplying Complexity(G), as follows:

w_{g} = \frac{q_{g} \cdot r_{g} \cdot C o m p l e x i t y (G)}{q_{p} \cdot r_{p} + q_{g} \cdot r_{g} \cdot C o m p l e x i t y (G)} .

(24)

This adjustment ensures that geometric structure features in high-complexity regions, where more geometric relationships are present, are assigned greater weight, thereby enhancing their contribution to the fusion process.

3.3.3. Joint Descriptor Construction for Feature Fusion

To further refine the integration of point features and geometric structure features, we propose a joint descriptor that combines them into a unified representation. This joint descriptor captures the local texture information provided by point features as well as the spatial and structural relationships encoded by geometric structure features. Let

D_{p}

and

D_{g}

represent the descriptors for point features and geometric structure features, respectively. The joint descriptor is constructed through concatenation:

D_{f} = D_{p} \oplus D_{g},

(25)

where ⊕ denotes the concatenation operation. The fused descriptor

D_{f}

is a high-dimensional vector that simultaneously encodes the local texture and the geometric information, providing a more comprehensive feature representation for matching and pose estimation.

To reduce the dimensionality and improve the fusion efficiency, we apply Principal Component Analysis (PCA) to the joint descriptor. The reduced descriptor

D_{f_{r}}

is then obtained by projecting the original fused descriptor onto the principal components:

D_{f_{r}} = V^{T} D_{f},

(26)

where V represents the matrix of principal eigenvectors. This dimensionality reduction ensures that the fused descriptor retains the most significant components, thereby optimizing computational efficiency without sacrificing feature descriptive power.

3.3.4. Online Learning for Dynamic Weight Adjustment

To further enhance the adaptive fusion strategy, we introduce an online learning mechanism for dynamic weight adjustment. This method enables the system to continuously learn and adapt to varying environmental conditions by updating the weight parameters based on previous feature fusion results. The goal is to optimize the feature fusion process by learning from historical data, ensuring that the system automatically adjusts the fusion strategy to maximize pose estimation accuracy. We employ Reinforcement Learning (RL) to dynamically adjust the weights

w_{p}

and

w_{g}

during the operation of the visual odometry system. The RL agent interacts with its environment, learning to adjust the feature fusion weights based on the rewards received, which are related to localization accuracy. The weight update is performed as follows:

Q (s, a) = R (s, a) + γ \max_{a^{'}} Q (s^{'}, a^{'}),

(27)

where

Q (s, a)

represents the quality value of the action a (weight adjustment) at state s,

R (s, a)

denotes the reward associated with the action,

γ

is the discount factor, and

s^{'}

and

a^{'}

represent the next state and action. This approach ensures that the fusion weights are continuously optimized based on real-time performance, making the system more robust to environmental variations.

3.4. Joint Pose Estimation and Optimization

3.4.1. Pose Estimation Framework

Given that the pose of the camera or UAV is defined by a transformation matrix

T = [R ∣ t]

, where R is the rotation matrix and t is the translation vector. The objective is to estimate T to minimize the error between the projected features (point and geometric) and their observed counterparts in consecutive frames. To achieve this, we implement a factor graph optimization approach that accounts for both feature types, combining them into a joint pose estimation framework.

Let

F_{p} = {f_{p}^{1}, f_{p}^{2}, \dots f_{p}^{N}}

and

F_{g} = {f_{g}^{1}, f_{g}^{2}, \dots f_{g}^{m}}

represent the sets of point features and geometric structure features, respectively, extracted from two consecutive frames. The optimization goal is to minimize the reprojection error

e_{p}

for point features and

e_{g}

for geometric structure features while jointly estimating the pose T. The combined objective function can be expressed as

min_{T} \sum_{i = 1}^{N} λ_{p} \cdot e_{p} {(f_{p}^{i}, T)}^{2} + \sum_{j = 1}^{M} λ_{g} \cdot e_{g} {(f_{g}^{i}, T)}^{2},

(28)

where

e_{p} (f_{p}^{i}, T)

represents the reprojection error for the i-th point feature,

e_{g} (f_{g}^{i}, T)

represents the reprojection error for the j-th geometric structure feature, and

λ_{p}

and

λ_{g}

are the weight factors for point features and geometric structure features. These error terms can be formulated as

e_{p} (f_{p}^{i}, T) = {‖ π (T \cdot f_{p}^{i}) - f_{p}^{i^{'}} ‖}_{2},

(29)

e_{g} (f_{g}^{i}, T) = {‖ π (T \cdot f_{g}^{i}) - f_{g}^{i^{'}} ‖}_{2},

(30)

where

\underset{̲}{π (\cdot)}

represents the projection operator that projects the 3D features onto the 2D image plane, and

f_{p}^{i^{'}}

and

f_{g}^{i^{'}}

are the corresponding observed feature points in the second frame.

3.4.2. Joint Pose Optimization

To jointly optimize the pose T, we must account for the relative importance of point and geometric features in the optimization process. The adaptive weighting mechanism plays a crucial role here. As discussed earlier, the weights

λ_{p}

and

λ_{g}

are dynamically adjusted based on the matching quality and the contextual reliability of the features. These weights ensure that the more reliable features contribute more significantly to the optimization process.

In practice, the optimization process is carried out using a nonlinear least squares solver (Ceres Solver), which efficiently minimizes the total reprojection error. The optimization objective can be written as

min_{T} \sum_{i = 1}^{N} λ_{p} \cdot e_{p} {(f_{p}^{i}, T)}^{2} + \sum_{j = 1}^{M} λ_{g} \cdot e_{g} {(f_{g}^{i}, T)}^{2} + \sum_{k + 1}^{P} λ_{p} \cdot e_{p} {(T_{k})}^{2},

(31)

where

e_{p} (T_{k})

represents the prior term, which includes regularization factors or inertial measurements, such as data from an inertial measurement unit (IMU). The weight

λ_{p}

balances the influence of prior information on the optimization.

3.4.3. Geometric Constraints in Pose Optimization

In addition to minimizing reprojection errors, we introduce geometric consistency constraints between points and geometric features during the pose optimization. These constraints help maintain the structural integrity of the scene and improve pose estimation accuracy, especially in dynamic environments. Geometric consistency is enforced by incorporating relative angles and distances between matching features.

Let

θ_{i j}

represent the angle between the i-th and j-th geometric structure features, and let

d_{i j}

represent the distance between the i-th and j-th features. The geometric consistency error term is defined as

e_{g} (f_{p}^{i}, f_{g}^{i}, T) = ‖ θ_{i j} (T \cdot f_{p}^{i}, T \cdot f_{g}^{i}) - θ_{i j}^{*} ‖_{2} + {‖ d_{i j} (T \cdot f_{p}^{i}, T \cdot f_{g}^{i}) - d_{i j}^{*} ‖}_{2},

(32)

where

θ_{i j}^{*}

and

d_{i j}^{*}

are the observed angle and distance values between the corresponding features in the scene.

These geometric constraints enforce structural consistency of the scene and ensure that the pose optimization process respects the geometric relationships between features, yielding more accurate and reliable pose estimates.

3.4.4. Optimization Algorithm

The optimization problem is solved iteratively using a Gauss–Newton [9] method, depending on the problem’s characteristics. These methods iteratively update the pose T by solving the following system of equations:

T^{k + 1} = T^{k} - H^{- 1} J^{T} r,

(33)

where H denotes the Hessian matrix, representing the second-order information about the objective function; J denotes the Jacobian matrix, representing the partial derivatives of the residuals with respect to the pose T; and r is the residual vector, representing the difference between the predicted and observed feature positions.

This iterative process continues until convergence, at which point the changes in the pose T fall below a predefined threshold.

4. Results

This study evaluated the proposed GLP-VO system in terms of accuracy, robustness, and runtime performance using both a public benchmark and real UAV flight experiments. The implementation was written in C++ on Ubuntu, with OpenCV, Sophus, Eigen, and Ceres Solver as the main dependencies. For quantitative evaluation on the TUM RGB-D dataset, the experiments were conducted on a desktop platform equipped with an Intel Core i9-13900H processor and 16 GB of RAM. Real-world flight experiments were performed on an NVIDIA Jetson TX2 onboard computer integrated with the UAV flight controller, enabling onboard low-altitude navigation tests. All evaluations in this study focus on visual odometry rather than full SLAM.

4.1. Quantitative Evaluation on the TUM RGB-D Dataset

To demonstrate the qualitative accuracy of our method, we conducted a series of experiments on the TUM RGB-D dataset. This dataset is a widely recognized benchmark for evaluating visual odometry and SLAM algorithms. It comprises a diverse collection of indoor sequences recorded with an RGB-D sensor, such as the Microsoft Kinect. The dataset provides synchronized color and depth images, along with accurate ground-truth trajectories captured by motion capture systems. This comprehensive resource enables researchers to rigorously assess algorithm performance under various challenging conditions, including dynamic scenes, varying illumination, and textureless environments.

To evaluate the localization accuracy, we compared GLP-VO with several representative methods, including ORB-SLAM [8], PL-SLAM [14], PTAM [37], LSD-SLAM [23], and RGBD-SLAM [9]. The metric used for the comparison is the Absolute Trajectory Error (ATE). The experimental results presented in Table 1 demonstrate the efficacy of our proposed GLP-VO method across a diverse set of indoor scenarios. Overall, GLP-VO consistently achieves lower RMSE values than PL-SLAM and LSD-SLAM, and its performance is competitive with ORB-SLAM and PTAM in several cases. For instance, in sequences characterized by moderate motion and sufficient texture, GLP-VO yields very low errors—often below 1 cm—indicating its ability to capture accurate spatial information even under rapid viewpoint changes. In more challenging sequences involving dynamic conditions or low-texture regions, GLP-VO maintains a stable error range, whereas the competing methods exhibit greater fluctuations. Notably, while PTAM occasionally achieves exceptionally low errors under stable conditions, its performance is less consistent when scene complexity increases. The results further reveal that methods that rely solely on point features, such as ORB-SLAM, are more sensitive to abrupt motion and occlusions, leading to higher errors in certain cases. In contrast, GLP-VO’s hybrid approach, which fuses point features with robust geometric cues derived from line information, mitigates drift and reduces cumulative errors over time. This fusion strategy is particularly beneficial in environments with both rich textures and prominent structural edges, as it provides complementary information that enhances overall localization accuracy. In summary, the detailed analysis confirms that GLP-VO offers a more robust and reliable performance in terms of the absolute trajectory error across various indoor laboratory sequences, making it a promising solution for challenging real-world SLAM applications.

Figure 6 shows the trajectory reconstruction results on three representative TUM RGB-D sequences. As can be observed, the compared methods all capture the main motion trend to some extent, but clear differences remain in trajectory consistency, local deviation, and overall path recovery. In the three sequences, the trajectory estimated by GLP-VO remains closer to the ground-truth path in most segments, with better preservation of the global shape and smoother local evolution.

For the sequence with relatively regular path changes, GLP-VO follows the reference trajectory more steadily, while some baseline methods exhibit visible drift accumulation and local oscillation. In the more challenging sequence with frequent directional changes, the differences become more apparent: several comparison methods show larger deviations in the crossing and turning regions, whereas GLP-VO maintains a more coherent trajectory structure. In the sequence containing longer curved motion and more complex local variations, the proposed method still preserves the overall path trend more effectively, while some other methods suffer from trajectory distortion or unstable recovery in partial segments.

It is also worth noting that the benchmark evaluation includes multiple classical methods, but their robustness is not fully consistent across all test sequences. In some cases, certain algorithms cannot maintain stable trajectory estimation throughout the entire sequence. By contrast, GLP-VO shows better continuity and stronger adaptability under different motion patterns and scene conditions. These results indicate that the fusion of local point features and geometric structural cues helps improve the stability of trajectory estimation and reduces the risk of local mismatch and drift accumulation.

4.2. Real-World Experiments

To validate the utility of our method under realistic conditions, we conducted a series of localization and mapping experiments in an urban low-altitude environment. For a performance comparison, we conducted initialization speed and accuracy experiments for our proposed GLP-VO method against ORB-SLAM (monocular configuration) and PL-SLAM, using GNSS-RTK ground truth to assess their relative accuracy. The experimental sequences highlight the robustness and accuracy of our method in dynamic urban environments, and the GNSS-RTK data can serve as a reference to quantify positioning errors and verify the overall system’s reliability.

4.2.1. Construction of the Experimental Dataset

To fully evaluate the performance of the proposed GLP-VO system under real-world conditions, we conducted multiple UAV flights in representative urban low-altitude scenarios. Each mission was carefully designed to include a variety of environmental features such as fields, dense building clusters and multi-lane intersections, as shown in Figure 7. We constructed three datasets for the experiments, ensuring they capture a variety of geometric structures, dynamic elements, and rapid illumination changes that are common in real-world urban operations. The image acquisition in this study mainly follows a bird’s-eye-view (near-nadir) UAV observation mode so that the camera can continuously capture road boundaries, building contours, and other structural cues that are important for the proposed point-structure fusion framework.

As shown in Figure 8, the UAV platform used for data collection was equipped with an Intel monocular camera, an NVIDIA Jetson TX2 onboard computer, a flight control system, and a GNSS-RTK module. During each flight, the camera captures monocular images at 640 × 800 pixels at 30 Hz. Although a depth sensor was available, our experiments focused on monocular images to maintain consistency with the primary visual odometry pipeline. Meanwhile, the onboard unit processed these images in real time to perform feature detection, feature matching, and attitude estimation. The flight control system was responsible for stabilizing the attitude and directing the UAV along a predetermined trajectory. GNSS-RTK readings were recorded at each time step to provide an estimate of the ground-truth position for subsequent positional accuracy assessment.

For reproducibility, the desktop-side runtime evaluation was conducted on a laptop equipped with an Intel Core i9-13900H processor and 16 GB of RAM, without GPU acceleration. The onboard runtime evaluation was conducted on an NVIDIA Jetson TX2 platform equipped with a hexa-core CPU (dual-core NVIDIA Denver 2 and quad-core ARM Cortex-A57), 8 GB LPDDR4 memory, and an integrated 256-core Pascal GPU.

By synchronizing and adjusting the image data, flight logs, and GNSS-RTK measurements via timestamps, we constructed a dataset that provides a solid foundation for algorithm validation. The dataset not only exhibits high variability in geometric scenes, ranging from repetitive building facades to cluttered road environments but also introduces challenging dynamic objects, such as moving vehicles and pedestrians. In addition, multiple flights were conducted under different lighting conditions, enabling a thorough examination of the system’s sensitivity to shaded areas and varying sunlight intensities. Overall, these design choices produced a real-world dataset that effectively tested GLP-VO’s ability to process both structured and unstructured urban elements and to perform continuous motion in areas where GNSS-RTK signal degradation may occur.

To provide a more comprehensive evaluation of the real-world experiments, we further plotted the estimated trajectories of GLP-VO, ORB-SLAM, and PL-SLAM together with the GNSS-RTK reference trajectory for all eight field-test sequences, as shown in Figure 9.

Across the eight tests, GLP-VO follows the reference path more consistently in most cases, with better preservation of the overall trajectory trend and smaller visible local deviations. In addition, the proposed method generally maintains a more complete trajectory over longer flight segments, whereas the comparison methods exhibit more noticeable drift accumulation or reduced trajectory continuity in several sequences.

This behavior is closely related to the characteristics of real-world UAV scenes. During field flights, the image stream often contains viewpoint changes, weak-texture regions, partial occlusions, and cluttered backgrounds, which can reduce the reliability of single-feature motion estimation. By jointly exploiting point features and structural cues, GLP-VO is able to preserve more stable geometric constraints during motion estimation, leading to better path consistency and more reliable long-range tracking. In contrast, methods relying more heavily on a single type of feature are more likely to accumulate local errors or experience degraded tracking stability when scene conditions become less favorable. The corresponding quantitative evaluation of trajectory accuracy and stable operating distance is provided in the following subsection.

Table 2 summarizes the computational cost of each module in the proposed GLP-VO framework on both the desktop and onboard platforms. The overall execution time is about 28.4 ms per frame on the desktop platform and 89.5 ms per frame on the onboard platform, corresponding to approximately 35 FPS and 11 FPS, respectively. This result provides additional evidence for the practical runtime feasibility of the proposed method.

It can also be observed that the dominant computational cost mainly comes from dual-feature extraction, geometric matching, and joint pose optimization, while the adaptive feature fusion module contributes only a limited fraction of the total runtime. This indicates that the additional fusion mechanism does not impose excessive computational overhead. Even on the resource-limited onboard platform, the proposed method remains capable of sustained online execution, which supports its applicability in real-world UAV scenarios.

4.2.2. Initialization Speed Comparison in Real Scenarios

Rapid and reliable initialization is a critical first step for real-world visual odometry and SLAM systems, as it enables these methods to quickly lock onto stable features and begin performing robust attitude estimation without unnecessary delays. In UAV-based applications, a sluggish or unstable initialization phase can hinder real-time performance and complicate mid-flight adjustments. With this in mind, we measured and compared the initialization times of GLP-VO, ORB-SLAM, and PL-SLAM across three custom datasets (Labs 1, 2, and 3) to assess the real-world impact of each algorithm’s startup process.

Table 3 details the initialization performance of the different algorithms, comparing the time required to initialize the three sets of sequences in an urban low-altitude scenario. Labs 1 and 2 contain dense urban landscapes, including buildings and other man-made structures. Under these visually complex conditions, GLP-VO achieved a consistently faster initialization speed than the other two methods. This is because the point–line fusion strategy effectively utilizes rich geometric cues (lines, corners, and edges). Unlike Labs 1 and 2, which are urban, Lab 3 is farmland located on the outskirts of a city with relatively few man-made structures. In this environment, there are fewer protruding edges and lines, which diminishes the potential benefits of GLP-VO. Consequently, the difference in average initialization time between the three algorithms is significantly reduced in Lab 3. Even so, GLP-VO remains competitive. First, the adaptive weighting mechanism enables the system to adjust the relative contribution of point features and structural features according to the actual feature reliability in the scene. As a result, when structural cues are weak, the method can rely more on stable point features instead of overemphasizing unreliable line information. Second, the joint optimization framework can still exploit the limited but meaningful geometric cues available in the scene, such as road boundaries, field edges, or sparse artificial contours, to provide complementary constraints for pose estimation. These two mechanisms allow GLP-VO to remain effective even when the scene contains fewer strong structures.

Initialization failure in GLP-VO typically occurs when the first frames do not provide sufficiently reliable point and structural constraints. Common causes include scarcity of clear line-like structures, motion blur, poor or unstable illumination, and scenes in which both texture and geometric structure are weak. Under these conditions, the quality of feature extraction and initial association degrades, which may prevent the system from establishing a stable initial pose estimate.

4.2.3. Comparison of Stability and Accuracy in Real Scenarios

The stability of navigation systems over long distances in urban scenarios also presents significant challenges. As feature distributions change in real time due to varying urban conditions, vision-dependent navigation systems may face problems such as lost followers, necessitating re-initialization. However, re-initialization results in the degradation of tracking accuracy. Therefore, the system’s long-range stability is of particular importance.

Figure 10 shows the stabilized operational distances for each algorithm (GLP-VO, ORB-SLAM2, and PL-SLAM) in eight different low-altitude urban route environments. While the specific urban or semi-urban characteristics of each sequence vary, the overall pattern remains consistent: in most cases, GLP-VO maintains stable tracking over longer paths, suggesting a significant robustness advantage for GLP-VO across a variety of structural layouts.

By way of analysis, GLP-VO exhibits particularly high stability in Labs 1–5 sequences with dense buildings. In such environments, geometric features are highly rich and are typically complemented by a large number of point features, such as building contours and intersections. However, the gap between the three algorithms narrows when the geometry is sparser or less varied. Labs 6 to 8 sequences contain routes that are primarily located on the outskirts of cities with large areas of landforms, such as fields, and thus have fewer geometrical features to utilize, resulting in minor differences in stability between GLP-VO and algorithms based purely on the fusion of point or point–line features.

The block diagram of the translation error statistics is shown in Figure 11. As far as the translation error is concerned, the GLP-VO algorithm outperforms both ORB-SLAM (based on point features only) and PL-SLAM (based on the fusion of points and lines) algorithms. By fusing the point features with the geometric structure features, GLP-VO significantly reduces the median error, especially in dense urban scenarios; the algorithm’s advantage appears more obvious in long-distance urban scenarios, such as Lab 1.

4.2.4. Linear Assessment of GLP-VO Function

To further assess the contribution of geometric structure features to GLP-VO’s overall performance, we conducted an ablation study that quantitatively and experimentally evaluated them. We present trajectory errors, as well as errors along the x, y, and z axes, for two configurations: one based solely on point features and another that fuses point features with geometric structure features. These evaluations were performed on two experimental sequences that we collected. To isolate the impact of the geometric structure features, the closed-loop module was disabled in both setups.

Figure 12 and Figure 13 present the experimental results for two distinct sequences. In an urban area rich in geometric features (Lab 1), the inclusion of geometric structure features substantially enhanced localization accuracy and robustness, particularly in estimating the z-axis. In semi-structured environments with fewer prominent geometric features, such as the rural areas surrounding the city (Lab 2), the fusion system still outperformed the configuration based solely on point features. Moreover, geometric features derived from sporadic structures (e.g., road signs and fences) compensated for unreliable point matching.

To better understand the role of each component in the proposed GLP-VO framework, we performed an ablation study in which geometric-structure-based feature extraction and matching, adaptive feature fusion, and geometric-constraint-based optimization were introduced step by step, as shown in Table 4. To provide a clearer and more comprehensive evaluation, two complementary metrics were adopted, namely the Absolute Trajectory Error (ATE) and the Relative Pose Error (RPE). In this context, ATE reflects the overall trajectory deviation from the ground truth, whereas RPE characterizes the local motion consistency between consecutive poses.

As shown in Table 4, the performance improves progressively from the base configuration to the full GLP-VO model on both indoor benchmark sequences and the real-world flight sequence. More importantly, the reductions are consistently reflected in both ATE and RPE, indicating that the proposed components improve not only the global trajectory accuracy but also the local stability of pose estimation. This trend confirms that the overall performance gain does not arise from a single stage but from the coordinated contribution of the frontend representation, feature fusion, and backend optimization modules.

The introduction of geometric-structure-based feature extraction and matching leads to a clear improvement over the baseline, which suggests that higher-level structural cues provide more reliable geometric constraints than conventional point-only representations, especially when texture information is weak or unstable. After that, adaptive feature fusion further enhances the performance, demonstrating that point features and geometric structure features are complementary in practical motion estimation. By dynamically balancing these heterogeneous cues according to the scene condition, the system becomes more robust to local feature degradation and matching ambiguity.

When geometric-constraint-based optimization is further incorporated, the full GLP-VO model achieves the best overall performance. This result indicates that the optimization stage plays an important role in suppressing error accumulation and enforcing trajectory consistency under multi-feature constraints. In addition, the improvement is more pronounced on the more challenging sequence f2_360_kidnap and the real-world Lab 1 sequence than on the relatively regular f1_xyz sequence. This observation suggests that the proposed framework is particularly beneficial in scenarios involving abrupt motion changes, degraded tracking conditions, and more complex real-flight disturbances.

5. Conclusions

This study proposes a hybrid visual odometry framework, GLP-VO, that fuses geometric structural features with key point information to address the limited navigation accuracy and robustness of UAVs in urban low-altitude environments. Experimental results on both the TUM RGB-D dataset and real UAV flight sequences demonstrate the effectiveness of the proposed method in terms of accuracy, robustness, and efficiency. On the TUM RGB-D dataset, GLP-VO achieves the best ATE results in five out of ten evaluated sequences, including 0.91 cm on f1_xyz and 0.62 cm on f3_str_tex_far, and remains competitive on challenging sequences such as f2_360_kidnap with an ATE of 2.26 cm. The ablation study further shows that, compared with the base configuration, the full GLP-VO model reduces the ATE and RPE by up to 44.9% and 43.1%, respectively, confirming the joint contribution of geometric-structure-based feature extraction and matching, adaptive feature fusion, and geometric-constraint-based optimization. In addition, the proposed system achieves an average runtime of 28.4 ms per frame on the desktop platform and 89.5 ms per frame on the onboard platform, corresponding to approximately 35 FPS and 11 FPS, respectively, which demonstrates that the proposed method maintains practical real-time performance while improving localization accuracy and long-range stability in complex urban UAV scenarios.

This is particularly evident in its accelerated initialization, reduced cumulative drift, and enhanced long-range stability. Although GLP-VO exhibits strong generalization and an accuracy advantage across various environments, it still has some limitations. First of all, for highly dynamic scenes, such as those in which vehicles and pedestrians move back and forth and where there is frequent field-of-view occlusion, the system primarily relies on robust feature matching and the RANSAC strategy to eliminate outliers; therefore, the matching process and backend optimization may still be adversely affected if the proportion of dynamic targets is very high. Second, in extremely simplified environments or those lacking obvious structural line features (e.g., large solid-colored walls, deserts), the contribution of structural features is limited, and the system’s performance enhancement will be reduced. In conclusion, GLP-VO not only enables accurate estimation of UAVs in complex urban low-altitude scenarios but also does not compromise real-time performance. However, there is still room for improvement. In the future, we will expand the closed-loop detection and global optimization mechanism to improve global consistency and map construction accuracy in large-scale urban environments and introduce explicit dynamic target detection and semantic segmentation techniques to accurately filter or reconstruct the dynamic targets, with the aim of achieving more stable and autonomous navigation in congested and dynamic urban environments.

Author Contributions

Conceptualization, Y.X., B.J., L.H. and R.Q.; methodology, Y.X., B.J., L.H., R.Q. and Z.W.; software, Y.X. and Z.W.; validation, Y.X. and Z.W.; formal analysis, Y.X.; investigation, Y.X.; resources, B.J., L.H. and R.Q.; data curation, Y.X.; writing—original draft preparation, Y.X.; writing—review and editing, Y.X., B.J., L.H. and R.Q.; visualization, Y.X.; supervision, B.J., L.H. and R.Q.; project administration, B.J., L.H. and R.Q.; funding acquisition, B.J., L.H. and R.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: https://vision.in.tum.de/data/datasets/rgbd-dataset (accessed on 5 April 2026). The custom dataset generated during this study is not publicly available due to privacy restrictions associated with the captured urban environments.

Acknowledgments

The authors would like to thank Hangzhou Antwork Network Technology Co., Ltd. for providing the UAV platform and supporting the data collection for the experiments in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dai, M.; Zheng, E.; Feng, Z.; Qi, L.; Zhuang, J.; Yang, W. Vision-based UAV self-positioning in low-altitude urban environments. IEEE Trans. Image Process. 2023, 33, 493–508. [Google Scholar] [CrossRef] [PubMed]
Shafafi, K.; Ricardo, M.; Campos, R. Traffic and Obstacle-Aware UAV Positioning in Urban Environments Using Reinforcement Learning. IEEE Access 2024, 12, 188652–188663. [Google Scholar] [CrossRef]
Zhang, Z.; Cao, Y.; Ding, M.; Zhang, L.; Tao, J. Monocular Vision Based Obstacle Avoidance Trajectory Planning for Unmanned Aerial Vehicle. Aerosp. Sci. Technol. 2020, 106, 106199. [Google Scholar] [CrossRef]
Song, Y.; Hsu, L. Tightly Coupled Integrated Navigation System Via Factor Graph for UAV Indoor Localization. Aerosp. Sci. Technol. 2021, 108, 106370. [Google Scholar] [CrossRef]
Vetrella, A.R.; Fasano, G.; Accardo, D. Attitude Estimation for Cooperating UAVs Based on Tight Integration of GNSS and Vision Measurements. Aerosp. Sci. Technol. 2019, 84, 966–974. [Google Scholar] [CrossRef]
Wei, J.; Karakay, D.; Yilmaz, A. A Gis Aided Approach for Geolocalizing an Unmanned Aerial System Using Deep Learning. In Proceedings of the 2022 IEEE Sensors, Dallas, TX, USA, 30 October–2 November 2022; pp. 1–4. [Google Scholar] [CrossRef]
Russell, J.S.; Ye, M.; Anderson, B.D.O.; Hmam, H.; Sarunic, P. Cooperative Localisation of a GPS-Denied UAV in 3-Dimensional Space Using Direction of Arrival Measurements. arXiv 2017, arXiv:1703.06261. [Google Scholar] [CrossRef]
Mur-Artal, R.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Zhang, Y.; Zhu, P.; Ren, W. PL-CVIO: Point-Line Cooperative Visual-Inertial Odometry. arXiv 2023, arXiv:2311.05717. [Google Scholar] [CrossRef]
Zheng, F.; Zhou, L.; Lin, W.; Liu, J.; Sun, L. LRPL-VIO: A Lightweight and Robust Visual–Inertial Odometry with Point and Line Features. Sensors 2024, 24, 1322. [Google Scholar] [CrossRef]
Liu, X.; Wang, H.; Yang, S. A Fast Point-Line Visual-Inertial Odometry with Structural Regularity. In Proceedings of the 2023 IEEE 2nd Industrial Electronics Society Annual On-Line Conference (ONCON), Online, 8–10 December 2023; pp. 1–6. [Google Scholar] [CrossRef]
Gee, A.P.; Mayol-Cuevas, W. Real-Time Model-Based SLAM Using Line Segments. In Advances in Visual Computing; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4292, pp. 354–363. [Google Scholar] [CrossRef]
Pumarola, A.; Vakhitov, A.; Agudo, A.; Sanfeliu, A.; Moreno-Noguer, F. PL-SLAM: Real-Time Monocular Visual SLAM with Points and Lines. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4503–4508. [Google Scholar] [CrossRef]
Yang, G.; Wang, Q.; Liu, P.; Zhang, H. An Improved Monocular PL-SLAM Method with Point-Line Feature Fusion Under Low-Texture Environment. In Proceedings of the 2021 ACM 4th International Conference on Control and Computer Vision, Macau, China, 13–15 August 2021; pp. 119–125. [Google Scholar] [CrossRef]
Ebrahimi, H.; Yazdaninejadi, A. Comparing the Superimposed Voltage-Current-Based and Superimposed Energy-Based Fault Detection Methods in Series-Compensated Lines: Exploring Challenges and Issues. In Proceedings of the 2022 IEEE International Conference on Protection and Automation of Power Systems (IPAPS), Zahedan, Iran, 19–20 January 2022; pp. 1–10. [Google Scholar] [CrossRef]
Wu, D.; Wang, M.; Li, Q.; Xu, W.; Zhang, T.; Ma, Z. Visual Odometry with Point and Line Features Based on Underground Tunnel Environment. IEEE Access 2023, 11, 24003–24015. [Google Scholar] [CrossRef]
Li, C.; Yan, L.; Xia, Y. A Real-Time Visual-Inertial Monocular Odometry by Fusing Point and Line Features. In Proceedings of the 2021 40th Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021; pp. 4085–4090. [Google Scholar] [CrossRef]
Tan, W.; Liu, H.; Dong, Z.; Zhang, G.; Bao, H. Robust Monocular SLAM in Dynamic Environments. In Proceedings of the 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Adelaide, SA, Australia, 1–4 October 2013; pp. 209–218. [Google Scholar]
Li, J.; Hu, Q.; Ai, M.; Zhong, R. Robust Feature Matching Via Support-Line Voting and Affine-Invariant Ratios. ISPRS J. Photogramm. Remote Sens. 2017, 132, 61–76. [Google Scholar] [CrossRef]
Chai, Z.; Shi, X.; Zhou, Y.; Xiong, Z. A real-time global re-localization framework for 3D LiDAR SLAM. arXiv 2021, arXiv:2109.00200. [Google Scholar]
Yang, S.; Scherer, S. Direct Monocular Odometry Using Points and Lines. arXiv 2017, arXiv:1703.06380. [Google Scholar] [CrossRef]
Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast Semi-Direct Monocular Visual Odometry. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 15–22. [Google Scholar] [CrossRef]
Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-Scale Direct Monocular SLAM. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 834–849. [Google Scholar]
Wang, J.; Liu, S.; Zhang, P. A New Line Matching Approach for High-Resolution Line Array Remote Sensing Images. Remote Sens. 2022, 14, 3287. [Google Scholar] [CrossRef]
Bartoli, A.; Sturm, P. Structure-from-Motion Using Lines: Representation, Triangulation, and Bundle Adjustment. Comput. Vis. Image Underst. 2005, 100, 416–441. [Google Scholar] [CrossRef]
Lee, S.J.; Hwang, S.S. Elaborate Monocular Point and Line SLAM with Robust Initialization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1121–1129. [Google Scholar]
Zuo, X.; Xie, X.; Liu, Y.; Huang, G. Robust Visual SLAM with Point and Line Features. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 1775–1782. [Google Scholar]
Zhu, Y.; Jin, R.; Lou, T.; Zhao, L. PLD-VINS: RGBD Visual-Inertial SLAM with Point and Line Features. Aerosp. Sci. Technol. 2021, 119, 107185. [Google Scholar] [CrossRef]
Weiss, S.; Achtelik, M.W.; Lynen, S.; Chli, M.; Siegwart, R. Realtime Onboard Visual-Inertial State Estimation and Self-Calibration of MAVs in Unknown Environments. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation (ICRA), St. Paul, MN, USA, 14–18 May 2012; pp. 957–964. [Google Scholar]
Lynen, S.; Achtelik, M.W.; Weiss, S.; Chli, M.; Siegwart, R. A Robust and Modular Multi-Sensor Fusion Approach Applied to MAV Navigation. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Tokyo, Japan, 3–7 November 2013; pp. 3923–3929. [Google Scholar]
Yang, Z.; Shen, S. Monocular Visual–Inertial State Estimation with Online Initialization and Camera–IMU Extrinsic Calibration. IEEE Trans. Autom. Sci. Eng. 2017, 14, 39–51. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardos, J.D. Visual-Inertial Monocular SLAM with Map Reuse. IEEE Robot. Autom. Lett. 2017, 2, 796–803. [Google Scholar] [CrossRef]
Bloesch, M.; Omari, S.; Hutter, M.; Siegwart, R. Robust Visual Inertial Odometry Using a Direct EKF-Based Approach. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 298–304. [Google Scholar]
Liu, Z.; Shi, D.; Li, R.; Qin, W.; Zhang, Y.; Ren, X. PLC-VIO: Visual–Inertial Odometry Based on Point-Line Constraints. IEEE Trans. Autom. Sci. Eng. 2022, 19, 1880–1897. [Google Scholar] [CrossRef]
Mishra, S.; Palanisamy, P. Autonomous Advanced Aerial Mobility - An End-to-End Autonomy Framework for UAVs and Beyond. IEEE Access 2023, 11, 136318–136349. [Google Scholar] [CrossRef]
Bakirci, M. A Novel Swarm Unmanned Aerial Vehicle System: Incorporating Autonomous Flight, Real-Time Object Detection, and Coordinated Intelligence for Enhanced Performance. Trait. Signal 2023, 40, 2063–2078. [Google Scholar] [CrossRef]
Jeong, W.Y.; Lee, K.M. Visual SLAM with Line and Corner Features. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, China, 9–15 October 2006; pp. 2570–2575. [Google Scholar]

Figure 1. GLP -VO framework. The images captured by the monocular camera are first processed by the frontend before the processed data are sent to the backend for position estimation.

Figure 2. Enhanced Plücker coordinates representation for geometric features.

Figure 3. Geometric constraint extraction. (a) Angle constraint. (b) Distance constraint. (c) Geometric shape constraint. (d) Normal vector descriptor.

Figure 4. Results of geometric structure feature matching using our proposed method.

Figure 5. Results of point feature matching using the conventional point-based method.

Figure 6. Comparison of trajectory reconstruction results on three representative TUM RGB-D sequences: (a) f1_xyz, (b) f2_360_kidnap, and (c) f3_str_tex_far. In each subplot, the ground-truth trajectory is shown together with the estimated trajectories of the compared methods.

Figure 7. Typical scene from the dataset. (a) Field environment with sparse features. (b) Dense building clusters with rich geometric structures. (c) Road environment with dynamic elements.

Figure 8. Physical demonstration of the drone used for the experiment.

Figure 9. Trajectory comparison results on eight real-world UAV field-test sequences, including the RTK-GNSS ground-truth trajectory and the estimated trajectories of ORB-SLAM, PL-SLAM, and GLP-VO.

Figure 10. Stability comparison of different algorithms in urban scenarios. The maximum range of several algorithms for stable operation is computed, where the path length is the total length of the route.

Figure 11. Block diagrams of absolute translation error statistics for the three algorithms in a real scenario. The middle box spans the first and third quartiles, while the whiskers are the upper and lower limits.

Figure 12. Trajectories and triaxial errors for Experiment Sequence 1. The blue line shows the results from GLP-VO using only points, and the green line shows the results from PLC-VIO using points with geometric constraints.

Figure 13. Trajectories and triaxial errors for Experiment Sequence 2. The blue line shows the results from GLP-VO using only points, and the green line shows the results from PLC-VIO using points with geometric constraints.

Table 1. Accuracy performance of different VO systems on the TUM RGB-D dataset. The best result is highlighted in bold.

Sequences	GLP-VO	PL-SLAM	ORB-SLAM	LSD-SLAM	PTAM
f1_xyz	0.91	1.32	1.16	8.34	1.03
f2_xyz	0.32	1.16	0.48	1.83	0.18
f1_floor	6.58	8.27	7.12	36.29	/
f2_360_kidnap	2.26	56.43	3.31	/	2.23
f3_long_office	1.68	4.99	3.16	34.28	/
f3_nstr_tex_far	/	35.64	/	16.42	31.57
f3_nstr_tex_near	1.84	1.31	2.37	6.39	2.53
f3_str_tex_far	0.62	1.07	0.83	6.57	0.73
f3_str_tex_near	1.64	5.74	1.34	/	0.92
f2_desk_person	1.88	6.22	4.96	28.58	/

The RMSE results are in cm. “/” indicates tracking failure.

Table 2. Computational cost of each module in the proposed GLP-VO framework.

Algorithm Module	Desktop (ms)	Onboard (ms)
Dual-Feature Extraction	∼10.2	∼31.5
Descriptor Generation	∼4.5	∼12.8
Geometric Matching	∼5.8	∼18.2
Adaptive Feature Fusion	∼1.4	∼4.6
Joint Pose Optimization	∼6.5	∼22.4
Total Execution Time	∼28.4 ms	∼89.5 ms
Real-time Performance	∼35 FPS	∼11 FPS

Table 3. Initialization speed evaluation of three VO systems in different scenarios.

Sequences	Scene Description	GLP-VO	ORB-SLAM2	PL-SLAM
Lab 1	Dense urban environment (buildings)	1.23	2.34	1.45
Lab 2	Complex city streets (varied structures)	0.11	1.02	0.34
Lab 3	Semi-rural outskirts (farmland)	4.46	4.78	5.21

The time results are in seconds (s).

Table 4. Ablation study of the main components in the proposed GLP-VO pipeline using two evaluation metrics, namely Absolute Trajectory Error (ATE) and Relative Pose Error (RPE).

Configuration	Feature Extraction & Matching	Adaptive Feature Fusion	Optimization	f1_xyz (cm)		f2_360_kidnap (cm)		Lab 1 (m)
Configuration	Feature Extraction & Matching	Adaptive Feature Fusion	Optimization	ATE	RPE	ATE	RPE	ATE	RPE
Base	–	–	–	1.16	0.62	3.31	1.78	9.25	3.60
Variant 1	✓	–	–	1.05	0.56	2.85	1.52	7.80	3.02
Variant 2	✓	✓	–	0.96	0.50	2.48	1.34	6.20	2.42
Full GLP-VO	✓	✓	✓	0.91	0.46	2.26	1.21	5.10	2.05

ATE denotes the Absolute Trajectory Error, and RPE denotes the Relative Pose Error.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, Y.; Jiang, B.; Huang, L.; Qu, R.; Wang, Z. GLP-VO: A Hybrid Visual Odometry Framework for Low-Altitude UAV Imaging in Complex Urban Environments. Drones 2026, 10, 329. https://doi.org/10.3390/drones10050329

AMA Style

Xu Y, Jiang B, Huang L, Qu R, Wang Z. GLP-VO: A Hybrid Visual Odometry Framework for Low-Altitude UAV Imaging in Complex Urban Environments. Drones. 2026; 10(5):329. https://doi.org/10.3390/drones10050329

Chicago/Turabian Style

Xu, Yuxuan, Bo Jiang, Longyang Huang, Ruokun Qu, and Zhiyuan Wang. 2026. "GLP-VO: A Hybrid Visual Odometry Framework for Low-Altitude UAV Imaging in Complex Urban Environments" Drones 10, no. 5: 329. https://doi.org/10.3390/drones10050329

APA Style

Xu, Y., Jiang, B., Huang, L., Qu, R., & Wang, Z. (2026). GLP-VO: A Hybrid Visual Odometry Framework for Low-Altitude UAV Imaging in Complex Urban Environments. Drones, 10(5), 329. https://doi.org/10.3390/drones10050329

Article Menu

GLP-VO: A Hybrid Visual Odometry Framework for Low-Altitude UAV Imaging in Complex Urban Environments

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Point-Based Visual Odometry Algorithm

2.2. Multi-Feature Fusion VO/vSLAM Algorithm

3. Materials and Methods

3.1. Feature Extraction and Description

3.1.1. Representation of Geometric Structural Features

3.1.2. Feature Descriptor Generation

3.2. Geometric Structure Feature Matching

3.2.1. Descriptor Preprocessing

3.2.2. Similarity-Based Matching

3.2.3. Geometric Validation

3.3. Feature Fusion

3.3.1. Adaptive Weighting Mechanism for Feature Fusion

3.3.2. Complexity-Based Dynamic Weighting for Geometric Structure Features

3.3.3. Joint Descriptor Construction for Feature Fusion

3.3.4. Online Learning for Dynamic Weight Adjustment

3.4. Joint Pose Estimation and Optimization

3.4.1. Pose Estimation Framework

3.4.2. Joint Pose Optimization

3.4.3. Geometric Constraints in Pose Optimization

3.4.4. Optimization Algorithm

4. Results

4.1. Quantitative Evaluation on the TUM RGB-D Dataset

4.2. Real-World Experiments

4.2.1. Construction of the Experimental Dataset

4.2.2. Initialization Speed Comparison in Real Scenarios

4.2.3. Comparison of Stability and Accuracy in Real Scenarios

4.2.4. Linear Assessment of GLP-VO Function

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI