A Depth-Guided Local Outlier Rejection Methodology for Robust Feature Matching in Urban UAV Images

Lee, Geonseok; Youn, Junhee; Choi, Kanghyeok

doi:10.3390/drones9120869

Open AccessArticle

A Depth-Guided Local Outlier Rejection Methodology for Robust Feature Matching in Urban UAV Images

by

Geonseok Lee

¹

,

Junhee Youn

² and

Kanghyeok Choi

^3,*

¹

Program in Smart City Engineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon 22212, Republic of Korea

²

Department of Future and Smart Construction Research, Korea Institute of Civil Engineering and Building Technology, Goyang-si 10223, Republic of Korea

³

Department of Geoinformatic Engineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon 22212, Republic of Korea

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(12), 869; https://doi.org/10.3390/drones9120869

Submission received: 4 November 2025 / Revised: 5 December 2025 / Accepted: 15 December 2025 / Published: 16 December 2025

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The proposed depth-guided local outlier rejection methodology integrates monocular depth estimation, DBSCAN clustering, and localized model estimation to improve feature matching reliability in complex urban UAV imagery.
Higher Recall and F1-score were achieved than with conventional 2D-based outlier rejection methods, while comparable Precision was maintained, demonstrating robust inlier preservation under depth and viewpoint variations.

What are the implications of the main findings?

Incorporating single-image depth information enhances geometric consistency and registration stability in depth-varying urban environments.
The methodology effectively corrects depth- and viewpoint-related mismatches, enhancing UAV image registration reliability.

Abstract

Urban UAV imagery presents challenges for reliable feature matching owing to complex 3D structures and depth discontinuities. Conventional 2D-based outlier rejection methods often fail to maintain geometric consistency under significant altitude variations or viewpoint differences, resulting in the rejection of valid correspondences. To overcome these limitations, a depth-guided local outlier rejection methodology is proposed which integrates monocular depth estimation, DBSCAN-based clustering, and local geometric model estimation. Depth information estimated from single UAV images is combined with feature correspondences to form pseudo-3D coordinates, enabling spatially localized registration. The proposed method was quantitatively evaluated in terms of Precision, Recall, F1-score, and Number of Matches, and was applied as a depth-guided front-end to three representative 2D-based outlier rejection schemes (RANSAC, LMedS, and MAGSAC++). Across all image sets, the depth-guided variants consistently achieved higher Recall and F1-score than their conventional 2D counterparts, while maintaining comparable Precision and keeping mismatches low. These results indicate that introducing depth-guided pseudo-3D constraints into the outlier rejection stage enhances geometric stability and correspondence reliability in complex urban UAV imagery. Accordingly, the proposed methodology provides a practical and scalable solution for accurate registration in depth-varying urban environments.

Keywords:

UAV imagery; monocular depth estimation; feature matching; outlier rejection

1. Introduction

Recent advances in unmanned aerial vehicle (UAV) technology have led to their widespread use in diverse fields such as urban planning, traffic monitoring, construction management, and geospatial data acquisition [1]. These aerial platforms can capture imagery from multiple viewing angles and varying flight altitudes, enabling efficient data acquisition in areas that are hazardous or inaccessible to personnel and providing imagery with relatively high spatial resolution [2,3]. Owing to these advantages, UAV imagery has become a key data source not only for applications such as real-time disaster monitoring, environmental management, and change detection, but also for high-precision urban tasks including urban change analysis, three-dimensional building modeling, and urban spatial structure assessment [4,5]. In particular, urban UAV images contain complex man-made structures and diverse viewpoints, and are therefore widely used in applications that require accurate spatial information, such as 3D reconstruction, Visual Simultaneous Localization and Mapping (Visual SLAM), and object recognition. These applications commonly rely on feature matching to identify and link corresponding points across multiple images as a fundamental component of the processing pipeline [6,7]. Feature matching plays a central role in accurately connecting overlapping regions between consecutive images, thereby enabling the generation of high-precision maps and robust object tracking. Consequently, the performance of feature matching algorithms for UAV imagery is a critical factor that directly influences the overall efficiency and accuracy of the processing workflow.

1.1. Related Work

Improving feature matching performance in UAV imagery requires a solid understanding of the algorithmic foundations established in general image registration and computer vision. To this end, previous studies have proposed various approaches to improve matching accuracy and registration reliability. Representative categories include global geometric model-based methods, multi-plane registration approaches, photogrammetric techniques, and depth-assisted approaches. These methods share the common goal of accurately estimating inter-image correspondences and eliminating false matches to achieve geometrically consistent registration.

1.1.1. Global Geometric Model-Based Approaches

Global geometric model-based approaches assume a single geometric transformation for the entire image and aim to enhance registration robustness by removing outliers through probabilistic sampling techniques. The Least Median of Squares (LMedS) estimator [8] enables stable model estimation by minimizing the median of squared residuals, and RANSAC [9] introduced a framework that iteratively estimates model parameters and identifies outliers. Subsequent studies have sought to improve RANSAC’s performance and efficiency: MLESAC [10] enhanced model fitting by introducing maximum likelihood estimation, and PROSAC [11] increased computational efficiency by utilizing matching confidence scores in the sampling process. Furthermore, Chum et al. [12] proposed LO-RANSAC, which improved model accuracy by incorporating a local optimization step after initial RANSAC estimation. USAC [13] further automated the registration procedure and enhanced robustness by integrating multiple complementary modules. More recently, Graph-Cut RANSAC [14] introduced a graph-cut optimization scheme for refined inlier set selection. Subsequently, Progressive-X [15] was proposed to address RANSAC’s unstable model estimation, achieving both computational efficiency and estimation stability by iteratively updating model quality metrics. MAGSAC++ [16] presented a method for estimating a more robust model against outliers by progressively readjusting the loss function. These global geometric model-based approaches have become standard techniques in various applications, including 3D reconstruction and image-based localization, effectively removing outliers and ensuring stable registration.

1.1.2. Multi-Plane Fitting-Based Approaches

Multi-plane fitting-based approaches aim to overcome the limitations of a single global geometric model by estimating multiple planar structures simultaneously from 2D image coordinates. This enables improved registration accuracy in environments with depth discontinuities. Zuliani et al. [17] introduced MultiRANSAC, a multi-structure fitting approach that estimates multiple planar models simultaneously by running parallel RANSAC instances. Isack and Boykov [18] proposed a global optimization method through the energy minimization-based PEARL algorithm to assign each data point to its optimal model. Toldo and Fusiello [19] developed the J-Linkage algorithm, which automatically separates and estimates multiple structures by clustering based on data consistency sets. Taking a different approach, Vincent and Laganière [20] proposed Sequential RANSAC, a simple yet effective method that iteratively identifies the most dominant model, removes its inliers, and repeats the process on the remaining data. Meanwhile, Magri and Fusiello [21] proposed T-Linkage, which addresses the limitations of J-Linkage by clustering models based on continuous similarities between data points, enabling more robust multi-structure estimation. These multi-model approaches have evolved to reduce the constraints imposed by the single-plane assumption.

1.1.3. Photogrammetry-Based Approaches

Photogrammetry-based approaches primarily aim to interpret geometric relationships between images and to precisely estimate camera exterior orientation parameters and 3D coordinates. In doing so, they have also improved feature matching accuracy and registration reliability. These approaches extend 2D inter-image correspondences into 3D spatial structures and eliminate geometrically inconsistent correspondences by estimating the positions of corresponding object points through triangulation. Triggs et al. [22] enhanced geometric consistency and improved matching reliability across multi-view images through the Bundle Adjustment (BA) algorithm. Furthermore, Snavely et al. [23] introduced the Structure-from-Motion (SfM) concept, which refines matching results by iteratively verifying triangulation outcomes for corresponding feature points across multiple views and rejecting inconsistent matches. Meanwhile, Schönberger et al. [24] developed COLMAP, an automated pipeline that integrates feature matching, triangulation, and BA, ensuring consistent registration quality and high matching accuracy even for large-scale image sets. Such studies have presented methods for extending 2D image correspondences to 3D spatial constraints and reducing inconsistent matches through geometric optimization.

1.1.4. Depth-Assisted Approaches

Recently, several approaches have been proposed that leverage predicted depth information to address the limitations of traditional 2D feature matching. Toft et al. [25] proposed a method that improves the matching performance of traditional feature descriptors by rectifying image patches using depth estimated from a single image. Wang et al. [26] enhanced correspondence quality by integrating local 3D curvature computed from depth into the matching similarity evaluation. Liu et al. [27] proposed a method for learning 3D geometry-aware descriptors using depth-based pseudo surface normals, while DXFeat [28] reported a lightweight matching technique that improves matching robustness by employing depth as an auxiliary signal for keypoint detection and feature refinement. Additionally, Li et al. [29] improved place recognition matching stability by performing graph reranking based on RGB–Depth fused descriptors. These studies focus on improving the local quality of correspondences by utilizing depth as supplementary information to enhance matching performance.

1.2. Limitations of Existing Approaches

Despite these improvements in feature matching accuracy, existing approaches still exhibit the following structural limitations. First, multi-plane fitting methods segment models based on 2D distributions rather than actual depth information. This can erroneously group physically separate structures, such as buildings and the ground, as coplanar. Because these segmentation criteria ignore 3D spatial structure, structural errors frequently occur in urban environments [30,31]. Second, global geometric models oversimplify complex 3D structures by assuming a single plane. Consequently, valid inliers on structures with different depths or altitudes may be misclassified as outliers and incorrectly rejected. In complex scenes, the single-model assumption fails to adequately represent the actual spatial structure [32,33], and may lead to incorrect plane selection when multiple surfaces exist [34]. Third, photogrammetry-based approaches fundamentally rely on 2D image correspondences, limiting their ability to fully reflect the actual spatial structure in imagery with significant depth changes or large viewpoint differences. Furthermore, few-view imagery or narrow baseline configurations provide insufficient parallax for triangulation, limiting the ability to correct depth-related matching errors [22,35,36]. These limitations are particularly pronounced in urban UAV imagery environments characterized by significant altitude variations and depth discontinuities, where 2D-based registration alone is insufficient to consistently represent complex structures. Fourth, existing depth-assisted approaches share a common characteristic in that they utilize depth as a local (feature-level) auxiliary signal. That is, depth is used to enhance the similarity of individual correspondences, but it does not provide the capability to distinguish the geometric structure of the entire correspondence set (e.g., multiple planes, objects, or parallax layers) or to estimate separate geometric models for each structure. Furthermore, monocular depth estimates exhibit substantial absolute errors, making it difficult to reliably remove structural outliers in complex scenes using simple depth differences or feature-level auxiliary information alone [37,38]. For these reasons, existing methods typically assume a single global model and have inherent limitations in performing structure-wise geometric verification required in multi-structure environments.

1.3. Research Objectives

The objective of this study is to enhance the performance of feature point mismatch removal in urban UAV imagery by accounting for its complex 3D structures.

To this end, this study proposes a 3D registration-based depth-guided local outlier rejection methodology for improving feature point mismatch removal performance in urban UAV imagery, based on the following approaches. First, it applies a clustering technique to feature points that include 3D information to distinguish between structures at different depths or altitudes. Second, it introduces a monocular depth estimation algorithm to incorporate depth information. Third, it performs a depth-guided local outlier rejection method on images with significant depth or viewpoint variations to effectively correct matching errors.

The proposed methodology is analyzed using representative evaluation metrics, and its effectiveness is validated. Through this process, the method aims to minimize mismatches caused by depth discontinuities and viewpoint differences, thereby enhancing the registration stability and matching reliability of urban UAV imagery.

2. Methods

2.1. Dataset Configuration

In this study, eight UAV image datasets acquired in various urban environments are used to evaluate the performance of the proposed depth-guided local outlier rejection methodology, as shown in Figure 1. The datasets are classified into three categories according to the spatial characteristics of the acquisition environment and the composition of objects within each scene. The first category represents urban areas where roads and buildings coexist (Figure 1a–c). These regions exhibit significant height variations, along with distinct façade reflections and shadows. Because of the coexistence of complex structures and multilayered objects, this category is suitable for assessing the capability of the proposed method in removing mismatches arising from depth variations. The second category corresponds to urban areas containing vegetation (Figure 1d–f). These regions include trees, roads, and buildings distributed together, producing irregular textures and repetitive patterns. This type provides a challenging environment for evaluating the algorithm’s robustness to variations caused by both natural elements and artificial structures. The third category consists of urban image sets captured from different viewing angles (Figure 1g,h). This category is designed to evaluate matching accuracy under viewpoint differences. Through the datasets classified into three categories, the proposed method is comprehensively analyzed and evaluated in terms of its performance on UAV-based urban imagery.

2.2. Proposed Methodology

This study proposes a depth-guided local outlier rejection methodology to enhance feature point mismatch (outlier) removal performance by incorporating the complex 3D structures of urban UAV imagery. The proposed method combines depth information with the outlier rejection algorithm (e.g., RANSAC, LMedS, MAGSAC++) to achieve both accurate classification of outliers and reliable preservation of true inliers in urban environments. The overall workflow consists of four main stages, as illustrated in Figure 2.

First is the Monocular Depth Estimation stage. A single-image depth estimation network is applied to each UAV image to generate a pixel-wise depth map. Second is the Feature Matching stage. Feature correspondences (feature matching) are extracted from the input image pairs using the Scale-Invariant Feature Transform (SIFT) algorithm. Each 2D feature coordinate is then augmented with the corresponding depth value, transforming the features into pseudo-3D coordinates. Third is the Pseudo-3D Feature clustering stage. Clustering is performed based on the spatial proximity of the depth-augmented feature points to group structures with similar depth distributions. Fourth is the local outlier rejection-based Registration and Inlier Extraction stage. RANSAC-based geometric model estimation is independently performed within each cluster, allowing the algorithm to derive locally optimized geometric models and extract inliers with improved robustness against depth variation.

2.2.1. Depth Estimation

In the depth estimation stage, a Monocular Depth Estimation technique is employed to estimate the depth value of each pixel from a single UAV image. This process addresses the limitation that feature points in UAV imagery are typically represented only by 2D coordinates, which often causes positional discrepancies in environments with significant depth or altitude variations. Furthermore, it provides an efficient means of indirectly incorporating 3D structural information without requiring multi-view imagery or additional sensor data. Representative algorithms for Monocular Depth Estimation include MiDaS [39], Depth Anything v2 [40], and Depth Pro [41]. Each algorithm differs in estimation accuracy and computational efficiency depending on the training dataset composition and network architecture. In this study, a preliminary experiment was conducted to compare the performance of multiple monocular depth estimation models in order to select a stable and consistent algorithm (Figure 3). The results showed that, compared with other monocular depth estimation models, Depth Pro more sharply preserved depth boundaries—such as vertical building surfaces, shadow edges, and transitions between different structures—without significant over- or underestimation. In particular, it maintained clear depth boundaries between buildings and trees, which is advantageous for preserving structural consistency. This, in turn, enables more accurate separation of structures in the subsequent pseudo-3D coordinate-based clustering stage. Accordingly, Depth Pro is adopted as the primary monocular depth estimation algorithm in this study, and depth values are generated for each image. However, it should be noted that the estimated depth values do not possess an absolute scale.

2.2.2. Feature Matching

In the feature matching stage, the SIFT (Scale-Invariant Feature Transform) algorithm [42] is employed to extract and match feature points in each UAV image. Among various feature detection methods, SIFT is one of the most widely used algorithms, exhibiting high robustness and repeatability under diverse conditions such as illumination changes, rotation, and scale differences. These properties make it particularly suitable for urban UAV imagery, where façade reflections, shadows, and viewpoint changes frequently occur, while still enabling the extraction of consistent feature points (Figure 4).

In the pseudo-3D coordinate generation stage, the depth values obtained from the monocular depth estimation process are combined with the 2D image coordinates (x,y) of each matched point pair, converting them into pseudo-3D coordinates (Figure 5). The depth values are normalized to the range of 0–100. If the raw depth values are directly used in the pseudo-3D coordinates, the influence of the depth dimension becomes relatively small compared with the x and y coordinates during clustering. When the normalization range is set too small, a similar effect is observed, whereas an excessively large range causes the depth dimension to dominate the distance computation, leading to distorted clustering results. To avoid these extremes, a preliminary evaluation was conducted for various normalization ranges, and it was observed that ranges of 0–10, 0–50, and 0–1000 yielded only limited improvements in Recall and F1-score (Table 1). This behavior arises because the influence of depth is either insufficiently reflected or excessively amplified, resulting in distorted clustering. In contrast, when the depth values were normalized to 0–100 or 0–200, which provide a scale more balanced with the x and y axes, the performance remained stable.

2.2.3. Clustering of Pseudo-3D Coordinates

In the clustering stage, the previously computed pseudo-3D coordinates are used to group feature points that share similar 3D spatial relationships. Because the depth values generated from monocular depth estimation are based on relative depth, their distributions may vary depending on viewpoint changes; thus, clustering is applied to aggregate feature points that are similar in both depth and spatial position into localized structural units. In the proposed method, clustering is performed using the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm [43]. DBSCAN does not require the number of clusters to be specified in advance and can automatically distinguish local structures based on differences in point density, which makes it suitable for mitigating depth uncertainty in pseudo-3D data and effectively grouping spatially similar feature points.

DBSCAN forms clusters based on local density using a distance threshold and a minimum number of neighboring points. In this study, to prevent the formation of excessively small clusters in various UAV image datasets and to stably reflect the local structure of the pseudo-3D coordinates, the distance threshold is set to 30 pixels and the minimum number of neighboring points is set to 10. The distance threshold is chosen to be approximately one-third of the maximum normalized depth value (100), which corresponds to a distance scale that divides the depth range into three levels—low, medium, and high. In the depth normalization preliminary experiments (Table 1), the distance threshold for each normalization range was likewise determined to be approximately one-third of the corresponding maximum depth value. Meanwhile, the minimum number of neighboring points in DBSCAN is recommended to be greater than the dimensionality of the space [43], and at least 8–10 inliers are generally advised to ensure stable density estimation and robust homography-based model estimation [9,44]. Accordingly, the minimum number of neighboring points is set to 10 in this study.

During the clustering stage using DBSCAN, any matching points that are not assigned to a cluster are grouped into an additional, separate cluster. This is because some isolated matches may still represent valid correspondences, even though they fail to be included in regular clusters owing to insufficient local point density. By separating these points into their own cluster, the proposed method preserves potentially useful matches while preventing unnecessary errors in the subsequent local outlier rejection stage. Figure 6 presents an example of 3D clustering results obtained from the pseudo-3D coordinates, where each cluster is distinguished using a different color and marker shape.

2.2.4. Local Clustering-Based RANSAC

In the final stage, a local clustering-based outlier rejection is applied to perform localized registration within each cluster of the image pair and to extract reliable inliers (Figure 7). First, for each cluster, local outlier rejection is independently performed on the pseudo-3D coordinates and their corresponding feature matches obtained in the previous stage. This process removes outliers that do not conform to the geometric model within each cluster and yields a consistent set of inliers in each local region.

Subsequently, among the correspondences that are classified as inliers in each of the two images, only those matches that remain inliers simultaneously in both images are retained as the final valid correspondences. In Figure 7, the dots and triangles represent these final valid inlier matches, and the yellow dots and triangles illustrate examples of points that belong to different clusters but are still detected as inliers. In other words, the final valid inliers are not constrained to lie in identical clusters across the two images; instead, they are correspondences that are geometrically consistent within the local clusters of each image and lie in the intersection of inlier sets.

This procedure is designed to alleviate the error accumulation that arises when conventional outlier rejection assumes a single global model for the entire image, and to perform localized registration that reflects the geometric characteristics of each cluster. Through this process, the proposed method overcomes the limitations of traditional global, single-plane outlier rejection schemes and achieves robust and structurally consistent feature matching even in urban UAV imagery with significant depth variations and viewpoint differences.

2.3. Implementation Details

During the preparation of this manuscript, the authors used ChatGPT (GPT-5.1, OpenAI) only for superficial text editing (grammar, spelling, and formatting). All experiments were conducted on a Windows 11 system equipped with an NVIDIA GeForce RTX 4070 Ti Super GPU and an AMD Ryzen 5 7500F CPU, using Python 3.12.

3. Results

In this study, the performance of the proposed depth-guided local outlier rejection methodology was validated by comparison with the RANSAC, LMedS, and MAGSAC++ algorithms. For each method, a performance comparison was conducted between two configurations: one where the proposed procedure was applied to each outlier rejection algorithm and another where the original algorithm alone was used, under identical input images and parameter settings. First, the average performance of each method and its behavior across different image types were analyzed to quantitatively evaluate the performance improvements achieved by the proposed methodology. Quantitative evaluation was conducted using Precision, Recall, F1-score, and Number of Matches, where the Number of Matches denotes the total number of correspondences that remained after applying each method. In addition, a significance test on the F1-score was performed to verify whether the performance improvements of the proposed methodology are statistically significant.

3.1. Overall Performance Comparison

Overall, the proposed depth-guided local outlier rejection methodology exhibited superior performance over all three baseline algorithms in terms of Recall, F1-score, and Number of Matches. This improvement can be attributed to the introduction of pseudo-3D geometric constraints using depth information, which allows true correspondences that were previously discarded by conventional methods owing to structural inconsistencies or altitude differences to be effectively preserved. Although the pseudo-3D depth used in this study is based on relative depth, it provides sufficiently consistent ordering of terrain elevations. As a result, a larger number of matches are retained overall, while false correspondences are stably removed (Table 2).

For LMedS, despite its relatively low Recall and F1-score compared with the other algorithms in the baseline configuration, the application of the proposed methodology led to substantial performance gains, with Recall and F1-score increasing by approximately 25.64 percentage points and 0.32, respectively. RANSAC and MAGSAC++ also showed improvements, with Recall increased by approximately 14.99 and 10.56 percentage points and F1-score increased by 0.10 and 0.07, respectively, compared with the original methods. In other words, the proposed methodology enhances inlier preservation while maintaining Precision at a comparable level.

To assess whether these improvements are statistically significant, paired t-tests were conducted on the F1-score values. The resulting p-values for RANSAC, LMedS, and MAGSAC++ were approximately 0.0002, 0.0022, and 0.0094, respectively, indicating statistically significant performance gains over the corresponding baseline algorithms. This confirms that the improvement achieved by the proposed depth-guided pseudo-3D geometric constraints is not merely a numerical increase but a consistent and statistically supported enhancement.

3.2. Performance Analysis by Image Category

Overall, the proposed method produced improved results compared with the conventional approaches, although the degree and pattern of improvement varied depending on the characteristics of the environment. In Class 1, where complex high-rise buildings and roads coexist, the proposed method generally achieved superior performance in terms of match preservation rate, Recall, and F1-score, while maintaining Precision at a level comparable to that of the original methods. In particular, clear improvements were observed in structurally and radiometrically challenging regions, such as building roofs and building–road boundary areas. Figure 8 presents a visual comparison of the matching results obtained by the proposed method and the conventional approach for a representative image pair from the Class 1 set. The symbols in each image of Figure 8 are defined as follows: green circles (TP) indicate points that are true inliers correctly identified as inliers; red circles (FN) represent true correspondences that were incorrectly classified as outliers; orange crosses (FP) denote false correspondences that were incorrectly classified as inliers; and cyan crosses (TN) indicate false correspondences that were correctly rejected. In the conventional method, a considerable number of true correspondences were misclassified as outliers (FN), whereas under the proposed method the proportion of such points that were correctly retained as TPs (green circles) increased, leading to improved performance. The quantitative comparison results for the Class 1 sets are summarized in Table 3.

In image set 1-1, the proposed method improved Recall, F1-score, and the Number of Matches across all algorithms compared with the conventional approaches. In particular, for LMedS, although the baseline Recall and F1-score were relatively low compared with the other algorithms, the application of the proposed method resulted in substantial gains, with Recall and F1-score increasing by approximately 53.12 percentage points and 0.59, respectively.

In image set 1-2, the proposed method yielded higher Recall, F1-score, and Number of Matches than the original methods, while maintaining comparable Precision. In the case of RANSAC, the Precision of the proposed configuration was approximately 0.56 percentage points lower than that of the original RANSAC, because the larger number of matches classified as inliers also introduced a small number of additional false correspondences. Nevertheless, the preservation of a greater number of true correspondences led to an increase in Recall of about 25.65 percentage points and an improvement in F1-score of about 0.18. For LMedS, Precision remained at 100%, identical to the baseline, whereas Recall and F1-score increased by approximately 38.48 percentage points and 0.54, respectively, representing a larger improvement than for the other algorithms.

For image set 1-3, the proposed method again produced higher Recall and F1-score than the original methods. However, in the case of LMedS, the improvements in Recall and F1-score were relatively modest—approximately 4.64 percentage points and 0.07, respectively—compared with those observed for image sets 1-1 and 1-2.

Class 2, which contains a mixture of vegetation and man-made structures, exhibited improved performance with the proposed method in terms of Recall and match preservation, owing to the maintenance of a larger number of inliers compared with the conventional approaches. In particular, a substantial number of inliers was preserved along building–vegetation boundary regions. Figure 9 provides a visual comparison of the matching results obtained by the proposed method and the conventional methods for a representative image pair from the Class 2 set; the point symbols in each image are defined in the same manner as in Figure 8. The corresponding quantitative performance metrics are summarized in Table 4, where the proposed method generally maintains Precision at a level comparable to that of the baseline methods while improving Recall and F1-score.

In the case of image set 2-1, the application of the proposed method to RANSAC and LMedS led to substantial improvements, with Recall increasing by approximately 16.46 and 16.77 percentage points, and F1-score increasing by about 0.11 and 0.28, respectively. For MAGSAC++, both the proposed and conventional configurations exhibited similar Recall and F1-score values (approximately 64.24% and 0.75), while the Number of Matches decreased by two when the proposed method was applied. This reduction is attributed to the removal of false correspondences located near building–vegetation boundaries, which resulted in a slight improvement in Precision of about 0.80 percentage points.

For image set 2-2, the proposed method increased Recall, F1-score, and the Number of Matches for all algorithms compared with the original methods. In the case of LMedS, Precision decreased slightly by approximately 0.28 percentage points because the increased number of inliers also introduced a small number of additional mismatches. Nevertheless, Recall increased by 19.29 percentage points, F1-score improved by 0.15, and the Number of Matches increased by about 103, indicating a clear performance gain.

A similar trend was observed for image set 2-3, where the proposed method again yielded higher Recall and F1-score than the corresponding conventional methods. For RANSAC and MAGSAC++, Precision decreased slightly by approximately 0.09 and 0.20 percentage points, respectively, as more correspondences were classified as inliers and some additional false matches were included. However, Recall increased by approximately 15.46 and 15.14 percentage points, F1-score improved by about 0.09 for both algorithms, and the Number of Matches increased by 146 and 144, respectively.

Across the Class 2 sets, the proposed method was thus able to preserve a larger number of valid correspondences while maintaining a comparable level of outlier rejection. This behavior can be interpreted as the result of combining DBSCAN-based clustering with depth information, which mitigates visual ambiguity arising from repetitive textures and vegetated regions, thereby improving Recall while keeping Precision close to that of the conventional methods.

Class 3 represents a challenging environment in which large viewpoint differences make feature matching itself difficult, resulting in Recall, F1-score, and the Number of Matches being generally lower than those of the other classes. This degradation is mainly caused by a reduction in the visible overlap of corresponding points between images owing to changes in viewing direction and camera pose, combined with geometric distortions induced by spatial resolution differences and viewpoint inconsistency [45]. Under such conditions, matches are sparsely distributed across the image, and only a limited number of regions form meaningful clusters. Within clustered regions, the use of depth information for local geometric model estimation improves matching performance; however, in sparse areas where an insufficient number of points is available, most correspondences are grouped into a single cluster and outlier rejection is performed in a manner similar to the conventional methods, leading to comparable performance in those regions.

Even under these unfavorable conditions, the proposed methodology preserved a larger number of true correspondences overall than the conventional approaches, resulting in increased Recall and F1-score. In other words, although the overall matching success rate remained low, the proposed method demonstrated a more robust model estimation capability. Figure 10 provides a visual comparison of the matching results obtained by the proposed method and the conventional methods for the Class 3 sets, and the meanings of the symbols in Figure 10 are identical to those in the previous figures. The corresponding quantitative results for Class 3 are summarized in Table 5.

In image set 3-1, Precision was 100% for both the proposed and conventional configurations, while the proposed method yielded higher Recall, F1-score, and Number of Matches than the original methods. In particular, for LMedS, Recall and F1-score increased by 23.59 percentage points and 0.37, respectively.

In image set 3-2, the application of the proposed method to RANSAC and LMedS led to improved Recall and F1-score compared with the conventional configurations. For LMedS, all matches were classified as outliers in the original method, resulting in a Number of Matches of 0, whereas the proposed method preserved 18 valid correspondences and achieved Recall and F1-score values of 28.12% and 0.44, respectively. This behavior indicates that depth information helped alleviate projection distortions caused by viewpoint differences, enabling stable preservation of matching points even in conditions where the original method failed to estimate a valid model.

By contrast, for MAGSAC++, the proposed and conventional configurations produced identical performance, with Recall of approximately 90.62%, F1-score of about 0.95, and 58 matching points. Because MAGSAC++ already exhibits strong suppression of false correspondences, its high Recall and F1-score leave limited room for additional gains from the depth-guided local clustering. Nevertheless, across most algorithms, the proposed method preserved valid correspondences more effectively than the conventional approaches, providing stable performance improvements even in scenarios with large viewpoint differences.

3.3. Overall Results

In this section, the performance of the proposed depth-guided local outlier rejection methodology is evaluated through comparative analysis with conventional methods. As a result, the proposed method maintained Precision at a level comparable to that of the conventional methods across most environments, while exhibiting consistent improvements in Recall, F1-score, and the Number of Matches. However, slight decreases in Precision were observed for RANSAC in image set 1-2, LMedS in image set 2-2, and RANSAC and MAGSAC++ in image set 2-3, compared with their respective baseline configurations. In the conventional global modeling schemes, complex three-dimensional structures are often oversimplified as a single plane, causing correspondences located on elevated structures relative to the ground plane to be classified as outliers. Consequently, Precision tends to be relatively high, whereas the Number of Matches and Recall are substantially reduced. By contrast, the proposed methodology estimates local geometric models independently for each cluster separated into different depth ranges, thereby preserving correspondences on elevated structures together with those on the ground. In this process, a very small number of additional false correspondences are introduced, leading to minor reductions in Precision for a few datasets; however, a larger number of true correspondences are preserved, resulting in overall improvements in Recall, F1-score, and the Number of Matches. Figure 11 provides a visual comparison of the feature matching results between the conventional methods and the proposed method for image sets 1-2, 2-2, and 2-3, in which only slight decreases in Precision were observed.

The proposed method achieved higher Recall and F1-score than the conventional approaches not only in image sets containing complex mixtures of high-rise buildings and roads, but also in scenes where vegetation and artificial structures coexist or where illumination and viewpoint changes are pronounced. Meanwhile, in image sets with large viewpoint differences or low overlap, the overall Number of Matches decreased; nevertheless, the proposed method maintained a relatively higher number of inliers even under these unfavorable conditions. Consequently, the proposed depth-guided local outlier rejection demonstrated stable registration performance across a range of conditions, including environmental complexity, depth discontinuities, and viewpoint differences, while simultaneously minimizing mismatches and improving inlier preservation.

4. Conclusions

This study proposed a depth-guided local outlier rejection methodology to address the degradation in mismatch removal performance observed in conventional registration methods for urban UAV imagery with complex three-dimensional structures. Conventional 2D-based global outlier rejection methods, such as RANSAC, LMedS, and MAGSAC++, rely on a single-plane model assumption, which limits their effectiveness in scenes with significant altitude variations or large viewpoint differences and often leads to the misclassification or rejection of true correspondences (inliers). To overcome these limitations, the proposed approach integrates depth information estimated from a single image into feature correspondences to construct pseudo-3D coordinates. Subsequently, DBSCAN-based clustering and local outlier rejection are applied independently within each cluster to ensure geometric consistency and local registration stability within each spatial region. Quantitative evaluation showed that the proposed method maintained a level of Precision comparable to the original RANSAC, LMedS, and MAGSAC++ configurations, while consistently achieving higher Recall, F1-score, and Number of Matches across the datasets. This improvement can be attributed to the incorporation of pseudo-3D geometric constraints derived from depth information, which effectively mitigated structural inconsistencies and projection distortions, thereby preserving valid correspondences that conventional methods tended to discard. In summary, the proposed method enhanced inlier preservation without compromising mismatch suppression, establishing it as an efficient registration approach that strengthens overall geometric consistency and correspondence reliability in urban UAV imagery.

The key contributions of this study are as follows:

(1): By integrating Monocular Depth Estimation results into feature coordinates to form a pseudo-3D space, the method enables geometrically consistent correspondence verification even in urban imagery containing depth discontinuities.
(2): Through a cluster-based local outlier rejection structure, it performs independent model estimation for each local region, overcoming the limitations of global registration.
(3): Despite the integration of depth information, the method maintains stable Precision across all outlier rejection algorithms considered, confirming the effectiveness of the pseudo-3D constraint in selectively incorporating depth cues into the registration process.

Therefore, the proposed depth-guided local outlier rejection methodology is evaluated as a practical and scalable registration method capable of improving geometric stability and correspondence reliability in urban UAV imagery characterized by complex structures and significant depth variations.

Future work will focus on enhancing computational efficiency by optimizing the inference speed of the Monocular Depth Estimation model, which currently requires approximately one second per frame even with GPU-based parallel processing. In addition, a Local Adaptive RANSAC structure will be investigated to dynamically adjust inlier decision thresholds according to spatial variations in depth uncertainty, with the aim of simultaneously improving registration stability and processing efficiency. Through these extensions, the proposed method is expected to evolve into a high-efficiency registration solution applicable to real-time UAV image processing and large-scale urban mapping.

Author Contributions

Conceptualization, G.L. and K.C.; methodology, G.L. and K.C.; software, G.L. and K.C.; validation, G.L., J.Y. and K.C.; formal analysis, G.L., J.Y. and K.C.; investigation, G.L. and K.C.; resources, G.L., J.Y. and K.C.; writing—original draft preparation, G.L.; writing—review and editing, K.C.; visualization, G.L.; supervision, K.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by a grant from the Korea Agency for Infrastructure Technology Advancement (KAIA) funded by the Ministry of Land, Infrastructure, and Transport [Grant number RS-2022-00143782 (Development of Fixed/Moving Platform Based Dynamic Thematic Map Generation Technology for Next-generation Digital Land Information Construction)].

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing multi-institutional R&D program and are currently being utilized for related research within the consortium. Requests to access the datasets should be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned Aerial Vehicle
Visual SLAM	Visual Simultaneous Localization and Mapping
RANSAC	Random Sample Consensus
LMedS	Least Median of Squares
BA	Bundle Adjustment
SfM	Structure-from-Motion
SIFT	Scale-Invariant Feature Transform
DBSCAN	Density-Based Spatial Clustering of Applications with Noise

References

Zhou, G.; Ambrosia, V.; Gasiewski, A.J.; Bland, G. Foreword to the Special Issue on Unmanned Airborne Vehicle (UAV) Sensing Systems for Earth Observations. IEEE Trans. Geosci. Remote Sens. 2009, 47, 687–689. [Google Scholar] [CrossRef]
Colomina, I.; Molina, P. Unmanned Aerial Systems for Photogrammetry and Remote Sensing: A Review. ISPRS J. Photogramm. Remote Sens. 2014, 92, 79–97. [Google Scholar] [CrossRef]
Saikhom, V.; Kalita, M. UAV for Remote Sensing Applications: An Analytical Review. In Proceedings of the Emerging Global Trends in Engineering and Technology 2022, Guwahati, India, 21–22 April 2022; Springer Nature: Singapore, 2022; pp. 51–59. [Google Scholar] [CrossRef]
Zhang, Z.; Zhu, L. A Review on Unmanned Aerial Vehicle Remote Sensing: Platforms, Sensors, Data Processing Methods, and Applications. Drones 2023, 7, 398. [Google Scholar] [CrossRef]
Butilă, E.V.; Boboc, R.G. Urban Traffic Monitoring and Analysis Using Unmanned Aerial Vehicles (UAVs): A Systematic Literature Review. Remote Sens. 2022, 14, 620. [Google Scholar] [CrossRef]
Fang, Z.; Ma, H.; Zhu, X.; Guo, X.; Zhou, R. SEFM: A Sequential Feature Point Matching Algorithm for Object 3D Reconstruction. In Proceedings of the International Conference on Frontier Computing 2020, Singapore, 10–13 July 2020; Springer: Singapore, 2020; pp. 283–296. [Google Scholar] [CrossRef]
Azzam, R.; Taha, T.; Huang, S.; Zweiri, Y. Feature-Based Visual Simultaneous Localization and Mapping: A Survey. SN Appl. Sci. 2020, 2, 224. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Least median of squares regression. J. Am. Stat. Assoc. 1984, 79, 871–880. [Google Scholar] [CrossRef]
Fischler, M.A.; Bolles, R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Torr, P.H.; Zisserman, A. MLESAC: A New Robust Estimator with Application to Estimating Image Geometry. Comput. Vis. Image Underst. 2000, 78, 138–156. [Google Scholar] [CrossRef]
Chum, O.; Matas, J. Matching with PROSAC—Progressive Sample Consensus. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; IEEE: Washington, DC, USA, 2005; Volume 1, pp. 220–226. [Google Scholar] [CrossRef]
Chum, O.; Matas, J.; Kittler, J. Locally Optimized RANSAC. In Proceedings of the DAGM 2003, Magdeburg, Germany, 10–12 September 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 236–243. [Google Scholar] [CrossRef]
Raguram, R.; Chum, O.; Pollefeys, M.; Matas, J.; Frahm, J.M. USAC: A Universal Framework for Random Sample Consensus. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2022–2038. [Google Scholar] [CrossRef]
Barath, D.; Matas, J. Graph-Cut RANSAC. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 6733–6741. [Google Scholar] [CrossRef]
Barath, D.; Matas, J. Progressive-X: Efficient, Anytime, Multi-Model Fitting Algorithm. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Seoul, Republic of Korea, 2019; pp. 3780–3788. [Google Scholar] [CrossRef]
Barath, D.; Noskova, J.; Ivashechkin, M.; Matas, J. MAGSAC++: A Fast, Reliable and Accurate Robust Estimator. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE: Seattle, WA, USA, 2020; pp. 1304–1312. [Google Scholar] [CrossRef]
Zuliani, M.; Kenney, C.S.; Manjunath, B.S. The MultiRANSAC Algorithm and Its Application to Detect Planar Homographies. In Proceedings of the IEEE International Conference on Image Processing 2005, Genova, Italy, 14 September 2005; IEEE: Genoa, Italy, 2005; Volume 3, p. III-153. [Google Scholar] [CrossRef]
Isack, H.; Boykov, Y. Energy-Based Geometric Multi-Model Fitting. Int. J. Comput. Vis. 2012, 97, 123–147. [Google Scholar] [CrossRef]
Toldo, R.; Fusiello, A. Robust Multiple Structures Estimation with J-Linkage. In Proceedings of the European Conference on Computer Vision 2008, Marseille, France, 12–18 October 2008; Springer: Berlin, Germany, 2008; pp. 537–547. [Google Scholar] [CrossRef]
Vincent, E.; Laganière, R. Detecting Planar Homographies in an Image Pair. In Proceedings of the ISPA 2001: Proceedings of the 2nd International Symposium on Image and Signal Processing and Analysis: In Conjunction with 23rd International Conference on Information Technology Interfaces, Pula, Croatia, 19–21 June 2001; IEEE: Pula, Croatia, 2001; pp. 182–187. [Google Scholar] [CrossRef]
Magri, L.; Fusiello, A. T-Linkage: A Continuous Relaxation of J-Linkage for Multi-Model Fitting. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: Columbus, OH, USA, 2014; pp. 3954–3961. [Google Scholar] [CrossRef]
Triggs, B.; McLauchlan, P.F.; Hartley, R.I.; Fitzgibbon, A.W. Bundle Adjustment—A Modern Synthesis. In Vision Algorithms: Theory and Practice; Triggs, B., Zisserman, A., Szeliski, R., Eds.; Lecture Notes in Computer Science 1883; Springer: Berlin/Heidelberg, Germany, 2000; pp. 298–372. [Google Scholar]
Snavely, N.; Seitz, S.M.; Szeliski, R. Photo Tourism: Exploring Photo Collections in 3D. In ACM SIGGRAPH 2006 Papers; ACM: Boston, MA, USA, 2006; pp. 835–846. [Google Scholar] [CrossRef]
Schönberger, J.L.; Frahm, J.M. Structure-from-Motion Revisited. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las Vegas, NV, USA, 2016; pp. 4104–4113. [Google Scholar] [CrossRef]
Toft, C.; Turmukhambetov, D.; Sattler, T.; Kahl, F.; Brostow, G.J. Single-image depth prediction makes feature matching easier. In Proceedings of the European Conference on Computer Vision 2020, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 473–492. [Google Scholar] [CrossRef]
Wang, S.; Kannala, J.; Pollefeys, M.; Barath, D. Guiding local feature matching with surface curvature. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2023, Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 17981–17991. [Google Scholar] [CrossRef]
Liu, Y.; Lai, W.; Zhao, Z.; Xiong, Y.; Zhu, J.; Cheng, J.; Xu, Y. LiftFeat: 3D Geometry-Aware Local Feature Matching. arXiv 2025, arXiv:2505.03422. [Google Scholar] [CrossRef]
Depth-Aware Features for Robust Image Matching; OpenReview (ICLR submission): Alameda, CA, USA, 2025.
Li, K.; Ou, Y.; Ning, J.; Kong, F.; Cai, H.; Li, H. Unified Depth-Guided Feature Fusion and Reranking for Hierarchical Place Recognition. Sensors 2025, 25, 4056. [Google Scholar] [CrossRef]
Saval-Calvo, M.; Azorin-Lopez, J.; Fuster-Guillo, A.; Garcia-Rodriguez, J. Three-Dimensional Planar Model Estimation Using Multi-Constraint Knowledge Based on k-Means and RANSAC. Appl. Soft Comput. 2015, 34, 572–586. [Google Scholar] [CrossRef]
He, H.; Xiong, W.; Zhou, F.; He, Z.; Zhang, T.; Sheng, Z. Topology-Aware Multi-View Street Scene Image Matching for Cross-Daylight Conditions Integrating Geometric Constraints and Semantic Consistency. ISPRS Int. J. Geo Inf. 2025, 14, 212. [Google Scholar] [CrossRef]
Martínez-Otzeta, J.M.; Rodríguez-Moreno, I.; Mendialdua, I.; Sierra, B. RANSAC for Robotic Applications: A Survey. Sensors 2023, 23, 327. [Google Scholar] [CrossRef] [PubMed]
Xu, X.; Cheong, L.F.; Li, Z. Learning for Multi-Model and Multi-Type Fitting. arXiv 2019, arXiv:1901.10254. [Google Scholar] [CrossRef]
Gallo, O.; Manduchi, R.; Rafii, A. CC-RANSAC: Fitting Planes in the Presence of Multiple Surfaces in Range Data. Pattern Recognit. Lett. 2011, 32, 403–410. [Google Scholar] [CrossRef]
Chang, R.; Yu, K.; Yang, Y. Self-Supervised Monocular Depth Estimation Using Global and Local Mixed Multi-Scale Feature Enhancement Network for Low-Altitude UAV Remote Sensing. Remote Sens. 2023, 15, 3275. [Google Scholar] [CrossRef]
Madhuanand, L.; Nex, F.; Yang, M.Y. Self-Supervised Monocular Depth Estimation from Oblique UAV Videos. ISPRS J. Photogramm. Remote Sens. 2021, 176, 1–14. [Google Scholar] [CrossRef]
Lahiri, S.; Ren, J.; Lin, X. Deep learning-based stereopsis and monocular depth estimation techniques: A review. Vehicles 2024, 6, 305–351. [Google Scholar] [CrossRef]
Zhang, J.; Wu, Y.; Jiang, H. Survey on monocular metric depth estimation. Computers 2025, 14, 502. [Google Scholar] [CrossRef]
Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1623–1637. [Google Scholar] [CrossRef]
Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything V2. arXiv 2024, arXiv:2406.09414. [Google Scholar] [CrossRef]
Bochkovskii, A.; Delaunoy, A.; Germain, H.; Santos, M.; Zhou, Y.; Richter, S.R.; Koltun, V. Depth Pro: Sharp Monocular Metric Depth in Less than a Second. arXiv 2024, arXiv:2410.02073. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; AAAI Press: Portland, OR, USA, 1996; pp. 226–231. [Google Scholar]
Fragoso, V.; Sweeney, C.; Sen, P.; Turk, M. Ansac: Adaptive non-minimal sample and consensus. arXiv 2017, arXiv:1709.09559. [Google Scholar] [CrossRef]
Chen, K.; Snavely, N.; Makadia, A. Wide-baseline relative camera pose estimation with directional learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 3258–3268. [Google Scholar] [CrossRef]

Figure 1. UAV image datasets collected in various urban environments for evaluating the proposed depth-guided local outlier rejection method: (a–c) Urban areas where roads and buildings coexist, characterized by significant height variations and distinct façade reflections. (d–f) Urban areas containing vegetation, including trees, roads, and buildings with irregular textures and repetitive patterns. (g,h) Urban image sets captured from different viewing angles.

Figure 2. Workflow of the proposed depth-guided local outlier rejection methodology.

Figure 3. Depth estimation results for each monocular depth model.

Figure 4. Example of SIFT feature detection and matching between reference and target UAV images in an urban environment: (a) reference image; (b) target image.

Figure 5. Visualization of pseudo-3D coordinates generated by combining estimated depth values with matched feature points. The color scale represents normalized depth values from 0 to 100 (purple: low depth; yellow: high depth): (a) reference image; (b) target image.

Figure 6. Example results of 3D clustering applied to pseudo-3D coordinates for each image within a dataset: (a) clustering result for Image 1; (b) clustering result for Image 2.

Figure 7. Example of local outlier rejection results for clustered pseudo-3D coordinates: dots and triangles indicate the final valid inliers; yellow symbols represent inliers belonging to different clusters but still detected as geometrically consistent correspondences.

Figure 8. Feature matching comparison for a representative Class 1 image pair using various outlier rejection methods.

Figure 9. Feature matching comparison for a representative Class 2 image pair using various outlier rejection methods.

Figure 10. Feature matching comparison for a representative Class 3 image pair using various outlier rejection methods.

Figure 11. Comparison of baseline and proposed methods on precision-degraded image sets.

Table 1. Performance comparison across different depth normalization ranges.

Depth Normalization Range	Precision (%)	Recall (%)	F1-Score	Number of Matches
0–10	98.48	63.29	0.75	376
0–50	99.49	60.66	0.71	380
0–100	98.64	79.50	0.87	463
0–200	98.46	76.95	0.84	476
0–1000	98.49	63.43	0.75	375

Table 2. Average performance comparison between conventional methods and the proposed depth-guided local approach.

Metric	Precision (%)	Recall (%)	F1-Score	Number of Matches	p-Value (F1-Score)
RANSAC	98.59	66.63	0.78	424	0.0002
Proposed + RANSAC	98.79	81.62	0.88	519	0.0002
LMedS	87.31	17.81	0.24	143	0.0022
Proposed + LMedS	99.83	43.45	0.56	328	0.0022
MAGSAC++	98.35	73.68	0.84	426	0.0094
Proposed + MAGSAC++	98.60	84.25	0.91	513	0.0094

Table 3. Performance metrics for Class 1 datasets.

Class 1	Method	Precision (%)	Recall (%)	F1-Score	Number of Matches
1-1	RANSAC	100	68.34	0.81	1196
	Proposed + RANSAC	100	87.54	0.93	1532
	LMedS	100	9.71	0.18	170
	Proposed + LMedS	100	62.29	0.77	1090
	MAGSAC++	99.83	66.74	0.8	1170
	Proposed + MAGSAC++	100	86.69	0.93	1517
1-2	RANSAC	99.45	52.48	0.69	181
	Proposed + RANSAC	98.89	78.13	0.87	271
	LMedS	100	1.75	0.03	6
	Proposed + LMedS	100	40.23	0.57	138
	MAGSAC++	98.98	56.85	0.72	197
	Proposed + MAGSAC++	99.63	77.55	0.87	267
1-3	RANSAC	99.40	85.40	0.92	500
	Proposed + RANSAC	99.47	96.22	0.98	563
	LMedS	98.44	10.82	0.2	64
	Proposed + LMedS	98.90	15.46	0.27	91
	MAGSAC++	99.00	85.40	0.92	502
	Proposed + MAGSAC++	99.48	98.28	0.99	575

Table 4. Performance metrics for Class 2 datasets.

Class 2	Method	Precision (%)	Recall (%)	F1-score	Number of Matches
2-1	RANSAC	90.62	64.24	0.75	224
	Proposed + RANSAC	92.73	80.70	0.86	275
	LMedS	100	0.95	0.02	3
	Proposed + LMedS	100	17.72	0.3	56
	MAGSAC++	89.82	64.24	0.75	226
	Proposed + MAGSAC++	90.62	64.24	0.75	224
2-2	RANSAC	99.34	86.01	0.92	458
	Proposed + RANSAC	99.42	96.60	0.98	514
	LMEDS	100	48.20	0.65	255
	Proposed + LMedS	99.72	67.49	0.80	358
	MAGSAC++	99.32	82.80	0.90	441
	Proposed + MAGSAC++	99.40	94.14	0.97	501
2-3	RANSAC	99.87	81.02	0.89	761
	Proposed + RANSAC	99.78	96.48	0.98	907
	LMedS	100	68.76	0.81	645
	Proposed + LMedS	100	90.41	0.95	848
	MAGSAC++	99.87	81.02	0.89	761
	Proposed + MAGSAC++	99.67	96.16	0.98	905

Table 5. Performance metrics for Class 3 datasets.

Class 3	Method	Precision (%)	Recall (%)	F1-Score	Number of Matches
3-1	RANSAC	100	39.33	0.56	35
	Proposed + RANSAC	100	51.69	0.68	46
	LMedS	100	2.25	0.04	2
	Proposed + LMedS	100	25.84	0.41	23
	MAGSAC++	100	61.8	0.76	55
	Proposed + MAGSAC++	100	66.29	0.80	59
3-2	RANSAC	100	56.25	0.72	36
	Proposed + RANSAC	100	65.62	0.79	42
	LMedS	-	-	-	0
	Proposed + LMedS	100	28.12	0.44	18
	MAGSAC++	100	90.62	0.95	58
	Proposed + MAGSAC++	100	90.62	0.95	58

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, G.; Youn, J.; Choi, K. A Depth-Guided Local Outlier Rejection Methodology for Robust Feature Matching in Urban UAV Images. Drones 2025, 9, 869. https://doi.org/10.3390/drones9120869

AMA Style

Lee G, Youn J, Choi K. A Depth-Guided Local Outlier Rejection Methodology for Robust Feature Matching in Urban UAV Images. Drones. 2025; 9(12):869. https://doi.org/10.3390/drones9120869

Chicago/Turabian Style

Lee, Geonseok, Junhee Youn, and Kanghyeok Choi. 2025. "A Depth-Guided Local Outlier Rejection Methodology for Robust Feature Matching in Urban UAV Images" Drones 9, no. 12: 869. https://doi.org/10.3390/drones9120869

APA Style

Lee, G., Youn, J., & Choi, K. (2025). A Depth-Guided Local Outlier Rejection Methodology for Robust Feature Matching in Urban UAV Images. Drones, 9(12), 869. https://doi.org/10.3390/drones9120869

Article Menu

A Depth-Guided Local Outlier Rejection Methodology for Robust Feature Matching in Urban UAV Images

Highlights

Abstract

1. Introduction

1.1. Related Work

1.1.1. Global Geometric Model-Based Approaches

1.1.2. Multi-Plane Fitting-Based Approaches

1.1.3. Photogrammetry-Based Approaches

1.1.4. Depth-Assisted Approaches

1.2. Limitations of Existing Approaches

1.3. Research Objectives

2. Methods

2.1. Dataset Configuration

2.2. Proposed Methodology

2.2.1. Depth Estimation

2.2.2. Feature Matching

2.2.3. Clustering of Pseudo-3D Coordinates

2.2.4. Local Clustering-Based RANSAC

2.3. Implementation Details

3. Results

3.1. Overall Performance Comparison

3.2. Performance Analysis by Image Category

3.3. Overall Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI