Unsupervised Knowledge Extraction of Distinctive Landmarks from Earth Imagery Using Deep Feature Outliers for Robust UAV Geo-Localization

Ostrovskyi, Zakhar; Barmak, Oleksander; Radiuk, Pavlo; Krak, Iurii

doi:10.3390/make7030081

Open AccessArticle

Unsupervised Knowledge Extraction of Distinctive Landmarks from Earth Imagery Using Deep Feature Outliers for Robust UAV Geo-Localization

¹

Department of Computer Science, Khmelnytskyi National University, 11 Instytuts’ka Str., 29016 Khmelnytskyi, Ukraine

²

Department of Theoretical Cybernetics, Taras Shevchenko National University of Kyiv, 4d Akademika Glushkova Ave., 03680 Kyiv, Ukraine

³

Laboratory of Communicative Information Technologies, V.M. Glushkov Institute of Cybernetics, 40 Akademika Glushkova Ave., 03187 Kyiv, Ukraine

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(3), 81; https://doi.org/10.3390/make7030081

Submission received: 24 June 2025 / Revised: 31 July 2025 / Accepted: 11 August 2025 / Published: 13 August 2025

(This article belongs to the Special Issue Deep Learning in Image Analysis and Pattern Recognition, 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Vision-based navigation is a common solution for the critical challenge of GPS-denied Unmanned Aerial Vehicle (UAV) operation, but a research gap remains in the autonomous discovery of robust landmarks from aerial survey imagery needed for such systems. In this work, we propose a framework to fill this gap by identifying visually distinctive urban buildings from aerial survey imagery and curating them into a landmark database for GPS-free UAV localization. The proposed framework constructs semantically rich embeddings using intermediate layers from a pre-trained YOLOv11n-seg segmentation network. This novel technique requires no additional training. An unsupervised landmark selection strategy, based on the Isolation Forest algorithm, then identifies objects with statistically unique embeddings. Experimental validation on the VPAIR aerial-to-aerial benchmark shows that the proposed max-pooled embeddings, assembled from selected layers, significantly improve retrieval performance. The top-1 retrieval accuracy for landmarks more than doubled compared to typical buildings (0.53 vs. 0.31), and a Recall@5 of 0.70 is achieved for landmarks. Overall, this study demonstrates that unsupervised outlier selection in a carefully constructed embedding space yields a highly discriminative, computation-friendly set of landmarks suitable for real-time, robust UAV navigation.

Keywords:

knowledge extraction; unsupervised learning; deep learning; feature embeddings; outlier detection; geo-localization; remote sensing

Graphical Abstract

1. Introduction

Unmanned Aerial Vehicles (UAVs) are integral to modern applications, spanning surveillance, logistics, and infrastructure inspection, where their autonomy is paramount [1]. A fundamental prerequisite for such autonomy is a continuous and robust localization capability, which ensures safe and effective operation [2,3]. While Global Navigation Satellite Systems, like Global Positioning System (GPS), are the conventional solution, their signals are notoriously unreliable in many critical scenarios. Environments such as dense urban canyons, indoor spaces, or contested areas subject to intentional signal jamming and spoofing render GPS-dependent systems vulnerable, severely limiting UAV operational reliability and versatility [4,5]. This limitation motivates developing alternative localization approach that enhance UAV resilience and safety by operating without external beacons [6].

Vision-Based Navigation (VBN) has emerged as a compelling and powerful alternative, leveraging onboard cameras to perceive the surrounding environment for self-positioning. This approach grants UAVs a high degree of autonomy, enabling them to navigate using intrinsic sensory data. Within VBN, landmark-based navigation is a particularly effective strategy, mirroring the innate cognitive processes humans and animals use to orient themselves by recognizing distinctive features [7]. By pre-mapping a set of unique visual references or discovering them in real-time, a UAV can effectively match its current view against this map to correct for positional drift accumulated by its Inertial Measurement Unit (IMU). This process allows the vehicle to maintain accurate localization over extended periods without external aid.

However, the efficacy of this strategy hinges on the quality of the landmarks. The automatic identification of what constitutes a “good” landmark is a non-trivial challenge [8]. An ideal landmark must satisfy two crucial criteria: it must be recognizable under varying conditions (e.g., different viewpoints, lighting, partial occlusions), and it must be sufficiently distinctive to avoid perceptual aliasing—the confusion of one location with another due to repetitive features. For example, an architecturally unique building or a solitary water tower can serve as an excellent, unambiguous landmark. In contrast, generic residential houses in a suburban grid or uniform rows of city buildings offer little discriminative value and can easily mislead a VBN system.

The advent of deep learning, particularly Convolutional Neural Networks (CNNs), has revolutionized the field of computer vision, providing sophisticated tools for feature extraction and object recognition that far surpass classical methods [9]. CNNs pre-trained on large-scale datasets learn hierarchical feature representations that can be transferred to diverse downstream tasks [10]. These features, often compressed into vectors known as embeddings, encapsulate rich semantic and structural information, making them ideal for representing visual objects for similarity comparison and re-identification [11]. Despite these technological advancements, the following critical research gap persists: the autonomous discovery and curation of a database of truly distinctive landmarks from raw Earth observation data, such as aerial survey imagery, remains an open problem. Existing systems often depend on pre-defined markers, extensive manual annotation, or generic object features that are not sufficiently discriminative for reliable, long-term navigation. This paper addresses this gap by proposing a framework to automatically survey an area, identify unique visual references, and build a high-quality landmark map for robust GPS-free navigation.

This paper makes the following specific contributions:

A novel framework for constructing semantically rich building embeddings from intermediate convolutional layers of a pre-trained segmentation model (i.e., YOLOv11n-seg from Ultralytics YOLO v8.3.135 [12]), requiring no additional training, that preserves both coarse geometry and fine appearance cues of detected objects.
An unsupervised landmark selection strategy based on outlier detection (Isolation Forest algorithm) in the embedding space, capable of automatically identifying unique and discriminative objects suitable for navigation from a pool of detected candidates.
An evaluation methodology, including the introduction of proxy retrieval metrics (“Aerial Survey-to-Aerial Survey” and “Aerial Survey-to-UAV”) for quantitative tuning of embedding parameters in the absence of ground-truth object correspondences, and validation on the Visual Place - Aircraft Image Retrieval (VPAIR) benchmark [13], demonstrating significant improvements in landmark retrieval accuracy compared to typical, non-selected objects.

The proposed approach focuses on urban environments and utilizes buildings as the primary landmark candidates, leveraging their structural diversity and permanence. The remainder of this paper is structured as follows: Section 2 reviews related work. Section 3 details the materials and methods. Section 4 presents the experimental results. Section 5 discusses the findings, and Section 6 concludes the paper.

2. Related Works

The challenge of UAV navigation in GPS-denied environments has spurred extensive research across several domains, including classical computer vision, deep learning-based visual recognition, and robotics. This section reviews the key advancements in these areas to contextualize our proposed approach for autonomous landmark discovery.

Early work in visual localization and landmark-based navigation predominantly relied on hand-crafted local feature descriptors. Methods such as the Scale-Invariant Feature Transform (SIFT) [14] and Speeded-Up Robust Features became the standard for identifying and matching keypoints between images. These local descriptors were often aggregated into a global image signature using techniques like Bag-of-Visual-Words (BoVW) [15], which quantizes local features into a histogram, or the Vector of Locally Aggregated Descriptors (VLAD) [16], which aggregates descriptor residuals. While foundational, these methods often struggled with scalability in large environments, viewpoint and illumination invariance, and the lack of high-level semantic understanding, making it difficult to distinguish between structurally similar but distinct objects.

The deep learning revolution ushered in a new era of performance for visual recognition tasks. Seminal architectures like AlexNet [9] and ResNet [10] demonstrated the power of hierarchical feature learning. This led to the development of specialized networks for visual place recognition, such as NetVLAD [16], which introduced a learnable aggregation layer to create powerful global embeddings for entire scenes. Other approaches, like Deep Local Features [17], focused on learning discriminative local features and using attention mechanisms to identify the most salient regions for matching. These methods significantly improved robustness but often required extensive training on large-scale, domain-specific datasets. A prevailing modern paradigm is Detect-to-Retrieve (D2R) [18], where an object detector first localizes objects of interest before a dedicated network computes their embeddings. This reduces background clutter and improves accuracy, especially for small landmarks. Our work builds on the D2R concept but adds a crucial, novel layer of unsupervised distinctiveness analysis. The underlying techniques of exploring these learned feature spaces often rely on dimensionality reduction for analysis and visualization [19], with classical methods like Principal Component Analysis (PCA) [20] and modern non-linear techniques like t-distributed Stochastic Neighbor Embedding (t-SNE) [21] being essential tools.

Within the specific context of UAVs, research has focused on adapting these techniques to the unique challenges of aerial robotics, including severe viewpoint changes between aerial survey and UAV imagery and the need for computationally efficient onboard processing. Recent surveys highlight the breadth of deep learning-based visual localization techniques being explored [2,5]. Specific methods include lightweight visual localization algorithms designed for real-time performance [6], context-enhanced cross-view geo-localization that explicitly reasons about aerial survey-to-UAV matching [22], and fine-grained approaches that combine local and global features [23]. Object detection models from the YOLO family [24,25] are particularly popular for their favorable balance of speed and accuracy. This has led to the development of highly specialized, lightweight networks for efficient UAV object detection [3,26,27] and related image analysis tasks like satellite image classification [28]. Furthermore, as these systems become more autonomous, ensuring their reliability and transparency is critical. This has spurred research into explainable AI for UAV navigation to build trust and understand decision-making processes [1,29], aligning with broader goals of developing trustworthy AI systems [30].

Despite these advances, a critical gap remains. While many systems can detect objects or match scenes [31], they typically lack an explicit mechanism to autonomously evaluate the navigational quality or distinctiveness of potential landmarks from raw geospatial data. Methods often rely on manually selected key regions [8] or treat all detected objects of a class (e.g., ‘building’) as equally viable landmarks, which can fail in repetitive urban environments. The core challenge is to automate the curation of a sparse, reliable landmark map from a dense set of object candidates without requiring extensive retraining or manual supervision. Our work addresses this by introducing an unsupervised outlier detection framework based on the Isolation Forest algorithm [32]. This approach leverages the inherent structure of a learned embedding space to identify statistically rare, and thus visually distinctive, objects. This is philosophically related to detecting subtle data anomalies, a task primarily explored in domains like image resolution enhancement [33], but applied here for the novel purpose of landmark curation.

Therefore, the goal of this study is to improve UAV navigation by creating and validating a lightweight, automated framework that identifies visually distinctive urban objects in aerial survey imagery to serve as reliable landmarks. We achieve this by first constructing rich object embeddings from a pre-trained segmentation network, a process that requires no additional training. Next, an unsupervised landmark selection strategy identifies unique objects by treating them as statistical outliers within the embedding space. Finally, we validate this approach using a cross-view benchmark dataset, demonstrating that our framework of selecting landmarks results in superior retrieval accuracy.

3. Materials and Methods

This section outlines the methodology for autonomous landmark identification. We begin by defining the problem and stating our core hypothesis. Subsequently, we describe the proposed framework, which involves object segmentation, multi-layer feature extraction, embedding construction, and outlier-based landmark selection. We also detail the rationale for UAV localization using these landmarks and introduce proxy metrics for evaluating and tuning the embedding process.

3.1. Problem Statement

We aim to develop a system that automatically identifies visually distinctive objects, primarily buildings, from a collection of geo-referenced aerial survey images, denoted as

I = {I_{1}, I_{2}, \dots, I_{N}}

. Each image

I_{n}

covers a specific geographic area and may contain multiple instances of objects belonging to a predefined set of categories,

T = {{Type}_{1}, \dots, {Type}_{t}}

. In this work, our primary category of interest is ’building’. These identified distinctive objects will serve as landmarks for UAV navigation.

The core of our approach relies on a CNN pre-trained for a relevant task, in this case, building segmentation. For each object

O_{k}

detected in an image

I_{n}

, we derive a d-dimensional real-valued vector embedding

e_{k} \in R^{d}

. Objects that exhibit visually unique or distinctive features in the learned embedding space are designated as landmarks. Formally, let

O = {O_{1}, O_{2}, \dots, O_{M}}

be the set of all detected objects across the entire image dataset I. Our primary objective is to identify a subset

L \subset O

of landmark objects characterized by unique visual cues and robust appearance under minor transformations (e.g., slight viewpoint changes, illumination variations). This curated set

L

, along with their geolocations, forms a landmark database for subsequent UAV navigation.

3.2. Core Hypothesis

Consider a UAV operating in a GPS-denied environment. Prior to deployment, a database of landmarks is constructed from aerial survey imagery of the operational area. This database contains embeddings and geolocations of distinctive objects. During its mission, the UAV’s onboard camera captures live video frames. Objects detected in these frames are processed to generate embeddings using the same methodology employed for the aerial survey imagery. These query embeddings are then matched against the landmark database. A successful match allows the UAV to estimate its current location based on the known coordinates of the identified landmark.

Our central hypothesis is twofold. First, concerning cross-domain distinctiveness, we posit that objects visually distinctive in aerial survey imagery will generally remain so when viewed from a UAV, despite differing perspectives and sensors. Consequently, in a common embedding space, landmark objects should form sparse, outlier regions, well-separated from clusters of typical objects. Second, regarding hierarchical feature utility, we hypothesize that an optimal object embedding should incorporate features from multiple CNN layers. While deeper layers provide semantic context, intermediate layers retain fine-grained details vital for distinguishing between similar-looking objects [10]. By focusing on outlier objects identified through this multi-layer embedding, we anticipate a significant improvement in localization reliability due to reduced matching ambiguity.

3.3. Proposed Framework: Landmark Identification Workflow

The proposed framework for autonomous landmark identification consists of four main steps, as illustrated in Figure 1.

3.3.1. Segmentation-Based Object Detection

Candidate objects are initially identified by applying a CNN-based segmentation model to each aerial survey image

I_{n} \in I

. The model is pre-trained to segment objects belonging to categories in

T

(in our case, primarily buildings). For each detected object instance

O_{k}

, the segmentation model outputs a binary segmentation mask

M_{k}

and a confidence score

α_{k}

. We retain only objects whose confidence score exceeds a threshold

θ_{c o n f}

, yielding an initial set of candidates

O^{'}

. We prefer segmentation over bounding-box detection because precise masks capture actual object shapes and help isolate object features from background clutter, leading to purer embeddings. The segmentation backbone used is a YOLOv11n-seg model, chosen for its lightweight architecture and proven proficiency in segmenting buildings from aerial views.

3.3.2. Multi-Layer Feature Extraction

Let the backbone CNN consist of

L_{t o t a l}

convolutional layers. We select a subset of L layers, indexed by

l_{1}, \dots, l_{L}

, from which to extract features. The feature map tensor at a given layer l is calculated as shown in Equation (1):

F^{(l)} = {CNNLayer}_{l} ({InputToLayer}_{l}),

(1)

where

F^{(l)} \in R^{C_{l} \times H_{l} \times W_{l}}

is the output tensor,

C_{l}

is the number of channels, and

H_{l}, W_{l}

are the spatial dimensions.

During the forward pass for an image

I_{n}

, we hook into the selected layers to store their feature maps. Since deeper layers have reduced spatial dimensions, the original segmentation mask

M_{k}

is downsampled to match the resolution of each feature map

F^{(l)}

, yielding an aligned mask

M_{k}^{(l)}

. This ensures that we extract features only from activations corresponding to object

O_{k}

. The rationale for using multiple layers, established by Zeiler and Fergus [34], is to create a rich embedding that combines low-level primitives (e.g., textures) from early layers with high-level semantic information (e.g., structure) from deeper layers, thereby enhancing discriminative power.

3.3.3. Embedding Construction via Aggregation

To represent each object

O_{k}

as a fixed-dimensional embedding vector

e_{k} \in R^{d}

, we aggregate the feature activations within its aligned mask

M_{k}^{(l)}

for each selected layer l. To create a vector for each channel, we apply an aggregation operator

A (\cdot)

to the set of masked activation values. This process is formalized in Equation (2), where

F_{c}^{(l)}

is the c-th channel of the feature map at layer l, and

(i, j)

are the spatial pixel coordinates within that channel’s grid that fall inside the downsampled mask

M_{k}^{(l)}

as follows:

z_{k, c}^{(l)} = A ({F_{c}^{(l)} (i, j) ∣ (i, j) \in M_{k}^{(l)}}) .

(2)

Common choices for the aggregation operator

A

include max-pooling (

A_{m a x} = max {\cdot}

), which emphasizes the strongest feature activations; average-pooling (

A_{a v g} = mean {\cdot}

), which provides a summary of the overall feature response; and sum-pooling (

A_{s u m} = \sum {\cdot}

), which reflects the total activation.

These scalar values are then concatenated across all channels for each selected layer. As described in Equation (3), the final embedding vector

e_{k}

is formed by concatenating these aggregated vectors across all selected layers

l_{1}, l_{2}, \dots, l_{L}

as follows:

e_{k} = {[z_{k, 1}^{(l_{1})}, \dots, z_{k, C_{l_{1}}}^{(l_{1})}, z_{k, 1}^{(l_{2})}, \dots, z_{k, C_{l_{2}}}^{(l_{2})}, \dots, z_{k, 1}^{(l_{L})}, \dots, z_{k, C_{l_{L}}}^{(l_{L})}]}^{T} .

(3)

The total dimension d of the embedding

e_{k}

is the sum of the channel counts in all selected layers as follows:

d = \sum_{m = 1}^{L} C_{l_{m}}

.

3.3.4. Outlier Detection for Landmark Identification

With embeddings generated for all objects, the final step is to distinguish distinctive landmarks from typical ones. Our core hypothesis suggests that landmark embeddings will be outliers in the embedding space, while typical objects will form denser clusters. We employ the Isolation Forest algorithm [32] for this task due to its computational efficiency, low memory requirements, and proven effectiveness on high-dimensional data without needing data normalization.

While other powerful anomaly detection algorithms exist, such as Local Outlier Factor or One-Class SVMs, Isolation Forest’s non-parametric nature and its ability to handle irrelevant attributes make it particularly well-suited for the potentially noisy feature space of our embeddings. A comparative study of these methods, however, remains an important direction for future investigation. The algorithm isolates observations by random feature splitting; shorter path lengths to isolate a point indicate an anomaly. An object

O_{k}

is designated as a landmark and added to the set

L

if its anomaly score

s (e_{k})

from the algorithm exceeds a threshold

s_{t h r}

, as formalized in Equation (4), expressed as follows:

O_{k} \in L if s (e_{k}) > s_{t h r} .

(4)

This unsupervised approach is advantageous as it requires no manual labeling and leverages the inherent data distribution to define distinctiveness algorithmically. Due to overlapping views in consecutive images within the VPAIR dataset, a single physical landmark may be detected multiple times. If any one of these instances is selected as an outlier, its other views are typically also identified as outliers. This results in the landmark database containing several similar embeddings for the same landmark from different perspectives, which intrinsically enhances the robustness of the retrieval process.

3.4. Rationale for UAV Localization Using Identified Landmarks

The landmark database

L

, constructed from the aerial survey imagery, serves as the map for UAV localization. The onboard process, illustrated in Figure 2, mirrors the identification workflow.

First, the UAV’s camera captures a video frame, and the same segmentation CNN detects objects in the view. Second, an embedding is computed for each detected object using the identical multi-layer feature extraction and aggregation methods. Third, this query embedding is matched against the landmark database using a distance metric like L2 distance. Finally, if a query embedding closely matches a landmark embedding, the UAV infers its approximate geolocation from the known coordinates of the matched landmark. This estimate can be further refined using multiple matches or integration into a sensor fusion framework with IMU data. Crucially, the retrieved top-N candidates would then typically be passed to a geometric verification stage (e.g., using RANSAC with a PnP solver) for precise 6-DoF pose estimation and robust outlier rejection, a standard practice that strengthens the final localization solution.

3.5. Proxy Metrics for Embedding Evaluation and Tuning

Optimizing embedding parameters (CNN layers and aggregation function) is crucial, but detailed ground-truth correspondences between aerial survey and UAV views are often unavailable. To enable quantitative tuning, we introduce two proxy retrieval metrics. The Aerial Survey-to-Aerial Survey (AS2AS) retrieval metric evaluates embedding consistency within the aerial survey domain by querying an object from one aerial survey image against a database from adjacent images. A successful retrieval occurs if the same physical object is found. The Aerial Survey-to-UAV (AS2UAV) Retrieval metric more directly addresses the cross-domain gap by matching object embeddings from aerial survey images against a database from UAV images of the same area (as available in datasets like VPAIR [13]).

To systematically identify an optimal layer combination, we employ a greedy forward selection strategy (Algorithm 1), guided by these proxy metrics. Starting with an empty set, we iteratively add the layer that yields the largest improvement in the chosen metric (e.g., AS2UAV Recall@1) until performance no longer improves.

Algorithm 1 Greedy algorithm for optimal layer selection

1:: Input: Set of all available CNN layers $L_{a l l}$ , chosen proxy metric $P_{M}$ , aggregation function $A$ .
2:: Output: Optimal set of selected layers $S_{o p t}$ .
3:: Initialize $S_{o p t} \leftarrow \emptyset$ .
4:: Initialize $P_{M, b e s t} \leftarrow 0$ .
5:: loop
6:: $L_{b e s t_c a n d i d a t e} \leftarrow null$ .
7:: $P_{M, c u r r e n t_i t e r_b e s t} \leftarrow P_{M, b e s t}$ .
8:: for all $l \in L_{a l l} ∖ S_{o p t}$ do
9:: $S_{c a n d i d a t e} \leftarrow S_{o p t} \cup {l}$ .
10:: Construct embeddings using layers in $S_{c a n d i d a t e}$ and aggregation $A$ .
11:: Evaluate $P_{M} (S_{c a n d i d a t e})$ on the validation set.
12:: if $P_{M} (S_{c a n d i d a t e}) > P_{M, c u r r e n t_i t e r_b e s t}$ then
13:: $P_{M, c u r r e n t_i t e r_b e s t} \leftarrow P_{M} (S_{c a n d i d a t e})$ .
14:: $L_{b e s t_c a n d i d a t e} \leftarrow l$ .
15:: end if
16:: end for
17:: if $L_{b e s t_c a n d i d a t e} \neq null$ and $P_{M, c u r r e n t_i t e r_b e s t} > P_{M, b e s t}$ then
18:: $S_{o p t} \leftarrow S_{o p t} \cup {L_{b e s t_c a n d i d a t e}}$ .
19:: $P_{M, b e s t} \leftarrow P_{M, c u r r e n t_i t e r_b e s t}$ .
20:: else
21:: break
22:: end if
23:: end loop
24:: return $S_{o p t}$ .

3.6. Evaluation Metrics for the Final Workflow

Once the landmark database

L

is constructed, we evaluate the UAV localization task as a retrieval problem. The primary metric is Recall@N, which we use to measure the proportion of query images for which the correct landmark is found within the top N retrieved results. The calculation is shown in Equation (5), expressed as follows:

Recall @ N = \frac{M_{N}}{N_{Q}},

(5)

where

N_{Q}

is the total number of queries, and

M_{N}

is the count of successful queries where the correct match appears in the top N results.

Metric in Equation (5) is crucial because Recall@1 measures exact match accuracy, while Recall@N (for N > 1) indicates the likelihood of the correct match being in a small candidate set, which is valuable for practical navigation systems that can employ further verification steps.

3.7. Experimental Setup

The experimental setup was designed to validate the proposed landmark discovery framework and quantify its impact on retrieval performance. This setup encompasses the dataset selection, implementation details of the models, and the evaluation protocol.

3.7.1. Dataset and Preprocessing

All experiments were conducted on the VPAIR dataset [13], a publicly available benchmark designed for cross-view visual localization. It features 2788 paired images from a flight over a German urban environment, comprising high-resolution aerial survey orthophotos and corresponding nadir images captured from a UAV (simulated by a light aircraft). This pairing is ideal for evaluating the robustness of features to significant changes in viewpoint and sensor modality. Representative examples of these image pairs are shown in Figure 3.

A key characteristic of the VPAIR dataset is the consistent Ground Sample Distance (GSD) across both aerial survey and UAV views, which were downsampled to a uniform 800 × 600 pixel resolution. This parity ensures that building structures, our primary landmark candidates, are resolved with sufficient detail for meaningful feature extraction. To confirm this, we analyzed the size of all 18,432 detected buildings in the dataset. As shown in Figure 4, the distribution of building bounding box sizes is concentrated well above minimal detection thresholds, confirming their suitability for the proposed framework.

3.7.2. Implementation Details

The entire workflow was implemented in Python v3.9 [35], with deep learning components built on PyTorch v1.12 [36]. All computations were performed on a workstation equipped with an NVIDIA RTX 3060 GPU.

For object segmentation, we employed the YOLOv11n-seg model from Ultralytics [37]. This model was chosen for its excellent trade-off between accuracy and computational cost, making it a strong candidate for eventual onboard UAV deployment. The model’s lightweight architecture is detailed in Table 1, highlighting its low parameter count (

2.9

M) and high-speed inference capabilities on GPU hardware.

The base model was fine-tuned for 100 epochs on a custom, diverse dataset of satellite images to perform single-class building segmentation. A comprehensive analysis of the fine-tuning process and the resulting model’s performance is provided in Appendix A.

Unsupervised landmark selection was performed using the Isolation Forest algorithm, implemented with the scikit-learn v1.1 library [38]. The IsolationForest model was configured with n_estimators = 500. For our primary analysis, the contamination (outlier proportion) was set to a fixed value of 0.01 to identify the most statistically rare objects. We acknowledge that a static global fraction is a simplification and may not be optimal for all environments; an adaptive threshold that adjusts based on local feature density is a key area for future work.

3.7.3. Evaluation Protocol

To validate our core hypotheses, we designed a two-part evaluation protocol. First, for qualitative analysis of the embedding space, we used PCA and t-SNE to project the high-dimensional embedding vectors into 2D space for visualization, as described in our previous work [19].

Second, for the quantitative evaluation of retrieval performance, we created a benchmark query set by manually annotating 200 buildings from the VPAIR dataset. This set was carefully balanced into two classes based on explicit criteria detailed in Section 3.7.4: 100 “landmarks” (architecturally unique or visually salient buildings) and 100 “typical buildings” (generic, repetitive structures). The primary metric used to assess performance was Recall@N (defined in Equation (5)), which measures the proportion of queries for which the correct building is retrieved within the top N results.

3.7.4. Benchmark Annotation Criteria

To ensure the quantitative evaluation was robust and reproducible, the manual annotation of the 200-building benchmark set followed a strict set of criteria. The annotation was performed by one of the authors and subsequently reviewed by another to ensure consistency.

Landmark Buildings (100 instances): These were selected based on visual saliency and uniqueness within the local context. Key criteria included: (1) irregular or complex building footprints (e.g., L-shaped, T-shaped, or non-rectangular designs); (2) unique roof structures, colors, or textures that clearly distinguished them from neighbors; (3) significant size or height relative to surrounding buildings; or (4) isolation from other structures, making them unambiguous.
Typical Buildings (100 instances): These were selected to represent the visually repetitive background structures. Key criteria included: (1) simple, rectangular footprints; (2) location within dense, grid-like clusters of similar-looking houses; and (3) generic, common architectural features (e.g., standard gabled roofs, uniform color).

This careful curation was essential for creating a benchmark that could fairly test the hypothesis that visually distinctive objects, as identified by our system, are inherently more reliable for retrieval-based localization.

4. Results

This section presents the experimental validation of our framework. We first present the qualitative and quantitative analysis that led to our optimal embedding configuration, and then report the final retrieval accuracy achieved by our landmark selection workflow.

4.1. Qualitative Study of Single-Layer Embeddings and the Role of Aggregation

To understand the impact of CNN layers and aggregation functions, we first explored embeddings from individual layers of the YOLOv11n-seg backbone using max-, sum-, and average-pooling. A qualitative analysis of these single-layer embeddings, visualized with PCA and t-SNE, is provided in Appendix B. The analysis revealed the following distinct behaviors: sum-pooling effectively separated buildings by size, average-pooling was sensitive to subtle textures but resulted in entangled distributions, and max-pooling, especially when applied to deeper layers (e.g., layers 9 and 11), proved most effective at separating structurally unique buildings from typical ones. These findings suggest that max-pooling best captures salient, viewpoint-invariant cues, making it a strong candidate for cross-domain landmark identification. This aligns with our theoretical expectations and supports our hypothesis that the embedding space can encode discriminative information heavily influenced by the aggregation strategy and layer depth.

4.2. Quantitative Layer Selection with Proxy Metrics

To select an optimal multi-layer combination, we applied the greedy selection algorithm (Algorithm 1) guided by AS2AS and AS2UAV proxy metrics. Figure 5 illustrates the progression of Top-1 accuracy as layers are added for each aggregation function.

Table 2 summarizes the best-performing configurations.

Max-aggregation achieved the highest Recall@1 of 0.175 on the critical AS2UAV (cross-domain, Figure 5a) task with a combination of deep and intermediate layers [9, 6, 10], indicating superior generalization. In contrast, average-aggregation performed best on the AS2AS (within-domain, Figure 5b) task but failed on the AS2UAV task, confirming that its texture-sensitive features are not robust to viewpoint changes. Based on these quantitative results, embeddings formed via max-aggregation over layers [9, 6, 10] were selected as the optimal configuration for our final landmark identification workflow.

4.3. Analysis of the Embedding Space for the Best-Performing Configuration

We inspected the geometry of the embedding space generated by the optimal configuration (max-pooling over layers [9, 6, 10]). Figure 6 shows PCA and t-SNE projections, where landmark buildings (orange) are clearly separated from typical ones (blue).

Both linear (PCA, see Figure 7) and non-linear (t-SNE, see Figure 8) projections show a dense core of typical buildings with landmarks forming sparse, peripheral clusters.

As illustrated in Figure 7 and Figure 8, points in the dense core correspond to generic residential houses, while outlier clusters correspond to large, architecturally irregular complexes. This robust separation validates our outlier-based selection criterion and confirms that landmarks are well-separated in the high-dimensional space.

4.4. Retrieval Accuracy: Landmarks vs. Typical Buildings

Finally, we evaluated whether automatically selected landmarks are easier to reidentify. Using our benchmark set, we performed retrieval from UAV image queries against the aerial survey database. Table 3 summarizes the results.

The results show a significant performance gap. Landmark buildings achieved a Recall@1 of 0.53, a 71% relative improvement over the 0.31 for typical buildings. The Recall@5 for landmarks was 0.70, indicating a high probability of finding the correct match in a small candidate set suitable for real-world navigation. These findings strongly support our hypothesis and the practical utility of our unsupervised landmark selection strategy. This marked improvement directly quantifies the effectiveness of the outlier detection step formalized in Equation (4). By filtering for statistically rare objects, the system avoids the ambiguity of matching against numerous lookalike buildings, thereby increasing the probability of a correct top-1 match from 31% to 53%. This demonstrates that the Isolation Forest effectively identifies those objects that are most reliably re-identifiable across different domains. Figure 9 illustrates several such scenarios: the system correctly identifies landmarks despite minor segmentation mask inconsistencies (a,b) and significant rotational differences between the query and database views (c), but can fail when faced with highly repetitive, non-landmark structures that create ambiguity (d).

5. Discussion

The results presented herein affirm the viability of autonomously identifying distinctive urban landmarks from aerial survey imagery, a critical capability for UAV navigation in GPS-denied settings. Our integrated framework, combining building segmentation, multi-layer feature embedding, and unsupervised outlier detection, effectively curates a database of landmarks that demonstrate markedly superior retrieval accuracy over typical, non-selected buildings. This section will delve into the implications of these findings, juxtapose our approach with existing methodologies, and critically assess its strengths, weaknesses, and promising avenues for future research.

5.1. Comparative Analysis with Existing Approaches

To situate our work within the broader landscape of visual localization, we provide both a conceptual and an empirical comparison. Conceptually, our framework differs from established paradigms in its explicit focus on automated landmark curation. Table 4 highlights these differences against traditional, global embedding, and generic D2R approaches.

Empirically comparing our landmark-centric framework to global descriptor methods like NetVLAD [16] or AnyLoc [39] is challenging, as they solve different tasks. Our system is designed for sparse landmark-based navigation correction, not continuous global localization. However, a proxy experiment was conducted to provide a relative baseline. We adapted our framework to mimic a global retrieval workflow on the VPAIR dataset’s urban images, using our landmark embeddings to represent local image groups. In this setup, our framework achieved a Recall@1 of 0.197 and Recall@5 of 0.37 (at 100

m

tolerance). In contrast, a heavyweight system like AnyLoc, which uses a massive ViT-G/14 transformer, achieves Recall@1 of 0.62 on the same task. While AnyLoc’s performance is superior in this global context, it comes at a tremendous computational cost. Our framework’s strength lies not in outperforming such systems at their own task, but in providing a highly efficient, alternative navigation paradigm.

The “lightweight” nature of our approach is a key advantage for onboard deployment. The fine-tuned YOLOv11n-seg model contains only

2.9

M parameters and requires approximately

10.4

GFLOPs per 640 px image, with a GPU latency under 2

m

s

on a TensorRT-T4. The total model weights are under 15 MB. In contrast, the DINOv2 ViT-G/14 model used by heavyweight approaches like AnyLoc has ~

1.1

B parameters (nearly 400× more) and produces large, high-dimensional descriptors (49 152-D vs. our 640-D). This allows our entire landmark identification and matching workflow to run on commodity UAV compute modules without requiring powerful edge GPUs or offloading to the cloud, a critical advantage for SWaP (Size, Weight, and Power)-constrained platforms.

5.2. Interpretation of Findings

The efficacy of our framework stems from its synergistic approach to constructing discriminative embeddings and an unsupervised mechanism for selecting distinctive objects. Qualitative analysis of single-layer embeddings (Figure A3 and Figure A4) revealed that max-pooling, particularly with features from deeper CNN layers like 9, 6, and 10 of the YOLOv11n-seg backbone (Table 2), excelled in cross-domain (AS2UAV) retrieval. This result aligns with established findings in computer vision. Max-pooling is known to be robust to small spatial translations and focuses on the most discriminative feature activation within a region, effectively capturing an object’s defining characteristics [40]. In contrast, average-pooling captures overall texture and statistics, which, while useful for within-domain tasks (AS2AS), proved less robust to the significant viewpoint and sensor changes inherent in the cross-domain (AS2UAV) task. This object-centric, masked aggregation strategy offers a nuanced feature representation, potentially diverging from optimal strategies in global image retrieval where techniques like Generalized Mean pooling [41] are common. The efficient use of intermediate layers from a pre-trained segmentation network for embedding construction, without requiring additional training, echoes early work on leveraging pre-trained CNNs for retrieval [11], but is specifically tailored here for the nuances of building structures.

A significant contribution is the unsupervised landmark selection via Isolation Forest. Unlike many navigation systems that rely on manual selection or pre-defined ‘distinctive’ classes [8], our framework defines distinctiveness as statistical rarity in the embedding space, enabling automated, data-driven database curation. This is particularly valuable for rapid deployment in unmapped areas. The pronounced improvement in retrieval accuracy for selected landmarks (0.53 Recall@1 vs. 0.31 for typical buildings, Table 3) validates this strategy. While sharing parallels with visual saliency research, its application here to curate landmarks from objects of the same semantic class is novel. The AS2AS and AS2UAV proxy metrics also proved effective for guiding layer selection (Figure 5) without extensive ground-truth, with AS2UAV being vital for optimizing cross-domain matching.

5.3. Strengths, Weaknesses, and Limitations

The proposed framework presents several key strengths for autonomous UAV navigation. Its high degree of automation in landmark discovery and database creation minimizes manual effort, facilitating rapid deployment. The lightweight embedding generation, by hooking features from a pre-trained segmentation network, avoids computationally expensive training of dedicated embedding networks, benefiting resource-constrained UAVs. The unsupervised landmark selection via Isolation Forest offers a principled, data-driven approach to define distinctiveness, adapting to environmental visual characteristics and leading to significantly improved retrieval accuracy (Recall@5 of 0.70 for landmarks). Furthermore, the modular framework design allows for future upgrades to individual components.

However, the framework has limitations. A primary limitation is the system’s dependency on the initial segmentation model. As detailed in Appendix A, the fine-tuned YOLOv11n-seg performs well, but segmentation errors inevitably propagate. False negatives (missed buildings) cause potential landmarks to be entirely omitted from consideration. False positives could pollute the database with non-landmark objects. More critically, imprecise segmentation masks, for instance, those that include shadows, adjacent trees, or parts of other buildings, directly corrupt the feature embeddings by including irrelevant information during the masked aggregation step (Equation (2)). While our qualitative results suggest some robustness to minor mask inconsistencies (Figure 9a,b), a quantitative sensitivity analysis of how retrieval accuracy degrades with mask quality (e.g., measured by IoU) was not performed and remains a crucial step for future work to fully characterize the system’s operational reliability.

Validation was conducted on the VPAIR dataset, which, while relevant, features generally favorable daytime conditions and well-matched image resolutions. The aerial survey orthophotos and UAV images have a similar GSD of approximately 10

c

m

to 20

c

m

per pixel, and all images were downsampled to a uniform 800 × 600 pixel resolution. This resolution parity ensures that detected objects are sufficiently large for meaningful feature extraction, as shown by the distribution of building bounding box sizes in Figure 4. However, performance under severe weather (e.g., heavy fog, rain), extreme illumination (e.g., night), or in scenarios with large GSD mismatches between aerial survey and UAV imagery remains untested. The current building-centric focus also limits applicability in non-urban areas where other landmark types are crucial.

Methodologically, the fixed contamination rate in Isolation Forest may not be universally optimal, and the hand-crafted aggregation functions, while standard, could potentially be outperformed by learnable alternatives, albeit with increased complexity. The proxy metrics also provide an approximate, not guaranteed optimal, parameter tuning.

5.4. Open Research Challenges and Future Work Directions

Addressing the identified limitations points to several impactful research directions. Expanding the landmark inventory beyond buildings to include features like road intersections, bridges, or natural formations is essential for broader applicability, requiring multi-class segmentation and potentially class-specific embedding strategies. Methodological enhancements, such as investigating learnable aggregation functions (e.g., attention-based pooling [17] or Perceiver-style resampling [42]) or adaptive outlier detection mechanisms that adjust distinctiveness thresholds based on scene complexity or navigational needs, could significantly improve performance and flexibility. Enhancing robustness to diverse environmental conditions through domain adaptation, advanced data augmentation e.g., using generative models to create seasonal or lighting variations of training data), or multi-modal sensor fusion (e.g., incorporating thermal or LiDAR data) is critical for real-world deployment.

Further system-level improvements include integrating geometric verification (e.g., Random Sample Consensus with PnP) to filter false positives and enable precise 6-DoF pose estimation. As mentioned in Section 3, this involves matching 2D keypoints from the UAV image to a 3D model of the landmark, which can often be inferred from the same aerial survey data used for the database. Developing capabilities for online landmark discovery and dynamic map management would allow UAVs to adapt to changing environments and improve their knowledge base over time. The creation of large-scale, multi-season, multi-modal benchmark datasets with precise ground truth is vital for rigorous evaluation and continued progress. Finally, our work contributes to the development of more sophisticated autonomous navigation systems. The curated landmark database constitutes a structured ‘visual knowledge base,’ mapping visual embeddings to semantic labels (‘landmark building’) and geolocations. This provides a concrete external knowledge source that a future navigation agent could query. For example, an agent’s planning module could query for the ‘nearest landmark’ to inform its path or verify its position. This reframes our contribution as a key perceptual component for future agentic systems that leverage external knowledge bases for robust decision-making, aligning with concepts like Retrieval-Augmented Generation but grounded in the specific, functional requirements of navigation [43].

6. Conclusions

This research successfully developed and validated an automated framework for identifying distinctive urban landmarks from aerial survey imagery to support UAV navigation in GPS-denied environments. By leveraging features from intermediate layers of a pre-trained YOLOv11n-seg model and employing an unsupervised Isolation Forest outlier detection method, our system effectively curates a database of visually unique building landmarks without requiring additional training for embedding generation or manual annotation of distinctiveness. Key quantitative results on the VPAIR benchmark demonstrated that the optimally configured max-pooled embeddings (from layers 9, 6, 10) achieved a top-1 retrieval accuracy of 0.53 for these automatically selected landmarks, a significant improvement over the 0.31 accuracy for typical buildings, and a promising Recall@5 of 0.70 for landmarks. Visual analysis confirmed the robust separation of landmark embeddings and their stability. While the current system demonstrates strong potential, its primary limitations include a building-centric focus, dependence on the initial segmentation quality, and validation on a single dataset under favorable conditions.

Future work will aim to expand the landmark inventory to diverse object types, explore learnable aggregation functions for enhanced viewpoint invariance, integrate geometric verification for precise pose estimation, and rigorously test the system across varied environmental conditions using more comprehensive benchmarks. The development of adaptive outlier detection and dynamic map management will also be crucial.

Author Contributions

Conceptualization, Z.O. and O.B.; methodology, Z.O. and O.B.; software, Z.O. and P.R.; validation, Z.O., O.B. and P.R.; formal analysis, O.B., P.R. and I.K.; investigation, Z.O.; resources, O.B. and I.K.; data curation, Z.O. and P.R.; writing—original draft preparation, Z.O. and O.B.; writing—review and editing, P.R. and I.K.; visualization, Z.O. and P.R.; supervision, I.K.; project administration, O.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable. This study used publicly available datasets and did not involve humans or animals.

Informed Consent Statement

Not applicable. This study did not involve humans.

Data Availability Statement

The VPAIR dataset used in this study is publicly available, as cited in the reference [13]. Derived data supporting the findings of this study are available on the GitHub repository through the following link: https://github.com/Zakhar17/UAV-landmarks-embeddings (accessed on 29 July 2025).

Acknowledgments

The authors would like to thank the creators of the VPAIR dataset for making their data publicly available. The authors also acknowledge the developers of the open source libraries used in this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AS2AS	Aerial Survey-to-Aerial Survey
AS2UAV	Aerial Survey-to-UAV
AUC	Area Under the Curve
BoVW	Bag-of-Visual-Words
CNN	Convolutional Neural Network
D2R	Detect-to-Retrieve
GPS	Global Positioning System
GSD	Ground Sample Distance
IMU	Inertial Measurement Unit
PCA	Principal Component Analysis
SIFT	Scale-Invariant Feature Transform
t-SNE	t-distributed Stochastic Neighbor Embedding
UAV	Unmanned Aerial Vehicle
VBN	Vision-Based Navigation
VLAD	Vector of Locally Aggregated Descriptors
VPAIR	Visual Place–Aircraft Image Retrieval
YOLO	You Only Look Once

Appendix A. YOLOv11n-Seg Segmentation Model Fine-Tuning and Evaluation

To ensure a high-quality input for our embedding workflow, we fine-tuned the YOLOv11n-seg model for a single-class (‘building’) segmentation task. The process used a corpus of 9665 annotated RGB aerial survey images and was run for 100 epochs.

The model’s training optimizes a multi-task loss function,

L = 7.5 L_{box} + 7.5 L_{seg} + 0.5 L_{cls} + 1.5 L_{dfl}

. The convergence of these loss components for both the training and validation sets is shown in Figure A1. The smooth decay and early plateauing of the validation losses indicate that the model generalized well without overfitting.

Figure A1. The (a) training and (b) validation loss curves for the fine-tuned YOLOv11n-seg segmentation model. The smooth decay and early plateauing of the validation losses across all components (box, segmentation, class, and dfl) indicate stable convergence and good generalization without overfitting.

The model’s final performance was rigorously assessed. For a robust quantitative analysis, we evaluated the model on seven random 80/20 train/test splits, with detailed results summarized in Table A1.

Table A1. Comprehensive evaluation metrics for the fine-tuned YOLOv11n-seg model, averaged over seven random 80/20 splits of the building segmentation dataset. Best results are shown in bold.

Metric	Set	Random Splitting							Avg	Std
Metric	Set	1	2	3	4	5	6	7	Avg	Std
mAP	Train	0.814	0.820	0.820	0.823	0.813	0.816	0.815	0.817	0.0043
mAP	Test	0.760	0.748	0.757	0.755	0.757	0.753	0.749	0.754	0.0043
Recall	Train	0.727	0.737	0.736	0.738	0.728	0.730	0.728	0.732	0.0047
Recall	Test	0.677	0.685	0.681	0.679	0.681	0.680	0.673	0.680	0.0037
Precision	Train	0.809	0.819	0.815	0.820	0.809	0.813	0.813	0.814	0.0042
Precision	Test	0.764	0.773	0.775	0.768	0.778	0.770	0.769	0.771	0.0047
$F_{1}$ -score	Train	0.766	0.775	0.773	0.777	0.766	0.769	0.768	0.771	0.0044
$F_{1}$ -score	Test	0.718	0.727	0.725	0.721	0.726	0.722	0.718	0.722	0.0037

The metric variance below 0.5% across all splits demonstrates stable performance, with the model achieving a stable test

F_{1}

-score of 0.722 ± 0.0037. Key qualitative performance indicators are visualized in Figure A2.

Figure A2. Visual performance analysis of the fine-tuned YOLOv11n-seg model on the test set. The results show (a) a confusion matrix with strong separation of the ‘building’ class from the background, and (b) a Precision-Recall curve with an AUC of 0.76, confirming balanced and effective segmentation performance.

The confusion matrix (Figure A2a) shows a clear distinction between the building class and the background. Furthermore, the precision-recall curve for the best-performing split (Figure A2b) achieves an AUC of 0.76, confirming balanced behavior despite a strong background class imbalance. With an

F_{1}

-score exceeding 0.72 on complex roof geometries, the detector supplies masks of sufficient fidelity for the downstream embedding step. Nevertheless, missed facades in deep-shadow scenes and slight over-segmentation of adjacent terraces are the dominant failure modes; future work will explore adaptive illumination augmentation to push the

F_{1}

-score beyond 0.75.

Appendix B. Visualizations of Single-Layer Embeddings

To qualitatively analyze the characteristics of embeddings from individual CNN layers and the impact of aggregation functions, a comprehensive visualization study was conducted. Embeddings were generated from six representative layers of the YOLOv11n-seg backbone for all segmented buildings. Three aggregation functions (average, max, sum) were applied. The resulting embeddings were projected into 2D using PCA (Figure A3) and t-SNE (Figure A4), with outliers identified by Isolation Forest highlighted in orange.

Figure A3. PCA projections of building embeddings from single CNN layers. Each row corresponds to a specific layer index (1, 3, 5, 7, 9, 11), and each column shows a different aggregation approach: (a) average, (b) max, and (c) sum. Blue points are typical buildings; orange points are outliers. Clear separation of outliers indicates effective discriminativeness.

Figure A4. t-SNE projections of building embeddings from single CNN layers: (a) average, (b) max, and (c) sum. The layout mirrors Figure A3. Blue points represent typical buildings; orange points stand for outliers. Sum and max aggregation with deeper layers show the most consistent peripheral separation of landmark clusters.

The visualizations in Figure A3 and Figure A4 show that sum-pooling consistently separates large-area constructions. Average-pooling is sensitive to fine-grained textures but results in an entangled distribution. Max-pooling, especially with deeper layers (7, 9, 11), shows clear separation of structurally unique buildings, forming distinct clusters or lobes corresponding to visually striking landmarks. This confirmed its suitability for capturing salient, robust features for cross-domain matching.

References

Javaid, S.; Khan, M.A.; Fahim, H.; He, B.; Saeed, N. Explainable AI and monocular vision for enhanced UAV navigation in smart cities: Prospects and challenges. Front. Sustain. Cities 2025, 7, 1561404. [Google Scholar] [CrossRef]
Panjavarnam, K.; Ismail, Z.H.; Tang, C.H.H.; Sekiguchi, K.; Casas, G.G. Model predictive control for autonomous UAV landings: A comprehensive review of strategies, applications and challenges. J. Eng. 2025, 2025, e70085. [Google Scholar] [CrossRef]
Hua, W.; Chen, Q.; Chen, W. A new lightweight network for efficient UAV object detection. Sci. Rep. 2024, 14, 13288. [Google Scholar] [CrossRef]
Lyu, D.; Wang, J.; He, Z.; Chen, Y.; Hou, B. Landmark-based inertial navigation system for autonomous navigation of missile platform. Sensors 2020, 20, 3083. [Google Scholar] [CrossRef]
Al-Jarrah, O.Y.; Shatnawi, A.S.; Shurman, M.M.; Ramadan, O.A.; Muhaidat, S. Exploring deep learning-based visual localization techniques for UAVs in GPS-denied environments. IEEE Access 2024, 12, 113049–113071. [Google Scholar] [CrossRef]
Wang, Y.; Feng, X.; Li, F.; Xian, Q.; Jia, Z.H.; Du, Z.; Liu, C. Lightweight visual localization algorithm for UAVs. Sci. Rep. 2025, 15, 6069. [Google Scholar] [CrossRef] [PubMed]
Se, S.; Lowe, D.; Little, J. Global localization using distinctive visual features. In Proceedings of the 2002 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Lausanne, Switzerland, 30 September–4 October 2002; pp. 226–231. [Google Scholar] [CrossRef]
Karnes, M.; Riffel, J.; Yilmaz, A. Key-region-based UAV visual navigation. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2024, XLVIII-2-2024, 173–179. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25; Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; pp. 1097–1105. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Babenko, A.; Lempitsky, V. Aggregating local deep features for image retrieval. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1269–1277. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO. Version v8.3.135. Ultralytics. 2025. Available online: https://github.com/ultralytics/ultralytics (accessed on 29 July 2025).
Schleiss, M.; Rouatbi, F.; Cremers, D. VPAIR–aerial visual place recognition and localization in large-scale outdoor environments. arXiv 2022, arXiv:2205.11567. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Philbin, J.; Chum, O.; Isard, M.; Sivic, J.; Zisserman, A. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar] [CrossRef]
Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5297–5307. [Google Scholar] [CrossRef]
Noh, H.; Araujo, A.; Sim, J.; Weyand, T.; Han, B. Large-scale image retrieval with attentive deep local features. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3476–3485. [Google Scholar] [CrossRef]
Song, C.H.; Yoon, J.; Hwang, T.; Choi, S.; Gu, Y.H.; Avrithis, Y. On train-test class overlap and detection for image retrieval. arXiv 2024, arXiv:2404.01524. [Google Scholar] [CrossRef]
Kalyta, O.; Barmak, O.; Radiuk, P.; Krak, I. Facial emotion recognition for photo and video surveillance based on machine learning and visual analytics. Appl. Sci. 2023, 13, 9890. [Google Scholar] [CrossRef]
Pearson, K. On lines and planes of closest fit to systems of points in space. Philos. Mag. 1901, 2, 559–572. [Google Scholar] [CrossRef]
Hinton, G.; Roweis, S. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems 15; Becker, S., Thrun, S., Obermayer, K., Eds.; MIT Press: Cambridge, MA, USA, 2002; pp. 857–864. [Google Scholar]
Xu, Y.; Dai, M.; Cai, W.; Yang, W. Precise GPS-denied UAV self-positioning via context-enhanced cross-view geo-localization. arXiv 2025, arXiv:2502.11408. [Google Scholar] [CrossRef]
Gong, F.; Hao, J.; Du, C.; Wang, H.; Zhao, Y.; Yu, Y.; Ji, X. FIM-JFF: Lightweight and fine-grained visual UAV localization algorithms in complex urban electromagnetic environments. Information 2025, 16, 452. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Min, X.; Zhou, W.; Hu, R.; Wu, Y.; Pang, Y.; Yi, J. LWUAVDet: A lightweight UAV object detection network on edge devices. IEEE Internet Things J. 2024, 11, 24013–24023. [Google Scholar] [CrossRef]
Cao, J.; Bao, W.; Shang, H.; Yuan, M.; Cheng, Q. GCL-YOLO: A GhostConv-based lightweight YOLO network for UAV small object detection. Remote Sens. 2023, 15, 4932. [Google Scholar] [CrossRef]
Rai, A.K.; Mandal, N.; Singh, K.K.; Izonin, I. Satellite image classification using a hybrid manta ray foraging optimization neural network. Big Data Min. Anal. 2023, 6, 44–54. [Google Scholar] [CrossRef]
Radiuk, P.; Barmak, O.; Manziuk, E.; Krak, I. Explainable deep learning: A visual analytics approach with transition matrices. Mathematics 2024, 12, 1024. [Google Scholar] [CrossRef]
Manziuk, E.; Barmak, O.; Krak, I.; Mazurets, O.; Skrypnyk, T. Formal model of trustworthy artificial intelligence based on standardization. In Proceedings of the 2nd International Workshop on Intelligent Information Technologies & Systems of Information Security (IntelITSIS 2021), Khmelnytskyi, Ukraine, 17–19 November 2021; Volume 2853, pp. 190–197. Available online: https://ceur-ws.org/Vol-2853/short18.pdf (accessed on 29 July 2025).
Melnychenko, O.; Savenko, O.; Radiuk, P. Apple detection with occlusions using modified YOLOv5-v1. In Proceedings of the 2023 IEEE 12th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), Dortmund, Germany, 7–9 September 2023; pp. 107–112. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (ICDM), Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]
Izonin, I.; Tkachenko, R.; Peleshko, D.; Rak, T.; Batyuk, D. Learning-based image super-resolution using weight coefficients of synaptic connections. In Proceedings of the 2015 Xth International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), Lviv, Ukraine, 14–17 September 2015; pp. 25–29. [Google Scholar] [CrossRef]
Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 818–833. [Google Scholar] [CrossRef]
Van Rossum, G.; Drake, F.L. Python 3 Reference Manual; CreateSpace: Scotts Valley, CA, USA, 2009. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8026–8037. Available online: https://dl.acm.org/doi/10.5555/3454287.3455008 (accessed on 29 July 2025).
Ultralytics. Segmentation Task Documentation; Ultralytics: Santa Monica, CA, USA, 2025; Available online: https://docs.ultralytics.com/tasks/segment/ (accessed on 29 July 2025).
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. Available online: https://dl.acm.org/doi/10.5555/1953048.2078195 (accessed on 29 July 2025).
Keetha, N.; Mishra, A.; Karhade, J.; Jatavallabhula, K.M.; Scherer, S.; Krishna, M.; Garg, S. AnyLoc: Towards universal visual place recognition. IEEE Robot. Autom. Lett. 2024, 9, 1286–1293. [Google Scholar] [CrossRef]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How powerful are graph neural networks? In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019; Available online: https://openreview.net/forum?id=ryGs6iA5Km (accessed on 29 July 2025).
Radenović, F.; Tolias, G.; Chum, O. Fine-tuning CNN image retrieval with no human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1655–1668. [Google Scholar] [CrossRef]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Milligan, K.; Reynolds, T.; et al. Flamingo: A visual language model for few-shot learning. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; pp. 21379–21408. Available online: https://dl.acm.org/doi/10.5555/3600270.3601993 (accessed on 29 July 2025).
Belcak, P.; Heinrich, G.; Diao, S.; Fu, Y.; Dong, X.; Muralidharan, S.; Lin, Y.C.; Molchanov, P. Small language models are the future of agentic AI. arXiv 2025, arXiv:2506.02153. [Google Scholar] [CrossRef]

Figure 1. High-level overview of the landmark discovery and navigation framework, detailing (a) the offline process where aerial images are used to segment buildings, create embeddings, and identify distinctive landmarks via outlier detection. This is followed by (b) the online process where a UAV localizes itself by matching objects in its view against the curated, geotagged landmark database.

Figure 2. The real-time UAV localization workflow in a GPS-denied environment. The onboard system captures an image, detects a building, and computes its embedding using the same multi-layer CNN and aggregation strategy as in Figure 1a. This query embedding is then matched against the landmark database via similarity search to retrieve the UAV’s geocoordinates and update its position estimate.

Figure 3. Illustrative image pairs from the VPAIR dataset [13], showing a high-resolution (a) aerial survey reference image used for database creation and the corresponding (b) nadir-view UAV query image.

Figure 4. Distribution of detected building bounding box areas from 18,432 buildings in the VPAIR dataset. The x-axis is on a logarithmic scale, indicating that most objects are sufficiently large for robust feature extraction.

Figure 5. Performance of different aggregation strategies during greedy layer selection, evaluated using (a) the cross-domain AS2UAV metric and (b) the within-domain AS2AS metric. Max-pooling (blue line) demonstrates superior performance on the critical cross-domain task, justifying its selection for the final model.

Figure 6. 2D projections of the optimal embedding space (max-pooling over layers [9, 6, 10]) using (a) linear PCA and (b) non-linear t-SNE. Manually annotated landmark buildings (orange) are clearly separated from the dense cluster of typical buildings (blue), visually validating our hypothesis that distinctive objects are statistical outliers.

Figure 7. Visual analysis of the PCA projection, where (a) an annotated plot shows a dense cluster of typical buildings (Region A in red) and a sparse cluster of landmarks (Region B in red). Panels (b,c) show corresponding image examples highlighted in red, illustrating that the embedding space successfully separates generic houses from architecturally unique structures.

Figure 8. Visual analysis of the t-SNE projection, where (a) an annotated plot distinguishes clusters of partially cropped houses (Region A in red) and distinctive landmarks (Region B in red). Panels (b,c) display corresponding image examples (highlighted in red), confirming that the non-linear projection also effectively separates unique landmarks from other structures.

Figure 9. Qualitative examples of retrieval performance (highlighted in red). The system demonstrates robustness to (a,b) minor segmentation inconsistencies and (c) large rotational differences between query and database views. However, it can fail when faced with (d) visually repetitive, non-landmark structures, highlighting the importance of the distinctiveness-filtering step.

Table 1. Official performance specifications for the YOLOv11n-seg model, illustrating its suitability for resource-constrained platforms.

Property	Value
Parameters	$2.9$ M
FLOPs (@ 640 px)	$10.4$ G
CPU Latency ( $m s$ )	65.9 ± 1.1
GPU Latency ( $m s$ )	1.8 ± 0.0

Table 2. Best performing layer combinations and Recall@1 scores achieved with proxy metrics during greedy layer selection. AS2UAV evaluates cross-domain (aerial survey-to-UAV) retrieval, while AS2AS evaluates within-domain (aerial survey-to-aerial survey) retrieval. Higher values are shown in bold.

Proxy Metric	Aggregation	Selected Layers	Recall@1
AS2UAV Metric (Cross-Domain)
AS2UAV	max	[9, 6, 10]	0.175
AS2UAV	sum	[6, 9, 10, 8]	0.166
AS2UAV	avg	[9]	0.105
AS2AS Metric (Within-Domain)
AS2AS	max	[9, 10, 4]	0.889
AS2AS	sum	[10, 9]	0.823
AS2AS	avg	[9, 10]	0.895

Table 3. Cross-view retrieval performance comparing automatically selected landmarks against typical buildings. Landmark buildings achieve a significantly higher Recall@1 (0.53 vs. 0.31) and Recall@5 (0.70 vs. 0.51). Best results are shown in bold.

Query Building Type	Recall@1	Recall@5
Landmark buildings (N = 100)	0.53	0.70
Typical buildings (N = 100)	0.31	0.51

Table 4. A conceptual comparison of our automated landmark discovery framework against traditional, global, and generic D2R navigation strategies. Key differentiators include the explicit handling of visual distinctiveness via unsupervised outlier detection, the use of hooked features without extra training, and the fully automated database curation process.

Feature	Traditional (e.g., SIFT + BoVW) [15]	Global Deep Embeddings (e.g., NetVLAD) [16]	D2R (Generic) [18]	Proposed Framework
Landmark Type	Keypoints	Whole Scene/Large Regions	Detected Objects (Generic)	Detected Objects (Buildings, Auto-Selected as Distinctive)
Distinctiveness Handling	Relies on keypoint detector saliency	Implicit in learned scene representation	Relies on object detector; no explicit distinctiveness selection	Explicit unsupervised outlier detection for landmark curation
Feature Source	Hand-crafted local features	End-to-end learned global features	Pre-trained detector + separate/learned embedder	Pre-trained segmenter + multi-layer hooked features (no extra training for embedding)
Computational Cost (Onboard)	Moderate to High (keypoint matching)	Low (embedding comparison)	High (detection + embedding)	Moderate (segmentation + light embedding + comparison)
Database Curation	Manual or semi-automated map building	Often requires large-scale training for place recognition	Relies on general object detection	Automated landmark discovery from aerial survey imagery
Robustness to Clutter	Sensitive if keypoints are obscured	Can be dominated by background	Improved by object localization	Good, due to segmentation mask and outlier selection
Adaptability to New Areas	Requires re-mapping	May generalize if trained on diverse scenes	Detector generalization is key	Good, relies on pre-trained segmenter and unsupervised selection
Reported Performance (VPAIR-Proposed)	N/A	N/A (typically on place recognition datasets)	N/A (varies by D2R implementation)	Landmarks: R@1 = 0.53, R@5 = 0.70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ostrovskyi, Z.; Barmak, O.; Radiuk, P.; Krak, I. Unsupervised Knowledge Extraction of Distinctive Landmarks from Earth Imagery Using Deep Feature Outliers for Robust UAV Geo-Localization. Mach. Learn. Knowl. Extr. 2025, 7, 81. https://doi.org/10.3390/make7030081

AMA Style

Ostrovskyi Z, Barmak O, Radiuk P, Krak I. Unsupervised Knowledge Extraction of Distinctive Landmarks from Earth Imagery Using Deep Feature Outliers for Robust UAV Geo-Localization. Machine Learning and Knowledge Extraction. 2025; 7(3):81. https://doi.org/10.3390/make7030081

Chicago/Turabian Style

Ostrovskyi, Zakhar, Oleksander Barmak, Pavlo Radiuk, and Iurii Krak. 2025. "Unsupervised Knowledge Extraction of Distinctive Landmarks from Earth Imagery Using Deep Feature Outliers for Robust UAV Geo-Localization" Machine Learning and Knowledge Extraction 7, no. 3: 81. https://doi.org/10.3390/make7030081

APA Style

Ostrovskyi, Z., Barmak, O., Radiuk, P., & Krak, I. (2025). Unsupervised Knowledge Extraction of Distinctive Landmarks from Earth Imagery Using Deep Feature Outliers for Robust UAV Geo-Localization. Machine Learning and Knowledge Extraction, 7(3), 81. https://doi.org/10.3390/make7030081

Article Menu

Unsupervised Knowledge Extraction of Distinctive Landmarks from Earth Imagery Using Deep Feature Outliers for Robust UAV Geo-Localization

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Problem Statement

3.2. Core Hypothesis

3.3. Proposed Framework: Landmark Identification Workflow

3.3.1. Segmentation-Based Object Detection

3.3.2. Multi-Layer Feature Extraction

3.3.3. Embedding Construction via Aggregation

3.3.4. Outlier Detection for Landmark Identification

3.4. Rationale for UAV Localization Using Identified Landmarks

3.5. Proxy Metrics for Embedding Evaluation and Tuning

3.6. Evaluation Metrics for the Final Workflow

3.7. Experimental Setup

3.7.1. Dataset and Preprocessing

3.7.2. Implementation Details

3.7.3. Evaluation Protocol

3.7.4. Benchmark Annotation Criteria

4. Results

4.1. Qualitative Study of Single-Layer Embeddings and the Role of Aggregation

4.2. Quantitative Layer Selection with Proxy Metrics

4.3. Analysis of the Embedding Space for the Best-Performing Configuration

4.4. Retrieval Accuracy: Landmarks vs. Typical Buildings

5. Discussion

5.1. Comparative Analysis with Existing Approaches

5.2. Interpretation of Findings

5.3. Strengths, Weaknesses, and Limitations

5.4. Open Research Challenges and Future Work Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. YOLOv11n-Seg Segmentation Model Fine-Tuning and Evaluation

Appendix B. Visualizations of Single-Layer Embeddings

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI