GRiM-Net: A Two-Stage Cross-View Visual Localization Framework for UAVs

Hu, Yanting; Zeng, Qinyong

doi:10.3390/rs18101477

Open AccessArticle

GRiM-Net: A Two-Stage Cross-View Visual Localization Framework for UAVs

by

Yanting Hu

^*

and

Qinyong Zeng

School of Aeronautics and Astronautics, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1477; https://doi.org/10.3390/rs18101477

Submission received: 27 March 2026 / Revised: 6 May 2026 / Accepted: 7 May 2026 / Published: 8 May 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A two-stage cascaded visual localization network, GRiM-Net, is proposed, which balances computational efficiency and localization accuracy by combining global retrieval for region-of-interest selection with local fine matching for pixel-level coordinate mapping.
A domain-adaptive shared backbone with domain-specific batch normalization is designed, enabling robust feature extraction across cross-view domain discrepancies. Joint multi-task optimization of global retrieval and fine matching losses further enhances feature representations for both similarity measurement and geometric alignment.

What are the implications of the main findings?

The proposed framework provides a visual coordinate regression component for high-precision UAV localization in urban environments.
The domain-adaptive backbone and joint optimization strategy offer a generalizable approach for other remote sensing applications that require cross-domain image matching, such as map updating, change detection, and multi-source image registration.

Abstract

Autonomous flight of unmanned aerial vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments critically depends on accurate and robust visual localization. To tackle the challenges of cross-view domain discrepancies and real-time high-precision matching, we propose GRiM-Net, a two-stage joint optimization visual localization network. First, a global retrieval module aggregates features and selects the most similar satellite map candidate patches from a pre-built index, efficiently narrowing the search from the global map to a local region. Next, a fine matching module performs pixel-level keypoint detection and description on the query image and candidate patches. Bidirectional matching and weighted homography estimation are then used to map the UAV image center to satellite coordinates, yielding precise geographic positions. Both modules share a backbone with domain-adaptive batch normalization, and joint optimization of global retrieval triplet loss with fine matching keypoint, descriptor, and homography reprojection losses enables synergistic enhancement of feature representations. Ablation and comparison experiments conducted on public urban cross-view benchmarks demonstrate that GRiM-Net can achieve efficient and robust geographic coordinate regression for UAVs, providing a practical localization component for broader navigation systems.

Keywords:

UAV visual localization; cross-view image matching; two-stage localization network; domain adaptation; homography estimation

1. Introduction

With their significant advantages such as high flexibility, rapid deployment, low operating cost and strong adaptability to complex environments, drones have played an important role in many fields [1]. For example, they have been applied to precision agriculture [2], infrastructure inspection, emergency rescue, logistics transportation, environmental monitoring and smart city management [3]. They have not only greatly improved efficiency and reduced labor costs, but also played an irreplaceable role in dangerous or difficult-to-reach tasks.

The ability of UAVs to play a role in the aforementioned fields mainly relies on their integrated capabilities, including autonomous navigation and decision-making, environmental perception and obstacle avoidance, stable flight control, and efficient mission planning and management [4,5]. These capabilities enable them to adapt to complex and dynamic scenarios and complete predetermined tasks. Among them, accurate and reliable positioning technology is the foundation for UAVs to achieve the above functions. Positioning technology not only provides UAVs with real-time location information in space [6,7,8,9], but is also a prerequisite for achieving accurate tracking, hovering, cooperative formation flight, and safe interaction with other objects in the environment. The accuracy, robustness, and real-time performance of the positioning system determine the success or failure of UAV missions. Therefore, the research and optimization of UAV positioning technology is one of the core driving forces for its development [10].

The development of UAV positioning technology can be divided into several stages. The initial stage was a single navigation stage centered on the GNSS, and the middle stage was a combined navigation stage integrating inertial navigation (INS) and GNSS [11]. In the past decade, although GNSS technologies (such as GPS, BeiDou, Galileo) have been able to provide centimeter-level positioning accuracy in open scenarios, their vulnerability has become increasingly prominent in complex field or urban operating environments. First, in “urban canyons”, dense forests or deep mountain valleys, satellite signals are easily blocked by buildings or vegetation, resulting in severe multipath effects and non-line-of-sight (NLOS) transmission, causing positioning accuracy to plummet from the meter level to the hundred-meter level or even completely lose lock [12]. Second, in weapon confrontation or sensitive areas, GNSS signals are extremely susceptible to electromagnetic interference (jamming) or spoofing by equipment [13], making UAVs face the risk of losing control. To address the challenge of localization in signal-denied environments, the field has recently evolved into autonomous localization centered on visual perception. Researchers initially introduced traditional visual localization techniques based on geometric constraints. Among these, Visual Odometry (VO) performs incremental pose estimation through feature correlation between consecutive frames [14], focusing on local real-time motion estimation, which suffers from error accumulation. Visual SLAM [15], on the other hand, introduces back-end optimization and loop closure detection mechanisms to eliminate drift errors through a globally consistent sparse map. These methods perform well in feature-rich indoor or near-field environments, but often fail when operating UAVs at high altitudes due to repetitive surface textures, drastic scale changes, or motion blur caused by high-speed camera movement. Furthermore, SLAM technology essentially extracts key points, connects previous and subsequent frames, and then uses the multi-view principle to recover three-dimensional motion from a two-dimensional image. Therefore, it provides coordinates relative to the starting point, meaning the coordinates are dynamically established. It is difficult to obtain the absolute geographical location of the UAV in the Earth coordinate system without prior information, which has prompted research to evolve towards image matching and localization with global perception capabilities.

In recent years, visual localization methods based on satellite image matching have gradually emerged [16]. By matching images captured by UAVs with previously captured satellite orthophotos with known geographic coordinates, absolute localization without cumulative error can be achieved without relying on GNSS signals. Since 2011, visual localization methods based on satellite images have evolved from manual feature extraction to deep learning methods. Early studies mainly relied on manually extracted geometric features [17,18,19,20,21], especially building facade structures and line features, to achieve matching between ground images and satellite images, verifying the feasibility of cross-view localization. However, due to the limited feature representation capabilities, the localization accuracy was low and the applicability was narrow. With the development of deep learning, UAV visual localization based on deep learning methods has entered a rapid development stage [22,23].

Although deep learning has significantly improved positioning performance, its further development still faces a series of challenges: First, there are significant differences between UAV aerial images and satellite images in terms of imaging mechanism, temporal phase (season, day and night) and perspective (orthogonal vs. tilt). Second, there is an inherent contradiction between real-time performance and accuracy [24]. There are two current methods. If a fine point-to-point matching is performed between large-scale orthorectified satellite images and UAV ground images, high-precision dense matching often means extremely high floating-point operations. While improving recall, the computational overhead often exceeds the load of the embedded platform, and the increased computational load means a decrease in real-time performance. Another method is to segment large-scale orthorectified satellite images into local areas and match them with UAV ground images, but the obtained coordinates are not accurate enough.

To address the aforementioned issues, this paper proposes GRiM-Net, a novel two-stage UAV visual localization network designed to bridge the gap between global retrieval efficiency and local geometric precision. Unlike conventional hierarchical frameworks that treat retrieval and matching as isolated processes, GRiM-Net introduces a unified architecture centered on a domain-adaptive shared backbone. This design specifically addresses the cross-domain feature misalignment and computational redundancy prevalent in existing methods. By embedding Domain-Adaptive Batch Normalization (DA-BN), the network decouples domain-specific statistical variations while preserving shared structural semantics, a critical advancement for cross-view (UAV-to-satellite) consistency. Furthermore, instead of a simple linear cascade, the global retrieval module and the fine-matching module are intrinsically coupled through a joint multi-task loss mechanism. This ensures that the global descriptors are refined by local geometric constraints, while the local matching benefits from the coarse context provided by global retrieval, achieving a synergistic optimization that is often missing in modular-based localization pipelines. The inference process of this two-stage network is shown in Figure 1. In summary, the main contributions of this research include the following three points:

A two-stage UAV visual localization network, GRiM-Net, is proposed, which shares a single backbone encoder across global retrieval and fine matching, eliminating redundant feature computation while achieving real-time meter-level localization.
A Domain-Adaptive Batch Normalization mechanism is introduced, which maintains domain-specific normalization statistics while sharing convolutional weights, explicitly resolving the cross-domain distributional mismatch between UAV imagery and satellite photos.
A joint multi-task optimization framework is constructed, coupling retrieval and matching through a unified loss that includes homography reprojection supervision, eliminating feature semantic fragmentation from independent stage-wise training and suppressing geometrically inconsistent matches prevalent in remote sensing imagery.

2. Related Work

Cross-view visual localization methods can be broadly categorized into four technical paradigms: image-level retrieval, viewpoint transformation, orientation-aided localization, and pixel-level fine localization. We review each category, discuss its representative works and limitations, and clarify how the proposed GRiM-Net addresses the gaps that remain.

2.1. Image-Level Retrieval

Early cross-view localization methods frame the problem as image retrieval: given a query image, the goal is to retrieve the most similar reference image from a geotagged database, where geographic coordinates are inherited from the matched reference. Workman et al. [25]. demonstrated that CNN features pre-trained on ImageNet could be directly applied to cross-view matching, establishing deep learning as a viable alternative to hand-crafted descriptors. Subsequent work by Vo and Hays [26] introduced dual-branch networks that learn separate embeddings for ground and aerial views, with metric learning objectives enforcing cross-view similarity. The introduction of triplet loss [27] further improved the discriminability of learned embeddings by explicitly pushing apart hard negative pairs. To aggregate local convolutional features into compact global descriptors, NetVLAD [28] has been widely adopted in retrieval-based localization pipelines, offering fixed-dimensional representations that are independent of input resolution. More recently, transformer-based architectures [29] have introduced self-attention and cross-attention mechanisms that capture long-range spatial dependencies, improving robustness under large appearance variations.

Despite these advances, image-level retrieval methods share a fundamental limitation: localization precision is bounded by the spatial coverage of the reference database tiles. When database tiles are large, the retrieved match provides only a coarse position estimate. Achieving meter-level accuracy requires either prohibitively dense tiling of the reference map or a subsequent fine localization stage. GRiM-Net addresses this by using retrieval exclusively as a coarse filtering stage, compressing the search space to a small candidate region, and delegating precise coordinate estimation to a learned fine matching module.

2.2. Viewpoint Transformation

A fundamental challenge in cross-view localization is the large appearance gap between ground-level or UAV oblique images and satellite orthophotos, which arises from differences in viewpoint, scale, and imaging geometry. One line of work addresses this through geometric transformation: polar coordinate transformation and projective warping have been applied to reduce the viewpoint gap between oblique imagery and satellite orthophotos by geometrically aligning their spatial layouts prior to feature extraction. These geometric approaches are computationally lightweight but rely on strong assumptions about camera geometry and scene planarity that are frequently violated in practice.

An alternative line of work employs generative adversarial networks (GANs) to synthesize cross-view image translations [30]. By training an image-to-image translation network to convert satellite images to ground-view appearance or vice versa, these methods reduce the domain gap in pixel space [31]. However, GAN-based approaches introduce additional training complexity, require paired cross-view data, and may hallucinate structural details that are inconsistent with the actual scene, introducing noise into the matching process.

In contrast to both geometric and generative transformation approaches, GRiM-Net addresses the cross-domain appearance gap through DA-BN in the feature space rather than the image space. By maintaining independent normalization statistics and affine parameters for the UAV and satellite domains while sharing convolutional weights, DA-BN aligns the underlying feature distributions without modifying the input images or requiring paired training data for image synthesis.

2.3. Orientation-Aided Localization

Several methods have incorporated orientation or heading information as an auxiliary cue to reduce the search space for cross-view matching. By estimating the compass direction of a ground camera from visual cues or magnetic sensors, these methods constrain the rotational degree of freedom in the matching problem, significantly reducing the number of candidate references that need to be evaluated. Some approaches further extend this idea to jointly estimate position and orientation [32], enabling simultaneous localization and pose estimation.

While orientation priors are effective at reducing computational cost, they introduce a dependency on reliable heading estimation, which may be unreliable in environments with strong electromagnetic interference or complex structural occlusion. GRiM-Net does not assume any prior knowledge of UAV orientation, relying solely on the visual content of the query image for both retrieval and fine matching, which makes it applicable under fully GNSS-denied conditions.

2.4. Pixel-Level Fine Localization

Recognition that image-level retrieval cannot achieve meter-level accuracy has motivated a shift toward pixel-level fine localization methods. Rather than inheriting geographic coordinates from the nearest retrieved tile, these approaches directly regress the precise position of the UAV within a satellite image.

A prominent paradigm in this direction is Finding Point with Image (FPI), which formulates UAV localization as an end-to-end heatmap regression task. FPI directly feeds the UAV query image and a satellite search map into a dual-stream network, computing cross-view feature similarity to generate a response heatmap over the satellite image; the peak of the heatmap indicates the predicted UAV position. This single-stage design eliminates the need for an explicit retrieval database and achieves direct meter-level positioning. Subsequent works have extended FPI along several directions: WAMF-FPI introduces weight-adaptive multi-scale feature fusion to improve spatial precision [33]; OS-FPI identifies that independent feature extraction in the two-stream architecture prevents early cross-view information interaction, and proposes a one-stream design that introduces cross-attention between UAV and satellite branches at the backbone stage to improve feature discriminability [34]. More recently, geometry-aware approaches have emerged that reconstruct a local 3D scene from multi-view UAV image sequences and render a bird’s-eye-view (BEV) representation to bridge the viewpoint gap, integrating retrieval and pose estimation within a unified pipeline.

Despite these advances, the FPI paradigm shares a fundamental scalability limitation: it requires the entire satellite search region to be processed jointly with the query image, making inference cost grow rapidly with the size of the operational area. This makes real-time onboard deployment challenging when the UAV operates over a large geographic range. Furthermore, the two-stream architecture’s independent feature extraction creates a representational gap between UAV and satellite branches that early-stage information interaction only partially addresses, leaving cross-domain feature alignment as an open problem.

GRiM-Net takes a complementary approach: rather than extending the FPI paradigm, it adopts a retrieval-and-matching architecture in which global NetVLAD-based retrieval compresses the search space to a small candidate region, resolving the scalability problem, while end-to-end joint optimization of retrieval and matching under a unified multi-task loss resolves the feature semantic fragmentation caused by independent optimization of the two stages. Domain-adaptive batch normalization further addresses cross-domain feature alignment at the normalization level, maintaining domain-specific statistics while preserving shared convolutional representations.

3. Methods

3.1. Problem Definition and Method Overview

Both the UAV query images and the satellite orthophoto reference map used in this work are three-channel RGB images, as detailed in the dataset descriptions in Section 4.1. The proposed architecture, training pipeline, and coordinate reconstruction formulas are designed and validated exclusively for RGB inputs. Adaptation to multispectral or hyperspectral imagery would require modifications beyond the scope of this work.

Let the UAV query image sequence be

Q = {q_{1}, q_{2}, \dots, q_{n}}

, where each frame

q_{i} \in R^{3 \times H_{q} \times W_{q}}

is a three-channel RGB aerial image captured by the UAV’s onboard camera (Zenmuse P1; DJI, Shenzhen, China), and

H q

and

W q

represent the pixel height and width of the query image, respectively. Given a large-area orthophoto map

M \in R^{3 \times H_{M} \times W_{M}}

(three-channel RGB) with geographic coordinate annotations processed via ArcGIS Pro (Esri, Redlands, CA, USA), its lower left and upper right latitude and longitude coordinates are denoted as

{L L}_{c o o r d s}

and

{U R}_{c o o r d s}

, and the map pixel resolution is

M_{w} \times M_{h}

. Our task is to estimate the precise geographic coordinates

{\hat{p}}_{i} = ({\hat{x}}_{i}, {\hat{y}}_{i})

of each frame

q_{i}

, relying solely on the visual content of the image, without GNSS assistance, so as to minimize the geographic distance error between the predicted coordinates and the ground truth coordinates

p_{i}^{g t}

.

This paper proposes a two-stage visual localization network, GRiM-Net, whose overall architecture is shown in Figure 2. Unlike Figure 1, which illustrates the inference pipeline at a high level, Figure 2 details the internal structure of both modules, including the shared backbone encoder with DA-BN, the NetVLAD-based global retrieval module, and the keypoint decoder and descriptor decoder of the fine matching module, along with their respective loss functions. In the first stage, the global retrieval module divides the satellite map into square sub-images and uses the NetVLAD aggregation layer to extract global descriptors, retrieving several most similar candidate satellite blocks and compressing the search range from the entire map to a local region. In the second stage, the fine matching module performs pixel-level keypoint detection and description on the query image and each candidate block. Through bidirectional k-nearest neighbor (kNN) matching and Random Sample Consensus (RANSAC) homography matrix estimation, the query center is mapped to the satellite coordinate system, and the precise coordinates are output. The two modules share the same VGG-16 [35] backbone encoder and are trained end-to-end through joint multi-task loss.

Regarding resolution: Before being fed into the network, query image qᵢ and satellite candidate blocks were uniformly scaled to a processing resolution of

H \times W

using bilinear interpolation (

H

and

W

are integer multiples of 32 to satisfy the integer division constraint of VGG-16 level 5 downsampling). The original query resolution of

H q \times W q

was only used for the final geographic coordinate reconstruction; all calculations within the network were performed in a unified

H \times W

coordinate system, independent of the original resolution. In the experiment, it was set to

H = W = 320

.

3.2. Satellite Map Preprocessing

The satellite map M is divided into N overlapping image patches using a sliding window with a step size

Δ s = 256

, resulting in a candidate patch set:

{R = r_{1}, \dots, r_{n}}

.

R = {r_{i, j} ∣ r_{i, j} = M [i \cdot Δ s : i \cdot Δ s + S_{p}, j \cdot Δ s : j \cdot Δ s + S_{p}]}

(1)

Row index

i \in {0, \dots, ⌊ (H_{M} - S_{p}) / Δ s ⌋}

, column index

j \in {0, \dots, ⌊ (W_{M} - S_{p}) / Δ s ⌋}

; block size

S_{p} = 512

, overlap rate 50%, total number of blocks

N = | R |

. The geographic anchor point coordinates corresponding to each image block are calculated by linear interpolation, as shown in the following formula:

l o n (r_{i, j}) = {L L}_{l o n} + \frac{j \cdot Δ s + S_{p} / 2}{M_{w}} \cdot ({U R}_{l o n} - {L L}_{l o n})

(2a)

l a t (r_{i, j}) = {U R}_{l a t} - \frac{i \cdot Δ s + S_{p} / 2}{M_{h}} \cdot ({U R}_{l a t} - {L L}_{l a t})

(2b)

where LLcoords = (

{L L}_{l o n}

,

{L L}_{l a t}

) denotes the lower-left corner (minimum longitude, minimum latitude) and URcoords = (

{U R}_{l o n}

,

{U R}_{l a t}

) denotes the upper-right corner (maximum longitude, maximum latitude) of the satellite map. In Equation (2a), column index j increases rightward in image convention, consistent with the eastward increase of longitude; no sign inversion is required. In Equation (2b), row index i increases downward in image convention, whereas geographic latitude increases northward; accordingly, the latitude is computed by subtracting the proportional offset from URlat, ensuring that blocks near the top of the map (small i) receive higher latitudes and blocks near the bottom (large i) receive lower latitudes.

Before being fed into the network, each candidate block is scaled to a processing resolution of

H \times W

using bilinear interpolation. Coordinates are then recalculated using a scaling factor of

(S_{p} / W, S_{p} / H)

to ensure that geographic coordinate calculations are unaffected by scaling. Image block segmentation is performed offline.

3.3. Shared Backbone Encoder

3.3.1. Network Structure

The global retrieval module and the fine matching module share the same backbone encoder, employing a VGG-16 convolutional part (conv1-conv5, with fully connected layers removed), as shown in Figure 3. It is initialized using ImageNet pre-training. For any processing resolution input, after five max-pooling iterations (stride 2), the output is a level 4 feature. This feature serves as the input to both the NetVLAD module of the global retrieval module and the upsampling module of the fine matching module.

3.3.2. Domain Adaptive Batch Normalization

To alleviate the domain differences between UAV images and satellite images, DA-BN is introduced into the shared backbone encoder. The module maintains independent statistics and affine parameters for the UAV domain (d = UAV) and the satellite domain (d = SAT, C is the number of channels in this layer), respectively. The formula is shown in (3). Specifically, the independent calculation of statistics is motivated by the large data distribution difference between the two domains: UAV images are taken at an angle, have perspective distortion, and have a warm tone; satellite images are taken orthophotos from above and have a cool tone. Shared batch normalization statistics will cause confusion in feature distribution. Therefore, independent calculation is used to achieve alignment of the underlying distribution. The independent calculation of affine parameters is because after normalization, the features need to be adjusted to a suitable value range. The scaling and translation in this step reflect the brightness and contrast style of each domain. Each domain learns independently, allowing the model to adapt to its own brightness distribution. However, since the two domain images share the same underlying geometric structure within the same geographical area, such as building outlines, road networks, and terrain topology, the network maintains parameter sharing in the convolutional layers. While surface texture may vary across seasons or phenological cycles, particularly in vegetated areas, we acknowledge that this claim has not been directly validated through dedicated multi-temporal experiments. The VTRN dataset, which spans a five-year temporal gap between UAV image acquisition (2017) and satellite reference imagery (2012), provides indirect evidence that the model can match across temporally misaligned image pairs that include vegetation and structural changes. However, systematic stress testing under controlled seasonal variation, such as comparing spring and autumn acquisitions over the same area, falls outside the scope of the current evaluation and constitutes an important direction for future investigation. The theoretical basis for DA-BN’s potential robustness to such variation lies in its domain-specific affine parameters, which capture domain-level brightness and contrast distributions independently; whether this mechanism generalizes to phenological cycles remains an open empirical question. This constrains the number of model parameters while promoting collaborative learning of global retrieval and local feature representation. In summary, compared to ordinary Batch Normalization (BN), DA-BN avoids mutual interference between the two domain statistics while maintaining shared convolutional weights to leverage the synergistic advantages of joint training. During inference, parameters for the corresponding domain are automatically selected based on the input source.

{D A - B N}^{(d)} (z) = γ^{(d)} ⊙ \frac{z - μ^{(d)}}{\sqrt{σ^{2 (d)} + ε}} + β^{(d)}, d \in {U A V, S A T}, ε = 1 0^{- 5}

(3)

3.3.3. Shared Upsampling Module

The NetVLAD in the global retrieval module directly operates on the fourth-level features of the backbone output. The two decoders in the fine matching module require higher-resolution features. This paper introduces a shared upsampling module consisting of two transposed convolutions (both with

4 \times 4

kernels, stride of 2, and padding of 1) to upsample the fourth-level features from

H / 32 \times W / 32

to

H / 8 \times W / 8

by a factor of 4. The two decoders share this output:

{\tilde{f}}^{(4)} = {T C o n v}_{2} ({T C o n v}_{1} (f^{(4)})) \in R^{C_{4} \times H / 8 \times W / 8}, C_{4} = 512

(4)

In this structure,

{T C o n v}_{1}

and

{T C o n v}_{2}

are both transposed convolutions that maintain a constant 512 channels, doubling the spatial size each time, for a total of four times. Both decoders use

{\tilde{f}}^{(4)}

as input. This structure reduces parameter redundancy and improves training and inference efficiency.

3.4. Global Retrieval Module

3.4.1. Global Feature Extraction

Using

f^{(4)} \in R^{C_{4} \times H / 32 \times W / 32}

as input, its space is flattened into

T = (H / 32) \cdot (W / 32)

C_{4} = 512

-dimensional feature vectors

{f_{t}}_{t = 1}^{T}

.NetVLAD is used for global feature aggregation. T varies with the input resolution; the output dimension of NetVLAD depends only on the number of clusters

K_{v}

and the number of channels

C_{4}

, and is independent of the input resolution. Therefore, during inference, a fixed-dimensional descriptor is output for any input resolution. Let

K_{v} = 64

learnable cluster centers,

c_{k} \in R^{C_{4}}

, and corresponding weights

w_{k} \in R^{C_{4}}

. The soft-assigned weights are:

σ_{k} (f_{t}) = \frac{\exp (w_{k}^{⊤} f_{t} + b_{k})}{\sum_{k^{'} = 1}^{K_{v}} \exp (w_{k^{'}}^{⊤} f_{t} + b_{k^{'}})}

(5)

Residual vector aggregation:

V (k, d) = \sum_{t = 1}^{T} σ_{k} (f_{t}) \cdot (f_{t} (d) - c_{k} (d))

(6)

After concatenating all clusters and flattening them, and then performing intra-normalization, we obtain:

\tilde{V} (k, :) = \frac{V (k, :)}{| | V (k, :) | |_{2}}

(7)

After overall normalization, the descriptor is then compressed to a fixed 512-dimensional descriptor using an offline PCA whitening matrix:

g = l_{2} (W_{p c a} \cdot l_{2} ({\tilde{V}}_{f l a t})) \in R^{D_{g}}, D_{g} = 512

(8)

3.4.2. Offline Database Construction and Online Retrieval

In the offline phase, descriptors are extracted from all N candidate blocks, a matrix

G_{d b} \in R^{N \times D_{g}}

is constructed, and a FAISS-IVF256 index with 64 product quantization (PQ) sub-quantizers is built [36]. FAISS is an efficient library for large-scale vector similarity search; the IVF256 configuration partitions the descriptor space into 256 inverted file index cells to accelerate approximate nearest-neighbor search, while the 64 PQ sub-quantizers compress each descriptor into a compact binary code to reduce memory consumption during online retrieval. In the online retrieval phase, the query descriptors are sorted by cosine similarity with the database descriptors, and the top K = 5 candidate blocks with the highest similarity are selected.

3.4.3. Retrieval Training Loss

Triple loss is used during training [37]:

L_{r e t} = \max {d (g_{q}, g_{p o s}) - d (g_{q}, g_{n e g}) + δ, 0}, d (\cdot) = | | \cdot | |_{2}, δ = 0.5

(9)

This paper introduces Online Hard Sample Mining (OHEM) [38]: In each iteration, forward inference is performed on all negative sample candidates in the current batch, and the top-m negative samples with the highest cosine similarity to the query descriptor (most easily confused) are selected to participate in the loss calculation. As training progresses, the definition of “hard” is dynamically upgraded after the model’s capabilities are improved, and an effective gradient signal is always maintained.

3.5. Fine Matching Module

3.5.1. Keypoint Decoder and Descriptor Decoder

Using

{\tilde{f}}^{(4)} \in R^{C_{4} \times H / 8 \times W / 8}

(obtained from Equation (4)) as input, the keypoint decoder and descriptor decoder architecture of SuperPoint [39] are adopted to output keypoint heatmaps and dense descriptors respectively. The input of the two decoders in this paper is unified as

{\tilde{f}}^{(4)} \in R^{C_{4} \times H / 8 \times W / 8}

, rather than their own independent upsampled features. The loss function adopts the original keypoint cross-entropy loss and descriptor hinge loss of SuperPoint. The self-supervised labels are generated by random homography transformation, and the strategy is the same as that of SuperPoint. The specific process is as follows.

After input

{\tilde{f}}^{(4)} \in R^{C_{4} \times H / 8 \times W / 8}

, it passes through a 3 × 3 convolutional layer (stretch 1, padding 1, output channels 65):

X = {C o n v}_{k p} ({\tilde{f}}^{(4)}) \in R^{H / 8 \times W / 8 \times 65}

(10)

Of the 65 channels, 64 correspond to 8 × 8 sub-pixel positions, and 1 is a non-keypoint dustbin. Softmax normalization is applied to the last dimension:

{\bar{P}}_{u, v, c} = \frac{\exp (X_{u, v, c})}{\sum_{c^{'} = 1}^{65} \exp (X_{u, v, c^{'}})}, u \in [0, H / 8), v \in [0, W / 8)

(11)

Discard the trash can channel (c = 65), and rearrange the remaining 64 channels pixel by pixel to restore a full-resolution key point heatmap:

P (8 u + r, 8 v + c) = {\bar{P}}_{u, v, 8 r + c}, r, c \in {0, \dots, 7}

(12)

In the heatmap P, each location represents the probability that a keypoint exists within that pixel. During inference, a threshold

τ_{k p} = 0.015

is set, and local maxima are extracted to obtain the keypoint coordinate set

K_{p}

(maximum

N_{k p} = 1024

). The keypoint loss is the pixel-level cross-entropy:

L_{k p} (X, Y) = - \frac{1}{(H / 8) \cdot (W / 8)} {\sum_{u, v} \log \bar{P}}_{u, v, y_{u, v}}

(13)

where

y_{u, v} ϵ {1, \dots, 65}

is the truth label of the

(u, v

) th unit.

The descriptor decoder also takes

f^{(4)} \in R^{C_{4} \times H / 8 \times W / 8}

(obtained from Equation (4)) as input, and passes it through a 3 × 3 convolutional layer (stride 1, padding 1, output channels D = 256):

D_{c o a r s e} = {C o n v}_{d e s c} ({\tilde{f}}^{(4)}) \in R^{D \times H / 8 \times W / 8}, D = 256

(14)

Bilinear interpolation upsampling to full resolution followed by positional normalization:

D_{f u l l} (h, w) = l_{2} (B i l i n I n t e r p (D_{c o a r s e}, h, w)) \in R^{D}

(15)

Ultimately, the spatial dimensions are consistent with the processing resolution, and the descriptor for each keypoint is a slice of its coordinates. The descriptor loss uses hinge loss to constrain positive and negative pairs:

L_{d e s c} = \frac{1}{(H \cdot W)^{2}} \sum \sum l_{d} (D_{f u l l} (h, w), {D_{f u l l}}^{'} (h^{'}, w^{'}); s_{h w, h^{'} w^{'}})

(16)

l_{d} (d, d^{'}; s) = s \cdot \max (0,1 - λ_{p} - d^{⊤} d^{'}) + (1 - s) \cdot \max (0, d^{⊤} d^{'} - λ_{n})

(17)

where

s = 1 [(h^{'}, w^{'}) = π (H_{w a r p} \cdot [h w 1]^{⊤})]

is the correspondence indicator, and

D_{f u l l}

is the descriptor of the transformed image obtained from the same scene through the known homography transformation

H_{w a r p}

.

λ_{p} = 0.9, λ_{n} = 0.2

.

3.5.2. Keypoint Matching

This paper introduces the Lowe ratio test [40], combined with a matching strategy based on bidirectional nearest neighbor matching:

M a t c h (q, r) = {(C_{l}, {C_{l}}^{'}) ∣ {n n}_{1} (C_{l}) = {C_{l}}^{'}, {n n}_{1} ({C_{l}}^{'}) = C_{l}, \frac{| | d_{l} - {d_{l}}^{'} | |}{| | d_{l} - {d_{l}}^{''} | |} < ρ}, ρ = 0.8

(18)

where

{d_{l}}^{'}

is the nearest neighbor and

{d_{l}}^{''}

is the second nearest neighbor. Bidirectional mutual verification ensures symmetrical consistency of the matching, and the ratio test uses

ρ = 0.8

to filter ambiguous matches, effectively reducing the false matching rate, which is especially important for scenes with sparse textures.

3.5.3. Homography Matrix Estimation and Coordinate Mapping

Using the matching set

S_{i}^{k} = {(C_{l}, {C_{l}}^{'})

as input, the homography matrix

H_{k} \in R^{3 \times 3}

is estimated using RANSAC-DLT [41] in the processing resolution

H \times W

coordinate system, with an interior point decision threshold of

ε = 4 p x

and a maximum iteration of 1000 times. Based on this, this paper introduces a refinement step with a weight of keypoint response probability

P (C_{l})

:

H_{k}^{*} = \arg \min_{H} \sum_{l \in I_{k}} P (C_{l}) \cdot | | π (H \cdot {\tilde{C}}_{l}) - {C_{l}}^{'} | |_{2}^{2}

(19)

After estimating homography for each of the Top-K candidate blocks, the optimal candidate is selected based on an in-point ratio of

\frac{| I_{k} |}{M_{k} + 1}

:

k^{*} = \arg \max_{k \in {1, \dots, K}} \frac{| I_{k} |}{M_{k} + 1}

(20)

Then, the center of the query image (W/2, H/2) is first mapped in the processing resolution coordinate system:

[x_{p} y_{p} 1]^{⊤} = H_{k^{*}} \cdot [W / 2 H / 2 1]^{⊤}

(21)

Then, convert the processed resolution pixel coordinates back to the original candidate block size

S_{p} \times S_{p}

:

x_{p, o r i g} = x_{p} \cdot \frac{S_{p}}{W}, y_{p, o r i g} = y_{p} \cdot \frac{S_{p}}{H}

(22)

Ultimately, the geographic registration information is converted into geographic coordinates:

{\hat{p}}_{i} = c o o r d s (r_{k^{*}}) + (\frac{x_{p, o r i g}}{S_{p}} - 0.5, - (\frac{y_{p, o r i g}}{S_{p}} - 0.5)) \cdot ({U R}_{c o o r d s} - {L L}_{c o o r d s}) \cdot (\frac{S_{p}}{M_{w}}, \frac{S_{p}}{M_{h}})

(23)

In Equation (23), xp,orig indexes the column (horizontal) direction within the candidate block, corresponding to the longitude component: since column direction (rightward) is consistent with the eastward increase of longitude, the sign of this offset is positive and no axis inversion is required. The term (yp,orig/Sp-0.5) indexes the row (vertical) direction, corresponding to the latitude component: since image rows increase downward while geographic latitude increases northward, the negative sign is applied so that a predicted point near the top of the block maps to a higher latitude and near the bottom maps to a lower latitude. The term (·/Sp-0.5) in both components converts from top-left-origin pixel coordinates to center-origin displacement relative to coords (rk*). Equation (22) is also correct for non-square processing resolution (H ≠ W), ensuring coordinate conversion correctness under arbitrary resolution input.

3.5.4. Homography Supervision Loss

In addition to using

L_{k p}

and

L_{d e s c}

losses, this paper introduces the homography reprojection loss

L_{h o m o}

. The homography supervision loss is defined as the average reprojection error during training:

L_{h o m o} = \frac{1}{| S |} \sum_{l = 1}^{| S |} | | π (H_{k} \cdot {\tilde{C}}_{l}) - π (H_{w a r p} \cdot {\tilde{C}}_{l}) | |_{2}

(24)

This loss not only directly constrains the geometric alignment accuracy of the final mapped coordinates, but also plays a crucial role in “geometric hard sample mining” during backpropagation. Remote sensing images often contain numerous repetitive structures with similar textures (such as similar building roofs or regular farmland textures). Descriptors relying solely on local appearance information are prone to producing high-confidence mismatches. By calculating the physical distance error of feature points projected onto the homography matrix, a strong penalty is imposed on these mismatches that have high visual similarity but lack geometric consistency. This spatial constraint, in turn, allows the descriptor decoder to distance repetitive textures in the feature space, significantly improving the spatial discriminative power and anti-interference capability of the descriptor at a global large scale.

3.6. Joint Training Objective

This paper employs end-to-end training with joint multi-task loss, resulting in four joint loss terms:

L_{t o t a l} = α L_{r e t} + β_{1} L_{k p} + β_{2} L_{d e s c} + γ L_{h o m o}

(25)

Weights

α = 1.0, β_{1} = 0.5, β_{2} = 0.5, γ = 0.3

. The details are shown in Table 1.

During training, after a forward propagation of all modules, gradients are backpropagated directly to

L_{t o t a l}

. The shared encoder simultaneously receives gradient signals from the retrieval path

L_{r e t}

and the matching path

L_{k p} + L_{d e s c} + L_{h o m o}

. To prevent excessively large retrieval gradient magnitudes from interfering with the local feature representation required for fine matching, the learning rate of the shared encoder is set to 1/10 of that of other modules.

It should be noted that the current implementation of GRiM-Net is designed for and validated on three-channel RGB imagery from both UAV and satellite sources. The system requires that both the UAV query images and the satellite reference orthophotos are provided as RGB (red, green, blue) three-channel inputs, as the shared VGG-16 backbone and the datasets (VTRN, RSSDIVCS, MSDI) are all RGB-based. Extension to alternative imaging modalities, such as multispectral UAV cameras (e.g., Micasense RedEdge; MicaSense, Seattle, WA, USA) or multispectral satellite imagery (e.g., Sentinel-2; European Space Agency, Paris, France with 13 spectral bands), would require architectural modifications to accommodate variable input channel numbers, as well as retraining on appropriately paired multispectral datasets, which are not currently publicly available for this task formulation. Investigating the applicability of the proposed framework to multispectral imagery represents an important direction for future work, particularly given the prevalence of multispectral sensors in professional remote sensing applications.

4. Experiments

4.1. Datasets

The core requirement of this paper is: given a frame of UAV aerial imagery, output its precise latitude and longitude coordinates. This requires the training or testing dataset to meet one of the following conditions: (a) containing real UAV images for training the UAV domain parameters of DA-BN; (b) containing image pairs with known homography transformations applied to remote sensing images for self-supervised training of the fine matching module; or (c) containing a large-scale satellite reference map and its pixel size and coverage information to support complete coordinate reconstruction inference. The following is an analysis of each dataset, and the main attributes of the datasets are shown below.

4.1.1. VTRN

The Visual Terrain Relative Navigation (VTRN) dataset was released by the ICRA 2022 General Place Recognition (GPR) Competition to evaluate the performance of visual terrain relative navigation and large-scale location recognition algorithms. The dataset originates from a real aerial flight mission, with a flight route extending from Ohio to Pittsburgh, Pennsylvania, USA, with a total distance of approximately 150 km. This dataset contains a sequence of continuously acquired aerial images from above, corresponding satellite reference images of the area, and synchronously recorded inertial measurement unit (IMU) data and high-precision GPS center point coordinates (GPS-INS). The aerial images are sampled at approximately 20 fps, and each image provides corresponding geographic coordinate information. The reference satellite imagery is sourced from Google Maps satellite tiles, georegistered with the flight path area at a ground sampling distance of approximately 1 m per pixel, and provided as three-channel RGB orthophotos. The aerial images were acquired in 2017, while the satellite imagery was collected in 2012, resulting in a temporal gap of approximately 5 years that introduces scene changes including building construction, road modifications, and vegetation variation. This leads to complex data including changes in buildings, roads, and vegetation, thus placing high demands on the robustness of the visual retrieval algorithm. Therefore, this dataset is widely used to study cross-temporal visual matching and large-scale visual navigation problems. The example images of this dataset is shown in Figure 4.

The limitation of this dataset is that it provides the GPS center point coordinates for each image, while the coordinate reconstruction formula in this paper requires the precise bounding box coordinates (lower-left and upper-right corner latitude/longitude) of each satellite candidate block, which are not provided in the VTRN dataset as it only supplies GPS center point coordinates per aerial image frame without the corresponding satellite tile boundary metadata. Specifically, it requires knowledge of the coverage area of the candidate block in the large image (i.e., the geographic area corresponding to each pixel within the block), therefore it cannot be directly used for complete inference. Suitable for training large-scale retrieval modules.

4.1.2. RSSDIVCS

The RSSDIVCS (Remote Sensing Scene Dataset for Image–Vision Classification and Semantics) dataset consists of high-resolution remote sensing images, containing 42,000 pairs of remote sensing top-view images. The images are 256 × 256 pixels in size and include various typical land scene categories, such as airport, residential, industrial, farmland, forest, river, parking lot, and stadium. The remote sensing images in the dataset mainly come from publicly available remote sensing platforms (such as Google Earth). The RSSDIVCS dataset has rich land structure information and diverse texture features, with significant structural differences between different scenes. Therefore, it is widely used in research on remote sensing visual understanding and remote sensing image classification. Because it lacks geographic coordinates and large-scale reference maps, it does not support complete localization inference. However, using this dataset to train the fine matching module helps improve the model’s ability to perceive local structures and discriminate features in remote sensing images, thereby improving the accuracy of cross-view image matching. The example images of this dataset is shown in Figure 5.

4.1.3. Manchester Surface Drone Imagery (MSDI)

MSDI is a geographically labeled UAV image dataset specifically designed for absolute visual positioning of UAVs. Compared to current mainstream cross-view visual positioning datasets, the MSDI dataset has several unique advantages in data construction and application scenarios. This dataset provides rigorously calibrated multi-view UAV imagery and includes accompanying camera intrinsic and extrinsic parameter information and coordinate transformation matrices. All images embed GPS coordinates, and the accompanying tool can automatically download the corresponding orthophoto reference map from Google Earth/Bing Map based on the GPS coordinates. The resolution is 1 m/pixel, and the coordinates of the lower left and upper right corners of the map can be directly used in the formulas presented in this paper, supporting coordinate reconstruction. This high-precision geometric calibration information is rare in existing public datasets, making MSDI not only suitable for image retrieval tasks but also capable of supporting geometric consistency verification, pose estimation, and fine matching algorithm evaluation, thus better aligning with the research objectives of the fine matching module in this paper. Furthermore, it can be used for complete inference testing. Figure 6 shows a partial example of the dataset.

Based on the above analysis, this paper specifies the following scheme: The large-scale retrieval module is trained using the VTRN dataset. Its large scale and diverse terrain make it a publicly available benchmark for evaluating the retrieval performance of large-scale retrieval modules. The training set contains 10,436 query images and 2853 reference satellite images; the validation set contains 1684 query images and 459 reference satellite images; and the test set contains 1532 query images and 1532 reference images. All images are uniformly set to 500 × 500 resolution. The fine matching module is trained using RSSDIVCS, which generates matching pairs by applying known planar homography transformations, compatible with the SuperPoint self-supervised training strategy. The training set contains 42,000 image pairs, and the test set contains 14,000 image pairs, all uniformly set to 256 × 256 resolution. In the inference testing phase, this paper uses five scenarios from MSDI for experiments, including real UAV-captured images and orthophoto reference maps with pixel size and coverage annotations, supporting the coordinate reconstruction presented in this paper. This section contains several typical urban scene areas, including the Metropolitan University (29 images), the Energy Center (71 images), the Business School (37 images), the Church of Our Lady (58 images), and the Museum (48 images), with an image resolution of approximately 4579 × 3427.

Regarding the camera parameters of the MSDI dataset, which was collected using a Parrot Anafi drone (Parrot SA, Paris, France). The following are specific parameter details based on the hardware specifications of the drone and the dataset description file. Firstly, the camera type. The sensor type is CMOS sensor (Sony Corporation, Tokyo, Japan). The pixel resolution is 21 million pixels (21 Megapixels). The lens design is equipped with a low dispersion non spherical lens (ASPH) with an f/2.4 aperture. Regarding size and sensitivity, the sensor size is 1/2.4 inch. The sensitivity range (ISO) is from ISO 100 to 3200 [42]. The shutter speed ranges from 1 s to 1/10,000 s. The MSDI dataset is specifically designed for cross view image registration and visual localization, and its acquisition strategy covers three different camera poses by adjusting the pan tilt pitch angle (Pitch): 1. Downward facing (Nadir): pitch angle is −90° (the largest proportion in the dataset, with a total of 446 images). 2. Forward facing 45 degree: pitch angle of −45° (89 images in total). 3. Forward facing 0-degree: pitch angle of 0° (64 images in total). Finally, regarding the Viewing Angle/FOV, based on the different imaging modes of the camera, the actual Field of View parameters are as follows: Maximum Diagonal View (DFOV): 110°. Wide angle photo mode: Horizontal View (HFOV): 84°. Distortion/Rectilinear mode: The horizontal viewing angle (HFOV) is 75.5°. Video recording mode: The horizontal viewing angle (HFOV) is 69°.

4.2. Evaluation Metrics

To objectively evaluate the positioning accuracy and computational efficiency of GRiM-Net, this paper selects Average Localization Error (ALE), Single-frame Localization Time [43], Recall@K and matching success rate as core evaluation metrics to comprehensively measure the model’s overall performance.

ALE: This metric directly quantifies the actual physical distance (usually in meters) between the predicted geographic coordinates output by the network and the true location coordinates. In visual positioning tasks without GNSS assistance, the estimation accuracy of absolute geographic coordinates is the most intuitive standard for evaluating the model’s cross-view spatial mapping capability, directly determining its reliability in practical applications such as UAV autonomous navigation and target reconnaissance.

Single-frame Localization Time (T): Considering the strict limitations on the computing resources and power consumption of the UAV onboard platform, and the real-time requirements for position estimation in actual flight, single-frame localization time is crucial for measuring the algorithm’s computational efficiency and engineering feasibility. A model with high accuracy but lacking efficient inference capabilities is insufficient to meet the needs of UAVs. By combining the above two indicators, we can objectively and systematically verify whether the proposed network (GRiM-Net) has achieved a good balance between positioning accuracy and computational efficiency while achieving high-precision cross-domain visual matching.

In addition to ALE and inference time, two supplementary metrics are adopted to provide a more comprehensive evaluation of system reliability. First, Recall@K measures the retrieval stage’s ability to include the ground-truth satellite region within the Top-K candidate set, defined as the percentage of query frames for which at least one retrieved candidate block center lies within 64 pixels of the ground-truth location. This metric isolates the contribution of the retrieval module and quantifies the upper bound of localization reliability imposed by the retrieval stage. Second, the frame-level localization success rate is defined as the percentage of test frames achieving ALE below 20 m. This metric directly reflects the proportion of frames for which the localization output is practically usable under real UAV deployment conditions, providing an operationally meaningful complement to mean ALE.

By combining the above indicators, we can objectively and systematically verify whether the proposed GRiM-Net has achieved a good balance between positioning accuracy and computational efficiency while achieving high-precision cross-domain visual matching.

4.3. Implementation Details

All images were scaled to a processing resolution H = W = 320 using bilinear interpolation. This constraint applies to the network’s internal computation: after five max-pooling iterations with a stride of 2, the VGG-16 image is scaled to 1/32 of its original size; a processing resolution that is an integer multiple of 32 is necessary to guarantee this divisibility. The original image sizes of each dataset (500 × 500, 4579 × 3427, etc.) can all be scaled to 320 × 320 using bilinear interpolation, without requiring the original size to satisfy any divisibility condition. During training, the fine matching module applies a random homography transformation to the RSSDIVCS images to generate self-searching matching pairs.

Resolution Sensitivity Analysis. To assess whether the 320 × 320 processing resolution preserves sufficient geometric detail for accurate localization, we evaluate GRiM-Net at five processing resolutions: 160 × 160, 256 × 256, 320 × 320, 384 × 384, and 512 × 512, all divisible by 32 as required by the VGG-16 backbone architecture, with fixed model weights on the MSDI test set. As reported in Table 2, mean ALE increases substantially at 160 × 160 (21.53 m, +46.3% relative to the default), confirming that severely reduced resolution leads to meaningful loss of structural detail necessary for reliable keypoint detection and descriptor matching. At 256 × 256, ALE remains moderately elevated (16.48 m, +12.0%). Beyond 320 × 320, the performance improvement saturates rapidly: increasing to 384 × 384 reduces mean ALE by only 0.11 m (0.7%), and increasing to 512 × 512 reduces it by 0.65 m (4.4%), while incurring proportionally higher memory and inference time costs due to the quadratic growth of feature map sizes. These results confirm that 320 × 320 represents an effective operating point that balances localization accuracy with computational efficiency, and that the processing resolution does not constitute a critical performance bottleneck under the MSDI evaluation conditions.

All experiments were performed on an NVIDIA RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) using the PyTorch version 2.0.1 framework (Meta Platforms, Inc., Menlo Park, CA, USA). The optimizer was Adam. The two modules are jointly trained by iteratively optimizing them at the epoch level with their respective loss components of the joint multi-task objective

L_{t o t a l}

. Specifically, in each training epoch, mini-batches sampled from VTRN are used to compute the retrieval loss

L_{r e t}

and update the shared backbone encoder together with the NetVLAD aggregation layer; mini-batches sampled from RSSDIVCS are used to compute the matching losses

{L_{k p} + L}_{d e s c} + L_{h o m o}

and update the shared backbone encoder together with the keypoint and descriptor decoders. Because the shared backbone encoder participates in both update steps within each epoch, it receives gradient signals from both the global retrieval objective and the local matching objective throughout training, enabling it to develop feature representations that simultaneously serve global similarity measurement and local geometric precision. The MSDI dataset is used exclusively for inference testing. During inference: K in Top-K is set to 5, the RANSAC in-point decision threshold is

ε = 4 p x

, the maximum number of iterations is 1000, the Lowe ratio is

ρ = 0.8

, the keypoint detection threshold is

τ_{k p} = 0.015

, and a maximum of 1024 keypoints are extracted per frame. The FAISS-IVF256 index is built offline after training.

The joint loss weights were

α = 1.0, β_{1} = 0.5, β_{2} = 0.5, γ = 0.3 .

The joint loss weights

α

,

β_{1}

,

β_{2}

and

γ

are determined empirically through a two-step procedure. First, the magnitudes of the individual loss terms are monitored during initial training runs to characterize their natural scale differences. The weights are then set such that the gradient contributions from the retrieval path (L_ret) and the combined matching path (L_kp + L_desc + L_homo) are of comparable magnitude at the start of joint training, following common practice in multi-task learning to prevent any single task from dominating the shared encoder’s gradient signal.

To assess the sensitivity of localization performance to these weight settings, we evaluate GRiM-Net under eight alternative configurations in which each weight is individually varied by factors of 0.5× and 2.0× relative to the default, with all other weights held constant. As reported in Table 3, the mean ALE across the five MSDI scenarios varies by only 1.37 m to 2.79 m across all tested configurations, remaining well within the range of practical localization requirements. The largest degradation occurs when γ is halved, reflecting the importance of homography reprojection supervision in enforcing geometric consistency during joint training. These results confirm that the method is not critically sensitive to the precise choice of loss weights, and that the default configuration represents a robust operating point rather than an overfitted hyperparameter setting.

4.4. Main Experimental Results

4.4.1. Ablation Experiments

To verify the contribution of each module to the localization performance, this paper designed module combination ablation experiments on five MSDI scenarios. The results are shown in Table 4. In the table, G, M, and L represent the large-scale retrieval module, the fine matching module, and the end-to-end joint optimization training, respectively; S indicates that the proposed learned fine matching module is replaced by SIFT [44]-based feature matching, serving as a classical hand-crafted baseline to quantify the performance gain introduced by the learned keypoint detection and descriptor matching design over traditional gradient-based methods. The first to third rows are ablation experiments with different combinations, and the fourth row (G + M + L) is the complete GRiM-Net method in this paper.

First, the fine matching module in this paper is analyzed (combining the results of G + S, G + M, and G + M + L). As shown in Table 4, after replacing the fine matching module with the traditional SIFT algorithm (G + S), the average localization error in all five scenarios increased significantly. The fundamental reason is that the SIFT algorithm extracts features based on the gradient orientation histogram of the underlying image, which lacks sufficient robustness to the differences in illumination, tone shift, and perspective distortion between UAV aerial images (captured at oblique viewing angles due to UAV attitude variation during flight) and satellite orthophoto images, resulting in a high mismatch rate in cross-domain scenarios. Regarding inference time, the experiment demonstrates the significant advantage of the shared backbone architecture in this paper. The single-frame time of G + M is higher than that of G + S. This is because the large-scale retrieval and fine matching modules run independently without joint optimization, and the feature extraction process of the VGG-16 backbone network is repeatedly calculated. However, after introducing end-to-end joint training and inference (G + M + L), the average single-frame time of the model is lower than that of the independent module (G + M), and even lower than that of the combination using the SIFT algorithm (G + S). This experimental result strongly validates the superiority of the shared feature extraction architecture presented in this paper: under the joint architecture, the network only needs to perform one forward propagation to compute the shared features, which can then be simultaneously output to both the global retrieval (NetVLAD) and fine matching decoders. In contrast, SIFT cannot reuse deep features and must perform intensive low-level computations on the original image.

Then, the strategy of removing the joint optimization training is analyzed (comparison between G + M and G + M + L). Retaining the large-scale retrieval module and the fine matching module, but removing the end-to-end joint optimization, the average ALE for all five scenarios increases, and the average single-frame latency also rises. This phenomenon reveals the core mechanism of joint optimization: when the two modules are trained independently, their backpropagation gradients are not aware of each other, and the features extracted by the shared encoder cannot simultaneously serve both global similarity measurement and local keypoint discrimination, leading to feature semantic mismatch; in scenarios with highly similar building structures, once erroneous candidate blocks are recalled during the retrieval phase, the independently trained matching module cannot correct the error through feature-level collaboration, ultimately resulting in cascading errors. The significant increase in latency stems directly from the architectural overhead of independent inference: when the two modules execute serially, the features of the shared encoder need to be repeatedly calculated, making it impossible to reuse intermediate results. Joint optimization serves both modules simultaneously with a single forward propagation, reducing inference time to a level that meets the real-time positioning requirements of UAVs.

Analysis of removing the large-scale retrieval module (comparison between M + L and G + M + L). Removing the large-scale retrieval module and replacing it with block-by-block matching of a global sliding window on the satellite map resulted in a slight increase in the average positioning error across all five scenarios, as well as an increase in the average single-frame latency. Considering both accuracy and efficiency, the large-scale retrieval module is a necessary component for maintaining high-precision positioning while ensuring real-time performance.

4.4.2. Robustness Analysis of Top-K Retrieval

The two-stage architecture of GRiM-Net relies on the global retrieval module to include the ground-truth satellite region within the Top-K candidates. To evaluate the practical reliability of this design, we conduct two complementary analyses on the MSDI test set: a retrieval recall analysis across candidate set sizes, and a sensitivity analysis of final localization accuracy with respect to K.

Retrieval Recall Analysis.

Table 5 reports Recall@K for K∈{1, 2, 3, 5, 8, 10}, where a retrieval is considered successful if the Top-K candidate set contains at least one satellite block whose center lies within 64 pixels of the ground-truth location. Recall@K increases from 82.3% at K = 1 to 92.2% at K = 5, and reaches 93.1% at K = 10. The marginal gain from K = 5 to K = 10 is only 0.9 percentage points, indicating that the recall curve has effectively saturated at the default operating point. At K = 5, the correct region is present in the candidate set for 92.2% of test frames, demonstrating that retrieval failure does not constitute a critical bottleneck under normal operating conditions.

B.: Sensitivity to K.

Table 6 reports ALE and mean per-frame inference time across the five MSDI scenarios for K∈{1, 3, 5, 10} with fixed model weights, isolating the effect of candidate set size from all other factors.

Two consistent trends are observed. First, increasing K from 1 to 5 yields substantial ALE reductions across all five scenarios. This confirms that the multi-candidate selection mechanism—which identifies the optimal candidate via inlier-ratio scoring as described in Section 3.5.3—effectively compensates for imperfect retrieval rankings by providing the fine matching module with a diverse set of candidates from which the geometrically most consistent one is selected. The reduction is particularly pronounced in structurally repetitive scenes: in the Holy Name Church scenario, where similar architectural patterns increase retrieval ambiguity, ALE decreases from 41.31 m at K = 1 to 29.62 m at K = 5, a reduction of 28.3%. Second, increasing K from 5 to 10 yields only marginal further ALE improvement of less than 0.2 m on average, while inference time increases from 0.41 s to 0.68 s per frame, a 65.9% increase attributable to the growth of fine matching computations with the number of candidates. These results confirm that K = 5 represents an optimal operating point that balances localization accuracy with the real-time constraints of UAV onboard platforms, and that system performance is not sensitive to the precise choice of K in the range [5, 10].

4.4.3. Robustness Analysis Under Illumination Variation

Real-world UAV deployments commonly involve illumination conditions that deviate from those present during training, including overcast skies, strong direct sunlight, and low-contrast atmospheric haze. To evaluate GRiM-Net’s robustness to such variations without retraining, we apply three photometric perturbations to the MSDI test images at inference time: brightness reduction (×0.5, simulating overcast or twilight conditions), brightness increase (×1.5, simulating strong direct sunlight), and contrast reduction (×0.6, simulating haze or mild overexposure). Model weights remain unchanged throughout. The results are reported in Table 7.

As shown in Table 7, GRiM-Net maintains stable localization performance across all three perturbation conditions. The mean ALE increases by 17.0% under low brightness (14.72 m–17.23 m), 9.4% under high brightness (14.72 m–16.10 m), and 5.2% under low contrast (14.72 m–15.48 m). The ordering of degradation—low brightness > high brightness > low contrast—is consistent with the expected behavior of the domain-adaptive batch normalization mechanism introduced in Section 3.3.2. DA-BN maintains independent normalization statistics and affine parameters per domain, which implicitly normalizes domain-specific contrast and brightness distributions; contrast reduction, being the perturbation most directly compensated by variance normalization, produces the smallest ALE increase. The larger degradation under low brightness is attributable to the suppression of low-intensity texture details, which reduces keypoint detector response and yields fewer high-confidence matches in the fine matching stage.

Across all scenarios, the maximum single-scene ALE increase is 14.8% (Holy Name Church under low brightness: 29.62 m–34.01 m), and no scenario exhibits ALE degradation exceeding 20% under any perturbation. These results demonstrate that GRiM-Net exhibits practical robustness to realistic illumination variations encountered in UAV operations.

Regarding seasonal and temporal variation, the VTRN dataset used for retrieval module training provides indirect evidence of cross-temporal robustness: UAV images were acquired in 2017 while satellite reference images originate from 2012, a five-year gap encompassing changes in vegetation state, building structures, and road layouts. The retrieval performance reported on this dataset in Section 4.4.1 therefore reflects the model’s ability to match across temporally and seasonally misaligned image pairs.

4.4.4. Comparative Experiments

The selection of comparison baselines is governed by the specific task formulation of this work. Unlike cross-view image retrieval methods—such as transformer-based approaches (e.g., TransGeo, SliceMatch) and attention-based methods (e.g., LPN)—which output a ranked list of satellite image candidates evaluated by Recall@K, GRiM-Net addresses pixel-level coordinate regression, directly outputting absolute geographic coordinates evaluated by mean localization error in meters. Adapting retrieval-based methods to produce coordinate outputs would require attaching an independent coordinate reconstruction backend, effectively creating a composite pipeline that evaluates our framework with alternative matching modules rather than those methods under their original design. Diffusion-based and computationally intensive dense matching approaches (e.g., RoMa, DeDoDe) are additionally excluded on practical grounds: their per-frame inference time of several seconds is irreconcilable with the real-time constraints of UAV onboard platforms that this work explicitly targets. Accordingly, SIFT is selected as a representative classical baseline for cross-view feature matching, and GLVL as the most complete existing system operating under the same task formulation of absolute coordinate regression from UAV imagery. This paper compares GRiM-Net with two benchmark methods in five MSDI scenarios. The results of the comparison experiments are shown in Table 8. As shown in Table 8, GRiM-Net outperforms SIFT and GLVL in all five scenarios. In scenarios with similar building structures, OHEM’s difficult negative sample training effectively reduces the false detection rate. To visually demonstrate the localization error distribution in the five scenarios of the test set, we also plotted a scatter plot. Figure 7a shows the localization error of GRiM-Net in each frame of the test set. The results show that the localization prediction results of the proposed GRiM-Net method do not have a significant chain effect on the localization of subsequent frames. Figure 7b shows the comparative experimental results of GRiM-Net and two contrasting models.

To provide a more operationally meaningful characterization of localization reliability, we report the frame-level localization success rate, defined as the percentage of test frames achieving ALE below 20 m. As reported in Table 9, GRiM-Net achieves an overall success rate of 89.3% across all 243 MSDI test frames, with per-scenario rates ranging from 83.8% (Business School) to 96.6% (Metropolitan University).

Error Source Analysis: The primary source of error frames is retrieval-stage displacement. The Recall@K analysis in Table 5 shows that at K = 5, 7.8% of test frames do not satisfy the 64-pixel center-proximity criterion. Although the ground-truth position remains within the spatial coverage of the retrieved block, this peripheral displacement introduces an initial positional bias into the fine matching stage. Compounded with the inherent estimation error of homography fitting, the cumulative ALE for these frames falls predominantly in the observed 64–250 m range. The K-sensitivity results in Table 6 directly corroborate this mechanism: increasing K from 1 to 5 reduces mean ALE by 33.7%, as a larger candidate set increases the probability that at least one retrieved block is well-centered on the ground-truth location, reducing the initial positional bias propagated to the fine matching stage. The secondary source is the error of the fine matching module. These errors mainly occur in structurally repetitive scenarios, where geometrically inconsistent descriptor matches can occur.

4.4.5. Analysis of DA-BN

To further validate the effectiveness of the DA-BN mechanism, we compare the full GRiM-Net against a variant in which DA-BN is replaced by standard batch normalization, with all other components held constant. As shown in Table 10, removing DA-BN leads to increased ALE across all five MSDI scenarios. This degradation is attributable to the fundamental distributional mismatch between UAV oblique imagery and satellite orthophotos: standard batch normalization computes shared normalization statistics across both domains, conflating their distinct brightness, contrast, and tone distributions and reducing the discriminability of the shared encoder’s feature representations for both retrieval and matching. DA-BN addresses this by maintaining independent normalization statistics and affine parameters for each domain while sharing convolutional weights, allowing the encoder to normalize domain-specific low-level appearance variation without sacrificing the domain-invariant geometric and structural representations learned through the shared weights. The improvement is most pronounced in the Holy Name Church scenarios, where cross-domain appearance differences are compounded by structural repetitiveness, confirming that DA-BN’s domain alignment is particularly beneficial in scenes where descriptor discriminability is already challenged by scene-level ambiguity.

5. Conclusions

To address the challenges of cross-view domain differences and high-precision real-time matching in UAV visual localization without GNSS assistance, this paper proposes a novel two-stage joint optimization visual localization network, GRiM-Net. This network integrates large-scale global retrieval and fine local matching into a unified architecture, reducing computational redundancy in feature extraction through a shared backbone encoder. To effectively mitigate the domain offset problem between UAV-captured imagery and orthophoto satellite maps, a domain-adaptive batch normalization mechanism is introduced in the feature encoding stage. This achieves effective alignment of the underlying distribution of data from different domains while maintaining parameter sharing and collaborative learning. Furthermore, a joint multi-task loss function is designed, enabling the network to balance metric learning of global descriptors and spatial consistency constraints of local keypoints through end-to-end training, overcoming the error cascading propagation caused by feature semantic fragmentation in traditional step-by-step methods.

Detailed experiments on public datasets fully validate the superior performance of GRiM-Net. Ablation experiments demonstrate the effectiveness of each module in our proposed network. Comparative analysis demonstrates that, compared to existing baseline algorithms, our proposed method not only reduces the ALE in challenging urban scenarios but also achieves single-frame inference times of 0.19–0.61 s on an RTX 4090 GPU, satisfying real-time requirements under desktop-GPU validation conditions. We note that formal validation on UAV onboard embedded platforms, such as NVIDIA Jetson AGX Orin, including profiling of memory bandwidth, power consumption, and thermal constraints, remains an important direction for future work to fully substantiate the real-time deployment claim in practice. Additionally, extending robustness evaluation to texture-limited environments such as snow-covered terrain and desert regions, and validating the homography-based coordinate recovery under scenes with severe non-planar structures, represent further directions for investigation.

In conclusion, GRiM-Net provides a robust and efficient geographic coordinate regression framework for cross-view UAV visual localization in GNSS-denied urban environments, validated on public benchmarks under controlled conditions. As a localization component rather than a complete autonomous navigation system, GRiM-Net is designed to be integrated within broader GNSS-denied navigation pipelines. Extending evaluation to non-urban environments including natural terrain, agricultural areas, and residential zones, as well as formal validation on onboard embedded platforms, constitutes the primary direction for future work.

Author Contributions

Conceptualization, Y.H. and Q.Z.; Data curation, Y.H.; Formal analysis, Y.H.; Funding acquisition, Q.Z.; Investigation, Y.H.; Methodology, Y.H.; Project administration, Q.Z.; Resources, Q.Z.; Software, Q.Z.; Validation, Y.H.; Visualization, Y.H.; Writing—original, Y.H. and Q.Z.; Writing—review and editing, Y.H.; draft. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset utilized in this study is publicly available. The network proposed in this research is not suitable for public release; it has been adopted by the supervising institution for use in a civilian drone positioning project. VTRN: [https://www.dropbox.com/scl/fo/6gwa0swtzj7pg1itk89hn/AN3RPb-dCxIqbepUA8siFHw?rlkey=yyalxgwgw9pvgbomnwd7zaxbo&e=2&st=pmhhlvon&dl=0] (accessed on 6 May 2026). RSSDIVCS: The RSSDIVCS dataset used in this study is a publicly available benchmark for remote sensing scene classification. The data can be accessed through the official repository of the original authors at [https://github.com/wenjiaXu/RS_Scene_ZSL] (accessed on 6 May 2026). MSDI: [https://www.upf.edu/web/mtg/msdi] (accessed on 6 May 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chan, K.W.; Nirmal, U.; Cheaw, W.G. Progress on drone technology and their applications: A comprehensive review. In Proceedings of the AIP Conference Proceedings, Melaka, Malaysia, 2–3 November 2018; Volume 2030, p. 020308. [Google Scholar] [CrossRef]
Dutta, G.; Goswami, P. Application of drone in agriculture: A review. Int. J. Chem. Stud. 2020, 8, 181–187. [Google Scholar] [CrossRef]
Gallacher, D. Drone applications for environmental management in urban spaces: A review. Int. J. Sustain. Land Use Urban Plan. 2016, 3, 1–14. [Google Scholar] [CrossRef]
Fan, B.; Li, Y.; Zhang, R. Review on the technological development and application of UAV systems. Chin. J. Electron. 2020, 29, 199–207. [Google Scholar] [CrossRef]
Vergouw, B.; Nagel, H.; Bondt, G. Drone technology: Types, payloads, applications, frequency spectrum issues and future developments. In Information Technology and Law Series; TMC Asser Press: The Hague, The Netherlands, 2016; Volume 27, pp. 21–45. [Google Scholar] [CrossRef]
Tong, P.; Yang, X.; Yang, Y. Multi-UAV collaborative absolute vision positioning and navigation: A survey and discussion. Drones 2023, 7, 261. [Google Scholar] [CrossRef]
Moshe, B.B.; Shvalb, N.; Baadani, J. Indoor positioning and navigation for micro UAV drones—Work in progress. In Proceedings of the 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel, Eilat, Israel, 14–17 November 2012; pp. 1–5. [Google Scholar] [CrossRef]
Conte, G.; Doherty, P. Vision-based unmanned aerial vehicle navigation using geo-referenced information. EURASIP J. Adv. Signal Process. 2009, 2009, 387308. [Google Scholar] [CrossRef]
Kutsenko, O.V.; Ilnytska, S.I.; Kondratyuk, V.M. Unmanned aerial vehicle position determination in GNSS landing system. In Proceedings of the 2017 IEEE 4th International Conference Actual Problems of Unmanned Aerial Vehicles Developments, Kyiv, Ukraine, 17–19 October 2017; pp. 79–83. [Google Scholar] [CrossRef]
Rieke, M.; Foerster, T.; Geipel, J. High-precision positioning and real-time data processing of UAV-systems. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2012, 38, 119–124. [Google Scholar] [CrossRef]
Zhang, G.; Hsu, L.T. Intelligent GNSS/INS integrated navigation system for a commercial UAV flight control system. Aerosp. Sci. Technol. 2018, 80, 368–380. [Google Scholar] [CrossRef]
Hussain, A.; Akhtar, F.; Khand, Z.H. Complexity and limitations of GNSS signal reception in highly obstructed enviroments. Eng. Technol. Appl. Sci. Res. 2012, 11, 6864–6868. [Google Scholar] [CrossRef]
Khan, S.Z.; Mohsin, M.; Iqbal, W. On GPS spoofing of aerial platforms: A review of threats, challenges, methodologies, and future research directions. PeerJ Comput. Sci. 2021, 7, e507. [Google Scholar] [CrossRef]
Nistér, D.; Naroditsky, O.; Bergen, J. Visual odometry. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Washington, DC, USA, 27 June–2 July 2004; Volume 1, pp. 652–659. [Google Scholar] [CrossRef]
Cadena, C.; Carlone, L.; Carrillo, H. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
Gu, M.; Li, H.; Zhang, J.; Bai, X.; Zhen, J. A review of vision-based UAV localization and navigation methods. Acta Electron. Sin. 2025, 53, 651–685. [Google Scholar] [CrossRef]
Van Dalen, G.J.; Magree, D.P.; Johnson, E.N. Absolute localization using image alignment and particle filtering. In Proceedings of the AIAA Guidance, Navigation, and Control Conference, San Diego, CA, USA, 4–8 January 2016; p. 0647. [Google Scholar] [CrossRef]
Wan, X.; Liu, J.; Yan, H. Illumination-invariant image matching for autonomous UAV localisation based on optical sensing. ISPRS J. Photogramm. Remote Sens. 2016, 119, 198–213. [Google Scholar] [CrossRef]
Chiu, H.P.; Das, A.; Miller, P. Precise vision-aided aerial navigation. In Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, 14–18 September 2014; pp. 688–695. [Google Scholar] [CrossRef]
Masselli, A.; Hanten, R.; Zell, A. Localization of unmanned aerial vehicles using terrain classification from aerial images. In Proceedings of the Intelligent Autonomous Systems 13: Proceedings of the 13th International Conference IAS-13, Padua, Italy, 15–18 July 2015; pp. 831–842. [Google Scholar] [CrossRef]
Couturier, A.; Akhloufi, M.A. Relative visual localization (RVL) for UAV navigation. In Proceedings of the SPIE 10642, Degraded Environments: Sensing, Processing, and Display 2018, Orlando, FL, USA, 14 May 2018; Volume 10642, pp. 213–226. [Google Scholar] [CrossRef]
Wang, T.; Zheng, Z.; Yan, C. Each part matters: Local patterns facilitate cross-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 867–879. [Google Scholar] [CrossRef]
Gurgu, M.M.; Queralta, J.P.; Westerlund, T. Vision-based gnss-free localization for uavs in the wild. In Proceedings of the 2022 7th International Conference on Mechanical Engineering and Robotics Research, Krakow, Poland, 9–11 December 2022; pp. 7–12. [Google Scholar] [CrossRef]
Li, S.; Hu, M.; Xiao, X. Patch similarity self-knowledge distillation for cross-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 5091–5103. [Google Scholar] [CrossRef]
Workman, S.; Souvenir, R.; Jacobs, N. Wide-area image geolocalization with aerial reference imagery. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3961–3969. [Google Scholar] [CrossRef]
Vo, N.N.; Hays, J. Localizing and orienting street views using overhead imagery. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 494–509. [Google Scholar] [CrossRef]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar] [CrossRef]
Arandjelovic, R.; Gronat, P.; Torii, A. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5297–5307. [Google Scholar] [CrossRef]
Zhu, S.; Shah, M.; Chen, C. Transgeo: Transformer is all you need for cross-view image geo-localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1162–1171. [Google Scholar] [CrossRef]
Zhai, M.; Bessinger, Z.; Workman, S.; Jacobs, N. Predicting ground-level scene layout from aerial imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 867–875. [Google Scholar] [CrossRef]
Regmi, K.; Borji, A. Cross-view image synthesis using conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3501–3510. [Google Scholar] [CrossRef]
Shi, Y.; Yu, X.; Campbell, D.; Li, H. Where am i looking at? joint location and orientation estimation by cross-view matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4064–4072. [Google Scholar] [CrossRef]
Wang, G.; Chen, J.; Dai, M.; Zheng, E. Wamf-fpi: A weight-adaptive multi-feature fusion network for uav localization. Remote Sens. 2023, 15, 910. [Google Scholar] [CrossRef]
Chen, J.; Zheng, E.; Dai, M.; Chen, Y.; Lu, Y. OS-FPI: A coarse-to-fine one-stream network for UAV geolocalization. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 7852–7866. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar] [CrossRef]
Shrivastava, A.; Gupta, A.; Girshick, R. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 761–769. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2nd ed.; Cambridge University Press: Cambridge, UK, 2011; pp. 1–672. [Google Scholar] [CrossRef]
ISO 12232:2019; Photography—Digital Still Cameras—Determination of Exposure Index, ISO Speed Ratings, Standard Output Sensitivity, and Recommended Exposure Index. ISO: Geneva, Switzerland, 2019.
Li, H.; Wang, J.; Wei, Z. Jointly Optimized Global-Local Visual Localization of UAVs. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 2, pp. 1150–1157. [Google Scholar] [CrossRef]

Figure 1. The inference process of this two-stage network.

Figure 2. The overall structure of a visual localization network that combines global retrieval module and fine matching Module.

Figure 3. Schematic diagram of shared encoder.

Figure 4. Sample images from the VTRN dataset.

Figure 5. Sample images from the RSSDIVCS dataset.

Figure 6. Partial example diagram of the MSDI dataset.

Figure 7. (a) Errors of GRiM-Net across all frames on the test set; (b) Comparison of GRiM-Net with two contrasting models.

Table 1. Summary of Joint Training Loss Functions.

Symbol	Loss Type	Weights	Function Description
Lret	Triple loss	$α = 1.0$	Supervised NetVLAD global descriptor metric learning
Lkp	Key point cross-entropy	$β_{1} = 0.5$	Supervised keypoint heatmap spatial accuracy
Ldesc	Descriptor hinge loss	$β_{2} = 0.5$	Supervised local descriptor discriminability
Lhomo	Uniform reprojection error	$γ = 0.3$	Supervised fine matching geometric alignment accuracy

Table 2. Mean ALE (meters) across five MSDI scenarios under different processing resolutions.

Resolution	160 × 160	256 × 256	320 × 320	384 × 384	512 × 512
Average ALE (m)	21.53	16.48	14.72	14.61	14.07

Table 3. Sensitivity analysis of joint loss weights. Mean ALE (meters) across five MSDI scenarios under alternative weight configurations, with each weight individually varied by 0.5× and 2.0× relative to the default setting.

Configuration	Default	$α \times 0.5$	$β_{1} \times 0.5$	$β_{2} \times 0.5$	$γ \times 0.5$	$α \times 2.0$	$β_{1} \times 2.0$	$β_{2} \times 2.0$	$γ \times 2.0$
Average ALE (m)	14.72	16.58	16.09	17.25	17.51	16.17	16.46	16.54	16.41

Table 4. Module combination ablation experiments (ALE unit: meters; mean time unit: seconds/frame; the left column below each scene is ALE, and the right column is average frame time).

Module Combination	Metropolitan University		Energy Center		Business School		Holy Name Church		Museum
G + S	70.56	0.93	98.33	0.87	69.23	0.67	43.97	0.89	67.19	0.84
G + M	38.15	3.54	33.45	3.46	99.12	2.01	87.53	3.47	81.67	3.46
M + L	6.37	0.43	12.91	0.49	14.53	1.93	40.08	0.82	33.78	0.49
G + M + L	4.33	0.37	11.84	0.19	13.96	0.45	29.62	0.61	13.85	0.41

Table 5. Corresponding to Different K Values Recall@K.

K	1	2	3	5	8	10
Recall@K (%)	82.3	86.9	89.7	92.2	92.8	93.1

Table 6. ALE (meters) and inference time (seconds/frame) under varying K on the MSDI test set.

K	Metropolitan University	Energy Center	Business School	Holy Name Church	Museum	Average Time (s)
1	9.21	18.43	22.14	41.31	19.86	0.18
3	5.88	13.22	15.47	31.79	15.15	0.28
5	4.33	11.84	13.96	29.62	13.85	0.41
10	4.21	11.71	13.74	29.41	13.81	0.68

Table 7. ALE (meters) under photometric perturbations on the MSDI test set.

Input Conditions	Metropolitan University	Energy Center	Business School	Holy Name Church	Museum	Average ALE (m)
Original condition	4.33	11.84	13.96	29.62	13.85	14.72
Low brightness (×0.5)	5.25	13.87	16.82	34.01	16.18	17.23
High brightness (×1.5)	4.86	12.98	15.37	32.45	14.83	16.10
Low contrast (×0.6)	4.71	12.62	14.84	30.80	14.42	15.48

Table 8. Comparative experiments on five MSDI scenarios (ALE in meters; Time in seconds; each reported value is the arithmetic mean over all frames within the respective scenario; the left column below each scene is ALE, and the right column is average frame time).

Method	Metropolitan University		Energy Center		Business School		Holy Name Church		Museum
SIFT	70.89	1.47	67.77	1.64	42.68	6.13	50.01	2.50	89.51	1.45
GLVL	5.35	0.48	13.65	0.23	17.83	0.51	30.47	0.69	15.32	0.48
GRiM-Net	4.33	0.37	11.84	0.19	13.96	0.45	29.62	0.61	13.85	0.41

Table 9. Matching success rate of GRiM-Net on the MSDI test set.

Scene	Metropolitan University	Energy Center	Business School	Holy Name Church	Museum	Average
Matching Success Rate(%)	96.6	91.5	83.8	84.5	91.7	89.3

Table 10. Ablation study of DA-BN.

Method	Metropolitan University	Energy Center	Business School	Holy Name Church	Museum
GRiM-Net	4.33	11.84	13.96	29.62	13.85
DA-BN	6.59	16.75	19.21	38.48	19.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, Y.; Zeng, Q. GRiM-Net: A Two-Stage Cross-View Visual Localization Framework for UAVs. Remote Sens. 2026, 18, 1477. https://doi.org/10.3390/rs18101477

AMA Style

Hu Y, Zeng Q. GRiM-Net: A Two-Stage Cross-View Visual Localization Framework for UAVs. Remote Sensing. 2026; 18(10):1477. https://doi.org/10.3390/rs18101477

Chicago/Turabian Style

Hu, Yanting, and Qinyong Zeng. 2026. "GRiM-Net: A Two-Stage Cross-View Visual Localization Framework for UAVs" Remote Sensing 18, no. 10: 1477. https://doi.org/10.3390/rs18101477

APA Style

Hu, Y., & Zeng, Q. (2026). GRiM-Net: A Two-Stage Cross-View Visual Localization Framework for UAVs. Remote Sensing, 18(10), 1477. https://doi.org/10.3390/rs18101477

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GRiM-Net: A Two-Stage Cross-View Visual Localization Framework for UAVs

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Image-Level Retrieval

2.2. Viewpoint Transformation

2.3. Orientation-Aided Localization

2.4. Pixel-Level Fine Localization

3. Methods

3.1. Problem Definition and Method Overview

3.2. Satellite Map Preprocessing

3.3. Shared Backbone Encoder

3.3.1. Network Structure

3.3.2. Domain Adaptive Batch Normalization

3.3.3. Shared Upsampling Module

3.4. Global Retrieval Module

3.4.1. Global Feature Extraction

3.4.2. Offline Database Construction and Online Retrieval

3.4.3. Retrieval Training Loss

3.5. Fine Matching Module

3.5.1. Keypoint Decoder and Descriptor Decoder

3.5.2. Keypoint Matching

3.5.3. Homography Matrix Estimation and Coordinate Mapping

3.5.4. Homography Supervision Loss

3.6. Joint Training Objective

4. Experiments

4.1. Datasets

4.1.1. VTRN

4.1.2. RSSDIVCS

4.1.3. Manchester Surface Drone Imagery (MSDI)

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Main Experimental Results

4.4.1. Ablation Experiments

4.4.2. Robustness Analysis of Top-K Retrieval

4.4.3. Robustness Analysis Under Illumination Variation

4.4.4. Comparative Experiments

4.4.5. Analysis of DA-BN

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI