Next Article in Journal
A Hyperspectral Simulation-Driven Framework for Sub-Pixel Impervious Surface Mapping: A Case Study Using Landsat Imagery
Previous Article in Journal
SGH-Net: An Efficient Hierarchical Fusion Network with Spectrally Guided Attention for Multi-Modal Landslide Segmentation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

GenGeo: Robust Cross-View Geo-Localization via Foundation Model and Dynamic Feature Aggregation

1
State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China
2
University of Chinese Academy of Sciences, Beiiing 100049, China
3
School of Computer Science, Beijing Institute of Technology, Beijing 100101, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(8), 1116; https://doi.org/10.3390/rs18081116
Submission received: 30 January 2026 / Revised: 3 April 2026 / Accepted: 6 April 2026 / Published: 9 April 2026
(This article belongs to the Section AI Remote Sensing)

Highlights

What are the main findings?
  • This study presents GenGeo, a framework that integrates vision foundation model representations with a SALAD-based matching-aware aggregation mechanism. By leveraging a shared clustering strategy, the model projects cross-view features into a unified assignment space, enabling implicit semantic alignment across different viewpoints.
  • The study further demonstrates that the dustbin mechanism in SALAD helps filter unmatched and non-informative regions caused by partial correspondence, thereby improving the robustness of cross-view matching.
What are the implications of the main findings?
  • The proposed method achieves state-of-the-art performance in cross-dataset generalization without relying on explicit geometric priors, effectively mitigating domain gaps between ground-level and aerial imagery under severe viewpoint and spatial misalignment.
  • The results highlight that the synergy between semantically rich foundation model representations and SALAD-based aggregation is critical for robust cross-view alignment, suggesting a promising alternative to geometry-dependent approaches for generalizable geo-localization.

Abstract

Cross-view geo-localization (CVGL) aims to match ground-level images with geo-tagged aerial imagery for precise localization, but remains challenging due to severe viewpoint discrepancies, partial correspondence, and significant domain shifts across geographic regions. While existing methods achieve high accuracy within specific datasets, their generalization ability to unseen environments is limited. In this paper, we propose GenGeo, a unified framework that integrates vision foundation model representations with a matching-aware aggregation mechanism to address these challenges. Specifically, we leverage DINOv2 to extract semantically rich and transferable features, and revisit the SALAD aggregation module in the context of CVGL. By employing a shared clustering strategy, the proposed framework projects cross-view features into a unified assignment space, enabling implicit semantic alignment across views, while the dustbin mechanism effectively filters unmatched and non-informative regions arising from partial correspondence. Extensive experiments on three large-scale benchmarks (CVUSA, CVACT, and VIGOR) demonstrate that GenGeo achieves state-of-the-art performance in cross-dataset generalization and consistently improves robustness under severe domain shifts and spatial misalignment. Notably, our method outperforms the baseline by 14.65% in Top-1 Recall on the CVUSA-to-CVACT transfer task. These results highlight the effectiveness of combining foundation model representations with matching-aware aggregation, and suggest that enforcing semantic consistency in a shared assignment space is a promising direction for generalizable cross-view geo-localization.

1. Introduction

Image-based geo-localization aims to achieve precise position estimation for a query image by matching it against a geo-referenced image database. This technology has demonstrated significant potential across various domains, including autonomous driving [1], robotic navigation [2], and augmented reality (AR) [3]. Although Global Navigation Satellite Systems (GNSS) are widely utilized, their reliability often degrades in “urban canyons” with dense buildings or in heavily forested areas. In these environments, satellite signals frequently suffer from multipath effects or signal blockage, resulting in localization errors ranging from several meters to tens of meters. Under such circumstances, image-based localization methods—leveraging their capacity to perceive environmental features—serve as a crucial supplement to GNSS, providing continuous and high-precision spatial alignment even in signal-constrained environments.
Among the various branches of visual localization, cross-view geo-localization (CVGL) matches ground-level query images with aerial or satellite imagery. This approach overcomes the inherent limitations of traditional visual place recognition (VPR), such as slow database updates and restricted coverage of street-view repositories. The global availability and accessibility of remote sensing imagery enable localization systems to serve not only urban centers but also rural and wilderness areas where street-view coverage is unavailable. However, CVGL faces substantial technical challenges, including drastic viewpoint disparities (ground-level vs. overhead), heterogeneous imaging geometries, and seasonal appearance variations. More fundamentally, these factors lead to severe spatial misalignment and partial correspondence between cross-view images, where only a subset of regions can be reliably matched across views. Currently, existing methods [4,5,6,7] typically employ a dual-branch retrieval framework to map cross-view images into a unified feature space for similarity-based retrieval. To mitigate matching difficulties, some studies incorporate image preprocessing techniques, such as Polar Transform [8,9], to bridge the perspective gap at the geometric level.
Despite the significant breakthroughs in accuracy achieved by deep learning-based methods on specific benchmarks, their generalization capacity remains limited. This limitation is closely related to the intrinsic challenges of cross-view geo-localization. Due to severe spatial misalignment and partial correspondence, it is inherently challenging to determine which regions can be reliably matched across views. Existing methods typically address this issue either by introducing explicit geometric priors to enforce cross-view alignment or by relying on local feature patterns that may capture dataset-specific textures. However, geometric transformations often depend on dataset-specific assumptions and may fail under complex viewpoint changes, while local feature-based representations are prone to overfitting to dataset-specific textures. As a result, constructing robust and generalizable global representations remains challenging.
Recently, Vision Foundation Models (VFMs) have demonstrated immense potential in handling complex semantics and geometric alignment, as their extracted image features typically possess robust universal semantics and strong generalization capabilities. For instance, ref. [10] demonstrated that employing DINOv2 [11] as a feature extractor significantly enhances the consistency of cross-view features. However, as these models are pre-trained on general internet-scale datasets, our research reveals that the high-dimensional semantics extracted by foundation models still contain substantial environmental redundancies irrelevant to geo-localization, making direct matching with these universal features suboptimal. Moreover, cross-view geo-localization fundamentally differs from conventional visual recognition tasks due to the severe viewpoint discrepancies between ground and aerial images. The lack of spatial correspondence and the presence of geometric distortions make direct feature matching highly unreliable. In particular, the presence of non-discriminative or view-specific regions—such as expansive sky areas, transient shadows, and dynamic objects—not only introduces noise but also exacerbates the inconsistency between cross-view representations. Therefore, beyond strong semantic representations, an effective CVGL system also requires a mechanism that can bridge the representation gap across views rather than relying solely on direct feature similarity.
In this context, feature aggregation mechanisms that can selectively emphasize discriminative regions while suppressing non-informative or unmatched content become particularly important. Recently, the SALAD [12] aggregation framework, originally proposed for visual place recognition, has demonstrated strong performance by enabling adaptive feature assignment and filtering through its clustering-based aggregation and dustbin mechanism. Building upon this observation, we extend this idea to the cross-view geo-localization setting and reinterpret the SALAD mechanism under the presence of severe viewpoint discrepancies and partial correspondence. Specifically, we show that the aggregation process can be viewed as inducing a unified assignment space for cross-view features, thereby facilitating more consistent semantic representation across different viewpoints.
Based on this insight, we further propose GenGeo (Generalized Geo-localization), which integrates the foundation model DINOv2 with the SALAD aggregation mechanism. The framework leverages DINOv2 to extract rich semantic representations and employs SALAD to produce discriminative global descriptors. By combining semantic-rich representations with matching-aware aggregation, the proposed approach significantly enhances robustness in heterogeneous environments.
The main contributions of this work are summarized as follows:
(1)
We revisit the SALAD aggregation framework with a dustbin mechanism in the context of cross-view geo-localization and analyze its suitability for addressing the intrinsic challenges of partial correspondence and information asymmetry. We show that the shared clustering process can be interpreted as inducing a unified assignment space for cross-view features, which promotes consistent semantic representation while filtering unmatched or noisy regions, thereby facilitating reliable cross-view alignment.
(2)
We present GenGeo, a framework for cross-view geo-localization that integrates vision foundation model representations with the SALAD-based matching-aware aggregation strategy. By combining the strong semantic generalization capability of foundation models with adaptive feature aggregation, the framework produces robust and transferable representations for cross-dataset localization.
(3)
Extensive experiments and ablation studies demonstrate that the proposed framework achieves state-of-the-art performance in cross-dataset generalization and consistently improves robustness under severe domain shifts and spatial misalignment, highlighting the importance of the synergy between foundation model representations and matching-aware aggregation for effective cross-view alignment.

2. Related Work

Existing cross-view geo-localization (CVGL) methods predominantly adopt the Siamese network architecture [13,14,15,16,17]. The core principle of this framework is to employ two backbone networks with identical structures to map ground images and aerial imagery into a unified embedding space. The spatial correlation is subsequently evaluated by computing the geometric distance or cosine similarity between the two image representations. Early studies primarily utilized Convolutional Neural Networks (CNNs) as feature extractors. One of the pioneering approaches [4] leveraged CNNs pre-trained on the ImageNet and Places [18] datasets to extract representations for ground images and satellite images, respectively, demonstrating that CNNs significantly outperform traditional handcrafted descriptors in CVGL tasks. To further enhance the discriminative power of features, contrastive learning has been widely integrated into model training [14,19]. For instance, Vo et al. [14] introduced a soft-margin triplet loss to constrain the relative distances between triplet samples, forcing the anchor and positive samples to cluster in the feature space. Building upon this, Hu et al. [15] proposed a weighted soft-margin triplet loss, which employs a hyperparameter α to dynamically adjust the gradient, effectively alleviating the slow convergence issue associated with the original triplet loss.
Despite the foundation laid by Siamese networks, the localization accuracy of early methods remained limited due to severe geometric distortions and drastic viewpoint disparities between ground images and satellite imagery. To address these challenges, the research community has developed two primary technical trajectories: feature learning-based methods and perspective transformation-based methods [20].
Feature learning-based methods aim to mine shared discriminative information between cross-view images by enhancing the model’s representative power. A prominent trend involves replacing original CNNs with advanced backbones such as Vision Transformers (ViT) [21] or ConvNeXt [22], which offer superior long-range dependency modeling. For instance, Zhu et al. [23] proposed TransGeo, which utilizes a dual-stage strategy with attention maps to focus on critical regions, while MCCG [24] leverages a multi-classifier to extract enriched features. Beyond backbones, research has also focused on feature aggregation and hard negative mining, such as the GPS-Sampling and Dynamic Similarity Sampling (DSS) introduced by Deuser et al. [6]. Recently, Sun et al. [25] further emphasized the metric feature consistency principle, showing that preserving strict dimension-wise correspondence between siamese branches is critical, and introduced frequency-domain features via DCT to mitigate cross-view discrepancies.
Perspective transformation-based methods reduce the difficulty of feature learning by employing geometric preprocessing to bridge the domain gap. A milestone in this area is the polar transform proposed in SAFA [8], which projects satellite images into polar coordinates to simulate the visual layout of panoramic ground images. This technique has been widely adopted on center-aligned datasets like CVUSA [4] and CVACT [17]. However, polar transformation possesses limitations: the geometric distortion inherent in projection introduces noise, and its performance is restricted on non-centered benchmarks such as VIGOR [26] and DReSS [27]. To overcome this, Wang et al. [28] utilized geometric transformations to project ground images into Bird’s-Eye View (BEV) representations. Additionally, Zhang et al. [29] proposed a feature recombination strategy as a robust alternative to traditional perspective transforms.
With the remarkable progress in CVGL techniques, performance on standard benchmarks has approached saturation. However, most existing methods focus predominantly on optimizing localization accuracy within a single domain, often overlooking the model’s generalization capacity. This limitation significantly hinders practical deployment, as models frequently suffer from severe performance degradation when encountering domain shifts caused by geographic heterogeneity. The emergence of Vision Foundation Models (VFMs) offers a promising trajectory to overcome these bottlenecks. For instance, DINOv2 [11], pre-trained via large-scale self-supervised learning, delivers exceptional feature representations with both rich semantics and geometric robustness. Similarly, multi-modal models like CLIP [30] learn universal discriminative features by aligning visual and textual spaces. Other self-supervised transformers such as MAE [31] and iBOT [32] provide rich patch-level semantic embeddings, offering additional avenues for transferability in CVGL tasks.
In the remote sensing domain, research has begun to harness the transferability of vision foundation models (VFMs). RemoteClip [33] demonstrated striking zero-shot capabilities, while GeoCLIP [34] leveraged global contextual priors to enhance localization. Building upon this trend, recent works have specifically integrated VFMs into cross-view geo-localization (CVGL) frameworks to improve robustness and generalization. For instance, Bi et al. [35] proposed a balanced bias enhanced multi-branch network (BEMN) that simultaneously leverages ConvNeXt for local feature extraction and DINOv2 for global context modeling, introducing a two-stage training strategy with joint feature alignment and frequency domain adjustment modules to mitigate unimodal bias and overfitting.
However, despite these advances, existing methods still face notable limitations. Approaches based on complex architectural designs often incur significant computational overhead, while those relying on hand-crafted geometric transformations or explicit metric alignment may remain sensitive to local distortions in unconstrained real-world environments. Distinct from these strategies, this paper explores enhancing the robustness of cross-view alignment by refining the internal feature aggregation mechanism of universal visual representations. We propose the GenGeo framework, which synergizes the discriminative power of DINOv2 with an optimal transport-based aggregation (SALAD). Unlike static pooling methods, our approach specifically addresses the viewpoint gap by utilizing a “dustbin” mechanism to filter out view-specific uninformative features (e.g., sky, transient objects), thereby distilling a cross-view invariant global descriptor. This allows GenGeo to achieve superior generalization across heterogeneous geographic environments without the need for explicit geometric priors or heavy multi-stage training.

3. Methods

3.1. Overview

3.1.1. Problem Definition

Let I g = g 1 , g 2 , , g n denote a set of query ground-level images, and I a = a 1 , a 2 , , a m denote a reference database of geo-tagged aerial images. For each query image g i I g , there exists at least one positive aerial image a i + I a that corresponds to the same geographic location.
Cross-view geo-localization aims to learn a unified representation function f ( · ) that maps images captured from drastically different viewpoints into a shared embedding space, such that images depicting the same geographic location are closely aligned. This task is particularly challenging due to severe viewpoint-induced appearance changes, large-scale spatial layouts, and heterogeneous land-use patterns commonly observed in aerial imagery. During inference, given a query ground image g, the model retrieves the most similar aerial image a * from the reference database:
a * = arg max a j I a sim ( f ( g ) , f ( a j ) ) ,
where sim ( · , · ) denotes cosine similarity. The geographic coordinates associated with a * are then used as the estimated location of the query image.

3.1.2. Overall Architecture

Figure 1 illustrates the overall architecture of the proposed GenGeo (Generalizable Geo-localization) framework. GenGeo follows a Siamese contrastive learning paradigm for cross-view geo-localization. Rather than explicitly modeling geometric transformations, it leverages strong visual representations and a matching-aware aggregation mechanism to improve robustness under large viewpoint differences and domain shifts.
The framework consists of two main components: (1) a shared feature extraction backbone based on the vision foundation model DINOv2 [11], and (2) a shared optimal transport-based feature aggregation module built upon SALAD [12] with an explicit dustbin mechanism. The DINOv2 backbone extracts dense local descriptors with strong invariance to illumination, seasonal changes, and regional diversity, while the shared aggregation module performs matching-aware feature aggregation to produce discriminative global representations.
During training, GenGeo adopts the hard negative mining strategy from Sample4Geo [6], including GPS-based sampling and Dynamic Similarity Sampling (DSS), and is optimized using a symmetric InfoNCE loss. This design enables effective learning in large-scale retrieval scenarios, where the reference database may span large urban areas or geographically diverse regions.

3.2. Feature Extraction Module

Traditional cross-view geo-localization methods predominantly rely on convolutional neural networks (CNNs) for feature extraction [14,15,36,37]. While CNN-based architectures can model spatial structures effectively, their reliance on local convolutional operations leads to representations that emphasize region-specific visual patterns. Such locality-driven representations may be less transferable across geographically diverse regions, where urban morphology and land-use distributions vary significantly. Although recent works [5,6,7] have adopted more advanced convolutional backbones (e.g., ConvNeXt), their ability to generalize across heterogeneous geographic regions remains a significant challenge. Recent advances in vision foundation models (VFMs) provide an alternative paradigm for feature extraction. Pre-trained on large-scale and diverse datasets using self-supervised learning, models such as DINOv2 are capable of capturing high-level semantic representations with improved robustness and transferability. This property makes them particularly suitable for cross-view geo-localization, where the ability to generalize across regions and viewpoints is critical.
To overcome these limitations, GenGeo employs DINOv2 [11] as its feature extraction backbone. DINOv2 is a Vision Transformer (ViT) foundation model pre-trained via large-scale self-supervised learning on diverse visual data. Its self-attention mechanism enables effective modeling of long-range spatial dependencies and holistic scene structures, which is particularly beneficial for aerial imagery characterized by large spatial extents and structured layouts. Furthermore, the exceptional performance of DINOv2 in Visual Place Recognition (VPR) tasks [38,39,40] validates its potential in geo-localization contexts. This motivated us to integrate DINOv2 into the GenGeo framework to explore the generalization capacity of universal vision foundation models in addressing cross-view challenges.
DINOv2 adopts the ViT as its foundational architecture. Formally, given an input image I R H × W × C , it is first partitioned into non-overlapping patches of size p × p ( p = 14 in this study). Each patch is linearly projected into a D-dimensional embedding, and a learnable [CLS] token is prepended to encode global contextual information. Positional embeddings are added to preserve spatial structure, and the resulting token sequence is processed by stacked Transformer blocks.
Following the Siamese paradigm, GenGeo employs parameter-shared DINOv2 backbones for both ground-level and aerial images, producing the following feature sequences:
F g = f g , c l s , f g , 1 , , f g , n , F a = f a , c l s , f a , 1 , , f a , n ,
where n = H W / p 2 denotes the number of patches and each feature token lies in R D . These dense feature sequences are subsequently fed into the feature aggregation module for cross-view alignment and dimensionality reduction. We empirically compare different fine-tuning strategies for the DINOv2 backbone, including full freezing, full fine-tuning, and partial fine-tuning, and adopt a partial fine-tuning scheme where early layers are frozen to preserve general representations while later layers are adapted to the target task. This configuration strikes an optimal balance between leveraging pre-trained foundation knowledge and learning discriminative geographic representations.

3.3. Feature Aggregation Module

Although DINOv2 provides strong universal visual representations, directly using the [CLS] token as a global descriptor is suboptimal for cross-view geo-localization. In remote sensing scenarios, dense patch tokens often contain substantial environmental redundancies, such as sky regions, shadows, or transient objects, which are irrelevant or even detrimental to localization. More importantly, due to the drastic viewpoint gap between ground and aerial images, many local regions exhibit partial correspondence, where certain features in one view may not have valid counterparts in the other view. This cross-view inconsistency makes it crucial to not only suppress redundant information but also selectively ignore unmatched or view-specific features. However, the [CLS] token is not explicitly optimized to disentangle geographically discriminative cues from such noise or handle such cross-view inconsistencies.
Motivated by these observations, GenGeo incorporates a feature aggregation module to perform secondary modeling on the dense patch tokens, thereby enhancing the discriminative power of the global descriptors. Rather than designing a task-specific aggregation module, we adopt SALAD [12] and reinterpret its mechanism in the context of cross-view geo-localization. Specifically, SALAD aggregates features using a shared set of learnable visual prototypes, assigning local features from both ground and aerial views into a unified assignment space, which encourages consistent semantic representation across views.
Furthermore, SALAD introduces a dustbin mechanism that allows unmatched or non-discriminative features to be filtered out during aggregation. This property is particularly beneficial for cross-view geo-localization, where partial correspondence is prevalent and many local regions may not have valid counterparts across views. By suppressing such unmatched features, the aggregation process produces more robust and view-consistent global representations.
In SALAD, a modern variant of VLAD [41], a global descriptor is generated through a two-step paradigm: first, a set of local features is assigned to a series of shared semantic clusters, then aggregated within each cluster. Figure 2 illustrates the overall process. The underlying assignment and aggregation processes are detailed below.

3.3.1. Assignment

The relationship between local features F = { f i } i = 1 n and M semantic cluster centroids is first modeled via an affinity score matrix S R n × M . Unlike traditional NetVLAD, which relies on k-means initialization, SALAD [12] employs a lightweight MLP to learn the affinity scores for each patch token:
s i = MLP ( f i ) , i = 1 , , n .
This data-driven approach avoids the inductive bias of static clustering and provides greater flexibility in heterogeneous geographic environments.
Importantly, the semantic clusters are shared across both ground-level and aerial views, meaning that features from different viewpoints are assigned to a common set of semantic categories. Each cluster is expected to capture consistent high-level geographic semantics (e.g., buildings, roads, vegetation) despite viewpoint variations. As a result, features corresponding to similar physical structures across views are more likely to be grouped within the same cluster, providing a structured basis for cross-view matching.
Another key refinement of SALAD is the introduction of a “dustbin” mechanism to handle non-informative or redundant features, which are common in CVGL tasks (e.g., sky, clouds, or transient vehicles. In particular, cross-view geo-localization inherently suffers from partial correspondence, where certain regions visible in one view may be entirely absent in the other due to drastic viewpoint differences. Such asymmetric and unmatched features cannot contribute to reliable cross-view matching and may even introduce noise during aggregation. The dustbin mechanism explicitly models this phenomenon by absorbing these unmatched or view-specific features, thereby preventing them from interfering with the aggregation of semantically consistent regions. To implement this, the score matrix S is augmented with an additional column representing the dustbin dimension, resulting in an enhanced matrix S ¯ = [ S , s ¯ M + 1 ] R n × ( M + 1 ) . Following SuperGlue [42], the dustbin scores are defined by a learnable scalar z:
s ¯ i , M + 1 = z , i { 1 , , n } ,
where z serves as a learned rejection threshold, allowing features with low cluster affinity to be diverted to the dustbin during the subsequent iterations.
The mapping between features and cluster centroids is then formulated as an Optimal Transport (OT) problem, reflecting both the affinity of features for clusters and the selectivity of clusters for features. Each feature is assigned a unit mass, forming a source distribution μ = 1 n , which is distributed across a target distribution κ = [ 1 M , n M ] corresponding to the M clusters and the dustbin. This configuration ensures that both feature-level and cluster-level constraints are satisfied:
  • Row constraints (feature side): P ¯ 1 M + 1 = μ , ensuring that the mass of each feature is fully allocated.
  • Column constraints (cluster side): P ¯ 1 n = κ , limiting each cluster centroid to aggregate only a unit mass of features and enforcing competitive assignment.
The resulting constrained optimization problem is solved using the Sinkhorn-Knopp algorithm [43,44], as in SALAD [12] and SuperGlue [42], iteratively normalizing rows and columns from exp( S ¯ ) to yield the optimal assignment matrix P ¯ . The last column corresponding to the dustbin is then discarded to obtain the final assignment matrix P R n × M for feature aggregation.

3.3.2. Aggregation

Once the assignment matrix P is obtained, the discrete local feature sequences are aggregated into a compact global descriptor. To maintain computational efficiency, the local features f i are first projected to a lower-dimensional space h i R l using a lightweight MLP. In contrast to conventional VLAD-style algorithms that aggregate residuals, a direct weighted summation of the local features based on P is performed. This approach introduces no additional geometric priors, thereby enhancing the model’s generalization across diverse geographic environments. The aggregation for the j-th cluster centroid is formulated as:
V j = i = 1 n P i , j · h i , j = 1 , , M .
The individual cluster descriptors are then flattened into a single vector V R M l . To capture the holistic scene context not captured by local clusters, this aggregated local representation is concatenated with the transformed DINOv2 global token f c l s . L2 normalization is then applied to ensure stability of the descriptor in the embedding space. Finally, a linear projection layer is appended to compress the global descriptor d to 1024 dimensions, reducing computational and memory costs during large-scale retrieval and ensuring compatibility with standard CVGL benchmarks [6,23].

3.4. Loss Function and Sampling Strategy

3.4.1. Symmetric InfoNCE Loss

By maximizing a lower bound of the mutual information between positive sample pairs, symmetric InfoNCE loss [30,45] has demonstrated strong performance in several recent geo-localization studies [6,27]. Given a training batch containing N image pairs, for the i-th ground image global descriptor d g , i and its corresponding aerial image global descriptor d a , i , the loss term from the ground-to-aerial perspective L g a is defined as:
L g a = 1 N i = 1 N log exp ( d g , i · d a , i / τ ) j = 1 N exp ( d g , i · d a , j / τ ) .
Similarly, the aerial-to-ground loss term L a g is defined by swapping the query and candidate sets. The final symmetric loss is the average of the two components: L = ( L g a + L a g ) / 2 . Compared to the traditional Triplet Loss, which considers only a single negative sample per anchor, the InfoNCE loss adopts an in-batch negative sampling strategy. Consequently, each positive pair is contrasted against N 1 negative samples within a single iteration. This contrastive formulation substantially improves the utilization efficiency of negative samples and enhances the discriminative capability of the learned representations, which is particularly beneficial for large-scale geo-localization scenarios with extensive candidate spaces.

3.4.2. Sampling Strategy

To further enhance discriminative learning of the InfoNCE loss, we adopt the dual-stage sampling strategy from Sample4Geo [6]. This strategy consists of GPS-based Sampling and Dynamic Similarity Sampling (DSS). In the early stages of training, as the model has not yet converged, it is unable to effectively identify hard negatives within the feature space. Consequently, we leverage geographical coordinates (GPS or UTM) to calculate spatial proximity, selecting geographically nearby samples as initial potential hard negatives. As training progresses, the strategy transitions to DSS. Specifically, every e epochs, we conduct a full feature extraction pass to update the global retrieval index via cosine similarity. For batch construction, we select the k / 2 most similar samples from a query’s K nearest neighbors. To maintain sampling diversity, an additional k / 2 samples are randomly selected from the remaining candidates. Following the sampling strategy in Sample4Geo, the hyperparameters for negative sample mining are set to k = 64 , K = 128 , and e = 4 .

4. Results

4.1. Datasets and Setting

4.1.1. Datasets

We evaluate GenGeo on three widely used cross-view geo-localization benchmarks: CVUSA [4], CVACT [17], and VIGOR [26], which collectively cover diverse geographic regions and present varying cross-view challenges (Figure 3).
CVUSA is one of the earliest large-scale cross-view datasets, containing a subset of 35,532 ground–aerial image pairs for training and 8884 pairs for testing. The ground-level images are captured as 360 panoramas, while the aerial images are obtained from standard satellite platforms. Camera extrinsic parameters are used to warp and align the panoramas with the corresponding satellite tiles, ensuring that the geographical north is at the top of the aerial images. The task in CVUSA is a one-to-one mapping, where each satellite image has exactly one corresponding street-view image. The dataset primarily covers suburban and rural areas across the United States.
CVACT follows a similar scale and split as CVUSA for training and validation (35,532/8884 image pairs), and additionally provides a large city-scale test set with 92,802 pairs. The dataset is collected in Canberra, Australia, focusing on dense urban environments with complex road networks and building layouts. Satellite images are provided at higher resolution than CVUSA (e.g., 1200 × 1200) and ground panoramas at 832 × 1664. Like CVUSA, CVACT maintains one-to-one correspondence between ground and aerial images, with images aligned based on camera extrinsics.
VIGOR is a large-scale cross-view benchmark spanning four U.S. cities: New York, Seattle, San Francisco, and Chicago. The dataset contains 105,214 ground panoramas and 90,618 aerial tiles. Unlike CVUSA and CVACT, VIGOR does not enforce strict center alignment between ground queries and aerial images, and multiple semi-positive aerial neighbors are provided for each ground image. Two evaluation protocols are defined: same-area, where all cities contribute to training and validation, and cross-area, where training is conducted on a subset of cities (e.g., New York and Seattle) and testing on the remaining cities (e.g., San Francisco and Chicago) to assess generalization to unseen regions.
Table 1 summarizes the key characteristics of the three datasets, including the number of images, resolutions, region types, and mapping strategies.

4.1.2. Evaluation Metrics

To quantitatively evaluate the retrieval performance of GenGeo, we employ the following metrics: Top-k Recall (R@k) and Hit Rate. Top-k Recall is the primary metric used across all datasets (CVUSA, CVACT, and VIGOR). For each ground-level query, the model retrieves the top-k most similar aerial candidates from the reference database based on the cosine similarity of their global descriptors. A retrieval is considered successful if the ground-truth aerial image is among the k nearest neighbors. Following standard protocols, we report R@1, R@5, R@10, and R@1% (the top 1% of the entire database). Specifically for the VIGOR dataset, we report the Hit Rate as an additional measure of practical localization success. In the “beyond-center” scenario of VIGOR, a query may be partially covered by multiple neighboring aerial tiles. The Hit Rate defines a successful retrieval if the top-1 retrieved aerial image physically covers the ground-level query location, regardless of whether it is the designated “ground-truth” center tile.

4.1.3. Implementation Details

We employ the ViT-B/14 variant of DINOv2 as the feature extraction backbone. This choice ensures a fair comparison with existing state-of-the-art methods [6,27,46] that predominantly utilize ConvNeXt-B, as both architectures possess a comparable model capacity of approximately 86–89 million parameters. To strike a balance between training efficiency and representation performance, only the last four Transformer blocks of the encoder are fine-tuned. The aggregation module is configured with M = 64 semantic clusters, each with a feature dimension of l = 128 , resulting in a final 1024-dimensional global descriptor via linear projection. Given that the patch size of DINOv2 is 14, we adaptively resized the input images for different datasets: for CVUSA and CVACT, the ground and aerial images are resized to 154 × 770 and 392 × 392 pixels, respectively; for VIGOR, they are adjusted to 294 × 588 and 294 × 294 pixels. The model is optimized using the AdamW optimizer [47] with an initial learning rate of 1 × 10 4 and a cosine annealing schedule [48]. The entire framework is developed on a server equipped with 8 NVIDIA GeForce RTX 2080 Ti GPUs, using a batch size of 256 and training for a total of 80 epochs.

4.2. Comparison with State-of-the-Art Methods

We compared GenGeo with state-of-the-art (SOTA) models in cross-view image geo-localization from two perspectives: cross-view image retrieval and generalization capability. Sample4Geo [6] serves as the baseline, while Panorama-BEV Co-Retrieval Network (P-BEV) [46] represents the current top-performing method.

4.2.1. Cross-View Image Retrieval

Table 2 presents the quantitative comparison on the CVUSA and CVACT benchmarks. On CVUSA, GenGeo achieves the best performance across all evaluation metrics, consistently outperforming existing methods. This demonstrates the effectiveness of combining the strong feature representations of DINOv2 with the SALAD semantic aggregation module for cross-view geo-localization. On the CVACT dataset, GenGeo achieves performance comparable to the Sample4Geo baseline and slightly outperforms it on the Test split. P-BEV and AuxGeo [27] attain marginally higher scores on some metrics, likely due to their use of explicit geometric transformations that project ground-level panoramas into bird’s-eye-view (BEV) representations, partially mitigating geometric discrepancies between ground and aerial views. In contrast, GenGeo performs feature learning and matching directly in the original view space, without relying on geometric priors, reflecting a general and viewpoint-agnostic modeling paradigm.
On the VIGOR benchmark (Table 3), GenGeo achieves competitive results under both same-area and cross-area evaluation protocols. In the same-area setting, performance is comparable to Sample4Geo and AuxGeo, slightly below P-BEV on certain metrics. In the more challenging cross-area scenario, GenGeo surpasses Sample4Geo and AuxGeo on several key metrics, demonstrating strong cross-region generalization. The performance gap with P-BEV is consistent with observations on CVACT, where the benefits of explicit viewpoint synthesis are more pronounced in strictly aligned urban scenes. These results suggest that the primary challenge in cross-view geo-localization lies in the severe viewpoint gap between ground and aerial images, and that explicitly reducing this discrepancy through geometric transformation can effectively improve retrieval accuracy in well-aligned scenarios.

4.2.2. Generalization Capabilities

Given that the primary objective of GenGeo is to bolster the model’s adaptability to diverse environments, the evaluation of generalization capability constitutes the core of our comparative analysis. Following [6,46,49], we evaluate generalization via cross-dataset training and testing between CVUSA and CVACT. Compared with VIGOR’s cross-area evaluation, cross-dataset validation is more challenging due to significant domain shifts, including differences in geographic landscapes (suburban vs. urban) and acquisition conditions. This setup better reflects the model’s transferability to unseen and complex environments. To further investigate robustness under more diverse conditions, we additionally conduct cross-dataset evaluation between CVUSA and VIGOR. Compared to CVUSA–CVACT, this setting introduces not only substantial scene variation (rural-to-urban) but also differences in spatial alignment assumptions (center-aligned vs. non-centered matching), posing additional challenges for cross-view correspondence. Due to the lack of publicly available models under this setting for many existing methods, we report comparisons with a representative baseline (Sample4Geo) alongside our approach.
As shown in Table 4, GenGeo exhibits strong generalization across different data distributions. In the CVUSA → CVACT transfer task, our model achieves a substantial 14.65% improvement in Top-1 recall over the baseline framework. More notably, GenGeo even surpasses P-BEV [46] that explicitly incorporates Birds-Eye-View (BEV) transformations—by a margin of 3.48%. In the reverse CVACT → CVUSA scenario, GenGeo achieves an R@1 of 55.66%, representing a significant 10.71% gain over the Sample4Geo baseline. While P-BEV results were not reported for this specific transfer direction, our framework maintains a consistent performance advantage over other established methods. The double-digit improvement observed in both transfer directions further validates the robust geographic transferability of our approach. Geometry-aware models such as GeoDTR and P-BEV leverage polar or BEV projections optimized for specific datasets (e.g., the strictly aligned urban scenes in CVACT), but these approaches often underperform in cross-domain transfers from rural to urban environments. In contrast, GenGeo directly matches invariant semantic landmarks using DINOv2 features and SALAD aggregation, without dependence on specific geometric layouts.
In the more challenging CVUSA–VIGOR cross-dataset setting (Table 5), GenGeo demonstrates even more pronounced advantages. Specifically, in the VIGOR → CVUSA transfer, GenGeo achieves an R@1 of 38.28%, significantly outperforming Sample4Geo (3.85%) by a large margin. Similarly, in the CVUSA → VIGOR scenario, GenGeo attains an R@1 of 11.38%, whereas the baseline model nearly collapses (0.002%). These results indicate that under severe domain shifts involving both scene diversity and spatial misalignment, conventional retrieval pipelines struggle to establish reliable correspondences, while the proposed method maintains robust and transferable representations. Notably, the near-zero performance of the baseline in the CVUSA → VIGOR setting can be attributed to the compounded challenges of this cross-dataset scenario. Unlike CVUSA and CVACT, VIGOR does not assume strict center alignment between ground and aerial views, and introduces significant viewpoint offsets and spatial ambiguity. As a result, methods that rely on implicit spatial correspondence or direct feature matching, such as Sample4Geo, are particularly vulnerable under this setting.
In addition, the baseline model constructs global descriptors via a convolutional backbone (ConvNeXt) followed by global average pooling, which may not be sufficiently robust to handle extreme cross-view discrepancies. The reliance on globally pooled features can limit the model’s ability to selectively focus on geographically discriminative regions, especially when large portions of the image are dominated by non-informative or view-specific content.

4.3. Ablation Study

To investigate the impact of key components and design choices in GenGeo, we conduct a series of systematic ablation studies. Given our primary focus on robust geographic transferability, we evaluate in-distribution effectiveness on CVUSA and assess cross-dataset generalization via the CVUSA → CVACT transfer task. All training and evaluation protocols are kept identical, except for the specific components under ablation, to ensure a fair and rigorous comparison.

4.3.1. Backbone Architecture Evaluation

As shown in Table 6, we conduct a systematic comparison across different backbone networks. We first evaluate the performance of using a fine-tuned DINOv2 model alone as the global feature extractor. Notably, compared with ConvNeXt, DINOv2 exhibits weaker generalization performance in the CVUSA→CVACT cross-dataset evaluation instead. We hypothesize that this behavior is related to the design choice of using the CLS token as the global representation in DINOv2. While the CLS token encodes rich semantic information, it may lack sufficient modeling of fine-grained discriminative cues that are crucial for retrieval tasks, particularly under significant domain shifts.
We further introduce the SALAD module on top of different backbones to evaluate its compatibility with varying feature representations. The experimental results indicate that the ConvNeXt + SALAD configuration not only fails to improve cross-dataset performance but instead leads to a noticeable degradation in generalization capability. This observation indicates that the effectiveness of SALAD is not universal across backbone architectures. We attribute this phenomenon to the fact that ConvNeXt inherently emphasizes local texture and low-level visual patterns, and aggregating such features may amplify dataset-specific biases, thereby reducing robustness under domain shifts. In contrast, when combined with DINOv2, SALAD leads to substantial performance improvements in both in-domain and cross-dataset evaluations. Overall, the results suggest that a significant synergistic effect emerges only when the strong semantic representation capability of DINOv2 is combined with the fine-grained local feature aggregation mechanism of SALAD. This indicates that the effectiveness of SALAD critically depends on the semantic quality of the underlying representations. This design preserves global semantic consistency while incorporating discriminative local structural information, leading to substantially improved generalization performance in cross-domain scenarios.

4.3.2. Aggregation Strategy Evaluation

To evaluate the effectiveness of different components in the aggregation module, we conduct a series of ablation studies focusing on the dustbin mechanism and the role of shared clustering (Table 7).
We first remove the dustbin component to assess its contribution. The results show that on the CVUSA dataset, the performance remains nearly unchanged (R@1: 98.94% vs. 98.94%), which can be attributed to the relatively homogeneous rural scenes with limited distractors. However, in the cross-dataset setting (CVUSA → CVACT), the performance drops noticeably (R@1: 71.27% → 67.14%), indicating that the dustbin mechanism plays a more critical role under significant domain shifts. This suggests that filtering unmatched or non-informative regions is particularly important when transferring between heterogeneous environments (e.g., rural to urban).
To investigate the role of shared clustering in cross-view alignment, we further conduct ablations using (i) a non-shared SALAD variant and (ii) a shared NetVLAD aggregation module. The non-shared SALAD model exhibits a substantial performance drop in both in-domain and cross-dataset evaluations, demonstrating that enforcing a shared clustering space across views is crucial for learning consistent semantic representations and achieving reliable cross-view alignment. While DINOv2 + NetVLAD improves over using DINOv2 alone in cross-dataset evaluation, it still underperforms compared to the proposed DINOv2 + SALAD model. This observation suggests that projecting features from different views into a shared clustering space already benefits generalization. However, SALAD further enhances this process through its optimal transport assignment mechanism and dustbin strategy, which explicitly handle partial correspondence and improve cross-view alignment.
Interestingly, although NetVLAD does not outperform certain ConvNeXt baselines (Table 6) in in-domain evaluation, it demonstrates improved performance in cross-dataset scenarios. This further supports our observation that combining semantically rich foundation model features with shared clustering-based aggregation is beneficial for generalization, even when it does not always yield the highest in-domain accuracy.

4.3.3. Effect of Fine-Tuning Strategies

To justify the use of the partial fine-tuning strategy, we compare it with two alternative settings: full fine-tuning and fully frozen DINOv2 backbone. As shown in Table 8, the fully frozen setting yields the worst performance, which can be attributed to the fact that the pre-trained DINOv2 model is not specifically optimized for geo-localization, leading to suboptimal semantic representations for this task. Full fine-tuning significantly improves performance over the frozen setting, and achieves results on CVUSA comparable to partial fine-tuning. However, its performance on the cross-dataset evaluation (CVACT) is substantially lower than that of partial fine-tuning. This suggests that full fine-tuning tends to overfit to the source dataset, weakening the general semantic representations learned by the foundation model and thus degrading cross-domain generalization. In contrast, partial fine-tuning preserves the general visual knowledge in early layers while enabling task-specific adaptation in higher layers, leading to a better balance between discrimination and generalization.

4.3.4. Effect of SALAD Cluster Number

We conduct an ablation study on the number of clusters to analyze its impact on model performance (Table 9). As shown in the results, increasing the number of clusters from 32 to 64 consistently improves performance across all evaluation metrics. This suggests that a moderate increase in cluster granularity enables the model to capture more fine-grained structural and semantic patterns, leading to more discriminative global representations. However, when the number of clusters is further increased to 80, performance degrades on both evaluation protocols. We attribute this drop to over-fragmentation of local features, which may introduce noisy or redundant clusters and hinder effective feature aggregation. Overall, these results indicate that an appropriate number of clusters is crucial for balancing representation capacity and robustness, with 64 clusters achieving the best trade-off in our experiments.

5. Discussion

5.1. Mechanistic Analysis of Visualization

To explore the internal mechanisms of GenGeo and enhance model interpretability, we conduct a detailed visualization analysis of DINOv2 features and the dustbin mechanism. Following the standard protocol of DINOv2 [11], we utilize the feature norm of the last Transformer layer output to characterize the model’s initial attention on different patches. Regarding the dustbin mechanism, we extract the weight w assigned to the dustbin class for each patch and visualize a selection map using 1 w to intuitively illustrate the discriminative features retained by the model.
As observed in Figure 4, DINOv2 natively captures rich semantic information, providing a foundation for precise geolocalization. However, the raw attention maps contain numerous features that are either positionally irrelevant or difficult to correlate across views, such as sky regions in ground-level panoramas and fine-grained background vegetation in satellite imagery. This likely explains why utilizing DINOv2 as a standalone feature extractor—without an adaptive aggregation layer—yields suboptimal results, as the global representation becomes inevitably contaminated by these non-discriminative environmental redundancies. In contrast, the dustbin mechanism demonstrates superior feature refinement capabilities; it adaptively suppresses these visual noises and focuses specifically on discriminative structures like buildings and roads (as evidenced by the salient red regions on the left side of the ground-level images in the ’Dustbin’ row of Figure 4). An extreme example in Figure 5 further highlights this: while DINOv2 allocates significant attention to grass on the periphery of satellite images and sky in ground-level views, the dustbin module nearly completely filters these redundancies, preserving only the core road regions. This clearly demonstrates the efficacy of the dustbin mechanism in complex cross-view geo-localization scenarios.
Furthermore, we evaluate the model’s generalization capability by testing the model trained on CVUSA on the CVACT dataset, as shown in Figure 6. Remarkably, the model consistently focuses on road regions even in unseen urban areas. This observation aligns with the findings in Sample4Geo [6], where the authors analyzed cross-dataset localization failures and found that most errors stemmed from the incorrect retrieval of satellite images with road features highly similar to the positive samples. Such a correlation underscores the fact that road-related features exhibit exceptional robustness in cross-view localization due to their high topological consistency across different perspectives and minimal sensitivity to variations in lighting, seasons, and occlusions. This inherent stability partially accounts for the strong generalization performance observed in the GenGeo framework. On the other hand, the fact that GenGeo exhibits slightly lower performance on urban datasets like CVACT and VIGOR (Same) compared to CVUSA may also be attributed to this strong focus on road features. In dense urban environments, the discriminative power of road layouts often diminishes due to their structural similarity, whereas roadside buildings and complex vegetation provide more unique landmark cues. This semantic bias suggests a trade-off between cross-dataset robustness and intra-dataset precision in urban settings.
Interestingly, we also observe a “semantic preference asymmetry” in the dustbin mechanism: the satellite branch tends to preserve road layouts, whereas the ground branch focuses more on roadside vegetation textures. We hypothesize that this asymmetry contributes to the performance degradation observed during cross-dataset evaluation. Although the model successfully captures position-related information, the divergent feature weight distributions across views weaken the descriptor consistency during the matching stage.

5.2. Cross-View Semantic Consistency Analysis

To further validate that the proposed model achieves cross-view semantic alignment through the shared SALAD clustering mechanism, we perform a qualitative and quantitative analysis on the distribution of cluster assignments across different views. Specifically, as illustrated in Figure 7, we select a pair of corresponding regions (highlighted by red boxes) from a ground-level image and its matched aerial image, both depicting the same bridge structure. Due to the inherent viewpoint discrepancy, the two regions exhibit notable visual differences: the aerial image captures the top surface of the bridge, while the ground-level image primarily observes its underside. This discrepancy reflects the fundamental challenge of cross-view geo-localization, where semantically corresponding regions may have significantly different visual appearances.
To analyze how the model handles such discrepancies, we examine the cluster assignment distributions of patch tokens within the selected regions. For each region, we collect the assignment probabilities of all patches over the shared set of semantic clusters, and compute the average distribution by aggregating and normalizing these assignments. Based on these statistics, we plot the corresponding histograms shown on the right side of Figure 7.
From the results, we observe that despite the substantial appearance variations caused by viewpoint differences, the cluster distributions of the two regions exhibit a high degree of similarity. This indicates that the model consistently assigns semantically corresponding regions from different views to similar clusters in the shared assignment space. Such behavior suggests that the shared SALAD aggregation mechanism implicitly enforces cross-view semantic consistency. As a result, view-specific variations (e.g., texture, illumination, and perspective differences) are largely suppressed, while high-level semantic cues corresponding to stable geographic structures are preserved.
This observation provides direct evidence that the proposed shared clustering strategy facilitates cross-view alignment at the feature aggregation level, enabling the model to learn view-invariant representations without relying on explicit geometric priors.

5.3. Future Work

Despite its robust generalization, GenGeo’s in-domain accuracy in dense urban environments suggests potential for refinement. Future research will prioritize adaptive multi-semantic mining to move beyond the current emphasis on road layouts, incorporating more discriminative cues from architectural morphologies and roadside vegetation. Furthermore, we plan to integrate Bird’s-Eye-View (BEV) transformations to further mitigate perspective distortions. Investigating the synergy between geometric projections and our existing feature selection mechanism will be crucial for enhancing absolute localization precision while preserving the framework’s characteristic cross-dataset robustness.

6. Conclusions

In this paper, we revisit and reinterpret the SALAD aggregation mechanism in the context of cross-view geo-localization, showing that its clustering-based aggregation process can be viewed as inducing a unified assignment space for features from different viewpoints. This perspective provides a new understanding of how adaptive feature aggregation can facilitate cross-view semantic consistency without relying on explicit geometric priors. Building upon this insight, we further propose GenGeo, a unified framework that integrates vision foundation model representations with a SALAD-based aggregation mechanism. By leveraging the strong semantic representations of foundation models together with adaptive feature aggregation, the proposed framework produces robust and transferable global descriptors for cross-view matching. In addition, the dustbin mechanism helps filter out unmatched and non-informative regions, addressing the partial correspondence and information asymmetry inherent in cross-view scenarios. Extensive experiments demonstrate that GenGeo achieves state-of-the-art performance in cross-dataset generalization and exhibits strong robustness under severe domain shifts and spatial misalignment. Both ablation studies and visualization analyses further confirm the effectiveness of the aggregation-based representation in facilitating cross-view alignment.
Overall, this work highlights that reinterpreting and adapting existing aggregation mechanisms, rather than designing entirely new architectures, can provide an effective and practical direction for improving generalizable cross-view geo-localization.

Author Contributions

Conceptualization, R.W.; Methodology, R.W.; Software, R.W. and W.Y. (Wen Yuan); Validation, R.W., W.Y. (Wen Yuan), T.L. and Y.Z.; Formal analysis, R.W.; Investigation, W.Y. (Wu Yuan) and X.X.; Resources, W.Y. (Wen Yuan); Data curation, R.W.; Writing—original draft, R.W.; Writing—review & editing, X.X., T.L. and Y.Z.; Supervision, W.Y. (Wu Yuan); Project administration, W.Y. (Wen Yuan) All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All studies in this paper are based on publicly available datasets. Code is available at https://github.com/rw-afk-pixel/GenGeo (accessed on 1 February 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Kim, D.K.; Walter, M.R. Satellite image-based localization via learned embeddings. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Singapore; IEEE: New York, NY, USA, 2017; pp. 2073–2080. [Google Scholar]
  2. Liu, J.; Qin, R.; Arundel, S.T. Assessing the utility of uncrewed aerial system photogrammetrically derived point clouds for land cover classification in the Alaska North Slope. Photogramm. Eng. Remote Sens. 2024, 90, 405–414. [Google Scholar] [CrossRef]
  3. Chiu, H.P.; Murali, V.; Villamil, R.; Sikka, K.; Kumar, R. Augmented reality driving using semantic geo-registration. In Proceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Reutlingen, Germany; IEEE: New York, NY, USA, 2018; pp. 423–430. [Google Scholar]
  4. Workman, S.; Souvenir, R.; Jacobs, N. Wide-Area Image Geolocalization with Aerial Reference Imagery. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile; IEEE: New York, NY, USA, 2015; pp. 3961–3969. [Google Scholar]
  5. Chen, Z.; Yang, Z.X.; Rong, H.J. Multi-level Embedding and Alignment Network with Consistency and Invariance Learning for Cross-View Geo-Localization. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5104416. [Google Scholar]
  6. Deuser, F.; Habel, K.; Oswald, N. Sample4Geo: Hard Negative Sampling for Cross-View Geo-Localisation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France; IEEE: New York, NY, USA, 2023; pp. 16847–16856. [Google Scholar]
  7. Shi, Y.; Yu, X.; Liu, L.; Li, H. Optimal Feature Transport for Cross-View Image Geo-Localization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA; AAAI Press: Palo Alto, CA, USA, 2020; Volume 34, pp. 11990–11997. [Google Scholar]
  8. Shi, Y.; Liu, L.; Yu, X.; Li, H. Spatial-Aware Feature Aggregation for Image Based Cross-View Geo-Localization. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, Canada; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32, pp. 10090–10100. [Google Scholar]
  9. Toker, A.; Zhou, Q.; Maximov, M.; Leal-Taixé, L. Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA; IEEE: New York, NY, USA, 2021; pp. 6488–6497. [Google Scholar]
  10. Ding, X.; Zhang, X.; Song, S.; Li, B.; Hui, L.; Dai, Y. Cross-View Geo-Localization via 3D Gaussian Splatting-Based Novel View Synthesis. Remote Sens. 2025, 17, 3673. [Google Scholar] [CrossRef]
  11. Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
  12. Izquierdo, S.; Civera, J. Optimal Transport Aggregation for Visual Place Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada; IEEE: New York, NY, USA, 2024; pp. 17658–17668. [Google Scholar]
  13. Lin, T.Y.; Cui, Y.; Belongie, S.; Hays, J. Learning Deep Representations for Ground-to-Aerial Geolocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA; IEEE: New York, NY, USA, 2015; pp. 5007–5015. [Google Scholar]
  14. Vo, N.N.; Hays, J. Localizing and Orienting Street Views Using Overhead Imagery. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, Netherlands; Springer: Cham, Switzerland, 2016; pp. 494–509. [Google Scholar]
  15. Hu, S.; Feng, M.; Nguyen, R.M.H.; Lee, G.H. CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geolocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA; IEEE: New York, NY, USA, 2018; pp. 7258–7267. [Google Scholar]
  16. Cai, S.; Guo, Y.; Khan, S.; Hu, J.; Wen, G. Ground-to-Aerial Image Geo-Localization with a Hard Exemplar Reweighted Triplet Loss. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea; IEEE: New York, NY, USA, 2019; pp. 8391–8400. [Google Scholar]
  17. Liu, L.; Li, H. Lending Orientation to Neural Networks for Cross-View Geo-Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA; IEEE: New York, NY, USA, 2019; pp. 5624–5633. [Google Scholar]
  18. Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 Million Image Database for Scene Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1452–1464. [Google Scholar] [CrossRef] [PubMed]
  19. Ye, Q.; Luo, J.; Lin, Y. A coarse-to-fine visual geo-localization method for GNSS-denied UAV with oblique-view imagery. ISPRS J. Photogramm. Remote Sens. 2024, 212, 306–322. [Google Scholar] [CrossRef]
  20. Rao, Z.; Lu, J.; Li, C.; Wang, F.; Chen, Y. A Cross-View Image Matching Method with Feature Enhancement. Remote Sens. 2023, 15, 2083. [Google Scholar] [CrossRef]
  21. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual; OpenReview.net: Amherst, MA, USA, 2020. [Google Scholar]
  22. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA; IEEE: New York, NY, USA, 2022; pp. 11976–11986. [Google Scholar]
  23. Zhu, S.; Shah, M.; Chen, C. TransGeo: Transformer Is All You Need for Cross-View Image Geo-Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA; IEEE: New York, NY, USA, 2022; pp. 1162–1171. [Google Scholar]
  24. Shen, T.; Wei, Y.; Kang, L.; Wan, S.; Yang, Y.H. MCCG: A ConvNeXt-based Multiple-Classifier Method for Cross-View Geolocalization. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 1456–1468. [Google Scholar]
  25. Sun, B.; Liu, G.; Yuan, Y. Dimensionally Unified Metric Model for Multisource and Multiview Scene Matching. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5601111. [Google Scholar] [CrossRef]
  26. Zhu, S.; Yang, T.; Chen, C. VIGOR: Cross-View Image Geo-Localization beyond One-to-One Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA; IEEE: New York, NY, USA, 2021; pp. 3640–3649. [Google Scholar]
  27. Xia, P.; Yu, L.; Wan, Y.; Zhang, M. Cross-view geo-localization with panoramic street-view and VHR satellite imagery in decentrality settings. ISPRS J. Photogramm. Remote Sens. 2025, 227, 1–11. [Google Scholar] [CrossRef]
  28. Wang, X.; Xu, R.; Cui, Z.; Wan, Z.; Zhang, Y. Fine-Grained Cross-View Geo-Localization Using a Correlation-Aware Homography Estimator. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 5301–5319. [Google Scholar]
  29. Zhang, Q.; Zhu, Y. Aligning Geometric Spatial Layout in Cross-View Geolocalization via Feature Recombination. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, USA; AAAI Press: Palo Alto, CA, USA, 2024; Volume 38, pp. 7251–7259. [Google Scholar]
  30. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML); PmLR: Brookline, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
  31. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA; IEEE: New York, NY, USA, 2022; pp. 16000–16009. [Google Scholar]
  32. Zhou, J.; Wei, C.; Wang, H.; Shen, W.; Xie, C.; Yuille, A.; Kong, T. ibot: Image bert pre-training with online tokenizer. arXiv 2021, arXiv:2111.07832. [Google Scholar]
  33. Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5616516. [Google Scholar] [CrossRef]
  34. Vivanco Cepeda, V.; Nayak, G.K.; Shah, M. GeoCLIP: CLIP-inspired Alignment between Locations and Images for Effective Worldwide Geo-localization. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 8690–8701. [Google Scholar]
  35. Bi, C.; Sun, B.; Wang, J.; Yuan, Y.; Liu, G. BEMN: Balanced bias enhanced multi-branch network for cross-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 1–14. [Google Scholar] [CrossRef]
  36. Tian, Y.; Chen, C.; Shah, M. Cross-view image matching for geo-localization in urban environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA; IEEE: New York, NY, USA, 2017; pp. 3608–3616. [Google Scholar]
  37. Lin, T.Y.; Belongie, S.; Hays, J. Cross-view image geolocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA; IEEE: New York, NY, USA, 2013; pp. 891–898. [Google Scholar]
  38. Keetha, N.; Mishra, A.; Karhade, J.; Jatavallabhula, K.M.; Scherer, S.; Krishna, M.; Garg, S. AnyLoc: Towards Universal Visual Place Recognition. IEEE Robot. Autom. Lett. 2023, 9, 1286–1293. [Google Scholar] [CrossRef]
  39. Lu, F.; Lan, X.; Zhang, L.; Jiang, D.; Wang, Y.; Yuan, C. Cricavpr: Cross-Image Correlation-Aware Representation Learning for Visual Place Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada; IEEE: New York, NY, USA, 2024; pp. 16772–16782. [Google Scholar]
  40. Izquierdo, S.; Civera, J. Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy; Springer: Cham, Switzerland, 2024; pp. 240–257. [Google Scholar]
  41. Jégou, H.; Douze, M.; Schmid, C.; Péronnin, F. Aggregating local descriptors into a compact image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA; IEEE: New York, NY, USA, 2010; pp. 3304–3311. [Google Scholar]
  42. Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA; IEEE: New York, NY, USA, 2020; pp. 4938–4947. [Google Scholar]
  43. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Lake Tahoe, NV, USA; Curran Associates, Inc.: Red Hook, NY, USA, 2013; Volume 26. [Google Scholar]
  44. Sinkhorn, R.; Knopp, P. Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 1967, 21, 343–348. [Google Scholar] [CrossRef]
  45. van den Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
  46. Ye, J.; Lv, Z.; Li, W.; Huang, Z.; Zhang, Y.; Lu, H.; Cheng, J. Cross-View Image Geo-Localization with Panorama-BEV Co-Retrieval Network. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy; Springer: Cham, Switzerland, 2024; pp. 74–90. [Google Scholar]
  47. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  48. Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
  49. Zhang, X.; Li, X.; Sultani, W.; Zhou, Y.; Wshah, S. Cross-View Geo-Localization via Learning Disentangled Geometric Layout Correspondence. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, USA; AAAI Press: Palo Alto, CA, USA, 2023; Volume 37, pp. 3480–3488. [Google Scholar]
Figure 1. Overall architecture of the proposed GenGeo framework. The model employs a Siamese DINOv2 backbone for feature extraction, followed by a SALAD aggregation module to aggregate salient information from local patches into a global descriptor, while a dustbin mechanism is integrated to suppress non-discriminative environmental redundancies.
Figure 1. Overall architecture of the proposed GenGeo framework. The model employs a Siamese DINOv2 backbone for feature extraction, followed by a SALAD aggregation module to aggregate salient information from local patches into a global descriptor, while a dustbin mechanism is integrated to suppress non-discriminative environmental redundancies.
Remotesensing 18 01116 g001
Figure 2. Overview of the SALAD aggregation module. Given a set of patch tokens, feature-to-cluster affinities are first computed via a projection layer to form an affinity matrix. An additional dustbin column is introduced to capture non-informative features. The affinity matrix is then normalized using the Sinkhorn algorithm to obtain an optimal assignment matrix with doubly-stochastic constraints. Based on the assignment matrix, aggregation is performed to compute cluster-wise descriptors, which are concatenated to form a VLAD-like representation. Finally, the global CLS token is projected and concatenated with the aggregated descriptor to produce the final global representation.This figure is adapted from SALAD [12]. Different colors in the diagram represent distinct feature sets to enhance visual clarity.
Figure 2. Overview of the SALAD aggregation module. Given a set of patch tokens, feature-to-cluster affinities are first computed via a projection layer to form an affinity matrix. An additional dustbin column is introduced to capture non-informative features. The affinity matrix is then normalized using the Sinkhorn algorithm to obtain an optimal assignment matrix with doubly-stochastic constraints. Based on the assignment matrix, aggregation is performed to compute cluster-wise descriptors, which are concatenated to form a VLAD-like representation. Finally, the global CLS token is projected and concatenated with the aggregated descriptor to produce the final global representation.This figure is adapted from SALAD [12]. Different colors in the diagram represent distinct feature sets to enhance visual clarity.
Remotesensing 18 01116 g002
Figure 3. Example ground–aerial image pairs from three benchmark datasets. (a) CVUSA: suburban/rural scenes with 1-to-1 center alignment. (b) CVACT: urban scenes in Canberra with 1-to-1 center alignment. (c) VIGOR: urban scenes in US cities featuring non-centered queries and semi-positive aerial neighbors. For each dataset, the left image shows the ground-level panorama and the right image shows the corresponding aerial tile.
Figure 3. Example ground–aerial image pairs from three benchmark datasets. (a) CVUSA: suburban/rural scenes with 1-to-1 center alignment. (b) CVACT: urban scenes in Canberra with 1-to-1 center alignment. (c) VIGOR: urban scenes in US cities featuring non-centered queries and semi-positive aerial neighbors. For each dataset, the left image shows the ground-level panorama and the right image shows the corresponding aerial tile.
Remotesensing 18 01116 g003
Figure 4. Visualization of feature selection behavior in GenGeo. The first row displays original ground-view and aerial images. The second row visualizes patch-wise feature norms from the DINOv2 backbone. The third row illustrates the heatmaps for the dustbin selection mechanism. Color coding represents weight/attention levels, ranging from blue (low) to red (high). Note that both views focus on the road and the roadside cabin.
Figure 4. Visualization of feature selection behavior in GenGeo. The first row displays original ground-view and aerial images. The second row visualizes patch-wise feature norms from the DINOv2 backbone. The third row illustrates the heatmaps for the dustbin selection mechanism. Color coding represents weight/attention levels, ranging from blue (low) to red (high). Note that both views focus on the road and the roadside cabin.
Remotesensing 18 01116 g004
Figure 5. Visualization of an extreme case using the same layout as Figure 4. The example contains strong view-specific background regions in both ground-view and aerial images. Color coding represents weight/attention levels, ranging from blue (low) to red (high).
Figure 5. Visualization of an extreme case using the same layout as Figure 4. The example contains strong view-specific background regions in both ground-view and aerial images. Color coding represents weight/attention levels, ranging from blue (low) to red (high).
Remotesensing 18 01116 g005
Figure 6. Cross-dataset visualization using the same protocol as Figure 4. The model is trained on CVUSA and evaluated on CVACT. Color coding represents weight/attention levels, ranging from blue (low) to red (high).
Figure 6. Cross-dataset visualization using the same protocol as Figure 4. The model is trained on CVUSA and evaluated on CVACT. Color coding represents weight/attention levels, ranging from blue (low) to red (high).
Remotesensing 18 01116 g006
Figure 7. Visualization of cross-view semantic consistency induced by shared SALAD clustering. The left panel shows a pair of corresponding regions (highlighted by red boxes) from an aerial image (top) and a ground-level image (bottom). Despite significant viewpoint differences—where the aerial view captures the top surface of the structure while the ground view observes its underside—the selected regions correspond to the same semantic object (bridge). The right panel presents the averaged cluster assignment distributions of patch tokens within the selected regions. The horizontal axis denotes the cluster index (64 semantic clusters), and the vertical axis represents the normalized assignment weights. Notably, the two distributions exhibit highly similar patterns, indicating that semantically corresponding regions from different views are consistently mapped to similar clusters.
Figure 7. Visualization of cross-view semantic consistency induced by shared SALAD clustering. The left panel shows a pair of corresponding regions (highlighted by red boxes) from an aerial image (top) and a ground-level image (bottom). Despite significant viewpoint differences—where the aerial view captures the top surface of the structure while the ground view observes its underside—the selected regions correspond to the same semantic object (bridge). The right panel presents the averaged cluster assignment distributions of patch tokens within the selected regions. The horizontal axis denotes the cluster index (64 semantic clusters), and the vertical axis represents the normalized assignment weights. Notably, the two distributions exhibit highly similar patterns, indicating that semantically corresponding regions from different views are consistently mapped to similar clusters.
Remotesensing 18 01116 g007
Table 1. Summary of datasets used for cross-view geo-localization experiments.
Table 1. Summary of datasets used for cross-view geo-localization experiments.
AttributeCVUSACVACTVIGOR
Train/Val/Test35,532/-/888435,532/8884/92,802105,214/-/-
Resolution (G)224 × 1232832 × 16641024 × 2048
Resolution (A)750 × 7501200 × 1200640 × 640
RegionSub/RuralUrbanUrban
Mapping1-to-11-to-1Non-aligned
Table 2. Comparison with state-of-the-art methods on CVUSA, CVACT Val, and CVACT Test datasets. All competitive methods utilize ConvNeXt-B as the backbone except TransGeo (DeiT-S) and Ours (DINOv2-B). The best results are highlighted in bold.
Table 2. Comparison with state-of-the-art methods on CVUSA, CVACT Val, and CVACT Test datasets. All competitive methods utilize ConvNeXt-B as the backbone except TransGeo (DeiT-S) and Ours (DINOv2-B). The best results are highlighted in bold.
ApproachCVUSACVACT ValCVACT Test
R@1 R@5 R@10 R@1% R@1 R@5 R@10 R@1% R@1 R@5 R@10 R@1%
TransGeo [23]94.0898.3699.0499.7784.9594.1495.7898.37----
Sample4Geo [6]98.6899.6899.7899.8790.8196.7497.4898.7771.5192.4294.4598.70
P-BEV [46]98.7199.7099.7899.8691.9097.2397.8498.8473.6893.5395.1198.80
AuxGeo [27]98.8099.7199.7599.8591.8697.2397.7998.9373.6593.5495.2398.75
Ours98.9499.7399.8299.9190.6396.5297.3398.5571.9292.5394.4098.67
Table 3. Quantitative comparison on the VIGOR dataset under “SAME Area” and “CROSS Area” settings. The best results are highlighted in bold.
Table 3. Quantitative comparison on the VIGOR dataset under “SAME Area” and “CROSS Area” settings. The best results are highlighted in bold.
ApproachVIGOR SAMEVIGOR CROSS
R@1 R@5 R@10 R@1% Hit R@1 R@5 R@10 R@1% Hit
TransGeo [23]61.4887.5491.8899.5673.0918.9938.2446.9188.9421.21
Sample4Geo [6]77.8695.6697.2199.6189.8261.7083.5088.0098.1769.87
P-BEV [46]82.1897.1098.1799.70-72.1988.6891.6898.56-
AuxGeo [27]80.3496.2597.5799.6793.7863.9484.9888.9898.0276.25
Ours77.9196.3797.7999.7591.1563.3986.2390.2198.4873.22
Table 4. Cross-dataset generalization results between CVUSA and CVACT. denotes the method that utilizes explicit polar transformation. The best results are highlighted in bold.
Table 4. Cross-dataset generalization results between CVUSA and CVACT. denotes the method that utilizes explicit polar transformation. The best results are highlighted in bold.
ApproachCVUSA → CVACTCVACT → CVUSA
R@1 R@5 R@10 R@1% R@1 R@5 R@10 R@1%
GeoDTR [49]47.7970.52-92.2029.1347.86-81.09
GeoDTR [49]53.1675.62-93.8044.0764.66-90.09
Sample4Geo [6]56.6277.7987.0294.6944.9564.3672.1090.65
P-BEV [46]67.7984.0687.9695.05----
Ours71.2787.3490.7996.6955.6674.0480.2694.78
Table 5. Cross-dataset generalization results between CVUSA and VIGOR. Models are trained on the source dataset and evaluated on the target dataset. The best results are highlighted in bold.
Table 5. Cross-dataset generalization results between CVUSA and VIGOR. Models are trained on the source dataset and evaluated on the target dataset. The best results are highlighted in bold.
ApproachCVUSA → VIGORVIGOR → CVUSA
R@1 R@5 R@10 R@1% Hit Rate R@1 R@5 R@10 R@1%
Sample4Geo [6]0.000.010.023.120.003.8510.1814.5237.51
Ours11.3818.6122.1556.9511.5738.2860.1468.2291.38
Table 6. Ablation study of different backbones on CVUSA and CVUSA → CVACT. All models are trained on CVUSA. Performance on CVUSA reflects in-domain retrieval accuracy, while CVUSA → CVACT evaluates cross-dataset generalization. The best results are highlighted in bold.
Table 6. Ablation study of different backbones on CVUSA and CVUSA → CVACT. All models are trained on CVUSA. Performance on CVUSA reflects in-domain retrieval accuracy, while CVUSA → CVACT evaluates cross-dataset generalization. The best results are highlighted in bold.
ArchitectureCVUSACVUSA → CVACT
R@1 R@5 R@10 R@1% R@1 R@5 R@10 R@1%
ConvNeXt98.6899.6899.7899.8756.6277.7987.0294.69
DINOv298.2399.6699.7599.9053.6475.7782.3494.65
ConvNeXt + SALAD98.4699.6699.7999.8648.8172.9179.6793.51
Ours98.9499.7399.8299.9171.2787.3490.7996.69
Table 7. Ablation study of different aggregation strategies on CVUSA and CVUSA → CVACT. All models are built on DINOv2 and trained on CVUSA. The best results are highlighted in bold.
Table 7. Ablation study of different aggregation strategies on CVUSA and CVUSA → CVACT. All models are built on DINOv2 and trained on CVUSA. The best results are highlighted in bold.
Aggregation StrategyCVUSACVUSA → CVACT
R@1 R@5 R@10 R@1% R@1 R@5 R@10 R@1%
NetVLAD (shared)98.6599.7199.8099.9060.6881.1186.3394.76
SALAD (non-shared)97.7499.5399.7299.8448.5772.0779.2093.02
SALAD (w/o dustbin)98.9499.7299.8299.8867.1484.4388.7096.17
SALAD (shared, w/ dustbin)98.9499.7399.8299.9171.2787.3490.7996.69
Table 8. Ablation study of different fine-tuning strategies on CVUSA and CVUSA → CVACT. All models are trained on CVUSA. Performance on CVUSA reflects in-domain retrieval accuracy, while CVUSA → CVACT evaluates cross-dataset generalization. The best results are highlighted in bold.
Table 8. Ablation study of different fine-tuning strategies on CVUSA and CVUSA → CVACT. All models are trained on CVUSA. Performance on CVUSA reflects in-domain retrieval accuracy, while CVUSA → CVACT evaluates cross-dataset generalization. The best results are highlighted in bold.
Fine-Tuning StrategyCVUSACVUSA → CVACT
R@1 R@5 R@10 R@1% R@1 R@5 R@10 R@1%
Frozen Backbone94.6598.8099.3499.8641.5665.4573.7690.33
Full Fine-tuning98.4599.5999.7499.8652.8574.1080.7592.91
Partial Fine-tuning (Ours)98.9499.7399.8299.9171.2787.3490.7996.69
Table 9. Ablation study of the number of clusters (K) on CVUSA and its generalization performance on CVACT. The best results are highlighted in bold.
Table 9. Ablation study of the number of clusters (K) on CVUSA and its generalization performance on CVACT. The best results are highlighted in bold.
K (Clusters)CVUSACVUSA → CVACT
R@1 R@5 R@10 R@1% R@1 R@5 R@10 R@1%
3298.7799.7299.8299.8967.4184.5288.8396.14
4898.8999.7199.8099.9068.5485.9389.7296.68
6498.9499.7399.8299.9171.2787.3490.7996.69
8098.7899.6899.7999.8965.8784.4188.4596.05
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, R.; Yuan, W.; Yuan, W.; Liu, T.; Xi, X.; Zhu, Y. GenGeo: Robust Cross-View Geo-Localization via Foundation Model and Dynamic Feature Aggregation. Remote Sens. 2026, 18, 1116. https://doi.org/10.3390/rs18081116

AMA Style

Wang R, Yuan W, Yuan W, Liu T, Xi X, Zhu Y. GenGeo: Robust Cross-View Geo-Localization via Foundation Model and Dynamic Feature Aggregation. Remote Sensing. 2026; 18(8):1116. https://doi.org/10.3390/rs18081116

Chicago/Turabian Style

Wang, Rong, Wen Yuan, Wu Yuan, Tong Liu, Xiao Xi, and Yaokai Zhu. 2026. "GenGeo: Robust Cross-View Geo-Localization via Foundation Model and Dynamic Feature Aggregation" Remote Sensing 18, no. 8: 1116. https://doi.org/10.3390/rs18081116

APA Style

Wang, R., Yuan, W., Yuan, W., Liu, T., Xi, X., & Zhu, Y. (2026). GenGeo: Robust Cross-View Geo-Localization via Foundation Model and Dynamic Feature Aggregation. Remote Sensing, 18(8), 1116. https://doi.org/10.3390/rs18081116

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop