1. Introduction
Image-based geo-localization aims to achieve precise position estimation for a query image by matching it against a geo-referenced image database. This technology has demonstrated significant potential across various domains, including autonomous driving [
1], robotic navigation [
2], and augmented reality (AR) [
3]. Although Global Navigation Satellite Systems (GNSS) are widely utilized, their reliability often degrades in “urban canyons” with dense buildings or in heavily forested areas. In these environments, satellite signals frequently suffer from multipath effects or signal blockage, resulting in localization errors ranging from several meters to tens of meters. Under such circumstances, image-based localization methods—leveraging their capacity to perceive environmental features—serve as a crucial supplement to GNSS, providing continuous and high-precision spatial alignment even in signal-constrained environments.
Among the various branches of visual localization, cross-view geo-localization (CVGL) matches ground-level query images with aerial or satellite imagery. This approach overcomes the inherent limitations of traditional visual place recognition (VPR), such as slow database updates and restricted coverage of street-view repositories. The global availability and accessibility of remote sensing imagery enable localization systems to serve not only urban centers but also rural and wilderness areas where street-view coverage is unavailable. However, CVGL faces substantial technical challenges, including drastic viewpoint disparities (ground-level vs. overhead), heterogeneous imaging geometries, and seasonal appearance variations. More fundamentally, these factors lead to severe spatial misalignment and partial correspondence between cross-view images, where only a subset of regions can be reliably matched across views. Currently, existing methods [
4,
5,
6,
7] typically employ a dual-branch retrieval framework to map cross-view images into a unified feature space for similarity-based retrieval. To mitigate matching difficulties, some studies incorporate image preprocessing techniques, such as Polar Transform [
8,
9], to bridge the perspective gap at the geometric level.
Despite the significant breakthroughs in accuracy achieved by deep learning-based methods on specific benchmarks, their generalization capacity remains limited. This limitation is closely related to the intrinsic challenges of cross-view geo-localization. Due to severe spatial misalignment and partial correspondence, it is inherently challenging to determine which regions can be reliably matched across views. Existing methods typically address this issue either by introducing explicit geometric priors to enforce cross-view alignment or by relying on local feature patterns that may capture dataset-specific textures. However, geometric transformations often depend on dataset-specific assumptions and may fail under complex viewpoint changes, while local feature-based representations are prone to overfitting to dataset-specific textures. As a result, constructing robust and generalizable global representations remains challenging.
Recently, Vision Foundation Models (VFMs) have demonstrated immense potential in handling complex semantics and geometric alignment, as their extracted image features typically possess robust universal semantics and strong generalization capabilities. For instance, ref. [
10] demonstrated that employing DINOv2 [
11] as a feature extractor significantly enhances the consistency of cross-view features. However, as these models are pre-trained on general internet-scale datasets, our research reveals that the high-dimensional semantics extracted by foundation models still contain substantial environmental redundancies irrelevant to geo-localization, making direct matching with these universal features suboptimal. Moreover, cross-view geo-localization fundamentally differs from conventional visual recognition tasks due to the severe viewpoint discrepancies between ground and aerial images. The lack of spatial correspondence and the presence of geometric distortions make direct feature matching highly unreliable. In particular, the presence of non-discriminative or view-specific regions—such as expansive sky areas, transient shadows, and dynamic objects—not only introduces noise but also exacerbates the inconsistency between cross-view representations. Therefore, beyond strong semantic representations, an effective CVGL system also requires a mechanism that can bridge the representation gap across views rather than relying solely on direct feature similarity.
In this context, feature aggregation mechanisms that can selectively emphasize discriminative regions while suppressing non-informative or unmatched content become particularly important. Recently, the SALAD [
12] aggregation framework, originally proposed for visual place recognition, has demonstrated strong performance by enabling adaptive feature assignment and filtering through its clustering-based aggregation and dustbin mechanism. Building upon this observation, we extend this idea to the cross-view geo-localization setting and reinterpret the SALAD mechanism under the presence of severe viewpoint discrepancies and partial correspondence. Specifically, we show that the aggregation process can be viewed as inducing a unified assignment space for cross-view features, thereby facilitating more consistent semantic representation across different viewpoints.
Based on this insight, we further propose GenGeo (Generalized Geo-localization), which integrates the foundation model DINOv2 with the SALAD aggregation mechanism. The framework leverages DINOv2 to extract rich semantic representations and employs SALAD to produce discriminative global descriptors. By combining semantic-rich representations with matching-aware aggregation, the proposed approach significantly enhances robustness in heterogeneous environments.
The main contributions of this work are summarized as follows:
- (1)
We revisit the SALAD aggregation framework with a dustbin mechanism in the context of cross-view geo-localization and analyze its suitability for addressing the intrinsic challenges of partial correspondence and information asymmetry. We show that the shared clustering process can be interpreted as inducing a unified assignment space for cross-view features, which promotes consistent semantic representation while filtering unmatched or noisy regions, thereby facilitating reliable cross-view alignment.
- (2)
We present GenGeo, a framework for cross-view geo-localization that integrates vision foundation model representations with the SALAD-based matching-aware aggregation strategy. By combining the strong semantic generalization capability of foundation models with adaptive feature aggregation, the framework produces robust and transferable representations for cross-dataset localization.
- (3)
Extensive experiments and ablation studies demonstrate that the proposed framework achieves state-of-the-art performance in cross-dataset generalization and consistently improves robustness under severe domain shifts and spatial misalignment, highlighting the importance of the synergy between foundation model representations and matching-aware aggregation for effective cross-view alignment.
2. Related Work
Existing cross-view geo-localization (CVGL) methods predominantly adopt the Siamese network architecture [
13,
14,
15,
16,
17]. The core principle of this framework is to employ two backbone networks with identical structures to map ground images and aerial imagery into a unified embedding space. The spatial correlation is subsequently evaluated by computing the geometric distance or cosine similarity between the two image representations. Early studies primarily utilized Convolutional Neural Networks (CNNs) as feature extractors. One of the pioneering approaches [
4] leveraged CNNs pre-trained on the ImageNet and Places [
18] datasets to extract representations for ground images and satellite images, respectively, demonstrating that CNNs significantly outperform traditional handcrafted descriptors in CVGL tasks. To further enhance the discriminative power of features, contrastive learning has been widely integrated into model training [
14,
19]. For instance, Vo et al. [
14] introduced a soft-margin triplet loss to constrain the relative distances between triplet samples, forcing the anchor and positive samples to cluster in the feature space. Building upon this, Hu et al. [
15] proposed a weighted soft-margin triplet loss, which employs a hyperparameter
to dynamically adjust the gradient, effectively alleviating the slow convergence issue associated with the original triplet loss.
Despite the foundation laid by Siamese networks, the localization accuracy of early methods remained limited due to severe geometric distortions and drastic viewpoint disparities between ground images and satellite imagery. To address these challenges, the research community has developed two primary technical trajectories: feature learning-based methods and perspective transformation-based methods [
20].
Feature learning-based methods aim to mine shared discriminative information between cross-view images by enhancing the model’s representative power. A prominent trend involves replacing original CNNs with advanced backbones such as Vision Transformers (ViT) [
21] or ConvNeXt [
22], which offer superior long-range dependency modeling. For instance, Zhu et al. [
23] proposed TransGeo, which utilizes a dual-stage strategy with attention maps to focus on critical regions, while MCCG [
24] leverages a multi-classifier to extract enriched features. Beyond backbones, research has also focused on feature aggregation and hard negative mining, such as the GPS-Sampling and Dynamic Similarity Sampling (DSS) introduced by Deuser et al. [
6]. Recently, Sun et al. [
25] further emphasized the metric feature consistency principle, showing that preserving strict dimension-wise correspondence between siamese branches is critical, and introduced frequency-domain features via DCT to mitigate cross-view discrepancies.
Perspective transformation-based methods reduce the difficulty of feature learning by employing geometric preprocessing to bridge the domain gap. A milestone in this area is the polar transform proposed in SAFA [
8], which projects satellite images into polar coordinates to simulate the visual layout of panoramic ground images. This technique has been widely adopted on center-aligned datasets like CVUSA [
4] and CVACT [
17]. However, polar transformation possesses limitations: the geometric distortion inherent in projection introduces noise, and its performance is restricted on non-centered benchmarks such as VIGOR [
26] and DReSS [
27]. To overcome this, Wang et al. [
28] utilized geometric transformations to project ground images into Bird’s-Eye View (BEV) representations. Additionally, Zhang et al. [
29] proposed a feature recombination strategy as a robust alternative to traditional perspective transforms.
With the remarkable progress in CVGL techniques, performance on standard benchmarks has approached saturation. However, most existing methods focus predominantly on optimizing localization accuracy within a single domain, often overlooking the model’s generalization capacity. This limitation significantly hinders practical deployment, as models frequently suffer from severe performance degradation when encountering domain shifts caused by geographic heterogeneity. The emergence of Vision Foundation Models (VFMs) offers a promising trajectory to overcome these bottlenecks. For instance, DINOv2 [
11], pre-trained via large-scale self-supervised learning, delivers exceptional feature representations with both rich semantics and geometric robustness. Similarly, multi-modal models like CLIP [
30] learn universal discriminative features by aligning visual and textual spaces. Other self-supervised transformers such as MAE [
31] and iBOT [
32] provide rich patch-level semantic embeddings, offering additional avenues for transferability in CVGL tasks.
In the remote sensing domain, research has begun to harness the transferability of vision foundation models (VFMs). RemoteClip [
33] demonstrated striking zero-shot capabilities, while GeoCLIP [
34] leveraged global contextual priors to enhance localization. Building upon this trend, recent works have specifically integrated VFMs into cross-view geo-localization (CVGL) frameworks to improve robustness and generalization. For instance, Bi et al. [
35] proposed a balanced bias enhanced multi-branch network (BEMN) that simultaneously leverages ConvNeXt for local feature extraction and DINOv2 for global context modeling, introducing a two-stage training strategy with joint feature alignment and frequency domain adjustment modules to mitigate unimodal bias and overfitting.
However, despite these advances, existing methods still face notable limitations. Approaches based on complex architectural designs often incur significant computational overhead, while those relying on hand-crafted geometric transformations or explicit metric alignment may remain sensitive to local distortions in unconstrained real-world environments. Distinct from these strategies, this paper explores enhancing the robustness of cross-view alignment by refining the internal feature aggregation mechanism of universal visual representations. We propose the GenGeo framework, which synergizes the discriminative power of DINOv2 with an optimal transport-based aggregation (SALAD). Unlike static pooling methods, our approach specifically addresses the viewpoint gap by utilizing a “dustbin” mechanism to filter out view-specific uninformative features (e.g., sky, transient objects), thereby distilling a cross-view invariant global descriptor. This allows GenGeo to achieve superior generalization across heterogeneous geographic environments without the need for explicit geometric priors or heavy multi-stage training.
6. Conclusions
In this paper, we revisit and reinterpret the SALAD aggregation mechanism in the context of cross-view geo-localization, showing that its clustering-based aggregation process can be viewed as inducing a unified assignment space for features from different viewpoints. This perspective provides a new understanding of how adaptive feature aggregation can facilitate cross-view semantic consistency without relying on explicit geometric priors. Building upon this insight, we further propose GenGeo, a unified framework that integrates vision foundation model representations with a SALAD-based aggregation mechanism. By leveraging the strong semantic representations of foundation models together with adaptive feature aggregation, the proposed framework produces robust and transferable global descriptors for cross-view matching. In addition, the dustbin mechanism helps filter out unmatched and non-informative regions, addressing the partial correspondence and information asymmetry inherent in cross-view scenarios. Extensive experiments demonstrate that GenGeo achieves state-of-the-art performance in cross-dataset generalization and exhibits strong robustness under severe domain shifts and spatial misalignment. Both ablation studies and visualization analyses further confirm the effectiveness of the aggregation-based representation in facilitating cross-view alignment.
Overall, this work highlights that reinterpreting and adapting existing aggregation mechanisms, rather than designing entirely new architectures, can provide an effective and practical direction for improving generalizable cross-view geo-localization.