ICT-Net: A Framework for Multi-Domain Cross-View Geo-Localization with Multi-Source Remote Sensing Fusion

Wu, Min; Xu, Sirui; Wang, Ziwei; Dong, Jin; Cheng, Gong; Yu, Xinlong; Liu, Yang

doi:10.3390/rs17121988

Open AccessArticle

ICT-Net: A Framework for Multi-Domain Cross-View Geo-Localization with Multi-Source Remote Sensing Fusion

by

Min Wu

^1,2,†

,

Sirui Xu

^2,†,

Ziwei Wang

^1,2,*,

Jin Dong

³,

Gong Cheng

^1,2,

Xinlong Yu

⁴ and

Yang Liu

^1,2

¹

Bejing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University, Beijing 100191, China

²

Institute of Artificial Intelligence, Beihang University, Beijing 100191, China

³

Beijing Academy of Blockchain and Edge Computing, Beijing 100191, China

⁴

Jianghuai Advance Technology Center, Hefei 230088, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(12), 1988; https://doi.org/10.3390/rs17121988

Submission received: 28 April 2025 / Revised: 29 May 2025 / Accepted: 6 June 2025 / Published: 9 June 2025

(This article belongs to the Special Issue Artificial Intelligence Algorithm for Remote Sensing Imagery Processing (5th Edition))

Download

Browse Figures

Versions Notes

Abstract

Traditional single neural network-based geo-localization methods for cross-view imagery primarily rely on polar coordinate transformations while suffering from limited global correlation modeling capabilities. To address these fundamental challenges of weak feature correlation and poor scene adaptation, we present a novel framework termed ICT-Net (Integrated CNN-Transformer Network) that synergistically combines convolutional neural networks with Transformer architectures. Our approach harnesses the complementary strengths of CNNs in capturing local geometric details and Transformers in establishing long-range dependencies, enabling comprehensive joint perception of both local and global visual patterns. Furthermore, capitalizing on the Transformer’s flexible input processing mechanism, we develop an attention-guided non-uniform cropping strategy that dynamically eliminates redundant image patches with minimal impact on localization accuracy, thereby achieving enhanced computational efficiency. To facilitate practical deployment, we propose a deep embedding clustering algorithm optimized for rapid parsing of geo-localization information. Extensive experiments demonstrate that ICT-Net establishes new state-of-the-art localization accuracy on the CVUSA benchmark, achieving a top-1 recall rate improvement of 8.6% over previous methods. Additional validation on a challenging real-world dataset collected at Beihang University (BUAA) further confirms the framework’s effectiveness and practical applicability in complex urban environments, particularly showing 23% higher robustness to vegetation variations.

Keywords:

ICT-Net; cross-view geo-localization; CNN-Transformer integration; attention-guided cropping; deep embedding clustering

1. Introduction

The widespread deployment of spatial infrastructure systems, including the BeiDou Navigation Satellite System (BDS) and Global Positioning System (GPS), has established Positioning, Navigation, and Timing (PNT) technologies as essential spatiotemporal foundations for modern society. However, in complex environments, such as indoor spaces, urban canyons, and underground facilities, satellite signals face significant challenges, including obstructions, interference-induced attenuation, and amplified multipath effects, leading to exponential degradation in positioning reliability [1]. These limitations are particularly critical in mission-sensitive applications like military reconnaissance and emergency response operations, where autonomous positioning systems must maintain robustness under conditions of incomplete geospatial information and dynamic environmental variations while eliminating dependencies on prior maps and susceptibility to signal interference [2].

Cross-view geo-localization has consequently garnered significant research attention through its integration of deep neural networks (DNNs) and metric learning to construct discriminative spatial feature representations. Recent advancements in artificial intelligence (AI), particularly through attention mechanisms and directional encoding, have substantially improved matching accuracy between unmanned aerial vehicle (UAV) and satellite perspectives by enhancing feature compactness for matched pairs while maximizing inter-class discrepancies for non-matching samples [3,4,5]. Nevertheless, early methodologies remained constrained by inherent cross-view geometric distortions and scale variations, impeding robust feature alignment. Pioneering work by Zhu et al. addressed this limitation through innovative incorporation of spatial neighborhood correlation features, significantly boosting geo-localization discriminative power [6].

Practical implementation of cross-view geo-localization necessitates universal solutions for multi-domain heterogeneous remote sensing data integration [7]. Multi-source remote sensing imagery encompasses multidimensional acquisitions from ground-based sensors, satellite platforms, and cross-view observations with varying imaging geometries, including nadir-pointing and oblique-angle photography. Even single-platform acquisitions exhibit substantial intra-class variability due to seasonal changes, meteorological conditions, and illumination variations, creating fundamental challenges of feature space heterogeneity and spatiotemporal inconsistency. In typical ground-to-satellite matching tasks [8], view distortions, occlusions, and imaging condition disparities induce feature asymmetry that conventional algorithms struggle to resolve, resulting in incomplete feature characterization and weak spatial correlations [9,10]. To solve these problems, we developed a lightweight learning framework integrating spatial and temporal awareness based on mature datasets, which can efficiently complete the matching task of multi-domain remote sensing images. The primary contributions of this paper are as follows.

A high-precision multi-domain benchmark dataset (BHUniv) is constructed. We construct the BHUniv dataset with sub-meter spatio-temporal alignment, addressing real-world challenges in semi-enclosed environments (e.g., courtyards, indoor–outdoor transitions) and dynamic vegetation variations. Unlike existing datasets, BHUniv integrates adaptive density sampling (20–50 points/km²) and geometric validation via phase correlation analysis, enabling robust evaluation under GPS-degraded scenarios.
A novel CNN-Transformer synergy framework (ICT-Net) with three key innovations is investigated. These include an attention-guided non-uniform cropping mechanism, a deeply embedded clustering algorithm for fast geographic parsing, and two-stage hierarchical training. By combining global feature alignment with local refinement, it achieves 12.3% better cross-view discrimination than static architectures.
State-of-the-art performance with practical efficiency. ICT-Net achieves 98.2% R@1 accuracy on CVUSA and 93.8% on BHUniv, surpassing recent methods (e.g., TransGeo, SALAD) while reducing GPU memory usage by 51%. The framework demonstrates 23% higher robustness to vegetation variations and 34% faster inference than Transformer-only baselines.

The remainder of this paper is organized as follows. Section 2 critically reviews existing cross-domain localization methods and identifies current research gaps. Section 3 details the architecture and distinctive characteristics of the BHUniv dataset. Section 4 presents the ICT-Net framework, elucidating its CNN-Transformer hybrid architecture for cross-domain descriptor learning and deep embedded clustering mechanism for geospatial information parsing. Section 5 provides comprehensive experimental validation through benchmark comparisons, ablation studies, and real-world deployment analysis. The conclusions and future research directions are discussed in Section 6.

2. Related Works

This section systematically reviews recent advancements in cross-view geolocalization, focusing on three critical aspects, including benchmark datasets, cross-domain feature representation learning, and hierarchical feature learning mechanisms in neural networks.

2.1. Cross-View Geolocalization Approaches and Datasets

Cross-view geolocalization has emerged as a specialized subfield of image retrieval research, characterized by the unique challenges of reconciling viewpoint differences on heterogeneous imaging platforms. Early foundational work established ground-based landmark datasets, such as Oxford5k [11] comprising 5062 images covering a total of 11 Oxford landmarks and Paris6k [12] comprising 6412 images covering a total of 12 Parisian attractions, which were mainly derived from the Flickr image site. To address the limitation of spatial coverage, subsequent studies introduced aerial images as geospatial anchors. The authors in [13] pioneered the strategy of combining aerial and ground view hybrid datasets to achieve cross-localization, but the occlusion effect of the two views remains unresolved.

Existing cross-view geo-localization methods face critical limitations in feature alignment due to their reliance on explicit geometric transformations (e.g., polar coordinate alignment) and insufficient global modeling capabilities. For instance, the authors in [14] proposed an end-to-end CNN with weakly supervised ranking loss to enhance view-invariant feature learning, which fails in spatially misaligned scenarios (e.g., semi-enclosed environments) due to rigid geometric assumptions. CNN-based architectures like CVM-Net [15] struggle to model long-range dependencies, resulting in weak correlations between ground and satellite features, especially under occlusions or seasonal variations [16]. The third phase involves multi-modal integration strategies such as [17] utilizing Google Street View data for city-scale building-aware network architecture localization. Due to the easy accessibility of aerial imagery provided by the Google Maps API, a series of studies focus on cross-view geolocalization in which collected satellite-aerial imagery is used as a reference image for rural [18] and urban [19] areas. They typically use metric learning loss to train dual-stream CNN frameworks. However, this cross-view retrieval system has a large domain gap between a street view and bird’s eye view because CNNs do not explicitly encode the location information of each view [20].

As the demand for datasets has further increased, modern benchmarks such as CVUSA [21] and CVACT [22] specifically address the challenge of ground-satellite view alignment in GPS-absent environments. CVACT extends this paradigm by enabling accurate spatial alignment between ground panoramic views and satellite imagery. Multiple sample presentations of CVACT is given in Figure 1. While VIGOR [23] proposed a new urban dataset assuming that the query can occur at any location in a given area, so that the street view image will not be spatially aligned at the center of the aerial image. In this case, the polar transform may not be able to model the cross-view correspondence due to unknown spatial displacements and strong occlusions. Recent methodological breakthroughs to bridge the cross-view gap include the following: The authors in [24] proposed a graph neural network reconstruction route to the reordering problem for computational efficiency. In [25], the authors proposed an end-to-end cross-matching approach integrating cross-view synthesis modules and geo-localization modules. Furthermore, the authors in [26] devised a Transformer-based architecture that enables dynamic region focusing in the presence of significant scale changes.

2.2. Cross-Domain Feature Representation Learning

In recent years, approaches to developing cross-domain descriptors have shifted from manual feature extraction to relying on deep learning models to learn feature representations automatically. Early deep learning approaches, such as the dual network architecture proposed by Zagoruyko [27], suffered from computational complexity problems, which were later alleviated by the Simo-Serra efficient Euclidean matching framework [28,29]. Building on this foundation, the triple network architecture [30] marked a significant advance by demonstrating that unconstrained embedding space optimization [31,32,33] could outperform explicit triple loss formulations. Subsequent innovations include hybrid 2D-3D descriptors [34] that address the limitations of triple networks, and multi-loss frameworks that combine triple loss with cross-entropy for joint local–global optimization [35]. Current cross-domain descriptor research shows three notable trends. First, self-supervised contrast learning [36,37] effectively alleviates the dilemma of insufficient cross-view data labeling by constructing viewpoint-invariant representations. Second, the Transformer-based hierarchical feature aggregation architecture [38] demonstrates unique advantages in cross-scale feature alignment. Furthermore, cross-modal knowledge distillation techniques [39] provide new ideas for feature mapping of heterogeneous sensor data. However, when facing extreme viewpoint differences (e.g., ground-satellite views) and dynamic environmental perturbations, existing methods still have significant trade-offs between feature preservation and generalization capability.

Our work advances this trajectory through perspective-invariant descriptor learning in shared latent space. Unlike existing task-specific optimizations, we synergistically integrate metric learning with a spatial attention mechanism to simultaneously maintain intra-class similarity and maximize inter-class discrimination. This dual mechanism places particular emphasis on geometrically stable regions that are critical for cross-viewpoint matching.

2.3. Hierarchical Feature Learning in Deep Architectures

The complementary strengths of CNNs and Transformers provide a foundation for multi-scale feature learning in geolocalization tasks. Among other things, recent studies have progressively revealed the hierarchical discriminative capabilities of CNNs in fine-grained visual recognition. Jiang et al. [40] demonstrated that distinct CNN layers inherently capture category-specific discriminative regions through feature visualization, revealing a progressive transition from localized texture details to semantically meaningful regions along the network depth. Building upon this hierarchical characteristic, the author of [41] developed a sequence-diverse architecture incorporating lightweight subnetworks that enable cross-region information exchange, effectively addressing the feature isolation problem in fine-grained recognition. On the other hand, the attention learning mechanisms in deep networks have been systematically investigated to enhance feature selectivity. The authors in [42] identified temporal attention propagation patterns in visual processing and proposed an attention-transfer framework that iteratively refines region correlations through semantic relevance encoding. This approach substantially improves the model’s capability in locating discriminative spatial-semantic features. Complementing these developments, Du et al. [43] introduced a progressive multi-granularity integration strategy that dynamically optimizes the contribution weights of different feature hierarchies, achieving state-of-the-art performance through adaptive information fusion.

Emerging methodologies continue to address the challenges of feature representation learning in complex visual environments. Recent work by Chen et al. [44] extends these principles through cross-modal attention alignment, enabling effective knowledge transfer between heterogeneous sensory data. Meanwhile, Transformer-based architectures [45] have demonstrated superior performance in modeling long-range dependencies between subtle discriminative regions, particularly through self-attention mechanisms. These advancements collectively contribute to developing more robust models, though challenges persist in balancing computational efficiency with representation power across varying scales and domains. Notably, Neural Radiation Field (NeRF)-based characterization methods [46] are attempting to introduce geometric priors into descriptor learning, opening new avenues for improving geometric consistency in cross-domain localization.

Building on these foundations, we propose a novel cross-architecture attention framework that combines the local geometric sensitivity of CNNs with the global relational inference of Transformer.

3. Construction of the BHUniv Dataset

To address the limitations of current cross-view geolocalization datasets in spatial continuity and geotagging precision, we introduce the BHUniv dataset, a continuous-coverage benchmark spanning 10 km² of the Beijing University of Aeronautics and Astronautics (BUAA) campus. As illustrated in Figure 2, the dataset integrates multi-temporal satellite imagery (0.5 m/pixel resolution, quarterly updated via Google Earth API) with 300 ground-level panoramic image pairs, each rigorously aligned through the following pipeline:

High-precision georeferencing: Ground images were acquired using DJI Phantom 4 RTK drones equipped with dual-frequency GNSS receivers, achieving centimeter-level positioning accuracy (2.5 cm horizontal RMS error) in the WGS84 coordinate system. Each ground image incorporates EXIF metadata documenting acquisition time, solar azimuth/elevation angles, and sensor parameters.
Spatio-temporal alignment: Satellite-ground pairs are validated by phase correlation analysis to ensure subpixel geometric consistency, with a standard of displacement error of less than 0.3 m. Temporal synchronization was validated by cross-modal matching of seasonal vegetation patterns and shadow dynamics, particularly critical for mitigating illumination variations between satellite and ground perspectives.
Dense sampling strategy: Unlike the discrete benchmark CVUSA dataset, the BHUniv dataset adopts a grid-based systematic sampling route of 30 points/ ${km}^{2}$ with an adaptive density adjustment mechanism. This was increased to 50 points/ ${km}^{2}$ in high feature variability areas such as school buildings and complexes, while maintaining 20 points/ ${km}^{2}$ in open areas, as shown in Figure 3.

Figure 2 demonstrates spatio-temporally aligned satellite-ground pairs from the BUAA campus, with all satellite images incorporating time-series metadata and ground truth annotations containing field-of-view parameters for spatial-temporal perception validation. Representative ground-view sampling distributions are illustrated in Figure 3, Figure 4, Figure 5 and Figure 6, while Table 1 quantitatively summarizes key dataset statistics, including spatial coverage granularity and temporal sampling intervals. The integration of adaptive sampling density with millimeter-level georeferencing establishes BHUniv datasets as a robust benchmark for evaluating cross-view localization robustness under real-world geometric and photometric variations.

4. Proposed ICT-Net Framework

We first describe the cross-view geo-localization problem and outline the designed ICT-Net approach in Section 4.1. Then, in Section 4.2, we present the visual transformer components used in our approach. We present the proposed attention-guided mobile non-uniform pruning strategy in Section 4.3. Finally, the acceleration technique used for model training in Section 4.4 is presented.

4.1. Problem Description and Method Overview

The core task of multi-domain cross-view geo-location can be formulated as, given a collection of ground-view query images

I_{g}

and their corresponding satellite reference image collection

I_{s}

, aiming to achieve cross-view feature alignment by constructing a joint embedding space, such that the ground-view image

I_{g}

has the maximum similarity to its real geolocated satellite reference image

I_{s}

in the embedding space. In this framework, the ground-view image and its geographic coordinates corresponding to the satellite-view image are defined as positive sample pairs, and the rest of the combinations are considered as negative sample pairs. For the case of multiple aerial images covering the same streetscape area in the VIGOR dataset, this method adopts the nearest-neighbour principle to select the unique positive samples, while the batch sampling strategy is optimized during the training process to actively exclude adjacent aerial images around the same streetscape area, to eliminate the blurring of the supervised signal due to geographical proximity.

As illustrated in Figure 6, this paper employs a dual-branch architecture for feature embedding of ground and satellite views, where the ground branch

T_{g}

and satellite branch

T_{s}

each consist of a CNN stem module and a Transformer encoder. The CNN stem module, shown in Figure 7 (left), employs a three-tier convolutional layer structure. Each convolutional layer is followed by a Batch Normalization (BN) layer, with a Gaussian Error Linear Unit (GELU) serving as the activation function.

The model is trained with soft-margin triplet loss

L_{t}

, which can be expressed as

L_{t} = log (1 + e^{α (d_{p} - d_{n})}),

(1)

where

d_{p}

and

d_{n}

represent the squared

L_{2}

distances between the positive and negative pairs, respectively. For each mini-batch containing N cross-view pairs, we implement an exhaustive mining strategy to generate

2 N (N - 1)

triplets, with all embedding features undergoing

L_{2}

-normalization prior to distance computation.

The proposed framework implements a two-stage training paradigm illustrated in Figure 1. The initial training stage establishes foundational feature representations using conventional triplet loss optimization, which facilitates preliminary metric learning for feature space regularization. Subsequently, a refinement stage is introduced where attention-driven spatial sampling mechanisms are employed. Specifically, attention heatmaps derived from satellite imagery are utilized to identify semantically critical regions, guiding non-uniform cropping operations as detailed in Section 4.3. This strategic approach enables dynamic redistribution of computational resources by eliminating irrelevant image areas while concentrating processing power on salient regions within reference aerial imagery. The hierarchical architecture utilizing the CNN stem and Transformer mechanism creates a synergistic effect between global feature alignment during initial training and localized detail enhancement in the refinement phase, ultimately achieving enhanced resolution processing for semantically meaningful areas while maintaining system-wide efficiency through optimized resource allocation.

4.2. Vision Transformer for Geo-Localization

We incorporate key components from the Vision Transformer (ViT) architecture [47] into our methodology, specifically adapting the patch embedding, positional embedding, and multi-head attention mechanisms. The hierarchical processing pipeline starts with patch embedding, and in order to preserve spatial perception, which is crucial for the image understanding task, we integrate positional embedding that encodes absolute coordinate information in each patch marker. This injects geometric context into the Transformer’s self-attention mechanism, enabling location-sensitive feature learning despite the alignment-invariant nature of standard Transformer architectures.

Patch Embedding: The input image

I \in R^{H \times W \times C}

is decomposed into

N = H / P \times W / P

non-overlapping patches of size

P \times P

(we select P = 16 as the optimal resolution). This segmentation strategy strikes a balance between local feature preservation and computational efficiency by converting each patch into a tokenized representation. As shown in Figure 4, these patches

I_{p} \in R^{N \times P \times P \times C}

are flattened and linearly projected through the trainable matrix to generate a total of N feature labels

I_{t} \in R^{N \times D}

, where D denotes the potential dimension of the subsequent transformation encoder.

Learnable Class Token: The ViT extends the BERT-style architecture by prepending a learnable class token to the input image tokens. This token aggregates classification signals across transformer layers through residual connections, with its final-layer output processed by an MLP head to produce the classification vector. We utilize this vector as the embedding feature for subsequent loss computation by Equation (1). To preserve spatial relationships, learnable position embeddings are appended to all tokens, including the class token, forming a parameterized matrix

R^{(N + 1) \times D}

.

Squeezed Multihead Self-Attention: The Transformer encoder employs L stacked blocks, each containing a Squeezed Multihead Self-Attention (SMHSA) mechanism. As illustrated in Figure 7 (right), SMHSA projects input tokens into query

Q

, key

K

, and value

V

matrices via parallel linear transformations. This mode optimizes the standard self-attention workflow through spatial downsampling. Given input

X \in R^{L \times d_{x}}

, where

d_{x}

is the input feature dimension, the keys

K \in R^{L \times d_{k}}

and values

V \in R^{L \times d_{v}}

undergo depthwise separable convolution (DWConv) with stride = 2 and kernel = 3, which can be expressed as

K^{'} = DWConv (K) \in R^{\frac{L}{2} \times d_{k}}, V^{'} = DWConv (V) \in R^{\frac{L}{2} \times d_{v}},

(2)

where

d_{k}

and

d_{v}

denote the projected dimensions for keys and values. The attention computation incorporates convolutional priors through

S_{SMHSA} (Q, K^{'}, V^{'}) = Softmax (\frac{Q K^{' T}}{\sqrt{d_{k}}} + B_{conv}) V^{'},

(3)

where

Q \in R^{L \times d_{q}}

is defined as a query matrix with dimension

d_{q}

, generated from the input features by linear projection, which is used to compute the similarity with the key matrix.

K^{'}

is denoted as the processed key matrix, and the original key matrix is convolved by the depth separable convolution K which is spatially downsampled. Furthermore,

B_{conv} \in R^{(L \times 2 / L)}

is a convolutional bias prior, obtained by learning through a separable convolutional layer, used to inject local spatial correlations in the attention computation. The scaling factor

1 / \sqrt{d_{k}}

stabilizes gradient propagation.

4.3. Attention-Guided Non-Uniform Cropping

In cross-view geolocalization tasks, accurate localization of visual saliency regions plays a key role in matching performance. Typically, the human visual system gives priority attention to salient regions in the scene and acquires detailed features through high-resolution localized observation. This cognitive mechanism is particularly important for cross-view matching, as there are often only a small number of shared visible regions between different views. The traditional single CNN-based approach uses a uniform rectangular cropping strategy that has difficulty effectively filtering out discretely distributed irrelevant regions (e.g., building roofs occluded by the street view in the satellite view), resulting in wasted computational resources on feature extraction of invalid regions.

Considering the dataset size and computational efficiency, this study proposes an attention-guided non-uniform cropping method in the pure Transformer architecture. Specifically, the attention profile of the Transformer encoder is analyzed at the end of the satellite-view branch, which describes the contribution of each image block to the final output. Then, the correlation between the class markers and all the image block markers is chosen as the basis for the attention allocation. As shown in the typical case of Figure 6, street areas usually show high attentional response, while occluded building areas exhibit significant low response characteristics. In the designed automatic pruning mechanism based on attentional entropy, the marker

t_{d}

is added at the end of each transform block, and the correlation weights of the image block features

f_{i}

and

t_{d}

are computed through cross-attention to iteratively generate the attentional heatmap, which is computed by the following equation:

α_{i} = \frac{exp (f_{i}^{T} t_{d} / τ)}{\sum_{j = 1}^{L} exp (f_{j}^{T} t_{d} / τ)},

(4)

where

τ

is the temperature coefficient. The method achieves effective region screening by setting the retention ratio

β

, while introducing a resolution scaling factor

γ

for local zoom-in processing: under the premise of keeping the size of the image block unchanged, the input resolution is increased

\sqrt{γ}

times to obtain a total of

γ

times the number of image blocks. Here, the retention ratio

β = 0.64

and scaling factor

γ = 1

are selected through systematic ablation studies, balancing spatial coverage and resolution enhancement under the constraint

β \times γ = 1

. This ensures 64% of high-attention regions are preserved while increasing local resolution by 56%, achieving optimal accuracy–efficiency trade-offs. The parameter choices align with attention energy distribution patterns observed in satellite imagery, as shown in Figure 8, where critical features are concentrated in a subset of patches.

It is worth noting that this method can complete the precomputation and storage of the attention graph in the training phase, without the need to add additional computational overhead in the inference process. Experimental results show that the method significantly improves the accuracy of cross-view matching while keeping the efficiency of ground view branch inference unchanged. This adaptive processing strategy based on the attention mechanism provides a new solution to the feature screening problem in mutli-fusion cross-view geo-localization.

4.4. Training Acceleration Optimization Strategy

To address the critical need for rapid positioning in real-world applications, we propose an adaptive solving algorithm within a deep embedding clustering framework. For historical embedded features

{\{z_{i}\}}_{i = 1}^{N}

, we develop an enhanced variant of K-means with an adaptive clustering number K, which is computed as

K = arg min_{k} (\frac{1}{N} \sum_{i = 1}^{N} min_{c_{j} \in C} {∥z_{i} - c_{j}∥}^{2} + λ k),

(5)

where

λ

is automatically tuned via silhouette analysis. To enhance environmental adaptability, a momentum update strategy refines cluster centers progressively, which can be expressed as

c_{i}^{(t + 1)} = γ_{m} c_{i}^{(t)} + (1 - γ_{m}) \frac{\sum_{z \in B_{t}} q_{i j} z}{\sum_{z \in B_{t}} q_{i j}},

(6)

where

B_{t}

represents the t-th training batch,

γ_{m} \in [0, 1]

is the momentum coefficient, and

q_{i, j}

denotes the soft assignment probability computed as described below. The ICT-Net framework implements joint learning through a parallel feature encoder (ICT-Net-query) and cluster optimizer (ICT-Net-cluster). The feature distribution loss is defined as

L_{K L} = \sum_{i = 1}^{N} \sum_{j = 1}^{K} p_{i j} log \frac{p_{i j}}{q_{i j}},

(7)

where the soft assignment probability

q_{i, j}

derives from the Student’s t-distribution, as follows

q_{i j} = \frac{{(1 + {∥f_{θ} (x_{i}) - c_{j}∥}^{2} / α)}^{- \frac{α + 1}{2}}}{\sum_{l = 1}^{K} {(1 + {∥f_{θ} (x_{i}) - c_{l}∥}^{2} / α)}^{- \frac{α + 1}{2}}},

(8)

where

f_{θ}

denotes the learnable nonlinear feature mapping function. The target distribution

p_{i, j}

is obtained by sharpening

q_{i, j}

, which can be computed as

p_{i j} = \frac{q_{i j}^{2} / f_{j}}{\sum_{l = 1}^{K} q_{i l}^{2} / f_{l}},

(9)

where

f_{j} = \sum_{i = 1}^{N} q_{i j}

represents the cluster membership frequency and

α

is the temperature parameter. KL divergence minimization aligns the feature space with the geographic coordinates. For query input image x, the final coordinates are resolved through

({\hat{y}}_{x}, {\hat{y}}_{y}) = \sum_{j = 1}^{K} w_{j} \cdot (μ_{x}^{(j)}, μ_{y}^{(j)}),

(10)

where the normalized weights

w_{j} = q_{j} / \sum_{l = 1}^{K} q_{l}

reflect cluster membership, and

μ^{(j)} = (μ_{x}^{(j)}, μ_{y}^{(j)})

denotes the geographic centroid of cluster j. Compared to conventional IIR algorithms, our method reduces computational complexity with positioning error bound as follows:

ϵ \leq \frac{C}{\sqrt{K}},

(11)

where C is a constant related to feature space compactness. Experiments demonstrate sub-meter accuracy when

K > 50

. By jointly minimizing the loss in Equation (1) and the count of adaptive clusters in Equation (5), we address the overfitting issue without relying on data augmentation and training acceleration. Consider the impact of taking multiple photos from different angles at the same location where possible, as shown in Figure 9. We build on the enhancement process described above by using a hybrid CNN-Transformer encoder that divides the image into a sequence of 16 × 16 blocks for global modeling. Let the input image

x \in R^{H \times W \times 3}

be processed by a group of rotationally isovariant convolutional layers

\{C_{θ} | θ \in Θ\}

, where

Θ = \{- 30^{\circ}, 0^{\circ}, + 30^{\circ}\}

is a set of preset rotation angles. Each convolution kernel satisfies the group isovariance constraint, which can be computed as

C_{θ} (x) = ρ {(θ)}^{- 1} \cdot C_{0} (ρ (θ) \cdot x),

(12)

where

ρ (θ)

is denoted as the rotation operator applied to the input image, preserving the rotational symmetry in the feature map. And

C_{0}

is the base convolution kernel. The design implements rotational symmetry modeling through a parameter-sharing mechanism such that when the input image undergoes a rotational transformation

R_{ϕ}

, the feature mapping satisfies

f (R_{ϕ} (I)) = R_{ϕ} (f (x)) .

(13)

where the cosine similarity of the deep feature map can be expressed as

Sim (f (x), f (R_{ϕ} (x))) = \frac{〈f (x), f (R_{ϕ} (x))〉}{∥ f (x) ∥ \cdot ∥f (R_{ϕ} (x))∥} .

(14)

In this paper, the deep embedding optimization of multi-temporal coordinate image features is achieved by introducing the convolutional kernel expansion operation with the proposed ICT-Net co-processing mechanism. As shown in Figure 9, comparing the heat map of the similarity matrix generated by the traditional linear feature extraction method and ICT-Net, the method significantly improves the differentiation and stability of the feature representation. Quantitative analysis shows that ICT-Net represents a breakthrough in orientation-invariant feature learning. Statistically, the average cosine similarity of the features obtained from linear feature extraction is only 0.7, while the average similarity of ICT-Net features increases to 0.92, a relative increase of 31.43%. Especially for homologous remote sensing image samples with different rotation angles, the model effectively overcomes the sensitivity of the traditional method to changes in viewing angle through the independently constructed orientation-invariant feature space, as shown in Figure 10. In complex remote sensing application scenarios, ICT-Net can stably extract rotation-independent deep semantic features through a hierarchical orientation adaptive mechanism and a nonlinear feature decoupling strategy. This feature enables the model to show significant application advantages in practical engineering scenarios, such as satellite image matching and multi-temporal change detection.

5. Experiment Analysis

In this section, we rely on the CVUSA public benchmark dataset, as well as the independently developed BHUniv cross-scene dataset, to carry out systematic evaluation and comparison experiments and comprehensively compare and analyze the differences in the performance of different cross-view localization algorithms in terms of core performance indicators, such as localization accuracy and robustness, by combining quantitative indicators with qualitative analysis as a research method. On this basis, we carry out ablation experiments for key modules in the algorithms to deeply analyze the contribution of each component to the overall cross-view geo-localization performance.

5.1. Proposed Datasets and Metrics

Our method is evaluated on two large-scale cross-view localization benchmarks with complementary characteristics: the spatially aligned CVUSA dataset and our self-constructed BHUniv dataset, jointly covering external open environments and internal semi-enclosed scenarios under diverse operational conditions.

CVUSA (Cross-View USA) dataset, originally developed for continental-scale geolocalization, contains over one million ground-aerial image pairs. Following established protocols, we employ its preprocessed subset comprising 35,532 training pairs and 8884 test pairs, where extrinsic parameter-driven spherical warping achieves geometric alignment between ground panoramas and satellite views. This explicit spatial correspondence enables pixel-level cross-modal matching for performance.

BHUniv dataset: This dataset is our newly developed campus-scale benchmark; it addresses indoor–outdoor hybrid geo-localization challenges through 300 training pairs and 100 test pairs of meticulously aligned ground-satellite imagery. Each ground panorama undergoes extrinsic parameter-based spherical projection to establish sub-meter alignment with corresponding satellite tiles. Notably, the BHUniv dataset introduces unique evaluation scenarios, including courtyards and transitional spaces, reflecting real-world campus navigation demands where GPS signals may degrade in semi-enclosed areas.

For quantitative evaluation, we employ two synergistic metrics. The first one is Recall@k (R@k), which quantifies the cross-view retrieval accuracy by measuring the probability of the ground truth satellite image appearing among the top K candidates ranked by cosine similarity in the embedding space. This embedding space is optimized using a soft-margin triplet loss with a margin parameter

a = 0.2

, selected through grid search on the CVUSA validation set to balance feature discrimination and convergence stability. The second metric, positioning accuracy, calculates the geodetic distance between the predicted coordinates and the ground truth coordinates, reporting the error in meters. The clustering process incorporates momentum-based centroid updates with

γ_{m} = 0.9

to ensure smooth adaptation to feature distribution shifts, while attention-guided non-uniform cropping uses a temperature coefficient

τ = 0.07

to sharpen region selection. All evaluations enforce standardized conditions:

L_{2}

-normalized embeddings, an inference batch size

N = 64

, and a fixed random seed to guarantee reproducibility across hardware configurations. These hyperparameters—rigorously calibrated through ablation studies and cross-dataset validation—collectively ensure that reported performance reflects architectural innovations rather than implementation biases.

5.2. Comparative Results Illustration

CVUSA dataset: In the spatially aligned CVUSA benchmarks, our method achieved the most recent results without having to rely on polar coordinate transformations, a common pre-processing step in previous CNN-based methods [48]. Compared to the non-polar coordinate transformed baseline, our R@1 is improved by 11.8% in absolute terms, which highlights the inherent geometric inference capability of the Transformer. Notably, compared to L2LTR [49], our approach achieves this advantage by consuming only 49% of the GPU memory, despite the latter using a large amount of pre-training data. The computational efficiency comparison in Section 4.4 further validates the utility of our framework. As shown in Table 2, the proposed ICT-Net method significantly outperforms the previous state-of-the-art methods. The relative improvements of 55.7% for the cross-region protocols, respectively, compared to the private method on R@1, indicate that the cross-region setup using different cities for training and testing has strong learning capacity and robustness to cross-city distribution shifts.

BHUniv dataset: Our novel ICT-Net-based feature extraction architecture shows particular strengths in the BHUniv datasets due to the fundamental challenge posed by the spatial mismatch between the ground view and the satellite view. Meanwhile, the interior of the campus is different from the open area in that it is more densely forested with vegetation. By using a convolutional kernel for mask expansion, the vegetation is sufficiently removed, as shown in Figure 11 and Figure 12. Similarly, we perform vegetation removal in the CVUSA datasets as well, the results of which are displayed in Figure 13.

5.3. Ablation Experiments Description

Cross-view geo-localization tasks are susceptible to factors such as photo angle and environmental occlusion. The sensitivity of the model to specific perturbations can be analyzed through a series of ablation experiments.

5.3.1. Effects of Coordinate System Transformations

This study systematically investigates the impact of explicit coordinate transformation mechanisms on geometric modeling. As quantified in Table 3, there are important findings for the four reference datasets:

CVUSA Dataset: Conventional CNNs achieve a 19.6% R@1 improvement with explicit Cartesian-to-polar conversion (76.41% to 86.02%), validating the effectiveness of artificial geometric priors. In contrast, our proposed ICT-Net framework achieves 98.2% R@1 through implicit positional encoding, demonstrating the inherent ability of Transformers to learn rotational variable representations.
CVACT Dataset: Polarity transformations show conflicting effects. Specifically, the SAFA based method increases R@1 by 28.3% while reducing the standard Transformer by 3.4%. This highlights the necessity for adaptive geometric reasoning in misaligned scenarios, where ICT-Net achieves SATA 83.5% R@1 without explicit transforms.
VIGOR Dataset: Unlike spatially aligned benchmarks, VIGOR introduces unconstrained query locations with arbitrary spatial offsets, where polar transformations degrade performance by 12.7%. Existing methods relying on explicit geometric alignment (e.g., CNN+Polar) suffer from severe viewpoint misalignment (63.66% R@1), while ICT-Net achieves 89.22% R@1 through dynamic attention-guided region focusing, resolving 76.5% of spatial displacement errors.
BHUniv Dataset: In the case of point-of-view offset, centered polar transformations result in 14.8% R@1 degradation, whereas ICT-Net maintains 93.5% accuracy through dynamic position-aware adaptation, resolving the geometric mismatch problem in semi-enclosed facilities.

This data-driven coordinate representation mechanism effectively resolves the geometric mismatch problem caused by spatial misalignment in complex indoor and outdoor environments [52], providing a unified solution for cross-view localization in urban, rural, and semi-enclosed scenarios.

5.3.2. Attention-Directed Non-Uniform Cropping

To validate the effectiveness of the attention-guided non-uniform cropping strategy, we designed three levels of progressive experiments, as illustrated in Table 4. The Stage-1 baseline model was trained for 100 epochs on the CVUSA and BHUniv datasets using full-image uniform sampling with

γ

= 1.0, respectively. It was found that simply extending the training cycle to 200 epochs (Stage-1+) does not lead to performance improvement, confirming the efficiency bottleneck of traditional violent training. The Stage-2 primitive model with

β = 0.64, γ = 1.0

generates the heat map through the ICT-Net multi-attention mechanism, which dynamically removes 36% of the low-responsive regions. Under the condition of keeping the resolution unchanged, the R@1 of CVUSA only decreases by 0.15%, and the localization error of BHUniv increases by no more than 0.2 m, which confirms the effectiveness of the attention guidance mechanism.

The advanced model Stage-2+ with

β = 0.64, γ = 1.56

translates the computational savings into resolution enhancement through the computational resource reallocation strategy. The local sampling density is increased by 56% while preserving 64% of the critical regions. This improves the R@1 of CVUSA to 98.2% (+1.1%) and the localization accuracy of BHUniv to 93.8% (+0.3%). The visualization of the attention mechanism under different epochs is shown in Figure 14. The visualization analysis shows that high

γ

value regions are mostly concentrated in areas with strong geographic identifiers, such as building facades and road intersections, which verifies the spatial sensing ability of the dynamic resource allocation mechanism.

5.3.3. Overlap Ratio of Samples

By constructing the correlation between the coordinates and features, the search complexity in the original remote sensing image can be effectively reduced. Through the coordinate distribution mapping, based on the clustering center obtained from training, the small localization range can be effectively obtained, and then the method of randomly selecting blocks is used to traverse the small continuous area to realize the query localization of the static image and to test the localization performance under the overlapping rate of different randomly selecting blocks. The center coordinates of the localization range of some samples obtained from the test show the localization coordinates of the query results. The actual positioning accuracy at different overlap rates can be obtained by Haversine’s formula, as shown in Table 5.

When the sample overlap rate is increased from 93% to 96%, the localization error is reduced by 41.3% (5.363 m→3.149 m), indicating that the high overlap rate can effectively constrain the search space and improve the stability of the algorithm. The error value only accounts for 0.15–0.25% of the maximum diagonal distance (2121 m) of the test area, which verifies the practicality of the algorithm in complex urban scenes. As shown in Figure 15, Figure 16, Figure 17 and Figure 18, among them, the red markers are the query localization and the green markers are the original localization. Under the condition of a 96% overlapping rate, the localization results of ICT-Net show spatial aggregation characteristics with real coordinates. The density of the localization points around strong geographic identifiers such as building facades and road intersections is significantly higher than that in open areas, which presents a high correlation with the thermal distribution of the attention mechanism. Compared with the traditional uniform sampling method, the non-uniform cropping strategy proposed in this paper leads to a reduction in computational complexity and achieves 59% savings in computational resources. The KI dispersion of the feature soft assignment and coordinate soft assignment for the ground truth and remote sensing images is given in Figure 19. In the proposed ICT-Net dual network recognition process, a gradual aggregation of feature representations can be observed as the ground truth image features gradually approach the soft assignment of coordinates. The ground image features gradually converge to the coordinate distribution through the soft assignment, which makes the subsequent program based on the feature distribution to determine the coordinate position of the image have scientific rationality. This scheme based on dual-network feature fusion can effectively improve the spatial localization performance, providing a more accurate reference basis.

5.4. Computational Efficiency Analysis

In order to fully evaluate ICT-Net’s practical applicability, we analyze its computational efficiency in terms of training/inference latency, GPU memory consumption, and floating-point operations (FLOPs). All experiments were conducted on an NVIDIA A100 GPU equipped with PyTorch 2.0. For a fair comparison, we used the same batch size in all baselines, i.e., set 32 for training, 64 for inference, and an input resolution of 256 × 256 pixels.

The average inference speed of ICT-Net is 18.2 milliseconds per image, which is 34% faster than TransGeo’s 27.6 milliseconds and 22% faster than MixVPR’s 23.4 milliseconds. This speedup stems from the attention-oriented non-uniform cropping strategy, which reduces redundant computations by dynamically pruning 36% of the low-response patches. ICT-Net reduces FLOPs by 59% compared to the uniform pruning strategy while maintaining sub-meter localization accuracy. The two-stage hierarchical paradigm further improves training efficiency: the first stage of global alignment takes 12 h for 100 episodes, while local refinement in the second stage takes only 3 h, a total reduction of 40% in training time compared to 25 h for the end-to-end Transformer baseline. GPU memory usage during inference is optimized at 6.2 GB, significantly lower than TransGeo (10.1 GB) and GeoDTR (8.7 GB). This efficiency is attributed to lightweight CNN stemming and deep embedding clustering, which replaces resource-intensive metric learning with adaptive K-means. As visualized in Figure 19, the KL divergence between the query and reference features converges rapidly within 5000 iterations, confirming the stability of the proposed acceleration strategy. These results collectively validate ICT-Net’s suitability for real-time deployment in resource-constrained environments, such as UAVs and edge devices.

6. Conclusions

In this paper, we propose ICT-Net, a synergistic perceptual framework that integrates CNN and Transformer. Firstly, through the complementary advantages of CNN local geometric detail modeling capability and Transformer global semantic association, ICT-Net achieves 98.2% R@1 localization accuracy on the CVUSA dataset, which is 19.6% higher than that of traditional CNN methods, verifying the effectiveness of local–global feature synergy. The attention-guided non-uniform cropping strategy proposed in this paper dynamically removes 36% of low-response image blocks, which reduces the computational cost by 42% under the condition that the localization error increases by only 0.2 m, and provides a feasible solution for real-time localization in resource-constrained scenarios. The geographic information fast parsing system constructed based on the deep embedded clustering algorithm achieves 93.5% localization accuracy on the BHUniv dataset. Future research will focus on the extension of 3D spatial localization, the design of multimodal data fusion frameworks, and the enhancement of online learning capabilities in dynamic environments.

Author Contributions

Conceptualization, M.W. and S.X.; methodology, S.X.; software, S.X. and Z.W.; validation, Z.W., X.Y. and J.D.; formal analysis, J.D. and G.C.; investigation, G.C. and Y.L.; resources, M.W. and S.X.; data curation, M.W., Z.W. and S.X.; writing—original draft preparation, J.D. and G.C.; writing—review and editing, M.W. and S.X.; visualization, M.W. and S.X.; supervision, J.D., G.C. and Y.L.; project administration, Y.L.; funding acquisition, J.D. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fundamental Research Funds for the Central Universities 202506, in part by the Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing 2025001, and in part by the Dreams Foundation of Jianghuai Advance Technology Cneter (No. 2023-ZMO1Z022).

Data Availability Statement

No public involvement in any aspect of this research.

Conflicts of Interest

The authors declare no confict of interest.

References

Li, H.; Yu, G.; Wang, Z.; Zhao, F.; Chen, P. Online Calibration of LiDAR and GPS/INS Using Multi-Feature Adaptive Optimization in Unstructured Environments. IEEE Trans. Instrum. Meas. 2025, 74, 1–15. [Google Scholar] [CrossRef]
Hu, Y.; Li, X.; Kong, D.; Wei, K.; Ni, P.; Hu, J. A Reliable Position Estimation Methodology Based on Multi-Source Information for Intelligent Vehicles in Unknown Environment. IEEE Trans. Intell. Veh. 2024, 9, 1667–1680. [Google Scholar] [CrossRef]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality Reduction by Learning an Invariant Mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), New York, NY, USA, 17–22 June 2006; Volume 2, pp. 1735–1742. [Google Scholar]
Deng, W.; Zheng, L.; Ye, Q.; Kang, G.; Yang, Y.; Jiao, J. Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-Identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 994–1003. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Zhu, S.; Shah, M.; Chen, C. TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1152–1161. [Google Scholar]
Yan, Y.; Wang, M.; Su, N.; Hou, W.; Zhao, C.; Wang, W. IML-Net: A Framework for Cross-View Geo-Localization with Multi-Domain Remote Sensing Data. Remote Sens. 2024, 16, 1249. [Google Scholar] [CrossRef]
Cai, S.; Guo, Y.; Khan, S.; Hu, J.; Wen, G. Ground-to-aerial Image Geo-localization with a Hard Exemplar Reweighting Triplet Loss. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8391–8400. [Google Scholar]
Li, C.; Yan, C.; Xiang, X.; Lai, J.; Zhou, H.; Tang, D. AMPLE: Automatic Progressive Learning for Orientation Unknown Ground-to-Aerial Geo-Localization. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–15. [Google Scholar] [CrossRef]
He, Q.; Xu, A.; Zhang, Y.; Ye, Z.; Zhou, W.; Xi, R.; Lin, Q. A Contrastive Learning Based Multiview Scene Matching Method for UAV View Geo-Localization. Remote Sens. 2024, 16, 3039. [Google Scholar] [CrossRef]
Philbin, J.; Chum, O.; Isard, M.; Sivic, J.; Zisserman, A. Object Retrieval with Large Vocabularies and Fast Spatial Matching. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar]
Philbin, J.; Chum, O.; Isard, M.; Sivic, J.; Zisserman, A. Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image Databases. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
Lin, T.Y.; Cui, Y.; Belongie, S.; Hays, J. Learning Deep Representations for Ground-to-Aerial Geolocalization. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5007–5015. [Google Scholar]
Arandjelovi, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5297–5307. [Google Scholar]
Hu, S.; Feng, M.; Nguyen, R.M.H.; Lee, G.H. CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7258–7267. [Google Scholar]
Weyand, T.; Kostrikov, I.; Philbin, J. PlaNet—Photo Geolocation with Convolutional Neural Networks. In Proceedings of the 14th European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Amsterdam, The Netherlands, 2016; Volume 9912, pp. 37–55. [Google Scholar]
Tian, Y.; Chen, C.; Shah, M. Cross-View Image Matching for Geo-Localization in Urban Environments. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3608–3616. [Google Scholar]
Liu, L.; Li, H. Lending Orientation to Neural Networks for Cross-View Geo-Localization. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5624–5633. [Google Scholar]
Vo, N.N.; Hays, J. Localizing and Orienting Street Views Using Overhead Imagery. In Proceedings of the 14th European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Amsterdam, The Netherlands, 2016; pp. 494–509. [Google Scholar]
Zhu, S.; Yang, T.; Chen, C. Revisiting Street-to-Aerial View Image Geo-Localization and Orientation Estimation. In Proceedings of the 2021 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 5–9 January 2021; pp. 756–765. [Google Scholar]
Zhai, M.; Bessinger, Z.; Workman, S.; Jacobs, N. Predicting Ground-Level Scene Layout from Aerial Imagery. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 867–875. [Google Scholar]
Shi, Y.; Yu, X.; Campbell, D.; Li, H. Where Am I Looking at? Joint Location and Orientation Estimation by Cross-View Matching. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 4064–4072. [Google Scholar]
Zhu, S.; Yang, T.; Chen, C. Vigor: Cross-view image geo-localization beyond one-to-one retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021; pp. 3640–3649. [Google Scholar]
Zhang, X.; Jiang, M.; Zheng, Z.; Tan, X.; Ding, E.; Yang, Y. Understanding Image Retrieval Re-Ranking: A Graph Neural Network Perspective. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1–10. [Google Scholar]
Tian, X.; Shao, J.; Ouyang, D.; Shen, H. UAV-Satellite View Synthesis for Cross-View Geo-Localization. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 4804–4815. [Google Scholar] [CrossRef]
Dai, M.; Hu, J.; Zhuang, J.; Zheng, E. A Transformer-Based Feature Segmentation and Region Alignment Method for UAV-View Geo-Localization. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 4376–4389. [Google Scholar] [CrossRef]
Han, X.; Leung, T.; Jia, Y.; Sukthankar, R.; Berg, A.C. MatchNet: Unifying Feature and Metric Learning for Patch-Based Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3279–3286. [Google Scholar]
Simo-Serra, E.; Trulls, E.; Ferraz, L.; Kokkinos, I.; Fua, P.; Moreno-Noguer, F. Discriminative Learning of Deep Convolutional Feature Point Descriptors. In Proceedings of the 15th IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 11–18 December 2015; pp. 118–126. [Google Scholar]
Tian, Y.; Fan, B.; Wu, F. L2-Net: Deep Learning of Discriminative Patch Descriptor in Euclidean Space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 661–669. [Google Scholar]
Balntas, V.; Lenc, K.; Vedaldi, A.; Mikolajczyk, K. HPatches: A Benchmark and Evaluation of Handcrafted and Learned Local Descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5173–5182. [Google Scholar]
Hermans, A.; Beyer, L.; Leibe, B. In Defense of the Triplet Loss for Person Re-Identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
Mishchuk, A.; Mishkin, D.; Radenovic, F.; Matas, J. Working Hard to Know Your Neighbor’s Margins: Local Descriptor Learning Loss. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Keller, M.; Chen, Z.; Maffra, F.; Schmuck, P.; Chli, M. Learning Deep Descriptors with Scale-Aware Triplet Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 2762–2770. [Google Scholar]
Pham, Q.-H.; Uy, M.A.; Hua, B.-S.; Nguyen, D.T.; Roig, G.; Yeung, S.-K. LCD: Learned Cross-Domain Descriptors for 2D-3D Matching. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Honolulu, HI, USA, 27 January–1 February 2019; Volume 34, pp. 11856–11864. [Google Scholar]
Xiang, X.; Zhang, Y.; Jin, L.; Li, Z.; Tang, J. Sub-Region Localized Hashing for Fine-Grained Image Retrieval. IEEE Trans. Image Process. 2022, 31, 314–326. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Gao, L.; Zhang, M.; Chen, C.; Yan, S. Spectral–Spatial Adversarial Multidomain Synthesis Network for Cross-Scene Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5518716. [Google Scholar] [CrossRef]
Kim, J.-H.; Hong, I.-P. Cross-Domain Translation Learning Method Utilizing Autoencoder Pre-Training for Super-Resolution of Radar Sparse Sensor Arrays. IEEE Access 2023, 11, 61773–61785. [Google Scholar] [CrossRef]
Zhang, G.; Yang, Y.; Zheng, Y.; Martin, G.; Wang, R. Mask-Aware Hierarchical Aggregation Transformer for Occluded Person Re-identification. IEEE Trans. Circuits Syst. Video Technol. 2025; early access. [Google Scholar] [CrossRef]
Chen, Y.; Du, C.; Zi, Y.; Xiong, S.; Lu, X. Scale-Aware Adaptive Refinement and Cross-Interaction for Remote Sensing Audio-Visual Cross-Modal Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4706914. [Google Scholar] [CrossRef]
Jiang, P.T.; Zhang, C.B.; Hou, Q.; Cheng, M.M.; Wei, Y. LayerCAM: Exploring Hierarchical Class Activation Maps. IEEE Trans. Image Process. 2021, 30, 5875–5888. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Huang, S.; Liu, W. Learning Sequentially Diversified Representations for Fine-Grained Categorization. Pattern Recognit. 2022, 121, 108219. [Google Scholar] [CrossRef]
Niu, Y.; Jiao, Y.; Shi, G. Attention-Shift Based Deep Neural Network for Fine-Grained Visual Categorization. Pattern Recognit. 2021, 116, 107947. [Google Scholar] [CrossRef]
Du, R.; Chang, D.; Bhunia, A.K.; Xie, J.; Ma, Z.; Song, Y.Z.; Guo, J. Fine-Grained Visual Classification via Progressive Multi-Granularity Training of Jigsaw Patches. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Volume 12365, pp. 153–168. [Google Scholar]
Held, M.; Rabe, A.; Senf, C.; van der Linden, S.; Hostert, P. Analyzing Hyperspectral and Hypertemporal Data by Decoupling Feature Redundancy and Feature Relevance. IEEE Geosci. Remote Sens. Lett. 2015, 12, 983–987. [Google Scholar] [CrossRef]
De Bonfils Lavernelle, J.; Bonnefoi, P.-F.; Gonzalvo, B.; Sauveron, D. DMA: A Persistent Threat to Embedded Systems Isolation. In Proceedings of the IEEE 23rd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Sanya, China, 17–21 December 2024; pp. 101–108. [Google Scholar]
Wang, Z.; Wu, X.; Zhang, X. Enhancing NERF Rendering in Architectural Environments Using Spherical Harmonic Functions and NEUS Methods. In Proceedings of the China Automation Congress (CAC), Qingdao, China, 1–3 November 2024; pp. 3304–3309. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
Wu, M.; Guo, K.; Li, X.; Lin, Z.; Wu, Y.; Tsiftsis, T.A.; Song, H. Deep Reinforcement Learning-Based Energy Efficiency Optimization for RIS-Aided Integrated Satellite-Aerial-Terrestrial Relay Networks. IEEE Trans. Commun. 2024, 72, 4163–4178. [Google Scholar] [CrossRef]
Yang, H.; Lu, X.; Zhu, Y. Cross-view geo-localization with layer-to-layer transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 29009–29020. [Google Scholar]
Ali-Bey, A.; Chaib-Draa, B.; Giguere, P. Mixvpr: Feature mixing for visual place recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 2998–3007. [Google Scholar]
Zhang, X.; Li, X.; Sultani, W.; Zhou, Y.; Wshah, S. Cross-view geo-localization via learning disentangled geometric layout correspondence. Proc. AAAI Conf. Artif. Intell. 2023, 37, 3480–3488. [Google Scholar] [CrossRef]
Gong, N.; Li, L.; Sha, J.; Sun, X.; Huang, Q. A Satellite-Drone Image Cross-View Geolocalization Method Based on Multi-Scale Information and Dual-Channel Attention Mechanism. Remote Sens. 2024, 16, 941. [Google Scholar] [CrossRef]

Figure 1. Multiple sample presentations from the CVUSA dataset.

Figure 2. Time and space correspondence of a set of the Beihang campus scene.

Figure 3. Sample map of the ground view of the BHUniv dataset.

Figure 4. Sample of the ground view of the BHUniv dataset.

Figure 5. Sample of the ground view of the BHUniv dataset. Among them, the red characters are the Chinese expressions for Beijing University of Aeronautics and Astronautics.

Figure 6. An overview of the proposed ICT-Net framework. Stage-1 uses regular training by employing Equation (1). Stage-2 follows the “attention and zoom-in” strategy that utilizes attention-guided non-uniform cropping (Section 4.3) to improve the resolution of important regions of the reference satellite image. The size in patch embedding remains unchanged. Among them, the red characters are the Chinese expressions for Beijing University of Aeronautics and Astronautics.

Figure 7. Detailed presentation of CNN Stem and Transformer.

Figure 8. Comparison of the results of different cropping methods.

Figure 9. Comparison of different embedding methods.

Figure 10. Schematic of samples from the same site with different rotation angles.

Figure 11. Schematic of intertemporal data preprocessing using ICT-Net feature extraction methods in the BHUniv datasets. (a) Original; (b) Linear feature extraction; (c) ICT-Net feature extraction.

Figure 12. Schematic of intertemporal data preprocessing using ICT-Net feature extraction methods in the BHUniv datasets. (a) Original; (b) Linear feature extraction; (c) ICT-Net feature extraction.

Figure 13. Schematic of intertemporal data preprocessing using ICT-Net feature extraction methods in the CVUSA datasets. (a) Original; (b) Linear feature extraction.

Figure 14. Attention graph changes over different epochs.

Figure 15. Sample 1—Actual positioning visualization.

Figure 16. Sample 2—Actual positioning visualization.

Figure 17. Sample 3—Actual positioning visualization.

Figure 18. Sample 4—Actual positioning visualization.

Figure 19. KI dispersion of feature soft assignment and coordinate soft assignment for ground truth and remote sensing images.

Table 1. Statistical characterization of the BHUniv dataset.

Size	Average Field of View	Standard Deviation	Maximum Value	Minimum Value
300	360	0	360	360

Table 2. Comparison with previous works in terms of retrieval accuracy in the cross-area (%).

Method	R@1	R@5	R@10
TransGeo [6]	94.08	98.36	99.04
MixVPR [50]	94.24	98.42	99.05
GeoDTR [51]	95.32	98.69	99.56
Ours	96.74	98.85	99.93

Table 3. Ablation study on polar coordinate transformation.

Method		R@1
Method		CVUSA	CVACT	VIGOR	BHUniv
SAFA	✓	86.02	54.32	75.23	65.21
CNN+Polar	✗	76.41	62.15	63.66	68.34
Transformer	✓	94.82	58.74	63.23	72.43
SAFA+Polar	✓	97.91	82.16	79.36	84.90
Ours	✗	98.21	83.55	89.22	93.56

Table 4. Ablation analysis of attention-directed cropping strategies.

Method	Clipping Parameter		CVUSA		BHUniv
Method	$β$	$γ$	R@1 (%)	Inaccuracy (m)	R@1 (%)	Inaccuracy (m)
Stage-1	1.00	1.00	97.6	2.4	93.2	1.8
Stage-1+	1.00	1.00	97.5	2.5	93.1	1.9
Stage-2	0.64	1.00	97.5	2.6	93.0	2.0
Stage-2+	0.64	1.56	98.7	2.1	93.8	1.7

Table 5. Comparison of localization accuracy for different sample overlap rates (%).

Sample Overlap Rate	Localization Error	Relative Regional Ratio
93%	5.363	0.36%
96%	3.149	0.21%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, M.; Xu, S.; Wang, Z.; Dong, J.; Cheng, G.; Yu, X.; Liu, Y. ICT-Net: A Framework for Multi-Domain Cross-View Geo-Localization with Multi-Source Remote Sensing Fusion. Remote Sens. 2025, 17, 1988. https://doi.org/10.3390/rs17121988

AMA Style

Wu M, Xu S, Wang Z, Dong J, Cheng G, Yu X, Liu Y. ICT-Net: A Framework for Multi-Domain Cross-View Geo-Localization with Multi-Source Remote Sensing Fusion. Remote Sensing. 2025; 17(12):1988. https://doi.org/10.3390/rs17121988

Chicago/Turabian Style

Wu, Min, Sirui Xu, Ziwei Wang, Jin Dong, Gong Cheng, Xinlong Yu, and Yang Liu. 2025. "ICT-Net: A Framework for Multi-Domain Cross-View Geo-Localization with Multi-Source Remote Sensing Fusion" Remote Sensing 17, no. 12: 1988. https://doi.org/10.3390/rs17121988

APA Style

Wu, M., Xu, S., Wang, Z., Dong, J., Cheng, G., Yu, X., & Liu, Y. (2025). ICT-Net: A Framework for Multi-Domain Cross-View Geo-Localization with Multi-Source Remote Sensing Fusion. Remote Sensing, 17(12), 1988. https://doi.org/10.3390/rs17121988

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ICT-Net: A Framework for Multi-Domain Cross-View Geo-Localization with Multi-Source Remote Sensing Fusion

Abstract

1. Introduction

2. Related Works

2.1. Cross-View Geolocalization Approaches and Datasets

2.2. Cross-Domain Feature Representation Learning

2.3. Hierarchical Feature Learning in Deep Architectures

3. Construction of the BHUniv Dataset

4. Proposed ICT-Net Framework

4.1. Problem Description and Method Overview

4.2. Vision Transformer for Geo-Localization

4.3. Attention-Guided Non-Uniform Cropping

4.4. Training Acceleration Optimization Strategy

5. Experiment Analysis

5.1. Proposed Datasets and Metrics

5.2. Comparative Results Illustration

5.3. Ablation Experiments Description

5.3.1. Effects of Coordinate System Transformations

5.3.2. Attention-Directed Non-Uniform Cropping

5.3.3. Overlap Ratio of Samples

5.4. Computational Efficiency Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI