Beyond Spatial Domain: Multi-View Geo-Localization with Frequency-Based Positive-Incentive Information Screening

Sun, Bangyong; Li, Mian; Sun, Bo; Liu, Ganchao; Bi, Cheng; Wang, Weifeng; Feng, Xiangpeng; Zhang, Geng; Hu, Bingliang

doi:10.3390/rs18010088

Open AccessArticle

Beyond Spatial Domain: Multi-View Geo-Localization with Frequency-Based Positive-Incentive Information Screening

by

Bangyong Sun

¹

,

Mian Li

¹,

Bo Sun

^2,*,

Ganchao Liu

³

,

Cheng Bi

³

,

Weifeng Wang

⁴,

Xiangpeng Feng

⁵,

Geng Zhang

⁵ and

Bingliang Hu

⁵

¹

School of Printing, Packaging and Digital Media, Xi’an University of Technology, Xi’an 710000, China

²

School of Information Engineering, Chang’an University, Xi’an 710000, China

³

School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, China

⁴

AVIC XCAC Commercial Aircraft Co., Ltd., Xi’an 710089, China

⁵

Key Laboratory of Spectral Imaging Technology of CAS, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(1), 88; https://doi.org/10.3390/rs18010088 (registering DOI)

Submission received: 28 November 2025 / Revised: 19 December 2025 / Accepted: 24 December 2025 / Published: 26 December 2025

(This article belongs to the Special Issue Multisource Data Fusion and Reasoning in Remote Sensing: From Perception to Decision Making)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

This study introduces a novel frequency-domain Positive-Incentive Information Screening (PIIS) mechanism. The mechanism performs entropy-guided adaptive selection to identify and preserve task-relevant spectral components, thereby substantially improving the robustness of multi-view feature alignment under significant appearance and geometric variations.
Comprehensive evaluations conducted on the University-1652 and SUES-200 benchmarks demonstrate that the proposed PIIS-C and PIIS-N models achieve state-of-the-art performance. The models exhibit consistent generalization capability and strong robustness across both simulated environments and real-world operational scenarios.

What is the implication of the main finding?

The research establishes a new theoretical paradigm that systematically integrates information-theoretic principles, particularly $π$ -Noise theory, with frequency-domain analytical methods. This integration provides a principled and interpretable framework for advancing representation learning in multi-view geo-localization.
Furthermore, the frequency-aware screening strategy offers a scalable and hardware-efficient solution that can be seamlessly incorporated into various backbone architectures, including both CNN and Transformer-based networks. This advancement facilitates the development of reliable, GPS-independent localization systems, with significant potential to support emerging low-altitude economy applications such as autonomous delivery, persistent surveillance, and emergency response operations.

Abstract

The substantial domain discrepancy inherent in multi-source and multi-view imagery presents formidable challenges to achieving precise drone-based multi-view geo-localization. Existing methodologies primarily focus on designing sophisticated backbone architectures to extract view-invariant representations within abstract feature spaces, yet they often overlook the rich and discriminative frequency-domain cues embedded in multi-view data. Inspired by the principles of

π

-Noise theory, this paper proposes a frequency-domain Positive-Incentive Information Screening (PIIS) mechanism that adaptively identifies and preserves task-relevant frequency components based on entropy-guided information metrics. This principled approach selectively enhances discriminative spectral signatures while suppressing redundant or noisy components, thereby improving multi-view feature alignment under substantial appearance and geometric variations. The proposed PIIS strategy demonstrates strong architectural generality, as it can be seamlessly integrated into various backbone networks including convolutional-based and Transformer-based architectures while maintaining consistent performance improvements across different models. Extensive evaluations on the University-1652 and SUES-200 datasets have validated the great potential of the proposed method. Specifically, the PIIS-N model achieves a Recall@1 of 94.56% and a mean Average Precision (mAP) of 95.44% on the University-1652 dataset, exhibiting competitive accuracy among contemporary approaches. These findings underscore the considerable promise of frequency-domain analysis in advancing multi-view geo-localization.

Keywords:

multi-view; geo-localization; domain gap; π-noise; frequency domain

1. Introduction

Multi-view geo-localization has emerged as a pivotal enabling technology in the field of autonomous drone navigation, offering a reliable and robust means of determining precise geographic locations by effectively matching visual data captured from significantly different viewpoints, typically between low-altitude aerial and satellite perspectives. This capability is particularly crucial for enhancing real-time situational awareness and operational autonomy in diverse and often unpredictable environments.

Currently, the majority of unmanned aerial vehicles (UAVs) depend heavily on the Global Positioning System (GPS) for autonomous navigation and self-localization. However, GPS-based methods are inherently constrained by their reliance on uninterrupted signal transmission, making them vulnerable to interference, signal blockage, or intentional jamming. In contrast, multi-view geo-localization provides a robust and GPS-independent alternative by leveraging visual cues from imagery to infer precise geolocations. This approach not only addresses the limitations of GPS-based systems in complex or adversarial settings but also enhances the operational reliability of drones in GPS-denied environments [1]. Furthermore, the applications of multi-view geo-localization extend well beyond aerial navigation. It assumes an increasingly pivotal role in diverse domains, including autonomous driving where accurate scene understanding and location awareness are critical [2]. Cross-modal alignment between on-board camera feeds and LiDAR data is supported to achieve reliable positioning in GPS-denied urban canyons or tunnels. In environmental monitoring, precise ecological tracking is facilitated through the fusion of satellite-scale spatial patterns with fine-grained details captured by drone-view imagery, including the quantification of vegetation coverage changes or glacial ablation rates. The technology also demonstrates immense value in tasks such as target recognition and surveillance, where few-shot learning techniques have been successfully utilized to enhance performance under limited data conditions [3]. In disaster response scenarios, pre-disaster satellite maps are rapidly matched with real-time drone-view imagery to locate trapped areas. The expanding interest and advancements in multi-view geo-localization underscore its significance as a foundational technology within the broader framework of intelligent perception and navigation systems across diverse operational environments.

The core challenges in multi-view geo-localization lie in the substantial spatial domain discrepancies that exist between images taken from different viewpoints. Ground-level images often capture fine-grained details with rich textures, occlusions, and perspective distortions, while satellite images provide a top-down, globally consistent but semantically sparse representation of the same scene.

These pronounced multi-view variations result in particularly severe appearance and geometric mismatches, which fundamentally hinder the effectiveness of traditional feature-based image matching methods. Substantial differences in scale, orientation, lighting conditions, and object visibility further exacerbate the inherent domain gap, making it extremely difficult for models to establish accurate correspondences across views. As a result, designing inherently robust algorithms that can bridge these view-dependent disparities and extract reliable view-invariant features becomes critically essential for achieving high-performance multi-view geo-localization in complex real-world scenarios.

To address the challenges posed by substantial multi-view discrepancies, Existing multi-view geo-localization approaches can be systematically classified into three technical paradigms according to their core design rationales. Specifically, spatial-domain feature learning methods are centered on the optimization of backbone networks or the introduction of attention mechanisms to strengthen the extraction of local and global features, yet face difficulties in addressing large-scale perspective variations [4,5,6,7]. Cross-domain alignment methods employ style transfer, domain adaptation modules, or metric learning strategies to alleviate domain discrepancies between multi-view images, but frequently overlook the inherent geometric and appearance inconsistencies induced by viewpoint changes [8,9,10]. Frequency-domain feature utilization methods mainly rely on static frequency screening or simple frequency-domain fusion to complement spatial-domain features, with insufficient adaptive excavation of discriminative frequency components. All three categories aim to tackle the challenges of matching drone-view and satellite-view imagery, but each exhibits distinct limitations in handling the complex variations inherent to multi-view geo-localization tasks.

For instance, Xia et al. [11] introduce a spatial alignment module within the feature domain, aiming to impose additional fine-grained multi-view constraints. This design significantly enhances the image representation capabilities of the backbone network, contributing to more accurate geo-localization. Yan et al. [12] adopt a deconvolutional network within the reconstruction pipeline, effectively bridging the semantic and structural differences in remote sensing image features across different domains. This enables the mapping of images from various platforms and viewpoints into a shared latent space with improved discriminative power, thus boosting robustness under large viewpoint changes.

Similarly, Ge et al. [13] propose a method that augments global features by incorporating part-level and image block-level representations. These are then fused back into the global feature representation to extract contextual information, thereby improving robustness and spatial awareness. Wang et al. [14] design MuSe-Net, a multi-environment adaptive network that dynamically mitigates domain shifts caused by varying environmental conditions. It employs a dual-branch structure comprising a style extraction module and an adaptive feature extraction module. The former captures style information specific to different environments, while the latter uses adaptive modulation to minimize environment-related style gaps, ensuring better generalization.

Lin et al. [15] propose Safe-Net, a novel unified end-to-end network designed to extract highly robust scale-invariant features. It incorporates two dedicated modules: a global representation-guided feature alignment module and a saliency-guided partitioning module. The former primarily employs global feature-guided affine transformations for adaptive alignment, whereas the latter strategically utilizes saliency distributions to guide carefully calibrated adaptive partitioning, thereby significantly improving model sensitivity to scale variations.

Further, Sun et al. [16] present TirSA, a comprehensive three-stage framework consisting of preprocessing, feature embedding, and post-processing. In the preprocessing stage, a self-supervised feature enhancement method (SFEM) is used to generate building perception masks, encouraging the model to focus on structurally meaningful regions without external supervision. The embedding stage integrates an adaptive feature integration module (AFIM) and employs a refined cross-domain triplet loss to mitigate inter-view discrepancies. Finally, a re-ranking strategy is introduced in the post-processing phase to optimize the retrieval results and enhance final matching accuracy.

While these methods demonstrate significant impressive performance improvements, they largely focus on abstract feature representations to measure similarity, overlooking the inherently rich frequency information present in multi-view images. This fundamental omission can critically limit their ability to fully exploit view-invariant cues embedded in the frequency domain. In this context, Sun et al. [17] introduce a truly pioneering approach by first transforming high-level feature representations into the frequency domain using Discrete Cosine Transform (DCT). This enables more effective feature screening in frequency space and facilitates substantially improved classification outcomes.

However, the aforementioned methods utilize frequency information in a relatively limited manner [18], without delving into the underlying structure and significance of specific frequency components within multi-source images. In practice, the frequency distribution of such imagery is far from uniform, and critical discriminative features are often concentrated in particular frequency bands. Ignoring this distributional characteristic leads to suboptimal utilization of valuable information embedded in the frequency spectrum.

Based on this insight, frequency information is posited to serve as a powerful complementary cue in multi-view geo-localization. Specifically, by integrating frequency-domain analysis into the matching process, it becomes possible to suppress the detrimental impact of spatial-domain inconsistencies while reinforcing invariant structural characteristics across views. This perspective underpins the design of our proposed method, which selectively screens and integrates frequency-based features to enhance multi-view matching robustness and accuracy.

This paper proposes a novel frequency-aware framework that adaptively evaluates and assigns importance to different frequency bands in multi-source images. This framework, named Positive-Incentive Information Screening (PIIS), is designed to capture and enhance the most salient and informative frequency components while reducing the influence of less relevant ones. By dynamically adjusting the contribution of each frequency band according to its discriminative value, PIIS allows the network to concentrate more effectively on features that distinguish between different views. This targeted enhancement supports more accurate and robust similarity measurement across heterogeneous perspectives, such as those captured from drone and satellite imagery, and leads to significant improvements in multi-view image matching and geo-localization performance, especially in challenging or GPS-denied environments with complex interference patterns.

The main contributions of this paper are summarized as follows:

(1) Multi-View Consistent Feature Mining in Frequency-Domain. Based on traditional abstract feature learning, the features are further projected into the frequency domain, which effectively addresses the spatial inconsistencies commonly present in multi-source images. This frequency-domain mapping serves to mitigate the effects of significant perspective changes and environmental variability by filtering out unstable or redundant components through spectral analysis. As a result, our approach strengthens the robustness and stability of multi-view image matching under challenging real-world conditions.

(2) Information-Theoretic Modeling of Inter-Branch Interaction. To enhance the comprehension of dual-branch siamese networks, the influence of shared parameters and inter-branch information exchange is examined from an information-theoretic perspective. By modeling the interaction as an information flow process, the aim is to maximize relevant mutual information while reducing redundancy. This formulation provides a principled framework for improving feature extraction and matching performance, especially in complex applications such as multi-view geo-localization. Additionally, the conventional notion of noise is revisited, proposing that in siamese architectures, what is often considered noise may in fact arise from a misalignment between learned representations and task-relevant signals. This perspective offers new insights for optimizing learning dynamics and enhancing model robustness.

(3) Positive-Incentive Information Screening in Frequency-Domain. The positive-incentive information screening strategy is proposed. This approach incorporates a core-band screening mechanism designed to identify and retain the frequency components with the highest task-relevant entropy. By approximating the maximum expected information entropy, the network is guided to emphasize structurally meaningful and discriminative spectral cues. Such selective preservation enables more robust and consistent feature alignment across views, ultimately improving performance in multi-view matching under significant appearance variations, geometric changes, and domain shifts. Notably, the proposed PIIS framework exhibits considerable architectural versatility, as it is readily applicable to various backbone networks, including both CNNs and Transformers, and maintains consistent performance improvements across diverse architectural paradigms.

The remainder of this paper is structured as follows: Section 2 reviews related work in the field of multi-view geo-localization and frequency-domain analysis. Section 3 presents the proposed PIIS framework in detail, including its underlying architecture and frequency-aware weighting strategy. Section 4 reports and the experimental results and comparative studies. Section 5 discussion and analysis of the experimental results have been conducted. Section 6 concludes the paper and outlines potential directions for future work.

2. Related Work & Motivation

Deep neural networks have achieved remarkable success across a wide range of computer vision tasks [19,20,21,22,23,24]. The majority of existing architectures are designed to operate in the spatial domain and require input images of fixed dimensions. In real-world scenarios, however, images are frequently captured at high resolutions and must be downsampled to meet model input specifications [25,26]. Although this downsampling process reduces computational and communication costs, it can result in the loss of both redundant and task-salient information, which may, in turn, compromise model accuracy.

2.1. Feature Learning in Frequency Domain

Inspired by principles from digital signal processing, Xu et al. [27] examine this issue from a frequency-domain perspective by analyzing the spectral bias of neural networks. They propose a learning-based frequency selection strategy that identifies and removes trivial frequency components without compromising accuracy. In contrast to traditional spatial downsampling methods, this frequency-domain approach leverages static channel selection to preserve informative frequency components. As a result, it achieves higher accuracy while further reducing the volume of input data, demonstrating the potential of frequency-aware methods in enhancing computational efficiency without sacrificing performance. Qin et al. [28] approach channel attention from the perspective of information compression and introduced the Discrete Cosine Transform (DCT) into the channel attention mechanism. They further demonstrate that the commonly used Global Average Pooling (GAP) operation is, in fact, a special case of the DCT. Building on this theoretical foundation, this framework effectively leverages frequency-domain representations to enhance feature selection, offering a more comprehensive and informative channel attention mechanism compared to traditional spatial-domain approaches.

In order to mine spatial information in the frequency domain, Cheng et al. [29] introduce a novel Frequency-Guided Few-Shot Learning (FGFL) framework designed to enhance few-shot classification and generalization. Specifically, their approach generates a frequency mask based on the classification gradient in the Discrete Cosine Transform (DCT) domain, which selectively emphasizes the frequency components that are most discriminative for the current meta-task. To further exploit frequency-domain information, a set of multi-level metric losses has been proposed, including sample-level triplet loss and class-level contrastive loss, which collectively guide the model to focus on task-relevant frequency patterns. These mechanisms encourage the extraction of more informative and discriminative features, thereby improving classification performance in the spatial domain.

Similarly, Zhang et al. [30] redirect their focus toward the frequency domain, emphasizing its untapped potential in enhancing knowledge distillation for dense prediction tasks. To address the inherent limitations associated with high- and low-frequency components during the distillation process, they propose a novel framework named FreeKD, which relaxes the constraints on the optimal geo-localization and range of frequency distillation. Within this framework, frequency cues are introduced to establish pixel-wise imitation guidance, enabling more precise knowledge transfer. Furthermore, a channel location-aware relational loss is devised to enhance the model’s sensitivity to object structures, thus contributing to significant performance gains in dense prediction scenarios.

Zhang et al. [31] propose a novel spatial adaptation method guided by frequency information. Specifically, the input features of the adapter are first transformed into the frequency domain, where the spectrogram is divided into non-overlapping circular regions corresponding to different frequency components. These grouped components are then processed to enable inter-group interactions, which dynamically enhance or suppress particular frequencies. Through this adaptive modulation, the network is able to emphasize image details and contour strength in a context-aware manner. As a result, the representation of salient features is significantly improved, and the frequency components most effective for distinguishing the target from the background are selectively accentuated. This process indirectly reveals the location and shape of camouflaged targets, contributing to more accurate target recognition.

Li et al. [32] propose a novel Frequency-Decoupled Network (FDNet) aimed at enhancing feature representation by independently refining high-frequency and low-frequency components in the frequency domain. FDNet introduces three key components: the Sparse-Aware Spectral Enhancement Module (SSEM), which improves spectral feature learning by compressing redundant information while emphasizing informative spectral bands. The Frequency-Decoupled Attention Module (FDAM), which enables precise discrimination and enhancement of high-frequency and low-frequency features. And a spectral context modeling mechanism that leverages enriched frequency cues to further strengthen feature representation. Together, these components allow FDNet to effectively exploit frequency-domain information for improved performance across vision tasks.

2.2. Critical Frequency Bands Screening

To expand the source domain and enhance feature representation, Duan et al. [33] propose a novel strategy named Tridos (Frequency-Aware Memory Enhancement in the Spatiotemporal Domain). This approach effectively isolates and strengthens frequency features through the integration of a Local-Global Frequency-Aware Module (LGFM) and Fourier Transform (FT). Drawing inspiration from the Human Visual System (HVS), the proposed memory enhancement mechanism is designed to capture spatial relationships of infrared targets across video frames. Temporal dynamic motion features are further encoded using differential learning and residual enhancement techniques. Additionally, a residual compensation module is introduced to mitigate potential cross-domain feature mismatches. Notably, Tridos represents the first comprehensive framework to explore feature learning for infrared targets across the space-time-frequency domains, establishing a new benchmark in spatiotemporal modeling.

The aforementioned methods collectively demonstrate the effectiveness of frequency analysis in multi-level information perception across various domains, primarily through single-branch neural network architectures. However, there remains a notable gap in the integration of frequency analysis techniques within metric learning models, particularly in tasks requiring fine-grained similarity metric such as multi-view geo-localization.

As a representative effort in leveraging frequency space for metric learning, Sun et al. [17] propose a novel approach that captures multi-view metric cues through frequency-domain analysis. Specifically, by applying Discrete Cosine Transform (DCT) to traditional abstract metric features, their method effectively maps these features into the frequency domain. Subsequently, frequency representations exhibiting robustness to viewpoint variations are extracted using the squeeze operation from SENet [34] and are fused into the overall feature embedding. While this method demonstrates promising results, it overlooks a critical property inherent to frequency representations, the heterogeneity of semantic content across different frequency bands. Each band encodes distinct structural and textural information, and treating them uniformly in the feature compression process can lead to the disruption of their inter-band continuity and semantic integrity.

2.3. $π$ -Noise & Motivation

Based on insights from

π

-noise [35] regarding informative frequency component preservation, selective retention of critical frequency bands is posited to contribute positively to the performance of dual-branch metric learning networks. Building upon this intuition, the present study introduces a frequency-domain component grouping strategy alongside a core-band screening mechanism. This mechanism is designed to identify and preserve the most informative frequency bands, thereby approximating the maximum expected information entropy. Through this targeted preservation, the model maintains the structural richness of frequency-domain representations while enhancing its capacity for robust and discriminative multi-view feature matching.

The challenge in multi-view geo-localization lies in establishing reliable correspondences between multi-view and multi-source images. This task is complicated by the inherent variations in viewpoint, scale, and perspective between the images, which can significantly hinder accurate feature matching. To address this challenge, many researchers have turned to siamese neural networks, a class of models that are particularly adept at learning such correspondences. By leveraging shared weights across the two branches of the network, siamese architectures enhance the correlation between feature representations from different views, allowing for a more consistent and unified mapping of the same scene or object across multiple perspectives.

In the context of multi-view geo-localization, this weight-sharing mechanism has been shown to significantly improve performance. Specifically, a comparison experiment conducted in [17] demonstrate the impact of shared weights in siamese networks. The experimental results demonstrate a notable performance boost, where the model achieves a 20% improvement in mean Average Precision (mAP) when weights are shared between the branches. This improvement underscores the importance of weight sharing in reinforcing feature consistency, thus aiding the network in learning more accurate and reliable correspondences between multi-view and multi-source images.

Despite the significant progress made with siamese networks, the understanding of their underlying mechanisms remains incomplete and warrants further exploration. While the traditional approach primarily focuses on improving the feature matching ability and computational efficiency, there is still a lack of a comprehensive theoretical framework that fully captures the intricate information dynamics between the branches of a siamese network. Inspired by recent developments in [35], this paper takes a step toward bridging this gap by adopting an information-theoretic perspective to examine the relationship between the different branches of a siamese network and their mutual information interactions.

In particular, we aim to establish a more rigorous understanding of how the shared parameters and the information exchange between the two branches influence the overall performance of the network. By applying concepts from information theory, such as mutual information and entropy, we hypothesize that the interaction between the branches can be more formally characterized as an information flow model, where the goal is to maximize the relevant shared information while minimizing redundancies. This approach not only provides deeper insights into the mechanics of siamese networks but also has the potential to guide future improvements in network design, particularly in applications such as multi-view geo-localization, where robust and efficient feature matching is crucial. By leveraging this theoretical framework, we can better understand the trade-offs involved in network architecture choices and optimize the information extraction process for improved model accuracy and generalization.

π

-Noise theory posits that conventional noise in machine learning stems from the misalignment between model expectations and task-specific signals requiring acquisition. In essence, such interference hinders learning processes by failing to constructively contribute to predictive tasks, thereby impeding the acquisition of relevant patterns. This established framework, however, demonstrates limited applicability to dual-branch siamese networks, where inter-branch dynamics operate through fundamentally distinct mechanisms.

3. Materials and Methods

For the siamese network, the objective of each branch can be conceptualized as a distinct process encompassing feature capture and subsequent cognition. Specifically, features are extracted from the input data by each branch and processed into representations that enable effective comparison between sample pairs. The ultimate goal is to determine whether a given pair constitutes a positive or negative match, which equates to assessing if the outputs from the two branches correspond to a positive sample pair. Within this framework, the weights shared between the two branches are not merely a source of redundancy, but rather an essential mechanism that facilitates the exchange of learned knowledge.

3.1. Positive-Incentive Information Theory

When viewed through the lens of

π

-noise, the shared weight strategy in siamese networks redefines the traditional notion of noise, transforming it into a form of positive excitation noise. Unlike conventional noise, which is regarded as detrimental and disruptive to learning, the shared parameters between network branches serve as a conduit for mutual knowledge exchange. This shared knowledge acts as a complementary information source, rather than a disturbance, and enhances the learning process in both branches.

Through this mechanism, the network benefits from a mutual enhancement effect, as the shared weights allow both branches to collaborate in extracting more comprehensive and discriminative features. In this context, what might superficially be interpreted as redundancy or overlap actually serves as a form of constructive interference, reinforcing essential representations. Therefore, this additional information can be regarded as positive noise, thereby enhancing the generalization capability of the model and consequently improving overall performance.

This perspective can be theoretically supported by the mutual information between the task

T

and the noise

ϵ

:

M I (T, ϵ) = H (T) - H (T | ϵ),

(1)

here

H (T)

denotes the entropy of the task, and

H (T | ϵ)

represents the conditional entropy of the task given the noise.

In the strictest sense, unexpected and harmful noise satisfies the condition:

M I (T, ϵ) = 0,

(2)

which implies that such noise offers no useful information about the task and only adds uncertainty. In contrast, forward excitation noise, as introduced by the shared-weight paradigm, satisfies the following condition:

M I (T, ϵ) > 0,

(3)

or equivalently:

H (T) > H (T | ϵ) .

(4)

This indicates that the presence of this noise actually reduces the uncertainty of the task, thereby contributing positively to learning.

In this manner, the concept of

π

-Noise can be reinterpreted as a beneficial inductive bias within siamese architectures. The “noise” induced by parameter sharing does not impede the learning process; instead, it accelerates the formation of robust and generalizable representations. In contrast to standard mutual information-based methods, which primarily focus on quantifying statistical dependence between paired images, the intrinsic mechanism governing inter-branch interaction in siamese networks remains unaddressed.

π

-Noise theory reinterprets shared weights in siamese networks as “positive-incentive noise” that mitigates task uncertainty rather than introducing disruptive interference.

This reinterpretation transforms the traditional weight-sharing mechanism from a simple parameter reuse strategy into a constructive information exchange channel, enabling mutual enhancement between the two branches of the siamese architecture. This improvement consequently elevates model efficacy in tasks including feature matching, multi-view geo-localization, and classification. Such enhanced performance is particularly evident in scenarios demanding high robustness to domain variations and input transformations under real-world constraints.

3.2. Frequency Positive-Incentive Information Screening

Building upon the preceding theoretical foundation, this paper proposes a bold hypothesis: Can frequency-domain spatial features act as positive excitation elements to enhance the optimization of siamese networks? Empirical results [17] affirm this hypothesis, demonstrating that appropriately integrated frequency-domain information can significantly boost the performance of multi-view matching tasks. However, the integration of such information presents several technical challenges. Notably, the lack of inherent constraints in frequency-domain integration, along with the possible impact of crowding operations, can result in the distortion or loss of discriminative frequency cues.

To overcome these limitations, we introduce a dedicated frequency filtering model, as depicted in Figure 1, which is designed to isolate and preserve frequency-domain features that provide positive excitation. This model enables the network to selectively amplify informative frequency components while suppressing irrelevant or noisy ones, thereby enhancing the optimization of the siamese network without compromising its learning dynamics. The core idea is to transform raw abstract features into the frequency domain, analyze them across distinct bands, and adaptively emphasize those components that contribute most meaningfully to discriminative representation.

(1) Frequency-Domain Transformation

The abstract spatial feature map

F_{abs}

is first transformed into the frequency domain using the Discrete Fourier Transform (DFT):

F_{f r e q} (u, v) = \frac{1}{H \cdot W} \sum_{h = 0}^{H - 1} \sum_{w = 0}^{W - 1} F_{a b s} (h, w) e^{- 2 π j (u h + v w)},

(5)

here

F_{f r e q} (u, v)

denotes the resulting complex-valued frequency map. H and W represent the spatial dimensions of the input feature, and

(h, w)

are the spatial coordinates. To facilitate centralized frequency analysis, a frequency shift is applied such that low-frequency components are relocated to the center of the spectrum. The normalized frequency indices u and v span from

- \frac{H}{2}

to

\frac{H - 1}{2}

and

- \frac{W}{2}

to

\frac{W - 1}{2}

, respectively. Frequencies beyond the Nyquist limit cannot be accurately captured, which imposes an upper limit on the usable frequency bandwidth:

H_{D}^{+} = \{(u, v) ∣ | u | > \frac{1}{2 D} or | v | > \frac{1}{2 D}\},

(6)

In typical CNN architectures, convolutional operations often act as high-pass filters, favoring the preservation of high-frequency content. However, this can bias the learned features towards finer details while sacrificing global context. This trade-off results in the use of smaller dilation rates to preserve high-frequency information, but at the expense of reduced receptive fields. To address this, our FreqSelect module is designed to balance high- and low-frequency components, thus enabling broader receptive fields while retaining essential fine-grained details.

(2) Frequency Band Decomposition and Reweighting

In the Positive-Incentive Information Screening (PIIS) framework, the frequency spectrum is decomposed into multiple non-overlapping bands using binary masks in the Fourier domain:

F_{f r e q_{n}} = F^{'} (M_{f r e q_{n}} * F_{f r e q}),

(7)

here

F^{'}

denotes the inverse Fast Fourier Transform (IFFT), and

M_{f r e q_{n}}

is a binary mask defined as:

M_{f r e q_{n}} (u, v) = \{\begin{matrix} 1 & if ϕ_{f r e q_{n}} \leq max (| u |, | v |) < ϕ_{f r e q_{n + 1}} \\ 0 & otherwise \end{matrix}

(8)

where

ϕ_{f r e q_{n}}

and

ϕ_{f r e q_{n + 1}}

are threshold values derived from a predefined set of octave-based frequency ranges, i.e.,

\{0, \frac{1}{16}, \frac{1}{8}, \frac{1}{4}, \frac{1}{2}\}

. This frequency decomposition enables precise localization of salient information across different scales, from low-frequency global context to high-frequency texture patterns.

Subsequently, PIIS dynamically reweights these frequency-specific features based on their spatial relevance:

\hat{X} (i, j) = \sum_{n = 0}^{N - 1} A_{b} (i, j) X_{f r e q_{n}} (i, j),

(9)

here

\hat{X} (i, j)

is the final reweighted frequency-balanced feature at location

(i, j)

,

X_{f r e q_{n}}

denotes the reconstructed spatial feature map from the n-th frequency band, and

A_{b} (\cdot)

is the learned attention map indicating the relative importance of that band at each spatial location.

This adaptive fusion mechanism enables the network to selectively enhance frequency components that provide positive excitation and are most beneficial for discriminative learning, while noisy or redundant signals are effectively suppressed. In distinction from conventional frequency-domain approaches that treat frequency bands as discrete matching units, the proposed Frequency-Based Positive-Incentive Information Screening mechanism is grounded in

π

-Noise theory and prioritizes the screening of task-relevant frequency components via entropy maximization. This approach ensures that the retained frequency bands not only exhibit high mutual information with the target task but also align with the optimization logic of siamese networks. This optimization logic centers on enhancing view-invariant features and suppressing domain discrepancies between multi-view images, thereby reinforcing feature alignment across heterogeneous perspectives, including drone-view and satellite imagery.

(3) Design of the PIIS model

Within the

π

-noise theoretical framework, the synergistic effects arising from parameter sharing constitute a form of positive-incentive noise that satisfies the mutual information condition

M I (T, ϵ) > 0,

indicating its capacity to effectively reduce task uncertainty rather than introduce interference.

Based on this theoretical breakthrough, the Positive-Incentive Information Screening (PIIS) mechanism is proposed, in which the frequency-domain information configuration is optimized through three core operations: spatial features are first mapped to the frequency domain using Discrete Fourier Transform, with frequency shift operations enabling centralized spectral analysis. the spectrum is then decomposed into semantically heterogeneous non-overlapping frequency bands using binary masks based on predefined octave-based thresholds. Finally, a learnable attention mechanism dynamically re-weights frequency-specific features to approximate the maximum expected information entropy.

For multi-view geo-localization tasks utilizing CNN backbone networks, the PIIS-C model is innovatively constructed, as illustrated in Figure 2a. This architecture enhances feature representation through frequency-domain analysis and further integrates frequency-enhanced features with original abstract domain features, thereby forming a more discriminative comprehensive metric representation. The similarity estimation layer subsequently computes precise matching scores for image pairs.

Recognizing the superior capability of Transformer architectures in capturing global contextual information, the frequency-domain positive-incentive screening mechanism is extended to these architectures [36,37]. As demonstrated in Figure 2b, this extension leads to the development of the PIIS-N model. The use of frequency-domain spatial features as positive-incentive elements serves to enhance the optimization efficacy of the Transformer in multi-view matching tasks. Although CNN and Transformer employ different spatial feature extraction mechanisms, their generated feature maps can both be treated as two-dimensional signals following universal frequency-domain transformation laws.

Frequency-domain decomposition effectively decouples appearance variations from structural information: high-frequency components predominantly encode detailed texture features, whereas low-frequency components encapsulate global structural information. The adaptive screening of frequency components that provide positive task incentives is theoretically grounded. It aligns not only with the feature representation optimization principles of information bottleneck theory but also embodies the core “noise as information” tenet of

π

-noise theory. Operationally, the reconstruction error introduced by this screening process functions not as detrimental interference but as a constructive incentive, thereby enhancing model generalization capability.

In the PIIS-N implementation, the mechanism functions as a front-end processing unit for the Transformer encoder, first converting feature maps to the frequency domain via Fast Fourier Transform, then employing a learnable weight matrix to enhance viewpoint-insensitive frequency components while suppressing those susceptible to viewpoint variations. This process essentially instantiates the mutual information gain mechanism of

π

-noise in the frequency domain, ensuring that screened frequency noise satisfies the condition

H (T) > H (T | ϵ),

thereby improving model robustness through reduced task conditional entropy. The enhanced frequency-domain features are ultimately fused with original features to form a comprehensive metric representation with complementary information characteristics, with the similarity estimation layer computing final matching scores for image pairs.

The PIIS module in PIIS-C is inserted after the last residual block, a placement that balances local texture details and preliminary semantic information, effectively suppressing perspective-induced noise in shallow features, preserving discriminative semantic cues, and avoiding the loss of fine-grained information caused by late-stage insertion or unstable processing of early raw features; in contrast, the PIIS module in PIIS-N is placed at the front-end of the encoder prior to feature downsampling, where frequency-domain transformation is first applied to screen view-invariant frequency components, and the enhanced features are then fed into Transformer blocks. This design reduces the computational burden of global attention on noisy components, ensures focus on structure-invariant frequency patterns, and maximizes the advantage in modeling long-range spectral correlations.

4. Results

To verify the effectiveness of the proposed method, this paper conducts comprehensive experiments on a public dataset, evaluating both segmentation accuracy and computational efficiency across multiple dimensions.

4.1. Datasets & Experiments Settings

First, the proposed algorithm is thoroughly validated using the multi-view matching and positioning dataset University-1652. This dataset, created within a simulation platform, offers a rich, diverse, and large-scale collection of training samples, thereby providing abundant metric learning signals for the model. University-1652 consists of a series of multi-view scenes, each paired with corresponding satellite imagery, allowing for a comprehensive evaluation across various perspectives. The drone images are captured in a circular ascending trajectory, enabling the acquisition of visual information from different angles and altitudes, particularly focusing on central buildings. This simulated setup provides controlled yet varied data, making it ideal for evaluating model performance across a range of viewing conditions.

In addition to the simulation-based validation, complementary experiments are conducted using the real-world dataset SUES-200, collected from actual drone flights in highly dynamic environments. The drone images reflect the inherent complexities of real-scene conditions, including prevalent environmental interferences such as weather variations, lighting changes, and optical distortions. This data introduces substantial noise and unpredictability, providing a particularly rigorous test for robustness and adaptability. A thorough assessment of model performance is thereby ensured through evaluation on both synthetic and real-world datasets, enabling validation across environments ranging from controlled settings to genuinely complex, dynamic scenarios.

The experiments for the PIIS-C model were conducted on a high-performance computing platform with an NVIDIA RTX 3090 GPU. An initial learning rate of 0.001 was adopted to balance convergence speed and model accuracy, while a dropout rate of 0.5 helped prevent overfitting. The model used a batch size of 4 for efficient GPU memory utilization and stable gradient estimation, with a stride of 2 to enable effective downsampling for multi-scale feature capture.

For the PIIS-N model, training was performed in a distributed environment using multiple RTX 3090 GPUs with NCCL backend. To suit the larger transformer-based architecture, the batch size was set to 8 and the learning rate to 0.0001. The model employed a ConvNeXt-b backbone processing 384 × 384 input images normalized with ImageNet statistics. Training lasted 10 epochs with gradient clipping and label smoothing for regularization. The data pipeline used 8 parallel workers with horizontal flipping and custom sampling, with evaluation performed each epoch. These hyperparameters are carefully selected to ensure robust training and reliable results across the experimental setup.

4.2. Precision Experiment on University-1652

To evaluate the performance of the proposed method against mainstream approaches, experiments have been conducted on two benchmark datasets, wherein the input image resolution of both the proposed method and all comparative methods has been uniformly set to 384 × 384. The results on the University-1652 dataset are presented in Table 1.

As demonstrated in Table 1, the rapid advancements in the field of visual localization have led to significant improvements in accuracy through the introduction of novel methods. For example, the classical siamese network model based on ResNet, while pioneering, exhibits notable limitations in handling multi-view variations. As shown in the first three rows of the experimental results, this model struggles with effectively capturing complex spatial relationships across varying perspectives, primarily due to its limited ability to model intricate inter-view transformations.

The introduction of attention mechanisms has considerably enhanced the ability of convolutional neural networks (CNNs) to selectively focus on salient regions within images. By assigning attention weights to features, these mechanisms allow the model to emphasize critical areas, thus improving its capability to handle multi-view discrepancies. When attention mechanisms, such as SENet and CBAM, are integrated into the siamese network framework, e.g., in Triplet + SENet and Triplet + CBAM configurations, they facilitate the selective activation of important regions within each branch of the network, leading to stronger and more consistent correspondences across views. This selective focus has been instrumental in improving the accuracy of visual geo-localization.

At its core, the integration of attention mechanisms constitutes a form of information enhancement within the backbone network, which augments the capacity to discern relevant features. However, in cases involving more complex multi-view transformations, more comprehensive strategies for feature mining can lead to even greater performance gains. For instance, methods that leverage densely connected multi-level features, such as Triplet Loss combined with DenseNet, allow for the extraction of rich, multi-scale interaction information that provides a more nuanced understanding of the scene. This approach, which goes beyond the fixed paradigm of attention mechanisms, offers further accuracy improvements. Yet, it is important to recognize that dense connectivity comes at the cost of increased network complexity and computational demands. Given the inherent limitations in edge computing resources, especially in visual geo-localization platforms that rely on lightweight devices, this method may face challenges in real-world deployment and practical implementation. Nevertheless, it provides valuable insights for future network designs, reinforcing the effectiveness of multi-scale information integration in tackling complex geo-localization tasks.

As the development of backbone architectures continues, showed in Table 2, increasingly sophisticated feature extraction frameworks, such as Transformer and ConvNeXt, are being introduced into visual geo-localization tasks. These advanced architectures contribute to further accuracy gains by improving feature extraction capabilities. However, it is crucial to align the choice of backbone with the specific requirements of the visual geo-localization task at hand. In particular, overcoming the challenge of handling multi-view variations remains a critical factor in improving performance. In response to this, the proposed method, PIIS, presents a novel approach by mapping traditional metric features into the frequency domain. In this domain, invariant multi-view information can be selectively filtered, allowing for the isolation and retention of stable representations across varying views. These stable representations are then reintegrated into the original metric feature space through adaptive fusion mechanisms, which significantly enhances the reliability and discriminability of the geo-localization model under multi-view conditions. This method thus represents a significant advancement in the effective handling of perspective changes in visual geo-localization.

Through systematic evaluation on the University-1652 dataset, this study validates the effectiveness of the frequency-domain feature screening mechanism in multi-view geo-localization tasks. As shown in Table 1, the proposed PIIS-C and PIIS-N methods demonstrate significant performance advantages across various evaluation metrics:

The quantitative evaluation demonstrates that the PIIS-C model achieves 83.61% Recall@1, 94.14% Recall@5, and 85.99% mAP, surpassing the previous state-of-the-art method DUMM by substantial margins of 6.28% in Recall@1 and 4.54% in mAP. Notably, its Recall@10 and Recall@1% scores reach 96.29% and 96.53%, respectively, reflecting robust feature discrimination.

The PIIS-N model further advances these benchmarks, attaining 94.56% Recall@1, 98.44% Recall@5, and 95.44% mAP, which represents a 2.08% improvement in Recall@1 over the ConvNext baseline while maintaining superior performance across higher-recall metrics.

The substantial advantages of the frequency feature screening mechanism in complex multi-view localization are confirmed by these quantitative results. Furthermore, consistent performance improvements across multiple recall thresholds demonstrate high effectiveness in handling viewpoint variations and modality differences, which pose significant challenges in fine-grained retrieval and real-world operational scenarios.

4.3. Precision Experiment on Real-World Dataset

Although the University-1652 dataset provides a large-scale and diverse training resource, it is generated from a simulation platform. Consequently, certain discrepancies exist between the simulated data and the characteristics of real-world environments, particularly in terms of background noise, lighting variations, and scene complexity. To address this limitation and further evaluate the practical applicability of the proposed frequency screening method, additional experiments are conducted on the real-world dataset SUES-200. This dataset comprises drone images collected from actual outdoor scenes, thereby introducing real environmental interference factors such as occlusion, lighting inconsistency, and texture complexity. The inclusion of SUES-200 allows for a more comprehensive validation of the robustness and generalization capability of the proposed method under realistic conditions. The experimental results on SUES-200 are presented in Table 3, demonstrating the method’s effectiveness in complex, real-world scenarios with significant domain shifts.

Through the systematic verification of SUES-200 real scene dataset, this paper deeply explores the effectiveness of the frequency feature screening mechanism in complex geolocation tasks. As shown in Table 3, our proposed PIIS-C method shows significant advantages at multiple distance thresholds ranging from 150m to 300m: In the most challenging 150m fine-grained localization task, PIIS outperforms the method LPN by 3.89% and 2.60% with 65.47% Recall@1 and 69.83% AP values. As the positioning distance is extended to 300m, the proposed method still maintains 85.90% Recall@1 and 88.12% AP, which are 12.85% and 11.73% higher than the traditional spatial domain method SA_DOM. These quantitative results confirm the core value of frequency feature screening in real interference environments.

The simulation-reality gap revealed by the SUES-200 dataset is mainly reflected in three dimensions. Firstly, the dynamic environmental elements (such as instantaneous occlusion and illumination mutation) lead to unstructured disturbance of spatial features. For example, the AP value of SA_DOM method at 150 m in Table 3 decreases by 11.0% compared with the simulation dataset. Second, the multi-scale noise (sensor noise, weather degradation, etc.) in the real scene shows cross-band coupling characteristics in the frequency domain, and the robustness of the traditional CNN backbone network to such composite interference is significantly limited under these challenging conditions. Moreover, the geometric complexity of the urban environment leads to multi-view feature mismatch, which can be seen from the AP value of the LPN method at 250 m distance is 8.2% lower than that of the simulation data.

The experiments reveal that the suppression effect of frequency filtering mechanism on real interference is range-sensitive. This multi-scale cooperative mechanism overcomes the accuracy imbalance problem of traditional methods in range-spanning localization, and provides a new theoretical perspective and technical path for building pervasive geo-localization systems with enhanced environmental adaptability.

Based on systematic experimental evaluation using the SUES-200 dataset, this research establishes the theoretical superiority of the PIIS-N framework in frequency-domain feature representation. The empirical evidence presented in Table 3 demonstrates that in the most demanding 150-meter fine-grained localization scenario, PIIS-N attains 94.93% Recall@1 and 95.98% AP, exhibiting not only substantial advancements over conventional spatial-domain approaches but, more significantly, revealing the theoretical promise of frequency-domain screening through its narrow performance differential with the state-of-the-art MEAN method. From an information-theoretic standpoint, these findings serve to verify that the frequency screening strategy, grounded in mutual information maximization, successfully isolates fundamental frequency patterns exhibiting intrinsic invariance to geometric transformations.

When the localization range is extended to 300 m, PIIS-N maintains competitive performance, providing further validation for the scale-invariant properties inherent in frequency-domain representations. Particularly compelling is the outstanding achievement of the framework at intermediate ranges of 200 and 250 m, where it surpasses all benchmark methods. This result substantiates that the architecture achieves optimal multi-scale feature integration through adaptive frequency selection. This finding aligns with frequency-domain decomposition theory, which posits that geo-localization tasks across varying distances rely on distinct characteristic scales; the frequency selection mechanism in PIIS-N successfully maintains a dynamic equilibrium among these hierarchical features.

Through rigorous analysis grounded in signal processing principles, the exceptional performance of PIIS-N can be attributed to its innovative frequency-domain screening methodology. Operating within the

π

-noise theoretical framework, the method reconceptualizes conventional frequency redundancy as constructive excitation signals through mutual information optimization, thereby achieving robust management of multi-scale interference in complex environments. This theoretical advancement not only elucidates the mechanism behind the remarkable generalization capability of frequency-domain features in cross-distance applications but also establishes a new foundation for developing next-generation geo-localization systems with enhanced domain adaptation capacities.

Quantitative results provide theoretical validation of the generalization mechanism in the Positive-Incentive Information Screening (PIIS) framework at the frequency-domain level. While maintaining the representational capacity of the Transformer architecture, the frequency screening strategy based on mutual information maximization enables effective cross-domain transfer of discriminative frequency features. Experimental evidence confirms that adaptively selected core frequency bands maintain stable discriminability in unseen environments, a phenomenon rooted in the inherent invariance of frequency components to geometric transformations. This discovery elucidates the mathematical principle underlying frequency feature transfer. Within the

π

-noise theoretical framework, mutual information optimization enables the successful extraction of intrinsic frequency patterns, which exhibit robustness against environmental disturbances.

The established mechanism represents a new theoretical paradigm for developing domain-adaptive geo-localization systems, where knowledge transfer from simulated to real-world scenarios is achieved through frequency-domain positive-incentive screening. This approach not only provides a principled solution to the domain adaptation challenge but also opens new pathways for constructing robust geo-localization systems capable of maintaining consistent and reliable performance across diverse operational environments.

5. Discussion

5.1. Feature Map Analysis

The quality of feature perception directly determines the accuracy performance of the measurement model. This section shows the attention degree of different methods to the key areas of the image through visualization, so as to verify the advancement of the method.

To further validate the effectiveness of the proposed Positive-Incentive Information Screening (PIIS), a qualitative comparison of feature activation maps is conducted across different methods. The compared methods include 1652, LPN, CBAM, SENet, and the proposed PIIS, as illustrated in Figure 3. The visualization covers four representative scenes, each comprising paired drone-view and satellite-view images. As shown in Figure 3, baseline methods such as 1652 and LPN tend to generate scattered and redundant activation patterns, failing to consistently localize key discriminative regions across heterogeneous views.

Although attention-based models like CBAM and SENet achieve modest improvements by highlighting more relevant areas, they still suffer from residual activation noise and excessive sensitivity to background interference, particularly under notoriously complex environmental conditions.

In contrast, the proposed PIIS method exhibits significantly more focused and structured activation responses. The salient areas corresponding to critical semantic elements, such as roads, building clusters, and urban layouts, are clearly and consistently highlighted. Meanwhile, background and irrelevant regions are effectively suppressed. This indicates that the frequency-aware weighting mechanism in PIIS enables the network to selectively enhance discriminative information while filtering out redundant or noisy components. Moreover, the activation consistency achieved by PIIS across diverse scenes demonstrates its strong generalization capability in multi-view feature matching tasks. These qualitative results are in line with the quantitative performance improvements reported in previous sections, further confirming the advantage of incorporating frequency-domain positive excitation for robust multi-view geo-localization.

5.2. Visualization of Matching Results

For the purpose of rigorously validating the effectiveness of the multi-modal geo-localization framework, a comprehensive visual analysis is carried out on both University-1652 and SUES-200 datasets by means of query-gallery matching visualization.

The experimental evidence from University-1652, as depicted in Figure 4, demonstrates that our dual-branch frequency mapping architecture with mutual positive excitation mechanism achieves superior matching accuracy across diverse scenarios. Green boxes indicate true match and red boxes indicate false match. Quantitative analysis reveals that 83.6% of test samples attain correct Top-1 matches through cross-domain frequency feature alignment, significantly outperforming spatial-only baseline methods.

Notably, the third and fourth case studies in Figure 4 present challenging scenarios with horizontal spatial displacement between query satellite images and target ground-level views. Through frequency spectrum decomposition and cross-domain phase correlation analysis, our model successfully distinguishes these geometrically shifted patterns that often confuse conventional spatial-based approaches. The learned frequency-aware representations exhibit intrinsically remarkable robustness to positional variations by adaptively capturing essential structural periodicities while effectively filtering incoherent spatial noise. The visual verification further and conclusively confirms that our frequency-enhanced framework preserves visually salient discriminative building contours and structurally coherent repetitive architectural patterns in low-frequency components, while simultaneously encoding sufficient high-frequency details for reliable fine-grained distinction. This dual-scale representation learning enables precise multi-view matching even under significant viewpoint changes and occlusion conditions, as evidenced by the successful retrieval cases containing partial building occlusions in rows 4 of Figure 4. The empirical results align with our theoretical analysis of Fourier descriptor invariance properties in multi-view matching tasks.

Figure 5 systematically demonstrates the cross-modal matching performance of our framework on the SUES-200 benchmark, particularly highlighting its capability to retrieve positive samples within the Top-5 ranking list under real-world operational conditions. Green boxes indicate true match and red boxes indicate false match. The visual evidence confirms that our frequency-enhanced metric learning architecture successfully identifies geometrically consistent matches despite significant domain shifts between ground and aerial perspectives. This achievement stems from the inherent capability of the model to disentangle persistent structural patterns from transient environmental noise through adaptive spectral feature decomposition and reconstruction.

These empirical observations underscore two fundamental research imperatives. First, the development of adaptive interference suppression mechanisms must prioritize the hierarchical nature of real-world noise, which operates across different spectral scales and temporal frequencies. Second, the data-driven paradigm of modern metric learning architectures necessitates comprehensive real-world data ecosystems that capture the full spectrum of environmental variances. Current methodologies face inherent constraints when trained on limited-scope datasets, as model generalization capabilities degrade exponentially with increasing domain shifts between training and deployment scenarios.

Despite Despite the fact that the proposed PIIS framework exhibits remarkable advantages in multi-view geo-localization tasks, particularly competitive performance in robustness across complex scenarios and cross-view feature alignment, a thorough analysis of experimental results and technical characteristics reveals optimizable limitations. These limitations are primarily reflected in the room for performance improvement in fine-grained localization scenarios. In the 150-meter fine-grained localization scenario of the real-world SUES-200 dataset, which focuses on precise matching between low-altitude drone-view and satellite perspectives and requires accurate capture of key discriminative elements such as building details and edge contours, the Recall@1 metric of the PIIS-N model reaches 94.93%. While demonstrating excellent performance, it still lags behind the DAC model’s 96.80% by 1.87 percentage points.

The core cause of this gap lies in the frequency domain segmentation strategy of the current PIIS framework, which divides the frequency domain into four non-overlapping bands based on predefined octave range thresholds. This strategy struggles to fully exploit the abundant high-frequency texture details and fine structural information contained in low-altitude drone-view images under this scenario, failing to achieve precise capture and enhancement of fine-grained features. From the perspective of technical optimization directions, more refined frequency domain segmentation strategies such as increasing the number of bands and dynamically adjusting band thresholds are expected to more fully extract key frequency domain information in low-altitude fine-grained scenarios, thereby narrowing the performance gap with state-of-the-art models. However, more refined frequency domain segmentation will inevitably increase computational complexity and parameter scale of the model, potentially compromising real-time performance and deployment feasibility of the model. This performance-efficiency trade-off requires further exploration and optimization in future research through technical approaches such as lightweight design and dynamic band adaptive selection.

6. Conclusions

To advance multi-view geo-localization through principled frequency-domain analysis, the Positive-Incentive Information Screening (PIIS) framework is introduced. Grounded in the principles of

π

-Noise theory, an intelligent information screening mechanism has been developed to enable automatic identification and preservation of the most discriminative frequency components via entropy maximization. While the current CNN-based architecture has demonstrated superior performance, substantial potential is recognized in integrating these frequency-domain innovations with emerging Transformer architectures, where self-attention mechanisms can further enhance long-range spectral correlations. The proposed framework not only establishes a new benchmark for multi-view geo-localization but also paves the way for future architectural evolution. Promising research directions include hybrid CNN-Transformer implementations, frequency screening methods adapted for patch-based processing, and attention-guided dynamic weighting of frequency components. These advancements are expected to make significant contributions to the development of next-generation visual geo-localization systems, enabling robust operation in increasingly complex real-world environments. Additionally, the proposed frequency-domain positive-incentive information screening mechanism exhibits broad application prospects in domains such as autonomous driving and remote sensing image change detection.

Author Contributions

Conceptualization, B.S. (Bo Sun) and B.S. (Bangyong Sun); Methodology, G.L. and B.S. (Bangyong Sun); Software, M.L. and B.S. (Bo Sun); Validation, M.L. and G.L.; Formal Analysis, B.S. (Bangyong Sun) and C.B.; Investigation, W.W. and M.L.; Resources, X.F., B.H. and G.L.; Data Curation, M.L. and W.W.; Writing—original draft preparation, M.L.; Writing—review and editing, B.S. (Bo Sun); Visualization, G.Z. and M.L.; Supervision, B.S. (Bo Sun); Project administration, B.S. (Bo Sun) and C.B.; Funding acquisition, B.S. (Bangyong Sun), G.Z. and B.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62471386, 52172325, 52272330 and 62301082; in part by the Key Research and Development Project of Shaanxi Province under Grant 2024GXYBXM-129; in part by the China Postdoctoral Science Foundation 2025M781579; and in part by the Open Research Fund of Shaanxi Key Laboratory of Optical Remote Sensing and Intelligent Information Processing (No. KF20250403).

Data Availability Statement

The datasets used and analyzing in this study are publicly accessible, and their use has been authorized through direct communication with the respective authors, from whom formal permission was obtained. Specifically, the University-1652 dataset and the SUES-200 dataset were acquired by contacting the corresponding creators, and written consent for their use in this research has been secured. All experimental codes and model implementations developed in this work are publicly available on GitHub at: https://github.com/HAORANJY/PIIS-main, accessed on 23 December 2025. Researchers interested in accessing the data or code may also contact the corresponding author for further support.

Conflicts of Interest

Author Weifeng Wang was employed by the company AVIC XCAC Commercial Aircraft Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zheng, Z.; Wei, Y.; Yang, Y. University-1652: A Multi-View Multi-Source Benchmark for Drone-Based Geo-Localization. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1395–1403. [Google Scholar]
Wang, S.; Nguyen, C.; Liu, J.; Zhang, Y.; Muthu, S.; Maken, F.A.; Zhang, K.; Li, H. View From Above: Orthogonal-View Aware Cross-View Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–18 June 2024; pp. 14843–14852. [Google Scholar]
Wang, L.; Yang, X.; Tan, H.; Bai, X.; Zhou, F. Few-Shot Class-Incremental SAR Target Recognition based on Hierarchical Embedding and Incremental Evolutionary Network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5204111. [Google Scholar] [CrossRef]
Zhu, Y.; Sun, B.; Lu, X.; Jia, S. Geographic Semantic Network for Cross-View Image Geo-Localization. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4704315. [Google Scholar] [CrossRef]
Zeng, Z.; Wang, Z.; Yang, F.; Satoh, S. Geo-Localization via Ground-to-Satellite Cross-View Image Retrieval. IEEE Trans. Multimed. 2022, 25, 2176–2188. [Google Scholar] [CrossRef]
Zhu, R.; Yang, M.; Zhang, K.; Wu, F.; Yin, L.; Zhang, Y. Modern Backbone for Efficient Geo-Localization. In Proceedings of the MM ’23: The 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 31–37. [Google Scholar]
Song, H.; Wang, Z.; Lei, Y.; Shi, D.; Tong, X.; Lei, Y.; Qiu, C. Learning Visual Representation Clusters for Cross-View Geo-Location. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6011805. [Google Scholar] [CrossRef]
Zhao, Z.; Tang, T.; Chen, J.; Shi, X.; Liu, Y. AST: An Attention-Guided Segment Transformer for Drone-Based Cross-View Geo-Localization. In International Conference on Computational Visual Media; Springer: Berlin/Heidelberg, Germany, 2024; pp. 332–353. [Google Scholar]
Hu, S.; Shi, Z.; Jin, T.; Liu, Y. Query-Driven Feature Learning for Cross-View Geo-Localization. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5620815. [Google Scholar] [CrossRef]
Gong, N.; Li, L.; Sha, J.; Sun, X.; Huang, Q. A Satellite-Drone Image Cross-View Geolocalization Method Based on Multi-Scale Information and Dual-Channel Attention Mechanism. Remote Sens. 2024, 16, 941. [Google Scholar] [CrossRef]
Xia, P.; Wan, Y.; Zheng, Z.; Zhang, Y.; Deng, J. Enhancing Cross-View Geo-Localization With Domain Alignment and Scene Consistency. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 13271–13281. [Google Scholar] [CrossRef]
Yan, Y.; Wang, M.; Su, N.; Hou, W.; Zhao, C.; Wang, W. IML-Net: A Framework for Cross-View Geo-Localization with Multi-Domain Remote Sensing Data. Remote Sens. 2024, 16, 1249. [Google Scholar] [CrossRef]
Ge, F.; Zhang, Y.; Wang, L.; Liu, W.; Liu, Y.; Coleman, S.; Kerr, D. Multilevel Feedback Joint Representation Learning Network Based on Adaptive Area Elimination for Cross-View Geo-Localization. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5913915. [Google Scholar] [CrossRef]
Wang, T.; Zheng, Z.; Sun, Y.; Yan, C.; Yang, Y.; Chua, T.-S. Multiple-Environment Self-Adaptive Network for Aerial-View Geo-Localization. Pattern Recognit. 2024, 152, 110363. [Google Scholar] [CrossRef]
Lin, J.; Luo, Z.; Lin, D.; Li, S.; Zhong, Z. A Self-Adaptive Feature Extraction Method for Aerial-View Geo-Localization. IEEE Trans. Image Process. 2025, 34, 126–139. [Google Scholar] [CrossRef]
Sun, J.; Sun, H.; Lei, L.; Ji, K.; Kuang, G. TirSA: A Three Stage Approach for UAV-Satellite Cross-View Geo-Localization Based on Self-Supervised Feature Enhancement. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7882–7895. [Google Scholar] [CrossRef]
Sun, B.; Liu, G.; Yuan, Y. Dimensionally Unified Metric Model for Multisource and Multiview Scene Matching. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5624411. [Google Scholar] [CrossRef]
Chen, L.; Gu, L.; Zheng, D.; Fu, Y. Frequency-Adaptive Dilated Convolution for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–18 June 2024; pp. 3414–3425. [Google Scholar]
Jin, M.; Wang, C.; Yuan, Y. Statistical Texture Awareness Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5521614. [Google Scholar] [CrossRef]
Wang, H.; Wang, C.; Yuan, Y. Neighbor Spectra Maintenance and Context Affinity Enhancement for Single Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5515315. [Google Scholar] [CrossRef]
Zheng, X.; Chen, X.; Lu, X.; Sun, B. Unsupervised Change Detection by Cross-Resolution Difference Learning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5606616. [Google Scholar] [CrossRef]
Li, J.; Zheng, K.; Li, Z.; Gao, L.; Jia, X. X-Shaped Interactive Autoencoders with Cross-Modality Mutual Learning for Unsupervised Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5518317. [Google Scholar] [CrossRef]
Cui, Z.; Zhou, P.; Wang, X.; Zhang, Z.; Li, Y.; Li, H.; Zhang, Y. A novel geo-localization method for UAV and satellite images using cross-view consistent attention. Remote Sens. 2023, 15, 4667. [Google Scholar] [CrossRef]
Zhou, X.; Zhang, X.; Yang, X.; Zhao, J.; Liu, Z.; Shuang, F. Towards UAV Localization in GNSS-Denied Environments: The SatLoc Dataset and a Hierarchical Adaptive Fusion Framework. Remote Sens. 2025, 17, 3048. [Google Scholar] [CrossRef]
Yuan, Y.; Sun, B.; Liu, G. Dual Attention and Dual Fusion: An Accurate Way of Image-Based Geo-Localization. Neurocomputing 2022, 500, 965–977. [Google Scholar] [CrossRef]
Li, J.; Zheng, K.; Gao, L.; Han, Z.; Li, Z.; Chanussot, J. Enhanced Deep Image Prior for Unsupervised Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5504218. [Google Scholar] [CrossRef]
Xu, K.; Qin, M.; Sun, F.; Wang, Y.; Chen, Y.-K.; Ren, F. Learning in the Frequency Domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency Channel Attention Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 783–792. [Google Scholar]
Cheng, H.; Yang, S.; Zhou, J.T.; Guo, L.; Wen, B. Frequency Guidance Matters in Few-Shot Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 11814–11824. [Google Scholar]
Zhang, Y.; Huang, T.; Liu, J.; Jiang, T.; Cheng, K.; Zhang, S. FreeKD: Knowledge Distillation via Semantic Frequency Prompt. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 15931–15940. [Google Scholar]
Zhang, S.; Kong, D.; Xing, Y.; Lu, Y.; Ran, L.; Liang, G.; Wang, H.; Zhang, Y. Frequency-Guided Spatial Adaptation for Camouflaged Object Detection. IEEE Trans. Multimed. 2025, 27, 72–83. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Yu, A.; Lyu, X.; Gao, H.; Zhou, J. A Frequency Decoupling Network for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5607921. [Google Scholar] [CrossRef]
Duan, W.; Ji, L.; Chen, S.; Zhu, S.; Ye, M. Triple-Domain Feature Learning with Frequency-Aware Memory Enhancement for Moving Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5006014. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Nnetworks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Li, X. Positive-Incentive Noise. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 8708–8714. [Google Scholar] [CrossRef]
Liu, Q.; Sun, B. Advanced Semi-Supervised Hyperspectral Change Detection via Cross-Temporal Spectral Reconstruction. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1. [Google Scholar] [CrossRef]
Li, J.; Zheng, K.; Gao, L.; Ni, L.; Huang, M.; Chanussot, J. Model-Informed Multistage Unsupervised Network for Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5516117. [Google Scholar] [CrossRef]
Ding, L.; Zhou, J.; Meng, L.; Long, Z. A Practical Cross-View Image Matching Method Between UAV and Satellite for UAV-Based Geo-Localization. Remote Sens. 2020, 13, 47. [Google Scholar] [CrossRef]
Lin, J.; Zheng, Z.; Zhong, Z.; Luo, Z.; Li, S.; Yang, Y.; Sebe, N. Joint Representation Learning and Keypoint Detection for Cross-View Geo-Localization. IEEE Trans. Image Process. 2022, 31, 3780–3792. [Google Scholar] [CrossRef] [PubMed]
Wang, T.; Zheng, Z.; Yan, C.; Zhang, J.; Sun, Y.; Zheng, B.; Yang, Y. Each Part Matters: Local Patterns Facilitate Cross-View Geo-Localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 867–879. [Google Scholar] [CrossRef]
Tian, X.; Shao, J.; Ouyang, D.; Shen, H.T. UAV-Satellite View Synthesis for Cross-View Geo-Localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4804–4815. [Google Scholar] [CrossRef]
Liao, Y.; Su, J.; Ma, D.; Niu, C. Uav-satellite cross-view image matching based on adaptive threshold-guided ring partitioning framework. Remote Sens. 2025, 17, 2448. [Google Scholar] [CrossRef]
Dai, M.; Hu, J.; Zhuang, J.; Zheng, E. A Transformer-Based Feature Segmentation and Region Alignment Method for UAV-View Geo-Localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4376–4389. [Google Scholar] [CrossRef]
Zhuang, J.; Dai, M.; Chen, X.; Zheng, E. A Faster and More Effective Cross-View Matching Method of UAV and Satellite Images for UAV Geo-Localization. Remote Sens. 2021, 13, 3979. [Google Scholar] [CrossRef]
Shao, J.; Jiang, L.-H. Style Alignment-based Dynamic Observation Method for UAV-View Geo-Localization. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3000914. [Google Scholar] [CrossRef]
Wang, T.; Zheng, Z.; Zhu, Z.; Sun, Y.; Yan, C.; Yang, Y. Learning Cross-View Geo-Localization Embeddings via Dynamic Weighted Decorrelation Regularization. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5647112. [Google Scholar] [CrossRef]
Zhu, R.; Yang, M.; Yin, L.; Wu, F.; Yang, Y. UAV’s Status is Worth Considering: A Fusion Representations Matching Method for Geo-Localization. Sensors 2023, 23, 720. [Google Scholar] [CrossRef]
Liu, X.; Wang, Z.; Wu, Y.; Miao, Q. Segcn: A Semantic-Aware Graph Convolutional Network for UAV Geo-Localization. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 17, 6055–6066. [Google Scholar] [CrossRef]
Shen, T.; Wei, Y.; Kang, L.; Wan, S.; Yang, Y.-H. MCCG: A ConvNeXt-Based Multiple-Classifier Method for Cross-View Geo-Localization. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 1456–1468. [Google Scholar] [CrossRef]
Deuser, F.; Habel, K.; Oswald, N. Sample4geo: Hard Negative Sampling for Cross-View Geo-Localisation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 16847–16856. [Google Scholar]
Chen, Z.; Yang, Z.-X.; Rong, H.-J. Multi-Level Embedding and Alignment Network with Consistency and Invariance Learning for Cross-View Geo-Localization. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5625615. [Google Scholar]
Lv, H.; Zhu, H.; Zhu, R.; Wu, F.; Wang, C.; Cai, M.; Zhang, K. Direction-Guided Multiscale Feature Fusion Network for Geo-Localization. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622813. [Google Scholar] [CrossRef]
Gan, W.; Zhou, Y.; Hu, X.; Zhao, L.; Huang, G.; Hou, M. Learning Robust Feature Representation for Cross-View Image Geo-Localization. IEEE Geosci. Remote Sens. Lett. 2025, 22, 6004405. [Google Scholar] [CrossRef]
Wu, Q.; Wan, Y.; Zheng, Z.; Zhang, Y.; Wang, G.; Zhao, Z. CAMP: A Cross-View Geo-Localization Method Using Contrastive Attributes Mining and Position-Aware Partitioning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5637614. [Google Scholar] [CrossRef]

Figure 1. The Diagram of PIIS (Positive-Incentive Information Screening).

Figure 2. The structure of PIIS-C&PIIS-N.

Figure 3. Comparison of Feature Maps for Different Algorithms.

Figure 4. Matching Results on University-1652 Dataset.

Figure 5. Matching Results on SUES-200 Dataset.

Table 1. Accuracy test on University-1652 dataset.

Backbone	Loss	Recall@1	Recall@5	Recall@10	Recall@1%	mAP
ResNet50	Instance Loss	58.49%	79.40%	85.31%	86.01%	63.31%
ResNet50	Contrastive Loss	41.70%	63.61%	73.98%	75.21%	46.93%
ResNet50	Triplet Loss	52.13%	73.53%	81.24%	62.21%	57.09%
SENet	Triplet Loss	62.17%	83.14%	88.52%	89.18%	66.91%
CBAM	Triplet Loss	65.34%	83.89%	88.80%	89.39%	69.52%
DenseNet	Triplet Loss	66.19%	85.70%	90.77%	91.25%	70.54%
ResNet101	Triplet Loss	70.98%	85.21%	91.06%	91.50%	74.52%
DUMM [17]	Triplet Loss	77.33%	91.33%	94.21%	94.53%	81.45%
PIIS-C	Triplet Loss	83.61%	94.14%	96.29%	96.53%	85.99%
ConvNext	InfoNCE Loss	92.48%	97.50%	98.27%	98.37%	93.63%
PIIS-N	InfoNCE Loss	94.56%	98.44%	98.86%	98.91%	95.44%

Table 2. Performance comparison with maintream methods.

Method	Backbone	Drone→Satellite		Satellite→Drone
Method	Backbone	Recall@1	mAP	Recall@1	mAP
LCM [38]	ResNet	66.65%	70.82%	79.89%	65.38%
RK-Net [39]	ResNet	66.13%	70.23%	80.17%	65.76%
LPN [40]	ResNet	75.93%	79.14%	86.45%	74.79%
PCL [41]	ResNet	79.47%	83.63%	91.78%	82.18%
ATRPF [42]	ResNet	82.50 %	84.28 %	90.87 %	80.25%
FSRA [43]	ViT	82.25%	84.82%	87.87%	81.53%
MSBA [44]	ResNet	82.33%	84.78%	90.58%	81.61%
PIIS-C	ResNet	f83.61%	85.99%	f91.30%	83.11%
Safe-Net [15]	Vit-S	86.98%	88.85%	91.22%	86.06%
SA_DOM [45]	ResNet	84.08%	86.39%	91.44%	82.02%
DWDR [46]	ViT	86.41%	88.41%	91.30%	86.02%
MBF [47]	ViT	89.05%	90.61%	93.15%	88.17%
SeGCN [48]	Swin-T	89.18%	90.89%	94.29%	89.65%
MCCG [49]	ConvNext	89.64%	91.32%	94.30%	89.39%
MFJR [13]	Swin-T	91.87%	93.15%	95.29%	91.51%
S4G [50]	ConvNext	92.65%	93.81%	95.14%	91.39%
MEAN [51]	ConvNext	93.55%	94.53%	96.01%	92.08%
TirSA [16]	ConvNext	93.37%	94.53%	96.15%	94.41%
SRLN [52]	ConvNext	92.70%	93.77%	95.14%	91.97%
LRFR [53]	ConvNext	94.13%	95.09%	95.72%	93.22%
CAMP [54]	ConvNext	94.67%	95.38%	96.15%	92.72%
DAC [11]	ConvNext	94.67%	95.50%	96.43%	93.79%
PIIS-N	ConvNext	94.56%	95.44%	95.86%	94.27%

Table 3. Accuracy test on SUES-200 dataset.

Method	150 m		200 m		250 m		300 m
Method	Recall@1	AP	Recall@1	AP	Recall@1	AP	Recall@1	AP
SUES-200	59.32	64.93	62.30	67.24	71.35	75.49	77.17	80.67
LPN	61.58	67.23	70.85	75.96	80.38	83.80	81.47	84.53
SA_DOM [45]	47.65	53.33	60.35	64.60	67.55	71.74	73.05	76.39
PIIS-C	65.47	69.83	74.20	77.88	81.85	84.64	85.90	88.12
Safe-Net [15]	81.05	84.76	91.10	93.04	94.52	95.74	94.57	95.60
MCCG [49]	82.22	85.47	89.38	91.41	93.82	95.04	95.07	96.20
MBF [47]	85.62	88.21	87.43	90.02	90.65	92.53	92.12	93.63
SRLN [52]	89.90	91.90	94.32	95.65	95.92	96.79	96.37	97.21
SeGCN [48]	90.80	92.32	91.93	93.41	92.53	93.90	93.33	94.61
CAMP [54]	95.40	96.38	97.63	98.16	98.05	98.45	99.33	99.46
DAC [11]	96.80	97.54	97.48	97.97	98.20	98.62	97.58	98.14
PIIS-N	94.93	95.98	97.68	98.20	98.55	98.90	99.00	99.24
MEAN [51]	95.50	96.46	98.38	98.72	98.95	99.17	99.52	99.63

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, B.; Li, M.; Sun, B.; Liu, G.; Bi, C.; Wang, W.; Feng, X.; Zhang, G.; Hu, B. Beyond Spatial Domain: Multi-View Geo-Localization with Frequency-Based Positive-Incentive Information Screening. Remote Sens. 2026, 18, 88. https://doi.org/10.3390/rs18010088

AMA Style

Sun B, Li M, Sun B, Liu G, Bi C, Wang W, Feng X, Zhang G, Hu B. Beyond Spatial Domain: Multi-View Geo-Localization with Frequency-Based Positive-Incentive Information Screening. Remote Sensing. 2026; 18(1):88. https://doi.org/10.3390/rs18010088

Chicago/Turabian Style

Sun, Bangyong, Mian Li, Bo Sun, Ganchao Liu, Cheng Bi, Weifeng Wang, Xiangpeng Feng, Geng Zhang, and Bingliang Hu. 2026. "Beyond Spatial Domain: Multi-View Geo-Localization with Frequency-Based Positive-Incentive Information Screening" Remote Sensing 18, no. 1: 88. https://doi.org/10.3390/rs18010088

APA Style

Sun, B., Li, M., Sun, B., Liu, G., Bi, C., Wang, W., Feng, X., Zhang, G., & Hu, B. (2026). Beyond Spatial Domain: Multi-View Geo-Localization with Frequency-Based Positive-Incentive Information Screening. Remote Sensing, 18(1), 88. https://doi.org/10.3390/rs18010088

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Beyond Spatial Domain: Multi-View Geo-Localization with Frequency-Based Positive-Incentive Information Screening

Highlights

Abstract

1. Introduction

2. Related Work & Motivation

2.1. Feature Learning in Frequency Domain

2.2. Critical Frequency Bands Screening

2.3. $π$ -Noise & Motivation

3. Materials and Methods

3.1. Positive-Incentive Information Theory

3.2. Frequency Positive-Incentive Information Screening

4. Results

4.1. Datasets & Experiments Settings

4.2. Precision Experiment on University-1652

4.3. Precision Experiment on Real-World Dataset

5. Discussion

5.1. Feature Map Analysis

5.2. Visualization of Matching Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Beyond Spatial Domain: Multi-View Geo-Localization with Frequency-Based Positive-Incentive Information Screening

Highlights

Abstract

1. Introduction

2. Related Work & Motivation

2.1. Feature Learning in Frequency Domain

2.2. Critical Frequency Bands Screening

2.3. π -Noise & Motivation

3. Materials and Methods

3.1. Positive-Incentive Information Theory

3.2. Frequency Positive-Incentive Information Screening

4. Results

4.1. Datasets & Experiments Settings

4.2. Precision Experiment on University-1652

4.3. Precision Experiment on Real-World Dataset

5. Discussion

5.1. Feature Map Analysis

5.2. Visualization of Matching Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.3. $π$ -Noise & Motivation