1. Introduction
Multi-view geo-localization has emerged as a pivotal enabling technology in the field of autonomous drone navigation, offering a reliable and robust means of determining precise geographic locations by effectively matching visual data captured from significantly different viewpoints, typically between low-altitude aerial and satellite perspectives. This capability is particularly crucial for enhancing real-time situational awareness and operational autonomy in diverse and often unpredictable environments.
Currently, the majority of unmanned aerial vehicles (UAVs) depend heavily on the Global Positioning System (GPS) for autonomous navigation and self-localization. However, GPS-based methods are inherently constrained by their reliance on uninterrupted signal transmission, making them vulnerable to interference, signal blockage, or intentional jamming. In contrast, multi-view geo-localization provides a robust and GPS-independent alternative by leveraging visual cues from imagery to infer precise geolocations. This approach not only addresses the limitations of GPS-based systems in complex or adversarial settings but also enhances the operational reliability of drones in GPS-denied environments [
1]. Furthermore, the applications of multi-view geo-localization extend well beyond aerial navigation. It assumes an increasingly pivotal role in diverse domains, including autonomous driving where accurate scene understanding and location awareness are critical [
2]. Cross-modal alignment between on-board camera feeds and LiDAR data is supported to achieve reliable positioning in GPS-denied urban canyons or tunnels. In environmental monitoring, precise ecological tracking is facilitated through the fusion of satellite-scale spatial patterns with fine-grained details captured by drone-view imagery, including the quantification of vegetation coverage changes or glacial ablation rates. The technology also demonstrates immense value in tasks such as target recognition and surveillance, where few-shot learning techniques have been successfully utilized to enhance performance under limited data conditions [
3]. In disaster response scenarios, pre-disaster satellite maps are rapidly matched with real-time drone-view imagery to locate trapped areas. The expanding interest and advancements in multi-view geo-localization underscore its significance as a foundational technology within the broader framework of intelligent perception and navigation systems across diverse operational environments.
The core challenges in multi-view geo-localization lie in the substantial spatial domain discrepancies that exist between images taken from different viewpoints. Ground-level images often capture fine-grained details with rich textures, occlusions, and perspective distortions, while satellite images provide a top-down, globally consistent but semantically sparse representation of the same scene.
These pronounced multi-view variations result in particularly severe appearance and geometric mismatches, which fundamentally hinder the effectiveness of traditional feature-based image matching methods. Substantial differences in scale, orientation, lighting conditions, and object visibility further exacerbate the inherent domain gap, making it extremely difficult for models to establish accurate correspondences across views. As a result, designing inherently robust algorithms that can bridge these view-dependent disparities and extract reliable view-invariant features becomes critically essential for achieving high-performance multi-view geo-localization in complex real-world scenarios.
To address the challenges posed by substantial multi-view discrepancies, Existing multi-view geo-localization approaches can be systematically classified into three technical paradigms according to their core design rationales. Specifically, spatial-domain feature learning methods are centered on the optimization of backbone networks or the introduction of attention mechanisms to strengthen the extraction of local and global features, yet face difficulties in addressing large-scale perspective variations [
4,
5,
6,
7]. Cross-domain alignment methods employ style transfer, domain adaptation modules, or metric learning strategies to alleviate domain discrepancies between multi-view images, but frequently overlook the inherent geometric and appearance inconsistencies induced by viewpoint changes [
8,
9,
10]. Frequency-domain feature utilization methods mainly rely on static frequency screening or simple frequency-domain fusion to complement spatial-domain features, with insufficient adaptive excavation of discriminative frequency components. All three categories aim to tackle the challenges of matching drone-view and satellite-view imagery, but each exhibits distinct limitations in handling the complex variations inherent to multi-view geo-localization tasks.
For instance, Xia et al. [
11] introduce a spatial alignment module within the feature domain, aiming to impose additional fine-grained multi-view constraints. This design significantly enhances the image representation capabilities of the backbone network, contributing to more accurate geo-localization. Yan et al. [
12] adopt a deconvolutional network within the reconstruction pipeline, effectively bridging the semantic and structural differences in remote sensing image features across different domains. This enables the mapping of images from various platforms and viewpoints into a shared latent space with improved discriminative power, thus boosting robustness under large viewpoint changes.
Similarly, Ge et al. [
13] propose a method that augments global features by incorporating part-level and image block-level representations. These are then fused back into the global feature representation to extract contextual information, thereby improving robustness and spatial awareness. Wang et al. [
14] design MuSe-Net, a multi-environment adaptive network that dynamically mitigates domain shifts caused by varying environmental conditions. It employs a dual-branch structure comprising a style extraction module and an adaptive feature extraction module. The former captures style information specific to different environments, while the latter uses adaptive modulation to minimize environment-related style gaps, ensuring better generalization.
Lin et al. [
15] propose Safe-Net, a novel unified end-to-end network designed to extract highly robust scale-invariant features. It incorporates two dedicated modules: a global representation-guided feature alignment module and a saliency-guided partitioning module. The former primarily employs global feature-guided affine transformations for adaptive alignment, whereas the latter strategically utilizes saliency distributions to guide carefully calibrated adaptive partitioning, thereby significantly improving model sensitivity to scale variations.
Further, Sun et al. [
16] present TirSA, a comprehensive three-stage framework consisting of preprocessing, feature embedding, and post-processing. In the preprocessing stage, a self-supervised feature enhancement method (SFEM) is used to generate building perception masks, encouraging the model to focus on structurally meaningful regions without external supervision. The embedding stage integrates an adaptive feature integration module (AFIM) and employs a refined cross-domain triplet loss to mitigate inter-view discrepancies. Finally, a re-ranking strategy is introduced in the post-processing phase to optimize the retrieval results and enhance final matching accuracy.
While these methods demonstrate significant impressive performance improvements, they largely focus on abstract feature representations to measure similarity, overlooking the inherently rich frequency information present in multi-view images. This fundamental omission can critically limit their ability to fully exploit view-invariant cues embedded in the frequency domain. In this context, Sun et al. [
17] introduce a truly pioneering approach by first transforming high-level feature representations into the frequency domain using Discrete Cosine Transform (DCT). This enables more effective feature screening in frequency space and facilitates substantially improved classification outcomes.
However, the aforementioned methods utilize frequency information in a relatively limited manner [
18], without delving into the underlying structure and significance of specific frequency components within multi-source images. In practice, the frequency distribution of such imagery is far from uniform, and critical discriminative features are often concentrated in particular frequency bands. Ignoring this distributional characteristic leads to suboptimal utilization of valuable information embedded in the frequency spectrum.
Based on this insight, frequency information is posited to serve as a powerful complementary cue in multi-view geo-localization. Specifically, by integrating frequency-domain analysis into the matching process, it becomes possible to suppress the detrimental impact of spatial-domain inconsistencies while reinforcing invariant structural characteristics across views. This perspective underpins the design of our proposed method, which selectively screens and integrates frequency-based features to enhance multi-view matching robustness and accuracy.
This paper proposes a novel frequency-aware framework that adaptively evaluates and assigns importance to different frequency bands in multi-source images. This framework, named Positive-Incentive Information Screening (PIIS), is designed to capture and enhance the most salient and informative frequency components while reducing the influence of less relevant ones. By dynamically adjusting the contribution of each frequency band according to its discriminative value, PIIS allows the network to concentrate more effectively on features that distinguish between different views. This targeted enhancement supports more accurate and robust similarity measurement across heterogeneous perspectives, such as those captured from drone and satellite imagery, and leads to significant improvements in multi-view image matching and geo-localization performance, especially in challenging or GPS-denied environments with complex interference patterns.
The main contributions of this paper are summarized as follows:
(1) Multi-View Consistent Feature Mining in Frequency-Domain. Based on traditional abstract feature learning, the features are further projected into the frequency domain, which effectively addresses the spatial inconsistencies commonly present in multi-source images. This frequency-domain mapping serves to mitigate the effects of significant perspective changes and environmental variability by filtering out unstable or redundant components through spectral analysis. As a result, our approach strengthens the robustness and stability of multi-view image matching under challenging real-world conditions.
(2) Information-Theoretic Modeling of Inter-Branch Interaction. To enhance the comprehension of dual-branch siamese networks, the influence of shared parameters and inter-branch information exchange is examined from an information-theoretic perspective. By modeling the interaction as an information flow process, the aim is to maximize relevant mutual information while reducing redundancy. This formulation provides a principled framework for improving feature extraction and matching performance, especially in complex applications such as multi-view geo-localization. Additionally, the conventional notion of noise is revisited, proposing that in siamese architectures, what is often considered noise may in fact arise from a misalignment between learned representations and task-relevant signals. This perspective offers new insights for optimizing learning dynamics and enhancing model robustness.
(3) Positive-Incentive Information Screening in Frequency-Domain. The positive-incentive information screening strategy is proposed. This approach incorporates a core-band screening mechanism designed to identify and retain the frequency components with the highest task-relevant entropy. By approximating the maximum expected information entropy, the network is guided to emphasize structurally meaningful and discriminative spectral cues. Such selective preservation enables more robust and consistent feature alignment across views, ultimately improving performance in multi-view matching under significant appearance variations, geometric changes, and domain shifts. Notably, the proposed PIIS framework exhibits considerable architectural versatility, as it is readily applicable to various backbone networks, including both CNNs and Transformers, and maintains consistent performance improvements across diverse architectural paradigms.
The remainder of this paper is structured as follows:
Section 2 reviews related work in the field of multi-view geo-localization and frequency-domain analysis.
Section 3 presents the proposed PIIS framework in detail, including its underlying architecture and frequency-aware weighting strategy.
Section 4 reports and the experimental results and comparative studies.
Section 5 discussion and analysis of the experimental results have been conducted.
Section 6 concludes the paper and outlines potential directions for future work.
3. Materials and Methods
For the siamese network, the objective of each branch can be conceptualized as a distinct process encompassing feature capture and subsequent cognition. Specifically, features are extracted from the input data by each branch and processed into representations that enable effective comparison between sample pairs. The ultimate goal is to determine whether a given pair constitutes a positive or negative match, which equates to assessing if the outputs from the two branches correspond to a positive sample pair. Within this framework, the weights shared between the two branches are not merely a source of redundancy, but rather an essential mechanism that facilitates the exchange of learned knowledge.
3.1. Positive-Incentive Information Theory
When viewed through the lens of -noise, the shared weight strategy in siamese networks redefines the traditional notion of noise, transforming it into a form of positive excitation noise. Unlike conventional noise, which is regarded as detrimental and disruptive to learning, the shared parameters between network branches serve as a conduit for mutual knowledge exchange. This shared knowledge acts as a complementary information source, rather than a disturbance, and enhances the learning process in both branches.
Through this mechanism, the network benefits from a mutual enhancement effect, as the shared weights allow both branches to collaborate in extracting more comprehensive and discriminative features. In this context, what might superficially be interpreted as redundancy or overlap actually serves as a form of constructive interference, reinforcing essential representations. Therefore, this additional information can be regarded as positive noise, thereby enhancing the generalization capability of the model and consequently improving overall performance.
This perspective can be theoretically supported by the mutual information between the task
and the noise
:
here
denotes the entropy of the task, and
represents the conditional entropy of the task given the noise.
In the strictest sense, unexpected and harmful noise satisfies the condition:
which implies that such noise offers no useful information about the task and only adds uncertainty. In contrast, forward excitation noise, as introduced by the shared-weight paradigm, satisfies the following condition:
or equivalently:
This indicates that the presence of this noise actually reduces the uncertainty of the task, thereby contributing positively to learning.
In this manner, the concept of -Noise can be reinterpreted as a beneficial inductive bias within siamese architectures. The “noise” induced by parameter sharing does not impede the learning process; instead, it accelerates the formation of robust and generalizable representations. In contrast to standard mutual information-based methods, which primarily focus on quantifying statistical dependence between paired images, the intrinsic mechanism governing inter-branch interaction in siamese networks remains unaddressed. -Noise theory reinterprets shared weights in siamese networks as “positive-incentive noise” that mitigates task uncertainty rather than introducing disruptive interference.
This reinterpretation transforms the traditional weight-sharing mechanism from a simple parameter reuse strategy into a constructive information exchange channel, enabling mutual enhancement between the two branches of the siamese architecture. This improvement consequently elevates model efficacy in tasks including feature matching, multi-view geo-localization, and classification. Such enhanced performance is particularly evident in scenarios demanding high robustness to domain variations and input transformations under real-world constraints.
3.2. Frequency Positive-Incentive Information Screening
Building upon the preceding theoretical foundation, this paper proposes a bold hypothesis: Can frequency-domain spatial features act as positive excitation elements to enhance the optimization of siamese networks? Empirical results [
17] affirm this hypothesis, demonstrating that appropriately integrated frequency-domain information can significantly boost the performance of multi-view matching tasks. However, the integration of such information presents several technical challenges. Notably, the lack of inherent constraints in frequency-domain integration, along with the possible impact of crowding operations, can result in the distortion or loss of discriminative frequency cues.
To overcome these limitations, we introduce a dedicated frequency filtering model, as depicted in
Figure 1, which is designed to isolate and preserve frequency-domain features that provide positive excitation. This model enables the network to selectively amplify informative frequency components while suppressing irrelevant or noisy ones, thereby enhancing the optimization of the siamese network without compromising its learning dynamics. The core idea is to transform raw abstract features into the frequency domain, analyze them across distinct bands, and adaptively emphasize those components that contribute most meaningfully to discriminative representation.
(1) Frequency-Domain Transformation
The abstract spatial feature map
is first transformed into the frequency domain using the Discrete Fourier Transform (DFT):
here
denotes the resulting complex-valued frequency map.
H and
W represent the spatial dimensions of the input feature, and
are the spatial coordinates. To facilitate centralized frequency analysis, a frequency shift is applied such that low-frequency components are relocated to the center of the spectrum. The normalized frequency indices
u and
v span from
to
and
to
, respectively. Frequencies beyond the Nyquist limit cannot be accurately captured, which imposes an upper limit on the usable frequency bandwidth:
In typical CNN architectures, convolutional operations often act as high-pass filters, favoring the preservation of high-frequency content. However, this can bias the learned features towards finer details while sacrificing global context. This trade-off results in the use of smaller dilation rates to preserve high-frequency information, but at the expense of reduced receptive fields. To address this, our FreqSelect module is designed to balance high- and low-frequency components, thus enabling broader receptive fields while retaining essential fine-grained details.
(2) Frequency Band Decomposition and Reweighting
In the Positive-Incentive Information Screening (PIIS) framework, the frequency spectrum is decomposed into multiple non-overlapping bands using binary masks in the Fourier domain:
here
denotes the inverse Fast Fourier Transform (IFFT), and
is a binary mask defined as:
where
and
are threshold values derived from a predefined set of octave-based frequency ranges, i.e.,
. This frequency decomposition enables precise localization of salient information across different scales, from low-frequency global context to high-frequency texture patterns.
Subsequently, PIIS dynamically reweights these frequency-specific features based on their spatial relevance:
here
is the final reweighted frequency-balanced feature at location
,
denotes the reconstructed spatial feature map from the
n-th frequency band, and
is the learned attention map indicating the relative importance of that band at each spatial location.
This adaptive fusion mechanism enables the network to selectively enhance frequency components that provide positive excitation and are most beneficial for discriminative learning, while noisy or redundant signals are effectively suppressed. In distinction from conventional frequency-domain approaches that treat frequency bands as discrete matching units, the proposed Frequency-Based Positive-Incentive Information Screening mechanism is grounded in -Noise theory and prioritizes the screening of task-relevant frequency components via entropy maximization. This approach ensures that the retained frequency bands not only exhibit high mutual information with the target task but also align with the optimization logic of siamese networks. This optimization logic centers on enhancing view-invariant features and suppressing domain discrepancies between multi-view images, thereby reinforcing feature alignment across heterogeneous perspectives, including drone-view and satellite imagery.
(3) Design of the PIIS model
Within the -noise theoretical framework, the synergistic effects arising from parameter sharing constitute a form of positive-incentive noise that satisfies the mutual information condition indicating its capacity to effectively reduce task uncertainty rather than introduce interference.
Based on this theoretical breakthrough, the Positive-Incentive Information Screening (PIIS) mechanism is proposed, in which the frequency-domain information configuration is optimized through three core operations: spatial features are first mapped to the frequency domain using Discrete Fourier Transform, with frequency shift operations enabling centralized spectral analysis. the spectrum is then decomposed into semantically heterogeneous non-overlapping frequency bands using binary masks based on predefined octave-based thresholds. Finally, a learnable attention mechanism dynamically re-weights frequency-specific features to approximate the maximum expected information entropy.
For multi-view geo-localization tasks utilizing CNN backbone networks, the PIIS-C model is innovatively constructed, as illustrated in
Figure 2a. This architecture enhances feature representation through frequency-domain analysis and further integrates frequency-enhanced features with original abstract domain features, thereby forming a more discriminative comprehensive metric representation. The similarity estimation layer subsequently computes precise matching scores for image pairs.
Recognizing the superior capability of Transformer architectures in capturing global contextual information, the frequency-domain positive-incentive screening mechanism is extended to these architectures [
36,
37]. As demonstrated in
Figure 2b, this extension leads to the development of the PIIS-N model. The use of frequency-domain spatial features as positive-incentive elements serves to enhance the optimization efficacy of the Transformer in multi-view matching tasks. Although CNN and Transformer employ different spatial feature extraction mechanisms, their generated feature maps can both be treated as two-dimensional signals following universal frequency-domain transformation laws.
Frequency-domain decomposition effectively decouples appearance variations from structural information: high-frequency components predominantly encode detailed texture features, whereas low-frequency components encapsulate global structural information. The adaptive screening of frequency components that provide positive task incentives is theoretically grounded. It aligns not only with the feature representation optimization principles of information bottleneck theory but also embodies the core “noise as information” tenet of -noise theory. Operationally, the reconstruction error introduced by this screening process functions not as detrimental interference but as a constructive incentive, thereby enhancing model generalization capability.
In the PIIS-N implementation, the mechanism functions as a front-end processing unit for the Transformer encoder, first converting feature maps to the frequency domain via Fast Fourier Transform, then employing a learnable weight matrix to enhance viewpoint-insensitive frequency components while suppressing those susceptible to viewpoint variations. This process essentially instantiates the mutual information gain mechanism of -noise in the frequency domain, ensuring that screened frequency noise satisfies the condition thereby improving model robustness through reduced task conditional entropy. The enhanced frequency-domain features are ultimately fused with original features to form a comprehensive metric representation with complementary information characteristics, with the similarity estimation layer computing final matching scores for image pairs.
The PIIS module in PIIS-C is inserted after the last residual block, a placement that balances local texture details and preliminary semantic information, effectively suppressing perspective-induced noise in shallow features, preserving discriminative semantic cues, and avoiding the loss of fine-grained information caused by late-stage insertion or unstable processing of early raw features; in contrast, the PIIS module in PIIS-N is placed at the front-end of the encoder prior to feature downsampling, where frequency-domain transformation is first applied to screen view-invariant frequency components, and the enhanced features are then fed into Transformer blocks. This design reduces the computational burden of global attention on noisy components, ensures focus on structure-invariant frequency patterns, and maximizes the advantage in modeling long-range spectral correlations.
4. Results
To verify the effectiveness of the proposed method, this paper conducts comprehensive experiments on a public dataset, evaluating both segmentation accuracy and computational efficiency across multiple dimensions.
4.1. Datasets & Experiments Settings
First, the proposed algorithm is thoroughly validated using the multi-view matching and positioning dataset University-1652. This dataset, created within a simulation platform, offers a rich, diverse, and large-scale collection of training samples, thereby providing abundant metric learning signals for the model. University-1652 consists of a series of multi-view scenes, each paired with corresponding satellite imagery, allowing for a comprehensive evaluation across various perspectives. The drone images are captured in a circular ascending trajectory, enabling the acquisition of visual information from different angles and altitudes, particularly focusing on central buildings. This simulated setup provides controlled yet varied data, making it ideal for evaluating model performance across a range of viewing conditions.
In addition to the simulation-based validation, complementary experiments are conducted using the real-world dataset SUES-200, collected from actual drone flights in highly dynamic environments. The drone images reflect the inherent complexities of real-scene conditions, including prevalent environmental interferences such as weather variations, lighting changes, and optical distortions. This data introduces substantial noise and unpredictability, providing a particularly rigorous test for robustness and adaptability. A thorough assessment of model performance is thereby ensured through evaluation on both synthetic and real-world datasets, enabling validation across environments ranging from controlled settings to genuinely complex, dynamic scenarios.
The experiments for the PIIS-C model were conducted on a high-performance computing platform with an NVIDIA RTX 3090 GPU. An initial learning rate of 0.001 was adopted to balance convergence speed and model accuracy, while a dropout rate of 0.5 helped prevent overfitting. The model used a batch size of 4 for efficient GPU memory utilization and stable gradient estimation, with a stride of 2 to enable effective downsampling for multi-scale feature capture.
For the PIIS-N model, training was performed in a distributed environment using multiple RTX 3090 GPUs with NCCL backend. To suit the larger transformer-based architecture, the batch size was set to 8 and the learning rate to 0.0001. The model employed a ConvNeXt-b backbone processing 384 × 384 input images normalized with ImageNet statistics. Training lasted 10 epochs with gradient clipping and label smoothing for regularization. The data pipeline used 8 parallel workers with horizontal flipping and custom sampling, with evaluation performed each epoch. These hyperparameters are carefully selected to ensure robust training and reliable results across the experimental setup.
4.2. Precision Experiment on University-1652
To evaluate the performance of the proposed method against mainstream approaches, experiments have been conducted on two benchmark datasets, wherein the input image resolution of both the proposed method and all comparative methods has been uniformly set to 384 × 384. The results on the University-1652 dataset are presented in
Table 1.
As demonstrated in
Table 1, the rapid advancements in the field of visual localization have led to significant improvements in accuracy through the introduction of novel methods. For example, the classical siamese network model based on ResNet, while pioneering, exhibits notable limitations in handling multi-view variations. As shown in the first three rows of the experimental results, this model struggles with effectively capturing complex spatial relationships across varying perspectives, primarily due to its limited ability to model intricate inter-view transformations.
The introduction of attention mechanisms has considerably enhanced the ability of convolutional neural networks (CNNs) to selectively focus on salient regions within images. By assigning attention weights to features, these mechanisms allow the model to emphasize critical areas, thus improving its capability to handle multi-view discrepancies. When attention mechanisms, such as SENet and CBAM, are integrated into the siamese network framework, e.g., in Triplet + SENet and Triplet + CBAM configurations, they facilitate the selective activation of important regions within each branch of the network, leading to stronger and more consistent correspondences across views. This selective focus has been instrumental in improving the accuracy of visual geo-localization.
At its core, the integration of attention mechanisms constitutes a form of information enhancement within the backbone network, which augments the capacity to discern relevant features. However, in cases involving more complex multi-view transformations, more comprehensive strategies for feature mining can lead to even greater performance gains. For instance, methods that leverage densely connected multi-level features, such as Triplet Loss combined with DenseNet, allow for the extraction of rich, multi-scale interaction information that provides a more nuanced understanding of the scene. This approach, which goes beyond the fixed paradigm of attention mechanisms, offers further accuracy improvements. Yet, it is important to recognize that dense connectivity comes at the cost of increased network complexity and computational demands. Given the inherent limitations in edge computing resources, especially in visual geo-localization platforms that rely on lightweight devices, this method may face challenges in real-world deployment and practical implementation. Nevertheless, it provides valuable insights for future network designs, reinforcing the effectiveness of multi-scale information integration in tackling complex geo-localization tasks.
As the development of backbone architectures continues, showed in
Table 2, increasingly sophisticated feature extraction frameworks, such as Transformer and ConvNeXt, are being introduced into visual geo-localization tasks. These advanced architectures contribute to further accuracy gains by improving feature extraction capabilities. However, it is crucial to align the choice of backbone with the specific requirements of the visual geo-localization task at hand. In particular, overcoming the challenge of handling multi-view variations remains a critical factor in improving performance. In response to this, the proposed method, PIIS, presents a novel approach by mapping traditional metric features into the frequency domain. In this domain, invariant multi-view information can be selectively filtered, allowing for the isolation and retention of stable representations across varying views. These stable representations are then reintegrated into the original metric feature space through adaptive fusion mechanisms, which significantly enhances the reliability and discriminability of the geo-localization model under multi-view conditions. This method thus represents a significant advancement in the effective handling of perspective changes in visual geo-localization.
Through systematic evaluation on the University-1652 dataset, this study validates the effectiveness of the frequency-domain feature screening mechanism in multi-view geo-localization tasks. As shown in
Table 1, the proposed PIIS-C and PIIS-N methods demonstrate significant performance advantages across various evaluation metrics:
The quantitative evaluation demonstrates that the PIIS-C model achieves 83.61% Recall@1, 94.14% Recall@5, and 85.99% mAP, surpassing the previous state-of-the-art method DUMM by substantial margins of 6.28% in Recall@1 and 4.54% in mAP. Notably, its Recall@10 and Recall@1% scores reach 96.29% and 96.53%, respectively, reflecting robust feature discrimination.
The PIIS-N model further advances these benchmarks, attaining 94.56% Recall@1, 98.44% Recall@5, and 95.44% mAP, which represents a 2.08% improvement in Recall@1 over the ConvNext baseline while maintaining superior performance across higher-recall metrics.
The substantial advantages of the frequency feature screening mechanism in complex multi-view localization are confirmed by these quantitative results. Furthermore, consistent performance improvements across multiple recall thresholds demonstrate high effectiveness in handling viewpoint variations and modality differences, which pose significant challenges in fine-grained retrieval and real-world operational scenarios.
4.3. Precision Experiment on Real-World Dataset
Although the University-1652 dataset provides a large-scale and diverse training resource, it is generated from a simulation platform. Consequently, certain discrepancies exist between the simulated data and the characteristics of real-world environments, particularly in terms of background noise, lighting variations, and scene complexity. To address this limitation and further evaluate the practical applicability of the proposed frequency screening method, additional experiments are conducted on the real-world dataset SUES-200. This dataset comprises drone images collected from actual outdoor scenes, thereby introducing real environmental interference factors such as occlusion, lighting inconsistency, and texture complexity. The inclusion of SUES-200 allows for a more comprehensive validation of the robustness and generalization capability of the proposed method under realistic conditions. The experimental results on SUES-200 are presented in
Table 3, demonstrating the method’s effectiveness in complex, real-world scenarios with significant domain shifts.
Through the systematic verification of SUES-200 real scene dataset, this paper deeply explores the effectiveness of the frequency feature screening mechanism in complex geolocation tasks. As shown in
Table 3, our proposed PIIS-C method shows significant advantages at multiple distance thresholds ranging from 150m to 300m: In the most challenging 150m fine-grained localization task, PIIS outperforms the method LPN by 3.89% and 2.60% with 65.47% Recall@1 and 69.83% AP values. As the positioning distance is extended to 300m, the proposed method still maintains 85.90% Recall@1 and 88.12% AP, which are 12.85% and 11.73% higher than the traditional spatial domain method SA_DOM. These quantitative results confirm the core value of frequency feature screening in real interference environments.
The simulation-reality gap revealed by the SUES-200 dataset is mainly reflected in three dimensions. Firstly, the dynamic environmental elements (such as instantaneous occlusion and illumination mutation) lead to unstructured disturbance of spatial features. For example, the AP value of SA_DOM method at 150 m in
Table 3 decreases by 11.0% compared with the simulation dataset. Second, the multi-scale noise (sensor noise, weather degradation, etc.) in the real scene shows cross-band coupling characteristics in the frequency domain, and the robustness of the traditional CNN backbone network to such composite interference is significantly limited under these challenging conditions. Moreover, the geometric complexity of the urban environment leads to multi-view feature mismatch, which can be seen from the AP value of the LPN method at 250 m distance is 8.2% lower than that of the simulation data.
The experiments reveal that the suppression effect of frequency filtering mechanism on real interference is range-sensitive. This multi-scale cooperative mechanism overcomes the accuracy imbalance problem of traditional methods in range-spanning localization, and provides a new theoretical perspective and technical path for building pervasive geo-localization systems with enhanced environmental adaptability.
Based on systematic experimental evaluation using the SUES-200 dataset, this research establishes the theoretical superiority of the PIIS-N framework in frequency-domain feature representation. The empirical evidence presented in
Table 3 demonstrates that in the most demanding 150-meter fine-grained localization scenario, PIIS-N attains 94.93% Recall@1 and 95.98% AP, exhibiting not only substantial advancements over conventional spatial-domain approaches but, more significantly, revealing the theoretical promise of frequency-domain screening through its narrow performance differential with the state-of-the-art MEAN method. From an information-theoretic standpoint, these findings serve to verify that the frequency screening strategy, grounded in mutual information maximization, successfully isolates fundamental frequency patterns exhibiting intrinsic invariance to geometric transformations.
When the localization range is extended to 300 m, PIIS-N maintains competitive performance, providing further validation for the scale-invariant properties inherent in frequency-domain representations. Particularly compelling is the outstanding achievement of the framework at intermediate ranges of 200 and 250 m, where it surpasses all benchmark methods. This result substantiates that the architecture achieves optimal multi-scale feature integration through adaptive frequency selection. This finding aligns with frequency-domain decomposition theory, which posits that geo-localization tasks across varying distances rely on distinct characteristic scales; the frequency selection mechanism in PIIS-N successfully maintains a dynamic equilibrium among these hierarchical features.
Through rigorous analysis grounded in signal processing principles, the exceptional performance of PIIS-N can be attributed to its innovative frequency-domain screening methodology. Operating within the -noise theoretical framework, the method reconceptualizes conventional frequency redundancy as constructive excitation signals through mutual information optimization, thereby achieving robust management of multi-scale interference in complex environments. This theoretical advancement not only elucidates the mechanism behind the remarkable generalization capability of frequency-domain features in cross-distance applications but also establishes a new foundation for developing next-generation geo-localization systems with enhanced domain adaptation capacities.
Quantitative results provide theoretical validation of the generalization mechanism in the Positive-Incentive Information Screening (PIIS) framework at the frequency-domain level. While maintaining the representational capacity of the Transformer architecture, the frequency screening strategy based on mutual information maximization enables effective cross-domain transfer of discriminative frequency features. Experimental evidence confirms that adaptively selected core frequency bands maintain stable discriminability in unseen environments, a phenomenon rooted in the inherent invariance of frequency components to geometric transformations. This discovery elucidates the mathematical principle underlying frequency feature transfer. Within the -noise theoretical framework, mutual information optimization enables the successful extraction of intrinsic frequency patterns, which exhibit robustness against environmental disturbances.
The established mechanism represents a new theoretical paradigm for developing domain-adaptive geo-localization systems, where knowledge transfer from simulated to real-world scenarios is achieved through frequency-domain positive-incentive screening. This approach not only provides a principled solution to the domain adaptation challenge but also opens new pathways for constructing robust geo-localization systems capable of maintaining consistent and reliable performance across diverse operational environments.