Next Article in Journal
Sea Ice and Water Segmentation in SAR Imagery Based on Polarization Channel Interaction and Edge Selective Fusion
Previous Article in Journal
HySIMU: An Open-Source Toolkit for Hyperspectral Remote Sensing Forward Modelling
Previous Article in Special Issue
Multimodal Weak Texture Remote Sensing Image Matching Based on Normalized Structural Feature Transform
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Towards Pseudo-Labeling with Dynamic Thresholds for Cross-View Image Geolocalization

1
State Key Laboratory of Spatial Datum, College of Remote Sensing and Geoinformatics Engineering, Faculty of Geographical Science and Engineering, Henan University, Zhengzhou 450046, China
2
Henan Industrial Technology Academy of Spatiotemporal Big Data, Henan University, Zhengzhou 450046, China
3
Xi’an Research Institute of Surveying and Mapping, Xi’an 710054, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(6), 944; https://doi.org/10.3390/rs18060944
Submission received: 20 January 2026 / Revised: 13 March 2026 / Accepted: 17 March 2026 / Published: 20 March 2026

Highlights

A cross-view image generation model (BEV-CycleGAN+CL) is proposed, integrating BEV geometric constraints with multi-scale contrastive learning to enhance structural consistency in ground-to-satellite image translation. A dynamic threshold-based pseudo-label self-training method is designed to adaptively balance label quality and quantity, effectively leveraging unlabeled data for improved geo-localization.
What are the main findings?
  • BEV-CycleGAN+CL achieves PSNR improvements of 42.30% (CVACT) and 41.45% (CVUSA) over CycleGAN, significantly improving image fidelity and structural consistency.
  • The dynamic threshold pseudo-labeling method attains an R@1 of 70.30% with only 1% labeled data on CVACT, outperforming the state-of-the-art UCVGL method (68.29%).
What are the implications of the main findings?
  • This study provides a high-quality, annotation-efficient framework for cross-view geo-localization, supporting applications such as disaster response and autonomous driving.
  • The adaptive pseudo-label selection mechanism effectively leverages unlabeled data, offering a scalable solution for semi-supervised and self-supervised geo-localization research.

Abstract

Cross-view image geolocalization aims to achieve accurate localization of geo-tagged images without geo-tagging by matching ground-view images with satellite images. However, there are huge imaging differences between ground and satellite viewpoints, and existing methods usually rely on a large number of accurately labeled cross-view image pairs. Therefore, to address issues such as significant perspective differences, high annotation costs, and low utilization of unpaired data, this paper proposes a cross-view generation model that integrates multi-scale contrastive learning and dynamic optimization, designs a multi-scale contrast loss function to strengthen the semantic consistency between the generated images and the target domain, adaptively balances the quality and quantity of pseudo-labels according to a dynamic threshold screening mechanism, and introduces a hard-sample triplet loss to enhance the model discriminative ability. Ablation experiments on the CVUSA and CVACT datasets show that the BEV-CycleGAN+CL (Bird’s-Eye View Cycle-Consistent Generative Adversarial Network with Contrastive Learning) model proposed in this paper significantly outperforms the comparative models in PSNR, SSIM, and RMSE metrics. Specifically, on the CVACT dataset, compared with the BEV-CycleGAN, BEV, and CycleGAN baselines, PSNR increased by 2.83%, 16.02%, and 42.30%, SSIM increased by 6.12%, 8.00%, and 18.48%, and RMSE decreased by 9.28%, 15.51%, and 25.35%, respectively. Similar advantages are observed on the CVUSA dataset. Compared with current state-of-the-art models, the dynamic threshold pseudo-label localization method in this paper demonstrates overall superiority in recall metrics such as R@1, R@5, R@10, and R@1%, for example achieving an R@1 of 98.94% on CVUSA, outperforming the best comparative model, Sample4G, which reached 98.68%. This study provides innovative methodological support for disaster emergency response, high-precision map construction for autonomous driving, military reconnaissance, and other applications.

1. Introduction

With the rapid development of smart cities, drones, augmented reality, and other fields, the accurate acquisition and understanding of geospatial information has become a core technical need [1,2,3]. Cross-view geolocalization, as a cutting-edge direction at the intersection of geographic information science and computer vision, aims to solve the problem of automatic localization of geo-tagged images without geo-tagging by matching the ground view images (e.g., street view, UAV images) with satellite or aerial view images [4,5], which has important application value for disaster emergency response [6], self-driving high-precision map construction [7], military reconnaissance [8], and other scenes. Traditional geo-localization methods are mainly based on GPS signals or SLAM technology; however, there are significant limitations in urban canyons, indoor scenes, or signal-restricted areas. In recent years, vision-based geolocalization techniques have become a research hotspot because of their universality. Early works constructed cross-view image matching models by extracting manual features such as SIFT [9], HOG [10], and SURF [11], but the performance degraded drastically when the difference in view angle exceeded 30°.
With the maturation of deep learning technologies, the research paradigm in this field has undergone a significant shift. Workman et al. [12] pioneered the application of convolutional neural networks to multi-view geolocation, achieving feature alignment between ground and aerial images through a designed cross-view training mechanism. The MGTL network proposed by Zhao et al. [13] significantly enhances cross-view feature extraction capabilities through a cascaded attention mechanism and spatial context enhancement modules. However, due to significant differences in imaging conditions, acquisition times, and geometric structures among images from different viewpoints, single-deep-learning-based localization methods still face core challenges such as feature alignment difficulties and insufficient model generalization [14].
To mitigate issues arising from viewpoint differences, researchers have attempted to bridge the gap between domains through image generation or transformation. Shi et al. [4] employed polar coordinate mapping to convert satellite images into a polar coordinate system, simulating the distribution of ground-level viewpoints. While this approach alleviated viewpoint distortion, the generated images still suffered from semantic content loss and edge blurring. Subsequently, generative adversarial networks (GANs) [15] were introduced for cross-view image translation, enabling unsupervised domain adaptation. Models like X-Fork and X-Seq [16] employed conditional GANs to generate cross-view images and reduce domain discrepancies. Recent studies incorporated attention mechanisms and multi-scale feature fusion, yet semantic misalignment caused by viewpoint differences remains an unresolved challenge [17]. The CycleGAN-Turbo model [18] enhances structural preservation by integrating latent diffusion modules, yet generated images exhibit structural distortions in complex texture regions (e.g., buildings, road intersections).
The aforementioned methods typically rely on large amounts of precisely annotated multi-view image pairs for supervised learning. However, in practical applications, high annotation costs and low utilization of unpaired data severely constrain the large-scale deployment of these techniques [19]. To reduce annotation dependency, pseudo-label learning methods have been introduced for multi-view localization tasks. UCVGL [20] employs a cross-view projection-guided model to retrieve initial pseudo labels, then improves their quality through rapid re-ranking. However, confidence estimation bias during pseudo label generation leads to noise propagation issues. Ran et al. [21] proposed a region-specific re-weighting strategy that assigns varying weights based on the spatial context of pseudo label regions, yet it fails to fully leverage domain-specific label information and lacks adaptability. CDPL [22] leverages consistent dense pseudo labels to enhance remote sensing object detection performance. However, its “dense pseudo labels” are tailored for instance-level detection and are unsuitable for image-level geolocation tasks. Most of the aforementioned pseudo label methods employ fixed thresholds for sample selection, making it difficult to adapt to changes in confidence distributions during model training. This results in unstable pseudo label quality and prominent noise accumulation issues.
To overcome the limitations of existing methods, this paper proposes a unified framework that integrates generative learning with pseudo-label learning paradigms. Specifically, we introduce BEV-CycleGAN+CL, a model that combines bird’s-eye view geometric constraints with multi-scale contrastive learning. Unlike pure feature matching approaches that struggle with large viewpoint differences, our method first explicitly transforms ground panoramas into BEV coordinates via geometric inverse projection based on the flat-ground assumption, significantly reducing viewpoint distortion. Unlike pure generative methods that often produce structural artifacts, our multi-scale contrastive loss enforces image generation at the feature level to align with real satellite imagery, preserving fine details and semantic structure. Unlike existing pseudo-label methods constrained by fixed thresholds and confidence bias, we design a dynamic threshold pseudo-label self-training mechanism. By integrating historical reliable thresholds with current confidence distributions, it adaptively balances pseudo-label quality and quantity. Additionally, a dynamic hard-sample triplet loss enhances the model’s discriminative capability between generated images and ground truth satellite imagery.
(1)
BEV-CycleGAN: We propose a BEV-CycleGAN model that alleviates perspective distortion through BEV geometric constraints and multi-scale contrastive learning. By using a ground-plane assumption-based geometric inverse projection, the ground panorama image is explicitly converted to BEV coordinates. We introduce a spatial pyramid module to model geometric relationships and design pixel-level and feature-level contrast losses to improve the fidelity of generated images.
(2)
Dynamic thresholding pseudo-labels: We propose a pseudo label self-training method based on dynamic threshold filtering, setting an adaptive confidence threshold and dynamically adjusting the threshold using an exponential function to match the improvement trend of the model’s learning capability. At the same time, we use a dynamic hard-sample triplet loss to enhance the model’s discriminative ability between generated satellite images and real satellite images.
It should be noted that the BEV projection employed in this paper is based on the assumption of flat terrain. This assumption performs well in suburban or rural settings where building heights vary gradually. However, in urban areas characterized by dense high-rise buildings and steep topography, severe occlusion and perspective distortion lead to significantly increased geometric mapping errors. This limitation will be further analyzed in subsequent discussions. Moving forward, we will explore integrating multimodal data (such as LiDAR point clouds and Digital Surface Models, DSM) to explicitly model three-dimensional structures, thereby enhancing the model’s robustness in complex urban environments.

2. Materials and Methods

2.1. Materials

In this paper, we use the publicly available standard benchmark datasets in ground-satellite view matching, CVUSA [22] and CVACT [12], CVUSA is mainly derived from the suburbs of U.S. cities, and contains 35,532 pairs of images used for training and 8884 pairs of images used for testing, with high image resolution and panoramic ground-view imagery. The CVACT dataset is similar to the CVUSA is similar, in addition to 92,802 images used for testing. The high-resolution satellite images in the CVACT dataset allow the model to learn finer features during training and testing, thus better validating the model’s generalization ability.
Evaluation indicators: In terms of image generation quality assessment, this study used multidimensional quantitative metrics to evaluate the generation results objectively. Pixel intensity deviation was measured by root mean square error (RMSE), image fidelity was assessed using peak signal-to-noise ratio (PSNR), and spatial structural consistency was measured using structural similarity (SSIM) to comprehensively analyze the extent to which the generated satellite images differed from the real data at the pixel level. In the context of image retrieval, R@K denotes the percentage of query images correctly matched among the top K satellite images returned by the retrieval. Specifically, R@1% denotes that for each query image, localization is considered successful if its correctly matched satellite image resides within the top 1% of candidates sorted by similarity (i.e., the candidate pool comprises 1% of the entire test dataset). This metric evaluates a model’s localization accuracy among extremely limited candidates, representing a highly challenging evaluation criterion for cross-view geolocation tasks.
All experimental results for comparison methods directly cite the highest metrics reported in their original papers. For methods lacking CVACT test set results, we reproduced them using official code and marked them with ‘†’. Input image resolution is uniformly set to 256 × 256 to align with our method.

2.2. Network Framework

As shown in Figure 1, the overall architecture is divided into two stages: the image generation module (BEV-CycleGAN+CL) and the geolocation module (dynamic threshold pseudo-label self-training). First, the process is based on geometric projection priors, providing structural constraints for subsequent generation. Then, through an adversarial learning mechanism, the generative networks (G, F) and the discriminative network (D) are jointly optimized to learn the pixel probability distribution features of BEV images and satellite images. In addition, an encoder structure (E) is incorporated into the BEV-CycleGAN model, introducing cross-scale positive and negative sample pairs in the feature space and designing a multi-scale contrastive loss function. Through a joint multi-task learning framework, adversarial loss, cycle consistency loss, and multi-scale contrastive loss are dynamically fused, and an adaptive weight allocation strategy is employed to balance the optimization goals of generation quality and structural fidelity. Finally, a pseudo-label self-training method based on dynamic threshold filtering is proposed, which selects high-confidence pseudo-labels through an adaptive confidence threshold and an exponential function dynamic adjustment mechanism, and uses a dynamic hard sample triplet loss for cross-view image matching.
To further clarify the interplay between the two stages, we detail their interaction as follows. The proposed framework comprises two sequential training phases. In the first (image generation) phase, the BEV-CycleGAN+CL model is trained using all paired ground-satellite images. After training, the generator G is fixed and applied to all ground images (including those that will later serve as labeled or unlabeled data) to produce pseudo-satellite images, forming a “generated view” dataset. In the second (geolocation) phase, a semi-supervised learning paradigm is adopted: the training set consists of a small number of labeled real satellite images, a large number of unlabeled real satellite images, and optionally the generated satellite images. The labeled data provide basic supervision, while the unlabeled real data are used to generate pseudo-labels via the dynamic threshold strategy. The generated images can serve as additional query or gallery samples to enhance feature learning. The two modules are indirectly synergistic through the shared ground image features: the generation module improves the structural alignment between ground and satellite views, thereby reducing the difficulty of feature matching for the localization module.
The proposed BEV-CycleGAN model adopts an encoder–decoder generator. The generator takes 256 × 256 × 3 images as input. The first layer is a 7 × 7 convolutional layer with 64 channels, a stride of 1, reflective padding, instance normalization, and ReLU activation. This is followed by two downsampling layers using 3 × 3 convolutions (stride 2), which increase the channel count to 128 and 256, respectively. The core of the generator consists of nine residual blocks; each block contains two 3 × 3 convolutional layers that maintain 256 channels, along with instance normalization and ReLU activation. The decoder performs upsampling via two 3 × 3 transposed convolutional layers (stride 2), reducing the channel dimensions to 128 and then 64. The final output layer uses a 7 × 7 convolution to map the features to 3 channels, employing a Tanh activation function. The generators G (BEV → satellite) and F (satellite → BEV) share this symmetric architecture. The discriminator adopts a PatchGAN structure, comprising five convolutional layers. The first four layers use 4 × 4 kernels with a stride of 2; their channel counts are 64, 128, 256, and 512, respectively. Each is followed by instance normalization and a LeakyReLU activation (negative slope of 0.2). The final layer is a 4 × 4 convolution (stride 1, 1 channel) that outputs a 30 × 30 feature map, discriminating the authenticity of local image patches. For multi-scale contrastive learning, the module shares the encoder part of the generator. It extracts feature maps at three scales: 64 × 64 (128 channels), 32 × 32 (256 channels), and 16 × 16 (512 channels). Each scale’s features are independently projected to a 64-dimensional latent vector via a 1 × 1 convolutional projection head.
The image geolocation network uses ConvNeXt-B as its backbone to extract a 1024-dimensional global feature vector. Feature matching is based on cosine similarity. The dynamic threshold update mechanism is governed by an initial smoothing factor α 0 = 0.5 and a decay rate γ   = 0.9. The model was trained using the Adam optimizer. The generator’s initial learning rate was 0.0002, which was kept constant for the first 100 epochs before applying linear decay. The localization network’s initial learning rate was 0.001. The weights for the total loss function were set to λ c y c l e = 10 and λ p c = 2. The batch size was set to 8 for the generative adversarial training phase and 128 for the localization network training. To improve generalization, random horizontal flipping was applied as a data augmentation technique during training. An exponential decay strategy was used to dynamically adjust the learning rate, facilitating model convergence.

2.3. Image Generation for Multi-Scale Contrasts

This paper uses a geometry-based back-projection process based on the flat-ground assumption, projecting street view images into bird’s-eye view (BEV) through an explicit panoramic BEV transformation, without requiring depth estimation or camera parameters. Specifically, given a ground panoramic image, we assume a flat ground plane with the camera located at the center of the BEV plane. By using the relationship of plane i, j, the coordinates of the required mapping point P(x, y, z = 0) are determined, with the camera height set to H. The corresponding pitch angle θ and azimuth angle φ are calculated using geometric relationships, and each ground pixel is back-projected onto the world coordinate system’s ground point through a spherical projection. Subsequently, these points are projected onto a top-down 2D plane to form the initial BEV representation. This process establishes a deterministic geometric mapping between ground image pixels and BEV plane coordinates, as shown in Figure 2.
Although the acquired bird’s eye view (BEV image) is similar to satellite images in terms of geometric structure, there are still many differences between the two in the imaging process [21]. These differences are mainly reflected in various aspects, such as the degree of distortion, lighting conditions, occlusions, and weather conditions. To effectively reduce the imaging differences between these two viewpoint images, the generative CycleGAN model is introduced to further reduce the gap in the imaging effect of the two-view images through its powerful generative adversarial capability to improve the consistency and comparability between the images. The BEV-CycleGAN framework proposed in this paper mainly consists of bidirectional mapping modules, including a generator G from the X domain (BEV image) to the Y domain (satellite view image), and a generator F from the Y domain to the X domain, along with a corresponding discriminator D for adversarial training. The architecture realizes efficient transformation of cross-domain images by constructing bidirectional circular mapping relations, as shown in Figure 2.
(1) GAN Loss
L G A N G , D Y , X , Y = E y p d a t a y log D Y y + E x p d a t a x log 1 D Y G x
where G x represents the generator from the BEV domain ( X ) to the satellite domain ( Y ); D y is the discriminator in the satellite domain ( Y ), used to distinguish between real satellite images Y and generated images; ( X , Y ) represent the data distributions of BEV images and real satellite images, respectively.
(2) Loss of cyclic consistency
L c y c G , F = E x p d a t a x F G x x 1 + E y p d a t a y G F y y 1
where F y denotes this represents the generator from the satellite domain ( Y ) to the BEV domain ( X ). This loss ensures cycle consistency during the conversion, meaning that an image transformed by one generator and then restored by another should be as close as possible to the original image, thereby preserving key content information.
However, CycleGAN tends to capture low-frequency features (color, illumination), but it has difficulty retaining high-frequency details (edges and texture), and because of the lack of explicit structural constraints, the generated images are often blurred with details and prone to irrational deformations, especially in complex texture regions (e.g., buildings, roads, etc.). Therefore, this study invokes contrastive learning to bring similar data samples close to each other in the representation space, whereas dissimilar samples are far away from each other. The focus is on enhancing the ability of the model to discriminate image features so that it can better capture subtle differences between different views. To construct positive and negative samples, N + 1 patches were randomly selected from the input satellite view image x , and one corresponding patch was randomly selected from the output generated satellite image y . The two corresponding patches are denoted as positive samples, whereas the other N patches in x are denoted as negative samples. The Information Noise Comparison Estimation Module (InfoNCE) was used to enhance the recognition ability of the model by maximizing the similarity between the positive samples (corresponding patches in x and y ). The similarity between the negative samples (other N patches in x ) was minimized to reduce interference and improve the accuracy of the feature representation. Specifically, the anchor (patches in y ), positive, and N negative are first mapped to K-dimensional vectors and are denoted as v, v+ and v-respectively. After that, let v denote the feature vector (anchor point) extracted from the generated image G x . v + denotes the corresponding feature vector extracted from the same location in the input image x (positive sample). v denotes the feature vector extracted from other locations in the input image x (negative sample):
L v , v + , v = log e x p sim v , v + / τ e x p sim v , v + / τ + n = 1 N e x p sim v , v n / τ
where sim v , v + denotes the cosine similarity between v and v + , v n represents the nth negative sample, and τ is a temperature parameter (default value 0.07) that controls the concentration of the similarity distribution.
(3) Multi-scale contrast loss
To enhance the feature alignment capability between the generated images and real satellite images, a multi-scale comparison loss was designed in this study. This loss function maximizes the similarity of positive sample pairs while minimizing the similarity of negative sample pairs by comparing the feature representations of the generated and real images at different scales. In this study, features are extracted using encoder module G e n c and the input image to CycleGAN module C and embedded in a feature stack z l L = C l G e n c l x L , where l represents the l th layer selected from G e n c . The feature stack actually represents different patches in the image by encoding the spatial location information of each convolutional layer as s ϵ 1 , , S l and then S l is denoted as the number of spatial locations in each layer, and each time an anchor point is selected, its features are denoted as z ^ l s R C l , where C l denotes the number of channels in each layer. In addition, the corresponding feature (i.e., positive) is denoted as z l s R C l and the corresponding feature (i.e., negative) is denoted as z ^ l S s R S l 1 C l . The goal of this study was to match the corresponding patches (positive) of the input and output images while pushing the other patches (negative) away from the anchor points. Therefore, the block-by-block multilayer contrast loss for mapping X → Y (i.e., BEV image → satellite view image) is given by
L P C G , C , X = E x X l = 1 L s = 1 S l l z ^ l s , z l s , z ^ l S s
Here, L denotes the selected feature layer. Indicates the total number of spatial positions on the feature map of the layer. z l s denotes the vector extracted from position s on the feature map of layer of the generated image G(X) (anchor). z ^ l s denotes the feature vector extracted from the same position s on the feature map of layer of the real satellite image Y (positive sample). z ^ l S s denotes the set of feature vectors extracted from all positions other than position s on the feature map of layer of the real satellite image Y (negative samples). l z ^ l s , z l s , z ^ l S s denotes the contrastive loss is calculated at each position s and for each layer, with the aim of making the anchor similar to the positive sample and dissimilar to the negative samples.
(4) aggregate loss function (math.)
To jointly optimize the loss function for multiple tasks, this study proposes the total loss function, which combines adversarial loss, circular consistency loss, and multiscale comparison loss, as shown in Equation (5):
L t o t a l = L G A N + λ c y c l e L c y c l e + λ P C L P C
Among them, λ c y c l e and λ P C are hyperparameters used to balance the weights of the losses in each part. By jointly optimizing the loss functions for multiple tasks, the model can simultaneously improve the quality of the generated images, feature alignment capability, and accuracy of the cross-view matching.

2.4. Dynamic Thresholding for Geo-Localization of Pseudo-Tagged Images

Figure 3 illustrates the filtering process for dynamic threshold pseudo labels: the dynamic threshold δ_t is determined based on the confidence distribution, and high-confidence samples are selected as pseudo labels. Pseudo-labelled self-training methods help models learn using unlabeled data by using the predictions of the unlabeled data as labels. However, the noise of pseudo-labels and the inevitable inconsistency between fake images and satellite images render them unsuitable for use as directly supervised pseudo-labels. To address the problem of noisy pseudo-labels and uneven distribution of categories in cross-view geolocation tasks, this paper proposes a dynamic threshold filtering-based pseudo-label self-training method, which centers on setting adaptive confidence thresholds and dynamically adjusting the thresholds based on an exponential function to match the trend of the model’s learning ability; it specifically includes the following four steps.
Step 1: Confidence Calculation and Distribution Estimation.
For a batch of unlabeled ground truth satellite images u i , first extract their features using the current geolocation model. Calculate the cosine similarity between each unlabeled image feature and all candidate satellite image features (including both ground truth and generated images). The maximum similarity is taken as the confidence level c i for that sample. Then, divide all confidence values in the current batch into K = 10 equal intervals: [ 0 , 0.1 ) , [ 0.1 , 0.2 ) , , [ 0.9 , 1.0 ] . Count the number of samples in each interval to obtain the approximate probability density distribution of the confidence values.
Step 2: Density Peak Threshold Selection.
Let C = { c 1 , c 2 , , c N } denote the confidence set of N unlabeled samples in the current batch. Divide C into K equal intervals I k = [ k 1 K , k K ) ( k = 1 , , K ) , where I K = [ K 1 K , 1 ] . Let f k = | { c i I k } | denote the frequency of samples falling within interval Ik. The density peak interval is defined as the interval with the highest frequency: k = arg max k   f k . The candidate threshold is taken as the right endpoint of this interval: δ t = k / K . This design is motivated by the observation that reliable samples tend to have confidence scores concentrated near the peak; selecting the right endpoint of this interval retains high-confidence samples while discarding low-confidence noise. For instance, if the peak interval is I k with k = 9 (i.e., interval ( 0.85 , 0.9 ] ), then δ t = 0.9 . Typically, in cross-view tasks, the model’s confidence rarely exceeds 0.95, and δ t often lies within 0.7–0.9, reflecting the inherent difficulty of the task and the model’s current discriminative capability.
Step 3: Exponential Weighted Moving Average Smoothing.
To avoid sharp fluctuations in the threshold caused by a single batch, an exponential weighted moving average is applied to smooth the candidate thresholds:
δ t = α t · δ t 1 + 1 α t · δ t
where α t denotes the weight balance factor of δ t 1 and δ t , which is calculated by the following Formula (7):
α t = α 0 · 1 t T γ
In the formula, α 0 = 0.5 is the initial weight, γ = 0.9 is the decay rate, t is the current iteration count, and T is the total iteration count. This strategy ensures that during the early training phase when model prediction noise is high, the historical threshold dominates to prevent introducing erroneous false labels. As training progresses and model confidence stabilizes, the current threshold gradually takes precedence, enabling the threshold to rapidly adapt to improvements in model capability.
Step 4: Pseudo-label Selection and Training Scheduling.
Finally, select samples with confidence scores c i δ t as pseudo labels and incorporate them into the next training round. The threshold δ t is updated every 50 iterations. During the warm-up phase covering the first 20% of iterations, no pseudo labels are used; the model is trained solely on the limited labeled data to establish a reasonable initial feature representation. The complete workflow is summarized in Algorithm 1.
Algorithm 1: Dynamic Thresholding for Pseudo-Label Self-Training
Input:
1: Unlabeled sample set U, current model M, historical threshold δ t 1
2: Total iterations T, current iteration t, decay rate γ = 0.9, initial weight α 0 = 0.5
3: Number of confidence intervals K = 10
Output: Updated threshold δ t , selected pseudo-label set P L
1: Extract features for each sample in U using model M.
2: For each sample, compute the cosine similarity vector s between its feature and all candidate satellite image features, and take the maximum value as the confidence C = m a x ( s ) .
3: Divide the confidence interval [0, 1] into K equal intervals.
4: Count the frequency of confidence values falling into each interval for all samples in the current batch, forming an approximate probability density distribution.
5: Find the interval with the highest frequency (density peak) and denote its right endpoint as the current batch threshold δ t .
6: Calculate the weight factor using exponential decay: α t = α 0 · 1 t T γ .
7: Update the global dynamic threshold: δ t = α t · δ t 1 + 1 α t · δ t
8: Select samples satisfying C δ t as the pseudo-label set P L .
9: Return δ t and P L .
In cross-view geolocation, triplet loss is commonly used to enforce that anchor samples (ground images) are closer to positive samples (satellite images of the same location) than to negative samples (satellite images of different locations). However, standard triplet loss treats all samples equally and overlooks the critical role of hard examples—samples that lie near the decision boundary. Hard positives are positive samples with the lowest similarity to the anchor, while hard negatives are negative samples with the highest similarity to the anchor. Mining these hard examples forces the model to learn more discriminative features.
To this end, this study proposes a dynamic, difficult sample triad loss that explicitly strengthens the model’s ability to discriminate complex cross-view relationships by adaptively mining difficult positive and difficult negative samples. During batch processing, the cosine similarity measure is used to calculate the similarity scores between the anchor samples and the set of positive and negative samples. Based on the calculation results, the difficult sample mining strategy is implemented, and the positive sample with the smallest similarity to the anchor point is selected as the difficult positive example, whereas the negative sample with the largest similarity is selected as the difficult negative example. This sample-selection mechanism can effectively improve the ability of the model to distinguish the boundary samples. The dynamic weight adjustment mechanism was introduced to provide higher weight loss to the difficult samples. See Equation (8).
L t r i p l e t = i m a x d a i , p i d a i , n i + α , 0
where a i is the anchor image, p i is the positive sample, n i is the negative sample, d x , y is the feature distance between samples x and y , and α is the margin hyperparameter, which controls the degree of separation between difficult samples. To enhance the contribution of difficult samples to training, a dynamic weight adjustment mechanism is introduced, which adjusts the loss weights of samples by adjusting them according to the change in similarity. Specifically, when a sample is recognized as a hard sample, its weight loss dynamically increases to have a greater impact on model training. By focusing on difficult samples, the model can more accurately distinguish cross-view image pairs that are visually different but geographically identical to alleviate overfitting, and the dynamic filtering mechanism avoids the problem of local optimality caused by fixed sample selection and enhances model generalization.

2.5. Training Details

All experiments were implemented using the PyTorch (v 1.12.0) framework and trained on an NVIDIA RTX 4090Ti GPU. The training settings for the image generation and geolocation stages are as follows:
(1) Image Generation Stage
This stage utilized all paired images from the CVUSA and CVACT datasets (training/test splits defined in Section 2.1). All input images were resized to 256 × 256 pixels and underwent data augmentation via random horizontal flipping. Both the generator and discriminator employed the Adam optimizer with an initial learning rate of 0.0002, held constant for the first 100 epochs before linearly decaying to zero. The batch size was set to 8. The total loss weights were set to λ_cycle = 10 and λ_PC = 2. The temperature parameter τ was set to 0.07 by default in the contrastive loss. Training was conducted for 200 epochs.
(2) Geolocation Phase
This phase employed a semi-supervised setup, where training data comprised a small number of labeled samples and a large number of unlabeled samples. Input images are uniformly resized to 256 × 256. Data augmentation includes random horizontal flipping and color dithering. The feature extraction network uses ConvNeXt-B with Adam optimizer, initial learning rate 0.001, and cosine annealing for decay. Batch size is 128 (with labeled-to-unlabeled sample ratio dynamically adjusted based on annotation proportion). The triplet loss boundary α is set to 0.2, with the temperature parameter τ defaulting to 0.07 during similarity calculations. The total training iterations T are determined by the annotation ratio (e.g., 50 epochs for 1% annotation, 120 epochs for 100% annotation), with network parameters updated per iteration. The dynamic threshold update frequency is set to update the global threshold δ_t every 50 iterations. Each iteration selects a number of pseudo labels equal to the batch size (i.e., 128) from the unlabeled samples for training, with the selection criteria detailed in Section 2.4. After training begins, pseudo labels are not enabled during the first 20% of iterations (using only labeled data for warm-up). Subsequently, pseudo labels are gradually introduced according to the dynamic threshold mechanism described in Section 2.4. This phase employs a semi-supervised setup, with training data sourced from the complete training set. We randomly retain a specified proportion (e.g., 1%, 5%, 10%, 20%, 50%, 70%, 100%) of samples as labeled data, while the remaining samples have their labels removed and are used as unlabeled data for training. The test set remains fully intact throughout and is not utilized for training, serving solely for final performance evaluation.

3. Results

3.1. Cross-View Image Generation

To validate the effectiveness of the self-supervised contrastive learning method for cross-view image generation, we deleted each part of the method in turn, namely BEV viewpoint transformation and contrastive learning. The ablation experiments were conducted on the CVUSA and CVACT datasets, and the importance of each module was proved by BEV_CycleGAN+Contrastive learning, BEV_CycleGAN, BEV, and CycleGAN, and the results are shown in Figure 4.
Figure 4 presents a qualitative comparison of cross-view image translation results obtained with different methods. The images generated by BEV-CycleGAN+CL exhibit the highest visual similarity to real satellite imagery, particularly in road structures (red boxes), vegetation areas (yellow boxes), and fine-grained textures (blue boxes). While the images produced by CycleGAN exhibit noticeable blurring and distortion. The results show that the BEV_CycleGAN+CL method outperforms BEV_CycleGAN, BEV, and CycleGAN in both image quality and detail recovery. The generated images exhibit relatively accurate spatial structures in areas such as roads (red box in Figure 4) and vegetation (yellow box in Figure 4), with richer details (blue box in Figure 4). Traditional CycleGAN methods produced relatively low-quality images with noticeable blurring and distortion. While BEV and BEV_CycleGAN better preserved spatial consistency between ground and satellite images during generation, significant local structural distortions remained. The results demonstrate that the contrastive learning strategy enhances the model’s ability to capture image details by introducing additional similarity constraints during training. The proposed method leverages BEV transformation as the foundation and introduces CycleGAN into the framework; this design effectively eliminates the errors caused by direct cross-view conversion between ground and satellite perspectives. Specifically, for typical ground land cover types including roads and vegetation, the method delivers substantial improvements in image quality, visual clarity and structural integrity. This improves the accuracy of structural reconstruction, thereby elevating the overall quality of generated images.
However, despite these improvements, residual geometric distortions—particularly slight tilting of building facades and edge blurring in complex structures—remain visible in densely built areas due to the inherent limitations of the flat-ground assumption underlying BEV projection. These residuals are most noticeable in the building regions (blue box) but are substantially mitigated compared to the BEV and BEV-CycleGAN baselines.
This paper systematically evaluates the cross-view image generation performance of four methods—BEV_CycleGAN+CL, BEV_CycleGAN, BEV, and CycleGAN—on the CVACT and CVUSA datasets using quantitative metrics (PSNR, SSIM, RMSE), as shown in Table 1.
Experimental results show that on the CVACT dataset, BEV_CycleGAN+CL achieves PSNR improvements of 2.83%, 16.02%, and 42.30% compared to BEV_CycleGAN, BEV, and CycleGAN respectively, indicating that BEV_CycleGAN+CL generates images with the highest fidelity. SSIM improved by 6.12%, 8.00%, and 18.48%, respectively, indicating that EV_CycleGAN+CL exhibits the highest spatial structural consistency with real satellite images; RMSE decreased by 9.28%, 15.51%, and 25.35%, respectively, demonstrating the smallest spatial offset between BEV_CycleGAN+CL generated images and real satellite images. In the CVUSA dataset, BEV_CycleGAN+CL outperformed BEV_CycleGAN, BEV, and CycleGAN models with PSNR improvements of 3.76%, 19.71%, and 41.45%, respectively, and SSIM improvements of 6.57%, 8.06%, and 16.28%, respectively. RMSE was reduced by 9.94%, 14.16%, and 22.32%, respectively. Experimental results demonstrate that BEV_CycleGAN+CL achieves optimal performance in terms of image fidelity, structural consistency with real images, and offset from real images. This proves that incorporating the BEV module and contrastive learning strategy not only effectively addresses the shortcomings of traditional methods in image detail and global structure but also provides a more efficient and reliable novel approach for cross-view image generation technology.
Although BEV-CycleGAN+CL significantly improves overall generation quality, the flat ground assumption underlying BEV transformations still introduces noticeable geometric distortions in densely built areas. As shown in the third column of Figure 4, building outlines exhibit slight tilting or edge blurring. Compared to BEV-CycleGAN, introducing contrastive learning (BEV-CycleGAN+CL) partially mitigates such distortions through feature layer consistency constraints. However, structural misalignments remain unresolved. Quantitative analysis on samples containing high-rise buildings reveals that BEV-CycleGAN+CL achieves approximately 2.1 dB lower PSNR than suburban scenes, indicating that geometric errors still impact generation quality. Synthetic images still exhibit subtle differences from real images in transient objects (e.g., vehicles, pedestrians). As shown in the blue-boxed regions of Figure 4, baseline CycleGAN frequently generates false objects (“hallucination” phenomenon), while BEV-CycleGAN+CL significantly reduces such artifacts, proving contrastive learning effectively suppresses fabricated inconsistencies. However, slight shifts in texture details persist. Statistics reveal that among samples exhibiting hallucinations, the R@1 of subsequent retrieval decreased by approximately 3.2 percentage points, indicating residual domain gaps remain unresolved. This analysis demonstrates contrastive learning’s crucial role in suppressing geometric distortions and hallucinations, though future work should integrate finer-grained geometric modeling or semantic priors to further narrow domain discrepancies.
To quantify the computational overhead introduced by the multi-scale contrastive loss LPC, we compared the single-iteration training time of BEV_CycleGAN and BEV_CycleGAN+CL under identical hardware conditions (NVIDIA RTX 4090Ti). On the CVUSA dataset, BEV_CycleGAN averaged 0.32 s per iteration, increasing to 0.41 s with LPC—a 28.1% rise. On the CVACT dataset, the time increased from 0.35 s to 0.45 s, representing a 28.6% increase. This additional overhead primarily stems from feature extraction across three feature layers (64 × 64, 32 × 32, 16 × 16) and InfoNCE loss computation. However, considering the significant improvements in generated image quality and localization accuracy, this trade-off is acceptable.

3.2. Image Geolocalization Results

On the CVUSA and CVACT datasets, comparisons are made with some cutting-edge methods to evaluate the performance of the model in this paper, as shown in Table 2, and the cutting-edge methods with which comparisons are made in the paper are: SAFA (Spatial Aware Feature Aggregation) [4], SAFA (Spatial Aware Feature Aggregation), LPN (Local Pattern Network) [23], DSM (Dynamic Similarity Matching), L2LTR (layer-to-layer Transformer) [24], GeoDTR (Geometric Disentangled Transformer) [25], TransGeo [26], and Sample4G [27].
As shown, the proposed dynamic threshold pseudo-label geolocation model achieves state-of-the-art performance across all evaluation metrics (R@1, R@5, R@10, and R@1%) on both datasets. Among the two test datasets, the Sample4G model achieved the highest positioning accuracy. Based on the CVUSA dataset, the proposed dynamic threshold pseudo-label image geolocation model outperformed Sample4G by 0.26%, 0.05%, 0.10%, and 0.02% in R@1, R@5, R@10, and R@1% accuracy, respectively. On the CVACT Val dataset, the proposed model outperformed Sample4G by 2.32%, 1.95%, 0.16%, and 0.61% in R@1, R@5, R@10, and R@1% accuracy, respectively. Based on the CVACT Test dataset, the proposed model achieves image geolocation accuracy that surpasses Sample4G by 1.88%, 0.86%, 0.18%, and 0.26% at R@1, R@5, R@10, and R@1%, respectively.
To comprehensively evaluate the practicality of our method, we analyzed the computational overhead and efficiency of the model. Table 3 compares the number of parameters (Params) and floating-point operations (FLOPs) between our method and several representative approaches.
During the geolocation phase, the feature extraction network ConvNeXt-B employed in this paper has 25.8 million parameters, with a single forward inference requiring approximately 14.5 GFLOPs. On an NVIDIA RTX 4090Ti GPU, the average time to retrieve a query image (including feature extraction and similarity calculation) is 0.18 s, meeting real-time requirements. Updates to the dynamically thresholded pseudo labels occur every 50 iterations, with negligible additional computational overhead. As shown in Table 3’s parameter and FLOPs comparison, our method achieves the highest localization accuracy (CVUSA R@1 98.94%, CVACT Test R@1 72.42%) while maintaining comparable parameters (25.8 M) and computational cost (14.5 G) to mainstream approaches. For instance, GeoDTR has 26.7 M parameters and 15.2 G FLOPs, while Sample4G has 24.6 M parameters and 13.9 G FLOPs. Our method slightly exceeds Sample4G in parameters but achieves higher accuracy, demonstrating a favorable balance between precision and efficiency. This further validates the feasibility and superiority of our approach in practical applications.
To validate the model’s reliability in pseudo-label adaptive filtering, we conducted semi-supervised training on the CVACT dataset using true label coverage rates of 1%, 5%, 10%, 20%, 50%, 75%, and 100%, and compared the results with the UCVGL [20] method (Table 4). Experimental results demonstrate that even in an extremely low-resource scenario with only 1% labeled data, our method achieves a R@1 of 70.30%, significantly outperforming UCVGL’s 68.29%. As the labeled proportion increases from 10% to 20%, all metrics stabilize, indicating that the model requires only a small number of samples to generate reliable pseudo labels for the majority of unlabeled images. When the annotation proportion increases from 20% to 75%, the R@1 of our method increases from 82.19% to 85.78%, approaching the fully supervised level of 86.32%, while UCVGL only increases from 79.60% to 83.88%. At all proportions, our method significantly outperforms UCVGL, especially at 50% and 75%, where R@1 is 1.74 and 1.90 percentage points higher, respectively. These results fully demonstrate that the adaptive pseudo-label filtering method not only effectively alleviates the shortage of labeled data and significantly reduces manual costs in low-resource scenarios, but also continuously leverages unlabeled data to improve model performance at medium and high labeling rates, showcasing outstanding scalability and robustness.
The consistent superiority of our method across both datasets and evaluation metrics can be attributed to three key factors. First, the BEV-CycleGAN+CL module generates pseudo-satellite images with significantly improved structural fidelity (Table 1), which effectively bridges the domain gap between ground and satellite views. This provides the geolocation network with inputs that are already roughly aligned, thereby simplifying the feature matching task. Second, the dynamic threshold pseudo-labeling strategy adaptively selects high-confidence samples based on the evolving confidence distribution, avoiding the noise accumulation inherent in fixed-threshold methods. This is particularly evident in low-label regimes (e.g., 1% labeled data in Table 4), where our method outperforms UCVGL by 2.01% in R@1, as the warm-up phase and EMA smoothing prevent early-stage error propagation. Third, the proposed dynamic hard sample triplet loss explicitly focuses on ambiguous pairs near the decision boundary, further refining the feature embedding space. The synergy between high-quality generated images, adaptive pseudo-label selection, and hard sample mining enables our model to achieve state-of-the-art performance on both CVUSA and CVACT, with especially notable gains in the most challenging scenarios (e.g., CVACT test set, where our R@1 reaches 72.42% vs. 71.57% for Sample4G). Furthermore, the stability of our method is evidenced by the controlled threshold update process in Figure 5, which indicates that the training process is robust to batch-wise fluctuations.

4. Discussion

(1)
Dynamic Threshold Adversarial Self-Training Evaluation
This paper draws inspiration from the concept of gradient updates in deep learning networks, employing an exponential moving average (EMA) weighting strategy for calculations. To validate whether the selected value of γ influences the weights, different values (such as γ = 0.9, 0.7, 0.5, 0.4) were chosen, as shown in Figure 5.
Figure 5 depicts a function decay curve, with the horizontal axis representing “number of iterations” and the vertical axis representing weights. Curves of different colors correspond to distinct values (e.g., γ = 0.9, 0.7, 0.5, 0.4). When γ is large (e.g., 0.9), the weight factor decreases more slowly, indicating that sample importance remains high throughout iterations—suitable for more stable training. When smaller (e.g., 0.4), they decrease rapidly, indicating greater influence in the early stages and reduced importance for samples later on, suitable for quickly adjusting learning strategies. This exponential decay strategy dynamically adjusts sample weights: during the early training phase when the network’s predictive capability is weak, smaller weights are assigned to mitigate noise effects; as training progresses and predictive capability improves, weights gradually increase, enhancing model stability. Selecting different values impacts training effectiveness: excessively large values may slow model learning, while excessively small values may cause premature convergence, increasing susceptibility to local optima. The decay rate ( γ = 0.9 ) controls the speed at which the α t weight decreases, ensuring that the weight decays gradually throughout the entire training process.
Figure 6 compares the performance of different threshold filtering methods. The dynamic threshold method effectively controls the number of false labels while maintaining accuracy. The first two plots [20] represent false labels generated by the difference between the highest or the sum of the highest and second-highest retrieval similarity scores. Dynamic threshold filtering demonstrates not only high accuracy but also allows the false label accuracy to gradually increase as the threshold rises. Filtering pseudo-labels using the discriminative power between scores more effectively eliminates erroneous labels compared to a single threshold. This approach sacrifices label quantity for a stable increase in accuracy. For dynamic threshold filtering, the number of labels decreases as the threshold increases, yet the accuracy improvement is the most significant (61.35% → 90.38%), resulting in a final accuracy rate substantially higher than the previous two methods. By dynamically adapting to data distributions, the dynamic threshold maximizes false label filtering while controlling label quantity, achieving a leap in pseudo-label quality that aligns with self-training’s demand for high-quality pseudo-labels. Its high accuracy provides robust supervision signals for model iteration, validating the effectiveness of “dynamic filtering” within the self-training framework.
(2)
Ablation Studies
To validate the effectiveness of the dynamic threshold pseudo-label self-training method, ablation studies were conducted. Here, Matching, BEV_CycleGAN+Matching, DT+Matching, and BEV_CycleGAN-DT+Matching represent direct matching, matching after BEV_CycleGAN perspective transformation, matching after dynamic threshold pseudo-label self-training, and matching after BEV_CycleGAN perspective transformation followed by dynamic threshold pseudo-label self-training, respectively, as shown in Table 5.
The baseline model using only traditional matching methods (Top1 = 19.2%) exhibits poor performance, indicating that cross-view perspective differences make direct matching challenging. Incorporating BEV-CycleGAN (Top1 = 38.7%) significantly improved performance (+19.5%), validating the effectiveness of the proposed multi-scale contrastive learning and geometric constraints for cross-view generation. The generated high-quality satellite images mitigated the perspective discrepancy issue. When using dynamic threshold pseudo-label self-training alone (DT+Matching), Top1 improved to 42.3% (+23.1%), demonstrating that dynamically filtering high-confidence pseudo-labels effectively leverages unlabeled data to enhance model discrimination. Combining both approaches (Top1 = 48.5%) achieves optimal performance, demonstrating that the synergistic interaction between high-quality pseudo-satellite images generated by BEV-CycleGAN and the adaptive filtering mechanism of dynamic thresholds further enhances feature alignment accuracy.
(3)
Network Convergence Analysis
This study employs a systematic training strategy to optimize the cross-view image translation performance of the BEV_CycleGAN model. Experiments are conducted on two large-scale public geospatial datasets, CVUSA and CVACT, each containing 35,532 training pairs and 8884 testing pairs. Through an end-to-end training process, the model progressively learns the complex mapping relationship between bird’s-eye-view and satellite images. Figure 7 shows the training loss convergence curves of the generator with contrastive learning module (Generator+CL) and the baseline generator (Generator) on the CVUSA and CVACT datasets. As illustrated, during the training process on both datasets, the model with contrastive learning consistently exhibits lower loss values than the baseline model, and its convergence curve is smoother and more stable, with significantly reduced oscillations, especially in the mid-to-late stages of training. This indicates that multi-scale contrastive learning effectively constrains the training process of the generative model, enhancing semantic consistency across views and improving optimization efficiency and convergence robustness. This phenomenon is consistent across different datasets, further validating that the proposed module plays an important role in stabilizing cross-view image generation and consequently improving geolocation performance.
(4)
Method Effectiveness
The superior performance of the proposed method stems from a synergistic multi-level design. First, the introduction of an explicit bird’s-eye-view (BEV) geometric prior—implemented via ground-plane-based inverse projection—imposes pixel-level spatial constraints between ground and satellite views. This transforms the perspective transformation problem into a detail-refinement task within a structurally aligned domain, significantly mitigating global distortion, as evidenced by the accurate reconstruction of road networks in Figure 4. Second, multi-scale contrastive learning compels the model to move beyond the superficial feature transfer (e.g., color and illumination) typical of CycleGAN. By constructing dense cross-view positive and negative sample pairs in the feature space, semantic consistency between input and output is reinforced. This is key to the improved fidelity in vegetation and architectural details and the notable SSIM improvement shown in Table 1. Finally, the dynamic pseudo-labeling mechanism enables efficient utilization of unlabeled data through adaptive confidence calibration. Its quality–quantity balancing strategy helps the model avoid noisy interference in early training stages while gradually releasing reliable supervisory signals later, leading to robust performance even under semi-supervised settings—as demonstrated by the results using only 1% ground-truth labels in Table 3. Together, these three components form the core foundation that ensures high accuracy and strong generalization in the cross-view geolocalization task.
(5)
Analysis of Robustness and Limitations under Extreme Conditions
To systematically evaluate the applicability and robustness of the proposed method in real-world complex environments, this study further designed test scenarios encompassing extreme illumination variations and significant seasonal differences. Experiments utilized ground-based panoramic images including overcast conditions (row 1), sunny conditions (row 2), and strong backlighting (row 3), alongside data collected during summer with dense vegetation (row 4) and winter with sparse vegetation (rows 5 and 6). These scenarios simulate common challenges of illumination and surface coverage variations encountered in practical applications. Figure 8 illustrates the complete cross-view transformation workflow from raw panoramas to generated satellite views across these diverse scenarios, comprising four processing stages: input panorama (Panorama), bird’s-eye view (BEV) generated via geometric projection, satellite-view images synthesized using the proposed BEV_CycleGAN+CL framework, and corresponding real satellite imagery (Satellite) as reference benchmarks.
Analysis indicates that this method maintains excellent geometric consistency and preserves critical semantic information under the vast majority of extreme lighting and seasonal variation conditions. Particularly for features with regular geometric patterns, such as road networks and vegetated areas, the generated images demonstrate high alignment with actual satellite imagery in spatial layout and contour retention, validating the effectiveness of the integrated BEV geometric prior and multi-scale contrastive learning mechanism in cross-domain feature alignment. However, the analysis also reveals limitations: the model tends to exhibit local edge blurring under backlight or low-light conditions, may cause semantic confusion in seasonal change scenarios, and produces minor deformations in high-rise dense areas due to cumulative geometric projection errors. These phenomena primarily stem from alignment challenges posed by simplified ground plane geometric assumptions and cross-seasonal texture variations.
Furthermore, regarding the generalization capability across heterogeneous landscapes, the experimental validation in this study primarily relies on the CVUSA and CVACT datasets, which predominantly cover suburban and urban street scenes in the United States, characterized by regular road networks and low-to-mid-rise buildings. However, in real-world applications, the model may encounter more diverse geographical environments, including rural areas, complex intersections, and regions with significant elevation changes such as hills or mountains. In these scenarios, the flat-ground assumption underlying BEV transformation may partially fail, leading to increased geometric projection errors. For instance, on steep slopes or undulating terrain, the BEV images generated under the flat-ground assumption may exhibit road distortions or misalignments; in rural areas, sparse building distributions and extensive natural landforms may also affect the stability of feature matching. Although a preliminary analysis of geometric distortions in complex terrains (e.g., an approximate 4.5 percentage point drop in R@1 on steep slopes) is provided in the sixth analysis of Section 4, a systematic cross-scene generalization evaluation still requires further investigation in future work. Subsequent research will consider collecting or synthesizing datasets covering a wider range of geographical landscapes (e.g., rural, mountainous, coastal areas) and integrating multimodal elevation data (e.g., LiDAR, DSM) to enhance the model’s adaptability to terrain variations, thereby improving its robustness and generalization performance in heterogeneous environments.
Figure 9 presents a visual comparison of the Top 5 retrieval results for cross-view geolocation under extreme lighting and seasonal variations. In the figure, each column displays, from top to bottom: the query ground panoramic image captured under low light, strong backlight, or different seasonal conditions; the corresponding satellite-view image generated by BEV_CycleGAN+CL; and the actual satellite imagery. To clearly distinguish retrieval accuracy, red boxes indicate correctly matched satellite images, while blue boxes denote incorrectly matched images.
Analysis reveals that in most extreme scenarios, the generated satellite views within red boxes maintain high spatial and semantic consistency with their real-world counterparts regarding key geographic features—such as road alignment, building outlines, and vegetation distribution. This demonstrates the model’s ability to robustly align features despite apparent variations caused by lighting and seasonal changes. Notably, top-ranked results (e.g., Top1 and Top2) exhibit significant visual correspondence between generated images and their correctly matched targets. Further examination of the mismatched samples indicated by blue boxes reveals that while the generated images retain structural similarity to the query scene in overall layout, their corresponding satellite imagery exhibits deviations from the actual target in terms of feature details, local textures, or spatial orientation. Such errors predominantly occur in retrieval results with relatively low similarity (e.g., Top4–Top5), indicating that under extreme lighting or complex vegetation coverage conditions, the model may still encounter matching ambiguities due to missing local information or cross-view imaging differences. This visualization further confirms, from a visual interpretability perspective, that our method achieves reliable cross-view image generation and matching under most extreme lighting and seasonal conditions, enhancing the model’s applicability in complex real-world environments.
Through error analysis, we identified three types of systematic errors in the model under extreme conditions: (1) Geometric information loss due to severe occlusion (e.g., Row 1 in Figure 9), where dense tree canopies obscure ground roads. The BEV transformation fails to reconstruct complete road topology, resulting in fragmented or misaligned generated images. This causes the retrieval model to incorrectly assign similarity scores to other geographic locations with similar fractured structures (Top4 and Top5 in Row 1 are both incorrect matches). (2) Geometric distortion caused by BEV transformation (e.g., Row 3 in Figure 9), where BEV geometric distortion fails in high-rise dense areas due to the ground plane assumption, resulting in tilted and distorted building outlines in generated images. This affects the model’s ranking stability in fine localization, causing all Top2–Top5 matches to be incorrect. (3) CycleGAN’s “hallucination” phenomenon (e.g., Row 5 in Figure 9), which generates non-existent structures like vehicles in sparse winter vegetation scenes. These false features cause the retrieval model to match them with similar noise patterns in real images, triggering misclassifications. The cumulative effect of these errors significantly reduces localization reliability in complex scenes.
(6)
Uncertainty Analysis
While the proposed method achieves strong performance in most scenarios—particularly in suburbs and rural areas where building height variations are moderate—its effectiveness diminishes in dense urban centers characterized by high-rise buildings. In such environments, the accurate reconstruction of roads and vegetation via BEV geometric priors and multi-scale contrastive learning is compromised by geometric distortions. The fundamental cause of this limitation lies in the partial failure of the “single ground-plane assumption” underlying the current approach within complex urban 3D settings. Dense buildings lead to severe occlusion and perspective foreshortening, causing a single pixel in the ground-level view to correspond to multiple physical points at different heights in the real world. Relying solely on image data makes it difficult to resolve such depth ambiguities, thereby introducing geometric errors during the projection transformation, as illustrated in Figure 10.
This limitation has direct implications for real-world deployments. For instance, in autonomous driving scenarios within dense urban cores, distorted reconstructions of building facades and road intersections could lead to inaccurate vehicle localization, potentially affecting downstream tasks such as path planning or obstacle avoidance. Similarly, in disaster response applications—where rapid and accurate geolocalization is critical for coordinating rescue efforts—geometric errors in high-rise areas may cause misalignment between ground reports and satellite imagery, delaying situational awareness and decision-making.
Furthermore, existing training datasets provide insufficient coverage of extreme occlusion and diverse urban facades, and the model lacks explicit perception of the scene’s three-dimensional structure. Consequently, when confronted with complex layouts not adequately learned, the model tends to produce over-smoothed results with lost details. Additionally, the flat-ground assumption inherent to BEV projection leads to notable geometric distortions in complex terrains such as steep slopes, where the generated images often exhibit misaligned road networks and deformed structures. The model is also sensitive to initial geometric calibration errors; experiments show that a camera height deviation exceeding 10% can degrade the R@1 accuracy by approximately 4.5 percentage points, underscoring the need for precise calibration or adaptive geometric modeling in practical deployments.
To address these limitations, future research will focus on incorporating multimodal data fusion strategies. For instance, leveraging the 3D geometric information provided by LiDAR point clouds or digital surface models (DSMs) can effectively compensate for depth ambiguity caused by the flat-ground assumption in high-rise building areas. Simultaneously, integrating semantic segmentation priors can guide the generative network to focus on structural consistency across different object categories, thereby achieving more reliable cross-view localization performance in complex real-world environments.

5. Conclusions

This paper tackles the challenges of geometric misalignment and limited labeled data in cross-view geolocation by introducing the BEV-CycleGAN framework, which integrates multi-scale contrastive learning with dynamic pseudo-label optimization. By introducing bird’s-eye-view geometric prior constraints and a spatial pyramid feature fusion module, the framework explicitly models the projection relationship between ground and satellite views, effectively mitigating edge blurring and structural distortion in generated images. Combined with a dynamic threshold pseudo-label selection mechanism and a hard-sample triplet loss, it significantly enhances the model’s utilization efficiency for unlabeled data while ensuring pseudo-label quality. Experiments demonstrate that the proposed method achieves state-of-the-art performance on both CVUSA and CVACT datasets, validating the effectiveness of multi-scale contrastive learning and adaptive pseudo-label strategies. Future work will explore multimodal data fusion (e.g., LiDAR, semantic maps) and meta-learning-driven threshold optimization mechanisms to further enhance the model’s scene generalization capability and positioning robustness, advancing the large-scale deployment of cross-view geolocation technology in practical applications.

Author Contributions

Y.Y. and J.G. proposed the research framework and model design; Y.Y. and W.L. were responsible for writing the code, debugging, and modifying it; Z.L. conducted quantitative experimental analysis and statistical validation; Y.Y. drafted the manuscript and created the visual content; N.L. and R.Z. were responsible for supervising the quality of the research, coordinating resources, managing the project schedule, and making decisions regarding submission. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the State Key Laboratory of Spatial Datum Open Project under Grant SKLGIE2023-ZZ-9 and SKLGIE2024-ZZ-8. This research was supported by Henan Provincial Key Scientific and Technological Project under Grant 242103810017. This research was supported by Joint Fund of the Henan Provincial Collaborative Innovation Center for Smart Central Plains Geographic Information Technology and the Key Laboratory of Spatio-Temporal Perception and Intelligent Processing, Ministry of Natural Resources under Grant 231103.

Data Availability Statement

The CVUSA and CVACT datasets used in this study can be found at Crossview USA [22] and OriCNN [12]. https://github.com/yuan511/BEV-CycleGAN (accessed on 13 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Roche, S. Geographic Information Science I: Why does a smart city need to be spatially enabled? Prog. Hum. Geogr. 2014, 38, 703–711. [Google Scholar] [CrossRef]
  2. Zhang, X.; Gao, H.; Guo, M.; Li, G.; Liu, Y.; Li, D. A study on key technologies of unmanned driving. CAAI Trans. Intell. Technol. 2016, 1, 4–13. [Google Scholar] [CrossRef]
  3. Bharath, G.; Ch, R.; Karthik, M.; Chowdary, M. Revelation of geospatial information using augmented reality. In Proceedings of the 2021 Sixth International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), Chennai, India, 25–27 March 2021; pp. 303–308. [Google Scholar]
  4. Shi, Y.; Liu, L.; Yu, X.; Li, H. Spatial-aware feature aggregation for image based cross-view geo-localization. Adv. Neural Inf. Process. Syst. 2019, 32, 10090–10100. [Google Scholar]
  5. Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
  6. Sun, Y.; Ye, Y.; Kang, J.; Fernandez-Beltran, R.; Feng, S.; Li, X.; Luo, C.; Zhang, P.; Plaza, A. Cross-view object geo-localization in a local region with satellite imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
  7. Hu, Y.; Liu, Y.; Hui, B. Combining OpenStreetMap with Satellite Imagery to Enhance Cross-View Geo-Localization. Sensors 2024, 25, 44. [Google Scholar] [CrossRef] [PubMed]
  8. Fan, J.; Zheng, E.; He, Y.; Yang, J. A Cross-View Geo-Localization Algorithm Using UAV Image and Satellite Image. Sensors 2024, 24, 3719. [Google Scholar] [CrossRef] [PubMed]
  9. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  10. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 21–23 September 2005; pp. 886–893. [Google Scholar]
  11. Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up robust features. In Proceedings of the Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar]
  12. Workman, S.; Souvenir, R.; Jacobs, N. Wide-area image geolocalization with aerial reference imagery. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 3961–3969. [Google Scholar]
  13. Zhao, J.; Zhai, Q.; Zhao, P.; Huang, R.; Cheng, H. Co-visual pattern-augmented generative transformer learning for automobile geo-localization. Remote Sens. 2023, 15, 2221. [Google Scholar] [CrossRef]
  14. Durgam, A.; Paheding, S.; Dhiman, V.; Devabhaktuni, V. Cross-view geo-localization: A survey. IEEE Access 2024, 12, 192028–192050. [Google Scholar] [CrossRef]
  15. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
  16. Regmi, K.; Borji, A. Cross-view image synthesis using conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, 18–22 June 2018; pp. 3501–3510. [Google Scholar]
  17. Huang, J.; Ye, D. Ground-to-aerial image geo-localization with cross-view image synthesis. In Proceedings of the International Conference on Image and Graphics, Haikou, China, 6–8 August 2021; pp. 412–424. [Google Scholar]
  18. Parmar, G.; Park, T.; Narasimhan, S.; Zhu, J.-Y. One-step image translation with text-to-image models. arXiv 2024, arXiv:2403.12036. [Google Scholar]
  19. Tian, X.; Shao, J.; Ouyang, D.; Shen, H.T. UAV-satellite view synthesis for cross-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4804–4815. [Google Scholar] [CrossRef]
  20. Li, G.; Qian, M.; Xia, G.-S. Unleashing unlabeled data: A paradigm for cross-view geo-localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16719–16729. [Google Scholar]
  21. Ran, L.; Wang, L.; Zhuo, T.; Xing, Y.; He, H.; Zhang, Y. Ddf: A novel dual-domain image fusion strategy for remote sensing image semantic segmentation with unsupervised domain adaptation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
  22. Zhao, T.; Zeng, Y.; Fang, Q.; Xu, X.; Xie, H. Semi-Supervised Object Detection for Remote Sensing Images Using Consistent Dense Pseudo-Labels. Remote Sens. 2025, 17, 1474. [Google Scholar] [CrossRef]
  23. Wang, T.; Zheng, Z.; Yan, C.; Zhang, J.; Sun, Y.; Zheng, B.; Yang, Y. Each part matters: Local patterns facilitate cross-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 867–879. [Google Scholar] [CrossRef]
  24. Yang, H.; Lu, X.; Zhu, Y. Cross-view geo-localization with layer-to-layer transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 29009–29020. [Google Scholar]
  25. Zhang, X.; Li, X.; Sultani, W.; Zhou, Y.; Wshah, S. Cross-view geo-localization via learning disentangled geometric layout correspondence. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 3480–3488. [Google Scholar]
  26. Zhu, S.; Shah, M.; Chen, C. Transgeo: Transformer is all you need for cross-view image geo-localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1162–1171. [Google Scholar]
  27. Deuser, F.; Habel, K.; Oswald, N. Sample4geo: Hard negative sampling for cross-view geo-localisation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 16847–16856. [Google Scholar]
Figure 1. Modeling framework.
Figure 1. Modeling framework.
Remotesensing 18 00944 g001
Figure 2. Multi-scale loss function construction. Red boxes indicate positive sample patches (semantically consistent regions), and blue boxes indicate negative sample patches (dissimilar regions for contrastive loss). Solid arrows denote image propagation, and dashed arrows denote loss calculation paths (cycle-consistency, adversarial, and pixel-wise contrastive loss).
Figure 2. Multi-scale loss function construction. Red boxes indicate positive sample patches (semantically consistent regions), and blue boxes indicate negative sample patches (dissimilar regions for contrastive loss). Solid arrows denote image propagation, and dashed arrows denote loss calculation paths (cycle-consistency, adversarial, and pixel-wise contrastive loss).
Remotesensing 18 00944 g002
Figure 3. Dynamic threshold pseudo-label filtering. The dashed lines indicate the adaptive confidence threshold δt, which separates high-confidence samples from low-confidence ones. The red box highlights the current threshold position δt′ in the confidence distribution. In the bar charts, blue bars denote high-confidence samples selected as pseudo labels, while yellow bars represent low-confidence samples to be filtered. In the pseudo label selection stage, green hexagons indicate retained reliable pseudo labels, and pink hexagons represent noisy pseudo labels to be removed.
Figure 3. Dynamic threshold pseudo-label filtering. The dashed lines indicate the adaptive confidence threshold δt, which separates high-confidence samples from low-confidence ones. The red box highlights the current threshold position δt′ in the confidence distribution. In the bar charts, blue bars denote high-confidence samples selected as pseudo labels, while yellow bars represent low-confidence samples to be filtered. In the pseudo label selection stage, green hexagons indicate retained reliable pseudo labels, and pink hexagons represent noisy pseudo labels to be removed.
Remotesensing 18 00944 g003
Figure 4. Comparison of Image Conversion Results from Ground View to Satellite View.
Figure 4. Comparison of Image Conversion Results from Ground View to Satellite View.
Remotesensing 18 00944 g004
Figure 5. Pattern of Threshold Weight Variation.
Figure 5. Pattern of Threshold Weight Variation.
Remotesensing 18 00944 g005
Figure 6. Different threshold filtering methods.
Figure 6. Different threshold filtering methods.
Remotesensing 18 00944 g006
Figure 7. Loss function curves. (a) Loss curves on the CVACT dataset; (b) Loss curves on the CVUSA dataset.
Figure 7. Loss function curves. (a) Loss curves on the CVACT dataset; (b) Loss curves on the CVUSA dataset.
Remotesensing 18 00944 g007
Figure 8. Comparison of Cross-View Image Conversion Results Under Extreme Lighting Conditions and Seasonal Variations.
Figure 8. Comparison of Cross-View Image Conversion Results Under Extreme Lighting Conditions and Seasonal Variations.
Remotesensing 18 00944 g008
Figure 9. Visualization of Top 5 Retrieval Results for Cross-View Geolocation under Extreme Conditions. Red boxes represent correct matches, while blue boxes represent incorrect matches.
Figure 9. Visualization of Top 5 Retrieval Results for Cross-View Geolocation under Extreme Conditions. Red boxes represent correct matches, while blue boxes represent incorrect matches.
Remotesensing 18 00944 g009
Figure 10. Uncertainty Analysis.
Figure 10. Uncertainty Analysis.
Remotesensing 18 00944 g010
Table 1. Comparison of Model Ablation Experiment Results Based on CVACT and CVUSA Datasets. ↑ denotes an increase in performance, and ↓ denotes a reduction in error.
Table 1. Comparison of Model Ablation Experiment Results Based on CVACT and CVUSA Datasets. ↑ denotes an increase in performance, and ↓ denotes a reduction in error.
CVACTCVUSA
PSNR (↑)SSIM (↑)RMSE (↓)PSNR (↑)SSIM (↑)RMSE (↓)
BEV_CycleGAN+CL26.510.78660.137826.480.78580.1413
BEV_CycleGAN25.780.74120.151925.520.73740.1569
BEV22.850.72830.163122.120.72720.1646
CycleGAN18.630.66390.184618.720.67580.1819
Table 2. Comparison of Model Geolocation Results Based on CVUSA and CVACT Datasets.
Table 2. Comparison of Model Geolocation Results Based on CVUSA and CVACT Datasets.
MethodCVUSACVACT ValCVACT Test
R@1R@5R@10R@1%R@1R@5R@10R@1%R@1R@5R@10R@1%
SAFA81.1594.2396.8599.4978.2891.6093.7998.15----
SAFA89.8496.9398.1499.6481.0392.8094.8498.17----
LPN92.8398.0098.8599.7883.6694.1495.9298.41----
DSM91.9697.5098.5499.6782.4992.4493.9997.32----
L2LTR91.9997.6898.6599.7583.1493.8495.5198.4058.3384.2388.6095.83
GeoDTR93.7698.4799.2299.8585.4394.8196.1198.2662.9687.3590.7098.61
TransGeo94.0898.3699.0499.7784.9594.1495.7898.37----
Sample4G98.6899.6899.7899.8790.8196.7497.4898.7771.5792.4294.4598.70
Ours98.9499.7399.8899.8992.9298.6397.6498.8372.4293.2194.6298.96
indicates results reproduced using the authors’ publicly available code; all other results are directly quoted from the original literature. All methods employed a 256 × 256 input resolution to ensure fair comparison.
Table 3. Comparison of Localization Accuracy and Computational Overhead Across Different Methods (CVUSA/CVACT Datasets).
Table 3. Comparison of Localization Accuracy and Computational Overhead Across Different Methods (CVUSA/CVACT Datasets).
MethodCVUSA R@1CVACT Test R@1Params (M)FLOPs (G)
SAFA81.15-23.414.2
LPN92.83-12.88.9
GeoDTR93.7662.9626.715.2
Sample4G98.6871.5724.613.9
Ours98.9472.4225.814.5
indicates results reproduced using the authors’ publicly available code; all other results are directly quoted from the original literature.
Table 4. Comparison of Geolocation Results for Ground-Truth Labels at Different Ratios (“GT Ratio” denotes the proportion of ground-truth labels used for training).
Table 4. Comparison of Geolocation Results for Ground-Truth Labels at Different Ratios (“GT Ratio” denotes the proportion of ground-truth labels used for training).
GT RationOursUCVGL [20]
R@1R@5R@10R@1R@5R@10
1%70.3083.2386.1868.2985.1888.80
5%79.2191.2592.6378.1090.8793.11
10%80.8292.6493.2178.8891.3193.53
20%82.1993.4194.1279.6091.9893.56
50%84.6594.3295.8782.9193.4494.75
75%85.7895.0196.2383.8894.1295.01
100%86.3295.6398.5484.4494.8595.83
Table 5. Comparison Results of Different Method Combinations.
Table 5. Comparison Results of Different Method Combinations.
MethodTop1Top-5Top-1%
Matching19.242.397.1
BEV_CycleGAN+Matching38.762.597.6
DT+Matching42.368.598.2
BEV_CycleGAN-DT+Matching48.576.899.3
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yuan, Y.; Guo, J.; Zhu, R.; Li, N.; Li, Z.; Luo, W. Towards Pseudo-Labeling with Dynamic Thresholds for Cross-View Image Geolocalization. Remote Sens. 2026, 18, 944. https://doi.org/10.3390/rs18060944

AMA Style

Yuan Y, Guo J, Zhu R, Li N, Li Z, Luo W. Towards Pseudo-Labeling with Dynamic Thresholds for Cross-View Image Geolocalization. Remote Sensing. 2026; 18(6):944. https://doi.org/10.3390/rs18060944

Chicago/Turabian Style

Yuan, Yuanyuan, Jianzhong Guo, Ruoxin Zhu, Ning Li, Ziwei Li, and Weiran Luo. 2026. "Towards Pseudo-Labeling with Dynamic Thresholds for Cross-View Image Geolocalization" Remote Sensing 18, no. 6: 944. https://doi.org/10.3390/rs18060944

APA Style

Yuan, Y., Guo, J., Zhu, R., Li, N., Li, Z., & Luo, W. (2026). Towards Pseudo-Labeling with Dynamic Thresholds for Cross-View Image Geolocalization. Remote Sensing, 18(6), 944. https://doi.org/10.3390/rs18060944

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop