1. Introduction
With the rapid development of smart cities, drones, augmented reality, and other fields, the accurate acquisition and understanding of geospatial information has become a core technical need [
1,
2,
3]. Cross-view geolocalization, as a cutting-edge direction at the intersection of geographic information science and computer vision, aims to solve the problem of automatic localization of geo-tagged images without geo-tagging by matching the ground view images (e.g., street view, UAV images) with satellite or aerial view images [
4,
5], which has important application value for disaster emergency response [
6], self-driving high-precision map construction [
7], military reconnaissance [
8], and other scenes. Traditional geo-localization methods are mainly based on GPS signals or SLAM technology; however, there are significant limitations in urban canyons, indoor scenes, or signal-restricted areas. In recent years, vision-based geolocalization techniques have become a research hotspot because of their universality. Early works constructed cross-view image matching models by extracting manual features such as SIFT [
9], HOG [
10], and SURF [
11], but the performance degraded drastically when the difference in view angle exceeded 30°.
With the maturation of deep learning technologies, the research paradigm in this field has undergone a significant shift. Workman et al. [
12] pioneered the application of convolutional neural networks to multi-view geolocation, achieving feature alignment between ground and aerial images through a designed cross-view training mechanism. The MGTL network proposed by Zhao et al. [
13] significantly enhances cross-view feature extraction capabilities through a cascaded attention mechanism and spatial context enhancement modules. However, due to significant differences in imaging conditions, acquisition times, and geometric structures among images from different viewpoints, single-deep-learning-based localization methods still face core challenges such as feature alignment difficulties and insufficient model generalization [
14].
To mitigate issues arising from viewpoint differences, researchers have attempted to bridge the gap between domains through image generation or transformation. Shi et al. [
4] employed polar coordinate mapping to convert satellite images into a polar coordinate system, simulating the distribution of ground-level viewpoints. While this approach alleviated viewpoint distortion, the generated images still suffered from semantic content loss and edge blurring. Subsequently, generative adversarial networks (GANs) [
15] were introduced for cross-view image translation, enabling unsupervised domain adaptation. Models like X-Fork and X-Seq [
16] employed conditional GANs to generate cross-view images and reduce domain discrepancies. Recent studies incorporated attention mechanisms and multi-scale feature fusion, yet semantic misalignment caused by viewpoint differences remains an unresolved challenge [
17]. The CycleGAN-Turbo model [
18] enhances structural preservation by integrating latent diffusion modules, yet generated images exhibit structural distortions in complex texture regions (e.g., buildings, road intersections).
The aforementioned methods typically rely on large amounts of precisely annotated multi-view image pairs for supervised learning. However, in practical applications, high annotation costs and low utilization of unpaired data severely constrain the large-scale deployment of these techniques [
19]. To reduce annotation dependency, pseudo-label learning methods have been introduced for multi-view localization tasks. UCVGL [
20] employs a cross-view projection-guided model to retrieve initial pseudo labels, then improves their quality through rapid re-ranking. However, confidence estimation bias during pseudo label generation leads to noise propagation issues. Ran et al. [
21] proposed a region-specific re-weighting strategy that assigns varying weights based on the spatial context of pseudo label regions, yet it fails to fully leverage domain-specific label information and lacks adaptability. CDPL [
22] leverages consistent dense pseudo labels to enhance remote sensing object detection performance. However, its “dense pseudo labels” are tailored for instance-level detection and are unsuitable for image-level geolocation tasks. Most of the aforementioned pseudo label methods employ fixed thresholds for sample selection, making it difficult to adapt to changes in confidence distributions during model training. This results in unstable pseudo label quality and prominent noise accumulation issues.
To overcome the limitations of existing methods, this paper proposes a unified framework that integrates generative learning with pseudo-label learning paradigms. Specifically, we introduce BEV-CycleGAN+CL, a model that combines bird’s-eye view geometric constraints with multi-scale contrastive learning. Unlike pure feature matching approaches that struggle with large viewpoint differences, our method first explicitly transforms ground panoramas into BEV coordinates via geometric inverse projection based on the flat-ground assumption, significantly reducing viewpoint distortion. Unlike pure generative methods that often produce structural artifacts, our multi-scale contrastive loss enforces image generation at the feature level to align with real satellite imagery, preserving fine details and semantic structure. Unlike existing pseudo-label methods constrained by fixed thresholds and confidence bias, we design a dynamic threshold pseudo-label self-training mechanism. By integrating historical reliable thresholds with current confidence distributions, it adaptively balances pseudo-label quality and quantity. Additionally, a dynamic hard-sample triplet loss enhances the model’s discriminative capability between generated images and ground truth satellite imagery.
- (1)
BEV-CycleGAN: We propose a BEV-CycleGAN model that alleviates perspective distortion through BEV geometric constraints and multi-scale contrastive learning. By using a ground-plane assumption-based geometric inverse projection, the ground panorama image is explicitly converted to BEV coordinates. We introduce a spatial pyramid module to model geometric relationships and design pixel-level and feature-level contrast losses to improve the fidelity of generated images.
- (2)
Dynamic thresholding pseudo-labels: We propose a pseudo label self-training method based on dynamic threshold filtering, setting an adaptive confidence threshold and dynamically adjusting the threshold using an exponential function to match the improvement trend of the model’s learning capability. At the same time, we use a dynamic hard-sample triplet loss to enhance the model’s discriminative ability between generated satellite images and real satellite images.
It should be noted that the BEV projection employed in this paper is based on the assumption of flat terrain. This assumption performs well in suburban or rural settings where building heights vary gradually. However, in urban areas characterized by dense high-rise buildings and steep topography, severe occlusion and perspective distortion lead to significantly increased geometric mapping errors. This limitation will be further analyzed in subsequent discussions. Moving forward, we will explore integrating multimodal data (such as LiDAR point clouds and Digital Surface Models, DSM) to explicitly model three-dimensional structures, thereby enhancing the model’s robustness in complex urban environments.
2. Materials and Methods
2.1. Materials
In this paper, we use the publicly available standard benchmark datasets in ground-satellite view matching, CVUSA [
22] and CVACT [
12], CVUSA is mainly derived from the suburbs of U.S. cities, and contains 35,532 pairs of images used for training and 8884 pairs of images used for testing, with high image resolution and panoramic ground-view imagery. The CVACT dataset is similar to the CVUSA is similar, in addition to 92,802 images used for testing. The high-resolution satellite images in the CVACT dataset allow the model to learn finer features during training and testing, thus better validating the model’s generalization ability.
Evaluation indicators: In terms of image generation quality assessment, this study used multidimensional quantitative metrics to evaluate the generation results objectively. Pixel intensity deviation was measured by root mean square error (RMSE), image fidelity was assessed using peak signal-to-noise ratio (PSNR), and spatial structural consistency was measured using structural similarity (SSIM) to comprehensively analyze the extent to which the generated satellite images differed from the real data at the pixel level. In the context of image retrieval, R@K denotes the percentage of query images correctly matched among the top K satellite images returned by the retrieval. Specifically, R@1% denotes that for each query image, localization is considered successful if its correctly matched satellite image resides within the top 1% of candidates sorted by similarity (i.e., the candidate pool comprises 1% of the entire test dataset). This metric evaluates a model’s localization accuracy among extremely limited candidates, representing a highly challenging evaluation criterion for cross-view geolocation tasks.
All experimental results for comparison methods directly cite the highest metrics reported in their original papers. For methods lacking CVACT test set results, we reproduced them using official code and marked them with ‘†’. Input image resolution is uniformly set to 256 × 256 to align with our method.
2.2. Network Framework
As shown in
Figure 1, the overall architecture is divided into two stages: the image generation module (BEV-CycleGAN+CL) and the geolocation module (dynamic threshold pseudo-label self-training). First, the process is based on geometric projection priors, providing structural constraints for subsequent generation. Then, through an adversarial learning mechanism, the generative networks (G, F) and the discriminative network (D) are jointly optimized to learn the pixel probability distribution features of BEV images and satellite images. In addition, an encoder structure (E) is incorporated into the BEV-CycleGAN model, introducing cross-scale positive and negative sample pairs in the feature space and designing a multi-scale contrastive loss function. Through a joint multi-task learning framework, adversarial loss, cycle consistency loss, and multi-scale contrastive loss are dynamically fused, and an adaptive weight allocation strategy is employed to balance the optimization goals of generation quality and structural fidelity. Finally, a pseudo-label self-training method based on dynamic threshold filtering is proposed, which selects high-confidence pseudo-labels through an adaptive confidence threshold and an exponential function dynamic adjustment mechanism, and uses a dynamic hard sample triplet loss for cross-view image matching.
To further clarify the interplay between the two stages, we detail their interaction as follows. The proposed framework comprises two sequential training phases. In the first (image generation) phase, the BEV-CycleGAN+CL model is trained using all paired ground-satellite images. After training, the generator G is fixed and applied to all ground images (including those that will later serve as labeled or unlabeled data) to produce pseudo-satellite images, forming a “generated view” dataset. In the second (geolocation) phase, a semi-supervised learning paradigm is adopted: the training set consists of a small number of labeled real satellite images, a large number of unlabeled real satellite images, and optionally the generated satellite images. The labeled data provide basic supervision, while the unlabeled real data are used to generate pseudo-labels via the dynamic threshold strategy. The generated images can serve as additional query or gallery samples to enhance feature learning. The two modules are indirectly synergistic through the shared ground image features: the generation module improves the structural alignment between ground and satellite views, thereby reducing the difficulty of feature matching for the localization module.
The proposed BEV-CycleGAN model adopts an encoder–decoder generator. The generator takes 256 × 256 × 3 images as input. The first layer is a 7 × 7 convolutional layer with 64 channels, a stride of 1, reflective padding, instance normalization, and ReLU activation. This is followed by two downsampling layers using 3 × 3 convolutions (stride 2), which increase the channel count to 128 and 256, respectively. The core of the generator consists of nine residual blocks; each block contains two 3 × 3 convolutional layers that maintain 256 channels, along with instance normalization and ReLU activation. The decoder performs upsampling via two 3 × 3 transposed convolutional layers (stride 2), reducing the channel dimensions to 128 and then 64. The final output layer uses a 7 × 7 convolution to map the features to 3 channels, employing a Tanh activation function. The generators G (BEV → satellite) and F (satellite → BEV) share this symmetric architecture. The discriminator adopts a PatchGAN structure, comprising five convolutional layers. The first four layers use 4 × 4 kernels with a stride of 2; their channel counts are 64, 128, 256, and 512, respectively. Each is followed by instance normalization and a LeakyReLU activation (negative slope of 0.2). The final layer is a 4 × 4 convolution (stride 1, 1 channel) that outputs a 30 × 30 feature map, discriminating the authenticity of local image patches. For multi-scale contrastive learning, the module shares the encoder part of the generator. It extracts feature maps at three scales: 64 × 64 (128 channels), 32 × 32 (256 channels), and 16 × 16 (512 channels). Each scale’s features are independently projected to a 64-dimensional latent vector via a 1 × 1 convolutional projection head.
The image geolocation network uses ConvNeXt-B as its backbone to extract a 1024-dimensional global feature vector. Feature matching is based on cosine similarity. The dynamic threshold update mechanism is governed by an initial smoothing factor = 0.5 and a decay rate = 0.9. The model was trained using the Adam optimizer. The generator’s initial learning rate was 0.0002, which was kept constant for the first 100 epochs before applying linear decay. The localization network’s initial learning rate was 0.001. The weights for the total loss function were set to = 10 and = 2. The batch size was set to 8 for the generative adversarial training phase and 128 for the localization network training. To improve generalization, random horizontal flipping was applied as a data augmentation technique during training. An exponential decay strategy was used to dynamically adjust the learning rate, facilitating model convergence.
2.3. Image Generation for Multi-Scale Contrasts
This paper uses a geometry-based back-projection process based on the flat-ground assumption, projecting street view images into bird’s-eye view (BEV) through an explicit panoramic BEV transformation, without requiring depth estimation or camera parameters. Specifically, given a ground panoramic image, we assume a flat ground plane with the camera located at the center of the BEV plane. By using the relationship of plane i, j, the coordinates of the required mapping point P(x, y, z = 0) are determined, with the camera height set to H. The corresponding pitch angle θ and azimuth angle
are calculated using geometric relationships, and each ground pixel is back-projected onto the world coordinate system’s ground point through a spherical projection. Subsequently, these points are projected onto a top-down 2D plane to form the initial BEV representation. This process establishes a deterministic geometric mapping between ground image pixels and BEV plane coordinates, as shown in
Figure 2.
Although the acquired bird’s eye view (BEV image) is similar to satellite images in terms of geometric structure, there are still many differences between the two in the imaging process [
21]. These differences are mainly reflected in various aspects, such as the degree of distortion, lighting conditions, occlusions, and weather conditions. To effectively reduce the imaging differences between these two viewpoint images, the generative CycleGAN model is introduced to further reduce the gap in the imaging effect of the two-view images through its powerful generative adversarial capability to improve the consistency and comparability between the images. The BEV-CycleGAN framework proposed in this paper mainly consists of bidirectional mapping modules, including a generator
G from the
domain (BEV image) to the
domain (satellite view image), and a generator
F from the
domain to the
domain, along with a corresponding discriminator
D for adversarial training. The architecture realizes efficient transformation of cross-domain images by constructing bidirectional circular mapping relations, as shown in
Figure 2.
(1) GAN Loss
where
represents the generator from the BEV domain (
) to the satellite domain (
);
is the discriminator in the satellite domain (
), used to distinguish between real satellite images
and generated images; (
,
) represent the data distributions of BEV images and real satellite images, respectively.
(2) Loss of cyclic consistency
where
denotes this represents the generator from the satellite domain (
) to the BEV domain (
). This loss ensures cycle consistency during the conversion, meaning that an image transformed by one generator and then restored by another should be as close as possible to the original image, thereby preserving key content information.
However, CycleGAN tends to capture low-frequency features (color, illumination), but it has difficulty retaining high-frequency details (edges and texture), and because of the lack of explicit structural constraints, the generated images are often blurred with details and prone to irrational deformations, especially in complex texture regions (e.g., buildings, roads, etc.). Therefore, this study invokes contrastive learning to bring similar data samples close to each other in the representation space, whereas dissimilar samples are far away from each other. The focus is on enhancing the ability of the model to discriminate image features so that it can better capture subtle differences between different views. To construct positive and negative samples, N + 1 patches were randomly selected from the input satellite view image
, and one corresponding patch was randomly selected from the output generated satellite image
. The two corresponding patches are denoted as positive samples, whereas the other N patches in
are denoted as negative samples. The Information Noise Comparison Estimation Module (InfoNCE) was used to enhance the recognition ability of the model by maximizing the similarity between the positive samples (corresponding patches in x and
). The similarity between the negative samples (other N patches in
) was minimized to reduce interference and improve the accuracy of the feature representation. Specifically, the anchor (patches in
), positive, and N negative are first mapped to K-dimensional vectors and are denoted as v, v+ and v-respectively. After that, let v denote the feature vector (anchor point) extracted from the generated image
.
denotes the corresponding feature vector extracted from the same location in the input image
(positive sample). v
− denotes the feature vector extracted from other locations in the input image
(negative sample):
where
denotes the cosine similarity between v and
,
represents the nth negative sample, and
is a temperature parameter (default value 0.07) that controls the concentration of the similarity distribution.
(3) Multi-scale contrast loss
To enhance the feature alignment capability between the generated images and real satellite images, a multi-scale comparison loss was designed in this study. This loss function maximizes the similarity of positive sample pairs while minimizing the similarity of negative sample pairs by comparing the feature representations of the generated and real images at different scales. In this study, features are extracted using encoder module
and the input image to CycleGAN module C and embedded in a feature stack
, where
represents the
th layer selected from
. The feature stack actually represents different patches in the image by encoding the spatial location information of each convolutional layer as
and then
is denoted as the number of spatial locations in each layer, and each time an anchor point is selected, its features are denoted as
, where
denotes the number of channels in each layer. In addition, the corresponding feature (i.e., positive) is denoted as
and the corresponding feature (i.e., negative) is denoted as
. The goal of this study was to match the corresponding patches (positive) of the input and output images while pushing the other patches (negative) away from the anchor points. Therefore, the block-by-block multilayer contrast loss for mapping X → Y (i.e., BEV image → satellite view image) is given by
Here, denotes the selected feature layer. Indicates the total number of spatial positions on the feature map of the layer. denotes the vector extracted from position s on the feature map of layer of the generated image G(X) (anchor). denotes the feature vector extracted from the same position s on the feature map of layer of the real satellite image Y (positive sample). denotes the set of feature vectors extracted from all positions other than position s on the feature map of layer of the real satellite image Y (negative samples). denotes the contrastive loss is calculated at each position s and for each layer, with the aim of making the anchor similar to the positive sample and dissimilar to the negative samples.
(4) aggregate loss function (math.)
To jointly optimize the loss function for multiple tasks, this study proposes the total loss function, which combines adversarial loss, circular consistency loss, and multiscale comparison loss, as shown in Equation (5):
Among them, and are hyperparameters used to balance the weights of the losses in each part. By jointly optimizing the loss functions for multiple tasks, the model can simultaneously improve the quality of the generated images, feature alignment capability, and accuracy of the cross-view matching.
2.4. Dynamic Thresholding for Geo-Localization of Pseudo-Tagged Images
Figure 3 illustrates the filtering process for dynamic threshold pseudo labels: the dynamic threshold δ_t is determined based on the confidence distribution, and high-confidence samples are selected as pseudo labels. Pseudo-labelled self-training methods help models learn using unlabeled data by using the predictions of the unlabeled data as labels. However, the noise of pseudo-labels and the inevitable inconsistency between fake images and satellite images render them unsuitable for use as directly supervised pseudo-labels. To address the problem of noisy pseudo-labels and uneven distribution of categories in cross-view geolocation tasks, this paper proposes a dynamic threshold filtering-based pseudo-label self-training method, which centers on setting adaptive confidence thresholds and dynamically adjusting the thresholds based on an exponential function to match the trend of the model’s learning ability; it specifically includes the following four steps.
Step 1: Confidence Calculation and Distribution Estimation.
For a batch of unlabeled ground truth satellite images , first extract their features using the current geolocation model. Calculate the cosine similarity between each unlabeled image feature and all candidate satellite image features (including both ground truth and generated images). The maximum similarity is taken as the confidence level for that sample. Then, divide all confidence values in the current batch into equal intervals: . Count the number of samples in each interval to obtain the approximate probability density distribution of the confidence values.
Step 2: Density Peak Threshold Selection.
Let denote the confidence set of N unlabeled samples in the current batch. Divide C into K equal intervals , where . Let denote the frequency of samples falling within interval Ik. The density peak interval is defined as the interval with the highest frequency: . The candidate threshold is taken as the right endpoint of this interval: . This design is motivated by the observation that reliable samples tend to have confidence scores concentrated near the peak; selecting the right endpoint of this interval retains high-confidence samples while discarding low-confidence noise. For instance, if the peak interval is with (i.e., interval ), then . Typically, in cross-view tasks, the model’s confidence rarely exceeds 0.95, and often lies within 0.7–0.9, reflecting the inherent difficulty of the task and the model’s current discriminative capability.
Step 3: Exponential Weighted Moving Average Smoothing.
To avoid sharp fluctuations in the threshold caused by a single batch, an exponential weighted moving average is applied to smooth the candidate thresholds:
where
denotes the weight balance factor of
and
, which is calculated by the following Formula (7):
In the formula, is the initial weight, is the decay rate, is the current iteration count, and is the total iteration count. This strategy ensures that during the early training phase when model prediction noise is high, the historical threshold dominates to prevent introducing erroneous false labels. As training progresses and model confidence stabilizes, the current threshold gradually takes precedence, enabling the threshold to rapidly adapt to improvements in model capability.
Step 4: Pseudo-label Selection and Training Scheduling.
Finally, select samples with confidence scores
as pseudo labels and incorporate them into the next training round. The threshold
is updated every 50 iterations. During the warm-up phase covering the first 20% of iterations, no pseudo labels are used; the model is trained solely on the limited labeled data to establish a reasonable initial feature representation. The complete workflow is summarized in Algorithm 1.
| Algorithm 1: Dynamic Thresholding for Pseudo-Label Self-Training |
| Input: |
| 1: Unlabeled sample set U, current model M, historical threshold |
| 2: Total iterations T, current iteration t, decay rate γ = 0.9, initial weight |
| 3: Number of confidence intervals K = 10 |
| Output: Updated threshold , selected pseudo-label set |
| 1: Extract features for each sample in U using model M. |
| 2: For each sample, compute the cosine similarity vector s between its feature and all candidate satellite image features, and take the maximum value as the confidence . |
| 3: Divide the confidence interval [0, 1] into K equal intervals. |
| 4: Count the frequency of confidence values falling into each interval for all samples in the current batch, forming an approximate probability density distribution. |
| 5: Find the interval with the highest frequency (density peak) and denote its right endpoint as the current batch threshold . |
| 6: Calculate the weight factor using exponential decay: . |
| 7: Update the global dynamic threshold: |
| 8: Select samples satisfying as the pseudo-label set . |
| 9: Return and . |
In cross-view geolocation, triplet loss is commonly used to enforce that anchor samples (ground images) are closer to positive samples (satellite images of the same location) than to negative samples (satellite images of different locations). However, standard triplet loss treats all samples equally and overlooks the critical role of hard examples—samples that lie near the decision boundary. Hard positives are positive samples with the lowest similarity to the anchor, while hard negatives are negative samples with the highest similarity to the anchor. Mining these hard examples forces the model to learn more discriminative features.
To this end, this study proposes a dynamic, difficult sample triad loss that explicitly strengthens the model’s ability to discriminate complex cross-view relationships by adaptively mining difficult positive and difficult negative samples. During batch processing, the cosine similarity measure is used to calculate the similarity scores between the anchor samples and the set of positive and negative samples. Based on the calculation results, the difficult sample mining strategy is implemented, and the positive sample with the smallest similarity to the anchor point is selected as the difficult positive example, whereas the negative sample with the largest similarity is selected as the difficult negative example. This sample-selection mechanism can effectively improve the ability of the model to distinguish the boundary samples. The dynamic weight adjustment mechanism was introduced to provide higher weight loss to the difficult samples. See Equation (8).
where
is the anchor image,
is the positive sample,
is the negative sample,
is the feature distance between samples
and
, and
is the margin hyperparameter, which controls the degree of separation between difficult samples. To enhance the contribution of difficult samples to training, a dynamic weight adjustment mechanism is introduced, which adjusts the loss weights of samples by adjusting them according to the change in similarity. Specifically, when a sample is recognized as a hard sample, its weight loss dynamically increases to have a greater impact on model training. By focusing on difficult samples, the model can more accurately distinguish cross-view image pairs that are visually different but geographically identical to alleviate overfitting, and the dynamic filtering mechanism avoids the problem of local optimality caused by fixed sample selection and enhances model generalization.
2.5. Training Details
All experiments were implemented using the PyTorch (v 1.12.0) framework and trained on an NVIDIA RTX 4090Ti GPU. The training settings for the image generation and geolocation stages are as follows:
(1) Image Generation Stage
This stage utilized all paired images from the CVUSA and CVACT datasets (training/test splits defined in
Section 2.1). All input images were resized to 256 × 256 pixels and underwent data augmentation via random horizontal flipping. Both the generator and discriminator employed the Adam optimizer with an initial learning rate of 0.0002, held constant for the first 100 epochs before linearly decaying to zero. The batch size was set to 8. The total loss weights were set to λ_cycle = 10 and λ_PC = 2. The temperature parameter τ was set to 0.07 by default in the contrastive loss. Training was conducted for 200 epochs.
(2) Geolocation Phase
This phase employed a semi-supervised setup, where training data comprised a small number of labeled samples and a large number of unlabeled samples. Input images are uniformly resized to 256 × 256. Data augmentation includes random horizontal flipping and color dithering. The feature extraction network uses ConvNeXt-B with Adam optimizer, initial learning rate 0.001, and cosine annealing for decay. Batch size is 128 (with labeled-to-unlabeled sample ratio dynamically adjusted based on annotation proportion). The triplet loss boundary α is set to 0.2, with the temperature parameter τ defaulting to 0.07 during similarity calculations. The total training iterations T are determined by the annotation ratio (e.g., 50 epochs for 1% annotation, 120 epochs for 100% annotation), with network parameters updated per iteration. The dynamic threshold update frequency is set to update the global threshold δ_t every 50 iterations. Each iteration selects a number of pseudo labels equal to the batch size (i.e., 128) from the unlabeled samples for training, with the selection criteria detailed in
Section 2.4. After training begins, pseudo labels are not enabled during the first 20% of iterations (using only labeled data for warm-up). Subsequently, pseudo labels are gradually introduced according to the dynamic threshold mechanism described in
Section 2.4. This phase employs a semi-supervised setup, with training data sourced from the complete training set. We randomly retain a specified proportion (e.g., 1%, 5%, 10%, 20%, 50%, 70%, 100%) of samples as labeled data, while the remaining samples have their labels removed and are used as unlabeled data for training. The test set remains fully intact throughout and is not utilized for training, serving solely for final performance evaluation.
3. Results
3.1. Cross-View Image Generation
To validate the effectiveness of the self-supervised contrastive learning method for cross-view image generation, we deleted each part of the method in turn, namely BEV viewpoint transformation and contrastive learning. The ablation experiments were conducted on the CVUSA and CVACT datasets, and the importance of each module was proved by BEV_CycleGAN+Contrastive learning, BEV_CycleGAN, BEV, and CycleGAN, and the results are shown in
Figure 4.
Figure 4 presents a qualitative comparison of cross-view image translation results obtained with different methods. The images generated by BEV-CycleGAN+CL exhibit the highest visual similarity to real satellite imagery, particularly in road structures (red boxes), vegetation areas (yellow boxes), and fine-grained textures (blue boxes). While the images produced by CycleGAN exhibit noticeable blurring and distortion. The results show that the BEV_CycleGAN+CL method outperforms BEV_CycleGAN, BEV, and CycleGAN in both image quality and detail recovery. The generated images exhibit relatively accurate spatial structures in areas such as roads (red box in
Figure 4) and vegetation (yellow box in
Figure 4), with richer details (blue box in
Figure 4). Traditional CycleGAN methods produced relatively low-quality images with noticeable blurring and distortion. While BEV and BEV_CycleGAN better preserved spatial consistency between ground and satellite images during generation, significant local structural distortions remained. The results demonstrate that the contrastive learning strategy enhances the model’s ability to capture image details by introducing additional similarity constraints during training. The proposed method leverages BEV transformation as the foundation and introduces CycleGAN into the framework; this design effectively eliminates the errors caused by direct cross-view conversion between ground and satellite perspectives. Specifically, for typical ground land cover types including roads and vegetation, the method delivers substantial improvements in image quality, visual clarity and structural integrity. This improves the accuracy of structural reconstruction, thereby elevating the overall quality of generated images.
However, despite these improvements, residual geometric distortions—particularly slight tilting of building facades and edge blurring in complex structures—remain visible in densely built areas due to the inherent limitations of the flat-ground assumption underlying BEV projection. These residuals are most noticeable in the building regions (blue box) but are substantially mitigated compared to the BEV and BEV-CycleGAN baselines.
This paper systematically evaluates the cross-view image generation performance of four methods—BEV_CycleGAN+CL, BEV_CycleGAN, BEV, and CycleGAN—on the CVACT and CVUSA datasets using quantitative metrics (PSNR, SSIM, RMSE), as shown in
Table 1.
Experimental results show that on the CVACT dataset, BEV_CycleGAN+CL achieves PSNR improvements of 2.83%, 16.02%, and 42.30% compared to BEV_CycleGAN, BEV, and CycleGAN respectively, indicating that BEV_CycleGAN+CL generates images with the highest fidelity. SSIM improved by 6.12%, 8.00%, and 18.48%, respectively, indicating that EV_CycleGAN+CL exhibits the highest spatial structural consistency with real satellite images; RMSE decreased by 9.28%, 15.51%, and 25.35%, respectively, demonstrating the smallest spatial offset between BEV_CycleGAN+CL generated images and real satellite images. In the CVUSA dataset, BEV_CycleGAN+CL outperformed BEV_CycleGAN, BEV, and CycleGAN models with PSNR improvements of 3.76%, 19.71%, and 41.45%, respectively, and SSIM improvements of 6.57%, 8.06%, and 16.28%, respectively. RMSE was reduced by 9.94%, 14.16%, and 22.32%, respectively. Experimental results demonstrate that BEV_CycleGAN+CL achieves optimal performance in terms of image fidelity, structural consistency with real images, and offset from real images. This proves that incorporating the BEV module and contrastive learning strategy not only effectively addresses the shortcomings of traditional methods in image detail and global structure but also provides a more efficient and reliable novel approach for cross-view image generation technology.
Although BEV-CycleGAN+CL significantly improves overall generation quality, the flat ground assumption underlying BEV transformations still introduces noticeable geometric distortions in densely built areas. As shown in the third column of
Figure 4, building outlines exhibit slight tilting or edge blurring. Compared to BEV-CycleGAN, introducing contrastive learning (BEV-CycleGAN+CL) partially mitigates such distortions through feature layer consistency constraints. However, structural misalignments remain unresolved. Quantitative analysis on samples containing high-rise buildings reveals that BEV-CycleGAN+CL achieves approximately 2.1 dB lower PSNR than suburban scenes, indicating that geometric errors still impact generation quality. Synthetic images still exhibit subtle differences from real images in transient objects (e.g., vehicles, pedestrians). As shown in the blue-boxed regions of
Figure 4, baseline CycleGAN frequently generates false objects (“hallucination” phenomenon), while BEV-CycleGAN+CL significantly reduces such artifacts, proving contrastive learning effectively suppresses fabricated inconsistencies. However, slight shifts in texture details persist. Statistics reveal that among samples exhibiting hallucinations, the R@1 of subsequent retrieval decreased by approximately 3.2 percentage points, indicating residual domain gaps remain unresolved. This analysis demonstrates contrastive learning’s crucial role in suppressing geometric distortions and hallucinations, though future work should integrate finer-grained geometric modeling or semantic priors to further narrow domain discrepancies.
To quantify the computational overhead introduced by the multi-scale contrastive loss LPC, we compared the single-iteration training time of BEV_CycleGAN and BEV_CycleGAN+CL under identical hardware conditions (NVIDIA RTX 4090Ti). On the CVUSA dataset, BEV_CycleGAN averaged 0.32 s per iteration, increasing to 0.41 s with LPC—a 28.1% rise. On the CVACT dataset, the time increased from 0.35 s to 0.45 s, representing a 28.6% increase. This additional overhead primarily stems from feature extraction across three feature layers (64 × 64, 32 × 32, 16 × 16) and InfoNCE loss computation. However, considering the significant improvements in generated image quality and localization accuracy, this trade-off is acceptable.
3.2. Image Geolocalization Results
On the CVUSA and CVACT datasets, comparisons are made with some cutting-edge methods to evaluate the performance of the model in this paper, as shown in
Table 2, and the cutting-edge methods with which comparisons are made in the paper are: SAFA (Spatial Aware Feature Aggregation) [
4], SAFA
† (Spatial Aware Feature Aggregation), LPN
† (Local Pattern Network) [
23], DSM
† (Dynamic Similarity Matching), L2LTR (layer-to-layer Transformer) [
24], GeoDTR (Geometric Disentangled Transformer) [
25], TransGeo [
26], and Sample4G
† [
27].
As shown, the proposed dynamic threshold pseudo-label geolocation model achieves state-of-the-art performance across all evaluation metrics (R@1, R@5, R@10, and R@1%) on both datasets. Among the two test datasets, the Sample4G† model achieved the highest positioning accuracy. Based on the CVUSA dataset, the proposed dynamic threshold pseudo-label image geolocation model outperformed Sample4G† by 0.26%, 0.05%, 0.10%, and 0.02% in R@1, R@5, R@10, and R@1% accuracy, respectively. On the CVACT Val dataset, the proposed model outperformed Sample4G† by 2.32%, 1.95%, 0.16%, and 0.61% in R@1, R@5, R@10, and R@1% accuracy, respectively. Based on the CVACT Test dataset, the proposed model achieves image geolocation accuracy that surpasses Sample4G† by 1.88%, 0.86%, 0.18%, and 0.26% at R@1, R@5, R@10, and R@1%, respectively.
To comprehensively evaluate the practicality of our method, we analyzed the computational overhead and efficiency of the model.
Table 3 compares the number of parameters (Params) and floating-point operations (FLOPs) between our method and several representative approaches.
During the geolocation phase, the feature extraction network ConvNeXt-B employed in this paper has 25.8 million parameters, with a single forward inference requiring approximately 14.5 GFLOPs. On an NVIDIA RTX 4090Ti GPU, the average time to retrieve a query image (including feature extraction and similarity calculation) is 0.18 s, meeting real-time requirements. Updates to the dynamically thresholded pseudo labels occur every 50 iterations, with negligible additional computational overhead. As shown in
Table 3’s parameter and FLOPs comparison, our method achieves the highest localization accuracy (CVUSA R@1 98.94%, CVACT Test R@1 72.42%) while maintaining comparable parameters (25.8 M) and computational cost (14.5 G) to mainstream approaches. For instance, GeoDTR has 26.7 M parameters and 15.2 G FLOPs, while Sample4G
† has 24.6 M parameters and 13.9 G FLOPs. Our method slightly exceeds Sample4G
† in parameters but achieves higher accuracy, demonstrating a favorable balance between precision and efficiency. This further validates the feasibility and superiority of our approach in practical applications.
To validate the model’s reliability in pseudo-label adaptive filtering, we conducted semi-supervised training on the CVACT dataset using true label coverage rates of 1%, 5%, 10%, 20%, 50%, 75%, and 100%, and compared the results with the UCVGL [
20] method (
Table 4). Experimental results demonstrate that even in an extremely low-resource scenario with only 1% labeled data, our method achieves a R@1 of 70.30%, significantly outperforming UCVGL’s 68.29%. As the labeled proportion increases from 10% to 20%, all metrics stabilize, indicating that the model requires only a small number of samples to generate reliable pseudo labels for the majority of unlabeled images. When the annotation proportion increases from 20% to 75%, the R@1 of our method increases from 82.19% to 85.78%, approaching the fully supervised level of 86.32%, while UCVGL only increases from 79.60% to 83.88%. At all proportions, our method significantly outperforms UCVGL, especially at 50% and 75%, where R@1 is 1.74 and 1.90 percentage points higher, respectively. These results fully demonstrate that the adaptive pseudo-label filtering method not only effectively alleviates the shortage of labeled data and significantly reduces manual costs in low-resource scenarios, but also continuously leverages unlabeled data to improve model performance at medium and high labeling rates, showcasing outstanding scalability and robustness.
The consistent superiority of our method across both datasets and evaluation metrics can be attributed to three key factors. First, the BEV-CycleGAN+CL module generates pseudo-satellite images with significantly improved structural fidelity (
Table 1), which effectively bridges the domain gap between ground and satellite views. This provides the geolocation network with inputs that are already roughly aligned, thereby simplifying the feature matching task. Second, the dynamic threshold pseudo-labeling strategy adaptively selects high-confidence samples based on the evolving confidence distribution, avoiding the noise accumulation inherent in fixed-threshold methods. This is particularly evident in low-label regimes (e.g., 1% labeled data in
Table 4), where our method outperforms UCVGL by 2.01% in R@1, as the warm-up phase and EMA smoothing prevent early-stage error propagation. Third, the proposed dynamic hard sample triplet loss explicitly focuses on ambiguous pairs near the decision boundary, further refining the feature embedding space. The synergy between high-quality generated images, adaptive pseudo-label selection, and hard sample mining enables our model to achieve state-of-the-art performance on both CVUSA and CVACT, with especially notable gains in the most challenging scenarios (e.g., CVACT test set, where our R@1 reaches 72.42% vs. 71.57% for Sample4G
†). Furthermore, the stability of our method is evidenced by the controlled threshold update process in
Figure 5, which indicates that the training process is robust to batch-wise fluctuations.
4. Discussion
- (1)
Dynamic Threshold Adversarial Self-Training Evaluation
This paper draws inspiration from the concept of gradient updates in deep learning networks, employing an exponential moving average (EMA) weighting strategy for calculations. To validate whether the selected value of γ influences the weights, different values (such as γ = 0.9, 0.7, 0.5, 0.4) were chosen, as shown in
Figure 5.
Figure 5 depicts a function decay curve, with the horizontal axis representing “number of iterations” and the vertical axis representing weights. Curves of different colors correspond to distinct values (e.g., γ = 0.9, 0.7, 0.5, 0.4). When γ is large (e.g., 0.9), the weight factor decreases more slowly, indicating that sample importance remains high throughout iterations—suitable for more stable training. When smaller (e.g., 0.4), they decrease rapidly, indicating greater influence in the early stages and reduced importance for samples later on, suitable for quickly adjusting learning strategies. This exponential decay strategy dynamically adjusts sample weights: during the early training phase when the network’s predictive capability is weak, smaller weights are assigned to mitigate noise effects; as training progresses and predictive capability improves, weights gradually increase, enhancing model stability. Selecting different values impacts training effectiveness: excessively large values may slow model learning, while excessively small values may cause premature convergence, increasing susceptibility to local optima. The decay rate (
) controls the speed at which the
weight decreases, ensuring that the weight decays gradually throughout the entire training process.
Figure 6 compares the performance of different threshold filtering methods. The dynamic threshold method effectively controls the number of false labels while maintaining accuracy. The first two plots [
20] represent false labels generated by the difference between the highest or the sum of the highest and second-highest retrieval similarity scores. Dynamic threshold filtering demonstrates not only high accuracy but also allows the false label accuracy to gradually increase as the threshold rises. Filtering pseudo-labels using the discriminative power between scores more effectively eliminates erroneous labels compared to a single threshold. This approach sacrifices label quantity for a stable increase in accuracy. For dynamic threshold filtering, the number of labels decreases as the threshold increases, yet the accuracy improvement is the most significant (61.35% → 90.38%), resulting in a final accuracy rate substantially higher than the previous two methods. By dynamically adapting to data distributions, the dynamic threshold maximizes false label filtering while controlling label quantity, achieving a leap in pseudo-label quality that aligns with self-training’s demand for high-quality pseudo-labels. Its high accuracy provides robust supervision signals for model iteration, validating the effectiveness of “dynamic filtering” within the self-training framework.
- (2)
Ablation Studies
To validate the effectiveness of the dynamic threshold pseudo-label self-training method, ablation studies were conducted. Here, Matching, BEV_CycleGAN+Matching, DT+Matching, and BEV_CycleGAN-DT+Matching represent direct matching, matching after BEV_CycleGAN perspective transformation, matching after dynamic threshold pseudo-label self-training, and matching after BEV_CycleGAN perspective transformation followed by dynamic threshold pseudo-label self-training, respectively, as shown in
Table 5.
The baseline model using only traditional matching methods (Top1 = 19.2%) exhibits poor performance, indicating that cross-view perspective differences make direct matching challenging. Incorporating BEV-CycleGAN (Top1 = 38.7%) significantly improved performance (+19.5%), validating the effectiveness of the proposed multi-scale contrastive learning and geometric constraints for cross-view generation. The generated high-quality satellite images mitigated the perspective discrepancy issue. When using dynamic threshold pseudo-label self-training alone (DT+Matching), Top1 improved to 42.3% (+23.1%), demonstrating that dynamically filtering high-confidence pseudo-labels effectively leverages unlabeled data to enhance model discrimination. Combining both approaches (Top1 = 48.5%) achieves optimal performance, demonstrating that the synergistic interaction between high-quality pseudo-satellite images generated by BEV-CycleGAN and the adaptive filtering mechanism of dynamic thresholds further enhances feature alignment accuracy.
- (3)
Network Convergence Analysis
This study employs a systematic training strategy to optimize the cross-view image translation performance of the BEV_CycleGAN model. Experiments are conducted on two large-scale public geospatial datasets, CVUSA and CVACT, each containing 35,532 training pairs and 8884 testing pairs. Through an end-to-end training process, the model progressively learns the complex mapping relationship between bird’s-eye-view and satellite images.
Figure 7 shows the training loss convergence curves of the generator with contrastive learning module (Generator+CL) and the baseline generator (Generator) on the CVUSA and CVACT datasets. As illustrated, during the training process on both datasets, the model with contrastive learning consistently exhibits lower loss values than the baseline model, and its convergence curve is smoother and more stable, with significantly reduced oscillations, especially in the mid-to-late stages of training. This indicates that multi-scale contrastive learning effectively constrains the training process of the generative model, enhancing semantic consistency across views and improving optimization efficiency and convergence robustness. This phenomenon is consistent across different datasets, further validating that the proposed module plays an important role in stabilizing cross-view image generation and consequently improving geolocation performance.
- (4)
Method Effectiveness
The superior performance of the proposed method stems from a synergistic multi-level design. First, the introduction of an explicit bird’s-eye-view (BEV) geometric prior—implemented via ground-plane-based inverse projection—imposes pixel-level spatial constraints between ground and satellite views. This transforms the perspective transformation problem into a detail-refinement task within a structurally aligned domain, significantly mitigating global distortion, as evidenced by the accurate reconstruction of road networks in
Figure 4. Second, multi-scale contrastive learning compels the model to move beyond the superficial feature transfer (e.g., color and illumination) typical of CycleGAN. By constructing dense cross-view positive and negative sample pairs in the feature space, semantic consistency between input and output is reinforced. This is key to the improved fidelity in vegetation and architectural details and the notable SSIM improvement shown in
Table 1. Finally, the dynamic pseudo-labeling mechanism enables efficient utilization of unlabeled data through adaptive confidence calibration. Its quality–quantity balancing strategy helps the model avoid noisy interference in early training stages while gradually releasing reliable supervisory signals later, leading to robust performance even under semi-supervised settings—as demonstrated by the results using only 1% ground-truth labels in
Table 3. Together, these three components form the core foundation that ensures high accuracy and strong generalization in the cross-view geolocalization task.
- (5)
Analysis of Robustness and Limitations under Extreme Conditions
To systematically evaluate the applicability and robustness of the proposed method in real-world complex environments, this study further designed test scenarios encompassing extreme illumination variations and significant seasonal differences. Experiments utilized ground-based panoramic images including overcast conditions (row 1), sunny conditions (row 2), and strong backlighting (row 3), alongside data collected during summer with dense vegetation (row 4) and winter with sparse vegetation (rows 5 and 6). These scenarios simulate common challenges of illumination and surface coverage variations encountered in practical applications.
Figure 8 illustrates the complete cross-view transformation workflow from raw panoramas to generated satellite views across these diverse scenarios, comprising four processing stages: input panorama (Panorama), bird’s-eye view (BEV) generated via geometric projection, satellite-view images synthesized using the proposed BEV_CycleGAN+CL framework, and corresponding real satellite imagery (Satellite) as reference benchmarks.
Analysis indicates that this method maintains excellent geometric consistency and preserves critical semantic information under the vast majority of extreme lighting and seasonal variation conditions. Particularly for features with regular geometric patterns, such as road networks and vegetated areas, the generated images demonstrate high alignment with actual satellite imagery in spatial layout and contour retention, validating the effectiveness of the integrated BEV geometric prior and multi-scale contrastive learning mechanism in cross-domain feature alignment. However, the analysis also reveals limitations: the model tends to exhibit local edge blurring under backlight or low-light conditions, may cause semantic confusion in seasonal change scenarios, and produces minor deformations in high-rise dense areas due to cumulative geometric projection errors. These phenomena primarily stem from alignment challenges posed by simplified ground plane geometric assumptions and cross-seasonal texture variations.
Furthermore, regarding the generalization capability across heterogeneous landscapes, the experimental validation in this study primarily relies on the CVUSA and CVACT datasets, which predominantly cover suburban and urban street scenes in the United States, characterized by regular road networks and low-to-mid-rise buildings. However, in real-world applications, the model may encounter more diverse geographical environments, including rural areas, complex intersections, and regions with significant elevation changes such as hills or mountains. In these scenarios, the flat-ground assumption underlying BEV transformation may partially fail, leading to increased geometric projection errors. For instance, on steep slopes or undulating terrain, the BEV images generated under the flat-ground assumption may exhibit road distortions or misalignments; in rural areas, sparse building distributions and extensive natural landforms may also affect the stability of feature matching. Although a preliminary analysis of geometric distortions in complex terrains (e.g., an approximate 4.5 percentage point drop in R@1 on steep slopes) is provided in the sixth analysis of
Section 4, a systematic cross-scene generalization evaluation still requires further investigation in future work. Subsequent research will consider collecting or synthesizing datasets covering a wider range of geographical landscapes (e.g., rural, mountainous, coastal areas) and integrating multimodal elevation data (e.g., LiDAR, DSM) to enhance the model’s adaptability to terrain variations, thereby improving its robustness and generalization performance in heterogeneous environments.
Figure 9 presents a visual comparison of the Top 5 retrieval results for cross-view geolocation under extreme lighting and seasonal variations. In the figure, each column displays, from top to bottom: the query ground panoramic image captured under low light, strong backlight, or different seasonal conditions; the corresponding satellite-view image generated by BEV_CycleGAN+CL; and the actual satellite imagery. To clearly distinguish retrieval accuracy, red boxes indicate correctly matched satellite images, while blue boxes denote incorrectly matched images.
Analysis reveals that in most extreme scenarios, the generated satellite views within red boxes maintain high spatial and semantic consistency with their real-world counterparts regarding key geographic features—such as road alignment, building outlines, and vegetation distribution. This demonstrates the model’s ability to robustly align features despite apparent variations caused by lighting and seasonal changes. Notably, top-ranked results (e.g., Top1 and Top2) exhibit significant visual correspondence between generated images and their correctly matched targets. Further examination of the mismatched samples indicated by blue boxes reveals that while the generated images retain structural similarity to the query scene in overall layout, their corresponding satellite imagery exhibits deviations from the actual target in terms of feature details, local textures, or spatial orientation. Such errors predominantly occur in retrieval results with relatively low similarity (e.g., Top4–Top5), indicating that under extreme lighting or complex vegetation coverage conditions, the model may still encounter matching ambiguities due to missing local information or cross-view imaging differences. This visualization further confirms, from a visual interpretability perspective, that our method achieves reliable cross-view image generation and matching under most extreme lighting and seasonal conditions, enhancing the model’s applicability in complex real-world environments.
Through error analysis, we identified three types of systematic errors in the model under extreme conditions: (1) Geometric information loss due to severe occlusion (e.g., Row 1 in
Figure 9), where dense tree canopies obscure ground roads. The BEV transformation fails to reconstruct complete road topology, resulting in fragmented or misaligned generated images. This causes the retrieval model to incorrectly assign similarity scores to other geographic locations with similar fractured structures (Top4 and Top5 in Row 1 are both incorrect matches). (2) Geometric distortion caused by BEV transformation (e.g., Row 3 in
Figure 9), where BEV geometric distortion fails in high-rise dense areas due to the ground plane assumption, resulting in tilted and distorted building outlines in generated images. This affects the model’s ranking stability in fine localization, causing all Top2–Top5 matches to be incorrect. (3) CycleGAN’s “hallucination” phenomenon (e.g., Row 5 in
Figure 9), which generates non-existent structures like vehicles in sparse winter vegetation scenes. These false features cause the retrieval model to match them with similar noise patterns in real images, triggering misclassifications. The cumulative effect of these errors significantly reduces localization reliability in complex scenes.
- (6)
Uncertainty Analysis
While the proposed method achieves strong performance in most scenarios—particularly in suburbs and rural areas where building height variations are moderate—its effectiveness diminishes in dense urban centers characterized by high-rise buildings. In such environments, the accurate reconstruction of roads and vegetation via BEV geometric priors and multi-scale contrastive learning is compromised by geometric distortions. The fundamental cause of this limitation lies in the partial failure of the “single ground-plane assumption” underlying the current approach within complex urban 3D settings. Dense buildings lead to severe occlusion and perspective foreshortening, causing a single pixel in the ground-level view to correspond to multiple physical points at different heights in the real world. Relying solely on image data makes it difficult to resolve such depth ambiguities, thereby introducing geometric errors during the projection transformation, as illustrated in
Figure 10.
This limitation has direct implications for real-world deployments. For instance, in autonomous driving scenarios within dense urban cores, distorted reconstructions of building facades and road intersections could lead to inaccurate vehicle localization, potentially affecting downstream tasks such as path planning or obstacle avoidance. Similarly, in disaster response applications—where rapid and accurate geolocalization is critical for coordinating rescue efforts—geometric errors in high-rise areas may cause misalignment between ground reports and satellite imagery, delaying situational awareness and decision-making.
Furthermore, existing training datasets provide insufficient coverage of extreme occlusion and diverse urban facades, and the model lacks explicit perception of the scene’s three-dimensional structure. Consequently, when confronted with complex layouts not adequately learned, the model tends to produce over-smoothed results with lost details. Additionally, the flat-ground assumption inherent to BEV projection leads to notable geometric distortions in complex terrains such as steep slopes, where the generated images often exhibit misaligned road networks and deformed structures. The model is also sensitive to initial geometric calibration errors; experiments show that a camera height deviation exceeding 10% can degrade the R@1 accuracy by approximately 4.5 percentage points, underscoring the need for precise calibration or adaptive geometric modeling in practical deployments.
To address these limitations, future research will focus on incorporating multimodal data fusion strategies. For instance, leveraging the 3D geometric information provided by LiDAR point clouds or digital surface models (DSMs) can effectively compensate for depth ambiguity caused by the flat-ground assumption in high-rise building areas. Simultaneously, integrating semantic segmentation priors can guide the generative network to focus on structural consistency across different object categories, thereby achieving more reliable cross-view localization performance in complex real-world environments.