Visual Geo-Localization Based on Spatial Structure Feature Enhancement and Adaptive Scene Alignment

Ping, Yifan; Lu, Jun; Guo, Haitao; Ding, Lei; Hou, Qingfeng

doi:10.3390/electronics14071269

Open AccessArticle

Visual Geo-Localization Based on Spatial Structure Feature Enhancement and Adaptive Scene Alignment

by

Yifan Ping

¹,

Jun Lu

^1,*,

Haitao Guo

¹,

Lei Ding

²

and

Qingfeng Hou

¹

Institute of Geospatial Information, Information Engineering University, Zhengzhou 450001, China

²

Department of Big Data Analysis, Information Engineering University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(7), 1269; https://doi.org/10.3390/electronics14071269

Submission received: 20 January 2025 / Revised: 14 March 2025 / Accepted: 19 March 2025 / Published: 24 March 2025

Download

Browse Figures

Versions Notes

Abstract

The task of visual geo-localization based on street-view images estimates the geographical location of a query image by recognizing the nearest reference image in a geo-tagged database. This task holds considerable practical significance in domains such as autonomous driving and outdoor navigation. Current approaches typically use perspective street-view images as reference images. However, the lack of scene content resulting from the restricted field of view (FOV) in such images is the main cause of inaccuracies in matching and localizing the query and reference images with the same global positioning system (GPS) labels. To address this issue, we propose a perspective-to-panoramic image visual geo-localization framework. This framework employs 360° panoramic images as references, thereby eliminating the issue of scene content mismatch due to the restricted FOV. Moreover, we propose the structural feature enhancement (SFE) module and integrate it into LskNet to enhance the ability of the feature extraction network to capture and extract long-term stable structural features. Furthermore, we propose the adaptive scene alignment (ASA) strategy to address the issue of data capacity and information content asymmetry between perspective and panoramic images, thereby facilitating initial scene alignment. In addition, a lightweight feature aggregation module, MixVPR, which considers spatial structure relationships, is introduced to aggregate the scene-aligned region features into robust global feature descriptors for matching and localization. Experimental results demonstrate that the proposed model outperforms current state-of-the-art methods and achieves R@1 scores of 72.5% on the Pitts250k-P2E dataset and 58.4% on the YQ360 dataset, indicating the efficacy of this approach in practical visual geo-localization applications.

Keywords:

scene alignment; structural features; visual geo-localization; lightweight feature aggregation

1. Introduction

In recent years, the global navigation satellite system (GNSS) has become the primary tool for obtaining geographic location information, and it plays a crucial role in daily human activities. However, GNSS signals are often unstable in complex urban environments, particularly when obstructions from dense buildings and interference from various electronic signals occur [1]. Under such conditions, GNSS positioning accuracy often fails to meet the requirements for high-precision localization. In such cases, image-based visual geo-localization (VG) methods can serve as an effective complement, helping to complete localization tasks when GNSS signals are obstructed.

VG, also known as visual place recognition or image-based localization [2], aims to estimate the geographic location of a query image without relying on additional information, such as GNSSs, as illustrated in Figure 1. This task is commonly treated as an image retrieval task, in which the key challenge is to learn discriminative feature representations [3]. Recently, deep learning has achieved remarkable success in computer vision because of its powerful feature extraction capabilities, which have significantly advanced the field of VG. However, two main challenges remain to be addressed in deep learning-based VG. The first is the increase in false positive samples caused by weakly supervised learning that relies on GPS labels. As illustrated in Figure 2, owing to the limited field of view (FOV) of perspective images from pinhole cameras, like those in the Norland database [4], or from cropped panoramic images in the Pitts250k [5,6] and SF-XL [2] databases, geographically close images can exhibit varying scene content when oriented differently [7,8]. However, current deep learning-based algorithms typically select positive and negative training samples based solely on the GPS labels [9,10]. Specifically, reference images located within 10 m of the query image are considered positive samples; otherwise, they are classified as negative samples. This leads to the wrong selection of a large number of false positive samples, which might not have any overlapping regions with a query. During training, these false positive samples inevitably introduce substantial misleading information, ultimately degrading model performance. The second problem is the inability to achieve stable matching and localization in complex and dynamic urban environments. Urban scene appearances are highly susceptible to changes in lighting, seasonal changes, and the movement of objects (e.g., pedestrians, vehicles, and obstacles) [11,12], which results in significant visual differences between images captured at different times in the same location. This remains a major challenge to the stability of matching and localization tasks.

To address the problems of inaccurate GPS labels in the context of weakly supervised learning, Kim et al. [13] attempted to extract true positive images by verifying their geometric relations; however, this approach was limited by the accuracy of off-the-shelf geometric techniques. Some researchers [14,15,16] have adopted panoramic images as query and reference inputs. However, the high-dimensional feature similarity computation between panoramic image pairs imposes substantial computational overhead. Moreover, the precision optical components required in panoramic imaging systems present higher cost barriers for industrial-scale deployment.

To achieve stable geo-localization in dynamic urban environments, researchers have focused on constructing image datasets that include images taken at multiple time points and under various conditions to improve the robustness of the model to urban scene variations [11]. However, this approach significantly increases the complexity of data collection and preprocessing. Furthermore, other studies [17,18,19] have concentrated on modifying network architectures and integrating attention mechanisms to enhance the ability to learn discriminative features at the cost of increasing network complexity. Nevertheless, these methods primarily emphasize the extraction of visual appearance information while neglecting structural information.

To address this challenge, we propose a VG framework that matches perspective images to panoramic images. This framework compares query perspective images and panoramic reference images tagged with GPS labels, which eliminates potential errors in sample selection caused by the limited FOV. However, directly extracting and matching features between perspective and panoramic images will introduce substantial redundant information into the matching process. This primarily stems from the inherent asymmetry in information content and data volume caused by the narrow FOV characteristics of perspective images versus the wide FOV properties of panoramic images. Addressing this challenge constitutes the critical pathway toward achieving high-precision and computationally efficient matching between perspective and panoramic imagery. Based on this, the adaptive scene alignment (ASA) module is proposed to address this problem. Furthermore, we propose the structural feature enhancement (SFE) module, which is based on linear feature extraction. This module automatically identifies and enhances continuous and clear linear features in an image, guiding the model to focus on regions rich in structural information to extract long-term stable features in dynamic urban environments.

In summary, the contributions of this study can be summarized as follows:

An ASA-based perspective-to-panoramic image VG framework is proposed. The framework uses panoramic images as reference images to address the problems of inaccurate GPS labels in the context of weakly supervised learning. The ASA module is designed to eliminate redundant scene information between the perspective and panoramic images.
An SFE module is proposed to actively exploit and capture linear features. The linear features contain rich and stable structural attributes, providing crucial insights for aligning the image semantics.
State-of-the-art performance is achieved on two benchmark datasets for VG. The results of extensive experiments demonstrate the effectiveness of the proposed method in terms of both accuracy and efficiency.

2. Related Work

This section describes the related VG work, which can be divided into perspective and panoramic approaches.

2.1. Perspective VG

In perspective-to-perspective VG [9,10], the images recorded at the identical location often exhibit considerable disparities in appearance owing to the influence of variations in illumination, weather conditions, seasonal changes, and viewpoints. Most solutions to this problem focus on establishing feature associations between matching image pairs. Early studies relied heavily on handcrafted local feature extraction operators such as SIFT [18] and SURF [19], which can effectively address the challenges posed by slight changes in viewpoint and lighting conditions. However, when confronted with complex and variable real urban environments, they proved to be insufficient. With the rapid development of deep learning, research on this task has gradually shifted toward the use of neural networks to extract discriminative and robust feature representations. Relja et al. [5] designed a trainable aggregation layer called NetVLAD [5] that can aggregate local features into compact representations. These local feature aggregation methods perform well in handing viewpoint changes but suffer from significant performance degradation in scenarios with drastic lighting variations. The other comprises global feature methods based on spatial pooling, such as R-MAC [20] and GeM [21], which excel in retrieval efficiency but are difficult to adapt to the requirements of VG tasks. Although CosPlace [2] has made breakthroughs in several benchmarks by improving GeM, and TransVPR [22] optimizes local features using transformers, both methods struggle to strike a balance between global representation performance and computational efficiency. Ge et al. [8] used self-supervised learning to explore the full potential of difficult samples in representation learning for weakly supervised problems, thereby improving the generalization capability of the model. Given the powerful image feature extraction ability of the vision foundation model, Huang et al. [23] designed Dino-Mix, which combines the vision foundation model with feature aggregation to further enhance the robustness of VG techniques in complex environments. Although these algorithms have achieved breakthroughs, they still face the challenge of sample selection errors because of the limited FOVs of perspective images.

2.2. Panoramic VG

Panoramic images, which include the entire FOV and scene content, can effectively overcome the above issues. Ahmet et al. [14] first proposed a NetVLAD-based panoramic-to-panoramic image matching paradigm. To achieve high-precision vehicle navigation and positioning, Fang et al. [15] developed a two-stage image retrieval framework termed CFVL, which initially performs panoramic-to-panoramic matching for coarse localization, followed by perspective-to-perspective matching for precise localization. Addressing the environmental variation challenge in visual localization, Zhang et al. proposed PAL (Panoramic Annular Localizer) [16], adopting panoramic-to-panoramic image matching but differentiating from prior methods [14,15] by dividing panoramic images into annular sub-images through circular segmentation. After feature extraction, these sub-descriptors are aggregated via permutation-invariant summation to generate global representations. Although these methods leverage the wide field-of-view advantage of panoramic imagery to mitigate significant appearance discrepancies caused by illumination changes, viewpoint offsets, and dynamic interference between query and database images, their high-dimensional feature matching requires substantial computational resources. Furthermore, the complex optical design of panoramic imaging incurs elevated industrial deployment costs.

Recent approaches have proposed a novel matching paradigm, namely perspective-to-panoramic image matching. Orhan et al. [24] used a sliding search window in the last convolutional layer over the panoramic image and computed its similarity with a descriptor extracted from the query image. This approach cleverly replaces the hard cropping method used in panoramic images and significantly improves both the efficiency and accuracy of the model. Shi et al. [25] used sliding windows over an entire panoramic image before feature extraction and computed feature descriptors for each window, which were then compared to determine place similarity. Both methods generate feature descriptors during the sliding search to determine the most similar region. However, owing to the lack of region-level similarity labels for supervised model training, this method is prone to mismatches. In addition, the feature aggregation process inevitably loses spatial and structural information, which is crucial for evaluating region similarity.

In contrast to the above methods, our proposed model adopts a strategy for assessing region similarity based on the overall pixel values within the sliding search window, which fundamentally avoids the dependence on region-level similarity label supervision. Furthermore, we integrate the SFE into the backbone before the sliding search, enabling the model to automatically extract and enhance the structural information in the image, which is crucial for calculating image similarity, thereby further enhancing the model’s ability to identify the most similar region.

3. Methodologies

This section describes our proposed method in detail. First, we provide an overview of the network architecture. Next, we present several key modules in the architecture. Finally, we describe the loss function.

3.1. Overall Framework

Current deep learning-based VG methods typically use perspective images with a limited FOV as reference images. However, due to the weakly supervised learning that relies on GPS labels, the model may encounter difficulties in correctly distinguishing between positive and negative samples. Conversely, we use panoramic images with a wide FOV as references. Consequently, it is possible to correctly select positive and negative samples even when relying solely on GPS labels.

Figure 3 presents our proposed network architecture. First, we set panoramic image

x_{p}

as the reference and input it into the network along with the query perspective image

x_{m}

, which effectively reduces errors in sample selection. Then, we leverage the LskNet [26] encoder as the backbone network to extract information about the target of interest and the background context required for target recognition, thereby generating a feature representation that is robust to changes in the appearance of the scene. We regard the four convolutional residual blocks in LskNet as four stages, and the features extracted at different stages have different depth levels. To further enhance the model’s ability to adapt to changes in the appearance of the scene, we integrate the proposed SFE module into the shallow stage 2. The purpose of this is to increase the weight of reliable matching features and reduce the weight of irrelevant features. In addition, to eliminate the feature redundancy between perspective and panoramic images, the deep features extracted from stage 3 are input into the proposed ASA module, where the correlations between the perspective and panoramic image features are computed along the horizontal direction to search for aligned scene regions. The feature aggregation module MixVPR [27] is introduced to aggregate the aligned scene region features into global representations. Finally, a weighted soft margin triplet loss function [28] is used to train the model.

3.2. SFE: Structure Feature Enhancement by Exploiting Linear Features

To extract features, the proposed model employs LskNet [26] as the backbone, which adaptively uses different large kernels and adjusts the receptive field for each target in space as required. This capability not only enables the model to capture the feature information of the target itself but also prompts the model to capture background information related to the foreground target. Building on this framework, we develop the SFE module and integrate it into the LskNet to augment the capacity for learning stable structural features, thus facilitating the generation of more comprehensive and robust feature representations.

The primary focus of the VG task is large urban areas. By analyzing urban street scenes, we observed that objects with stable geometric structures, such as buildings, roads, and streetlights, are ubiquitous. In contrast to visual appearance features, the features of geometric structures rarely change over time [29,30]. This universality and long-term stability make them highly suitable key features for street-view imagery matching. In general, objects with stable structural features have clear and continuous linear features, whereas those with ambiguous structures, such as pedestrians, trees, and vehicles, often lack pronounced linear features and appear as irregular textures and shapes in images. Therefore, the proposed SFE module is based on linear feature extraction and can automatically identify and capture linear features. These linear features focus the model on the image regions with stable structural features, ultimately generating long-term stable feature representations. The module can be flexibly integrated into different stages of the network, and its detailed architecture is shown in the blue dashed frame at the bottom of Figure 3.

For the input feature map,

F_{l}

, to emphasize its salient regions, we perform a sum-pooling operation along the channel axis. The aggregated feature map

F_{a}

can be obtained using

F_{a} = S u m P o o l (F_{l}),

(1)

where SumPool(·) [31] is the sum-pooling operation.

The aggregated feature map

F_{a}

is processed using two 3 × 3 convolutional kernels, as illustrated in Figure 4. The kernels

ϕ_{1} (u, v), ϕ_{2} (u, v)

are specifically designed to maximize the responses to vertical and horizontal edges, respectively. During processing, regarding these two convolution kernels as sliding windows that slide over the entire feature map

F_{a}

, the convolution operation results

R_{1} (i, j), R_{2} (i, j)

at any spatial location

(i, j)

of the feature map

F_{a}

can be represented by the following formulas:

R_{1} (i, j) = \sum_{u = 1}^{3} \sum_{v = 1}^{3} ϕ_{1} (u, v) * F_{a} (i - u, j - v),

(2)

R_{2} (i, j) = \sum_{u = 1}^{3} \sum_{v = 1}^{3} ϕ_{2} (u, v) * F_{a} (i - u, j - v),

(3)

where

(i, j)

denotes the spatial coordinates on the feature map

F_{a}

,

(u, v)

denotes the relative coordinates inside the two convolution kernels

ϕ_{1} (u, v), ϕ_{2} (u, v)

, and

1 \leq u, v \leq 3 .

Next, we compute the sum of the squares of these two convolution results

R_{1} (i, j), R_{2} (i, j)

and take the square root. The resulting value represents the intensity

F_{a}^{'} (i, j)

of pixel

F_{a} (i, j)

in the output mask, directly reflecting the significance of the pixel as a line feature. This process can be mathematically formulated as

F_{a}^{'} (i, j) = \sqrt{{R_{1} (i, j)}^{2} + {R_{2} (i, j)}^{2}},

(4)

After normalization, the final line attention mask

M

is obtained. This process is mathematically expressed as follows:

M = R e L U (B N (F_{a}^{'})),

(5)

where

B N

is batch normalization [32] and

R e L U

[33] is the rectified linear unit activation function.

To utilize the generated line attention mask

M

for highlighting important regions in the feature map

F_{a}

, we first reduplicate its channels to the same size of

F_{l}

and generate the expanded channel-line attention mask as

M^{'} = E x p a n d (M, c)

(6)

where

E x p a n d (\cdot)

is used to extend the dimensions of a tensor by utilizing the tensor broadcasting mechanism to replicate the single-channel attention mask

M ϵ R^{H \times W}

to

M^{'} ϵ R^{c \times H \times W}

, and

c

is the channel size of

F_{l}

.

M^{'}

is a soft mask acting as a weighting function. The straightforward stacking of the line feature mask onto the original feature image (directly perform matrix element-wise addition on

M^{'}

and

F_{l}

) may lead to a decline in performance. To mitigate this issue, we employ attentional residual learning [34]. Specifically, we use the element-wise product to re-weight the values in the feature map

F_{l}

and obtain the final feature map as follows:

F_{l}^{'} = F_{l} \times M^{'} + F_{l},

(7)

where

“ \times ”

represents element-wise multiplication, i.e., the corresponding elements of the two matrices are multiplied;

“ + ”

represents element-wise addition, i.e., the corresponding elements of the two matrices are added.

3.3. ASA Module

The scene content missing due to the limited field of view in perspective street images leads to the presence of many false positive samples being incorrectly selected by GPS proximity filtering, which have no overlapping area with the query image. To address this challenge, in this framework, panoramic images serve as reference images, characterized by a comprehensive FOV that encompasses all scene information at a given location. By contrast, the query perspective images captured by a standard camera only include partial scene information. Due to the significant differences between the FOV coverage and information content in the two image types, a direct comparison of their features would introduce a substantial amount of redundant data, consequently reducing the accuracy and efficiency of the matching process. Drawing inspiration from the mechanisms by which the human visual system discerns scenes—in particular, the natural scanning of the environment and the acquisition of scene information from multiple directions to enable precise comparison and positioning—we have developed the ASA module in this study to emulate this process. Specifically, after extracting the deep features from the images, the module calculates the correlation between the perspective and panoramic features along the equatorial direction. The network then uses this correlation to adaptively crop the region with the highest response. This strategy aims to preselect candidate regions with high scene similarity, thereby providing a strong initial condition for subsequent matching and localization. The detailed workflow is illustrated in Figure 5.

Upon the completion of feature extraction by the backbone network, perspective feature image

F_{m} \in R^{H \times W_{m} \times C}

can be used as a sliding window to search across panoramic feature image

F_{p} \in R^{H \times W_{p} \times C}

. However, because the panoramic image is imaged with equal rectangular projection [35], the image unfolding process cuts off the left and right boundary features. To simulate the continuity and completeness of real-world scenes, the left and right boundaries of the panoramic feature image are first circularly stitched to obtain a seamless panoramic feature image

F_{p}^{'} \in R^{H \times W_{p + m} \times C}

before the sliding search is performed. While the perspective and panoramic features maintain a consistent vertical viewpoint, they exhibit variation in the horizontal viewpoints, necessitating traversal along the equatorial axis. During the sliding search, we calculate the similarity

S_{i}

between the perspective and panoramic features in all directions, thereby quantitatively evaluating the correlation between them. This strategy helps the model adaptively locate the scene alignment region of the panoramic image with respect to the perspective image. The correlation

S_{i}

between the perspective and panoramic feature is expressed as

S_{i} = \sum_{c = 1}^{C} \sum_{h = 1}^{H} \sum_{w = 1}^{W_{m}} F_{p}^{'} (h, (i + w), c) \times F_{m} (h, w, c), 0 < = i < = W_{p},

(8)

where

C, H

denote the number of channels and height of the feature images, respectively, and

W_{m}

,

W_{p}

denote the widths of the perspective and panoramic images, respectively.

After the correlation has been computed, the region with the maximum similarity score is the scene alignment area of the panoramic image with respect to the perspective image. This area

F_{p}^{*} \in R^{H \times W_{m} \times C}

is then accurately extracted from the panoramic image and then fed into the MixVPR along with the perspective feature

F_{m}

for further processing.

3.4. Feature Aggregation Strategy

To achieve efficient and accurate visual geolocation tasks in city-scale databases, it is necessary to adopt appropriate feature aggregation strategies to semantically encode the scene-aligned feature image pairs processed by the ASA module into compact and robust feature representations. To effectively integrate both visual features and spatial structural characteristics into the feature representation, this paper introduces an efficient feature aggregation strategy—MixVPR. Unlike traditional pyramid-style hierarchical aggregation that manually weights local regions, MixVPR employs multiple structurally identical feature mixers composed of multi-layer perceptrons (MLPs) [36] to iteratively fuse feature maps. These feature mixers leverage the capability of fully connected layers to automatically aggregate features in a holistic manner rather than focusing on localized features, thereby generating more compact and robust feature representations. The detailed is illustrated in Figure 6.

For the feature images

F \in R^{c \times H \times W}

extracted from the LskNet, they can be regarded as a collection of

c

two-dimensional features

X_{i}

of size

H \times W :

F = \{X_{1}, X_{2}, \dots, X_{c}\},

(9)

where

c, H, W

denote the channels, height, and width of

F

, and

X_{i}

denotes the

i

-th 2D feature along the channel dimension in

F

.

Next, each

X_{i}

in

F

is flattened into a one-dimensional vector using the Flatten() function, resulting in the flattened feature maps

F \in R^{c \times n}

,

n = H \times W

.

Then, a feature mixer

{F M}_{1}

is applied to

F \in R^{c \times n}

to produce an output

Z_{1} \in R^{c \times n}

. This process can be mathematically expressed as

Z_{1} = F M_{1} (F) = W_{2} (σ (W_{1} X_{i})) + X_{i}, i = \{1, \dots, c\},

(10)

where

W_{1}, W_{2}

denotes the weights of two fully connected layers and

σ

denotes the nonlinear activation functions.

This output is then fed into a second feature mixer

F M_{2}

and so on until it is passed through the L-th feature mixer. Through iterative fusion, cross-level feature relationships are effectively integrated, and this process can be mathematically expressed as

Z = F M_{L} (F M_{L - 1} (\dots F M_{1} (F))),

(11)

where

F M

denotes the feature mixer and L denotes the number of iterations.

The dimensions of

Z

are the same as the feature map

F

. To generate low-dimensional global descriptors, two fully connected layers are added after the feature mixer. The first fully connected layer is used to reduce the depth (channel) dimension, and the second fully connected layer is used to reduce the row dimension. Specifically, this is performed using a depth projection, which maps the Z dimension from

R^{c \times n}

to

R^{d \times n}

, and the process is expressed as follows:

Z' = W_{d} (T r a n s p o s e (Z)),

(12)

where

W_{d}

denotes the weight of the first fully connected layer. Next, a row projection is applied to map the

Z^{'}

dimension from

R^{d \times n}

to

R^{d \times r}

and the process is expressed as follows:

O = W_{r} (T r a n s p o s e (Z')),

(13)

where

W_{r}

denotes the weight of the second fully connected layer and the dimensions of the final output

O

are

d \times r

. Finally, the global feature descriptor

f

is produced subsequent to the processes of flattening and L2 normalization.

3.5. Weighted Soft Margin Triplet Loss

During training, the ASA module is applied to all perspective–panoramic pairs. For the matching pairs, the training process focuses on maximizing the similarity score between the feature descriptors of the perspective image and scene alignment area (the area of the panoramic image that is the most similar to the perspective image). This approach enhances the identification of viewpoint differences. For non-matching pairs, there exists no region on the panoramic image that possesses the FOV as the perspective image. Nonetheless, a region exhibiting the highest degree of similarity will be present. By minimizing the similarity score between the feature descriptors of perspective and the most similar region, the ability to distinguish features that are challenging to classify can be enhanced. Therefore, in the experiment, we train our network using weighted soft margin triplet loss [27]. The calculations in this loss function are as follows:

L = \log (1 + e^{γ (d (F_{m}, F_{p'}) - d (F_{m}, F_{p *}))})

(14)

where

F_{m}

denotes the feature descriptor of the perspective image,

F_{p'}

and

F_{p *}

denote the feature descriptors of the scene alignment areas cropped from the matching and non-matching panoramic images, respectively. Function

d (,)

is used to calculate the Euclidean distance of the matching and non-matching pairs, and parameter

γ

controls the convergence rate of the training process.

4. Experimental Results

This section presents the experimental results obtained on the benchmark datasets. It consists of comparison with state-of-the-art ablation studies and visualizations of the effectiveness of the proposed ASA module, as well as the qualitative results.

4.1. Experimental Setup

4.1.1. Datasets

We perform experiments in this study using two open benchmark datasets: Pitts250k-P2E, YQ360 [25], and Ominicity [37]. Figure 7 and Figure 8 illustrate representative images from these datasets, while Table 1 delineates the dataset sizes and their respective divisions.

(a): Pitts250k-P2E dataset: Pitts250k-P2E is built on the original Pitts250k dataset [5,6], which comprises reference images sourced from Google Street View panoramic images of the Pittsburgh area, as well as query images that are derived from these panoramic images through hard cropping methods. The query images have an overlapping vertical FOV with the reference images, and they are captured at different times.
(b): YQ360 dataset: The YQ360 dataset [25] is collected by Zhejiang University, and the images mainly consist of roads with repetitive texture structures and open street scenes with distinctive features. The panoramic images are captured using a panoramic annular lens camera mounted on the roof of a vehicle, whereas the query images are captured using a pinhole camera. In contrast to the Pitts250k-P2E dataset, in order to simulate a more realistic testing situation, the vertical FOV of the query and panoramic images in this dataset do not completely overlap.
(c): Ominicity dataset: The Ominicity [37] dataset is collected by Sun Yat-sen University and contains six typical urban scenes from New York (such as commercial centers, industrial areas, residential neighborhoods, etc.). Compared to the Pitts250k-P2E and YQ360 datasets, the perspective street images in this dataset primarily focus on road intersection information, presenting a richer scene composition. In addition to building information, it also includes abundant road details. Moreover, the Pitts250k-P2E and YQ360 datasets adopt a 1:10 narrow field-of-view image size ratio, filtering out most of the sky and ground areas. Ominicity retains a wider field-of-view ratio of 1:4. This allows for the complete preservation of sky information, stereoscopic building details, and ground information, offering a more comprehensive urban spatial context.

4.1.2. Experimental Details

The framework is implemented in PyTorch 3.10, and the experiments are conducted on a machine with an NVIDIA GeForce RTX 3090 graphics card (Manufacturer: NVIDIA Corporation, Santa Clara, CA, USA). The number of training epochs is set to 100. The Adam optimizer is employed for training. A phased learning rate adjustment strategy is adopted; that is, the entire training process of the network is divided into three stages. For the first 30 rounds of training, the learning rate is set to 1 × 10⁻³, and for rounds 30–50 of training, the learning rate is set to 1 × 10⁻⁴, for subsequent training rounds, the learning rate is set to 1 × 10⁻⁵.

In terms of the loss function, a weighted soft margin triplet loss is adopted to train the model. Each triplet consists of one query perspective image, one positive panoramic image, and multiple negative panoramic images. And we follow the sample selection criteria commonly used in most VG studies [7,8], where the definition of positive and negative samples is based on the distance between the reference image and the query image. During the training phase, the system automatically selects reference images that are within 10 m of the query image’s GPS location as positive samples, while others are treated as negative samples. During the testing phase, in order to more comprehensively evaluate the model’s performance, this threshold is adjusted to 25 m. In addition, to facilitate processing, the sizes of the perspective and panoramic images are uniformly set to 128 × 128 and 128 × 768 pixels, respectively.

4.1.3. Evaluation Metrics

Similarly to most studies that evaluate VG algorithms [38,39], we use top-K recall as our evaluation metric to examine the performance of our method and compare it with state-of-the-art methods. Specifically, a query image is determined to be successfully retrieved from the top-K if at least one of the top-K retrieved reference images is located within 25 m of the query image. The percentage of correctly localized query images is reported as recall @K (R@K).

4.2. Comparison with the State-of-the-Art Models

To demonstrate the effectiveness of the proposed method, we compare it with state-of-the-art methods on the Pitts250k-P2E and YQ360 datasets. Among them, NetVLAD [5] and Berton et al. [2] are two well-established algorithms in the field of VG, while Dino-Mix [19] and SALAD [40] are two recent methods distinguished by their robust feature encoding capabilities. To address the weak-supervision problem, SFRS [8] adopts a self-supervised method to train the network. Orhan et al. [24] and PanoVPR [25] adopt the same VG strategy as employed in this study, which involves aligning perspective images with panoramic images. Therefore, we also compare these approaches to validate the efficacy of the proposed method.

4.2.1. Results on Pitts250k-P2E

The experimental results obtained on the Pitts250k-P2E dataset are listed in Table 2, and a comparison of the performance of the VG algorithms is presented in Figure 9. The results exhibit that the proposed method obtains superior accuracy across the four recall metrics with merely 5.67 Mb parameters, showing significant advantages in terms of accuracy and computational efficiency. In addition, our algorithm demonstrates significant engineering application potential. The inference latency for a single image pair on an NVIDIA GeForce RTX 3090 GPU is 25.78 ms, which is a 2.95 times improvement compared to the current state-of-the-art PanoVPR algorithm.

The comparison results in Table 2 and Figure 9 reveal that the algorithm proposed in this study, that by Berton et al., and PanoVPR demonstrate significant performance advantages. This indicates that the sliding window strategy for preliminary scene alignment before image matching is an effective approach, given that panoramic images with a wide FOV inevitably contain a large amount of redundant information that does not correspond to perspective images. This redundant information makes it difficult to accurately extract the key features from the panoramic image for the matching task. In contrast, the sliding window approach effectively eliminates the parts of the panoramic image that are not relevant to the perspective view, thus precisely matching the regions and improving algorithm accuracy. Furthermore, our approach outperforms the algorithms using the same strategy, Orhan et al.’s, and PanoVPR. This is mainly because the Orhan et al. and PanoVPR algorithms calculate the region similarity by generating feature descriptors for each window and query image. Nevertheless, owing to the absence of region-level similarity annotations within the dataset, these models are incapable of employing supervised learning to generate feature descriptors that effectively differentiate between varying viewpoints. Furthermore, the process of generating feature descriptors unavoidably results in a loss of structural and spatial information, which is essential for an accurate identification of regions that exhibit similar scenes. By contrast, the proposed algorithm assesses region similarity based on the overall pixel values within the sliding search windows. This method does not rely on region-level similarity labels. Moreover, it preserves image structure and spatial information. Therefore, our algorithm more accurately locates the similar regions between panoramic and perspective images, providing a solid foundation for subsequent retrieval and localization.

Apart from the enhancements in accuracy, the proposed methodology also demonstrates a substantial reduction in the quantity of parameters (from 50.22 M to 5.67 M). This can be attributed to the following two main reasons: (1) This study is the first to combine the lightweight design feature extraction network LskNet and the feature aggregation module MixVPR. LskNet replaces the traditional large convolution kernel with sequential decomposition to capture the contextual information of the image, thereby significantly reducing the number of parameters. Moreover, MixVPR incorporates the global relationship between the elements in each feature map in a feature mixing cascade, avoiding the need for local or pyramidal aggregation, further reducing the number of parameters. (2) The proposed SFE and ASA modules introduce only a few learnable parameters produced by the batch normalization layers, which lead to negligible extra computational cost.

4.2.2. Results on YQ360

Deep learning is a technique that heavily relies on data [42], whereas the size of the YQ360 dataset is not large enough to train deep models. To mitigate this limitation, we follow the training strategy in PanoVPR [25] and employ the weights obtained by training on Pitts250k-P2E to initialize the model, followed by fine-tuning it on the YQ360 dataset.

The experimental results for the YQ360 dataset are listed in Table 3. The results show that our algorithm achieves the best performance in the R@1 and R@10 metrics, whereas the R@20 result is on par with that of the PanoVPR algorithm, and the R@5 result is slightly lower than that of the PanoVPR algorithm. Moreover, our algorithm requires the least number of parameters.

Although our algorithm performs well on Pitts250k-P2E and Ominicity, the improvement in accuracy on the YQ360 dataset is relatively limited. This is primarily due to two unique challenges presented by YQ360. The first is that YQ360 is collected from areas with heavy vegetation coverage that severely occludes key matching features such as buildings and roads. This results in fewer effective features available for matching. The second is that to simulate a more realistic testing scenario, the vertical FOV of the perspective and panoramic images collected in YQ360 do not completely overlap, which causes an increase in the matching ambiguity, thus seriously affecting the matching accuracy of the model.

4.2.3. Results on Ominicity

The experimental results on the Ominicity dataset are shown in Table 4. Since most algorithms have saturated in terms of accuracy on the R@20 metric, we selected the more discriminative metrics R@1, R@5, and R@10 to evaluate our model. From the results, it can be seen that our algorithm is comprehensively leading in these three metrics, achieving the best precision so far. The precision in R@1 and R@5 metrics has been improved by 2.1% and 3.5%, respectively, compared to the current best algorithm PanoVPR. The experimental results demonstrate that the proposed algorithm is not only effective in scenarios where building features are prominent but also in datasets with distinctive road features, open views, and rich non-matching information such as the ground and sky, showing good generalization ability.

4.3. Ablation Study

This section describes the results of our ablation studies, which consist of a hyperparameter ablation study and a study of ASA and SFE module effectiveness.

4.3.1. Hyperparameter Ablation Study

To investigate the impact of different hyperparameter configurations within the SFE and ASA modules on the performance of our model, we evaluate the effect of the stride and cycle parameters in ASA and the effect of injecting SFE into different stages.

(a): Effect of the stride and cycle parameters in ASA: The stride and cycle (whether to perform circular stitching) are two important parameters in the ASA module. Stride denotes the step size when the perspective image, serving as a convolutional kernel, slides over the panoramic image. The choice of stride is directly related to the level of detail during the sliding process. The parameter cycle indicates whether circular stitching is applied to the panoramic image. Specifically, the 0–60° region of the unfolded panoramic image is copied and appended after the 360° region to create a representation with periodic continuity. While circular stitching ensures visual continuity, it also increases additional computational cost and complexity. To determine the appropriate parameter combination, we conduct multiple ablation experiments on the Pitts250k-P2E and YQ360 datasets. The results are reported in Table 5.

The impact of different stride settings {stride = 1,2,3,4} on model performance is compared in Table 4. We observed that as the stride decreases, the model’s recall precision increases correspondingly (as the stride decreases from four to one, it increases from R@1 68.7 to 73.9 on Pitts250k-P2E, and on the YQ360, it increases from R@1 49.6 to 56.0).

This improvement occurs because a smaller stride reduces the set size for the search window as it moves across the feature map, allowing it to capture more local details, which helps the model make more accurate assessments of the image content. In addition, the sliding process simulates how the human visual system searches for specific scenes. In real-life scenarios, the human visual system is seamless and fluid when performing such searches. A smaller stride implies a smoother transition, better approximating this natural, continuous search pattern, thereby enhancing the ability of the model to emulate the human visual system.

Furthermore, the impact of performing circular stitching {cycle= true, false} on model performance is presented in Table 5, which reveals that using circular stitching (cycle = true) improves R@l by 4.2 on the Pitts250k-P2E dataset and by 3.2 on the YQ360 dataset. The other metrics are also substantially improved.

This is mainly because the 360° FOV in real-world scenes is three-dimensional and circular. To process and use this information effectively, it must be projected from the three-dimensional (3D) space onto the two-dimensional (2D) plane. During this projection, continuous and seamless viewing areas in the 3D space are cut and distributed across the two borders of the 2D plane. When the FOV of the perspective image spans the cutoff area, it becomes impossible to crop a region from the panoramic image that corresponds exactly to the perspective image. This mismatch in the FOV results in insufficient matching features, which in turn affects model accuracy.

(b): Effect of injecting SFE into different stages: In deep learning, the features extracted at different stages of the network exhibit varying degrees of depth. To assess the effect of injecting SFE at different stages, we initially evaluated the performance by injecting SFE after only one of the three stages. Subsequently, we evaluate the performance by injecting SFE after multiple stages. The results are reported in Table 6.

Table 6 reveals that injecting SFE into the shallower stages (stage 1 or 2) results in higher performance than injecting it into the deeper stage (stage 3). In particular, injecting SFE after stage 2 achieves the highest R@1 precision on both datasets (74.4% on Pitts250k-P2E and 58.4% on YQ360).

This improvement occurs because the shallow stages primarily reflect basic data attributes such as color, texture, and structure. Emphasizing the extraction of linear features at these stages can effectively highlight stable structural information, laying a foundation for subsequent image matching. By contrast, the deeper stages focus on more abstract and complex high-level semantic information. Emphasizing linear feature extraction at these stages can interfere with the model’s ability to capture and process high-level semantics, ultimately reducing overall performance. Moreover, when comparing the results of injecting SFE after a single stage (stages 1, 2, or 3) with those of injecting SFE after multiple stages (stage 1 + 2 or stage 1 + 2 + 3), we find that injecting SFE into a single stage performs better.

4.3.2. ASA and SFE Module Effectiveness

To verify the effectiveness of the proposed ASA and SFE modules, we conducted the following experiments on the Pitts250kP2E and YQ360 datasets: We selected the lightweight LskNet as the backbone and used perspective and panoramic images as network inputs. We fixed the parameters determined in Section 4.3.1, injecting SFE and ASA into the second and third stages of the backbone, respectively. Subsequently, the model was trained and tested on the Pitts250k-P2E and YQ360 datasets. The results are given in Table 7.

Adding ASA to the baseline results in substantial improvements in R@1, which increases by 35.6% on Pitts250k-P2E and 10.4% on YQ360. And adding SFE to the baseline also results in improvements in R@1, which increases by 2.2% on Pitts250k-P2E and 3.4% on YQ360. Additionally, SFE further improves the accuracy of the algorithm on both datasets (from R@1 73.9% to 74.4% on Pitts250k-P2E and from R@1 56.0% to 58.4% on YQ360).

Based on these results, we conclude that the ASA and SFE designed in this study improve the overall performance. This can be attributed to the following two reasons: (1) There is a significant difference in the FOVs of panoramic and perspective images. When these image types are matched by generating compact tensors with the same dimensions, the panoramic feature vector must capture and encode the global information of the scene, whereas the perspective feature vector focuses on encoding local information. This results in a significant difference in the information content and focus of the feature tensors. Our approach crops the regions from the panoramic image that correspond to the perspective image using ASA before generating compact feature vectors, effectively eliminating redundant information in the panoramic image and reducing its interference in the image matching process. (2) In contrast to conventional VG algorithms, which typically focus on extracting visual features, our model extends to include the extraction and enhancement of structural features within images. Structural features exhibit greater stability and abundance within the spatial layout of images, as they retain elements such as buildings, streetlights, and roads, which remain highly consistent amidst dynamic changes and complex urban environments. By extracting and enhancing the structural features in images, we provide accurate and reliable reference information for VG tasks, thereby improving model accuracy.

Additionally, when SFE is co-optimized with ASA, it achieves better performance. The reason is that the network enhances the key structural features before sliding the search window, which significantly improves the discriminativeness of local region similarity measurement, thereby providing high-confidence candidate regions for subsequent matching and localization.

4.3.3. MixVPR Module Effectiveness

In this section, we designed relevant experiments to verify the effectiveness of the aggregation strategy MixVPR introduced in this paper. In the experiments, we fixed the optimal parameters determined in Section 4.3.1 and used feature images processed by the LskNet, SFE, and ASA modules as the input to the feature aggregation module. Various aggregation strategies (GEM, NetVLAD, Cosplace, MixVPR) were applied for feature aggregation. The experiments were conducted on the Pitts250k-P2E dataset for training and testing, and the results are compared as shown in Table 8.

The experimental results show that compared to traditional aggregation algorithms, the aggregation strategy MixVPR introduced in this paper demonstrates superior performance. This indicates that using the Feature Mixer module to mix the global relationships between different feature maps helps the model learn more robust feature representations in street scene images. Additionally, in terms of computational efficiency, MixVPR’s lightweight design based on MLP reduces the inference time for a single image pair to 25.8 ms, providing a feasible solution for practical deployment.

4.4. Visualization of Attention

To provide a clearer more intuitive understanding of the ASA module designed in this study, we employed Grad-CAM [43] to visualize heat maps of the embedded features. As shown in Figure 10, we selected three sets of street-view image pairs (group 1, group 2, and group 3) for display. For each image pair, the first row represents the original perspective and panoramic images, and the second row represents the feature heat maps of the perspective and panoramic images processed with the baseline SFE model designed in this study. The depth of the color in the figure is proportional to the network attention level. As the network’s attention on a specific area increases, the warmth of color correspondingly increases, thereby indicating the focal region.

Figure 10 reveals that the scene alignment regions (indicated by the red window) identified by the ASA on the panoramic feature heatmaps are the areas that show the highest similarity in color and structural distributions with the perspective feature heatmaps. This indicates that the proposed ASA efficiently and accurately locates regions in panoramic feature images with the same FOV as the perspective feature image.

4.5. Qualitative Results

To visually demonstrate the retrieval and localization performance of the proposed algorithm, two representative image retrieval cases (queries 1–4) from the Pitts250k-P2E and YQ360 datasets are selected for presentation. Among them, queries 1 and 3 describe the retrieval results under varying lighting and darkness conditions, respectively; queries 2 and 4 present the retrieval results with changes in viewpoint, including variations in shooting angle and distance; and query 3 demonstrates the impact of occlusions present in the image on the retrieval results. The qualitative results are shown in Figure 11 and Figure 12, which confirm that the proposed algorithm in this paper demonstrates accurate and robust retrieval and localization capabilities in challenging urban environments.

5. Conclusions

Traditional deep learning-based VG tasks typically rely on GPS labels to select positive and negative samples. When using perspective images as training samples, this method often leads to errors in sample selection. In this paper, we propose the ASA-VG framework, in which we select panoramic images with a complete FOV as reference images and leverage the LskNet backbone network integrated with the SFE module to extract contextual information containing geometric spatial relationships. Additionally, we design the ASA module to achieve scene alignment between perspective and panoramic images.

Experimental results demonstrate that our proposed ASA-VG algorithm achieves significant improvements in accuracy and efficiency compared to the SOTA algorithms. This method, by modeling geometric structure information and contextual information in images, can better adapt to the changes in scene appearance over time. Additionally, benefiting from the scene pre-alignment operation, the framework exhibits a faster and more accurate retrieval and localization ability.

However, we still face a challenge compared to traditional VG algorithms: the benchmark datasets required for this algorithm are relatively limited, and the generalization performance may be lower than that of traditional VG algorithms when the training sample is small.

With the development of VFMs, universal feature representations can be obtained using limited samples. In the future, we will continue to explore how to combine perspective–panoramic VG with visual foundation models to improve the model’s generalization performance.

Author Contributions

Conceptualization, J.L. and H.G.; methodology, Y.P. and J.L.; software, Y.P.; validation, Y.P., J.L. and Q.H.; formal analysis, Y.P. and L.D.; investigation, L.D.; resources, J.L. and L.D.; writing—original draft preparation, Q.H.; writing—review and editing, J.L. and Q.H.; visualization, Y.P.; supervision, H.G. and L.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation of China, grant numbers 42301464 and 42201443.

Data Availability Statement

The data that support the findings of this study are openly available at https://github.com/zafirshi/PanoVPR (accessed on 24 March 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dai, M.; Chen, J.; Lu, Y.; Hao, W.; Zheng, E. Finding Point with Image: An End-to-End Benchmark for Vision-Based UAV Localization. arXiv 2022, arXiv:2208.06561. [Google Scholar]
Berton, G.; Masone, C.; Caputo, B. Rethinking visual geo-localization for large-scale applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4878–4888. [Google Scholar]
Hou, Q.; Lu, J.; Guo, H.; Liu, X.; Gong, Z.; Zhu, K.; Ping, Y. Feature Relation Guided Crosss-View Image Based Geo-Localization. Remote Sens. 2023, 15, 5029. [Google Scholar] [CrossRef]
Sünderhauf, N.; Neubert, P.; Protzel, P. Are we there yet? Challenging SeqSLAM on a 3000 km journey across all four seasons. In Proceedings of the Workshop on Long-Term Autonomy, IEEE International Conference on Robotics and Automation (ICRA), Karlsruhe, Germany, 6 May 2013; p. 2013. [Google Scholar]
Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5297–5307. [Google Scholar]
Torii, A.; Sivic, J.; Pajdla, T.; Okutomi, M. Visual place recognition with repetitive structures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 883–890. [Google Scholar]
Izquierdo, S.; Civera, J. Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition. arXiv 2024, arXiv:2407.02422. [Google Scholar]
Ge, Y.; Wang, H.; Zhu, F.; Zhao, R.; Li, H. Self-supervising fine-grained region similarities for large-scale image localization. In Computer Vision–ECCV 2020 Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IV 16; Springer International Publishing: Cham, Switzerland, 2020; pp. 369–386. [Google Scholar]
Warburg, F.; Hauberg, S.; Lopez-Antequera, M.; Gargallo, P.; Kuang, Y.; Civera, J. Mapillary street-level sequences: A dataset for lifelong place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2626–2635. [Google Scholar]
Maddern, W.; Pascoe, G.; Linegar, C.; Newman, P. 1 year, 1000 km: The Oxford Robotcar Dataset. Int. J. Robot. Res. 2017, 36, 3–15. [Google Scholar] [CrossRef]
Torii, A.; Arandjelovic, R.; Sivic, J.; Okutomi, M.; Pajdla, T. 24/7 place recognition by view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1808–1817. [Google Scholar]
Lowry, S.; Andreasson, H. Lightweight, Viewpoint-Invariant Visual Place Recognition in Changing Environments. IEEE Robot. Autom. Lett. 2018, 3, 957–964. [Google Scholar] [CrossRef]
Kim, H.J.; Dunn, E.; Frahm, J.M. Learned contextual feature reweighting for image geo-localization. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; Volume 1, p. 3. [Google Scholar]
Iscen, A.; Tolias, G.; Avrithis, Y.; Furon, T.; Chum, O. Panorama to panorama matching for location recognition. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, Bucharest Romania, 6–9 June 2017; pp. 392–396. [Google Scholar]
Fang, Y.; Wang, K.; Cheng, R.; Yang, K. CFVL: A coarse-to-fine vehicle localizer with omnidirectional perception across severe appearance variations. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1885–1891. [Google Scholar]
Cheng, R.; Wang, K.; Lin, S.; Hu, W.; Yang, K.; Huang, X.; Li, H.; Sun, D.; Bai, J. Panoramic annular localizer: Tackling the variation challenges of outdoor localization using panoramic annular images and active deep descriptors. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 920–925. [Google Scholar]
Lin, J.; Zheng, Z.; Zhong, Z.; Luo, Z.; Li, S.; Yang, Y.; Sebe, N. Joint Representation Learning and Keypoint Detection for Cross-View Geo-Localization. IEEE Trans. Image Process. 2022, 31, 3780–3792. [Google Scholar] [CrossRef] [PubMed]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up robust features. In Computer Vision–ECCV 2006 Proceedings of the 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Proceedings, Part I 9; Springer: Berlin, Heidelberg, 2006; pp. 404–417. [Google Scholar]
Gu, Y.; Li, C.; Xie, J. Attention-aware generalized mean pooling for image retrieval. arXiv 2018, arXiv:1811.00202. [Google Scholar]
Jia, L.; Lin, S.; Liu, L.; Shi, D. A Two-Stage Registration Method for Visible Light and SAR Images Based on Structural Information. J. Army Eng. Univ. 2024, 3, 26–34. [Google Scholar]
Wang, R.; Shen, Y.; Zuo, W.; Zhou, S.; Zheng, N. Transvpr: Transformer-based place recognition with multi-level attention aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13648–13657. [Google Scholar]
Huang, G.; Zhou, Y.; Hu, X.; Zhang, C.; Zhao, L.; Gan, W. Dino-Mix: Enhancing Visual Place Recognition with Foundational Vision Model and Feature Mixing. Sci. Rep. 2024, 14, 22100. [Google Scholar] [CrossRef]
Orhan, S.; Baştanlar, Y. Efficient search in a panoramic image database for long-term visual localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1727–1734. [Google Scholar]
Shi, Z.; Shi, H.; Yang, K.; Yin, Z.; Lin, Y.; Wang, K. PanoVPR: Towards unified perspective-to-equirectangular visual place recognition via sliding windows across the panoramic view. In Proceedings of the 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 24–28 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1333–1340. [Google Scholar]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 16794–16805. [Google Scholar]
Ali-Bey, A.; Chaib-Draa, B.; Giguere, P. Mixvpr: Feature mixing for visual place recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2998–3007. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Wang, Z.; Li, S.; Cao, M.; Chen, H.; Liu, Y. Pole-like objects mapping and long-term robot localization in dynamic urban scenarios. In Proceedings of the 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), Sanya, China, 27–31 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 998–1003. [Google Scholar]
Schaefer, A.; Büscher, D.; Vertens, J.; Luft, L.; Burgard, W. Long-term urban vehicle localization using pole landmarks extracted from 3-D lidar scans. In Proceedings of the 2019 European Conference on Mobile Robots (ECMR), Prague, Czech Republic, 4–6 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–7. [Google Scholar]
Aich, S.; Stavness, I. Global sum pooling: A generalization trick for object counting with small datasets of large images. arXiv 2018, arXiv:1805.11123. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
Xinhua, C. Research on Visual Saliency Prediction of 360° Image Based on Deep Learning; Nanjing University of Aeronautics and Astronautics: Nanjing, China, 2022. [Google Scholar] [CrossRef]
Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
Li, W.; Lai, Y.; Xu, L.; Xiangli, Y.; Yu, J.; He, C.; Xia, G.S.; Lin, D. Omnicity: Omnipotent city understanding with multi-level and multi-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17397–17407. [Google Scholar]
Yang, M.; He, D.; Fan, M.; Shi, B.; Xue, X.; Li, F.; Ding, E.; Huang, J. Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 11772–11781. [Google Scholar]
Tolias, G.; Sicre, R.; Jégou, H. Particular Object Retrieval with Integral Max-Pooling of CNN Activations. arXiv 2015, arXiv:1511.05879. [Google Scholar]
Izquierdo, S.; Civera, J. Optimal transport aggregation for visual place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17658–17668. [Google Scholar]
Keetha, N.; Mishra, A.; Karhade, J.; Jatavallabhula, K.M.; Scherer, S.; Krishna, M.; Garg, S. Anyloc: Towards universal visual place recognition. IEEE Robot. Autom. Lett. 2023, 9, 1286–1293. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. Illustration of visual geo-localization (VG). The user captures a query image using a handheld device (top left) and the system retrieves the nearest top-k reference images from a geo-tagged database (bottom left). The geographic location (GPS) of the query image is then estimated based on the GPS of the retrieved reference images (right). The blue pentagram shows the GPS of a correctly retrieved image, whereas the red pentagram shows the GPS of an incorrectly retrieved image.

Figure 2. As shown in (d), reference images (b,c) located within 10 m of the query image (a) are selected as positive samples. However, significant scene discrepancies exist between geographically close images (a,b) due to variations in perspective angles and shooting directions. Thus, (b) is defined as a false positive sample.

Figure 3. Overall architecture of the proposed network. The structural feature enhancement (SFE) module, detailed in the blue dashed frame, includes feature aggregation, line feature attention mask generation (LFMG), and residual attention fusion.

Figure 4. Linear feature enhancement with attention masking.

Figure 5. Workflow within the adaptive scene alignment (ASA) module. The yellow curve indicates the similarity score between the perspective and panoramic feature along the equatorial direction. The maximum score region aligns the panoramic image with the perspective image (red window). Circular stitching is omitted for brevity.

Figure 6. Feature aggregation with mixed features for VPR (MixVPR).

Figure 7. Two examples from the street-view datasets. Perspective images (left) and panoramic images (right).

Figure 8. Examples from Ominicity. Perspective images (left) and panoramic images (right).

Figure 9. Performance comparison of VG algorithms on the Pitts250k-P2E dataset [2,5,8,23,24,25,26].

Figure 10. Model heat map visualization. The yellow curve shows the similarity score between perspective and panoramic features along the equatorial direction, generated by the ASA module. The red window indicates the panoramic area aligning with the perspective image at the maximum similarity score.

Figure 11. Qualitative results on the (a) Pitts250k-P2E and (b) YQ360 datasets. The left column shows perspective query images, and the right column shows the top 3 matched panoramic references. Green frames indicate successful retrievals; red frames indicate failures, the yellow dotted frames indicate the area on the panoramic image that is consistent with the scene on the perspective image.

Figure 12. Qualitative results on the Ominicity datasets. The left column shows perspective query images, and the right column shows the top 3 matched panoramic references. Green frames indicate successful retrievals; red frames indicate failures; the yellow dotted frames indicate the area on the panoramic image that is consistent with the scene on the perspective image.

Table 1. Size and division of the datasets.

Dataset	Type	Train	Valid	Test	Total
Pitts250k-P2E	Query	2940	3804	4140	10,884
Pitts250k-P2E	Database	2466	2116	2158	6720
YQ360	Query	182	78	250	310
YQ360	Database	91	39	125	255
Ominicity	Query	9430	3144	1671	310
Ominicity	Database	10,800	3600	3600	255

Table 2. Results of various visual geo-localization (VG) algorithms on the Pitts250k-P2E dataset.

Method	Pitts250k-P2E
Method	Inference Time (ms)	Param (Mb)	R@1	R@5	R@10	R@20
NetVLAD [5]	37.34	7.23	4.0	12.4	20.0	30.2
Berton et al. [2]	-	86.86	8.0	23.0	33.0	44.7
Dino_Mix [23]	24.38	86.7	12.7	30.9	39.1	51.4
TransVPR [22]	75.31	32.8	15.9	33.2	40.6	52.1
SALAD [40]	29.22	88.2	30.6	52.9	60.5	69.6
SFRS [8]	-	87.51	35.0	59.6	69.0	77.7
AnyLoc [41]	26.55	70.62	36.5	60.5	71.4	79.8
LskNet [26]	8.92	15.94	38.3	61.0	70.5	80.4
Orhan et al. [24]	-	136.62	47.0	66.4	73.6	80.3
PanoVPR [25]	76.25	50.22	48.8	73.8	82.4	89.4
Ours	25.78	5.67	72.5	85.1	87.6	92.0

Note: The bold font indicates the optimal value for each column. Params indicate the total number of trainable parameters. Inference time refers to the time required by the model to make a prediction on a single image pair. Latency is measured on an NVIDIA GeForce RTX 3090 GPU. To ensure accuracy, we performed the measurement 100 times and took the average value.

Table 3. Results of various VG algorithms on the YQ360 dataset.

Method	YQ360
Method	Param (Mb)	R@1	R@5	R@10	R@20
NetVLAD [5]	7.23	35.2	66.8	80.0	90.8
Berton et al. [2]	86.86	40.4	74.8	88.4	95.8
TransVPR [22]	32.8	41.2	75.2	88.2	95.9
SALAD [40]	88.2	43.7	76.6	87.0	94.9
Dino_Mix [23]	86.7	44.0	76.8	87.2	95.2
SFRS [8]	87.5	45.6	83.2	92.0	96.0
AnyLoc [41]	70.62	44.8	75.2	85.3	93.9
LskNet [26]	15.94	47.6	79.2	88.4	95.2
Orhan et al. [24]	136.62	48.0	81.6	93.6	97.6
PanoVPR [25]	50.22	51.4	89.2	93.1	98.4
Ours	5.67	58.4	88.8	95.2	98.4

Note: The bold font indicates the optimal value for each column. Params indicate the total number of trainable parameters.

Table 4. Results of various VG algorithms on the Ominicity dataset.

Method	Ominicity
Method	Inference Time (ms)	Params (Mb)	R@1	R@5	R@10
Dino_Mix [23]	24.38	86.7	80.7	85.6	88.5
TransVPR [22]	75.31	32.8	82.6	89.5	92.3
SALAD [40]	29.22	88.2	88.3	93.1	96.8
PanoVPR [25]	76.25	50.22	90.5	95.3	97.1
Ours	25.78	5.67	92.6	98.7	99.2

Note: The bold font indicates the optimal value for each column. Params indicate the total number of trainable parameters. Inference time refers to the time required by the model to make a prediction on a single image pair.

Table 5. Ablation study on the stride and cycle parameters in ASA.

Method	Config		Pitts250k-P2E
Method	Stride	Cycle	R@1	R@5	R@10	R@20
BaseLine + ASA	1	×	69.7	81.7	85.7	88.8
BaseLine + ASA	1	√	73.9	84.9	87.2	90.1
BaseLine + ASA	2	√	71.4	83.8	87.1	89.8
BaseLine + ASA	3	√	70.2	82.9	86.5	89.1
BaseLine + ASA	4	√	68.7	82.5	85.7	89.2
Method	Config		YQ360
Method	Stride	Cycle	R@1	R@5	R@10	R@20
BaseLine + ASA	1	×	52.8	82.4	92.0	95.2
BaseLine + ASA	1	√	56.0	85.2	93.6	97.6
BaseLine + ASA	2	√	54.4	82.4	93.1	97.6
BaseLine + ASA	3	√	52.8	84.0	92.0	96.8
BaseLine + ASA	4	√	49.6	79.2	85.2	96.4

Note: Bold font highlights the optimal value per column.

Table 6. Ablation study on injecting the SFE module into different stages.

Stage	Pitts250k-P2E
Stage	R@1	R@5	R@10	R@20
1	74.2	84.8	87.3	90.3
2	74.4	85.0	87.8	90.6
3	72.4	84.4	87.2	89.5
1 + 2	74.2	84.5	87.2	90.1
1 + 2 + 3	72.5	84.2	87.2	90.0
Stage	YQ360
Stage	R@1	R@5	R@10	R@20
1	56.8	87.6	93.6	97.6
2	58.4	88.8	95.2	98.4
3	54.4	86.2	90.4	94.0
1 + 2	57.6	86.4	92.8	97.6
1 + 2 + 3	55.2	86.4	92.0	96.4

Note: The bold font indicates the optimal value for each column. The stage indicates the injection stage of the SFE module in the LskNet. Stage {1, 2, 3}, respectively, indicate that the SFE is injected after the {1, 2, 3} block in the LskNet. As the number of blocks increases, the level of abstraction of the extracted features gradually transitions from low-level texture representations to high-level semantic representations.

Table 7. Validation of our designed ASA and SFE modules on the Pitts250k-P2E and YQ360 datasets.

Method	Pitts250k-P2E
Method	R@1	R@5	R@10	R@20
BaseLine	38.3	61.0	70.5	80.4
BaseLine + SFE	40.5	62.8	72.1	81.6
BaseLine + ASA	73.9	84.9	87.2	90.1
BaseLine + ASA + SFE	74.4	84.8	87.3	90.6
Method	YQ360
Method	R@1	R@5	R@10	R@20
BaseLine	45.6	83.2	92.0	96.0
BaseLine + SFE	48.2	84.5	92.6	97.2
BaseLine + ASA	56.0	85.2	93.6	97.6
BaseLine + ASA + SFE	58.4	88.8	95.2	98.4

Note: The bold font indicates the optimal value for each column.

Table 8. Validation of introduced MixVPR modules on the Pitts250k-P2E datasets.

Method	Pitts250k-P2E
Method	Reference Time (ms)	Params (Mb)	R@1	R@10
BaseLine + GeM	31.9	4.2	62.3	83.1
BaseLine + NetVLAD	47.2	6.8	68.3	85.1
BaseLine + Cosplace	54.1	7.8	69.8	84.6
BaseLine + MixVPR	25.8	5.7	72.5	87.6

Note: The bold font indicates the optimal value for each column. Params indicate the total number of trainable parameters. Inference time refers to the time required by the model to make a prediction on a single image pair.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ping, Y.; Lu, J.; Guo, H.; Ding, L.; Hou, Q. Visual Geo-Localization Based on Spatial Structure Feature Enhancement and Adaptive Scene Alignment. Electronics 2025, 14, 1269. https://doi.org/10.3390/electronics14071269

AMA Style

Ping Y, Lu J, Guo H, Ding L, Hou Q. Visual Geo-Localization Based on Spatial Structure Feature Enhancement and Adaptive Scene Alignment. Electronics. 2025; 14(7):1269. https://doi.org/10.3390/electronics14071269

Chicago/Turabian Style

Ping, Yifan, Jun Lu, Haitao Guo, Lei Ding, and Qingfeng Hou. 2025. "Visual Geo-Localization Based on Spatial Structure Feature Enhancement and Adaptive Scene Alignment" Electronics 14, no. 7: 1269. https://doi.org/10.3390/electronics14071269

APA Style

Ping, Y., Lu, J., Guo, H., Ding, L., & Hou, Q. (2025). Visual Geo-Localization Based on Spatial Structure Feature Enhancement and Adaptive Scene Alignment. Electronics, 14(7), 1269. https://doi.org/10.3390/electronics14071269

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Visual Geo-Localization Based on Spatial Structure Feature Enhancement and Adaptive Scene Alignment

Abstract

1. Introduction

2. Related Work

2.1. Perspective VG

2.2. Panoramic VG

3. Methodologies

3.1. Overall Framework

3.2. SFE: Structure Feature Enhancement by Exploiting Linear Features

3.3. ASA Module

3.4. Feature Aggregation Strategy

3.5. Weighted Soft Margin Triplet Loss

4. Experimental Results

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Experimental Details

4.1.3. Evaluation Metrics

4.2. Comparison with the State-of-the-Art Models

4.2.1. Results on Pitts250k-P2E

4.2.2. Results on YQ360

4.2.3. Results on Ominicity

4.3. Ablation Study

4.3.1. Hyperparameter Ablation Study

4.3.2. ASA and SFE Module Effectiveness

4.3.3. MixVPR Module Effectiveness

4.4. Visualization of Attention

4.5. Qualitative Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI