GDCPlace: Geographic Distance Consistent Loss for Visual Place Recognition

Shao, Shihao; Cui, Qinghua

doi:10.3390/electronics14071418

Open AccessArticle

GDCPlace: Geographic Distance Consistent Loss for Visual Place Recognition

by

Shihao Shao

^* and

Qinghua Cui

^*

School of Basic Medical Sciences, Peking University, Beijing 100083, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(7), 1418; https://doi.org/10.3390/electronics14071418

Submission received: 21 February 2025 / Revised: 25 March 2025 / Accepted: 27 March 2025 / Published: 31 March 2025

(This article belongs to the Special Issue Machine Vision for Robotics and Autonomous Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Visual place recognition (VPR) is essential for robots and autonomous vehicles to understand their environment and navigate effectively. Inspired by face recognition, a recent trend for training a VPR model is to leverage classification objective, where the embeddings of images are trained to be similar to corresponding class centers. Ideally, the predicted similarities should be negative correlated to the geographic distances. However, previous studies typically used loss functions from face recognition due to the similarity between the two tasks, which cannot guarantee the rank consistency above as face recognition is unrelated to geographic distance. Current methods for distance-similarity or ordinal constraint are either for sample-to-sample training, only partially meet the constraint, or are incapable for the VPR task. To this end, we provide a mathematical definition geographic distance consistent defining the above consistency that the loss function for VPR should adhere to. Based on it, we derive the upper bound of cross-entropy softmax loss under the desired constraint to minimize, and propose a novel loss function for VPR that is geographic distance consistent, called GDCPlace. To the best of our knowledge, GDCPlace is the first classification loss function designed for VPR. To evaluate our loss, we collected 11 benchmarks that have high domain variability to test on. As our contribution is on the loss function and previous classification-based VPR methods mostly adopt face recognition loss function, we collect several additional loss functions to compare, e.g., loss for face recognition, image retrieval, ordinal classification, and general purpose. The results show that GDCPlace performs the best among different losses and former state-of-the-art (SOTA) for VPR. It is also evaluated for ordinal classification tasks to show the generalizability of GDCPlace.

Keywords:

visual place recognition; geographic distance consistency; soft similarity measures; hard negative mining

1. Introduction

Visual place recognition (VPR) helps robotics and autonomous systems recognize familiar locations, which is crucial for tasks like mapping, localization, and path planning. Stated simply, it aims to predict the geographical localization of where a photo was captured. It is frequently depicted as a retrieval task, meaning that the system stores a large number of images with localization to retrieve for an input image [1,2,3]. There are two classic types of training and inference strategies: (1) sample-to-sample training, inference in a retrieval form [4,5,6,7]; (2) training and inference both in a classification form [8,9]. Recently, motivated by other retrieval tasks like face recognition [10,11] and landmark retrieval [12,13], there are some methods that train the retrieval model using a classification proxy but carry out inference as a retrieval task, achieving a significant boost in convergence speed and decreasing implementation difficulties while maintaining comparable performance [14,15].

There is a reasonable constraint that the visual place recognition methods should address, namely, the distances from the location where the shot was taken to locations A, B, and C progressively increase, then the similarity of image embeddings to locations A, B, and C should correspondingly decrease. However, current methods are incapable of achieving this for classification training. Specifically, they are either (1) implemented in sample-to-sample training paradigms, which are not suitable for classification training [16,17]; (2) only aiming to place correct images ahead of wrong ones, but cannot guarantee the exact order [18,19]; or (3) ordinal classification methods deal with 1d ranking data (e.g., age) [20,21], whereas VPR has 2d geographic data. Furthermore, there is currently no classification loss designed specifically for VPR, with most approaches replicating CosFace [10] and ArcFace [11]. A list of related works can be viewed in Table 1.

Motivated by the above reasons, we first give a mathematical definition called geographic distance consistent, which is defined for those loss functions that encourage embedding similarities negatively correlated to the geographic distances. We derive an upper bound of the cross-entropy softmax loss under the desired constraint. Supported by a few lemmas, we prove that the resulted formulation, GDCPlace, meets the definition we require. We also reveal several other useful patterns it exhibits. Further, inspired by hard negative mining (HNM) [22,23], we propose to implicitly mine hard negative classes by merely modification to the formulation of our loss function, thus free of time and memory overhead. We call this technique hard negative class mining (HNCM).

Extensive experiments were conducted to evaluate the performance of our method. To evaluate the performance and generalization capacity, we collect 11 benchmarks which cover different cities, weathers, and viewpoints [4,24,25,26,27,28,29]. We test our loss function on the proposed method, EigenPlace [15]. We first compare with other classification loss functions that have been/have the potential to be adopted to VPR training [10,11,30,31,32,33,34]. Then, we also follow recent works to adopt DINOv2 backbone [35] to further evaluate its performance compared with other SOTAs. Our method exhibits the overall best performances in the comparisons. Remarkably, under the identical training setup, substituting the CosFace loss [10] utilized by EigenPlace [15] with GDCPlace loss results in an increase of recall@1 by over 11% on SVOX Night [25]. The generalizability is also evaluated on ordinal classification benchmarks where GDCPlace is capable to make precise estimation. GDCPlace is both experimentally and theoretically evaluated to demonstrate its effectiveness in training a VPR model. It is also proven to be capable of benefiting other methods.

The main contributions can be summarized as follows:

(1): To help models learn reasonable relationships between embedding of different localization in VPR, we propose GDCPlace, a classification loss function that holds similarity-distance constraint. To the best of our knowledge, this is also the first loss function designed specifically for training a classification proxy in a VPR retrieval model.
(2): The mathematical analysis is presented to demonstrates the effectiveness of GDCPlace in a theoretical view.
(3): We propose hard negative class mining through loss function design to help classification training with no overhead.
(4): Experimental results show that GDCPlace outperforms previous classification loss functions and other SOTA methods in various VPR benchmarks. Experiments on ordinal classification further demonstrate its generalizability.

The rest of our article is organized as follows. We introduce and discuss the related works in Section 2. The details of our proposed method is in Section 3. The effectiveness of the proposed method is evaluated on extensive benchmarks, as introduced in Section 4. Finally, the conclusion and the future work directions are presented in Section 6.

2. Related Works

2.1. Classification Loss for Retrieval

To tackle open-set classification tasks like VPR [10,12,14,36], it has been firstly de facto to adopt sample-to-sample loss [4,19,37,38,39], as it well fits the training object. Although it works well in several cases, it has some unpleasant drawbacks recognized in general. For instance, it is difficult to converge [40], there is a requirement for hard negative mining [41], and there is training time overhead [42]. It has been proposed to set up a proximal classification task, giving samples of the same kind a center to represent each category of images [43]. From then on, several subsequent works were about the design of such proximal classification loss [30,44,45]. Subsequently, it has begun a long period of trending to work on the margin. Just as most sample-to-sample loss has additional margin penalty, it has been shown that it is also beneficial to classification loss functions [46]. The works can be, in general, divided into three lines: (1) Find where the best location is to place the margin term [10,11,45]; (2) Determine how to dynamically tuning the margin during training [31,32,47]; (3) Perform the theoretical analysis of margin loss [34,48]. For instance, ArcFace [11] places the margin terms to the angle between embedding and central weights, contributing to a reasonable manifold explanation. AdaFace [31] further achieves dynamic tuning of the margin term. LM-Softmax [34] pushes the inter-class distance to maximum backup by theoritical analysis. Recent VPR works utilize face recognition loss functions, especially CosFace and ArcFace, to train models [8,14,15]. They take different localizations as different faces in face recognition to perform the training. However, they are not designed for VPR, and are not capable of considering the relationship between embedding similarities and geographic distances, which motivates this work.

2.2. Visual Place Recognition

VPR aims to predict the localization given photos taken in a certain place [28,49,50,51,52]. In contrast to place recognition based on LiDAR [53], it is more cost-effective and has lower power consumption. A dataset with photos and the corresponding localization are constructed such that we can match the correct image given the query photo, and the localization is, thus, obtained. Like many other computer vision areas, it began with hand-craft local features processing [54,55], such as SIFT [56], SURF [57], BRIEF [58], and the fusion of different works [59]. As we came to the deep learning era, much attention has been dedicated to the design of the head layer to aggregate local features to a global one. NetVLAD is a popular CNN module for solving VPR tasks in an end-to-end style [4], with a novel weakly-supervised ranking loss proposed. It works as a global feature integrator that can be plugged into neural networks like NetBoW [60] and NetFV [61]. To alleviate the influence of the noisy label, SFRS [62] has been proposed to train with image-to-region supervision signal. GeM pooling [63], a method used to aggregate local features by their exponents, was also adopted in VPR. On top of that, MixVPR [5] has been proposed to perform accurate feature aggregation given various pretrained backbones. They are all based on sample-to-sample loss. CosPlace [14] firstly adopted the idea of CosFace [10], in a proximal classification form, to VPR and showed satisfying performance. For achieving view-point invariance, singular value decomposition (SVD) was adopted to classify photos in different viewpoints [15]. A line of works also converted VPR to a pure classification task covering inference, with significant speed boost and memory saving [8,64]. There are also works trying to address the similarity-distance constraint in a sample-to-sample training style [16,17]. However, a classification method that can handle this constraint remains uncovered.

2.3. Hard Negative Mining

Hard negative mining focuses on identifying and utilizing challenging negative examples to enhance model robustness and discrimination capabilities, particularly in contrastive learning and object detection. The evolution starts by collecting the hard-to-predict samples according to the specific metrics [65,66]. For instance, ref. [67] introduced an online sample mining strategy, collecting samples in each batch. Built upon this trend, there are works synthesizing hard negative samples and automatically selecting from them [68,69,70]. DCL [70] and HCL [71] sample hard negatives without knowing true labels by leveraging positive-unlabeled learning [72]. Ref. [13] utilized curricular learning [32] with hard negative mining to achieve SOTA for landmark retrieval. There are also works that set up a dynamic dictionary to train momentum networks to obtain more online samples in a contrastive style [13,73,74,75]. HCA [76] enables hard negative mining in the design of loss function but for sample-to-sample training. In contrast to the above works, we propose a loss function that implicitly incorporates the idea of hard negative mining. Specifically, compared to focal loss [77], which is also a loss function side work to balance negative samples, we directly drop those easy negative classes and maintain the geographic distance consistency, which is important for VPR tasks.

3. Methods

In this section, we first formulate the VPR task and give the desired definition in Section 3.1. The derivation of our GDCPlace, and the analysis for showing the GDCPlace geographic distance consistent is in Section 3.2. We introduce hard negative class mining in Section 3.3. Last, but not least, the implementation details of GDCPlace are in Section 3.4. All the methods introduced in this section are evaluated in Section 4.

3.1. Problem Formulation

For VPR with retrieval-style inference, users upload given photos to the system. The algorithm then predicts the location where the photos were taken by finding the most similar photo in the database. As in previous works [14,15], we interpret VPR as such a retrieval task, but also leveraging classification training proxy. Here, we formulate the classification training settings for it.

The geographic map

U

is divided into different blocks which form a set

{u^{i}}

with a total number of N blocks. Each block has a geographic center

z^{i}

and maintains learnable normalized unit-length weights

w^{i}

. There are multiple photos distributed in different blocks, with localization

z_{j}

, and the corresponded vision signal

x_{j} \in X

. We aim to train our network

f_{w} : X \to F

to obtain the latent feature

g_{j} \in F

(normalized, unit length) such that

g_{j} * w^{i_{1}} > g_{j} * w^{i_{2}} \Leftrightarrow d_{j i_{1}} < d_{j i_{2}}

, where

d_{x y} = | | z_{x} - z^{y} | |

is the distance of the given x-th photo to the y-th block. Let

p_{j} = \underset{i}{arg min} d_{j i}

be the index of block that the given photo belongs to. Note that we mainly follow the partition strategies of EigenPlace [15], where more details on field of view (FoVs) and group construction are given in Appendix B. We discuss an arbitrary single input image, and let

d_{i}

be the distance from this photo to the i-th block. The dot-product between the latent feature of this photo and the learnable weight of the i-th block are denoted as

cos θ_{i}

, as both vectors are in unit length, following previous works [10,11]. The loss function

L ({d}, {cos θ}, p)

is minimized to optimize the parameter w. The goal of our paper is to find such a loss function that is geographic distance consistent. We give a formal definition of geographic distance consistent:

Definition 1

(Geographic Distance Consistent). A loss function

L

is called geographic distance consistent if, given

d_{1} < d_{2} < \dots < d_{N}

, and a set of values

{cos θ_{1}, \dots, cos θ_{N}}

, the order that minimizes

L

is

cos θ_{1}^{'} > cos θ_{2}^{'}, \dots > cos θ_{N}^{'}

, where

cos θ_{i}^{'}

represents the cosine value from

{cos θ_{1}, \dots, cos θ_{N}}

that is placed in pair with

d_{i}

.

The meaning of GDCPlace is also depicted visually in Figure 1. By ensuring geographic distance consistent, we can obtain a loss function that encourages a correct relationship between embedding similarities and geographic distances. This is intuitive as we want the cosine similarity to be higher when the sample is closer to the center, and vice versa.

3.2. Geographic Distance Consistent Loss (GDCPlace)

Our goal is to find a proper formulation of the loss function that meets the consistency condition. We start from the cross-entropy softmax loss function as a reasonable statistical view. The aim to make it related to the geographic constraint suggests that integrating the constraint of Definition 1 into the initial form could be a possible attempt towards this goal. Now that p is the index for the positive block, let

n_{*}

denote the index of other negative blocks. We can always relabel the corresponding geographic distances such that

d_{p} < d_{n_{1}} < d_{n_{2}} < \dots < d_{n_{N - 1}}

. According to our constraint, we should have

cos θ_{p} > cos θ_{n_{1}} > \dots > cos θ_{n_{N - 1}}

. Then, we can define a monotonic decreasing function h of geographic distance d such that (the choice of h appears in Section 3.4):

\begin{matrix} min (cos θ_{i}, h (d_{i})) \geq max (cos θ_{j}, h (d_{j})), \forall d_{i} < d_{j} \end{matrix}

(1)

We start from the cross-entropy softmax loss, and desire to have a derivation of cross-entropy loss under this constraint. However, it is not easy to derive an analytic form in this way. Instead, we focus on the upper bound of the cross-entropy loss under our constraint. Here, we present the derivation:

\begin{matrix} (2) & L_{c e} = - log \frac{exp (s cos θ_{p})}{exp (s cos θ_{p}) + \sum_{i = 1}^{N - 1} exp (s cos θ_{n_{i}})} \\ = \frac{1}{2} (log (1 + \sum_{i = 1}^{N - 1} \frac{exp (s cos θ_{n_{i}})}{exp (s cos θ_{p})}) \\ (3) & + log (1 + \sum_{i = 1}^{N - 1} \frac{exp (s cos θ_{n_{i}})}{exp (s cos θ_{p})})) \\ \leq \frac{1}{2} (log (1 + \sum_{i = 1}^{N - 1} \frac{exp (s cos θ_{n_{i}})}{exp (s cos θ_{p})}) \\ (4) & + log (1 + \sum_{i = 1}^{N - 1} \frac{exp (s cos θ_{n_{i}})}{exp (s h (d_{n_{i}}))})) \\ \leq \frac{1}{2} (log (1 + exp (s (h (d_{p}) - cos θ_{p}))) \\ (5) & + log (1 + \sum_{i = 1}^{N - 1} exp (s (cos θ_{n_{i}} - h (d_{n_{i}})))) + log (\frac{N}{2})) \end{matrix}

where s is the hyperparameter for scaling utilized by [10,11,15,30] to facilitate training. The detailed proof can be found in Appendix A. Minimizing Equation (5) is equally minimizing without constant term. We, therefore, obtain the first formulation of GDCPlace:

\begin{matrix} L_{G D C} & = \frac{1}{s} (log (1 + exp (s (h (d_{p}) - cos θ_{p}))) \\ + log (1 + \sum_{i = 1}^{N - 1} exp (s (cos θ_{n_{i}} - h (d_{n_{i}}))))) \end{matrix}

(6)

The following lemma gives a condition that can be pushed forward to satisfy Definition 1, and it is also a property worth showing for our GDCPlace.

Lemma 1.

Given a function f, if for a list of cosine similarities and distances to different classes:

((cos θ_{1}, d_{1}), \dots, (cos θ_{N}, d_{N}))

,

d_{1} < \dots < d_{N}

, suppose there exists

cos θ_{i} < cos θ_{j}

where

i < j

. We have

\begin{matrix} f ((\dots, (cos θ_{j}, d_{i}), \dots, (cos θ_{i}, d_{j}), \dots)) \\ < f ((\dots, (cos θ_{i}, d_{i}), \dots, (cos θ_{j}, d_{j}), \dots)) \end{matrix}

(7)

then f is geographic distance consistent.

Proof.

It is not hard to show that we can keep executing order-changing till the loss reaches the minimum under the correct order of cosine values and distances. See Appendix A. □

The above lemma indicates that we can first prove that switching the pairs of

cos θ

that are not aligned with distance d decreases the value of our loss function. It follows that we are ready to show the following.

Theorem 1.

L_{G D C}

is geographic distance consistent.

Proof.

The proof starts from proving that

L_{G D C}

meets the condition described in Lemma 1. By the convex nature with a special case of Jensen’s inequality, Mercer type [78] (also shown in Lemma A1 in Appendix A), we prove that each of the order changing in Lemma 1 decreases the loss. It follows that it guarantees that

L_{G D C}

is geographic distance consistent. See Appendix A for details. □

Theorem 1 indicates that

L_{G D C}

is a loss function that meets our requirement. Figure 1 depicts the effect of GDCPlace. Note that it is discussed for a single input image. For the whole training set, we can further expand our theorem to prove that it still guarantees the consistency. Here,

L_{G D C}^{i}

represents the loss of the i-th input image out of a total number of M.

Corollary 1.

Given all M inputs from the training dataset,

\sum_{i}^{M} L_{G D C}^{i}

is still geographic distance consistent.

Proof.

It can also be supported by Lemmas 1 and A1. Compared with Theorem 1, an extra discussion is about the condition mentioned by Lemma 1 that appears between

L_{G D C}^{i}

of different i. The proof is in Appendix A. □

Recent work, UniFace [33], suffers from an extreme gradient unbalance when the class number rises, namely, the negative class gradient will lead when we consider more classes. UniFace overcomes this issue by adding an extra term to balance the gradients. In contrast, GDCPlace is, by nature, gradient balanced, as we will show. Derive the gradient with respect to the correct and wrong classes, respectively:

\frac{\partial L_{G D C}}{\partial cos θ_{p}} = - \frac{exp (s (h (d_{p}) - cos θ_{p}))}{1 + exp (s (h (d_{p}) - cos θ_{p}))}

(8)

\frac{\partial L_{G D C}}{\partial cos θ_{n_{i}}} = \frac{exp (s (cos θ_{n_{i}} - h (d_{n_{i}})))}{1 + \sum_{i = 1}^{N - 1} exp (s (cos θ_{n_{i}} - h (d_{n_{i}})))}

(9)

The absolute values of positive and negative gradients are all restricted between 0 and 1, regardless of the number of negative classes

N - 1

:

- 1 < \frac{\partial L_{G D C}}{\partial cos θ_{p}} < 0 < \sum_{i = 1}^{N - 1} \frac{\partial L_{G D C}}{\partial cos θ_{n_{i}}} < 1

(10)

Other useful properties that GDCPlace exhibits are presented in Appendix C.

3.3. Hard Negative Class Mining (HNCM)

In various sample-to-sample training strategies [13,37,42,79], hard negative mining or other engineering variants are seen as a must for better convergence. They mine hard negative samples given an anchor sample for it to learn the differences. Although the performance boost is well examined, it is at the cost of massive time overhead. In our training loss design, we incorporate this idea into the formulation of our GDCPlace, with no obvious overhead (see Section 4.9 for complexity analysis and benchmarking). Given a list of similarities and geographic distances to negative classes,

{(cos θ_{n_{1}}, d_{n_{1}}), \dots}

, we sort the pairs by

cos θ

in a descending order and we obtain the ordered list

((cos θ_{n_{1}}^{'}, d_{n_{1}}^{'}), \dots, (cos θ_{n_{N - 1}}^{'}, d_{n_{N - 1}}^{'}))

. Then, we further derive GDCPlace with top-K HNCM for

K < N - 1

:

\begin{matrix} L_{G D C - K} & = \frac{1}{s} (log (1 + exp (s (h (d_{p}) - cos θ_{p}))) \\ + log (1 + \sum_{i = 1}^{K} exp (s (cos θ_{n_{i}}^{'} - h (d_{n_{i}}^{'}))))) \end{matrix}

(11)

which only considers the top-similar negative classes. In the perspective of geographic distance consistency, this form of GDCPlace adheres to a variant mathematical constraint as follows:

Corollary 2.

L_{G D C - K}

is top-

(K + 1)

geographic distance consistent, i.e., let

d_{1} < \dots < d_{K + 1}

represent the top-(

K + 1

) shortest distance, then

L_{G D C - K}

is minimized when

cos θ_{1}^{'} > cos θ_{2}^{'} > \dots > cos θ_{K + 1}^{'} > m a x (cos θ_{K + 2}^{'}, \dots, cos θ_{N}^{'})

. This conclusion can be expanded to

\sum_{i}^{M} L_{G D C - K}^{i}

for arbitrary M samples, as per Corollary 1.

Proof.

The idea is similar to the previous theorem. See Appendix A. □

This variant of geographic distance consistency only ensures the order of similarities with respect to the distance to the positive center and the top-K shortest distances to negative ones, which is actually desired. It is because there is less semantic information for the model to judge when the distance is too large. Specifically, when retrieving images in a database given a query image, it is observed that the relatively closer places share some common semantic clues for the model to predict the distances. For those farther ones, however, there is hardly no semantic information to be utilized. HNCM only guarantees geographic distance calibration among top-K images, which prevents the far blocks with less semantic cue being considered. The shape of function h is also capable of exhibiting a large value gap for those closer objects, whereas a smaller gap for the distant ones. The visualization is shown in Figure 2.

3.4. Implementation of GDCPlace

The key aspect of implementing GDCPlace revolves around selecting the function h. As per our analysis, the function h should exhibit a monotonic decreasing behavior with respect to geographic distance. Additionally, we aim to constrain its values to ensure stable training. Taking factors into consideration, we adopt sigmoid function as h, which is given by

\begin{matrix} h (x) = \frac{1}{1 + exp (γ (x - ζ))} \end{matrix}

(12)

where we take

γ = 0.2

and

ζ = 6

by comparing the results on the SF-XL [14] validation set. More selections on

γ

and

ζ

are examined by ablation studies. K is the key variable to achieve HNCM, where we take

K = 2

and show the ablations on K in Section 4.8. We set the scale factor as

s = 30

. The unit is meters. The shape of the function can also solve the problem that the distant objects share less semantic information, as in Figure 2. Concretely, when we measure the difference in the function value between a pair of distances,

x_{1}

and

x_{2}

, we assume that the differences in x, i.e.,

| x_{2} - x_{1} |

, remain constant. Then, when x is larger, the difference in output function values is smaller. This meets the phenomenon we show in Figure 2, whereby the visual cues vanish as the distance becomes larger. The values of h for the far class centers are very similar and the differences mainly appear for those closer ones, as also shown in Figure 2. Note that the formulation of h is similar to the positive and negative influence functions

g^{+}

and

g^{-}

in [16] (Janine’s Loss), but (1) when we expand the formulation, one will find they are actually different, for which the h is added and g’s are multipliers. They are like the margin terms placed different in various loss functions for face tasks [10,11,45]. (2) By the formulation differences, GDCPlace actually exhibits more useful properties that Janine’s Loss does not, e.g., gradient balance (Section 3.2). (3) Janine’s Loss is used for sample-to-sample style training, while GDCPlace is used for classification. (4) We use HNCM, while Janine’s Loss needs hard negative mining. The reasons above together make our approach significantly different from Janine’s Loss. Detailed analysis and performance comparisons are given in Section 4.5.

4. Experiments

We conduct extensive experiments to evaluate the effectiveness of our GDCPlace. In this section, we first introduce our common settings in Section 4.1, for experimental platform, datasets selection, and training settings. In Section 4.2, we compare our GDCPlace with other classification loss functions that have been adopted in VPR, or any potential sensed, with a wide range including face recognition, landmark retrieval, and theoretical analysis. In Section 4.3, we compare with other SOTA methods, showing that GDCPlace performs the best among other classification-training-based methods and contrastive-learning-based methods in the same ResNet-50 backbone [80]. The ranking results, discussion of previous loss functions, application to ordinal classification task, and application to other methods are given in Section 4.4, Section 4.5, Section 4.6 and Section 4.7, respectively. As very recently a few works have adopted the large-scale pretrained vision foundation model, DINOv2 [35], we also conduct experiments to evaluate under the DINOv2 backbone. The ablations on backbone scalability, components, and hyperparameters are reported in Section 4.8. Section 4.9 reports the efficiency analysis. Finally, Section 4.10 shows the visualizations of our methods, compared with EigenPlace [15].

4.1. Common Settings

4.1.1. Experimental Platform

We used Pytorch 1.10.0 with Python 3.8 for data processing, model building, training, and inference. For mathematical analysis, we use MATLAB© R2022b. For hardware, the platform is equipped with two NVIDIA A100 GPUs with a 20-core Intel© Xeon© Gold 6248 Processor. The experimental hardware setting is identical for all experiments.

4.1.2. Datasets Selection

For a fair comparison, we adhere to EigenPlace’s approach [15] by training on the SF-XL dataset [14] and strictly replicating EigenPlace’s training settings. For benchmark to test on, we collect 11 standard VPR datasets [14,27,81,82,83], of which the significance has been well acknowledged in previous studies. These datasets share strong domain shift, extreme weather, poor photo-taking condition, or day/night variances. For instance, in NordLand [26], most photos were taken in a train which has a significantly different domain to the SF-XL [14] training set. SVOX Night [25] is difficult for the night condition and long exposure time of photos. AmsterTime [24] has a notable distinction between scanned archival images and street view images. We follow [15], which divides the datasets into multi-view and frontal-view (facing the road ahead) ones. A more comprehensive introduction of each dataset is given in Appendix D.

4.1.3. Evaluation Metrics

Following the convention of previous works to evaluate the performance of algorithms for VPR, we adopt Recall@1 for most experiments, and @5 and @10 for ablation studies. Here, Recall@k is calculated by the ratio of queries of which the top-k confident predictions contain at least one of the photos whereby the location is within a specified proximity to the exact coordinates. For most datasets we considered, this proximity is 25 m. In the case of Nordland [26], it corresponds to a span of 10 frames. For Amstertime [24], a correct prediction is identified by matching it with its equivalent in the database. It is a standard metric for evaluating VPR methods.

To further evaluate the ranking performance, several experiments adopt mAP@k as the metric to assess the ranking results of GDCPlace. This metric is designed to evaluate the quality of rankings within the top-k predictions. It has been used for large-scale retrieval problems, as in [84]. The formula is provided as follows:

m A P @ k = \frac{1}{M min (n_{i}, k)} \sum_{i}^{M} \sum_{j}^{min (n_{i}, k)} P (i, j) r e l (i, j)

(13)

where

P (i, j)

is the precision of the top-j confident predictions of query i;

r e l (i, j)

is an indicator function outputting 1 if the j-th predictions of i-th queries is within the proximity threshold introduced above; M is the total number of the queries;

n_{i}

is the number of correct photos in database given the i-th query. We select this metric but not complete mAP because there are some images that were taken very close to the query image, but they are not in the same orientation and, thus, contain no semantic cue. Thus, the complete mAP will be largely decreased for these implicit biases.

4.2. Comparison with Other Classification Loss Functions

4.2.1. Baseline Selection

As our work introduces a novel loss function and there is a lack of loss functions specifically for VPR classification training, we also gathered additional classification loss functions from similar areas for sufficient comparison. Most of them are from face recognition, as they are mostly adopted by VPR classification works [14,15], which include various famous and new SOTA loss functions: NormFace [30], SphereFace [45], CosFace [10], ArcFace [11], CurricularFace [32], AdaFace [31], and UniFace [33]. We also select MadaCos [85] and GeM-AP [19] from landmark retrieval, SORD [86] from ordinal classification, and LM-SoftMax [34] for general purpose to compare with. Many ordinal classification methods are difficult to transferr to VPR, as they are often designed for 1d data. For these types of method, we pick SORD as a good fit for VPR. Meanwhile, although GeM-AP is for sample-to-sample training, it appears as a competitor that partially meets the consistency (but only prioritizes the correct images instead of the wrong ones), so we include it into our comparison. All of them perform SOTA in their tracks for some time and are capable of being reproduced for VPR. Each method follows the same training settings as EigenPlace [15], except for ArcFace [11], for which we extended the training epochs to 60, as it took longer time to converge in our experiments. All the loss functions used in this study were retrained on the SF-XL training set [14], except for CosFace, whose results were adapted from the CosPlace paper [14]. The hyperparameters were chosen based on the results of the SF-XL validation set, which are shown in Appendix E.

4.2.2. Results

The results can be viewed in Table 2. GDCPlace performs the best in 9 out of 11 datasets compared with other loss functions. It is not surprising that some old methods, e.g., NormFace [30] and SphereFace [45], underperform other new SOTA methods. But it is interesting that ArcFace [11] cannot surpass CosFace [10] in VPR. It might need many more training epochs (D&C [8] trained ArcFace on SF-XL [14] for 300 epochs, for example). AdaFace [31] relies on the quality of the face images, which is not helpful for VPR so the result is intuitive. Although CosFace [10] is adopted for several VPR works [14,15], it turns out that LM-Softmax [34] performs the second best overall thanks to its robust theoretical backup. The ranking-related loss functions, SORD and GeM-AP, fall short in comparison with GDCPlace. This shows that GDCPlace is better than other competitive loss functions, and is, thus, the most robust classification loss function for VPR.

4.3. Comparison with State-of-the-Art Methods

4.3.1. Baseline Selection

We collected well-known and the most recent advanced works in VPR for comparison, including NetVLAD [4], SFRS [62], MixVPR [5], Conv-AP [7], D&C [8], CNN-Transformer [87], CosPlace [14], EigenPlace [15], HCA [76], and

R^{2}

Former [88]. The methods are categorized into two groups: training based on classification or sample-to-sample. As there are only limited works training a VPR model in the classification style but using retrieval for inference, we also included the landmark retrieval model CFCD [85] for comparison of the similarity between tasks. Additionally, as we only consider the single-stage method, we compare it with

R^{2}

Former single-stage results. We adopt the public accessible weights of these methods.

4.3.2. Vision Foundation Model

There are some very recently proposed methods that are based on the large-scale pretrained vision foundation model, DINOv2 [35]. It is built on the Vision Transformer [89] architecture, distinct from the common backbone in VPR, ResNet [80]. For a fair comparison, here, we also report the results adopting our methods with DINOv2. To evaluate the performance, we compare with SALAD [90], CricaVPR [91], and AnyLoc [1], which are all recently proposed DINOv2-based methods. We also reproduced EigenPlace [15] in DINOv2 backbone under the same settings to evaluate the performance of the previous SOTA.

4.3.3. Results on ResNet50

We report the results in Table 3. GDCPlace outperforms SOTA in most datasets considered. It is remarkable that GDCPlace significantly outperforms MixVPR [5], the SOTA of sample-to-sample training. MixVPR exhibits consistent advantages compared to other methods in the same training pattern, except for Pitts30k, where HCA performs the best. It is also a satisfying boost from EigenPlace [15], considering that only loss function change can bring such a score increase. However, GDCPlace underperforms MixVPR [5] in some frontal-view datasets. It is known that one of the main drawbacks of SF-XL-based methods, e.g., EigenPlace and CosPlace, is that it often displays inferior performance compared to MixVPR and SALAD. As GDCPlace is trained with EigenPlace, it also inherits this nature. But progress has also been made, namely, it maintains the best performance in rather hard SVOX datasets. It is always challenging to compare the performance of classification methods in front-view datasets with sample-to-sample ones [15].

4.3.4. Results on DINOv2

The results are reported in Table 4. Although GDCPlace fails to surpass EigenPlace in Tokyo-24/7 [27] under ResNet50, it outperforms EigenPlace [15] under DINOv2 in all measures. This suggests a better scaling capacity compared with CosFace [10] for VPR. For the other recent DINOv2-based methods, although SALAD performs SOTA in a large portion of datasets, GDCPlace shows the best average performance compared with other sample-to-sample methods. Remarkably, this is achieved by a 4× smaller descriptor size compared with SALAD [90], contributing to a more memory-friendly VPR system.

4.4. mAP Results of GDCPlace

We provide quantization ranking results for GDCPlace in Table 5 and Table 6. It is shown that, although EigenPlace [15] outperforms CosPlace [14] for recall@1, the performance on mAP@k is different. CosPlace [14] performs even better in NordLand [26] across all

k = 3, 5, 7

. Furthermore, it is shown that CosFace [10] outperforms LM-Softmax [34] in Nordland [26]. On the other hand, although MixVPR [5] leads on the recall metric, it falls behind compared with EigenPlace [15] when evaluating mAP@k. GDCPlace consistently performs the best for all mAP@k, which fits the motivation. The results are obtained with the public weights.

4.5. Comparison with Janine’s Loss

First, let us recap the formulation of Janine’s Loss [16]. It is designed for sample-to-sample training, but not for classification training, as GDCPlace does. Given a set of anchor images

{a_{i}}

, we are interested in the Euclidean distances of the features to anchors

l_{j i} = | | f_{θ} (x_{j}) - f_{θ} (a_{i}) | |

, where

f_{θ}

is neural network, and

x_{j}

j-th images. The corresponded geographic distances in map are

d_{j i}

. Janine’s Loss is formulated as follows:

\begin{matrix} L_{J a n i n e} (i) = log (1 + \sum_{j} exp (η g^{+} (d_{j i}) l_{j i} - μ) / η) \\ + log (1 + \sum_{j} exp (μ - η g^{-} (d_{j i}) l_{j i}) / η) \end{matrix}

(14)

where

g^{+} (y) = \frac{1}{1 + exp (λ - γ y)}

, and

g^{-} (y) = \frac{1}{1 + exp (γ y - λ)}

. On the other hand, for GDCPlace:

\begin{matrix} L_{G D C - K} & = \frac{1}{s} (log (1 + exp (s (h (d_{p}) - cos θ_{p}))) \\ + log (1 + \sum_{i = 1}^{K} exp (s (cos θ_{n_{i}}^{'} - h (d_{n_{i}}^{'}))))) \end{matrix}

(15)

where

cos θ

of GDCPlace and their l are measures of similarities and distances between latent features, respectively. The h is added to

cos θ

, while the g is multiplied to l. Because of these formulation differences, Janine’s Loss does not meet gradient balance (see Section 3.2) as GDCPlace does. Here, we give the proof:

Proof.

\begin{matrix} \frac{\partial L_{J a n i n e} (i)}{\partial l_{j i}} = \frac{\sum_{j} g^{+} (d_{j i}) exp (η g^{+} (d_{j i}) l_{j i} - μ)}{1 + \sum_{j} exp (η g^{+} (d_{j i}) l_{j i} - μ) / η} \\ - \frac{\sum_{j} g^{-} (d_{j i}) exp (μ - η g^{-} (d_{j i}) l_{j i})}{1 + \sum_{j} exp (μ - η g^{-} (d_{j i}) l_{j i}) / η} \end{matrix}

(16)

It is not hard to see that Equation (16) is not bounded. It can be arbitrary high or low via adjusting

g^{+}

and

g^{-}

. □

Furthermore, GDCPlace is derived by integrating Definition 1 to cross-entropy loss, which makes a reasonable foundation. HNCM is also proposed for efficient hard class mining instead of the common time-consuming hard negative mining. To evaluate the advantages of the formulation of GDCPlace for classification, we incorporate Janine’s Loss for classification training with the same training settings as GDCPlace obeys. As shown in Table 7, the performance gap is also examined.

4.6. Application to Ordinal Classification Task

To demonstrate the generalizability of GDCPlace, we also compare our loss function with several other SOTA methods for ordinal classification, including regression [20], classification [92], DLDL-v2 [21], SORD [86], and OR-CNN [20]. We reimplement all the methods chosen and conduct our experiments on two popular benchmarks, UTKFace [93] and AgeDB [94]. These two benchmarks are tasked with estimating people’s ages given their face photos, which is one of the typical ordinal classification tasks. It is not the main focus of this work, but the results, shown in Table 8, reveal that GDCPlace generalizes well to the ordinal classification tasks. The improvements, while appearing to be minor, are shown accordingly in both datasets. This suggests that further domain-specific designs could benefit the ordinal classification. Meanwhile, as it is not the main contribution, we did not conduct enough experiments to completely evaluate the method on the ordinal classification.

4.7. Application to CosPlace

To evaluate GDCPlace as an out-of-the-box loss function, we replace the loss function of CosPlace [14] (CosFace [10]) with our GDCPlace to see if it can also work with other methods. CosPlace has a different map partition strategy compared to EigenPlace, where views of the same direction are categorized as being in the same class. The results are reported in Table 9. It is shown that GDCPlace + CosPlace performs consistently better than vanilla CosPlace in every benchmark.

4.8. Ablation Studies

4.8.1. Ablation on HNCM

In this section, we showcase the experiments demonstrating the effect of HNCM and its impact based on K, which determines the number of hard classes. Table 10 first illustrates the advantage of using HNCM compared to not using it. When using HNCM, it is shown that a small number of K will further boost the performance. This suggests that only a small portion of hard negative classes needs to be compared during each loss computation. Additional visualization on HNCM is introduced in Section 4.10.

4.8.2. Ablation on Function h

There are two important hyperparameters,

γ

and

ζ

, for the implementation of function h.

γ

controls the shape of the function (to be more like sigmoid or not), whereas

ζ

decides the phase of the function (the graph moves left or right). Table 11 and Table 12 show that GDCPlace is not fragile with respect to the selection of

γ

and

ζ

. The hyperparameter tuning based on the SF-XL validation set is also reliable.

4.9. Time Complexity of GDCPlace

For N classes, a classification loss function typically has

O (N)

time complexity (omit the descriptor size). For GDCPlace, an additional operation to pick up top-K negative classes is required, which is

O (K N)

. As K is constant, and is set to 2, GDCPlace still exhibits

O (N)

time complexity, as per other classification loss functions. The time cost was also benchmarked, as shown in Table 13. We trained EigenPlace and GDCPlace on SF-XL for 5 epochs and recorded the average time interval. This proves that the time gain is ignorable, but there is an increase in standard derivation.

4.10. Qualitative Results

To see whether GDCPlace can guarantee a correct similarity order during inference, we present some visualizations of EigenPlace [15] and our method in Figure A2. It can be observed that GDCPlace outputs the correct top six results compared to EigenPlace [15], where several mistakes were made for two queries from SVOX Night [25] and NordLand [26], respectively. The query photos in SVOX Night [25] were taken during night and unclear, and photos in Nordland [26] appear to be extremely hard to distinguish. But GDCPlace still performs well and give robust results. It is less informative to show the ranking capacity, as the photos in AmsterTime [24] are generally far from each other. But we still show the performance of GDCPlace whereby it can predict the correct image more accurately compared with EigenPlace [15], as shown in Figure A3. It is less easy to be fooled by the photos from Amsterdam City Archive, which exhibit a strong domain shift. The qualitative results of using HNCM are shown in Figure 3. It is demonstrated that training with HNCM is in favor of correct orders of images. We also provide more visualizations on SVOX Night [25] and NordLand [26] in Figure A4. It is consistently proven that GDCPlace is stable and reliable in outputting a correct order of photos. Except for SVOX Night [25], the query from Nordland [26] similarly demonstrates a significant change in weather conditions, under which GDCPlace is still robust.

5. Discussion

Visual place recognition (VPR) serves as a critical enabling technology for autonomous systems, allowing them to reliably identify previously visited locations through visual inputs. This capability forms the foundation for large-scale navigation in GPS-denied environments, enabling loop closure in SLAM (Simultaneous Localization and Mapping) systems and facilitating long-term autonomy through cross-session place matching. By combining GDCPlace with robust matching algorithms, modern VPR systems have the potential to demonstrate remarkable resilience to viewpoint variations, illumination changes, and seasonal appearance shifts, making them indispensable for autonomous vehicles, delivery robots, and aerial drones operating in dynamic real-world conditions. The GDC condition can also serve as a standard mathematical constraint for future loss functions design. For the limitations, although GDCPlace shows the best performance overall and has backup from theoretical analysis, the FOVs and group partition strategy are taken care by EigenPlace [15]. The hyperparameters are also decided via tuning on the validation set, where an auto mechanism for this goal is also appealing. Meanwhile, although GDCPlace shows strong potential in performing ordinal classification tasks, there is still room for improvement such that the 1d data nature of ordinal classification can be further utilized to refine our loss function.

On the other hand, as it is clear that most of the face recognition and landmark retrieval methods can be applied to VPR, conversely, it is also highly possible that GDCPlace can be adapted to other retrieval tasks. However, defining a distance to replace with the geographic distance in VPR for other retrieval tasks poses a significant challenge. These are all our future work directions.

6. Conclusions

In summary, this work resolves the challenge in VPR classification training that loss functions should encourage the embedding similarities to be negatively correlated to geographic distances. We gave a formal definition, geographic distance consistent, and proved that our proposed loss function, GDCPlace, meets this requirement. As a novel loss function, we compared GDCPlace with other classification loss functions, from areas including face recognition, ordinal classification landmark, and visual place recognition itself. We also conducted comparisons with other SOTA methods proposed for the VPR task, both with traditional ResNet-50 and DINOv2 backbones. As GDCPlace encourages a correct rank of geographic distances, we also evaluated the mAP results to test the rank prediction performance. Ablations were conducted on each of the hyperparameters, and qualitative results were provided to visually showcase the performance of GDCPlace. Overall, extensive experiments revealed that our GDCPlace performance is SOTAcompared with other cutting-edge methods. It is also the first VPR-specific loss function with robust theoretical support.

Author Contributions

Conceptualization, S.S.; Methodology, S.S.; Software, S.S.; Validation, S.S. and Q.C.; Formal analysis, S.S.; Writing—original draft, S.S.; Writing—review & editing, Q.C.; Supervision, Q.C.; Funding acquisition, Q.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

For requests for code, please contact the corresponding authors.

Acknowledgments

The authors would like to thank the three anonymous reviewers for their constructive feedback.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

VPR	Visual place recognition
SOTA	State of the art
GDC	Geographic distance consistent
HNM	Hard negative mining
HNCM	Hard negative class mining
mAP	Mean average precision
FOV	Field of view
SLAM	Simultaneous Localization and Mapping
PI	Pitts30k dataset
AM	AmsterTime dataset
EY	Eynsham dataset
TO	Tokyo-24/7 dataset
NL	Nordland dataset
SN	SVOX Night dataset
SO	SVOX Overcast dataset
SW	SVOX Snow dataset
SR	SVOX Rain dataset
SU	SVOX Sun dataset
MS	MSLS val dataset

Appendix A. Proofs

Appendix A.1. Proof of Equations (2)–(5)

\begin{matrix} (A1) & L_{c e} = - log \frac{exp (s cos θ_{p})}{exp (s cos θ_{p}) + \sum_{i = 1}^{N - 1} exp (s cos θ_{n_{i}})} \\ = \frac{1}{2} (log (1 + \frac{\sum_{i = 1}^{N - 1} exp (s cos θ_{n_{i}})}{exp (s cos θ_{p})}) \\ (A2) & + log (1 + \frac{\sum_{i = 1}^{N - 1} exp (s cos θ_{n_{i}})}{exp (s cos θ_{p})})) \\ \leq \frac{1}{2} (log (1 + \frac{\sum_{i = 1}^{N - 1} exp (s cos θ_{n_{i}})}{exp (s cos θ_{p})}) \\ (A3) & + log (1 + \frac{\sum_{i = 1}^{N - 1} exp (s cos θ_{n_{i}})}{exp (s h (d_{n_{i}}))})) \\ Let C = log (1 + \sum_{i = 1}^{N - 1} \frac{exp (s cos θ_{n_{i}})}{exp (s h (d_{n_{i}}))}) : \\ = \frac{1}{2} (log (1 + \frac{N - 2}{2 (N - 1)} \frac{\sum_{i = 1}^{N - 1} exp (s cos θ_{n_{i}})}{exp (s cos θ_{p})} \\ (A4) & \frac{N}{2 (N - 1)} \frac{\sum_{i = 1}^{N - 1} exp (s cos θ_{n_{i}})}{exp (s cos θ_{p})}) + C) \\ \leq \frac{1}{2} (log (1 + \frac{N - 2}{2 (N - 1)} \frac{\sum_{i = 1}^{N - 1} exp (s cos θ_{p})}{exp (s cos θ_{p})} \\ (A5) & + \frac{N}{2 (N - 1)} \frac{\sum_{i = 1}^{N - 1} exp (s h (d_{p}))}{exp (s cos θ_{p})}) + C) \\ (A6) & = \frac{1}{2} (log (1 + \frac{N - 2}{2 (N - 1)} (N - 1) \frac{exp (s cos θ_{p})}{exp (s cos θ_{p})} \\ (A7) & + \frac{N}{2 (N - 1)} (N - 1) \frac{exp (s h (d_{p}))}{exp (s cos θ_{p})}) + C) \\ (A8) & = \frac{1}{2} (log (\frac{N}{2} + \frac{N}{2} \frac{exp (s h (d_{p}))}{exp (s cos θ_{p})}) + C) \\ = \frac{1}{2} (log (1 + exp (s (h (d_{p}) - cos θ_{p}))) \\ (A9) & + log (1 + \sum_{i = 1}^{N - 1} exp (s (cos θ_{n_{i}} - h (d_{n_{i}})))) + log (\frac{N}{2})) \end{matrix}

Appendix A.2. Proofs of Theorems, Corollaries, and Lemmata

Lemma A1

(Special case of Jensen’s inequality, Mercer type [78]). For real numbers

a \leq min (b, c)

,

d \geq max (b, c)

and

a + d = b + c

, given a convex function f, it holds that

f (a) + f (d) \geq f (b) + f (c)

, and

f (a) + f (d) = f (b) + f (c)

holds if

a = b = c = d

.

Proof.

\frac{d - b}{d - a} a + \frac{b - a}{d - a} d = b

(A10)

Plug in

c = a + d - b

:

\frac{b - a}{d - a} a + \frac{d - b}{d - a} d = c

(A11)

\begin{matrix} (A12) & f (a) + f (d) \\ (A13) & = \frac{d - b}{d - a} f (a) + \frac{b - a}{d - a} f (d) + \frac{b - a}{d - a} f (a) + \frac{d - b}{d - a} f (d) \end{matrix}

As

\frac{d - b}{d - a} + \frac{b - a}{d - a} = 1

and f is convex, by Jensen’s inequality:

\begin{matrix} (A14) & \geq f (\frac{d - b}{d - a} a + \frac{b - a}{d - a} d) + f (\frac{b - a}{d - a} a + \frac{d - b}{d - a} d) \\ (A15) & \overset{(A10), (A11)}{=} f (b) + f (c) \end{matrix}

According to Jensen’s inequality, the equality is obtained if

a = d

, by

a \leq min (b, c)

and

d \geq max (b, c)

, we have

a = b = c = d

. It is trivial to further prove that it also holds for

a + d > b + c

. □

Lemma 1.

Given a function f, if for a list of cosine similarities and distances to different classes:

((cos θ_{1}, d_{1}), \dots, (cos θ_{N}, d_{N}))

,

d_{1} < \dots < d_{N}

, suppose there exists

cos θ_{i} < cos θ_{j}

where

i < j

. We have

\begin{matrix} f ((\dots, (cos θ_{j}, d_{i}), \dots, (cos θ_{i}, d_{j}), \dots)) \\ < f ((\dots, (cos θ_{i}, d_{i}), \dots, (cos θ_{j}, d_{j}), \dots)) \end{matrix}

(A16)

then f is geographic distance consistent.

Proof.

From sorting algorithms, we know that there are finite conversions between some

cos θ_{i}

and

cos θ_{j}

, where, initially,

cos θ_{i} < cos θ_{j}

, to obtain the order

cos θ_{1}^{'} > cos θ_{2}^{'} > \dots > cos θ_{N}^{'}

. For each conversion, the value of f decreases; thus,

f (((cos θ_{1}^{'}, d_{1}), \dots, (cos θ_{N}^{'}, d_{N})))

< f (((cos θ_{1}, d_{1}), \dots, (cos θ_{N}, d_{N})))

if

cos θ_{1} > \dots > cos θ_{N}

does not hold. Hence, f is geographic distance consistent. □

Theorem 1.

L_{G D C}

is geographic distance consistent.

Proof.

To prove that

L_{G D C}

is geographic distance consistent, we first prove that

L_{G D C}

is one of f in Lemma 1. First we suppose that there is

cos θ_{i} < cos θ_{j}

and

d_{i} < d_{j}

, where

i < j

. We discuss two cases covering all:

Case 1.

i = p

:

\begin{matrix} L_{G D C} (((cos θ_{p}, d_{p}), \dots, (cos θ_{j}, d_{j}), \dots)) \\ (A17) & - L_{G D C} (((cos θ_{j}, d_{p}), \dots, (cos θ_{p}, d_{j}), \dots)) \\ = log (1 + exp (s (h (d_{p}) - cos θ_{p}))) \\ - log (1 + exp (s (h (d_{p}) - cos θ_{j}))) \\ (A18) & + log (1 + \sum_{t \neq p, j}^{N} exp (s (cos θ_{t} - h (d_{t}))) \\ + exp (s (cos θ_{j} - h (d_{j})))) \\ (A19) & - log (1 + \sum_{t \neq p, j}^{N} exp (s (cos θ_{t} - h (d_{t}))) \\ (A20) & + exp (s (cos θ_{p} - h (d_{j})))) \\ (A21) & > 0 \end{matrix}

Case 2.

i \neq p

: Let

K = 1 + \sum_{t \neq p, i, j}^{N} exp (s (cos θ_{t} - h (d_{t})))

\begin{matrix} L_{G D C} (((cos θ_{i}, d_{i}), \dots, (cos θ_{j}, d_{j}), \dots)) \\ (A22) & - L_{G D C} (((cos θ_{j}, d_{i}), \dots, (cos θ_{i}, d_{j}), \dots)) \\ = log (K + exp (s (cos θ_{i} - h (d_{i}))) \\ + exp (s (cos θ_{j} - h (d_{j})))) \\ - log (K + exp (s (cos θ_{j} - h (d_{i}))) \\ (A23) & + exp (s (cos θ_{i} - h (d_{j})))) \end{matrix}

To prove it

> 0

, it suffices to show that

\begin{matrix} exp (s (cos θ_{i} - h (d_{i}))) + exp (s (cos θ_{j} - h (d_{j}))) \\ > exp (s (cos θ_{j} - h (d_{i}))) + exp (s (cos θ_{i} - h (d_{j}))) \end{matrix}

(A24)

As

s (cos θ_{i} - h (d_{i})) < min (s (cos θ_{j} - h (d_{i})), s (cos θ_{i} - h (d_{j})))

(A25)

s (cos θ_{j} - h (d_{j})) > max (s (cos θ_{j} - h (d_{i})), s (cos θ_{i} - h (d_{j})))

(A26)

and

\begin{matrix} s (cos θ_{i} - h (d_{i})) & + s (cos θ_{j} - h (d_{j})) \\ = s (cos θ_{i} - h (d_{j})) + s (cos θ_{j} - h (d_{i})) \end{matrix}

(A27)

By Lemma A1, Equation (A24) holds. By Cases 1 and 2, and Lemma 1, it follows that

L_{G D C}

is geographic distance consistent. □

Corollary 1.

Given all inputs from the training dataset:

\sum_{i}^{M} L_{G D C}^{i}

is still geographic distance consistent.

Proof.

Let

cos θ_{t j}

and

d_{t j}

denote the cosine similarity and distance to the j-th class from the t-th sample. It suffices to prove that if

cos θ_{t j} < cos θ_{k l}

,

d_{t j} < d_{k l}

, then

\begin{matrix} \sum_{i \neq t, k}^{M} L_{G D C}^{i} & + L_{G D C}^{t} (\dots, (cos θ_{t j}, d_{k l}), \dots) \\ + L_{G D C}^{k} (\dots, (cos θ_{k l}, d_{t j}), \dots) \\ < \sum_{i \neq t, k}^{M} L_{G D C}^{i} + L_{G D C}^{t} (\dots, (cos θ_{t j}, d_{t j}), \dots) \\ + L_{G D C}^{k} (\dots, (cos θ_{k l}, d_{k l}), \dots) \end{matrix}

(A28)

By the grid sampling of classes and samples, we have

d_{t p} < d_{k l}

for any t, k, and

l \neq p

. Therefore, we will not have

j \neq p

and

l = p

simultaneously. When

t = k

, it is identical to Theorem 1, so we only consider when

t \neq j

. The case where

j = p

and

l \neq p

is the same as Case 1 in the proof of Theorem 1. Here, we show that Equation (A28) holds when

j = p

and

l = p

.

j \neq p

and

l \neq p

are very similar to this part of the proof.

\begin{matrix} \sum_{i \neq t, k}^{M} L_{G D C}^{i} + L_{G D C}^{t} (\dots, (cos θ_{t p}, d_{t p}), \dots) \\ + L_{G D C}^{k} (\dots, (cos θ_{k p}, d_{k p}), \dots) \\ - \sum_{i \neq t, k}^{M} (L_{G D C}^{i} + L_{G D C}^{t} (\dots, (cos θ_{t p}, d_{k p}), \dots) \\ (A29) & + L_{G D C}^{k} (\dots, (cos θ_{k p}, d_{t p}), \dots)) \\ = log (1 + exp (s (h (d_{t p}) - cos θ_{t p}))) \\ (A30) & + log (1 + exp (s (h (d_{k p}) - cos θ_{k p}))) \\ - log (1 + exp (s (h (d_{t p}) - cos θ_{k p}))) \\ (A31) & - log (1 + exp (s (h (d_{k p}) - cos θ_{t p}))) \end{matrix}

It is easy to verify that

f (x) = log (1 + e^{x})

is a convex function. Meanwhile, consider

\begin{matrix} s (h (d_{t p}) - cos θ_{t p}) < \\ min (s (h (d_{t p}) - cos θ_{k p}), s (h (d_{k p}) - cos θ_{t p})) \end{matrix}

(A32)

\begin{matrix} s (h (d_{k p}) - cos θ_{k p}) > \\ max (s (h (d_{t p}) - cos θ_{k p}), s (h (d_{k p}) - cos θ_{t p})) \end{matrix}

(A33)

and

\begin{matrix} s (h (d_{t p}) - cos θ_{t p}) + s (h (d_{k p}) - cos θ_{k p}) \\ = s (h (d_{t p}) - cos θ_{k p}) + s (h (d_{k p}) - cos θ_{t p}) \end{matrix}

(A34)

It follows that Lemma A1 indicates Equation (A29)

> 0

; hence, Equation (A28) holds. This permutation property is pushed to geographic distance consistent by Lemma 1. □

Corollary 2.

L_{G D C - K}

is the top-

(K + 1)

geographic distance consistent, i.e., let

d_{1} < \dots < d_{K + 1}

represent the top-

(K + 1)

shortest distances, then

L_{G D C - K}

is minimized when

cos θ_{1}^{'} > cos θ_{2}^{'} > \dots > cos θ_{K + 1}^{'} > m a x (cos θ_{K + 2}^{'}, \dots, cos θ_{N}^{'})

. This conclusion can be expanded to

\sum_{i}^{M} L_{G D C - K}^{i}

for arbitrary M samples, as per Corollary 1.

Proof.

Given the grid partition strategy (see Appendix B), and

d_{1} < d_{2} < \dots

, we have

p = 1

. Let us first suppose that

i = 1

,

j > K + 1

, and

cos θ_{i} < cos θ_{j}

. Let

cos θ_{j}

be within the top-

(K + 1)

highest values, whereas

cos θ_{i}

is not. Consider the following permutation:

\begin{matrix} L_{G D C - K} (((cos θ_{i}, d_{i}), \dots, (cos θ_{j}, d_{j}), \dots)) - \\ (A35) & L_{G D C - K} (((cos θ_{j}, d_{i}), \dots, (cos θ_{i}, d_{j}), \dots)) \\ = log (1 + exp (s (h (d_{1}) - cos θ_{1}))) \\ (A36) & - log (1 + exp (s (h (d_{1}) - cos θ_{j}))) \\ (A37) & > 0 \end{matrix}

Then, suppose that

1 < i \leq K + 1

,

cos θ_{j}

is one of the top-

(K + 1)

greatest cosine values, but not as

cos θ_{i}

. Further,

d_{j}

is not in the top-

(K + 1)

shortest distances, and let

C = \sum_{t \neq j, t \in Q} exp (s (cos θ_{t} - h (d_{t})))

, where

Q

is the set of index whereby the corresponded cosine values are in top-

(K + 1)

. Then, we have

\begin{matrix} L_{G D C - K} (((cos θ_{i}, d_{i}), \dots, (cos θ_{j}, d_{j}), \dots)) \\ (A38) & - L_{G D C - K} (((cos θ_{j}, d_{i}), \dots, (cos θ_{i}, d_{j}), \dots)) \\ = log (C + exp (s (cos θ_{j} - h (d_{j})))) \\ (A39) & - log (C + exp (s (cos θ_{j} - h (d_{i})))) \\ (A40) & > 0 \end{matrix}

Thus, it requires at least

m i n (cos θ_{1}^{'}, \dots, cos θ_{K + 1}^{'}) > m a x (cos θ_{K + 2}^{'}, \dots, cos θ_{N}^{'})

, when we have

m a x (d_{1}, \dots, d_{K + 1}) < m i n (d_{K + 2}, \dots, d_{N})

. Next, we need to discuss the order to minimize

L_{G D C - K}

when this condition is fulfilled. The discussion is the same as the proof of Theorem 1. It is not difficult to expand the conclusion to

\sum_{i}^{M} L_{G D C - K}^{i}

, the multi-sample version, following Corollary 1. □

Appendix B. Details on Map Partition for Classification Training

We follow the training strategy of EigenPlace [15]. For completeness, here, we recap the details of the map partition for preparing the data. One can also refer to their paper [15] for details and motivation. The geographic map is divided into small blocks fully covering the whole region. However, the adjacent blocks might share some common visual cue that confuses the classification labels. To alleviate this issue, the blocks are further grouped into different groups where there are no adjacent blocks in the same group. Given all localization of the shot positions in blocks, SVD is applied to the map, of which the 360° camera view is segmented to street-directed and building facade as two sub-groups. The overall illustration is shown in Figure A1.

Appendix C. Additional Gradient Analysis

Theorem 1 and Corollary 2 are about the value of loss functions. But for the gradient, we give additional useful insights to make our methods more convincing. It can be shown that the correct order of the similarities can lead to the minimum of the gradient magnitude as well. That means that the loss function induces fewer updates on the network parameters if the similarities order is correct, as we desire.

Formally, the conclusion about the gradient can be given as follows:

Theorem 2.

| | \nabla L_{G D C} {| |}_{2}

is Geographic Distance Consistent, and

| | \nabla L_{G D C - K} {| |}_{2}

is top-

(K + 1)

Geographic Distance Consistent.

Figure A1. The map partition strategy. Photos of the identical number and color are trained in the same group. The photos in the same block are expected to be predicted as the same class.

Proof.

\begin{matrix} | | \nabla L_{G D C} {| |}_{2}^{2} & = (\frac{exp (s (h (d_{p}) - cos θ_{p}))}{1 + exp (s (h (d_{p}) - cos θ_{p}))})^{2} + \\ \sum_{j = 1}^{N - 1} (\frac{exp (s (cos θ_{n_{j}} - h (d_{n_{j}})))}{1 + \sum_{i = 1}^{N - 1} exp (s (cos θ_{n_{i}} - h (d_{n_{i}})))})^{2} \end{matrix}

(A41)

Supported by Lemma 1, we can still discuss, by case, under the same Case 1 and Case 2 as the proof of Theorem 1. It is not hard to show that it holds for Case 1. Here, we analyze Case 2. Suppose that

cos θ_{i} < cos θ_{j}

,

d_{i} < d_{j}

, where

i < j

, and

i \neq p

. We want to show that

\begin{matrix} | | \nabla L_{G D C} (((cos θ_{i}, d_{i}), \dots, (cos θ_{j}, d_{j}), \dots)) {| |}_{2}^{2} \\ - | | \nabla L_{G D C} (((cos θ_{i}, d_{j}), \dots, (cos θ_{j}, d_{i}), \dots)) {| |}_{2}^{2} > 0 \end{matrix}

(A42)

Let

C = 1 + \sum_{t \neq p, i, j}^{N} exp (s (cos θ_{t} - h (h_{t})))

,

a = s (cos θ_{i} - h (d_{i}))

,

b = s (cos θ_{j} - h (d_{j}))

,

x = s (cos θ_{i} - h (d_{j}))

, and

y = s (cos θ_{j} - h (d_{i}))

, and it suffices to prove the following inequality:

\begin{matrix} (A43) & \frac{exp (2 a) + exp (2 b)}{{(exp (a) + exp (b) + C)}^{2}} > \frac{exp (2 x) + exp (2 y)}{{(exp (x) + exp (y) + C)}^{2}} \\ \Leftrightarrow C^{2} (exp (2 a) + exp (2 b)) \\ + 2 (exp (x + y + 2 a) + C exp (x + 2 a) + C exp (y + 2 a) \\ (A44) & + exp (x + y + 2 b) + C exp (x + 2 b) + C exp (y + 2 b)) \\ > C^{2} (exp (2 x) + exp (2 y)) + 2 (exp (a + b + 2 x) \\ + C exp (a + 2 x) + C exp (b + 2 x) + exp (a + b + 2 y) \\ (A45) & + C exp (a + 2 y) + C exp (b + 2 y)) \end{matrix}

Grouping them into pairs, it turns out that

C^{2} (exp (2 a) + exp (2 b)) > C^{2} (exp (2 x) + exp (2 y))

,

2 (exp (x + y + 2 a) + exp (x + y + 2 b)) > 2 (exp (a + b + 2 x) + exp (a + b + 2 y))

,

2 C (exp (x + 2 a) + exp (y + 2 b)) > 2 C (exp (a + 2 x) + exp (b + 2 y))

, and

2 C (exp (y + 2 a) + exp (x + 2 b)) > 2 C (exp (b + 2 x) + exp (a + 2 y))

. All the inequalities can be easily proven by Lemma A1. Adding up all the left-hand side and right-hand side, respectively, we can obtain Equation (A42); thus, it holds. This conclusion is not hard to expand to

| | \nabla L_{G D C - K} {| |}_{2}

. □

Appendix D. Datasets

We collect Pitts30k [4], AmsterTime [24], Eynsham [29], Tokyo-24/7 [27], Nordland [26], SVOX-Night, -Overcast, -Snow, -Rain, -Sun [25], and MSLS-val [28] to benchmark our methods. Pitts30k provides a diverse set of urban scenes captured under different weather conditions and seasons. AmsterTime focuses on capturing the same locations over time. Eynsham sets up a challenging benchmark for visual localization due to grayscale photos. Tokyo-24/7 covers a wide range of illumination conditions, from bright daylight to night scenes. Nordland consists of images captured along a train journey in Norway during different seasons. SVOX categorizes different visual field variations into separate small datasets. MSLS contains images curated from the Mapillary collaborative mapping platform, with day/night changes.

Appendix E. Classification Loss Configurations

We reproduced additional classification loss functions to compare with our GDCPlace. Here, we report the detailed choices of their hyperparameters in Table A1. They are determined through experiments on the validation set and considering the recommendations in their paper. The corresponding results can be seen in Table 2.

Table A1. The hyperparameter configurations for each loss function.

NormFace [30]	SphereFace [45]	ArcFace [11]	CurricularFace [32]
$s = 100$	$m = 4$	$s = 50$ , $m = 0.2$	$s = 100$ , $m = 0.4$
AdaFace [31]	LM-Softmax [34]	MadaCos [85]	UniFace [33]
$s = 100$ , $m = 0.4$ , $h = 0.33$ , $α = 0.99$	$s = 100$	$ρ = 0.02$ , $ϵ = e^{- 7}$	$s = 100$ , $m = 0.2$ , $λ = 0.5$ , $r = 0.5$

Figure A2. The visualization of the top six similar images in database given query images. The geographic distance or frame index of each image to the query image is shown. The top one is from SVOX Night [25] and the bottom is from Nordland [26]. The red dots and red lines denote far locations and wrong orders, respectively.

Figure A3. The visualization of GDCPlace performing on AmsterTime [24]. The images with green borders denote the corresponding photos in database given queries.

Figure A4. Additional visualization on SVOX Night [25] and Nordland [26]. The geographic distance or frame index of each image to the query image is shown. The top one is from SVOX Night [25] and the bottom is from Nordland [26]. The red dots and red lines denote far locations and wrong orders, respectively.

References

Keetha, N.; Mishra, A.; Karhade, J.; Jatavallabhula, K.M.; Scherer, S.; Krishna, M.; Garg, S. Anyloc: Towards universal visual place recognition. IEEE Robot. Autom. Lett. 2023, 1286–1293. [Google Scholar] [CrossRef]
Camara, L.G.; Gäbert, C.; Přeučil, L. Highly robust visual place recognition through spatial matching of CNN features. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 3748–3755. [Google Scholar]
Hong, Z.; Petillot, Y.; Lane, D.; Miao, Y.; Wang, S. TextPlace: Visual place recognition and topological localization through reading scene texts. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2861–2870. [Google Scholar]
Arandjelović, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1437–1451. [Google Scholar] [CrossRef] [PubMed]
Ali-Bey, A.; Chaib-Draa, B.; Giguere, P. MixVPR: Feature mixing for visual place recognition. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2998–3007. [Google Scholar]
Shen, Y.; Zhou, S.; Fu, J.; Wang, R.; Chen, S.; Zheng, N. StructVPR: Distill structural knowledge with weighting samples for visual place recognition. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 11217–11226. [Google Scholar]
Ali-bey, A.; Chaib-draa, B.; Giguère, P. GSV-cities: Toward appropriate supervised visual place recognition. Neurocomputing 2022, 513, 194–203. [Google Scholar] [CrossRef]
Trivigno, G.; Berton, G.; Aragon, J.; Caputo, B.; Masone, C. Divide&Classify: Fine-Grained Classification for City-Wide Visual Geo-Localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Vancouver, BC, Canada, 17–24 June 2023; pp. 11142–11152. [Google Scholar]
Muller-Budack, E.; Pustu-Iren, K.; Ewerth, R. Geolocation estimation of photos using a hierarchical model and scene classification. In Proceedings of theECCV 2018: 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 563–579. [Google Scholar]
Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.; Li, Z.; Liu, W. Large Margin Cosine Loss for Deep Face Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5265–5274. [Google Scholar]
Deng, J.; Guo, J.; Yang, J.; Xue, N.; Kotsia, I.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5962–5979. [Google Scholar] [CrossRef]
Shao, S.; Chen, K.; Karpur, A.; Cui, Q.; Araujo, A.; Cao, B. Global Features are All You Need for Image Retrieval and Reranking. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2023, Paris, France, 2–3 October 2023; pp. 11002–11012. [Google Scholar]
Lee, S.; Seong, H.; Lee, S.; Kim, E. Correlation Verification for Image Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Berton, G.; Masone, C.; Caputo, B. Rethinking visual geo-localization for large-scale applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4878–4888. [Google Scholar]
Berton, G.; Trivigno, G.; Caputo, B.; Masone, C. EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 11080–11090. [Google Scholar]
Thoma, J.; Paudel, D.P.; Gool, L.V. Soft contrastive learning for visual localization. Adv. Neural Inf. Process. Syst. 2020, 33, 11119–11130. [Google Scholar]
Leyva-Vallina, M.; Strisciuglio, N.; Petkov, N. Regressing Transformers for Data-efficient Visual Place Recognition. arXiv 2024, arXiv:2401.16304. [Google Scholar]
Brown, A.; Xie, W.; Kalogeiton, V.; Zisserman, A. Smooth-AP: Smoothing the path towards large-scale image retrieval. In Proceedings of the 16th European Conference (ECCV 2020), Glasgow, UK, 23–28 August 2020; pp. 677–694. [Google Scholar]
Revaud, J.; Almazán, J.; Rezende, R.S.; Souza, C.R.d. Learning with average precision: Training image retrieval with a listwise loss. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5107–5116. [Google Scholar]
Niu, Z.; Zhou, M.; Wang, L.; Gao, X.; Hua, G. Ordinal Regression with Multiple Output CNN for Age Estimation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4920–4928. [Google Scholar] [CrossRef]
Gao, B.B.; Zhou, H.Y.; Wu, J.; Geng, X. Age Estimation Using Expectation of Label Distribution Learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 712–718. [Google Scholar]
Shrivastava, A.; Gupta, A.; Girshick, R. Training region-based object detectors with online hard example mining. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 761–769. [Google Scholar]
Henriques, J.F.; Carreira, J.; Caseiro, R.; Batista, J. Beyond hard negative mining: Efficient detector learning via block-circulant decomposition. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 2760–2767. [Google Scholar]
Yildiz, B.; Khademi, S.; Siebes, R.M.; Van Gemert, J. AmsterTime: A visual place recognition benchmark dataset for severe domain shift. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 2749–2755. [Google Scholar]
Berton, G.M.; Paolicelli, V.; Masone, C.; Caputo, B. Adaptive-attentive geolocalization from few queries: A hybrid approach. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 2918–2927. [Google Scholar]
Milford, M.J.; Wyeth, G.F. Mapping a suburb with a single camera using a biologically inspired SLAM system. IEEE Trans. Robot. 2008, 24, 1038–1053. [Google Scholar] [CrossRef]
Torii, A.; Arandjelovic, R.; Sivic, J.; Okutomi, M.; Pajdla, T. 24/7 place recognition by view synthesis. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1808–1817. [Google Scholar]
Warburg, F.; Hauberg, S.; Lopez-Antequera, M.; Gargallo, P.; Kuang, Y.; Civera, J. Mapillary street-level sequences: A dataset for lifelong place recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2626–2635. [Google Scholar]
Cummins, M.; Newman, P. Highly scalable appearance-only SLAM-FAB-MAP 2.0. In Robotics: Science and Systems; The MIT Press: Cambridge, MA, USA, 2009; Volume 5. [Google Scholar]
Wang, F.; Xiang, X.; Cheng, J.; Yuille, A.L. NormFace: L2 Hypersphere Embedding for Face Verification. In Proceedings of the 25th ACM international conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1041–1049. [Google Scholar] [CrossRef]
Kim, M.; Jain, A.K.; Liu, X. AdaFace: Quality Adaptive Margin for Face Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 18750–18759. [Google Scholar]
Huang, Y.; Wang, Y.; Tai, Y.; Liu, X.; Shen, P.; Li, S.; Li, J.; Huang, F. Curricularface: Adaptive curriculum learning loss for deep face recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5901–5910. [Google Scholar]
Zhou, J.; Jia, X.; Li, Q.; Shen, L.; Duan, J. UniFace: Unified Cross-Entropy Loss for Deep Face Recognition. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 20730–20739. [Google Scholar]
Zhou, X.; Liu, X.; Zhai, D.; Jiang, J.; Gao, X.; Ji, X. Learning Towards The Largest Margins. In International Conference on Learning Representations (ICLR). 2022. Available online: https://openreview.net/forum?id=hqkhcFHOeKD (accessed on 20 February 2025).
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Ran, Z.; Wei, X.; Liu, W.; Lu, X. Multi-Scale Aligned Spatial-Temporal Interaction for Video-Based Person Re-Identification. IEEE Trans. Circuit Syst. Video Technol. 2024, 34, 8536–8546. [Google Scholar] [CrossRef]
Hoffer, E.; Ailon, N. Deep metric learning using triplet network. In Similarity-Based Pattern Recognition, Proceedings of the Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, 12–14 October 2015; Proceedings 3; Springer: Cham, Switzerland, 2015; pp. 84–92. [Google Scholar]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
Li, Q.; Jia, X.; Zhou, J.; Shen, L.; Duan, J. UniTSFace: Unified Threshold Integrated Sample-to-Sample Loss for Face Recognition. Adv. Neural Inf. Process. Syst. 2023, 36, 32732–32747. [Google Scholar]
Wu, C.Y.; Manmatha, R.; Smola, A.J.; Krahenbuhl, P. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2840–2848. [Google Scholar]
Yu, B.; Liu, T.; Gong, M.; Ding, C.; Tao, D. Correcting the triplet selection bias for triplet loss. In Proceedings of the 15th European Conference on Computer Vision, ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 71–87. [Google Scholar]
Dong, X.; Shen, J. Triplet loss in siamese network for object tracking. In Proceedings of the 15th European Conference on Computer Vision, ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 459–474. [Google Scholar]
Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of the 14th European Conference on Computer Vision, ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 499–515. [Google Scholar]
Liu, Y.; Li, H.; Wang, X. Learning deep features via congenerous cosine loss for person recognition. arXiv 2017, arXiv:1702.06890. [Google Scholar]
Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. SphereFace: Deep Hypersphere Embedding for Face Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Liu, W.; Wen, Y.; Yu, Z.; Yang, M. Large-Margin Softmax Loss for Convolutional Neural Networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 507–516. [Google Scholar]
Meng, Q.; Zhao, S.; Huang, Z.; Zhou, F. MagFace: A universal representation for face recognition and quality assessment. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14225–14234. [Google Scholar]
Kasarla, T.; Burghouts, G.; van Spengler, M.; van der Pol, E.; Cucchiara, R.; Mettes, P. Maximum class separation as inductive bias in one matrix. Adv. Neural Inf. Process. Syst. 2022, 35, 19553–19566. [Google Scholar]
Zhang, X.; Wang, L.; Su, Y. Visual place recognition: A survey from deep learning perspective. Pattern Recognit. 2021, 113, 107760. [Google Scholar]
Tsintotas, K.A.; Bampis, L.; Gasteratos, A. Visual Place Recognition for Simultaneous Localization and Mapping. In Autonomous—Vehicles Volume 2: Smart Vehicles; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2022; pp. 47–79. [Google Scholar]
Yang, M.; Mao, J.; He, X.; Zhang, L.; Hu, X. A sequence-based visual place recognition method for aerial mobile robots. J. Phys. Conf. Ser. 2020, 1654, 012080. [Google Scholar] [CrossRef]
Niu, J.; Qian, K. Robust place recognition based on salient landmarks screening and convolutional neural network features. Int. J. Adv. Robot. Syst. 2020, 17, 1–8. [Google Scholar]
Wang, Z.; Zhang, L.; Zhao, S.; Zhou, Y. Global Localization in Large-Scale Point Clouds via Roll-Pitch-Yaw Invariant Place Recognition and Low-Overlap Global Registration. IEEE Trans. Circuit Syst. Video Technol. 2024, 34, 3846–3859. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar]
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded up robust features. In Proceedings of the 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar]
Calonder, M.; Lepetit, V.; Strecha, C.; Fua, P. BRIEF: Binary robust independent elementary features. In Proceedings of the 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; pp. 778–792. [Google Scholar]
Wang, J.; Zhong, S.; Yan, L.; Cao, Z. An Embedded System-on-Chip Architecture for Real-time Visual Detection and Matching. IEEE Trans. Circuit Syst. Video Technol. 2014, 24, 525–538. [Google Scholar] [CrossRef]
Ong, E.J.; Husain, S.S.; Bober-Irizar, M.; Bober, M. Deep Architectures and Ensembles for Semantic Video Classification. IEEE Trans. Circuit Syst. Video Technol. 2019, 29, 3568–3582. [Google Scholar] [CrossRef]
Miech, A.; Laptev, I.; Sivic, J. Learnable pooling with context gating for video classification. arXiv 2017, arXiv:1706.06905. [Google Scholar]
Ge, Y.; Wang, H.; Zhu, F.; Zhao, R.; Li, H. Self-supervising fine-grained region similarities for large-scale image localization. In Proceedings of the 16th European Conference (ECCV 2020), Glasgow, UK, 23–28 August 2020; pp. 369–386. [Google Scholar]
Radenović, F.; Iscen, A.; Tolias, G.; Avrithis, Y.; Chum, O. Revisiting oxford and paris: Large-scale image retrieval benchmarking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Seo, P.H.; Weyand, T.; Sim, J.; Han, B. Cplanet: Enhancing image geolocalization by combinatorial partitioning of maps. In Proceedings of the 15th European Conference on Computer Vision, ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 536–551. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
Kalantidis, Y.; Sariyildiz, M.B.; Pion, N.; Weinzaepfel, P.; Larlus, D. Hard negative mixing for contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21798–21809. [Google Scholar]
Lee, K.; Zhu, Y.; Sohn, K.; Li, C.L.; Shin, J.; Lee, H. i-mix: A domain-agnostic strategy for contrastive representation learning. arXiv 2020, arXiv:2010.08887. [Google Scholar]
Chuang, C.Y.; Robinson, J.; Lin, Y.C.; Torralba, A.; Jegelka, S. Debiased contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 8765–8775. [Google Scholar]
Robinson, J.; Chuang, C.Y.; Sra, S.; Jegelka, S. Contrastive learning with hard negative samples. arXiv 2020, arXiv:2010.04592. [Google Scholar]
Chen, X.; Chen, W.; Chen, T.; Yuan, Y.; Gong, C.; Chen, K.; Wang, Z. Self-pu: Self boosted and calibrated positive-unlabeled training. In Proceedings of the International Conference on Machine Learning PMLR, Online, 13–18 July 2020; pp. 1510–1519. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Chen, X.; Fan, H.; Girshick, R.; He, K. Improved baselines with momentum contrastive learning. arXiv 2020, arXiv:2003.04297. [Google Scholar]
Chen, X.; Xie, S.; He, K. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9640–9649. [Google Scholar]
Guan, P.; Cao, Z.; Fan, S.; Yang, Y.; Yu, J.; Wang, S. Hardness-aware Metric Learning with Cluster-guided Attention for Visual Place Recognition. IEEE Trans. Circuit Syst. Video Technol. 2025, 35, 367–379. [Google Scholar]
Ross, T.Y.; Dollár, G. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Mercer, A.M. A variant of Jensen’s inequality. J. Inequalities Pure Appl. Math. 2003, 4, 73–74. [Google Scholar]
Lee, S.; Lee, S.; Seong, H.; Kim, E. Revisiting Self-Similarity: Structural Embedding for Image Retrieval. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Cummins, M.; Newman, P. Appearance-only SLAM at large scale with FAB-MAP 2.0. Int. J. Robot. Res. 2011, 30, 1100–1123. [Google Scholar]
Torii, A.; Sivic, J.; Pajdla, T.; Okutomi, M. Visual place recognition with repetitive structures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 883–890. [Google Scholar]
Chen, D.M.; Baatz, G.; Köser, K.; Tsai, S.S.; Vedantham, R.; Pylvänäinen, T.; Roimela, K.; Chen, X.; Bach, J.; Pollefeys, M.; et al. City-scale landmark identification on mobile devices. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 737–744. [Google Scholar]
Weyand, T.; Araujo, A.; Cao, B.; Sim, J. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Zhu, Y.; Gao, X.; Ke, B.; Qiao, R.; Sun, X. Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar]
Diaz, R.; Marathe, A. Soft Labels for Ordinal Regression. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Wang, Y.; Qiu, Y.; Cheng, P.; Zhang, J. Hybrid CNN-Transformer Features for Visual Place Recognition. IEEE Trans. Circuit Syst. Video Technol. 2023, 33, 1109–1122. [Google Scholar] [CrossRef]
Zhu, S.; Yang, L.; Chen, C.; Shah, M.; Shen, X.; Wang, H. R2former: Unified retrieval and reranking transformer for place recognition. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 19370–19380. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR). 2021. Available online: https://openreview.net/forum?id=YicbFdNTTy (accessed on 26 March 2025).
Izquierdo, S.; Civera, J. Optimal transport aggregation for visual place recognition. arXiv 2023, arXiv:2311.15937. [Google Scholar]
Lu, F.; Lan, X.; Zhang, L.; Jiang, D.; Wang, Y.; Yuan, C. CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition. arXiv 2024, arXiv:2402.19231. [Google Scholar]
Liu, Y.; Kong, A.W.K.; Goh, C.K. A Constrained Deep Neural Network for Ordinal Regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Zhang, Z.; Song, Y.; Qi, H. Age progression/regression by conditional adversarial autoencoder. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5810–5818. [Google Scholar]
Moschoglou, S.; Papaioannou, A.; Sagonas, C.; Deng, J.; Kotsia, I.; Zafeiriou, S. Agedb: The first manually collected, in-the-wild age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 51–59. [Google Scholar]

Figure 1. (a) An illustration that depicts what it means to be geographic distance consistent for a loss function. The loss function is minimized when the geographic distances are negative correlated to the similarities. (b) The quantitative results of GDCPlace. The loss function is minimized when the embedding similarities exhibit the correct rank.

Figure 2. Given the query image, the system outputs the most similar index images to input. The visual cue gradually vanishes while farther photos are considered. The right-bottom-corner graph presents the value of function h changes with respect to the geographic distance, where the value changes are significantly larger when the geographic distance is small.

Figure 3. The visualization of the predicted most similar index images with and without using HNCM.

Table 1. List of related state-of-the-art papers.

References	Training Strategies	Inference Strategies	Methods	Limitations
Arandjelović et al. [4]	Sample-to-Sample	Retrieval	NetVLAD integrates a differentiable VLAD layer into a CNN, aggregating local features into a global descriptor.	High computational and memory requirements, along with sensitivity to variations in local features.
Ali-Bey et al. [5]	Sample-to-Sample	Retrieval	MixVPR fuses CNN feature maps using cascaded lightweight MLP blocks to generate compact global descriptors.	Its reliance on fixed pretrained backbones and MLP fusion can make it sensitive to noise.
Shen et al. [6]	Sample-to-Sample	Retrieval	StructVPR distills structural knowledge by dynamically weighting training samples to guide global descriptor learning.	Its effectiveness relies on accurate structural cues, making it potentially sensitive to noisy scene structures.
Ali-Bey et al. [7]	Sample-to-Sample	Retrieval	GSV-Cities introduces a large-scale, accurately labeled urban dataset to train VPR models	Its approach may be less effective in non-urban settings and relies on extensive, high-quality annotations.
Trivigno et al. [8]	Classification	Classification	Divide & Classify reformulates VPR task as a fine-grained classification task by partitioning dense urban maps into non-adjacent cells.	Its performance depends heavily on dense, well-distributed training data and may struggle with small-scale regions.
Muller-Budack et al. [9]	Classification	Classification	The method treats geolocalization as a classification task by hierarchically partitioning the Earth into cells and integrating scene context.	It suffers from ambiguity in scene content in defining effective partitions, reducing localization accuracy in transitional or unclear areas.
Berton et al. [14]	Classification	Retrieval	CosPlace reformulates visual geo-localization as a classification task by partitioning the city into discrete regions and training a CNN with an CosFace loss.	Its performance is highly sensitive to the partitioning strategy. Errors in defining region boundaries or ambiguous scene cues can lead to misclassifications.
Berton et al. [15]	Classification	Retrieval	EigenPlace trains a CNN to produce viewpoint-robust global descriptors by incorporating eigen-decomposition techniques that emphasize stable, discriminative features.	Its reliance on eigen-decomposition and the need for diverse, extensive training data can limit generalization and add computational complexity.

Table 2. Comparison with other loss functions (recall@1). Bold denotes the best, and underline denotes the second best. Abbreviations: PI for Pitts30k. AM for AmsterTime. EY for Eynsham. TO for Tokyo-24/7. NL for NordLand. SN for SVOX Night. SO for SVOX Overcast. SW for SVOX Snow. SR for SVOX Rain. SU for SVOX Sun. MS for MSLS val.

Method	Track	Multi-View				Frontal-View							AVG
Method	Track	PI	AM	EY	TO	NL	SN	SO	SW	SR	SU	MS	AVG
NormFace [30]	Face	90.6	41.3	90.0	88.3	58.4	40.3	88.2	87.2	85.2	41.3	86.8	72.5
SphereFace [45]	Face	86.4	43.1	79.5	63.2	59.3	52.0	90.7	87.7	79.8	63.1	83.4	71.7
CosFace [10]	Face	92.5	48.9	90.7	93.0	71.2	58.9	93.1	93.1	90.0	86.4	89.1	82.4
ArcFace [11]	Face	89.2	40.5	83.2	91.2	64.2	46.1	82.8	90.2	83.6	81.4	85.7	76.2
CurricularFace [32]	Face	92.3	47.1	90.8	91.4	68.8	61.6	93.7	93.6	90.6	84.4	89.1	82.1
AdaFace [31]	Face	86.7	44.8	75.4	74.6	61.1	44.7	89.1	83.9	74.8	51.9	79.6	69.7
SORD [86]	Ordinal	91.8	49.2	90.1	91.6	70.8	63.9	93.2	93.5	90.6	85.3	88.0	82.5
LM-Softmax [34]	General	92.2	49.6	90.4	90.5	73.2	65.2	94.0	93.4	90.1	85.4	89.6	83.1
MadaCos [85]	Landmark	91.7	46.8	90.6	90.8	68.8	61.4	95.0	93.6	90.3	85.1	88.5	82.1
Gem-AP [19]	Landmark	90.6	45.1	88.6	91.5	68.2	54.1	91.9	92.0	86.4	82.9	85.0	79.7
UniFace [33]	Face	92.2	46.5	90.6	91.4	65.5	61.2	94.2	94.0	91.5	84.7	88.1	81.8
GDCPlace	VPR	92.6	51.6	90.9	92.1	77.6	70.1	95.0	94.4	90.8	86.8	89.6	84.7

Table 3. Comparison with SoTAs equipped with ResNet50 and VGG16 backbones (recall@1). Blue denotes the best for both categories, and bold denotes the best in each category.

Method	Backbone	Dim.	Multi-View				Frontal-View							AVG
Method	Backbone	Dim.	PI	AM	EY	TO	NL	SN	SO	SW	SR	SU	MS	AVG
Classification Training
D&C [8]	ResNet50	1280	84.6	17.5	87.2	60.0	11.9	4.7	69.3	57.7	55.1	42.5	69.6	50.9
CFCD [85]	ResNet50	512	62.2	28.8	76.3	54.9	6.1	16.8	60.8	50.9	39.3	25.6	50.7	42.9
CosPlace [14]	ResNet50	2048	90.9	47.7	90.0	87.3	71.9	50.7	92.2	92.0	87.0	78.5	87.4	79.6
EigenPlace [15]	ResNet50	2048	92.5	48.9	90.7	93.0	71.2	58.9	93.1	93.1	90.0	86.4	89.1	82.4
GDCPlace	ResNet50	2048	92.6	51.6	90.9	92.1	77.6	70.1	95.0	94.4	90.8	86.8	89.6	84.7
Sample-to-Sample Training
SFRS [62]	VGG16	4096	89.1	29.7	72.3	80.3	16.0	28.6	81.1	76.0	69.7	54.8	70.0	60.7
NetVLAD [4]	VGG16	4096	85.0	16.3	77.7	69.8	13.1	8.0	66.4	54.4	51.5	35.4	58.9	48.8
$R^{2}$ Former-global [88]	ResNet50	256	73.1	12.8	82.4	48.3	24.6	13.5	75.8	60.8	47.6	28.0	80.3	49.7
CNN-Transformer [87]	VGG16 + Swin-B	4096	86.3	-	-	78.9	-	-	-	-	-	-	-	-
HCA [76]	VGG16	4096	-	-	-	86.7	-	-	-	-	-	-	76.4	-
Conv-AP [7]	ResNet50	8192	90.5	35.0	87.6	72.1	62.9	43.4	91.9	91.0	82.8	80.4	82.4	74.5
MixVPR [5]	ResNet50	4096	91.5	40.2	89.4	85.1	76.2	64.4	96.2	96.8	91.5	84.8	87.2	82.1

Table 4. Comparison with recent DINOv2 methods (recall@1). Blue denotes the best for both categories, and bold denotes the best in each category.

Method	Backbone	Dim.	Multi-View				Frontal-View							AVG
Method	Backbone	Dim.	PI	AM	EY	TO	NL	SN	SO	SW	SR	SU	MS	AVG
Classification Training
EigenPlace [15]	DINOv2-B/14	2048	93.3	53.0	91.3	95.2	86.2	92.8	96.9	97.2	96.3	94.1	89.2	89.6
GDCPlace	DINOv2-B/14	2048	93.5	57.7	91.7	96.5	88.8	94.9	97.2	97.9	96.8	95.6	91.4	91.1
GDCPlace	DINOv2-B/14	8448	94.2	58.5	91.9	96.2	90.7	94.9	97.6	97.8	97.0	96.1	91.4	91.5
Sample-to-Sample Training
AnyLoc [1]	DINOv2-G/14	49152	87.5	43.9	87.5	91.4	35.7	76.5	91.9	88.0	85.6	86.5	70.3	76.8
SALAD [90]	DINOv2-B/14	8448	91.9	56.9	91.4	94.3	86.1	94.8	98.3	98.9	98.7	96.7	92.2	90.9
CricaVPR [91]	DINOv2-B/14	4096	94.9	64.7	91.6	93.0	90.7	85.1	96.7	96.0	95.0	93.8	90.0	90.1

Table 5. mAP@k results of SoTA methods. S2S represents sample-to-sample, and CLS stands for classification. Bold denotes the best in each measure, and underline denotes the second best.

Methods	Training Strategy	NordLand			SVOX Night			AmsterTime
Methods	Training Strategy	@3	@5	@7	@3	@5	@7	@3	@5	@7
CosPlace [14]	CLS	60.8	51.8	44.6	40.1	31.8	26.0	54.8	56.2	56.7
EigenPlace [15]	CLS	57.0	46.8	39.4	47.7	34.1	27.6	56.0	57.1	57.7
Conv-AP [7]	S2S	49.1	40.4	34.3	31.6	24.3	19.5	40.5	41.9	42.4
MixVPR [5]	S2S	60.7	50.2	42.7	47.2	35.9	28.4	46.6	47.6	47.9
GDCPlace	CLS	63.6	52.8	44.8	54.6	42.3	33.7	58.3	59.7	60.0

Table 6. Comparison on mAP@k with other classification loss functions. Bold denotes the best in each measure, and underline denotes the second best.

Methods	Tracks	NordLand			SVOX Night			AmsterTime
Methods	Tracks	@3	@5	@7	@3	@5	@7	@3	@5	@7
NormFace [30]	Face	39.5	31.2	25.7	29.0	22.5	18.0	47.9	49.0	49.6
SphereFace [45]	Face	47.2	38.8	32.9	39.1	30.4	24.6	50.6	51.9	52.3
CosFace [10]	Face	60.8	51.8	44.6	40.1	31.8	26.0	54.8	56.2	56.7
ArcFace [11]	Face	38.9	31.7	26.8	19.3	14.6	11.7	45.3	46.6	47.2
CurricularFace [32]	Face	55.1	45.4	38.3	46.7	36.0	28.8	55.3	56.5	57.0
AdaFace [31]	Face	48.4	40.1	34.2	33.7	26.1	21.0	51.7	52.9	53.5
SORD [86]	Ordinal	61.2	50.6	43.2	51.4	39.8	32.4	58.0	57.6	59.4
LM-Softmax [34]	General	59.8	49.5	42.0	49.2	37.7	30.2	57.1	58.2	58.7
MadaCos [85]	Landmark	55.9	46.5	39.5	46.8	36.2	29.1	54.7	55.8	56.1
GeM-AP [85]	Landmark	58.2	47.9	41.5	48.7	37.8	31.2	56.2	57.2	58.0
UniFace [33]	Face	52.5	43.3	36.7	44.3	33.8	27.0	55.0	56.1	56.6
GDCPlace	VPR	63.6	52.8	44.8	54.6	42.3	33.7	58.3	59.7	60.0

Table 7. Comparison with Janine’s Loss (Recall@1). Bold denotes the best in each dataset.

Formulation	Multi-View				Frontal-View							AVG
Formulation	NI	AM	EY	TO	NL	SN	SO	SW	SR	SU	MS	AVG
Janine’s Loss	78.3	47.5	76.1	69.2	48.4	36.8	88.3	89.8	80.2	66.2	80.1	69.2
Janine’s Loss + HNCM	89.9	51.6	89.2	81.6	51.3	44.7	92.4	92.2	87.4	78.3	86.6	76.8
GDCPlace	92.6	51.6	90.9	92.1	77.6	70.1	95.0	94.4	90.8	86.8	89.6	84.7

Table 8. GDCPlace applied to ordinal classification tasks (MAE). Bold denotes the best in each dataset.

Method	Regression [20]	Classification [92]	DLDL-v2 [21]	SORD [86]	OR-CNN [20]	GDCPlace
UTKFace	4.78	4.95	4.67	4.77	4.64	4.59
AgeDB	6.72	7.29	6.89	6.86	6.68	6.55

Table 9. GDCPlace applied to CosPlace (recall@1). Bold denotes the best in each dataset.

Method	Backbone	Dim.	Multi-View				Single-View							AVG
Method	Backbone	Dim.	PI	AM	EY	TO	NL	SN	SO	SW	SR	SU	MS	AVG
CosPlace	ResNet50	2048	90.9	47.7	90.0	87.3	71.9	50.7	92.2	92.0	87.0	78.5	87.4	79.6
CosPlace + GDCPlace	ResNet50	2048	91.5	50.2	90.0	89.5	75.8	54.2	93.0	93.5	87.8	79.0	88.0	81.1

Table 10. Ablations on different K of HNCM (recall). Bold denotes the best in each measure.

K in HNCM	Pitts30k			Tokyo-24/7			SVOX Night
K in HNCM	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
1	92.4	96.5	97.3	92.1	96.8	97.1	50.3	72.2	78.1
2	92.6	96.4	97.3	92.1	97.5	98.4	51.6	72.5	77.0
3	92.4	96.5	97.4	92.7	96.2	97.1	51.1	73.4	78.1
50	92.0	96.4	97.2	90.2	95.6	96.8	50.4	72.5	77.0
No HNCM	92.3	96.5	97.3	91.4	97.1	97.8	50.5	72.1	77.3

Table 11. Ablation on different

γ

in function h (recall). Bold denotes the best in each measure.

Table 11. Ablation on different

γ

in function h (recall). Bold denotes the best in each measure.

$γ$ in Function h	Pitts30k			Tokyo-24/7			SVOX Night
$γ$ in Function h	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
0.1	92.4	96.4	97.4	91.4	97.5	97.8	50.5	72.0	77.8
0.2	92.6	96.4	97.3	92.1	97.5	98.4	51.6	72.5	77.0
0.4	92.3	96.5	97.4	91.8	97.1	98.0	50.4	71.6	77.6
0.6	92.4	96.5	97.2	91.1	96.8	97.5	50.9	72.6	77.8
0.8	92.3	96.2	97.2	90.8	97.1	97.8	50.9	72.3	77.3

Table 12. Ablation on different

ζ

in function h (recall). Bold denotes the best in each measure.

Table 12. Ablation on different

ζ

in function h (recall). Bold denotes the best in each measure.

$ζ$ in Function h	Pitts30k			Tokyo-24/7			SVOX Night
$ζ$ in Function h	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
2	92.4	96.4	97.4	94.0	97.8	98.1	68.5	85.4	88.2
4	92.3	96.4	97.3	92.1	96.8	97.8	68.5	84.0	88.6
6	92.6	96.4	97.3	92.1	97.5	98.4	70.1	84.8	89.2
8	92.1	96.5	97.3	90.8	96.8	97.5	70.4	83.1	87.5
10	92.2	96.5	97.3	90.2	97.5	97.5	66.3	81.7	85.7

Table 13. The training efficiency.

Method	Time (min/5000 Steps)
EigenPlace	$30.74 \pm 0.21$
GDCPlace	$30.69 \pm 0.36$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shao, S.; Cui, Q. GDCPlace: Geographic Distance Consistent Loss for Visual Place Recognition. Electronics 2025, 14, 1418. https://doi.org/10.3390/electronics14071418

AMA Style

Shao S, Cui Q. GDCPlace: Geographic Distance Consistent Loss for Visual Place Recognition. Electronics. 2025; 14(7):1418. https://doi.org/10.3390/electronics14071418

Chicago/Turabian Style

Shao, Shihao, and Qinghua Cui. 2025. "GDCPlace: Geographic Distance Consistent Loss for Visual Place Recognition" Electronics 14, no. 7: 1418. https://doi.org/10.3390/electronics14071418

APA Style

Shao, S., & Cui, Q. (2025). GDCPlace: Geographic Distance Consistent Loss for Visual Place Recognition. Electronics, 14(7), 1418. https://doi.org/10.3390/electronics14071418

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GDCPlace: Geographic Distance Consistent Loss for Visual Place Recognition

Abstract

1. Introduction

2. Related Works

2.1. Classification Loss for Retrieval

2.2. Visual Place Recognition

2.3. Hard Negative Mining

3. Methods

3.1. Problem Formulation

3.2. Geographic Distance Consistent Loss (GDCPlace)

3.3. Hard Negative Class Mining (HNCM)

3.4. Implementation of GDCPlace

4. Experiments

4.1. Common Settings

4.1.1. Experimental Platform

4.1.2. Datasets Selection

4.1.3. Evaluation Metrics

4.2. Comparison with Other Classification Loss Functions

4.2.1. Baseline Selection

4.2.2. Results

4.3. Comparison with State-of-the-Art Methods

4.3.1. Baseline Selection

4.3.2. Vision Foundation Model

4.3.3. Results on ResNet50

4.3.4. Results on DINOv2

4.4. mAP Results of GDCPlace

4.5. Comparison with Janine’s Loss

4.6. Application to Ordinal Classification Task

4.7. Application to CosPlace

4.8. Ablation Studies

4.8.1. Ablation on HNCM

4.8.2. Ablation on Function h

4.9. Time Complexity of GDCPlace

4.10. Qualitative Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Proofs

Appendix A.1. Proof of Equations (2)–(5)

Appendix A.2. Proofs of Theorems, Corollaries, and Lemmata

Appendix B. Details on Map Partition for Classification Training

Appendix C. Additional Gradient Analysis

Appendix D. Datasets

Appendix E. Classification Loss Configurations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI