CriSALAD: Robust Visual Place Recognition Using Cross-Image Information and Optimal Transport Aggregation
Abstract
:1. Introduction
- Global descriptor generation with cross-image information—We introduce a novel global descriptor generation approach that integrates cross-image information with optimal transport-based feature aggregation, thus enhancing recognition robustness.
- Parameter-efficient adaptation strategy—We propose a lightweight adaptation mechanism that decouples spatial filtering via depthwise convolutions from channel-wise interactions via pointwise convolutions. This design reduces the computational overhead (FLOPs) compared to other conventional adapters.
- State-of-the-art performance—Extensive experiments conducted on multiple VPR benchmarks demonstrate that CriSALAD outperforms existing approaches on several datasets, such as Pitts30k and Nordland.
2. Related Works
2.1. Visual Place Recognition
2.2. Parameter-Efficient Transfer Learning
3. Methodology
3.1. Preliminaries
3.2. Depthwise Separable Convolutional Adapter
3.3. Global Descriptor Aggregation
3.4. Cross-Image Correlation-Aware Place Representation
4. Experiments
4.1. Datasets
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. Comparison Methods
4.5. Results and Discussion
4.6. Ablation Studies
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
VPR | Visual Place Recognition |
VFMs | Visual Foundation Models |
SALAD | Sinkhorn Algorithm for Locally Aggregated Descriptors |
SURF | Speeded-Up Robust Features |
VLAD | Vector of Locally Aggregated Descriptors |
ORB | Oriented FAST and Rotated BRIEF |
BoW | Bag of Words |
CNNs | Convolutional Neural Networks |
ViT | Vision Transformer |
MLP | Multi-Layer Perceptron |
GeM | Generalized Mean Pooling |
DiNOv2 | Distillation with NO labels (Version 2) |
3D | Three-dimensional |
GPU | Graphical processing unit |
PETL | Parameter-efficient transfer learning |
NLP | Natural Language Processing |
LoRA | Low-Rank Adaptation |
CLIP | Contrastive Language–Image Pre-training |
MHA | Multi-head attention |
LN | Layer normalization |
SoTA | State-of-the-art |
PCA | Principal Component Analysis |
FLOPs | Floating Point Operations |
MACs | Multiply–Accumulate Operations |
GSV-Cities | Google Street View-Cities |
MSLS | Mapillary Street-Level Sequences |
MS | Multi-Similarity |
OHEM | Online Hard Example Mining |
SF-XL | San Francisco eXtra Large Dataset |
DSConv | Depthwise Separable Convolution |
References
- Masone, C.; Caputo, B. A survey on deep visual place recognition. IEEE Access 2021, 9, 19516–19547. [Google Scholar] [CrossRef]
- Schubert, S.; Neubert, P.; Garg, S.; Milford, M.; Fischer, T. Visual place recognition: A tutorial. arXiv 2023, arXiv:2303.03281. [Google Scholar]
- Chen, Z.; Jacobson, A.; Sünderhauf, N.; Upcroft, B.; Liu, L.; Shen, C.; Reid, I.; Milford, M. Deep learning features at scale for visual place recognition. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3223–3230. [Google Scholar]
- Middelberg, S.; Sattler, T.; Untzelmann, O.; Kobbelt, L. Scalable 6-dof localization on mobile devices. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part II 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 268–283. [Google Scholar]
- Qian, J.; Cheng, Y.; Ying, R.; Liu, P. A Novel Indoor Localization Method Based on Image Retrieval and Dead Reckoning. Appl. Sci. 2020, 10, 3803. [Google Scholar] [CrossRef]
- Peng, Z.; Song, R.; Yang, H.; Li, Y.; Lin, J.; Xiao, Z.; Yan, B. Adaptive Feature Refinement and Weighted Similarity for Deep Loop Closure Detection in Appearance Variation. Appl. Sci. 2024, 14, 6276. [Google Scholar] [CrossRef]
- Garg, S.; Sünderhauf, N.; Milford, M. LoST? Appearance-Invariant Place Recognition for Opposite Viewpoints using Visual Semantics. In Proceedings of the Robotics: Science and Systems XIV, Carnegie Mellon University, Pittsburgh, PA, USA, 26–30 June 2018. [Google Scholar] [CrossRef]
- Ming, Y.; Yang, X.; Calway, A. Object-Augmented RGB-D SLAM for Wide-Disparity Relocalisation. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 2203–2209. [Google Scholar] [CrossRef]
- Berton, G.; Trivigno, G.; Caputo, B.; Masone, C. EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 11046–11056. [Google Scholar] [CrossRef]
- Ming, Y.; Yang, X.; Zhang, G.; Calway, A. CGiS-Net: Aggregating Colour, Geometry and Implicit Semantic Features for Indoor Place Recognition. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 6991–6997. [Google Scholar] [CrossRef]
- Ming, Y.; Ma, J.; Yang, X.; Dai, W.; Peng, Y.; Kong, W. AEGIS-Net: Attention-Guided Multi-Level Feature Aggregation for Indoor Place Recognition. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 4030–4034. [Google Scholar] [CrossRef]
- Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-up robust features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
- Jégou, H.; Douze, M.; Schmid, C.; Pérez, P. Aggregating local descriptors into a compact image representation. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3304–3311. [Google Scholar]
- Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar] [CrossRef]
- Angeli, A.; Filliat, D.; Doncieux, S.; Meyer, J.A. Fast and incremental method for loop-closure detection using bags of visual words. IEEE Trans. Robot. 2008, 24, 1027–1037. [Google Scholar] [CrossRef]
- A, V.; Garg, H.; Anand, A.; Nigam, R.; Gupta, A.; Murthy, K.N.B.; Natarajan, S. Aggregation of Deep Local Features using VLAD and Classification using R2 Forest. Procedia Comput. Sci. 2018, 143, 998–1006. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
- Keetha, N.; Mishra, A.; Karhade, J.; Jatavallabhula, K.M.; Scherer, S.; Krishna, M.; Garg, S. AnyLoc: Towards Universal Visual Place Recognition. IEEE Robot. Autom. Lett. 2024, 9, 1286–1293. [Google Scholar] [CrossRef]
- Lu, F.; Lan, X.; Zhang, L.; Jiang, D.; Wang, Y.; Yuan, C. CricaVPR: Cross-Image Correlation-Aware Representation Learning for Visual Place Recognition. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16772–16782. [Google Scholar] [CrossRef]
- Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5297–5307. [Google Scholar]
- Ali-Bey, A.; Chaib-Draa, B.; Giguere, P. Mixvpr: Feature mixing for visual place recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2998–3007. [Google Scholar]
- Radenović, F.; Tolias, G.; Chum, O. Fine-tuning CNN image retrieval with no human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1655–1668. [Google Scholar] [CrossRef]
- Izquierdo, S.; Civera, J. Optimal transport aggregation for visual place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17658–17668. [Google Scholar]
- Hausler, S.; Garg, S.; Xu, M.; Milford, M.; Fischer, T. Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14136–14147. [Google Scholar] [CrossRef]
- Guan, P.; Cao, Z.; Fan, S.; Yang, Y.; Yu, J.; Wang, S. Hardness-Aware Metric Learning With Cluster-Guided Attention for Visual Place Recognition. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 367–379. [Google Scholar] [CrossRef]
- Ming, Y.; Xu, M.; Yang, X.; Ye, W.; Wang, W.; Peng, Y.; Dai, W.; Kong, W. VIPeR: Visual Incremental Place Recognition With Adaptive Mining and Continual Learning. IEEE Robot. Autom. Lett. 2025, 10, 3038–3045. [Google Scholar] [CrossRef]
- Yang, X.; Ming, Y.; Cui, Z.; Calway, A. FD-SLAM: 3-D Reconstruction Using Features and Dense Matching. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 8040–8046. [Google Scholar] [CrossRef]
- Oh, J.; Eoh, G. Variational Bayesian Approach to Condition-Invariant Feature Extraction for Visual Place Recognition. Appl. Sci. 2021, 11, 8976. [Google Scholar] [CrossRef]
- Ye, M.; Tanaka, K. Visual Place Recognition of Robots via Global Features of Scan-Context Descriptors with Dictionary-Based Coding. Appl. Sci. 2023, 13, 9040. [Google Scholar] [CrossRef]
- Chen, B.; Song, X.; Shen, H.; Lu, T. Hierarchical Visual Place Recognition Based on Semantic-Aggregation. Appl. Sci. 2021, 11, 9540. [Google Scholar] [CrossRef]
- Berton, G.; Masone, C.; Caputo, B. Rethinking Visual Geo-localization for Large-Scale Applications. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4868–4878. [Google Scholar] [CrossRef]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
- Huang, G.; Zhou, Y.; Hu, X.; Zhang, C.; Zhao, L.; Gan, W. DINO-Mix enhancing visual place recognition with foundational vision model and feature mixing. Sci. Rep. 2024, 14, 22100. [Google Scholar] [CrossRef]
- Lu, F.; Zhang, L.; Lan, X.; Dong, S.; Wang, Y.; Yuan, C. Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
- He, J.; Zhou, C.; Ma, X.; Berg-Kirkpatrick, T.; Neubig, G. Towards a unified view of parameter-efficient transfer learning. arXiv 2021, arXiv:2110.04366. [Google Scholar]
- Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 2790–2799. [Google Scholar]
- Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv 2021, arXiv:2104.08691. [Google Scholar]
- Chen, S.; Ge, C.; Tong, Z.; Wang, J.; Song, Y.; Wang, J.; Luo, P. Adaptformer: Adapting vision transformers for scalable visual recognition. Adv. Neural Inf. Process. Syst. 2022, 35, 16664–16678. [Google Scholar]
- Xu, M.; Zhang, Z.; Wei, F.; Hu, H.; Bai, X. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2945–2954. [Google Scholar]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Ali-bey, A.; Chaib-draa, B.; Giguère, P. GSV-Cities: Toward appropriate supervised visual place recognition. Neurocomputing 2022, 513, 194–203. [Google Scholar] [CrossRef]
- Torii, A.; Sivic, J.; Pajdla, T.; Okutomi, M. Visual Place Recognition with Repetitive Structures. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 883–890. [Google Scholar] [CrossRef]
- Torii, A.; Arandjelović, R.; Sivic, J.; Okutomi, M.; Pajdla, T. 24/7 Place Recognition by View Synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 257–271. [Google Scholar] [CrossRef] [PubMed]
- Olid, D.; Fácil, J.M.; Civera, J. Single-View Place Recognition under Seasonal Changes. arXiv 2018, arXiv:1808.06516. [Google Scholar]
- Warburg, F.; Hauberg, S.; López-Antequera, M.; Gargallo, P.; Kuang, Y.; Civera, J. Mapillary Street-Level Sequences: A Dataset for Lifelong Place Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2623–2632. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Wang, R.; Shen, Y.; Zuo, W.; Zhou, S.; Zheng, N. TransVPR: Transformer-Based Place Recognition with Multi-Level Attention Aggregation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13638–13647. [Google Scholar] [CrossRef]
Method | Dim. | Pitts30k | Tokyo24/7 | Nordland | MSLS-Val | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
NetVLAD [20] | 32,768 | 81.9 | 91.2 | 93.7 | 60.6 | 68.9 | 74.6 | - | - | - | 53.1 | 66.5 | 71.1 |
Conv-AP [43] | 4096 | 89.4 | 94.7 | 95.9 | 81.0 | 88.3 | 92.4 | - | - | - | 69.2 | 80.3 | 83.1 |
Patch-NetVLAD [24] | / | 88.7 | 94.5 | 95.9 | 86.0 | 88.6 | 90.5 | - | - | - | 79.5 | 86.2 | 87.7 |
TransVPR [49] | / | 89.0 | 94.9 | 96.2 | 79.0 | 82.2 | 85.1 | - | - | - | 86.8 | 91.2 | 92.4 |
CosPlace [31] | 512 | 88.4 | 94.5 | 95.7 | 81.9 | 90.2 | 92.7 | 58.5 | 73.7 | 79.4 | 82.8 | 89.7 | 92.0 |
MixVPR [21] | 4096 | 91.5 | 95.5 | 96.3 | 85.1 | 91.7 | 94.3 | 76.2 | 86.9 | 90.3 | 88.0 | 92.7 | 94.6 |
EigenPlaces [9] | 2048 | 92.5 | 96.8 | 97.6 | 93.0 | 96.2 | 97.5 | 71.2 | 83.8 | 88.1 | 89.1 | 93.8 | 95.0 |
CricaVPR(16 bs) [19] | 4096 | 93.7 | 97.0 | 98.1 | 93.0 | 97.1 | 97.8 | - | - | - | 89.9 | 95.4 | 96.2 |
DiNO-SALAD [23] | 8448 | 91.9 | 96.3 | 97.4 | 92.1 | 96.2 | 96.5 | 76.0 | 89.2 | 92.0 | 92.2 | 96.4 | 97.0 |
CriSALAD (ours) | 8448 | 94.7 | 97.2 | 97.8 | 90.8 | 95.6 | 97.1 | 89.3 | 95.7 | 97.2 | 87.4 | 93.8 | 95.0 |
Model Configuration | Nordland | |||||
---|---|---|---|---|---|---|
Adapter | CIE | lr | Scheduler | R@1 | R@5 | R@10 |
DSConv-DiNOv2 | ✓ | 6 × 10−5 | Linear | 89.3 | 95.7 | 97.2 |
DSConv-DiNOv2 | ✓ | 0.0001 | Linear | 80.9 | 90.4 | 93.4 |
DSConv-DiNOv2 | ✓ | 6 × 10−5 | Step | 85.0 | 93.1 | 95.3 |
MultiConv-DiNOv2 | ✓ | 6 × 10−5 | Linear | 84.8 | 93.1 | 95.3 |
MLP-DiNOv2 | ✓ | 6 × 10−5 | Linear | 86.4 | 94.1 | 95.2 |
DSConv-DiNOv2 | × | 6 × 10−5 | Step | 69.4 | 80.9 | 85.4 |
MACs (M) | FLOPs (M) | Params (K) | |
---|---|---|---|
MLP adapter | 1.181 | 2.362 | 2356.08 |
MultiConv adapter | 44.236 | 88.472 | 761.472 |
Our adapter | 39.223 | 78.446 | 741.504 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xu, J.; Ming, Y.; Xu, M.; Fan, Y.; Zhang, Y.; Kong, W. CriSALAD: Robust Visual Place Recognition Using Cross-Image Information and Optimal Transport Aggregation. Appl. Sci. 2025, 15, 5287. https://doi.org/10.3390/app15105287
Xu J, Ming Y, Xu M, Fan Y, Zhang Y, Kong W. CriSALAD: Robust Visual Place Recognition Using Cross-Image Information and Optimal Transport Aggregation. Applied Sciences. 2025; 15(10):5287. https://doi.org/10.3390/app15105287
Chicago/Turabian StyleXu, Jinyi, Yuhang Ming, Minyang Xu, Yaqi Fan, Yuan Zhang, and Wanzeng Kong. 2025. "CriSALAD: Robust Visual Place Recognition Using Cross-Image Information and Optimal Transport Aggregation" Applied Sciences 15, no. 10: 5287. https://doi.org/10.3390/app15105287
APA StyleXu, J., Ming, Y., Xu, M., Fan, Y., Zhang, Y., & Kong, W. (2025). CriSALAD: Robust Visual Place Recognition Using Cross-Image Information and Optimal Transport Aggregation. Applied Sciences, 15(10), 5287. https://doi.org/10.3390/app15105287