Monocular Camera Localization in Known Environments: An In-Depth Review
Abstract
1. Introduction
1.1. Overview
1.2. Basic Principles of Monocular Camera Localization
- Localization in unknown environments (online SLAM): The system builds a map of the scene in real time while simultaneously estimating the camera’s pose.
- Localization in known environments (the focus of this survey): A pre-existing map or reference dataset of the environment is already available. The map can take several forms:
- A database of reference images with known camera poses;
- A 3D point cloud or mesh with associated visual descriptors;
- A learned neural scene representation.
- (1)
- Feature extraction from the query image (using handcrafted keypoints and descriptors, learned local features, or dense features from CNNs/Transformers; in regression methods, this is often implicit within the network).
- (2)
- Relating the query to the known environment (via explicit matching, like 2D-2D image-to-image or 2D-3D image-to-point-cloud correspondences, or implicitly through neural regression of poses or scene coordinates).
- (3)
- Pose computation either by solving a geometric problem (e.g., essential-matrix decomposition for 2D-2D; Perspective-n-Point solver for 2D-3D) or by direct neural regression (e.g., outputting the 6-DoF pose or per-pixel 3D coordinates followed by a lightweight solver).
1.3. Existing Surveys and Limitations
1.4. Classification of Methods
1.5. Survey Structure
2. 2D-2D Feature Matching-Based Localization
2.1. Approximate 2D-2D Localization
2.1.1. Conventional Methods
2.1.2. Deep Learning-Based Methods
2.2. Precise 2D-2D Localization
3. 2D-3D Feature Matching-Based Localization
3.1. Creating a 3D Prior Map
- RGB-D sensors: Devices like Microsoft Kinect and Intel RealSense capture both color and depth (i.e., RGB-D) information simultaneously. These sensors can generate real-time 3D maps and are especially useful for indoor environments where lighting conditions are controlled.
- Innovative Point Cloud Generation: In response to the rapid advancement of sensor technologies, novel point cloud data types have emerged. These include multisource fusion point clouds [53] and Interferometric Synthetic Aperture Radar (InSAR) point clouds [54], which have shown exceptional value and distinct advantages in recent research [55,56,57].
3.2. Direct 2D-3D Localization Methods
3.2.1. SfM-Based Methods
3.2.2. 3D-Scanner-Based Methods
3.3. Hierarchical 2D-3D Localization Methods
4. Regression-Based Localization
4.1. Absolute Pose Regression
4.2. Scene Coordinate Regression
5. Datasets and Evaluation Metrics
5.1. Benchmark Datasets
5.1.1. Place Recognition Datasets
5.1.2. 6-DoF Pose Estimation Datasets
5.2. Evaluation Metrics
5.2.1. Place Recognition Metrics
5.2.2. 6-DoF Pose Estimation Metrics
6. Comparative Analysis
6.1. Comparing Place Recognition Methods on Common Benchmarks
6.2. Comparing 6-Dof Pose Estimation Methods on Common Benchmarks
6.2.1. Intra-Class Comparisons
6.2.2. Inter-Class Comparisons
7. Challenges and Future Work
7.1. Synthetic Data Generation
- Developing specialized GAN variants, such as geometry-aware GANs, that preserve 3D structural consistency during synthesis to better support tasks like scene coordinate regression (e.g., ensuring synthesized images maintain accurate depth cues for 2D-3D matching);
- Exploring hybrid synthesis methods that combine physics-based rendering (e.g., via tools like Blender or Unreal Engine) with GANs to simulate dynamic elements like weather or crowd movements, which are underrepresented in current datasets;
- Addressing scalability issues in generating diverse, large-scale synthetic environments by automating procedural generation pipelines tailored to urban or indoor scenes.
7.2. Generalizable Models
- Exploiting permutation equivariance in relative pose estimation, as in EquiPose [157], to handle unordered correspondences and improve zero-shot performance across varying scene structures;
- Incorporating hypernetworks for adaptive localization, such as in HyperPose [158], which uses meta-learning to dynamically adjust parameters for unseen environments and extends benchmarks like Cambridge Landmarks for broader evaluation;
- Scaling relative pose regression through large-scale training, as explored in Reloc3r [159], integrating foundation models pre-trained on massive datasets to enable fast, accurate localization with minimal adaptation, addressing domain shifts in real-time applications like autonomous navigation.
7.3. Multi-Sensor Fusion
- Data-Level Fusion: Combining raw sensor data, such as aligning LiDAR point clouds with camera images to create dense spatial representations, as seen in BEVFusion [160], which unifies Bird’s-Eye-View representations for multi-task perception.
- Feature-Level Fusion: Integrating feature descriptors from different modalities to improve matching accuracy, for instance, using cross-modal attention in transformer architectures to align tokens from camera and LiDAR embeddings.
- Decision-Level Fusion: Merging individual pose estimates from various sensors for improved robustness and consistency, often enhanced by probabilistic models or Kalman filters to handle uncertainties.
- Developing adaptive fusion mechanisms that dynamically weigh sensor contributions based on environmental conditions, extending methods like SAMFusion [161] to incorporate real-time uncertainty estimation via Bayesian transformers;
- Leveraging large vision–language models for semantic-aware fusion, such as integrating CLIP-like embeddings to enhance cross-modal alignment in unstructured environments, thereby improving generalization across diverse datasets like nuScenes or Waymo Open.
7.4. Advanced Learning Paradigms
- Integrating state–space models like Mamba into vision pipelines, as in VimGeo [166], to achieve linear-time complexity for long-sequence processing in dynamic environments while maintaining accuracy;
- Developing hybrid architectures that combine transformers, GNNs, and SSMs for multi-modal fusion, addressing computational bottlenecks in real-time applications;
- Exploring diffusion-based paradigms for handling pose uncertainty, inspired by promptable 3D localization models [167], to improve robustness in noisy or incomplete data scenarios.
7.5. Lightweight Architectures
- Applying advanced model compression techniques, such as dynamic pruning and 4-bit quantization, to further reduce the footprint of transformer-based models while preserving pose estimation accuracy in dynamic scenes;
- Developing adaptive lightweight models that switch between low-power modes based on device constraints, incorporating federated learning for on-device personalization in diverse environments like urban navigation.
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Dikmen, M.; Burns, C.M. Autonomous driving in the real world: Experiences with Tesla Autopilot and summon. In Proceedings of the 8th International Conference on Automotive User Interfaces and Interactive Vehicular Applications, Ann Arbor, MI, USA, 24–26 October 2016; pp. 225–228. [Google Scholar]
- Liu, F.; Lu, Z.; Lin, X. Vision-based environmental perception for autonomous driving. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2025, 239, 39–69. [Google Scholar] [CrossRef]
- Dong, X.; Cappuccio, M.L. Applications of computer vision in autonomous vehicles: Methods, challenges and future directions. arXiv 2023, arXiv:2311.09093. [Google Scholar] [CrossRef]
- Zhang, K.Z. Applications and prospects of AI in autonomous cars-take Tesla as an example. In Proceedings of the 2nd International Conference on Mechatronic Automation and Electrical Engineering (ICMAEE 2024), Nanjing, China, 22–24 November 2024; pp. 355–360. [Google Scholar]
- Wu, Y.; Tang, F.; Li, H. Image-based camera localization: An overview. Vis. Comput. Ind. Biomed. Art 2018, 1, 8. [Google Scholar] [CrossRef]
- Piasco, N.; Sidibé, D.; Demonceaux, C.; Gouet-Brunet, V. A survey on visual-based localization: On the benefit of heterogeneous data. Pattern Recognit. 2018, 74, 90–109. [Google Scholar] [CrossRef]
- Xin, X.; Jiang, J.; Zou, Y. A review of visual-based localization. In Proceedings of the 2019 International Conference on Robotics, Intelligent Control and Artificial Intelligence, Shanghai, China, 20–22 September 2019; pp. 94–105. [Google Scholar]
- Humenberger, M.; Cabon, Y.; Pion, N.; Weinzaepfel, P.; Lee, D.; Guérin, N.; Sattler, T.; Csurka, G. Investigating the role of image retrieval for visual localization: An exhaustive benchmark. Int. J. Comput. Vis. 2022, 130, 1811–1836. [Google Scholar] [CrossRef]
- Sivic, J.; Zisserman, A. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the 9th IEEE International Conference on Computer Vision, Nice, France, 13–16 October 2003; pp. 1470–1477. [Google Scholar]
- Nister, D.; Stewenius, H. Scalable recognition with a vocabulary tree. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 2161–2168. [Google Scholar]
- Philbin, J.; Chum, O.; Isard, M.; Sivic, J.; Zisserman, A. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 18–23 June 2007. [Google Scholar]
- Chum, O.; Philbin, J.; Sivic, J.; Isard, M.; Zisserman, A. Total recall: Automatic query expansion with a generative feature model for object retrieval. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007. [Google Scholar]
- Cummins, M.; Newman, P. FAB-MAP: Probabilistic localization and mapping in the space of appearance. Int. J. Robot. Res. 2008, 27, 647–665. [Google Scholar] [CrossRef]
- Perd’och, M.; Chum, O.; Matas, J. Efficient representation of local geometry for large scale object retrieval. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 9–16. [Google Scholar]
- Jégou, H.; Douze, M.; Schmid, C.; Pérez, P. Aggregating local descriptors into a compact image representation. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3304–3311. [Google Scholar]
- Chum, O.; Mikulik, A.; Perdoch, M.; Matas, J. Total recall II: Query expansion revisited. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011; pp. 889–896. [Google Scholar]
- Arandjelović, R.; Zisserman, A. Three things everyone should know to improve object retrieval. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2911–2918. [Google Scholar]
- Torii, A.; Sivic, J.; Pajdla, T.; Okutomi, M. Visual place recognition with repetitive structures. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 2346–2359. [Google Scholar] [CrossRef]
- Kim, H.J.; Dunn, E.; Frahm, J.M. Predicting good features for image geo-localization using per-bundle vlad. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1170–1178. [Google Scholar]
- Torii, A.; Arandjelovic, R.; Sivic, J.; Okutomi, M.; Pajdla, T. 24/7 place recognition by view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1808–1817. [Google Scholar]
- Tolias, G.; Sicre, R.; Jégou, H. Particular object retrieval with integral max-pooling of CNN activations. arXiv 2015, arXiv:1511.05879. [Google Scholar]
- Babenko, A.; Lempitsky, V. Aggregating deep convolutional features for image retrieval. arXiv 2015, arXiv:1510.07493. [Google Scholar] [CrossRef]
- Gordo, A.; Almazán, J.; Revaud, J.; Larlus, D. Deep image retrieval: Learning global representations for image search. In Proceedings of the European Conference on Computer Vision 2016 (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 241–257. [Google Scholar]
- Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 21–30 June 2016; pp. 5297–5307. [Google Scholar]
- Radenović, F.; Tolias, G.; Chum, O. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 3–20. [Google Scholar]
- Kalantidis, Y.; Mellina, C.; Osindero, S. Cross-dimensional weighting for aggregated deep convolutional features. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 685–701. [Google Scholar]
- Noh, H.; Araujo, A.; Sim, J.; Weyand, T.; Han, B. Large-scale image retrieval with attentive deep local features. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3456–3465. [Google Scholar]
- Gordo, A.; Almazan, J.; Revaud, J.; Larlus, D. End-to-end learning of deep visual representations for image retrieval. Int. J. Comput. Vis. 2017, 124, 237–254. [Google Scholar] [CrossRef]
- Radenović, F.; Tolias, G.; Chum, O. Fine-tuning CNN image retrieval with no human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1655–1668. [Google Scholar] [CrossRef]
- Teichmann, M.; Araujo, A.; Zhu, M.; Sim, J. Detect-to-retrieve: Efficient regional aggregation for image search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5109–5118. [Google Scholar]
- An, G.; Huo, Y.; Yoon, S.E. Hypergraph propagation and community selection for objects retrieval. Adv. Neural Inf. Process. Syst. 2021, 34, 3596–3608. [Google Scholar]
- Tolias, G.; Jenicek, T.; Chum, O. Learning and aggregating deep local descriptors for instance-level recognition. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 460–477. [Google Scholar]
- Cao, B.; Araujo, A.; Sim, J. Unifying deep local and global features for image search. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 726–743. [Google Scholar]
- Hausler, S.; Garg, S.; Xu, M.; Milford, M.; Fischer, T. Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14141–14152. [Google Scholar]
- Yang, M.; He, D.; Fan, M.; Shi, B.; Xue, X.; Li, F.; Ding, E.; Huang, J. Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 11772–11781. [Google Scholar]
- Wu, H.; Wang, M.; Zhou, W.; Hu, Y.; Li, H. Learning token-based representation for image retrieval. Proc. AAAI Conf. Artif. Intell. 2022, 36, 2703–2711. [Google Scholar] [CrossRef]
- Weinzaepfel, P.; Lucas, T.; Larlus, D.; Kalantidis, Y. Learning super-features for image retrieval. arXiv 2022, arXiv:2201.13182. [Google Scholar] [CrossRef]
- Shao, S.; Chen, K.; Karpur, A.; Cui, Q.; Araujo, A.; Cao, B. Global features are all you need for image retrieval and reranking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 11036–11046. [Google Scholar]
- Tan, F.; Yuan, J.; Ordonez, V. Instance-level image retrieval using reranking transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 12105–12115. [Google Scholar]
- Wei, T.; Lindenberger, P.; Matas, J.; Barath, D. Breaking the Frame: Visual Place Recognition by Overlap Prediction. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 2322–2331. [Google Scholar]
- Mohwald, A.; Jenicek, T.; Chum, O. Dark side augmentation: Generating diverse night examples for metric learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 11153–11163. [Google Scholar]
- Zhang, W.; Kosecka, J. Image based localization in urban environments. In Proceedings of the 3rd International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT’06), Chapel Hill, NC, USA, 14–16 June 2006; pp. 33–40. [Google Scholar]
- Melekhov, I.; Ylioinas, J.; Kannala, J.; Rahtu, E. Relative camera pose estimation using convolutional neural networks. In Proceedings of the International Conference on Advanced Concepts for Intelligent Vision Systems, Antwerp, Belgium, 18–21 September 2017; pp. 675–687. [Google Scholar]
- Laskar, Z.; Melekhov, I.; Kalia, S.; Kannala, J. Camera relocalization by computing pairwise relative poses using convolutional neural network. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 929–938. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Balntas, V.; Li, S.; Prisacariu, V. Relocnet: Continuous metric learning relocalisation using neural nets. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 751–767. [Google Scholar]
- Saha, S.; Varma, G.; Jawahar, C.V. Improved visual relocalization by discovering anchor points. arXiv 2018, arXiv:1811.04370. [Google Scholar] [CrossRef]
- Ding, M.; Wang, Z.; Sun, J.; Shi, J.; Luo, P. CamNet: Coarse-to-fine retrieval for camera re-localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2871–2880. [Google Scholar]
- Li, H.; Zhao, J.; Bazin, J.C.; Chen, W.; Chen, K.; Liu, Y.H. Line-based absolute and relative camera pose estimation in structured environments. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 6914–6920. [Google Scholar]
- Zhou, Q.; Sattler, T.; Pollefeys, M.; Leal-Taixe, L. To learn or not to learn: Visual localization from essential matrices. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 3319–3326. [Google Scholar]
- Chen, K.; Snavely, N.; Makadia, A. Wide-baseline relative camera pose estimation with directional learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3258–3268. [Google Scholar]
- Ullman, S. The interpretation of structure from motion. Proc. R. Soc. London Ser. B Biol. Sci. 1979, 203, 405–426. [Google Scholar]
- Pan, Y.; Xia, Y.; Li, Y.; Yang, M.; Zhu, Q. Research on stability analysis of large karst cave structure based on multi-source point clouds modeling. Earth Sci. Inform. 2023, 16, 1637–1656. [Google Scholar] [CrossRef]
- Tong, X.; Zhang, X.; Liu, S.; Ye, Z.; Feng, Y.; Xie, H.; Chen, L.; Zhang, F.; Han, J.; Jin, Y.; et al. Automatic registration of very low overlapping array InSAR point clouds in urban scenes. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5224125. [Google Scholar] [CrossRef]
- Kabuli, L.A.; Foster, G. Elevation mapping with interferometric synthetic aperture radar for autonomous driving. In Proceedings of the 2024 IEEE Conference on Computational Imaging Using Synthetic Apertures (CISA), Boulder, CO, USA, 20–23 May 2024. [Google Scholar]
- da Silva Ruiz, P.R.; Almeida, C.M.; Schimalski, M.B.; Liesenberg, V.; Mitishita, E.A. Multi-approach integration of ALS and TLS point clouds for a 3-D building modeling at LoD3. Int. J. Archit. Comput. 2023, 21, 652–678. [Google Scholar] [CrossRef]
- Yang, Y.; Zhao, Z.; Zhou, D.; Lai, Z.; Chang, K.; Fu, T.; Niu, L. Identification and Analysis of the Geohazards Located in an Alpine Valley Based on Multi-Source Remote Sensing Data. Sensors 2024, 24, 4057. [Google Scholar] [CrossRef] [PubMed]
- Zhang, W.; Li, Y.; Li, P.; Feng, Z. A BIM and AR-based indoor navigation system for pedestrians on smartphones. KSCE J. Civ. Eng. 2025, 29, 100005. [Google Scholar] [CrossRef]
- Wong, M.O.; Lee, S. Indoor navigation and information sharing for collaborative fire emergency response with BIM and multi-user networking. Autom. Constr. 2023, 148, 104781. [Google Scholar] [CrossRef]
- Wehbi, R. Integration of BIM and Digital Technologies for Smart Indoor Hazards Management. Ph.D. Thesis, Université de Lille, Lille, France, 2021. [Google Scholar]
- Haralick, B.M.; Lee, C.N.; Ottenberg, K.; Nölle, M. Review and analysis of solutions of the three point perspective pose estimation problem. Int. J. Comput. Vis. 1994, 13, 331–356. [Google Scholar] [CrossRef]
- Bujnak, M.; Kukelova, Z.; Pajdla, T. A general solution to the P4P problem for camera with unknown focal length. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008. [Google Scholar]
- Bujnak, M.; Kukelova, Z.; Pajdla, T. New efficient solution to the absolute pose problem for camera with unknown focal length and radial distortion. In Proceedings of the Asian Conference on Computer Vision, Queenstown, New Zealand, 8–12 November 2010; pp. 11–24. [Google Scholar]
- Kukelova, Z.; Bujnak, M.; Pajdla, T. Real-time solution to the absolute pose problem with unknown radial distortion and focal length. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 2816–2823. [Google Scholar]
- Albl, C.; Kukelova, Z.; Pajdla, T. Rolling shutter absolute pose problem with known vertical direction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2016; pp. 3355–3363. [Google Scholar]
- Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
- Chum, O.; Matas, J. Optimal randomized RANSAC. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1472–1482. [Google Scholar] [CrossRef]
- Lebeda, K.; Matas, J.; Chum, O. Fixing the locally optimized ransac–full experimental evaluation. In Proceedings of the British Machine Vision Conference, Surrey, UK, 3–7 September 2012. [Google Scholar]
- Sattler, T.; Sweeney, C.; Pollefeys, M. On sampling focal length values to solve the absolute pose problem. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 828–843. [Google Scholar]
- Barath, D.; Matas, J.; Noskova, J. MAGSAC: Marginalizing sample consensus. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10197–10205. [Google Scholar]
- Snavely, N.; Seitz, S.M.; Szeliski, R. Modeling the world from internet photo collections. Int. J. Comput. Vis. 2008, 80, 189–210. [Google Scholar] [CrossRef]
- Arth, C.; Wagner, D.; Klopschitz, M.; Irschara, A.; Schmalstieg, D. Wide area localization on mobile phones. In Proceedings of the 2009 8th IEEE International Symposium on Mixed and Augmented Reality, Orlando, FL, USA, 19–22 October 2009; pp. 73–82. [Google Scholar]
- Li, Y.; Snavely, N.; Huttenlocher, D.P. Location recognition using prioritized feature matching. In Proceedings of the European Conference on Computer Vision, Crete, Greece, 5–11 September 2010; pp. 791–804. [Google Scholar]
- Sattler, T.; Leibe, B.; Kobbelt, L. Fast image-based localization using direct 2D-to-3D matching. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 667–674. [Google Scholar]
- Sattler, T.; Leibe, B.; Kobbelt, L. Improving image-based localization by active correspondence search. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 752–765. [Google Scholar]
- Paudel, D.P.; Demonceaux, C.; Habed, A.; Vasseur, P. Localization of 2D cameras in a known environment using direct 2D-3D registration. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 196–201. [Google Scholar]
- Sattler, T.; Havlena, M.; Radenovic, F.; Schindler, K.; Pollefeys, M. Hyperpoints and fine vocabularies for large-scale location recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2102–2110. [Google Scholar]
- Feng, Y.; Fan, L.; Wu, Y. Fast localization in large-scale environments using supervised indexing of binary features. IEEE Trans. Image Process. 2015, 25, 343–358. [Google Scholar] [CrossRef] [PubMed]
- Sattler, T.; Leibe, B.; Kobbelt, L. Efficient & effective prioritized matching for large-scale image-based localization. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1744–1756. [Google Scholar] [PubMed]
- Liu, L.; Li, H.; Dai, Y. Efficient global 2D-3D matching for camera localization in a large-scale 3D map. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2372–2381. [Google Scholar]
- Song, Z.; Wang, C.; Liu, Y.; Shen, S. Recalling direct 2D-3D matches for large-scale visual localization. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 1191–1197. [Google Scholar]
- Nadeem, U.; Jalwana, M.A.; Bennamoun, M.; Togneri, R.; Sohel, F. Direct image to point cloud descriptors matching for 6-dof camera localization in dense 3D point clouds. In Proceedings of the International Conference on Neural Information Processing, Vancouver, BC, Canada, 8–14 December 2019; pp. 222–234. [Google Scholar]
- Nadeem, U.; Bennamoun, M.; Togneri, R.; Sohel, F. Unconstrained Matching of 2D and 3D Descriptors for 6-DOF Pose Estimation. arXiv 2020, arXiv:2005.14502. [Google Scholar] [CrossRef]
- Feng, M.; Hu, S.; Ang, M.H.; Lee, G.H. 2D3D-MatchNet: Learning to match keypoints across 2D image and 3D point cloud. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 4790–4796. [Google Scholar]
- Pham, Q.H.; Uy, M.A.; Hua, B.S.; Nguyen, D.T.; Roig, G.; Yeung, S.K. LCD: Learned cross-domain descriptors for 2D-3D matching. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11856–11864. [Google Scholar]
- Yu, H.; Ye, W.; Feng, Y.; Bao, H.; Zhang, G. Learning bipartite graph matching for robust visual localization. In Proceedings of the 2020 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Recife/Porto de Galinhas, Brazil, 9–13 November 2020; pp. 146–155. [Google Scholar]
- Yu, H.; Zhen, W.; Yang, W.; Zhang, J.; Scherer, S. Monocular camera localization in prior lidar maps with 2D-3D line correspondences. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 4588–4594. [Google Scholar]
- Sarlin, P.E.; Unagar, A.; Larsson, M.; Germain, H.; Toft, C.; Larsson, V.; Pollefeys, M.; Lepetit, V.; Hammarstrand, L.; Kahl, F.; et al. Back to the feature: Learning robust camera localization from pixels to pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3247–3257. [Google Scholar]
- Lai, B.; Liu, W.; Wang, C.; Fan, X.; Lin, Y.; Bian, X.; Wu, S.; Cheng, M.; Li, J. 2D3D-MVPNet: Learning cross-domain feature descriptors for 2D-3D matching based on multi-view projections of point clouds. Appl. Intell. 2022, 52, 14178–14193. [Google Scholar] [CrossRef]
- Kim, M.; Koo, J.; Kim, G. Ep2p-loc: End-to-end 3D point to 2D pixel localization for large-scale visual localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 21527–21537. [Google Scholar]
- Zhou, Q.; Agostinho, S.; Ošep, A.; Leal-Taixé, L. Is geometry enough for matching in visual localization? In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 407–425. [Google Scholar]
- Nguyen, S.T.; Fontan, A.; Milford, M.; Fischer, T. FUSELOC: Fusing Global and Local Descriptors to Disambiguate 2D-3D Matching in Visual Localization. arXiv 2024, arXiv:2408.12037. [Google Scholar] [CrossRef]
- Irschara, A.; Zach, C.; Frahm, J.M.; Bischof, H. From structure-from-motion point clouds to fast location recognition. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 2599–2606. [Google Scholar]
- Sattler, T.; Weyand, T.; Leibe, B.; Kobbelt, L. Image retrieval for image-based localization revisited. In Proceedings of the British Machine Vision Conference, Surrey, UK, 3–7 September 2012; p. 4. [Google Scholar]
- Cao, S.; Snavely, N. Minimal scene descriptions from structure from motion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 461–468. [Google Scholar]
- Sattler, T.; Torii, A.; Sivic, J.; Pollefeys, M.; Taira, H.; Okutomi, M.; Pajdla, T. Are large-scale 3D models really necessary for accurate visual localization? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1637–1646. [Google Scholar]
- Camposeco, F.; Cohen, A.; Pollefeys, M.; Sattler, T. Hybrid camera pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 136–144. [Google Scholar]
- Taira, H.; Okutomi, M.; Sattler, T.; Cimpoi, M.; Pollefeys, M.; Sivic, J.; Pajdla, T.; Torii, A. InLoc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7199–7209. [Google Scholar]
- Sarlin, P.E.; Debraine, F.; Dymczyk, M.; Siegwart, R.; Cadena, C. Leveraging deep visual descriptors for hierarchical efficient localization. In Proceedings of the Conference on Robot Learning, Zurich, Switzerland, 29–31 October 2018; pp. 456–465. [Google Scholar]
- Sarlin, P.E.; Cadena, C.; Siegwart, R.; Dymczyk, M. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12716–12725. [Google Scholar]
- Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-net: A trainable CNN for joint description and detection of local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8092–8101. [Google Scholar]
- Germain, H.; Bourmaud, G.; Lepetit, V. S2dnet: Learning accurate correspondences for sparse-to-dense feature matching. arXiv 2020, arXiv:2004.01673. [Google Scholar]
- Yang, T.Y.; Nguyen, D.K.; Heijnen, H.; Balntas, V. Ur2kid: Unifying retrieval, keypoint detection, and keypoint description without local correspondence supervision. arXiv 2020, arXiv:2001.07252. [Google Scholar]
- Shi, T.; Cui, H.; Song, Z.; Shen, S. Dense semantic 3D map based long-term visual localization with hybrid features. arXiv 2020, arXiv:2005.10766. [Google Scholar] [CrossRef]
- Humenberger, M.; Cabon, Y.; Guerin, N.; Morat, J.; Leroy, V.; Revaud, J.; Rerole, P.; Pion, N.; De Souza, C.; Csurka, G. Robust image retrieval-based visual localization using kapture. arXiv 2020, arXiv:2007.13867. [Google Scholar]
- Shu, M.; Chen, G.; Zhang, Z. Efficient image-based indoor localization with MEMS aid on the mobile device. ISPRS J. Photogramm. Remote Sens. 2022, 185, 85–110. [Google Scholar] [CrossRef]
- Yan, S.; Liu, Y.; Wang, L.; Shen, Z.; Peng, Z.; Liu, H.; Zhang, M.; Zhang, G.; Zhou, X. Long-term visual localization with mobile sensors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 July 2023; pp. 17245–17255. [Google Scholar]
- Kendall, A.; Grimes, M.; Cipolla, R. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2938–2946. [Google Scholar]
- Kendall, A.; Cipolla, R. Modelling uncertainty in deep learning for camera relocalization. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 4762–4769. [Google Scholar]
- Kendall, A.; Cipolla, R. Geometric loss functions for camera pose regression with deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5974–5983. [Google Scholar]
- Walch, F.; Hazirbas, C.; Leal-Taixe, L.; Sattler, T.; Hilsenbeck, S.; Cremers, D. Image-based localization using LSTMs for structured feature correlation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 627–637. [Google Scholar]
- Melekhov, I.; Ylioinas, J.; Kannala, J.; Rahtu, E. Image-based localization using hourglass networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 879–886. [Google Scholar]
- Wu, J.; Ma, L.; Hu, X. Delving deeper into convolutional neural networks for camera relocalization. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 5644–5651. [Google Scholar]
- Naseer, T.; Burgard, W. Deep regression for monocular camera-based 6-dof global localization in outdoor environments. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 1525–1530. [Google Scholar]
- Brahmbhatt, S.; Gu, J.; Kim, K.; Hays, J.; Kautz, J. Geometry-aware learning of maps for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2616–2625. [Google Scholar]
- Wang, B.; Chen, C.; Lu, C.X.; Zhao, P.; Trigoni, N.; Markham, A. AtLoc: Attention guided camera localization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 10393–10401. [Google Scholar]
- Cai, M.; Shen, C.; Reid, I. A Hybrid Probabilistic Model for Camera Relocalization. In Proceedings of the British Machine Vision Conference (BMVC), Newcastle, UK, 3–6 September 2018; p. 8. [Google Scholar]
- Chidlovskii, B.; Sadek, A. Adversarial transfer of pose estimation regression. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 646–661. [Google Scholar]
- Shavit, Y.; Ferens, R. Do we really need scene-specific pose encoders? In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 3186–3192. [Google Scholar]
- Blanton, H.; Greenwell, C.; Workman, S.; Jacobs, N. Extending absolute pose regression to multiple scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 38–39. [Google Scholar]
- Shavit, Y.; Ferens, R.; Keller, Y. Learning multi-scene absolute pose regression with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2733–2742. [Google Scholar]
- Shavit, Y.; Keller, Y. Camera pose auto-encoders for improving pose regression. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 140–157. [Google Scholar]
- Clark, R.; Wang, S.; Markham, A.; Trigoni, N.; Wen, H. VidLoc: A deep spatio-temporal model for 6-dof video-clip relocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6856–6864. [Google Scholar]
- Valada, A.; Radwan, N.; Burgard, W. Deep auxiliary learning for visual localization and odometry. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 6939–6946. [Google Scholar]
- Radwan, N.; Valada, A.; Burgard, W. Vlocnet++: Deep multitask learning for semantic visual localization and odometry. IEEE Robot. Autom. Lett. 2018, 3, 4407–4414. [Google Scholar] [CrossRef]
- Bui, M.; Baur, C.; Navab, N.; Ilic, S.; Albarqouni, S. Adversarial networks for camera pose regression and refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Wang, S.; Kang, Q.; She, R.; Tay, W.P.; Hartmannsgruber, A.; Navarro, D.N. RobustLoc: Robust camera pose regression in challenging driving environments. In Proceedings of the AAAI Conference on Artificial Intelligence, Singapore, 20–27 January 2023; pp. 6209–6216. [Google Scholar]
- Xu, M.; Zhang, Z.; Gong, Y.; Poslad, S. Regression-based camera pose estimation through multi-level local features and global features. Sensors 2023, 23, 4063. [Google Scholar] [CrossRef]
- Chen, S.; Li, X.; Wang, Z.; Prisacariu, V.A. Dfnet: Enhance absolute pose regression with direct feature matching. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 1–17. [Google Scholar]
- Chen, S.; Bhalgat, Y.; Li, X.; Bian, J.W.; Li, K.; Wang, Z.; Prisacariu, V.A. Neural refinement for absolute pose regression with feature synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 20987–20996. [Google Scholar]
- Shotton, J.; Glocker, B.; Zach, C.; Izadi, S.; Criminisi, A.; Fitzgibbon, A. Scene coordinate regression forests for camera relocalization in RGB-D images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2930–2937. [Google Scholar]
- Guzman-Rivera, A.; Kohli, P.; Glocker, B.; Shotton, J.; Sharp, T.; Fitzgibbon, A.; Izadi, S. Multi-output learning for camera relocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1114–1121. [Google Scholar]
- Valentin, J.; Nießner, M.; Shotton, J.; Fitzgibbon, A.; Izadi, S.; Torr, P.H. Exploiting uncertainty in regression forests for accurate camera relocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4400–4408. [Google Scholar]
- Brachmann, E.; Krull, A.; Nowozin, S.; Shotton, J.; Michel, F.; Gumhold, S.; Rother, C. DSAC—Differentiable RANSAC for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6684–6692. [Google Scholar]
- Brachmann, E.; Rother, C. Learning less is more-6D camera localization via 3D surface regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4654–4662. [Google Scholar]
- Brachmann, E.; Rother, C. Expert sample consensus applied to camera re-localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7525–7534. [Google Scholar]
- Brachmann, E.; Rother, C. Visual camera re-localization from RGB and RGB-D images using DSAC. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5847–5865. [Google Scholar] [CrossRef]
- Li, X.; Wang, S.; Zhao, Y.; Verbeek, J.; Kannala, J. Hierarchical scene coordinate classification and regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11983–11992. [Google Scholar]
- Wang, S.; Laskar, Z.; Melekhov, I.; Li, X.; Zhao, Y.; Tolias, G.; Kannala, J. HSCNet++: Hierarchical scene coordinate classification and regression for visual localization with transformer. Int. J. Comput. Vis. 2024, 132, 2530–2550. [Google Scholar] [CrossRef]
- Rekavandi, A.M.; Boussaid, F.; Seghouane, A.-K.; Bennamoun, M. B-Pose: Bayesian Deep Network for Accurate Camera 6-DoF Pose Estimation from RGB Images. IEEE Robot. Autom. Lett. 2023, 8, 6746–6754. [Google Scholar] [CrossRef]
- Tang, S.; Tang, S.; Tagliasacchi, A.; Tan, P.; Furukawa, Y. Neumap: Neural coordinate mapping by auto-transdecoder for camera localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 July 2023; pp. 929–939. [Google Scholar]
- Chen, S.; Cavallari, T.; Prisacariu, V.A.; Brachmann, E. Map-relative pose regression for visual re-localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 20665–20674. [Google Scholar]
- Revaud, J.; Cabon, Y.; Brégier, R.; Lee, J.; Weinzaepfel, P. Sacreg: Scene-agnostic coordinate regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 688–698. [Google Scholar]
- Brachmann, E.; Cavallari, T.; Prisacariu, V.A. Accelerated coordinate encoding: Learning to relocalize in minutes using RGB and poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 July 2023; pp. 5044–5053. [Google Scholar]
- Lu, D.; Xiao, W.; Ran, T.; Yuan, L.; Lv, K.; Zhang, J. Attention-Based Accelerated Coordinate Encoding Network for Visual Relocalization. In Proceedings of the 2024 IEEE 7th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 20–22 September 2024; pp. 1675–1680. [Google Scholar]
- Philbin, J.; Chum, O.; Isard, M.; Sivic, J.; Zisserman, A. Lost in quantization: Improving particular object retrieval in large scale image databases. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008. [Google Scholar]
- Jegou, H.; Douze, M.; Schmid, C. Hamming embedding and weak geometric consistency for large scale image search. In Proceedings of the European Conference on Computer Vision, Marseille, France, 12–18 October 2008; pp. 304–317. [Google Scholar]
- Radenović, F.; Iscen, A.; Tolias, G.; Avrithis, Y.; Chum, O. Revisiting Oxford and Paris: Large-scale image retrieval benchmarking. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5706–5715. [Google Scholar]
- Weyand, T.; Araujo, A.; Cao, B.; Sim, J. Google Landmarks Dataset V2—A large-scale benchmark for instance-level recognition and retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2575–2584. [Google Scholar]
- Badino, H.; Huber, D.; Kanade, T. Visual topometric localization. In Proceedings of the 2011 IEEE Intelligent Vehicles Symposium (IV), Baden-Baden, Germany, 5–9 June 2011; pp. 794–799. [Google Scholar]
- Maddern, W.; Pascoe, G.; Linegar, C.; Newman, P. 1 year, 1000 km: The Oxford RobotCar dataset. Int. J. Robot. Res. 2017, 36, 3–15. [Google Scholar] [CrossRef]
- Huang, X.; Cheng, X.; Geng, Q.; Cao, B.; Zhou, D.; Wang, P.; Lin, Y.; Yang, R. The ApolloScape dataset for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 954–960. [Google Scholar]
- Sattler, T.; Maddern, W.; Toft, C.; Torii, A.; Hammarstrand, L.; Stenborg, E.; Safari, D.; Okutomi, M.; Pollefeys, M.; Sivic, J.; et al. Benchmarking 6DoF outdoor visual localization in changing conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8601–8610. [Google Scholar]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
- Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7167–7176. [Google Scholar]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Online, 6 November 2020; pp. 1597–1607. [Google Scholar]
- Liu, Y.; Dong, Q. EquiPose: Exploiting Permutation Equivariance for Relative Camera Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025; pp. 1127–1137. [Google Scholar]
- Ferens, R.; Keller, Y. HyperPose: Hypernetwork-infused camera pose localization and an extended cambridge landmarks dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA, 11–15 June 2025; pp. 11547–11557. [Google Scholar]
- Dong, S.; Wang, S.; Liu, S.; Cai, L.; Fan, Q.; Kannala, J.; Yang, Y. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA, 11–15 June 2025; pp. 16739–16752. [Google Scholar]
- Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.; Han, S. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. arXiv 2022, arXiv:2205.13542. [Google Scholar]
- Palladin, E.; Dietze, R.; Narayanan, P.; Bijelic, M.; Heide, F. SAMFusion: Sensor-adaptive multimodal fusion for 3D object detection in adverse weather. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 484–503. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the 1st Conference on Language Modeling, Philadelphia, PA, USA, 11 April–10 May 2024. [Google Scholar]
- Dao, T.; Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv 2024, arXiv:2405.21060. [Google Scholar] [CrossRef]
- Tang, Y.; Dong, P.; Tang, Z.; Chu, X.; Liang, J. VMRNN: Integrating Vision Mamba and LSTM for efficient and accurate spatiotemporal forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 5663–5673. [Google Scholar]
- Yan, W.; Yin, F.; Wang, J.; Leus, G.; Zoubir, A.M.; Tian, Y. Attentional Graph Neural Network Is All You Need for Robust Massive Network Localization. arXiv 2023, arXiv:2311.16856. [Google Scholar] [CrossRef]
- Huang, J.; Wu, M.; Li, P.; Wu, W.; Yu, R. VimGeo: Efficient Cross-View Geo-Localization with Vision Mamba Architecture. In Proceedings of the 34th International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 16–22 August 2025; pp. 1188–1196. [Google Scholar]
- Hong, C.Y.; Wang, L.H.; Liu, T.L. Promptable 3-D Object Localization with Latent Diffusion Models. In Proceedings of the 39th Annual Conference on Neural Information Processing Systems, San Diego, CA, 2–7 December 2025. [Google Scholar]
- Xu, Q.; Chen, Y.; Li, Y.; Liu, Z.; Lou, Z.; Zhang, Y.; Zheng, H.; He, X. MambaVesselNet++: A hybrid CNN-Mamba architecture for medical image segmentation. arXiv 2025, arXiv:2507.19931. [Google Scholar] [CrossRef]
- Boukhari, D.E. Mamba-CNN: A Hybrid Architecture for Efficient and Accurate Facial Beauty Prediction. arXiv 2025, arXiv:2509.01431. [Google Scholar] [CrossRef]
- Cao, A.; Li, Z.; Jomsky, J.; Laine, A.F.; Guo, J. MedSegMamba: 3D CNN-Mamba hybrid architecture for brain segmentation. arXiv 2024, arXiv:2409.08307. [Google Scholar]






| Dataset Name | Scale | Scene | Challenging Factors | Image Number |
|---|---|---|---|---|
| UKB [10] | Small | Indoor | Scale, illumination, background clutter | 10,200 |
| Oxford5K [11] | Medium | Urban landmarks | Scale, occlusion, repetitive structures | 5011 references, 55 queries |
| Oxford105K [11] | Large | Urban landmarks | Scale, occlusion, repetitive structures | 104,793 references, 55 queries |
| Paris6K [146] | Medium | Urban landmarks | Scale, occlusion, repetitive structures | 6245 references, 55 queries |
| Paris106K [146] | Large | Urban landmarks | Scale, occlusion, repetitive structures | 106,031 references, 55 queries |
| INRIA Holidays [147] | Medium | Diverse urban and suburban scenes | Illumination, viewpoint, scene diversity | 991 references, 500 queries |
| 24/7-Tokyo [20] | Large | Urban streets | Day–night, season, dynamic urban conditions | 75,984 references, 1125 queries, 597,744 synthetic views |
| Oxford [148] | Medium | Urban landmarks | Scale, occlusion, illumination | 4993 references, 70 queries |
| Paris [148] | Medium | Urban landmarks | Scale, occlusion, illumination | 6322 references, 70 queries |
| GLD [27] | Large | Global natural and human-made landmarks | Occlusion, clutter, scale, partial/multi-landmarks | 1,060,709 references, 111,036 queries |
| GLDv2 [149] | Large | Global natural and man-made landmarks | Occlusion, clutter, scale, partial/multi-landmarks | 4.1 M references, 118 K queries |
| Dataset Name | Scale | Scene | Capture | Challenging Factors | Image Number | 3D Map | Point Number | Other Sensor |
|---|---|---|---|---|---|---|---|---|
| Vienna [93] | Medium | Urban streets | Car trajectory | Illumination, weather, season | 1324 references, 266 queries | SfM | 1.12 M | None |
| Rome [73] | Large | Historic city | Free viewpoint | Occlusion, illumination, people | 15,179 references, 1000 queries | SfM | 4.31 M | None |
| CMU [150] | Medium | Urban, suburban, park | Car trajectory | Season, occlusion, illumination | 10,000–14,000 per traversal | SfM | - | GPS, IMU |
| Aachen [94] | Medium | Historic city | Free viewpoint | Day–night, weather, season | 3047 references, 369 queries | SfM | 1.54 M | None |
| 7-Scenes [131] | Small | Indoor | Free viewpoint | Illumination, motion blur, flat surfaces | 26,000 references, 17,000 queries | RGB-D camera | - | None |
| CLD [108] | Medium | Historic city | Free viewpoint | Illumination, weather, traffic, pedestrian | 6848 references, 4081 queries | SfM | 1.89 M | None |
| RobotCar [151] | Large | Urban streets | Car trajectory | Day–night, weather, season, traffic, pedestrian | 20 M | LiDAR | - | GPS, INS |
| University [44] | Medium | Indoor | Free viewpoint | Illumination, occlusion | 9694 references, 5068 queries | SfM | - | None |
| ApolloScape [152] | Large | Urban streets | Car trajectory | Illumination, weather, traffic, pedestrian | 14K | LiDAR | - | GPS, IMU |
| Aachen Day–Night [153] | Medium | Historic city | Free viewpoint | Day–night, weather, season | 4328 references, 922 queries | SfM | 1.65 M | None |
| RobotCar Seasons [153] | Large | Urban streets | Car trajectory | Day–night, weather, season, traffic, pedestrian | 20,862 references, 11,934 queries | SfM | 6.77 M | None |
| CMU Seasons [153] | Medium | Urban, suburban, | Car trajectory | Weather, illumination, season, traffic, pedestrian | 7159 references, 75,335 queries | SfM | 1.61 M | None |
| Medium | Hard | |||
|---|---|---|---|---|
| Method | Oxford | Oxford + R1M | Oxford | Oxford + R1M |
| SuperGlobal [38] | 90.90 | 84.40 | 80.20 | 71.10 |
| Hypergraph Propagation + Community Selection [31] | 88.40 | 79.10 | 73.00 | 60.50 |
| Tokenizer [36] | 82.28 | 75.64 | 66.57 | 51.37 |
| FIRe [37] | 81.80 | 66.50 | 61.20 | 40.10 |
| DOLG [35] | 81.50 | 77.43 | 58.82 | 52.21 |
| RRT + αQE [39] | 80.40 | 71.70 | 64.00 | 50.90 |
| HOW [32] | 79.40 | 65.80 | 56.90 | 38.90 |
| DELG [33] | 78.50 | 62.70 | 59.30 | 39.30 |
| NetVLAD [24] | 73.91 | 60.51 | 56.45 | 37.92 |
| GEM + αQE [29] | 71.40 | 53.10 | 45.90 | 26.20 |
| DELF-ASMK + SP [27] | 67.80 | 53.80 | 43.10 | 31.20 |
| GEM [29] | 64.70 | 45.20 | 38.50 | 19.90 |
| R-MAC [21] | 60.90 | 39.30 | 32.40 | 12.50 |
| Medium | Hard | |||
|---|---|---|---|---|
| Method | Paris | Paris + R1M | Paris | Paris + R1M |
| SuperGlobal [38] | 93.30 | 84.90 | 86.70 | 71.40 |
| Hypergraph Propagation + Community Selection [31] | 92.60 | 86.60 | 83.30 | 72.70 |
| DOLG [35] | 89.81 | 80.79 | 77.70 | 62.83 |
| Tokenizer [36] | 89.34 | 79.76 | 78.56 | 61.56 |
| RRT + αQE [39] | 88.50 | 74.80 | 77.70 | 57.10 |
| NetVLAD [24] | 86.81 | 71.31 | 73.61 | 48.98 |
| FIRe [37] | 85.30 | 67.60 | 70.00 | 42.90 |
| GEM + αQE [29] | 84.00 | 60.30 | 67.30 | 32.30 |
| DELG [33] | 82.90 | 62.60 | 65.50 | 37.00 |
| HOW [32] | 81.60 | 61.80 | 62.40 | 33.70 |
| R-MAC [21] | 78.90 | 54.80 | 59.40 | 28.00 |
| GEM [29] | 77.20 | 52.30 | 56.30 | 24.70 |
| DELF-ASMK + SP [27] | 76.90 | 57.30 | 55.40 | 26.40 |
| Method | Oxford5K | Oxford105K | Paris6K | Paris106K |
|---|---|---|---|---|
| E2E-R-MAC + QE [28] | 90.60 | 89.40 | 96.00 | 93.20 |
| DIR + QE [23] | 87.10 | 85.20 | 95.30 | 91.80 |
| DIR [23] | 86.10 | 82.80 | 94.50 | 90.60 |
| MAC + QE [25] | 85.00 | 81.80 | 86.50 | 78.80 |
| DELF [27] | 83.80 | 82.60 | 85.00 | 81.70 |
| R-MAC + QE [21] | 82.90 | 77.90 | 85.60 | 78.30 |
| BoW (iSP + ctx QE) [16] | 82.70 | 76.70 | 80.50 | 71.00 |
| BoW (SPAUG + DQE + RootSIFT) [17] | 80.90 | 72.20 | - | - |
| MAC [25] | 79.70 | 73.90 | 82.40 | 74.60 |
| BoW (Elliptical regions + SP + QE) [14] | 78.40 | 72.80 | - | - |
| R-MAC [21] | 77.00 | 69.20 | 83.80 | 76.40 |
| CroW + QE [26] | 74.90 | 70.60 | 84.80 | 79.40 |
| CroW [26] | 70.80 | 65.30 | 79.70 | 72.20 |
| BoW (tf-idf + SP) [11] | 67.20 | 58.10 | - | - |
| SPoC [22] | 53.10 | - | 50.10 | - |
| Method Type | Method | TE (cm) | RE (°) |
|---|---|---|---|
| Approximate 2D-2D | DenseVLAD [20] | 26.14 | 13.11 |
| Precise 2D-2D | CamNet [48] | 3.86 | 1.69 |
| AnchorNet [47] | 9.00 | 6.74 | |
| EssNet [50] | 19.00 | 4.28 | |
| RelocNet [46] | 21.00 | 6.73 | |
| Relative PN [44] | 21.00 | 9.28 | |
| Direct 2D-3D | PixLoc [88] | 2.86 | 0.98 |
| Active Search [79] | 3.71 | 1.18 | |
| Hierarchical 2D-3D | HF-Net [100] | 3.14 | 1.09 |
| InLoc [98] | 4.14 | 1.38 | |
| APR | VLocNet++ [125] | 2.15 | 1.39 |
| DFNet+ [130] | 2.42 | 0.79 | |
| VLocNet [124] | 4.80 | 3.80 | |
| DFNet [129] | 6.42 | 1.93 | |
| MS-Transformer+ [122] | 15.00 | 7.28 | |
| MS-Transformer [121] | 18.00 | 7.28 | |
| MLF [128] | 18.42 | 7.44 | |
| AdPR [126] | 19.42 | 7.47 | |
| AtLoc [116] | 19.71 | 7.56 | |
| MapNet [115] | 20.71 | 7.79 | |
| GeoPoseNet [110] | 22.86 | 8.12 | |
| HourGlass-Pose [112] | 23.29 | 9.53 | |
| VidLoc [123] | 25.00 | - | |
| BranchNet [113] | 29.00 | 8.30 | |
| LSTM PoseNet [111] | 31.29 | 9.85 | |
| PoseNet [108] | 44.14 | 10.40 | |
| Bayesian PoseNet [109] | 46.57 | 9.81 | |
| SCR | ACE++ [145] | 0.30 | 1.00 |
| ACE [144] | 0.33 | 1.08 | |
| HSCNet++ [139] | 2.29 | 0.81 | |
| HSCNet [138] | 2.71 | 0.90 | |
| DSAC* [137] | 2.71 | 1.36 | |
| NeuMap [141] | 3.14 | 1.09 | |
| MaRepo [142] | 3.18 | 1.54 | |
| DSAC++ [135] | 3.57 | 1.10 | |
| SACReg [143] | 3.71 | 1.22 | |
| DSAC [134] | 20.00 | 6.30 |
| Method Type | Method | TE (cm) | RE (°) |
|---|---|---|---|
| Approximate 2D-2D | DenseVLAD [20] | 255.75 | 7.10 |
| Precise 2D-2D | EssNet [50] | 83.00 | 1.36 |
| AnchorNet [47] | 84.00 | 2.10 | |
| Direct 2D-3D | BGNet [86] | 5.93 | 0.13 |
| FuseLoc [92] | 10.00 | 0.20 | |
| UM [83] | 11.23 | 0.40 | |
| Active Search [79] | 13.80 | 0.23 | |
| PixLoc [88] | 15.00 | 0.25 | |
| Hierarchical 2D-3D | HF-Net [100] | 10.80 | 0.20 |
| APR | DFNet+ [130] | 35.25 | 0.77 |
| VLocNet [124] | 78.40 | 2.82 | |
| MS-Transformer+ [122] | 96.00 | 2.73 | |
| DFNet [129] | 119.25 | 2.90 | |
| MS-Transformer [121] | 128.00 | 2.73 | |
| LSTM PoseNet [111] | 130.00 | 5.52 | |
| SVS-Pose [114] | 132.50 | 5.17 | |
| GeoPoseNet [110] | 163.25 | 2.86 | |
| Bayesian PoseNet [109] | 192.00 | 6.28 | |
| PoseNet [108] | 208.50 | 6.83 | |
| SCR | SACReg [143] | 8.75 | 0.23 |
| ACE [144] | 10.25 | 0.30 | |
| ACE++ [145] | 11.25 | 0.28 | |
| HSCNet [138] | 13.00 | 0.30 | |
| HSCNet++ [139] | 13.50 | 0.29 | |
| DSAC* [137] | 13.50 | 0.35 | |
| NeuMap [141] | 14.00 | 0.33 | |
| DSAC++ [135] | 14.25 | 0.33 | |
| DSAC [134] | 31.75 | 0.78 |
| Day | Night | ||||||
|---|---|---|---|---|---|---|---|
| Method Type | Method | High | Mid | Low | High | Mid | Low |
| Approximate 2D-2D | FAB-MAP [13] | 0.0 | 0.0 | 4.6 | 0.0 | 0.0 | 0.0 |
| NetVLAD [24] | 0.0 | 0.2 | 18.9 | 0.0 | 0.0 | 14.3 | |
| DenseVLAD [20] | 0.0 | 0.1 | 22.8 | 0.0 | 1.0 | 19.4 | |
| Direct 2D-3D | Active Search [79] | 85.3 | 92.2 | 97.9 | 39.8 | 49.0 | 64.3 |
| BGNet [86] | 84.5 | 92.4 | 96.2 | 46.9 | 63.3 | 84.7 | |
| PixLoc [88] | 84.7 | 94.2 | 98.8 | 81.6 | 93.9 | 100 | |
| Hierarchical 2D-3D | UR2KiD [103] | 79.9 | 88.6 | 93.6 | 45.9 | 64.3 | 83.7 |
| S2DNet [102] | 84.5 | 90.3 | 95.3 | 74.5 | 82.7 | 94.9 | |
| D2-Net [101] | 84.3 | 91.9 | 96.2 | 75.5 | 87.8 | 95.9 | |
| DSLoc [104] | 89.3 | 95.4 | 97.6 | 44.9 | 67.3 | 87.8 | |
| HF-Net [100] | 89.6 | 95.4 | 98.8 | 86.7 | 93.9 | 100. | |
| SCR | DSAC [134] | 0.4 | 2.4 | 34.0 | - | - | - |
| ESAC [136] | 42.6 | 59.6 | 75.5 | 6.1 | 10.2 | 18.4 | |
| HSCNet [138] | 65.5 | 77.3 | 88.8 | 22.4 | 38.8 | 54.1 | |
| NeuMap [141] | 76.2 | 88.5 | 95.5 | 37.8 | 62.2 | 87.8 | |
| SACReg [143] | 85.8 | 95.0 | 99.6 | 67.5 | 90.6 | 100. | |
| Day | Night | ||||||
|---|---|---|---|---|---|---|---|
| Method Type | Method | High | Mid | Low | High | Mid | Low |
| Approximate 2D-2D | NetVLAD [24] | 6.4 | 26.3 | 90.9 | 0.3 | 2.3 | 15.9 |
| DenseVLAD [20] | 7.6 | 31.2 | 91.2 | 1.0 | 4.4 | 22.7 | |
| Direct 2D-3D | Active Search [79] | 50.9 | 80.2 | 96.6 | 6.9 | 15.6 | 31.7 |
| BGNet [86] | 55.7 | 77.0 | 94.1 | 8.4 | 21.7 | 50.9 | |
| PixLoc [88] | 56.9 | 82.0 | 98.1 | 34.9 | 67.7 | 89.5 | |
| Hierarchical 2D-3D | S2DNet [102] | 53.9 | 80.6 | 95.8 | 14.5 | 40.2 | 69.7 |
| D2-Net [101] | 54.5 | 80.0 | 95.3 | 20.4 | 40.1 | 55.0 | |
| HF-Net [100] | 56.9 | 81.7 | 98.1 | 33.3 | 65.9 | 88.8 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yan, H.; Lau, A.; Fan, H. Monocular Camera Localization in Known Environments: An In-Depth Review. Appl. Sci. 2026, 16, 2332. https://doi.org/10.3390/app16052332
Yan H, Lau A, Fan H. Monocular Camera Localization in Known Environments: An In-Depth Review. Applied Sciences. 2026; 16(5):2332. https://doi.org/10.3390/app16052332
Chicago/Turabian StyleYan, Hailun, Albert Lau, and Hongchao Fan. 2026. "Monocular Camera Localization in Known Environments: An In-Depth Review" Applied Sciences 16, no. 5: 2332. https://doi.org/10.3390/app16052332
APA StyleYan, H., Lau, A., & Fan, H. (2026). Monocular Camera Localization in Known Environments: An In-Depth Review. Applied Sciences, 16(5), 2332. https://doi.org/10.3390/app16052332

