Next Article in Journal
Investigation of the Dynamic Behavior of Brayton Batteries for Coupled Generation of Electricity, Heat, and Cooling
Previous Article in Journal
Fish Farming 5.0: Advanced Tools for a Smart Aquaculture Management
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization

School of Automation Science and Engineering, South China University of Technology, Guangzhou 510641, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(23), 12642; https://doi.org/10.3390/app152312642 (registering DOI)
Submission received: 24 October 2025 / Revised: 22 November 2025 / Accepted: 23 November 2025 / Published: 28 November 2025

Abstract

Camera relocalization, a cornerstone capability of modern computer vision, accurately determines a camera’s position and orientation from images and is essential for applications in augmented reality, mixed reality, autonomous driving, delivery drones, and robotic navigation. Unlike traditional deep learning-based methods regress camera pose from images in a single scene which lack generalization and robustness in diverse environments. We propose MVL-Loc, a novel end-to-end multi-scene six degrees of freedom camera relocalization framework. MVL-Loc leverages pretrained world knowledge from vision-language models and incorporates multimodal data to generalize across both indoor and outdoor settings. Furthermore, natural language is employed as a directive tool to guide the multi-scene learning process, facilitating semantic understanding of complex scenes and capturing spatial relationships among objects. Extensive experiments on the 7Scenes and Cambridge Landmarks datasets demonstrate MVL-Loc’s robustness and state-of-the-art performance in real-world multi-scene camera relocalization, with improved accuracy in both positional and orientational estimates.
Keywords: end-to-end camera relocalization; vision-language models; multi-scene generalization end-to-end camera relocalization; vision-language models; multi-scene generalization

Share and Cite

MDPI and ACS Style

Xiao, Z.; Yang, S.; Ji, S.; Yin, J.; Wen, Z.; Wei, W. MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization. Appl. Sci. 2025, 15, 12642. https://doi.org/10.3390/app152312642

AMA Style

Xiao Z, Yang S, Ji S, Yin J, Wen Z, Wei W. MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization. Applied Sciences. 2025; 15(23):12642. https://doi.org/10.3390/app152312642

Chicago/Turabian Style

Xiao, Zhendong, Shan Yang, Shujie Ji, Jun Yin, Ziling Wen, and Wu Wei. 2025. "MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization" Applied Sciences 15, no. 23: 12642. https://doi.org/10.3390/app152312642

APA Style

Xiao, Z., Yang, S., Ji, S., Yin, J., Wen, Z., & Wei, W. (2025). MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization. Applied Sciences, 15(23), 12642. https://doi.org/10.3390/app152312642

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop