MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization

Xiao, Zhendong; Yang, Shan; Ji, Shujie; Yin, Jun; Wen, Ziling; Wei, Wu

doi:10.3390/app152312642

Open AccessArticle

MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization

by

Zhendong Xiao

,

Shan Yang

,

Shujie Ji

,

Jun Yin

,

Ziling Wen

and

Wu Wei

^*

School of Automation Science and Engineering, South China University of Technology, Guangzhou 510641, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12642; https://doi.org/10.3390/app152312642

Submission received: 24 October 2025 / Revised: 22 November 2025 / Accepted: 23 November 2025 / Published: 28 November 2025

Download

Browse Figures

Versions Notes

Abstract

Camera relocalization, a cornerstone capability of modern computer vision, accurately determines a camera’s position and orientation from images and is essential for applications in augmented reality, mixed reality, autonomous driving, delivery drones, and robotic navigation. Unlike traditional deep learning-based methods regress camera pose from images in a single scene which lack generalization and robustness in diverse environments. We propose MVL-Loc, a novel end-to-end multi-scene six degrees of freedom camera relocalization framework. MVL-Loc leverages pretrained world knowledge from vision-language models and incorporates multimodal data to generalize across both indoor and outdoor settings. Furthermore, natural language is employed as a directive tool to guide the multi-scene learning process, facilitating semantic understanding of complex scenes and capturing spatial relationships among objects. Extensive experiments on the 7Scenes and Cambridge Landmarks datasets demonstrate MVL-Loc’s robustness and state-of-the-art performance in real-world multi-scene camera relocalization, with improved accuracy in both positional and orientational estimates.

Keywords:

end-to-end camera relocalization; vision-language models; multi-scene generalization

1. Introduction

Camera relocalization, which estimates a camera’s six degrees of freedom (6-DoF) pose from visual inputs, is a fundamental problem in intelligent systems, enabling applications from autonomous navigation to augmented reality (AR). Traditional approaches solve this task by matching visual features to pre-constructed maps and estimating poses via algorithms such as Perspective-n-Point (PnP) and Kabsch, combined with Random Sample Consensus (RANSAC) [1], to establish 2D-3D correspondences. Although effective, these methods are computationally intensive and storage-heavy, as they rely heavily on the quantity and quality of reference data. Deep learning-based camera relocalization redefines the paradigm by enabling end-to-end pose regression directly from raw pixels, bypassing the need for handcrafted features. PoseNet [2] pioneered this approach, estimating 6-DoF poses from single images using GoogLeNet. Subsequent improvements include Long Short-Term Memory (LSTM) modules, ResNet34 skip connections, uncertainty modeling, and geometric constraints [3]. Video-based methods such as VidLoc [4] leverage temporal constraints, while AtLoc [5] enhances pose consistency with attention mechanisms. EffLoc [6] optimizes efficiency using a Vision Transformer with memory-efficient self-attention. Despite these advances, deep learning methods remain inherently limited in cross-scene generalization, struggling to adapt to unseen layouts due to scene-specific training dependency.

To transcend scene-specific constraints, recent advances in embodied Artificial Intelligence (AI) have integrated vision-language models (VLMs) to augment transferability with cross-modal semantic reasoning. Enhanced with pre-trained web knowledge, these models have demonstrated significant progress. NavGPT [7] utilizes VLMs to interpret and execute complex navigational commands in natural language, integrating these instructions with real-time visual inputs to navigate diverse environments effectively. Similarly, RT-2 [8] merges visual perception with natural language processing, enabling service robots to respond with heightened context awareness. Nevertheless, no prior work has systematically leveraged VLMs for joint semantic-geometric optimization in camera relocalization.

In this work, we propose MVL-Loc, a novel Multi-scene Visual Language Localization framework, which incorporates a Vision-Language Model trained across multiple scenes for 6-DoF camera relocalization. As illustrated in Figure 1, we assign each scene unique, non-template-based, language-guided instructions to enhance the fusion of multi-scene settings. These specific contextual cues enable the model to differentiate between scenes and establish robust spatial and semantic relationships among objects. Our model surpasses existing multi-scene approaches, demonstrating superior accuracy in image-based camera relocalization across indoor and outdoor benchmarks. Specifically, MVL-Loc surpasses Multi-Scene PoseNet (MSPN) [9] on the 7-Scenes dataset, reducing the average position error by 23.8% and rotation error by 19.2%. On the Cambridge Landmarks dataset, MVL-Loc improves position error by 6% and rotation error by 7.3% compared to C2f-MS-Transformer [10], a recent transformer-based multi-scene approach. In summary, our main contributions are as follows:

We propose MVL-Loc, a novel multi-scene camera relocalization framework that harnesses pretrained world knowledge from vision-language models (VLMs), effectively generalizing to both indoor and outdoor environments.
We leverage natural language as a guiding tool for the multi-scene learning process of VLMs, enabling a deep semantic understanding of complex scenes and capturing the spatial relationships between objects and scenes.
Through extensive experiments on common benchmarks, i.e., 7-Scenes and Cambridge Landmarks datasets, we demonstrate that our MVL-Loc framework achieves state-of-the-art performance in the task of end-to-end multi-scene camera relocalization.

2. Related Work

2.1. Deep Learning for Camera Relocalization

Camera relocalization aims to recover the 6-DoF pose of a camera given an input image. Traditional approaches rely on geometric pipelines involving feature matching, structure-from-motion, and PnP with RANSAC [11]. However, these methods are sensitive to viewpoint changes and require extensive 3D maps. Recent advances in deep learning have enabled end-to-end pose regression directly from images. PoseNet [2] initiates this line by predicting camera pose without requiring explicit 2D-3D correspondences. Subsequent methods integrate temporal cues via RNNs or LSTMs, as in VidLoc, to improve stability in video streams. PoseNetV2 introduces geometric reprojection loss for spatial consistency, while DSAC and HybridPose revisit differentiable RANSAC for learning-based correspondence selection. Map-free employs depth-guided relative pose loss to remove map dependencies. AtLoc [5] and EffLoc [6] further enhance localization accuracy via attention mechanisms and efficient vision transformers, respectively. Generalization across diverse environments remains challenging. Multi-Scene PoseNet [9] attempts to train shared pose regressors, while transformer-based models like Ms-Transformer [12] and its coarse-to-fine variant C2f-MS-Transformer [10] offer better scalability. MapNet [13] fuses visual odometry and GPS, and FusionLoc [14] incorporates LiDAR to augment visual features.

2.2. Vision-Language Models for Geospatial Reasoning

The emergence of large vision-language models (LVLMs) has reshaped cross-modal learning. CLIP [15] demonstrates strong generalization by aligning vision and language in a contrastive manner, inspiring downstream tasks such as image editing [16], scene graph grounding, and 3D scene understanding [17]. Particularly relevant is [18], which utilizes CLIP to associate object-level semantics with visual correspondences for localization. However, its dependence on PROSAC-based ranking limits adaptability to dynamic scenes. In contrast, our method eliminates explicit matching through end-to-end learning, guided by language embeddings that encode spatial priors and object semantics. Recent studies extend LVLMs to geospatial domains. GeoLLM [19] leverages LLMs to infer spatial distributions, and GeoReasoner [20] performs coarse-grained city-level localization from street-view imagery. Spatial-LLaVA [21] enhances geospatial alignment by enabling detailed spatial inference beyond coarse scene semantics, while UniSceneVL [22] unifies vision-language understanding with 3D spatial correspondence learning. Yet these methods operate at low spatial resolution and fail in precise tasks such as camera pose estimation. Yet these methods operate at low spatial resolution and fail in precise tasks such as camera pose estimation. Our approach draws insights from semantic modeling in generative tasks like progressive pose-conditioned generation [23] and rich-contextual diffusion frameworks for spatial consistency [24]. We adapt these ideas to pose estimation, enabling fine-grained semantic-to-spatial mapping through pre-trained vision-language priors.

3. Multi-Scene Visual Language Localization

Language-guided visual localization remains underexplored for multi-scene camera relocalization. We present MVL-Loc, a framework that regresses 6-DoF camera poses from single images across multiple scenes using open-set natural language descriptions as contextual guidance. By jointly processing visual inputs and language embeddings, the model achieves cross-scene generalization without requiring scene-specific fine-tuning.

3.1. MVL-Loc Framework

The pipeline of our framework is shown in Figure 2, which consists of Language-guided text-to-scene correspondences generation, and multi-scene camera pose estimation. In text-to-scene correspondences generation, the vision-language integration is based on the similarity of the description embeddings and vision feature embeddings. The localized region features

\overset{˚}{V} \in R^{C \times H \times W}

are extracted via a 2D convolutional encoder from the input image I, where C, H, and W represent the channel dimension, height, and width of the resulting feature map to construct spatial hierarchies:

\overset{˚}{V} = f_{conv} (I) + γ \cdot P (I) .

(1)

where

γ

is the negative square root of the width W, which ensures the positional embedding

P \in R^{C \times H \times W}

does not overshadow the visual input. And the descriptions text T can vary in length across different scenes, but they do describe subsets of the objects related to the scenes.

\overset{˚}{L} = f_{token} (T) + γ \cdot P (T) .

(2)

For the Transformer Block, we employed N identical standard Decoder layers to classify K scenes and experimented with

N = 2, 4, 8,

and 16. The model failed to converge when

N = 2

, while

N = 8

and

N = 16

significantly increased the computational burden due to the quadratic cost of self-attention. In contrast,

N = 4

achieves stable convergence with a substantially lower memory footprint and training time, while providing nearly the same accuracy as deeper configurations. Each layer comprises self-attention (SA) and multi-head attention (MHA). For each Transformer layer, the model jointly updates the visual and language features represented as

(V, L)

, where V and L denote the feature tensors corresponding to the visual and language modalities, through the following SA and MHA operations:

{(V, L)}^{n^{'}} = S A (L N ({(V, L)}^{n - 1})) + {(V, L)}^{n - 1} \in R^{C \times H \times W},

(3)

{(V, L)}^{n} = M H A (L N ({(V, L)}^{n^{'}})) + {(V, L)}^{n^{'}} \in R^{C \times H \times W} .

(4)

At the final layer N, the output is processed through a feedforward module with the Gaussian Error Linear Unit (GELU) activation function, a smooth nonlinearity that improves gradient flow and convergence stability:

{(V, L)}^{N} = F F (L N ({(V, L)}^{n})) + {(V, L)}^{n} \in R^{C \times H \times W} .

(5)

3.2. Language-Guided Camera Relocalization

We propose a cross-modal fusion encoder (Enc) to integrate natural language descriptions with visual features for enhanced multi-scene camera pose estimation. These descriptions, which emphasize spatial relationships (e.g., “in front of,” “surrounding”), are fused with visual features via dot product:

(V, L) = Enc [\overset{˚}{V} \cdot {\overset{˚}{L}}^{T}],

(6)

where

(\overset{˚}{V}, \overset{˚}{L})

and

(V, L)

represent the vision and language features before and after fusion, respectively. Furthermore, we use the transposed language embedding matrix to align well with the visual features, a requirement dictated by the scaled dot-product attention formulation. In cross-attention, visual features act as queries Q and language embeddings as keys K. Their interaction is computed through the affinity matrix

Q K^{⊤}

, where the transpose ensures that each visual token forms valid inner-product similarities with all language tokens, yielding an

N_{v} \times N_{l}

attention map. This operation is fundamental to the attention mechanism rather than an architectural choice, and it enables the model to construct dense cross-modal correspondences between spatial visual patterns and semantic language cues.

To construct the language priors, we manually create 4 human-written descriptions for each scene, focusing on stable structures and spatial object relations (e.g., “two monitors placed side by side on a white desk,” “a chessboard centered on a small square table surrounded by chairs”). These descriptions are used only during training to guide cross-modal alignment and are not required during inference. Unlike CLIP’s fixed templates (e.g., “a photo of a [scene]”), our descriptions are tailored to camera relocalization and explicitly encode spatial relationships among objects. Specific and detailed descriptions (e.g., “a chessboard on a small table surrounded by chairs”) help the model attend to key geometric cues, whereas broader descriptions (e.g., “a chessboard in a classroom”) provide weaker localization signals and lead to more diffuse attention, as shown in Figure 3. Comprehensive prompts such as “two monitors side by side on a cluttered desk with a chair in front” further anchor the model to salient structural elements that are critical for pose estimation. This training-time linguistic supervision enriches the visual features with semantic and spatial priors, enabling MVL-Loc to learn more discriminative and context-aware representations for multi-scene camera relocalization.

3.3. Multi-Scene Camera Pose Regression

Based on the single-scene language-guided training, we extend this methodology by indexing each scene and pairing it with its corresponding detailed description. Unlike Multi-Scene PoseNet (MSPN) [9], which produces separate models for each corresponding scene. We train different scenes in parallel for all poses, where each scene is guided by its unique set of language prompts. The final decoder layer outputs the pose sequences corresponding to each scene’s latent features. Since a single image corresponds to the specific scene from which it was captured, the appropriate latent embedding must be selected. We index the Scene list

K = [1, 2, \dots, k]

represents the number of predicted scene and pair with its corresponding detailed description. The input list is [“Chess”, “Fire”, “Heads”, …,“Red kitchen”] or [“King’s College”, “Old hospital”, …,“Stmary’s church”]. In order to determine which scene a query image belongs to, we pass the classification logits

z = [z_{1}, z_{2}, \dots, z_{K}]

through the SoftMax function to obtain the predicted scene probabilities:

π_{i} = \frac{exp (z_{i})}{\sum_{j = 1}^{K} exp (z_{j})}, i = 1, 2, \dots, K,

(7)

where

π_{i}

denotes the probability that the input image belongs to the i-th scene. Our work builds upon the VidLoc [13] pose estimation method, which regresses 6-DoF camera poses from vision-language fused features through Multilayer Perceptrons (MLPs) corresponding to the selected scene i with the maximal probability

π_{i}

:

{[p, q]}_{i} = M L P s ((V, L)),

(8)

here

p \in R^{3}

represented the 3D camera position, and

q \in R^{4}

represented a 4D unit quaternion for orientation. We first compute the Negative Log Likelihood loss for classification, given the ground truth scene index

k_{0}

:

L_{c l s} = - log (π_{i}) .

(9)

We then integrate the camera pose loss and multi-scene classification loss to get the overall loss:

L_{ms} = | p - \hat{p} |_{1} e^{- α} + α + {| log q - log \hat{q} |}_{1} e^{- β} + β + L_{c l s},

(10)

α

and

β

are the scale factors to balance the positional and orientational losses. Here, p and q denote the ground-truth camera position and orientation, while

\hat{p}

and

\hat{q}

represent the predicted position and quaternion produced by the network. We implemented the logarithm of a unit quaternion instead of unit-norm quaternion, which provides a minimally parameterized representation in three dimensions. Hence, it allows for L1 distance loss without requiring normalization. The L1 loss mitigates the influence of outliers, enhancing robustness to unusual observations while encouraging parameter and feature sparsity, thereby facilitating robust feature extraction and the assignment of negligible weights to trivial or nonessential features.

In the context of camera relocalization, the loss in Equation (10) reflects the discrepancy between the predicted and ground-truth camera poses for each training sample. For a query image captured at spatial position p with orientation q, the network outputs a predicted pose

(\hat{p}, \hat{q})

. The translational component

| p - \hat{p} |_{1}

directly measures the deviation in the estimated 3D position, while the rotational term

| log q - log \hat{q} |_{1}

quantifies the angular discrepancy by comparing the logarithmic map of the ground-truth and predicted quaternions. This formulation ensures that orientation errors are evaluated in a minimally parameterized and geometrically meaningful space, allowing the model to learn pose regressions that remain stable across multi-scene environments.

Notably, the unit quaternion is represented as

q = (η, ζ)

, where

η

denotes the scalar real part and

ζ \in R^{3}

represents the imaginary part corresponding to the rotation axis. The logarithmic map used in the orientational term of Equation (10) is formally defined as:

log q = \{\begin{matrix} \frac{η}{∥ η ∥} {cos}^{- 1} η, & if ∥ η ∥ \neq 0 \\ 0, & otherwise . \end{matrix}

(11)

The logarithm of quaternion is often utilized in camera pose regression tasks because it supports a smooth and differentiable representation of orientation. Nonetheless, q and

- q

stand for the same rotation angle because of the dual hemispheres. In this study, we restrict quaternions to a single hemisphere using absolute values to ensure uniqueness.

4. Experiments

4.1. Implementation Details

To ensure consistent network training, input images are cropped to a size of 224 × 224 using random and central cropping techniques. We utilized a pre-trained CLIP model, which was trained on a diverse set of datasets, including MS-COCO, Visual Genome, and others. We trained the model from scratch for 280 epochs on an Nvidia V100 GPU using PyTorch 1.11.0. To enhance data diversity, we applied large-scale ColorJitter augmentation by adjusting brightness, contrast, saturation, and hue with respective factors of 0.6, 0.7, 0.7, and 0.5. The model was optimized using the AdamW optimizer, coupled with a cosine learning rate scheduler. The initial learning rate was set to

4.5 \times 10^{- 5}

, with a weight decay of

4 \times 10^{- 5}

. The model was trained using a minibatch size of 64, a dropout rate of 0.5, and weight initializations of

α = - 4.0

and

β = - 2.0

.

4.2. Datasets

The Cambridge Landmarks dataset [2] comprises labeled RGB images and video sequences captured across several outdoor locations around the University of Cambridge. For our experiments, we adopt four representative scenes—King’s College, Old Hospital, Shop Façade, and St Mary’s Church—covering a variety of architectural styles and viewing geometries. All data were collected using a handheld smartphone at a resolution of 1920 × 1080 pixels, with ground-truth 6-DoF camera poses reconstructed via structure-from-motion (SfM). These scenes exhibit substantial appearance variations across different lighting and weather conditions, including sunny and overcast skies, shadows cast by nearby buildings, and temporal changes between bright daylight and dusk. In addition, dynamic elements such as moving pedestrians, cyclists, and passing vehicles frequently enter the field of view, introducing occlusions and non-rigid motion. Such factors make Cambridge Landmarks a challenging benchmark for learning robust outdoor camera-pose regression models.

To complement the outdoor Cambridge dataset, we further employ the 7 Scenes dataset [25], which serves as a standard indoor benchmark for evaluating visual localization and camera-pose regression methods. It consists of RGB-D image sequences collected from seven small-scale office and household environments: Chess, Fire, Heads, Office, Pumpkin, Kitchen, and Stairs, using a Microsoft Kinect RGB-D sensor at a resolution of 640 × 480 pixels. Ground-truth 6-DoF camera poses were reconstructed with the KinectFusion system, which performs dense 3D mapping and pose tracking. The dataset exhibits pronounced appearance changes caused by variations in viewpoint, scale, and illumination, as well as perceptual aliasing from repetitive textures and reflective surfaces. Furthermore, subtle lighting differences between training and testing trajectories—arising from varying indoor illumination settings or time-of-day changes—make the dataset particularly challenging for consistent pose estimation. Together, these two datasets provide complementary testbeds that cover both indoor and outdoor scenarios, enabling a comprehensive evaluation of model robustness and generalization ability.

4.3. Baselines

To rigorously validate our framework, we benchmark against both single-scene and multi-scene end-to-end trainable approaches. For the experiments conducted on 7 Scenes dataset, we selected prominent learning-based methods: PoseNet [2], BayesianPoseNet [26], PoseNet-Lstm [27], PoseNet17 [3], IRPNet [28], Hourglass [29], and AtLoc [5]. Additionally, we validated the model’s generalization using the outdoor Cambridge Landmarks dataset, comparing it against several multi-scene state-of-the-art approaches: MSPN [9], MS-Trans [12] and c2f-MsTrans [10]. For a fair comparative analysis, we selected four benchmark scenes from the Cambridge Landmarks dataset, excluding Great Court and Street. This exclusion was necessary to ensure a balanced comparison of average multi-scene camera relocalization accuracy, as IRPNet, LSTM-PoseNet, and MS-Trans failed to converge on these two scenes.

4.4. Quantitative Results Analysis

On the Cambridge Landmarks dataset, MVL-Loc demonstrates effective generalization and sets a new state-of-the-art, as evidenced by the comprehensive results in Table 1. Our method achieves the best overall performance with an average accuracy of

0.93

m and

2.90 °

, underscoring its robustness across diverse outdoor scenes. The performance gain is particularly significant in expansive, structurally complex environments such as King’s College and Old Hospital. Here, MVL-Loc reduces the position error from

0.71

m to

0.62

m and from

1.50

m to

1.38

m, while improving rotation accuracy from

2.71 °

to

1.89 °

and from

2.98 °

to

2.41 °

, respectively, compared to the strong c2f-MsTrans [10] baseline. This advancement stems from the model’s ability to leverage language-guided semantic priors. By focusing on descriptive cues such as “the grand façade of King’s College Chapel” or “the red-brick walls of the Old Hospital,” the network anchors its pose estimation to permanent, geometrically stable landmarks, effectively ignoring transient distractions like pedestrians or vehicles that often degrade the performance of purely visual methods. Furthermore, in smaller-scale scenes with repetitive textures, such as Shop Façade and St Mary’s Church, MVL-Loc maintains exceptional precision. The model’s reliance on high-level semantic context mitigates the challenges of perceptual aliasing where visual-only approaches often fail. Language descriptions enhance pose estimation, with notable performance gains of

47.5 %

and

29.2 %

on Shop Façade and St Mary’s Church compared to PoseNet-LSTM [27]. This improvement highlights how aligning visual observations with language context provides a powerful mechanism for disambiguation, enabling accurate localization even when visual cues are ambiguous or insufficient. When compared directly with other multi-scene transformer baselines, the advantages of our vision-language framework become even more apparent. Against MS-Trans [12], MVL-Loc achieves an average improvement of

28.5 %

in translation and

3.6 %

in rotation accuracy. This consistent margin underscores the fundamental benefit of cross-modal semantic grounding, which provides inherent invariance to illumination changes, dynamic clutter, and viewpoint variations.

On the 7 Scenes dataset, MVL-Loc achieves state-of-the-art localization accuracy, as shown in Table 2. Compared to single-scene baselines, our method demonstrates clear advantages in both translation and rotation estimation, achieving an average improvement of

23.8 %

in position accuracy and

10.6 %

in rotation accuracy over AtLoc [5]. Specifically, in the Pumpkin scene, MVL-Loc reduces position error from

0.22

m to

0.16

m and rotation error from

7.32 °

to

3.82 °

, indicating that precise language descriptions significantly refine pose estimation and orientation stability. By incorporating semantic cues such as “the orange pumpkin on the table” or “the textured kitchen wall,” the model can anchor its localization to semantically meaningful and spatially stable regions, improving vertical alignment and geometric consistency compared to Hourglass [29]. The advantages of MVL-Loc are particularly evident in scenes characterized by repetitive patterns and limited spatial depth, such as Chess and Office, where visual-only models often fail due to perceptual aliasing. The integration of language-guided semantic information enables the network to disambiguate similar visual contexts by leveraging descriptive attributes, yielding more distinctive feature embeddings and stable pose predictions. This language grounding enhances cross-scene generalization and reduces dependency on specific viewpoint configurations, which are common challenges in small indoor environments. When compared directly with multi-scene transformer-based approaches such as MS-Trans [12] and c2f-MsTrans [10], MVL-Loc achieves a

20 %

improvement in translation and a

5.6 %

gain in rotation accuracy. Overall, these findings confirm that coupling language-guided semantic reasoning with multi-scene learning enables MVL-Loc to achieve robust and generalizable 6-DoF camera relocalization, setting a new benchmark on the 7 Scenes and Cambridge Landmarks dataset.

4.5. Visualization of Multi-Scene Fusion

Training on single scenes limits model generalization and scene adaptability. Multi-scene training, however, enables MVL-Loc to recognize key objects, interpret spatial relationships, and generalize across diverse environments. Figure 4 visualizes attention in the Heads and Firescenes from the 7Scenes dataset, showing activations for one, three, and seven scenes (left to right columns). In Heads, single-scene training struggles to extract positional information, with attention sparsely distributed. With three scenes, the model starts identifying stable objects like the face and headset, while seven-scene fusion captures additional elements like the monitor and mannequin head, enhancing spatial understanding. For the Fire scene, single-scene training focuses on isolated regions, such as fire extinguishers, while three-scene input expands attention to the umbrella and background elements. Seven-scene fusion integrates local and global cues, improving object contextualization and scene comprehension.

On the Cambridge Landmarks dataset, MVL-Loc, guided by language prompts, outperforms MS-Trans by focusing on static, geometrically significant objects (e.g., spires and arches) in the College and Church scenes shown in Figure 5. MS-Trans, by contrast, focuses on dynamic objects like pedestrians and cars, which are less reliable for relocalization. MVL-Loc’s attention on structural elements like archways and carvings leads to better spatial awareness and enhanced relocalization across diverse environments.

4.6. Feature Stability Under Dynamic Conditions

To further assess the robustness of the proposed framework, we analyze the feature distance profiles illustrated in Figure 6. In this visualization, we compute the L2 distance between the feature of a reference frame and those of subsequent frames along the same trajectory. The extracted features from MS-Trans and MVL-Loc are compared to investigate how the incorporation of vision-language alignment enhances feature consistency under dynamic disturbances. MS-Trans exhibits sharp fluctuations in feature distance when moving pedestrians intermittently occlude the scene, indicating its sensitivity to local dynamic changes. In contrast, MVL-Loc produces smoother and more stable curves, benefiting from its language-guided global semantics that preserve structural understanding even under transient occlusions.While under changing illumination, the overall brightness gradually decreases from normal to dim conditions, causing the feature representation of MS-Trans to shift irregularly.By leveraging language-guided semantic consistency, our model focuses on illumination-invariant scene properties, effectively decoupling appearance variations from geometric and semantic understanding. These observations validate that MVL-Loc effectively mitigates instability caused by pedestrian motion and illumination variation, demonstrating superior robustness and semantic consistency compared to MS-Trans.

5. Ablation Study

We conducted a comprehensive evaluation of the core components of our proposed MVL-Loc framework on two standard camera relocalization benchmarks: the indoor 7Scenes and the outdoor Cambridge Landmarks datasets. The ablation results, systematically presented in Table 3, delineate the individual and synergistic contributions of integrating world knowledge (CLIP-based visual-language pre-training), natural language descriptions, and multi-scene training. The initial configuration, which utilized a standard ImageNet-pre-trained model as the visual backbone, demonstrated limited generalization capabilities and suboptimal relocalization performance. This is attributed to its lack of cross-modal semantic understanding, which is crucial for disambiguating visually similar scenes. Replacing this with a CLIP-based encoder, thereby incorporating rich world knowledge, yielded a substantial improvement. This shift enhanced the model’s ability to interpret complex scene structures and semantic contexts, reducing the average position and rotation errors to 1.33 m and 3.57°, respectively, on the challenging Cambridge Landmarks dataset.The integration of natural language descriptions provided a further significant boost in performance. By leveraging textual cues to guide visual attention and feature extraction, the model achieved more precise spatial reasoning, as evidenced by the reduced errors of 0.18 m and 7.10° on 7Scenes and 1.01 m and 3.05° on Cambridge Landmarks. The synergistic integration of all components—world knowledge, language guidance, and multi-scene training—culminated in the optimal performance, underscoring the importance of a holistic design that combines semantic priors, language-driven focus, and diverse environmental exposure for robust and generalizable camera relocalization. To thoroughly evaluate the advancements of MVL-Loc within the emerging paradigm of vision-language models for localization, we conduct a comparative analysis against two state-of-the-art VLM-based methods: Glo-Loc [18] and FM-Loc [17]. These methods represent the most direct antecedents to our work, as they also leverage pre-trained vision-language knowledge for camera pose estimation. This comparison is crucial for demonstrating the specific improvements afforded by our end-to-end, language-guided multi-scene framework. For a fair and consistent evaluation, we compare the performance of all methods on the 7Scenes and Cambridge Landmarks datasets using the standard training and testing splits as defined in the original datasets. The evaluation metrics are consistent across all compared methods. The results for Glo-Loc and FM-Loc are sourced from their respective publications to ensure an unbiased comparison. The comparative analysis presented in Table 4 provides compelling evidence for the advancements achieved by the MVL-Loc framework within the domain of vision-language models applied to localization.When compared to Glo-Loc, MVL-Loc demonstrates a significant advantage in metric precision. Glo-Loc is inherently constrained by its reliance on discrete database matching, struggling to achieve the continuous, fine-grained accuracy required for precise 6-DoF pose estimation in environments with subtle geometric variations. In contrast, MVL-Loc’s end-to-end regression framework learns a direct mapping from multimodal inputs to the pose space, enabling it to capture nuanced spatial relationships that are critical for accurate localization. Furthermore, the comparison with FM-Loc highlights the enhanced robustness afforded by MVL-Loc’s avoidance of object-detection dependencies. FM-Loc’s performance is intrinsically linked to the reliability of its object detector, making it susceptible to failure in the presence of occlusions, dynamic objects, or challenging lighting conditions—all common in real-world scenarios. MVL-Loc circumvents this fragility by leveraging dense, pre-trained VLM features that are semantically guided by natural language descriptions. This strategy allows the model to focus on stable, salient scene structures based on high-level semantic cues rather than relying on the precise detection of specific object instances. MVL-Loc’s integration of end-to-end regression with language-guided, multi-scene training establishes a more effective and robust pathway for leveraging world knowledge from vision-language models for the specific task of camera relocalization.

To further validate our architectural choices, we extended the ablation study to examine different vision-language encoders’ capabilities. As summarized in Table 5, we compared three prominent models—BLIP-2, OpenFlamingo, and CLIP—to identify the optimal backbone for our multi-modal framework. BLIP-2’s relatively weaker performance stems from its pre-training emphasis on image-text generative tasks rather than dense geometric understanding. While effective for captioning, this focus limits its ability to capture fine-grained spatial relationships essential for precise camera pose estimation. OpenFlamingo demonstrated moderate improvements, achieving 0.19 m and 7.21° on 7Scenes, owing to its stronger sequential reasoning capabilities. However, its architectural complexity for handling interleaved visual-linguistic sequences introduces unnecessary overhead for our single-image regression task. CLIP emerged as the superior encoder, achieving remarkable accuracy of 0.16 m and 6.98° on 7Scenes and 0.93 m and 2.90° on Cambridge Landmarks. This superiority originates from its contrastive pre-training paradigm, which explicitly aligns visual and textual representations in a shared embedding space. This process inherently enhances both semantic grounding and geometric awareness, creating ideal features for language-guided pose regression. The alignment quality is visually demonstrated in Figure 3, where CLIP-based attention maps show precise focus on linguistically-described scene elements.

Lastly, we conducted a comprehensive study on the optimal depth of the transformer decoder architecture, with results detailed in Table 6. This investigation reveals a clear performance trajectory based on decoder complexity: the 2-layer configuration exhibited significantly elevated errors (0.54 m, 8.15° on 7Scenes; 1.71 m, 3.93° on Cambridge Landmarks), indicating pronounced underfitting due to insufficient model capacity for complex multimodal fusion. Progressing to 4 decoder layers yielded the optimal balance, achieving precision of 0.16 m and 6.98° on 7Scenes—this depth provides adequate nonlinear transformation capacity for effective cross-modal integration while maintaining stable gradient propagation and computational efficiency. Notably, expanding to 6 and 8 layers resulted in performance degradation (0.17 m, 7.16° for 8 layers), suggesting overfitting and optimization difficulties where deeper architectures memorize dataset-specific artifacts rather than learning generalizable representations.

6. Conclusions

In this work, we propose MVL-Loc, a novel 6-DoF camera relocalization framework that leverages pretrained world knowledge from vision-language models (VLMs) and integrates multimodal data. The architecture generalizes effectively across both indoor and outdoor environments, demonstrating robust performance in various real-world settings. We employ natural language as a directive tool to guide multi-scene learning, allowing the model to capture both semantic information and spatial relationships among objects. Through extensive experiments, MVL-Loc achieves state-of-the-art performance in end-to-end multi-scene camera relocalization. Looking forward, we plan to explore integrating large language models, such as GPT-o1, with MVL-Loc to enable autonomous scene comprehension. This advancement has the potential to further enhance the precision and adaptability of camera relocalization in complex, real-world environments.

Author Contributions

Conceptualization, Z.X. and S.Y.; Methodology, J.Y. and Z.W.; Validation, Z.X., S.Y. and S.J.; Formal analysis, Z.X.; Investigation, Z.X.; Resources, W.W.; Writing—original draft, Z.X.; Writing—review and editing, W.W.; Visualization, Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available. The 7-Scenes dataset can be accessed at the Microsoft Research Cambridge website (https://www.microsoft.com/en-us/research/project/rgb-d-dataset-7-scenes/, accessed on 15 October 2025), and the Cambridge Landmarks dataset is available at the University of Cambridge Repository (https://www.repository.cam.ac.uk/handle/1810/251342), accessed on 15 October 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fischler, M.A.; Bolles, R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Kendall, A.; Grimes, M.; Cipolla, R. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Kendall, A.; Cipolla, R. Geometric loss functions for camera pose regression with deep learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Clark, R.; Wang, S.; Markham, A.; Trigoni, N.; Wen, H. Vidloc: A deep spatio-temporal model for 6-dof video-clip relocalization. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Wang, B.; Chen, C.; Lu, C.X.; Zhao, P.; Trigoni, N.; Markham, A. Atloc: Attention guided camera localization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10393–10401. [Google Scholar]
Xiao, Z.; Chen, C.; Yang, S.; Wei, W. EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 8529–8536. [Google Scholar]
Zhou, G.; Hong, Y.; Wu, Q. NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models. arXiv 2023, arXiv:2305.16986. [Google Scholar] [CrossRef]
Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Chen, X. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv 2023, arXiv:2307.15818. [Google Scholar]
Blanton, H.; Greenwell, C.; Workman, S.; Jacobs, N. Extending Absolute Pose Regression to Multiple Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 38–39. [Google Scholar]
Shavit, Y.; Ferens, R.; Keller, Y. Coarse-to-Fine Multi-Scene Pose Regression with Transformers. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 14222–14233. [Google Scholar] [CrossRef] [PubMed]
Chen, D.M.; Baatz, G.; Köser, K.; Tsai, S.S.; Vedantham, R.; Pylvänäinen, T.; Roimela, K.; Chen, X.; Bach, J.; Pollefeys, M.; et al. City-scale landmark identification on mobile devices. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar]
Shavit, Y.; Ferens, R.; Keller, Y. Learning Multi-Scene Absolute Pose Regression with Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 2713–2722. [Google Scholar]
Brahmbhatt, S.; Gu, J.; Kim, K.; Hays, J.; Kautz, J. Geometry-aware learning of maps for camera localization. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Lee, J.; Lee, H.; Oh, J. FusionLoc: Camera-2D LiDAR Fusion Using Multi-Head Self-Attention for End-to-End Serving Robot Relocalization. IEEE Access 2023, 11, 75121–75133. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 8748–8763. [Google Scholar]
Li, J.; Tan, H.; Bansal, M. Envedit: Environment Editing for Vision-and-Language Navigation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 15386–15396. [Google Scholar] [CrossRef]
Mirjalili, R.; Krawez, M.; Burgard, W. FM-Loc: Using Foundation Models for Improved Vision-Based Localization. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 1381–1387. [Google Scholar] [CrossRef]
Matsuzaki, S.; Sugino, T.; Tanaka, K.; Sha, Z.; Nakaoka, S.; Yoshizawa, S.; Shintani, K. CLIP-Loc: Multi-modal Landmark Association for Global Localization in Object-based Maps. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 13673–13679. [Google Scholar]
Manvi, R.; Khanna, S.; Mai, G.; Burke, M.; Lobell, D.B.; Ermon, S. GeoLLM: Extracting Geospatial Knowledge from Large Language Models. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Li, L.; Ye, Y.; Jiang, B.; Zeng, W. GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024. [Google Scholar]
Sun, X.; Albin, D.; Mauceri, C.; Woods, D.; Heckman, C. Spatial-LLaVA: Enhancing Large Language Models with Spatial Referring Expressions for Visual Understanding. arXiv 2025, arXiv:2505.12194. [Google Scholar]
Chen, H.; Müller, T.; Park, Y. UniSceneVL: A Unified Vision-Language Model for 3D Scene Understanding and Spatial Correspondence Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025; pp. 5521–5532. [Google Scholar]
Shen, F.; Ye, H.; Zhang, J.; Wang, C.; Han, X.; Yang, W. Advancing pose-guided image synthesis with progressive conditional diffusion models. arXiv 2023, arXiv:2310.06313. [Google Scholar]
Shen, F.; Ye, H.; Liu, S.; Zhang, J.; Wang, C.; Han, X.; Wei, Y. Boosting consistency in story visualization with rich-contextual conditional diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 6785–6794. [Google Scholar]
Criminisi, A.; Shotton, J.; Glocker, B.; Izadi, S.; Fitzgibbon, A. 7-Scenes RGB-D Dataset. In Proceedings of the Microsoft Research Cambridge Vision and Graphics Group Dataset Release, Cambridge, UK, 1 January 2013; Available online: https://www.microsoft.com/en-us/research/project/rgb-d-dataset-7-scenes/ (accessed on 6 November 2025).
Kendall, A.; Cipolla, R. Modelling uncertainty in deep learning for camera relocalization. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016. [Google Scholar]
Walch, F.; Hazirbas, C.; Leal-Taixe, L.; Sattler, T.; Hilsenbeck, S.; Cremers, D. Image-based localization using lstms for structured feature correlation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Shavit, Y.; Ferens, R. Do We Really Need Scene-specific Pose Encoders? In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 3186–3192. [Google Scholar]
Melekhov, I.; Ylioinas, J.; Kannala, J.; Rahtu, E. Image-Based Localization Using Hourglass Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017; pp. 870–877. [Google Scholar]

Figure 1. Language guides multi-scene learning for camera pose estimation, even when visual appearances vary. Our MVL-Loc uses contextual cues, such as a chessboard on a small table (Scene 1) and dual monitors on a cluttered desk (Scene 2), to estimate the camera’s pose.

Figure 2. Overview of the MVL-Loc framework for multi-scene camera relocalization using open-world language guidance. The left section shows natural language descriptions guiding the model across three scenes. The center section depicts multi-modal input processing, integrating visual-language correspondence via Transformer blocks with self-attention and multi-head attention. The right section highlights the final 6-DOF camera pose output through multi-scene selection and pose regression.

Figure 3. Attention visualizations under the Chess scene. From left to right: (a) original image, (b) broad language-guided attention, and (c) detailed language-guided attention with sharper focus on the chessboard surrounding objects.

Figure 4. Attention visualizations on Heads and Fire scenes with 1, 3, and 7-scene training.

Figure 5. Saliency maps on King’s College and St Mary’s Church. From left to right: original images, MS-Trans, and MVL-Loc (ours). MVL-Loc shows sharper focus on key architectural structures, aiding more accurate pose estimation.

Figure 6. Comparative analysis of feature stability under dynamic disturbances. We evaluate the robustness of feature representations extracted by MVL-Loc and MS-Trans on the Cambridge Landmarks dataset under two challenging scenarios. Under pedestrian-induced occlusions, moving individuals introduce transient appearance changes that lead to unstable feature responses in MS-Trans, whereas MVL-Loc tends to maintain smoother and more consistent feature trajectories. During illumination transitions between normal and dim conditions, MVL-Loc appears to provide relatively stable representations, suggesting preliminary evidence of enhanced robustness to real-world dynamic disturbances.

Table 1. Camera localization results on the Cambridge Landmarks dataset. The dataset comprises several outdoor scenes around the University of Cambridge and four representative scenes. King’s College, Old Hospital, Shop Façade, and St Mary’s Church are used for evaluation. For each scene, we report the median position and rotation errors of various single-scene and multi-scene localization baselines and our proposed method. The best results in each column are highlighted in bold.

		King’s College	Old Hospital	Shop Façade	St Mary’s Church	Average
Single Scene	PoseNet [2]	1.94 m, 5.43°	0.61 m, 2.92°	1.16 m, 3.92°	2.67 m, 8.52°	1.60 m, 5.20°
	BayesianPoseNet [26]	1.76 m, 4.08°	2.59 m, 5.18°	1.27 m, 7.58°	2.13 m, 8.42°	1.94 m, 6.32°
	MapNet [13]	1.08 m, 1.91°	1.96 m, 3.95°	1.51 m, 4.26°	2.02 m, 4.57°	1.64 m, 3.67°
	PoseNet17 [3]	1.62 m, 2.31°	2.64 m, 3.93°	1.16 m, 5.77°	2.95 m, 6.50°	2.09 m, 4.63°
	IRPNet [28]	1.21 m, 2.19°	1.89 m, 3.42°	0.74 m, 3.51°	1.89 m, 4.98°	1.43 m, 3.53°
	PoseNet-Lstm [27]	0.99 m, 3.74°	1.53 m, 4.33°	1.20 m, 7.48°	1.54 m, 6.72°	1.32 m, 5.57°
Multi-Scene	MSPN [9]	1.77 m, 3.76°	2.55 m, 4.05°	2.92 m, 7.49°	2.67 m, 6.18°	2.48 m, 5.37°
	MS-Trans [12]	0.85 m, 1.63°	1.83 m, 2.43°	0.88 m, 3.11°	1.64 m, 4.03°	1.30 m, 2.80°
	c2f-MsTrans [10]	0.71 m, 2.71°	1.50 m, 2.98°	0.61 m, 2.92°	1.16 m, 3.92°	0.99 m, 3.13°
	MVL-Loc (ours)	0.62 m, 1.69°	1.38 m, 2.41°	0.59 m, 3.21°	1.09 m, 3.82°	0.92 m, 2.79°

Table 2. Camera localization results on the 7 Scenes dataset. The dataset contains seven small indoor environments. Chess, Fire, Heads, Office, Pumpkin, Kitchen, and Stairs are used for evaluation. For each scene, we report the median position and rotation errors of various single-scene and multi-scene localization baselines and our proposed method. The most accurate results are highlighted in bold.

		Chess	Fire	Heads	Office	Pumpkin	Kitchen	Stairs	Average
Single-Scene	PoseNet [2]	0.32 m, 7.60°	0.48 m, 14.6°	0.31 m, 12.2°	0.48 m, 7.68°	0.47 m, 8.42°	0.59 m, 8.64°	0.47 m, 13.81°	0.45 m, 10.42°
	Bayesian [26]	0.38 m, 7.24°	0.43 m, 13.8°	0.30 m, 12.3°	0.49 m, 8.09°	0.63 m, 7.18°	0.59 m, 7.59°	0.48 m, 13.22°	0.47 m, 9.91°
	PN-Lstm [27]	0.24 m, 5.79°	0.34 m, 12.0°	0.22 m, 13.8°	0.31 m, 8.11°	0.34 m, 7.03°	0.37 m, 8.83°	0.41 m, 13.21°	0.32 m, 9.82°
	PoseNet17 [3]	0.14 m, 4.53°	0.29 m, 11.5°	0.19 m, 13.1°	0.20 m, 5.62°	0.27 m, 4.77°	0.24 m, 5.37°	0.36 m, 12.53°	0.24 m, 8.20°
	IRPNet [28]	0.13 m, 5.78°	0.27 m, 9.83°	0.17 m, 13.2°	0.25 m, 6.41°	0.23 m, 5.83°	0.31 m, 7.32°	0.35 m, 11.91°	0.24 m, 8.61°
	Hourglass [29]	0.15 m, 6.18°	0.27 m, 10.83°	0.20 m, 11.6°	0.26 m, 8.59°	0.26 m, 7.32°	0.29 m, 10.7°	0.30 m, 12.75°	0.25 m, 9.71°
	AtLoc [5]	0.11 m, 4.37°	0.27 m, 11.7°	0.16 m, 11.9°	0.19 m, 5.61°	0.22 m, 4.54	0.25 m, 5.62°	0.28 m, 10.9°	0.21 m, 7.81°
MultiScene	MSPN [9]	0.10 m, 4.76°	0.29 m, 11.5°	0.17 m, 13.2°	0.17 m, 6.87°	0.21 m, 5.53°	0.23 m, 6.81°	0.31 m, 11.81°	0.21 m, 8.64°
	MS-Trans [12]	0.11 m, 4.67°	0.26 m, 9.78°	0.16 m, 12.8°	0.17 m, 5.66°	0.18 m, 4.44°	0.21 m, 5.99°	0.29 m, 8.45°	0.20 m, 7.40°
	c2f-MsTrans [10]	0.10 m, 4.63°	0.25 m, 9.89°	0.14 m, 12.5°	0.16 m, 5.65°	0.16 m, 4.42°	0.18 m, 6.29°	0.27 m, 7.86°	0.18 m, 7.32°
	MVL-Loc	0.09 m, 3.95°	0.22 m, 9.45°	0.11 m, 11.6°	0.14 m, 5.60°	0.16 m, 3.82°	0.14 m, 6.11°	0.23 m, 8.11°	0.16 m, 6.95°

Table 3. Ablation study evaluating the effects of World Knowledge (WK), Language Descriptions (LD), and Multi-scene Training (MT) on camera relocalization performance across the 7 Scenes and Cambridge Landmarks datasets.

Pre-Trained	WK	LD	MT	7 Scenes	Cambridge Landmarks
ImageNet	×	×	×	0.28 m, 7.97°	1.63 m, 4.76°
CLIP	✓	×	×	0.22 m, 7.38°	1.33 m, 3.57°
CLIP	✓	✓	×	0.18 m, 7.10°	1.01 m, 3.05°
CLIP	✓	✓	✓	0.16 m, 6.98°	0.93 m, 2.90°

Table 4. Comparative analysis with VLM-based localization methods on 7Scenes and Cambridge Landmarks datasets. The best results are highlighted in bold.

Method	Dataset	Avg. Position Error (m)	Avg. Rotation Error (°)
MVL-Loc (Ours)	7Scenes	0.16	6.98
FM-Loc	7Scenes	0.19	7.60
Glo-Loc	7Scenes	0.22	8.50
MVL-Loc (Ours)	Cambridge	0.93	2.90
FM-Loc	Cambridge	1.05	3.20
Glo-Loc	Cambridge	1.10	3.50

Table 5. Ablation study evaluating the performance of different vision-language encoders in our model, tested on the 7Scenes and Cambridge Landmarks datasets.

Vision-Language Encoder	7Scenes	Cambridge Land.
BLIP-2	0.18 m, 7.13°	0.98 m, 3.16°
OpenFlamingo	0.19 m, 7.21°	1.03 m, 3.19°
Clip	0.16 m, 6.98°	0.93 m, 2.90°

Table 6. Ablation study on the impact of varying the number of decoder layers in the Transformer architecture, evaluated using the 7Scenes and Cambridge Landmarks datasets. The table reports the average median position and orientation errors across all scenes. The model with 4 decoder layers, highlighted in bold, represents the chosen configuration due to its balanced performance in both datasets.

Decoder Layers	7Scenes	Cambridge Landmarks
2	0.54 m, 8.15°	1.71 m, 3.93°
4	0.16 m, 6.98°	0.93 m, 2.90
6	0.17 m, 7.02°	0.97 m, 2.99°
8	0.17 m, 7.16°	0.98 m, 3.01°

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiao, Z.; Yang, S.; Ji, S.; Yin, J.; Wen, Z.; Wei, W. MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization. Appl. Sci. 2025, 15, 12642. https://doi.org/10.3390/app152312642

AMA Style

Xiao Z, Yang S, Ji S, Yin J, Wen Z, Wei W. MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization. Applied Sciences. 2025; 15(23):12642. https://doi.org/10.3390/app152312642

Chicago/Turabian Style

Xiao, Zhendong, Shan Yang, Shujie Ji, Jun Yin, Ziling Wen, and Wu Wei. 2025. "MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization" Applied Sciences 15, no. 23: 12642. https://doi.org/10.3390/app152312642

APA Style

Xiao, Z., Yang, S., Ji, S., Yin, J., Wen, Z., & Wei, W. (2025). MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization. Applied Sciences, 15(23), 12642. https://doi.org/10.3390/app152312642

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning for Camera Relocalization

2.2. Vision-Language Models for Geospatial Reasoning

3. Multi-Scene Visual Language Localization

3.1. MVL-Loc Framework

3.2. Language-Guided Camera Relocalization

3.3. Multi-Scene Camera Pose Regression

4. Experiments

4.1. Implementation Details

4.2. Datasets

4.3. Baselines

4.4. Quantitative Results Analysis

4.5. Visualization of Multi-Scene Fusion

4.6. Feature Stability Under Dynamic Conditions

5. Ablation Study

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI