MS2-CL: Multi-Scale Self-Supervised Learning for Camera to LiDAR Cross-Modal Place Recognition

Liu, Wen; Ma, Lei; Zhuang, Xuanshun; Deng, Zhongliang

doi:10.3390/s26051561

Open AccessArticle

MS2-CL: Multi-Scale Self-Supervised Learning for Camera to LiDAR Cross-Modal Place Recognition

School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(5), 1561; https://doi.org/10.3390/s26051561

Submission received: 26 January 2026 / Revised: 20 February 2026 / Accepted: 27 February 2026 / Published: 2 March 2026

(This article belongs to the Section Radar Sensors)

Download

Browse Figures

Versions Notes

Abstract

Place recognition is a fundamental challenge for robotics and autonomous vehicles. While visual place recognition has achieved high precision, cross-modal place recognition—specifically, visual localization within large-scale point cloud maps—remains a formidable problem. Existing methods often struggle with the significant domain gap between modalities and can be computationally prohibitive, especially those processing raw 3D point clouds. Furthermore, they frequently fail to learn features invariant to viewpoint and scale variations, limiting generalization to unseen environments. In this paper, we formulate cross-modal recognition as a problem of learning a scale-invariant, unified embedding space. Our framework employs a hierarchical Swin Transformer to extract multi-scale features from unified 2D representations of both modalities. The central principle of our method is a multi-scale self-distillation paradigm, which recasts feature learning as an intra-modal knowledge transfer task. Specifically, the coarse-scale “teacher” features provide supervision for the fine-scale “student” features. The final inter-modal alignment is then achieved via a global contrastive loss, exclusively leveraging the semantically rich “teacher” embeddings to ensure a reliable and discriminative matching. Extensive experiments on the KITTI and KITTI-360 datasets demonstrate that our method achieves state-of-the-art performance. Notably, using only the KITTI-trained model without fine-tuning, Recall@1 exceeds 60% on all evaluable sequences of KITTI-360 at a 10 m threshold. Code and pre-trained models will be made publicly available upon acceptance.

Keywords:

cross-modal place recognition; self-supervised learning; Swin Transformer; autonomous driving

1. Introduction

Visual place recognition plays a crucial role in mobile robotics, particularly in applications such as autonomous driving and simultaneous localization and mapping (SLAM). It relies on sensor data, such as images and point clouds, to achieve global localization and loop closure detection in pre-built maps and databases. Visual place recognition [1,2,3] is highly competitive due to the small size and low cost of cameras. However, cameras are sensitive to changes in lighting and field of view. Lidar, on the other hand, provides precise geometric information and is robust to illumination changes, making it the standard for building High-Definition (HD) maps.

In practical autonomous driving scenarios, a major challenge is balancing system performance with hardware costs. While mapping vehicles are typically equipped with high-end LiDARs to construct accurate 3D HD maps, mass-produced passenger vehicles often rely on low-cost cameras for perception. Therefore, the ability to localize a camera-equipped vehicle within a pre-built LiDAR map is of significant practical value. This task, known as cross-modal place recognition, enables a cost-effective localization solution that leverages the structural accuracy of LiDAR maps without imposing heavy hardware costs on end-user vehicles. Given the discrepancies between the data modalities queried by mobile robots and those stored in the database, cross-modal place recognition tackles the challenges arising from these modality differences, thereby significantly enhancing the accuracy and resilience of global place recognition.

For example, an autonomous vehicle can achieve precise visual place recognition on a pre-built large-scale 3D point cloud map using query images captured by an inexpensive camera, as shown in Figure 1. Given the complementary nature of image and point cloud data, several multimodal fusion methods [4,5,6] have been developed to combine 2D and 3D data, significantly improving the precision of global localization. However, due to the inherent differences between these two modalities, fusion remains a challenging task. Furthermore, existing multimodal methods have not fully tackled the issue of performing global place recognition from one modality’s sensor to another modality’s map. To overcome the data representation differences between image and point cloud information, some methods [7,8] convert sensor data into intermediate representations, such as bird’s-eye views or depth images. Others [9,10,11] aim to extract structural information from visual data to align it with point clouds or directly use deep neural networks to match the data. These methods, however, are resource-intensive and involve large model parameters. Specifically, processing 3D point clouds often requires handling vast and complex data, which leads to high hardware costs. Moreover, even with sufficient training data, models may struggle to maintain strong performance on unseen data.

To address these limitations, we propose a novel and efficient framework for cross-modal place recognition, designed to learn discriminative, scale-invariant features while minimizing computational overhead. The core idea is to bridge the modality gap by first transforming both sensor inputs into a unified 2D representation and then leveraging a multi-scale self-distillation paradigm within a Transformer-based architecture. We first project the 3D point cloud into a 2D spherical representation and estimate a dense depth map from the query RGB image. This step not only creates a common ground for feature extraction but also significantly reduces the complexity associated with processing raw 3D data. Subsequently, we employ a shared-weight Swin Transformer backbone to extract hierarchical features from both 2D representations. The strength of the Swin Transformer lies in its ability to capture both local details and global context through its shifted window mechanism, producing feature maps at multiple scales. The key innovation of our work lies in how we supervise the learning of these multi-scale features. We introduce a multi-scale teacher–student learning strategy, a form of self-distillation. Specifically, we designate the features from the coarsest scale (e.g., the final stage), which encapsulate the most abstract and global semantic information, as the “teacher”. The features from finer scales (e.g., the earlier stages), which retain more spatial details, act as “students”. Through a Scale Consistency Supervision loss, we compel the student features to align with the teacher’s representation. This process enforces intra-modal consistency, forcing the network to learn features that are invariant to scale variations and viewpoint changes.

Finally, the global descriptors extracted from the “teacher” level of both modalities are used to compute a Global Contrastive Loss. Inspired by the success of contrastive learning paradigms like Contrastive Language-Image Pre-training (CLIP) [12] in pairing heterogeneous data, and in contrast to methods relying on triplet loss, this loss function pushes the descriptors of corresponding image-point cloud pairs closer together in the embedding space while pulling all other non-matching pairs within the batch apart. By integrating intra-modal scale consistency with inter-modal contrastive learning, our model effectively learns a unified embedding space that is highly discriminative.

In summary, the main contributions of our work are as follows:

We propose a lightweight framework that unifies 2D image and 3D point cloud into common 2D representations, enabling the use of an efficient Swin Transformer backbone for cross-modal place recognition.
We introduce a novel multi-scale teacher–student self-distillation paradigm. By enforcing consistency between coarse-scale (teacher) and fine-scale (student) features within each modality, our model learns highly discriminative and scale-invariant descriptors.
Our method achieves state-of-the-art performance on the KITTI and KITTI-360 benchmarks. Notably, a model trained solely on the KITTI dataset demonstrates exceptional generalization, achieving a Recall@1 exceeding over 57.5% on eight sequences from KITTI-360 at 20 m threshold, without any fine-tuning.

2. Related Work

2.1. Image-Based Place Recognition

Early visual place recognition research relied on manually designed features like SIFT [13] and ORB [14], which are designed to extract local features from the image. These methods performed well in static environments but were sensitive to lighting changes. With the advent of deep learning, CNN-based Place Recognition techniques [15] have seen substantial advancements. NetVLAD [1] introduced weakly supervised learning via triplet loss, enhancing feature associations. Integrating generalized mean pooling (GeM) [16] with contrastive learning significantly enhances the discriminative power of the extracted features. Hybrid architectures like Patch-NetVLAD [17] optimized local and global features through hierarchical fusion. Several researchers [2] have leveraged the Transformer architecture to tackle the challenge of image geolocation from multiple angles, effectively addressing the limitations inherent in traditional methods based on convolutional neural networks. Recently [18], depth estimation has provided geometric verification for feature matching, and a two-stage retrieval process (global coarse localization followed by local fine matching) has markedly improved search efficiency in large-scale scenes.

2.2. Point-Cloud-Based Place Recognition

In recent years, point cloud-based place recognition methods have advanced along two primary directions: direct point cloud processing and structured projection. PointNetVLAD [19] integrates deep learning by generating descriptors via PointNet [20] and NetVLAD [1], but its point-wise mechanism limits local geometric modeling. Structural projection methods, including distance image-based projection [21,22] and bird’s-eye view-based projection [23,24], transform unordered point cloud data into two-dimensional representation formats. OverlapNet [21] estimates the overlap and relative pose between 3D point clouds by leveraging siamese networks and channel attention mechanisms. BVMatch [23] generates descriptors by fusing multi-scale bird’s-eye view (BEV) features and employs the Bag-of-Words (BoW) method for frame retrieval. Recently, researchers [25] have improved robustness to field-of-view changes by fusing range and BEV images using cross-view transformers, achieving reliable place recognition.

2.3. Image to Point Cloud Place Recognition

Currently, cross-modal place recognition from images to point clouds remains imprecise and low accuracy despite significant advancements in single-modal place recognition. Consequently, there is considerable potential for further exploration in achieving image-based place recognition on large-scale point cloud maps. Previous researchs [26,27] have primarily focused on extracting shared embedding representations between 2D images and 3D point clouds but has largely overlooked the alignment and interaction of information across these two modalities, leading to suboptimal generalization performance. Methods such as those described in [28,29] focus on detecting local features within images and point clouds, facilitating the identification of cross-modal correspondences for place retrieval. The 2D3D-MatchNet framework [29] generated descriptors for images and point clouds using VGG and PointNet networks, respectively, and was trained using triplet loss. However, this approach is limited to small-scale environments due to the computational demands of dense point cloud representation, making it challenging to scale effectively to larger environments. (LC)² [8] integrates RGB images and point clouds into a unified 2D depth images, encoding them into vetor representations, and subsequently optimizing these representations via contrastive and triplet losses through a two-stage network training. I2P-Rec [7] leverages a depth estimation algorithm to reconstruct point clouds from images, mitigating modality differences, and utilizes BEV projection for global feature extraction to accurately identify and locate camera images within large-scale point cloud maps. LIP-Loc [30] achieves cross-modal Place Recognition by leveraging batch contrastive loss for pre-training with a large number of image and point cloud pairs. ModaLink [31] introduces a field-of-view transformation module that converts point clouds into range images, aligning modality information and providing more discriminative global descriptors through its designed encoder. However, these methods still fall short of achieving superior recognition accuracy at high precision levels.

3. Methods

This paper introduces MS2-CL, a lightweight cross-modal place recognition framework capable of achieving high accuracy on both seen and unseen datasets.

3.1. Data Processing

(1) Range image generation: The point cloud data, represented in sparse Cartesian coordinates

(x, y, z)

, is transformed into a 2D range image to unify the modalities by emulating the spherical projection characteristics of a LiDAR sensor. Following an initial filtering stage that removes points exceeding the predefined maximum detection range

R_{m a x}

, we calculate the Euclidean distance from each remaining point to the sensor origin as the depth measurement

r_{i}

.

P = {p_{i} | p_{i} = {(x_{i}, y_{i}, z_{i})}_{i = 1}^{N}},

(1)

P^{'} = {p_{i} | 0 < ∥ v_{i} ∥_{2} < R_{m a x}, v_{i} = (x_{i}, y_{i}, z_{i})},

(2)

r_{i} = \sqrt{x_{i}^{2} + y_{i}^{2} + z_{i}^{2}},

(3)

For the remaining valid points, we calculate their horizontal (yaw

φ_{i}

) and vertical (pitch

θ_{i}

) angles based on their 3D coordinates.

φ_{i} = - arctan (\frac{y_{i}}{x_{i}}),

(4)

θ_{i} = arcsin (\frac{z_{i}}{r_{i}}),

(5)

These angles are then used to determine the corresponding positions in the 2D projection plane. Specifically, the horizontal angle is mapped to a normalized interval, while the vertical position is computed proportionally to the total field of view. The resulting coordinates are scaled to fit the preset image dimensions, ensuring accurate placement of each point in the range image

R (u, v)

.

u_{i} = (\frac{θ_{i}}{π} + 1) \cdot \frac{W}{2},

(6)

v_{i} = (1 - \frac{| φ_{i} | + | α_{d o w n} |}{α}) \cdot H,

(7)

where

α = | α_{d o w n} | + | α_{u p} |

is the total vertical field of view and H and W represent the height and width of the range image, respectively. It is important to note that the spherical projection parameters, specifically the vertical field of view angles

α_{u p}

and

α_{d o w n}

, are sensor-dependent. In this work, these parameters are configured to match the Velodyne HDL-64E LiDAR sensor used in the KITTI datasets. For deployment on vehicles with different LiDAR configurations (e.g., 32-beam or solid-state LiDARs), these projection parameters must be adjusted to align with the specific sensor’s vertical field of view to ensure optimal image-to-point cloud alignment. To address potential occlusions due to overlapping points in the original point cloud, we sort all valid points by their distance values in ascending order, ensuring that closer points are not obscured by farther ones. Finally, a range image matrix is constructed, where each element represents the depth value at the corresponding pixel location. Additionally, for each pixel of the generated range image, which represents depth information, multiply by a fixed scaling factor

β

to enhance the contrast of the whole image.

R^{'} (u, v) = β * R (u, v),

(8)

(2) Deep Image Generation: This study utilizes the Monodepth2 [32] framework to process monocular RGB image sequences from the KITTI and KITTI-360 datasets. We selected this method due to its temporal self-supervised learning approach based on monocular video, which constructs a differentiable photometric reprojection loss using camera motion parameters between consecutive frames. This approach implicitly decouples scene depth from motion estimation, effectively mitigating the monocular scale ambiguity issue. To further enhance robustness in dynamic traffic scenes, the method incorporates a multi-frame consistency-driven adaptive masking mechanism [33]. This mechanism effectively filters out moving objects (e.g., other vehicles) that violate the static scene assumption, significantly reducing reprojection error noise in dynamic regions. In this work, RGB images are processed offline at their original resolution, and the resulting depth images from monocular estimation are stored locally, substantially decreasing the time and computational resources required for the entire training process.

3.2. Multi-Scale Self-Supervised Learning Network

(1) Network Architecture: Our proposed framework for cross-modal place recognition employs a dual-stream, multi-scale architecture to achieve coherent feature alignment between visual and LiDAR data. This design comprises three core stages: modality-specific preprocessing, hierarchical feature encoding under a teacher–student paradigm, and a dual-objective loss function for comprehensive supervision. The network’s backbone consists of parallel Swin Transformer encoders, each followed by dedicated projection heads.

As depicted in Figure 2, the framework begins with modality-specific preprocessing. The camera stream converts an RGB image into a dense depth map, while the LiDAR stream transforms a 3D point cloud into a 2D spherical projection. These 2D representations are then fed into their respective Swin Transformer backbones, which extract features at multiple hierarchical scales. A key aspect of our architecture is the intra-modal teacher–student learning paradigm. For each modality, features from the encoder’s final, most global stage (e.g., 7 × 7 patch level) are designated as the ‘teacher’ embedding, representing a holistic scene view. Features from earlier, higher-resolution stages are termed ‘student’ embeddings. This structure facilitates a dual-objective optimization. First, an intra-modal Scale Consistency Supervision loss aligns the student embeddings with their corresponding teacher embedding, ensuring feature consistency across granularities. Second, the main Global Contrastive Loss is computed exclusively between the teacher-level embeddings of the two modalities. This primary loss drives the network to map corresponding camera and LiDAR views to a unified embedding space. This combined supervision enables the learning of discriminative descriptors for high-precision cross-modal place recognition.

(2) Scale Consistency Supervised Learning: Distinction from Conventional Self-Distillation: Our scale consistency supervision differs fundamentally from conventional knowledge distillation frameworks. Traditional self-distillation methods (e.g., [34]) typically distill from a larger teacher model to a smaller student for model compression. In contrast, we perform intra-modal, multi-scale self-distillation within a single encoder, where teacher and student roles are assigned based on feature granularity rather than model size. Specifically, the coarsest-scale embedding (capturing global semantics) supervises finer-scale embeddings (retaining spatial details), enforcing hierarchical consistency across scales. This design is tailored for cross-modal alignment: by ensuring scale-invariant representations within each modality before cross-modal matching, we effectively bridge the camera-LiDAR domain gap. The stop-gradient operation on teacher features prevents interference with the primary contrastive loss, ensuring stable optimization. Unlike existing methods that focus on inter-modal alignment alone, our intra-modal scale consistency provides a crucial regularization that improves feature robustness to viewpoint and scale variations.

Beyond the primary cross-modal alignment, we introduce an auxiliary self-supervised task to enforce feature consistency across different scales within each modality. This is achieved through an intra-modal teacher–student learning paradigm. For a given modality (e.g., camera depth images), we designate the embedding from the final, most global encoder stage as the ‘teacher’ embedding,

z_{D}^{T}

, which captures a coarse-grained, holistic representation of the scene. Embeddings from the preceding, higher-resolution stages,

{z_{D}^{s}}_{s = 1}^{S - 1}

, are treated as ‘student’ embeddings, representing finer-grained local features.

The Scale Consistency Loss,

L_{S C}

, then minimizes the distance between each student embedding and its corresponding teacher embedding. Crucially, the teacher embedding is treated as a fixed target by stopping the gradient flow from this auxiliary loss, ensuring that its representation is guided solely by the main global contrastive task. This process is applied symmetrically to both the camera depth (D) and LiDAR range (R) images. The loss for each modality is formulated as:

\begin{matrix} L_{S C}^{D} & = \frac{1}{S - 1} \sum_{s = 1}^{S - 1} L_{d i s t} (z_{s}^{D}, sg (z_{T}^{D})), \end{matrix}

(9)

\begin{matrix} L_{S C}^{R} & = \frac{1}{S - 1} \sum_{s = 1}^{S - 1} L_{d i s t} (z_{s}^{R}, sg (z_{T}^{R})), \end{matrix}

(10)

where S is the total number of scales,

sg (\cdot)

denotes the stop-gradient operation, and

L_{d i s t}

is a distance metric, for which we use the Smooth L1 loss. The total objective for scale consistency supervision is the sum of the individual modality losses:

L_{S C} = L_{S C}^{D} + L_{S C}^{R} .

(11)

This auxiliary objective encourages the model to learn multi-scale descriptors that are hierarchically consistent, thereby enhancing feature robustness against variations in viewpoint and distance.

(3) Global Contrastive Loss Function: Different from the triplet loss commonly employed in previous methods, we train the proposed network architecture for Image-to-Point Cloud correspondence retrieval using a cross-modal mini-batch contrastive loss function. Given a batch of 2D depth image descriptors

{E_{i}^{D}}_{i = 1}^{N}

and their corresponding matching 2D range image descriptors

{E_{i}^{R}}_{i = 1}^{N}

, where N represents the size of mini-batch, this batch contains N positive pairings and

N^{2} - N

negative pairings. For each pairing, the contrastive loss is computed as follows.

\begin{matrix} l (i, D, R) & = - log \frac{exp (E_{i}^{D} \cdot E_{i}^{R} / τ)}{\sum_{j \in N} exp (E_{i}^{D} \cdot E_{j}^{R} / τ)} \\ - log \frac{exp (E_{i}^{R} \cdot E_{i}^{D} / τ)}{\sum_{j \in N} exp (E_{i}^{R} \cdot E_{j}^{D} / τ)}, \end{matrix}

(12)

where

τ

is the temperature, similar to CLIP [12]. The parameter

τ

scales the logits to control the sharpness of the probability distribution. A properly chosen

τ

assists the model in mining hard negatives by emphasizing difficult samples during gradient computation. Following standard practices [30], we empirically set

τ = 1.0

in our experiments. Throughout the training process using mini-batch, the objective function for the paired contrastive loss between images and point clouds is defined as follows:

L (D, R) = \frac{1}{N} [\sum_{i \in N} l (i, D, R)] .

(13)

(4) Training Strategy: Our training process, detailed in Algorithm 1, is driven by the joint optimization of the global contrastive loss and the intra-modal scale consistency loss in a self-supervised manner. We employ the AdamW optimizer for model training and adopt a differential learning rate strategy. Specifically, the learning rates for the depth and range image encoders are set to

1 \times 10^{- 4}

, while the projection heads are trained with a higher learning rate of

1 \times 10^{- 3}

to facilitate faster adaptation. The weighting factor

λ

for the scale consistency loss is set to 0.5. To ensure training stability, gradients are clipped with a maximum norm of 1.0.

Algorithm 1 Multi-Scale Self-Supervised Cross-Modal Learning (MS2-CL)

Require: Depth image

I^{c a m}

, LiDAR range image

I^{l i d}

, temperature

τ

, loss weight

λ

Ensure: Cross-modal matching model with scale-invariant embeddings

1:: Initialize:
2:: Multi-scale Swin encoders $E_{c a m}, E_{l i d}$
3:: Projection heads ${P_{c a m}^{s}}_{s = 1}^{4}, {P_{l i d}^{s}}_{s = 1}^{4}$
4:: Scale consistency loss $L_{s c}$ and contrastive loss $L_{c t r}$
5:
6:: Step 1: Multi-scale feature extraction
7:: ${F_{c a m}^{s}}_{s = 1}^{4} \leftarrow E_{c a m} (I^{c a m})$
8:: ${F_{l i d}^{s}}_{s = 1}^{4} \leftarrow E_{l i d} (I^{l i d})$
9:
10:: Step 2: Feature projection at each scale
11:: for $s = 1$ to 4 do
12:: $Z_{c a m}^{s} \leftarrow P_{c a m}^{s} (F_{c a m}^{s})$
13:: $Z_{l i d}^{s} \leftarrow P_{l i d}^{s} (F_{l i d}^{s})$
14:: end for
15:
16:: Step 3: Teacher–student separation (coarse → teacher, fine → student)
17:: $Z_{c a m}^{T} \leftarrow Z_{c a m}^{4}$ , $Z_{l i d}^{T} \leftarrow Z_{l i d}^{4}$
18:: $S_{c a m} \leftarrow {Z_{c a m}^{1}, Z_{c a m}^{2}, Z_{c a m}^{3}}$
19:: $S_{l i d} \leftarrow {Z_{l i d}^{1}, Z_{l i d}^{2}, Z_{l i d}^{3}}$
20:
21:: Step 4: Intra-modal scale consistency supervision
22:: $L_{s c}^{c a m} \leftarrow \frac{1}{3} \sum_{Z \in S_{c a m}} {∥ Z - Z_{c a m}^{T} ∥}_{1}$
23:: $L_{s c}^{l i d} \leftarrow \frac{1}{3} \sum_{Z \in S_{l i d}} {∥ Z - Z_{l i d}^{T} ∥}_{1}$
24:
25:: Step 5: Cross-modal contrastive learning (teacher embeddings only)
26:: Compute logits: $ℓ_{i j} = \frac{Z_{l i d, i}^{T} \cdot Z_{c a m, j}^{T}}{τ}$
27:: $L_{c t r} \leftarrow \frac{1}{2} (CE (ℓ) + CE (ℓ^{⊤}))$
28:
29:: Step 6: Final objective
30:: $L = L_{c t r} + λ (L_{s c}^{c a m} + L_{s c}^{l i d})$
31:: return Trained model capable of cross-modal place recognition.

4. Experiments

In this section, we evaluate the performance of our MS2-CL method in image-based place recognition on large-scale point cloud maps. Using the KITTI and KITTI-360 datasets, we assess the recall@N(%) accuracy of MS2-CL and compare it against several baseline models. Furthermore, we perform a zero-shot transfer analysis of our method on the KITTI-360 dataset. Lastly, ablation studies are conducted to investigate the contributions of key algorithms in our approach.

4.1. Dataset

KITTI. Evaluation is conducted on the KITTI odometry dataset, which provides a large number of synchronized and calibrated LiDAR point clouds and RGB images. This dataset is widely adopted for validating autonomous driving–related algorithms. KITTI consists of 22 sequences (00–21). In our experimental setup, we define sequences 02, 05, 06, and 08 as the testing set to evaluate generalization performance. All remaining sequences (00, 01, 03, 04, 07, 09–21) constitute the training set. Note that while the official odometry benchmark does not provide ground truth for sequences 11–21, we obtain their precise poses from the raw GPS/IMU data included in the KITTI raw dataset, following the practice in LIP-Loc [30]. Additionally, we report the results on Sequence 00 in Table 1 to demonstrate the model’s fitting capability on the training data.

KITTI-360. To further investigate the zero-shot transferability of our method, we utilize the KITTI-360 dataset. It comprises a Velodyne HDL-64E LiDAR (Velodyne Lidar, San Jose, CA, USA)sensor and raw images captured by the perspective camera. This dataset contains approximately 80,000 frames of lidar and image pairs covering a distance of 73.7 km, along with the precise vehicle pose information required for evaluation. We selected sequences 3, 4, 5, 6, 7, 9 and 10 for training and cross-validation and performed the evaluation using sequence 0.

4.2. Implementation and Experimental Setup

Our MS2-CL utilizes a pretrained swin-tiny-patch4-window7-224 as the backbone. For range image generation, we set the detection distance to 50 m, the upward field of view (

α_{u p}

) to 3 degrees, and the downward field of view (

α_{d o w n}

) to −25 degrees. The range image size is configured to be 64 × 900 pixels. Prior to feeding into the encoder, all input images are uniformly resized to 224 × 224 pixels.

4.3. Evaluation Metrics

(1) On the KITTI dataset, we evaluate our model using recall rates at Top-1 and Top-1%. A match between a query image and a retrieved point cloud is deemed positive if their Euclidean distance is less than 10 m. Recall at Top-N, a standard metric for retrieval tasks, is defined as the percentage of queries for which at least one correct match is found within the top-N ranked candidates. It is formally calculated as:

Recall @ N = \frac{1}{| Q |} \sum_{q \in Q} I [R_{N} (q) \cap D_{q}^{+} \neq Ø],

(14)

where Q is the set of all queries,

R_{N} (q)

is the set of top-N retrieved candidates for a given query q, and

D_{q}^{+}

is the set of all ground truth positive matches for q in the database. The indicator function

I [\cdot]

returns 1 if its condition is true (i.e., at least one positive match exists in the top-N results) and 0 otherwise.

Recall@Top-1% Definition: The Recall@Top-1% metric extends this concept by evaluating retrieval performance within the top 1% of the database. Formally:

Recall @ Top - 1 % = \frac{1}{| Q |} \sum_{q \in Q} I [R_{⌈ 0.01 \times | D | ⌉} (q) \cap D_{q}^{+} \neq Ø]

(15)

where

| D |

is the database size (number of point clouds) for the sequence,

⌈ 0.01 \times | D | ⌉

denotes the top 1% of candidates (rounded up),

R_{k} (q)

represents the top-k retrieved point clouds for query q, and

D_{q}^{+}

denotes the set of positive (correct) matches for query q. For example, in a sequence with 4500 point clouds in the database, Top-1% evaluates whether a correct match appears within the top 45 (

⌈ 4500 \times 0.01 ⌉ = 45

) retrieved candidates. This metric assesses retrieval quality beyond the single best match, which is critical for downstream re-ranking or geometric verification stages in practical SLAM systems. This metric has been widely adopted in prior place recognition literature [20,23,33].

Specifically, for each test sequence, the Query Set Q is constructed using all RGB images in that sequence, while the Database Set comprises all corresponding LiDAR point clouds. Consequently, for a sequence of length M, both the query and database sets contain M samples. The retrieval is performed within the same sequence, where each query image searches against the entire point cloud database of that sequence.

(2) On the KITTI-360 dataset, we utilize recall rates at TOP-1, TOP-5, and TOP-20 as evaluation metrics. A match is deemed positive if the Euclidean distance between the query image and the retrieved point cloud is less than 20 m.

4.4. Evaluation for Place Recognition on the KITTI Dataset

We compare our method with baseline [28], I2P-Rec [7], (LC)² [8], LIP-Loc [30] and ModaLink [31] on the KITTI dataset. The baseline approaches employs DNN to process images and point clouds, generating descriptors using NetVLAD [1]. I2P-Rec utilizes various depth estimation algorithms such as monocular depth estimation (MIM [35]), stereo matching estimation (PSM [36] and LEA [37]) to recover point clouds. It then generates descriptors based on BEV images of the recovered point clouds and uses PointNetVLAD to generate global descriptors for the point clouds. LC² [8] transforms RGB images and point clouds into 2.5D depth images and performs a two-stage training process utilizing distinct loss functions. LIP-Loc [30] leverages large-scale pre-trained image-point cloud pairs and uses a batch contrastive loss function to bring them closer in the embedding space. ModaLink [31] transforms point clouds into image perspectives and relies on explicit FoV transformations for alignment, utilizing CNNs and NMF with Triplet Loss to generate descriptors. Distinct from ModaLink’s reliance on rigid geometric alignment, our proposed MS2-CL introduces a Teacher–Student Scale Consistency strategy within a Swin Transformer architecture. This enables implicit feature-level learning and utilizes a global Contrastive Loss, ensuring superior robustness compared to the local constraints of Triplet Loss.

Monocular vs. Stereo Input Analysis: It is noteworthy that our MS2-CL achieves state-of-the-art performance using only monocular depth estimation, while some strong baselines (PSM-I2P-Rec*, LEA-I2P-Rec*) require stereo camera systems. As shown in Table 2, our method achieves +13.1% higher average Recall@1 than stereo-based LEA-I2P-Rec* while requiring only half the cameras:

This demonstrates that our multi-scale contrastive learning framework can extract more discriminative cross-modal features even from limited visual input. The monocular design also offers significant practical advantages: (1) Lower cost: single-camera systems are significantly cheaper than stereo rigs; (2) Reduced complexity: no camera synchronization or baseline calibration required; (3) Better robustness: no failure modes from stereo matching errors; (4) Easier deployment: compatible with existing monocular camera infrastructure in production vehicles.

As shown in Table 1, our proposed MS2-CL, which relies solely on monocular images, significantly outperforms I2P-Rec across all scenarios involving depth estimation or panoramic images, and surpasses ModaLink trained with triplet loss in all sequences.

Furthermore, to demonstrate the remarkable performance of our method in cross-modal place recognition, we adopted progressively stringent evaluation criteria, as illustrated in Table 3. Specifically, we utilized a Top-1 recall rate and set distance thresholds at 1 m, 4 m, 7 m, 10 m, 13 m and 16 m. To rigorously benchmark our performance, we compared MS2-CL with the state-of-the-art baseline ModaLink [31] under identical settings. As evidenced in Table 3, MS2-CL exhibits superior robustness, particularly on the challenging KITTI-02 sequence where it surpasses the baseline by 8.8% at the strictest 1m threshold. Although ModaLink shows competitive results on sequence 05 at relaxed thresholds, our method consistently secures the highest recall at the precise 1 m threshold across all evaluated sequences. Overall, the experimental results confirm that MS2-CL consistently achieves an accuracy rate exceeding 68% even under the tightest geometric constraints.

4.5. Evaluation for Place Recognition on the KITTI-360 Dataset

We compare our method with AECMLoc [38], a technique that relies on fisheye images and necessitates more intricate preprocessing, as well as the benchmark LIP-Loc [30] for cross-modal localization on the KITTI-360 dataset. Zero-shot LIP-Loc was trained using all sequences selected from the KITTI dataset. To ensure a fair comparison, our Zero-shot MS2-CL model was trained on the exact same set of sequences.

As illustrated in Table 4, based on the LIP-Loc evaluation metrics, our method has achieved an accuracy rate approaching 79.8%, and the initial evaluation results provide evidence of zero-shot MS2-CL strong generalization capability.

Moreover, our proposed method MS2-CL, as shown in Figure 3, is benchmarked against state-of-the-art methods including LIP-Loc and AECMLoc. The results demonstrate that MS2-CL consistently outperforms all baselines across the entire range of thresholds.

4.6. Zero-Shot Generalization Analysis

To rigorously evaluate the generalization capability of our MS2-CL model, we performed a challenging zero-shot cross-dataset evaluation. The model, trained exclusively on the KITTI dataset, was directly tested on the KITTI-360 dataset without any fine-tuning or retraining. As highlighted by prior work [38], this task presents a significant domain shift due to fundamental differences in sensor hardware and data acquisition: KITTI-360 utilizes a 180° fisheye camera, whereas KITTI employs a 90° standard perspective camera. This leads to drastic variations in image distortion and field of view. Furthermore, the two datasets exhibit no trajectory overlap, ensuring that the model is evaluated on entirely new environments. These factors collectively establish KITTI-360 as a true out-of-distribution testbed for our KITTI-trained model.

The quantitative results of this zero-shot evaluation are presented in Table 5. Despite the substantial domain gap and the absence of any adaptation, MS2-CL demonstrates remarkable generalization performance. The model achieves an average Recall@1 of 61.4% across all eight test sequences.

4.7. Ablation Study

To validate the effectiveness of our proposed Scale Consistency Loss and to determine its optimal contribution to the overall training objective, we conducted a series of ablation studies on the KITTI dataset. The results, summarized in Table 6, analyze the model’s performance by varying the loss weight

λ

.

Effectiveness of Scale Consistency Loss: The most significant finding is the critical role of the Scale Consistency Loss (

L_{S C}

). By setting

λ = 0

, we establish a baseline model trained solely with the global contrastive loss. Compared to our final model (

λ = 0.5

), this baseline exhibits a catastrophic drop in performance. Specifically, Recall@1 plummets by approximately 28–32 percentage points across all test sequences (e.g., from 88.5% down to 56.3% on Seq-06). A similar drastic degradation is observed for Recall@1%. This clearly demonstrates that enforcing intra-modal consistency between multi-scale features is not merely an auxiliary task but a fundamental component for learning scale-invariant and discriminative descriptors. Without this supervision, the network fails to build a coherent hierarchical representation, resulting in poor retrieval accuracy.

Impact of the Weighting Factor

λ

: To pinpoint the optimal contribution of the Scale Consistency Loss, we conducted a fine-grained analysis with

λ \in {0, 0.25, 0.5, 0.75, 1.0}

. As shown in Table 6, the results exhibit a clear bell-shaped trend. Introducing the auxiliary loss with

λ = 0.25

yields significant improvements over the baseline, yet it falls short of the peak performance. The performance reaches its maximum at

λ = 0.5

. However, further increasing

λ

beyond this point leads to progressive degradation. For instance, on Seq-05, Recall@1 drops from 89.4% (

λ = 0.5

) to 80.1% (

λ = 0.75

) and finally to 76.9% (

λ = 1.0

). This confirms that

λ = 0.5

provides the optimal balance, effectively regularizing the feature hierarchy without interfering with the primary global contrastive learning objective.

4.8. Backbone Architecture Comparison

To validate our choice of Swin Transformer as the feature extraction backbone, we conducted a systematic comparison with three representative architectures under identical training settings:

ResNet-50 + FPN: While the Feature Pyramid Network provides multi-scale features, the CNN-based architecture captures strong local patterns but exhibits limited global context modeling capability, resulting in an average Recall@1 of 70.2%. ViT-Small: The Vision Transformer excels at capturing global dependencies through self-attention but lacks the hierarchical multi-scale structure inherent to our teacher–student paradigm. Its uniform patch-based design is less suitable for the spatially structured range/depth images, achieving 75.9% average recall. ConvNeXt-Tiny: This modernized CNN architecture balances local and global modeling, achieving 78.5% average recall. However, it still falls short of Swin Transformer’s performance due to its lack of explicit hierarchical attention mechanisms. Swin Transformer (Ours): Achieves the highest average Recall@1 of 86.6%, outperforming ConvNeXt by +8.1%, ViT by +10.7%, and ResNet + FPN by +16.4%. As shown in Table 7, these results validate that Swin Transformer is the optimal backbone choice for our multi-scale self-supervised cross-modal learning framework. The superior performance of the Swin Transformer over the standard ViT-Small can be attributed to its hierarchical architecture. While ViT processes images as uniform patches with a constant resolution, the Swin Transformer employs a shifted window mechanism to construct hierarchical feature maps. This structure allows our model to capture both fine-grained local geometric details (essential for depth-to-point cloud matching) and high-level semantic context, which aligns perfectly with our multi-scale self-distillation paradigm.

Qualitative retrieval results are visualized in Figure 4, which further confirms the effectiveness of our method.

4.9. Complexity and Runtime Analysis

All experiments were conducted on a workstation equipped with an Intel i7-13700K CPU (Intel, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 4090 GPU (NVIDIA, Santa Clara, CA, USA). To validate the efficiency of the proposed MS2-CL, we analyze its model size and inference latency. The total number of parameters of our network is 29.6 M. Note that the standard ResNet backbone already accounts for 26.9 M, meaning our proposed cross-modal learning modules introduce only a marginal increase of 2.7 M parameters, adhering to a lightweight design.

In terms of inference speed, our method is highly efficient. Excluding the offline depth estimation step, the feature extraction process takes an average of 5.0 ms per frame. Although depth estimation is performed offline in our experiments for training efficiency, state-of-the-art lightweight monocular depth estimation networks can achieve inference latencies of approximately 25–30 ms on modern onboard GPUs. Consequently, the complete pipeline, including online depth estimation and retrieval, remains well within the real-time latency constraints required for autonomous driving systems. Candidates are sorted based on the magnitude of cosine similarity, enabling our Top-1 point cloud retrieval to only take 0.03 ms. This results in a real-time performance suitable for onboard deployment.

5. Discussion

While our experimental results demonstrate the superior retrieval performance of MS2-CL on benchmark datasets, it is essential to contextualize these findings within real-world autonomous driving tasks. In this section, we discuss the practical implications of our image-to-point cloud place recognition framework, specifically focusing on global initialization, loop closure, and system redundancy.

5.1. Global Initialization in HD Maps

A critical challenge in autonomous driving is the “Kidnapped Robot Problem,” where a vehicle must determine its global pose within a map without any prior knowledge of its location. This typically occurs when a vehicle starts up in a GNSS-denied environment, such as an underground parking lot or an urban canyon. In this scenario, high-precision localizers often fail because they require a good initial pose guess to converge. Our MS2-CL serves as a robust Global Initialization module. By retrieving the most similar LiDAR submap from the pre-built 3D HD map using only the current camera view, our method provides a reliable coarse location estimate. This estimate acts as the initial seed for subsequent fine-grained 6-DoF pose estimation algorithms, significantly narrowing the search space and preventing local minima convergence.

5.2. Cross-Modal Loop Closure Detection

Long-term Simultaneous Localization and Mapping (SLAM) systems inevitably suffer from position drift over time. In a multi-sensor setup, our method functions as an effective Loop Closure Detection module. Consider a scenario where a mapping vehicle equipped with LiDAR has previously traversed a route, creating a geometric map. Later, a user vehicle equipped only with cameras revisits the same route. MS2-CL can identify this revisit by matching the current visual frame against the historical geometric map. By establishing a constraint between the current visual observation and the stored map, the backend optimization system (e.g., Pose Graph Optimization) can correct the accumulated drift, thereby maintaining the global consistency of the trajectory.

5.3. Safety Redundancy and Cost Efficiency

For Level 4 and Level 5 autonomous driving systems, sensor redundancy is a non-negotiable requirement for safety certification. In scenarios where primary localization sensors degrade—for example, if the GNSS signal is jammed or the on-board LiDAR fails due to mechanical issues—our vision-based retrieval against the 3D database provides a critical safety fallback layer. Furthermore, this cross-modal capability supports a cost-efficient fleet management model. It allows for a heterogeneous fleet structure: a small number of expensive, LiDAR-equipped vehicles can be responsible for maintaining the HD map, while a large fleet of mass-produced, camera-only vehicles can localize themselves within that map. This asymmetry significantly reduces the hardware cost for end-user vehicles while leveraging the high accuracy of LiDAR-based maps.

6. Conclusions

We proposed MS2-CL, a lightweight framework that significantly advances cross-modal place recognition. Distinct from methods focusing solely on cross-modal alignment, our approach introduces an intra-modal Scale Consistency Loss. This loss complements a global contrastive objective by ensuring features learned via a teacher–student paradigm are tightly aligned across multiple scales. This dual-supervision strategy leads to state-of-the-art results on the KITTI benchmark. Furthermore, our discussion on practical applications highlights the critical utility of MS2-CL in real-world autonomous driving tasks, specifically for global initialization in HD maps, cross-modal loop closure detection, and serving as a cost-effective redundancy layer for safety-critical systems. Finally, the model’s exceptional zero-shot performance on the out-of-distribution KITTI-360 dataset validates its strong generalization capabilities and potential for scalable deployment.

Author Contributions

Methodology, L.M.; software, L.M. and X.Z.; validation, L.M.; investigation, L.M.; resources, W.L.; data curation, L.M. and X.Z.; writing original draft preparation, L.M.; review and editing, L.M.; visualization, L.M.; supervision, W.L. and Z.D.; funding acquisition, W.L. and Z.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the National Natural Science Foundation of China under Grant No. 62372049.

Data Availability Statement

The code and pre-trained models are available at https://github.com/ml-bupt, accessed on 26 February 2026.

Conflicts of Interest

The authors declare no conflict of interest.

References

Arandjelović, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5297–5307. [Google Scholar]
Zhu, S.; Shah, M.; Chen, C. TransGeo: Transformer is all you need for cross-view image geo-localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1162–1171. [Google Scholar]
Khaliq, A.; Milford, M.; Garg, S. Multires-NetVLAD: Augmenting place recognition training with low-resolution imagery. IEEE Robot. Autom. Lett. 2022, 7, 3882–3889. [Google Scholar] [CrossRef]
Lai, H.; Yin, P.; Scherer, S. Adafusion: Visual-lidar fusion with adaptive weights for place recognition. IEEE Robot. Autom. Lett. 2022, 7, 12038–12045. [Google Scholar] [CrossRef]
Oertel, A.; Cieslewski, T.; Scaramuzza, D. Augmenting visual place recognition with structural cues. IEEE Robot. Autom. Lett. 2020, 5, 5534–5541. [Google Scholar] [CrossRef]
Ratz, S.; Dymczyk, M.; Siegwart, R.; Dube, R. Oneshot global localization: Instant lidar-visual pose estimation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020. [Google Scholar]
Zheng, S.; Li, Y.; Yu, Z.; Yu, B.; Cao, S.-Y.; Wang, M.; Xu, J.; Ai, R.; Gu, W.; Luo, L. I2P-Rec: Recognizing images on large-scale point cloud maps through bird’s eye view projections. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 1395–1400. [Google Scholar]
Lee, A.J.; Song, S.; Lim, H.; Lee, W.; Myung, H. (LC)2: LiDAR-camera loop constraints for cross-modal place recognition. IEEE Robot. Autom. Lett. 2023, 8, 3589–3596. [Google Scholar] [CrossRef]
Cattaneo, D.; Vaghi, M.; Ballardini, A.L.; Fontana, S.; Sorrenti, D.G.; Burgard, W. CMRNet: Camera to LiDAR-map registration. In Proceedings of the IEEE Intelligent Transportation Systems Conference, Auckland, New Zealand, 27–30 October 2019; pp. 1283–1289. [Google Scholar]
Chang, M.-F.; Mangelson, J.; Kaess, M.; Lucey, S. HyperMap: Compressed 3D map for monocular camera registration. In Proceedings of the IEEE International Conference on Robotics and Automation, Xi’an, China, 30 May–5 June 2021; pp. 11739–11745. [Google Scholar]
Li, J.; Lee, G.H. DeepI2P: Image-to-point cloud registration via deep classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15960–15969. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
Radenović, F.; Tolias, G.; Chum, O. Fine-tuning CNN image retrieval with no human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1655–1668. [Google Scholar] [CrossRef] [PubMed]
Hausler, S.; Garg, S.; Xu, M.; Milford, M.; Fischer, T. PatchNetVLAD: Multi-scale fusion of locally-global descriptors for place recognition. arXiv 2021, arXiv:2104.10419. [Google Scholar]
Cao, B.; Araujo, A.; Sim, J. Unifying deep local and global features for image search. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 726–743. [Google Scholar]
Uy, M.A.; Lee, G.H. PointNetVLAD: Deep point cloud based retrieval for large-scale place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4470–4479. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Chen, X.; Läbe, T.; Milioto, A.; Röhling, T.; Vysotska, O.; Haag, A.; Behley, J.; Stachniss, C. OverlapNet: Loop closing for LiDAR-based SLAM. arXiv 2021, arXiv:2105.11344. [Google Scholar]
Ma, J.; Zhang, J.; Xu, J.; Ai, R.; Gu, W.; Chen, X. OverlapTransformer: An efficient and yaw-angle-invariant transformer network for LiDAR-based place recognition. IEEE Robot. Autom. Lett. 2022, 7, 6958–6965. [Google Scholar] [CrossRef]
Luo, L.; Cao, S.-Y.; Han, B.; Shen, H.-L.; Li, J. BVMatch: LiDAR-based place recognition using bird’s-eye view images. IEEE Robot. Autom. Lett. 2021, 6, 6076–6083. [Google Scholar] [CrossRef]
Luo, L.; Zheng, S.; Li, Y.; Fan, Y.; Yu, B.; Cao, S.-Y.; Li, J.; Shen, H.-L. BEVPlace: Learning LiDAR-based place recognition using bird’s eye view images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 8700–8709. [Google Scholar]
Ma, J.; Xiong, G.; Xu, J.; Chen, X. CVTNet: A Cross-View Transformer Network for LiDAR-Based Place Recognition in Autonomous Driving Environments. IEEE Trans. Ind. Inform. 2024, 20, 4039–4048. [Google Scholar] [CrossRef]
Cattaneo, D.; Vaghi, M.; Fontana, S.; Ballardini, A.L.; Sorrenti, D.G. Global visual localization in LiDAR-maps through shared 2D-3D embedding space. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 4365–4371. [Google Scholar]
Wang, B.; Chen, C.; Cui, Z.; Qin, J.; Lu, C.X.; Yu, Z.; Zhao, P.; Dong, Z.; Zhu, F.; Trigoni, N.; et al. P2-Net: Joint description and detection of local features for pixel and point matching. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 16004–16013. [Google Scholar]
Chen, K.; Yu, H.; Yang, W.; Yu, L.; Scherer, S.; Xia, G.-S. I2D-Loc: Camera localization via image to lidar depth flow. ISPRS J. Photogramm. Remote Sens. 2022, 194, 209–221. [Google Scholar] [CrossRef]
Feng, M.; Hu, S.; Ang, M.H.; Lee, G.H. 2D3D-MatchNet: Learning to match keypoints across 2D image and 3D point cloud. In Proceedings of the IEEE International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 4790–4796. [Google Scholar]
Shubodh, S.; Omama, M.; Zaidi, H.; Parihar, U.S.; Krishna, M. Lip-Loc: Lidar image pretraining for cross-modal localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 948–957. [Google Scholar]
Xie, W.; Luo, L.; Ye, N.; Ren, Y.; Du, S.; Wang, M.; Xu, J.; Ai, R.; Gu, W.; Chen, X. ModaLink: Unifying Modalities for Efficient Image-to-PointCloud Place Recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 3326–3333. [Google Scholar]
Bae, J.; Moon, S.; Im, S. Deep digging into the generalization of self-supervised monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence; ACM Press: New York, NY, USA, 2023; Volume 37, pp. 187–196. [Google Scholar]
Xie, Z.; Geng, Z.; Hu, J.; Zhang, Z.; Han, H.; Yue, C. Revealing the dark secrets of masked image modeling. arXiv 2022, arXiv:2205.13543. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning; PMLR: London, UK, 2021; pp. 10347–10357. [Google Scholar]
Cheng, X.; Zhong, Y.; Harandi, M.; Dai, Y.; Chang, X.; Li, H.; Drummond, T.; Ge, Z. Hierarchical neural architecture search for deep stereo matching. Adv. Neural Inf. Process. Syst. 2020, 33, 22158–22169. [Google Scholar]
Chang, J.-R.; Chen, Y.-S. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5410–5418. [Google Scholar]
Chen, C.F.R.; Fan, Q.; Panda, R. CrossViT: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 357–366. [Google Scholar]
Zhao, Z.; Yu, H.; Lyv, C.; Yang, W.; Scherer, S. Attention-enhanced cross-modal localization between 360 images and point clouds. arXiv 2022, arXiv:2212.02757. [Google Scholar]

Figure 1. The objective of MS2-CL is to perform visual localization within large-scale point cloud maps, specifically to retrieve the most similar point cloud corresponding to a given query image.

Figure 2. The architecture of the proposed multi-scale self-supervised learning network.

Figure 3. Evaluation of localization tolerance on the KITTI-360 dataset.

Figure 4. Qualitative retrieval results on the KITTI dataset. (Left): The query depth image. (Middle): The ground truth point cloud (projected as a range image). (Right): The Top-1 retrieved candidate. Green borders indicate successful matches (error < 10 m), while the red border highlights a failure case. The numbers denote the localization error in meters.

Table 1. The Recall@1 and Recall@1% (%) of Image-to-Point Cloud Recognition on the KITTI dataset.

Method	Seq-00		Seq-02		Seq-05		Seq-06		Seq-08
Method	@1	@1%	@1	@1%	@1	@1%	@1	@1%	@1	@1%
Baseline [28]	58.0	_	4.0	_	10.6	_	22.3	_	5.3	_
MIM-Points [35]	53.3	71.6	6.9	57.1	14.8	68.3	20.0	44.4	12.5	68.6
PSM-Points * [36]	58.0	72.3	8.2	53.6	16.0	71.1	32.1	50.0	14.7	75.0
LEA-Points * [37]	58.8	72.6	9.1	61.3	21.8	80.0	31.6	71.6	18.0	76.3
MIM-I2P-Rec [7]	74.3	88.6	46.3	89.7	49.9	89.1	41.9	81.5	42.1	88.4
PSM-I2P-Rec * [7]	81.7	97.4	47.6	92.9	63.8	94.1	52.1	88.5	53.5	92.1
LEA-I2P-Rec * [7]	92.6	99.7	77.0	98.4	83.4	98.8	55.5	91.5	69.4	96.4
(LC)² [8]	31.4	84.5	23.1	67.4	40.5	88.6	39.8	71.5	38.4	87.3
LIP-Loc [30]	93.5	100	62.8	86.7	66.3	97.8	65.4	80.6	66.4	98.3
ModaLink [31]	98.0	100	70.5	97.7	91.3	99.3	87.4	100	84.4	99.9
MS2-CL (ours)	98.4	100	81.7	99.8	89.4	99.8	88.5	100	86.8	100

* indicates methods requiring stereo images as input. Bold values indicate the best performance in each column.

Table 2. Input modality comparison between MS2-CL and baseline methods. We report average Recall@1 (%) across all test sequences on the KITTI dataset. Methods marked with * require stereo image pairs as input.

Method	Input	Avg Recall@1 (%)	Hardware
PSM-I2P-Rec*	Stereo	59.8	2 cameras
LEA-I2P-Rec*	Stereo	75.8	2 cameras
LIP-Loc	Monocular	70.9	1 camera
ModaLink	Monocular	86.3	1 camera
MS2-CL (Ours)	Monocular	88.9	1 camera

Table 3. The Recall@1 (%) comparison between MS2-CL and ModaLink against different distance thresholds on the KITTI dataset.

Sequence	Method	Threshold (m)
Sequence	Method	1	4	7	10	13	16
KITTI-02	ModaLink	59.3	60.1	64.3	70.5	71.7	72.9
KITTI-02	MS2-CL	68.1	75.7	78.9	81.7	83.1	84.4
KITTI-05	ModaLink	71.7	82.2	85.7	91.3	92.5	92.7
KITTI-05	MS2-CL	72.0	80.2	86.1	89.4	90.4	90.7
KITTI-06	ModaLink	66.2	73.1	80.9	87.4	87.7	88.2
KITTI-06	MS2-CL	70.6	79.4	85.5	88.5	88.7	89.6
KITTI-08	ModaLink	75.2	80.6	82.3	84.4	85.1	86.3
KITTI-08	MS2-CL	76.3	82.3	83.6	86.8	87.8	88.9

Table 4. Comparison of Image-to-Point Cloud Place Recognition performance on Sequence-0 of the KITTI-360 dataset.

Method	Accuracy (%)
Method	Recall@1	Recall@5	Recall@10	Recall@20
AECMLoc [38]	46.2	66.0	72.5	78.2
Zero-shot LIP-Loc [30]	54.0	77.0	86.0	91.9
LIP-Loc [30]	68.6	86.6	92.7	96.6
Zero-shot MS2-CL(ours)	61.8	79.3	90.4	95.1
MS2-CL(ours)	79.8	91.0	96.4	99.4

AECMLoc requires fisheye images as input.

Table 5. Performance of Zero-shot MS2-CL without fine-tuning on the KITTI-360 dataset.

Sequence	0	3	4	5	6	7	9	10
Recall@1 (%)	61.8	62.4	59.7	66.5	57.5	60.3	64.9	58.2

Table 6. Ablation study on the weight of the Scale Consistency Loss (

λ

). We investigated a finer range of values (

λ \in {0, 0.25, 0.5, 0.75, 1.0}

) to identify the optimal balance. We report Recall@1 (%) and Recall@1% (%) on sequences from the KITTI dataset.

Table 6. Ablation study on the weight of the Scale Consistency Loss (

λ

). We investigated a finer range of values (

λ \in {0, 0.25, 0.5, 0.75, 1.0}

) to identify the optimal balance. We report Recall@1 (%) and Recall@1% (%) on sequences from the KITTI dataset.

	Seq-02		Seq-05		Seq-06		Seq-08
Value of $λ$	@1	@1%	@1	@1%	@1	@1%	@1	@1%
0.0 (Baseline)	53.6	78.1	60.8	81.5	56.3	79.3	59.2	80.4
0.25	55.1	80.3	61.2	81.8	57.8	80.1	59.9	81.5
0.5 (Ours)	81.7	99.8	89.4	99.8	88.5	100	86.8	100
0.75	74.5	95.5	80.1	95.9	77.5	93.4	73.4	95.2
1.0	73.1	94.7	76.9	95.8	69.2	91.5	71.2	93.1

Table 7. Ablation study on backbone architectures. All models are trained with the same multi-scale contrastive learning framework (

λ = 0.5

). We report Recall@1 (%) and Recall@1% (%) on sequences from the KITTI dataset.

Table 7. Ablation study on backbone architectures. All models are trained with the same multi-scale contrastive learning framework (

λ = 0.5

). We report Recall@1 (%) and Recall@1% (%) on sequences from the KITTI dataset.

	Seq-02		Seq-05		Seq-06		Seq-08
Backbone	@1	@1%	@1	@1%	@1	@1%	@1	@1%
ResNet-50 + FPN	68.3	95.2	72.1	96.8	70.8	94.5	69.5	95.1
ViT-Small	73.5	96.8	78.9	97.5	76.2	96.2	74.8	96.9
ConvNeXt-Tiny	75.8	97.5	81.3	98.2	79.4	97.8	77.6	97.3
Swin-Tiny (Ours)	81.7	99.8	89.4	99.8	88.5	100	86.8	100

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, W.; Ma, L.; Zhuang, X.; Deng, Z. MS2-CL: Multi-Scale Self-Supervised Learning for Camera to LiDAR Cross-Modal Place Recognition. Sensors 2026, 26, 1561. https://doi.org/10.3390/s26051561

AMA Style

Liu W, Ma L, Zhuang X, Deng Z. MS2-CL: Multi-Scale Self-Supervised Learning for Camera to LiDAR Cross-Modal Place Recognition. Sensors. 2026; 26(5):1561. https://doi.org/10.3390/s26051561

Chicago/Turabian Style

Liu, Wen, Lei Ma, Xuanshun Zhuang, and Zhongliang Deng. 2026. "MS2-CL: Multi-Scale Self-Supervised Learning for Camera to LiDAR Cross-Modal Place Recognition" Sensors 26, no. 5: 1561. https://doi.org/10.3390/s26051561

APA Style

Liu, W., Ma, L., Zhuang, X., & Deng, Z. (2026). MS2-CL: Multi-Scale Self-Supervised Learning for Camera to LiDAR Cross-Modal Place Recognition. Sensors, 26(5), 1561. https://doi.org/10.3390/s26051561

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MS2-CL: Multi-Scale Self-Supervised Learning for Camera to LiDAR Cross-Modal Place Recognition

Abstract

1. Introduction

2. Related Work

2.1. Image-Based Place Recognition

2.2. Point-Cloud-Based Place Recognition

2.3. Image to Point Cloud Place Recognition

3. Methods

3.1. Data Processing

3.2. Multi-Scale Self-Supervised Learning Network

4. Experiments

4.1. Dataset

4.2. Implementation and Experimental Setup

4.3. Evaluation Metrics

4.4. Evaluation for Place Recognition on the KITTI Dataset

4.5. Evaluation for Place Recognition on the KITTI-360 Dataset

4.6. Zero-Shot Generalization Analysis

4.7. Ablation Study

4.8. Backbone Architecture Comparison

4.9. Complexity and Runtime Analysis

5. Discussion

5.1. Global Initialization in HD Maps

5.2. Cross-Modal Loop Closure Detection

5.3. Safety Redundancy and Cost Efficiency

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI