This paper introduces MS2-CL, a lightweight cross-modal place recognition framework capable of achieving high accuracy on both seen and unseen datasets.
3.1. Data Processing
(1)
Range image generation: The point cloud data, represented in sparse Cartesian coordinates
, is transformed into a 2D range image to unify the modalities by emulating the spherical projection characteristics of a LiDAR sensor. Following an initial filtering stage that removes points exceeding the predefined maximum detection range
, we calculate the Euclidean distance from each remaining point to the sensor origin as the depth measurement
.
For the remaining valid points, we calculate their horizontal (yaw
) and vertical (pitch
) angles based on their 3D coordinates.
These angles are then used to determine the corresponding positions in the 2D projection plane. Specifically, the horizontal angle is mapped to a normalized interval, while the vertical position is computed proportionally to the total field of view. The resulting coordinates are scaled to fit the preset image dimensions, ensuring accurate placement of each point in the range image
.
where
is the total vertical field of view and
H and
W represent the height and width of the range image, respectively. It is important to note that the spherical projection parameters, specifically the vertical field of view angles
and
, are sensor-dependent. In this work, these parameters are configured to match the Velodyne HDL-64E LiDAR sensor used in the KITTI datasets. For deployment on vehicles with different LiDAR configurations (e.g., 32-beam or solid-state LiDARs), these projection parameters must be adjusted to align with the specific sensor’s vertical field of view to ensure optimal image-to-point cloud alignment. To address potential occlusions due to overlapping points in the original point cloud, we sort all valid points by their distance values in ascending order, ensuring that closer points are not obscured by farther ones. Finally, a range image matrix is constructed, where each element represents the depth value at the corresponding pixel location. Additionally, for each pixel of the generated range image, which represents depth information, multiply by a fixed scaling factor
to enhance the contrast of the whole image.
(2)
Deep Image Generation: This study utilizes the Monodepth2 [
32] framework to process monocular RGB image sequences from the KITTI and KITTI-360 datasets. We selected this method due to its temporal self-supervised learning approach based on monocular video, which constructs a differentiable photometric reprojection loss using camera motion parameters between consecutive frames. This approach implicitly decouples scene depth from motion estimation, effectively mitigating the monocular scale ambiguity issue. To further enhance robustness in dynamic traffic scenes, the method incorporates a multi-frame consistency-driven adaptive masking mechanism [
33]. This mechanism effectively filters out moving objects (e.g., other vehicles) that violate the static scene assumption, significantly reducing reprojection error noise in dynamic regions. In this work, RGB images are processed offline at their original resolution, and the resulting depth images from monocular estimation are stored locally, substantially decreasing the time and computational resources required for the entire training process.
3.2. Multi-Scale Self-Supervised Learning Network
(1) Network Architecture: Our proposed framework for cross-modal place recognition employs a dual-stream, multi-scale architecture to achieve coherent feature alignment between visual and LiDAR data. This design comprises three core stages: modality-specific preprocessing, hierarchical feature encoding under a teacher–student paradigm, and a dual-objective loss function for comprehensive supervision. The network’s backbone consists of parallel Swin Transformer encoders, each followed by dedicated projection heads.
As depicted in
Figure 2, the framework begins with modality-specific preprocessing. The camera stream converts an RGB image into a dense depth map, while the LiDAR stream transforms a 3D point cloud into a 2D spherical projection. These 2D representations are then fed into their respective Swin Transformer backbones, which extract features at multiple hierarchical scales. A key aspect of our architecture is the intra-modal teacher–student learning paradigm. For each modality, features from the encoder’s final, most global stage (e.g., 7 × 7 patch level) are designated as the ‘teacher’ embedding, representing a holistic scene view. Features from earlier, higher-resolution stages are termed ‘student’ embeddings. This structure facilitates a dual-objective optimization. First, an intra-modal Scale Consistency Supervision loss aligns the student embeddings with their corresponding teacher embedding, ensuring feature consistency across granularities. Second, the main Global Contrastive Loss is computed exclusively between the teacher-level embeddings of the two modalities. This primary loss drives the network to map corresponding camera and LiDAR views to a unified embedding space. This combined supervision enables the learning of discriminative descriptors for high-precision cross-modal place recognition.
(2)
Scale Consistency Supervised Learning: Distinction from Conventional Self-Distillation: Our scale consistency supervision differs fundamentally from conventional knowledge distillation frameworks. Traditional self-distillation methods (e.g., [
34]) typically distill from a larger teacher model to a smaller student for model compression. In contrast, we perform
intra-modal,
multi-scale self-distillation within a single encoder, where teacher and student roles are assigned based on
feature granularity rather than model size. Specifically, the coarsest-scale embedding (capturing global semantics) supervises finer-scale embeddings (retaining spatial details), enforcing hierarchical consistency across scales. This design is tailored for cross-modal alignment: by ensuring scale-invariant representations within each modality
before cross-modal matching, we effectively bridge the camera-LiDAR domain gap. The stop-gradient operation on teacher features prevents interference with the primary contrastive loss, ensuring stable optimization. Unlike existing methods that focus on inter-modal alignment alone, our intra-modal scale consistency provides a crucial regularization that improves feature robustness to viewpoint and scale variations.
Beyond the primary cross-modal alignment, we introduce an auxiliary self-supervised task to enforce feature consistency across different scales within each modality. This is achieved through an intra-modal teacher–student learning paradigm. For a given modality (e.g., camera depth images), we designate the embedding from the final, most global encoder stage as the ‘teacher’ embedding, , which captures a coarse-grained, holistic representation of the scene. Embeddings from the preceding, higher-resolution stages, , are treated as ‘student’ embeddings, representing finer-grained local features.
The Scale Consistency Loss,
, then minimizes the distance between each student embedding and its corresponding teacher embedding. Crucially, the teacher embedding is treated as a fixed target by stopping the gradient flow from this auxiliary loss, ensuring that its representation is guided solely by the main global contrastive task. This process is applied symmetrically to both the camera depth (D) and LiDAR range (R) images. The loss for each modality is formulated as:
where
S is the total number of scales,
denotes the stop-gradient operation, and
is a distance metric, for which we use the Smooth L1 loss. The total objective for scale consistency supervision is the sum of the individual modality losses:
This auxiliary objective encourages the model to learn multi-scale descriptors that are hierarchically consistent, thereby enhancing feature robustness against variations in viewpoint and distance.
(3)
Global Contrastive Loss Function: Different from the triplet loss commonly employed in previous methods, we train the proposed network architecture for Image-to-Point Cloud correspondence retrieval using a cross-modal mini-batch contrastive loss function. Given a batch of 2D depth image descriptors
and their corresponding matching 2D range image descriptors
, where
N represents the size of mini-batch, this batch contains
N positive pairings and
negative pairings. For each pairing, the contrastive loss is computed as follows.
where
is the temperature, similar to CLIP [
12]. The parameter
scales the logits to control the sharpness of the probability distribution. A properly chosen
assists the model in mining hard negatives by emphasizing difficult samples during gradient computation. Following standard practices [
30], we empirically set
in our experiments. Throughout the training process using mini-batch, the objective function for the paired contrastive loss between images and point clouds is defined as follows:
(4)
Training Strategy: Our training process, detailed in Algorithm 1, is driven by the joint optimization of the global contrastive loss and the intra-modal scale consistency loss in a self-supervised manner. We employ the AdamW optimizer for model training and adopt a differential learning rate strategy. Specifically, the learning rates for the depth and range image encoders are set to
, while the projection heads are trained with a higher learning rate of
to facilitate faster adaptation. The weighting factor
for the scale consistency loss is set to 0.5. To ensure training stability, gradients are clipped with a maximum norm of 1.0.
| Algorithm 1 Multi-Scale Self-Supervised Cross-Modal Learning (MS2-CL) |
| Require: Depth image , LiDAR range image , temperature , loss weight |
| Ensure: Cross-modal matching model with scale-invariant embeddings |
- 1:
Initialize: - 2:
Multi-scale Swin encoders - 3:
Projection heads - 4:
Scale consistency loss and contrastive loss - 5:
- 6:
Step 1: Multi-scale feature extraction - 7:
- 8:
- 9:
- 10:
Step 2: Feature projection at each scale - 11:
for to 4 do - 12:
- 13:
- 14:
end for - 15:
- 16:
Step 3: Teacher–student separation (coarse → teacher, fine → student) - 17:
, - 18:
- 19:
- 20:
- 21:
Step 4: Intra-modal scale consistency supervision - 22:
- 23:
- 24:
- 25:
Step 5: Cross-modal contrastive learning (teacher embeddings only) - 26:
Compute logits: - 27:
- 28:
- 29:
Step 6: Final objective - 30:
- 31:
return Trained model capable of cross-modal place recognition.
|