1. Introduction
Three-dimensional (3D) scene understanding is fundamental to safe navigation and decision making in autonomous driving. Among its core tasks, LiDAR point cloud semantic segmentation assigns a semantic label to each point in 3D space and provides fine-grained scene understanding. Although LiDAR captures accurate geometric structures, point clouds remain sparse, irregular, and largely devoid of texture because of the sensing process. These characteristics impose a fundamental limitation on LiDAR-only methods, especially when recognizing distant small objects (e.g., pedestrians) or separating geometrically similar yet semantically different surfaces (e.g., road versus vegetation).
Existing LiDAR segmentation methods can be broadly categorized into three paradigms. Projection-based methods project 3D point clouds onto 2D representations and process them with mature 2D CNNs, building upon the fully convolutional network (FCN) [
1] paradigm. Representative works include the SqueezeSeg family [
2,
3,
4], PolarNet [
5], AMVNet [
6], RangeFormer [
7], and FRNet [
8]; however, the projection process inevitably incurs information loss and quantization errors. Point-based methods, pioneered by PointNet [
9], directly process raw point sets without intermediate representations. Extensions such as PointNet++ [
10], RandLA-Net [
11], and KPConv [
12] introduce hierarchical learning and flexible convolution operators yet suffer from high computational costs in large-scale scenes. Voxel-based methods utilize sparse convolution to efficiently handle voxelized grids (e.g., MinkowskiEngine [
13], SPVCNN [
14], and Cylinder3D [
15]) and represent the current mainstream. Hybrid approaches, such as RPVNet [
16], 2-S3Net [
17], and UniSeg [
18], further combine multiple representations. Related efforts also exploit scene-completion priors [
19], transformer-based context modeling in SphereFormer [
20], and non-uniform cylindrical partitioning in NUC-Net [
21] to strengthen LiDAR-only perception, while M3Net [
22] extends this line toward universal LiDAR segmentation across heterogeneous datasets and sensor setups. In parallel, temporal models, such as MemorySeg [
23] and TASeg [
24], exploit sequential observations to recover semantics for sparse, occluded, or fast-moving regions. Despite this progress, conventional sparse convolutions employ static shared weights that cannot adaptively adjust the receptive field according to local geometric environments (e.g., planar surfaces vs. corners), limiting the capacity to model complex fine-grained structures.
To compensate for the inherent texture deficiency of point clouds, multimodal fusion methods have attracted considerable attention. Early approaches, such as PMF [
25], align point clouds with image pixels via geometric projection, whereas later methods perform feature fusion in a unified bird’s-eye view (BEV) space, including BEVFusion [
26] and its variant [
27], as well as MSeg3D [
28]. These methods improve segmentation accuracy, but they also require LiDAR and camera inputs during inference, which increases the computational load and imposes strict synchronization requirements. Camera failure under adverse weather conditions or calibration drift can therefore degrade the overall system. Vision Foundation Models (VFMs) built on the Vision Transformer (ViT) [
29] architecture, such as SAM [
30] and DINOv2 [
31], have shown strong generalization in the 2D domain. DINOv3 [
32] further refines the Masked Image Modeling (MIM) strategy and improves boundary localization and semantic consistency in dense prediction tasks. Recent representative studies have begun to transfer such foundation-model semantics to LiDAR more directly: Puy et al. [
33] analyze scalable VFM distillation for LiDAR; ELiTe [
34] introduces patch-to-point multi-stage transfer with parameter-efficient teacher adaptation, and later works, such as LiMoE [
35] and LiMA [
36], extend image-to-LiDAR learning to multi-representation or long-horizon pretraining. Nevertheless, these methods mainly emphasize representation pretraining quality, stage-wise feature transfer, or broader multimodal perception rather than end-to-end topology-preserving distillation tailored to LiDAR semantic segmentation.
Cross-Modal Knowledge Distillation (CMKD) [
37] offers a promising alternative, following the paradigm of multimodal enhancement during training and efficient single-modal deployment during inference. Early cross-modal distillation works, including RGB-depth supervision transfer [
38] and audiovisual learning via SoundNet [
39], demonstrated the feasibility of cross-modality knowledge transfer. In the context of 3D understanding, xMUDA [
40] pioneered cross-modal unsupervised domain adaptation for semantic segmentation. Subsequent methods—2D3DNet [
41], contrastive multimodal fusion approaches [
42], PVKD [
43], cross-domain distillation [
44], bidirectional LiDAR–camera distillation [
45], and CMDFusion [
46]—progressively improved LiDAR segmentation under teacher guidance. Beyond pointwise transfer, structure-aware distillation methods in related domains preserve local topology or affinity graphs, such as graph-structure distillation [
47] and inter-region affinity distillation [
48]. However, existing LiDAR-image distillation methods still largely rely on direct feature/logit transfer or handcrafted cross-modal interaction designs [
44,
45,
46]. They do not explicitly account for the fact that image and point-cloud features lie on heterogeneous manifolds linked by sparse, calibration-dependent correspondences; consequently, forcing the student to mimic the teacher’s numerical distribution often damages geometric discriminability and leads to negative transfer.
To address these challenges, we propose Cross-Modal Collaborative Manifold Distillation (CMCMD). The framework is built on the hypothesis that across image and LiDAR modalities, pairwise semantic relations are more transferable than absolute feature coordinates. A DINOv3 Vision Foundation Model serves as the visual teacher and provides dense semantic priors. The student network adopts an Adaptive Relation Convolution (ARConv) backbone so that feature aggregation can respond to local geometric variation. A Unified Bidirectional Mapping Module (UBMM) enables explicit interaction between the 2D and 3D branches and reduces the gap between their representations. Manifold-Aware Topological Distillation (MATD) then aligns inter-sample affinity structures in a latent subspace instead of enforcing numerical feature matching. Experiments on SemanticKITTI and nuScenes show that CMCMD achieves mIoU values of 72.9% and 81.2%, respectively, exceeding existing single-modal and cross-modal distillation methods while remaining competitive with multimodal fusion baselines at lower inference cost.
The main contributions of this paper are summarized as follows:
We propose an asymmetric cross-modal distillation framework that couples the generic semantic capabilities of DINOv3 with the geometric modeling power of ARConv, improving the representation capability of single-modal point cloud segmentation;
We design the Adaptive Relation Convolution (ARConv) module, which dynamically perceives local geometric topology via a dual-stream mechanism to generate adaptive aggregation weights, effectively overcoming the limitation of fixed receptive fields in standard sparse convolutions;
We propose the Unified Bidirectional Mapping Module (UBMM), which integrates geometry-guided visual enhancement (P2I) and semantic-injected geometric refinement (I2P) within a unified framework, rectifying the spatial misalignment and representational heterogeneity between 2D and 3D features;
We propose the Manifold-Aware Topological Distillation (MATD) algorithm, which decouples modality-private attributes and aligns the inter-sample affinity matrix to achieve the robust transfer of high-order topological knowledge across heterogeneous feature spaces.
2. Materials and Methods
2.1. Algorithm’s Overall Framework
To address the inherent challenges of texture deficiency and limited geometric feature expression in LiDAR point cloud semantic segmentation, this paper proposes a Cross-Modal Collaborative Manifold Distillation (CMCMD) framework based on an ARConv backbone and Vision Foundation Models (VFMs). As illustrated in
Figure 1, the framework adopts the paradigm of “multimodal enhancement during training and efficient single-modal deployment during inference”, consisting of a multimodal teacher network and a single-modal student network.
In the feature-encoding stage, we construct an asymmetric dual-stream architecture as follows:
2D Image Branch. This branch integrates the DINOv3 Vision Foundation Model as the feature extractor. Leveraging its robust self-supervised pretraining, DINOv3 provides hierarchical image features, (), where , and , serving as high-quality semantic guidance for the 3D network.
3D Point Cloud Branch. This branch employs an ARConv backbone. Unlike conventional sparse convolutions with static weights, this architecture dynamically perceives the local geometric distribution of the point cloud to generate adaptive aggregation weights, thereby producing multi-scale point features, (), where .
To achieve deep alignment of heterogeneous modalities, a Unified Bidirectional Mapping Module (UBMM) is designed. This module receives 2D semantic features and 3D geometric features at multiple scales and eliminates the modal gap through explicit interaction mechanisms. The fused multi-scale features undergo progressive upsampling and residual aggregation before being fed into the segmentation head to produce semantic predictions.
During training, the student network is guided by the Cross-Modal Collaborative Manifold Distillation (CMCMD) strategy described in
Section 2.4, which couples soft-label semantic alignment with manifold-aware topological distillation. During inference, the system retains only the single-modal student network, achieving high-precision segmentation without incurring additional sensor costs.
For reproducibility, the complete workflow can be summarized as the following stage-wise procedure:
Extract four-scale image features, –, with the frozen DINOv3 teacher and four-scale point features, –, with the ARConv-based student encoder;
At each scale, establish calibrated 2D/3D correspondences, apply the UBMM P2I path to inject geometric priors into the image branch, and apply the I2P path to feed sampled visual semantics back to the point branch;
Decode the refined point features to obtain student segmentation logits, while the teacher branch supplies soft semantic targets and intermediate features for training supervision;
Optimize the student with the supervised segmentation loss together with and , where MATD aligns teacher/student inter-sample affinity matrices in the shared latent manifold subspace;
Remove the teacher branch and all the distillation paths during inference, and deploy only the LiDAR student network for efficient single-modal segmentation.
2.2. 2D/3D Branch Design
To fully exploit the complementary advantages of dense image semantics and sparse point cloud geometry, we design specific feature-encoding branches tailored to the characteristics of each modality.
2.2.1. 2D Branch: Generic Semantic Representation with DINOv3
The primary objective of the image branch is to extract semantic representations with strong generalization capabilities. Unlike prior approaches that rely on ResNet [
49] or DINOv2, we employ the DINOv3 [
32] model to construct the 2D backbone. DINOv3 adopts an improved Masked Image Modeling (MIM) pretraining strategy, which significantly enhances performance in dense prediction tasks, particularly with respect to boundary localization and category consistency within complex scenes.
During implementation, input images are processed by a frozen DINOv3 encoder to extract pyramid features, . Freezing the encoder reduces training memory usage and preserves the visual knowledge learned during large-scale pretraining. The teacher, therefore, provides a stable semantic signal for cross-modal distillation without drifting toward domain-specific representations during fine tuning.
2.2.2. 3D Branch: Adaptive Geometric Modeling with ARConv
Standard sparse convolution networks (e.g., MinkowskiEngine) typically employ static geometric weights shared across all the spatial locations. This weight-sharing mechanism limits the network’s capacity to adapt to diverse local geometric morphologies, such as distinguishing flat road surfaces from slender poles. To mitigate this limitation, we design the student network backbone based on the Adaptive Relation Convolution (ARConv) module, illustrated in
Figure 2.
Formally, we describe one ARConv layer on a set of active 3D locations with coordinates
and input features
. For a given center location,
, we define its local neighborhood as
, where
r is the query radius. The ARConv output,
, is computed as a dynamic geometry-conditioned aggregation as follows:
where ⊙ denotes element-wise multiplication,
is the transformed neighbor feature, and
denotes the channel-wise adaptive aggregation weights. The ARConv module generates these weights and features through a dual-stream parallel mechanism as follows:
Spatial Stream (Weight Generation). This stream focuses on parsing the local geometric topology. It takes the relative coordinates,
, as input to model the spatial relationship. A learnable function,
, parameterized by a multi-layer perceptron (MLP), maps these geometric cues to channel-wise logits. The resulting adaptive weights are normalized over the neighbor index within
as follows:
This mechanism ensures that the convolution kernel, , is input dependent, dynamically adjusting its focus based on the local shape (e.g., planes, edges, or corners). The Softmax is applied over the neighbor index (j) within for each output channel (c).
Feature Stream (Semantic Transformation). This stream performs the linear transformation of semantic information. The input features,
, of neighboring points are processed by a linear layer,
, to generate transformed high-dimensional semantic attributes as follows:
where
is the learnable weight matrix for feature projection.
By synthesizing the output features through Equation (
1), ARConv endows the network with an adaptive receptive field. This design improves the representation of fine geometric structures in sparse, unstructured LiDAR point clouds and supports more accurate segmentation at object boundaries and on thin structures.
For clarity, Equation (
1) adopts a generic radius-based notation for the local set
. In the actual sparse-voxel implementation, however, the neighborhood is instantiated on a hierarchical voxel pyramid rather than by an explicit
k-NN search. Concretely, the 3D encoder contains four stages connected by
sparse convolutions with a stride of 2. With the input voxel size fixed at 0.1 m, the effective voxel resolutions of the four stages become
,
,
, and
m, respectively. To instantiate the multi-scale neighborhood in ARConv, we use pyramid grid sizes
,
,
, and
for the four stages, respectively, while fixing the number of reference centroids at 16 at each stage. Under our voxelization setting, these grids correspond to effective physical receptive extents of
–
m,
–
m,
–
m, and
–
m, which provide a practical balance between small-object sensitivity and large-context aggregation in outdoor LiDAR scenes.
2.3. Unified Bidirectional Mapping Module
In multimodal fusion tasks, 2D images provide rich texture and semantic context, while 3D point clouds offer precise geometric structure and spatial depth. However, the heterogeneity in data representation and the misalignment of coordinate spaces pose significant challenges for effective feature interaction. To bridge this gap, we propose the Unified Bidirectional Mapping Module (UBMM). As illustrated in
Figure 3, the UBMM integrates two synergistic pathways—Point-to-Image (P2I) and Image-to-Point (I2P)—within a unified framework to achieve geometry-guided visual enhancement and semantic-injected geometric refinement.
2.3.1. Coordinate Space Alignment
The prerequisite for cross-modal interaction is establishing the correspondence between heterogeneous spaces. At a given interaction stage (
s), let the point features be
and the image features be
. We utilize the calibrated camera intrinsic matrix (
) and the LiDAR-to-camera extrinsic matrix (
) to project a 3D point (
) onto the image plane as follows:
For multi-scale interactions, the projected coordinates are rescaled to the stage
s feature-map resolution as follows:
We further define a validity mask
so that only valid point–image correspondences participate in cross-modal interactions. In the current implementation, we do not introduce an explicit depth-based visibility competition mechanism; instead, we rely on the projection validity mask to retain only usable correspondences for subsequent interactions.
2.3.2. P2I: Geometry-Guided Visual Feature Enhancement
The P2I pathway (left branch in
Figure 3) aims to inject sparse 3D geometric priors into the 2D visual representation. At stage
s, the valid point features in
are scattered to the image coordinates
, and bilinear interpolation is used to generate a dense geometric prior map (
) aligned with the image resolution.
To adaptively fuse these priors, the image features (
) are concatenated with
to generate a spatial attention map via a gating branch. The gating mechanism consists of convolutional layers and a Sigmoid activation as follows:
where
denotes the Sigmoid function, and
is the ReLU activation. This attention map (
) modulates the geometric priors, which are then concatenated with the original image features and fused via a
convolution to produce the geometry-enhanced visual output as follows:
2.3.3. I2P: Semantic-Injected Point Feature Refinement
Conversely, the I2P pathway (right branch in
Figure 3) transfers visual semantics back to the 3D domain. Using the projection correspondence, we bilinearly sample the image features at the projected coordinates of each point to obtain pointwise visual contexts (
). This interpolation strategy preserves the local image continuity when the projected coordinates fall at non-integer pixel locations as follows:
where
is the
ith sampled visual feature, and
Stacking all the sampled vectors yields .
To achieve deep fusion, the point features (
) and sampled visual contexts are processed by multi-layer perceptrons (MLPs) and concatenated. An MLP-based gating mechanism computes the cross-modal importance weights (
) as follows:
The final refined point features are obtained by modulating the sampled visual features with
and fusing them with the original geometric features as follows:
This bidirectional loop effectively rectifies modal misalignment, yielding robust multimodal representations.
2.4. Cross-Modal Collaborative Manifold Distillation
The core objective of cross-modal knowledge distillation is to construct a high-performance multimodal teacher network and transfer its learned generic semantic representations to a student network that relies solely on single-modal input. However, owing to the intrinsic heterogeneous gap between the data manifolds of LiDAR point clouds (sparse, unstructured, and geometry dominant) and camera images (dense, regular, and texture dominant), conventional point-to-point feature matching (e.g., MSE) often forces the student to mimic modality-specific numerical distributions of the teacher. This rigid alignment can inadvertently damage the distinct geometric discriminability of point clouds, leading to negative transfer.
To address this challenge, we propose the Cross-Modal Collaborative Manifold Distillation (CMCMD) framework. Instead of aligning absolute feature values, this framework focuses on maintaining the consistency of topological structural relations within the feature space. Specifically, we decouple the distillation process into two orthogonal optimization paths: semantic distribution alignment and manifold topology alignment.
2.4.1. Soft-Label Semantic Alignment
To transfer the robust discriminative power of the Vision Foundation Model (VFM) for open-world categories to the point cloud network, distillation is performed at the logit level. Unlike hard labels that provide binary supervision, the soft probability distribution output by the teacher network encapsulates rich inter-class similarity information (e.g., the semantic proximity between “truck” and “bus” is higher than that between “truck” and “vegetation”).
Let
denote the logits output by the teacher and student networks at voxel
i, respectively, where
K is the number of semantic classes. We introduce a temperature coefficient (
) to smooth the distributions, thereby explicitly mining semantic correlations among long-tail classes. For class index
, the softened teacher and student probabilities are defined as follows:
We employ the Kullback–Leibler (KL) divergence to minimize the information discrepancy between the student distribution (
) and the teacher distribution (
) as follows:
This loss function compels the student network to learn not only “what the class is” but also “how the teacher distinguishes” between ambiguous categories, thereby enhancing semantic generalization.
2.4.2. Manifold-Aware Topological Distillation (MATD)
To overcome modality heterogeneity, we propose Manifold-Aware Topological Distillation (MATD). The underlying hypothesis is that although feature distributions differ across modalities, pairwise semantic relations should remain topologically consistent within the semantic manifold. In MATD, these relations are modeled with cosine affinity rather than the absolute feature distance. The method contains three coupled steps.
Projection and Decoupling. We collect a matched set of
B point/voxel features from the teacher and student branches, denoted by
and
. These features are projected into a shared latent manifold subspace through learnable nonlinear heads
and
. The teacher and student projection heads are implemented as two-layer MLPs with an output dimension of 128, namely,
and
, respectively, with ReLU activation and batch normalization between layers. This projection reduces modality-specific statistical bias and separates modality-private attributes as follows:
Inter-sample Affinity Modeling. Within the sampled set of size
B, we construct teacher and student affinity matrices (
). Here, each “sample” corresponds to a point/voxel feature (or a fixed-size subset thereof) selected from the current batch to control the
complexity. The
th element characterizes the cosine similarity between the
ith and
jth samples on the feature manifold, effectively encoding global topological structural information as follows:
where
and
denote the
ith row vectors of
and
, respectively. Crucially, these matrices capture high-order structural information and remain invariant to the absolute scaling of specific feature magnitudes.
Topology Consistency Constraint. The student network is then trained to reconstruct the teacher’s topological view in the latent space by minimizing the Frobenius norm distance between the teacher’s affinity matrix (
) and the student’s affinity matrix (
) as follows:
Through the MATD strategy, the student network learns the essential cross-modal law of “whether sample A and sample B are semantically similar or dissimilar”, ignoring low-level numerical discrepancies. This manifold alignment paradigm significantly improves the robustness and convergence speed of the model in complex scenarios.
2.5. Optimization Objective and Loss Functions
To endow the single-modal student network with high-precision segmentation capabilities while effectively inheriting cross-modal knowledge, we design a compound optimization objective. This objective consists of two primary components: a hybrid segmentation loss for supervised learning and a collaborative distillation loss for knowledge transfer.
2.5.1. Hybrid Segmentation Loss
LiDAR point clouds in outdoor scenes typically exhibit a long-tailed distribution, where background classes (e.g., roads and buildings) dominate over foreground classes (e.g., pedestrians and cyclists). To mitigate the optimization bias caused by this class imbalance, we employ a hybrid supervision strategy combining the standard Cross-Entropy loss (
) and the Lovász-Softmax loss [
50] (
).
Specifically,
ensures the convergence of pixel-wise classification probabilities, while
serves as a differentiable surrogate for the Jaccard index (IoU), directly optimizing the segmentation quality of sparse and small-scale objects. The segmentation loss is formulated as follows:
2.5.2. Collaborative Distillation Loss
As detailed in
Section 2.4, the distillation loss comprises two parts: the logit-level soft-label alignment (
) for the semantic probability transfer, and the feature-level manifold-aware topological distillation (
) for geometric structure alignment. The total distillation loss is defined as follows:
2.5.3. Total Optimization Objective
The final objective function for end-to-end training is a weighted sum of the segmentation supervision and the cross-modal distillation constraints as follows:
where
is a hyperparameter balancing the contribution of the teacher’s guidance. By minimizing
, the student network simultaneously learns precise decision boundaries from ground-truth labels and robust topological representations from the multimodal teacher network.
4. Discussion
The results indicate that CMCMD narrows the performance gap between LiDAR-only and multimodal semantic segmentations. Using DINOv3 as the teacher backbone is particularly beneficial for texture-dependent or geometrically ambiguous classes, improving “motorcyclist” by 20.8% over Cylinder3D, “bicycle” by 27.2% in nuScenes, and “person” by 4.6%.
Table 4 further shows that MATD alone (71.7%) outperforms pointwise transfer losses, such as L2 mimicking (70.6%) and KL matching (71.0%), supporting the premise that aligning inter-sample relations is more effective than rigid numerical matching across heterogeneous modalities. The combination of
and
reaches 72.9%, indicating that the two objectives provide complementary supervision.
Compared with recent VFM-guided image-to-LiDAR representation learning methods, such as Three Pillars [
33], ELiTe [
34], LiMoE [
35], and LiMA [
36], our framework is optimized directly for supervised semantic segmentation and emphasizes topology-preserving distillation within the downstream training loop. Rather than focusing on the generic 3D representation quality alone, CMCMD addresses a specific failure mode in segmentation learning, namely, heterogeneous numerical alignment between modalities, through shared-manifold affinity matching.
From a practical standpoint, the CMCMD student achieves a 72.9% mIoU with only 71 ms of latency and 2.4 GB of memory (
Table 6), reducing the computational cost by 3.4×–4.4× compared to those of multimodal baselines and satisfying the <100 ms real-time constraint of autonomous driving. Eliminating camera dependency during inference further enhances robustness to adverse weather, sensor failures, and calibration drift. The qualitative evaluation in unseen campus scenes (
Figure 5) demonstrates strong cross-domain generalization, attributable to the DINOv3 teacher’s open-world representations and MATD’s focus on modality-invariant topological structures.
Several limitations remain. Recent temporal LiDAR segmentation models, such as MemorySeg [
23] and TASeg [
24], process temporal contexts explicitly, whereas the current framework operates on individual LiDAR frames. Incorporating 4D convolutions, recurrent memory, or sequence-level distillation could improve consistency for moving objects and severely occluded regions. UBMM also relies on camera–LiDAR calibration and 2D/3D projection quality. Although the validity mask removes out-of-view points, the current formulation does not model visibility competition, projection uncertainty, or rolling-synchronization errors explicitly, which may reduce robustness under calibration drift or severe occlusion. The
affinity computation in MATD may also require efficient approximations for denser sampling or larger batch construction. Extending the distillation paradigm to additional sensor modalities (e.g., radar) and integrating online learning for continuous adaptation remain natural directions for future work. The present study reports the main quantitative results from single training runs under a fixed protocol. The expanded ablations, sensitivity analysis, efficiency study, and real-world campus validation provide complementary evidence, but multi-seed variance analysis and formal statistical significance testing are still needed.