Distilling Vision Foundation Models into LiDAR Networks via Manifold-Aware Topological Alignment

Yang, Yuchuan; Xu, Xiaosu

doi:10.3390/computers15040234

Open AccessArticle

Distilling Vision Foundation Models into LiDAR Networks via Manifold-Aware Topological Alignment

by

Yuchuan Yang

and

Xiaosu Xu

^*

Key Laboratory of Micro-Inertial Instrument and Advanced Navigation Technology, Ministry of Education, School of Instrument Science and Engineering, Southeast University, Nanjing 210096, China

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(4), 234; https://doi.org/10.3390/computers15040234

Submission received: 27 February 2026 / Revised: 31 March 2026 / Accepted: 3 April 2026 / Published: 9 April 2026

Download

Browse Figures

Versions Notes

Abstract

LiDAR point cloud semantic segmentation is essential for autonomous driving, yet LiDAR-only methods remain constrained by sparsity and limited texture cues. We propose Cross-Modal Collaborative Manifold Distillation (CMCMD), which transfers open-world semantic priors from the DINOv3 Vision Foundation Model to a LiDAR student network. The framework combines an Adaptive Relation Convolution (ARConv) backbone with geometry-conditioned aggregation, a Unified Bidirectional Mapping Module (UBMM) for explicit 2D–3D interaction, and Manifold-Aware Topological Distillation (MATD), which aligns inter-sample affinity structures in a shared latent manifold rather than enforcing pointwise feature matching. By preserving relational topology instead of absolute feature coordinates, CMCMD mitigates negative transfer across heterogeneous modalities. Experiments on SemanticKITTI and nuScenes yield mIoU values of 72.9% and 81.2%, respectively, surpassing the compared distillation baselines and approaching the performance of multimodal fusion methods at lower inference cost. Additional evaluation on real-world campus scenes further supports the cross-domain robustness of the proposed framework.

Keywords:

LiDAR semantic segmentation; point cloud semantic segmentation; cross-modal knowledge distillation; vision foundation models; manifold-aware learning; adaptive sparse convolution; 3D point cloud processing; topological feature alignment; autonomous driving perception

Graphical Abstract

1. Introduction

Three-dimensional (3D) scene understanding is fundamental to safe navigation and decision making in autonomous driving. Among its core tasks, LiDAR point cloud semantic segmentation assigns a semantic label to each point in 3D space and provides fine-grained scene understanding. Although LiDAR captures accurate geometric structures, point clouds remain sparse, irregular, and largely devoid of texture because of the sensing process. These characteristics impose a fundamental limitation on LiDAR-only methods, especially when recognizing distant small objects (e.g., pedestrians) or separating geometrically similar yet semantically different surfaces (e.g., road versus vegetation).

Existing LiDAR segmentation methods can be broadly categorized into three paradigms. Projection-based methods project 3D point clouds onto 2D representations and process them with mature 2D CNNs, building upon the fully convolutional network (FCN) [1] paradigm. Representative works include the SqueezeSeg family [2,3,4], PolarNet [5], AMVNet [6], RangeFormer [7], and FRNet [8]; however, the projection process inevitably incurs information loss and quantization errors. Point-based methods, pioneered by PointNet [9], directly process raw point sets without intermediate representations. Extensions such as PointNet++ [10], RandLA-Net [11], and KPConv [12] introduce hierarchical learning and flexible convolution operators yet suffer from high computational costs in large-scale scenes. Voxel-based methods utilize sparse convolution to efficiently handle voxelized grids (e.g., MinkowskiEngine [13], SPVCNN [14], and Cylinder3D [15]) and represent the current mainstream. Hybrid approaches, such as RPVNet [16], 2-S3Net [17], and UniSeg [18], further combine multiple representations. Related efforts also exploit scene-completion priors [19], transformer-based context modeling in SphereFormer [20], and non-uniform cylindrical partitioning in NUC-Net [21] to strengthen LiDAR-only perception, while M3Net [22] extends this line toward universal LiDAR segmentation across heterogeneous datasets and sensor setups. In parallel, temporal models, such as MemorySeg [23] and TASeg [24], exploit sequential observations to recover semantics for sparse, occluded, or fast-moving regions. Despite this progress, conventional sparse convolutions employ static shared weights that cannot adaptively adjust the receptive field according to local geometric environments (e.g., planar surfaces vs. corners), limiting the capacity to model complex fine-grained structures.

To compensate for the inherent texture deficiency of point clouds, multimodal fusion methods have attracted considerable attention. Early approaches, such as PMF [25], align point clouds with image pixels via geometric projection, whereas later methods perform feature fusion in a unified bird’s-eye view (BEV) space, including BEVFusion [26] and its variant [27], as well as MSeg3D [28]. These methods improve segmentation accuracy, but they also require LiDAR and camera inputs during inference, which increases the computational load and imposes strict synchronization requirements. Camera failure under adverse weather conditions or calibration drift can therefore degrade the overall system. Vision Foundation Models (VFMs) built on the Vision Transformer (ViT) [29] architecture, such as SAM [30] and DINOv2 [31], have shown strong generalization in the 2D domain. DINOv3 [32] further refines the Masked Image Modeling (MIM) strategy and improves boundary localization and semantic consistency in dense prediction tasks. Recent representative studies have begun to transfer such foundation-model semantics to LiDAR more directly: Puy et al. [33] analyze scalable VFM distillation for LiDAR; ELiTe [34] introduces patch-to-point multi-stage transfer with parameter-efficient teacher adaptation, and later works, such as LiMoE [35] and LiMA [36], extend image-to-LiDAR learning to multi-representation or long-horizon pretraining. Nevertheless, these methods mainly emphasize representation pretraining quality, stage-wise feature transfer, or broader multimodal perception rather than end-to-end topology-preserving distillation tailored to LiDAR semantic segmentation.

Cross-Modal Knowledge Distillation (CMKD) [37] offers a promising alternative, following the paradigm of multimodal enhancement during training and efficient single-modal deployment during inference. Early cross-modal distillation works, including RGB-depth supervision transfer [38] and audiovisual learning via SoundNet [39], demonstrated the feasibility of cross-modality knowledge transfer. In the context of 3D understanding, xMUDA [40] pioneered cross-modal unsupervised domain adaptation for semantic segmentation. Subsequent methods—2D3DNet [41], contrastive multimodal fusion approaches [42], PVKD [43], cross-domain distillation [44], bidirectional LiDAR–camera distillation [45], and CMDFusion [46]—progressively improved LiDAR segmentation under teacher guidance. Beyond pointwise transfer, structure-aware distillation methods in related domains preserve local topology or affinity graphs, such as graph-structure distillation [47] and inter-region affinity distillation [48]. However, existing LiDAR-image distillation methods still largely rely on direct feature/logit transfer or handcrafted cross-modal interaction designs [44,45,46]. They do not explicitly account for the fact that image and point-cloud features lie on heterogeneous manifolds linked by sparse, calibration-dependent correspondences; consequently, forcing the student to mimic the teacher’s numerical distribution often damages geometric discriminability and leads to negative transfer.

To address these challenges, we propose Cross-Modal Collaborative Manifold Distillation (CMCMD). The framework is built on the hypothesis that across image and LiDAR modalities, pairwise semantic relations are more transferable than absolute feature coordinates. A DINOv3 Vision Foundation Model serves as the visual teacher and provides dense semantic priors. The student network adopts an Adaptive Relation Convolution (ARConv) backbone so that feature aggregation can respond to local geometric variation. A Unified Bidirectional Mapping Module (UBMM) enables explicit interaction between the 2D and 3D branches and reduces the gap between their representations. Manifold-Aware Topological Distillation (MATD) then aligns inter-sample affinity structures in a latent subspace instead of enforcing numerical feature matching. Experiments on SemanticKITTI and nuScenes show that CMCMD achieves mIoU values of 72.9% and 81.2%, respectively, exceeding existing single-modal and cross-modal distillation methods while remaining competitive with multimodal fusion baselines at lower inference cost.

The main contributions of this paper are summarized as follows:

We propose an asymmetric cross-modal distillation framework that couples the generic semantic capabilities of DINOv3 with the geometric modeling power of ARConv, improving the representation capability of single-modal point cloud segmentation;
We design the Adaptive Relation Convolution (ARConv) module, which dynamically perceives local geometric topology via a dual-stream mechanism to generate adaptive aggregation weights, effectively overcoming the limitation of fixed receptive fields in standard sparse convolutions;
We propose the Unified Bidirectional Mapping Module (UBMM), which integrates geometry-guided visual enhancement (P2I) and semantic-injected geometric refinement (I2P) within a unified framework, rectifying the spatial misalignment and representational heterogeneity between 2D and 3D features;
We propose the Manifold-Aware Topological Distillation (MATD) algorithm, which decouples modality-private attributes and aligns the inter-sample affinity matrix to achieve the robust transfer of high-order topological knowledge across heterogeneous feature spaces.

2. Materials and Methods

2.1. Algorithm’s Overall Framework

To address the inherent challenges of texture deficiency and limited geometric feature expression in LiDAR point cloud semantic segmentation, this paper proposes a Cross-Modal Collaborative Manifold Distillation (CMCMD) framework based on an ARConv backbone and Vision Foundation Models (VFMs). As illustrated in Figure 1, the framework adopts the paradigm of “multimodal enhancement during training and efficient single-modal deployment during inference”, consisting of a multimodal teacher network and a single-modal student network.

In the feature-encoding stage, we construct an asymmetric dual-stream architecture as follows:

2D Image Branch. This branch integrates the DINOv3 Vision Foundation Model as the feature extractor. Leveraging its robust self-supervised pretraining, DINOv3 provides hierarchical image features, ( $F_{1}, \dots, F_{4}$ ), where $F_{s} \in R^{H_{s} \times W_{s} \times C_{s}}$ , and $(H_{s}, W_{s}) = (H / 2^{s + 1}, W / 2^{s + 1})$ , serving as high-quality semantic guidance for the 3D network.
3D Point Cloud Branch. This branch employs an ARConv backbone. Unlike conventional sparse convolutions with static weights, this architecture dynamically perceives the local geometric distribution of the point cloud to generate adaptive aggregation weights, thereby producing multi-scale point features, ( $f_{1}, \dots, f_{4}$ ), where $f_{s} \in R^{N_{s} \times C_{s}}$ .

To achieve deep alignment of heterogeneous modalities, a Unified Bidirectional Mapping Module (UBMM) is designed. This module receives 2D semantic features and 3D geometric features at multiple scales and eliminates the modal gap through explicit interaction mechanisms. The fused multi-scale features undergo progressive upsampling and residual aggregation before being fed into the segmentation head to produce semantic predictions.

During training, the student network is guided by the Cross-Modal Collaborative Manifold Distillation (CMCMD) strategy described in Section 2.4, which couples soft-label semantic alignment with manifold-aware topological distillation. During inference, the system retains only the single-modal student network, achieving high-precision segmentation without incurring additional sensor costs.

For reproducibility, the complete workflow can be summarized as the following stage-wise procedure:

Extract four-scale image features, $F_{1}$ – $F_{4}$ , with the frozen DINOv3 teacher and four-scale point features, $f_{1}$ – $f_{4}$ , with the ARConv-based student encoder;
At each scale, establish calibrated 2D/3D correspondences, apply the UBMM P2I path to inject geometric priors into the image branch, and apply the I2P path to feed sampled visual semantics back to the point branch;
Decode the refined point features to obtain student segmentation logits, while the teacher branch supplies soft semantic targets and intermediate features for training supervision;
Optimize the student with the supervised segmentation loss together with $L_{KL}$ and $L_{MATD}$ , where MATD aligns teacher/student inter-sample affinity matrices in the shared latent manifold subspace;
Remove the teacher branch and all the distillation paths during inference, and deploy only the LiDAR student network for efficient single-modal segmentation.

2.2. 2D/3D Branch Design

To fully exploit the complementary advantages of dense image semantics and sparse point cloud geometry, we design specific feature-encoding branches tailored to the characteristics of each modality.

2.2.1. 2D Branch: Generic Semantic Representation with DINOv3

The primary objective of the image branch is to extract semantic representations with strong generalization capabilities. Unlike prior approaches that rely on ResNet [49] or DINOv2, we employ the DINOv3 [32] model to construct the 2D backbone. DINOv3 adopts an improved Masked Image Modeling (MIM) pretraining strategy, which significantly enhances performance in dense prediction tasks, particularly with respect to boundary localization and category consistency within complex scenes.

During implementation, input images are processed by a frozen DINOv3 encoder to extract pyramid features,

F_{s}

. Freezing the encoder reduces training memory usage and preserves the visual knowledge learned during large-scale pretraining. The teacher, therefore, provides a stable semantic signal for cross-modal distillation without drifting toward domain-specific representations during fine tuning.

2.2.2. 3D Branch: Adaptive Geometric Modeling with ARConv

Standard sparse convolution networks (e.g., MinkowskiEngine) typically employ static geometric weights shared across all the spatial locations. This weight-sharing mechanism limits the network’s capacity to adapt to diverse local geometric morphologies, such as distinguishing flat road surfaces from slender poles. To mitigate this limitation, we design the student network backbone based on the Adaptive Relation Convolution (ARConv) module, illustrated in Figure 2.

Formally, we describe one ARConv layer on a set of active 3D locations with coordinates

P = {p_{i} \in R^{3}}_{i = 1}^{N}

and input features

F = {f_{i} \in R^{C_{in}}}_{i = 1}^{N}

. For a given center location,

p_{i}

, we define its local neighborhood as

N (i) = {j ∣ ∥ p_{j} - p_{i} ∥ < r}

, where r is the query radius. The ARConv output,

f_{i}^{out} \in R^{C_{out}}

, is computed as a dynamic geometry-conditioned aggregation as follows:

f_{i}^{out} = \sum_{j \in N (i)} {ff}_{i j} ⊙ h_{j},

(1)

where ⊙ denotes element-wise multiplication,

h_{j} = ψ (f_{j})

is the transformed neighbor feature, and

{ff}_{i j} \in R^{C_{out}}

denotes the channel-wise adaptive aggregation weights. The ARConv module generates these weights and features through a dual-stream parallel mechanism as follows:

Spatial Stream (Weight Generation). This stream focuses on parsing the local geometric topology. It takes the relative coordinates, $Δ p_{i j} = p_{j} - p_{i}$ , as input to model the spatial relationship. A learnable function, $G (\cdot)$ , parameterized by a multi-layer perceptron (MLP), maps these geometric cues to channel-wise logits. The resulting adaptive weights are normalized over the neighbor index within $N (i)$ as follows:

$α_{i j, c} = \frac{exp (G_{c} (Δ p_{i j}))}{\sum_{k \in N (i)} exp (G_{c} (Δ p_{i k}))}, c = 1, \dots, C_{out} .$

(2)

This mechanism ensures that the convolution kernel, ${ff}_{i j}$ , is input dependent, dynamically adjusting its focus based on the local shape (e.g., planes, edges, or corners). The Softmax is applied over the neighbor index (j) within $N (i)$ for each output channel (c).
Feature Stream (Semantic Transformation). This stream performs the linear transformation of semantic information. The input features, $f_{j}$ , of neighboring points are processed by a linear layer, $ψ$ , to generate transformed high-dimensional semantic attributes as follows:

$h_{j} = ψ (f_{j}) = W_{v} f_{j}, W_{v} \in R^{C_{out} \times C_{in}},$

(3)

where $W_{v}$ is the learnable weight matrix for feature projection.

By synthesizing the output features through Equation (1), ARConv endows the network with an adaptive receptive field. This design improves the representation of fine geometric structures in sparse, unstructured LiDAR point clouds and supports more accurate segmentation at object boundaries and on thin structures.

For clarity, Equation (1) adopts a generic radius-based notation for the local set

N (i)

. In the actual sparse-voxel implementation, however, the neighborhood is instantiated on a hierarchical voxel pyramid rather than by an explicit k-NN search. Concretely, the 3D encoder contains four stages connected by

2 \times 2 \times 2

sparse convolutions with a stride of 2. With the input voxel size fixed at 0.1 m, the effective voxel resolutions of the four stages become

0.1

,

0.2

,

0.4

, and

0.8

m, respectively. To instantiate the multi-scale neighborhood in ARConv, we use pyramid grid sizes

{8, 12, 16, 16}

,

{6, 9, 12, 12}

,

{4, 6, 8, 8}

, and

{3, 4, 6, 6}

for the four stages, respectively, while fixing the number of reference centroids at 16 at each stage. Under our voxelization setting, these grids correspond to effective physical receptive extents of

0.8

–

1.6

m,

1.2

–

2.4

m,

1.6

–

3.2

m, and

2.4

–

4.8

m, which provide a practical balance between small-object sensitivity and large-context aggregation in outdoor LiDAR scenes.

2.3. Unified Bidirectional Mapping Module

In multimodal fusion tasks, 2D images provide rich texture and semantic context, while 3D point clouds offer precise geometric structure and spatial depth. However, the heterogeneity in data representation and the misalignment of coordinate spaces pose significant challenges for effective feature interaction. To bridge this gap, we propose the Unified Bidirectional Mapping Module (UBMM). As illustrated in Figure 3, the UBMM integrates two synergistic pathways—Point-to-Image (P2I) and Image-to-Point (I2P)—within a unified framework to achieve geometry-guided visual enhancement and semantic-injected geometric refinement.

2.3.1. Coordinate Space Alignment

The prerequisite for cross-modal interaction is establishing the correspondence between heterogeneous spaces. At a given interaction stage (s), let the point features be

f_{s} \in R^{N_{s} \times C_{s}}

and the image features be

F_{s} \in R^{H_{s} \times W_{s} \times C_{s}}

. We utilize the calibrated camera intrinsic matrix (

K

) and the LiDAR-to-camera extrinsic matrix (

T

) to project a 3D point (

p_{i} = (x_{i}, y_{i}, z_{i})

) onto the image plane as follows:

z_{i} {[u_{i}, v_{i}, 1]}^{T} = K T {[x_{i}, y_{i}, z_{i}, 1]}^{T} .

(4)

For multi-scale interactions, the projected coordinates are rescaled to the stage s feature-map resolution as follows:

u_{i}^{(s)} = \frac{W_{s}}{W} u_{i}, v_{i}^{(s)} = \frac{H_{s}}{H} v_{i} .

(5)

We further define a validity mask

M_{i}^{(s)} = 1 (0 \leq u_{i}^{(s)} < W_{s}, 0 \leq v_{i}^{(s)} < H_{s}, z_{i} > 0),

(6)

so that only valid point–image correspondences participate in cross-modal interactions. In the current implementation, we do not introduce an explicit depth-based visibility competition mechanism; instead, we rely on the projection validity mask to retain only usable correspondences for subsequent interactions.

2.3.2. P2I: Geometry-Guided Visual Feature Enhancement

The P2I pathway (left branch in Figure 3) aims to inject sparse 3D geometric priors into the 2D visual representation. At stage s, the valid point features in

f_{s}

are scattered to the image coordinates

(u_{i}^{(s)}, v_{i}^{(s)})

, and bilinear interpolation is used to generate a dense geometric prior map (

{\hat{f}}_{s} \in R^{H_{s} \times W_{s} \times C_{s}}

) aligned with the image resolution.

To adaptively fuse these priors, the image features (

F_{s}

) are concatenated with

{\hat{f}}_{s}

to generate a spatial attention map via a gating branch. The gating mechanism consists of convolutional layers and a Sigmoid activation as follows:

W_{P 2 I}^{(s)} = σ ({Conv}_{1 \times 1} (δ ({Conv}_{3 \times 3} (Concat (F_{s}, {\hat{f}}_{s}))))),

(7)

where

σ

denotes the Sigmoid function, and

δ

is the ReLU activation. This attention map (

W_{P 2 I}^{(s)}

) modulates the geometric priors, which are then concatenated with the original image features and fused via a

1 \times 1

convolution to produce the geometry-enhanced visual output as follows:

{\tilde{F}}_{s} = {Conv}_{1 \times 1} (Concat (F_{s}, W_{P 2 I}^{(s)} ⊙ {\hat{f}}_{s})) .

(8)

2.3.3. I2P: Semantic-Injected Point Feature Refinement

Conversely, the I2P pathway (right branch in Figure 3) transfers visual semantics back to the 3D domain. Using the projection correspondence, we bilinearly sample the image features at the projected coordinates of each point to obtain pointwise visual contexts (

F_{s}^{sample}

). This interpolation strategy preserves the local image continuity when the projected coordinates fall at non-integer pixel locations as follows:

v_{s, i} = M_{i}^{(s)} \sum_{m = 1}^{W_{s}} \sum_{n = 1}^{H_{s}} ω_{i, m, n}^{(s)} F_{s} (n, m),

(9)

where

v_{s, i}

is the ith sampled visual feature, and

ω_{i, m, n}^{(s)} = max (0, 1 - | u_{i}^{(s)} - m |) max (0, 1 - | v_{i}^{(s)} - n |) .

(10)

Stacking all the sampled vectors yields

F_{s}^{sample} = {[v_{s, 1}, \dots, v_{s, N_{s}}]}^{T} \in R^{N_{s} \times C_{s}}

.

To achieve deep fusion, the point features (

f_{s}

) and sampled visual contexts are processed by multi-layer perceptrons (MLPs) and concatenated. An MLP-based gating mechanism computes the cross-modal importance weights (

W_{I 2 P}^{(s)}

) as follows:

W_{I 2 P}^{(s)} = σ (MLP (δ (MLP (Concat (f_{s}, F_{s}^{sample}))))) .

(11)

The final refined point features are obtained by modulating the sampled visual features with

W_{I 2 P}^{(s)}

and fusing them with the original geometric features as follows:

{\tilde{f}}_{s} = MLP (Concat (f_{s}, W_{I 2 P}^{(s)} ⊙ F_{s}^{sample})) .

(12)

This bidirectional loop effectively rectifies modal misalignment, yielding robust multimodal representations.

2.4. Cross-Modal Collaborative Manifold Distillation

The core objective of cross-modal knowledge distillation is to construct a high-performance multimodal teacher network and transfer its learned generic semantic representations to a student network that relies solely on single-modal input. However, owing to the intrinsic heterogeneous gap between the data manifolds of LiDAR point clouds (sparse, unstructured, and geometry dominant) and camera images (dense, regular, and texture dominant), conventional point-to-point feature matching (e.g., MSE) often forces the student to mimic modality-specific numerical distributions of the teacher. This rigid alignment can inadvertently damage the distinct geometric discriminability of point clouds, leading to negative transfer.

To address this challenge, we propose the Cross-Modal Collaborative Manifold Distillation (CMCMD) framework. Instead of aligning absolute feature values, this framework focuses on maintaining the consistency of topological structural relations within the feature space. Specifically, we decouple the distillation process into two orthogonal optimization paths: semantic distribution alignment and manifold topology alignment.

2.4.1. Soft-Label Semantic Alignment

To transfer the robust discriminative power of the Vision Foundation Model (VFM) for open-world categories to the point cloud network, distillation is performed at the logit level. Unlike hard labels that provide binary supervision, the soft probability distribution output by the teacher network encapsulates rich inter-class similarity information (e.g., the semantic proximity between “truck” and “bus” is higher than that between “truck” and “vegetation”).

Let

z_{i}^{T}, z_{i}^{S} \in R^{K}

denote the logits output by the teacher and student networks at voxel i, respectively, where K is the number of semantic classes. We introduce a temperature coefficient (

τ

) to smooth the distributions, thereby explicitly mining semantic correlations among long-tail classes. For class index

c \in {1, \dots, K}

, the softened teacher and student probabilities are defined as follows:

P_{T}^{i} (c) = \frac{exp (z_{i, c}^{T} / τ)}{\sum_{c^{'} = 1}^{K} exp (z_{i, c^{'}}^{T} / τ)}, P_{S}^{i} (c) = \frac{exp (z_{i, c}^{S} / τ)}{\sum_{c^{'} = 1}^{K} exp (z_{i, c^{'}}^{S} / τ)} .

(13)

We employ the Kullback–Leibler (KL) divergence to minimize the information discrepancy between the student distribution (

P_{S}

) and the teacher distribution (

P_{T}

) as follows:

L_{KL} = τ^{2} \sum_{i \in V} \sum_{c = 1}^{K} P_{T}^{i} (c) log \frac{P_{T}^{i} (c)}{P_{S}^{i} (c)} .

(14)

This loss function compels the student network to learn not only “what the class is” but also “how the teacher distinguishes” between ambiguous categories, thereby enhancing semantic generalization.

2.4.2. Manifold-Aware Topological Distillation (MATD)

To overcome modality heterogeneity, we propose Manifold-Aware Topological Distillation (MATD). The underlying hypothesis is that although feature distributions differ across modalities, pairwise semantic relations should remain topologically consistent within the semantic manifold. In MATD, these relations are modeled with cosine affinity rather than the absolute feature distance. The method contains three coupled steps.

Projection and Decoupling. We collect a matched set of B point/voxel features from the teacher and student branches, denoted by

F_{T} \in R^{B \times C_{T}}

and

F_{S} \in R^{B \times C_{S}}

. These features are projected into a shared latent manifold subspace through learnable nonlinear heads

g_{T} (\cdot)

and

g_{S} (\cdot)

. The teacher and student projection heads are implemented as two-layer MLPs with an output dimension of 128, namely,

C_{T} \to 256 \to 128

and

C_{S} \to 256 \to 128

, respectively, with ReLU activation and batch normalization between layers. This projection reduces modality-specific statistical bias and separates modality-private attributes as follows:

H_{T} = g_{T} (F_{T}), H_{S} = g_{S} (F_{S}) .

(15)

Inter-sample Affinity Modeling. Within the sampled set of size B, we construct teacher and student affinity matrices (

G_{T}, G_{S} \in R^{B \times B}

). Here, each “sample” corresponds to a point/voxel feature (or a fixed-size subset thereof) selected from the current batch to control the

O (B^{2})

complexity. The

(i, j)

th element characterizes the cosine similarity between the ith and jth samples on the feature manifold, effectively encoding global topological structural information as follows:

G_{T} (i, j) = \frac{h_{T, i} \cdot h_{T, j}}{{∥ h_{T, i} ∥}_{2} {∥ h_{T, j} ∥}_{2}}, G_{S} (i, j) = \frac{h_{S, i} \cdot h_{S, j}}{{∥ h_{S, i} ∥}_{2} {∥ h_{S, j} ∥}_{2}},

(16)

where

h_{T, i}

and

h_{S, i}

denote the ith row vectors of

H_{T}

and

H_{S}

, respectively. Crucially, these matrices capture high-order structural information and remain invariant to the absolute scaling of specific feature magnitudes.

Topology Consistency Constraint. The student network is then trained to reconstruct the teacher’s topological view in the latent space by minimizing the Frobenius norm distance between the teacher’s affinity matrix (

G_{T}

) and the student’s affinity matrix (

G_{S}

) as follows:

L_{MATD} = \frac{1}{B^{2}} {∥\frac{G_{T}}{∥ G_{T} ∥_{F}} - \frac{G_{S}}{∥ G_{S} ∥_{F}}∥}_{F}^{2} .

(17)

Through the MATD strategy, the student network learns the essential cross-modal law of “whether sample A and sample B are semantically similar or dissimilar”, ignoring low-level numerical discrepancies. This manifold alignment paradigm significantly improves the robustness and convergence speed of the model in complex scenarios.

2.5. Optimization Objective and Loss Functions

To endow the single-modal student network with high-precision segmentation capabilities while effectively inheriting cross-modal knowledge, we design a compound optimization objective. This objective consists of two primary components: a hybrid segmentation loss for supervised learning and a collaborative distillation loss for knowledge transfer.

2.5.1. Hybrid Segmentation Loss

LiDAR point clouds in outdoor scenes typically exhibit a long-tailed distribution, where background classes (e.g., roads and buildings) dominate over foreground classes (e.g., pedestrians and cyclists). To mitigate the optimization bias caused by this class imbalance, we employ a hybrid supervision strategy combining the standard Cross-Entropy loss (

L_{ce}

) and the Lovász-Softmax loss [50] (

L_{lov}

).

Specifically,

L_{ce}

ensures the convergence of pixel-wise classification probabilities, while

L_{lov}

serves as a differentiable surrogate for the Jaccard index (IoU), directly optimizing the segmentation quality of sparse and small-scale objects. The segmentation loss is formulated as follows:

L_{seg} = L_{ce} + L_{lov} .

(18)

2.5.2. Collaborative Distillation Loss

As detailed in Section 2.4, the distillation loss comprises two parts: the logit-level soft-label alignment (

L_{KL}

) for the semantic probability transfer, and the feature-level manifold-aware topological distillation (

L_{MATD}

) for geometric structure alignment. The total distillation loss is defined as follows:

L_{distill} = L_{KL} + L_{MATD} .

(19)

2.5.3. Total Optimization Objective

The final objective function for end-to-end training is a weighted sum of the segmentation supervision and the cross-modal distillation constraints as follows:

L_{total} = L_{seg} + λ L_{distill},

(20)

where

λ

is a hyperparameter balancing the contribution of the teacher’s guidance. By minimizing

L_{total}

, the student network simultaneously learns precise decision boundaries from ground-truth labels and robust topological representations from the multimodal teacher network.

3. Results

3.1. Experimental Configuration

3.1.1. Benchmarks and Evaluation Metrics

To empirically validate the proposed CMCMD framework, we conduct rigorous evaluations on two authoritative LiDAR semantic segmentation benchmarks: SemanticKITTI and nuScenes.

SemanticKITTI originates from the KITTI Odometry Benchmark and provides dense pointwise annotations captured by a 64-beam Velodyne LiDAR. Following the official partition protocol, sequences 00–07 and 09–10 are used for training, while sequence 08 is reserved for validation. The remaining sequences (11–21) constitute the test set. All the evaluations are performed across the canonical 19 semantic categories.

NuScenes is a large-scale multimodal dataset comprising 1000 diverse driving scenarios. It provides synchronized data from a 32-beam LiDAR and six surrounding cameras. The dataset is divided into 700, 150, and 150 scenes for training, validation, and testing, respectively. Following standard practice, we merge classes to evaluate the performance on 16 categories.

The segmentation performance is quantified using the mean Intersection over Union (mIoU). Let

T P_{c}

,

F P_{c}

, and

F N_{c}

denote the true positive, false positive, and false negative predictions for class c, respectively. The metric is formulated as follows:

mIoU = \frac{1}{| C |} \sum_{c \in C} \frac{T P_{c}}{T P_{c} + F P_{c} + F N_{c}},

(21)

where

| C |

represents the total number of valid semantic classes.

3.1.2. Implementation Details

Our approach is developed within the MMDetection3D [51] framework, with all the experiments conducted on a single NVIDIA RTX 4090 GPU.

Architecture Configurations. The multimodal teacher network integrates a frozen DINOv3 encoder to leverage robust open-world semantic priors. To accommodate the patch-embedding requirements of the Vision Foundation Model, input images are resized to a resolution of

224 \times 448

. The student network is built upon our ARConv-based sparse backbone. To achieve an optimal tradeoff between geometric granularity and computational overhead, the voxelization size for the point cloud is set at 0.1 m. The 3D encoder adopts four hierarchical stages with channel widths of

{64, 64, 128, 256}

. The stage-transition sparse convolutions use a kernel size of 2 and a stride of 2, yielding voxel strides of

{1, 2, 4, 8}

relative to the input grid. Inside ARConv, the pyramid grid sizes are set at

{8, 12, 16, 16}

,

{6, 9, 12, 12}

,

{4, 6, 8, 8}

, and

{3, 4, 6, 6}

from shallow to deep stages, and the number of reference centroids is fixed at 16 for all the stages.

Optimization and Hyperparameters. We employ the AdamW optimizer [52] to minimize the objective function, initializing the learning rate at

1 \times 10^{- 3}

with a weight decay of

0.02

. A cosine annealing schedule is applied to decay the learning rate dynamically. For dataset-specific configurations, the model is trained for 64 epochs, with a batch size of 8 on SemanticKITTI, and for 80 epochs, with a batch size of 16 on nuScenes. To manage GPU memory consumption, we adopt a random sampling strategy, limiting the input to 100,000 points per frame. The distillation weight (

λ

) in the loss function is empirically set at

0.2

. The temperature coefficient (

τ

) for soft-label distillation is set at 4, following the convention in the knowledge distillation literature [37].

3.2. Comparison with State-of-the-Art Methods

3.2.1. Results in SemanticKITTI

We compare our CMCMD framework against representative methods spanning four paradigms in the SemanticKITTI validation set (sequence 08): projection-based, point-based, and voxel-based single-modal methods; multimodal fusion methods; and cross-modal distillation methods. To preserve protocol comparability, the main table only includes methods with directly comparable single-scan validation results; recent methods reported only on hidden test sets or under substantially different settings are not included in this table. The quantitative results are presented in Table 1.

From the results in Table 1, several key observations can be drawn:

Superiority over Single-Modal Baselines. Our CMCMD achieves a 72.9% mIoU using only LiDAR input, outperforming all the compared LiDAR-only baselines on this validation split. The improvements over PVKD (71.2%) and AF2S3Net (70.8%) remain +1.7% and +2.1%, while the margin over Cylinder3D (68.9%) reaches +4.0%. These gains demonstrate that cross-modal manifold distillation effectively compensates for the inherent texture deficiency of point clouds.

Competitive Performance against Multimodal Methods. The per-class results reveal that CMCMD achieves the highest scores among the compared methods in several challenging categories, including “other vehicle” (63.5%), “person” (78.3%), “terrain” (72.3%), and “pole” (65.7%). These improvements are particularly significant for categories that depend on texture and context cues, validating the effectiveness of our DINOv3-based teacher in providing discriminative visual priors.

Fine-Grained Category Analysis. The per-class analysis reveals that CMCMD particularly excels in geometrically ambiguous and texture-dependent categories. For instance, in “motorcyclist” (68.8%), our method achieves a substantial gain of +20.8% over Cylinder3D (48.0%) and in “person” (78.3%), a gain of +4.6% over Cylinder3D (73.7%). These categories often present similar geometric profiles but distinct visual appearances, confirming that the MATD strategy effectively transfers high-order topological knowledge while preserving geometric discriminability.

3.2.2. Results in nuScenes

To further verify the generalization capability of CMCMD, we conduct experiments on the nuScenes validation set. As with SemanticKITTI, we prioritize directly comparable single-scan validation results in the main table and exclude settings that rely on hidden test-server submissions, multi-frame inputs, or materially different training objectives. The results are summarized in Table 2.

On the nuScenes benchmark, CMCMD achieves an 81.2% mIoU with LiDAR-only inference. This result extends the observations from SemanticKITTI to a dataset with different sensor characteristics and annotation rules as follows:

Bridging the Single–Multimodal Gap. CMCMD (81.2%) is comparable to the multimodal fusion method MSeg3D (81.1%), which requires both LiDAR and six cameras during inference, and exceeds PMF (63.9%) and 2D3DNet (80.0%) by large margins. It also remains ahead of recent strong LiDAR-only baselines, such as CMDFusion (80.8%) and NUC-Net (78.9%). This observation indicates that a single-modal network, when trained with principled manifold distillation, can be competitive with both recent single-modal advances and more complex multimodal systems. The practical implication is substantial—deploying a single LiDAR sensor significantly reduces hardware costs, calibration complexity, and failure modes.

Pronounced Gains in Rare and Small Categories. The per-class breakdown reveals that the most significant improvements occur in under-represented and geometrically challenging categories: “bicycle” (+27.2% vs. Cylinder3D), “motorcycle” (+8.2%), and “drivable surface” (97.8%, the highest among all the methods). CMCMD also maintains competitive scores in “other_flat” (69.8%) and “sidewalk” (80.5%) while leading the table in the overall mIoU value. These classes typically occupy few points in each scan or lack distinctive geometric features, making them particularly reliant on the texture and contextual features that our DINOv3-based teacher provides.

Cross-Dataset Generalization. Across the two benchmarks, CMCMD surpasses the strongest compared LiDAR-only validation baselines, including PVKD (71.2% on SemanticKITTI), CMDFusion (80.8% on nuScenes), and NUC-Net (78.9% on nuScenes). The same trend appears under different LiDAR configurations (32 beams and 64 beams) and different annotation protocols, which indicates that the proposed manifold-aware distillation strategy transfers modality-invariant structures rather than dataset-specific feature statistics.

3.3. Ablation Studies

To dissect the contribution of each proposed component, we conduct comprehensive ablation studies on the SemanticKITTI validation set. All the ablation experiments share the same training protocol to ensure a fair comparison.

3.3.1. Component-Wise Ablation

We progressively integrate each proposed module into a MinkUNet baseline and evaluate the incremental performance gains. The results are presented in Table 3.

Effectiveness of ARConv. Replacing the static sparse convolutions in MinkUNet with our ARConv backbone (row b) yields a +1.7% improvement (from 67.3% → 69.0%). This gain validates the hypothesis that dynamic, geometry-conditioned aggregation weights are essential for modeling diverse local structures in point clouds. The dual-stream mechanism of ARConv enables the network to perceive fine-grained geometric transitions (e.g., from a flat road surface to a vertical pole), which static shared kernels cannot capture.

Contribution of UBMM. Introducing the Unified Bidirectional Mapping Module (row c) brings an additional +0.7% gain (from 69.0% → 69.7%). The bidirectional feature flow between the P2I and I2P pathways effectively mitigates the modal misalignment problem. Compared to unidirectional projection methods (e.g., simple pixel-to-point feature sampling), the bidirectional design enables both modalities to benefit from each other’s complementary information during the fusion stage.

Soft-Label Distillation. Adding the soft-label KL divergence loss (row d) increases mIoU by 1.6 points (from 69.7% → 71.3%). This gain indicates that the DINOv3 teacher’s probability distribution contains inter-class relations that are not available from hard labels alone, including ambiguous class boundaries, such as “car” versus “truck”.

MATD: The Key Differentiator. Adding the MATD loss (row e) yields a further 1.6-point improvement (from 71.3% → 72.9%). This result shows that matching soft probability distributions alone is not sufficient; aligning the topology of the feature manifold provides complementary supervision and improves modality-invariant representation learning.

3.3.2. Impact of Distillation Strategies

To further validate the advantage of MATD over conventional distillation methods, we compare different distillation strategies applied to the same teacher–student architecture. The results are reported in Table 4.

The data show three consistent patterns. Conventional feature mimicking methods (L2 and KL) improve mIoU by 0.9 and 1.3 points, respectively, but remain constrained by cross-modal heterogeneity. MATD alone reaches 71.7%, which is higher than all the point-to-point alignment strategies and supports the benefit of topology-preserving distillation. Combining

L_{KL}

and

L_{MATD}

produces the best result, 72.9%, which is 1.2 points above the better individual strategy. These results indicate that soft-label semantic alignment and manifold topology alignment transfer different parts of the teacher’s knowledge.

3.3.3. Sensitivity to the Distillation Weight ( $λ$ )

We investigate the influence of the distillation weight (

λ

) in the total loss function (

L_{total} = L_{seg} + λ L_{distill}

) on the model’s performance. The results in the SemanticKITTI validation set are shown in Table 5.

The model maintains at least a 72.1% mIoU for

λ

values from 0.1 to 0.3. The best result is obtained at

λ = 0.2

. When

λ

is reduced to 0.05, the distillation signal is too weak to transfer sufficient teacher knowledge. When

λ

is increased to 1.0, the distillation term dominates the optimization objective and reduces boundary discrimination in the student network.

3.4. Efficiency Analysis

A key advantage of the CMCMD framework lies in its deployment efficiency. We compare the inference-time computational cost of our single-modal student network against those of representative multimodal fusion methods in Table 6.

The CMCMD student network (24.3 M parameters) incurs only 14.5% additional computational overhead compared to that of the standard MinkUNet baseline (21.7 M), attributable to the lightweight dynamic weight generation in the ARConv module. In contrast, it reduces inference latency by 4.4× and memory consumption by 3.8× compared to those of the multimodal teacher network (CMCMD-Teacher), while sacrificing merely 1.5% mIoU (72.9% vs. 74.4%). Compared to BEVFusion, our student achieves superior accuracy, with 3.4× lower inference latency and a 3.2× lower GPU memory footprint. These results underscore the practical feasibility of CMCMD for real-time autonomous driving deployment, where both accuracy and latency are paramount.

3.5. Qualitative Visualization

We provide qualitative analyses from two perspectives: prediction error distribution in SemanticKITTI and semantic segmentation in real-world campus scenes.

3.5.1. Prediction Error Distribution Analysis

Figure 4 visualizes per-point prediction errors in three SemanticKITTI validation frames. Correctly classified points are rendered in gray and misclassified points in red. Compared with Cylinder3D [15] and SPVCNN [14], CMCMD produces the fewest error points. The clearest differences appear at class boundaries, where the baseline methods show dense red clusters along road/sidewalk edges and building/vegetation interfaces, and on distant objects beyond approximately 40 m, where LiDAR returns are sparse. Across intersections, straight roads, and complex urban scenes, the visual results are consistent with the quantitative results and support the effectiveness of manifold-aware distillation.

3.5.2. Real-World Campus Scene Validation

To evaluate cross-domain generalization, we deploy the trained models on point cloud data collected from a campus environment using a self-built platform equipped with an Ouster 64-beam LiDAR, a Honeywell IMU, and a SiNan Navigation GNSS receiver. Although the LiDAR beam configuration matches that of SemanticKITTI, the campus introduces substantial domain shifts in the scene layout (e.g., teaching buildings and non-motorized lanes) and pedestrian density patterns.

As shown in Figure 5, CMCMD produces more accurate segmentation than both baselines. In Scene 1, Cylinder3D misclassifies portions of a pedestrian as vegetation due to the spatial proximity, while SPVCNN yields fragmented segments; CMCMD delivers coherent pedestrian segmentation. In Scene 2, both baselines exhibit boundary bleeding between a parked vehicle and an adjacent building, whereas CMCMD achieves sharp boundary delineation benefiting from UBMM’s bidirectional texture injection. Despite the campus data being entirely unseen during training, CMCMD maintains semantically plausible predictions, which we attribute to the DINOv3 teacher’s open-world representations and MATD’s focus on modality-invariant topological structures.

4. Discussion

The results indicate that CMCMD narrows the performance gap between LiDAR-only and multimodal semantic segmentations. Using DINOv3 as the teacher backbone is particularly beneficial for texture-dependent or geometrically ambiguous classes, improving “motorcyclist” by 20.8% over Cylinder3D, “bicycle” by 27.2% in nuScenes, and “person” by 4.6%. Table 4 further shows that MATD alone (71.7%) outperforms pointwise transfer losses, such as L2 mimicking (70.6%) and KL matching (71.0%), supporting the premise that aligning inter-sample relations is more effective than rigid numerical matching across heterogeneous modalities. The combination of

L_{KL}

and

L_{MATD}

reaches 72.9%, indicating that the two objectives provide complementary supervision.

Compared with recent VFM-guided image-to-LiDAR representation learning methods, such as Three Pillars [33], ELiTe [34], LiMoE [35], and LiMA [36], our framework is optimized directly for supervised semantic segmentation and emphasizes topology-preserving distillation within the downstream training loop. Rather than focusing on the generic 3D representation quality alone, CMCMD addresses a specific failure mode in segmentation learning, namely, heterogeneous numerical alignment between modalities, through shared-manifold affinity matching.

From a practical standpoint, the CMCMD student achieves a 72.9% mIoU with only 71 ms of latency and 2.4 GB of memory (Table 6), reducing the computational cost by 3.4×–4.4× compared to those of multimodal baselines and satisfying the <100 ms real-time constraint of autonomous driving. Eliminating camera dependency during inference further enhances robustness to adverse weather, sensor failures, and calibration drift. The qualitative evaluation in unseen campus scenes (Figure 5) demonstrates strong cross-domain generalization, attributable to the DINOv3 teacher’s open-world representations and MATD’s focus on modality-invariant topological structures.

Several limitations remain. Recent temporal LiDAR segmentation models, such as MemorySeg [23] and TASeg [24], process temporal contexts explicitly, whereas the current framework operates on individual LiDAR frames. Incorporating 4D convolutions, recurrent memory, or sequence-level distillation could improve consistency for moving objects and severely occluded regions. UBMM also relies on camera–LiDAR calibration and 2D/3D projection quality. Although the validity mask removes out-of-view points, the current formulation does not model visibility competition, projection uncertainty, or rolling-synchronization errors explicitly, which may reduce robustness under calibration drift or severe occlusion. The

O (B^{2})

affinity computation in MATD may also require efficient approximations for denser sampling or larger batch construction. Extending the distillation paradigm to additional sensor modalities (e.g., radar) and integrating online learning for continuous adaptation remain natural directions for future work. The present study reports the main quantitative results from single training runs under a fixed protocol. The expanded ablations, sensitivity analysis, efficiency study, and real-world campus validation provide complementary evidence, but multi-seed variance analysis and formal statistical significance testing are still needed.

5. Conclusions

This study presents Cross-Modal Collaborative Manifold Distillation (CMCMD), a LiDAR semantic segmentation framework that combines multimodal training with LiDAR-only inference. The method integrates a frozen DINOv3 teacher for open-world semantic priors, Adaptive Relation Convolution (ARConv) for geometry-conditioned feature aggregation, a Unified Bidirectional Mapping Module (UBMM) for explicit 2D–3D interactions, and Manifold-Aware Topological Distillation (MATD) for topology-preserving knowledge transfer. In SemanticKITTI and nuScenes, CMCMD achieves competitive single-modal performances of 72.9% and 81.2% mIoU, respectively, while reducing latency and memory consumption relative to those of representative multimodal baselines. The current framework still depends on reliable camera–LiDAR calibration during training and does not explicitly model temporal contexts. Future work will explore temporal extensions for 4D segmentation, calibration-aware correspondence modeling, improved multimodal teacher design, and scalable approximations of MATD for denser representations.

Author Contributions

Conceptualization, Y.Y.; methodology, Y.Y.; software, Y.Y.; validation, Y.Y.; formal analysis, Y.Y.; investigation, Y.Y.; resources, Y.Y.; data curation, Y.Y.; writing—original draft preparation, Y.Y.; writing—review and editing, X.X.; visualization, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon reasonable request from the corresponding author. The SemanticKITTI and nuScenes datasets used in this work are publicly available at http://www.semantic-kitti.org (accessed on 1 April 2026) and https://www.nuscenes.org (accessed on 1 April 2026), respectively.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CMCMD	Cross-Modal Collaborative Manifold Distillation
MATD	Manifold-Aware Topological Distillation
ARConv	Adaptive Relation Convolution
UBMM	Unified Bidirectional Mapping Module
VFM	Vision Foundation Model
mIoU	mean Intersection-over-Union
BEV	Bird’s-Eye View
CMKD	Cross-Modal Knowledge Distillation
ViT	Vision Transformer
MIM	Masked Image Modeling
CNN	Convolutional Neural Network
FCN	Fully Convolutional Network

References

Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Wu, B.; Wan, A.; Yue, X.; Keutzer, K. SqueezeSeg: Convolutional Neural Nets with Recurrent CRF for Real-Time Road-Object Segmentation from 3D LiDAR Point Cloud. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1887–1893. [Google Scholar]
Wu, B.; Zhou, X.; Zhao, S.; Yue, X.; Keutzer, K. SqueezeSegV2: Improved Model Structure and Unsupervised Domain Adaptation for Road-Object Segmentation from a LiDAR Point Cloud. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 4376–4382. [Google Scholar]
Xu, C.; Wu, B.; Wang, Z.; Zhan, W.; Vajda, P.; Keutzer, K.; Tomizuka, M. SqueezeSegV3: Spatially-Adaptive Convolution for Efficient Point-Cloud Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 1–19. [Google Scholar]
Zhang, Y.; Zhou, Z.; David, P.; Yue, X.; Xi, Z.; Gong, B.; Foroosh, H. PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9601–9610. [Google Scholar]
Liong, V.E.; Nguyen, T.N.T.; Widjaja, S.; Sharma, D.; Chong, Z.J. AMVNet: Assertion-Based Multi-View Fusion Network for LiDAR Semantic Segmentation. arXiv 2020, arXiv:2012.04934. [Google Scholar]
Kong, L.; Liu, Y.; Chen, R.; Ma, Y.; Zhu, X.; Li, Y.; Hou, Y.; Qiao, Y.; Liu, Z. Rethinking Range View Representation for LiDAR Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 228–240. [Google Scholar]
Xu, X.; Kong, L.; Shuai, H.; Liu, Q. FRNet: Frustum-Range Networks for Scalable LiDAR Segmentation. IEEE Trans. Image Process. 2025, 34, 2173–2186. [Google Scholar] [CrossRef] [PubMed]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5099–5108. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11108–11117. [Google Scholar]
Thomas, H.; Qi, C.R.; Deschaud, J.-E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. KPConv: Flexible and Deformable Convolution for Point Clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
Choy, C.; Gwak, J.; Savarese, S. 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3075–3084. [Google Scholar]
Tang, H.; Liu, Z.; Zhao, S.; Lin, Y.; Lin, J.; Wang, H.; Han, S. Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 685–702. [Google Scholar]
Zhu, X.; Zhou, H.; Wang, T.; Hong, F.; Ma, Y.; Li, W.; Li, H.; Lin, D. Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9939–9948. [Google Scholar]
Xu, J.; Zhang, R.; Dou, J.; Zhu, Y.; Sun, J.; Pu, S. RPVNet: A Deep and Efficient Range-Point-Voxel Fusion Network for LiDAR Point Cloud Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 16024–16033. [Google Scholar]
Cheng, R.; Razani, R.; Taghavi, E.; Li, E.; Liu, B. 2-S3Net: Attentive Feature Fusion with Adaptive Feature Selection for Sparse Semantic Segmentation Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12547–12556. [Google Scholar]
Liu, Y.; Chen, R.; Li, X.; Kong, L.; Yang, Y.; Xia, Z.; Bai, Y.; Zhu, X.; Ma, Y.; Li, Y.; et al. UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 21662–21673. [Google Scholar]
Yan, X.; Gao, J.; Li, J.; Zhang, R.; Li, Z.; Huang, R.; Cui, S. Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 3101–3109. [Google Scholar]
Lai, X.; Chen, Y.; Lu, F.; Liu, J.; Jia, J. Spherical Transformer for LiDAR-Based 3D Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 17545–17555. [Google Scholar]
Wang, X.; Feng, W.; Kong, L.; Wan, L. NUC-Net: Non-uniform Cylindrical Partition Network for Efficient LiDAR Semantic Segmentation. arXiv 2025, arXiv:2505.24634. [Google Scholar] [CrossRef]
Liu, Y.; Kong, L.; Wu, X.; Chen, R.; Li, X.; Pan, L.; Liu, Z.; Ma, Y. Multi-Space Alignments Towards Universal LiDAR Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 14648–14661. [Google Scholar]
Li, E.; Casas, S.; Urtasun, R. MemorySeg: Online LiDAR Semantic Segmentation with a Latent Memory. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 745–754. [Google Scholar]
Wu, X.; Hou, Y.; Huang, X.; Lin, B.; He, T.; Zhu, X.; Ma, Y.; Wu, B.; Liu, H.; Cai, D.; et al. TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Zhuang, Z.; Li, R.; Jia, K.; Wang, Q.; Li, Y.; Tan, M. Perception-Aware Multi-Sensor Fusion for 3D LiDAR Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 16280–16290. [Google Scholar]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.; Han, S. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2774–2781. [Google Scholar]
Liang, T.; Xie, H.; Yu, K.; Xia, Z.; Lin, Z.; Wang, Y.; Tang, T.; Wang, B.; Tang, Z. BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 10421–10434. [Google Scholar]
Li, J.; Dai, H.; Han, H.; Ding, Y. MSeg3D: Multi-Modal 3D Semantic Segmentation for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 21694–21704. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.V.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. Trans. Mach. Learn. Res. 2024. Available online: https://openreview.net/forum?id=a68SUt6zFt (accessed on 26 February 2026).
Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Bojanowski, P. DINOv3. arXiv 2025, arXiv:2508.10104. [Google Scholar] [PubMed]
Puy, G.; Gidaris, S.; Boulch, A.; Siméoni, O.; Sautier, C.; Pérez, P.; Bursuc, A.; Marlet, R. Three Pillars Improving Vision Foundation Model Distillation for Lidar. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 21519–21529. [Google Scholar]
Zhang, Z.; Yang, X.; Zhang, W.; Jin, C. ELiTe: Efficient Image-to-LiDAR Knowledge Transfer for Semantic Segmentation. arXiv 2024, arXiv:2405.04121. [Google Scholar]
Xu, X.; Kong, L.; Shuai, H.; Pan, L.; Liu, Z.; Liu, Q. LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 27368–27379. [Google Scholar]
Xu, X.; Kong, L.; Wang, S.; Zhou, C.; Liu, Q. Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–23 October 2025; pp. 25506–25518. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Gupta, S.; Hoffman, J.; Malik, J. Cross Modal Distillation for Supervision Transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2827–2836. [Google Scholar]
Aytar, Y.; Vondrick, C.; Torralba, A. SoundNet: Learning Sound Representations from Unlabeled Video. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Jaritz, M.; Vu, T.-H.; de Charette, R.; Wiber, E.; Pérez, P. xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12605–12614. [Google Scholar]
Genova, K.; Yin, F.; Kundu, A.; Arriola, C.; Bernstein, K.; Sun, J.; Tagliasacchi, A.; Kanazawa, A.; Davis, L. Learning 3D Semantic Segmentation with Only 2D Image Supervision. In Proceedings of the International Conference on 3D Vision (3DV), Virtual, 1–3 December 2021; pp. 361–372. [Google Scholar]
Liu, Y.; Fan, Q.; Zhang, S.; Dong, H.; Funkhouser, T.; Yi, L. Contrastive Multimodal Fusion with TupleInfoNCE. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 754–763. [Google Scholar]
Hou, Y.; Zhu, X.; Ma, Y.; Loy, C.C.; Li, Y. Point-to-Voxel Knowledge Distillation for LiDAR Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8479–8488. [Google Scholar]
Li, M.; Zhang, Y.; Xie, Y.; Gao, S.; Li, Y.; Sun, X.; Liu, T. Cross-Domain and Cross-Modal Knowledge Distillation in Domain Adaptation for 3D Semantic Segmentation. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 3829–3837. [Google Scholar]
Sun, T.; Zhang, Z.; Tan, X.; Qu, Y.; Xie, Y.; Li, X. Uni-to-Multi Modal Knowledge Distillation for Bidirectional LiDAR-Camera Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11059–11072. [Google Scholar] [CrossRef] [PubMed]
Cen, J.; Zhang, S.; Pei, Y.; Li, K.; Zheng, H.; Luo, M.; Zhang, Y.; Chen, Q. CMDFusion: Bidirectional Fusion Network with Cross-Modality Knowledge Distillation for LiDAR Semantic Segmentation. IEEE Robot. Autom. Lett. 2023, 9, 771–778. [Google Scholar] [CrossRef]
Yang, Y.; Qiu, J.; Song, M.; Tao, D.; Wang, X. Distilling Knowledge from Graph Convolutional Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7074–7083. [Google Scholar]
Hou, Y.; Ma, Z.; Liu, C.; Hui, T.-W.; Loy, C.C. Inter-Region Affinity Distillation for Road Marking Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12486–12495. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Berman, M.; Triki, A.R.; Blaschko, M.B. The Lovász-Softmax Loss: A Tractable Surrogate for the Optimization of the Intersection-over-Union Measure in Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4413–4421. [Google Scholar]
Contributors, M. MMDetection3D: OpenMMLab Next-Generation Platform for General 3D Object Detection. Available online: https://github.com/open-mmlab/mmdetection3d (accessed on 15 June 2025).
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Yan, X.; Gao, J.; Zheng, C.; Zheng, C.; Zhang, R.; Cui, S.; Li, Z. JS3C-Net: Joint Sparse Semantic-Scene Completion Network for LiDAR Point Cloud Segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 3090–3098. [Google Scholar]
Li, R.; Li, X.; Heng, P.A.; Fu, C.-W. RealSurf: Real-Time LiDAR Surface Reconstruction for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13250–13259. [Google Scholar]

Figure 1. The overall framework of the proposed Cross-Modal Collaborative Manifold Distillation (CMCMD) method. The architecture comprises an asymmetric teacher–student structure. The teacher network (top) utilizes a frozen DINOv3 Vision Foundation Model to extract four progressively downsampled image feature maps,

F_{1}

–

F_{4}

. The student network (bottom) employs an ARConv backbone to produce multi-scale point features,

f_{1}

–

f_{4}

, with sizes

f_{s} \in R^{N_{s} \times C_{s}}

. Here,

f_{1}

–

f_{4}

denote the four encoder stages (

s = 1, 2, 3, 4

), whose voxel strides relative to the input grid are 1, 2, 4, and 8, respectively. A Unified Bidirectional Mapping Module (UBMM) aligns heterogeneous features across multiple scales. During training, the student learns via Cross-Modal Collaborative Manifold Distillation (CMCMD) with soft-label semantic alignment and manifold-aware topological distillation; during inference, the upper teacher branch and distillation paths are removed, and only the single-modal student network is deployed.

Figure 1. The overall framework of the proposed Cross-Modal Collaborative Manifold Distillation (CMCMD) method. The architecture comprises an asymmetric teacher–student structure. The teacher network (top) utilizes a frozen DINOv3 Vision Foundation Model to extract four progressively downsampled image feature maps,

F_{1}

–

F_{4}

. The student network (bottom) employs an ARConv backbone to produce multi-scale point features,

f_{1}

–

f_{4}

, with sizes

f_{s} \in R^{N_{s} \times C_{s}}

. Here,

f_{1}

–

f_{4}

denote the four encoder stages (

s = 1, 2, 3, 4

), whose voxel strides relative to the input grid are 1, 2, 4, and 8, respectively. A Unified Bidirectional Mapping Module (UBMM) aligns heterogeneous features across multiple scales. During training, the student learns via Cross-Modal Collaborative Manifold Distillation (CMCMD) with soft-label semantic alignment and manifold-aware topological distillation; during inference, the upper teacher branch and distillation paths are removed, and only the single-modal student network is deployed.

Figure 2. Illustration of the Adaptive Relation Convolution (ARConv) module. For a center location (i), the spatial stream maps the relative coordinates (

Δ p_{i j} \in R^{3}

) of neighboring locations (

j \in N (i)

) to channel-wise adaptive weights (

{ff}_{i j} \in R^{C_{out}}

), while the feature stream linearly transforms the input features (

f_{j} \in R^{C_{in}}

) to

h_{j} = W_{v} f_{j} \in R^{C_{out}}

. The final output feature, (

f_{i}^{out}

), is obtained by channel-wise weighted aggregation,

\sum_{j \in N (i)} {ff}_{i j} ⊙ h_{j}

, enabling the network to adapt its receptive field to different local geometric structures.

Figure 2. Illustration of the Adaptive Relation Convolution (ARConv) module. For a center location (i), the spatial stream maps the relative coordinates (

Δ p_{i j} \in R^{3}

) of neighboring locations (

j \in N (i)

) to channel-wise adaptive weights (

{ff}_{i j} \in R^{C_{out}}

), while the feature stream linearly transforms the input features (

f_{j} \in R^{C_{in}}

) to

h_{j} = W_{v} f_{j} \in R^{C_{out}}

. The final output feature, (

f_{i}^{out}

), is obtained by channel-wise weighted aggregation,

\sum_{j \in N (i)} {ff}_{i j} ⊙ h_{j}

, enabling the network to adapt its receptive field to different local geometric structures.

Figure 3. Architecture of the Unified Bidirectional Mapping Module (UBMM). At scale s, the P2I path (left) projects point features (

f_{s}

), to the image plane, forms a dense geometric prior (

{\hat{f}}_{s}

), and uses the validity mask (

M^{(s)}

) together with gated fusion to enhance image features (

F_{s}

). The I2P path (right) bilinearly samples visual contexts (

F_{s}^{sample}

) at the projected coordinates (

(u_{i}^{(s)}, v_{i}^{(s)})

) and fuses them back into the point features through an MLP-based gate. “Concat” in the diagram denotes the concatenation operation.

Figure 3. Architecture of the Unified Bidirectional Mapping Module (UBMM). At scale s, the P2I path (left) projects point features (

f_{s}

), to the image plane, forms a dense geometric prior (

{\hat{f}}_{s}

), and uses the validity mask (

M^{(s)}

) together with gated fusion to enhance image features (

F_{s}

). The I2P path (right) bilinearly samples visual contexts (

F_{s}^{sample}

) at the projected coordinates (

(u_{i}^{(s)}, v_{i}^{(s)})

) and fuses them back into the point features through an MLP-based gate. “Concat” in the diagram denotes the concatenation operation.

Figure 4. Prediction error distribution in the SemanticKITTI validation set. Gray points denote correct predictions; red points denote misclassifications. CMCMD produces the fewest errors, especially along class boundaries and on sparse distant objects.

Figure 5. Semantic segmentation in real-world campus scenes. Red dashed boxes highlight regions where our CMCMD produces more accurate segmentation than Cylinder3D and SPVCNN, particularly for pedestrians, vehicles, and vegetation boundaries under domain shifts.

Table 1. Quantitative comparison in SemanticKITTI validation set (mIoU %). All methods use LiDAR-only input during inference and are ordered by ascending mIoU values. The best results in each column are shown in bold.

Method	mIoU	Car	Bicy.	Moto.	Truck	O. Veh.	Pers.	Bicl.	Motcl.	Road	Park	Sidew.	O. Gro.	Build	Fence	Veget.	Trunk	Terr.	Pole	T. Sign
PolarNet [5]	54.3	93.8	40.3	30.1	22.9	28.5	43.2	40.2	5.6	90.8	61.7	74.4	21.7	90.0	61.3	84.0	65.5	67.8	51.8	57.5
SqueezeSegV3 [4]	55.9	92.5	38.7	36.5	29.6	33.0	45.6	46.2	20.1	91.7	63.4	74.8	26.4	89.0	59.4	82.0	58.7	65.4	49.6	58.9
KPConv [12]	58.8	95.0	30.2	42.5	33.4	44.3	61.5	61.6	11.8	90.3	61.3	72.7	31.5	90.5	64.2	84.8	69.2	69.1	56.4	47.4
AMVNet [6]	65.3	96.2	59.9	54.2	48.8	45.7	71.0	65.7	11.0	90.1	71.0	75.8	32.4	92.4	69.1	85.6	71.7	69.6	62.7	67.2
JS3C-Net [53]	66.0	95.8	59.3	52.9	54.3	46.0	69.5	65.4	39.9	88.9	61.9	72.1	31.9	92.5	70.8	84.5	69.8	67.9	60.7	68.7
SPVCNN [14]	67.0	97.2	50.6	50.4	56.6	58.0	67.4	67.1	50.3	90.2	67.6	75.4	21.8	91.6	66.9	86.1	73.4	71.0	64.3	67.3
Cylinder3D [15]	68.9	97.1	67.6	63.8	50.8	58.5	73.7	69.2	48.0	92.2	65.0	77.0	32.3	90.7	66.5	85.6	72.5	69.8	62.4	66.2
RPVNet [16]	70.3	97.6	68.4	68.7	44.2	61.1	75.9	74.4	43.4	93.4	70.3	80.7	33.3	93.5	72.1	86.5	75.1	71.7	64.8	61.4
AF2S3Net [17]	70.8	94.3	63.0	81.4	40.2	40.0	76.4	81.7	77.7	92.0	66.8	76.2	45.8	92.5	69.6	78.6	68.0	63.1	64.0	73.3
PVKD [43]	71.2	97.0	69.3	53.5	67.9	60.2	75.1	73.5	50.5	91.8	77.5	70.9	41.0	92.4	69.4	86.5	73.8	71.9	64.9	65.8
CMCMD (Ours)	72.9	97.3	64.5	66.3	60.8	63.5	78.3	79.4	68.8	91.2	70.8	76.4	42.1	93.1	71.8	86.4	74.2	72.3	65.7	69.3

Table 2. Quantitative comparison in nuScenes validation set (mIoU %). “L” and “LC” denote LiDAR-only and LiDAR–Camera inputs during inference, respectively. Methods are grouped by inference modality and ordered by ascending mIoU values within each group. The best results in each column are shown in bold.

Method	Modal.	mIoU	Barrier	Bicy.	Bus	Car	Constr.	Moto.	Ped.	Traf.	Trailer	Truck	Drive	O. Flat	Sidew.	Terrain	Manm.	Veget.
PolarNet [5]	L	69.4	72.2	16.8	77.0	86.5	51.1	69.7	64.8	54.1	69.7	63.5	96.6	67.1	77.7	72.1	87.1	84.5
JS3C-Net [53]	L	73.6	80.1	26.2	87.8	84.5	55.2	72.6	71.3	66.3	76.8	71.2	96.8	64.5	76.9	74.1	87.5	86.1
Cylinder3D [15]	L	77.2	82.8	29.8	84.3	89.4	53.0	79.3	77.2	73.4	84.6	69.1	97.7	70.2	80.3	75.5	90.4	87.6
AMVNet [6]	L	77.3	80.6	32.0	81.7	88.9	67.1	84.3	76.1	73.5	84.9	67.3	97.5	67.4	79.4	75.5	91.5	88.7
SPVCNN [14]	L	77.4	80.0	30.0	91.9	90.8	64.7	79.0	75.6	70.9	81.0	74.6	97.4	69.2	80.0	76.1	89.3	87.1
AF2S3Net [17]	L	78.3	78.9	52.2	89.9	84.2	77.4	74.3	77.3	72.0	83.9	73.8	97.1	66.5	77.5	74.0	87.7	86.8
NUC-Net [21]	L	78.9	78.1	48.6	92.9	92.8	48.5	87.2	81.4	70.6	70.9	83.9	97.1	78.7	77.6	76.4	88.9	88.6
RealSurf [54]	L	80.1	81.4	36.8	93.2	91.8	77.2	83.4	78.9	74.8	87.3	76.2	97.7	66.2	79.9	75.5	92.6	89.3
CMDFusion [46]	L	80.8	83.5	45.7	94.5	91.4	76.7	87.0	77.2	73.0	85.6	77.3	97.4	69.2	79.5	75.5	91.0	88.5
PMF [25]	LC	63.9	82.1	40.3	80.9	86.4	63.7	79.2	79.8	75.9	81.2	67.1	97.3	67.7	78.1	74.5	89.9	88.5
2D3DNet [41]	LC	80.0	83.0	59.4	88.0	85.1	63.7	84.4	82.0	76.0	84.8	71.9	96.9	67.4	79.8	76.0	92.1	89.2
MSeg3D [28]	LC	81.1	83.1	42.5	94.9	92.0	67.1	78.6	85.7	80.5	87.5	77.3	97.7	69.8	81.2	77.8	92.4	90.1
CMCMD (Ours)	L	81.2	83.0	57.0	92.8	92.3	74.8	87.5	80.4	75.0	85.5	75.8	97.8	69.8	80.5	76.4	91.7	89.3

Table 3. Component-wise ablation study on the SemanticKITTI validation set. “Baseline” denotes a standard MinkUNet with static sparse convolutions. A checkmark indicates that the corresponding component or loss is enabled, and bold denotes the best result.

Configuration	ARConv	UBMM	$L_{KL}$	$L_{MATD}$	mIoU (%)
(a) Baseline					67.3
(b) +ARConv	✓				69.0
(c) +UBMM	✓	✓			69.7
(d) +Soft-Label KD	✓	✓	✓		71.3
(e) +MATD (Full)	✓	✓	✓	✓	72.9

Table 4. Comparison of distillation strategies in the SemanticKITTI validation set. All methods use the same DINOv3-ARConv teacher–student architecture, and bold denotes the best result.

Distillation Strategy	mIoU (%)
No Distillation (Supervised only)	69.7
Feature Mimicking (L2 loss)	70.6
Feature Mimicking (KL divergence)	71.0
Contrastive Distillation	71.2
Soft-Label KD only ( $L_{KL}$ )	71.3
MATD only ( $L_{MATD}$ )	71.7
CMCMD ( $L_{KL} + L_{MATD}$ )	72.9

Table 5. Sensitivity analysis of the distillation weight (

λ

) in the SemanticKITTI validation set. Bold denotes the best result.

Table 5. Sensitivity analysis of the distillation weight (

λ

) in the SemanticKITTI validation set. Bold denotes the best result.

$λ$	0.05	0.1	0.2	0.3	0.5	1.0
mIoU (%)	71.0	72.1	72.9	72.6	71.8	70.4

Table 6. Inference efficiency comparison. Latency and memory are measured on a single NVIDIA RTX 4090 GPU with a single SemanticKITTI frame (approximately 120,000 points).

Method	Modality	Params. (M)	Latency (ms)	Memory (GB)
PMF	L + C	52.3	187	5.8
BEVFusion	L + C	68.7	243	7.6
CMCMD-Teacher	L + C	89.5	312	9.2
MinkUNet	L	21.7	62	2.1
CMCMD-Student	L	24.3	71	2.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Y.; Xu, X. Distilling Vision Foundation Models into LiDAR Networks via Manifold-Aware Topological Alignment. Computers 2026, 15, 234. https://doi.org/10.3390/computers15040234

AMA Style

Yang Y, Xu X. Distilling Vision Foundation Models into LiDAR Networks via Manifold-Aware Topological Alignment. Computers. 2026; 15(4):234. https://doi.org/10.3390/computers15040234

Chicago/Turabian Style

Yang, Yuchuan, and Xiaosu Xu. 2026. "Distilling Vision Foundation Models into LiDAR Networks via Manifold-Aware Topological Alignment" Computers 15, no. 4: 234. https://doi.org/10.3390/computers15040234

APA Style

Yang, Y., & Xu, X. (2026). Distilling Vision Foundation Models into LiDAR Networks via Manifold-Aware Topological Alignment. Computers, 15(4), 234. https://doi.org/10.3390/computers15040234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Distilling Vision Foundation Models into LiDAR Networks via Manifold-Aware Topological Alignment

Abstract

1. Introduction

2. Materials and Methods

2.1. Algorithm’s Overall Framework

2.2. 2D/3D Branch Design

2.2.1. 2D Branch: Generic Semantic Representation with DINOv3

2.2.2. 3D Branch: Adaptive Geometric Modeling with ARConv

2.3. Unified Bidirectional Mapping Module

2.3.1. Coordinate Space Alignment

2.3.2. P2I: Geometry-Guided Visual Feature Enhancement

2.3.3. I2P: Semantic-Injected Point Feature Refinement

2.4. Cross-Modal Collaborative Manifold Distillation

2.4.1. Soft-Label Semantic Alignment

2.4.2. Manifold-Aware Topological Distillation (MATD)

2.5. Optimization Objective and Loss Functions

2.5.1. Hybrid Segmentation Loss

2.5.2. Collaborative Distillation Loss

2.5.3. Total Optimization Objective

3. Results

3.1. Experimental Configuration

3.1.1. Benchmarks and Evaluation Metrics

3.1.2. Implementation Details

3.2. Comparison with State-of-the-Art Methods

3.2.1. Results in SemanticKITTI

3.2.2. Results in nuScenes

3.3. Ablation Studies

3.3.1. Component-Wise Ablation

3.3.2. Impact of Distillation Strategies

3.3.3. Sensitivity to the Distillation Weight ( λ )

3.4. Efficiency Analysis

3.5. Qualitative Visualization

3.5.1. Prediction Error Distribution Analysis

3.5.2. Real-World Campus Scene Validation

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3.3. Sensitivity to the Distillation Weight ( $λ$ )