1. Introduction
Point cloud semantic segmentation is a critical task in computer vision that involves assigning semantic labels to individual points within a three-dimensional (3D) point cloud. This methodology has garnered considerable attention due to its wide-ranging applications, particularly in navigation and perception for autonomous vehicles, robotics, and virtual reality environments.
Point cloud data is typically collected using light detection and ranging, a crucial technology that enables self-driving vehicles in their environmental detection capabilities [
1]. Semantic segmentation methods for point clouds have advanced significantly through fully supervised learning approaches. However, PointNet [
2] faces substantial computational challenges and burdens when processing raw point cloud data directly, particularly when dealing with large-scale datasets, limiting its suitability for real-time application requirements. In contrast, RandLA-Net [
3] enhances efficiency by using random downsampling to select a subset of points from the input point cloud for training. While this approach supports real-time performance, it can compromise segmentation accuracy in more complex scenarios. Both approaches depend heavily on extensive point cloud annotations. For instance, labeling complex areas in SemanticKITTI [
4], such as residential neighborhoods, is highly labor-intensive, significantly driving up annotation costs.
To reduce time and labor costs, weakly supervised learning methods for point clouds have emerged, training networks with only a small subset of labeled points. By relying on sparse or minimal annotations, these approaches significantly reduce annotation efforts and time expenses. Recent studies [
5,
6,
7,
8,
9] have explored learning from weakly annotated point clouds, using the consistency constraint technique [
10] to improve segmentation accuracy. This approach reduces the gap between predicted and actual point clouds by comparing an unlabeled data point with its perturbed version [
11]. Although this approach improves point cloud segmentation and reduces labeling costs, it remains limited in capturing the full complexity of point cloud data. Dependence on consistency constraints and a small set of labeled points is insufficient for extracting the rich contextual information embedded in the point cloud.
Over the past two years, the perturbation self-distillation (PSD) [
12] framework has emerged to address the aforementioned challenges. Drawing on self-supervised learning principles, PSD generates perturbation branches aimed at enhancing predictive consistency between these branches to strengthen consistency between them and the original branches. By integrating an auxiliary supervision mechanism, it effectively captures the graph topology of the entire point cloud, thereby enabling better information transfer between labeled and unlabeled points. Despite achieving significant average improvements, this method has several limitations. First, relying on a single-layer RandLA-Net [
3] for point cloud training can result in the loss of intricate details, reducing the model’s ability to accurately capture point cloud features. Second, insufficient supervisory information may restrict the model’s adaptability to varied point cloud structures. Lastly, the method’s success heavily depends on the network’s ability to understand point cloud features and acquire effective supervisory information, both of which are crucial for ensuring high-quality point cloud generation.
To address the aforementioned challenges, this study proposes a novel heterogeneous dual-branch consistency learning framework that enhances semantic segmentation performance under limited supervision. Unlike conventional mean teacher approaches [
13] with identical architectures, our method employs structurally asymmetric encoders, a deep student network, and a shallower-but-wider teacher network, where the teacher’s parameters are updated via an exponential moving average (EMA) with cross-architectural variable mapping. This design allows the teacher to act as a stabilized, distilled version of the student, providing consistent target signals while preserving essential geometric details. Beyond the standard prediction-level consistency, we introduce multi-hierarchical consistency constraints that simultaneously regularize predictions, features, and local structures.
To generate diverse yet geometrically meaningful perturbations, we propose a structure-aware augmentation module that incorporates four distinct operations, mirroring, rotation, coordinate jittering, and regionwise masking, applied randomly during training. Importantly, the augmentation process is weighted by a learned channel attention mechanism, allowing the model to emphasize informative feature channels. Furthermore, an explicit Structure Relation Context (SR-Context) module is embedded to encode local geometric relationships, providing an inductive bias that is particularly beneficial for outdoor and medical point clouds with strong structural patterns. Beyond the standard prediction-level consistency, we introduce multi-hierarchical consistency constraints that simultaneously regularize predictions, features, and local structures.
In summary, the main contributions of this study are outlined as follows:
(1) We propose a structurally asymmetric dual-branch design featuring a deep student encoder and a shallow-but-wide teacher encoder. The teacher’s parameters are updated via a cross-architectural exponential moving average (EMA) with tailored variable mapping, which stabilizes training and preserves fine geometric details often lost in homogeneous setups.
(2) We introduce a unified regularization scheme that enforces agreement at three complementary levels: prediction consistency through Jensen-Shannon divergence, feature-level consistency via contrastive regularization (CR loss) that clusters same-class embeddings, and local structural consistency through a novel Laplacian smoothing loss that encourages similar predictions among neighboring points. This multi-level approach is systematically implemented in the loss function.
(3) We design a structure-sensitive augmentation module that combines mirroring, rotation, jittering, and regionwise masking, weighted by a learnable channel attention mechanism (data augment). Furthermore, an explicit Structure Relation Context (SR-Context) module captures local geometric relationships via edge convolution, providing a strong inductive bias for segmenting complex outdoor and medical point clouds.
(4) We conduct rigorous evaluations on multiple large-scale benchmarks (including S3DIS and Toronto-3D) under extremely low-label regimes (e.g., 0.1% labels). Our framework not only outperforms existing weakly and semi-supervised methods but also matches or surpasses several fully supervised baselines, demonstrating its effectiveness and robustness.
The rest of this paper is organized as follows:
Section 2 reviews related work on semi-supervised, fully supervised, and unsupervised point cloud segmentation.
Section 3 details our proposed heterogeneous dual-branch framework, including the EMA teacher mechanism, structure-aware enhancement, the SR-Context module, and the design of the loss functions.
Section 4 presents extensive experiments on benchmarks such as S3DIS and Toronto-3D, including comparisons with state-of-the-art methods, ablation studies, and qualitative analyses. Finally,
Section 5 concludes this paper and discusses potential directions for future research.
3. Methodology
3.1. Overall Network Model Architecture
The overall network framework diagram is shown in
Figure 1. Point cloud data, denoted as P, typically contains N points, each defined by spatial coordinates (x, y, z) and color attributes (r, g, b). The dual-branch consistency learning network proposed in this study employs an enhanced hierarchical encoder–decoder architecture with an exponential moving average (EMA) teacher mechanism. This network consists of two main synergistic components. The traditional mean teacher [
13] employs identical network architectures for both student and teacher models. The teacher’s parameters are a direct exponential moving average (EMA) of the student’s parameters, where architectures are symmetric. We implement structurally asymmetric encoders while maintaining parameter update linkage through a novel variable-mapping EMA scheme. Our architectural divergence introduces functional specialization: the student explores complex feature representations while the teacher maintains a stabilized, distilled feature space. The EMA update propagates not just parameters but architecturally transformed knowledge from deep to shallow representations.
In the first component, the original point cloud undergoes comprehensive data augmentation. The traditional mean teacher applies generic augmentations (flips, rotations, noise addition). We introduce regionwise masking and channel attention mechanisms. Both the original and augmented point clouds are processed through a multi-scale feature extraction pipeline comprising dilated residual blocks with relative position encoding and attention-based pooling operations. The encoder utilizes random sampling and neighborhood aggregation to capture hierarchical features, while the decoder employs nearest interpolation to recover spatial resolution. In the second component, the network implements a student–teacher paradigm where the student network processes augmented inputs and the teacher network, updated via EMA from student parameters, provides stable supervisory signals. Consistency constraints are enforced through a multi-objective optimization framework combining cross-entropy loss, Lovasz loss, JS divergence between student and teacher predictions, contrastive regularization for feature embedding alignment, and Laplacian smoothing for spatial coherence. The traditional mean teacher enforces consistency through a single loss term (typically MSE or KL divergence) between student and teacher predictions.We implement a multi-tier consistency framework. This comprehensive loss strategy enables effective knowledge transfer between labeled and unlabeled data while preserving structural integrity across varying point densities. The proposed framework enhances semantic feature extraction through structural context preservation modules that explicitly model local geometric relationships and adaptive feature aggregation, significantly improving the network’s capability to capture both fine-grained details and complex contextual information in 3D scenes.
3.2. Point Cloud Augmentation Branching
To enhance model robustness and generalization capabilities while extracting more stable semantic features from limited labeled data, controlled perturbations are applied to the input data or features, thereby forming a perturbation branch. Various perturbation operations are first applied to the original point cloud, including geometric transformations (rotation and scaling), noise injection, and mirroring. These augmentations help ensure that the model’s predictions stay consistent across different perturbations, thereby mitigating the sensitivity of the point cloud to noise. Furthermore, a self-supervised region masking strategy is employed, which systematically masks portions of the point cloud in geometric space. This approach generates extra supervisory signals during training by leveraging the data’s inherent structural or semantic information. Training data selection is performed probabilistically, helping the model learn more robust semantic features.
Before the training process begins, the original point cloud is mapped into feature dimensions using a fully connected layer, preparing it for further network processing. The input data for the current layer of the original point cloud is characterized by the coordinates
, where
represents the neighbor index for the current layer, and
signifies the downsampling index for the same layer. Data augmentation employs several techniques, including mirror transformation, axis rotation, jittering, and self-supervised masking strategy. Additionally, the module selects these enhancements randomly, following a probability distribution aligned with predetermined criteria.
The mirror transformation, as detailed in reference [
23], is a rigid body reflection, a type of linear transformation that symmetrically reflects the point cloud along the
Y-axis, which can be mathematically expressed as
. This operation maintains the topological properties of the object and necessitates augmenting the model to achieve invariance to reflective symmetry.
The random rotation matrix, described in reference [
24], is essential for 3D point cloud data. Its primary role is to enhance model robustness and generalization abilities by applying diverse geometric transformations. This method implicitly enlarges the training dataset and decreases the model’s dependence on the original orientation distribution. In our approach, we utilize a random rotation matrix to rotate the point cloud coordinates around the
Y-axis.
In this framework, the rotation angle follows a uniform distribution . The global geometric transformation is applied via matrix multiplication, expressed as . Furthermore, Gaussian noise is incorporated by perturbing the positions within the point cloud. This noise is typically described by a distribution , with a specified offset of added to perturb point positions. Consequently, the modified point cloud is expressed as .
Jitter enhancement, as described in reference [
25], introduces Gaussian noise or point cloud coordinates to simulate sensor noise or minor deformations. This technique boosts the model’s robustness against real-world noise data. Implementing jitter enhancement requires careful tuning of noise intensity, framework compatibility, and feature synchronization. In our approach, we set the noise intensity to
with a maximum absolute noise threshold of
to prevent excessive distortion of the point cloud.
We adopt a region-based masking strategy in our framework due to its strong alignment with the characteristics of real-world 3D scenes. Unlike random point masking, which discards isolated points and may fail to capture structural semantics, region masking occludes spatially contiguous subsets of the point cloud, thereby simulating realistic scenarios such as object occlusion, missing scans, or sensor noise. This strategy compels the network to reason over larger spatial contexts and infer semantics from surrounding structures, which is particularly beneficial in large-scale outdoor environments where contextual dependencies dominate. Moreover, region masking avoids excessively sparse supervision signals by preserving the overall distribution of labeled points, making it more effective than random masking in weakly supervised settings. Empirical studies further demonstrate that region masking leads to more robust feature learning and improves the model’s ability to generalize to unseen data.
This technique encourages the network to focus on global semantics rather than overfitting to local details by simulating sensor occlusion or partial object loss. The point cloud’s 3D space is first divided into a uniform grid of dimensions
. Each grid cell is defined as a cubic region centered at
with a side length of
, where
r represents the grid corresponding to a point
, which is calculated as
. The point cloud is then subjected to regional sampling, wherein
K regions are randomly selected from the
grids, where
K is signified as
, and
p implies the masking ratio. Experimental investigations conducted across various masking ratios indicate that the optimal training performance during the training process occurs at a masking ratio of
, and we highlight our segmentation results compared to others with red circles. As shown in
Figure 2. To prevent overfitting, the masking range is randomly adjusted throughout the training phase. This region masking consistently obscures the irregular and unordered point clouds, thereby helping the model extract meaningful contextual patterns.
In the preceding discussion, we introduced four data enhancement strategies, one of which is randomly chosen during point cloud data training. Real-world point clouds exhibit diverse attributes; for example, color is essential for distinguishing categories such as doors and windows. However, in classes with similar colors such as columns and walls, relying on color may hinder effective feature extraction. To address this issue, we introduce an attribute attention layer that adaptively learns weights for the input attributes, serving as a learnable transformation to handle the inherent diversity of point clouds. Specifically, the enhanced point cloud features are concatenated with the original ones to form a combined feature,
F, which is then processed through an unbiased fully connected layer, producing a channel attention score
, where
W indicates the learnable parameter matrix. The attention distribution is subsequently obtained through normalization via the Softmax function [
27], expressed as
. Ultimately, the perturbed point cloud is created by applying the learned feature weights.
3.3. The Dual-Branch Network Architecture
To enhance the model’s capacity for extracting contextual semantic features from point clouds, provide stronger supervisory signals with limited labeled data, and generate higher-quality segmentation results, this study proposes a dual-branch consistency learning network with an EMA teacher mechanism. The primary architecture is based on the RandLA-Net [
3]. The overall structure of the network is shown in
Figure 1. Each branch network follows an encoder–decoder architecture, where the encoder serves as the feature extractor for the point cloud data. The rationale for our architectural choices stems from three principles:
Diversity–Stability Trade-Off: The deeper student explores complex hypothesis spaces while the shallower teacher maintains stable targets.
Information Filtering Hypothesis: The deep–shallow parameter mapping filters out task-irrelevant features, retaining robust representations.
Temporal Smoothing as Regularization: Teacher weights represent temporally smoothed student behavior, preventing overfitting to recent batches.
The student network builds upon an enhanced hierarchical architecture that processes both original and augmented point clouds. The encoder integrates dilated residual blocks with local feature aggregation through relative position encoding and attention-based pooling. As point cloud P with N points progresses through the encoder layers, systematic downsampling reduces point count while expanding feature dimensions: . Correspondingly, feature dimensions increase progressively from 8 to 512, capturing multi-scale contextual information while preserving structural details through skip connections.
The teacher network employs a streamlined three-layer encoder that processes only the original point cloud, with parameters updated via an exponential moving average (EMA) from the student network. This design provides stable supervisory signals while maintaining computational efficiency. The teacher branch undergoes similar dimensional transformations , enabling effective global context capture with reduced computational overhead.
The decoder processes features from the final encoder layer through a hierarchical upsampling path. At each stage, features from the preceding decoder layer are combined with corresponding encoder features via skip connections, after applying 1 × 1 convolutions for dimensional alignment. Nearest-neighbor interpolation restores spatial resolution using stored indices from the encoder’s downsampling operations. Feature fusion occurs through channel-wise concatenation followed by transpose convolution for enhanced feature learning. The final output is generated through 1 × 1 convolutional layers that produce high-level semantic segmentation logits.
The encoder of the branch network consists of three layers, each functioning similarly to those in the backbone network, aiming to effectively capture the global information inherent within the point cloud. After the decoder, the output features of the point cloud undergo dropout regularization [
28], which enhances the model’s robustness by randomly masking neurons to help prevent overfitting. These features are then processed through a classification layer, where convolutional operations map them to the task-specific category space. This process produces prediction data for each category within the point cloud, and the argmax function [
29] is applied to identify the index of the maximum value for segmentation purposes. Ultimately, the output comprises the predicted information for the point cloud.
In the decoder phase, both networks employ nearest interpolation with skip connections to recover spatial resolution. The student decoder integrates features from corresponding encoder layers through feature concatenation and transpose convolution operations, progressively restoring point cloud resolution while incorporating multi-scale contextual information. The teacher decoder follows a similar pattern but processes only the original point cloud features.
The architecture incorporates structural context modules with edge convolution to enhance local geometric awareness, and introduces a graph-structured Laplacian smoothing loss that enforces local prediction consistency by minimizing the symmetric KL divergence between the probability distributions of neighboring points. This multi-objective optimization framework integrates cross-entropy loss, Lovasz loss, Jensen–Shannon divergence [
30], and contrastive regularization, thereby achieving both local smoothness and global semantic accuracy while ensuring robustness to noise and structural variations.
3.4. Structure-Relation Context (SR-Context) Module
To explicitly encode local geometric relationships and enhance the model’s awareness of point cloud structure, we introduce a Structure-Relation Context (SR-Context) module. This module operates directly on point-level features and is designed to capture context from the local neighborhood of each point.
Given the input point-wise features
, where
N is the number of points and
d is the feature dimension, the module first constructs a local graph using the
k-nearest neighbors (
k-NN) based on the point coordinates. For each point
i, let N(
i) denote its neighbor set. We then compute edge-aware features by applying an EdgeConv-style operation:
.
where ⊕ denotes concatenation,
and
are features of point
i and its neighbor
j, and
is the resulting edge feature. The edge features are aggregated via a channel-wise attentive pooling scheme:
where
is the sigmoid function,
is a learnable weight matrix, ⊙ denotes element-wise multiplication, and max(·) performs channel-wise max pooling.
Finally, the original feature and the structure-aware feature are concatenated and passed through a lightweight projection layer to obtain the enhanced output:
where
= [
. The SR-Context module is lightweight and differentiable, allowing it to be inserted at multiple stages of the network. It provides a strong geometric inductive bias that helps the model better distinguish between semantically similar but structurally different objects (e.g., chairs vs. tables), which is particularly beneficial in weakly supervised settings where labeled data is scarce.
3.5. Loss Function Design
The loss function in our proposed framework is designed as a comprehensive multi-objective optimization strategy that integrates multiple complementary components to enhance segmentation performance and ensure consistency across the dual-branch architecture.
The cross-entropy loss () measures the divergence between predicted probability distributions and ground truth labels, with class-weighted adjustments to address category imbalance in point cloud data. Simultaneously, the Lovasz-softmax loss () directly optimizes the intersection-over-union (IoU) metric, effectively handling the non-convex and combinatorial nature of segmentation evaluation while being particularly beneficial for imbalanced class distributions.
Beyond these fundamental segmentation losses, we introduce several consistency and regularization terms: JS divergence loss () enforces prediction consistency between the student network’s original and augmented views, as well as between student and teacher network outputs.
Contrastive regularization loss () aligns feature embeddings by maximizing agreement between semantically similar points while separating dissimilar ones.
Graph Laplacian smoothing loss () explicitly enforces local consistency in the prediction space by minimizing symmetric KL divergence between neighboring points’ probability distributions.
The complete optimization objective is formulated as a weighted combination:
where
indicates the tensor of predicted labels;
C implies the number of categories; and
signifies the actual labels. Additionally, let
K refer to the number of labels present in the point cloud. Consequently, the associated weighted CEloss can be expressed as follows:
where
i and
j imply distinct points within a point cloud and
and
signify the categorical relationship between points
i and
j, respectively. Additionally, the variables
and
are employed to represent the semantic relationship between points
i and
j. The Lovasz loss function [
31], as referenced in the literature, is designed to optimize the IoU metric and is frequently employed in segmentation tasks. Specifically,
where
indicates the convex substitution function associated with the Jaccard exponential error for category
C. An illustration of a convex substitution function is provided, where
signifies the predicted labeling information and
signifies the true labeling information.
3.6. Consistency Loss Function Design
We employ a comprehensive consistency regularization framework using Jensen–Shannon divergence [
30] to enforce prediction alignment between the student and teacher networks, as well as between original and augmented point cloud views processed by the student network. This includes augmentation consistency loss between the student’s predictions for original and augmented data:
and teacher–student consistency loss between the student and EMA teacher predictions:
where
and
represent the student network’s predicted probability distributions for original and augmented point clouds, respectively, and
denotes the teacher network’s predictions. Additionally, we incorporate contrastive regularization loss to align feature embeddings:
where
and
represent feature embeddings from different augmentations of the same point, and
is a temperature parameter. The graph Laplacian smoothing loss further enforces local consistency:
where
represents the prediction of the k-th neighbor of point i. The complete optimization objective integrates all components:
where
=
+
combines weighted cross-entropy and Lovasz losses for semantic segmentation, and
,
, and
are balancing parameters that control the relative importance of each consistency term.
5. Conclusions
This study presents a dual-branch consistency learning framework with an EMA teacher mechanism for weakly supervised point cloud semantic segmentation. By incorporating supplementary supervisory cues for unlabeled data, the model surpasses conventional single-layer networks and improves its focus on the global characteristics of point clouds. Consequently, the proposed network consistently delivers accurate and reliable point cloud segmentation across various states. Extensive experimental results indicate that our method holds strong promise for weakly supervised semantic segmentation of large-scale point clouds. Extensive experimental results demonstrate that our method achieves competitive performance in weakly supervised point cloud segmentation while maintaining computational efficiency. Future work will focus on extending the dual-branch consistency learning framework to weakly supervised point cloud detection and instance segmentation tasks, exploring more efficient graph construction methods for Laplacian smoothing, and investigating adaptive weighting strategies for the multi-objective loss function to further enhance performance across diverse 3D scenes.
5.1. Discussion and Limitations
Although our proposed framework demonstrates strong performance on established indoor (S3DIS) and outdoor (SemanticKITTI/Toronto-3D) benchmarks in extreme low-label settings, we acknowledge several limitations that point to valuable future work.
Our evaluation is primarily within the dataset; cross-domain generalization (e.g., from driving to medical scans) remains challenging due to shifts in density, scale, and semantics. Future work requires rigorous cross-dataset benchmarks (e.g., aerial LiDAR) and integration of domain adaptation into our framework. Additionally, while efficient, the dual-branch design increases training cost, and performance is backbone-dependent. Future directions include architectural refinements for greater efficiency and testing on broader 3D benchmarks.
5.2. Limitations and Trade-Offs
A core trade-off exists between high accuracy under extreme label scarcity and increased training cost, a deliberate choice to maximize learning from minimal labels. Performance also depends on the backbone network’s capacity; while our design is modular, we used RandLA-Net for its efficiency. Hyperparameter tuning is required, but sensitivity analysis confirms robustness within a practical range. Finally, like most data-driven methods, optimal performance assumes similar training and test distributions; generalization across vastly different domains remains challenging and may require dedicated adaptation techniques.
5.3. Future Work
Building upon the identified limitations and trade-offs, several promising directions emerge for future research: (1) architectural distillation into a single efficient network for deployment; (2) exploring integration with more powerful backbones (e.g., transformers) to boost performance; (3) incorporating domain adaptation techniques to enhance cross-domain robustness; (4) extensive validation on broader 3D benchmarks (e.g., aerial LiDAR, medical point clouds).