TGR-T: Truncated-Gaussian-Weighted Reliability for Adaptive Dynamic Thresholding in Weakly Supervised Indoor 3D Point Cloud Segmentation

Luo, Ziwei; Liu, Xinyue; Jiang, Jun; Qi, Hanyu; Wang, Chen; Xie, Zhong; Zeng, Tao

doi:10.3390/ijgi15030108

Open AccessArticle

TGR-T: Truncated-Gaussian-Weighted Reliability for Adaptive Dynamic Thresholding in Weakly Supervised Indoor 3D Point Cloud Segmentation

by

Ziwei Luo

^1,2

,

Xinyue Liu

^1,3,

Jun Jiang

^1,4,

Hanyu Qi

¹,

Chen Wang

¹

,

Zhong Xie

² and

Tao Zeng

^5,6,*

¹

School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan 430200, China

²

School of Computer Science, China University of Geosciences, Wuhan 430074, China

³

Engineering Research Center of Natural Resource Information Management and Digital Twin Engineering Software, Ministry of Education, Wuhan 430074, China

⁴

Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan 430074, China

⁵

School of Electronic Information Engineering, Sichuan University, Chengdu 610065, China

⁶

Chengdu Qianjia Technology Co., Ltd., Chengdu 610207, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2026, 15(3), 108; https://doi.org/10.3390/ijgi15030108

Submission received: 14 January 2026 / Revised: 23 February 2026 / Accepted: 2 March 2026 / Published: 4 March 2026

(This article belongs to the Special Issue Indoor Mobile Mapping and Location-Based Knowledge Services)

Download

Browse Figures

Versions Notes

Abstract

Indoor 3D point cloud semantic segmentation is a fundamental task for fine-grained scene understanding and intelligent perception. Due to the prohibitive cost of dense point-wise annotations, weakly supervised learning has emerged as a promising alternative for indoor point cloud segmentation. However, existing weakly supervised methods commonly rely on fixed confidence thresholds for pseudo-label selection, which exhibit limited generalization caused by threshold sensitivity, underutilization of informative low-confidence regions, and progressive noise accumulation during self-training. To address these issues, we propose TGR-T, a weakly supervised framework for indoor 3D point cloud semantic segmentation that incorporates truncated-Gaussian-weighted reliability with adaptive dynamic thresholding. Specifically, a reliability-adaptive dynamic thresholding strategy is introduced to guide pseudo-label selection based on the evolving confidence statistics of unlabeled mini-batches, with exponential moving average smoothing employed to produce stable global estimates and robust separation of reliable and ambiguous regions. To further exploit uncertain regions, a learnable truncated Gaussian weighting function is designed to explicitly model prediction uncertainty within the ambiguous set, providing soft supervision by assigning adaptive weights to low-confidence predictions during optimization. Extensive experimental results demonstrate that the proposed framework significantly enhances the exploitation of unlabeled data under extremely limited supervision: extensive experiments conducted on standard indoor 3D scene benchmarks demonstrate that TGR-T achieves competitive or superior segmentation performance under extremely sparse supervision and can even outperform several fully supervised baselines trained with dense annotations while using only 1% labeled points, thereby substantially narrowing the performance gap between weakly supervised and fully supervised 3D semantic segmentation methods.

Keywords:

3D point cloud semantic segmentation; indoor scene; weakly supervised learning; dynamic thresholding; truncated-Gaussian weighting; pseudo-label refinement

1. Introduction

In recent years, 3D point cloud technology has been widely applied in the field of remote sensing due to its ability to provide high-precision spatial information. Advanced 3D scanning sensors enable the accurate acquisition of complex structural characteristics in indoor environments, thereby providing critical data support for indoor space analysis and reconstruction. Semantic segmentation of indoor 3D point clouds aims to assign a semantic label to each point. By precisely capturing fine-grained indoor structural features, it enables the fine-grained extraction and analysis of richer and more detailed environmental information, thereby promoting automated scene understanding and intelligent perception in indoor environments. In addition, it provides automated data analysis and intelligent decision-making support for a wide range of remote sensing applications, including autonomous robotic systems [1,2], indoor navigation [3,4], augmented reality [5,6], and Building Information Modeling (BIM) [7,8].

In early studies on 3D point cloud semantic segmentation, many methods relied on fully supervised approaches, which require exhaustive per-point annotations. Such annotations are labor-intensive, time-consuming, and prone to human error, particularly in complex urban or natural environments [9]. To alleviate these challenges, weakly supervised methods have been proposed, leveraging sparse or incomplete annotations while exploiting the rich geometric and structural information inherent in 3D point clouds to enable scalable and high-fidelity 3D scene understanding [10,11,12]. Among these approaches, pseudo-label-based self-training has become one of the most widely adopted frameworks. In this paradigm, a model trained with sparse supervision generates pseudo-labels for unlabeled points, which are then iteratively refined to improve model performance. Many existing methods typically employ fixed confidence thresholds to select high-confidence pseudo-labels [13,14,15]. While this strategy effectively suppresses noisy labels, it inevitably discards informative yet ambiguous samples that contain valuable semantic cues. Fixed-threshold approaches exhibit three main limitations: they under-utilize unlabeled data, struggle to adapt to class-specific characteristics, and are sensitive to the dynamic changes in confidence distributions during training [16]. These limitations have motivated the development of more flexible pseudo-labeling mechanisms that can better exploit unlabeled points, improve training stability, and enhance model generalization, thereby addressing the intrinsic challenges of weakly supervised 3D point cloud segmentation.

Although fixed-threshold strategies have demonstrated effectiveness in weakly supervised point cloud segmentation [17,18], they remain fundamentally constrained by several persistent and intractable issues that hinder their generalization to complex scenarios. These drawbacks manifest in three critical aspects: (1) High sensitivity to the choice of threshold—since the optimal confidence threshold varies significantly with dataset characteristics and training dynamics, a fixed threshold determined on one dataset or at an early training phase often leads to unstable pseudo-label quality across different datasets and training stages, resulting in either over-retaining noisy predictions or under-utilizing informative samples. (2) The systematic neglect of low-confidence regions—conventional fixed-threshold methods discard all predictions below the predefined confidence threshold, yet these low-confidence regions may still contain valuable semantic cues that are crucial for modeling fine-grained geometric structures and accurate object boundary delineation. (3) Vulnerability to noise accumulation—in the iterative self-training pipeline, incorrect pseudo-labels generated in early stages are repeatedly used as supervision signals; the fixed-threshold strategy cannot distinguish between “informative low-confidence predictions” and “noisy predictions”, thus allowing erroneous pseudo-labels to be mistakenly reinforced in subsequent training cycles, which ultimately degrades the model’s segmentation accuracy and robustness.

To address these limitations, we propose TGR-T, a unified weakly supervised framework for indoor 3D point cloud semantic segmentation that integrates confidence-distribution-aware dynamic thresholding with uncertainty-aware soft supervision. Specifically, unlabeled points are dynamically partitioned into a reliable set and an ambiguous set based on batch-wise confidence statistics. High-confidence predictions within the reliable set are utilized as hard pseudo-labels, whereas points in the ambiguous set are softly reweighted using a truncated Gaussian function. This mechanism allows the model to effectively suppress noise while exploiting discriminative information from uncertain regions. Furthermore, to enhance semantic consistency and generalization, we introduce a class-balanced confidence regularization mechanism to alleviate category bias in pseudo-label learning. The contributions of this work are as follows:

1.: We propose a reliability-adaptive dynamic thresholding estimation that adjusts pseudo-label selection based on the evolving confidence statistics of unlabeled mini-batches, with these statistics smoothed via an exponential moving average to obtain stable global estimates. Unlabeled points are then partitioned into reliable and ambiguous sets according to the adaptive threshold, enabling selective supervision that effectively mitigates noise from ambiguous regions.
2.: We propose a learnable truncated Gaussian weighting function to explicitly model uncertainty within the ambiguous set. This soft supervision approach allows the model to learn effectively from uncertain regions by assigning adaptive weights to low-confidence predictions during optimization, thereby improving generalization across complex object boundaries.
3.: We propose TGR-T, a unified weakly supervised framework for indoor 3D point cloud semantic segmentation and evaluate it extensively on standard indoor scene datasets. Experimental results demonstrate that TGR-T achieves competitive or superior performance under extremely sparse supervision and can even outperform some fully supervised baselines trained with dense annotations while using only 1% of labeled points.

2. Related Works

Weakly supervised learning reduces the annotation cost of 3D point cloud segmentation by leveraging sparse or coarse labels with unlabeled data but is challenged by ambiguous supervision, noisy pseudo-labels, and inaccurate boundaries. Beyond semantic segmentation, unified panoptic parsing of point clouds has also been actively studied [19], highlighting the broad spectrum of 3D scene understanding tasks. To address these issues, existing methods mainly fall into three paradigms: pseudo-label-based methods expand sparse supervision via iterative self-training and confidence-based filtering; contrastive-learning-based methods improve feature discriminability by enforcing similarity and dissimilarity across points, regions, or views under weak supervision; and consistency-regularization-based methods stabilize training by encouraging prediction invariance under perturbations or alternative views, enabling more effective utilization of unlabeled data.

2.1. Pseudo-Label-Based Methods

Pseudo-label self-training trains a teacher on labeled data, uses it to generate pseudo-labels for unlabeled samples, and then trains a student model with both real and pseudo annotations, often with iterative teacher→student updates. While a few early methods used fixed confidence thresholds to select pseudo-labels [20], most later approaches adopted dynamic thresholding, adaptively adjusting the threshold based on training progress or prediction statistics to improve both reliability and sample utilization. This paradigm has become the foundation of modern pseudo-labeling methods. Meanwhile, transformer-based 3D mask/instance segmentation methods [21,22] have demonstrated strong representation capability under full supervision, motivating the exploration of annotation-efficient learning paradigms for 3D scene understanding.

2.1.1. Fixed Threshold Filtering

In 3D point cloud and 3D segmentation tasks, researchers have proposed a variety of improvements to address key challenges such as pseudo-label noise, instance-level consistency, and severe class imbalance. A representative line of work is based on fixed-threshold pseudo-label self-training frameworks, in which semantic and instance pseudo-labels are generated and filtered under static confidence criteria, and mutual structural constraints are introduced to improve pseudo-label quality. For example, Wei [23] introduced Multi-Path Region Mining, where multi-branch attention cues produced class-specific localization signals to transfer weak labels into point-level supervision. Ünal [24] proposed a scribble-supervised LiDAR segmentation pipeline using a mean-teacher model and class-range-balanced self-training with fixed high-confidence predictions to improve pseudo-label balance and quality. These fixed-threshold-based strategies are simple and stable, but their performance is sensitive to confidence calibration and may suffer from limited flexibility in complex real-world scenes.

2.1.2. Dynamic Threshold Filtering

A complementary line of research focuses on dynamic-threshold or adaptive pseudo-label filtering strategies, which aim to further improve robustness by continuously adjusting pseudo-label selection criteria during training. A substantial body of semi-supervised learning (SSL) research has systematically investigated confidence-driven pseudo-label refinement, with particular emphasis on threshold scheduling, confidence calibration, and sample reweighting mechanisms. For instance, FixMatch [25] employs a fixed high-confidence threshold to select reliable pseudo-labels; AdaMatch [26] improves pseudo-label reliability through distribution alignment and adaptive confidence calibration; FlexMatch [27] introduces curriculum-inspired, class-wise adaptive thresholds to mitigate class imbalance; FreeMatch [28] further refines dynamic threshold selection by leveraging the temporal evolution of model confidence; and SoftMatch [29] replaces hard filtering with soft confidence-aware weighting to suppress the adverse impact of low-confidence pseudo-labels. Although these approaches were primarily developed for 2D image classification and segmentation tasks, they provide a principled foundation for adaptive pseudo-label optimization. The core idea—namely, regulating pseudo-label quality through dynamic confidence thresholding or sample-level weighting—remains highly relevant for weakly supervised 3D point cloud semantic segmentation, where uncertainty, boundary ambiguity, and geometric sparsity further complicate pseudo-label reliability. Tang [30] constructed a superpoint graph and learned inter-superpoint affinities encoding semantic and geometric relations; a semantic-aware random walk propagated sparse point-click annotations to yield high-confidence pseudo-labels whose reliability was progressively refined. TWIST [18] introduced a mutual-labeling mechanism for instance segmentation, enabling two-way pseudo-label generation and object-level denoising, though closed-loop propagation could accumulate errors. Tao [31] proposed SegGroup, extending sparse click annotations to segments and grouping unlabeled segments hierarchically, but the performance depended heavily on over-segmentation quality. Wang [32] developed an OCOC framework using one-click-per-class supervision, constraining pseudo-label generation to the clicked class set and dynamically reweighting labels based on entropy confidence. Ünal [33] incorporated Bayesian uncertainty to filter or weight pseudo-labels at point and instance levels, substantially improving robustness under label noise. Recently, Deng [34] proposed a self-training network with hierarchical pseudo-label optimization, dynamically merging similar categories to enhance medium-confidence pseudo-label quality. Class-Aware Pseudo-Labeling [35] adjusts pseudo-label thresholds according to class distribution but heavily depends on the accuracy of initial class distribution estimation, and errors may be amplified when labeled data are extremely scarce or the estimated distribution significantly deviates from the true distribution. PAIS [36] improves the utilization of pixel- or mask-level pseudo-labels through a dynamic alignment loss. These dynamic strategies have been shown in image and instance segmentation to significantly enhance pseudo-label quality and downstream performance, and their ideas can be directly transferred or adapted to point cloud segmentation scenarios. Such advances in instance-level segmentation [21] further indicate the potential of transferring robust architectural priors to annotation-efficient point cloud learning.

2.2. Contrastive-Learning-Based Methods

Contrastive learning is a paradigm that constructs supervisory signals through relationships between samples. Its core idea is to bring representations of the same instance closer together under different data augmentations or perspectives, while pushing representations of different instances further apart. This approach enables the model to learn discriminative features without requiring manual annotation, effectively enhancing its utilization efficiency for unlabeled data and improving performance on downstream tasks.

Researchers are addressing core challenges such as constructing geometrically and semantically aligned sample pairs and designing multi-scale contrast mechanisms suitable for segmentation tasks. On one hand, researchers focus on building robust and discriminative contrastive pairs. Early work relied on strict geometric correspondences: Xie [37] proposed PointContrast, forming point-to-point positives from overlapping views with geometric augmentations, laying the foundation for 3D contrastive learning. Wang [38] further used intrinsic local density to guide contrastive objectives, enabling features to better encode structural properties and improving representation discriminability. These methods provide universal and transferable representations. However, negative sampling within a mini-batch may introduce false-negative pairs (i.e., semantically similar instances treated as negatives), which can harm representation learning. Liu [39] proposed a thresholding-based false-negative pair calibration strategy that identified incorrect negatives via pair-alignment scores and calibrated them into positives, improving the robustness of contrastive learning. On the other hand, many studies embed contrastive learning as a regularizer in weakly or semi-supervised segmentation. Li [14] proposed hybrid contrastive regularization, jointly applying local patch-level and global scene-level contrast to enforce multi-scale structural consistency under sparse labels. Luo [40] proposed DSDCL, introducing context-aware and scene-aware dense contrastive constraints to enhance feature discrimination and robustness under noisy pseudo-label supervision. Huang [41] designed TCCWS, combining teacher–student pseudo-label consistency with contrastive learning to enhance semantic boundary discrimination. Wang [42] proposed MCCR, integrating multi-scale feature learning with point- and region-level contrastive regularization to improve semantic consistency under weak labels. Liu [43] incorporated contrastive learning into the LESS framework, pairing it with active learning to impose contrastive constraints on the most informative region. Group contrastive learning [44] mitigates confirmation bias by constructing group-based positives and negatives, reducing the influence of noisy pseudo-labels. Uncertainty-Guided Contrastive Learning [45] further uses prediction uncertainty to select reliable anchors and filter noisy pairs, improving robustness under sparse supervision. Tang [46] proposed Contrastive Boundary Learning, enhancing boundary-aware discrimination through multi-scale subsampling to improve both boundary and overall segmentation performance.

2.3. Consistency-Regularization-Based Methods

In semi-supervised learning, consistency regularization is a core category of methods, whose basic idea is that the model’s predictions should remain consistent when the input data undergo different perturbations or augmentations. This assumption leverages the local smoothness and invariance of point cloud data in spatial structures, enabling unlabeled samples to provide effective regularization signals for the model.

Early consistency regularization methods originated from 2D vision. Temporal Ensembling [47] and Mean Teacher [48] generate stable targets using historical predictions or EMA teachers, despite early-stage noise and architecture sensitivity, establishing the foundation for consistency-based learning. In weakly supervised 3D point cloud segmentation, SQN [49] propagates extremely sparse labels via superpoint querying using local semantic consistency without explicit consistency losses or pseudo-labeling. In the 3D domain, Zhao [50] introduced a teacher–student structure with cross-augmentation consistency, boosting performance on ScanNet and S3DIS. Xu [17] expanded sparse labels into local neighborhoods through geometric regularization and spatial smoothness, though the generated pseudo-labels remain inaccurate near boundaries and small objects. Hou [51] enforced cross-view consistency based on PointContrast, aligning features across multi-view projections to improve robustness to geometric variations. Zhang [52] further proposed perturbed self-distillation, enforcing prediction consistency between perturbed and original samples to propagate information via learned graph topology. More recent work integrates consistency with other regularization strategies. Zhao [53] introduced self-ensembling consistency in SESS, aligning predictions under perturbations via a teacher–student model to enhance 3D detection robustness. Hui [54] designed a geometry-aware edge loss enforcing intra-instance compactness and inter-instance separation, providing strong structural priors for instance segmentation. MCGC [55] integrates global structural planes with local surface convexity and color constraints within a graph-cut clustering framework to maintain local–global structural coherence in indoor point clouds. Wu [56] proposed RAC-Net, adaptively scaling consistency strength based on prediction confidence to avoid noise propagation. Deng [57] incorporated superpoint-guided consistency, fusing local geometric coherence with the global semantic structure, though such methods require careful tuning of weights, hyperparameters, and augmentations to prevent over-regularization, underfitting, or training instability.

3. Methods

In this section, we introduce our proposed framework for weakly supervised point cloud semantic segmentation, as shown in Figure 1. We first describe the task setting and notations and then elaborate on its core components, including reliability-adaptive dynamic thresholding estimation, and truncated-Gaussian-weighted consistency regularization. The overall training pipeline of TGR-T is summarized in Algorithm 1.

3.1. Notation Definition

We address the task of weakly supervised semantic segmentation for 3D point clouds, where only a small subset of points is annotated while the remaining points are unlabeled. The labeled dataset is denoted as

D_{L} = {(X_{i}, Y_{i})}_{i = 1}^{N_{L}}

, and the unlabeled dataset is denoted as

D_{U} = {X_{j}}_{j = 1}^{N_{U}}

, where X is a point cloud containing N points with coordinates and optional attributes such as color or intensity, and

Y \in {0, 1}^{N \times C}

is the per-point one-hot semantic label over C classes. We adopt a shared-parameter segmentation backbone

f_{θ} (\cdot)

. Given an input point cloud X, the network outputs per-point class probability distributions:

P = f_{θ} (X) \in R^{N \times C}, p_{i} \in R^{C},

(1)

where

p_{i} = [p_{i, 1}, \dots, p_{i, C}]

denotes the predicted class probability vector for the ith point. As illustrated in Figure 1, our training procedure consists of three forward branches that share the same backbone parameters.

Algorithm 1 Full training procedure of TGR-T

Require:: Labeled set $D_{L}$ , unlabeled set $D_{U}$ ; network $f_{θ}$ ; augmentation $T (\cdot)$ ; EMA momentum m; threshold coefficient $k_{τ}$ ; Gaussian width factor k; loss weights $λ_{r}, λ_{a}$ ; iterations T
1:: Initialize running statistics $(μ_{0}, σ_{0})$ ; initialize parameters $θ$
2:: for $t = 1$ to T do
3:: Sample: $(X_{l}, Y_{l}) \sim D_{L}$ , $X_{u} \sim D_{U}$
4:: Augment: $X_{u}^{a u g} = T (X_{u})$
5:: Three-branch forward: $P_{l} = f_{θ} (X_{l})$ , $P_{u} = f_{θ} (X_{u})$ , $P_{u}^{a u g} = f_{θ} (X_{u}^{a u g})$
6:: Supervised loss: $L_{s e g} = CE (P_{l}, Y_{l})$
7:: Confidence: $q_{i} = {max}_{c} P_{u} (i, c)$ for each unlabeled point i
8:: EMA stats: compute batch statistics $({\hat{μ}}_{b}, {\hat{σ}}_{b})$ from ${q_{i}}$
9:: $μ_{t} = m μ_{t - 1} + (1 - m) {\hat{μ}}_{b}$ , $σ_{t} = m σ_{t - 1} + (1 - m) {\hat{σ}}_{b}$
10:: Dynamic threshold: $τ_{t} = μ_{t} - k_{τ} σ_{t}$
11:: Partition: $R_{t} = {i ∣ q_{i} \geq τ_{t}}$ , $A_{t} = {i ∣ q_{i} < τ_{t}}$
12:: Pseudo-labels on reliable set: ${\hat{y}}_{i} = arg {max}_{c} P_{u} (i, c)$ for $i \in R_{t}$
13:: Reliable-set loss: $L_{r} = \frac{1}{| R_{t} |} \sum_{i \in R_{t}} CE (P_{u}^{a u g} (i, \cdot), {\hat{y}}_{i})$
14:: Weights on ambiguous set: $w_{i} = exp (- \frac{{(q_{i} - μ_{t})}^{2}}{2 (σ_{t}^{2} / k^{2})})$ for $i \in A_{t}$
15:: Ambiguous-set consistency: $L_{a} = \frac{1}{| A_{t} |} \sum_{i \in A_{t}} w_{i} KL (P_{u} (i, \cdot) ∥ P_{u}^{a u g} (i, \cdot))$
16:: Total objective: $L_{t o t a l} = L_{s e g} + λ_{r} L_{r} + λ_{a} L_{a}$
17:: Update: back-propagate and update $θ$ (one optimizer step)
18:: end for

The labeled point cloud

X_{l}

is fed into the backbone to obtain per-point predictions

P_{l} = f_{θ} (X_{l})

. A supervised cross-entropy loss is computed against the ground-truth labels

Y_{l}

:

L_{seg} = - \frac{1}{| Ω_{l} |} \sum_{i \in Ω_{l}} \sum_{c = 1}^{C} y_{i, c} log (p_{i, c}),

(2)

where

Ω_{l} = {1, 2, \dots, N}

denotes the index set of annotated labeled points in

X_{l}

, C is the number of classes,

y_{i, c}

is the one-hot ground-truth label, and

p_{i, c}

is the predicted probability of point i belonging to class c. For unlabeled point clouds

X_{u} \in D_{U}

, we obtain predictions

P_{u} = f_{θ} (X_{u})

and compute point-wise confidence scores to quantify the reliability of the predictions. The confidence statistics estimated from the current unlabeled mini-batch are used to update the running mean and variance, from which a dynamic threshold

τ_{t}

is derived (Section 3.2) to partition unlabeled points into a reliable set

R_{t}

and an ambiguous set

A_{t}

. In addition, for points in

R_{t}

, the original-view predictions are converted into hard pseudo-labels.

We further construct an augmented view

X_{u}^{a u g} = T (X_{u})

, where

T (\cdot)

denotes a stochastic point-cloud augmentation operator, including random rotation, scaling, jittering, and color perturbation, and compute

P_{u}^{a u g} = f_{θ} (X_{u}^{a u g})

. The augmented predictions are used together with the original-view outputs to form unlabeled training signals (Section 3.3): (i) on the reliable set

R_{t}

, we apply cross-entropy between the hard pseudo-labels generated from

P_{u}

and the augmented predictions

P_{u}^{a u g}

to obtain

L_{r}

; (ii) on the ambiguous set

A_{t}

, we enforce prediction consistency between

P_{u}

and

P_{u}^{a u g}

using KL divergence, where each ambiguous point is reweighted by a confidence-dependent Gaussian function to yield

L_{a}

.

The final loss function comprises three components: supervised learning loss

L_{seg}

, hard pseudo-label loss

L_{r}

, soft consistency loss

L_{a}

:

L_{total} = L_{seg} + λ_{r} L_{r} + λ_{a} L_{a},

(3)

In essence, labeled data provide strong supervision, whereas unlabeled data are exploited in a confidence-aware manner: high-confidence points contribute via hard pseudo-label learning, and low-confidence points contribute via softly weighted consistency regularization.

3.2. Reliability-Adaptive Dynamic Thresholding Estimation

In indoor 3D scene segmentation, weak supervision faces additional challenges beyond sparse annotation. Indoor point clouds are typically characterized by enclosed spaces, severe occlusions, strong structural regularities, and high inter-class similarity between objects such as walls, ceilings, doors, and furniture. These properties often lead to uneven confidence distributions across scenes and training stages, making the reliability of pseudo-labels highly variable. As a result, adopting a fixed confidence threshold is inadequate: an overly strict threshold severely limits the number of usable pseudo-labeled points in cluttered indoor regions, whereas a relaxed threshold tends to introduce noisy labels around object boundaries and occluded areas.

To address this issue, we propose a dynamic thresholding strategy that adapts to the evolving confidence statistics of unlabeled indoor point clouds, as illustrated in Figure 2. This design enables the model to automatically adjust pseudo-label selection criteria in response to scene complexity and training progress.

Formally, at iteration t, we sample an unlabeled mini-batch

B_{U} \subset D_{U}

from indoor scenes. Let

P_{u} = f_{θ} (X_{u})

denote the model predictions for this mini-batch. For each point, we define its confidence score as the maximum predicted class probability:

q_{i} = max_{c} p_{i, c}^{u} .

(4)

Based on these confidence scores, we compute the batch-wise mean and variance over

B_{U}

:

{\hat{μ}}_{b} = \frac{1}{| B_{U} |} \sum_{i \in B_{U}} q_{i}, {\hat{σ}}_{b}^{2} = \frac{1}{| B_{U} |} \sum_{i \in B_{U}} {(q_{i} - {\hat{μ}}_{b})}^{2} .

(5)

Since indoor scenes exhibit substantial variation in spatial layout and object density, instantaneous batch statistics can fluctuate considerably. To obtain stable global estimates, we update the confidence statistics using an exponential moving average:

μ_{t} = m μ_{t - 1} + (1 - m) {\hat{μ}}_{b},

(6)

σ_{t}^{2} = m σ_{t - 1}^{2} + (1 - m) {\hat{σ}}_{b}^{2},

(7)

where

m \in (0, 1)

is a momentum coefficient. The batch statistics

({\hat{μ}}_{b}, {\hat{σ}}_{b})

reflect the instantaneous confidence distribution of the current indoor mini-batch, while

(μ_{t}, σ_{t})

provide smoothed global estimates that evolve across training iterations. A larger m emphasizes historical information and suppresses abrupt fluctuations caused by scene-specific complexity, whereas a smaller m allows faster adaptation to recent predictions.

The dynamic confidence threshold is then defined as

τ_{t} = μ_{t} - k_{τ} \cdot σ_{t},

(8)

where

k_{τ}

controls the strictness of pseudo-label selection by specifying how many standard deviations below the mean are accepted as reliable. In indoor environments,

σ_{t}

effectively captures the uncertainty induced by occlusions, cluttered object arrangements, and ambiguous semantic boundaries. When the model is unstable or encounters complex layouts, a larger

σ_{t}

leads to a more conservative threshold; as training progresses and predictions become more consistent across rooms and object categories, the threshold adapts accordingly, allowing the reliable set to expand.

Using

τ_{t}

, unlabeled points are partitioned into an ambiguous set and a reliable set:

A_{t} = {i ∣ q_{i} < τ_{t}}, R_{t} = {i ∣ q_{i} \geq τ_{t}} .

(9)

For points in the reliable set

R_{t}

, we generate hard pseudo-labels by one-hot encoding the argmax prediction:

{\hat{y}}_{i, c} = onehot (arg max_{c} p_{i, c}), i \in R_{t} .

(10)

The reliable-set loss is defined as the cross-entropy between these pseudo-labels and the predictions from the augmented view:

L_{r} = - \frac{1}{| R_{t} |} \sum_{i \in R_{t}} \sum_{c = 1}^{C} {\hat{y}}_{i, c} log p_{i, c}^{aug} .

(11)

By selectively consolidating high-confidence predictions in indoor scenes, this loss term effectively expands supervision beyond the labeled set

D_{L}

, mitigates noise introduced by ambiguous regions, and facilitates the formation of more robust and spatially consistent decision boundaries.

3.3. Truncated Gaussian-Weighted Consistency Regularization

We further introduce a truncated-Gaussian-weighted consistency regularization to modulate the contribution of each ambiguous point, as illustrated in Figure 3. In indoor 3D scenes, points in the ambiguous set

A_{t}

are frequently located near semantic boundaries (e.g., wall–door or floor–furniture interfaces) or belong to sparsely sampled and occluded structures. In such regions, predictions tend to be unstable, and hard pseudo-labels are more prone to error. Directly enforcing hard supervision on these points may therefore amplify noise and degrade training stability.

To mitigate this issue, we adopt a consistency regularization strategy on

A_{t}

, encouraging prediction invariance under point cloud perturbations while avoiding explicit hard labeling. Given an unlabeled indoor point cloud

X_{u}

and its augmented view

X_{u}^{aug} = T (X_{u})

, we obtain the corresponding predictions

P_{u} = f_{θ} (X_{u}), P_{u}^{aug} = f_{θ} (X_{u}^{aug}) .

(12)

For each ambiguous point

i \in A_{t}

, we quantify the discrepancy between the original and augmented predictions using the Kullback–Leibler (KL) divergence:

KL (p_{i} ‖ p_{i}^{aug}) = \sum_{c = 1}^{C} p_{i, c} log \frac{p_{i, c}}{p_{i, c}^{aug}} .

(13)

However, not all ambiguous points in indoor environments are equally informative. Points with extremely low confidence often correspond to severe occlusions, cluttered regions, or geometrically indistinguishable structures, where predictions are close to random and can destabilize optimization if treated uniformly. To account for this, we introduce a Gaussian weighting function to adaptively reweight each ambiguous point according to its confidence.

The key insight is that the proposed weighting mechanism is not an independently introduced heuristic but is parameter-coupled with the reliability-adaptive dynamic thresholding through a shared statistical state that co-evolves during training. Concretely, the dynamic thresholding module estimates confidence scores

{q_{i}}_{i \in B_{U}}

from the current unlabeled batch

B_{U}

and updates the exponentially smoothed global statistics

(μ_{t}, σ_{t})

via EMA. These statistics serve a dual functional role. On the one hand, they define the adaptive confidence boundary, which partitions unlabeled points into the reliable set

R_{t}

and the ambiguous set

A_{t}

. On the other hand, the same statistics directly parameterize the Gaussian kernel in the weighting function:

μ_{t}

acts as the global reference center, while

σ_{t}

determines the effective kernel bandwidth, thereby modulating the per-point consistency gradient magnitude within

A_{t}

. Through this shared-statistics design, the framework realizes a coherent risk-aware learning mechanism characterized by global filtering and local weighting. At the global level, the adaptive threshold

τ_{t}

prevents unreliable predictions from entering the hard pseudo-label branch, thereby reducing confirmation bias. At the local level, the Gaussian weighting performs fine-grained modulation within the ambiguous region, upweighting near-reliable samples while strongly suppressing highly uncertain samples far below

τ_{t}

. Since both the decision boundary and the weighting function are parameterized by the same

(μ_{t}, σ_{t})

, the weight distribution is intrinsically aligned with the adaptive boundary. This alignment ensures that consistency regularization concentrates on informative uncertainty around the confidence boundary, rather than being dominated by stochastic low-confidence noise. Furthermore, this parameter coupling naturally induces an adaptive matching behavior over the course of training. In early iterations, a larger

σ_{t}

reflects unstable and poorly calibrated predictions, resulting in a more conservative boundary (i.e., a broader ambiguous region) and a wider Gaussian bandwidth, which allows cautious exploration while preventing gradient explosion from random predictions. As training progresses and predictions become better calibrated,

σ_{t}

gradually decreases, leading to a tighter boundary and a narrower kernel bandwidth. Consequently, the optimization focus shifts toward boundary-adjacent hard samples. This synchronized co-evolution between dynamic thresholding and Gaussian weighting forms an implicit confidence-driven curriculum, enabling stable convergence and improved robustness in weakly supervised 3D point cloud segmentation without introducing additional scheduling heuristics.

w_{i} = exp (- \frac{{(q_{i} - μ_{t})}^{2}}{2 (σ_{t}^{2} / k^{2})}), i \in A_{t},

(14)

where

μ_{t}

and

σ_{t}

denote the running mean and standard deviation of confidence scores estimated by the confidence-adaptive dynamic thresholding module, and k controls the effective width of the Gaussian kernel. As illustrated in Figure 3, different values of k lead to distinct weighting profiles: a smaller k (e.g.,

k = 0.8

) produces a wider curve, allowing a broader range of ambiguous points to receive non-negligible weights, whereas a larger k (e.g.,

k = 1.5

) results in a narrower curve that concentrates the weights more tightly around the dynamic threshold. Specifically, k determines how rapidly the weight decays as the confidence score deviates from

τ_{t} = μ_{t} - k σ_{t}

. When k is small, the Gaussian kernel has a larger effective bandwidth, leading to a smoother decay and permitting moderately uncertain points to contribute to the consistency regularization. In contrast, a larger k sharpens the decay behavior, strongly suppressing points far below the threshold and focusing the learning process on near-threshold samples. This design assigns larger weights to ambiguous points whose confidence lies close to the dynamic threshold, i.e., near-reliable points that typically appear around indoor object boundaries, while strongly down-weighting highly uncertain points far below

τ_{t}

. Consequently, the learning process focuses on ambiguous regions that are more likely to provide meaningful and stable gradients. In our implementation, k was fixed across all experiments to ensure reproducibility and avoid dataset-specific tuning. Its value was chosen to achieve a balanced trade-off between noise suppression and information utilization, as reflected in Figure 3, where the intermediate setting (e.g.,

k = 1.0

) provides a moderate decay profile that neither over-suppresses informative ambiguous samples nor overly amplifies noisy ones.

The ambiguous-set consistency loss is then formulated as

L_{a} = \frac{1}{| A_{t} |} \sum_{i \in A_{t}} w_{i} \cdot KL (p_{i} ‖ p_{i}^{aug}) .

(15)

By jointly leveraging dynamic separation and Gaussian-weighted consistency regularization, the proposed framework exploits unlabeled indoor point clouds in a risk-aware manner: reliable points contribute through hard pseudo-label supervision, while ambiguous points provide softly weighted invariance constraints. This complementary design enhances robustness to geometric perturbations and occlusions, and effectively suppresses the adverse impact of noisy pseudo-labels in complex indoor scenes.

4. Experiments

4.1. Dataset

To facilitate a rigorous evaluation and analysis, this study adopted two widely used indoor point cloud benchmarks, S3DIS [58] and ScanNet [59], to perform semantic annotation at the point level for indoor 3D semantic segmentation.

S3DIS Dataset: The Stanford 3D Indoor Spaces Dataset (S3DIS) is a widely adopted large-scale benchmark for indoor 3D scene understanding. It consists of six spatially distinct areas (Area 1–Area 6) and contains 271 annotated indoor rooms, covering representative functional spaces such as offices, conference rooms, and corridors. The dataset was collected using the Matterport scanning system, which integrates multiple structured-light depth sensors and performs 360° rotational acquisition at each scan location, enabling synchronized capture of RGB and depth information. Multiple RGB–D observations obtained at each position are fused to construct dense 3D point clouds. Each point in S3DIS is represented by its 3D coordinates and associated RGB features and is annotated with a point-wise semantic label in 3D space. The dataset defines 13 semantic classes, including ceiling, floor, wall, beam, column, window, door, table, chair, sofa, bookcase, board, and clutter. Following the standard evaluation protocol, Areas 1, 2, 3, 4, and 6 were used for training, while Area 5 was reserved for testing. With its area-wise split protocol, S3DIS offers a standardized benchmark for evaluating cross-area generalization performance, facilitating fair and consistent comparisons of indoor point cloud semantic segmentation models under complex geometric structures and diverse scene distributions.

ScanNet Dataset: ScanNet is a large-scale RGB-D dataset for indoor 3D scene understanding, comprising continuous scans of 1,513 real-world indoor scenes with approximately 2.5 million frames. Data were captured using an RGB-D sensor (a coupled RGB camera and depth sensor), providing per-pixel appearance and depth measurements. For each scene, ScanNet offers key annotations and outputs, including camera pose estimation, surface reconstruction, and instance-level semantic segmentation, facilitating joint modeling and evaluation from 2D observations to 3D geometry. In terms of scene coverage, ScanNet spans diverse indoor environments such as residential spaces, offices, and public areas, exhibiting substantial scene variability and realistic sensor noise. Its annotation protocol assigns semantic category labels at the instance level, supporting fine-grained understanding at both the object and scene scales; the commonly used setup includes 21 object semantic classes. Consistent with this benchmark, our experimental setting also adopted the same 21 semantic classes for training and evaluation. ScanNet contains thousands of reconstructed indoor scenes with complex geometry and noise distributions, facilitating the assessment of model robustness and generalization.

4.2. Implementation Details

Configuration details are described as follows. Initially, to mitigate the computational burden induced by large-scale raw point clouds, we conducted a point-sampling procedure, which reduced point density while preserving the geometric structure of the scene. During training, the grid size was set to

0.04 m

for S3DIS and

0.02 m

for ScanNet. For all experiments, the weighting parameters

λ_{r}

and

λ_{a}

in Equation (3) were both set to one for simplicity. The momentum parameter m in Equations (6) and (7) was fixed to

0.999

. In Equation (8), the coefficient

k_{τ}

adjusts the strictness of the dynamic threshold

τ_{t}

relative to the EMA confidence statistics

(μ_{t}, σ_{t})

, explicitly controlling the “quality–quantity” trade-off in pseudo-label selection. Larger

k_{τ}

yields a more conservative criterion (fewer but cleaner pseudo-labels), while smaller

k_{τ}

increases coverage at a higher noise risk. We fixed

k_{τ} = 0.001

across all experiments. With confidence normalized to

[0, 1]

, this setting introduces only a per-mille-level threshold perturbation, ensuring that

τ_{t}

remains primarily data-driven by

(μ_{t}, σ_{t})

rather than dominated by a manually amplified offset. This choice further mitigates threshold jitter induced by transient mini-batch fluctuations when

σ_{t}

is unstable in early training, while still allowing

τ_{t}

to evolve naturally as

(μ_{t}, σ_{t})

stabilizes, leading to a smooth and risk-controlled expansion of the reliable set

R_{t}

. Keeping

k_{τ}

fixed avoids stage-wise scheduling or dataset-specific tuning, improving reproducibility and robustness. In Equation (14), k is introduced as a variance scaling factor in the truncated Gaussian weighting function to control the bandwidth of the confidence-aware reweighting by scaling

σ_{t}^{2}

. This parameter affects the curvature of the weighting profile but does not change the monotonic mapping from confidence deviation to weight assignment. In all experiments, we set k to 1.0 and kept it fixed to ensure reproducibility. Since the number of points varied across sub-clouds, the batch size was not fixed; instead, the maximum number of points per training iteration was capped at 10,000. The model was trained for 100 epochs using the Adam optimizer with an initial learning rate of

10^{- 3}

. All experiments were implemented in the PyTorch (v1.10.1) framework and conducted on a workstation equipped with an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60 GHz and an NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA).

4.3. Evaluation on S3DIS

In Figure 4 and Figure 5, we visualize qualitative segmentation results on the S3DIS dataset, providing an insightful comparison among PSD [52], SQN [49], and our method TGR-T. A noteworthy observation is that indoor point clouds often exhibit subtle local geometric variations, and the discrimination is further complicated by intrinsic boundary-point ambiguity and occlusion-induced incompleteness. In 3D semantic segmentation, these factors make categories with similar shape priors particularly prone to confusion, especially for tail classes. Typical examples include the fine-grained distinction between beam vs. wall and chair vs. sofa. Moreover, segmenting thin structures and fine-grained boundary details in indoor scenes substantially increases the task difficulty. For instance, TGR-T demonstrates a clear advantage over competing methods in separating the red wall from the green beam. To facilitate a clear and fair qualitative comparison, all methods are visualized using the same class-to-color mapping, and the red boxes consistently denote representative regions where our method yields more accurate and coherent 3D semantic predictions than the competing approaches (the same convention is used throughout all qualitative results shown in this paper). This is especially evident in the first row of Figure 4 and Figure 5, where our method more accurately localizes the true boundary between beam and wall, producing a cleaner and more spatially continuous structural delineation. In contrast, PSD and SQN fail to effectively disentangle the beam from the wall in that region, resulting in noticeable misclassification. In the selected region of the second row, our method can fully and coherently recover the chair category, preserving the main body and contour integrity of the chair, which indicates stronger fine-grained semantic modeling. By comparison, although PSD and SQN can detect the chair, their predictions are often incomplete, with the chair region partially missing or being absorbed into the surrounding background. Furthermore, TGR-T exhibits stronger structural consistency and spatial generalization in uncertain regions. As shown in the third row of Figure 4, the wall structure between two large unlabeled blocks is still stably identified and accurately segmented by our method, yielding continuous and consistent semantic predictions. In contrast, PSD and SQN fail to recover a complete wall surface at this location, leading to missing wall regions. A similar trend is observed in the third row of Figure 5, where the competing methods exhibit noticeably lower semantic consistency in the unlabeled regions, leading to fragmented and category-inconsistent predictions.

Table 1 and Table 2 report the quantitative comparisons of different methods under fully supervised and weakly supervised settings, and Figure 6 provides a more intuitive comparison of the performance of our algorithm with that of the other algorithms. For clarity, bold numbers in these tables indicate the best performance for each metric/category (i.e., the highest IoU/mIoU), and the same highlighting convention is used consistently throughout all tables in this paper unless otherwise specified. Under the weakly supervised setting with only 1% labeled points, TGR-T substantially improves the per-class IoU for most categories compared with PSD and SQN, highlighting its ability to effectively exploit limited annotations to boost segmentation performance. Specifically, TGR-T achieves an mIoU improvement of 1.6% over PSD, and yields a 0.2% higher mIoU over SQN and 0.7% higher mIoU over SAF-C3 [60]. Notably, under the same evaluation protocol, TGR-T still exceeds the fully supervised PointCNN [61] by more than 4.7% in mIoU, indicating strong annotation efficiency and competitive performance even relative to fully supervised counterparts. With only 0.1% labeled points, TGR-T improves the mIoU by 4.8% over PSD and outperforms SQN by over 5.7% in mIoU. The fully supervised results are reported from prior work. Under extremely sparse supervision, TGR-T achieves competitive performance, significantly narrowing the gap between weakly supervised and fully supervised approaches. More concretely, TGR-T achieves the highest IoU on several representative categories, including floor, table, chair, sofa and board. In the extremely sparse annotation regime with only 0.1% labeled data, although the performance of most competing methods drops markedly on low-sample classes (e.g., floor, table, board, sofa and chair), TGR-T consistently delivers the best results on these categories, indicating superior robustness and generalization under ultra-low supervision.

4.4. Evaluation on ScanNet

Figure 7 and Figure 8 present qualitative semantic segmentation results on the ScanNet benchmark. To facilitate a clear and fair qualitative comparison, all methods are visualized using the same class-to-color mapping, and the black boxes consistently denote representative regions where our method yields more accurate and coherent 3D semantic predictions than the competing approaches (the same convention is used throughout all qualitative results shown in this paper). Since ScanNet point clouds are reconstructed from RGB-D scans, different scenes exhibit substantial variation in sensor noise, point density, and occlusion patterns, which further increases the difficulty of semantic segmentation and makes models prone to boundary mislabeling. In Figure 7 (first row), the predicted semantic map and spatial layout demonstrate the superior performance of TGR-T in segmenting cabinet (yellow) and wall (blue). In contrast, PSD and SQN exhibit pronounced segmentation errors by incorrectly classifying these regions as door (red). In the second and third rows of Figure 7 as well as the first three rows of Figure 8, TGR-T continues to show superior capability by accurately recovering the structural extent of table (light pink), door (red), and chair (yellow-green). The fourth row depicts a particularly challenging indoor scene containing multiple complex categories, including other furniture, chair, sofa, and floor. In contrast, SQN shows evident class confusion in this region, mislabeling it as table, while PSD fails to adequately model the contact boundary, resulting in semantic adhesion between desk and chair. In the fourth row of Figure 8, TGR-T further demonstrates stronger fine-grained discrimination and robustness. Although a small number of points remain incorrectly predicted in that area, the overall segmentation maintains high semantic consistency and boundary integrity. Overall, these qualitative results demonstrate that the proposed framework constructs more stable and effective point-wise pseudo-supervision under the coexistence of occlusion, non-uniform point density, and RGB-D reconstruction noise. The adaptive thresholding mechanism aligns pseudo-label selection with batch-wise confidence statistics, while truncated-Gaussian reweighting suppresses noisy gradients from highly uncertain boundary and occluded points, thereby improving boundary consistency and enhancing the robustness and generalization of semantic segmentation in complex indoor scenes.

The experimental results reported in Table 3 and Table 4 provide a comprehensive comparison of different methods under both fully supervised and weakly supervised protocols across diverse scenes in the dataset. Figure 9 provides the corresponding visualizations of these experimental results. Notably, under the 1% labeling setting, TGR-T improves the mIoU over PSD by 2.0%, and over SQN by 9.3%, respectively. Moreover, compared with DCL [64], which achieves an mIoU of 59.3 under the same 1% labeling protocol, our method further obtains a 0.2% higher mIoU. TGR-T achieves particularly strong category-wise performance on chair, counter, and picture, with IoU scores of 83.2%, 70.4%, and 47.1%. Under the more challenging 0.1% weak-supervision regime, the IoU of TGR-T on most categories does not exhibit a noticeable decline. Moreover, with 0.1% labels, the mIoU of TGR-T surpasses SQN by 10.5% and exceeds PSD by 6.6%. The superior mIoU can be attributed to TGR-T’s ability to suppress negative transfer induced by low-confidence noisy pseudo-labels, thereby enhancing multi-class semantic separability and boundary consistency, which in turn strengthens cross-category generalization. For instance, TGR-T attains a 94.1% IoU on bed, 83.2% on chair, and 81.0% on table, demonstrating robust effectiveness in maintaining high segmentation quality even under extremely sparse annotation. We further report fully supervised results as reference points. We note that these fully supervised numbers are taken from their original publications and may involve different backbone architectures, preprocessing pipelines, and training recipes; thus, they are not intended as a strictly controlled, backbone-consistent comparison. Nevertheless, TGR-T trained with only 0.1% labels achieves performance that is competitive with several reported fully supervised baselines on certain categories (e.g., MTML [65] and 3D-BoNet [66]), highlighting the strong annotation efficiency of the proposed weakly supervised framework. While PSD and SQN deliver commendable performance, TGR-T remains consistently competitive in terms of mIoU, particularly for categories that are traditionally challenging due to occlusion, structural ambiguity, and uncertain class boundaries. Furthermore, relative to widely adopted fully supervised baselines, our method also exhibits strong performance and even achieves better results in some cases, further validating its effectiveness for semantic segmentation in complex indoor point cloud environments.

4.5. Discussion

In this section, we evaluate the individual contributions of the temperature parameter, reliability-adaptive dynamic thresholding mechanism and the truncated-Gaussian weighting strategy within the TGR-T framework. In addition, we conduct a sensitivity analysis on the key hyperparameters

λ_{r}

and

λ_{a}

, which control the relative importance of the reliable-region loss and the ambiguous-region consistency loss in Equation (3). Comprehensive ablation studies are performed on the S3DIS benchmark under the extremely sparse 0.1% annotation setting, and the quantitative results are summarized in Table 5.

(1) Temperature Parameter: In this subsection, we investigate the influence of the temperature coefficient T in TGR-T on segmentation performance. In the weakly supervised training pipeline, the temperature is applied to the network’s logits to modulate the distributions of the generated soft pseudo-labels and consistency targets. Its primary role is to directly control the sharpness (entropy) of the class-probability distribution. Keeping all other training settings identical, we evaluated

T \in {0.5, 1.0, 2.0}

, where

T = 1.0

was the default configuration. Specifically, given the logit vector

z_{i}

for each sampled point in a point cloud, the predicted probability for class i was computed as

p_{i} = \frac{exp (z_{i} / T)}{\sum_{j} exp (z_{j} / T)}

(16)

As shown in Table 6 and Figure 10, when

T = 0.5

, the smaller temperature induces a sharper probability distribution, amplifying high-confidence responses and encouraging the model to focus on only a small subset of highly discriminative points. However, under weak supervision, the initial supervisory signal is sparse and incomplete, making pseudo-labels inevitably noisy. The reduced tolerance to noisy pseudo-supervision leads to a performance drop, with the mIoU decreasing to 53.5% and the avg. F1 score to 63.36%. When

T = 2.0

, a larger temperature yields a smoother distribution, which can partially alleviate the adverse effect of noisy predictions. Nevertheless, excessive smoothing suppresses high-confidence responses on reliable points, making it difficult for the model to capture inter-class discriminative cues and weakening the discriminative capacity of learned representations. As a result, the mIoU drops to 52.5%, and the avg. F1 score decreases to 65.05%, indicating degraded boundary localization and detail preservation. Performance was evaluated in terms of mIoU and avg.F1 score on the validation split of S3DIS.

In contrast, TGR-T achieves the best overall segmentation performance at

T = 1.0

. Compared with

T = 0.5

, it improves the mIoU by 2.3% and avg. F1 by 4.0%; compared with

T = 2.0

, it increases the mIoU by 3.3% and avg. F1 by 2.31%. These results demonstrate that for weakly supervised 3D point cloud semantic segmentation, an appropriate temperature configuration is crucial for balancing pseudo-label reliability and model discriminability.

(2) Dynamic Threshold: To validate the effectiveness of the proposed reliability-adaptive dynamic thresholding mechanism in pseudo-label selection, we constructed a strictly controlled comparative experiment in which the dynamic thresholding module was replaced with a fixed-threshold strategy, while keeping all other training configurations unchanged to eliminate confounding factors. Specifically, the fixed threshold was set to 0.95 [10,68]. This value was not arbitrarily chosen but aligned with commonly adopted settings in mainstream weakly supervised point cloud semantic segmentation methods, thereby ensuring fairness and reproducibility of the comparison. Under this strong baseline configuration, experiments were conducted on the S3DIS dataset with an extremely sparse annotation ratio of 0.1%, and the results are reported in Table 5 and Figure 11. The experimental results showed that the fixed-threshold strategy with

τ = 0.95

performed consistently worse than the proposed dynamic thresholding scheme. When the threshold was fixed at 0.95, the mIoU decreased by 2.3% compared to the dynamic thresholding approach. This observation indicates that under extremely limited supervision, the model’s early-stage prediction distribution is inherently unstable, and a fixed threshold inevitably introduces two limitations: First, an overly high threshold may result in an insufficient number of reliable pseudo-labels, leading to weak supervisory signals and slower convergence. Second, as training progresses, a fixed threshold cannot adapt to inter-class and inter-scene confidence variations, which may cause excessive filtering for certain categories while admitting noisy predictions for others, thereby inducing class imbalance or noise accumulation. In contrast, the proposed dynamic thresholding mechanism estimates the threshold adaptively based on batch-wise confidence statistics and employs exponential moving averaging to obtain stable global estimates. This enables pseudo-label selection to dynamically match the model’s current reliability level during training, achieving a better trade-off between pseudo-label quality and quantity. Consequently, it delivers more stable and sustained performance improvements.

Overall, the comparative results further demonstrate the necessity and superiority of the proposed dynamic thresholding mechanism, particularly under extremely sparse supervision, where it more effectively enhances pseudo-label utilization efficiency and robustness.

(3) Truncated-Gaussian Weighting: To further investigate the necessity of the truncated-Gaussian weighting mechanism in modeling uncertainty within the ambiguous set, we conducted a controlled ablation study. Specifically, the Gaussian-based confidence modulation was removed, and samples in the ambiguous set were assigned a uniform weighting coefficient

λ_{a}

without confidence-aware reweighting. All other components, including the reliability-adaptive dynamic thresholding module, network architecture, optimizer settings, learning rate schedule, data augmentation strategy, and training iterations, were kept identical to those of the full TGR-T framework to ensure a fair comparison. Experiments were conducted on the S3DIS benchmark under the extremely sparse 0.1% annotation setting. Quantitative results are summarized in Table 5 and Figure 11. When the truncated-Gaussian weighting was replaced with a constant weighting scheme, the mIoU decreased by 3.4% compared to the complete model. This performance degradation indicates that globally scaling the loss of ambiguous samples via

λ_{a}

alone is insufficient to capture the heterogeneous uncertainty distribution inherent in pseudo-label predictions. From an optimization perspective, the confidence distribution within the ambiguous set is typically highly non-uniform in weakly supervised point cloud semantic segmentation. Some samples lie slightly below the adaptive threshold yet remain highly informative, while others are located near decision boundaries and exhibit substantial noise risk. Assigning identical weights to all ambiguous samples suppresses the contribution of informative borderline samples and simultaneously amplifies the adverse impact of unreliable pseudo-labels, leading to suboptimal convergence behavior. In contrast, the proposed truncated Gaussian weighting function performs continuous confidence-aware modeling, producing a smooth attenuation profile over low-confidence predictions. This soft reweighting strategy effectively balances noise suppression and information exploitation, particularly around complex object boundaries and class transition regions. As a result, the model achieves improved stability during optimization and enhanced generalization performance.

Overall, the ablation results confirm that the truncated-Gaussian weighting mechanism plays a critical role in uncertainty modeling and pseudo-label refinement within the TGR-T framework, especially under extremely sparse supervision, where fine-grained confidence calibration is essential for robust performance.

(4) Key Hyperparameters: We further performed a systematic hyperparameter ablation on the relative weights of the two core supervision signals in the loss under the 1% labeling protocol of the S3DIS dataset, aiming to rigorously assess the synergy and performance sensitivity between distribution-level consistency regularization (KL divergence) and point-wise classification supervision (cross-entropy, CE) within our weakly supervised self-training framework for 3D point cloud semantic segmentation, as shown in Table 7. Specifically, we kept all other settings strictly unchanged—including network architecture, data split, training schedule, optimizer, and learning-rate policy—and only varied the loss weights. We compared

(λ_{a}, λ_{r}) \in {(1.0, 1.0), (1.0, 0.5), (0.5, 1.0)}

. The results showed that the default configuration

(1.0, 1.0)

achieved the best performance with an mIoU of

61.9

. When halving the CE weight to

(1.0, 0.5)

, the mIoU dropped to

59.9

, which was

2.0

points lower than the optimum. In contrast, halving the KL weight to

(0.5, 1.0)

resulted in a much larger degradation, with the mIoU further decreasing to

55.8

, representing a

6.1

-point drop relative to the best configuration.

These findings indicate that under extremely sparse point-level annotations, the KL-based consistency term plays a more critical role in leveraging unlabeled points, stabilizing pseudo-label learning, and mitigating error accumulation/confirmation bias during self-training. First, by enforcing prediction distribution alignment across different views/augmentations, the KL objective enables effective utilization of low-confidence and boundary-adjacent points without relying solely on hard pseudo-labels, alleviating the coverage bias caused by learning only from highly confident samples. Second, this distribution-level constraint provides smoother gradients when predictions are unstable in early training, thereby reducing noise propagation induced by erroneous pseudo-labels and improving optimization stability and generalization. By comparison, moderately weakening CE primarily reduces the learning strength over the reliable (high-confidence) pseudo-labeled set, resulting in a relatively mild performance drop, suggesting CE mainly serves to consolidate discriminative learning on confident points rather than being the dominant driver for unlabeled knowledge transfer. Based on this observation, we adopted

(λ_{a}, λ_{r}) = (1.0, 1.0)

as the default setting for all subsequent experiments to achieve a better balance between discriminative pseudo-label supervision (CE) on reliable points and robust consistency regularization (KL) on ambiguous/unlabeled regions, thereby maximizing unlabeled-data exploitation while suppressing noise accumulation in weakly supervised 3D point cloud semantic segmentation.

5. Conclusions

We proposed TGR-T, a unified weakly supervised framework for indoor 3D point cloud semantic segmentation that systematically addressed the challenges of label sparsity and uncertainty. By introducing a reliability-adaptive dynamic thresholding strategy, pseudo-label selection was guided by evolving confidence statistics of unlabeled mini-batches, yielding stable global estimates and enabling effective separation of reliable and ambiguous regions for selective supervision. Furthermore, a learnable truncated Gaussian weighting function was proposed to explicitly model uncertainty within ambiguous regions, allowing the network to exploit low-confidence predictions through soft supervision and thereby enhance generalization, particularly around complex object boundaries. Extensive experiments on standard indoor benchmarks demonstrated that TGR-T delivered competitive or superior performance under extremely sparse supervision and could even surpass several fully supervised baselines while using only 1% labeled points. These results highlight the effectiveness and practicality of the proposed framework for scalable and annotation-efficient 3D scene understanding.

Author Contributions

Conceptualization, Ziwei Luo; Methodology, Ziwei Luo and Xinyue Liu; Software, Ziwei Luo; Validation, Ziwei Luo and Xinyue Liu; Formal analysis, Ziwei Luo; Investigation, Ziwei Luo and Xinyue Liu; Resources, Ziwei Luo; Data curation, Ziwei Luo and Xinyue Liu; Writing—original draft, Ziwei Luo, Xinyue Liu, Jun Jiang, Hanyu Qi, Chen Wang and Zhong Xie; Writing—review and editing, Ziwei Luo, Xinyue Liu and Jun Jiang; Visualization, Ziwei Luo and Xinyue Liu; Supervision, Ziwei Luo and Tao Zeng; Project administration, Ziwei Luo and Tao Zeng; Funding acquisition, Ziwei Luo and Tao Zeng. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of Hubei Province (2025AFB341), the Sichuan Science and Technology Program (2025ZNSFSC0529), the National Natural Science Foundation of China (62301373), the Open Fund of Hubei Engineering Research Center for High-Precision Deformation Monitoring with “Beidou + Cloud” (HBBDGJ202508Y), and the Joint Open Fund of the Research Platforms of School of Computer Science, China University of Geosciences, Wuhan (No. PTLH2024-B-11).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Author Tao Zeng was employed by the company Chengdu Qianjia Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Riz, L.; Saltori, C.; Ricci, E.; Poiesi, F. Novel class discovery for 3d point cloud semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 9393–9402. [Google Scholar]
González-de Santos, L.M.; Díaz-Vilariño, L.; Balado, J.; Martínez-Sánchez, J.; González-Jorge, H.; Sánchez-Rodríguez, A. Autonomous point cloud acquisition of unknown indoor scenes. ISPRS Int. J. Geo-Inf. 2018, 7, 250. [Google Scholar] [CrossRef]
Qiu, S.; Anwar, S.; Barnes, N. Semantic segmentation for real point cloud scenes via bilateral augmentation and adaptive fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021; pp. 1757–1767. [Google Scholar]
Xie, F.; Schwertfeger, S. Robust lifelong indoor lidar localization using the area graph. IEEE Robot. Autom. Lett. 2023, 9, 531–538. [Google Scholar] [CrossRef]
Sun, Y.; Zhang, X.; Miao, Y. A review of point cloud segmentation for understanding 3D indoor scenes. Vis. Intell. 2024, 2, 14. [Google Scholar] [CrossRef]
Tahara, T.; Seno, T.; Narita, G.; Ishikawa, T. Retargetable AR: Context-aware augmented reality in indoor scenes based on 3D scene graph. In Proceedings of the 2020 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), Recife, Brazil, 9–13 November 2020; pp. 249–255. [Google Scholar]
Luo, Z.; Zeng, Z.; Wan, J.; Tang, W.; Jin, Z.; Xie, Z.; Xu, Y. D2T-Net: A dual-domain transformer network exploiting spatial and channel dimensions for semantic segmentation of urban mobile laser scanning point clouds. Int. J. Appl. Earth Obs. Geoinf. 2024, 132, 104039. [Google Scholar] [CrossRef]
Tang, S.; Li, X.; Zheng, X.; Wu, B.; Wang, W.; Zhang, Y. BIM generation from 3D point clouds by combining 3D deep learning and improved morphological approach. Autom. Constr. 2022, 141, 104422. [Google Scholar] [CrossRef]
Luo, Z.; Zeng, Z.; Tang, W.; Wan, J.; Xie, Z.; Xu, Y. Dense dual-branch cross attention network for semantic segmentation of large-scale point clouds. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5700216. [Google Scholar] [CrossRef]
Kweon, H.; Kim, J.; Yoon, K.J. Weakly supervised point cloud semantic segmentation via artificial oracle. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 3721–3731. [Google Scholar]
Wang, Y.; Liu, Y.; Zhou, S.; Huang, Y.; Tang, C.; Zhou, W.; Chen, Z. Emotion-oriented Cross-modal Prompting and Alignment for Human-centric Emotional Video Captioning. IEEE Trans. Multimed. 2025, 27, 3766–3780. [Google Scholar] [CrossRef]
Li, X.; Xu, Q.; Zhang, J.; Zhang, T.; Yu, Q.; Sheng, L.; Xu, D. Multi-modality affinity inference for weakly supervised 3D semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 3216–3224. [Google Scholar]
Cheng, M.; Hui, L.; Xie, J.; Yang, J. Sspc-net: Semi-supervised semantic 3d point cloud segmentation network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Online, 2–9 February 2021; Volume 35, pp. 1140–1147. [Google Scholar]
Li, M.; Xie, Y.; Shen, Y.; Ke, B.; Qiao, R.; Ren, B.; Lin, S.; Ma, L. Hybridcr: Weakly-supervised 3d point cloud semantic segmentation via hybrid contrastive regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14930–14939. [Google Scholar]
Sun, B.; Yang, Y.; Zhang, L.; Cheng, M.M.; Hou, Q. Corrmatch: Label propagation via correlation matching for semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 3097–3107. [Google Scholar]
Jiang, L.; Shi, S.; Tian, Z.; Lai, X.; Liu, S.; Fu, C.W.; Jia, J. Guided Point Contrastive Learning for Semi-Supervised Point Cloud Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6423–6432. [Google Scholar]
Xu, X.; Lee, G.H. Weakly supervised semantic point cloud segmentation: Towards 10x fewer labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 14–19 June 2020; pp. 13706–13715. [Google Scholar]
Chu, R.; Ye, X.; Liu, Z.; Tan, X.; Qi, X.; Fu, C.W.; Jia, J. Twist: Two-way inter-label self-training for semi-supervised 3d instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1100–1109. [Google Scholar]
Su, S.; Xu, J.; Wang, H.; Miao, Z.; Zhan, X.; Hao, D.; Li, X. PUPS: Point cloud unified panoptic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 2339–2347. [Google Scholar]
Luo, Z.; Zeng, T.; Jiang, J.; Cai, Z.; Wu, W.; Xie, Z.; Xu, Y. P3CL: Pseudo-Label Confidence-Calibrated Curriculum Learning for Weakly Supervised Urban Airborne Laser Scanning Point Cloud Classification. Remote Sens. 2026, 18, 552. [Google Scholar] [CrossRef]
Schult, J.; Engelmann, F.; Hermans, A.; Litany, O.; Tang, S.; Leibe, B. Mask3d: Mask transformer for 3d semantic instance segmentation. arXiv 2022, arXiv:2210.03105. [Google Scholar]
Li, X.; Ding, H.; Yuan, H.; Zhang, W.; Pang, J.; Cheng, G.; Chen, K.; Liu, Z.; Loy, C.C. Transformer-based visual segmentation: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10138–10163. [Google Scholar] [CrossRef] [PubMed]
Wei, J.; Lin, G.; Yap, K.H.; Hung, T.Y.; Xie, L. Multi-path region mining for weakly supervised 3D semantic segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 14–19 June 2020; pp. 4384–4393. [Google Scholar]
Unal, O.; Dai, D.; Van Gool, L. Scribble-supervised lidar semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2697–2707. [Google Scholar]
Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.A.; Cubuk, E.D.; Kurakin, A.; Li, C.L. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Adv. Neural Inf. Process. Syst. 2020, 33, 596–608. [Google Scholar]
Berthelot, D.; Roelofs, R.; Sohn, K.; Carlini, N.; Kurakin, A. Adamatch: A unified approach to semi-supervised learning and domain adaptation. arXiv 2021, arXiv:2106.04732. [Google Scholar]
Zhang, B.; Wang, Y.; Hou, W.; Wu, H.; Wang, J.; Okumura, M.; Shinozaki, T. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Adv. Neural Inf. Process. Syst. 2021, 34, 18408–18419. [Google Scholar]
Wang, Y.; Chen, H.; Heng, Q.; Hou, W.; Fan, Y.; Wu, Z.; Wang, J.; Savvides, M.; Shinozaki, T.; Raj, B.; et al. Freematch: Self-adaptive thresholding for semi-supervised learning. arXiv 2022, arXiv:2205.07246. [Google Scholar]
Chen, H.; Tao, R.; Fan, Y.; Wang, Y.; Wang, J.; Schiele, B.; Xie, X.; Raj, B.; Savvides, M. SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised Learning. arXiv 2023, arXiv:2301.10921. [Google Scholar]
Tang, L.; Hui, L.; Xie, J. Learning inter-superpoint affinity for weakly supervised 3D instance segmentation. In Proceedings of the Asian Conference on Computer Vision (ACCV), Macao, China, 4–8 December 2022; pp. 1282–1297. [Google Scholar]
Tao, A.; Duan, Y.; Wei, Y.; Lu, J.; Zhou, J. SegGroup: Seg-level supervision for 3D instance and semantic segmentation. IEEE Trans. Image Process. 2022, 31, 4952–4965. [Google Scholar] [CrossRef]
Wang, P.; Yao, W.; Shao, J. One class one click: Quasi scene-level weakly supervised point cloud semantic segmentation with active learning. ISPRS J. Photogramm. Remote Sens. 2023, 204, 89–104. [Google Scholar] [CrossRef]
Unal, O.; Sakaridis, C.; Van Gool, L. Bayesian Self-training for Semi-supervised 3D Segmentation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 89–107. [Google Scholar]
Deng, J.; Lu, J.; Zhang, T. Quantity-quality enhanced self-training network for weakly supervised point cloud semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 3580–3596. [Google Scholar] [CrossRef] [PubMed]
Xie, M.K.; Xiao, J.; Liu, H.Z.; Niu, G.; Sugiyama, M.; Huang, S.J. Class-distribution-aware pseudo-labeling for semi-supervised multi-label learning. Adv. Neural Inf. Process. Syst. 2023, 36, 25731–25747. [Google Scholar]
Hu, J.; Chen, C.; Cao, L.; Zhang, S.; Shu, A.; Jiang, G.; Ji, R. Pseudo-label alignment for semi-supervised instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 16337–16347. [Google Scholar]
Xie, S.; Gu, J.; Guo, D.; Qi, C.R.; Guibas, L.; Litany, O. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 574–591. [Google Scholar]
Wang, X.; Zhang, B.; Yu, L.; Xiao, J. Hunting sparsity: Density-guided contrastive learning for semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 3114–3123. [Google Scholar]
Liu, Y.; Feng, S.; Liu, S.; Zhan, Y.; Tao, D.; Chen, Z.; Chen, Z. Sample-cohesive pose-aware contrastive facial representation learning. Int. J. Comput. Vis. 2025, 133, 3727–3745. [Google Scholar] [CrossRef]
Luo, Z.; Zeng, T.; Jiang, X.; Peng, Q.; Ma, Y.; Xie, Z.; Pan, X. Dense Supervised Dual-Aware Contrastive Learning for Airborne Laser Scanning Weakly Supervised Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5701015. [Google Scholar] [CrossRef]
Huang, S.; Hu, Q.; Ai, M.; Zhao, P.; Li, J.; Cui, H.; Wang, S. Weakly supervised 3D point cloud semantic segmentation for architectural heritage using teacher-guided consistency and contrast learning. Autom. Constr. 2024, 168, 105831. [Google Scholar] [CrossRef]
Wang, J.; He, J.; Liu, Y.; Chen, C.; Zhang, M.; Tan, H. Multi-Scale Classification and Contrastive Regularization: Weakly Supervised Large-Scale 3D Point Cloud Semantic Segmentation. Remote Sens. 2024, 16, 3319. [Google Scholar] [CrossRef]
Liu, M.; Zhou, Y.; Qi, C.R.; Gong, B.; Su, H.; Anguelov, D. Less: Label-efficient semantic segmentation for lidar point clouds. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 70–89. [Google Scholar]
Zheng, Z.; Song, H. Group contrastive learning for weakly-supervised 3D point cloud semantic segmentation. J. East China Norm. Univ. (Natural Sci.) 2024, 2024, 108–118. [Google Scholar]
Yao, B.; Dong, L.; Qiu, X.; Song, K.; Yan, D.; Peng, C. Uncertainty-guided contrastive learning for weakly supervised point cloud segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5704913. [Google Scholar] [CrossRef]
Tang, L.; Zhan, Y.; Chen, Z.; Yu, B.; Tao, D. Contrastive boundary learning for point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8489–8499. [Google Scholar]
Laine, S.; Aila, T. Temporal Ensembling for Semi-Supervised Learning. arXiv 2016, arXiv:1610.02242. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30, 1195–1204. [Google Scholar]
Hu, Q.; Yang, B.; Fang, G.; Guo, Y.; Leonardis, A.; Trigoni, N.; Markham, A. Sqn: Weakly-supervised semantic segmentation of large-scale 3d point clouds. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 600–619. [Google Scholar]
Zhao, H.; Jiang, L.; Fu, C.W.; Jia, J. Pointweb: Enhancing local neighborhood features for point cloud processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 5565–5573. [Google Scholar]
Hou, J.; Graham, B.; Nießner, M.; Xie, S. Exploring data-efficient 3d scene understanding with contrastive scene contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 9–25 June 2021; pp. 15587–15597. [Google Scholar]
Zhang, Y.; Qu, Y.; Xie, Y.; Li, Z.; Zheng, S.; Li, C. Perturbed self-distillation: Weakly supervised large-scale point cloud semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15520–15528. [Google Scholar]
Zhao, N.; Chua, T.S.; Lee, G.H. Sess: Self-ensembling semi-supervised 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 14–19 June 2020; pp. 11079–11087. [Google Scholar]
Hui, L.; Tang, L.; Shen, Y.; Xie, J.; Yang, J. Learning superpoint graph cut for 3d instance segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 36804–36817. [Google Scholar]
Luo, Z.; Xie, Z.; Wan, J.; Zeng, Z.; Liu, L.; Tao, L. Indoor 3D point cloud segmentation based on multi-constraint graph clustering. Remote Sens. 2022, 15, 131. [Google Scholar] [CrossRef]
Wu, Z.; Wu, Y.; Lin, G.; Cai, J. Reliability-adaptive consistency regularization for weakly-supervised point cloud segmentation. Int. J. Comput. Vis. 2024, 132, 2276–2289. [Google Scholar] [CrossRef]
Deng, S.; Dong, Q.; Liu, B.; Hu, Z. Superpoint-guided semi-supervised semantic segmentation of 3D point clouds. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 9214–9220. [Google Scholar]
Armeni, I.; Sener, O.; Zamir, A.R.; Jiang, H.; Brilakis, I.; Fischer, M.; Savarese, S. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1534–1543. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5828–5839. [Google Scholar]
Su, Y.; Cheng, M.; Yuan, Z.; Liu, W.; Zeng, W.; Zhang, Z.; Wang, C. Spatial adaptive fusion consistency contrastive constraint: Weakly supervised building facade point cloud semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5703214. [Google Scholar] [CrossRef]
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. Pointcnn: Convolution on x-transformed points. Adv. Neural Inf. Process. Syst. 2018, 31, 828–838. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Landrieu, L.; Simonovsky, M. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4558–4567. [Google Scholar]
Yao, B.; Xiao, H.; Zhuang, J.; Peng, C. Weakly supervised learning for point cloud semantic segmentation with dual teacher. IEEE Robot. Autom. Lett. 2023, 8, 6347–6354. [Google Scholar] [CrossRef]
Lahoud, J.; Ghanem, B.; Pollefeys, M.; Oswald, M.R. 3d instance segmentation via multi-task metric learning. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 16–20 June 2019; pp. 9256–9266. [Google Scholar]
Yang, B.; Wang, J.; Clark, R.; Hu, Q.; Wang, S.; Markham, A.; Trigoni, N. Learning object bounding boxes for 3d instance segmentation on point clouds. Adv. Neural Inf. Process. Syst. 2019, 32, 1–10. [Google Scholar]
Liu, C.; Furukawa, Y. MASC: Multi-Scale Affinity with Sparse Convolution for 3D Instance Segmentation. arXiv 2019, arXiv:1902.04478. [Google Scholar] [CrossRef]
Wu, Y.; Yan, Z.; Cai, S.; Li, G.; Han, X.; Cui, S. Pointmatch: A consistency training framework for weakly supervised semantic segmentation of 3d point clouds. Comput. Graph. 2023, 116, 427–436. [Google Scholar] [CrossRef]

Figure 1. An overview of the proposed framework.

Figure 2. Reliability-adaptive dynamic thresholding for partitioning pseudo-labels into reliable and ambiguous sets.

Figure 3. Truncated-Gaussian reweighting of ambiguous pseudo-labels.

Figure 4. The visualized semantic segmentation result of S3DIS under the 1% setting.

Figure 5. The visualized semantic segmentation result of S3DIS under the 0.1% setting.

Figure 6. Comparisons of performance on S3DIS: (left) 0.1% annotation, (right) 1% annotation.

Figure 7. The visualized semantic segmentation result of ScanNet under the 1% setting.

Figure 8. The visualized semantic segmentation result of ScanNet under the 0.1% setting.

Figure 9. Comparisons of performance on ScanNet: (left) 0.1% annotation, (right) 1% annotation.

Figure 10. Ablation study on the temperature coefficient T.

Figure 11. Ablation Results:Impact of Dynamic Threshold (DT) and Gaussian weighting (GW).

Table 1. Semantic segmentation results on the S3DIS dataset under different supervision settings.

Setting	Method	mIoU (%)
Fully	PointNet [62]	41.9
Fully	SPGraph [63]	57.9
Fully	PointCNN [61]	57.3
0.1%	SQN [49]	50.1
0.1%	PSD [52]	51.0
0.1%	Ours	55.8
1%	SAF-C3 [60]	60.9
1%	PSD [52]	60.0
1%	SQN [49]	61.4
1%	Ours	61.6

Table 2. Semantic segmentation results on S3DIS dataset. The results include the IoU score (%) of each category and the mIoU score (%).

Methods	mIoU (%)	Ceiling	Floor	Wall	Beam	Column	Window	Door	Table	Chair	Sofa	Bookcase	Board	Clutter
PointNet [62]	41.9	88.8	97.3	69.8	0.1	3.9	48.3	13.5	59.6	53.7	5.8	42.3	27.2	34.4
SPGraph [63]	57.9	89.4	96.9	78.3	0.0	42.6	48.8	61.6	84.8	74.1	69.4	52.6	2.1	52.6
PointCNN [61]	57.3	92.3	98.2	79.4	0.0	17.6	22.8	62.1	74.4	80.6	31.7	66.7	62.1	56.7
SQN (0.1%) [49]	50.1	87.7	94.4	71.4	0.0	10.2	32.3	34.8	61.1	74.6	41.0	63.0	37.0	44.1
PSD (0.1%) [52]	51.0	90.6	95.5	74.8	0.0	18.9	51.0	18.4	59.8	69.3	31.7	61.3	49.7	42.1
Ours (0.1%)	55.8	76.9	98.0	71.5	0.0	17.3	42.6	41.1	77.7	84.6	62.6	54.4	60.3	39.5
SQN (1%) [49]	61.4	91.7	95.6	78.7	0.0	24.2	55.9	63.1	70.5	83.1	60.7	67.8	56.1	50.6
PSD (1%) [52]	60.0	91.9	96.6	79.7	0.0	19.0	60.1	39.4	72.8	81.6	53.0	70.4	62.7	52.9
Ours (1%)	61.6	89.5	98.3	78.5	0.0	15.6	42.2	51.2	79.3	86.2	74.7	69.8	67.4	47.6

Table 3. Semantic segmentation results on the ScanNet dataset under different supervision settings.

Setting	Method	mIoU (%)
Fully	MASC [67]	44.7
Fully	3D-Bonet [66]	43.8
Fully	MTML [65]	55.6
0.1%	PSD [52]	46.0
0.1%	SQN [49]	42.1
0.1%	Ours	52.6
1%	PSD [52]	57.5
1%	SQN [49]	50.2
1%	DCL [64]	59.3
1%	Ours	59.5

Table 4. Semantic segmentation results on ScanNet dataset. The results include the IoU score (%) of each category and the mIoU score (%).

Methods	mIoU (%)	Bathtub	Bed	Bookshelf	Cabinet	Chair	Counter	Curtain	Desk	Door	Floor	Other Furn.	Picture	Fridge	s.Curtain	Sink	Sofa	Table	Toilet	Wall	Window
MASC [67]	44.7	52.8	55.5	38.1	38.2	63.3	0.2	50.9	26.0	36.1	-	43.2	32.7	45.1	57.1	36.7	63.9	38.6	98.0	-	27.6
3D-Bonet [66]	48.8	100.0	67.2	59.0	30.1	48.4	9.8	62.0	30.6	34.1	-	25.9	12.5	43.4	79.6	40.2	49.9	51.3	90.9	-	43.9
MTML [65]	54.9	100.0	80.7	58.8	32.7	64.7	0.4	81.5	18.0	41.8	-	36.4	18.2	44.5	100.0	44.2	68.8	57.1	100.0	-	39.6
SQN (0.1%) [49]	42.1	46.2	90.3	31.4	50.3	60.6	51.2	50.4	16.8	26.8	36.4	32.7	37.2	36.6	40.8	40.2	17.9	50.6	45.9	44.9	35.6
PSD (0.1%) [52]	46.0	43.6	92.5	33.9	60.7	68.5	53.7	56.0	23.8	40.6	63.5	8.3	46.0	46.3	28.7	36.3	40.5	63.9	32.6	51.2	29.1
Ours (0.1%)	52.6	61.5	92.6	39.8	54.3	70.5	58.9	55.2	38.4	39.3	61.3	10.2	48.9	45.6	56.3	37.2	54.3	76.4	42.7	69.8	38.8
SQN (1%) [49]	50.2	56.7	91.2	46.5	65.3	75.0	60.0	61.3	17.5	32.0	43.4	43.7	41.1	45.4	46.5	59.1	23.7	59.4	45.9	50.2	41.1
PSD (1%) [52]	57.5	66.4	93.6	48.2	73.0	78.6	69.1	63.0	35.2	53.9	74.1	22.8	50.2	51.2	47.7	57.0	27.5	77.1	50.4	62.8	48.8
Ours (1%)	59.5	77.3	94.1	48.4	66.5	83.2	70.4	62.6	47.1	47.2	66.3	14.0	52.8	48.1	61.0	41.0	59.2	81.0	48.9	74.7	45.0

Table 5. Ablation study on the effects of dynamic threshold (DT) and Gaussian Weighting (GW).

Setting	DT	GW	mIoU (%)	avg.F1 (%)
Full model	🗸	🗸	61.6	71.5
w/o Dynamic Threshold	–	🗸	58.2 (−3.4)	68.3 (−3.2)
w/o Gaussian Weighting	🗸	–	59.3 (−2.3)	68.9 (−2.6)

Table 6. Ablation study on the temperature coefficient T.

T	mIoU (%)	avg.F1 (%)
0.5	53.5	63.36
1.0	55.8	67.36
2.0	52.5	65.05

Table 7. Ablation on the loss weights of KL divergence and cross-entropy (CE).

$λ_{a}$	$λ_{r}$	mIoU (%)	$Δ$ mIoU (%)
1.0	1.0	61.9	0.0
1.0	0.5	59.9	−2.0
0.5	1.0	55.8	−6.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Luo, Z.; Liu, X.; Jiang, J.; Qi, H.; Wang, C.; Xie, Z.; Zeng, T. TGR-T: Truncated-Gaussian-Weighted Reliability for Adaptive Dynamic Thresholding in Weakly Supervised Indoor 3D Point Cloud Segmentation. ISPRS Int. J. Geo-Inf. 2026, 15, 108. https://doi.org/10.3390/ijgi15030108

AMA Style

Luo Z, Liu X, Jiang J, Qi H, Wang C, Xie Z, Zeng T. TGR-T: Truncated-Gaussian-Weighted Reliability for Adaptive Dynamic Thresholding in Weakly Supervised Indoor 3D Point Cloud Segmentation. ISPRS International Journal of Geo-Information. 2026; 15(3):108. https://doi.org/10.3390/ijgi15030108

Chicago/Turabian Style

Luo, Ziwei, Xinyue Liu, Jun Jiang, Hanyu Qi, Chen Wang, Zhong Xie, and Tao Zeng. 2026. "TGR-T: Truncated-Gaussian-Weighted Reliability for Adaptive Dynamic Thresholding in Weakly Supervised Indoor 3D Point Cloud Segmentation" ISPRS International Journal of Geo-Information 15, no. 3: 108. https://doi.org/10.3390/ijgi15030108

APA Style

Luo, Z., Liu, X., Jiang, J., Qi, H., Wang, C., Xie, Z., & Zeng, T. (2026). TGR-T: Truncated-Gaussian-Weighted Reliability for Adaptive Dynamic Thresholding in Weakly Supervised Indoor 3D Point Cloud Segmentation. ISPRS International Journal of Geo-Information, 15(3), 108. https://doi.org/10.3390/ijgi15030108

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TGR-T: Truncated-Gaussian-Weighted Reliability for Adaptive Dynamic Thresholding in Weakly Supervised Indoor 3D Point Cloud Segmentation

Abstract

1. Introduction

2. Related Works

2.1. Pseudo-Label-Based Methods

2.1.1. Fixed Threshold Filtering

2.1.2. Dynamic Threshold Filtering

2.2. Contrastive-Learning-Based Methods

2.3. Consistency-Regularization-Based Methods

3. Methods

3.1. Notation Definition

3.2. Reliability-Adaptive Dynamic Thresholding Estimation

3.3. Truncated Gaussian-Weighted Consistency Regularization

4. Experiments

4.1. Dataset

4.2. Implementation Details

4.3. Evaluation on S3DIS

4.4. Evaluation on ScanNet

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI