Next Article in Journal
Application of Machine Learning Models in Optimizing Wastewater Treatment Processes: A Review
Previous Article in Journal
Non-Linear Prediction Model for the Strength of Medium-to-Low-Grade Phosphate Tailings Cemented Backfill
Previous Article in Special Issue
Gas Leak Detection and Leakage Rate Identification in Underground Utility Tunnels Using a Convolutional Recurrent Neural Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SemABC: Semantic-Guided Adaptive Bias Calibration for Generative Zero-Shot Point Cloud Segmentation

School of Information Science and Engineering, Shandong Normal University, Jinan 250358, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(15), 8359; https://doi.org/10.3390/app15158359
Submission received: 20 June 2025 / Revised: 15 July 2025 / Accepted: 25 July 2025 / Published: 27 July 2025
(This article belongs to the Special Issue Applications of Artificial Intelligence in Industrial Engineering)

Abstract

Due to the limited quantity and high cost of high-quality three-dimensional annotations, generalized zero-shot point cloud segmentation aims to transfer the knowledge of seen to unseen classes by leveraging semantic correlations to achieve generalization purposes. Existing generative point cloud semantic segmentation approaches rely on generators trained on seen classes to synthesize visual features for unseen classes in order to help the segmentation model gain the ability of generalization, but this often leads to a bias toward seen classes. To address this issue, we propose a semantic-guided adaptive bias calibration approach with a dual-branch network architecture. This network consists of a novel visual–semantic fusion branch alongside the primary segmentation branch to suppress the bias toward seen classes. Specifically, the visual–semantic branch exploits the visual–semantic relevance of the synthetic features of unseen classes to provide auxiliary predictions. Furthermore, we introduce an adaptive bias calibration module that dynamically integrates the predictions from both the main and auxiliary branches to achieve unbiased segmentation results. Extensive experiments conducted on standard benchmarks demonstrate that our approach significantly outperforms state-of-the-art methods on both seen and unseen classes, thereby validating the effectiveness of our approach.

1. Introduction

Among three-dimensional (3D) scene understanding tasks, point cloud segmentation plays a fundamental role in fields such as autonomous driving [1,2], robot navigation [3,4], and augmented reality [5]. However, due to the sparsity and unordered properties of 3D point clouds, as well as the high cost of acquiring high-quality annotations [6], the supervised learning methods for point clouds [7,8] face challenges in generalization to a wider range of scenarios. Since generalized zero-shot learning (GZSL) [9,10,11] enables knowledge transfer from seen to unseen classes without the need for labeled data during training, recently, researchers have been particularly interested in exploring the potential of these techniques to handle the prediction of both seen and unseen classes in point cloud segmentation tasks. GZSL can be divided into transductive approaches [12,13,14], which leverage data samples from unseen classes during training, and inductive approaches [15,16,17], which rely solely on the samples of seen classes during training. Given the difficulties of acquiring all kinds of unlabeled data samples in real-world scenarios, inductive GZSL is more suitable for varied practical applications.
Inductive GZSL methods for point cloud semantic segmentation, such as 3DGenZ [18] and SV-Seg [19], employ generative models [20,21] to create synthesized features based on semantic embeddings [22,23,24,25] of unseen classes. However, since the generator is trained exclusively on seen classes, the features it generates for unseen classes are inevitably biased toward the characteristics of seen classes. Consequently, the segmentor trained with these features tends to favor the seen classes. To address this issue, these approaches [18,19] typically reduce the bias toward seen classes by subtracting a predefined constant value from their predicted probability score for calibration purposes, which may affect the generalization and robustness of the model in open scenarios.
To address this problem, we propose a semantic-guided adaptive bias calibration approach with a dual-branch network architecture. This network comprises a primary segmentation branch (main branch) and a visual–semantic fusion branch (auxiliary branch). The primary segmentation branch encodes the input point cloud and predicts the segmentation probabilities, as other approaches do, while the visual–semantic fusion branch additionally explores the correlation between the visual features of the point clouds and the semantic embeddings of the categories to reduce the bias toward seen classes. Our motivation is that, since these synthesized features are generated based on category semantics of unseen classes, their relevance to semantics is inherently higher than that of the seen classes. Therefore, we propose to incorporate the spatial and channel attention mechanisms to exploit the visual–semantic correlation and align the two modalities via contrastive learning, as shown in the first row of Figure 1. In this way, the visual–semantic fusion branch is able to produce segmentation predictions in favor of unseen classes, which forms a complement to those from the primary segmentation branch.
With this dual-branch network architecture, we further propose an adaptive bias calibration module that dynamically adjusts the weights of the auxiliary predictions based on the prediction of the main branch, as shown in the bottom row of Figure 1. When the main branch exhibits high confidence with the predicted probability, we assign less weight to the auxiliary predictions to avoid introducing additional noise or making unnecessary corrections to high-confidence predictions. In contrast, when confidence is low, we increase the influence of the auxiliary predictions, leveraging different perspectives from the auxiliary branch to enhance the reliability of the overall prediction.
In real-world dynamic scenarios, our method can be applied to the identification of unknown obstacles in urban autonomous driving, such as identifying new types of vehicles not covered in the training set, such as electric scooters and delivery robots. Through the visual–semantic branch, these objects are associated with the “small mobile vehicle” semantic category to avoid misclassification as background. Additionally, it can be applied in construction site adaptation scenarios, where temporary roadblocks, cones, and other objects with variable shapes result in low confidence in the main branch. In such cases, our model gives more weight to the auxiliary branch, using the semantic association to improve segmentation performance.
Our main contributions are summarized as follows:
  • We propose a dual-branch network where the auxiliary branch exploits visual–semantic correlations to complement the prediction of the main branch. It leverages the higher semantic relevance of the synthesized visual features of unseen classes to prevent bias toward the seen classes.
  • We propose an adaptive bias calibration based on the confidence of segmentation predictions. This module dynamically integrates the predictions from the main and auxiliary branches, thus effectively suppressing the network’s bias towards seen classes.
  • Extensive experiments show that our method significantly outperforms existing generative GZSL methods in terms of both segmentation accuracy and robustness on the benchmark datasets.

2. Related Works

2.1. Point Cloud Semantic Segmentation

As a core task of 3D visual understanding, point cloud semantic segmentation aims to endow each spatial point in the unstructured point cloud with an accurate semantic label. Many works design network architectures for point clouds, such as PointNet [26], PointNet++ [27], KPConv [28], and RandLA-Net [29], and utilize a fully supervised learning paradigm that depends on large-scale labeled datasets for training. However, the fine-grained annotation of point clouds requires enormous human resources, which significantly limits the application and advancement of these methods in practical scenarios.

2.2. Zero-Shot Point Cloud Semantic Segmentation

Although there have been significant advances in zero-shot learning for images [30,31], research on zero-shot segmentation for 3D point clouds is still underexplored. The pioneering work in this area was carried out by Cheraghian et al. [32], who adapted the two-dimensional (2D) Zero-Shot Learning (ZSL) framework to 3D point cloud data by aligning the PointNet features and semantic embeddings in the same space. Subsequently, they further expanded 3D ZSL through several studies, including introducing transductive learning strategies [33] and building geometric primitive transfer models [34]. However, these methods assume the availability of unlabeled samples from unseen categories, which greatly limits their applicability in real-world situations.
To eliminate reliance on samples from unseen categories during training, Michele et al. [18] proposed a generative framework that utilizes a generator to synthesize pseudo-features based on the semantic information of unseen classes, achieving breakthrough progress in 3D ZSL and GZSL. Building on this, subsequent work by SV-Seg [19] improved feature diversity by enhancing the multi-dimensional alignment of the visual–semantic space. Since these methods rely on data from seen classes to learn models for recognizing both seen and unseen data, they tend to exhibit bias toward seen classes.
By contrast, we propose a dual-branch network architecture that includes a visual–semantic fusion branch. This branch takes advantage of the higher semantic relevance of synthetic features for unseen classes compared to seen classes. By doing so, it allows us to obtain auxiliary predictions that are unbiased toward seen classes through visual–semantic alignment. This adaptive approach helps guide the calibration for semantic segmentation.

2.3. Calibrated Stacking

In generalized zero-shot learning, since the training data only include seen classes while the test data come from both seen and unseen classes, the test data from unseen classes are often misclassified as seen classes. To address this problem, Chao et al. [35] proposed the calibrated stacking method, which suppresses the bias toward seen classes by subtracting a predefined constant value. This has been widely used in both 2D [36] and 3D [18,19] GZSL approaches. However, this fixed-parameter calibration may weaken the model’s adaptability in open environments, and using homogenized calibration parameters for heterogeneous seen classes would reduce the model’s robustness. To address these limitations, we propose an adaptive bias calibration module that dynamically adjusts the weights of auxiliary predictions according to the confidence of the main branch’s prediction. The weighted auxiliary predictions are subsequently integrated into the main branch prediction, yielding the final prediction following adaptive bias calibration.

3. Methods

3.1. Overview

We propose a semantic-guided adaptive bias calibration approach for the zero-shot point cloud semantic segmentation problem. Our approach specifically addresses the problem that existing generative segmentation approaches [18,19] often misclassify between seen and unseen classes. As illustrated in Figure 2, we design a dual-branch network architecture. One branch is the primary segmentation branch, following existing work [19], which tends to produce segmentation results that can better distinguish different objects but often misclassifies between seen and unseen classes. The other branch is the visual–semantic fusion branch, which fuses the semantic embeddings with point cloud features through contrastive learning to identify potential features, especially for unseen classes. The outputs of the two branches are adaptively combined to calibrate the bias and produce accurate segmentation probabilities.
To formulate our problem, we establish an object category set C, which includes the seen C S and unseen classes C U . Following the general setting of inductive generalized ZSL works, our training process takes all the object category embeddings for C = { C S , C U } and only the point clouds of seen classes C S with their ground truth labels to train the network. After training, the network will encounter point clouds from either seen or unseen classes.
In the remainder of this section, we first introduce the network modules, including the primary segmentation branch (Section 3.2), the visual–semantic fusion branch (Section 3.3), and the adaptive bias calibration module (Section 3.4). Then, we describe the entire training strategy, which is a three-stage training process (Section 3.5).

3.2. Primary Segmentation Branch

We adopt the network architecture from [19] as the primary segmentation branch. Specifically, this branch consists of three modules: an encoder E, a segmentor S, and an auxiliary generator G. The encoder takes point clouds as input, where each point contains a 3D coordinate and its accompanying attributes provided by specific datasets, and outputs per-point features. The segmentor uses these features as input to predict segmentation P m a i n , i.e., the probability that the points belong to each class.
The auxiliary generator G is used only during the training process to assist in the inductive generalization to unseen classes. It takes a class embedding t u as input, which is the embedding of a given class label encoded by pre-trained Natural Language Processing (NLP) models, i.e., GloVe [37] + Word2Vec [38]. Note that t u is disturbed by random noise z u before being fed into the generator in order to introduce randomness and diversify the generated data. The generator outputs the features F u ˜ of the given classes, which share the same dimensions as those of encoder E. In this way, the primary segmentation branch can produce the visual features F s + u for both the seen and unseen classes.

3.3. Visual–Semantic Fusion Branch

The proposed visual–semantic fusion branch aims to integrate the visual features F s + u with the category semantic embeddings T for a more accurate classification between the seen and unseen classes. The network architecture of the branch is illustrated in Figure 3. We utilize a contrastive learning strategy to train it. F s + u is the visual features produced by the primary segmentation branch, which consists of the seen features from the encoder E and unseen features synthesized by the generator G. T denotes the semantic embeddings of all the category labels C. We first separately project the visual features F s + u and semantic embeddings T to obtain the same dimensions along the feature channels, denoted as F ^ and T ^ .
Then, we perform channel and spatial attention to fuse the visual features F ^ and semantic embeddings T ^ . Specifically, channel attention calculates the importance of each channel of one modality, allowing the features of the other modality to concentrate on the relevant key channels. Meanwhile, spatial attention enhances significant regions and depresses irrelevant regions by analyzing the spatial correlation of the two modalities.
The channel attention is formulated as follows:
Q t c = T ^ W q t c , K t c = T ^ W k t c , V f c = F ^ W v f c , F c = F ^ + S o f t m a x ( Q t c T K t c D K ) V f c ,
Q f c = F ^ W q f c , K f c = F ^ W k f c , V t c = T ^ W v t c , T c = T ^ + S o f t m a x ( Q f c T K f c D K ) V t c ,
where W q t c , W k t c , and W v f c are the learnable projection weights that map T ^ to the query ( Q t c ) and key ( K t c ) vectors and map F ^ to the value ( V f c ) vectors. Similarly, W q f c , W k f c , and W v t c project F ^ to the corresponding query ( Q f c ) and key ( K f c ) and project T ^ to the value ( V t c ). D K is the dimension size of the projected vectors. In this way, we obtain the fused visual features F c and semantic features T c encoded with channel attention.
The spatial attention is computed as follows:
Q f s = F c W q f s , K t s = T c W k t s , V t s = T c W v t s , F s = F c + S o f t m a x ( Q f s K t s T D K ) V t s ,
Q t s = T c W q t s , K f s = F c W k f s , V f s = F c W v f s , T s = T c + S o f t m a x ( Q t s K f s T D K ) V f s ,
where W q f s , W k t s , and W v t s denote the learnable weights that map F c to its corresponding query ( Q f s ) and map T c to the key ( K t s ) and value ( V t s ) vectors, whereas W q t s , W k f s , and W v f s similarly project T c and F c to the query ( Q t s ), key ( K f s ), and value ( V f s ). D K is the dimension of the output vectors. After the spatial attention, we obtain the fused visual features F s and semantic features T s .
Next, the features F s and T s are input into a multi-layer perceptron (MLP) and a following layer normalization to obtain the final fused features:
F f u s i o n = L a y e r N o r m ( F s + M L P ( F s ) ) ,
T f u s i o n = L a y e r N o r m ( T s + M L P ( T s ) ) .
We compute the contrastive loss for the visual–semantic fusion branch. For a visual feature, the semantic feature corresponding to its category is regarded as a positive sample, while all other semantic features are considered negative. The contrastive loss L a u x is defined as follows:
P a u x = s o f t m a x ( F f u s i o n T T f u s i o n ) .
L a u x = log exp ( F f u s i o n c · T f u s i o n c / τ ) ( exp ( F f u s i o n c · T f u s i o n c / τ ) + exp ( F f u s i o n c · T f u s i o n c ^ / τ ) ) ,
where P a u x represents the probability of each point belonging to each category. τ is the temperature parameter, c indicates the class associated with the visual features, and c ^ represents other classes different from c.

3.4. Adaptive Bias Calibration Module

We introduce an adaptive bias calibration method based on the confidence of the predicted segmentation probabilities to improve the model’s adaptability in open scenarios. That is, we estimate a confidence that indicates the likelihood of a successful prediction and use it as a weight to integrate the predictions of two branches. Intuitively, when the confidence of the primary segmentation branch is high, we assign a smaller weight to the prediction from the visual–semantic fusion branch, and conversely, when the confidence is low, we give more weight to the visual–semantic fusion branch’s prediction.
Specifically, we adopt the kurtosis measure to estimate the predicted confidence, denoted as κ (see Equation (9)), which is a statistical measure that describes the shape of the data distribution. Kurtosis reflects the steepness or flatness of the distribution, particularly indicating the height of the peak in the probability density curve at the average value. A higher kurtosis indicates that the distribution has a more pronounced peak, representing a higher degree of prediction confidence. Given the predicted point-wise segmentation probabilities P m a i n , we adopt κ to balance the effect of the two branches:
κ = E [ ( P m a i n μ σ ) 4 ] ρ ,
P = P m a i n + β κ P a u x ,
where μ and σ are the mean and standard deviation of P m a i n , and ρ is a parameter that modulates the strength of the kurtosis influence. β is a hyperparameter that controls the confidence weight value.

3.5. Three-Stage Training Strategy

Next, a three-stage training strategy is introduced to progressively train the modules of our dual-branch network. The first stage trains the encoder E and segmentor S s e e n with the point clouds P s of seen classes c from the training set. We calculate the cross-entropy loss between the prediction and ground truth labels Y s :
L s = c Y s log ( S s e e n ( E ( P s ) ) ) ,
The second stage freezes the encoder E and trains the generator G on the data of seen classes from the training set. The goal is to ensure that the synthesized features from G closely resemble the real features extracted by the frozen encoder E. The loss function for this stage is
L G = c ( L a l i g n + L c o n + L M M D + α L c s t ) ,
where L a l i g n refers to the semantic–visual alignment loss, which computes the cosine distance between the semantic and visual features of the seen classes. L c o n represents the contrastive loss between the features of real and synthetic data for the seen classes. L M M D is the maximum mean deviation (MMD) loss to reduce the distributional mismatch between the synthesized features F s ˜ and real features F s of seen classes. L c s t denotes the semantic–visual consistency loss, which ensures that the distance distributions of seen and unseen classes are consistent in synthesized visual and semantic space by minimizing the cosine distance. α is a hyperparameter for loss balance.
The third stage freezes the encoder E and generator G to train the visual–semantic fusion branch and the final segmentor S. It leverages E to acquire real features of seen classes c and the G to produce synthesized features of unseen classes c . The loss for this stage is as follows:
L a l l = λ × L s e g + ( 1 λ ) × L a u x ,
where λ is the hyperparameter to balance the two losses.
The segmentation loss L s e g is defined as:
L s e g = c Y s log ( S ( F s ) ) c Y u log ( S ( F u ˜ ) ) ,
where Y u denotes ground truth labels of unseen classes.

4. Experiments

4.1. Settings

Datasets. 
We conduct experiments on three public 3D semantic segmentation datasets: ScanNet [39], S3DIS [40], and SemanticKITTI [41], which follow the setting of SV-Seg [19]. We select four unseen categories for each dataset. For ScanNet, which includes 20 categories, the unseen classes are “desk”, “bookshelf”, “sofa”, and “toilet”. For S3DIS, with 13 categories, the unseen classes are “beam”, “column”, “window”, and “sofa”. For SemanticKITTI, with 19 categories, we set “motorcycle”, “truck”, “bicyclist”, and “traffic sign” as unseen classes.
Metrics. 
We employ the mean Intersection-over-Union (mIoU) to measure the accuracy of segmentation results and the Harmonic mean of mIoU (HmIoU) to report the overall performance on the combined seen and unseen classes. The metrics are defined as follows:
m I o U = 1 k i = 1 k T P i F N i + F P i + T P i ,
H m I o U = 2 × m I o U s e e n × m I o U u n s e e n m I o U s e e n + m I o U u n s e e n ,
where T P i , F P i , and F N i represent the counts of true positives, false positives, and false negatives for category c i , respectively. k is the total number of categories.
Baselines. 
We compare the proposed approach with four baselines. Following 3DGenZ [18], we evaluated ZSLPC-Seg [32] and DeViSe-3DSeg [42], which are developed for ZSL classification and 2D segmentation, and are then adapted to solve the zero-shot point cloud segmentation problem. In addition, we compare with 3DGenZ [18] and SV-Seg [19], which are the state-of-the-art generative point cloud segmentation methods.

4.2. Implementation Details

To facilitate comparison with the baseline, we utilize the same encoder and generator described by SV-Seg [19]. The encoders for each dataset are different, i.e., FKAConv [43] for ScanNet, ConvPoint [44] for S3DIS, and KPConv [28] for SemanticKITTI. The generator uniformly employs the Generative Moment Matching Network (GMMN) [20]. All encoders are pre-trained on seen class data using parameters recommended in their related papers. When pre-training is complete, we train the generator using the frozen encoder to obtain features from the seen classes. We use G l o V e + W o r d 2 V e c to compute a 600 D semantic embedding vector for each category label. The model is trained for a total of 20 epochs with the Adam optimizer. The learning rates are 0.0002 for the generator, 0.007 for the encoder, and 0.07 for the segmentor. As for the visual–semantic fusion branch, we adopt the SGD optimizer with a setting of 0.0002 as the initial learning rate. We use the same hyperparameters for all three datasets. Specifically, the parameter ρ in Equation (9) is set to 1, the parameter β in Equation (10) is set to 1.2 , and the parameter λ in Equation (13) is set to 0.7 .

4.3. Comparisons

Our comparison against the state-of-the-art methods is presented in Table 1. Compared to the best performance (from SV-Seg [19]), our approach achieves HmIoU improvements of 0.7%, 1.3%, and 3.2% on the ScanNet, S3DIS, and SemanticKITTI datasets, respectively. The mIoU improvements for unseen classes are 0.6%, 0.7%, and 2.4% for the three datasets, demonstrating that our approach effectively mitigates semantic bias and enhances generalization to unseen classes. In addition, the improvements for seen classes are 0.4%, 9.2%, and 3.1%. This indicates that fewer unseen classes are wrongly classified as seen ones, leading to an mIoU increase for seen classes.
We conduct a qualitative comparison with SV-Seg [19], which performs bias calibration by subtracting a pre-defined constant value from their predicted probability scores, on the three benchmark datasets, as shown in Figure 4. Consequently, for the unseen classes, such as “desk” in ScanNet, “beam” in S3DIS, and “motorcycle” in SemanticKITTI, our model outperforms SV-Seg [19], demonstrating its effectiveness in addressing the bias toward seen classes. Regarding the seen classes, such as “sink” of ScanNet, “chair” of S3DIS, and “poly” of SemanticKITTI, the baseline incorrectly classifies them as unseen classes: “toilet”, “beam/column”, and “truck”. In contrast, our model exhibits better accuracy in recognizing these seen class samples. These observations indicate that the fixed calibration strategy in SV-Seg [19] can negatively impact the accuracy for seen classes, while our adaptive bias calibration method proves to be more effective in this regard.

4.4. Ablation Studies

4.4.1. Module Effectiveness

Table 2 evaluates the effectiveness of the proposed visual–semantic fusion branch and the adaptive bias calibration module, especially their collaborative mechanism. The first baseline, denoted as Fixed Calibration, directly adopts the prediction results of the existing generative segmentation network SV-Seg [19]. Note that here, deviation calibration is performed by subtracting the fixed calibration value. The second experiment, i.e., w/o BCM, employs our dual-branch network architecture to calibrate bias by averaging the predictions of the two branches, rather than using our adaptive calibration module. The third experiment, i.e., w/o VSFM, removes the core fusion module of the visual–semantic branch, directly calculates the similarity matrix between visual and semantic features as the output segmentation probability of this branch, and then applies our adaptive bias calibration scheme.
The experimental results reveal that the most significant improvement in the performance occurs when the VSFM and BCM modules work together, resulting in an increase of 0.8% in HmIoU. The absence of the VSFM module (w/o VSFM) leads to a more noticeable decline in the performance of seen classes, whereas the lack of BCM (w/o BCM) causes a more substantial decrease in the performance of unseen classes. These results suggest that the two proposed modules establish a bidirectional compensatory system with complementary characteristics: on one hand, the VSFM module employs the channel and spatial attention mechanism to facilitate dynamic interaction and fusion between visual–semantic features, effectively enhancing the spatial separability of visual features for both seen and unseen classes. On the other hand, the BCM module implements a confidence-based weight adjustment strategy that dynamically regulates the contribution of auxiliary predictions while adaptively incorporating these results into the predictions of the main branch.

4.4.2. Channel and Spatial Attention

To evaluate the effectiveness of channel and spatial attention mechanisms used in the VSFM, we conduct a quantitative analysis through an ablation study (see Table 3). The experimental results indicate that while each mechanism yields only slight improvements when used independently, their combination results in a 1.4% increase in the HmIoU metric on the ScanNet dataset. This underscores the cumulative enhancement effect of the dual attention mechanism: spatial attention improves the representation of critical areas by analyzing spatial relationships between modalities, while channel attention fosters a joint focus on key channels across them. The collaboration between these two mechanisms effectively balances the need for detailed recognition and generalization.

4.4.3. Prediction Confidence Estimation

To assess the impact of various prediction confidence estimation methods on segmentation performance, we compare five approaches based on the prediction probability distribution: Maximum Probability (Max), Entropy [45], Energy Score [46], Variance [47], and Kurtosis [48]. In addition, as a reference, SV-Seg [19], which employs pre-defined calibration values, is adopted as the baseline. As shown in Table 4, the kurtosis-based estimation method exhibits the most significant performance improvement on the ScanNet and S3DIS datasets, with the HmIoU metrics increasing by 0.7% and 1.3%, respectively. Moreover, all methods that utilize prediction confidence for adaptive bias calibration achieve segmentation performances that are close to or exceed the baseline in terms of mIoU for unseen classes.

4.4.4. Hyperparameters

To verify the impact of hyperparameter values on network performance, we conduct experiments on the S3DIS and ScanNet datasets for the hyperparameter β used to control the confidence weight in Equation (10) and the hyperparameter λ used to balance the two branches in Equation (13), as shown in Figure 5. Figure 5a demonstrates that on the S3DIS dataset, setting the hyperparameter β to 1.2 results in the optimal HmIoU value. Figure 5b reveals that on the ScanNet dataset, the best HmIoU value is achieved when the hyperparameter λ is 0.7 . These findings indicate that during network training, the main segmentation branch plays a more important role than the auxiliary branch.

4.5. Limitations

Our semantic-guided adaptive bias calibration approach still has limitations. In the visual–semantic fusion branch, segmentation prediction is obtained by aligning visual and semantic data. However, it aligns point cloud features of the same category with a single semantic embedding, overlooking the complexity and diversity within these features. We will explore methods to achieve multi-scale visual and semantic alignment in the future, which could lead to more accurate segmentation results.

5. Discussion

We propose a semantic-guided adaptive bias calibration approach to alleviate the bias towards seen classes in generalized zero-shot point cloud segmentation. We explore the visual–semantic correlation through the spatial-channel attention mechanism and utilize visual–semantic alignment to effectively distinguish different category samples in the feature space and generate auxiliary predictions to counteract the bias. We adaptively adjust the fusion weights between the main prediction from the primary segmentation branch and the auxiliary prediction from the visual–semantic fusion branch, providing auxiliary guidance preferentially for low-confidence samples while suppressing unnecessary corrections to high-confidence predictions. The experiments on common benchmarks confirm that our approach has significantly enhanced zero-shot segmentation performance. Despite these advances, our method still faces challenges in aligning visual and semantic information. This method aligns point cloud features of the same class with a single semantic embedding, which, in practical applications, may lead to segmentation misjudgments due to differences in object morphology, similar to other methods. In the future, we plan to explore multi-scale alignment methods between visual and semantic information further to improve the performance of zero-shot point cloud semantic segmentation.

Author Contributions

Conceptualization, Y.W.; methodology, Y.W. and M.Q.; software, Y.W.; validation, Y.W. and M.Q.; formal analysis, Y.W.; investigation, Y.W.; resources, M.Q.; data curation, Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W. and M.Q.; visualization, Y.W.; supervision, M.Q.; project administration, M.Q.; funding acquisition, M.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (61902225) and the Joint Funds of Natural Science Foundation of Shandong Province under Grant (ZR2021LZL011).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
3DThree-Dimensional
2DTwo-Dimensional
ZSLZero-Shot Learning
GZSLGeneralized Zero-Shot Learning
NLPNatural Language Processing
MLPMulti-Layer Perceptron
MMDMaximum Mean Deviation
mIoUMean Intersection-over-Union
HmIoUHarmonic Mean of mIoU
SGDStochastic Gradient Descent
AdamAdaptive Moment Estimation
GMMNGenerative Moment Matching Network
S3DISStanford Large-Scale three-dimensional Indoor Spaces
VSFMVisual–Semantic Fusion Module
BCMAdaptive Bias Calibration Module
MaxMaximum Probability

References

  1. Ansary, S.I.; Mishra, A.; Deb, S.; Deb, A.K. A framework for robotic grasping of 3D objects in a tabletop environment. Multimed. Tools Appl. 2025, 84, 25865–25894. [Google Scholar] [CrossRef]
  2. Hu, D.; Gan, V.J.; Yin, C. Robot-assisted mobile scanning for automated 3D reconstruction and point cloud semantic segmentation of building interiors. Autom. Constr. 2023, 152, 104949. [Google Scholar] [CrossRef]
  3. Huang, C.; Mees, O.; Zeng, A.; Burgard, W. Visual language maps for robot navigation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 10608–10615. [Google Scholar]
  4. Xiao, X.; Liu, B.; Warnell, G.; Stone, P. Motion planning and control for mobile robot navigation using machine learning: A survey. Auton. Robot. 2022, 46, 569–597. [Google Scholar] [CrossRef]
  5. Arena, F.; Collotta, M.; Pau, G.; Termine, F. An overview of augmented reality. Computers 2022, 11, 28. [Google Scholar] [CrossRef]
  6. Liu, G.; van Kaick, O.; Huang, H.; Hu, R. Active self-training for weakly supervised 3D scene semantic segmentation. Comput. Vis. Media 2024, 10, 425–438. [Google Scholar] [CrossRef]
  7. Shao, Y.; Tong, G.; Peng, H. Exploring high-contrast areas context for 3D point cloud segmentation via MLP-driven Discrepancy mechanism. Comput. Graph. 2025, 129, 104222. [Google Scholar] [CrossRef]
  8. Sun, C.Y.; Tong, X.; Liu, Y. Semantic segmentation-assisted instance feature fusion for multi-level 3D part instance segmentation. Comput. Vis. Media 2023, 9, 699–715. [Google Scholar] [CrossRef]
  9. Cheraghian, A.; Rahman, S.; Chowdhury, T.F.; Campbell, D.; Petersson, L. Zero-shot learning on 3d point cloud objects and beyond. Int. J. Comput. Vis. 2022, 130, 2364–2384. [Google Scholar] [CrossRef]
  10. Xu, W.; Xu, R.; Wang, C.; Xu, S.; Guo, L.; Zhang, M.; Zhang, X. Spectral prompt tuning: Unveiling unseen classes for zero-shot semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 6369–6377. [Google Scholar]
  11. Zhang, Y.; Guo, M.H.; Wang, M.; Hu, S.M. Exploring regional clues in CLIP for zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3270–3280. [Google Scholar]
  12. Su, T.; Wang, H.; Qi, Q.; Wang, L.; He, B. Transductive learning with prior knowledge for generalized zero-shot action recognition. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 260–273. [Google Scholar] [CrossRef]
  13. Ge, C.; Wang, J.; Qi, Q.; Sun, H.; Xu, T.; Liao, J. Semi-transductive learning for generalized zero-shot sketch-based image retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Montréal, QC, Canada, 8–10 August 2023; Volume 37, pp. 7678–7686. [Google Scholar]
  14. Liu, Y.; Tao, K.; Tian, T.; Gao, X.; Han, J.; Shao, L. Transductive zero-shot learning with generative model-driven structure alignment. Pattern Recognit. 2024, 153, 110561. [Google Scholar] [CrossRef]
  15. Khan, F.B.; Khan, A.; Durad, M.H.; Khan, F.A.; Ali, A. ISAnWin: Inductive generalized zero-shot learning using deep CNN for malware detection across windows and android platforms. PeerJ Comput. Sci. 2024, 10, e2604. [Google Scholar]
  16. Chen, S.; Wang, W.; Xia, B.; Peng, Q.; You, X.; Zheng, F.; Shao, L. Free: Feature refinement for generalized zero-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 122–131. [Google Scholar]
  17. Wu, J.; Zhang, T.; Zha, Z.J.; Luo, J.; Zhang, Y.; Wu, F. Prototype-augmented self-supervised generative network for generalized zero-shot learning. IEEE Trans. Image Process. 2024, 33, 1938–1951. [Google Scholar] [CrossRef]
  18. Michele, B.; Boulch, A.; Puy, G.; Bucher, M.; Marlet, R. Generative zero-shot learning for semantic segmentation of 3d point clouds. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; pp. 992–1002. [Google Scholar]
  19. Yang, Y.; Hayat, M.; Jin, Z.; Zhu, H.; Lei, Y. Zero-shot point cloud segmentation by semantic-visual aware synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 11586–11596. [Google Scholar]
  20. Li, Y.; Swersky, K.; Zemel, R. Generative moment matching networks. In Proceedings of the International Conference on Machine Learning. PMLR, Lille, France, 7–9 July 2015; pp. 1718–1727. [Google Scholar]
  21. Odena, A.; Olah, C.; Shlens, J. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the International Conference on Machine Learning. PMLR, Sydney, Australia, 6–11 August 2017; pp. 2642–2651. [Google Scholar]
  22. Wei, Z.; Chen, P.; Yu, X.; Li, G.; Jiao, J.; Han, Z. Semantic-aware SAM for point-prompted instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3585–3594. [Google Scholar]
  23. Zheng, X.; Lyu, Y.; Wang, L. Learning modality-agnostic representation for semantic segmentation from any modalities. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 146–165. [Google Scholar]
  24. Li, H.; Zhang, D.; Dai, Y.; Liu, N.; Cheng, L.; Li, J.; Wang, J.; Han, J. Gp-nerf: Generalized perception nerf for context-aware 3d scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 21708–21718. [Google Scholar]
  25. Zhai, X.; Han, W.; Li, X.; Huang, S. PLGCA: A Progressive Local-Global Context-Aware Semantic Segmentation Network for Crop Remote Sensing Mapping. In Proceedings of the 2024 6th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT), Guangzhou, China, 19–21 July 2024; pp. 491–495. [Google Scholar]
  26. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
  27. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 5105–5114. [Google Scholar]
  28. Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
  29. Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11108–11117. [Google Scholar]
  30. Zhou, Z.; Lei, Y.; Zhang, B.; Liu, L.; Liu, Y. Zegclip: Towards adapting clip for zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11175–11185. [Google Scholar]
  31. Tian, J.; Aggarwal, L.; Colaco, A.; Kira, Z.; Gonzalez-Franco, M. Diffuse attend and segment: Unsupervised zero-shot segmentation using stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3554–3563. [Google Scholar]
  32. Cheraghian, A.; Rahman, S.; Petersson, L. Zero-shot learning of 3d point cloud objects. In Proceedings of the 2019 16th International Conference on Machine Vision Applications (MVA), Tokyo, Japan, 27–31 May 2019; pp. 1–6. [Google Scholar]
  33. Cheraghian, A.; Rahman, S.; Campbell, D.; Petersson, L. Mitigating the hubness problem for zero-shot learning of 3d objects. arXiv 2019, arXiv:1907.06371. [Google Scholar]
  34. Cheraghian, A.; Rahman, S.; Campbell, D.; Petersson, L. Transductive zero-shot learning for 3d point cloud classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 923–933. [Google Scholar]
  35. Chao, W.L.; Changpinyo, S.; Gong, B.; Sha, F. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 52–68. [Google Scholar]
  36. Shi, Y.; Jiang, C.; Song, F.; Ye, Q.; Long, Y.; Zhang, H. Multi-domain feature-enhanced attribute updater for generalized zero-shot learning. Neural Comput. Appl. 2025, 37, 8397–8414. [Google Scholar] [CrossRef]
  37. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
  38. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26, 3111–3119. [Google Scholar]
  39. Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5828–5839. [Google Scholar]
  40. Armeni, I.; Sener, O.; Zamir, A.R.; Jiang, H.; Brilakis, I.; Fischer, M.; Savarese, S. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1534–1543. [Google Scholar]
  41. Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9297–9307. [Google Scholar]
  42. Frome, A.; Corrado, G.S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; Mikolov, T. Devise: A deep visual-semantic embedding model. Adv. Neural Inf. Process. Syst. 2013, 26, 2121–2129. [Google Scholar]
  43. Boulch, A.; Puy, G.; Marlet, R. FKAConv: Feature-kernel alignment for point cloud convolution. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November– 4 December 2020. [Google Scholar]
  44. Boulch, A. ConvPoint: Continuous convolutions for point cloud processing. Comput. Graph. 2020, 88, 24–34. [Google Scholar] [CrossRef]
  45. Malinin, A.; Gales, M. Predictive uncertainty estimation via prior networks. Adv. Neural Inf. Process. Syst. 2018, 31, 7047–7058. [Google Scholar]
  46. Liu, W.; Wang, X.; Owens, J.; Li, Y. Energy-based out-of-distribution detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21464–21475. [Google Scholar]
  47. Seo, S.; Seo, P.H.; Han, B. Learning for single-shot confidence calibration in deep neural networks through stochastic inferences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9030–9038. [Google Scholar]
  48. Tang, Y.; Lin, Z.; Wang, Q.; Zhu, P.; Hu, Q. Amu-tuning: Effective logit bias for clip-based few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 23323–23333. [Google Scholar]
Figure 1. We propose a visual–semantic fusion branch that effectively aligns and fuses the features of the two modalities to obtain auxiliary predictions that are not biased toward seen classes. Additionally, we introduce an adaptive bias calibration module, which dynamically combines the predictions from the main and auxiliary branches.
Figure 1. We propose a visual–semantic fusion branch that effectively aligns and fuses the features of the two modalities to obtain auxiliary predictions that are not biased toward seen classes. Additionally, we introduce an adaptive bias calibration module, which dynamically combines the predictions from the main and auxiliary branches.
Applsci 15 08359 g001
Figure 2. The framework of the semantic-guided dual-branch bias calibration network contains three components: a primary segmentation branch (main branch), a visual–semantic fusion branch (auxiliary branch), and an adaptive bias calibration module. The encoder, generator, segmentor, and visual–semantic fusion branch undergo sequential optimization during the training process. Only the encoder is used to process the point cloud for features without using the generator during inference.
Figure 2. The framework of the semantic-guided dual-branch bias calibration network contains three components: a primary segmentation branch (main branch), a visual–semantic fusion branch (auxiliary branch), and an adaptive bias calibration module. The encoder, generator, segmentor, and visual–semantic fusion branch undergo sequential optimization during the training process. Only the encoder is used to process the point cloud for features without using the generator during inference.
Applsci 15 08359 g002
Figure 3. The computational process of the visual–semantic fusion module. ⊕ denotes element-wise addition. ⊗ denotes the dot product.
Figure 3. The computational process of the visual–semantic fusion module. ⊕ denotes element-wise addition. ⊗ denotes the dot product.
Applsci 15 08359 g003
Figure 4. Qualitative comparison with SV-Seg [19] in the inductive generalized zero-shot setting. The areas within the boxes emphasize the efficacy of our approach.
Figure 4. Qualitative comparison with SV-Seg [19] in the inductive generalized zero-shot setting. The areas within the boxes emphasize the efficacy of our approach.
Applsci 15 08359 g004
Figure 5. Effect of hyperparameters: confidence weight β on S3DIS (a) and branch balance parameter λ on ScanNet (b); β = 1.2 and λ = 0.7 show the best results.
Figure 5. Effect of hyperparameters: confidence weight β on S3DIS (a) and branch balance parameter λ on ScanNet (b); β = 1.2 and λ = 0.7 show the best results.
Applsci 15 08359 g005
Table 1. Experimental results are presented on three benchmark datasets. The first part introduces the upper limits of various supervision levels, while the second part demonstrates the performance of the baseline and our method on generalized 3D zero-shot semantic segmentation. In this context, C S represents visible classes, and C U represents unseen classes. C ^ U represents the synthesized unseen data.
Table 1. Experimental results are presented on three benchmark datasets. The first part introduces the upper limits of various supervision levels, while the second part demonstrates the performance of the baseline and our method on generalized 3D zero-shot semantic segmentation. In this context, C S represents visible classes, and C U represents unseen classes. C ^ U represents the synthesized unseen data.
MethodTraining SetScanNetS3DISSemanticKITTI
EncoderSegmentormIoUHmIoUmIoUHmIoUmIoUHmIoU
C S C U All C S C U All C S C U All
Full supervision C S C U C S C U 43.351.945.147.274.050.066.659.659.450.357.554.5
Unseen classes only for segmentor C S C S C U 41.539.240.340.360.921.548.731.852.913.242.321.2
Supervision with seen classes C S C S 39.20.031.30.070.20.048.60.055.80.044.00.0
ZSLPC-Seg [32] C S C U 16.44.213.96.75.21.34.02.126.410.221.814.7
DeViSe-3DSeg [42] C S C U 12.83.010.94.83.61.43.02.042.94.227.67.5
3DGenZ [18] C S C S C ^ U 32.87.727.812.553.17.33912.941.410.83517.1
SV-Seg [19] C S C S C ^ U 34.514.330.420.258.99.743.816.746.412.839.420.1
Ours C S C S C ^ U 34.914.930.920.968.110.450.318.049.515.242.323.3
Table 2. Effects of the VSFM and BCM on the ScanNet dataset.
Table 2. Effects of the VSFM and BCM on the ScanNet dataset.
MethodsVSFMBCMmIoUHmIoU
C S C U All
Fixed Calibration34.614.230.520.1
w/o BCM34.913.930.819.9
w/o VSFM32.414.028.719.6
Ours34.914.930.920.9
symbolizes the absence of corresponding modules, symbolizes the presence of corresponding modules.
Table 3. Ablation experiments of the VSFM on the ScanNet dataset, including Channel and Spatial Attention.
Table 3. Ablation experiments of the VSFM on the ScanNet dataset, including Channel and Spatial Attention.
ChannelSpatialmIoUHmIoU
C S C U All
32.813.929.019.5
32.013.828.319.3
31.814.428.419.9
34.914.930.920.9
symbolizes the absence of corresponding modules, symbolizes the presence of corresponding modules.
Table 4. The influence of common uncertainty algorithms on segmentation performance.
Table 4. The influence of common uncertainty algorithms on segmentation performance.
MethodScanNetS3DIS
mIoUHmIoUmIoUHmIoU
C S C U All C S C U All
Baseline34.514.330.420.258.99.743.816.7
Max34.813.330.519.367.79.849.917.1
Entropy34.414.130.320.068.19.350.016.4
Energy34.614.130.520.067.39.549.516.7
Variance34.813.830.619.866.99.849.317.1
Kurtosis34.914.930.920.968.110.450.318.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wei, Y.; Qi, M. SemABC: Semantic-Guided Adaptive Bias Calibration for Generative Zero-Shot Point Cloud Segmentation. Appl. Sci. 2025, 15, 8359. https://doi.org/10.3390/app15158359

AMA Style

Wei Y, Qi M. SemABC: Semantic-Guided Adaptive Bias Calibration for Generative Zero-Shot Point Cloud Segmentation. Applied Sciences. 2025; 15(15):8359. https://doi.org/10.3390/app15158359

Chicago/Turabian Style

Wei, Yuyun, and Meng Qi. 2025. "SemABC: Semantic-Guided Adaptive Bias Calibration for Generative Zero-Shot Point Cloud Segmentation" Applied Sciences 15, no. 15: 8359. https://doi.org/10.3390/app15158359

APA Style

Wei, Y., & Qi, M. (2025). SemABC: Semantic-Guided Adaptive Bias Calibration for Generative Zero-Shot Point Cloud Segmentation. Applied Sciences, 15(15), 8359. https://doi.org/10.3390/app15158359

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop