Cosine Prompt-Based Class Incremental Semantic Segmentation for Point Clouds

Guo, Lei; Li, Hongye; Pang, Min; Liu, Kaowei; Han, Xie; Xiong, Fengguang

doi:10.3390/a18100648

Open AccessArticle

Cosine Prompt-Based Class Incremental Semantic Segmentation for Point Clouds

by

Lei Guo

^1,2

,

Hongye Li

³,

Min Pang

^1,2

,

Kaowei Liu

^1,2,

Xie Han

^1,2

and

Fengguang Xiong

^1,2,*

¹

Shanxi Key Laboratory of Machine Vision and Virtual Reality, North University of China, Taiyuan 030051, China

²

School of Computer Science and Technology, North University of China, Taiyuan 030051, China

³

Luzhou North Chemical Industries Co., Ltd., Luzhou 646003, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(10), 648; https://doi.org/10.3390/a18100648

Submission received: 13 August 2025 / Revised: 9 October 2025 / Accepted: 14 October 2025 / Published: 16 October 2025

(This article belongs to the Section Randomized, Online, and Approximation Algorithms)

Download

Browse Figures

Versions Notes

Abstract

Although current 3D semantic segmentation methods have achieved significant success, they suffer from catastrophic forgetting when confronted with dynamic, open environments. To address this issue, class incremental learning is introduced to update models while maintaining a balance between plasticity and stability. In this work, we propose CosPrompt, a rehearsal-free approach for class incremental semantic segmentation. Specifically, we freeze the prompts for existing classes and incrementally expand and fine-tune the prompts for new classes, thereby generating discriminative and customized features. We employ clamping operations to regulate backward propagation, ensuring smooth training. Furthermore, we utilize the learning without forgetting loss and pseudo-label generation to further mitigate catastrophic forgetting. We conduct comparative and ablation experiments on the S3DIS dataset and ScanNet v2 dataset, demonstrating the effectiveness and feasibility of our method.

Keywords:

incremental learning; semantic segmentation; point cloud; prompt

1. Introduction

Three-dimensional semantic segmentation aims to assign a semantic label to each point to facilitate scene understanding. This technique has extensive applications in domains such as autonomous driving, robotic grasping, and virtual reality [1]. Traditional methods typically require samples from all classes during training. However, in practical scenarios, there is an ongoing need to learn new classes, while storing data for old classes requires a massive amount of space. Furthermore, old class data may become unavailable due to privacy concerns or other reasons. Directly training on new classes causes backpropagation-based algorithms to update weights with the new data, leading to catastrophic forgetting of old classes. Consequently, research is needed to mitigate performance degradation in the model’s predictions for old classes during the learning of new knowledge. To address this challenge, 3D class incremental semantic segmentation (3D-CISS) has been proposed.

Incremental learning has been extensively studied in 2D image classification, primarily encompassing replay-based methods, distillation loss-based methods, regularization-based methods, and parameter-isolation-based methods [2,3]. Replay-based methods first mine or generate representative samples for each class to preserve model performance and have shown promising results [4,5,6]. However, mining old samples requires access to old data, and generating representative samples for past data may struggle in complex scenarios, which to some extent limits the applicability of such methods. Distillation-based methods leverage distillation losses to align the outputs and feature representations of the new model with those of the previous model, thereby facilitating stable knowledge transfer [7,8,9]. Regularization-based methods enhance model stability by constraining changes in important parameters [10,11]. Parameter-isolation-based methods aim to balance old and new knowledge by freezing the backbone network and fine-tuning a subset of parameters [12,13,14]. Among these methods, distillation-based, regularization-based, and parameter-isolation-based approaches represent the current mainstream directions, as they store knowledge within the model itself. Due to architectural advantages, prompt-based models, a representative of parameter-isolation methods, have achieved notable success in incremental learning for 2D image tasks. These methods typically freeze the backbone and adapt to new data by fine-tuning a small set of parameters, such as prompts and segmentation head. Compared to other approaches, this category of methods enhances stability in feature representations by freezing the backbone network, without requiring the involvement of old samples.

Research on incremental learning has recently begun to extend to the 3D domain. Incremental learning for 3D point cloud classification has been explored in [15,16,17,18,19]. Specifically, references [15,16] describe replay-based methods that achieved good results by mining old exemplars. However, compared to image data, point cloud data require significantly larger storage space. In complex scenes, a few exemplars are often insufficient to represent an entire class, imposing greater limitations. Therefore, researching rehearsal-free methods is of significant importance. In the domain of 3D semantic segmentation, incremental learning has been investigated in [20,21,22], primarily employing distillation and pseudo-label generation to preserve memory of old knowledge. Therefore, to facilitate more convenient application, in this study, we focus on investigating rehearsal-free methods.

Inspired by prior work, we propose a prompt-based class incremental semantic segmentation method in the 3D domain. We are the first to apply prompt expansion and freezing to 3D-CISS. A cosine prompt module is introduced to achieve fast and stable learning. During the incremental learning phase, we fine-tune only the prompts of new classes, segmentation head, and high-level backbone layers, significantly reducing computational overhead. In the incremental training process, we utilize clamped backward propagation to reduce the interference of outliers. Furthermore, to further mitigate catastrophic forgetting, we also incorporate a pseudo-label generation and learning without forgetting (LWF) loss. Within the pseudo-label generation, low confidence thresholds are assigned to minority classes, promoting class balance and better optimizing the final output. For the LWF loss, it constrains the original outputs of the new model. The primary contribution of our work is a dynamic prompt mechanism that generates personalized features by computing affinities across all existing class prompts. This method is integrated within a comprehensive framework that orchestrates a frozen low-level backbone, knowledge distillation, and refined pseudo-labels. The synergy of these components delivers a more effective balance between stability and plasticity than can be achieved by any individual technique alone.

Specifically, the contributions are threefold.

We present a cosine prompt-based class incremental learning approach for 3D semantic segmentation, achieving the balance between old and new knowledge through prompt expansion, pseudo-label generation, and LwF loss, thereby forming an end-to-end rehearsal-free framework.
To accommodate new feature representations, we design a cosine prompt module. This module incorporates learnable prompts dedicated to new classes into a shared prompt pool while freezing the prompts associated with old classes, facilitating stable and discriminative feature learning.
Extensive comparative experiments against other methods on S3DIS and ScanNet v2 datasets demonstrate the superior performance of our proposed approach. Furthermore, we conduct in-depth ablation studies to evaluate each component.

2. Related Work

2.1. Class Incremental Learning

Class incremental learning is a technique that enables models to continuously learn from new data while mitigating catastrophic forgetting. Current class incremental learning approaches consists of four types: replay-based methods, distillation-loss-based methods, regularization-based methods, and parameter-isolation-based methods. Replay-based methods preserve prior knowledge by selecting or generating representative old exemplars and training the model jointly with these old exemplars alongside new ones [4,5,6]. Distillation-loss-based and regularization-based methods constrain changes to critical parameters and features caused by new knowledge through the introduction of specific loss functions [7,8,9,10,11]. Parameter isolation-based methods learn new knowledge by introducing new parameters while keeping existing parameters fixed [12,13,14]. Due to factors such as privacy concerns, acquiring old exemplars is often challenging, and generated old exemplars struggle to adapt to complex scenarios, limiting the applicability of replay-based methods. The other three categories can be considered rehearsal-free methods, which effectively reduce catastrophic forgetting by restricting changes to parameters and network outputs. Therefore, this work combines parameter-isolation-based and distillation-loss-based methods to strike a balance between the plasticity and stability of the neural network.

2.2. Class Incremental Semantic Segmentation

Recent years have witnessed the extension of incremental learning to semantic segmentation. A key distinction from incremental classification lies in the semantic drift of the background class. For new data, the background class may contain a mixture of old classes. Directly treating these old classes as background during training severely degrades the predictive performance on them. MiB, the pioneering work in class incremental semantic segmentation, employed knowledge distillation to retain knowledge of old classes [23]. PLOP mitigated catastrophic forgetting using a pseudo-labeling strategy [24]. SPPA introduced a feature alignment loss to constrain the feature representation of the models for old classes [25]. Conversely, freezing partial parameters represents another prominent approach. RCIL [26] and EWF [27] proposed dual-branch architectures, comprising a frozen branch and a trainable branch, to balance plasticity and stability. These methods require fewer trainable parameters, offering valuable insights for related research.

2.3. Class Incremental Learning on Point Cloud

Recent research has extended class incremental learning to 3D point clouds, covering both classification and segmentation tasks. In classification, I3DOL mitigates catastrophic forgetting via geometric-aware attention mechanisms and representative samples [15]. RCR enhances feature stability by reconstructing and mining old point cloud exemplars and retaining discriminative features [16]. Meanwhile, ReFu updates regularized autocorrelation matrices to continuously consolidate knowledge while adapting to new classes [18]. Three-dimensional point cloud segmentation is conceptually equivalent to per-point classification. Yang et al. pioneered 3D-CISS using geometry-aware distillation and pseudo-labeling schemes to maintain historical knowledge [20]. OpenDistill3D further reduced forgetting through continual self-distillation and high-fidelity pseudo-labels [21], while Su et al. jointly protected prior knowledge and acquire new concepts via residual distillation learning and balanced pseudo-label training [22]. Collectively, knowledge distillation and pseudo-label generation are established as core techniques in point cloud CIL, effectively alleviating catastrophic forgetting. Nevertheless, these methods suffer from two primary drawbacks. First, their strategy of fine-tuning the entire backbone during incremental learning leads to substantial computational overhead and low training efficiency. Second, in pseudo-labeling, minority classes often suffer from low-quality pseudo-labels due to sparse samples and imbalanced data distributions. To overcome these limitations, we propose a dual-strategy solution. First, we freeze the low-level backbone to enhance feature stability and introduce an adaptive prompt feature learning mechanism that fully leverages the model’s inherent knowledge. Second, our work introduces class-aware confidence thresholds, where higher thresholds for majority classes ensure reliable supervision while lower thresholds for minority classes improve their representation and performance.

3. Methodology

3.1. Problem Formulation

Three-dimensional class incremental semantic segmentation training framework comprises two core phases: base learning and incremental training. The learning process is as follows:

Base learning phase: Train feature extractor $M^{b a s e}$ and classifier $C l s^{Base}$ on base dataset $D^{Base}$ .
Incremental learning phase (Step i):

Initialize feature extractor

M_{i}^{novel}

and classifier

C l s_{i}^{novel}

; inherit parameters from

M^{b a s e}

(if i = 1) or

M_{i - 1}^{novel}

(if i > 1); copy parameters for existing classes from

M^{b a s e}

(i = 1) or

C l s_{i}^{novel}

(i > 1), with new class parameters randomly initialized. Perform forward propagation, compute loss, and optimize.

3.2. Framework Overview

To mitigate catastrophic forgetting in 3D class incremental semantic segmentation, we present a prompt-based framework enabling stable plastic learning, as shown in Figure 1. The proposed framework facilitates balanced learning of both new and old knowledge during incremental learning by incorporating learnable prompts for new classes while freezing the prompts corresponding to old classes. A pseudo-label generation strategy is introduced to further boost performance in rehearsal-free incremental learning scenarios. Additionally, the LwF loss is integrated to reduce logit-level forgetting. Implemented on a Point-BERT backbone with inserted prompt modules, the pipeline processes a point cloud sample

P \in D_{i}^{novel}

with N points as follows: After grouping and low-level encoding yield initial features

h

, these are fused with prompt features to generate adapted high-level features, which are finally decoded by a segmentation head into logits. At last, the cross-entropy loss and the LwF loss are integrated, enabling end-to-end learning covering all learnable parameters, and significantly mitigating catastrophic forgetting.

3.3. Cosine Prompt Learning

Prompt-based learning represents an emerging technique for incremental learning. In this work, we leverage prompt freezing and expansion to balance the new and old knowledge. Specifically, stable knowledge updates are achieved by freezing parameters in the low-level backbone network and existing class-specific prompts while fine-tuning prompts for new classes, high-level feature extraction parameters, and segmentation head. Depending on input, class-aware prompt features are dynamically generated and prepended to the key/value features of the attention mechanism, obtaining discriminative feature representations.

We define a set of prompts

p \in R^{C \times L^{p} \times D}

, where C denotes the number of previously and currently learned classes,

L^{p}

represents the number of prompts per class, and D indicates the feature embedding dimensionality. To enhance learning stability, we compute the cosine similarity

β_{i}

between the feature

h

and the ith key

K_{i}

in the prompt pool:

β_{i} = CosSim (h, K_{i})

where

Cos Sim (\cdot)

is the cosine similarity function,

K_{i} \in R^{1 \times D}

. Finally, the costumed prompt

p^{Cust}

is generated as follows:

p^{Cust} = \sum_{i = 1}^{N} β_{i} p_{i}

where

p_{i}

is the ith Prompt.

For a Multi-head Self-Attention (MSA) layer embedded with prompts, the prompting function is formally defined as

f_{Pro_MSA} (h, p^{Cust}) = MSA (h_{Q}, [h_{K}; p_{K}^{Cust}], [h_{V}; p_{V}^{Cust}])

where

h_{Q} = h_{K} = h_{V} = h

;

p_{K}^{Cust}

denotes the first half of

p^{Cust}

;

p_{V}^{Cust}

denotes the second half of

p^{Cust}

; and

[\cdot; \cdot]

indicates the concatenation operation. This process can be interpreted as an adaptive filtering mechanism that generates

p^{Cust}

conditioned on input characteristics, thereby yielding excellent feature representations. Finally, multiple prompt features are embedded into the encoder layers, enabling rapid incremental learning via lightweight fine-tuning.

During the training process, we compute the L2 norm of the gradients:

{∥ gra ∥}_{2} = \sqrt{\sum_{i = 1}^{m} {∥ g_{i} ∥}_{2}^{2}}

where

g_{i}

represents the gradient of the ith layer and

{∥\cdot∥}_{2}

denotes the

L_{2}

norm. If the gradient norm exceeds a predetermined threshold, the gradients for the current batch are set to zero. This approach effectively prevents training failure by mitigating gradient explosion.

As illustrated in Figure 2, prompts are expanded during incremental learning. In the base class learning phase, a fixed set of

|C^{base}| \times L^{p}

prompts are initialized. During the incremental learning phase i, existing prompts are frozen to stabilize representations of prior knowledge, while

|C_{k}^{novel}| \times L^{p}

newly added prompts are fine-tuned. This dual strategy of freezing old prompts and expanding new prompts enhances the model’s representational capacity, thereby improving its adaptability to novel data while preserving stability for learned classes.

3.4. Pseudo-Label Generation

During incremental phases, labels for base classes become unavailable and are incorrectly treated as background. This conflicts with the base model’s training, causing knowledge interference that severely degrades performance on the old classes. To mitigate this, we leverage pseudo-label generation (PLG) by the previous model to guide new model learning. Accounting for class imbalance, we assign class-specific thresholds: low thresholds for minority classes and high thresholds for majority classes. The pseudo-label generation is formalized as

y_{i}^{Gen} = \{\begin{matrix} \underset{c}{arg max} ({\hat{g}}_{i}^{c}) y_{i} \in C^{base} and max_{c} ({\hat{g}}_{i}^{c}) \geq {threshold}_{j} \\ ignored y_{i} \in C^{base} and max_{c} ({\hat{g}}_{i}^{c}) {< threshold}_{j} \\ y_{i} + |C^{base}| y_{i} \in C_{k}^{novel} \end{matrix}

where

{\hat{g}}_{i}^{c}

denotes the output of the ith point for class c in the new model,

C^{base}

represents the base class set,

{theshold}_{j}

is the filtering threshold for the jth class,

C_{k}^{novel}

denotes the set of new classes at step k, and

|C^{base}|

is the number of base classes. During pseudo-label generation, different thresholds are applied to old data to mitigate class imbalance effects. For new classes, a fixed offset of

|C^{base}|

is added to ensure label space separation between old and new classes. These ignored labels are assigned a value of −1 and are excluded from the gradient computation, thereby exerting no influence on the model update process. By generating pseudo-labels for old data and integrating them into training, the model further preserves previously acquired knowledge.

3.5. LwF Loss

To further mitigate catastrophic forgetting, we integrate the LwF loss. This approach leverages knowledge distillation by comparing raw logits between old and new models—rather than one-hot labels—enabling more prior knowledge through soft targets. LwF achieves robust performance in traditional incremental learning scenarios. The formal definition is

L_{LWF} ({\hat{g}}_{i}, {\tilde{g}}_{i}) = \sum_{c = 1}^{|C_{base}|} {\tilde{g}}_{i} log ({\hat{g}}_{i}),

where

{\hat{g}}_{i}

and

{\tilde{g}}_{i}

represent the logits of the new and old models, respectively, and

|C_{base}|

denotes the number of base classes.

3.6. Training Process

The training procedure comprises two phases: base learning and incremental training. During base learning, optimization is performed using the cross-entropy loss:

L^{base} = L_{CE} (C l s^{Base} (M^{b a s e} (x)), y)

where

L_{CE} (\cdot)

denotes the cross-entropy loss,

M^{b a s e}

represents the feature extraction model, and

C l s^{Base}

is the segmentation head for base classes. During the incremental training phase, we freeze low-level weights and old-class prompts while fine-tuning high-level weights, new-class prompts, and the segmentation head. The segmentation head is initialized with weights corresponding to previously learned classes. End-to-end optimization employs a composite loss function:

L^{incre} = L_{CE} (y^{gen}, y) + λ L_{LWF}

where

y^{gen}

denotes generated pseudo-labels and

λ

is a balancing hyperparameter. We formally refer to our method as Cosine Prompt-based Class Incremental Learning, abbreviated as CosPrompt.

4. Experiments

4.1. Experimental Setup

Experimental validation is conducted on the S3DIS dataset [28] and the ScanNet v2 dataset [29]. The S3DIS dataset comprises six distinct areas and encompasses thirteen semantic classes. Following the standard protocol, models are trained on Areas 1, 2, 3, 4, and 6, and evaluated on Area 5. To facilitate efficient processing, the S3DIS dataset is partitioned into 7547 blocks. The ScanNet v2 dataset is a large-scale indoor 3D scene understanding dataset, and it has 1513 scans for 707 indoor different scenes. The dataset comprises twenty-one classes: twenty annotated classes and one unannotated class. As the official split, 1201 scenes are employed for training and 312 scenes are used for validation. Aligned with common practice in 2D class incremental segmentation, the model is first trained on the base classes. Subsequently, the base model is successively fed new data for incremental training. Quantitative evaluation employs the mean Intersection-over-Union (mIoU), defined as

m I o U = \frac{\sum_{i = 1}^{C} I o U_{i}}{C}

I o U_{i} = \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}}

where

T P_{i}

,

F P_{i}

,

F N_{i}

denote the numbers of true positives, false positives, and false negatives for class i, respectively,

I o U_{i}

is the Intersection-over-Union for class i, and C represents the total number of classes.

For these two datasets, the experimental setup for training is the same. During the base learning phase, training is conducted for 100 epochs with an initial learning rate of 0.00005 using the Adam optimizer, alongside a step learning rate scheduler with gamma set to 0.5. In the incremental learning phase, the model is trained for another 100 epochs with an initial learning rate of 0.0001, again using Adam and the same step scheduler with gamma 0.5. In the ablation studies, only the number of training epochs is adjusted, while all other settings, including the Adam optimizer, remain unchanged. We conduct all the ablation studies on the ScanNet v2 dataset. All experiments are conducted on NVIDIA A6000 GPU (Santa Clara, CA, USA).

4.2. Comparison Experiments

In this subsection, we conduct comparison experiments against Fine-tune (FT), LwF [7], 3DPC [20], and Joint Training (JT). The FT method trains directly on new data. LwF is a distillation-loss-based incremental learning method. 3DPC is a representative method for 3D class incremental segmentation. BalDis [22] employs the residual distillation learning strategy and pseudo-label learning strategy to mitigate forgetting for 3D class incremental segmentation. The JT method trains on a mixture of old and new data, representing the upper bound of model performance. It is important to note that FT, LwF, JT, and our method share the same backbone network, namely Point-BERT.

Table 1 presents the comparison of mIoU results between the comparison methods and our approach on the S3DIS dataset. We evaluate under three settings:

C_{Novel}

= 5, 3, and 1. The Joint method achieves strong results across all classes. Compared to the Joint method, the FT method exhibits a significant performance degradation on old classes. Furthermore, the FT method demonstrates relatively good performance on new classes when contrasted with other incremental learning approaches. This occurs because the model focuses on learning the new classes during incremental learning. The weight updates via backpropagation partially degrade the representations learned for old data, leading to catastrophic forgetting. In contrast, LwF demonstrates better performance on old classes, exceeding FT by at least 3.51%. However, its performance on new classes is generally inferior to FT. This indicates that LwF constrains the model’s plasticity by forcing its outputs for old classes to approximate those of the old model. Overall, our method surpasses 3DPC, particularly on old classes, yet remains slightly inferior to the current state-of-the-art approach, BalDis. This demonstrates the competitiveness of our method while also revealing its limitations. This is because the approach of freezing the majority of the model weights inherently constrains its representational capacity. However, the use of frozen operations significantly enhances incremental learning efficiency, as detailed in the ablation studies in Section 4.3. As part of our future work, we plan to investigate adaptive backbone fine-tuning strategies to strike a balance between training efficiency and segmentation performance. Under the

C_{Novel}

= 1 setting (representing a challenging “clutter” class), our whole results are slightly weaker. This performance difference can be attributed to factors such as catastrophic forgetting and point density, as discussed in Section 4.3 (Ablation Studies), specifically in the “Segmentation Performance Analysis” subsection.

Table 2 presents the results of our method and other compared methods on the ScanNet v2 dataset. Evaluations are also conducted under

C_{Novel}

= 5, 3, and 1. Unlike the observations on the S3DIS dataset, the FT approach performs poorly on the old classes. This can be attributed to the greater complexity of the ScanNet dataset, which causes the FT method to forget most of the old knowledge while learning new classes. In contrast, the LWF method achieves relatively better performance on the old classes, demonstrating the effectiveness of its distillation loss. Both 3DPC and BalDis exhibit strong results across both old and new classes, indicating that these comparative methods strike a favorable balance between plasticity and stability. In terms of all classes, on the ScanNet v2 dataset, our method outperforms both 3DPC and the current state-of-the-art approach, BalDis, demonstrating its superior performance. Taken together, the results on both the S3DIS and ScanNet v2 datasets confirm the effectiveness of our proposed method.

4.3. Ablation Studies

Contribution of individual model components. To evaluate the contribution of each component, we ablate three modules sequentially: cosPrompt, pseudo-label generation, and LwF loss. Experiments are conducted under the

C_{Novel} = 3

setting. According to Table 1 and Table 3, removing the CosPrompt, pseudo-label generation, and LwF loss modules results in mIoU decreases of 2.77%, 6.67%, and 0.87% on old classes, respectively. This demonstrates that pseudo-label generation contributes most significantly to the overall performance. During incremental learning, focusing solely on new classes adversely impacts the model’s internal knowledge of old classes. Introducing pseudo-label generation effectively balances learning across both old and new classes. The CosPrompt module also provides a substantial contribution to performance improvement. CosPrompt strikes a balance between the stability and plasticity of prompts through freezing and extension operations. The LwF loss contributes moderately to performance gains by constraining the model’s output distributions.

Contribution of the number of prompts per class. To validate the impact of the number of prompts per class, we conduct the following experiments. Under the setting of

C_{Novel} = 3

, we systematically compare the effects of number of prompts per class. As illustrated in Figure 3, the results demonstrate that the configurations with

L^{p}

= 8 and 16 yield marginally better performance compared to

L^{p}

= 2 and 4. This improvement can be attributed to the improved representational capacity provided by the increase in prompt diversity. Notably, the performance on base classes (1–10) significantly surpasses that of new classes (11–13). This disparity stems from two primary factors: (1) our method exhibits robust knowledge preservation capabilities for previously learned classes, and (2) the novel classes present inherently greater learning challenges. In conclusion, considering the performance and model complexity, we adopt

L^{p}

= 8 as the default configuration for other experiments.

The effect of incremental training epochs. We perform the following experiments shown in Figure 4 under the condition of

C_{Novel} = 3

. We compare the performance of incremental models trained for Epochs 0, 10, 50, 100, and 200 with FT, LWF, and our method. The incremental models at epoch = 0 are obtained by simply adding the prompts or the channels corresponding to the new classes, while the base model is derived after being trained on the old classes. For FT and LWF, their mIoU at epoch = 0 differs slightly from that of our method, which is attributed to their lack of prompts and slight structural differences. By comparing the base models with those at epoch = 0, we observe that the mIoU for Classes 1–10 remains largely unchanged, indicating that the incorporation of the new module has minimal impact on the old knowledge. With more training epochs, the mIoU of Classes 1–10 for FT and LWF decreases, which is accompanied by a corresponding increase in the mIoU of Classes 11–13. In comparison, the FT method exhibits a faster performance improvement on the new classes, whereas LWF shows a slower increase. This is because FT uses only the labels of the new classes during incremental learning, leading to rapid performance gain on the new classes. In contrast, LWF constrains the model output to some extent, thereby limiting the acquisition of new knowledge. Moreover, the FT method does not impose constraints on the old knowledge, resulting in generally inferior performance on Classes 1–10. As for our method, after training for 100 epochs, the model converges, achieving an mIoU exceeding 47% for the old classes, over 36% for the new classes, and an overall mIoU exceeding 44%. These experimental results further validate the effectiveness of our approach.

Evolution of Features and Predictions. To better qualitatively evaluate our method, we provide t-SNE visualizations of the features and the prediction results. Figure 5a,b shows the t-SNE visualizations of the base and incremental models, respectively. Figure 6a–d presents the prediction results and ground truths of the base and incremental models, respectively.

Both visualizations are conducted under the setting

C_{Novel} = 3

, meaning that the novel classes are bookcase, board, and clutter, while the remaining classes are base classes. Different from image classification, during t-SNE visualization, we aggregate multiple samples and visualize them at the point level. For the prediction results, we concatenate the predictions of all point clouds in a single room. According to Figure 5a,b, in the feature space, samples from the same class form compact clusters, with relatively clear boundaries separating the majority of classes. This demonstrates that our method effectively preserves the separation of old classes, which forms the basis for mitigating catastrophic forgetting. According to Figure 6, by comparing the predictions with the ground truths, we find that the predictions are generally accurate. According to Figure 6a,b, the base model can correctly predict some of the points as the hard classes, column and door. However, as shown in Figure 6c,d, the incremental model mispredicts these parts as bookcase and clutter. This indicates that the model faces challenges and exhibits forgetting when predicting a few difficult samples. In the future, we aim to improve its capability in handling such samples.

Efficiency Analysis. For performance comparison, we primarily compare our method with the representative approach 3DPC, as summarized in Table 4. Since both FT and LWF methods employ the same backbone as ours in Table 1 and exhibit similar performance across multiple metrics, comparisons with them are not performed. As shown in Table 4, in terms of total parameters, trainable parameters, FLOPs, and inference time, 3DPC achieves 0.41 M, 0.41 M, 2.61 GFLOPs, and 2.29 ms, respectively, while our method obtains 21.9 M, 4.10 M, 9.83 GFLOPs, and 6.42 ms, respectively. The trainable parameters of our method constitute approximately one-fifth of the total parameters, yet remain higher than those of 3DPC. In the aforementioned aspects, our approach still falls short of 3DPC, which can be attributed to the more complex architecture of our Point-BERT backbone. In terms of GPU memory usage and incremental training time, our method requires only 10.09 GB and 2.00 h, respectively, significantly outperforming 3DPC, which requires 14.87 GB and 5.52 h. The short incremental learning time is a key advantage of our method, which is vital for scalable incremental learning. Although our model employs a greater number of trainable parameters than 3DPC, we freeze the lower-level backbone layers. This design confines the backward pass to the unfrozen top modules, thereby shortening the gradient propagation path. In contrast, 3DPC necessitates a sequential backward pass through the entire network, leading to substantially higher memory overhead and longer training time. This result further validates the efficiency of our method from a computational perspective.

Segmentation Performance Analysis. For an in-depth analysis of the performance of each class, the density, IoU, and confusion matrix are shown in Table 5 and Figure 7. According to Table 5, we observe that the IoUs for window, door, table, and chair are relatively high, while those for beam and sofa are comparatively low. Based on Figure 7, we find that the segmentation model tends to misclassify other classes as wall and clutter, which are classes with higher point density or new classes, leading to their correspondingly lower IoU. On one hand, performance is strongly correlated with density, which is influenced by material and shape size—i.e., lower density generally corresponds to poorer performance. Essentially, this is attributed to the insufficient emphasis on sparse point clouds during the sampling process, thereby affecting segmentation performance. On the other hand, this confirms that the method still suffers from a degree of catastrophic forgetting, where some samples of old classes are incorrectly predicted as new classes. This confirms that merely assigning lower thresholds to minority classes is insufficient to address class imbalance, and our method exhibits certain limitations in handling point cloud sparsity. For future work, greater attention should be devoted to exploiting spatial information during sampling and achieving a better balance between stability and plasticity. Furthermore, strategies to mitigate class imbalance, such as class-balanced sampling or loss re-weighting, will be explored.

5. Conclusions

In this work, we propose a cosine-prompt-based method to mitigate catastrophic forgetting in 3D class incremental semantic segmentation. Our approach integrates CosPrompts into the model architecture, which preserves previously acquired knowledge while adapting to new classes. By freezing existing prompts and selectively expanding new ones during incremental learning phases, our method successfully balances plasticity and stability, thereby reducing performance degradation on old classes without compromising the learning of new ones. To further enhance model stability, our framework incorporates distillation loss and pseudo-label generation techniques, which constrain the optimization process and mitigate drift in feature representation. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed method.

In future work, we plan to extend the evaluation of our approach to large-scale outdoor scene datasets, such as SemanticKITTI and nuScenes, to assess its generalization capability in more complex and dynamic environments. Additionally, we will pursue adaptive backbone fine-tuning to balance representation learning and efficiency while also developing spatial-aware sampling strategies and incorporating class-imbalance mitigation techniques, such as loss re-weighting, to collectively improve the stability–plasticity trade-off.

Author Contributions

Methodology, L.G., F.X. and H.L.; software, L.G., H.L. and M.P.; validation, K.L. and X.H.; writing—original draft, L.G. and F.X.; writing—review and editing, F.X. and X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Shanxi Province (No. 202203021212138), the National Natural Science Foundation of China (No. 62272426), and the Foundation of Shanxi Key Laboratory of Machine Vision and Virtual Reality (No. 447-110103).

Data Availability Statement

The datasets are available at the following links: https://stanford.redivis.com/datasets/9q3m-9w5pa1a2h/ accessed on 1 June 2016 and http://www.scan-net.org/ accessed on 1 July 2017.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. Author Hongye Li is employed by the company Luzhou North Chemical Industries Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Sarker, S.; Sarker, P.; Stone, G.; Gorman, R.; Tavakkoli, A.; Bebis, G.; Sattarvand, J. A comprehensive overview of deep learning techniques for 3D point cloud classification and semantic segmentation. Mach. Vis. Appl. 2024, 35, 67. [Google Scholar] [CrossRef]
Wang, L.; Zhang, X.; Su, H.; Zhu, J. A comprehensive survey of continual learning: Theory, method and application. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5362–5383. [Google Scholar] [CrossRef] [PubMed]
Zhou, D.W.; Wang, Q.W.; Qi, Z.H.; Ye, H.J.; Zhan, D.C.; Liu, Z. Class-incremental learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9851–9873. [Google Scholar] [CrossRef] [PubMed]
Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2001–2010. [Google Scholar]
Douillard, A.; Cord, M.; Ollion, C.; Robert, T.; Valle, E. Podnet: Pooled outputs distillation for small-tasks incremental learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 86–102. [Google Scholar]
Gao, R.; Liu, W. Ddgr: Continual learning with deep diffusion-based generative replay. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; PMLR: Cambridge, MA, USA, 2023; pp. 10744–10763. [Google Scholar]
Li, Z.; Hoiem, D. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2935–2947. [Google Scholar] [CrossRef] [PubMed]
Lee, K.; Lee, K.; Shin, J.; Lee, H. Overcoming catastrophic forgetting with unlabeled data in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 312–321. [Google Scholar]
Gao, Z.; Han, S.; Zhang, X.; Xu, K.; Zhou, D.; Mao, X.; Dou, Y.; Wang, H. Maintaining fairness in logit-based knowledge distillation for class-incremental learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 16763–16771. [Google Scholar]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
Benzing, F. Unifying importance based regularisation methods for continual learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 28–30 March 2022; PMLR: Cambridge, MA, USA, 2022; pp. 2372–2396. [Google Scholar]
Yan, S.; Xie, J.; He, X. Der: Dynamically expandable representation for class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3014–3023. [Google Scholar]
Douillard, A.; Ramé, A.; Couairon, G.; Cord, M. Dytox: Transformers for continual learning with dynamic token expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9285–9295. [Google Scholar]
Wang, Z.; Zhang, Z.; Ebrahimi, S.; Sun, R.; Zhang, H.; Lee, C.Y.; Ren, X.; Su, G.; Perot, V.; Dy, J.; et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 631–648. [Google Scholar]
Dong, J.; Cong, Y.; Sun, G.; Ma, B.; Wang, L. I3dol: Incremental 3d object learning without catastrophic forgetting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 6066–6074. [Google Scholar]
Zamorski, M.; Stypułkowski, M.; Karanowski, K.; Trzciński, T.; Zięba, M. Continual learning on 3d point clouds with random compressed rehearsal. Comput. Vis. Image Underst. 2023, 228, 103621. [Google Scholar] [CrossRef]
Liu, Y.; Cong, Y.; Sun, G.; Zhang, T.; Dong, J.; Liu, H. L3DOC: Lifelong 3D object classification. IEEE Trans. Image Process. 2021, 30, 7486–7498. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Zhong, L.; Zhuang, H. ReFu: Recursive Fusion for Exemplar-Free 3D Class-Incremental Learning. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 3396–3405. [Google Scholar]
Chowdhury, T.; Cheraghian, A.; Ramasinghe, S.; Ahmadi, S.; Saberi, M.; Rahman, S. Few-shot class-incremental learning for 3d point cloud objects. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 204–220. [Google Scholar]
Yang, Y.; Hayat, M.; Jin, Z.; Ren, C.; Lei, Y. Geometry and uncertainty-aware 3d point cloud class-incremental semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21759–21768. [Google Scholar]
Boudjoghra, M.E.A.; Lahoud, J.; Cholakkal, H.; Anwer, R.M.; Khan, S.; Khan, F.S. Continual Learning and Unknown Object Discovery in 3D Scenes via Self-distillation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 416–431. [Google Scholar]
Su, Y.; Chen, S.; Wang, Y.G. Balanced residual distillation learning for 3D point cloud class-incremental semantic segmentation. Expert Syst. Appl. 2025, 269, 126399. [Google Scholar] [CrossRef]
Cermelli, F.; Mancini, M.; Bulo, S.R.; Ricci, E.; Caputo, B. Modeling the background for incremental learning in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9233–9242. [Google Scholar]
Douillard, A.; Chen, Y.; Dapogny, A.; Cord, M. Plop: Learning without forgetting for continual semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4040–4050. [Google Scholar]
Lin, Z.; Wang, Z.; Zhang, Y. Continual semantic segmentation via structure preserving and projected feature alignment. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 345–361. [Google Scholar]
Zhang, C.B.; Xiao, J.W.; Liu, X.; Chen, Y.C.; Cheng, M.M. Representation compensation networks for continual semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7053–7064. [Google Scholar]
Xiao, J.W.; Zhang, C.B.; Feng, J.; Liu, X.; van de Weijer, J.; Cheng, M.M. Endpoints weight fusion for class incremental semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7204–7213. [Google Scholar]
Armeni, I.; Sener, O.; Zamir, A.R.; Jiang, H.; Brilakis, I.; Fischer, M.; Savarese, S. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1534–1543. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]

Figure 1. The framework of our approach, CosPrompt. This method mitigates catastrophic forgetting from both the architecture and logits of models by leveraging CosPrompt, pseudo-label generation, and LWF loss. The gray boxes represent the old model, while the light blue boxes represent the new model.

Figure 2. Architecture of the proposed cosine prompt. Our work achieves a balance between plasticity and stability through end-to-end learning, prompt expansion, and freezing. By leveraging the backward clamp operations, we enhance training stability.

Figure 3. Impact of the number of prompts on the S3DIS dataset. The experiment was conducted on the set of

C_{Novel} = 3

.

Figure 3. Impact of the number of prompts on the S3DIS dataset. The experiment was conducted on the set of

C_{Novel} = 3

.

Figure 4. The effect of of incremental training epochs. The experiment was conducted on the set of

C_{Novel} = 3

. (a) FT. (b) LWF. (c) CosPrompt.

Figure 4. The effect of of incremental training epochs. The experiment was conducted on the set of

C_{Novel} = 3

. (a) FT. (b) LWF. (c) CosPrompt.

Figure 5. t-SNE visualization of features for the base and incremental models on the S3DIS dataset. (a) Feature visualization for the base model. (b) Feature visualization for the incremental model.

Figure 6. Predictions of the base and incremental models on the S3DIS dataset. (a) Predictions of the base model. (b) Ground truths of Figure 6a. (c) Predictions of the incremental model. (d) Ground truths of Figure 6c.

Figure 7. Confusion matrix. The experiment was conducted on the set of

C_{Novel} = 3

.

Figure 7. Confusion matrix. The experiment was conducted on the set of

C_{Novel} = 3

.

Table 1. Comparison results of S3DIS dataset on the mIoU (%). We compare our method with FT, LWF, 3DPC, BalDis, and JT methods.

Methods	$C_{Novel} = 5$			$C_{Novel} = 3$			$C_{Novel} = 1$
Methods	1–8	9–13	All	1–10	11–13	All	1–12	13	All
FT	36.06	39.81	37.50	41.21	36.69	40.17	41.04	25.36	39.84
LWF [7]	50.88	20.01	39.01	46.04	24.13	40.99	44.55	24.37	42.99
3DPC [20]	48.94	39.56	45.33	45.15	45.33	45.19	44.08	35.69	43.43
BalDis [22]	50.68	40.62	46.81	49.20	44.12	47.26	46.94	38.35	46.28
CosPrompt	49.99	37.21	45.07	47.93	36.44	45.28	46.84	25.50	45.20
Joint	53.37	42.73	49.28	51.79	40.90	49.28	50.00	40.60	49.28

Table 2. Comparison results of ScanNet v2 dataset on the mIoU (%). We compare our method with FT, LWF, BalDis, 3DPC, and JT methods.

Methods	$C_{Novel} = 5$			$C_{Novel} = 3$			$C_{Novel} = 1$
Methods	1–15	16–20	All	1–17	18–20	All	1–19	20	All
FT	9.88	13.47	10.41	8.67	10.17	8.90	8.26	10.67	8.85
LWF [7]	30.07	9.07	24.81	25.71	10.70	23.46	24.48	10.14	23.76
3DPC [20]	34.16	13.43	28.98	28.38	14.31	26.27	25.74	12.62	25.08
BalDis [22]	33.82	15.30	29.19	31.40	15.63	29.04	30.02	15.57	29.30
CosPrompt	40.71	10.45	33.15	35.46	11.59	31.88	33.60	10.52	32.45
Joint	42.42	15.63	35.72	38.82	18.13	35.72	36.78	15.63	35.72

Table 3. Ablation studies of individual model component on mIoU (%) on the S3DIS dataset. We validate the contribution of each module by removing CosPrompt, pseudo-label generation, and LWF from the standard framework.

CosPrompt	PLG	LWF	1–10	11–13	All
	√	√	45.16	38.54	43.63
√		√	41.26	39.01	40.74
√	√		47.06	37.50	44.85

Table 4. Efficiency comparison between 3DPC and our method on the S3DIS dataset. Total Params: Number of Total Parameters; Trainable: Number of Trainable Parameters; GPU Mem.: GPU Memory Usage; Inc. T: Incremental Learning Time; Inf. T: Inference Time; Freeze BB: Freeze Lower-level Backbone.

Metrics	Total Params	Trainable	GPU Mem.	FLOPs	Inc. T	Inf. T	Freeze BB
3DPC [20]	0.41 M	0.41 M	14.87 GB	2.61 GFLOPs	5.52 h	2.29 ms	No
CosPrompt	21.9 M	4.10 M	10.09 GB	9.83 GFLOPs	2.00 h	6.42 ms	Yes

Table 5. The per-class density and IoU on the S3DIS test set. The experiment was conducted on the set of

C_{Novel} = 3

.

Table 5. The per-class density and IoU on the S3DIS test set. The experiment was conducted on the set of

C_{Novel} = 3

.

Class	Ceil.	Floor	Wall	Beam	Col.	Win.	Door	Table	Chair	Sofa	Book.	Board	Clut.
Density ( $10^{4}$ points/m²)	1.5	13.2	23.2	0.03	1.3	3.3	2.4	3.4	1.7	0.1	8.4	1.0	7.1
mIoU (%)	84.6	94.0	71.8	1.2	14.0	51.4	17.4	63.3	68.7	12.9	49.5	25.3	34.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, L.; Li, H.; Pang, M.; Liu, K.; Han, X.; Xiong, F. Cosine Prompt-Based Class Incremental Semantic Segmentation for Point Clouds. Algorithms 2025, 18, 648. https://doi.org/10.3390/a18100648

AMA Style

Guo L, Li H, Pang M, Liu K, Han X, Xiong F. Cosine Prompt-Based Class Incremental Semantic Segmentation for Point Clouds. Algorithms. 2025; 18(10):648. https://doi.org/10.3390/a18100648

Chicago/Turabian Style

Guo, Lei, Hongye Li, Min Pang, Kaowei Liu, Xie Han, and Fengguang Xiong. 2025. "Cosine Prompt-Based Class Incremental Semantic Segmentation for Point Clouds" Algorithms 18, no. 10: 648. https://doi.org/10.3390/a18100648

APA Style

Guo, L., Li, H., Pang, M., Liu, K., Han, X., & Xiong, F. (2025). Cosine Prompt-Based Class Incremental Semantic Segmentation for Point Clouds. Algorithms, 18(10), 648. https://doi.org/10.3390/a18100648

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cosine Prompt-Based Class Incremental Semantic Segmentation for Point Clouds

Abstract

1. Introduction

2. Related Work

2.1. Class Incremental Learning

2.2. Class Incremental Semantic Segmentation

2.3. Class Incremental Learning on Point Cloud

3. Methodology

3.1. Problem Formulation

3.2. Framework Overview

3.3. Cosine Prompt Learning

3.4. Pseudo-Label Generation

3.5. LwF Loss

3.6. Training Process

4. Experiments

4.1. Experimental Setup

4.2. Comparison Experiments

4.3. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI