6.1. Overall Performance Comparison
In this section, we present the performance of different normalization strategies on histopathology images and compare ten CL approaches covering replay-based, prompt-based, architecture-based, and regularization-based methods. When presenting the results of replay-based and prompt-based methods, we also report the average accuracy and forgetting values under different buffer sizes and training epochs, allowing us to examine the influence of these hyperparameters on model performance.
To assess the effect of normalization strategies on the performance of the CL method, we evaluate two representative methods for DER++ and DualPrompt under four commonly used normalization types.
Table 3 demonstrates the corresponding average accuracy results (epochs = 5 for DualPrompt; epochs = 50, buffer size = 500 for DER++). Based on the experimental results, it can be observed that dataset-level normalization consistently achieves the highest performance for both DER++ and DualPrompt, indicating that normalization aligned with the dataset’s inherent distribution is more effective than ImageNet-based or the other two normalization approaches.
The superior performance of dataset-level normalization can be explained by its better balance between domain alignment and preservation of discriminative histopathological cues. ImageNet normalization uses statistics derived from natural images, which may introduce a domain mismatch for H&E-stained histopathology patches. Per-image normalization standardizes each image independently, reducing image-level brightness and contrast variation but also suppressing differences in cross-sample color and staining intensity that may be informative for distinguishing tissue categories. Macenko normalization reduces stain variation by aligning images to a reference stain appearance, but may also remove useful stain-related differences or introduce additional variability when stain estimation is unstable.
In contrast, dataset-level normalization estimates the mean and standard deviation from the target training dataset and applies the same statistics to all samples. This provides a consistent histopathology-specific input scale while preserving relative differences in color and staining intensity across tissue categories. From the perspective of the stability–plasticity tradeoff in CIL, such consistent input scaling can reduce input-induced representation drift when new classes are introduced, thereby improving stability for previously learned classes while retaining sufficient plasticity to learn newly introduced tissue categories.
Table 4 and
Table 5 further examine the influence of the training process by reporting the variation of average accuracy and forgetting under different training epoch settings. These results allow us to evaluate how training duration affects both overall performance and knowledge retention in the CIL setting for histopathology image classification, where new tissue categories are introduced sequentially across tasks (e.g., Task 1: [ADI, BACK, DEB], Task 2: [LYM, MUC, MUS], and Task 3: [NORM, STR, TUM]).
To examine how the buffer size and number of training epochs influence the effectiveness of replay-based methods, we evaluate ER, DER, and DER++ under different combinations of epoch (20, 50) and buffer size (200, 500). As shown in
Table 4, DER and DER++ exhibit a clear trend in which increasing both the number of epochs and buffer size improves average accuracy while reducing forgetting, with the best performance observed at Epochs = 50 and Buffer size = 500. ER also benefits from larger buffers and longer training. Although ER achieves the highest average accuracy under certain settings, this is accompanied by substantially higher forgetting; in contrast, DER++ consistently attains the lowest forgetting while maintaining a competitive average accuracy. Thus, from a stability-oriented perspective, DER++ is preferable among the replay-based methods, as it better preserves previously learned tissue-category knowledge while learning new classes.
These observations also highlight a limitation in the current approach to evaluating CL methods. When comparing different methods, average accuracy and forgetting may favor different models; one method may achieve higher accuracy but suffer from greater forgetting, whereas another may obtain slightly lower accuracy while preserving previous knowledge more effectively. Although composite evaluation schemes have been proposed, including CLscore, which combines criteria including accuracy, backward transfer, memory usage, and computational efficiency [
54], these metrics are designed for broader multi-criteria evaluation and may depend on application-specific weighting. In our setting, the key challenge is the lack of a simple and interpretable unified metric for jointly considering average accuracy and forgetting when selecting the best-performing CL method. Developing such a metric remains an important direction for future work.
To examine whether prompt-based methods require only a small number of epochs to achieve good generalization, we evaluate L2P, CODA-Prompt, and DualPrompt using 5, 10, and 20 training epochs. As shown in
Table 5, L2P and DualPrompt achieve their best performance on the NCT-CRC-HE-100K dataset at 5 epochs, indicating that these methods converge effectively under a short training schedule. In contrast, CODA-Prompt continues to improve with longer training and achieves better performance at 20 epochs. These results demonstrate that the effect of training duration differs across prompt-based methods.
Overall, DER++ and DualPrompt are the leading methods within the replay-based and prompt-based categories, respectively, under our experimental settings. Our analysis of sensitivity to the number of epochs further shows that L2P and DualPrompt reach their best performance at 5 epochs, while CODA-Prompt continues to improve with longer training. These findings motivate a more detailed analysis of the tradeoffs between replay-based and prompt-based strategies in terms of performance, efficiency, stability, and deployment feasibility.
Figure 4 shows that replay-based CL methods (ER, DER, and DER++) achieve higher average accuracy and lower forgetting than prompt-based methods (L2P, CODA-Prompt, and DualPrompt) when sufficient replay buffer and training epochs are available. However, these gains require longer training and come at the cost of storing past samples. In contrast, DualPrompt achieves competitive performance while training for only 5 epochs and without replaying past data. Therefore, in privacy-constrained clinical environments, prompt-based methods provide a practical and effective alternative for maintaining performance on both newly introduced and previously learned tasks. This is particularly relevant for the histopathology image classification task addressed in this work.
On the NCT class-incremental benchmark, regularization-based methods such as Online EWC, LwF, and SI still exhibit substantial catastrophic forgetting even after extensive hyperparameter tuning. Increasing the regularization strength to favor previously acquired knowledge (e.g.,
for Online EWC,
for LwF, and
for SI) results in reduced plasticity but does not prevent performance degradation, with models still showing severe forgetting of earlier tasks. As reported in
Table 6, these methods achieve very low average accuracy (16–30%) with severe forgetting (75–100%) and with noticeable variance across different random seeds, indicating instability under the class-incremental histopathology setting. These results suggest that parameter-space regularization alone is insufficient to handle the strong distribution shift and class interference inherent to CIL. This behavior aligns with prior studies indicating that regularization-based methods often fail in CIL settings because they lack positive training signals for previously learned classes [
55]. Furthermore, under the single-head evaluation setup, these methods suffer from severe class imbalance and logit bias towards new classes, leading to catastrophic forgetting [
16,
56]. Consequently, our results confirm the limitations of regularization-based approaches in realistic medical scenarios with expanding diagnostic categories.
For completeness, we also evaluate DyTox on the NCT-CRC-HE-100K dataset for a more comprehensive comparison. Under 500 training epochs with a memory buffer of 1000, DyTox achieves an average accuracy of 25.09 and a forgetting score of 8.59 on our histopathology classification tasks. These results indicate that DyTox struggles to learn discriminative representations for histopathology images in the CIL setting. By jointly examining accuracy and forgetting, we observe that this limitation is primarily reflected in low final accuracy rather than excessive forgetting; while DyTox exhibits comparable forgetting to other methods, its accuracy remains substantially lower. This behavior can be explained by the interaction between DyTox’s architecture based on shared tokens and the fine-grained characteristics of colorectal histopathology patches.
From the architectural perspective, DyTox relies on shared self-attention encoder blocks to extract patch token representations across all tasks, while task-specific information is mainly modeled through dynamically expanded task tokens and classifier branches [
17]. Therefore, the effectiveness of the task-specific tokens depends strongly on whether the shared encoder has already preserved discriminative pathological features. This design differs from replay-based methods [
31], which can revisit previous samples to reinforce old class-specific tissue patterns, as well as from prompt-based methods such as DualPrompt [
21], which exploit a pretrained ViT backbone and use general and expert prompts to guide task-adaptive feature extraction. In contrast, DyTox updates the shared transformer representation during incremental learning, and does not explicitly preserve previous histopathology patterns through replay or a frozen pretrained representation space.
This design is less favorable for colorectal histopathology images, where discriminative information often appears as subtle, local, and spatially-distributed microtextures rather than as object-level semantic structures [
9,
57]. Tissue categories such as tumor epithelium, normal mucosa, mucus, debris, stroma, glandular organization, staining variation, and local epithelial arrangement can be visually similar within
patches [
52]. Fixed-size tokenization may weaken or merge these local pathological cues before they reach the task token decoder. As a result, when visually similar tissue classes are introduced incrementally, the shared representation may drift and the expanded task tokens may be insufficient to preserve old class-specific microtexture patterns. This explains why DyTox performs poorly in our experiments despite its architecture expansion-based design.
These results indicate clear differences in how CL paradigms behave on histopathology images, motivating further analysis of their underlying mechanisms and related practical implications.
6.3. Generalization Analysis
To further examine whether our observations generalize beyond the NCT-CRC-HE-100K dataset, we conduct additional experiments on the CRC-HE-7K dataset using the representative CL categories (prompt-based methods and replay-based methods). The CRC-HE-7K dataset [
2] contains 7180 histopathological images collected from 50 patients with colorectal adenocarcinoma, and has no overlapping images with the NCT-CRC-HE-100K dataset.
Table 8 shows that the dataset-level normalization strategy also improves performance on the CRC-HE-7K dataset. In particular, both DER++ and DualPrompt achieve their best results under dataset normalization compared with the other normalization strategies, suggesting that dataset-level statistics provide a more suitable normalization scheme for this histopathology dataset.
Based on the results in
Table 9 and
Table 10, we observe trends on the CRC-HE-7K dataset that are consistent with those obtained on the NCT-CRC-HE-100K dataset. First, when considering the stability–plasticity tradeoff as measured by forgetting and average accuracy, DER++ performs better than ER and DER among replay-based methods, while DualPrompt outperforms L2P and CODA-Prompt among prompt-based methods. Second, for replay-based methods, increasing the number of training epochs and buffer size generally improves the performance of ER and DER++. In contrast, DER achieves its best performance with 20 epochs and a buffer size of 200, which may be attributed to the distribution of replay samples. Similar to the observations on the NCT-CRC-HE-100K dataset,
Table 10 shows that L2P and DualPrompt achieve their best balance between high average accuracy and low forgetting with 5 training epochs. For CODA-Prompt, however, better performance is obtained with more training epochs, likely because its more complex architecture and larger number of parameters require additional training steps for effective optimization. Although increasing the number of epochs may slightly improve the overall average accuracy for the DualPrompt method, it also increases forgetting; therefore, we used the setting with 5 epochs to achieve a better balance between classification performance and forgetting control.
The CRC-HE-7K results show that the main trends observed on the NCT-CRC-HE-100K dataset are largely preserved. First, the effect of normalization remains consistent, with dataset-level normalization achieving the best average accuracy for both DER++ and DualPrompt among the evaluated normalization strategies. This suggests that the benefit of dataset-level normalization is not restricted to the original NCT-CRC-HE-100K split. Second, the relative behavior within CL method families is also consistent. Among replay-based methods, DER++ remains a strong baseline and achieves the best average accuracy and lowest forgetting under the setting with 50 epochs and a buffer size of 500. Among prompt-based methods, DualPrompt shows the strongest and most stable performance, achieving high average accuracy with low forgetting across training epochs. These findings provide additional support for our conclusion that DualPrompt can serve as a competitive replay-free alternative to replay-based methods in colorectal histopathology CIL.
Figure 8 displays the confusion matrices of DualPrompt and DER++ on the CRC-HE-7K dataset. Both methods achieve strong diagonal dominance, indicating that the selected CL methods maintain high class-wise classification accuracy on the additional dataset. DER++ shows fewer off-diagonal errors than DualPrompt, suggesting better generalization and more stable performance across histological classes. In particular, DualPrompt does not misclassify NORM samples as TUM, which is clinically meaningful for reducing false tumor predictions.
Figure 9 presents the multiclass ROC curves of DualPrompt and DER++ on CRC-HE-7K. Both methods obtain high AUC values across most classes, demonstrating strong discriminative capability on the additional dataset. These results further support the conclusion that the selected prompt-based and replay-based CL methods generalize well beyond the NCT-CRC-HE-100K dataset.