4.2.1. Results of ORACIL
Figure 2 presents the phase-wise accuracy comparison between ORACIL and representative class-incremental learning methods on CIFAR-100, CUB200, and OmniBenchmark. The compared methods cover both raw-sample-free approaches that avoid storing or replaying historical images, such as L2P, CODA-Prompt, DualPrompt, SimpleCIL, FeTrIL, and GDDSG, and replay-based approaches that rely on exemplar memory or replay buffers, including FOSTER, MEMO, iCaRL, WA, BiC, and DER++. The symbols √ and × indicate whether each method avoids raw historical image replay.
As the incremental phase increases, most baseline methods show a continuous decline in accuracy, reflecting the accumulation of catastrophic forgetting when new classes are introduced sequentially. This degradation is particularly evident for DER++, SimpleCIL, DualPrompt, L2P, and CODA-Prompt on CUB200 and OmniBenchmark. Although replay-based methods such as iCaRL, WA, and BiC alleviate forgetting to a certain extent, their accuracy still gradually decreases in later phases. In contrast, ORACIL maintains a more stable accuracy trend across the incremental process and achieves the highest final-phase accuracy on all three datasets without relying on raw historical image replay.
On CIFAR-100, ORACIL starts with relatively lower accuracy in the first two phases, but its performance rapidly improves and remains stable above 95% throughout the later phases. This indicates that ORACIL can effectively adapt to the incremental class order and preserve discriminative ability over long-term updates. On CUB200, a fine-grained recognition dataset with highly similar categories, ORACIL gradually surpasses the competing methods and maintains nearly 94% accuracy at the final phase, showing its advantage in mitigating interference among visually similar classes. On OmniBenchmark, where the data distribution is more diverse and cross-domain class variations are larger, ORACIL also exhibits the most stable trend and achieves the best final-phase performance among all compared methods.
These results demonstrate that ORACIL provides strong long-term incremental learning ability in challenging class-incremental scenarios. The conflict-graph-based dynamic grouping strategy helps reduce interference caused by the consecutive arrival of similar or conflicting classes, while the analytic closed-form update mechanism enables efficient model adaptation without iterative replay-based retraining. As a result, ORACIL achieves a favorable balance between stability, resistance to forgetting, and classification performance across different incremental learning scenarios.
Figure 3 further compares ORACIL with representative raw-sample-free class-incremental learning methods, including L2P, CODA-Prompt, DualPrompt, GDDSG, SimpleCIL, and FeTrIL. These methods avoid storing or replaying raw historical images, making them directly comparable under a raw-sample-free class-incremental learning setting. Overall, most baseline methods exhibit a continuous decline in accuracy as the number of incremental phases increases, indicating that long-term forgetting remains a challenging issue even without considering replay-based methods. In contrast, ORACIL maintains a much more stable trend across the incremental process and achieves the best final-phase performance on all three datasets.
On CIFAR-100, several baselines start with higher initial accuracy, especially CODA-Prompt and FeTrIL. However, their performance gradually decreases as more classes are introduced. ORACIL starts with relatively lower accuracy in the first two phases, but it quickly improves and remains stable above 95% in the later phases, achieving the highest accuracy in the final phase. On CUB200, the advantage of ORACIL becomes more evident as the incremental process proceeds. Although methods such as SimpleCIL, DualPrompt, and FeTrIL perform competitively in the early phases, they show clear performance degradation in later phases, while ORACIL maintains around 94% accuracy and achieves the best final-phase result. On OmniBenchmark, where the class distribution is more diverse and cross-domain variations are larger, most raw-sample-free baselines suffer from substantial accuracy drops. In contrast, ORACIL keeps a relatively stable trajectory and obtains the highest final-phase accuracy among all compared raw-sample-free methods.
These results demonstrate that ORACIL has stronger long-term stability and resistance to forgetting than existing raw-sample-free baselines. The conflict-graph-based dynamic grouping strategy helps reduce interference among newly introduced and previously learned classes, while the group-recognition router and analytic incremental updating strategy enable effective adaptation during continuous class expansion. Consequently, ORACIL maintains stable performance across different datasets without storing or replaying raw historical images.
Figure 4 presents the phase-wise accuracy comparison among replay-based class-incremental learning methods, including FOSTER, MEMO, iCaRL, WA, BiC, and DER++, on CIFAR-100, CUB200, and OmniBenchmark. These methods rely on exemplar memory or replay buffers to preserve historical information during incremental learning. Compared with the raw-sample-free methods shown in
Figure 3, replay-based methods can explicitly reuse historical samples or stored experience to mitigate catastrophic forgetting. However, their performance still varies significantly across datasets and incremental phases.
On CIFAR-100, most replay-based methods start from high initial accuracy and gradually degrade as the incremental process proceeds. FOSTER maintains the best final-phase accuracy among the replay-based baselines, while iCaRL, WA, and BiC show similar decreasing trends. MEMO exhibits a moderate decline, and DER++ suffers from the most pronounced performance drop in the later phases. On CUB200, iCaRL and MEMO show relatively strong long-term performance, whereas FOSTER and DER++ degrade more rapidly as new fine-grained classes are introduced. This suggests that exemplar replay alone may be insufficient when inter-class visual similarity is high. On OmniBenchmark, which contains more diverse visual domains, FOSTER achieves the most stable performance among the replay-based methods, while DER++ again experiences a substantial accuracy decrease across phases.
These results provide a complementary reference for evaluating ORACIL. Although replay-based methods can access historical samples, their accuracy still declines under long-term incremental learning, especially on fine-grained or diverse datasets. In contrast, the results in
Figure 2 and
Figure 3 show that ORACIL achieves competitive or superior final-phase performance without replaying raw historical images. This indicates that the proposed conflict-graph-based grouping, group-recognition routing, and analytic incremental updating mechanisms can effectively improve long-term stability and resistance to forgetting without relying on exemplar memory.
Table 3,
Table 4 and
Table 5 report the phase-wise accuracy and final average forgetting (
) of different class-incremental learning methods on CIFAR-100, CUB200, and OmniBenchmark, respectively. The column “No Raw Replay” indicates whether a method avoids storing or replaying raw historical images during incremental learning. A check mark denotes raw-sample-free methods, while a cross mark denotes methods that rely on raw historical images, exemplars, or replay buffers. AF measures the final average forgetting rate, where a lower value indicates better retention of previously learned knowledge.
Across the three datasets, ORACIL consistently achieves the best final-phase accuracy and the lowest AF. On CIFAR-100, ORACIL reaches 95.77% accuracy at the final phase with an AF of only 0.16%, outperforming both raw-sample-free methods and replay-based baselines. Although several methods, such as CODA-Prompt, iCaRL, WA, BiC, and DER++, obtain higher accuracy in the initial phase, their performance gradually declines as more classes are introduced. In contrast, ORACIL improves after the early phases and remains highly stable throughout the later incremental stages.
On CUB200, which contains fine-grained categories with high visual similarity, most methods suffer from more evident performance degradation. Prompt-based methods such as L2P, CODA-Prompt, and DualPrompt drop substantially from the initial to the final phase, while replay-based methods such as iCaRL, WA, BiC, and MEMO also show noticeable forgetting. ORACIL achieves the highest final accuracy of 93.86% and the lowest AF of 0.77%, indicating that the proposed conflict-aware grouping strategy is effective in reducing interference among visually similar classes.
On OmniBenchmark, the performance gap becomes more pronounced due to the more diverse and cross-domain data distribution. Many baselines experience significant accuracy degradation, especially DER++, L2P, CODA-Prompt, DualPrompt, and several replay-based methods. ORACIL maintains a stable accuracy trend and achieves the best final accuracy of 88.12% with the lowest AF of 1.04%. Compared with GDDSG, which also avoids raw replay and shows relatively low forgetting, ORACIL achieves both lower AF and higher final-phase accuracy, demonstrating stronger long-term incremental learning ability.
Overall,
Table 3,
Table 4 and
Table 5 show that ORACIL does not necessarily obtain the highest accuracy at the initial phase, but it consistently exhibits stronger stability as the incremental process proceeds. This behavior is expected because early phases provide limited class observations, making class relationship estimation and group assignment less reliable. As more classes are observed, the conflict graph, group-recognition router, and analytic updating mechanism become more effective, enabling ORACIL to reduce inter-class interference and preserve previously learned knowledge. These results demonstrate that ORACIL achieves a favorable balance between classification accuracy, forgetting resistance, and raw-sample-free incremental learning.
4.2.2. Robustness to Class Order
As shown in
Figure 5, on the CIFAR-100 dataset, GDDSG achieves the lowest values on both metrics, indicating that it is almost unaffected by class order. ORACIL obtains a relatively low AOPD of 1.78, but its MOPD reaches 4.9, which is noticeably higher than that of the other methods. This suggests that its order sensitivity is mainly concentrated in a few specific phases rather than reflected as persistent fluctuations throughout the whole process. Both APD [
25] and its variant APDfix show relatively high values on this dataset, while HALRP [
26] lies in the middle range.
On the fine-grained CUB200 dataset, ORACIL achieves the lowest MOPD and AOPD, with values of 7.72 and 3.22, respectively. Both values are lower than those of GDDSG, indicating that ORACIL is more robust under both average and worst-case conditions. On the more challenging OmniBenchmark dataset, which exhibits larger distribution discrepancies, ORACIL again achieves the best performance, with an MOPD of 10.7 and an AOPD of 4.7. These values are lower than GDDSG’s 11.2 and 7.6, respectively, with a particularly notable reduction in the average disparity.
Overall, ORACIL significantly improves order robustness on CUB200 and OmniBenchmark. Although its MOPD on CIFAR-100 is relatively high due to a single-phase peak, its overall average disparity remains small.
As shown in
Figure 6, the figure presents the average accuracy of ORACIL on the three datasets when trained under five different class arrival orders. It can be observed that the accuracy ranges on CIFAR-100, CUB200, and OmniBenchmark are 94.46–95.74%, 93.02–95.73%, and 87.25–91.57%, respectively. Across all three datasets, the accuracy variation under different orders is very small. The stable performance under multiple order settings indicates that ORACIL is robust to class order; that is, the overall model performance remains stable even when the class arrival order changes, without showing severe fluctuations. This robustness mainly benefits from the dynamic grouping mechanism. During the incremental process, it constructs a conflict graph and adjusts the class-group structure, so that newly arriving data are assigned to class groups with lower similarity, thereby achieving order robustness.
4.2.3. Ablation Results
Table 6 reports the ablation results of the three key designs in ORACIL, including the analytic head, the recursive update mechanism, and the dynamic grouping with the group-recognition router. The comparison reveals a clear progressive effect among these components. When only the analytic head is retained, the model fails to maintain discriminative knowledge across incremental phases. Although the analytic head provides a closed-form classifier, it is not sufficient by itself to handle continuously expanding class sets. This is reflected by the extremely low final accuracy and the very high final forgetting on all three datasets. Therefore, the analytic head should be viewed as the computational basis of ORACIL rather than a complete solution to class-incremental learning.
After introducing the recursive update mechanism, the performance improves substantially. This indicates that recursively updating the analytic solution with newly observed data is critical for preserving previously learned decision boundaries while incorporating new classes. Compared with the analytic-head-only variant, this setting greatly reduces final forgetting and raises the final accuracy across CIFAR-100, CUB200, and OmniBenchmark. These results confirm that the recursive update mechanism is the core factor that enables ORACIL to perform raw-sample-free incremental learning without repeatedly retraining the classifier from scratch.
The full model further incorporates dynamic grouping and the group-recognition router. This component does not merely add another classifier; rather, it changes how class interference is controlled during incremental learning. By assigning classes into low-conflict groups and routing test samples through soft group probabilities, ORACIL reduces the competition among visually similar or order-sensitive classes. As a result, the full model achieves the best final accuracy on all three datasets and further suppresses final forgetting. The improvement is especially meaningful because it shows that recursive updates alone can preserve knowledge, but dynamic grouping is needed to organize the expanding class space more effectively.
The AOPD and MOPD metrics provide additional insight into order robustness. Although the full model does not always produce the lowest AOPD or MOPD, it achieves a much better balance between final accuracy, forgetting, and class-order stability. In particular, the recursive-update-only variant can sometimes show lower order-difference values because its overall performance is lower and less variable, but this does not indicate stronger practical performance. The full ORACIL framework maintains high accuracy while keeping forgetting at a low level, demonstrating that the proposed components are complementary: the analytic head provides an efficient closed-form classifier, the recursive update enables continual adaptation without raw-sample replay, and the dynamic grouping router improves robustness by reducing inter-group interference.
4.2.7. Sensitivity Analysis of Threshold Strategies in Conflict-Graph Construction
Table 9 presents the sensitivity analysis of different threshold strategies used in conflict-graph construction. The threshold strategy directly affects the density of the conflict graph and thus determines the granularity of the generated class groups. A stricter threshold tends to introduce more conflict edges and produces more groups, whereas a looser threshold merges more classes into the same group.
The proposed mean-similarity threshold achieves the most stable overall trade-off across the three datasets. It adaptively estimates the compactness of each class by averaging the cosine similarities between samples and their corresponding class prototype. Therefore, the threshold is determined by the intra-class distribution rather than by a manually selected global value. As shown in
Table 9, mean similarity obtains competitive or superior final accuracy on all datasets while maintaining a moderate number of groups.
Although the fixed threshold of 0.8 achieves the highest accuracy on CIFAR-100, it relies on a manually specified hyperparameter and does not consistently outperform mean similarity on CUB200 and OmniBenchmark. This indicates that a fixed global threshold may be dataset-dependent. Median similarity provides a robust alternative against outliers, but it yields higher forgetting on CUB200 and OmniBenchmark, suggesting that ignoring the overall intra-class distribution may weaken the stability of conflict-graph construction.
The adaptive threshold generates more groups, especially on CIFAR-100, where the number of groups increases from 12 to 21. This indicates that the adaptive strategy makes the conflict graph denser and leads to over-fragmented class partitions. Excessive fragmentation increases the difficulty of group recognition and multi-head fusion, which can reduce final accuracy and increase forgetting.
Overall, these results show that ORACIL is reasonably robust to different threshold strategies, but the mean-similarity threshold provides the best balance among accuracy, forgetting, group compactness, and hyperparameter-free adaptability. Therefore, we use mean similarity as the default threshold strategy in ORACIL.