Continual Learning Model with Dynamic Adaptation and Flexibility for Incremental Classes

Hotta, Minami; Ogasawara, Noriko; Miyajima, Kengo; Shimizu, Ryotaro; Goto, Masayuki

doi:10.3390/app152212118

Open AccessArticle

Continual Learning Model with Dynamic Adaptation and Flexibility for Incremental Classes

by

Minami Hotta

^*

,

Noriko Ogasawara

,

Kengo Miyajima

,

Ryotaro Shimizu

and

Masayuki Goto

School of Creative Science and Engineering, Waseda University, Tokyo 169-8555, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(22), 12118; https://doi.org/10.3390/app152212118

Submission received: 11 October 2025 / Revised: 2 November 2025 / Accepted: 10 November 2025 / Published: 14 November 2025

Download

Browse Figures

Versions Notes

Abstract

In recent years, deep learning models have achieved highly accurate image classification. However, in scenarios where the number of target classes continues to increase while using a model, it is not very effective to retrain the model by using all data from the beginning each time. Consequently, there has been increasing interest in class-incremental learning, which sequentially updates existing models with newly introduced classes. In particular, Forward Compatible Few-Shot Class-Incremental Learning (FACT) has demonstrated high classification performance by proactively reserving regions within the embedding space to represent knowledge of classes that will be added in the future. In this study, we propose FACT+, which seeks to further enhance the performance of FACT by incorporating two key extensions. First, to account for the semantic diversity (i.e., the breadth or narrowness of meaning) of each class, we introduce variability into each class’s covariance matrix. Second, to address the issue of insufficient learning from incremental data, we implement a mechanism to continuously update the model’s knowledge through the reuse of data collected during deployment. Experimental results show that the proposed method outperforms FACT in terms of classification accuracy and confirm the validity of the embedding space learned by the proposed approach.

Keywords:

continual learning; class-incremental learning; few-shot learning; image classification; covariance adaptation

1. Introduction

In recent years, improvements in deep learning model performance [1,2] and increases in computational resources have enabled high-accuracy image classification [3,4,5,6]. Convolutional neural networks [7] are a representative deep learning model for images, but image classification accuracy has also improved dramatically through various other techniques and know-how. A recent trend of image classification and object detection based on deep neural network model is to make use of pretrained models [8], and various pretrained models for these tasks have been released and are widely available. Examples include YOLO (You Only Look Once), VGGNet, and ResNet. Furthermore, recent attempts have been made to estimate consumer impression evaluations using pairwise deep neural networks [9,10]. Typically, image classification models built using standard supervised learning [11,12] are trained by using a given training dataset, and the trained classification model is then used for classification of target data. In other words, it can be said as a static learning approach designed for continuous classification of target input data using a pre-trained classifier. Naturally, it is assumed that the number of classes to be classified is known and fixed, and the input of data with unknown new classes during the use of the classifier is not anticipated.

Nevertheless, in actual image classification tasks, it is often the case that the number of target classes to be classified increases incrementally during the phase where the classifier is utilized [13,14]. In such situations, it is inefficient from a time and monetary cost perspective to retrain a large-scale model from scratch every time new classes are added, just to teach the model about these new classes. To address this issue, a framework is needed that can efficiently learn newly added classes while leveraging existing pre-trained models [15]. Against this backdrop, the significance of Class-Incremental Learning (CIL) [13], which enables continuous learning of newly added classes in an existing model, has been advocated, and various studies have been conducted. Among the models applicable to CIL, Forward Compatible Few-Shot Class-Incremental Learning (FACT) [15] is known for achieving high classification accuracy on both previously learned base classes and newly added incremental classes. By reserving a region to represent the knowledge of incremental classes in the embedding space in advance, FACT prevents interference between the embeddings of base classes and incremental classes, thereby maintaining both forms of knowledge while achieving highly accurate inference.

However, FACT represents the generation source of each class in the embedding space as a multivariate normal distribution and assumes its covariance matrix to be the identity matrix. This assumption makes it difficult to flexibly represent the range of each class region according to its semantic diversity (i.e., the breadth or narrowness of each class concept). This means that FACT assumes a uniform Gaussian distribution across all classes, regardless of conceptual breadth. Furthermore, FACT lacks a mechanism to utilize newly obtained data on incremental classes during model deployment for updates. Consequently, it cannot improve its performance in real time by exploiting the new data obtained continuously. From a practical business perspective, this represents a missed opportunity to enhance accuracy and adaptability, as valuable new information is not being leveraged to update the model in real time.

The objective of this study is, therefore, to “incorporate the semantic diversity of each class into the embedding region, while simultaneously improving model accuracy in real time by utilizing data accumulated during deployment”. Concretely, we propose an extended model called FACT+, which introduces (1) parameters (covariance matrices) that represent the semantic diversity of each class, and (2) a mechanism to dynamically update the predictive distribution of each class using new data obtained during deployment. FACT+ aligns with a more natural scenario and can achieve higher accuracy than FACT. In addition, we apply the proposed method to multi-class classification tasks under various scenarios and conduct evaluation experiments and analyses. We demonstrate the effectiveness of our approach in terms of classification accuracy and the properties of the learned embedding space.

2. Preliminaries

In this chapter, we introduce the foundational concepts of this study. We begin by defining the problem setting of Class-Incremental Learning (CIL) and its key challenges, including catastrophic forgetting. We then explain our baseline model, Forward Compatible Few-Shot Class-Incremental Learning (FACT), and the underlying principle of Prototype Learning.

2.1. Problem Settings

We focus on the class-incremental learning (CIL) setting, where a model is required to recognize an ever-expanding set of classes while mitigating catastrophic forgetting. The learning process proceeds over one base session followed by multiple incremental sessionss. We summarize the notations and definitions that are used consistently throughout this paper.

Input and encoder: Each training instance is denoted as

x_{i} \in R^{D}

, where D is the input dimension that means the number of features. An encoder

ϕ : R^{D} \to R^{d}

maps the input into a d-dimensional embedding space, yielding

z_{i} = ϕ (x_{i}) \in R^{d}

.

Base session: In the base session, the initial dataset is defined as

D_{0} = {(x_{n}^{0}, y_{n}^{0}) ∣ y_{n}^{0} \in C_{0}}_{n = 1}^{N_{0}},

(1)

where

C_{0}

is the set of base classes and

N_{0}

is the number of initial data in

D_{0}

. The classifier for base classes is represented by a set of prototypes

W = (w_{1}, \dots, w_{| C_{0} |}) \in R^{d \times | C_{0} |},

(2)

where each prototype

w_{i} \in R^{d}

for

c_{i} \in C_{0}

is a trainable parameter during the base session.

Virtual classes: To reserve embedding space for future sessions, a set of virtual classes

C_{v}

is introduced. Their prototypes are

P_{V} = (p_{1}, \dots, p_{| P_{V} |}) \in R^{d \times | P_{V} |},

(3)

where

| P_{V} | = | C_{0} | + \sum_{b = 1}^{B} N_{b}

with B denoting the number of incremental sessions and

N_{b}

the number of classes introduced in the b-th session. For simplicity, we assume that the number of classes introduced per session is constant, i.e.,

N_{1} = N_{2} = \dots = N_{B}

. Each prototype

p_{v} \in R^{d}

is jointly trained with

W

.

Incremental sessions: In the b-th incremental session (

1 \leq b \leq B

), the dataset is given by

D_{b} = {(x_{n}^{b}, y_{n}^{b}) ∣ y_{n}^{b} \in C_{b}}_{n = 1}^{N_{b}},

(4)

where

C_{b}

denotes the set of new classes. For each class

c_{j} \in C_{b}

, the prototype is computed as

w_{j} = \frac{1}{N_{j}^{b}} \sum_{(x, y^{b}) \in D_{b}} I (y^{b} = j) ϕ (x),

(5)

where

N_{j}^{b}

is the number of data points in class

c_{j}

and

I (\cdot)

is the indicator function. Thus, unlike base session prototypes, incremental prototypes are not trainable but estimated directly from few-shot examples.

Covariance matrices: For each class, we characterize its distribution in the embedding space by a prototype and a covariance matrix. Specifically, the prototype

w_{i} \in R^{d}

(or

p_{v}

for a virtual class) represents the mean feature vector of the class, while the covariance matrix

Σ_{i} = σ_{i}^{2} I, Σ_{v} = σ_{v}^{2} I,

(6)

describes the spread of feature vectors around the prototype. Here,

σ_{i}, σ_{v} \in R^{+}

are scalar parameters controlling the variance and

I \in R^{d \times d}

is the

d \times d

identity matrix. That is,

Σ_{i}, Σ_{v} \in R^{d \times d}

.

2.2. Related Work

Class-Incremental Learning (CIL) [13] refers to a learning framework aimed at adapting to newly added classes in an incremental manner while preserving knowledge about base classes (those learned in the initial stage). This sequential learning process consists of a base session, where initial classes are learned, followed by multiple incremental sessions, in which new classes are learned one after another. The ultimate goal of CIL is to build a single model that can accurately classify all classes observed so far, both old and new. Our baseline model, FACT, addresses this CIL framework by incrementally adding new class-specific parameters to the model without the need for a full retraining of the existing model.

A critical challenge in CIL is catastrophic forgetting [16,17,18], a phenomenon in which the knowledge acquired from previously learned base classes rapidly degrades as the model learns new incremental classes. This degradation typically manifests as a significant drop in classification performance on base classes. Catastrophic forgetting is often attributed to several factors, including classifier bias [19,20,21], task overwriting in the embedding space [21,22,23], and the collapse of class distributions in the embedding space [24,25].

To mitigate catastrophic forgetting and achieve robust performance across all learned classes, various CIL approaches have been proposed.

2.3. Few-Shot Class-Incremental Learning

In many real-world scenarios, the number of available data points for incremental classes is limited. Few-Shot Class-Incremental Learning (FSCIL) [26,27,28,29] is a CIL framework designed to address this challenge. In addition to catastrophic forgetting, the key challenge FSCIL must address is preventing overfitting to incremental classes with limited data. To resolve these issues, methods have been proposed to mitigate overfitting during model updates. These include Continually Evolved Classifier (CEC) [30], which uses graph structures to learn the relationships between prototypes of base and incremental classes, and Self-Promoted Prototype Refinement (SPPR) [31], which prevents forgetting by randomly and iteratively training the model on incremental classes and separating the feature spaces of existing and incremental classes.

2.4. Forward Compatible Few-Shot Class-Incremental Learning

Forward Compatible Few-Shot Class-Incremental Learning (FACT) [15] is a method that secures embedding regions for incremental classes in advance during the learning of base classes. FACT achieves high inference accuracy in scenarios where the target classes to be classified continue to increase over time.

During the base session, the model is trained on a dataset

D_{0} = {(x_{n}^{0}, y_{n}^{0}) ∣ y_{n}^{0} \in C_{0}}_{n = 1}^{N_{0}}

for the base classes

C_{0}

to learn (1) an encoder

ϕ

that maps the input to a lower-dimensional space and (2) the base class classifier

W = (w_{1}, \dots, w_{| C_{0} |})

(hereafter referred to as prototypes). Let

x \in R^{d}

be the feature vector,

x_{n}^{0}

and

y_{n}^{0}

be the input and its corresponding label for the n-th data, and

C_{0}

be the set of base classes.

At the same time, in order to secure embedding space for future incremental classes, a set of “virtual classes” is introduced, and their prototypes

P_{V} = (p_{1}, \dots, p_{| P_{V} |})

are jointly trained with

W

. The data for virtual classes are generated by applying manifold mixup [32] to the data of two different base classes.

By designing a loss function that takes into account these virtual classes, the data of each class is concentrated into class-specific regions in the embedding space, while the base classes and the virtual classes are uniformly positioned within that space. (The loss functions for the base session are detailed in Appendix A). This design enables FACT to preserve the embedding representations of previously learned classes, thereby alleviating catastrophic forgetting. FACT+ inherits this mechanism to maintain stable performance across incremental updates.

In the b-th incremental session, the model uses a small dataset

D_{b} = {(x_{n}^{b}, y_{n}^{b}) ∣ y_{n}^{b} \in C_{b}}_{n = 1}^{N_{b}}

for the incremental classes

C_{b}

to compute the prototype

w_{j}

for each newly added class

c_{j} \in C_{b}

using Equation (5) (where

N_{j}^{b}

is the number of data points for class

c_{j}

, and

I (\cdot)

is an indicator function).

Next, the probability that the input belongs to class

c_{i} \in C_{0} \cup \dots \cup C_{b}

is computed by Equations (7)–(11). Let

w_{i} \in R^{d}

be the prototype of class

c_{i}

,

η

be the parameter that adjusts the influence of the virtual classes, and

Σ_{i}, Σ_{v} \in R^{d \times d}

be the covariance matrices (assumed to be identity matrices) for class

c_{i}

and a virtual class

c_{v}

, respectively. For the FACT,

σ_{i}

and

σ_{v}

are fixed to 1 and never updated, remaining constant across the base session, incremental sessions, and inference.

Here,

N (\cdot ∣ μ, Σ)

denotes a multivariate Gaussian distribution with mean

μ \in R^{d}

and covariance

Σ \in R^{d \times d}

.

\begin{matrix} p (c_{i} | ϕ (x)) & = \sum_{p_{v} \in P_{V}} p (w_{i} | p_{v}, ϕ (x)) p (p_{v} | ϕ (x)), \end{matrix}

(7)

\begin{matrix} p (w_{i} | p_{v}, ϕ (x)) & = \frac{p (ϕ (x) | w_{i}, p_{v}) p (w_{i} | p_{v})}{\sum_{w \in W} p (ϕ (x) | w, p_{v}) p (w | p_{v})}, \end{matrix}

(8)

\begin{matrix} p (w_{i} | p_{v}) & = N (w_{i} ∣ p_{v}, Σ_{v}), \end{matrix}

(9)

\begin{matrix} p (ϕ (x) ∣ w_{i}, p_{v}) & = η N (ϕ (x) ∣ w_{i}, Σ_{i}) + (1 - η) N (ϕ (x) ∣ p_{v}, Σ_{v}), \end{matrix}

(10)

\begin{matrix} p (p_{v} ∣ ϕ (x)) & = \frac{exp (p_{v}^{⊤} ϕ (x))}{\sum_{p \in P_{V}} exp (p^{⊤} ϕ (x))} . \end{matrix}

(11)

2.5. Prototype Learning

Prototype learning is a method that classifies input data based on the similarity between the input data and representative feature vectors for each class (prototypes). Generally, the prototype

w_{j}

for a class

c_{j}

is defined by Equation (12), where

x

is an input sample,

S_{j}

is the set of samples belonging to class

c_{j}

, and

ϕ

is the encoder mapping the input to the embedding space.

w_{j} = \frac{1}{| S_{j} |} \sum_{x \in S_{j}} ϕ (x) .

(12)

While traditional supervised learning models in classification tasks can sometimes overfit to specific training data, prototype learning often demonstrates superior generalization performance by referencing the representative prototype for each class [33].

3. Proposed Methodology

In this chapter, we present our proposed method, FACT+. Building upon the baseline FACT model, we introduce two key improvements to enhance performance. First, we model each class with a flexible embedding region by learning a class-specific variance, which captures the semantic diversity of the data. Second, we introduce a mechanism to update the model using post-classification data that becomes available during real-world deployment. This allows us to continuously improve the model’s performance by alleviating data scarcity and bias for incremental classes.

3.1. Motivation

In our proposed method (FACT+), we adopt the core approach of FACT, which allows for minimal updates when new classes are added in incremental sessions instead of retraining the entire model. Building on this framework, we aim to (1) flexibly secure embedding regions that consider the semantic diversity of each class. To define these regions, various approaches can be considered, such as using a multivariate normal distribution [34] or a box-based representation [35]. This paper adopts the former approach, as it enables us to model not only the region’s extent but also the uncertainty of the embedding, which is a crucial aspect for handling diverse class concepts. Our second objective is to (2) introduce a mechanism to leverage post-classification data for model updates, thereby alleviating data scarcity and biases associated with incremental classes. This design enables improving classification performance during model deployment—an essential capability for real-world business environments where classes are added incrementally and data continually accumulates. In the base session,

σ_{i}

and

σ_{v}

are treated as trainable positive scalars, whereas in incremental sessions they will be computed by formulas introduced later in this paper. Figure 1 illustrates a comparison between the proposed method (FACT+) and the original FACT framework. In terms of objective (2), FACT discards the data acquired during inference without using it for model updates as shown in Figure 1a. In contrast, FACT+ incorporates a mechanism to update only the incremental classes (indicated by the blue circles in Figure 1b), while keeping the base classes fixed (black circles in Figure 1b).

3.2. Incorporating Flexible Variance

We define the covariance matrix of a base class

c_{i} \in C_{0}

and that of an incremental class

c_{j} \in C_{1} \cup \dots \cup C_{b}

as

Σ_{i} = σ_{i}^{2} I

and

Σ_{j} = σ_{j}^{2} I

, respectively (

σ_{i}^{2}, σ_{j}^{2} \in R^{+}

,

I

: the identity matrix). The variance

σ_{i}^{2}

of base class

c_{i}

is treated as a learnable parameter from the base session, and the variance

σ_{j}^{2}

of incremental class

c_{j}

is computed in the b-th incremental session using:

σ_{j}^{2} = \frac{1}{N_{j}^{b}} \sum_{(x, y^{b}) \in D_{b}} I (y^{b} = j) \cdot | | ϕ (x) - w_{j} {| |}_{2}^{2} .

(13)

3.3. Model Update Using Post-Classification Data

While the model is in operation, we collect labeled data for incremental classes after classification and use it to update the model parameters.

More concretely, consider updating the model in the b-th incremental session. Let the set of incremental classes for which data annotation is performed be

C_{b}^{'} = ⋃_{k = 1}^{b - 1} C_{k}

. During the inference session immediately following the b-th incremental session, the input data consists of samples from both base classes and the newly added incremental classes. Each data point is labeled with one of the classes in

C_{b}^{'} \cup {[N / A]}

. Here, “[N/A]” indicates that the data does not belong to any of the classes in

C_{b}^{'}

. Consequently, data belonging to the base classes are labeled as “[N/A]” and are not utilized for model updates. If the assigned label is not “[N/A],” we use that data for model updates (In practice, we explore multiple scenarios for selecting which data points to use for model updates, but we introduce only one scenario here due to space constraints).

Suppose

α

data points are labeled with an incremental class

c_{j} \in C_{1} \cup \dots \cup C_{b}

. Using Equations (5) and (13), we compute the new estimates of the prototype and variance, denoted

{w_{j}}^{'}

and

{σ_{j}^{2}}^{'}

. We then update

w_{j}

and

σ_{j}^{2}

as follows:

w_{j} \leftarrow \frac{T_{j}^{b - 1}}{T_{j}^{b - 1} + α} w_{j} + \frac{α}{T_{j}^{b - 1} + α} {w_{j}}^{'},

(14)

σ_{j}^{2} \leftarrow \frac{T_{j}^{b - 1}}{T_{j}^{b - 1} + α} σ_{j}^{2} + \frac{α}{T_{j}^{b - 1} + α} {σ_{j}^{2}}^{'} .

(15)

Here,

T_{j}^{b - 1} = \sum_{k = 1}^{b - 1} N_{j}^{k}

denotes the cumulative number of data points observed for class

c_{j}

before the b-th session.

4. Experimental Evaluation

In this chapter, we conduct experiments to evaluate the effectiveness of the proposed method on a multi-class classification task under the CIL setting.

4.1. Experimental Setting

The dataset used in this experiment is CIFAR-100 [36], an image dataset consisting of color photographs of objects such as animals, plants, equipment, and vehicles. Table 1 shows a list of the classes in CIFAR-100. Following previous work [15], we split the dataset so that classes with IDs 0–59 are treated as base classes, and those with IDs 60–99 are incremental classes. Each base class contains 500 data points, and each new incremental class introduced for the first time in any incremental session contains five data points. We add five classes per incremental session, for a total of eight incremental sessions, and evaluate accuracy across all nine sessions (including the base session).

We train for 600 epochs with a batch size of 256 and set the learning rate in the base session to 0.01. We use stochastic gradient descent for optimization [37] and adopt ResNet20 [38] as the encoder. All experiments were conducted on a workstation equipped with an Intel (Santa Clara, CA, USA) Core i9-9900K CPU (3.6 GHz, 8 cores), an NVIDIA (Santa Clara, CA, USA) GeForce GTX 1660 SUPER GPU (6 GB VRAM), and 64 GB of system memory.

To investigate the significance of each proposed component through an ablation study, we compare four models whose configurations are summarized in Table 2.

4.2. Experimental Results

Figure 2 shows the classification accuracy of each model. The base session is labeled as session 0, and incremental sessions are labeled as 1–8.

From Figure 2, we see that FACT+ w/o update surpasses FACT in some sessions (sessions 0, 1, 4, 7). Moreover, the proposed method (FACT+) outperforms FACT in 8 out of 9 sessions. This suggests that introducing flexible variance improves accuracy for some classes, and using deployment data to update the parameters effectively supplements knowledge about incremental classes with limited initial data, helping maintain higher accuracy in later sessions.

Next, we visualize the embedding space learned by FACT+ to examine how class prototypes are organized after training and incremental updates. Figure 3 shows the prototypes categorized by their class type (base, incremental, and virtual), whereas Figure 4 groups them by the 20 superclasses defined in CIFAR-100. Both figures present two-dimensional UMAP [39] projections of the learned prototypes.

In Figure 3, blue, purple, and gray points denote the prototypes of base, incremental, and virtual classes, respectively. Observing Figure 3, it can be seen that semantically related base and incremental classes are embedded in close proximity within the embedding space. While the base classes are sufficiently learned by the model, the incremental classes are not explicitly trained but are instead obtained solely through the use of the pretrained encoder. Even with the introduction of class-specific variances in the proposed method, no distortion or bias was observed in the spatial arrangement of either the base or incremental classes. These observations indicate that the validity of the embedding space obtained through FACT+ is well preserved.

Figure 4 colors each class prototype according to the 20 superclasses defined in CIFAR-100. The color legend corresponds to the 20 superclasses defined in Table 1. The order of the legend items is consistent with the superclass IDs in the table (e.g., 0: aquatic mammals, 1: fish, …, 19: vehicles 2). Prototypes belonging to the same superclass tend to be embedded in close proximity within the embedding space. For instance, food containers (center-left), household furniture (upper-left), people (lower-right), and trees (upper-center) each form a compact cluster. The virtual prototypes (gray “×”) are interspersed across regions, bridging semantically related classes without forming a dominant cluster.

These results confirm that FACT+ effectively captures semantic consistency across both previously learned and newly introduced classes, producing an embedding space that preserves meaningful organization.

5. Discussion

In this chapter, we conduct additional experiments and analyses under various conditions to more comprehensively evaluate our proposed method, FACT+.

In Section 5.1, we compare the computational complexity, memory cost, and runtime between FACT and FACT+. Then, in Section 5.2 and Section 5.3, we compare the classification accuracy of each model on characteristic datasets (data with different class hierarchies and data where the features of base and incremental classes differ significantly). In Section 5.4, we compare the accuracy of our proposed model with a model where the covariance matrices for virtual classes are also variable, in addition to those for base and incremental classes. This allows us to verify the validity of only varying the covariance matrices for base and incremental classes. Finally, in Section 5.5, we discuss the model’s sensitivity to the quality and amount of both training and post-deployment data.

5.1. Computational Complexity, Memory Cost, and Runtime Analysis

The number of model parameters for FACT and FACT+ is compared in Table 3. Introducing class-specific covariance (variance) parameters in FACT+ leads to only a marginal increase in parameter count (Increase: only 0.00036%). Accordingly, the memory cost and computational load remain almost unchanged in theory. However, FACT+ involves the Mahalanobis distance calculation, which requires additional matrix inversion steps during both training and inference.

To quantify this additional cost, we also measured the actual computation times. For the base session (training on Session 0), FACT required approximately 24,169 s (≈6.7 h), while FACT+ took 24,436 s (≈6.8 h)—an increase of only about 1%. For inference across the incremental sessions (Sessions 1–8), the total computation time was 611 s (≈10 min) for FACT and 800 s (≈13 min) for FACT+, reflecting a modest rise mainly due to the Mahalanobis distance computation during prediction. Overall, the additional complexity in FACT+ introduces only a minor runtime overhead while maintaining nearly identical memory consumption to FACT.

While this additional computation time is relatively small, it may become non-negligible for large-scale datasets. Exploring optimization strategies to reduce this overhead remains a promising direction for future research.

5.2. Validation with Data of Varying Class Granularity

5.2.1. Experimental Setting

The CIFAR100 dataset is organized into 20 superclasses, each consisting of five subclasses (totaling 100 classes). We first select 10 superclasses as

C^{high}

and take the subclasses under the remaining 10 superclasses—that is, 50 subclasses—as

C^{low}

. Our training dataset is then constructed to include both large-grained (higher-level) classes and small-grained (lower-level) classes. Specifically, the base session includes six classes from

C^{high}

and 30 classes from

C^{low}

. Each incremental session adds one new class from

C^{high}

and five new classes from

C^{low}

.

We simulate a scenario of one base session followed by four incremental sessions, each adding six classes. Regardless of class granularity, each base class has 500 data points, and each incremental class has five data points, randomly chosen. Other settings (optimizer, encoder, etc.) follow Section 4.1, and we use FACT+ as the model.

5.2.2. Experimental Results

Figure 5 shows the accuracy comparisons of each model. Session 0 is the base session, and sessions 1–4 are incremental.

From Figure 5, FACT+ outperforms FACT in all sessions. Comparing FACT+ and FACT+ w/o dynamic-var, both of which reuse data, FACT+ exhibits higher accuracy in four of the five sessions. This indicates that flexible variance is effective when classes differ significantly in their semantic diversity.

Next, Figure 6 shows the distribution of variance values for each class

c_{i} \in C_{0} \cup \dots \cup C_{4}

. From Figure 6, the variances for base classes are generally smaller for lower-level classes and larger for higher-level classes. At the end of session 8 in FACT+, the average variance for the higher-level base classes is

6.91 \times 10^{- 4}

, whereas it is

2.00 \times 10^{- 4}

for the lower-level base classes. For incremental classes, the average variance for higher-level classes is

1.64 \times 10^{- 3}

and

1.48 \times 10^{- 3}

for lower-level classes. Notably, for the base classes in particular, the method successfully infers a larger variance for classes with broader semantic diversity and a smaller variance for classes with narrower diversity.

Combining these findings with the results in Section 4.2 (“flexible variance boosts accuracy for certain classes”), it can be inferred that effective estimation of variance has a positive impact on predictive performance.

However, as shown in Figure 6, not all superclasses had larger variances than all subclasses. One possible reason is the limited number of data points for incremental classes. In this experiment, following Zhou et al. [15], the number of data points for superclasses was matched to that of subclasses, meaning that the number of data points for incremental class

c_{j} \in C_{b}

newly added to the model in session b is

N_{j}^{b} = 5

. This implies that among the five subclasses grouped as a superclass, there might be classes not included in the incremental session’s data, suggesting insufficient data points for calculating incremental class parameters.

In the experiments in Section 5.2.2, our aim was to evaluate the model’s robustness under scenarios where various types of classes could be introduced. The thing to note here is that the proposed model does not currently embed hierarchical class relationships. In real-world scenarios, however, incremental classes may appear in strict inclusion or hierarchical relationships with previously learned base classes. Therefore, integrating the proposed approach with models that explicitly encode hierarchical or nested structures—such as box-based hierarchical embeddings [40,41]—remains an open and promising direction for future investigation.

5.3. Validation with Datasets Where Base and Incremental Class Features Differ Significantly

This section verifies that the proposed method’s mechanism functions effectively in situations where the features of base classes and incremental classes differ significantly. In the experimental conditions of Section 5.4.1, CIFAR100 class numbers were assigned randomly regardless of class category, so there was no bias towards similar classes being grouped into either base or incremental classes. However, in practical applications, the types of base and incremental classes may be entirely different, potentially leading to inaccurate embedding representations for incremental classes if the encoder trained on base classes cannot adequately handle them. For example, a user who has previously photographed and stored many images of people might suddenly become interested in trains or airplanes. As photos of these new subjects accumulate in their folders, they need to be appropriately tagged. To simulate such a scenario, this experiment artificially assigns base and incremental classes within CIFAR100 to reproduce a dataset where the features of base and incremental classes differ significantly.

5.3.1. Experimental Setting

Based on the superclasses of CIFAR100 (20 categories in the dataset) as shown in Table 1, the dataset was split such that base classes comprised natural objects like animals and plants (superclasses: 0, 1, 2, 3, 4, 7, 8, 9, 10, 11, 12, 13), and incremental classes comprised people or artificial objects (superclasses: 5, 6, 14, 15, 16, 17, 18, 19). This setup establishes a configuration where the types of base and incremental classes differ significantly. Other experimental conditions are the same as those described in Section 4.1.

5.3.2. Experimental Results

Figure 7 shows the comparison of accuracy for each model. From Figure 7, it can be observed that FACT+ w/o dynamic-var outperforms FACT in classification accuracy in eight out of nine sessions. This suggests that incorporating a data reuse mechanism can suppress the decrease in accuracy to some extent, even when incremental classes with low similarity to base classes appear. On the other hand, FACT+ only surpassed FACT’s accuracy in sessions 6 and 7. From this result, it is suggested that while the data reuse mechanism functions well in cases where base and incremental class features differ significantly, as in this experiment, introducing dynamic variance is not effective.

5.4. Application of Dynamic Variance to Virtual Classes

In the proposed method, the covariance matrices of base and incremental classes were variable, but the covariance matrix of virtual classes was fixed as an identity matrix. This section investigates how varying the covariance matrix of virtual classes,

Σ_{v}

, affects accuracy and discusses the contributing factors.

5.4.1. Experimental Setting

In this experiment, a model with a variable covariance matrix for virtual classes (FACT+ w/o update w/ variable-

σ_{v}

) is introduced as a new comparison method. In this model, in addition to

Σ_{i}

and

Σ_{j}

(

σ_{i}^{2}, σ_{j}^{2} \in R

,

I

: identity matrix on

R^{d \times d}

), the covariance matrix of a virtual class

p_{v} \in C_{v}

is defined as

Σ_{v} = σ_{v}^{2} I (σ_{v}^{2} \in R)

. The variances of base classes

σ_{i}^{2}

and virtual classes

σ_{v}^{2}

are learning parameters in the base session, and the variances of incremental classes are calculated in the incremental session. In this experiment, FACT+ w/o update is used as a comparison model, and the data reuse mechanism is not introduced; only the variability of variance is manipulated. Other experimental conditions are the same as those described in Section 4.1.

5.4.2. Experimental Results

Figure 8 shows the comparison of accuracy for each model. From Figure 8, FACT+ w/o update w/ variable-

σ_{v}

underperformed FACT+ w/o update in eight out of nine sessions.

The following two points are inferred as the reasons why the accuracy trend of FACT+ w/o update w/ variable-

σ_{v}

was unstable and no improvement in accuracy was observed. The first point is the increase in the number of learning parameters. Generally, as the number of learning parameters in a machine learning model increases, its expressive power improves, but the possibility of overfitting to the training data also increases [42]. Moreover, learning can become unstable as the number of learning parameters increases [38,43]. The second point is the structure of the likelihood calculation in the proposed method. The probability density function of the normal distribution, given by Equations (9) and (10), includes the matrix norm of the covariance matrix in the denominator. Specifically, it contains the structure shown in the following Equation (16).

p (w_{i} | p_{v}) = \frac{1}{\sqrt{{(2 π)}^{n} | Σ_{v} |}} exp (- \frac{1}{2} {(w_{i} - p_{v})}^{⊤} Σ_{v}^{- 1} (w_{i} - p_{v})) .

(16)

In Equation (16), it is known that if

Σ_{v} \in R^{d \times d}

becomes a high-dimensional variable, the likelihood calculation becomes unstable due to the curse of dimensionality [44,45,46].

Therefore, it is inferred that the magnitude of the variance significantly affects likelihood calculation, leading to instability, increased misclassifications, and thus a decline in classification performance. Fundamentally, unlike base and incremental classes, virtual classes are not tied to specific concrete classes, so considering the magnitude of their variance is thought to offer few advantages. Based on this result, the proposed method fixes the variance of virtual classes to an identity matrix.

5.5. Noise Sensitivity and Data Quality

Regarding image noise, CIFAR-100 [36] consists of real-world photographs annotated with class labels; thus, some level of variation inherent in everyday image capture (e.g., lighting, background, or angle) is already reflected in the dataset. For label noise, CIFAR-100 is also known to contain mislabeled samples [47]. The proposed model maintained stable performance under these conditions, suggesting a certain degree of robustness to moderate noise in both inputs and labels.

Furthermore, the effectiveness of continual updates may depend on both the amount and quality of post-deployment data. Although this study mainly focused on verifying the model’s update mechanism, it is reasonable to assume that the quantity of update data would influence performance. Similarly, data quality—specifically, the informativeness and representativeness of samples—could have a comparable effect. This aspect is discussed further in relation to active learning strategies in Section 6.

6. Conclusions and Future Work

In this work, we proposed FACT+, a novel model that extends the Forward Compatible Few-Shot Class-Incremental Learning (FACT) framework with two key improvements.

First, we addressed the limitation of FACT’s assumption that all classes have a uniform Gaussian distribution in the embedding space, which hinders the model’s ability to capture the semantic diversity of each class. To overcome this, FACT+ introduces learnable variance–covariance matrices for each class’s embedding region. This flexibility allows the model to better reflect the true semantic scope of each class, from broad categories to narrow sub-classes.

Second, we tackled the challenge of improving classification performance for incremental classes after the initial few-shot learning. The original FACT model, once deployed, cannot refine its knowledge of new classes as more data becomes available. In contrast, FACT+ incorporates a mechanism for dynamically updating class prototypes and their variances using labeled data obtained during deployment. This capability allows the model to continuously adapt to and refine its understanding of new classes.

To validate the effectiveness of these improvements, we conducted a series of comprehensive experiments. We first confirmed that FACT+ achieves higher overall accuracy across most sessions compared to the original FACT model. This success is attributed to our model’s ability to secure embedding regions that accurately reflect the semantic diversity of each class and its capability to continuously supplement knowledge about incremental classes during deployment.

Future directions include model compression and exploring more effective annotation strategies. Since introducing flexible variance in the proposed model increases the number of parameters in proportion to the number of classes and necessitates additional variance calculations, reducing memory usage and computational overhead is desirable. Lightweight compression approaches such as parameter pruning, low-rank factorization, or knowledge distillation could be explored to reduce memory and computational requirements without significant loss of accuracy. Although their feasibility within the proposed framework has not yet been verified, this represents a promising direction worth investigating to improve scalability.

Additionally, incorporating active learning techniques has the potential to further enhance the effectiveness of model updates. In this study, we introduced a mechanism that reuses post-deployment data for continual model updates. However, in the current implementation, all newly collected data are treated equally, without considering their informativeness or reliability.

As discussed above, the effectiveness of updates may also depend on the quality of the data used—particularly, whether the samples provide valuable information for refining class boundaries or reinforcing representative prototypes. Active learning offers a promising way to address this issue by selectively leveraging such high-value samples. For instance, prioritizing uncertain or boundary samples may strengthen class separation, whereas selecting representative samples near class prototypes could improve stability during updates. While the practical impact of these strategies remains to be validated, they represent a natural extension of the current framework toward more efficient and robust continual adaptation.

Finally, although our experiments considered classes of different granularity that do not form a strict hierarchical or inclusion relationship, testing the proposed method under scenarios where classes are in an inclusion relationship remains an important area for future investigation.

Author Contributions

M.H.: Conceptualization, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing—original draft. N.O.: Writing—review and editing. K.M.: Writing—review and editing. R.S.: Methodology, Project administration, Software, Writing—review and editing. M.G.: Funding acquisition, Supervision, Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research is partly supported by JSPS KAKENHI Grant Numbers 21H04600 and 24H00370.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the public domain. These data were derived from the following resource: CIFAR-100 [36], available at https://www.cs.toronto.edu/~kriz/cifar.html, accessed on 8 February 2025.

Acknowledgments

We would like to express our gratitude to Tatsuya Ishii, Yuto Nunome and Miho Mizutani for their helpful discussions and reviews of the contents of our paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Loss Function

In the base session, the encoder is trained using the virtual loss

L_{v}

and the future loss

L_{f}

.

First, we describe the virtual loss

L_{v}

. The virtual loss

L_{v}

is a function that pre-allocates space in the model’s embedding space for new classes to be added in the future. The prototypes of the virtual classes serve as a reserve area for classes that will be added later. The specific loss functions are shown in Equations (A1)–(A5). By minimizing the loss

L_{1}

, the distance between a base class instance and its corresponding base class prototype is reduced. By minimizing

L_{2}

, the distance between a base class instance and the nearest virtual class prototype is reduced.

\begin{matrix} L_{v} (x, y) & = L_{1} (x, y) + L_{2} (x, y), \end{matrix}

(A1)

\begin{matrix} L_{1} (x, y) & = ℓ (f_{v} (x), y), \end{matrix}

(A2)

\begin{matrix} L_{2} (x, y) & = γ ℓ (Mask (f_{v} (x), y), \hat{y}), \end{matrix}

(A3)

\begin{matrix} Mask (f_{v} (x), y) & = f_{v} (x) \otimes (1 - OneHot (y)), \end{matrix}

(A4)

\begin{matrix} \hat{y} & = {argmax}_{v} p_{v}^{⊤} ϕ (x) . \end{matrix}

(A5)

Here,

f_{v} (x) = {[w_{1}; \dots; w_{| C_{0} |}; p_{1}; \dots; p_{| P_{V} |}]}^{⊤} ϕ (x)

. In practice, however,

f_{v} (x)

is calculated with cosine similarity as the distance metric.

OneHot (i) \in R^{| C_{0} | + | P_{V} |}

is a vector where only the i-th element is 1 and all other elements are 0.

ℓ (\cdot)

is the cross-entropy loss with softmax,

\hat{y}

is the pseudo-label of the virtual class, and

γ

is a hyperparameter for adjusting the weight of the loss.

Next, we describe the future loss

L_{f}

. The future loss

L_{f}

is a function for preparing for new data that will arise in the future. It prepares for new classes by adapting the model to virtually generated instances of virtual classes. First, a virtual class instance

z_{v}

is generated using manifold mixup [32]. The generation equation is shown in Equation (A6). Here,

x_{j}, x_{k} (j \neq k)

are a pair of randomly selected base class inputs,

h_{l} (\cdot)

is the output of the l-th intermediate layer in model

h (\cdot)

, and

g (\cdot)

is the function that maps the intermediate representation output to the final embedding space. The encoder’s embedding

ϕ (x)

is then viewed as a two-stage process, such as

ϕ (x) = g (h_{l} (x))

.

z_{v} = g [λ h_{l} (x_{j}) + (1 - λ) h_{l} (x_{k})], (j \neq k, λ \in [0, 1]) .

(A6)

This pseudo-instance is used to calculate the future loss

L_{f}

. The specific loss functions are shown in Equations (A7)–(A9). By minimizing

L_{3}

, the distance between the virtual class instance and the nearest virtual class prototype is reduced. By minimizing

L_{4}

, the distance between the virtual class instance and the nearest base prototype is reduced.

\begin{matrix} L_{f} (z) & = L_{3} + L_{4}, \end{matrix}

(A7)

\begin{matrix} L_{3} (z) & = ℓ (f_{v} (z), \hat{y}), \end{matrix}

(A8)

\begin{matrix} L_{4} (z) & = γ ℓ (Mask (f_{v} (z), \hat{y}), \hat{\hat{y}}) . \end{matrix}

(A9)

Here,

\hat{\hat{y}}

is the class label in the set of base classes that is most similar to the generated data embedding

z

. As described above, by designing a loss function that considers virtual classes, it is possible to secure embedding regions for base classes and incremental classes in the embedding space and adjust them so that they do not encroach on each other’s regions.

Appendix B. Algorithm for Each Session

Appendix B.1. FACT

The following algorithm describes the training procedure of FACT in the base session.

Algorithm A1 FACT Base Session Procedure

Input: Dataset of base classes

D_{0}

Output: Classifier for base classes (set of prototypes)

W

, classifier for virtual classes (set of prototypes)

P_{V}

, encoder

ϕ (\cdot)

1:: Initialize $W$ , $P_{V}$ , $ϕ (\cdot)$ with random numbers
2:: repeat
3:: Sample a mini-batch ${(x_{i}^{0}, y_{i}^{0})}_{i = 1}^{n}$ from $D_{0}$
4:: Compute the virtual loss $L_{v}$
5:: Randomly shuffle $D_{0}$ and obtain another mini-batch ${(x_{j}^{0}, y_{j}^{0})}_{j = 1}^{n}$
6:: Mask pairs $(x_{i}^{0}, x_{j}^{0})$ that belong to the same class
7:: Compute the forecasting loss $L_{f}$
8:: Update the model to minimize $L = L_{v} + L_{f}$
9:: until A predefined number of epochs is reached

The next algorithm summarizes the procedure of FACT in the incremental session.

Algorithm A2 FACT Incremental Session Procedure

Input: Dataset consisting of incremental classes

D_{b}

(number of classes:

| C_{b} |

)

Output: Updated classifier for classes (set of prototypes)

W

1:: for $j = 1, \dots, | C_{b} |$ do
2:: Using the trained encoder $ϕ (\cdot)$ , compute the prototype of class $c_{j} \in C_{b}$ :

$w_{j} = \frac{1}{N_{j}^{b}} \sum_{(x, y^{b}) \in D_{b}} I (y^{b} = c_{j}) ϕ (x)$
3:: Update $W \leftarrow [W; w_{j}, c_{j} \in C_{b}]$
4:: end for

Appendix B.2. FACT+

The following algorithm describes the training procedure of FACT+ in the base session.

Algorithm A3 Procedure of FACT+ in the Base Session

Input: Dataset of base classes

D_{0}

Output: Classifier for base classes (set of prototypes)

W

, classifier for virtual classes (set of prototypes)

P_{V}

, encoder

ϕ (\cdot)

, covariance matrices of base classes $Σ_{i}$

1:: Initialize $W$ , $P_{V}$ , $ϕ (\cdot)$ , and $Σ_{i}$ randomly
2:: repeat
3:: Sample a mini-batch ${(x_{i}^{0}, y_{i}^{0})}_{i = 1}^{n}$ from $D_{0}$
4:: Compute the virtual loss $L_{v}$
5:: Shuffle $D_{0}$ and obtain another mini-batch ${(x_{j}^{0}, y_{j}^{0})}_{j = 1}^{n}$
6:: Mask pairs $(x_{i}^{0}, x_{j}^{0})$ belonging to the same class
7:: Compute the forecasting loss $L_{f}$
8:: Update the model to minimize $L = L_{v} + L_{f}$
9:: until A predefined number of epochs is reached

The next algorithm summarizes the procedure of FACT+ in the incremental session.

Algorithm A4 Procedure of FACT+ in the Incremental Session

Input: Dataset of incremental classes

D_{b}

(number of classes:

n_{b}

)

Output: Updated classifier (set of prototypes)

W

, updated covariance matrices $Σ_{j}$

1:: for $j = 1 to n_{b}$ do
2:: Using the trained encoder $ϕ (\cdot)$ , compute the prototype of incremental class $c_{j} \in C_{b}$ :

$w_{j} = \frac{1}{N_{j}^{b}} \sum_{(x, y^{b}) \in D_{b}} I (y^{b} = c_{j}) ϕ (x)$
3:: Compute the variance of the incremental class:

$σ_{j}^{2} = \frac{1}{N_{j}^{b}} \sum_{(x, y^{b}) \in D_{b}} I (y^{b} = c_{j}) \cdot {∥ ϕ (x) - w_{j} ∥}_{2}^{2}$
4:: Update $W \leftarrow [W; w_{j}, j \in C_{b}]$ , $Σ_{j} \leftarrow [Σ_{j}; σ_{j}, j \in C_{b}]$
5:: end for

Finally, the following algorithm presents the inference procedure of FACT+.

Algorithm A5 Procedure of FACT+ in Inference

Input:

W

,

P_{V}

,

ϕ (\cdot)

,

Σ_{i}

Output: Class probability for each class

p (c_{k} ∣ ϕ (x))

, updated classifier (set of prototypes)

W

, updated covariance matrices

Σ_{j}

1:: Compute the probability of class membership using $W$ , $P_{V}$ , $ϕ (\cdot)$ , and $Σ_{i}$ :

$p (c_{i} ∣ ϕ (x)) = \sum_{p_{v} \in P_{V}} p (w_{i} ∣ p_{v}, ϕ (x)) p (p_{v} ∣ ϕ (x))$
2:: Annotate a subset of test data according to a selection criterion (obtain labeled set $D_{b}^{'}$ )
3:: Using $D_{b}^{'}$ , compute the updated prototype and variance for each class $c_{j}$ :

${w_{j}}^{'} = \frac{1}{α} \sum_{(x, y^{b}) \in D_{b}^{'}} I (y^{b} = c_{j}) ϕ (x), {σ_{j}^{2}}^{'} = \frac{1}{α} \sum_{(x, y^{b}) \in D_{b}^{'}} I (y^{b} = c_{j}) \cdot {∥ ϕ (x) - w_{j} ∥}_{2}^{2}$
4:: Update the prototype and variance as:

$w_{j} \leftarrow \frac{T_{j}^{b - 1}}{T_{j}^{b - 1} + α} w_{j} + \frac{α}{T_{j}^{b - 1} + α} {w_{j}}^{'}, σ_{j}^{2} \leftarrow \frac{T_{j}^{b - 1}}{T_{j}^{b - 1} + α} σ_{j}^{2} + \frac{α}{T_{j}^{b - 1} + α} {σ_{j}^{2}}^{'}$

References

Krizhevsky, A.; Sutskever, I.; Hinton, G. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Bianco, S.; Celona, L.; Napoletano, P.; Schettini, R. On the use of deep learning for blind image quality assessment. Signal Image Video Process. 2018, 12, 355–362. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M.; Hill, R.; Allen, P. A comprehensive review of convolutional neural networks for defect detection in industrial applications. IEEE Access 2024, 12, 94250–94295. [Google Scholar] [CrossRef]
Shihab, M.; Tasnim, N.; Zunair, H.; Rupty, L.; Mohammed, N. VISTA: Vision transformer enhanced by U-Net and image colorfulness frame filtration for automatic retail checkout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 3183–3191. [Google Scholar]
Miyoshi, K.; Shimizu, R.; Song, L.; Goto, M. Optimizing pre-training via target-aware source data selection. Knowl.-Based Syst. 2025, 329, 114371. [Google Scholar] [CrossRef]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef]
Jena, B.; Nayak, G.K.; Saxena, S. Convolutional neural network and its pretrained models for image classification and object detection: A survey. Concurr. Comput. Pract. Exp. 2022, 34, e6767. [Google Scholar] [CrossRef]
Yamagiwa, A.; Tsung, F.; Goto, M. Quantifying User Preferences for Pokémon Characters Using Pairwise Comparison Deep Learning Models. In Proceedings of the International Conference on Computers and Industrial Engineering, Sydney, Australia, 9–11 December 2024; CIE—Computers and Industrial Engineering: Sydney, Australia, 2024; Volume 2024, pp. 336–345. [Google Scholar]
Yamagiwa, A.; Goto, M. Impression evaluation of product images using deep neural network. Neural Comput. Appl. 2025, 37, 10215–10242. [Google Scholar] [CrossRef]
Lu, D.; Weng, Q. A survey of image classification methods and techniques for improving classification performance. Int. J. Remote Sens. 2007, 28, 823–870. [Google Scholar] [CrossRef]
Rawat, W.; Wang, Z. Deep convolutional neural networks for image classification: A comprehensive review. Neural Comput. 2017, 29, 2352–2449. [Google Scholar] [CrossRef]
Zhou, D.; Wang, Q.; Qi, Z.; Ye, H.; Zhan, D.; Liu, Z. Class-Incremental Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9851–9873. [Google Scholar] [CrossRef]
Yoshida, K.; Naraki, Y.; Horie, T.; Yamaki, R.; Shimizu, R.; Saito, Y.; McAuley, J.; Naganuma, H. Mastering Task Arithmetic: τ Jp as a Key Indicator for Weight Disentanglement. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Zhou, D.; Wang, F.; Ye, H.; Ma, L.; Pu, S.; Zhan, D. Forward Compatible Few-Shot Class-Incremental Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9046–9056. [Google Scholar]
McCloskey, M.; Cohen, N.J. Catastrophic interference in connectionist networks: The sequential learning problem. Psychol. Learn. Motiv. 1989, 24, 109–165. [Google Scholar]
Goodfellow, I.J.; Mirza, M.; Xiao, D.; Courville, A.; Bengio, Y. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv 2013, arXiv:1312.6211. [Google Scholar]
Kemker, R.; McClure, M.; Abitino, A.; Hayes, T.; Kanan, C. Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Liu, W.; Wu, X.; Zhu, F.; Yu, M.; Wang, C.; Liu, C. Class incremental learning with self-supervised pre-training and prototype learning. Pattern Recognit. 2025, 157, 110943. [Google Scholar] [CrossRef]
Ni, Z.; Shi, H.; Tang, S.; Wei, L.; Tian, Q.; Zhuang, Y. Revisiting catastrophic forgetting in class incremental learning. arXiv 2021, arXiv:2107.12308. [Google Scholar] [CrossRef]
Kalb, T.; Beyerer, J. Causes of catastrophic forgetting in class-incremental semantic segmentation. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 56–73. [Google Scholar]
Atkinson, C.; McCane, B.; Szymanski, L.; Robins, A. Pseudo-rehearsal: Achieving deep reinforcement learning without catastrophic forgetting. Neurocomputing 2021, 428, 291–307. [Google Scholar] [CrossRef]
Saha, G.; Garg, I.; Ankit, A.; Roy, K. Space: Structured compression and sharing of representational space for continual learning. IEEE Access 2021, 9, 150480–150494. [Google Scholar] [CrossRef]
Yang, B.; Lin, M.; Zhang, Y.; Liu, B.; Liang, X.; Ji, R.; Ye, Q. Dynamic support network for few-shot class incremental learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2945–2951. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Shi, W.; Dong, S.; Gao, X.; Song, X.; Gong, Y. Semantic knowledge guided class-incremental learning. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 5921–5931. [Google Scholar] [CrossRef]
Achituve, I.; Navon, A.; Yemini, Y.; Chechik, G.; Fetaya, E. Gp-tree: A gaussian process classifier for few-shot incremental learning. In Proceedings of the International Conference on Machine Learning, PMLR, Virtually, 18–24 July 2021; pp. 54–65. [Google Scholar]
Cheraghian, A.; Rahman, S.; Ramasinghe, S.; Fang, P.; Simon, C.; Petersson, L.; Harandi, M. Synthesized feature based few-shot class-incremental learning on a mixture of subspaces. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 8661–8670. [Google Scholar]
Kukleva, A.; Kuehne, H.; Schiele, B. Generalized and incremental few-shot learning by explicit learning and calibration without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9020–9029. [Google Scholar]
Zhao, H.; Fu, Y.; Kang, M.; Tian, Q.; Wu, F.; Li, X. Mgsvf: Multi-grained slow versus fast framework for few-shot class-incremental learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 46, 1576–1588. [Google Scholar] [CrossRef]
Zhang, C.; Song, N.; Lin, G.; Zheng, Y.; Pan, P.; Xu, Y. Few-shot incremental learning with continually evolved classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12455–12464. [Google Scholar]
Zhu, K.; Cao, Y.; Zhai, W.; Cheng, J.; Zha, Z. Self-promoted prototype refinement for few-shot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6801–6810. [Google Scholar]
Verma, V.; Lamb, A.; Beckham, C.; Najafi, A.; Mitliagkas, I.; Lopez-Paz, D.; Bengio, Y. Manifold mixup: Better representations by interpolating hidden states. In Proceedings of the International Conference on Machine Learning. PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6438–6447. [Google Scholar]
Tao, X.; Hong, X.; Chang, X.; Dong, S.; Wei, X.; Gong, Y. Few-Shot Class-Incremental Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Shimizu, R.; Kimura, M.; Goto, M. Fashion-specific attributes interpretation via dual gaussian visual-semantic embedding. arXiv 2022, arXiv:2210.17417. [Google Scholar]
Sakurai, K.; Ishii, T.; Shimizu, R.; Song, L.; Goto, M. Lare: Latent augmentation using regional embedding with vision-language model. Mach. Learn. Appl. 2025, 20, 100671. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009; pp. 32–33. [Google Scholar]
Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Statist. 1951, 22, 400–407. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2020, arXiv:1802.03426. [Google Scholar] [CrossRef]
Vilnis, L.; Li, X.; Murty, S.; McCallum, A. Probabilistic Embedding of Knowledge Graphs with Box Lattice Measures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 263–272. [Google Scholar] [CrossRef]
Patel, D.; Dasgupta, S.S.; Boratko, M.; Li, X.; Vilnis, L.; McCallum, A. Representing joint hierarchies with box embeddings. In Proceedings of the Automated Knowledge Base Construction, San Diego, CA, USA, 17 June 2020. [Google Scholar]
Belkin, M.; Hsu, D.; Ma, S.; Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. Natl. Acad. Sci. USA 2019, 116, 15849–15854. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Silverman, B.W. Density Estimation for Statistics and Data Analysis; Chapman and Hall: Boca Raton, FL, USA, 1986; Chapters 3 and 4. [Google Scholar]
van Leeuwen, P.J. A variance-minimizing Filter for large-scale applications. Mon. Weather Rev. 2003, 131, 2071–2084. [Google Scholar] [CrossRef]
Snyder, C.; Bengtsson, T.; Bickel, P.; Anderson, J. Obstacles to High-Dimensional Particle Filtering. Mon. Weather Rev. 2008, 136, 4629–4640. [Google Scholar] [CrossRef]
Natarajan, N.; Dhillon, I.S.; Ravikumar, P.K.; Tewari, A. Learning with noisy labels. Adv. Neural Inf. Process. Syst. 2013, 26, 1195–1203. [Google Scholar]

Figure 1. Comparison of the proposed method (FACT+) and the original FACT framework (The detailed algorithms for FACT and FACT+ in each session are provided in Appendix B).

Figure 2. Accuracy comparison of each model.

Figure 3. Class prototypes within a nonlinear dimensionality reduction embedding space (UMAP visualization of base, incremental, and virtual classes).

Figure 4. Class prototypes within a nonlinear dimensionality reduction embedding space (UMAP visualization colored by CIFAR-100 superclasses).

Figure 5. Comparison of model accuracy on a dataset including classes of varying granularity.

Figure 6. Distribution of class variances.

Figure 7. Comparison of model accuracy on a dataset where base and incremental class features differ significantly.

Figure 8. Comparison of model accuracy for each model.

Table 1. CIFAR-100 Class List [36].

Superclass	(Sub) Class
0: aquatic mammals	4: beaver, 30: dolphin, 55: otter, 72: seal, 95: whale
1: fish	1: aquarium fish, 32: flatfish, 67: ray, 73: shark, 91: trout
2: flowers	54: orchids, 62: poppies, 70: roses, 82: sunflowers, 92: tulips
3: food containers	9: bottles, 10: bowls, 16: cans, 28: cups, 61: plates
4: fruit and vegetables	0: apples, 51: mushrooms, 53: oranges, 57: pears, 83: sweet peppers
5: household electrical devices	22: clock, 39: computer keyboard, 40: lamp, 86: telephone, 87: television
6: household furniture	5: bed, 20: chair, 25: couch, 84: table, 94: wardrobe
7: insects	6: bee, 7: beetle, 14: butterfly, 18: caterpillar, 24: cockroach
8: large carnivores	3: bear, 42: leopard, 43: lion, 88: tiger, 97: wolf
9: large man-made outdoor things	12: bridge, 17: castle, 37: house, 68: road, 76: skyscraper
10: large natural outdoor scenes	23: cloud, 33: forest, 49: mountain, 60: plain, 71: sea
11: large omnivores and herbivores	15: camel, 19: cattle, 21: chimpanzee, 31: elephant, 38: kangaroo
12: medium-sized mammals	34: fox, 63: porcupine, 64: possum, 66: raccoon, 75: skunk
13: non-insect invertebrates	26: crab, 45: lobster, 77: snail, 79: spider, 99: worm
14: people	2: baby, 11: boy, 35: girl, 46: man, 98: woman
15: reptiles	27: crocodile, 29: dinosaur, 44: lizard, 78: snake, 93: turtle
16: small mammals	36: hamster, 50: mouse, 65: rabbit, 74: shrew, 80: squirrel
17: trees	47: maple, 52: oak, 56: palm, 59: pine, 96: willow
18: vehicles 1	8: bicycle, 13: bus, 48: motorcycle, 58: pickup truck, 90: train
19: vehicles 2	41: lawn-mower, 69: rocket, 81: streetcar, 85: tank, 89: tractor

Table 2. List of the model configurations.

Model	Flexible Covariance Matrices	Dynamically Model Update
1. FACT
2. FACT+ w/o update	✓
3. FACT+ w/o dynamic-var		✓
4. FACT+	✓	✓

Table 3. Breakdown of parameters used in FACT and FACT+.

Component	FACT	FACT+
Encoder	ResNet20 $271, 824$	ResNet20 $271, 824$
Base session	prototypes $64 \times 60$	prototypes $64 \times 60$ + variances 60
Incremental sessions	prototypes $64 \times 40$	prototypes $64 \times 40$ + variances 40
Total parameters	$278, 224$	$278, 324$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hotta, M.; Ogasawara, N.; Miyajima, K.; Shimizu, R.; Goto, M. Continual Learning Model with Dynamic Adaptation and Flexibility for Incremental Classes. Appl. Sci. 2025, 15, 12118. https://doi.org/10.3390/app152212118

AMA Style

Hotta M, Ogasawara N, Miyajima K, Shimizu R, Goto M. Continual Learning Model with Dynamic Adaptation and Flexibility for Incremental Classes. Applied Sciences. 2025; 15(22):12118. https://doi.org/10.3390/app152212118

Chicago/Turabian Style

Hotta, Minami, Noriko Ogasawara, Kengo Miyajima, Ryotaro Shimizu, and Masayuki Goto. 2025. "Continual Learning Model with Dynamic Adaptation and Flexibility for Incremental Classes" Applied Sciences 15, no. 22: 12118. https://doi.org/10.3390/app152212118

APA Style

Hotta, M., Ogasawara, N., Miyajima, K., Shimizu, R., & Goto, M. (2025). Continual Learning Model with Dynamic Adaptation and Flexibility for Incremental Classes. Applied Sciences, 15(22), 12118. https://doi.org/10.3390/app152212118

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Continual Learning Model with Dynamic Adaptation and Flexibility for Incremental Classes

Abstract

1. Introduction

2. Preliminaries

2.1. Problem Settings

2.2. Related Work

2.3. Few-Shot Class-Incremental Learning

2.4. Forward Compatible Few-Shot Class-Incremental Learning

2.5. Prototype Learning

3. Proposed Methodology

3.1. Motivation

3.2. Incorporating Flexible Variance

3.3. Model Update Using Post-Classification Data

4. Experimental Evaluation

4.1. Experimental Setting

4.2. Experimental Results

5. Discussion

5.1. Computational Complexity, Memory Cost, and Runtime Analysis

5.2. Validation with Data of Varying Class Granularity

5.2.1. Experimental Setting

5.2.2. Experimental Results

5.3. Validation with Datasets Where Base and Incremental Class Features Differ Significantly

5.3.1. Experimental Setting

5.3.2. Experimental Results

5.4. Application of Dynamic Variance to Virtual Classes

5.4.1. Experimental Setting

5.4.2. Experimental Results

5.5. Noise Sensitivity and Data Quality

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Loss Function

Appendix B. Algorithm for Each Session

Appendix B.1. FACT

Appendix B.2. FACT+

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI