Semi-Supervised Few-Shot Incremental Learning with k-Probabilistic Principal Component Analysis

Han, Ke; Barbu, Adrian

doi:10.3390/electronics13245000

Open AccessArticle

Semi-Supervised Few-Shot Incremental Learning with k-Probabilistic Principal Component Analysis

by

Ke Han

and

Adrian Barbu

^*

Department of Statistics, Florida State University, Tallahassee, FL 32304, USA

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(24), 5000; https://doi.org/10.3390/electronics13245000

Submission received: 22 November 2024 / Revised: 11 December 2024 / Accepted: 16 December 2024 / Published: 19 December 2024

Download

Browse Figure

Review Reports Versions Notes

Abstract

This paper introduces a novel method for Semi-Supervised Few-Shot Class Incremental Learning (SSFSCIL) that exhibits virtually no catastrophic forgetting. The method uses a generic feature extractor that was pretrained without supervision on a large image dataset, and a classifier based on a Probabilistic PCA (PPCA) model for each class instead of the standard fully connected layer usually employed as the projection head. The PPCA models are localized around the class means and the models for existing classes are not retrained when new classes are added. The learning algorithm is a modified k-Means that freezes the models on the existing classes and only updates models for the new classes. This makes the approach both computationally efficient and accurate. Extensive experiments on CUB200, CIFAR100, and miniImageNet show the effectiveness of the proposed approach. Additionally, experiments on the ImageNet-1k dataset, which previous methods have avoided due to its size, demonstrate its applicability to large-scale datasets.

Keywords:

neural networks; semi-supervised learning; class incremental learning; few-shot learning

1. Introduction

Class Incremental Learning (CIL) is an active research topic along with the development of deep learning. It aims to incrementally learn a unified classifier to recognize all classes that have been met during training. However, due to the scarcity of costly labeled data in practice, there is an emerging interest in promoting the utility of existing labels. Against this background, the Few-Shot Class Incremental Learning (FSCIL) task [1] has recently risen from CIL. An FSCIL model needs to involve new classes with limited labeled observations sequentially without forgetting the discriminability of old classes. However, due to the limited number of labeled samples, FSCIL methods are prone to overfitting and catastrophic forgetting compared to traditional CIL.

From the FSCIL perspective, as discussed in related work described in Section 1, preserving representative memories [1,2,3,4] and knowledge distillation [5,6] are popular techniques to maintain the representation capability for old classes and address catastrophic forgetting. To address the overfitting issue caused by the scarcity of labeled samples, various existing methods have suggested different approaches. For instance, some have proposed decoupling the encoder from the classifier [2], while others have introduced node-adjustable networks [7].

Compared to labeled data, unlabeled data are less expensive and easier to access in the real world. Hence, it is naturally used to leverage the performance in the incremental sessions. Cui et al. [6] first proposed Semi-Supervised Few-Shot Class-Incremental Learning (SSFSCIL), a sub-task of FSCIL. They utilized exemplars and knowledge distillation to mitigate catastrophic forgetting and alleviated overfitting with supervised massive unlabeled samples. One of the other recent SSFSCIL works, FeSSSS [8], fuses a multi-layer perceptron with self-learning and pretrained supervised feature generators to overcome overfitting.

These neural network-based SSFSCIL methods achieved remarkable performance on different popular FSCIL benchmarks. However, the retraining of neural networks [2,3,5,6,8,9,10,11] can be difficult due to several factors, e.g., continuous adapting to the data stream, data imbalance, privacy concerns, computational effort, etc. To reduce the inconvenience of retraining and lessen catastrophic forgetting, this paper proposes a novel SSFSCIL framework, termed k-Probabilistic Principal Component Analyzers (k-PPCAs). The code is available at: https://github.com/barbua/KPPCA accessed on 21 November 2024. First, the proposed method models the class embedding in the feature space with a Gaussian distribution reconstructed from a lower-dimensional subspace. Specifically, each class is modeled by an individual Probabilistic PCA (PPCA) [12], and all classes are systematized as a collection of PPCAs, which is essentially a mixture of Gaussian models [13]. As opposed to the standard representations based on fully connected linear projection heads, this PPCA-based representation is localized in the feature space, easily allowing for other classes to be added without changing the models for the existing classes. When new classes are added, the PPCA models for the old classes do not necessitate retraining, and catastrophic forgetting can be significantly alleviated. Second, a Mahalanobis distance-based k-Means is introduced to perform parameter estimation and classification for the FSCIL problem. This modification takes advantage of the Gaussian nature of the class models which can be clustered, in contrast to the standard representations of the linear projection head classifiers. Furthermore, this metric-based classifier can help identify the out-of-distribution samples. Third, the framework uses a generic pretrained feature extractor which is frozen and static during training. The design can take advantage of the recent developments in self-learning algorithms, and the benefit of decoupling the representation learner and classifier has also been shown to mitigate catastrophic forgetting [2,4,8]. To demonstrate the effectiveness of the proposed method, extensive experiments are performed on four benchmark datasets: CUB200, CIFAR100, miniImageNet, and ILSVRC-2012 (ImageNet-1k). The proposed method outperforms most of the state-of-the-art SSFSCIL methods on the first three benchmarks. To the best of our knowledge, there are no SSFSCIL methods evaluated on the ImageNet-1k dataset due to its large scale, where the proposed k-PPCAs achieve an accuracy of 58.44%.

The main contributions of the paper are as follows:

It proposes a semi-supervised FSCIL framework named k-PPCAs that avoids the retraining of old classes so that catastrophic forgetting can be significantly reduced when new classes are incorporated.
A Mahalanobis distance-based k-Means classifier is adapted for the FSCIL task, which can capture the shape information of class embeddings and classify the out-of-distribution samples to “unknown” through a low-confidence score metric.
Performs a comprehensive comparison between the proposed method and state-of-the-art semi-supervised FSCIL methods on three popular FSCIL benchmarks: CUB200, CIFAR100, and miniImageNet. The results show that the proposed method outperforms most of the other methods.
Conduct experiments on the large-scale dataset ImageNet-1k, which is less frequently evaluated for the FSCIL task.

Related Work

Because FSCIL is a subdomain of class-incremental learning, this section starts with the related works in class-incremental learning.

Class-Incremental Learning (CIL). Class-incremental learning aims to teach a unified classifier to recognize new incremental classes without forgetting the old class representations. There are two popular ways to mitigate catastrophic forgetting, exploring better exemplars for old classes and modifying loss function. iCaRL [14] dynamically updates important exemplars and adopts nearest-class-mean [15] for classification. Nakata et al. [16] proposed a k-Nearest Neighbor (KNN) classifier based on zero-shot pertaining feature extractor, CLIP [17], and it is evaluated and achieves state-of-art results on several popular benchmarks. EEIL [18] introduces the knowledge distillation loss in an end-to-end framework. NCM [19] employs three rectification components in the loss function to discriminate the old and new classes by magnitude and spatial orientation.

Few-shot class incremental learning (FSCIL). FSCIL is a recently active research area derived from CIL, aiming to classify inputs with few-shot labeled samples. Two major strategies are employed by recent FSCIL methods. The first one is knowledge representation learning and refinement. TOPIC [1] incorporates a neural gas network [20] to obtain and store the topology structure of the feature space. Zhang et al. [2] proposed the Continually Evolved Classifier (CEC) which decoupled representation learning and classification and employed a graph model as the representation learner to propagate the global context between old and new sessions. Self-Promoted Prototype Refinement (SPPR) [3] constructs a cosine-similarity-based relation matrix between old and new classes, which acts as a transitional coefficient for adapting old prototypes in the new feature space. The ALICE [4] framework, incorporates the angular penalty loss to adapt the feature extractor to obtain well-clustered features. Another strategy is knowledge distillation. Dong et al. [5] proposed ERL++, an exemplar relation graph, to preserve the angular-based structural relations from old classes, and the distillation loss is used to retain the relation information in the new session.

Semi-Supervised FSCIL (SSFSCIL). SSFSCIL algorithms introduce additional unlabeled data to the FSCIL task to mitigate overfitting. From a representation learning perspective, FeSSSS [8] incorporates a self-learning ResNet50 encoder to a pretrained supervised ResNet18 model by a feature fusion design that concatenates the outputs from the two feature extractors. Kalla and Biswas [9] proposed S3C, which uses the stochastic classifier with a self-supervised feature extractor and its weights are kept frozen in the incremental session to reduce catastrophic forgetting. There are also some methods based on knowledge distillation. Cui et al. [6] introduced a detailed semi-supervised configuration for a distillation-based network by revising the network until all unlabeled data obtain high-confidence pseudo-labels. Then unlabeled samples are combined with labeled data to enhance the performance of FSCIL. They further proposed Us-KD [10], which incorporates an uncertainty-guided component to filter out low-certainty unlabeled data to alleviate overfitting and noise during knowledge transfer. In [11] latest work, they observed that easily classifiable classes necessitate fewer unlabeled samples to achieve a high prediction accuracy, thus a data selection method was devised to avoid contaminating well-learned classes with less-reliable unlabeled data.

2. Materials and Methods

In this section, we present a method for semi-supervised few-shot class-incremental learning using probabilistic PCA (PPCA). First, an overview of the PPCA is introduced, followed by the introduction of our proposed classifier, k-PPCAs and its parameter estimation. Then the proposed algorithm for SSFSCIL task is explained, along with a computation and complexity discussion.

2.1. Probabilistic PCA Representation

Probabilistic PCA (PPCA) [12] models a data cluster with a Gaussian probability proxy in a lower-dimensional subspace with a mean

μ

and covariance matrix

Σ

.

PPCA derives from latent factor analysis. The latent factor analysis model is a plane with noise, where a d-dimensional observable variable

x \in R^{d}

is approximated using a latent q-dimensional variable

t \in R^{q}

as follows:

x = W t + μ + ϵ,

(1)

where the latent variable

t \sim N (0, I_{q})

, and

μ \in R^{d}

is the mean of observable variable

x

. The transformation matrix

W \in R^{d \times q}

connects the d-dimensional space and the lower q-dimensional subspace, and

ϵ

represents the independent and identically distributed (i.i.d.) noise following a Gaussian distribution with a diagonal covariance matrix

N (0, Ψ)

. Then, the conditional distribution of

x

given the latent variable

t

is

x | t = t \sim N (W t + μ, Ψ) .

(2)

The marginal distribution of

x

can be obtained by integration on the latent variable

t

, obtaining

x \sim N (μ, Σ)

, where

Σ = W W^{T} + Ψ

.

To better capture the principal information of

x

by a latent variable in a lower dimensional subspace, Principal Component Analysis (PCA) [21,22] is naturally considered in the factor analysis. Tipping and Bishop [13] show that PCA arises from factor analysis when

Ψ = σ^{2} I

and the d-q minor eigenvalues of the sample covariance matrix are equal. In this respect, the covariance matrix in

Σ

above is defined as the combination of the reconstruction from the principal components that span the lower dimensional space, and the covariance from the noise term. It can be written as

Σ = L^{T} D^{2} L + λ I_{d},

(3)

where

L^{T}

is the matrix containing the first q (

q < d

) principal eigenvectors as columns, and

D^{2}

is the diagonal matrix with the first q principal eigenvalues as diagonal elements. The parameter

λ > 0

is a small number (e.g.,

λ = 0.01

in our experiments) indicating the variance of the noise. From the singular value decomposition of the sample covariance matrix,

V S^{2} V^{T} = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - μ_{k}) {(x_{i} - μ_{k})}^{T},

(4)

it is clear that

L^{T}

are the first q columns of

V

, and

D^{2}

is the left-upper

q \times q

sub-matrix of

S^{2}

.

2.2. k-Means Classifier with the Mahalanobis Distance (k-PPCAs)

When using PPCA for classification, the classes are assumed to be mostly separable from each other in the feature space. Each class is modeled by PPCA with different parameters and all classes are systematized by a mixture model of Gaussian distributions. Instead of classifying the data based on posterior and estimating the parameters by EM [13], our approach performs estimation by k-Means with the Mahalanobis distance.

For example, the data belonging to the k-th class follow the Gaussian distribution with mean

μ_{k}

and covariance

Σ_{k}

. Then the negative log-likelihood of observing data

x

in a given class k is, without the constant

{(2 π)}^{d / 2}

:

s_{k} (x) = - 2 log p (x | y = k) = log | Σ_{k} | + {(x - μ_{k})}^{T} Σ_{k}^{- 1} (x - μ_{k})

(5)

where a smaller score corresponds to a higher likelihood.

Mahalanobis Distance: The Mahalanobis distance is a well-known measure of distance first introduced by P. C. Mahalanobis in 1936 [23]. It considers the correlations of the variables and measures the distance between a point and the mean of a distribution by the unit of standard deviation of this distribution.

In our proposed k-PPCAs, the score function from Equation (5) can be simplified as a Mahalanobis distance given below.

r (x) = {(x - μ)}^{T} Σ^{- 1} (x - μ) .

(6)

Given PPCA parameters

θ = (μ, L, S)

and denoting by

s \in R^{q}

the vector containing the first q elements of

S

, the score (6) can be more efficiently computed using the following theorem.

Theorem 1

([24]). The score (6) can also be computed as follows:

r (x; μ, L, S) = {∥ x - μ ∥}^{2} / λ - {∥ u (x) ∥}^{2} / λ,

(7)

where

u (x) = diag (\frac{s}{\sqrt{s^{2} + λ 1_{q}}}) L^{T} (x - μ)

.

Parameter Estimation of k-PPCAs: The k-means clustering method [25] is a popular unsupervised learning algorithm. It divides observations into k clusters by assigning each point to its closest cluster centroid and recalculating the centroids as the means of the clusters based on assigned observations in each cluster. In our proposed method, the Euclidean distance is replaced by the Mahalanobis distance to capture the shape information of the clusters in feature space. To extend k-means from clustering to classification, the clusters are initialized from the labeled samples in each session, as discussed in the following section.

For n observations

x_{1}, \dots, x_{n}

, the first step is to assign them to their closest cluster, respectively,

y_{i} = \underset{k}{argmin} r (x) = \underset{k}{argmin} {(x_{i} - μ_{k})}^{T} Σ_{k}^{- 1} (x_{i} - μ_{k}),

(8)

then update the parameters (

{\hat{μ}}_{k}

,

{\hat{Σ}}_{k}

) for each cluster,

{\hat{μ}}_{k} = \frac{\sum_{{i | y_{i} = k}} x_{i}}{n_{k}},

(9)

{\hat{Σ}}_{k} = \frac{\sum_{{i | y_{i} = k}} (x_{i} - {\hat{μ}}_{k}) {(x_{i} - {\hat{μ}}_{k})}^{T}}{n_{k} - 1}

(10)

where

n_{k}

is the number of assigned observations in the k-th cluster. The above two steps are iteratively performed until the parameters converge. While incorporating k-Means with PPCA models, the above covariance matrix is decomposed by Equation (4) and is remodeled by the first q principal components and the probabilistic term by Equation (3). It is formulated as follows:

{\hat{Σ}}_{k} = L_{k}^{T} D_{k}^{2} L_{k} + λ I_{d} .

(11)

2.3. k-PPCAs for Semi-Supervised FSCIL

In a preprocessing step, a pretrained and frozen feature extractor is utilized to prepare features from images or any other types of raw data, and the following incremental learning steps are all based on these features.

For the base session and successive incremental sessions, the training dataset

D_{t r a i n}

includes the subset with labeled features

D_{t r a i n}^{(l)}

and the subset of massive unlabeled features

D_{t r a i n}^{(u)}

. To initialize PPCAs for each new class in the training set

D_{t r a i n}

, singular value decomposition (SVD) is applied to the covariance matrix of features labeled in each class, and the PPCA parameters

θ = (μ, L, S)

can be obtained from PCA, as described in Algorithm 1.

Algorithm 1 Initialization with PCA

Input:: Labeled dataset $D^{(l)} = {D_{1}^{(l)}, D_{2}^{(l)}, \dots, D_{k}^{(l)}, \dots, D_{K}^{(l)}}$ for all classes $k \in C$ , where C is the set of classes
Output:: Initialized PPCA models $θ_{k} = (μ_{k}, L_{k}, S_{k})$ for each class $k \in C$ .
1:: for each class $k \in C$ do
2:: Obtain $μ_{k}$ and $Σ_{k}$ from $D_{k}^{(l)}$ using Equation (9) and Equation (11), respectively.
3:: Obtain $S_{k}$ from $V_{k} S_{k}^{2} V_{k}^{T} = Σ_{k}$ using SVD
4:: Obtain $L_{k}$ as the first q columns of $V_{k}$
5:: end for

After the PPCA model initialization, the set of existing classes

C_{o}

is empty for the base session

D_{t r a i n, 0}

, so all base session classes are considered for update, and the PPCA models are adapted for each class. The labeled and unlabeled data are processed in different ways. The labeled observations will be directly assigned to the clusters corresponding to their labels. For the unlabeled observations, the Mahalanobis distance between the feature vector and each cluster will be calculated as a score using Equation (7), and each unlabeled observation will receive a pseudo-label indicating the cluster with the smallest score, that is, the assigned cluster. Then, the mean and the covariance will be updated for each cluster using Equation (9) and Equation (11) respectively. Then by applying SVD to the updated covariance of each cluster, the updated PPCA parameters are obtained. e.g., for the k-th PPCA model, PPCA parameters

θ_{k} = (μ_{k}, L_{k}, S_{k})

are obtained from

{\hat{Σ}}_{k}

by Equation (4).

After training the base session, the updated PPCA models will be stored and frozen for the following incremental sessions, and the classes in the base session will be moved to

C_{o}

.

In the incremental sessions, the PPCAs of the new classes will be initialized and appended to the PPCA models inherited from the previous sessions. Then all existing PPCA models will be included in the score calculation and pseudo-label assignment for the new observations, but the PPCAs from the old classes will be kept frozen and only the new models will be updated. After training an incremental session, similar to the base session, all existing models are frozen for the next session and the classes in the current session will be treated as existing classes. The new PPCA models are continually added and updated along with new classes until all incremental sessions are processed. The flow chart of the k-PPCAs is illustrated in Figure 1 and the procedure is described in Algorithm 2.

In order to improve the robustness of our method, the observations with obviously large scores will be identified as extreme values and removed from the model update step. Under the assumption that the scores in the kth class follow a normal distribution, any observations whose score is greater than

μ_{r_{k}} + 1.96 σ_{r_{k}}

are labeled as extreme values, where

μ_{r_{k}}

and

σ_{r_{k}}

are the mean scores and the standard deviation of the kth class, respectively. Applying this criterion is approximately identical to filtering out the top 2.5% observations with the greatest scores in the kth class.

Algorithm 2 k-PPCAs

Input:: Models $θ_{k} = (μ_{k}, L_{k}, S_{k})$ for existing classes $k \in C_{o}$ , new training dataset $D_{t r a i n} = {D_{t r a i n}^{(l)}, D_{t r a i n}^{(u)}}$ for current incremental session, with labeled set $D_{t r a i n}^{(l)}$ , unlabeled set $D_{t r a i n}^{(u)}$ , and new classes $C_{n}$ .
Output:: Updated PPCA models $θ_{k} = (μ_{k}, L_{k}, S_{k})$ for each new class $k \in C_{n}$ .
1:: for each new class $k \in C_{n}$ do
2:: Initialize $θ_{k} = (μ_{k}, L_{k}, S_{k})$ by PCA with q PCs on the labeled data from class k using Equation (4)
3:: end for
4:: for j = 1 to $n_{i t e r}$ do
5:: Initialize Running Averages framework (RAVEs in Section 2.4) $R_{k} = (0, 0, 0)$ $\forall k \in C_{n}$
6:: for each new observation $x \in D_{t r a i n}$ do
7:: if $x \in D_{t r a i n}^{(l)}$ then
8:: Set $k = y$ as the observation label y
9:: else
10:: Compute scores $r_{k} = r (x, μ_{k}, L_{k}, S_{k})$ , $\forall k \in C_{n} \cup C_{o}$
11:: Obtain label $k = {argmin}_{k} r_{k}$ .
12:: end if
13:: if $k \in C_{n}$ and $r_{k} < μ_{r_{k}} + τ σ_{r_{k}}$ then
14:: add $x$ to $R_{k}$ using Equation (15)
15:: end if
16:: end for
17:: for $k \in C_{n}$ do
18:: Obtain $μ_{k}$ and $Σ_{k}$ from $R_{k}$ using Equation (14)
19:: Obtain $S_{k}$ from $V_{k} S_{k}^{2} V_{k}^{T} = Σ_{k}$ using SVD
20:: Obtain $L_{k}$ as the first q columns of $V_{k}$
21:: end for
22:: end for

2.4. Computing Covariance Matrices with Running Averages

In practice, lines 6–16 from Algorithm 2, which assign labels to the observations based on the Mahalanobis distance, are performed using mini-batches due to the scale of covariance matrices in PPCAs and the memory limitation of GPU. However, the mean and the covariance are calculated (Equations (9) and (11)) after assigning all the observations while the mini-batches have been discarded. In our proposed method, an online learning technique, the running averages framework (RAVE) [26] is employed to compute and update the mean and the covariance batch-incrementally during screening mini-batches.

Given a set of observations

x_{i}, i \in J

, we aim to compute

{\hat{μ}}_{J} = \frac{1}{| J |} \sum_{i \in J} x_{i}

and

{\hat{Σ}}_{J} = \frac{\sum_{{i \in J}} (x_{i} - \hat{μ}) {(x_{i} - \hat{μ})}^{T}}{| J | - 1} .

(12)

This can be performed efficiently by maintaining running averages

R_{J} = (| J |, S_{x}^{(J)}, S_{x x}^{(J)})

, where

S_{x}^{(J)} = \frac{1}{| J |} \sum_{i \in J} x_{i}, S_{x x}^{(J)} = \frac{1}{| J |} \sum_{i \in J} x_{i} x_{i}^{T} .

(13)

From these running averages we can obtain

{\hat{μ}}_{J} = S_{x}^{(J)}

and

{\hat{Σ}}_{J} = \frac{| J |}{| J | - 1} (S_{x x}^{(J)} - {\hat{μ}}_{J} {\hat{μ}}_{J}^{T})

(14)

The running averages can be updated incrementally, e.g., one observation at a time. For simplicity of notation, assume that

J = {1, \dots, n}

and denote

S_{x}^{(J)} = S_{x}^{(n)}

and

S_{x x}^{(J)} = S_{x x}^{(n)}

. Then the running averages can be updated as follows:

S_{x}^{(n)} = \frac{n - 1}{n} S_{x}^{(n - 1)} + \frac{1}{n} x_{n}

(15)

and similarly for

S_{x x}^{(n)}

.

2.5. Complexity Analysis of k-PPCAs

In this section, we provide a detailed complexity analysis of the k-PPCAs model, addressing the computational requirements of its key operations. To clarify and enhance the insights provided, we break down the analysis step by step and discuss the implications of the growth factors involved.

Let n, k, q, and d represent the sample size, the number of classes, the number of principal components, and the dimensionality of the feature space, respectively. The computational complexity of each major step in the k-PPCAs pipeline is as follows:

1. Mahalanobis Distance Calculation: Calculating the Mahalanobis distance from one observation to a single class without using PPCA, as given by Equation (6), is dominated by covariance inversion step, which has a complexity of

O (d^{3})

. By leveraging Equation (7), the computational cost is reduced to

O (q^{2} d)

. Extending this computation to all n observations and k classes, the total complexity for assigning observations becomes

O (k n q^{2} d)

.

2. Covariance Estimation: The covariance matrix for each class is computed based on the

O (n)

samples assigned to the class, which incurs a time complexity of

O (n d^{2})

.

3. Singular Value Decomposition (SVD): Performing SVD on the covariance matrix of each class requires

O (d^{3})

operations. Since this step is repeated for all k classes, the overall complexity for this stage is

O (k d^{3})

.

By combining the complexities of these steps, the total computational complexity of the k-PPCAs model can be expressed as

O (k n q^{2} d + n d^{2} + k d^{3})

Discussion and Insights: The formula encapsulates the computational cost of k-PPCAs in terms of the primary parameters n, k, q, and d. Each term in the expression reflects a trade-off between these parameters. The term

O (k n q^{2} d)

dominates for large datasets (high n) and a large number of classes (high k), indicating that assigning observations is the most computationally intensive step in such cases. The term

O (k d^{3})

highlights that SVD is computationally expensive for high-dimensional data, especially when the feature dimensionality d is large. The term

O (n d^{2})

becomes significant when the feature dimensionality d is high, as it directly impacts the covariance estimation. These insights emphasize that the scalability of k-PPCAs depends on balancing the model’s parameters to fit the available computational resources. Further optimizations, such as parallel processing for SVD or efficient approximations for the Mahalanobis distance, could mitigate these costs in practical applications.

3. Results

3.1. Datasets

We evaluated our method on four popular datasets, the Caltech-UCSD Birds-200-2011 (CUB200), CIFAR100, miniImageNet, and ILSVRC-2012 (ImageNet-1k) datasets. The first three datasets are widely used benchmarks for FSCIL, and ImageNet-1k is one of the most popular datasets for classification and incremental learning.

Caltech-UCSD Birds-200-2011 [27]. The CUB200 dataset includes 5994 training and 5794 test fine-grained images of 200 classes of birds. There are about 30 training samples included in each of the classes. The CUB200 dataset images vary in size as they come from real-world bird photographs.

CIFAR100 [28]. The CIFAR100 dataset comprises 60,000 images of size is 32 × 32, equally distributed in 100 classes. For each class, there are 500 training and 100 test images.

miniImageNet. miniImageNet is a 100-class subset of ImageNet-1k [29], where we used the same 100 classes as [30]. Each class contains 500 training images and 100 test images. Each image is of the size 84 × 84.

ImageNet. Compared to the other datasets mentioned above, ImageNet-1k is a large-scale dataset containing 1,281,167 training and 50,000 validation images from 1000 object classes. There are about 1300 training images and 50 validation images in each class. The original dataset contains high-resolution images of varying sizes, which are typically resized to standard dimensions during preprocessing for compatibility with specific model architectures.

Tao et al. [1] proposed a series of training schemes for few-shot class incremental learning on CUB200, CIFAR100, and miniImageNet. The setting utilizes all samples as labeled data in the base session and uses few-shot labeled samples for the incremental session. This setting is widely adopted by the subsequent FSCIL research. For the convenience of comparison with previous related works, the experiments on these three datasets are performed under the same schedules. Considering the labeled data in the base session can also be expensive in reality, the proposed method is also evaluated under a stricter setting in the ablation study, where the labeled samples are also few-shot in the base session, just the same as that in the incremental session.

For CIFAR100 and miniImageNet, the 100 classes are split into 60 classes for the base session and 5 classes for each of the eight incremental sessions. For each incremental session, five samples per class are labeled and the remaining samples are unlabeled. Therefore, experiments on these two benchmarks follow a 5-way 5-shot incremental setting. For CUB200, a 10-way 5-shot setting is adopted. A total of 100 classes were selected as the base session and the remaining 100 classes were equally divided into 10 incremental sessions. Due to the large scale of ImageNet-1k, it is uncommon for previous SSFSCIL methods to be evaluated on this benchmark. To the best of our knowledge, we are the first to evaluate an SSFSCIL method on ImageNet-1k. In our experiment, the dataset is split into the base session with 500 classes and five 100-class incremental sessions. In order to test the limit of our approach, we only provide two labeled samples for each of the sessions, even for the base session. The remaining images are used as unlabeled samples. That is, we performed a 100-way 2-shot setting on ImageNet-1k.

3.2. Implementation Details

As mentioned in Section 2.3, in this paper, different pretrained and frozen backbones are adopted as feature extractors. CLIP [17] is a flexible and generic contrastive learning method. It maximizes the similarity between the image features and the associated text embeddings. The image encoders (ResNet-50x4, ResNet-50, ViT-B/32, ViT-L/14, etc.) in CLIP, pretrained on a wide variety of images from the internet, are introduced and kept frozen as some of the feature extractors in our method. In addition to CLIP, ResNet-18 pretrained on ImageNet-1k, is used in our experiments on CIFAR100 and CUB200. Because CUB200 is a dataset with fine-grained classes about birds, the features pretrained on the coarser-grained ImageNet may not be able to separate the bird classes well enough, we performed a fine-tuning on the base session of CUB200. Because ImageNet-1k and miniImageNet are overlapped, we adopt a ResNet18 pretrained on OpenImages-v6 [31] by Ahmad et al. [8] on miniImageNet.

For image preprocessing on the CUB200, miniImageNet, and ImageNet datasets, we follow the settings of the feature extractors. The short edge of the image is resized while maintaining the original aspect ratio, followed by central cropping to obtain a square image as input for the feature extractor. The transformed size is 288 for CLIP-ResNet, and 224 for the plain ResNet and CLIP-ViT backbones.

Because the number of principal components q is limited by the number of labeled samples per class, data augmentation (random cropping and random horizontal flipping) was applied and ten augmented images were generated for each labeled image.

The value of the PPCA parameter

λ

in Equations (3) and (11) is set to

λ = 0.01

. For all benchmarks, the model is trained for 10 epochs. Five independent runs were conducted on each dataset and the mean and standard deviation of the accuracy were reported for each session.

Experimental Setup:

Hardware Specifications: NVIDIA RTX 3060 GPU, 48 GB RAM, and AMD Ryzen 7 5700G processors.

Software Specifications: PyTorch version 1.9.1, CUDA version 11.7, and Python 3.8.

Random Number Settings: In our experiments, we use random initializations with different seeds for each run to ensure robustness and evaluate the consistency of the proposed method across varying random configurations.

3.3. Evaluation Results

The results on CUB200, CIFAR100, miniImageNet, and ImageNet-1k are reported in Table 1, Table 2, Table 3, and Table 4 respectively. Other than the average accuracy on the five independent runs, the standard deviation is given in parentheses. Based on the results from Table 5 in the ablation study,

q = 10

principal components were used for all experiments.

Table 1, Table 2 and Table 3 present a comparison of k-PPCAs with other state-of-the-art (SOTA) SSFSCIL models. In Table 4 are shown our results only on ImageNet-1k.

Table 1 shows the results on CUB200. Comparing the accuracy after the last incremental session, our approach with ResNet18 obtained an accuracy of 58.96% which is comparable and outperforms most previous SOTA methods. On the CIFAR100 dataset, from Table 2, the accuracy with plain ResNet18 is 61.40% which performs better than all other SOTA methods and exceeds the second-best method by 6.9%. For miniImageNet, shown in Table 3, our approach with a ResNet18 pretrained on OpenImages-v6 [31] obtains an accuracy of 52.17% and it outperforms most of the SOTAs except FeSSSS [8]. However, FeSSSS utilized a larger backbone which is combined by two components, ResNet18/20 for the supervised and ResNet50 for the self-supervised feature extractor. The approach concatenates the features from two components and projects them into a fused space.

The proposed approach significantly enhances accuracy when utilizing the CLIP-ResNet backbone. On the CUB200 dataset, it achieves an accuracy of 66.06%, while on the CIFAR100 dataset, the accuracy is 66.15%. For the miniImageNet dataset, the method attains a notable accuracy of 79.02% using the ResNet50x4 backbone with CLIP. Additionally, a narrower ResNet50 backbone with CLIP achieves an accuracy of 74.19%.

The proposed approach also demonstrates a strong performance with CLIP-ViT-based backbones. On the CUB200 dataset, using ViT-B/32 achieves an accuracy of 59.18%, while ViT-L/14 achieves an accuracy of 76.78%. For the CIFAR100 dataset, ViT-B/32 and ViT-L/14 achieve accuracies of 67.16% and 80.83%, respectively. On the miniImageNet dataset, the method attains accuracies of 80.68% with ViT-B/32 and an impressive 88.91% with ViT-L/14. These results underline the substantial performance gains and adaptability of the proposed method across various datasets and backbones.

It is interesting to note that in the comparison of these three datasets, our method does not always perform the best after the base session, but due to the robustness to catastrophic forgetting, our approach can retain a more stable accuracy and show better resistance to catastrophic forgetting.

As mentioned in Section 3.1, we could be the first to evaluate an SSFSCIL method on the ImageNet-1k dataset. So only the results of k-PPCAs are listed in Table 4. k-PPCAs reaches an accuracy of 58.44% on ImageNet-1k under a 100-way 2-shot setting. A plain k-Means with the same feature extractor, using Euclidean distance, is also evaluated as a baseline, which has an accuracy of 51.93%.

4. Discussion

4.1. The Number of Principal Components in PPCA

In the proposed method, more principal components in PPCA can reconstruct more variation and shape information about the classes embedded in the feature space. But, when deciding on the number of principal components, there is a trade-off because more principal components may overfit the training data. As mentioned in Section 3.2, ten augmented images are generated for each labeled image under the 5-way 5-shot setting. Table 5 shows the miniImageNet accuracy of using different numbers of principal components, from zero principal components, which means a constant covariance in PPCA, i.e., k-Means, to all components, i.e., full covariance for each class. From the results, we can observe that the accuracy increases initially as more components are introduced in PPCAs, reaches the peak at the level of ten principal components, and then starts to drop if more principal components are added.

4.2. Different Shots in the Base Session

A few-shot base session could aggravate the overfitting problem in the optimization-based FSCIL methods. But considering the labeled data in the base session can also be expensive in reality, we compare the few-shot base session with the conventional full-labeled design. We used a 5-shot base session and examined our method on miniImageNet. All experiments used the CLIP-based ResNet as the feature extractor. The miniImageNet results are listed in Table 6. We observe a decrease in accuracy with the few-shot setting when compared to the full-labeled design, but this difference shrinks after the last incremental session. The result shows our method is less affected by the overfitting from the few-shot setting.

4.3. The Criteria of Extreme Values

As described in Section 2.3, we assume the scores in each cluster follow a normal distribution, and any observation with a score over a threshold will be identified as extreme value and excluded from the parameter update. Experiments for extreme value criteria

μ_{r_{k}} + τ σ_{r_{k}}

with different values of

τ

were conducted on miniImageNet. The results in Table 7 demonstrate that a less strict criterion for extreme values leads to a higher accuracy because more unlabeled features can help the estimation of PPCA parameters, and the peak appears at a threshold of

μ_{r_{k}} + 1.96 σ_{r_{k}}

.

4.4. Discussion on the Strengths of the Proposed Method

Using Separate Probabilistic PCA Models for Each Class: By employing a flat design where each class is modeled by its own probabilistic PCA, the method captures class-specific variance with high precision. This approach allows the model to create tailored representations for each class, avoiding interference between unrelated classes and improving the overall discrimination capability.

Effectiveness of Mahalanobis Distance in Classification and Updating: The classification step leverages Mahalanobis distance, which is particularly effective in accounting for the variance structure of each class. This ensures that new data points are evaluated in a way that aligns with the learned probabilistic subspace, enhancing classification accuracy. During the k-Means-based parameter update, the Mahalanobis distance helps refine the probabilistic PCA parameters to better fit the evolving data distribution.

Semi-Supervised Strengths: The semi-supervised learning process ensures efficient use of both labeled and unlabeled data. By iteratively updating the PPCA parameters based on Mahalanobis-distance-based clustering, the model effectively integrates information from new classes while preserving the representations of previously learned classes.

4.5. Discussion About CLIP and ImageNet

We would also like to discuss the training set of CLIP because our results on miniImageNet and ImageNet-1k can be less reliable if there is an overlapping between the training set of CLIP and ImageNet. Even though the detailed data source is not published in the CLIP paper, they mentioned that their dataset is created from a variety of publicly available sources on the Internet, and more convincingly, the CLIP can be used for zero-shot transfer learning and was evaluated on ImageNet-1k in their paper. In addition, there has been an existing method that applied CLIP for CIL [16]. After considering the above evidence, we believe that using a CLIP-based encoder on ImageNet benchmarks is reasonable.

5. Conclusions

This paper concentrated on a challenging but practically important scenario SSFSCIL, where models are continually learned on new classes from limited labeled samples and a relatively large amount of unlabeled data. We present k-PPCAs, an efficient probabilistic classifier based on a pretrained self-learning generic feature extractor. The proposed k-PPCAs approach models the classes with a mixture model of probabilistic principal component analyzers, which can successfully capture and discriminate the variance information of classes in the feature space under the FSCIL setting. Extensive experiments are performed and show the effectiveness of the proposed approach on benchmarks CUB200, CIFAR100, and miniImageNet. Experiments are also conducted on the ImageNet-1k dataset, which has not been evaluated by SSFSCIL methods before due to its large size.

Limitations and future work. The k-PPCAs proposed is constructed upon a pretrained feature extractor. Therefore, the effectiveness and intended purpose of the feature extractor can significantly influence the classification performance downstream. To illustrate, if the feature extractor is trained to identify a broad concept, such as the motor vehicle, utilizing it may pose challenges when classifying varied subcategories of vehicles. This constraint prompts us to consider exploring an end-to-end PPCA-based classification algorithm in the future. In addition, we also plan to apply the proposed method to even larger datasets, e.g., the whole ImageNet containing more than 20,000 classes. The PPCA models are based on assumptions of normality and data free of label noise, thus are sensitive to outliers. The case of noisy labeled data is another direction of future work, where in the supervised setup it becomes a robust PPCA estimation problem.

Author Contributions

Conceptualization, A.B.; methodology, A.B. and K.H.; software, K.H. and A.B.; validation, K.H. and A.B.; formal analysis, K.H. and A.B.; investigation, K.H. and A.B.; resources, K.H. and A.B.; data curation, K.H. and A.B.; writing—original draft preparation, K.H.; writing—review and editing, K.H. and A.B.; visualization, K.H.; supervision, A.B.; project administration, K.H. and A.B.; funding acquisition, A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The following publicly available datasets have been used in experiments: Caltech-UCSD Birds-200-2011 [27], CIFAR-100 [28], miniImageNet [30], ImageNet-1k [29].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CIL	Class Incremental Learning
FSCIL	Few-Shot Class Incremental Learning
SSFSCIL	Semi-Supervised Few-Shot Class-Incremental Learning
PCA	Principal Component Analysis
PPCA	Probabilistic Principal Component Analyzers
CUB200	Caltech-UCSD Birds-200-2011

References

Tao, X.; Hong, X.; Chang, X.; Dong, S.; Wei, X.; Gong, Y. Few-shot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12183–12192. [Google Scholar]
Zhang, C.; Song, N.; Lin, G.; Zheng, Y.; Pan, P.; Xu, Y. Few-shot incremental learning with continually evolved classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12455–12464. [Google Scholar]
Zhu, K.; Cao, Y.; Zhai, W.; Cheng, J.; Zha, Z.J. Self-promoted prototype refinement for few-shot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6801–6810. [Google Scholar]
Peng, C.; Zhao, K.; Wang, T.; Li, M.; Lovell, B.C. Few-shot class-incremental learning from an open-set perspective. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 382–397. [Google Scholar]
Dong, S.; Hong, X.; Tao, X.; Chang, X.; Wei, X.; Gong, Y. Few-shot class-incremental learning via relation knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 1255–1263. [Google Scholar]
Cui, Y.; Xiong, W.; Tavakolian, M.; Liu, L. Semi-Supervised Few-Shot Class-Incremental Learning. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 1239–1243. [Google Scholar]
Yang, B.; Lin, M.; Liu, B.; Fu, M.; Liu, C.; Ji, R.; Ye, Q. Learnable Expansion-and-Compression Network for Few-shot Class-Incremental Learning. arXiv 2021, arXiv:2104.02281. [Google Scholar]
Ahmad, T.; Dhamija, A.R.; Cruz, S.; Rabinowitz, R.; Li, C.; Jafarzadeh, M.; Boult, T.E. Few-Shot Class Incremental Learning Leveraging Self-Supervised Features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3900–3910. [Google Scholar]
Kalla, J.; Biswas, S. S3C: Self-supervised stochastic classifiers for few-shot class-incremental learning. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 432–448. [Google Scholar]
Cui, Y.; Deng, W.; Xu, X.; Liu, Z.; Liu, Z.; Pietikäinen, M.; Liu, L. Uncertainty-guided semi-supervised few-shot class-incremental learning with knowledge distillation. IEEE Trans. Multimed. 2022, 25, 6422–6435. [Google Scholar] [CrossRef]
Cui, Y.; Deng, W.; Chen, H.; Liu, L. Uncertainty-Aware Distillation for Semi-Supervised Few-Shot Class-Incremental Learning. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 14259–14272. [Google Scholar] [CrossRef]
Tipping, M.E.; Bishop, C.M. Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 1999, 61, 611–622. [Google Scholar] [CrossRef]
Tipping, M.E.; Bishop, C.M. Mixtures of probabilistic principal component analyzers. Neural Comput. 1999, 11, 443–482. [Google Scholar] [CrossRef]
Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2001–2010. [Google Scholar]
Mensink, T.; Verbeek, J.; Perronnin, F.; Csurka, G. Distance-based image classification: Generalizing to new classes at near-zero cost. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2624–2637. [Google Scholar] [CrossRef] [PubMed]
Nakata, K.; Ng, Y.; Miyashita, D.; Maki, A.; Lin, Y.C.; Deguchi, J. Revisiting a knn-based image classification system with high-capacity storage. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 457–474. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Castro, F.M.; Marín-Jiménez, M.J.; Guil, N.; Schmid, C.; Alahari, K. End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 233–248. [Google Scholar]
Hou, S.; Pan, X.; Loy, C.C.; Wang, Z.; Lin, D. Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 831–839. [Google Scholar]
Martinetz, T. A “neural-gas” network learns topologies. In Artificial Neural Networks; Elsevier: Amsterdam, The Netherlands, 1991; pp. 397–402. [Google Scholar]
Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef]
Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417. [Google Scholar] [CrossRef]
Mahalanobis, P.C. On the Generalized Distance in Statistics; National Institute of Science of India: Odisha, India, 1936. [Google Scholar]
Wang, B.; Barbu, A. Scalable Learning with Incremental Probabilistic PCA. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 5615–5622. [Google Scholar]
Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
Sun, L.; Wang, M.; Zhu, S.; Barbu, A. A novel framework for online supervised learning with feature selection. J. Nonparametr. Stat. 2024, 1–27. [Google Scholar] [CrossRef]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset; Technical Report CNS-TR-2011-001; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, University of Toronto, Toronto, ON, Canada, 2009. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Ravi, S.; Larochelle, H. Optimization as a model for few-shot learning. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis. 2020, 128, 1956–1981. [Google Scholar] [CrossRef]

Figure 1. Flowchart of k-PPCAs (2-way 5-shot). The colored points are labeled or classified observations. The gray points are unlabeled.

Table 1. Comparison of k-PPCAs with the state-of-the-art SSFSCIL methods on the CUB200 dataset under 10-way 5-shot settings with 100 base and 100 incremental classes. Our results were obtained from five independent runs with the standard deviation in parentheses.

Method	Backbone	Accuracy in Each Session (%)
Method	Backbone	0	1	2	3	4	5	6	7	8	9	10
SS-iCaRL [6]	RN18	69.89	61.24	55.81	50.99	48.18	46.91	43.99	39.78	37.50	34.54	31.33
SS-NCM [6]	RN18	69.89	61.91	55.51	51.71	49.68	46.11	42.19	39.03	37.96	34.05	32.65
SS-NCM-CNN [6]	RN18	69.89	64.87	59.82	55.14	52.48	49.60	47.87	45.10	40.47	38.10	35.25
FeSSSS [8]	RN18+RN50	79.60	73.46	70.32	66.38	63.97	59.63	58.19	57.56	55.01	54.31	52.98
Us-KD [10]	RN18	74.69	71.71	69.04	65.08	63.60	60.96	59.06	58.68	57.01	56.41	55.54
S3C [9]	RN18	80.62	77.55	73.19	68.54	68.05	64.33	63.58	62.07	60.61	59.79	58.95
UaD-CE [11]	RN18	75.17	73.27	70.87	67.14	65.49	63.66	62.42	62.55	60.99	60.48	60.72
k-PPCAs	RN18	76.72	73.87	70.90	66.90	65.28	63.16	62.17	61.12	59.38	59.42	58.96
k-PPCAs	RN18	(0.03)	(0.27)	(0.31)	(0.50)	(0.40)	(0.44)	(0.67)	(0.65)	(0.73)	(0.68)	(0.79)
k-PPCAs	CLIP-RN50x4	81.46	78.84	76.87	73.09	71.90	70.02	69.24	67.93	66.25	66.18	66.06
k-PPCAs	CLIP-RN50x4	(0.04)	(0.26)	(0.33)	(0.66)	(0.86)	(0.74)	(0.74)	(1.13)	(1.34)	(1.35)	(1.29)
k-PPCAs	CLIP-ViT-B/32	75.89	73.01	70.74	67.02	65.43	63.29	62.14	60.94	59.23	59.56	59.18
k-PPCAs	CLIP-ViT-B/32	(0.03)	(0.36)	(0.48)	(0.79)	(0.76)	(0.79)	(0.82)	(0.90)	(0.93)	(0.87)	(1.20)
k-PPCAs	CLIP-ViT-L/14	86.47	84.30	83.06	79.98	79.60	78.17	77.89	77.44	76.51	76.84	76.78
k-PPCAs	CLIP-ViT-L/14	(0.02)	(0.13)	(0.38)	(0.75)	(0.87)	(0.98)	(0.90)	(0.84)	(0.89)	(0.90)	(1.08)

Table 2. Experiments on CIFAR100 under 5-way 5-shot settings with 60 base and 40 incremental classes. S3C is from [9] but only the accuracy of the end incremental session is provided.

Method	Backbone	Accuracy in Each Session (%)
Method	Backbone	0	1	2	3	4	5	6	7	8
SS-NCM-CNN [6]	RN18	64.1	62.22	61.11	58.0	54.22	50.66	48.88	46.0	44.44
FeSSSS [8]	RN20+RN50	75.35	70.81	66.7	62.73	59.62	56.45	54.33	52.10	50.23
S3C [9]	RN20	-	-	-	-	-	-	-	-	53.96
Us-KD [10]	RN18	76.85	69.87	65.46	62.36	59.86	57.29	55.22	54.91	54.42
UaD-CE [11]	RN18	75.55	72.17	68.57	65.35	62.80	60.27	59.12	57.05	54.50
k-PPCAs	RN18	69.87	68.83	67.83	65.39	65.05	63.73	63.74	62.88	61.40
k-PPCAs	RN18	(0.07)	(0.10)	(0.33)	(0.34)	(0.34)	(0.32)	(0.32)	(0.31)	(0.28)
k-PPCAs	CLIP-RN50x4	73.06	71.54	70.30	68.23	67.41	66.99	67.18	66.81	66.15
k-PPCAs	CLIP-RN50x4	(0.07)	(0.15)	(0.24)	(0.26)	(0.17)	(0.18)	(0.15)	(0.18)	(0.15)
k-PPCAs	CLIP-ViT-B/32	75.02	72.88	72.27	70.24	69.59	68.96	68.92	68.37	67.16
k-PPCAs	CLIP-ViT-B/32	(0.04)	(0.16)	(0.14)	(0.09)	(0.09)	(0.10)	(0.16)	(0.10)	(0.14)
k-PPCAs	CLIP-ViT-L/14	85.43	84.12	83.56	82.01	81.89	81.82	81.68	81.62	80.83
k-PPCAs	CLIP-ViT-L/14	(0.04)	(0.06)	(0.10)	(0.12)	(0.13)	(0.10)	(0.10)	(0.11)	(0.11)

Table 3. Experiments on miniImageNet under 5-way 5-shot settings with 60 base and 40 incremental classes. S3C is from [9] but only the accuracy of the end incremental session is provided.

Method	Backbone	Accuracy in Each Session (%)
Method	Backbone	0	1	2	3	4	5	6	7	8
SS-NCM-CNN [6]	RN18	62.88	60.88	57.63	52.8	50.66	48.28	45.27	41.65	41.21
Us-KD [10]	RN18	72.35	67.22	62.41	59.85	57.81	55.52	52.64	50.86	50.47
UaD-CE [11]	RN18	72.35	66.91	62.13	59.89	57.41	55.52	53.26	51.46	50.52
S3C [9]	RN18	-	-	-	-	-	-	-	-	52.14
FeSSSS [8]	RN18+RN50	81.5	77.04	72.92	69.56	67.27	64.34	62.07	60.55	58.87
k-PPCAs	RN18	71.50	67.02	63.64	61.14	58.94	56.66	54.42	53.16	52.17
k-PPCAs	RN18	(0.04)	(0.12)	(0.08)	(0.10)	(0.05)	(0.20)	(0.30)	(0.24)	(0.22)
k-PPCAs	CLIP-RN50	76.82	76.17	75.06	75.03	75.01	74.61	73.75	73.97	74.19
k-PPCAs	CLIP-RN50	(0.07)	(0.14)	(0.22)	(0.21)	(0.22)	(0.23)	(0.20)	(0.18)	(0.17)
k-PPCAs	CLIP-RN50x4	81.57	81.28	80.12	79.98	79.97	79.44	78.60	78.92	79.02
k-PPCAs	CLIP-RN50x4	(0.05)	(0.04)	(0.11)	(0.11)	(0.11)	(0.07)	(0.08)	(0.07)	(0.06)
k-PPCAs	CLIP-ViT-B/32	83.51	82.99	81.97	81.51	81.63	81.20	80.39	80.62	80.68
k-PPCAs	CLIP-ViT-B/32	(0.05)	(0.15)	(0.18)	(0.51)	(0.47)	(0.45)	(0.45)	(0.42)	(0.39)
k-PPCAs	CLIP-ViT-L/14	91.05	90.86	89.78	89.84	89.88	89.44	88.72	88.85	88.91
k-PPCAs	CLIP-ViT-L/14	(0.05)	(0.11)	(0.09)	(0.09)	(0.09)	(0.09)	(0.12)	(0.12)	(0.11)

Table 4. ImageNet-1k experiments under 100-way 2-shot settings with 500 base and 500 incremental classes. No ImageNet results were found for FSCIL in the literature.

Method	Accuracy in Each Session (%)
Method	0	1	2	3	4	5
k-Means	53.68	54.99	54.03	52.81	51.49	51.93
(CLIP-RN50x4)	(0.39)	(0.51)	(0.49)	(0.46)	(0.19)	(0.23)
k-PPCAs	57.05	59.27	59.44	59.05	58.10	58.44
(CLIP-RN50x4)	(0.27)	(0.33)	(0.52)	(0.44)	(0.32)	(0.27)

Table 5. The influence of the number of principal components (PC) on accuracy for the miniImageNet dataset.

Number	Accuracy in Each Session (%)
of PC	0	1	2	3	4	5	6	7	8
0 PC	62.46	62.95	61.96	62.22	62.49	62.34	61.84	62.15	62.37
k-Means	(1.05)	(0.97)	(1.07)	(0.99)	(0.91)	(1.03)	(0.99)	(1.41)	(0.99)
5 PC	73.19	73.47	72.82	73.05	73.44	73.30	72.91	73.39	73.92
	(1.09)	(1.05)	(1.01)	(0.89)	(0.90)	(0.87)	(0.84)	(0.82)	(0.77)
10 PC	73.96	74.28	73.68	73.93	74.32	74.12	73.63	74.22	74.69
	(1.29)	(1.17)	(1.13)	(1.08)	(1.03)	(0.95)	(0.93)	(0.90)	(0.83)
20 PC	74.00	74.24	73.44	73.71	74.20	73.87	73.27	73.89	74.42
	(1.25)	(1.13)	(1.02)	(1.00)	(0.97)	(0.83)	(0.94)	(0.91)	(0.87)
30 PC	73.92	74.12	73.22	73.53	74.05	73.60	72.96	73.58	74.11
	(1.31)	(1.17)	(1.15)	(1.13)	(1.11)	(0.91)	(0.89)	(0.85)	(0.82)
50 PC	73.87	74.07	73.10	72.97	73.45	72.87	72.04	72.76	73.31
	(1.26)	(1.15)	(1.07)	(1.23)	(1.21)	(1.17)	(1.05)	(1.01)	(0.97)
All PC	62.46	62.95	61.96	62.24	62.51	62.34	61.85	62.15	62.51
Full Cov.	(1.03)	(0.94)	(1.06)	(0.97)	(0.89)	(1.01)	(0.97)	(1.38)	(1.14)

Table 6. Comparison of few-shot base and all-labeled base on miniImageNet.

	Accuracy in Each Session (%)
	0	1	2	3	4	5	6	7	8
5-shot	73.99	74.33	73.69	73.96	74.34	74.15	73.68	74.27	74.76
(CLIP-RN50x4)	(1.16)	(1.07)	(1.05)	(0.96)	(0.93)	(0.86)	(0.85)	(0.82)	(0.76)
all-labeled	81.57	81.28	80.12	79.98	79.97	79.44	78.60	78.92	79.02
(CLIP-RN50x4)	(0.05)	(0.04)	(0.11)	(0.11)	(0.11)	(0.07)	(0.08)	(0.07)	(0.06)

Table 7. Comparison of the extreme value thresholds

μ_{r_{k}} + τ σ_{r_{k}}

on miniImageNet dataset.

Table 7. Comparison of the extreme value thresholds

μ_{r_{k}} + τ σ_{r_{k}}

on miniImageNet dataset.

		Accuracy in Each Session (%)
$τ$	%	0	1	2	3	4	5	6	7	8
0	50	73.02	73.17	72.32	72.55	73.00	72.75	72.22	72.74	73.23
		(1.36)	(1.26)	(1.12)	(1.00)	(0.94)	(0.93)	(0.91)	(0.83)	(0.80)
1	16	73.74	74.03	73.37	73.60	73.99	73.79	73.33	73.83	74.35
		(1.30)	(1.17)	(1.12)	(0.99)	(0.95)	(0.92)	(0.89)	(0.85)	(0.78)
1.64	5	73.97	74.29	73.66	73.91	74.30	74.12	73.66	74.21	74.73
		(1.09)	(0.99)	(0.98)	(0.91)	(0.87)	(0.80)	(0.76)	(0.73)	(0.70)
1.96	2.5	73.99	74.33	73.69	73.96	74.34	74.15	73.68	74.27	74.76
		(1.16)	(1.07)	(1.05)	(0.96)	(0.93)	(0.86)	(0.85)	(0.82)	(0.76)
2.56	0.5	73.96	74.28	73.68	73.93	74.32	74.12	73.63	74.22	74.69
		(1.29)	(1.17)	(1.13)	(1.08)	(1.03)	(0.95)	(0.93)	(0.90)	(0.83)
∞	0	73.97	74.31	73.70	73.97	74.33	74.14	73.67	74.25	74.72
		(1.11)	(1.03)	(0.98)	(0.91)	(0.89)	(0.83)	(0.82)	(0.79)	(0.74)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, K.; Barbu, A. Semi-Supervised Few-Shot Incremental Learning with k-Probabilistic Principal Component Analysis. Electronics 2024, 13, 5000. https://doi.org/10.3390/electronics13245000

AMA Style

Han K, Barbu A. Semi-Supervised Few-Shot Incremental Learning with k-Probabilistic Principal Component Analysis. Electronics. 2024; 13(24):5000. https://doi.org/10.3390/electronics13245000

Chicago/Turabian Style

Han, Ke, and Adrian Barbu. 2024. "Semi-Supervised Few-Shot Incremental Learning with k-Probabilistic Principal Component Analysis" Electronics 13, no. 24: 5000. https://doi.org/10.3390/electronics13245000

APA Style

Han, K., & Barbu, A. (2024). Semi-Supervised Few-Shot Incremental Learning with k-Probabilistic Principal Component Analysis. Electronics, 13(24), 5000. https://doi.org/10.3390/electronics13245000

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semi-Supervised Few-Shot Incremental Learning with k-Probabilistic Principal Component Analysis

Abstract

1. Introduction

Related Work

2. Materials and Methods

2.1. Probabilistic PCA Representation

2.2. k-Means Classifier with the Mahalanobis Distance (k-PPCAs)

2.3. k-PPCAs for Semi-Supervised FSCIL

2.4. Computing Covariance Matrices with Running Averages

2.5. Complexity Analysis of k-PPCAs

3. Results

3.1. Datasets

3.2. Implementation Details

3.3. Evaluation Results

4. Discussion

4.1. The Number of Principal Components in PPCA

4.2. Different Shots in the Base Session

4.3. The Criteria of Extreme Values

4.4. Discussion on the Strengths of the Proposed Method

4.5. Discussion About CLIP and ImageNet

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI