Class-wise Classifier Design Capable of Continual Learning using Adaptive Resonance Theory-based Topological Clustering

This paper proposes a supervised classification algorithm capable of continual learning by utilizing an Adaptive Resonance Theory (ART)-based growing self-organizing clustering algorithm. The ART-based clustering algorithm is theoretically capable of continual learning, and the proposed algorithm independently applies it to each class of training data for generating classifiers. Whenever an additional training data set from a new class is given, a new ART-based clustering will be defined in a different learning space. Thanks to the above-mentioned features, the proposed algorithm realizes continual learning capability. Simulation experiments showed that the proposed algorithm has superior classification performance compared with state-of-the-art clustering-based classification algorithms capable of continual learning.


Introduction
With the recent development of IoT technology, a wide variety of big data from different domains have become easily available.In order to make effective use of such big data, many studies have been conducted in various research fields.
One of the major challenges in the field of machine learning is to achieve continual learning like the human brain by computational algorithms.In general, it is difficult for many computational algorithms to avoid catastrophic forgetting, i.e., previously learned knowledge is collapsed due to learning new knowledge [1], which makes the algorithms difficult to realize efficient learning for big data with increasing amount and types of data.Recently, therefore, computational algorithms capable of continual learning attract much attention as a promising approach for continually and efficiently extracting knowledge from big data [2].
Continual learning is categorized into three scenarios: domain incremental learning, task incremental learning, and class incremental learning [3,4].This paper focuses on class incremental learning which includes the common real-world problem of incrementally learning new classes of information.In the case of layer-wise neural networks capable of continual learning, their memory capacity is often limited because most part of the network architecture is fixed [5].One promising approach to overcome the limitation of memory capacity is to apply a growing self-organizing clustering algorithm as a classifier.Self-Organizing Incremental Neural Networks (SOINN) [6] is a well-known growing self-organizing clustering algorithm which is inspired by Growing Neural Gas (GNG) [7].Several conventional studies have shown that SOINN-based classifiers capable of continual learning have good classification performance [8][9][10].However, the self-organizing process of SOINN-based algorithms is highly unstable due to the instability of GNG-based algorithms.
This paper proposes a new classification algorithm capable of continual learning by utilizing a growing self-organizing clustering based on Adaptive Resonance Theory (ART) [11].Among recent ART-based clustering algorithms, an algorithm that utilizes Correntropy-Induced Metric (CIM) [12] to a similarity measure shows faster and more stable self-organizing performance than GNG-based algorithms [13][14][15].We apply CIMbased ART with Edge and Age (CAEA) [16] as a base clustering algorithm in the proposed algorithm.Moreover, we also propose two variants of the proposed algorithm by modifying a computation of the CIM for improving the classification performance.
The contributions of this paper are summarized as follows: (i) A new classification algorithm capable of continual learning, called CAEA Classifier (CAEAC), is proposed by applying an ART-based clustering algorithm.(ii) Two variants of CAEAC are introduced by modifying a computation of the CIM.(iii) Empirical studies show that CAEAC and its variants have superior classification performance than state-of-the-art clustering-based classifiers.(iv) The parameter sensitivity of CAEAC (and its variants) is analyzed in detail.
The paper is organized as follows.Section 2 presents literature review for growing self-organizing clustering algorithms and classification algorithms capable of continual learning.Section 3 describes the details of mathematical backgrounds for CAEAC and its variants.Section 4 presents extensive simulation experiments to evaluate the classification performance of CAEAC and its variants by using real-world datasets.Section 5 concludes this paper.

Literature Review 2.1. Growing Self-organizing Clustering Algorithms
In general, the major drawback of classical clustering algorithms such as Gaussian Mixture Model (GMM) [17] and k-means [18] is that the number of clusters/partitions has to be specified in advance.GNG [7] and ASOINN [8] are typical types of growing selforganizing clustering algorithms that can handle the drawback of GMM and k-means.GNG and ASOINN adaptively generate topological networks by generating nodes and edges for representing sequentially given data.However, since these algorithms permanently insert new nodes into topological networks for extracting new knowledge, they have a potential to forget learned knowledge (i.e., catastrophic forgetting).More generally, this phenomena is called the plasticity-stability dilemma [19].As a SOINN-based algorithm, SOINN+ [20] can detect clusters of arbitrary shapes in noisy data streams without any predefined parameters.Grow When Required (GWR) [21] is a GNG-based algorithm which can avoid the plasticity-stability dilemma by adding nodes whenever the state of the current network does not sufficiently match to the data.One problem of GWR is that as the number of nodes in the network increases, the cost of calculating a threshold for each node increases, and thus the learning efficiency decreases.
In contrast to GNG-based algorithms, ART-based algorithms can theoretically avoid the plasticity-stability dilemma by utilizing a predefined similarity threshold (i.e., a vigilance parameter) for controlling a learning process.Thanks to this ability, a number of ART-based algorithms and their improvements have been proposed for both supervised learning [22][23][24] and unsupervised learning [25][26][27][28].Specifically, algorithms which utilize the CIM as a similarity measure have achieved faster and more stable self-organizing performance than GNG-based algorithms [14,15,29,30].One drawback of the ART-based algorithms is a specification of data-dependent parameters such as a similarity threshold.Several studies have proposed to solve this drawback by utilizing multiple vigilance levels [31], adjusting parameters during a learning process [24], and estimating parameters from given data [16].In particular, CAEA [16], which utilizes the CIM as a similarity measure, has shown superior clustering performance while successfully reducing the effect of data-dependent parameters.

Classification Algorithms Capable of Continual Learning
In recent years, neural networks have demonstrated high performance in object recognition, speech recognition, and natural language processing.On the other hand, the ability of neural networks to perform continual learning without catastrophic forgetting is not sufficient [32].Continual learning is categorized into three scenarios [3,4]: domain incremental learning [33,34], task incremental learning [35], and class incremental learning [5,[35][36][37].In general, when new information is given, layer-wise neural networks capable of continual learning use one of the following two learning mechanisms: selective learning of weight coefficients between neurons and sequential addition of neurons in the output layer corresponding to new information.However, the major problem with the above approaches is that the structure of networks is basically fixed, and therefore, there is an upper limit on the memory capacity.
One promising approach to overcome this difficulty caused by the fixed network structur is to apply a growing self-organizing clustering algorithm as a classifier.The growing self-organizing clustering algorithms adaptively and continually generate a node to represent new information.Typical classifiers of this type are Episodic-GWR [38] and ASOINN Classifier (ASC) [8], which utilize GWR and ASOINN, respectively.One state-ofthe-art algorithm is SOINN+ with ghost nodes (GSOINN+) [10].GSOINN+ has successfully improved the classification performance by generating some ghost nodes near a decision boundary of each class.
Another successful approach is ART-based supervised learning algorithms, i.e., ARTMAP [22][23][24]26,29,39].As mentioned in Section 2.1, ART-based algorithms theoretically realize sequential and class-incremental learning without catastrophic forgetting.However, especially for an algorithm with an ARTMAP architecture, label information does not fully utilize during a supervised learning process.In general, a class label of each node is determined based on the frequency of the label appearance.Therefore, there is a possibility that a decision boundary of each class cannot be learned clearly.

Proposed Algorithm
In this section, first the overview of CAEAC is introduced.Next, the mathematical backgrounds of the CIM and CAEA are explained in detail.Then, modifications of the CIM computation are introduced for variants of CAEAC.Table 1 summarizes the main notations used in this paper.

Class-wise Classifier Design Capable of Continual Learning
Figure 1 shows the overview of CAEAC.The architecture of CAEAC is inspired by ASC [8].As shown in Fig. 2, ASC is an ASOINN-based supervised classifier incorporating k-means and two node clearance mechanisms after class-wise unsupervised self-organizing processes.The main difference between CAEAC and ASC is that CAEAC does not require kmeans and node clearance mechanisms because CAEA has superior clustering performance than ASOINN.
In CAEAC, a training dataset is divided into multiple subsets based on their class labels.The number of the subsets is the same as the number of classes.Each subset is used to generate a classifier (i.e., nodes and edges) through a self-organization process by CAEA.Since CAEA is capable of continual learning, each classifier can be continually updated.Moreover, when a training dataset belongs to a new class, a self-organizing space of CAEA is newly defined.Thus it is possible to learn new knowledge without destroying existing knowledge.When classifying an unknown data point, classifiers (one classifier for each class) are installed in the same space, and the label information of the nearest neighbor node of the unknown data point is output as a classification result.The learning procedure of CAEAC is summarized in Algorithm 1.
In the following subsections, the mathematical backgrounds of the CIM and CAEA are explained in detail.

Input:
the training data points: X = {x 1 , x 2 , . . ., x L }(x l ∈ R d ), the predefined interval for computing σ and deleting an isolated node: λ, and the predefined threshold of an age of edge: a max .Output: the CAEA models.

Correntropy and Correntropy-induced Metric
Correntropy [12] provides a generalized similarity measure between two arbitrary data points x = (x 1 , x 2 , . . ., x d ) and y = (y 1 , y 2 , . . ., y d ) as follows: where E[•] is the expectation operation, and κ σ (•) denotes a positive definite kernel with a bandwidth σ.The correntropy is estimated as follows: In this paper, we use the following Gaussian kernel in the correntropy: A nonlinear metric called CIM is derived from the correntropy [12].CIM quantifies the similarity between two data points x and y as follows: here, since the Gaussian kernel in (3) does not have the coefficient 1 In general, the Euclidean distance suffers from the curse of dimensionality.However, CIM reduces this drawback since the correntropy calculates the similarity between two data points by using a kernel function.Moreover, it has also been shown that CIM with the Gaussian kernel has a high outlier rejection ability [12].

CIM-based ART with Edge and Age
CAEA [16] is an ART-based topological clustering algorithm capable of continual learning.In [16], CAEA and its hierarchical approach show comparable clustering performance to recently-proposed clustering algorithms without difficulty of parameter specifications to each dataset.The learning procedure of CAEA is divided into four parts: 1) initialization process for nodes and a bandwidth of a kernel function in the CIM, 2) winner node selection, 3) vigilance test, and 4) node learning and edge construction.Each of them is explained in the following subsections.
In this paper, we use the following notations: A set of training data points is X = {x 1 , x 2 , . . ., x n , . ..}where x n = (x n1 , x n2 , . . ., x nd ) is a d-dimensional feature vector.A set of prototype nodes in CAEA at the time of the presentation of a data point x n is Y = {y 1 , y 2 , . . ., y K } (K ∈ Z + ) where a node y k = (y k1 , y k2 , . . . ,y kd ) has the same dimension as x n .Furthermore, each node y k has an individual bandwidth σ k for the CIM, i.e., S = {σ 1 , σ 2 , . . ., σ K }.

Initialization Process for Nodes and a Bandwidth of a Kernel Function in the CIM
In the case that CAEA does not have any nodes, i.e., a set of prototype node Y = ∅, the first (λ/2)th training data points X init = {x 1 , x 2 , . . .x λ/2 } directly become prototype nodes, i.e., Y init = {y 1 , y 2 , . . ., y λ/2 }, where y k = x k (k = 1, 2, . . ., λ/2) and λ ∈ Z + is a predefined parameter of CAEA.This parameter is also used for a node deletion process that is explained in Section 3.3.4.
In an ART-based clustering algorithm, a vigilance parameter (i.e., a similarity threshold) plays an important role in a self-organizing process.Typically, the similarity threshold is data-dependent and specified by hand.On the other hand, CAEA uses the minimum pairwise CIM value between each pair of nodes in Y init = {y 1 , y 2 , . . ., y λ/2 }, and the average of pairwise CIM values is used as the similarity threshold V threshold , i.e., where σ is a kernel bandwidth in the CIM.
In general, the bandwidth of a kernel function can be estimated from λ instances belonging to a certain distribution [40], which is defined as follows: where Γ denotes a rescale operator (d-dimensional vector) which is defined by a standard deviation of each of the d attributes among λ instances, ν is the order of a kernel, the single factorial of ν is calculated by the product of integer numbers from 1 to ν, the double factorial notation is defined as (2ν)!! = (2ν − 1) • 5 • 3 • 1 (commonly known as the odd factorial), R(F) is a roughness function, and κ ν (F) is the moment of a kernel.The details of the derivation of ( 6) and ( 7) can be found in [40].
In this paper, we use the Gaussian kernel for CIM.Therefore, ν = 2, R(F) = (2 √ π) −1 , and κ 2 ν (F) = 1.Then, ( 7) is rewritten as follows: Equation ( 8) is known as the Silverman's rule [41].Here, H contains the bandwidth of a kernel function in CIM.In (8), Σ contains the bandwidth of each attribute (i.e., Σ is a d-dimensional vector).In this paper, the median of the d elements of Σ is selected as a representative bandwidth of the Gaussian kernel in the CIM, i.e., σ = median(Σ). ( In CAEA, the initial prototype nodes Y init = {y 1 , y 2 , . . ., y λ/2 } have a common bandwidth of the Gaussian kernel in the CIM, i.e., S init = {σ 1 , σ 2 , . . ., σ λ/2 } where When a new node y K+1 is generated from x n , a bandwidth σ K+1 is estimated from the past λ/2 data points, i.e., {x n−λ/2 , . . ., x n−2 , x n−1 } by using ( 6) and (7).As a result, each new node has a different bandwidth σ depending on the distribution of training data points.In addition, a set of counters M = {M 1 , M 2 , . . ., M λ/2 } where Although the similarity threshold V threshold depends on the distribution of the initial λ/2 training data points, we regard that an adaptive V threshold estimation is realized by assigning a different bandwidth σ, which affects the CIM value, to each node in response to the changes in the data distribution.

Winner Node Selection
Once a data point x n is presented to CAEA with the prototype node set Y = {y 1 , y 2 , . . ., y K }, two nodes which have a similar state to the data point x n are selected, namely, winner nodes y k 1 and y k 2 .The winner nodes are determined based on the state of the CIM as follows: k 1 = arg min where k 1 and k 2 denote the indexes of the 1st and 2nd winner nodes, i.e., y k 1 and y k 2 , respectively.S is bandwidths of the Gaussian kernel in the CIM for each node.

Vigilance Test
Similarities between the data point x n and the 1st and 2nd winner nodes are defined as follows: The vigilance test classifies the relationship between a data point and a node into three cases by using a predefined similarity threshold V threshold , i.e.,

•
Case I The similarity between the data point x n and the 1st winner node y k 1 is larger (i.e., less similar) than V threshold , namely: • Case II The similarity between the data point x n and the 1st winner node y k 1 is smaller (i.e., more similar) than V threshold , and the similarity between the data point x n and the 2nd winner node y k 2 is larger (i.e., less similar) than V threshold , namely: • Case III The similarities between the data point x n and the 1st and 2nd winner nodes are both smaller (i.e., more similar) than V threshold , namely:

Node Learning and Edge Construction
Depending on the result of the vigilance test, a different operation is performed.If x n is classified as Case I by the vigilance test (i.e., ( 14) is satisfied), a new node y K+1 = x n is added to the prototype node set Y = {y 1 , y 2 , . . ., y K }.A bandwidth σ K+1 for node y K+1 is calculated by (9).In addition, the number of data points that have been accumulated by the node y K+1 is initialized as M K+1 = 1.
If x n is classified as Case II by the vigilance test (i.e., (15) is satisfied), first, the age of each edge connected to the first winner node y k 1 is updated as follows: where N k 1 is a set of all neighbor nodes of the node y k 1 .After updating the age of each of those edges, an edge whose age is greater than a predefined threshold a max is deleted.In addition, a counter M for the number of data points that have been accumulated by y k 1 is also updated as follows: Then, y k 1 is updated as follows: When updating the node, the difference between x n and y n is divided by M k 1 .Thus, the changes of the node position is smaller when M k 1 is larger.This is based on the idea that the information around a node, where data points are frequently given, is important and should be held by the node.
If x n is classified as Case III by the vigilance test (i.e., ( 16) is satisfied), the same operations as Case II (i.e., (17), (19), and ( 18)) are performed.In addition, the neighbor nodes of y k 2 are updated as follows: In Case III, moreover, if there is no edge between y k 1 and y k 2 , a new edge e (k 1 ,k 2 ) is defined and its age is initialized as follows: In the case that there is an edge between nodes y k 1 and y k 2 , its age is also reset by (21).Apart from the above operations in Cases I-III, as a noise reduction purpose, the nodes without edges are deleted every λ training data points (e.g., the node deletion interval is the presentation of λ training data points).
The learning procedure of CAEA is summarized in Algorithm 2. Note that, in CAEAC, the classification process of an unknown data point is similar to ASC.That is, the unknown data point is assigned to the class of its nearest neighbor node.

Modifications of the CIM Computation
As shown in ( 2) and (3), the CIM in CAEAC uses a common bandwidth σ to all attributes.Thus, a specific attribute may have a large impact on the value of the CIM if the common bandwidth σ is not appropriate for the attribute.
In this paper, two modifications of the CIM computation [42] are integrated into CAEAC in order to mitigate the above-mentioned effects: 1) one is to compute the CIM by using each individual attribute separately, and the average CIM value is used for similarity measurement, and 2) the other is to apply a clustering algorithm to attribute values, then attributes with similar value ranges are grouped.The CIM is computed by using each attribute group, and the average CIM value is used for similarity measurement.

Individual-based Approach
In this approach, the CIM is computed by using each individual attribute separately, and the average CIM value is used for similarity measurement.The similarity between a data point x n and a node y k is defined by the CIM I as follows: Algorithm 2: Learning Algorithm of CAEA [16] Input: a set of training data points: X = {x 1 , x 2 , . . ., x n , . ..} whee x n = (x n1 , x n2 , . .Initialize a counter M K+1 = 1.

5
Update a set of counter M ← M ∪ M K+1 .

6
Create the new node as y K+1 = x l .
Calculate the vigilance parameter V threshold by (5).Search the indexes of winner nodes k 1 and k 2 by ( 10) and (11), respectively.
13 if a (k 1 ,j) > a max then 14 Delete the edge.

17
Update a set of counter M ← M ∪ M K+1 .

18
Create the new node as y K+1 = x l .

else 21
Update the state of M k 1 by (18).

22
Update the state of y k 1 by (19).
Update the state of neighbor nodes y j by (20).where σ k = (σ k1 , σ k2 , . . . ,σ kd ) is a bandwidth of a node y k .A bandwidth for the ith attribute of y k (i.e., σ ki ) is defined as follows: where Γ i denotes a rescale operator which is defined by a standard deviation of the ith attribute values among the λ data points.
In this paper, CAEAC with the individual-based approach is called CAEAC-Individual (CAEAC-I).

Clustering-based Approach
In this approach, for every λ data points, the clustering algorithm presented in [42] is applied to the attribute values.Each attribute value of λ data points is regarded as a onedimensional vector and used as an input to the clustering algorithm.As a result, attributes with similar value ranges are grouped together.By using this grouping information, the similarity is calculated for each attribute group, and their average is defined as the similarity between a data point x n and a node y k .Specifically, by using the grouping information, the data point x n = (x n1 , x n2 , . . . ,x nd ) is transformed into x C n = u n1 , u n2 , . . ., u nJ (J ≤ d) by the clustering algorithm, where u j represents the jth attribute group.Similarly, the node where v j represents the jth attribute group.The dimensionality of each attribute group is represented as d C = {d 1 , d 2 , . . ., d J } where d j is the dimensionality of the jth attribute group (i.e., the number of attributes in the jth attribute group).
Here, the similarity between the data point x C n and the node y k is defined by the CIM C as follows: where y C k = v 1 , v 2 , . . ., v J is a node y k , but its attributes are grouped based on the attribute grouping of x C n .A bandwidth σ j is defined as follows: where Γ i denotes a rescale operator which is defined by the standard deviation of the ith attribute value in the jth attribute group among the λ data points.
In this paper, CAEAC with the clustering-based approach is called CAEAC-Clustering (CAEAC-C).
The differences in attribute processing among the general approach, the individualbased approach (CAEAC-I), and the clustering-based approach (CAEAC-C) are shown in Fig. 3.The source codes of CAEAC, CAEAC-I, and CAEAC-C are available on GitHub1 .
ASC is a similar approach of CAEAC.GSOINN+ is the state-of-the-art SOINN-based classification algorithm capable of continual learning.FTCAC and SOINN+C are the classifiers based on FTCA [15] and SOINN+ [20], respectively, which have the same architecture as CAEAC (i.e., Fig. 1).FTCA is a state-of-the-art ART-based clustering algorithm while SOINN+ is a state-of-the-art GNG-based clustering algorithm.Because those algorithms have similar characteristics to CAEA, we can construct the related classifiers (i.e., FTCAC and SOINN+C) and use them in our computational experiments.

Datasets
We utilize five synthetic datasets and nine real-world datasets selected from the commonly used clustering benchmarks [43] and public repositories [44,45].Table 2 summarizes statistics of the datasets.

Parameter Specifications
This section describes parameter settings of each algorithm for classification tasks in Section 4.3.ASC, FTCA, GSOINN+, CAEAC, CAEAC-I, and CAEAC-C have parameters that effect to the classification performance while SOINN+C does not have any parameters.
Table 3 summarizes parameters of each algorithm.Because SOINN+C does not have any parameters, it is not listed in Table 3.In each algorithm, two parameters are specified by grid search.The ranges of grid search are the same as in the original paper of each algorithm, or wider.In ASC, a parameter k for noise reduction is the same setting as in [8].During grid search in our experiments, the training data points in each dataset are presented to each algorithm only once without pre-processing.For each parameter specification, we repeat the evaluation 20 times.In each of the 20 runs, first, training data points with no pre-processing are randomly ordered using a different random seed.Then, the re-ordered training data points are used for all algorithms.Table 4 summarizes parameter values which are specified by grid search.N/A indicates that the corresponding algorithm could not build a predictive model.Using the parameter specifications in Tables 3 and 4, each algorithm shows the highest Accuracy for each dataset.3 and 4, we repeat the evaluation 20 times.Similar to Section 4.2, first, training data points with no pre-processing are randomly ordered using a different random seed.Then, the re-ordered training data points are used for all algorithms.The classification performance is evaluated by Accuracy, NMI [46], and Adjusted Rand Index (ARI) [47].
As a statistical analysis, the Friedman test and Nemenyi post-hoc analysis [48] are utilized.The Friedman test is used to test the null hypothesis that all algorithms perform equally.If the null hypothesis is rejected, the Nemenyi post-hoc analysis is conducted.The Nemenyi post-hoc analysis is used for all pairwise comparisons based on the ranks of results over all the evaluation metrics for all datasets.Here, the null hypothesis is rejected at the significance level of 0.05 both in the Friedman test and the Nemenyi post-hoc analysis.All computations are carried out on Matlab 2020a with a 2.2GHz Xeon Gold 6238R processor and 768GB RAM.

Results
Table 5 shows the results of the classification performance.The best value in each metric is indicated by bold for each dataset.The standard deviation is indicated in parentheses.N/A indicates that the corresponding algorithm could not build a predictive model.Training time is in [sec].The brighter the cell color, the better the performance.In ASC, k-means is applied to adjust the position of the generated nodes in order to achieve a good approximation of the distribution of data points.In addition, some generated nodes and edges are deleted during the learning process.Therefore, the number of clusters is not shown for ASC in Table 5.
As an overall trend, ASC, CAEAC, CAEAC-I, and CAEAC-C show better classification performance than SOINN+C, and GSOINN+.FTCAC showed the best classification performance on several datasets whereas FTCAC cannot build a predictive model for TOX171.
Regarding training time, ASC, FTCAC, and CAEAC are shorter than SOINN+C, GSOINN+, CAEAC-I, and CAEAC-C.With respect to generated nodes and clusters, CAEAC and its variants tend to generate a large number of nodes and clusters than those of compared algorithms.
Here, the null hypothesis is rejected on the Friedman test over all the evaluation metrics and datasets.Thus, we apply the Nemenyi post-hoc analysis.Fig. 4 shows a critical difference diagram based on the classification performance including all the evaluation metrics and datasets.Better performance is shown by lower average ranks, i.e., on the right side of a critical distance diagram.In theory, different algorithms within a critical distance (i.e., a red line) do not have a statistically significance difference [48].In Fig. 4, CAEAC-C shows the lowest rank value, but there is no statistically significant difference from CAEAC, CAEAC-I, ASC, and FTCAC.Among the compared algorithms, FTCAC cannot build a predictive model for TOX171 as shown in Table 5.This is a critical problem of FTCAC.SOINN+C and GSOINN+ show inferior classification performance than ASC, CAEAC, CAEAC-I, and CAEAC-C as shown The brighter the cell color, the better the performance.N/A indicates that the corresponding algorithm could not build a predictive model.
in Fig. 4 (and Table 5).Although ASC shows comparable classification performance to CAEAC, CAEAC-I, and CAEAC-C, ASC deletes some generated nodes and edges during its learning process.Therefore, it may be difficult for ASC to maintain continual learning capability when learning additional data points after the deletion process.This can be regarded as the functional limitation of ASC.The above observations suggest that CAEAC, CAEAC-I, and CAEAC-C have several advantages over compared algorithms not only in classification performance but also functional perspectives.The characteristics of CAEAC and its variants are summarized as follows: • CAEAC This algorithm shows stable and good classification performance with maintaining fast computation. •

CAEAC-I
The classification performance and computation time of this algorithm are both inferior to CAEAC and CAEAC-C.However, in Table 5, it has the highest number of best evaluation metric values among CAEAC and its variants.Therefore, this algorithm is worth trying when high classification performance is desired. •

CAEAC-C
This algorithm can be the first-choice algorithm because it shows superior classification performance than CAEAC and CAEAC-I.

Sensitivity of Parameters
Figs. 5-7 show effects of the parameter settings on Accuracy for ASC, FTCAC, and GSOINN+, respectively.For FTCAC, each parameter must be carefully specified to obtain high Accuracy for classification tasks.If the parameters of FTCAC are not specified to an appropriate range, either the classification performance deteriorates dramatically or a predictive model cannot be built.On the other hand, parameter specifications of ASC and GSOINN+ are easier than FTCAC because the range of parameters that provide high Accuracy is wider.
Figs. 8-10 show effects of the parameter settings on Accuracy for CAEAC, CAEAC-I, and CAEAC-C, respectively.λ is the predefined interval for computing σ and deleting an isolated node, and a max is the predefined threshold of an age of edge.As a general trend among CAEAC and its variants, there are no large undulations in the bar graphs for each parameter like FTCAC (i.e., Fig. 6).
From the above observations, the sensitivity of the parameters of ASC, GSOINN+, CAEAC, CAEAC-I, and CAEAC-C are lower than FTCAC.Moreover, it can be considered that ASC, GSOINN+, CAEAC, CAEAC-I, and CAEAC-C can utilize the common parameter setting to a wide variety of datasets without difficulty of parameter specifications.

Conclusion
This paper proposed a supervised classification algorithm capable of continual learning by an ART-based growing self-organizing clustering algorithm, called CAEAC.In addition, its two variants, called CAEAC-I and CAEAC-C, are also proposed by modifying the CIM computation.Experimental results showed that CAEAC and its variants have the superior classification performance compared with the state-of-the-art clustering-based classification algorithms.
In regard to continual learning, the concepts of learned information sometimes change over time.This is known as concept drift [49].Although handling concept drift is an important capability in classification algorithms capable of continual learning, CAEAC focuses only on avoiding catastrophic forgetting.Thus, as a future research topic, we will focus on handling concept drift in CAEAC and its variants in order to improve the functionality of the algorithms.

Figure 3 .
Figure 3. Differences in the CIM calculation.

Figure 4 .
Figure 4. Critical difference diagram of classification tasks.

Table 2 .
Statistics of datasets for classification tasks

Table 3 .
Parameter settings of each algorithm

Table 4 .
Parameters specified by grid search

Table 5 .
Results of the classification performance.
The best value in each metric is indicated by bold.The standard deviation is indicated in parentheses.Training time is in [sec].