Hierarchical Concept Learning by Fuzzy Semantic Cells

: Concept modeling and learning have been important research topics in artiﬁcial intelligence and knowledge discovery. This paper studies a hierarchical concept learning method that requires a small amount of data to achieve competitive performances. The method starts from a set of fuzzy prototypes called Fuzzy Semantic Cells (FSCs). As a result of FSC parameter optimization, it creates a hierarchical structure of data–prototype–concept. Experiments are conducted to demonstrate the effectiveness of our approach in a classiﬁcation problem. In particular, when faced with limited training data, our proposed method is comparable with traditional techniques in terms of robustness and generalization ability.


Introduction
This work is mainly concerned with concept learning. Concept learning categorizes the process to partition samples into classes for the purpose of generalization, discrimination, and inference [1]. Concepts are the basis of most cognitive processes such as inference, learning, and reasoning [2][3][4]. Concept modeling is fundamental in the fields of cognitive science and artificial intelligence. The learning and modeling of fuzzy concepts, in particular, has been a hot topic in these two fields. One prominent work on the cognitive representations of concepts in natural language is the prototype theory [5,6]. Many classic machine learning algorithms are related to the prototype theory, such as K-means algorithm, KNN algorithm and so on. The modeling of concept vagueness in artificial intelligence has been dominated by ideas from fuzzy set theory as originally proposed by Zadeh [7,8]. Then Goodman and Nguyen provided a solid framework foundation for fuzzy conceptual representations [9]. Lawry and Tang introduced an approach to uncertainty modeling for vague concepts by combining the prototype theory and the random set theory [10,11]. In the following in-depth studies, they developed a semantic representation of modeling uncertain concepts called Information Cell [12][13][14]. Tang and Xiao [15] also adopt Fuzzy Semantic Cell (FSC) to name this model. Based on FSC, Tang and Xiao provided an efficient way for unsupervised concept learning [15].
Our motivation is to model concepts underlying data and to represent concepts as partitions of samples for further classification. Our work builds on the model given in [15] and is inspired by the set partition method based on conditional entropy described in [16]. The model called fuzzy semantic cell (FSC) comprises a prototype P, a distance function d, and a probability density function δ of granularity. This structure is considered as the smallest unit of vague concepts and the building brick of concept representation. In Tang and Xiao's study, they proposed three principles for developing reasonable FSC: Maximum coverage, maximum specificity, and maximum fuzzy entropy. In this paper, we focus on solving supervised concept learning problems, particularly those with sparse data. We learn fromŚmieja and Geiger [16] that the consistency of set segmentation can be measured using conditional entropy. This viewpoint has greatly impacted our understanding of fuzzy semantic cell optimization. We combine the above two approaches from a supervisory point of view to obtain a simplified way to build the FSC by replacing three principles with the conditional entropy minimization principle with the help of the Adam algorithm [17]. The Adam algorithm is an efficient method for solving unconstrained nonlinear optimization problems, and it has been widely used in the optimization of neural networks in recent years. Moreover, we also propose a hierarchical concept learning structure for modeling and learning abstract concepts to which the sample pertains. Our method's direct applicability is to solve the classification task in the case of a few training samples (only 10% samples as the training set). In our experiments, our method is not only competitive on the classification accuracy but also has robustness and generalization capabilities. As for conventional methods, we choose the NaiveBayes algorithm [18], knearest neighbor (KNN) [19], Decision Tree (DT) [20], Support Vector Machine (SVM) [21], AdaBoost [22] and neural network algorithm to apply control experiments.
The remainder of this paper is organized as follows. In Section 2, we revisit the ideas of [15], which presents a cognitive structure of vague concepts named Fuzzy Semantic Cell (FSC). In Section 3, we present a detailed introduction to our proposed supervised hierarchical concept learning method based on FSC. Section 4 reports and analyzes various experimental results to demonstrate the effectiveness of our proposed method. Conclusions are presented in Section 5. Our source code in PyTorch [23] is publicly available on github.com/Ming0405/HCL_.

Fuzzy Semantic Cell
Tang and Lawry [10,11] introduced a novel cognitive structure L = P, d, δ to model the vague concept having the form "about P", "similar to P" or "close to P". In this paper, we continue the definition in Tang and Xiao's work [15] named Fuzzy Semantic Cell.

Definition 1.
A fuzzy semantic cell for a vague concept L i on the domain Ω is a triple representation L i = P i , d, δ i where P i is the prototype of L i , d is the distance metric which measures the neighborhood size of L i , and δ i is the probability density function of the neighborhood size of the vague concept L i .
In other words, a fuzzy semantic cell for a vague concept L i is made up of a semantic cell nucleus represented by the prototype P i and a semantic cell membrane represented by an uncertain boundary of L i using d and δ i . Hence, the fuzzy semantic cell L i = P i , d, δ i is assumed to be the smallest semantic unit of vague concepts. Figure 1a shows a fuzzy semantic cell L i = P i , d, δ i in two-dimensional space as an illustrative example.

Definition 2.
On the domain Ω, for any fuzzy semantic cell L i = P i , d, δ i , and neighborhood size ≥ 0, the -neighborhood N L i of L i is defined as follows: According to Definition 2, the -neighborhood N L i includes all the elements x ∈ Ω whose distance to the prototype P i is less than the given neighborhood size of L i , as shown in Figure 1a. Since is a random variable with the probability density function δ i , N L i can be considered as a random set neighborhood of L i . Therefore, for any x ∈ Ω, the degree of x belonging to the fuzzy semantic cell L i should be equal to the probability that the -neighborhood N L i of L i is a set containing x, denoted by Prob( : x ∈ N L i ). Then, the neighborhood function of L i can be defined as follows.
Definition 3. On the domain Ω, for any fuzzy semantic cell L i = P i , d, δ i , and x ∈ Ω, the neighborhood function µ L i (x) of L i is defined as follows: The fuzzy semantic cell L i = P i , d, δ i with the prototype P i and the uncertain neighborhood size .
can be the belief of a sample x being a neighbor of the the fuzzy semantic cell L i . Figure 1b shows an example of the neighborhood function µ L i (x) of L i . Since µ L i (x) is a decreasing function of the distance d(x, P i ), an exponential form of µ L i (x) is defined in [24] as follows: where σ i is the parameter of the probability density function δ i , and it is related to the extent of the distribution. According to (2), the probability density function δ i ( | σ i ) can be readily derived as follows: Moreover, we use the following semimetric for d in this paper: d(x, P i ) = x − P i 2 2 . A semimetric on Ω is a function d : Ω × Ω → R + that satisfies d(x, y) = 0 iff. x = y and d(x, y) = d(y, x) for x, y ∈ Ω.

Method
This section suggests a method for obtaining a hierarchical structure from data concerning the concept using the appropriate fuzzy semantic cells. First, a theoretical analysis of the proposed method is provided to demonstrate the rationality of this hierarchical structure. Then we explain The classification decision method based on the hierarchical structure. Finally, we present the principles of learning fuzzy semantic cell of a hierarchical structure, followed by the developed algorithm.

Hierarchical Structure of Concept Leaning
When people describe an abstract concept corresponding to a specific object, they may use one or more familiar prototypes in the explanation. These prototypes are highly relevant, and the boundaries between them are somewhat vague. Within a certain range, each prototype can explain the meaning of the object. Then, starting with these prototypes, we can achieve a higher level of abstraction to determine the concept's meaning. We may only need one prototype for relatively clear and simple data; however, some more ambiguous and broad concepts may necessitate multiple prototypes to cover all interpretable ranges adequately. Considering account the ambiguous nature of abstract concepts, we develop a hierarchical concept learning model based on the fuzzy semantic cell. As shown in Figure 2, the hierarchical structure is composed of three layers. The bottom is the data layer, and the middle is the fuzzy semantic cell layer, the top is the conceptual layer. The membership degrees µ characterize the relationship between the samples in the data layer and the fuzzy semantic cells in the middle layer. In particular, µ L i (x) indicates how well the semantic cell L i can interpret the sample x. Conversely, each fuzzy semantic cell can explain a range of different radii, covering a different number of samples in the data layer. Each concept is partitioned at the concept level by one or more FSCs. Because of the uncertainty and overlap of the boundaries between FSCs, this partition is ambiguous.

Concept:
Fuzzy Semantic Cell: Formally, suppose we have a dataset DB ⊂ Ω with category information. This category information is a series of the abstract concepts. The DB can be partitioned into K subsets by category information: where K is the number of categories, and we mark this partition way as S. In other words, D j is a subset of DB, and all of the elements x in D j belong to the same category.
At the same time, assume that Ω can also be partitioned into LA = {L 1 , . . . , L M }, where L i = (P i , d, δ i ) is a fuzzy semantic cell with a prototype P i ∈ Ω, a density function δ i defined on [0, +∞), and a distance metric d on Ω. Then, we will obtain the appropriate fuzzy semantic cell partition using the method described in Section 3.3.

Decision Rule of Classification
Based on the above discussion, we have got the hierarchical structure from known data to the concept. For a new sample x, the method to identify the abstract concept it belongs to is as follows: choose the concept that the L i belongs to as the abstract concept of x The classification decision rule outlined above is based on a hierarchical concept learning structure [24,25]. To obtain the corresponding category information, we compute the membership degree of x to M fuzzy semantic cells, and then take the largest one.

Optimization of Fuzzy Semantic Cells
In this section, we will introduce the principles of learning FSCs. The proper partition LA means this partition makes the concepts of elements belonging to each L i are as consistent as possible. That is to say, the concept purity of elements covered by every L i is as high as possible.
To this end, for each concept (category), we compute the sum of the membership of all the elements under that concept to every fuzzy semantic cell L i marked r ij . In fact, r ij approximates the average number of samples covered by L i in D j . : Then we normalize all r ij 's for every L i to estimate the probability that samples covered by L i in D j : As a result, the relative entropy of every L i is defined as: We minimizeH(S | L i ∩ DB) to get the prior distribution for every fuzzy semantic cell. As mentioned above, our goal is to get the L i which maximizes the concept purity. In other words, we need to find a partition of fuzzy semantic cells that is highly consistent with the concept partition. Our solution is inspired by the work [16]. In [16], the authors give the following definition of consistency of dataset segmentation: Definition 4. Let X be a finite dataset and let X l ⊂ X be the set of labeled data points that is Based on this, they proved that conditional entropy H(Z | Y, X l ) can be used to measure the consistency of dataset segmentation: The smaller the conditional entropy is, the higher the consistency is. Inspired by this idea, we give the following definition of partition consistency of LA and S: Then we introduce the conditional entropy to our objective function to ensure LA is consistent with S. The conditional entropy in our scenario is defined as: By minimizing the conditional entropy, we update the fuzzy semantic cells. The updating rule allows having the highest possible purity. At the same time, the parameter M controls the fine degree of FSCs. A proper number of M ensures that the fuzzy semantic cells are neither too rough nor too fine. LA is too fine means M K. In this case, too many FSCs correspond to the same category, although each L i has high internal purity, and each fuzzy semantic cell covers a very small radius. This will cause great redundancy. Especially in extreme circumstances, the partition will be perfect when LA is separated into a small portion which only contains one element, that is to say, M is equal to the number of items x. Because there are too many fuzzy semantic cells at this time, the computational complexity is high. Therefore, according to the premise of high purity, the FSCs in the same category, the better. As a result, we should include the constraint that M be as small as possible while L i be as pure as possible.
Finally, the learning problem of the hierarchical structure becomes an optimization problem of the objective function. Due to the difficulty in calculating the closed-form solution of this optimization problem, we use the iterative method instead. To obtain the optimal fuzzy semantic cell, we apply to Adam method [17], which is an efficient method to solve the unconstrained nonlinear optimization problems to minimize the objective function Equation (4). Existing works also use gradient descent [24], evolutionary optimization [26] or particle swarm optimization [27] to optimize the models. We believe numerical methods are more stable in optimization and Adam can avoid saddle points compared with gradient descent. The procedure of the hierarchical concept learning with fuzzy semantic cells is summarized in Algorithm 1.

Algorithm 1 Hierarchical Concept Learning by Fuzzy Semantic Cells
Require: Compute: minimize the following by Adam: 10: return O(P, σ) 11: end function The main runtime in our training is the computation of Equation (4), namely the conditional entropy. The computation complexity of Equation (4) is O(nM). This scales linearly with n, the number of samples. Empirically the computation only takes less than a second on a CPU of Intel Xeon 4116 in Ubuntu 16.04. Prototypes and σ are the parameters of our fuzzy semantic cells to optimize, and the only hyper-parameter having an influence is the number of prototypes. We will discuss it in Section 4.3.
In summary, we describe the learning method to build the hierarchical structure from the data to the concepts. Based on this structure, given a new sample x, we also introduce the decision-making approach to infer its abstract concept.

Discussion
In this part we discuss the similarity and differences between our proposed method and related supervised learning methods. The discussion is from four aspects: Prototype classification, hard/soft margin, limited samples and unsupervised learning.
Prototype classification. Prototype classification has been a thriving area in artificial intelligence. Mean-of-class prototype classification is a typical method to learn prototypes based on which classification is done [28]. There are three kinds of mean-of-class prototype classification methods. Non-margin classifiers, such as linear discriminant analysis (LDA) [29,30], form non-overlapping areas to represent concepts. Such methods use a similar strategy as our method to construct concepts. Other advanced works [31,32] introduce deep models to extract features for learning prototypes. However, one concept may have multiple prototypes and a non-margin classifier assumes only one prototype, while our method assumes multiple prototypes intrinsically.
Hard/soft margin. Hard margin classifiers include Support Vector Machine [21] and similar methods. They assume a hard hyper-plane or hyper-sphere to separate concepts. Such assumption does not consider the vagueness and uncertainty among samples, while our fuzzy semantic cells assume vagueness with an uncertain margin. Soft margin classifiers obtain prototypes by applying a regularized preprocessing and then classifying samples using hard margin classifiers [21]. This kind of methods also fail to consider multiple prototypes. Another typical soft margin classifier is mixture classification models [33,34]. The main idea is to fit Gaussian mixture models on samples in each class respectively and then to classify new samples with linear discriminant analysis. The fitting process is similar to our prototype learning procedure. However, our method further considers the exclusion among concepts with conditional entropy.
Limited samples. How to learn concepts with limited data is also a fundamental topic. As the number of samples is limited, prior knowledge [35], strong assumptions [36], appropriate augmentation [37,38] or transferred knowledge [39] is important. They can serve as an important inductive bias for out-of-distribution samples or regularize the model to avoid potential overfitting. The virtue that learning concepts from data can prevent adversarial attacks is also discussed [40]. Our method is based on the assumption of prototype theory [11] and the hierarchical structure of prototypes and concepts [24,25]. These two assumptions are from the perspective of how humans form concepts. Given the no-free lunch theorem, there is no universally applicable assumption. However, as our motivation is to model concepts, our assumptions of hierarchical prototypes are instructive.
Unsupervised learning. Prototypes and concepts learning has also been an important topic in unsupervised learning. With the assumption of density peaks [41], message propagation [42], mixture of distributions [43] or disjunctive combination [44], the underlying concepts can also be learned. A recent work [45] also extracts deep features for clustering. The motivation of the introduced unsupervised learning methods is also to learn concepts from samples, but the learning procedures are not supervised by labels. In summary, how to learn fuzzy concepts unsupervisedly with a reasonable assumption is an important topic in our future work.

Experiments
In this section, we conduct experimental studies of the proposed approach on five datasets.

Datasets
The description of each dataset is given as follows. Synthetic dataset [15]: The synthetic two-dimensional dataset contains three classes. Each class follows the Gaussian distribution. The dataset contains 750 samples, and there are 250, 300 and 200 samples in each class, respectively.
Forest [46]: The Forest dataset is from UCI Machine Learning Repository. This dataset contains training and testing data from a remote sensing study which mapped different forest types based on their spectral characteristics at visible-to-near infrared wavelengths of four types ("s"-"Sugi" forest, "h"-"Hinoki" forest, "d"-"Mixed deciduous" forest, "o"-"Other" non-forest land). The original data contains 27 attributes, of which 1-9 are numerical spectral properties, 10-27 are non-numerical attributes. We only took the first nine columns of numerical properties for the experiment followed by merging the training set and the test set in the original data, a total of 523 samples as a 523 × 9 array.
Pendigits [47]: The dataset which is a digit database by collecting 250 samples from 44 writers stored in a 10,992 × 16 array. It is also from UCI Machine Learning Repository.
MNIST [48]: The MNIST database contains a total of 70,000 examples of handwritten digits of size 28 × 28 pixels. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.
MIT face [49]: This dataset from MIT contains synthetic face images of 10 subjects with 324 images per item. These synthetic images were rendered from 3D head models and stored as a 64 × 64 size figure. The final dataset is stored as an array of 3240 × 4096, containing ten different categories for a different subject. Table 1 shows the summarized characteristics of five datasets involved in the experiment.

Initialization Method
The dataset needs to be preprocessed before starting the classification experiment. To ensure the rationality of the dataset segmentation, the data of each category is firstly extracted proportionally and then spliced together as the training set. Finally, we shuffle the samples in the training set, and the remaining data is regraded as the test set.
For the initialization of the parameter P, we try a variety of methods.
• The first one is to perform K-means on each type of training set and get the clustering centers as the initial prototypes. • The second way is to get the clustering centers by K-means in the entire training set to get the initial prototypes. • The third way is to select samples in each category as prototypes randomly. • The last way is to take the average of each category of data as initialization.
Along with our paper, we assume the number of clusters in initialization is equal to the number of prototypes. The first three methods take a proportional number of prototypes M to the number of categories K. Notice that we define K is the number of categories, and K is not related to K-means in our scenario. The final method takes the mass centers of the samples from each class as the prototypes, and it assumes the number of prototypes equal to the number of categories. In the experimental section, we will compare these four initialization methods to investigate the effects of different prototype initialization methods.
For the parameter σ, as a parameter for controlling the bell-shaped distribution width, experiments show that the initialization of σ will affect the convergence speed and the classification result of the algorithm. Empirically we discovered that a better classification result can be obtained when the value of σ is one-third of the average distance between training set samples and the corresponding prototype. However, no additional theoretical support for this phenomenon has been discovered. We will leave it for future work. Table 2 shows the changes of classification accuracy under different training set scales. We repeat the experiment 100 times to obtain the mean and standard deviation of the classification results, considering the randomness of the dataset segmentation and differences in initializing parameters. In this experiment, we use the second initialization method of P. In particular, we use K-means to initialize one prototype in each class and then combine them together as the prototype initialization. As an initialization for σ, we take one-third of the average distances between the samples of each category and the prototypes. From the experimental results, we can see that our method is robust. Even in the case of a small amount of data (only 10%), we can get an acceptable classification result. For the initialization of the prototype, we can limit the initialization method and the number of initial prototypes. In the following two experiments, 30% (3290 samples) of the Pendigits dataset were used as the training set and the rest as the test set (7702 samples). Table 3 shows the difference in classification results for different numbers of prototypes. We initialize one to four prototypes in each category (a total of 10 categories) by K-means, so the number of prototypes ranges from 10 to 40 in the prototype initialization. We also set σ as one-third of the average distance that the samples of each category to the prototypes. We also repeated the experiment 10 times to obtain the mean and standard deviation of the classification results, taking into account the randomness of dataset segmentation and differences in initializing parameters. The experimental results are consistent with the above analysis. The classification effect will improve as the number of prototypes increases. As a result, as the number of prototypes grows, so the computational complexity also increases. At the same time, if there were too many prototypes, most fuzzy semantic cell will become too small and redundant. As a result, we must select the appropriate number of prototypes to strike a balance between computational efficiency and classification effect. In the following experiment, we conducted a controlled trial in four ways as described in Section 4.2 to show the classification accuracy under different initialization methods of prototypes on Pendigits dataset as shown in Table 4. We used ten prototypes (#prototype = #class), and the initialization method of σ is the same as before. The experiment is repeated six times. The experimental results show that the training accuracy decays slightly since more samples are difficult to fit. However, the testing accuracy increases substantially as the model learns more about the training distribution. Despite the objective function being a non-convex problem, the classification effect is nearly stable across the different prototype initialization methods. When prototypes are initialized by category, the results are consistent after repeated experiments, and the standard deviation is low. When the prototypes are initialized with a global method, the overall classification accuracy is not significantly worse, though the result is slightly different and the standard deviation is relatively larger. In the last prototype initialization with 0.3 and 0.5 as the training ratio, the average training accuracy is slightly lower than the testing accuracy. However, when considering the deviation, such difference is not significant. We attribute this phenomenon to randomness. Using Adam [17] is based on the following empirical findings. The learning problem of our method is an unconstrained non-convex optimization, and the loss is estimated within batches of samples. It has been shown Adam converges faster than other first-order optimization methods do [17], like SGD and RMSprop [50]. For second-order optimization methods, Newton and Quasi-Newton require expensive computation and memory cost. Besides, L-BFGS is an efficient Quasi-Newton approximation. We conduct experiments on SGD, RMSprop and L-BFGS. We use the fourth initialization method of prototypes and compares the test accuracy in Table 5. The L-BFGS introduce instability when the ratio is 0.1. The performances of other methods are worse than Adam in this scenario. Evolutionary optimization and particle swarm optimization are also used in related works [26,27], but we believe numerical optimization methods are more reproducible. In the final experiment, we compared the differences between the proposed method and the six conventional classification methods. We conduct this experiment with packages in scikit-learn [51]. SVM uses the polynomial kernel function, and AdaBoost's learning rate is set to 0.3. The Neural Network is a multi-layer perceptron model of two hidden layers, which have five and two neurons, respectively. The nonlinear activation is ReLU, and the out neurons are softmaxed to give a probability distribution. To optimize the model, a cross-entropy loss and an L-BFGS optimizer are used. Other settings are the default. We test typical classification methods (DecisionTree, NaiveBayes, KNN, SVM, AdaBoost, Neural Network) in classification accuracy for five different datasets. For each dataset, we select 10% as the training set and the remaining 90% as the test set to compare the classification effects in the case of a small size of the training set. Because fewer samples are employed for training, most algorithms can achieve outstanding performances on the training set, but algorithms perform variedly on the test set. The classification accuracy on the test set is shown in Table 6.

Experimental Results
From the results in Table 6, we can conclude that seven algorithms have similar performance on the synthetic datasets. When the sample size used for the training is relatively large, the neural network algorithm has the best consequences, such as in handwritten datasets Pendigits and MNIST. However, when the number of training samples are small, the neural network algorithm stability rapidly declines. Not only does the classification effect deteriorate, but it also fluctuates dramatically (high standard deviation) in the Forest and MIT face datasets. In contrast, regardless of the amount of data, the algorithm proposed in this paper produces a stable output. Moreover, we also found that the KNN algorithm also has outstanding performance on different datasets. However, a better performance of KNN is reached when the parameter of neighborhood is manually chosen.Accordingly, in our algorithm, the same parameter initialization method is utilized for all datasets, and a general classification algorithm is obtained without further human intervention. In this experiment, we use K-means (K = 3) in each category as the prototype initialization and take one-third of average distances that the samples of each class to the prototypes as corresponding σ initialization. The Naive Bayes algorithm is more sensitive to datasets and has diverse experimental results for different datasets. For example, it has a better performance on Forest datasets but lags behind other methods on MIT face datasets. This shows that the NaiveBayes algorithm does not have a strong generalization ability. The performance of the SVM algorithm on each dataset is neither remarkable nor bad. As for the DecisionTree algorithm and AdaBoost algorithm, the performance lags behind other algorithms when only 10% of the data is used as a training set. In general, the method proposed in this paper has strong competitiveness both in terms of classification effect, robustness and generalization ability. Further experiments show that the Euclidean distance measurement method is not applicable in high-dimensional data spaces. However, after dimension reduction, the data can produce notable experimental results. As a result, the classification results for the high-dimensional data in this experiment are based on the data after dimension reduction.
In most of our experiments, we repeat six times. The reason for this is to reduce estimation error. We find most existing works repeat experiments three times to get the average and standard deviation of performances for comparison. Suppose the standard variance of the ground-truth performance as a random variable is std, so the standard variance of an average of three repeats is std/3. As the mean is an unbiased estimation of the expectation, the standard variance is equivalent to the estimation error. The three-time repeat is an acceptable balance between computational cost and estimation accuracy. As our method is trained on a small number of samples, we are able to repeat six times to get a more precise estimation, with a standard variance of std/6.

Conclusions
In this paper, we propose a hierarchical concept learning model based on the fuzzy semantic cell. There are two main contributions of this model. Firstly, the model can be used to model and learn abstract concepts in a supervised manner. Secondly, this model can be used to achieve a good learning effect with a small amount of data (even only 10%). According to the experimental results, our proposed method is comparable with typical techniques in terms of robustness and generalization ability under the circumstances of limited training data.
In the future, we will improve the objective function to make it more modeling capable while also looking for a better strategy to initialize parameters and to solve non-convex problems. We will concentrate on combining the proposed concept learning mechanism with other common models and investigating a feasible method of expanding more complex abstract concepts to improve concept learning performance.