Output Effect Evaluation Based on Input Features in Neural Incremental Attribute Learning for Better Classification Performance

Machine learning is a very important approach to pattern classification. This paper provides a better insight into Incremental Attribute Learning (IAL) with further analysis as to why it can exhibit better performance than conventional batch training. IAL is a novel supervised machine learning strategy, which gradually trains features in one or more chunks. Previous research showed that IAL can obtain lower classification error rates than a conventional batch training approach. Yet the reason for that is still not very clear. In this study, the feasibility of IAL is verified by mathematical approaches. Moreover, experimental results derived by IAL neural networks on benchmarks also confirm the mathematical validation.


Introduction
Machine learning is a very useful technology for pattern classification and regression.It has been widely used and successfully applied in a number of different fields, and can bring very good performance and accurate results to us [1][2][3][4].Neural Network (NN) is one of most popular machine learning technologies, which has been widely employed in many scenarios [5,6].NN is often built according to some machine learning strategy, and Incremental Attribute Learning (IAL) is one of the newest machine learning strategies.
IAL is a "divide-and-conquer" machine learning strategy, which gradually trains input features in one or more size.Previous research has shown that IAL is an applicable approach for solving multidimensional problems in pattern classification integrated with some machine learning predictive algorithms such as Genetic Algorithm (GA) [7,8], NN [9,10], Support Vector Machine (SVM) [11], Particle Swarm Optimization (PSO) [12], Decision Tree (DT) [13].The results of these previous studies also showed that IAL can exhibit better performance than conventional methods, where all input features are trained together in one batch.
Generally, there are two important factors which make IAL overcome conventional batch-training machine learning.One is the incremental training structure of IAL.For example, Incremental Learning in terms of Input Attributes (ILIA) [9] and Incremental Training with an Increasing input Dimension (ITID) [10] have been shown to be applicable for achieving better performance by neural network based IAL.The other factor is feature ordering, a unique preprocessing in IAL [14][15][16][17][18].In comparison with the results derived by conventional batch training machine learning approaches, both the structure and the preprocessing of feature ordering in IAL can bring positive efforts on the improvement of classification accuracy.However, why the structure and the feature ordering can efficiently enhance classification performance and reduce error rates in IAL is a question which has still not been answered yet.
In this paper, as a frequently-used metric, Single Discriminability (SD) [14] is taken as an example for feature's classification capacity evaluation.The structure of IAL neural networks and the feature ordering of IAL will be analyzed in detail to make it clear why the unique structure and the preprocessing are important to IAL, and how IAL is able to reduce the error rate in final classification results.

IAL Based on Neural Networks
IAL gradually imports features one by one.At present, based on some intelligent predictive methods like NN, new approaches and algorithms have been presented for IAL.For example, ITID was shown to be applicable for classification.It divides the whole input space into several sub spaces, each of which corresponds to an input feature.Instead of learning input features altogether as an input vector in a training instance, ITID learns input features one after another through their corresponding sub-networks while the structure of NN gradually grows with an increasing input dimension based on Incremental Learning in terms of Input Attributes (ILIA) [9].During training, information obtained by a new sub-network is merged together with the information obtained by the old network.Such architecture is based on ILIA1.After training, if the outputs of NN are collapsed with an additional network sitting on the top where links to the collapsed output units and all the input units are built to collect more information from the inputs, this results in ILIA2 as shown in Figure 1.Finally, a pruning technique is adopted to find out the appropriate network architecture.Previous experiments have shown that, with less internal interference among input features, ITID achieves higher generalization accuracy than conventional batch training methods [10].

Feature Ordering and Single Discriminability
Many previous studies have shown that preprocessing, like feature selection, feature ordering and feature extraction, usually plays a very important role in the final performance [19][20][21].Feature ordering is naturally treated as an independent preprocessing stage in IAL [14], because features should be imported into an IAL predictive system one by one.Thus, it is necessary to decide which feature should be trained early and which one should be put in a later place.The criterion for feature sorting usually depends on a metric, which is a measurement for feature's discrimination ability.
Feature discrimination ability is an expected index metric of each single feature's capacity for final classification rates in pattern classification.It can be used as a predictive tool to evaluate the final classification performance.There are many feature discrimination ability estimation approaches for feature ordering [14][15][16][17][18]22]. Usually, feature discrimination ability can be derived based on each single feature's contribution or some statistical metrics.In previous studies, SD [14] was used as a metric for feature ordering.However, why it is applicable for feature's discrimination ability evaluation was unknown until this study.In the next section, it will be mathematically analyzed.Here is the definition of SD.
Output Units Output Units of Sub-Network on the Top ILIA1 ILIA2 Definition 1. Single Discriminability (SD) refers to the discriminating capacity of one input feature fi in distinguishing all output features ω1, ω2, …, ωm, where fi is the i-th feature in the input set, m is the number of output features.Let f = [f1, f2, …, fn] the pool of input, and Ω = [ω1, ω2, …, ωm] the pool of output, where fi (1 ≤ i ≤ n) is the i-th input features in Ω, and μj (1 ≤ j ≤ m) is the j-th output feature in Ω, SD can be calculated by where μj(fi) is the mean of feature i in output j, stdj(fi) is its standard deviation, n is the number of input, and m is the number of output.SD provides an indicative feature ordering ranking in two or more output categorization problems.

Classification Estimation in IAL
As a simple and efficient classifier, linear classification methods can be employed to estimate each feature's discrimination ability in IAL preprocessing.Although the result is not very accurate, the estimation to predict feature's single discrimination ability is still effective and applicable more or less.Usually, classification can be treated as a process for searching a hyperplane or set of hyperplanes in a high-or finite-dimensional space.Intuitively, a good separation achieved by a hyperplane should have the largest distance to the nearest training data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.In supervised learning, datasets are usually divided into training dataset and testing dataset.Assuming a dataset with n features, f = {f1, f2, …, fn} is the data vector containing all input data while  trn = { 1 trn ,  2 trn , ⋯ ,  n trn } is the training data, which is a subset of f.The computing process of SD should be based on ftrn.In IAL, features are incrementally imported into the predictive system; thus, the feature space starts from one dimension, and then grows to more dimensions, step by step.When ftrn is introduced into the predictive system for the first time, only one feature is introduced.Each classification hyperplane is a single point.Along with the growing of feature numbers, the dimensionality of hyperplanes also increases.When all n features are imported, the hyperplanes will have n − 1 dimensions.
Assuming the feature ordering in IAL for ftrn is  1 trn ,  2 trn , ⋯ ,  n trn , which indicates that when another new feature is introduced into the system, SD of the new one should be smaller than those of previous features.Namely, In another aspect, because the classification work is based on ITID neural networks, SD( 1 ,  2 ), the SD of the integration of f1 and f2 is where w1 and w2 are the weights in neural networks, and w1 + w2 = 1.Similarly, if there are n features imported into the system, SD( 1 ,  2 , ⋯ ,   ) =  1 SD( 1 ) +  2 SD( 2 ) + ⋯ +   SD(  ) where w1 + … + wn = 1.According to Equations ( 2) and (3), SD( 2 ) ≤  1 SD( 1 ) +  2 SD( 2 ) ≤ SD( 1 ) Namely, Based on Equation ( 5), if SD(  ) value refers to the real classification ability of each single feature, then the classification performance evaluation is SD( 1 , ⋯ ,   ) for conventional batch-training method and, SD( 1 , ⋯ ,   ) IAL for IAL.
The proof of Theorem 1 is in the Appendix.This theorem indicates that IAL usually conditionally performs better than conventional batch-training methods in classification, if features are imported into system according to the feature ordering sorted by their discrimination ability in descending order.Anyway, SD is only a metric with an expected value of Classification performance derived by features.It is not the real classification results that are finally obtained.Moreover, because only training data are employed in the feature ordering calculation, SD results always have a bias when the testing data are imported in the later steps.

Experiments
In this study, eight classification benchmarks from UCI Machine Learning Repository are employed to verify that SD is feasible to evaluate each feature's classification capacity for IAL final classification performance.They are Diabetes, Cancer, Glass, Thyroid, Ionosphere, Musk1, Sonar and Semeion.In these experiments, all the patterns were randomly divided into three groups: training set (50%), validation set (25%) and testing set (25%), and SD is employed for feature discrimination ability evaluation.After evaluation, all the features are sorted according to their SD value.Neural networks with ITID structure are employed for classification using datasets formatted according to SD feature orderings which have been shown in Table 1

Result Analysis
It is obvious that all the final results derived by ITID (SD-ILIA2) are better than those obtained by conventional batch training according to the results shown in Table 2.They obtained lower final classification error rates by using IAL with the feature orderings based on SD.Moreover, the Correlation Coefficient derived by SD values and error rates obtained in each ILIA1 step show that there is a strong positive correlation between SD and classification performance.Therefore, in IAL, SD estimation for feature ordering has more probability to exhibit better performance when neural networks based on ITID is employed for classification.
Figure 2 demonstrates the correlation between SD value and ILIA1 classification error rates obtained in each feature importing step.It also confirms that there is a strong positive correlation between SD values and classification error rates.According to Figure 2, it is manifest that both feature ordering SD values and ILIA1 classification error rates have the same reductive trend during the IAL classification process in general.This phenomenon coincides with the Correlation Coefficient values shown in Table 1, which also means that SD value is an applicable metric for final classification performance estimation.However, in Figure 2, ILIA1 classification result values fluctuate in almost all datasets, although the general trend is reductive.That means that some features trained in later steps have more contribution than some of those trained in earlier steps.This is influenced by sampling.Actually, there are no effective approaches existing to cope with the difference between sampling and population.Another way to tackle such a fluctuation of results is feature selection.If feature selection is used, better results can be obtained.Taking Cancer as an example, if feature selection can be employed in this datasets, only features 3, 2, 6, and 7 should be employed.Other features will be discarded.Thus, the final classification can be easily improved.This is an important issue which will be discussed and solved in the future.

Conclusions
This paper aims to analyze why IAL can outperform conventional batch training approaches and emphasize that SD is a feasible metric for feature ordering which is a preprocessing of IAL.In this study, the feasibility of IAL is verified by using mathematical proof.According to the mathematical validation and benchmarks, if features can be sorted according to their SD values, and imported into the IAL system based on this feature ordering, it can usually obtain lower classification error rates than conventional batch training approaches.Thus, feature ordering is very important to IAL, which depends on the evaluation of each feature's capacity to final classification performance.Moreover, based on some conditions of neural networks weights, IAL is more applicable than conventional batch training approaches for obtaining a lower error rate in classification.
In general, IAL is a novel machine learning approach which gradually trains input attributes in one or more sizes.Feature ordering in training is a unique preprocessing step in IAL pattern recognition.It also plays a very important role in result improvement.Reasons why IAL can often obtain lower classification error rates in final results than conventional batch training approaches is clear according to this study.Feature Ordering based on SD can be employed as a preprocessing in Neural IAL classification for lower error rates.

Figure 1 .
Figure 1.The network structure of ITID.

Table 1 .
. Their ILIA1 results derived in last feature importing step and final classification results (ILIA2 results) are compared with those derived by conventional batch-training approaches in Table 2.The final classification error reduction and the Correlation Coefficient between SD and Step Error Rate are also demonstrated in this table.Single Discriminability (SD) Feature Ordering of each Dataset.

ITID (SD-ILIA1) Classification Error Rate (%) ITID (SD-ILIA2) Final Classification Error Rate (%) Batch-Training Classification Error Rate (%) Final Classification Error Reduction (%) Correlation Coefficient btw SD and Step Error Rate
SD values and classification error rates derived in each step when Incremental Learning in terms of Input Attributes (ILIA1) is applied and features are imported into the Incremental Training with an Increasing input Dimension (ITID) Neural Networks one by one according to the feature ordering sorted by SD.It is obvious that both SD values and classification error rates derived by ILIA1 in each step have the same downtrend during the process.The above diagrams (a-h) show the comparison of SD values and Classification error rates for Diabetes, Cancer, Glass, Thyroid, Ionosphere, Sonar, Musk1 and Semeion, respectively, when new features are imported into the training by ITID.(a) SD values and Classification error rates of Diabetes; (b) SD values and Classification error rates of Cancer; (c) SD values and Classification error rates of Glass; (d) SD values and Classification error rates of Thyroid; (e) SD values and Classification error rates of Ionosphere; (f) SD values and Classification error rates of Sonar; (g) SD values and Classification error rates of Musk1; (h) SD values and Classification error rates of Semeion.