Two-Stage Feature Generator for Handwritten Digit Classification

In this paper, a novel feature generator framework is proposed for handwritten digit classification. The proposed framework includes a two-stage cascaded feature generator. The first stage is based on principal component analysis (PCA), which generates projected data on principal components as features. The second one is constructed by a partially trained neural network (PTNN), which uses projected data as inputs and generates hidden layer outputs as features. The features obtained from the PCA and PTNN-based feature generator are tested on the MNIST and USPS datasets designed for handwritten digit sets. Minimum distance classifier (MDC) and support vector machine (SVM) methods are exploited as classifiers for the obtained features in association with this framework. The performance evaluation results show that the proposed framework outperforms the state-of-the-art techniques and achieves accuracies of 99.9815% and 99.9863% on the MNIST and USPS datasets, respectively. The results also show that the proposed framework achieves almost perfect accuracies, even with significantly small training data sizes.


Introduction
Pattern recognition typically involves both feature generation and classification.In pattern recognition approaches, such as face recognition and digit recognition, a feature extractor aims to find the characteristics of patterns that can discriminate and separate classes.However, a variability of features can lead to difficulties in such approaches.For example, even if it is desirable to have small within-class variability in a face recognition approach, varying lighting conditions can lead to differences in features.Similarly, digits written by different people in digit recognition systems can cause variability in features [1][2][3][4][5][6][7].Hence, determining and using the most efficient framework for feature generation and classification is crucial in the pattern recognition approaches.
There are studies conducted on feature generation and classification in the literature.For instance, linear transformation techniques, such as principal component analysis (PCA), singular value decomposition (SVD), independent component analysis, discrete Fourier transform, Hadamard and Haar transforms, and discrete time wavelet transform (DTWT), are used for feature generation in the literature [7].Moreover, neural networks (NNs) are used for classification in numerous studies such as in [2][3][4][5][6][7].It should be stressed that typical recognition architectures use a single feature extractor followed by a supervised classifier.However, as stated in [8,9], two successive stages of feature generation yield higher accuracies than a one-stage extractor.There are also studies in the literature in which two or more feature extractors are cascaded, and the resulting features are used to train a supervised classifier.Even though such studies are exploited for the feature

State of the Art
There are studies based on handwritten character recognition in the literature.For instance, Mellouli et al. [1] proposed a new convolutional neural network (CNN) architecture using morphological filters for digit recognition.The morphological configuration was called Morph-CNN, which achieved a test accuracy of 99.66% on the MNIST dataset.Patel et al. proposed a multi-resolution technique using a discrete wavelet transform (DWT)-based approach for handwritten character recognition [10].The authors used the DWT to extract features and they also used the MDC to recognize the system output.Their technique achieved an overall success rate of 90.00%.Ayyaz et al. [11] proposed a hybrid feature extraction system based on the SVM.Their system was tested on both handwritten digits and uppercase alphabets, which achieved higher efficiency compared to other methods.Shubhangi et al. [12] proposed a structural micro-feature system based on the SVM to recognize handwritten English characters and digits with a high recognition rate.
Liu et al. [13] proposed an NN-based system, which achieved improved accuracy by discriminative training and achieved a 98.45% recognition rate on the CENPARMI dataset.Suen et al. [14] developed a system to sort and identify cheques and financial documents on the CENPARMI dataset, which achieved a success rate of 98.85%.Lee et al. [15] proposed an offline handwritten digit recognition system for the CEDAR dataset, which achieved a recognition rate of 99.09%.Filatov et al. [16] designed a system based on an address script to identify handwritten postal addresses for US mail on the CEDAR dataset, which achieved a success rate of 99.54%.
In [17], a discriminative cascaded CNN model was used, which achieved an error rate of 0.18% on the MNIST dataset.Ganapathy et al. [18] studied a multiscale NN recognition system.In [19], a single-layer NN achieved a 98.39% accuracy on the proposed MNIST dataset.In [20], four different techniques, i.e., the PCA, CNN, SVM, and multi-classifier systems, were used to develop a powerful system for handwritten character recognition, which achieved a success rate of 98.50% on the MNIST dataset.In [21], a cascaded PCA, binary hashing, and block-wise histograms were used with a very simple deep learning network for image classification, which achieved a 99.67% recognition rate.In [22], a system based on a multicolumn deep neural network (MCDNN) was developed using 35 pre-trained CNNs, which achieved an error rate of 0.23% on the MNIST dataset.Bruna et al. [9] used an invariant scattering convolution network, which achieved an error rate of 0.43% on the MNIST dataset.Goodfellow et al. [23] used a convolution max-out system to regularize dropout, which achieved an error rate of 0.45% on the MNIST dataset.Zeiler et al. [24] proposed stochastic pooling on deep CNN, which achieved an error rate of 0.47% on the MNIST dataset.In [25], a context-dependent deep NN/hidden Markov model was used for large-vocabulary speech recognition.This system was tested on both the MNIST and TIMIT datasets, which achieved an error rate of 0.83% on the MNIST dataset.Jarrett et al. [8] used large CNNs and achieved an error rate of 0.53% on the MNIST dataset without distortions.Yu et al. [26] used a hierarchical two-layer sparse coding network on pixels, which achieved an error rate of 0.77% on the MNIST dataset.Keysers et al. [27] proposed an image distortion model based on local optimization, which achieved a low error rate of 0.54% on the MNIST dataset.In [28], a scalable generative model based on a convolutional deep belief network was used for unlabeled data from the MNIST dataset, which achieved an error rate of 0.82%.
In [29], pattern recognition using average patterns of categorical k-nearest neighbors was proposed, which achieved error rates of 1.27% and 3.44% on the MNIST and USPS datasets, respectively, using kernel classification on categorical average patterns.In [30], a discriminative-based supervised dictionary learning was developed, which achieved test error rates of 0.60% and 2.40% for the MNIST and USPS datasets, respectively.Error rates of 1.66% and 2.59% were achieved using SVM and KNN, respectively, on the USPS test set in [31].In [32], perceptron learning of a modified quadratic discriminant function (MQDF) was used to achieve error rates of 1.49% and 2.19% on the MNIST and USPS datasets, respectively, which indicates that discriminative learning of MQDF can further improve MQDF's performance.Xu et al. [33] presented a nonnegative representation-based classifier for pattern classification, which achieved accuracies of 99% and 95.1% on the MNIST and USPS datasets, respectively.Prasad et al. [34] presented novel features and cascaded classifiers KNN and SVM, resulting in an accuracy of 99.26 on the MNIST dataset.

Proposed Framework
Employing appropriate features to classify data can directly influence desired learning results.Therefore, selecting and generating features that are easily separable is vital for accurate classification [1][2][3].Considering this motivation, a two-stage cascaded feature generator framework is proposed in this study.
In this sub-section, first, a one-stage feature generator, which provides the basis for the proposed framework, is discussed, and then the proposed two-stage feature generator framework of this study is introduced.

Soft Sensor Implementation for the Feature Generation
The proposed method has been implemented by using both hardware sensors (cameras, scanners, etc.) and soft sensors.The former captures the digits.The latter provides features that no hardware sensor is able to measure.In this study, a soft sensor model was developed to generate features for handwritten digit classification.The soft sensor has been realized by two cascaded modules, namely PCA and PTNN.The following section presents the details of each module.
Handwritten digit classification (HDC) has found such applications as postal automation, bank check automation, and human-computer interactions in practice.Many studies have been conducted for the classification of digits, as mentioned in Section 2. The first and most vital step in the recognition cycle is the collection of handwritten digits from people.There exist various ways to acquire the digits depending on the way the digits are generated.Therefore, different sensors are utilized for capturing the digits.While the digits written on paper can be recorded by handheld scanners or cameras, the digits created in the air can be captured by Kinect cameras, wearable inertial measurement unit (IMU) sensors, and wearable smart gloves and armbands.They rely on capturing hand and finger movements.In addition, a smart pen that exploits the inertial force sensors can record the digits [35][36][37][38].

One-Stage Feature Generator
Figure 1 depicts a one-stage feature generator framework that employs the PCA for the feature extraction.As can also be seen from the figure, the implementation of the one-stage classifier is based on either the MDC, which is a simple algorithm, or the SVM, which is a sophisticated algorithm, for classification [11,12].MDC can be defined as calculating the distance between the unknown data and each class center and assigning the data to the nearest class center with the shortest Euclidean distance.
mation, bank check automation, and human-computer interactions in practice.M studies have been conducted for the classification of digits, as mentioned in Section 2 first and most vital step in the recognition cycle is the collection of handwritten digits people.There exist various ways to acquire the digits depending on the way the digi generated.Therefore, different sensors are utilized for capturing the digits.While th its written on paper can be recorded by handheld scanners or cameras, the digits cr in the air can be captured by Kinect cameras, wearable inertial measurement unit ( sensors, and wearable smart gloves and armbands.They rely on capturing hand an ger movements.In addition, a smart pen that exploits the inertial force sensors can r the digits [41][42][43][44].

One-Stage Feature Generator
Figure 1 depicts a one-stage feature generator framework that employs the PC the feature extraction.As can also be seen from the figure, the implementation of the stage classifier is based on either the MDC, which is a simple algorithm, or the SVM, w is a sophisticated algorithm, for classification [11,12].MDC can be defined as calcu the distance between the unknown data and each class center and assigning the d the nearest class center with the shortest Euclidean distance.Although the algorithm ends in this step, the following step demonstrates the tiveness of the generated features.• S5: Train a classifier (such as SVM or MDC) by the rows of the matrix .
It is easy to show that the elements of matrix F are the projection of each data sa on principal component vectors.In F, the product   denotes the inner product two vectors.Hence, we can express this product as:
Since ∥  ∥ = 1, the inner product can be written as: Algorithm 1, which is presented below, describes the steps to generate the features based on the PCA within this framework.
with the size of NxK.Although the algorithm ends in this step, the following step demonstrates the effectiveness of the generated features.
• S6: Train a classifier (such as SVM or MDC) by the rows of the matrix F.
It is easy to show that the elements of matrix F are the projection of each data sample on principal component vectors.In F, the product d i c j denotes the inner product of the two vectors.Hence, we can express this product as: where θ is the angle between d i and c j , i = 1, . . ., N and j = 1, . . ., K. Since c j = 1, the inner product can be written as: Equation ( 2) represents the projection of d i and c j , that is: Consequently, the projected data are employed as features to train the selected classifier, which is based on the MDC or SVM.It is a fact that the variance in clusters obtained from using the PCA is very large; hence, the MDC or SVM classifiers cannot successfully separate one cluster from another.This is due to features sparsely scattered around the center of the cluster (i.e., distances between samples within the same cluster are high).This reduces the classification performance which yields low success rates.

Two-Stage Feature Generator
To enhance the performance of the one-stage generator, we propose inserting another transformation operator between the PCA and MDC/SVM modules to form a two-stage feature generator framework.Figure 2 depicts the proposed framework.The framework for a two-stage feature generator is explained in more detail in Algorithm 2 step by step.

𝑝𝑟𝑜𝑗 𝑑 =∥ 𝑑 ∥ 𝑐𝑜𝑠 𝜃
Consequently, the projected data are employed as features to train the selected sifier, which is based on the MDC or SVM.
It is a fact that the variance in clusters obtained from using the PCA is very l hence, the MDC or SVM classifiers cannot successfully separate one cluster from ano This is due to features sparsely scattered around the center of the cluster (i.e., dista between samples within the same cluster are high).This reduces the classification pe mance which yields low success rates.

Two-Stage Feature Generator
To enhance the performance of the one-stage generator, we propose inserting ano transformation operator between the PCA and MDC/SVM modules to form a two-s feature generator framework.Figure 2 depicts the proposed framework.The framew for a two-stage feature generator is explained in more detail in Algorithm 2 step by s The PTNN module in the framework is simply a multilayer perceptron (MLP with one hidden layer with various neurons.It is structured for the purpose of class tion.Thus, the outputs of the network correspond to the clusters to be identified, i.e number of digits in our test cases.The network is fed by the projected data features.H ever, the network is not fully trained, but partially trained.Therefore, training is h after a few epochs.The epoch errors are high, which indicates that training is far complete.When training is halted, the network cannot correctly identify the clus However, we keep on training the network to observe the behavior of the PTNN a early stage of the training.In summary, PTNN is simply an MLP without full trainin the training period is stopped after a predefined number of epochs. Figure 3a,b illustrate mean squared error (MSE) results obtained from the trained NN and PTNN training, respectively.MSE is computed as the mean of the squ differences between the actual output and the estimated output.Figure 3b represent performance of the neural network at the early stages of training.The results show the MSE decreases rapidly at the beginning of the training phase and changes slowly 2000 epochs are reached.Then, it remains almost constant, implying that the NN is trained.In total, 60,000 and 10,000 samples are employed during the training and te phases, respectively, and a test accuracy of 98.58% is achieved [35].Additionally, whe MLP NN is trained for classifying the digits in the MNIST dataset with zero featur traction, the number of the epochs required varies from 40 to 50 to achieve test accur between 87% and 98% using 60,000 samples for the training set and 10,000 samples fo test set [36,37].The PTNN module in the framework is simply a multilayer perceptron (MLP) [2] with one hidden layer with various neurons.It is structured for the purpose of classification.Thus, the outputs of the network correspond to the clusters to be identified, i.e., the number of digits in our test cases.The network is fed by the projected data features.However, the network is not fully trained, but partially trained.Therefore, training is halted after a few epochs.The epoch errors are high, which indicates that training is far from complete.When training is halted, the network cannot correctly identify the clusters.However, we keep on training the network to observe the behavior of the PTNN at the early stage of the training.In summary, PTNN is simply an MLP without full training, or the training period is stopped after a predefined number of epochs.
Figure 3a,b illustrate mean squared error (MSE) results obtained from the fully trained NN and PTNN training, respectively.MSE is computed as the mean of the squared differences between the actual output and the estimated output.Figure 3b represents the performance of the neural network at the early stages of training.The results show that the MSE decreases rapidly at the beginning of the training phase and changes slowly until 2000 epochs are reached.Then, it remains almost constant, implying that the NN is fully trained.In total, 60,000 and 10,000 samples are employed during the training and testing phases, respectively, and a test accuracy of 98.58% is achieved [39].Additionally, when an MLP NN is trained for classifying the digits in the MNIST dataset with zero feature extraction, the number of the epochs required varies from 40 to 50 to achieve test accuracies between 87% and 98% using 60,000 samples for the training set and 10,000 samples for the test set [40,41].
Despite stopping the training at a significantly early stage, if the outputs of the hidden unit of the partially trained network are used as features, we find that intra-cluster distances are reduced as compared to those in the PCA feature space.On the other hand, the size of the feature vectors in the two-stage feature generator is higher than those in the one-stage feature generator (i.e., larger than K).That is, the feature space composed of the two-stage feature generator includes more features than that of the one-stage feature generator.Hence, the proposed approach does not reduce the number of features.However, it improves the accuracy of the classifier.Despite stopping the training at a significantly early stage, if the outputs of the hidden unit of the partially trained network are used as features, we find that intra-cluster distances are reduced as compared to those in the PCA feature space.On the other hand, the size of the feature vectors in the two-stage feature generator is higher than those in the one-stage feature generator (i.e., larger than K).That is, the feature space composed of the two-stage feature generator includes more features than that of the one-stage feature generator.Hence, the proposed approach does not reduce the number of features.However, it improves the accuracy of the classifier.The algorithms discussed above are tested on the MNIST and USPS digit datasets to analyze the distance distribution of each digit class in this study.For this purpose, the distances within the class and between classes are calculated, where the Euclidean distance is used as the distance metric.Let  and  be row vectors in  .Then, the Euclidean distance between these two vectors is defined as The within-cluster distances are calculated by Algorithm 3.
where, W is the weight matrix between input and hidden layers and i = 1, . . ., P • S5: Construct the hidden layer output matrix H = (h 1 h 2 . . .h P ) whose size is NxP.
Although the algorithm ends in this step, the following step demonstrates the effectiveness of the generated features.• S6: Train a classifier (such as SVM or MDC) by the rows of the matrix H.
The algorithms discussed above are tested on the MNIST and USPS digit datasets to analyze the distance distribution of each digit class in this study.For this purpose, the distances within the class and between classes are calculated, where the Euclidean distance is used as the distance metric.Let d i and d j be row vectors in R N .Then, the Euclidean distance between these two vectors is defined as The within-cluster distances are calculated by Algorithm 3. • S1: Calculate the centroid of the class m as: Calculate the Euclidean distance between each example and centroid vectors as: It is envisaged that the proposed framework should yield minimized intra-cluster distances or maximized inter-class distances.This envisagement is proved in the following section by considering both the one-stage and two-stage feature generators and the algorithms considered.

Verification of Inter and Intra Class Distributions
In this section, the intra-class and inter-class distance distributions are verified using the distance metric presented in Equation (5).To form the metric, first of all, the standard deviation, which indicates how sparsely or densely distributed the distances are within a class, is determined for each class.Then, to quantify the distance between the classes, the separation metric (SM) is formed: where d ij is the distance between the centers of classes i and j, while σ i and σ j are the standard deviations for classes i and j, respectively.This metric represents the degree of separability.The inter-class distances are calculated by Algorithm 4. • Case 1: if the distance remains constant and the standard deviations in Equation ( 5) have small values, then SM becomes higher.Note that a higher SM indicates better separation.

•
Case 2: if the standard deviations in Equation ( 5) are constant and the distance has high values, then SM becomes higher.
These two cases are illustrated in Figure 4. Tables 1 and 2 show the standard deviations of distances between the centers of classes using the one-stage and two-stage feature generators, respectively, in the USPS dataset.The results show that the standard deviations (i.e., σ's) using the two-stage feature generator are smaller than those using the one-stage feature generator.This is associated with the fact that samples in the given class are distributed close to the center of the class.A consequence of this is that the data are more separable in the feature space formed by the PCA plus PTNN.In other words, the boundary or volume of each cluster shrinks inward.On the other hand, with the one-stage case, the samples in each class are scattered away from the center of the class so that small values of standard deviations are obtained.Consequently, the variation within a cluster without the PTNN is higher than that with the PTNN.   3 and 4 show the separability values calculated by Equation ( 5) for the o stage and two-stage feature generators, respectively.It can be seen that classes scatter in the feature space are more separable in the two-stage case (i.e., the separability creases).In pattern recognition, this is one of the desired requirements for a classifier classify data accurately.Furthermore, we can obtain the SM ratio of the value of a select class from Table 4 to the value of that class in Table 3. Once these ratios are calculated can be seen from Table 5 that they are mostly greater than 1.Thus, the classes in the featu space built from the PCA plus NN are more separable compared to those in the PCA spa   Tables 3 and 4 show the separability values calculated by Equation ( 5) for the one-stage and two-stage feature generators, respectively.It can be seen that classes scattered in the feature space are more separable in the two-stage case (i.e., the separability increases).In pattern recognition, this is one of the desired requirements for a classifier to classify data accurately.Furthermore, we can obtain the SM ratio of the value of a selected class from Table 4 to the value of that class in Table 3. Once these ratios are calculated, it can be seen from Table 5 that they are mostly greater than 1.Thus, the classes in the feature space built from the PCA plus NN are more separable compared to those in the PCA space.The same cluster behavior is also observed for the MNIST digit dataset.Tables 6 and 7 show the standard deviations for the clusters formed with 5000 and 10,000 samples, respectively.It can be seen from the tables that the variation within a cluster with the two-stage extractor is lower than that with the one-stage generator.The separability values and SM ratios for the MNIST dataset are shown in Tables 8-10, respectively.

Results and Discussion
The performance of the proposed feature generator is tested on the MNIST and USPS digit datasets.The USPS handwritten digit dataset is derived from a project on recognizing handwritten digits on envelopes [42].The digits have sizes of 16 × 16 pixels.It contains 7291 samples for the training set and 2007 samples for the test set.The standard MNIST dataset is derived from the NIST dataset and was created by LeCun et al. [43].The digits have sizes of 28 × 28 pixels.It has 60,000 samples for the training set and 10,000 samples for the test set.Figures 5 and 6 show some examples of the digits from the MNIST and USPS datasets, respectively.The MDC and SVM are utilized to identify digits in these datasets.The MDC is a simple classifier.In the training phase, training vectors are separated by each class.Then, the mean values of each class are computed.In the test phase, the closest mean to the test vector is calculated via the Euclidean distance.Then, the corresponding class is predicted.The SVM is much more complex than the MDC.It is capable of extracting not only linear but also curved decision boundaries.Thus, more accurate classification can be achieved by setting a maximum margin separator among the sample points, where the margin is defined as the distance of the decision boundary to the closest sample.
The performance of the proposed feature generator is tested on the MN digit datasets.The USPS handwritten digit dataset is derived from a proje ing handwritten digits on envelopes [38].The digits have sizes of 16 × 16 pix 7291 samples for the training set and 2007 samples for the test set.The sta dataset is derived from the NIST dataset and was created by LeCun et al. have sizes of 28 × 28 pixels.It has 60,000 samples for the training set and for the test set.Figures 5 and 6 show some examples of the digits from t USPS datasets, respectively.The MDC and SVM are utilized to identify datasets.The MDC is a simple classifier.In the training phase, training ve rated by each class.Then, the mean values of each class are computed.In the closest mean to the test vector is calculated via the Euclidean distance.responding class is predicted.The SVM is much more complex than the MD of extracting not only linear but also curved decision boundaries.Thus, classification can be achieved by setting a maximum margin separator amo points, where the margin is defined as the distance of the decision boundar sample.In Table 11, the results of the MDC for the USPS digit classes are show one-stage and two-stage feature generators.We then determine the accura ent eigenvalues and different training sizes.The results show that the best r is achieved using 4000 samples for the training set and 5298 samples for th K = 10.Note that the NN is partly trained for various epochs, i.e., the train the early stage of iterations.As an example, Table 12 presents the accurac The performance of the proposed feature generator is tested on the MNIST and USPS digit datasets.The USPS handwritten digit dataset is derived from a project on recognizing handwritten digits on envelopes [38].The digits have sizes of 16 × 16 pixels.It contains 7291 samples for the training set and 2007 samples for the test set.The standard MNIST dataset is derived from the NIST dataset and was created by LeCun et al. [39].The digits have sizes of 28 × 28 pixels.It has 60,000 samples for the training set and 10,000 samples for the test set.Figures 5 and 6 show some examples of the digits from the MNIST and USPS datasets, respectively.The MDC and SVM are utilized to identify digits in these datasets.The MDC is a simple classifier.In the training phase, training vectors are separated by each class.Then, the mean values of each class are computed.In the test phase, the closest mean to the test vector is calculated via the Euclidean distance.Then, the corresponding class is predicted.The SVM is much more complex than the MDC.It is capable of extracting not only linear but also curved decision boundaries.Thus, more accurate classification can be achieved by setting a maximum margin separator among the sample points, where the margin is defined as the distance of the decision boundary to the closest sample.In Table 11, the results of the MDC for the USPS digit classes are shown for both the one-stage and two-stage feature generators.We then determine the accuracies for different eigenvalues and different training sizes.The results show that the best recognition rate is achieved using 4000 samples for the training set and 5298 samples for the test set with K = 10.Note that the NN is partly trained for various epochs, i.e., the training is halted in the early stage of iterations.As an example, Table 12 presents the accuracies for K = 10 at different epochs and different training sizes.The table shows that the performance of PCA plus PTNN (two-stage generator) is higher than that of the one-stage extractor.Moreover, as an example, the performance of the two-stage generator framework with a training size of 500 samples is improved by 2.386 points with reference to the one-stage In Table 11, the results of the MDC for the USPS digit classes are shown for both the one-stage and two-stage feature generators.We then determine the accuracies for different eigenvalues and different training sizes.The results show that the best recognition rate is achieved using 4000 samples for the training set and 5298 samples for the test set with K = 10.Note that the NN is partly trained for various epochs, i.e., the training is halted in the early stage of iterations.As an example, Table 12 presents the accuracies for K = 10 at different epochs and different training sizes.The table shows that the performance of PCA plus PTNN (two-stage generator) is higher than that of the one-stage extractor.Moreover, as an example, the performance of the two-stage generator framework with a training size of 500 samples is improved by 2.386 points with reference to the one-stage extractor at an epoch of 15 for the USPS dataset.During the training for each scenario, the learning rate and the number of hidden nodes are set to 0.5 and 50, respectively.Then, hidden layer outputs are extracted from the NN.The mean values of these outputs are calculated for each digit class.For the unseen test data, the hidden layer output is calculated.Then, Euclidean distances of the test data to the mean values of digit classes are computed.The test data are classified according to the digit class with minimum distance.For all the scenarios, two-stage features lead to higher performance than one-stage features.The average test recognition rates for 10 classes are 91.60% and 90.13% at the training size of 4000 for the two-stage and one-stage cases, respectively.Table 13 presents the performance rates for the MNIST digit classes.As seen, the performance of a two-stage extractor is lower than that of the one-stage extractor for small training sizes.However, an improvement in the performance appears for the full training size of 60,000.Tables 14 and 15 show the test success rates of the SVM classifier for the USPS and MNIST datasets, respectively.The experiments on SVM are held with the RBF kernel function.Although the best performance is obtained using 60,000 samples for the training set and 10,000 samples for the test set, it is clear that small training sizes also result in very high accuracies.PTNN is trained with a learning rate of 0.50 and 50 hidden nodes.Although the PTNN is trained for 5% to 30% of the MNIST and USPS datasets, the proposed method achieves almost perfect performance with the SVM.Furthermore, the performance is acceptable even for a simple MDC.The results show that the proposed approach provides more relevant features for the data.Hence, the classifier achieves much better performance scores, i.e., 99.9863% and 99.9815%, for the USPS and MNIST datasets, respectively.To the best of our knowledge, these are the best performances in the current literature.
Table 17 shows the effectiveness of the two-stage feature extractor.Improvements in the accuracies with respect to the one-stage extractor are clear for each classifier.This shows that the proposed features give quite better abilities of generalization to the classifiers.Tables 18 and 19 show comparisons of the performances of our framework and some state-of-the-art methods on the MNIST and USPS datasets, respectively.The results show that the proposed method outperforms well-known techniques in the literature.Note that the SVM using two-stage features achieves error rates of 0.0185% and 0.0137% for the MNIST and USPS datasets, respectively, which are currently the best performances in the literature.

Conclusions and Future Work
In this paper, we proposed a novel framework based on a two-stage feature generator for handwritten digit classification.The first stage of this framework relies on the PCA, which generates the projected data from the eigenvectors corresponding to the highest K eigenvectors.The second stage has been constructed by a PTNN whose training has been halted at early epochs, i.e., it was not fully trained to recognize the input classes.This PTNN has been fed by the projected data on principal components and then its hidden layer outputs have been selected as new features, which have been used to train the MDC and SVM classifiers.
We evaluated the performance of the proposed method on the MNIST and USPS datasets.In both datasets, the best results are performed by using an SVM classifier.We found out that the two-stage feature extractor has led to noticeable improvements in terms of accuracy.Moreover, compared to current state-of-the-art methods, the proposed framework has resulted in almost perfect performances even with small training sizes.In addition, our experiments have shown that the proposed method can achieve error rates of 0.0185% and 0.0137% for the MNIST and USPS datasets, respectively, which can currently be considered the best performances in the literature.

Figure 1 .Algorithm 1 :
Figure 1.Structure of one-stage feature generator with handwritten digit inputs.

Figure 1 .
Figure 1.Structure of one-stage feature generator with handwritten digit inputs.

Algorithm 1 :
Obtaining principal component (PC)-based features The data matrix D = (d 1 d 2 . . .d N ) of size MxN where d i represents the i th sample from the data matrix, where i = 1, . . ., N and N is the number of examples in data matrix.

Figure 2 .
Figure 2. Structure of two-stage feature generator with handwritten digit inputs.

Figure 2 .
Figure 2. Structure of two-stage feature generator with handwritten digit inputs.

Figure 3 .
Figure 3. MSE of fully (a) and partially (b) trained neural network.

Algorithm 2 :
Algorithm 2 describes the transformation to generate features based on the PCA plus PTNN.Obtaining neural network-based features from projected data on the PCs The feature matrix  = (  . . . ) • S1: Build an MLP network with one hidden layer and  hidden nodes (neuron).• S2: Start training the network for classifying the examples represented by the rows of .• S3: Halt training in early iterations.• S4: Calculate the outputs of the hidden layer ℎ = ( + ) where,  is the weight matrix between input and hidden layers and  = 1, . . .,  • S5: Construct the hidden layer output matrix  = (ℎ ℎ . . .ℎ ) whose size is .Although the algorithm ends in this step, the following step demonstrates the effectiveness of the generated features.• S6: Train a classifier (such as SVM or MDC) by the rows of the matrix .

Figure 3 .
Figure 3. MSE of fully (a) and partially (b) trained neural network.Algorithm 2 describes the transformation to generate features based on the PCA plus PTNN.

Algorithm 2 :
Obtaining neural network-based features from projected data on the PCsThe feature matrix F = ( f 1 f 2 . . .f K )•S1: Build an MLP network with one hidden layer and P hidden nodes (neuron).• S2: Start training the network for classifying the examples represented by the rows of F. • S3: Halt training in early iterations.• S4: Calculate the outputs of the hidden layer

Algorithm 3 :
Calculating the distances among the feature vectors within a digit class Assume that F m = ( f 1 f 2 . . .f S ) is a feature matrix for m th class and m = 1, 2, . . ., O and S is the number of the examples in a given class.

Algorithm 4 : 1 c
Calculating the distances between the two-digit classesAssume that F m = ( f 1 f 2 . . .f S ) is a feature matrix for m th class and m = 1, 2, . . ., O and S is the number of the examples in a given class.• S1: Calculate the centroids for each class in a given dataset.• S2: Calculate the distance between the centroids of two classes.L m(m−1) = F m c − F m−In step 2 of Algorithm 4, m(m − 1) represents the distance of the center of each class from the centers of all other classes.Suppose the following:

Figure 4 . 1 .
Figure 4. Representative distribution of features in space with high and low standard deviations

Figure 4 .
Figure 4. Representative distribution of features in space with high and low standard deviations.

Figure 5 .
Figure 5.Samples of digits from the MNIST dataset.

Figure 6 .
Figure 6.Samples of digits from the USPS dataset.

Figure 5 .
Figure 5.Samples of digits from the MNIST dataset.

Figure 5 .
Figure 5.Samples of digits from the MNIST dataset.

Figure 6 .
Figure 6.Samples of digits from the USPS dataset.

Figure 6 .
Figure 6.Samples of digits from the USPS dataset.

Table 2 .
Standard deviations for each digit class in USPS with a training size of 1000.

Table 3 .
Separability values for only PCA with a training size of 500.

Table 1 .
Standard deviations for each digit class in USPS with a training size of 500.

Table 2 .
Standard deviations for each digit class in USPS with a training size of 1000.

Table 3 .
Separability values for only PCA with a training size of 500.

Table 4 .
Separability values for only PCA plus NN with a training size of 500.

Table 5 .
Separability ratios of PCA + NN to PCA with a training size of 500.

Table 6 .
Standard deviations for each digit class in MNIST with a training size of 5000.

Table 7 .
Standard deviations for each digit class in MNIST with a training size of 1000.

Table 8 .
Separability values for only PCA with a training size of 5000.

Table 9 .
Separability values for only PCA + NN with a training size of 5000.

Table 10 .
Separability ratios of PCA + NN to PCA with a training size of 5000.

Table 11 .
Test recognition rates of MDC for USPS.

Table 12 .
Accuracies with K = 10 and various sizes and epochs for USPS.

Table 13 .
Test recognition rates of MDC for MNIST.

Table 14 .
Test recognition rates of SVM for USPS.

Table 15 .
Test recognition rates of SVM for MNIST.

Table 16
lists the accuracies with K = 8 at different epochs and training sizes.The improvements in the performance of the two-stage extractor are clear; for instance, accuracy is increased by 1.5869 points with respect to the one-stage extractor at an epoch of 10 for a training size of 5000.

Table 16 .
Accuracies with K = 8 and various sizes and epochs for MNIST.

Table 17 .
Performance scores with one-stage and two-stage feature extractors.

Table 18 .
Comparison with state-of-the-art methods on MNIST.

Table 19 .
Comparison with state-of-the-art methods of USPS.