A Novel Framework Using Deep Auto-Encoders Based Linear Model for Data Classification

This paper proposes a novel data classification framework, combining sparse auto-encoders (SAEs) and a post-processing system consisting of a linear system model relying on Particle Swarm Optimization (PSO) algorithm. All the sensitive and high-level features are extracted by using the first auto-encoder which is wired to the second auto-encoder, followed by a Softmax function layer to classify the extracted features obtained from the second layer. The two auto-encoders and the Softmax classifier are stacked in order to be trained in a supervised approach using the well-known backpropagation algorithm to enhance the performance of the neural network. Afterwards, the linear model transforms the calculated output of the deep stacked sparse auto-encoder to a value close to the anticipated output. This simple transformation increases the overall data classification performance of the stacked sparse auto-encoder architecture. The PSO algorithm allows the estimation of the parameters of the linear model in a metaheuristic policy. The proposed framework is validated by using three public datasets, which present promising results when compared with the current literature. Furthermore, the framework can be applied to any data classification problem by considering minor updates such as altering some parameters including input features, hidden neurons and output classes.


Introduction
Deep learning (DL) is a new paradigm of neural networks, which is employed in different fields such as image classification and recognition, medical imaging and robotics etc. The deep auto-encoder (DAE) is also a popular deep learning technique and has been recently adapted to various applications in different fields [1][2][3][4]. Bhatkoti and Paul propose a new framework for Alzheimer's disease diagnosis based on deep learning and the KSA algorithm. In this application, the results of the modified approach are compared to the non-modified k-sparse method. The σKSA algorithm optimizes the competence of diagnosis compared to the previous research [5,6]. Tong et al. present a software defect prediction application by using the advantages of stacked denoising auto-encoders (SDAEs) and a two-stage ensemble (TSE). In the first step, SDAEs are used to learn the deep representations from the imitative software metrics. Moreover, a new ensemble learning method, TSE, is proposed to predict the label imbalance problem. The proposed method is trained and tested by using 12 NASA benchmark test data to show the effectiveness of the SDAEsTSE system, which is significantly effective for software defect prediction [7]. approach through classical methods and the performance of DCNN to classify fetal facial standard plane for clinical detection [15].
Visual surveying of the large size of data has drawbacks and weaknesses. Visual investigation is time-consuming and may encounter conflicts in recognition, classification and detection processes, which are fundamental problems of large size of data. Therefore, many computer-aided diagnosis systems are proposed for data classification and processing by using machine learning techniques etc. Despite the researchers' recent interest, it is still an open field and needs further solutions. This essentially motivates authors to contribute in this field. Accordingly, as aforementioned, this study introduces a general framework for data classification and processing issues. This framework is verified by employing a number of benchmark datasets in different fields. Overall, the main advantage of this framework is its remarkable experimental results when compared with former studies. Furthermore, the proposed framework can be used in any field with minimum effort by setting the model parameters based on the characteristics of the problem.
Generally, the two sparse auto-encoders are utilized to diminish the dimension of input features and learn refined features. Those features are then classified by employing a Softmax layer. The whole model is stacked to provide a supervised training methodology. Then, the critical contribution is achieved by integrating a linear model, utilizing a metaheuristic algorithm for optimization, and is applied to enhance the deep sparse auto-encoder performance.

Literature Review
Several studies related to three different datasets (epileptic seizure detection, cardiac arrhythmia and SPECTF classification) are analyzed and presented in Tables 10-12. Epileptic seizure is one of the most studied diseases in the field of computer-aided detection systems. Srinivasan et al. propose a new system based on time-frequency domain for feature extraction, and RNN were used to classify the features. The proposed method presents 99.60% accuracy as can be seen in [16]. Subasi and Ercelebi propose artificial neural network (ANN)-based wavelet transform (WT) and produce only 92% performance [17]. Subasi proposes a discrete WT based on a mixture of expert model, which presents only 94.5% performance [18]. Kannathal et al. propose a dynamic neuro-fuzzy inference system (ANFIS) based on entropy measures and produce 95% performance as can be seen in [19]. Tzallas et al. propose a new method based on time-frequency analysis and ANN which produces a high accuracy of 100% [20]. Polat and Güneş propose a fast Fourier transformation and decision tree (DT) which presents 98.72% performance [21]. Acharya et al. employ wavelet packet decomposition (WPD) to decompose segments and principal component analysis (PCA) to extract eigenvalues from the coefficients. Then, a supervised technique, namely, Gaussian mixture model (GMM) classifier, is employed to categorize the extracted features and obtain 99% accuracy [22]. Acharya et al. propose a combination of entropies, "HOS", "Higuchi FD", "Hurst exponent" and FC, and the proposed method offers "99.70%" accuracy [23]. Peker et al. propose a complex-value artificial neural network (CVANN) based on dual tree complex wavelet transformation (DTCWT). The proposed method presents 100% performance [24]. Karim et al. propose a new framework involving deep sparse auto-encoders (DSAE) utilizing the Taguchi optimization method, and the proposed method presents 100% accuracy [25]. Recently, Karim et al. modified the same framework by incorporating energy spectral density function, used to extract features, into a similar DSAE architecture. The results reveal that it outperforms many existing systems, especially in medical datasets [26].
Additionally, an important study in arrhythmias relying on spontaneous methods was recently offered, in which a model for estimation of cardiac arrhythmias is proposed [27]. The presented method applies two conventional supervised techniques (k-NN and SVM), respectively. The proposed method is validated and tested by employing the "UCI" dataset. While k-NN presents "73.8%" accuracy rate, SVM surprisingly achieves a 68.8% accuracy rate. Mustaqeem [28]. Zuo et al. present a technique for the taxonomy of cardiac arrhythmia using a k-nearest neighbor classifier. The submitted method outperforms traditional KNN algorithms and produces more than 70% accuracy [29]. Besides that, an ANN-based architecture is applied to classify the Electrocardiography (ECG) records for cardiac arrhythmia taxonomy. It is claimed that the experimental results yield more than 87% classification accuracy [30]. Moreover, Persada et al. propose Best First and CsfSubsetEval for the feature selection process. The selected features are classified by using several classifiers, and the best precision is obtained by using the "RBF Classifier" in the combination of BFS and "CsfSubsetEval" techniques, producing 81% [31]. Jadhav et al. propose a modular neural network model for the binary classification (normal or abnormal) of arrhythmia dataset. The proposed model is claimed to attain 82.22% accuracy with the given dataset [32]. Further corresponding studies can be found in [33][34][35].
Moreover, a number of previous studies in the field of SPECTF classification are accessible. Srinivas et al. propose an SVM technique relying on sparsity-based dictionary learning. The proposed method presents 97.8% accuracy [36]. An alternative study offers a Bayesian network to select features. The method entails a vast number of features and produces 95.76% accuracy [37]. Cha et al. propose a new data description approach, namely support vector data description, which is assessed by employing datasets from the UCI repository. The method achieves almost 95% accuracy for the given dataset [38]. Furthermore, Liu et al. propose a new SVDD-based method. The proposed method offers 90% accuracy [39]. Previously, Cui et al. combined an improved version of k-nearest neighbors and the method is known as transductive confidence machine (TCM). The authors claim that this approach (TCM-IKNN) presents 90% accuracy with the UCI dataset [40]. Alternatively, a previous study on discretization approach, namely, "core-generating approximate minimum entropy discretization", was also presented by [41]. This aims to control the lowermost entropy cuts in order to create discrete data points providing nonempty cores. The presented method is also confirmed by employing the UCI dataset and achieves 84% accuracy rate [41].

Material and Methods
The main contribution of this paper is to integrate a post processing procedure to a data classification framework. Accordingly, a strong deep learning framework combining sparse auto-encoders (SAEs) followed by a Softmax Classifier, a generalization of the binary form of the Logistic Regression method, is initially designed. The auto-encoder levels and the classifier level are stacked so as to be trained in a supervised approach based on a backpropagation algorithm. In order to increase the overall classification accuracy, a linear transformation function is integrated into the framework. This layer, in essence, improves the results obtained from DAEs based on a linear model. The critical issue here is to estimate the optimum parameter for the linear transformation model. A strong and reliable metaheuristic algorithm, PSO, is employed to approximate the most optimum model parameters. All these steps are detailed in the following sub sections.

Stacked Sparse Auto-Encoder
The stacked sparse auto-encoder (SSAE) is principally a neural network involving of a number of auto-encoders where each auto-encoder represents a layer and is trained in an unsupervised fashion using unlabeled data. The input of each auto-encoder is the output of the previous one. The training of an auto-encoder estimates the optimal parameters by using different algorithms which reduce the divergence between input x and output .
x. The coding between input and output is represented by the equations illustrated below. Here, the input vector x = (1, 2, 3, 4 . . . , N), is transformed into hidden representation " .
x", by employing a nonlinear model.
Sensors 2020, 20, 6378 5 of 21 n (1) Here n (1) i refers to the ith neuron at the first layer for the architecture, M is an activation function, w i, and b i refer to weight matrix and the bias parameter, respectively. The final mathematical model is illustrated in Equation (4): The input x and output .
x discrepancy is represented by using a cost function. Several algorithms are used to find the optimum parameters of the network. The corresponding mathematical model can be seen in [25,42]. The model of Stacked Sparse Auto-encoder (SSAE), used in the proposed framework, is illustrated in Figure A1 and can be seen in Appendix A. The model has two hidden layers and a classifier layer (SoftMax).

The Particle Swarm Optimization (PSO) Algorithm
PSO algorithms are considered as population-based metaheuristic algorithms proposed by [43][44][45][46]. These algorithms impersonate the social behavior of birds for problem solving.The PSO algorithm is set with a group of arbitrary solutions, representing the particles, and then it explores to approximate an optimal solution by updating the generations. In each iteration, every particle is modified by considering the two (best) values, namely local and global best values. The first best solution that is attained so far by the particle itself is denoted as the best local solution and is stored, known as "pbest" value. Then, the other, global, refers the best solution achieved thus far by a particle located in the population, and this best solution is a global best, known as "gbest" value. The particle updates the positions and velocity by employing Equations (5) and (6) after selecting the best two solutions.
Here, X i k represents particle position, V i k represents particle velocity, P i k represents the best "remembered" individual particle position (pbest), P g k represents the best swarm position (gbest), c 1 and c 2 . are cognitive and social parameters. Additionally, r 1 , r 2 are random parameters between (0,1) and w refers inertial coefficient (0,1). This manipulates convergence and "explore-exploit" trade-off in the PSO algorithm. PSO algorithms offers a number of advantages when compared with other optimization algorithms. PSO is a fast optimization algorithm and only needs few parameters for tuning. Especially, when PSO is compared with one of its main counterpart algorithms, Genetics Algorithm (GA), it should be noted that PSO can converge faster and needs fewer parameters to be configured.
Accordingly, PSO is successfully applied in several fields, such as neural networks, optimization problems, etc. Algorithm 1 refers to the conventional PSO algorithm [47].

For each particle
Set particles in a random manner End Do Estimate the Local best "pBest" for each particle If the "pBest" is enhanced Update "pBest" value End Global Best (gBest) is updated as the best of "pBests" For each particle Estimate the velocity of particles via Equations (5) and (6) Update the positions of the particles End End

A New Deep Learning Framework Using Deep Auto-Encoders and a Linear Model Based on PSO
Suppose a trained deep stacked auto-encoder is used to classify an object into one of the "M" classes. The input layer of the deep stacked auto-encoder involves "N" neurons that are related to object features X 1 , X 2 , . . . , X N , and the output layer involves "M" neurons that stand for the expected output (class label)Ẑ 1 ,Ẑ 2 ,Ẑ 3 ,Ẑ M (see Figure 1).
Sensors 2020, 20, x FOR PEER REVIEW 6 of 20 Accordingly, PSO is successfully applied in several fields, such as neural networks, optimization problems, etc. Algorithm 1 refers to the conventional PSO algorithm [47].

A New Deep Learning Framework Using Deep Auto-Encoders and a Linear Model Based on PSO
Suppose a trained deep stacked auto-encoder is used to classify an object into one of the "M" classes. The input layer of the deep stacked auto-encoder involves "N" neurons that are related to object features X1, X2, …, XN, and the output layer involves "M" neurons that stand for the expected output (class label) , , , (see Figure 1). The deep auto-encoder involves two auto-encoders and Softmax, where the auto-encoders try to learn the high-level features from the input data X. The aim of using a number of auto-encoders is to reduce the number of features gradually. This is because dropping the number of features suddenly in one auto-encoder can lead to missing important features and affect the accuracy. The cost function of the stacked auto-encoders is represented as Equation (7).

Post Processing Stage Deep Stacked AutoEncoders
Here, the error rate is denoted by E, the input features are illustrated by "x", the reconstructed features are illustrated with "x" , λ is the coefficient for the "L2 Weight Regularization", β is the coefficient for "Sparsity Regularization", and signifies the "L2 Weight Regularization", which can be represented as shown in Equation (8).
Here, L presents the number of hidden layers, n is for the number of observations, and k indicates The deep auto-encoder involves two auto-encoders and Softmax, where the auto-encoders try to learn the high-level features from the input data X. The aim of using a number of auto-encoders is to reduce the number of features gradually. This is because dropping the number of features suddenly in one auto-encoder can lead to missing important features and affect the accuracy. The cost function of the stacked auto-encoders is represented as Equation (7).
Here, the error rate is denoted by E, the input features are illustrated by "x", the reconstructed features are illustrated with "x", λ is the coefficient for the "L2 Weight Regularization", β is the coefficient Sensors 2020, 20, 6378 7 of 21 for "Sparsity Regularization", and Ω weights signifies the "L2 Weight Regularization", which can be represented as shown in Equation (8).
Here, L presents the number of hidden layers, n is for the number of observations, and k indicates the variable number of the current training data.
Finally, Ω sparsity is the Sparsity Regularization parameter which adjusts the degree of sparsity of the output from the hidden layers, as illustrated in Equation (9).
Here, the desired value is represented by ρ,ρ i symbolizes the average output activation of any neuron i, and KL represents the function, measuring the variation between two probability distributions based on the same data. Furthermore, the features that produce minimum cost in Equation (1) are selected and become input to Softmax, see Equation (10). Softmax is exploited as a classifier of the extracted features from X to the labels Z (see Figure 1).
Here the net input z is defined as Here, while w represents the weight vector, x symbolizes the feature vector of lth training sample. Essentially, the Softmax function calculates the probability of belonging to a class "j" for a training sample "x (i) " by taking into account the given weights and net input z (i) . Softmax is used without other classifiers because it is a transfer function and multiclass classifier which acts like an output layer to the previous auto-encoders. Then, the auto-encoders and Softmax layers are combined and trained by using a backpropagation algorithm in a supervised fashion to improve the performance of the network.
Moreover, antithetically to previous deep learning applications, the output of the deep auto-encoder does not generate the final prediction but optimizes it by using a linear model [48]. Essentially, the performance of a deep networks is considered by the network's structure, transfer function, and learning algorithm. Yet, a network classifier tends to be weak once it is designed based on an inappropriate structure. Essentially, there is no certain way to estimate a proper structure. A recent study proposed a linear model as a post processing layer based on Kalman Filter to improve overall classification performance [49]. Our study is inspired by this previous work and it employs the linear model so as to transform the predicted output of the network to a value close to the desired output via the linear combination of the object features and the expected output. This simple transformation can be considered as a post processing step, reducing the error of network and enhancing classification performance. A metaheuristic approach, PSO, is employed to optimize the parameters of the linear model. Overall, the parameters of the Linear model are calculated during the iteration of PSO algorithm. The linear model utilizes the predicted output of the deep network and the object features as input to estimate the class labels. The output of the DSAEẐ is processed in a linear model by using X, coefficients A, B and the error rate e to produce the optimized result Z (see Equation (12)).
Sensors 2020, 20, 6378 8 of 21 Here, A represents diagonal matrix M × M as shown in (13), B denotes M × N matrix as shown in (14), and e is for the error rate. Moreover, coefficients, namely A and B, are unknown for the linear model [50]. The values of A and B are estimated by using a PSO algorithm, and the parameters of PSO are selected depending on the problem type and input features.
The details of the linear model mathematics are explained in [49], and the whole framework flowchart is illustrated in Figure 2. The MSE is represented as a cost function. PSO minimizes its value by estimating the best values for parameters A, B and e.
Each dataset has been divided into test and training sets according to the preliminary In each iteration of PSO, the predicted Z is controlled by using MSE with optimal prediction Q, as illustrated in Equation (15).
Here, m denotes the number of examples, Q i is the optimum class label for input features and MSE is the discrepancy rate between the z i and Q i .
The MSE is represented as a cost function. PSO minimizes its value by estimating the best values for parameters A, B and e.
Each dataset has been divided into test and training sets according to the preliminary experiments and based on our previous studies. According to these, the Epileptic Seizure dataset is divided as 100 samples for training and the other 100 for testing, indicating 50% for test and 50% for training. The SPECTF Classification dataset, on the other hand, is arranged as 187 (70%) for training, and 87 (30%) for the testing process. The final dataset, the cardiac arrhythmias dataset, consists of 450 instances from 16 classes with 70% of those data employed for training and 30% for the testing procedures, respectively. Overfitting is a critical problem for classification models. In order to prevent overfitting, a random subsampling validation technique was applied during the training process. Following this, each experiment is repeated five times and the average of those experiments is registered.

Epileptic Seizure
The proposed framework is confirmed by employing a popular public dataset provided by Bonn University [51]. The dataset consists of 200 samples, with each sample consisting of 4096 features. The EEG data is split into two groups for training and testing procedures. Each group involves 100 examples, 50 of which are normal and the remaining 50 are abnormal. Those cases are illustrated in Figure 3. According to the framework, the first and second auto-encoders extract high-level features obtained from EEG signals and then diminish the number of features to 2007 and 112, respectively. Details of the parametric configuration of auto-encoders are shown in Table 1. Later, the Softmax layer classifies the extracted features as being normal and abnormal.
The linear model is then used to enhance the results, and the parameters of the linear model are estimated by using the PSO algorithm. The linear model parameters are estimated in 30 epochs and a reasonable MSE value is produced, as shown in Figure 4. Besides, the parameters of PSO are presented in Table 2. The EEG data is split into two groups for training and testing procedures. Each group involves 100 examples, 50 of which are normal and the remaining 50 are abnormal. Those cases are illustrated in Figure 3. According to the framework, the first and second auto-encoders extract high-level features obtained from EEG signals and then diminish the number of features to 2007 and 112, respectively. Details of the parametric configuration of auto-encoders are shown in Table 1. Later, the Softmax layer classifies the extracted features as being normal and abnormal.  The linear model is then used to enhance the results, and the parameters of the linear model are estimated by using the PSO algorithm. The linear model parameters are estimated in 30 epochs and a reasonable MSE value is produced, as shown in Figure 4. Besides, the parameters of PSO are presented in Table 2.    The test process is repeated five times with the same parameters and hidden layer values, but in each implementation the training and test data are arbitrarily designated to avoid overfitting. The average results of the dataset based on previously defined evaluation parameters is shown in Table 3. The corresponding table represents the results during the testing process.   The test process is repeated five times with the same parameters and hidden layer values, but in each implementation the training and test data are arbitrarily designated to avoid overfitting. The average results of the dataset based on previously defined evaluation parameters is shown in Table 3. The corresponding table represents the results during the testing process.

SPECTF Classification
The proposed framework is assessed by employing another benchmark dataset, namely, "SPECTF, (Single Proton Emission Computed Tomography) Heart datasets", which is mainly presented in [52]. This dataset involves "normal" and "abnormal" classes that comprise more than 267 examples, with each of these instances consisting of 44 features. There exists 40 occurrences of each class at the training dataset, whereas the validation dataset contains "172 normal" and "15 abnormal" examples. As it is noted, auto-encoders can reduce the input dimension, and accordingly, the features in auto-encoders 1 and 2 are reduced step-by-step to 40 and 35, respectively, which essentially extracts high-level and sensitive features from input data.
The constraints of the auto-encoders are illustrated in Table 4. The parameters of the PSO algorithm are presented in Table 5. The experimental results are evaluated by calculating the values of parameters, as presented in Table 6. For this dataset, the linear model parameters are converged in almost 20 epochs and produce "2.03" error value as illustrated in Figure 5.

Diagnosis of Cardiac Arrhythmia
The final benchmark dataset involves the data regarding cardiac arrhythmia, presented in [52]. This dataset consists of 450 instances from 16 different classes. Each class has 279 features. The proposed framework is trained for this dataset, according to which, the first Auto-Encoder is trained by employing an unsupervised approach and achieves a decrease in the number of features from 279 to 250. The output of the first one is passed to the second auto-encoder, which is also trained in an unsupervised manner. Afterwards, the number of features is reduced from 250 to 200. Essentially, those auto-encoders layers extract appropriate features in an unsupervised manner. The output is fed to Softmax Layer for multi class classification that helps to generate the classification probabilities. The whole architecture, on the other hand, propagates the error by using a backpropagation algorithm. This allows the framework to have supervised characteristics as aforementioned. Auto-encoder parameters for this dataset are shown in Table 7.  Table 8 presents the parameters of PSO which are employed to estimate the best parameters of the linear model. Table 9 demonstrates the proposed framework experimental performance regarding the performance evaluation parameters.  For this dataset, the linear model parameters are estimated in almost 28 epochs and produce 2.11 error rate as illustrated in Figure 6.

Statistical Significance Analysis of Algorithms in the Proposed Method
In applied machine learning, comparing the algorithms and proposing a final appropriate model for the presented problem is a common approach. Models are generally evaluated using resampling methods (k-fold cross-validation etc.). In these methods, mean performance scores are calculated and compared directly. This approach can give wrong ideas because it is difficult to understand whether the difference between mean performance scores is real or the result of a statistical chance. Statistical significance tests are proposed to overcome this problem and measure the likelihood of the samples with the assumption that they were selected from the equivalent distribution. If this assumption, or null hypothesis, is rejected (if a critical value is smaller than the significance level), it suggests that the difference in skill scores is statistically significant.
Once the data is distributed normally, the two-sample t-test (regarding independent sets) and the paired t-test (for matched samples) are possibly considered the most extensively preferred methods in statistics for the assessment of differences between two samples [53]. A t-test is a type of statistical test that is employed to compare the means of two groups. A 2-tailed paired t-test is preferred in this study to compare the difference between the results without post-processing using PSO and the results after post-processing with PSO (Figures 7-9) in order to evaluate if there is a statistically significant difference when the results are optimized. Two-tailed tests are able to identify differences in either path, greater or less than [54].

Statistical Significance Analysis of Algorithms in the Proposed Method
In applied machine learning, comparing the algorithms and proposing a final appropriate model for the presented problem is a common approach. Models are generally evaluated using resampling methods (k-fold cross-validation etc.). In these methods, mean performance scores are calculated and compared directly. This approach can give wrong ideas because it is difficult to understand whether the difference between mean performance scores is real or the result of a statistical chance. Statistical significance tests are proposed to overcome this problem and measure the likelihood of the samples with the assumption that they were selected from the equivalent distribution. If this assumption, or null hypothesis, is rejected (if a critical value is smaller than the significance level), it suggests that the difference in skill scores is statistically significant.
Once the data is distributed normally, the two-sample t-test (regarding independent sets) and the paired t-test (for matched samples) are possibly considered the most extensively preferred methods in statistics for the assessment of differences between two samples [53]. A t-test is a type of statistical test that is employed to compare the means of two groups. A 2-tailed paired t-test is preferred in this study to compare the difference between the results without post-processing using PSO and the results after post-processing with PSO (Figures 7-9) in order to evaluate if there is a statistically significant difference when the results are optimized. Two-tailed tests are able to identify differences in either path, greater or less than [54].
A 2-tailed paired t-test is applied in Excel on the two matched groups of epileptic seizure detection and p-value is calculated as 0.002463, that is, less than the standard level of significance (p < 0.05) so a statistically significant difference is noted on this data without using PSO and using PSO. The null hypothesis can be rejected since the sample data support the hypothesis that the population means are dissimilar.
A 2-tailed paired t-test is applied in Excel on the two matched groups of SPECTF classification and p-value is calculated as 0.020919, that is, less than the standard level of significance (p < 0.05) so a statistically significant difference is noted on this data without using PSO and using PSO. The null hypothesis can be rejected since the sample data support the hypothesis that the population means are dissimilar.
for the presented problem is a common approach. Models are generally evaluated using resampling methods (k-fold cross-validation etc.). In these methods, mean performance scores are calculated and compared directly. This approach can give wrong ideas because it is difficult to understand whether the difference between mean performance scores is real or the result of a statistical chance. Statistical significance tests are proposed to overcome this problem and measure the likelihood of the samples with the assumption that they were selected from the equivalent distribution. If this assumption, or null hypothesis, is rejected (if a critical value is smaller than the significance level), it suggests that the difference in skill scores is statistically significant.
Once the data is distributed normally, the two-sample t-test (regarding independent sets) and the paired t-test (for matched samples) are possibly considered the most extensively preferred methods in statistics for the assessment of differences between two samples [53]. A t-test is a type of statistical test that is employed to compare the means of two groups. A 2-tailed paired t-test is preferred in this study to compare the difference between the results without post-processing using PSO and the results after post-processing with PSO (Figures 7-9) in order to evaluate if there is a statistically significant difference when the results are optimized. Two-tailed tests are able to identify differences in either path, greater or less than [54]. Sensors 2020, 20, x FOR PEER REVIEW 14 of 20 A 2-tailed paired t-test is applied in Excel on the two matched groups of epileptic seizure detection and p-value is calculated as 0.002463, that is, less than the standard level of significance (p < 0.05) so a statistically significant difference is noted on this data without using PSO and using PSO. The null hypothesis can be rejected since the sample data support the hypothesis that the population means are dissimilar. A 2-tailed paired t-test is applied in Excel on the two matched groups of SPECTF classification and p-value is calculated as 0.020919, that is, less than the standard level of significance (p < 0.05) so a statistically significant difference is noted on this data without using PSO and using PSO. The null hypothesis can be rejected since the sample data support the hypothesis that the population means are dissimilar.

SPECTF Classification
DSAEs without Post-processing DSAEs using PSO A 2-tailed paired t-test is applied in Excel on the two matched groups of diagnosis of cardiac arrhythmia and p-value is calculated as 0.000307, that is, less than the standard level of significance (p < 0.05) so a statistically significant difference is noted on this data without using PSO and using PSO. The null hypothesis can be rejected since the sample data support the hypothesis that the population means are dissimilar. A 2-tailed paired t-test is applied in Excel on the two matched groups of SPECTF classification and p-value is calculated as 0.020919, that is, less than the standard level of significance (p < 0.05) so a statistically significant difference is noted on this data without using PSO and using PSO. The null hypothesis can be rejected since the sample data support the hypothesis that the population means are dissimilar. A 2-tailed paired t-test is applied in Excel on the two matched groups of diagnosis of cardiac arrhythmia and p-value is calculated as 0.000307, that is, less than the standard level of significance (p < 0.05) so a statistically significant difference is noted on this data without using PSO and using PSO. The null hypothesis can be rejected since the sample data support the hypothesis that the population means are dissimilar.

Diagnosis of Cardiac Arrhythmia
DSAEs without Post-processing DSAEs using PSO

Performance Evaluation of the Framework Using Benchmark Datasets
The results of the proposed method, performed on benchmark datasets, are compared to several studies presented in this field. Then, the previous studies are analyzed to reveal the performance of the proposed framework. The comparison results for each dataset are detailed in Tables 10-12.  Table 10 represents the comparison between the proposed framework and the leading state-of-the-art studies using Epileptic Seizure Dataset [51], whereas Table 11 involves the comparison based on SPECTF Dataset). Table 12, on the other hand, represents the performance comparison using Cardiac Arrhythmia Dataset. Details of both SPECTF and Cardiac Arrhythmia Datasets can be seen in [52].

Reference
Method Accuracy [36] Time-frequency domain feature-RNN 99.6% [17] WT + ANN 92.0% [18] Discrete WT-mixture of expert model 94.5% [19] Entropy measures-ANFIS 92.22% [20] Time-frequency analysis-ANN 100% [21] Fast Fourier transform-DT 98.72% [22] WPD-PCA-GMM 99.00% [23] Entropies + HOS + Higuchi FD + Hurst exponent + FC 99.70% [24] DTCWT + CVANN-3 100% [25] Deep auto-encoder using Taguchi method 100% [26] Deep Auto-Encoder + Energy Spectral Density 100% Proposed Framework Deep auto-encoder and linear model based PSO 100% According to the results shown in Table 10, the proposed framework presented better results than a number of studies [17][18][19][21][22][23]36] and presented the same results as other studies with a difference in the complexity and execution time. Peker et al. [24] propose traditional machine techniques which require a long processing time when compared with our proposed framework exactly in high-dimensional features such as epileptic seizure detection. Moreover, in a recent study, the authors propose to train DAEs using the Taguchi method for complex systems. According to this, the parameters are fitted manually when compared with our proposed framework that automatically optimizes the obtained results without needing to repeat experiments manually to obtain the best accuracy [25].

SPECTF Dataset
For this sub-section, results obtained from the proposed framework are compared with well-known studies in the field of SPECTF classification, as shown in Table 11.

Cardiac Arrhythmia Dataset
Finally, the proposed framework shows remarkable results when compared with well-known studies in the field of cardiac arrhythmia, as illustrated in Table 12.
Those studies can be seen in [28][29][30][31][32][33][34][35]. The results verify the advantage of the proposed system over previous relevant papers using the Cardiac Arrhythmia dataset. As previously mentioned, there exist 16 different classes for labelling the dataset. Accordingly, the proposed method accomplishes the best result when it is compared with the state-of-the-art studies.

Conclusions
This paper proposes a framework for data classification problems. This novel framework incorporates an efficient deep learning approach (DAE) and linear model trained by a metaheuristic algorithm (PSO). Despite their efficiency, DAEs may produce low performance when employed for complex problems, such as EEG signal classification and motion estimation. Accordingly, the overall goal of this framework is to increase the performance of the DAEs by integrating a post processing layer. This layer essentially optimizes the results obtained from DAEs based on a linear model trained by PSO algorithm. This metaheuristic approach is mainly employed to estimate the parameters of the linear model. As it has produced satisfactory results in various problems, it should be noted that it is easy to implement and involves quite a few parameters for tuning.
Experimental results reveal that the proposed framework presents a number of advantages when compared with previous studies in the literature: learning using less data than other methods. The use of deep learning techniques leads to speeding up the processing time in high-dimensional features because it uses greedy layers as compared to convolutional techniques. The framework also proves that the overall performance of DAEs on complex problems can be enhanced by integrating a post processing layer. According to the results obtained, it is concluded that the introduced framework shows favorable results and can be adapted by researchers for any type of data classification problem. Additionally, as a future work, nonlinear and dynamic linear model systems can be proposed as a post-processing technique for enhancing the classification accuracy of the proposed framework. Moreover, additional optimization algorithms can be employed to train the models instead of PSO, such as the genetic algorithm, the gray-wolf optimization algorithm, the bat algorithm, and other classification models can be combined with linear and nonlinear models, such as support vector machines, naive Bayes or decision trees.
Epilepsy Figure A1. The model of stack a Stacked Sparse Auto-encoder (SSAE) with two hidden layers and a classifier (SoftMax).