Preclinical Diagnosis of Magnetic Resonance (mr) Brain Images via Discrete Wavelet Packet Transform with Tsallis Entropy and Generalized Eigenvalue Proximal Support Vector Machine (gepsvm)

Background: Developing an accurate computer-aided diagnosis (CAD) system of MR brain images is essential for medical interpretation and analysis. In this study, we propose a novel automatic CAD system to distinguish abnormal brains from normal brains in MRI scanning. Methods: The proposed method simplifies the task to a binary classification problem. We used discrete wavelet packet transform (DWPT) to extract wavelet packet coefficients from MR brain images. Next, Shannon entropy (SE) and Tsallis entropy (TE) were harnessed to obtain entropy features from DWPT coefficients. Finally, generalized eigenvalue proximal support vector machine (GEPSVM), and GEPSVM with radial basis function (RBF) kernel, were employed as classifier. We tested the four proposed diagnosis methods (DWPT + SE + GEPSVM, DWPT + TE + GEPSVM, DWPT + SE + GEPSVM + RBF, and DWPT + TE + GEPSVM + RBF) on three benchmark datasets of Dataset-66, Dataset-160, and Dataset-255. Results: The 10 repetition of K-fold stratified cross validation results showed the proposed DWPT + TE + 1796 GEPSVM + RBF method excelled not only other three proposed classifiers but also existing state-of-the-art methods in terms of classification accuracy. In addition, the DWPT + TE + GEPSVM + RBF method achieved accuracy of 100%, 100%, and 99.53% on Dataset-66, Dataset-160, and Dataset-255, respectively. For Dataset-255, the offline learning cost 8.4430s and online prediction cost merely 0.1059s. Conclusions: We have proved the effectiveness of the proposed method, which achieved nearly 100% accuracy over three benchmark datasets.


Introduction
Magnetic resonance imaging (MRI) is a low-risk, fast, non-invasive imaging technique that produces high quality images of the anatomical structures of the human body (especially the brain), and provides rich information for clinical diagnosis and biomedical research [1].Soft tissue structures are clearer and more detailed with MRI than other imaging modalities such as X-ray, CT, etc. [2].Researchers are working not only to improve the magnetic resonance (MR) image quality, but also to develop novel methods for easier and quicker pre-clinical diagnosis from MR images [3].In this study, we focused on preclinical diagnosis of MR brain images.
The problem arose that existing manual methods of preclinical diagnosis are tedious, time consuming, costly, and irreproducible, due to the huge amount of imaging data.This required technicians to design an automatic computer-aided diagnosis (CAD) tool [4].In the last decade, various methods were proposed for brain MR image classification.Chaplot et al. [5] used the approximation coefficients obtained by discrete wavelet transform (DWT), and employed the self-organizing map (SOM) neural network and support vector machine (SVM).Maitra and Chatterjee [6] employed the Slantlet transform, which is an improved version of DWT.Their feature vector of each image was created by considering the magnitudes of Slantlet transform outputs corresponding to six spatial positions chosen according to a specific logic.Then, they used the common back-propagation neural network (BPNN).El-Dahshan et al. [7] extracted the approximation and detail coefficients of 3-level DWT, reduced the coefficients by principal component analysis (PCA), and used feed-forward back-propagation artificial neural network (FP-ANN) and K-nearest neighbor (KNN) classifiers.Zhang et al. [8] proposed to use DWT for feature extraction, PCA for feature reduction, and feed-forward neural network (FNN) with scaled chaotic ABC as classifier.Based on it, Zhang et al. [9] suggested to replace SCABC with a scaled-conjugate-gradient method.Ramasamy and Anandhakumar [10] used a fast-Fourier-transform based expectation-maximization Gaussian mixture model for brain tissue classification of MR images.Zhang and Wu [11] proposed the use of kernel SVM, and suggested three kernels for this purpose: homogeneous polynomial, inhomogeneous polynomial, and Gaussian radial basis.Saritha et al. [12] proposed a novel feature of wavelet-entropy (WE), employed spider-web plots (SWP) to further reduce features, and used the probabilistic neural network (PNN).Das et al. [13] proposed Ripplet transform (RT) + PCA + least square SVM (LS-SVM), and the 5 × 5 CV showed high classification accuracies.Kalbkhani et al. [14] modelled the detailed coefficients of 2-level DWT by generalized autoregressive conditional heteroscedasticity (GARCH) model, the parameters of which were considered as the primary feature vector.Their classifier was chosen as KNN and SVM.Zhang et al. [15] presented a SVM decision tree (SVMDT) diagnosis method for Alzheimer's disease (AD) based on structural MR images.
All those methods achieved good results, nevertheless, most methods were vulnerable to following two points: (i) they commonly used DWT, which is translation-variant, namely, the wavelet coefficients behave unpredictably under translation of the input signal; and (ii) the classifiers performed well on training images, but poorly on new query images.
To address those problems, we suggest in this study three improvements: (i) we employ a discrete wavelet packet transform (DWPT) to replace DWT, (ii) we introduce two entropy methods: Shannon entropy (SE) and Tsallis entropy (TE), to extract features from the wavelet packet coefficients, and (iii) we introduce the generalized eigenvalue proximal SVM (GEPSVM) that has better generalization ability, and used a kernel technique.
The rest of this article is organized as follows: the next section presents the Materials and Methods.Experiments in Section 3 compare the proposed methods with state-of-the-art methods over three different benchmark datasets.Section 4 is devoted to discussion.Finally, Section 5 concludes the paper.

Benchmark Dataset
Three different benchmark MR image datasets, i.e., Dataset-66, Dataset-160, and Dataset-255, were used for tests in this study.All datasets consist of T2-weighted MR brain images in axial plane and 256 × 256 in-plane resolution, which were downloaded from the website of Harvard Medical School (http://med.harvard.edu/AANLIB/).Dataset-66 and Dataset-160 were already widely used in brain MR image classification.They consist of abnormal images from seven types of diseases along with normal images.The abnormal brain MR images of the two datasets consists of the following diseases: glioma, meningioma, Alzheimer's disease, Alzheimer's disease plus visual agnosia, Pick's disease, sarcoma, and Huntington's disease.Das, Chowdhury and Kundu [13] proposed the third dataset "Dataset-255", which contains 11 types of diseases, among which seven types of diseases are the same as the Dataset-66 and Dataset-160 mentioned before, and four new types of diseases (chronic subdural hematoma, cerebral toxoplasmosis, herpes encephalitis, and multiple sclerosis) were included.Figure 1 shows samples of brain MR images.
The cost of predicting an abnormal brain as normal is severe.It may delay the treatment of the subject.In contrast, the misclassification of a normal brain to an abnormal brain can be corrected by other auxiliary diagnosis techniques.How to deal with a cost-sensitivity problem?The common methods introduce biases into the error-based classification methods, and they can be divided into following five kinds of solutions [16]: • Changing the class distribution: resampling, instance reweighting, metacost; • Boost methods: AdaBoost/Adacost, cost boosting, asymmetric boosting; • Modifying the learning algorithms: modifying the decision tree, modifying neural networks, modifying SVMs, modifying naive Bayes classifier; • Direct cost-sensitive learning: Laplace correction, smoothing, curtailment, Platt calibration, and Isotonic regression; • Other methods: Cost-sensitive CBR, Cost-sensitive specification, Cost-sensitive genetic programming.
In this study, we can access original data, so we change the class distribution at the step of creating the dataset, i.e., we intentionally create three imbalanced datasets, with the aim of solving the cost-sensitivity problem.The imbalanced class distribution [17] can feed into the classifier more abnormal brains, so the classifier is biased to abnormal brains to solve the cost-sensitivity problem.

CV Setting
Following common conventions and taking advantage of ease of stratified cross validation, 5 × 6-fold stratified cross validation (CV) was used for Dataset-66, and 5 × 5-fold stratified CV was used for the other two datasets.Table 1 showed the statistical characteristics and CV settings of the three datasets.In addition, the abnormal brains were set to true and normal brains to false.We take the Dataset-66 as example, which contains 18 normal and 48 abnormal brains.We use 6-fold stratified cross validation, and assign each fold three normal and eight abnormal brains.Hence, the validation set (one fold) contains three normal and eight abnormal brains, and the training set (the other five folds) contains 15 normal and 40 abnormal ones.Co-registration was not needed because it is not essential.The proposed technique was similar to those used for face recognition, in which some scholars used co-registration [18,19] but others did not use it [20,21].Moreover, some past publications about abnormal brain detection did not use co-registration [5][6][7][8][9][10][11][12][13][14][15] but they all obtained good results, which were comparable to those obtained employing co-registration [22,23].

Discrete Wavelet Transform
The discrete wavelet transform (DWT) is a powerful implementation of the wavelet transform (WT) using the dyadic scales and positions.The fundamental of DWT are introduced as follows: suppose x(t) is a square-integrable function, then the continuous WT of x(t) relative to a given wavelet ψ(t) is defined as [24]: where: Here, the wavelet ψ(t|fs, ft) is calculated from the mother wavelet ψ(t) by translation and dilation: fs is the scale factor, ft the translation factor (both real positive numbers), and C the coefficients of WT.
There are several different kinds of wavelets which have gained popularity throughout the development of wavelet analysis.Equation ( 1) can be discretized by restraining fs and ft to a discrete lattice (fs = 2^ft & fs > 0) to give the DWT, which can be expressed as follows: Here the coefficients L and H refer to the approximation components and the detail components, respectively.The functions l(n) and h(n) denote the low-pass filter and high-pass filter, respectively.The parameters j and k represent the values of scale and translation factors, respectively.The DS operator means the downsampling.
The above decomposition process can be iterated with successive approximations being decomposed so that one signal is broken down into various levels of resolution [25].The whole process is called a wavelet decomposition tree.An example of 2-level 1D-DWT is shown in Figure 2a.In applying this technique to MR images, the DWT is applied separately to each dimension.Figure 2b illustrates a schematic diagram of a 2-level 2D-DWT.As a result, there are four subband (LL, LH, HH, HL) images at each scale.The subband LL is used for the next 2D-DWT.The LL subband can be regarded as the approximation component of the image, while the LH, HL, and HH subbands can be regarded as the detailed components of the image.As the level of the decomposition increased, we obtained more compact yet coarser approximation components.Thus, wavelets provide a simple hierarchical framework for interpreting the image information.

Discrete Wavelet Packet Transform
DWPT is an extension of DWT, whereby all nodes in the tree structure are allowed to split further at each level of decomposition.In the DWT, each level is calculated by passing only the previous wavelet approximation coefficients through discrete-time low and high pass quadrature mirror filters.However in the DWPT; both the detail and approximation coefficients are decomposed to create the full binary tree (Compare Figure 3 and Figure 2a).Therefore, features can be generated based on approximation and detail coefficients at different levels to obtain more information.The WPT of a signal x(t) is defined as: where n is the channel number, j the number of decomposition level (scale parameter), p the position parameter, S the maximum decomposition level, and C the coefficients of WPT.After decomposing signal x(t) by WPT, 2 S sequences can be produced in the S-th level.The fast decomposition equation to next level is: For j levels of decomposition the DWPT produces 2 j different sets of coefficients as opposed to (3j + 1) sets for the DWT.However; due to the downsampling process the overall number of coefficients of DWPT is still the same of those of DWT; so there is no redundancy.

Shannon and Tsallis Entropy
Entropy is a statistical measure of randomness, associated with the order of irreversible processes from a traditional point of view.The entropy concept of Boltzmann/Gibbs was redefined as as a measure of uncertainty regarding the information content of a system as Shannon entropy (SE): where i represents the greylevel of reconstructed coefficient, p the probability of greylevel of i, and L the total number of greylevels.
The SE was restricted to the domain of validity of the Boltzmann-Gibbs-Shannon (BGS) statistics, which only described nature when the effective microscopic interactions and the microscopic memory were short ranged [26].Suppose a physical system can be decomposed into two statistically independent subsystems A and B, the SE has the extensive property (additivity): However, for a certain class of physical systems which entail long-range interactions, long time memory, and fractal-type structures, it is necessary to use non-extensive entropy.Tsallis [27] proposed a generalization of BGS statistics, which was termed TE with its form depicted as: where the real number q denotes an entropic index that characterizes the degree of non-extensivity.Above expression will meet the SE in the limit q→1.The TE is non-extensive in such a way that for a statistical dependent system [28].Its entropy is defined with the obey of pseudo-additivity rule: Three different entropies can be defined with regard to different values of q [29].For q < 1, the TE becomes a sub-extensive entropy where Sq(A + B) < Sq(A) + Sq(B); for q = 1, the TE reduces to an standard extensive entropy where Sq(A + B) = Sq(A) + Sq(B); for q > 1, the TE becomes a superextensive entropy where Sq(A + B) > Sq(A) + Sq(B).
In this study, TE was employed to extract features from 16 subbands of DWPT coefficients of MR brain images.There were two reasons.First, TE had been successfully applied in brain images [30][31][32].Second, the combination of TE and wavelet transform had proven to perform better than either TE or wavelet transform [33][34][35].Third, brain images possess long-range interaction and fractal-type structure, because of the self-similarity observed brain structures imaged with a finite resolution.In a word, there are similarities at different spatial scales in brain images, which are more suitable for TE rather than SE.

Feature Extraction
The features were obtained as the entropies of the DWPT coefficients (see Algorithm 1).We used two kinds of entropies, Shannon entropy and Tsallis entropy, as described above.We compared them in the experiments to find which one was better.Next, we established two classifiers in this study: GEPSVM and kernel GEPSVM.

Algorithm 1: Feature Extraction
Step 1 Import MR image.
Step 3 Calculate the entropy of each subband.

Generalized Eigenvalue Proximal SVM
We consider the task of abnormal brain detection as a binary classification problem.SVM is employed due to its excellent performance reported in the latest literature.In the original SVM, two parallel planes are generated such that each plane is closest to one of two datasets and the two planes are as far apart as possible.Mangasarian and Wild [36] proposed GEPSVM that dropped the parallelism condition on the two hyperplanes, and required each plane be as close as possible to one of the data sets and as far as possible from the other data set.Reports have shown that GEPSVM achieved better classification performance than standard SVM.
Suppose sample data belonging to class 1 and 2 are represented by matrices X1 and X2, respectively.GEPSVM aims to determine two nonparallel planes: 0 and 0 where the first plane is closest to the points of class 1 and furthest from points in class 2, and the second plane closest to points in class 2 and furthest from points in class 1.To obtain the first plane, we minimize the sum of squares of Euclidean distance between each of the points of class 1 to the plane divided by the squares of Euclidean distance between each of points of class 2 to the plane, which leads to the following optimization problem: Simplifying Equation ( 12) gives: The Tikhonov regularization term is introduced to reduce the norm of the problem variables (w, b) that determine the first plane in Equation ( 11): where δ is a nonnegative Tikhonov factor.Equation ( 15) becomes the "Rayleigh quotient" of the form: where G and H are symmetric matrices in ( 1) ( 1) Using the boundedness and stationarity properties of Rayleigh Quotient, solution of ( 16) is obtained by solving the generalized eigenvalue problem: where the global minimum of Equation ( 16) is achieved at an eigenvector z1 corresponding to the smallest eigenvalue λmin of Equation (19).Hence, w1 and b1 can be obtained through Equation (13), and used to determine the plane in Equation (11).Next, we create another optimization problem analogous to Equation ( 14) by interchanging the roles of X1 and X2.The eigenvector z2* corresponding to the smallest eigenvalue of the second generalized eigenvalue problem will yield the second plane which is close to points of class 2.

Kernel Technique
Traditional SMVs constructed a hyperplane to classify data, so they cannot deal with classification problems in which the different types of data are located at different sides of a hypersurface; hence the kernel strategy is applied to GEPSVM in this study.Readers can refer to Section 3 in [36] for a detailed description.From another point of view, the kernel technique allows one to fit the hyperplane in a transformed feature space.The transformation may be nonlinear and the transformed space higher dimensional; thus though the classifier is a hyperplane in the higher-dimensional feature space, it may be nonlinear in the original input space.RBF was chosen in this study due to its excellent performance reported in many literatures.

Implementation of the Proposed Method
The aim of this study was to develop an automatic MR brain image classification system with high classification accuracy.The proposed system consists of three different stages (Figure 4): DWPT decomposition, entropy calculation, and classification.The implementation of the proposed system is two-fold (Algorithm 2): offline learning with the aim of training the classifier, and online prediction with the aim of predict normal/abnormal labels for subjects.Particularly, we varied the value of q from 0.1 to 1 with increment of 0.1, and record the corresponding classification accuracy to select the optimal value of q.Classifier Training: The set of features along with the corresponding labels, were used to train the classifier.10 repetition of K-fold stratified CV was employed for get the out-of-sample evaluation.
Evaluation: Report the classification performance.Step 4.
Parameter Selection: Above three steps iterated with q varies from 0.1 to 1 with increment of 0.1.Select the optimal q that corresponds to the highest classification accuracy.
Phase II: Online prediction (users are doctors and radiologists) Step 1.
Feature Extraction: Users presented to the system the query image to be classified.Its feature was obtained as in the offline learning phase.
Predict: Input the features of the query image to previously trained classifier.The classifier labeled the input query image as normal or abnormal.

Evaluation
The experiments were carried out on an IBM platform with a 3 GHz core i3 processor and 8 GB RAM, running under the Windows 7 operating system.The algorithm was in-house developed via Matlab 2014a (The Mathworks, Natick, MA, USA).
We designed four experimental tasks in this study.First, we made a visual comparison between DWT and DWPT.A 2-level Haar wavelet was used.Second, we tested the four proposed diagnosis methods:

(i) DWPT + SE + GEPSVM; (ii) DWPT + TE + GEPSVM; (iii) DWPT + SE + GEPSVM + RBF; (iv) DWPT + TE + GEPSVM + RBF;
We compared them with state-of-the-art methods.Third, we analyzed the computation time for every step of offline learning and online prediction.Finally, we show how to set the optimal parameter q.

Setting of Parameter q
A problem arose with how to determine the optimal value of q? Obviously q should be less than 1 since the brain is a subextensive system that consists of complex areas and tissues.Then, we varied the value of q from 0.1 to 1 with increment of 0.1, and recorded the average classification accuracy over 10 repetitions of Dataset-255 by two proposed methods "DWPT + TE + GEPSVM" and "DWPT + TE + GEPSVM + RBF".The results are shown in Figure 6 and Table 4.

Computational Burden Analysis
Computation time is another important factor to evaluate a classifier.Table 5 records the time consumed of each step using Dataset-255.For the offline learning phase, the three procedures, i.e., DWPT, Entropy calculation, and classifier training, took 4.0565 s, 3.4961 s, and 0.8904 s, respectively.For the online learning phase, the three procedures, i.e., DWPT, Entropy calculation, and brain classification, took 0.0817 s, 0.0213 s, and 0.0029 s, respectively.Value of q Average Accuracy (%) DWPT+TE+GEPSVM+RBF DWPT+TE+GEPSVM
Another finding was that TE was superior to traditional SE, after comparing the result of "DWPT + TE + GEPSVM" with "DWPT + SE + GEPSVM", and comparing the result of "DWPT + TE + GEPSVM + RBF" with "DWPT + SE + GEPSVM + RBF".The reason is because Tsallis entropy is a generalization of traditional SE [37].Brain MR images showed the presence of correlation between pixels of the same tissue, so they were especially suitable to be represented by TE.
Finally, "DWPT + TE + GEPSVM + RBF" performed the best among the four proposed diagnosis methods, obtaining classification accuracies of 100.00%, 100.00%, and 99.53% for the three datasets.The also proved the effectiveness of GEPSVM, which dropped the parallelism condition, leading to a more flexible hyperplane construction.The next is RT + PCA + LS-SVM [13] that achieved 100.00%, 100.00%, and 99.39% for the three datasets.Table 4 showed the proposed DWPT + TE + GEPSVM + RBF outperformed not only three other proposed diagnosis methods, but also the state-of-the-art algorithms of MR brain classification.
The results in Table 3 showed that the DWPT + TE + GEPSVM + RBF performed perfectly on either Dataset-66 or Dataset-160.For Dataset-255, its sensitivity was 100%, specificity was 97.14%, precision was 99.55%, and accuracy was 99.61%.This corresponded to the situation where one normal brain image was mislabeled as abnormal while all abnormal brains were recognized correctly.Note that Table 4 were obtained based on ten repetitions, hence, the accuracies of Tables 2 and 3 were not coherent.
The curves in Figure 6 and the data in Table 4 showed the value of q produced slight but discernible effects on the average classification accuracy.As q increased from 0.1 to 0.8, the average classification accuracy increased gradually until the point where q = 0.8, which produced the highest classification accuracy.Then, the curves decreased sharply when q increased from 0.8 to 0.1.The Tsallis entropy degraded to Shannon entropy when q equals to 1.It was obvious that introducing Tsallis entropy did improve the classification performance.
This result (q = 0.8) was identical to the conclusions of Sturzbecher et al. [38] and Cabella et al. [39].Diniz et al. [40] on the other hand found that q was best as 0.2 for CSF, 0.1 for white matter and 1.5 for gray matter.For this study, we had to assign a single value of q on the whole image, hence, our result of 0.8 can be regarded as an average of best q values of CSF, white matter and gray matter.
From Table 5, we can calculate that the offline learning cost totally cost as 4.0565 + 3.4961 + 0.8904 = 8.4430 s, whereas the computer required only 0.0817 + 0.0213 + 0.0029 = 0.1059 s for online prediction.The offline learning takes much more time than online prediction, because offline learning needs to operate with 255 images whereas online prediction just focuses on one query image.In addition, the classifier training in the offline learning phase had already obtained the weights/biases of the classifier, so the brain classification in the online prediction phase directly used a trained classifier to obtain the output.
DWPT takes more time than DWT on average, because it needs to decompose not only the approximation coefficients, but also the detailed coefficients.In spite of this, the computation time in practical use (online prediction) was 0.1059 s, which was fast and met real-time requirements.

Discussion on the Proposed Method
Revisiting the whole methodology of the proposed CAD system, there were three reasons why we used DWPT, Tsallis entropy, and GEPSVM.First, DWPT can obtain more high-frequency information and more detailed frequency domain information than DWT, which are essentially important in training the classifier.Second, entropy can efficiently represent the complexity and uncertainty of signals.Finally, GEPSVM dropped the parallelism condition required by standard SVM, leading to a more flexible hyperplane construction.
We chose the 2-level Haar wavelet by experience; however, there are other outstanding wavelets such as db and bior series, and other decomposition levels may give better performance.In the future, we expect to develop an algorithm that can automatically determine the optimal wavelet and the optimal decomposition level.
The GEPSVM is an excellent classifier that relaxes the universal requirement that bounding or proximal planes should be parallel in the input space for linear classifier or in the higher dimensional feature space for nonlinear kernel classifier.This study showed its success in the application of MR brain image classification.The nonparallel planes are easily obtained using a single Matlab command that solved the classical generalized eigenvalue problem.The simple program formulation, computational efficiency, and high classification accuracy proved it was an extremely effective algorithm for classification.TE proved to be better than SE in this study.The setting of value of q still remained a problem, since we performed a rather coarse search by varying its value from the range of 0.1 to 1 with increment of 0.1.The problem belonged to hyperparameter optimization, which needs to choose a set of parameters for a learning algorithm with the goal of optimizing the algorithm's performance.Some other hyperparameter optimization techniques will be used like random search, Bayesian optimization, etc.
A limitation was that the classifier was machine-oriented not human-oriented.Although machine-oriented classifiers yielded better classification performance than human-oriented classifiers, technicians cannot understand or interpret what the weights/biases of the classifier mean, which limited its usage in reality.Another limitation was the computation time of DWPT, which took the most time during either the offline learning stage or online prediction stage.In the future we will try to use fast algorithms to implement DWPT.

Conclusions and Future Research
In this study, we treated the abnormal brain detection as a binary classification problem.To solve it, we proposed to use DWPT to replace the traditional DWT method, and we proposed two entropy methods (SE and TE), and two classifiers (GEPSVM with and without RBF kernel), with the aim of developing an automatic classifier for MR brain images.The experiments showed the proposed "DWPT + TE + GEPSVM + RBF" diagnosis method yielded superior performance to not only the

Figure 4 .
Figure 4. Flowchart of the proposed system.

Algorithm 2 :
Pseudocodes of the proposed system.Phase I: Offline learning (users are scientists)Step 1.Feature Extraction: Users decompose images by DWPT, and extract Tsallis entropies from all subbands.Step 2.

First, we
carried out both DWT and DWPT on a normal and AD MR brain image, respectively.A 2-level Haar wavelet was utilized.

Figure 6 .
Figure6.Effect of q value on classification (q = 1 corresponds to SE).

Table 1 .
Statistical characteristics and CV setting of the three datasets.

Table 5 .
Computation time analysis based on Dataset-255.