Dual-Tree Complex Wavelet Transform and Twin Support Vector Machine for Pathological Brain Detection

: ( Aim ) Classiﬁcation of brain images as pathological or healthy case is a key pre-clinical step for potential patients. Manual classiﬁcation is irreproducible and unreliable. In this study, we aim to develop an automatic classiﬁcation system of brain images in magnetic resonance imaging (MRI)


Introduction
Stroke, brain tumors, neurodegenerative diseases, and inflammatory/infectious diseases are the four main types of brain diseases.Stroke is also called vascular disease of cerebral circulation.Brain tumors occur when abnormal cells form inside the brain.Neurodegenerative diseases occur when neurons lose structure or function progressively.Inflammatory/infectious disease suffers from inflammation or infection in or around the brain tissues.All diseases cause serious problems for both patients and the society.Hence, it is important to make early diagnosis system, with the aim of providing more opportunities for better clinical trials.This type of task is commonly named as "pathological brain detection (PBD)".
Magnetic resonance imaging (MRI) offers the best diagnostic information in the brain; however, to make a diagnosis usually needs human manual interpretation.Existing manual methods are costly, tedious, lengthy, and irreproducible, because of the huge volume data of MRI.Those shortcomings lead to the necessity to develop automatic tools such as computer-aided diagnosis (CAD) systems [1,2].Due to the better performance provided by magnetic resonance (MR) images, many CAD systems are based on MR images [3].
Existing methods over brain CAD systems could be divided into two types according to the data dimension.One type is for three-dimensional (3D) image, but it needs to scan the whole brain.The other type is based on a single-slice that contains the disease related areas, which is cheap and commonly used in Chinese hospitals.El-Dahshan et al. [4] employed a 3-level discrete wavelet transform (DWT), followed by principal component analysis (PCA) to reduce features.Finally, they used K-nearest neighbor (KNN) for classification.Patnaik et al. [5] utilized DWT to extract the approximation coefficients.Then, they employed support vector machine (SVM) for classification.Dong et al. [6] further suggested to train feedforward neural network (FNN) with a novel scaled conjugate gradient (SCG) approach.Their proposed classification method achieved good results in MRI classification.Wu [7] proposed to use kernel SVM (KSVM), and suggested three new kernels: Gaussian radial basis, homogeneous polynomial, and inhomogeneous polynomial.Das et al. [8] proposed to combine Ripplet transform (RT) and PCA and least square SVM (LS-SVM), and the 5 ˆ5 cross validation test showed high classification accuracies.El-Dahshan et al. [9] used the feedback pulse-coupled neural network to preprocess the MR images, the DWT and PCA for features extraction and reduction, and the FBPNN to detect pathological brains from normal brains.Dong et al. [10] combined discrete wavelet packet transform (DWPT) and Tsallis entropy (TE).In order to segment and classify malignant and benign brain tumor slices in Alzheimer's disease (AD), Wang et al. [11] employed stationary wavelet transform (abbreviated as SWT) to replace the common used DWT.Besides, they proposed a hybridization of Particle swarm optimization and Artificial bee colony (HPA) algorithm to obtain the optimal weights and biases of FNN.Nazir et al. [12] implemented denoising at first.Their method achieved an overall accuracy of 91.8%.Sun et al. [13] combined wavelet entropy (WE) and Hu moment invariant (HMI) as features.
After analyzing the above methods, we found all literatures treated the PBD problem as a classification problem and their studies aimed at improving the classification accuracy.As we know, a standard classification task is composed of two steps: feature extraction and classifier training.
The former step "feature extraction" in PBD usually employed discrete wavelet transform (DWT) [5].The reason is DWT can provide multiresolution analysis at any desired scale for a particular brain image.Besides, the abundant texture features in various brain regions are coherent with wavelet analysis [14,15].However, DWT is shift variant, i.e., a slight shift in the image degrades the performance of DWT-based classification.Our study chose a variant of DWT, viz., the dual-tree complex wavelet transform (DTCWT).Scholars have proven that the DTCWT offers "more directional selectivity" than canonical DWT, with merely 2 n redundancy for data of n-dimensional [16][17][18].
Although stationary wavelet transform (SWT) can also deal with the shift variance problem, it will lead to more redundancy than DTCWT.Then, we need to extract features from the DTCWT results.In this paper, we proposed to use variance and entropy (VE) [19] from all the subbands of DTCWT.
Although energy is also a common feature extracted from the wavelet subbands, scholars have proven it is not as efficient as entropy in MR image classification [20,21].Besides, variance with the form of E[(x ´µ) 2 ] (E, expectation; x, random variable; µ, the mean) indicates how data points be close to the mean value, while energy with the form of E[x 2 ] does not consider the mean value.Hence, it is self-explanatory that variance will perform better than energy even if the expected mean value has a slight shift.
The latter step "classifier training" usually employed support vector machines (SVMs) or artificial neural network.Compared with other conventional classification methods, SVMs have significant advantages of elegant mathematical tractability [22], direct geometric interpretation [23], high accuracy [24], etc.Hence, we continued to choose SVM.Besides, two variants of SVM were introduced in this study: generalized eigenvalue proximal SVM (GEPSVM) and twin SVM (TSVM), with the aim of augmenting the classification performance further.
The contribution of our study is three-fold: We applied DTCWT to pathological brain detection.We applied both variance and entropy to extract features.We applied TSVM for classification.The structure of the paper is organized as follows: Section 2 contains the materials used in this study.Section 3 presents the dual-tree complex wavelet transform, and offers the mathematical fundamental of SVM and its two variants.Section 4 designs the experiment and gives the evaluation measures.Section 5 contains the experimental results and offers the discussions.Section 6 is devoted to conclusions.The abbreviations used in this work are listed in the end of this paper.

Materials
At present, there are three benchmark datasets as Dataset66, Dataset160, and Dataset255, of different sizes of 66-, 160-, and 255-images, respectively.All datasets contain T2-weighted MR brain images obtained along axial plane with size of 256 ˆ256.We downloaded all the slices of subjects from the website of Medical School of Harvard University (Boston, MA, USA) [25].Then, we selected five slices from each subject.The selection criterion is that for healthy subjects, these slices were selected at random.For pathological subjects, the slices should contain the lesions by confirmation of thee radiologists with ten years of experiences.
The former two datasets (Dataset66 & Dataset160) consisted of 7 types of diseases (meningioma, AD, AD plus visual agnosia, sarcoma, Pick's disease, Huntington's disease, and glioma) along with normal brain images.The last dataset "Dataset255" contains all 7 types of diseases as mentioned before, and 4 new diseases (multiple sclerosis, chronic subdural hematoma, herpes encephalitis, and cerebral toxoplasmosis).
Figure 1 shows samples of brain MR images.Our method is for hospital other than research.In Chinese hospitals, we usually scanned one slice that is closest to the potential focus, other than the whole brain.Hence, one slice was obtained from one subject.Each slice in Figure 1 is selected from regions related to the foci of diseases (in total 26 axial slices).
Note that we treated all different diseased brains as pathological, so our task is a two-class classification problem, that is, to detect pathological brains from healthy brains.The whole image is treated as the input.We did not choose local characteristics like point and edge, and we extract global image characteristics that are further learned by the CAD system.Note that our method is different from the way neuroradiologists do.They usually select local features and compare with standard template to check whether focuses exist, such as shrink, expansion, bleeding, inflammation, etc.While our method is like AlphaGo [26], the computer scientists give the machine enough data, and then the machine can learn how to classify automatically.
Including subjects' information (age, gender, handedness, memory test, etc.) can add more information, and thus may help us to improve the classification performance.Nevertheless, this CAD system in our study is only based on the imaging data.Besides, the imaging data from the website does not contain the subjects' information.
normal brain images.The last dataset "Dataset255" contains all 7 types of diseases as mentioned before, and 4 new diseases (multiple sclerosis, chronic subdural hematoma, herpes encephalitis, and cerebral toxoplasmosis).
Figure 1 shows samples of brain MR images.Our method is for hospital other than research.In Chinese hospitals, we usually scanned one slice that is closest to the potential focus, other than the whole brain.Hence, one slice was obtained from one subject.Each slice in Figure 1 is selected from regions related to the foci of diseases (in total 26 axial slices).Note that we treated all different diseased brains as pathological, so our task is a two-class classification problem, that is, to detect pathological brains from healthy brains.The whole image is treated as the input.We did not choose local characteristics like point and edge, and we extract global image characteristics that are further learned by the CAD system.Note that our method is different from the way neuroradiologists do.They usually select local features and compare with standard template to check whether focuses exist, such as shrink, expansion, bleeding, inflammation, etc.While our method is like AlphaGo [26], the computer scientists give the machine enough data, and then the machine can learn how to classify automatically.
Including subjects' information (age, gender, handedness, memory test, etc.) can add more information, and thus may help us to improve the classification performance.Nevertheless, this CAD system in our study is only based on the imaging data.Besides, the imaging data from the website does not contain the subjects' information.
The cost of predicting pathological to healthy is severe; because the patients may be told that, she/he is healthy and thus ignore the mild symptoms displayed.The treatments of patients may be deferred.Nevertheless, the cost of misclassification of healthy to pathological is low, since correct remedy can be given by other diagnosis means.
This cost-sensitivity (CS) problem was solved by changing the class distribution at the beginning state, since original data was accessible.That means, we intentionally picked up more pathological brains than healthy ones into the dataset, with the aim of making the classifier biased to pathological brains, to solve the CS problem [27].The overfitting problem would be monitored by cross validation technique.

Methodology
The proposed method consists of three decisive steps: wavelet analysis by dual-tree complex wavelet transform (DTCWT), feature extraction by "Variance & Entropy (VE)", and classification by three independent classifiers (SVM, GEPSVM, and TSVM).Figure 2 illustrates our modular framework.The output of DTCWT is wavelet subband coefficients, which are then submitted to VE block.The cost of predicting pathological to healthy is severe; because the patients may be told that, she/he is healthy and thus ignore the mild symptoms displayed.The treatments of patients may be deferred.Nevertheless, the cost of misclassification of healthy to pathological is low, since correct remedy can be given by other diagnosis means.
This cost-sensitivity (CS) problem was solved by changing the class distribution at the beginning state, since original data was accessible.That means, we intentionally picked up more pathological brains than healthy ones into the dataset, with the aim of making the classifier biased to pathological brains, to solve the CS problem [27].The overfitting problem would be monitored by cross validation technique.

Methodology
The proposed method consists of three decisive steps: wavelet analysis by dual-tree complex wavelet transform (DTCWT), feature extraction by "Variance & Entropy (VE)", and classification by three independent classifiers (SVM, GEPSVM, and TSVM).Figure 2 illustrates our modular framework.The output of DTCWT is wavelet subband coefficients, which are then submitted to VE block.Note that we treated all different diseased brains as pathological, so our task is a two-class classification problem, that is, to detect pathological brains from healthy brains.The whole image is treated as the input.We did not choose local characteristics like point and edge, and we extract global image characteristics that are further learned by the CAD system.Note that our method is different from the way neuroradiologists do.They usually select local features and compare with standard template to check whether focuses exist, such as shrink, expansion, bleeding, inflammation, etc.While our method is like AlphaGo [26], the computer scientists give the machine enough data, and then the machine can learn how to classify automatically.
Including subjects' information (age, gender, handedness, memory test, etc.) can add more information, and thus may help us to improve the classification performance.Nevertheless, this CAD system in our study is only based on the imaging data.Besides, the imaging data from the website does not contain the subjects' information.
The cost of predicting pathological to healthy is severe; because the patients may be told that, she/he is healthy and thus ignore the mild symptoms displayed.The treatments of patients may be deferred.Nevertheless, the cost of misclassification of healthy to pathological is low, since correct remedy can be given by other diagnosis means.
This cost-sensitivity (CS) problem was solved by changing the class distribution at the beginning state, since original data was accessible.That means, we intentionally picked up more pathological brains than healthy ones into the dataset, with the aim of making the classifier biased to pathological brains, to solve the CS problem [27].The overfitting problem would be monitored by cross validation technique.

Methodology
The proposed method consists of three decisive steps: wavelet analysis by dual-tree complex wavelet transform (DTCWT), feature extraction by "Variance & Entropy (VE)", and classification by three independent classifiers (SVM, GEPSVM, and TSVM).Figure 2 illustrates our modular framework.The output of DTCWT is wavelet subband coefficients, which are then submitted to VE block.

Discrete Wavelet Transform
The discrete wavelet transform (DWT) is an image processing method [28] that provides multi-scale representation of a given signal or image [29].Standard DWT is vulnerable to shift variance problem, and only has horizontal and vertical directional selectivity [30].Suppose s represents a particular signal, n represents the sampling point, h and g represents a high-pass filter and low-pass filter, respectively, H and L represents the coefficients of high-pass and low-pass subbands.We have Lpnq " Figure 3 shows the directional selectivity of DWT.The LH denotes a low-pass filter along x-axis and high-pass filter along y-axis.HL denotes a high-pass filter along x-axis followed by a low-pass filter along y-axis.The LL denotes low-pass filters along both directions, and HH denotes high-pass filters along both directions.

Discrete Wavelet Transform
The discrete wavelet transform (DWT) is an image processing method [28] that provides multi-scale representation of a given signal or image [29].Standard DWT is vulnerable to shift variance problem, and only has horizontal and vertical directional selectivity [30].Suppose s represents a particular signal, n represents the sampling point, h and g represents a high-pass filter and low-pass filter, respectively, H and L represents the coefficients of high-pass and low-pass subbands.We have Figure 3 shows the directional selectivity of DWT.The LH denotes a low-pass filter along x-axis and high-pass filter along y-axis.HL denotes a high-pass filter along x-axis followed by a low-pass filter along y-axis.The LL denotes low-pass filters along both directions, and HH denotes high-pass filters along both directions.
Here the HL and LH have well-defined for both vertical and horizontal orientations.For the HH, it mixes directions of both −45 and +45 degrees together, which stems from the use of real-valued filters in DWT.This mixing also impedes the direction check [31].

Dual-Tree Complex Wavelet Transform
To help improve the directional selectivity impaired by DWT, we proposed a dual-tree DWT, which was implemented by two separate two-channel filter banks.Note that the scaling and wavelet filters in the dual-tree cannot be selected arbitrarily [32].In one tree, the wavelet and scaling filters should produce a wavelet and scaling function, which are approximate Hilbert transforms of those generated by another tree [33].In this way, the wavelet generated from both trees and the complexed-valued scaling function are approximately analytic, and are called dual-tree complex wavelet transform (DTCWT).Here the HL and LH have well-defined for both vertical and horizontal orientations.For the HH, it mixes directions of both ´45 and +45 degrees together, which stems from the use of real-valued filters in DWT.This mixing also impedes the direction check [31].

Dual-Tree Complex Wavelet Transform
To help improve the directional selectivity impaired by DWT, we proposed a dual-tree DWT, which was implemented by two separate two-channel filter banks.Note that the scaling and wavelet filters in the dual-tree cannot be selected arbitrarily [32].In one tree, the wavelet and scaling filters should produce a wavelet and scaling function, which are approximate Hilbert transforms of those generated by another tree [33].In this way, the wavelet generated from both trees and the complexed-valued scaling function are approximately analytic, and are called dual-tree complex wavelet transform (DTCWT).
DTCWT obtained directional selectivity by using approximately analytic wavelets, i.e., they have support on only one half of the whole frequency domain [34].At each scale of a 2D DTCWT, it produces in total six directionally selective subbands (˘15 ˝, ˘45 ˝, ˘75 ˝) for both real (R) and imaginary (I) parts [35].Figure 4 shows the directional selectivity of DTCWT.The first row depicts the 6 directional wavelets of the real oriented DTCWT, and the second row shows the imaginary counterpart.The R and I parts are oriented in the same direction, and they together form the DTCWT as where M represents the magnitude of the DTCWT coefficients.
which was implemented by two separate two-channel filter banks.Note that the scaling and wavelet filters in the dual-tree cannot be selected arbitrarily [32].In one tree, the wavelet and scaling filters should produce a wavelet and scaling function, which are approximate Hilbert transforms of those generated by another tree [33].In this way, the wavelet generated from both trees and the complexed-valued scaling function are approximately analytic, and are called dual-tree complex wavelet transform (DTCWT).DTCWT obtained directional selectivity by using approximately analytic wavelets, i.e., they have support on only one half of the whole frequency domain [34].At each scale of a 2D DTCWT, it produces in total six directionally selective subbands (±15°, ±45°, ±75°) for both real (  ) and imaginary (  ) parts [35].Figure 4 shows the directional selectivity of DTCWT.The first row depicts the 6 directional wavelets of the real oriented DTCWT, and the second row shows the imaginary counterpart.The  and  parts are oriented in the same direction, and they together form the DTCWT as where  represents the magnitude of the DTCWT coefficients.

Comparison between DWT and DTCWT
To compare the directional selectivity ability between DWT and DTCWT, we performed a simulation experiment.Two simulation images (a pentagon and a heptagon) were generated.We decomposed both images to 4-level by DWT and DTCWT, respectively.Then, we reconstructed them to obtain an approximation to original images by a 4-th level detail subband.The results were shown in Figure 5.

Comparison between DWT and DTCWT
To compare the directional selectivity ability between DWT and DTCWT, we performed a simulation experiment.Two simulation images (a pentagon and a heptagon) were generated.We decomposed both images to 4-level by DWT and DTCWT, respectively.Then, we reconstructed them to obtain an approximation to original images by a 4-th level detail subband.The results were shown in Figure 5. DTCWT obtained directional selectivity by using approximately analytic wavelets, i.e., they have support on only one half of the whole frequency domain [34].At each scale of a 2D DTCWT, it produces in total six directionally selective subbands (±15°, ±45°, ±75°) for both real (  ) and imaginary (  ) parts [35].Figure 4 shows the directional selectivity of DTCWT.The first row depicts the 6 directional wavelets of the real oriented DTCWT, and the second row shows the imaginary counterpart.The  and  parts are oriented in the same direction, and they together form the DTCWT as where  represents the magnitude of the DTCWT coefficients.

Comparison between DWT and DTCWT
To compare the directional selectivity ability between DWT and DTCWT, we performed a simulation experiment.Two simulation images (a pentagon and a heptagon) were generated.We decomposed both images to 4-level by DWT and DTCWT, respectively.Then, we reconstructed them to obtain an approximation to original images by a 4-th level detail subband.The results were shown in Figure 5.The first column in Figure 5 shows the simulation images, the second column shows the DWT reconstruction results, and the last column shows the DTCWT reconstruction results.Both DWT and DTCWT can extract edges from detail subbands, which are abundant in brain tissues.The first column in Figure 5 shows the simulation images, the second column shows the DWT reconstruction results, and the last column shows the DTCWT reconstruction results.Both DWT and DTCWT can extract edges from detail subbands, which are abundant in brain tissues.

Simulation image DWT reconstruction DTCWT reconstruction
Those edges are discriminant features that are different in pathological brains and healthy brains.The reason is all the focus-related areas contain either shrink, or expand, or bleed, or become inflamed.Those that will yield structural alternations that are associated with edges.We find from the last column in Figure 5 that the edges detected by DTCWT have a clear contour, so DTCWT can detect nearly all directions clearly and perfectly.Nevertheless, the edges detected by DWT (See Figure 5) are discontinued, stemming from that DWT can only detect horizontal and vertical edges.The results fall in line with Figure 3.

Variance and Entropy (VE)
Based on the coefficients of DTCWT, we extract variance and entropy (VE) features for each decomposition level s and each direction d.Suppose (x, y) is the spatial coordinate of the corresponding subband, and (L, W) is the length and width of the corresponding subband, we can define the variance V (s, d) as ´Mps,dq px, yq ´µps,dq μps,dq here µ denotes the mathematical expectation of M. The variance V measures the spread of grey-level distribution of the subbands.The larger the value of V is, the more widely the gray-levels of the image vary.V also reflect the contrast of the texture.
Another indicator is the entropy E, which measures the randomness of the gray-level distribution [36].The larger the value of V is, the more randomly the gray-level distribution spreads [37].The entropy E is defined in following form: P ´Mps,dq px, yq ¯logP ´Mps,dq px, yq ¯(5) Here, P denotes the probability function.Both variance and entropy are sufficient to produce a good performance.All directional subbands of those two kinds of features are combined to form a new feature set V s and E s as rV ps,´15q , V ps,15q , V ps,´45q , V ps,45q , V ps,´75q Hence, we extract 12 features for each scale, among which 6 is for V s and 6 for E s .For an s-level decomposition, we totally obtain 12s features VE as The 12s VEs were then submitted into classifiers.SVM is now probably treated as one of the most excellent classification approach in small-size (less than one-thousand samples) problem [7].To further enhance the classification performance, two new variants of SVM were introduced:

Generalized Eigenvalue Proximal SVM
Mangasarian and Wild [38] proposed the generalized eigenvalue proximal SVM (GEPSVM).It drops the parallelism condition on the two hyperplanes (remember the parallelism is necessary in original SVM).Latest literatures showed that GEPSVM yielded superior classification performance to canonical support vector machines [39,40].Suppose samples are from either class 1 (denote by symbol X 1 ) or class 2 (denoted by symbol X 2 ), respectively.The GEPSVM finds the two optimal nonparallel planes with the form of (w and b denotes the weight and bias of the classifier, respectively) To obtain the first plane, we deduce from Equation ( 9) and get the following solution where o is a vector of ones of appropriate dimensions.Simplifying formula (10) gives We include the Tikhonov regularization to decrease the norm of z, which corresponds to the first hyperplane.The new equation including Tikhonov regularization term is: where t is a positive (or zero) Tikhonov factor.Formula (13) turns to the "Rayleigh Quotient (RQ)" in the following form of where P and Q are symmetric matrices in R pp`1qˆpp`1q in the forms of Solution of ( 14) is deduced by solving a generalized eigenvalue problem, after using the stationarity and boundedness characteristics of RQ.
Here the optimal minimum of ( 14) is obtained at an eigenvector z 1 corresponding to the smallest eigenvalue λ min of formula (17).Therefore, w 1 and b 1 can be obtained through formula (11), and used to determine the plane in formula (9).Afterwards, a similar optimization problem is generated that is analogous to (12) by exchanging the symbols of X 1 and X 2 .z 2 * can be obtained in a similar way.

Twin Support Vector Machine
In 2007, Jayadeva et al. [41] presented a novel classifier as twin support vector machine (TSVM).The TSVM is similar to GEPSVM in the way that both obtain non-parallel hyperplanes.The difference lies in that GEPSVM and TSVM are formulated entirely different.Both quadratic programming (QP) problems in TSVM pair are formulated as a typical SVM.Reports have shown that TSVM is better than both SVM and GEPSVM [42][43][44].Mathematically, the TSVM is constructed by solving the two QP problems min here q is a nonnegative slack variance.c i (i = 1,2) are positive parameters, and o i (i = 1,2) is the same as in formula (10).By this mean, the TSVM constructed two hyperplanes [45].The first term in equations of ( 18) and ( 19) is the sum of squared distances.The second one represents the sum of error variables.Therefore, minimizing Equations ( 18) and ( 19) will force the hyperplanes approximate to data in each class, and minimize the misclassification rate [46].Finally, the constraint requires the hyperplane to be at a distance of more than one from points of the other class.Another advantage of TSVM is that its convergence rate is four times faster than conventional SVM [47].

Pseudocode of the Whole System
The implementation covers two phases: offline learning and online prediction, with the goal of training the classifier and prediction of new instances, respectively.Table 1 offers the pseudocode of proposed methods.
Table 1.Pseudocode of our system.

Phase I: Offline learning
Step A Wavelet Analysis Perform s-level dual-tree complex wavelet transform (DTCWT) on every image in the ground-truth dataset Step

Phase II: Online prediction
Step

Statistical Setting
In order to carry out a strict statistical analysis, stratified cross validation (SCV) was used since it is a model validation technique for small-size data [48].6-fold SCV was employed for Dataset66, and 5-fold SCV was employed for Dataset160 and Dataset255.The SCV setting was listed in Table 2.The 10-fold is not used because of two reasons: One is that past literatures used the same setting as Table 2. Another is for stratification, viz., we expect to guarantee each fold covers the same class numbers.If we divide the dataset into 10-folds, then the stratification will be breached.Figure 6 illustrates an example of K-fold SCV, by which the dataset is partitioned into K folds with the same class distributions.The (K-1) folds are used for training, and the rest fold for test, i.e., query images come from the rest fold.The evaluation is based on test images.This above process repeats K times so that each fold is used as test once.The final accuracy of K-fold SCV is obtained by averaging the K results.The K-fold SCV repeats 10 times to further remove randomness (See Section 5.4).The 10-fold is not used because of two reasons: One is that past literatures used the same setting as Table 2. Another is for stratification, viz., we expect to guarantee each fold covers the same class numbers.If we divide the dataset into 10-folds, then the stratification will be breached.Figure 6 illustrates an example of K-fold SCV, by which the dataset is partitioned into K folds with the same class distributions.The (K-1) folds are used for training, and the rest fold for test, i.e., query images come from the rest fold.The evaluation is based on test images.This above process repeats K times so that each fold is used as test once.The final accuracy of K-fold SCV is obtained by averaging the K results.The K-fold SCV repeats 10 times to further remove randomness (See Section 0).

Parameter Estimation for s
It remains an issue of finding the optimal value of decomposition level s.From the view of information provided, a smaller s offers less information and a larger s offers more information to the classifier.From the view of avoiding overfit, a smaller s may prevent overfitting in greater degree than a larger s.This study used grid-search [49] method to find the optimal value of s, i.e., we vary the value of s from 1 to 5 with increment of 1, and check the corresponding average accuracies.The one associated with the largest accuracy is the optimal value of s.

Evaluation
The pathological brains are treated as positive, while the healthy brains as negative.To evaluate the performance, we first calculated overall confusion matrix of 10 runs, then calculate the TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative).The pathological brains were set to true and healthy ones to false, following common convention.The classification accuracy (Acc) is defined as:

Results and Discussions
The algorithms were in-house developed based on 64 bit Matlab 2015a (The Mathworks ©, Natick, MA, USA). Figure 7 shows the graphical user interface (GUI).The simulation experiments were implemented on the platform of P4 IBM with 3.2 GHz processor, and 8 GB random access memory (RAM), running under Windows 7 operating system.

Parameter Estimation for s
It remains an issue of finding the optimal value of decomposition level s.From the view of information provided, a smaller s offers less information and a larger s offers more information to the classifier.From the view of avoiding overfit, a smaller s may prevent overfitting in greater degree than a larger s.This study used grid-search [49] method to find the optimal value of s, i.e., we vary the value of s from 1 to 5 with increment of 1, and check the corresponding average accuracies.The one associated with the largest accuracy is the optimal value of s.

Evaluation
The pathological brains are treated as positive, while the healthy brains as negative.To evaluate the performance, we first calculated overall confusion matrix of 10 runs, then calculate the TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative).The pathological brains were set to true and healthy ones to false, following common convention.The classification accuracy (Acc) is defined as:

Results and Discussions
The algorithms were in-house developed based on 64 bit Matlab 2015a (The Mathworks ©, Natick, MA, USA). Figure 7 shows the graphical user interface (GUI).The simulation experiments were implemented on the platform of P4 IBM with 3.2 GHz processor, and 8 GB random access memory (RAM), running under Windows 7 operating system.

Classifier Comparison
In the second experiment, we compared three classifiers: SVM, GEPSVM, and TSVM.All three datasets are tested.A 10 runs of k-fold SCV was carried out.Accuracy was used for evaluation.The

Classifier Comparison
In the second experiment, we compared three classifiers: SVM, GEPSVM, and TSVM.All three datasets are tested.A 10 runs of k-fold SCV was carried out.Accuracy was used for evaluation.The results are listed in Table 3.The SVM achieved 100.00%, 99.69%, and 98.43% accuracy for Dataset66, Dataset160, and Dataset255, respectively.The GEPSVM achieved accuracy of 100.00%, 99.75%, and 99.25% for three datasets.The TSVM yielded accuracy of 100.00%, 100.00%, and 99.57%, respectively.Data in Table 3 indicate that GEPSVM is superior to standard SVM.For Dataset160, the acc of GEPSVM is higher than that of SVM by 0.06%.For Dataset255, the acc of GEPSVM is higher than that of SVM by 0.82%.Meanwhile, TSVM is superior to GEPSVM.For Dataset160, the acc of TSVM is 0.25 higher than that of GEPSVM.For Dataset255, the acc of TSVM is 0.32% higher than that of GEPSVM.
The parallel hyperplane setting restrains standard SVM to generate complicated and flexible hyperplanes.GEPSVM and TSVM discard this setting, so their performances are much better than SVM.TSVM is similar to GEPSVM in spirit, since both use non-parallel hyperplanes.The difference between them is TSVM uses simpler formulation than GEPSVM, and the former can be solved by merely two QP problems.Our results align with the finding in Kumar and Gopal [50], which says "generalization performance of TSVM is better than GEPSVM and conventional SVM".In following experiments, TSVM is the default classifier

Optimal Decomposition Level Selection
The value of decomposition level s was set in the range of (1,2,3,4,5).We chose the TSVM.All datasets were tested.A 10 runs of k-fold SCV was implemented with varying s.The curve of average accuracy versus against decomposition level is shown in Figure 8.  Remember for s = 1, only 12 features are used.For s = 2, in total 24 features are used.The number of employed features is 12 times of the value of s. Figure 8 shows the relationship between Acc versus s.Here we find the acc has a tendency of decrease when the decomposition level s increases.The reason is more features will attenuate the classification performance [51].Reducing the number of features can simplify the model, cost shorter training time, and augment generalization performance through reduction of variance [52].

Comparison to State-of-the-Art Approaches
We have already compared the SVM with its variants in Section 0. In this section, we compared Remember for s = 1, only 12 features are used.For s = 2, in total 24 features are used.The number of employed features is 12 times of the value of s. Figure 8 shows the relationship between Acc versus s.
Here we find the acc has a tendency of decrease when the decomposition level s increases.The reason is more features will attenuate the classification performance [51].Reducing the number of features can simplify the model, cost shorter training time, and augment generalization performance through reduction of variance [52].
Table 4 shows the comparison results between the best method "DTCWT + VE + TSVM" with the state-of-the-art approaches.The first column lists the method name, the second column the number of features employed, the third column the total run times (all algorithms run 10 times, except some old algorithms ran five times which were reported in literature [8]), and the last three columns the average acc over three datasets.After investigating the results in Table 4, it is clear that 11 out of 13 methods achieve perfect classification (100%) over Dataset66, which stems from its small size.For a larger dataset (Dataset160), only four methods yield perfect classification.They are RT + PCA + LS-SVM [8], DWPT + TE + GEPSVM [10], SWT + PCA + HPA-FNN [11], and the proposed "DTCWT + VE + TSVM".A common point among the four methods is that they all used advanced feature extraction (RT, DWPT, SWT, and DTCWT) and classification techniques (LS-SVM, GEPSVM, HPA-FNN, and TSVM).This suggests us to learn and apply latest advanced artificial-intelligence and machine-learning approaches to the field of MR brain classification.For the largest dataset (Dataset255), no algorithm achieves perfect classification, because there are relatively various types of diseases.Among all methods, this proposed "DTCWT + VE + TSVM" achieves the highest accuracy of nearly 99.57%, which demonstrates its effectiveness and feasibility.
It is reasonable that all methods achieved high accuracy.Retrospect a similar problem of facial recognition system (FRS), the latest FRS achieved nearly perfect performance and been applied for banking customers [56], vehicle security [57], etc.The pathological detection is simpler than face detection, because it does not need to identify each subject but identify the status (pathological or healthy).Hence, it is expected that our methods can achieve high classification accuracy.
The difference of accuracy in Table 4 is not significant, however, it is obtained by strict statistical analysis, viz., 10 runs of K-fold stratified cross validation.Hence, this slight improvement is reliable and convincing.Even the largest dataset only contains 255 images, so we will try to create a larger dataset that contains more images and more types of diseases.
The proposed CAD system cannot give physical meanings of particular brain regions.Nevertheless, after comparing classifier and human brains, we believe expert systems are similar to declarative memory, while support vector machines are similar to nondeclarative memory.Thus, it is impossible for SVMs (its variants) to give physical meanings.In the future, we may try to use expert systems that can mimic the reasoning process of doctors, but may not give as high accuracies as SVMs.

Results of Different Runs
The correct classification instance numbers, together with their accuracies, are listed in Table 5.In the table, each row lists the results of different runs, and each column lists the results of different folds.The last row averages the results, and the last column summarizes the results over different folds.(F = Fold, R = Run).

Computation Time
The computation time of each step of our method was calculated and recorded.The training time was recorded over Dataset255.The results of offline-learning and online-prediction are listed in Tables 6 and 7, respectively.The computation time results in Table 6 provides that DTCWT costs 8.41 s, VE costs 1.81 s, and TSVM training costs 0.29 s, in the offline-learning procedure.This is because there are 255 images in the dataset, and the training process need to handle all images.The total time is 10.51 s.
The online-prediction time only deals with one query image, so the computation time reduces sharply.Table 7 provides that DTCWT costs 0.037 s, VE costs 0.009 s, and TSVM costs 0.003 s for one query image.Its total time is 0.049 s.Therefore, the proposed system is feasible in practice.

Comparison to Human Reported Results
In the final experiment, we invited three senior neuroradiologists who have over ten years of experiences.We scanned 20 subjects (5 healthy and 15 pathological), and we pick five slices from each subject.For the healthy subjects, the five slices were selected randomly.For the pathological subjects, the five slices should contain the lesions by confirmation of all the three senior neuroradiologists.
Afterwards, a double blind test was performed.Four junior neuroradiologists with less than 1 year of experiences were required to predict the status of the brain (either pathological or healthy).Each image was assigned with three minutes.Their diagnosis accuracies were listed in Table 8. (O = Observer).
From Table 8, we see our computer-aided diagnosis method achieves an accuracy of 96%.Compared to Table 4, the performance of our proposed method is affected in real-world scenario.
The reasons are complicated.First, the source dataset is downloaded from Harvard medical school, which is for teaching.Hence, the source dataset itself highlighted the difference between pathological and healthy brains intentionally, and the slice positions were selected with care.Second, images from real hospitals are of poorer quality and of poorer localization of lesions.All these contribute to the worsen performance of our method.After all, an accuracy of 96% in rea-world scenario is good and promising.
Another finding is Table 8 is that the four junior neuroradiologists obtained accuracies of 74%, 78%, 77%, and 79%, respectively.All are below 80%.This validates the power of computer vision and machine learning, since computers have proven to deliver better performance in face recognition, video security, etc.Nevertheless, there are other more visible symptoms suggesting that something may be wrong in the brain for neuroradiologists.Therefore, this simple test does not reflect the realistic diagnosis accuracy in real hospitals.

Conclusions and Future Research
We proposed a novel CAD system for pathological brain detection (PBD) using DTCWT, VE, and TSVM.The results show that the proposed method yields better results than 12 existing methods in terms of classification accuracy.
Our contributions include three points: (1) We investigate the potential use of dual-tree complex wavelet transform (DTCWT) in MR image classification, and prove DTCWT is effective; (2) We utilize twin support vector machine (TSVM) and prove it is better than canonical SVM and GEPSVM; (3) The proposed system "DTCWT + VE + TSVM" is superior to nineteen state-of-the-art systems.
The limitation of our method is the dataset size is too small.We will try to re-check our methods by creating a large dataset.Another limitation is the dataset cannot reflect real-word scenario, thus we need to obtain more data from hospitals directly in the future.The third limitation is that our data involves only middle and late stage of diseases; hence, our method performs not so good for MR images with diseases in early stage.
In the future, we will consider to validate our method use real clinical data and use advanced classification methods, such as RBFNN, deep leaning, least-square techniques.Besides, we will try to apply our method to remote-sensing related fields or hearing loss detection [58].The advanced parameter estimation, case-based reasoning [59], and optimization [60] techniques will be carried out in a thorough way.Fuzzy method [61] may be applied to remove outliers in the dataset.Coarse-graining [62] can help extract more efficient entropy that is robust to the noises.Video-on-demand [63,64] services may be applied to help reduce computation resources.Particularly, we shall acquire more datasets and compare our method with human interpretation.

Figure 2 .
Figure 2. Modular framework of the proposed system for magnetic resonance (MR) brain

Figure 2 .
Figure 2. Modular framework of the proposed system for magnetic resonance (MR) brain classification (K may be 5 or 6 according to the dataset).

Figure 2 .
Figure 2. Modular framework of the proposed system for magnetic resonance (MR) brain classification (K may be 5 or 6 according to the dataset).

Figure 5 .
Figure 5.The reconstruction comparison between DWT and DTCWT.

Figure 5 .
Figure 5.The reconstruction comparison between DWT and DTCWT.

Figure 5 .
Figure 5.The reconstruction comparison between DWT and DTCWT.
B Feature Extraction Obtain 12 ˆs features (6 ˆs Variances and 6s Entropies, and s represents the decomposition level) from the subbands of DTCWT Step C Training Submit the set of features together with the class labels to the classifier, in order to train its weights/biases.Step D Evaluation Record the classification performance based on a 10 ˆK -fold stratified cross validation.

Figure 7 .
Figure 7. Graphical user interface (GUI) of our developed programs.

Figure 7 .
Figure 7. Graphical user interface (GUI) of our developed programs.

Table 2 .
Stratified cross validation (SCV) setting of all datasets.

Table 8 .
Comparison to human reported results.