Classification of MRI Brain Images Using DNA Genetic Algorithms Optimized Tsallis Entropy and Support Vector Machine

As a non-invasive diagnostic tool, Magnetic Resonance Imaging (MRI) has been widely used in the field of brain imaging. The classification of MRI brain image conditions poses challenges both technically and clinically, as MRI is primarily used for soft tissue anatomy and can generate large amounts of detailed information about the brain conditions of a subject. To classify benign and malignant MRI brain images, we propose a new method. Discrete wavelet transform (DWT) is used to extract wavelet coefficients from MRI images. Then, Tsallis entropy with DNA genetic algorithm (DNA-GA) optimization parameters (called DNAGA-TE) was used to obtain entropy characteristics from DWT coefficients. At last, DNA-GA optimized support vector machine (called DNAGA-KSVM) with radial basis function (RBF) kernel, is applied as a classifier. In our experimental procedure, we use two kinds of images to validate the availability and effectiveness of the algorithm. One kind of data is the Simulated Brain Database and another kind of image is real MRI images which downloaded from Harvard Medical School website. Experimental results demonstrate that our method (DNAGA-TE+KSVM) obtained better classification accuracy.


Introduction
Human brain image segmentation, as a branch of medical image segmentation, can help us diagnose the diseases of the human brain. MRI images segmentation technology can help doctors diagnose patients with brain diseases such as brain tumors, cerebral infarction and Parkinson's disease quickly and accurately. The use of brain imaging techniques to qualitatively and quantitatively analyze brain function is important for effective diagnosis of brain diseases. MRI imaging technology has a greater potential advantage than other imaging methods. MRI technology has more security than the other methods, because it does not have ionizing radiation. High quality MRI images have been routinely used for obtaining the anatomical and pathological conditions of the brain in both biomedical research and clinical diagnosis [1]. Since MRI is primarily used for soft tissue anatomy and can generate large amounts of detailed information about the brain conditions of a subject, the use of MRI images to classify normal and pathological brain conditions is very important in clinical diagnosis [2]. However, too much data makes manual interpretation difficult, so an automated image analysis tool needs to be developed. This study develops a new classification method, wavelet transform (WT), to realize the feature extraction and supporting vector machine (SVM) for classification. The classification method is applied to the classification of normal and pathological brain conditions. WT allows for the analysis of images at different resolution levels because they have multi-resolution analysis features. However, this technology requires a large amount of storage and is computationally intensive [3]. This paper integrates DWT-based feature extraction and DNA-GA optimized Tsallis entropy and kernel SVM to perform MRI image classification to achieve accurate and stable automatic MRI benign/malignant brain image classification. The structure of the rest of the paper is: Section 2 gives the relevant content of feature extraction, including DWT and Tsallis entropy. Section 3 first introduces the principles of linear SVM and then adds the concept of kernel functions. Section 4 introduces the method of DNAGA-TE+KSVM. It gives the principles of KSVM and then uses the DNA genetic algorithm to optimize the values of parameters q , C and σ ; finally it used five-fold cross-validation to protect the classifier from overfitting. Section 5 uses two kinds of datasets to analyze the availability and effectiveness of the algorithm. We compare the DNAGA-TE+KSVM with traditional back propagation neural network (BP-NN) and RBF neural network (RBF-NN) methods. Finally, Section 6 gives the conclusions and outlook.

Feature Extraction with DWT and Tsallis Entropy
In our study, the first step in extracting features from an image is DWT. Wavelet transform is a localized analysis method in which the window size is fixed but its shape can be changed. In the classical signal analysis theory, Fourier transform (FT) is the most widely-used and best-performing analysis method. However, it is just a pure frequency domain analysis method that does not provide frequency information over a local time period. Wavelet transform can obtain better time resolution in the high frequency part of the signal, and better frequency resolution can be obtained in the low frequency part of the signal, so as to effectively extract information from the signal. FT is based solely on its frequency content when providing a representation of the image [23]. Therefore, the wavelet function is located in space, while the FT representation is not spatially located. FT decomposes the signal into a spectrum. However, wavelet analysis decomposes a signal from the coarsest scale into a hierarchy of scale ranges. Wavelet transform provides image representations at different resolutions for images. Therefore, wavelet transform is a better image feature extraction tool. In this paper, the decimated wavelet transform is adopted because it is relatively simple compared with nondecimated wavelet transform.

Feature Extraction
FT is the most common signal analysis tool. When using FT to extract the signal spectrum, it is important to apply all time domain information of the signal. This is an overall transformation. When converting the signal, FT has serious shortcomings. For example, analyst cannot tell when a particular event took place from a Fourier spectrum. Therefore, classification performance may decrease as time information is lost.
Fourier transform was improved by Gabor so that it can analyze multiple small portions of the signal at once. We call this technology a windowing or short-time FT (STFT) [24]. In fact, the process

Feature Extraction with DWT and Tsallis Entropy
In our study, the first step in extracting features from an image is DWT. Wavelet transform is a localized analysis method in which the window size is fixed but its shape can be changed. In the classical signal analysis theory, Fourier transform (FT) is the most widely-used and best-performing analysis method. However, it is just a pure frequency domain analysis method that does not provide frequency information over a local time period. Wavelet transform can obtain better time resolution in the high frequency part of the signal, and better frequency resolution can be obtained in the low frequency part of the signal, so as to effectively extract information from the signal. FT is based solely on its frequency content when providing a representation of the image [23]. Therefore, the wavelet function is located in space, while the FT representation is not spatially located. FT decomposes the signal into a spectrum. However, wavelet analysis decomposes a signal from the coarsest scale into a hierarchy of scale ranges. Wavelet transform provides image representations at different resolutions for images. Therefore, wavelet transform is a better image feature extraction tool. In this paper, the decimated wavelet transform is adopted because it is relatively simple compared with non-decimated wavelet transform.

Feature Extraction
FT is the most common signal analysis tool. When using FT to extract the signal spectrum, it is important to apply all time domain information of the signal. This is an overall transformation. When converting the signal, FT has serious shortcomings. For example, analyst cannot tell when a particular event took place from a Fourier spectrum. Therefore, classification performance may decrease as time information is lost.
Fourier transform was improved by Gabor so that it can analyze multiple small portions of the signal at once. We call this technology a windowing or short-time FT (STFT) [24]. In fact, the process of calculating STFT is to divide a longer time signal into shorter segments of the same length and to calculate FT on each shorter segment. The STFT balances time and frequency representations of information better than the FT.
The wavelet transform retains both frequency and time information of the signal because it uses a window of variable size to represent the next logical step. The most important thing is that the wavelet transform gives an adjustable time-frequency window with the width of the window. When the frequency is increased, the width of the time window is automatically narrowed to increase Entropy 2018, 20, 964 4 of 17 the resolution; when the frequency is reduced, the time is reduced. The width of the window is automatically widened to ensure the integrity of the information.

Discrete Wavelet Transform
DWT discretizes the scale and translation of basic wavelet transform. The continuous wavelet transform of function f (t) is defined as where ψ a,b (t) means that mother wavelet ψ(t) is stretched and translated, then In (2), the dilation factor a acts to vary the time scale of the probing function ψ(t) and b is the translation parameter across f (t) (both real positive numbers). In the development of wavelet analysis, many different types of wavelets have received wide attention. Among them, the most typical one is Haar wavelet [25], because it is the simplest and most commonly used wavelet in many applications.
The discrete wavelet transform is obtained by taking a, b in the wavelet basis function (1) at some discrete points. One of the most common discrete methods is to discrete lattice (a = 2 j and j ∈ Z) to DWT, then In (3), ca j,k represent the coefficients of the approximation components, and cd j,k represent the coefficients of detailed components. The functions l * (t) represent the tap coefficients sequence of the low-pass filter, and h * (t) represents the tap coefficients sequence of the high-pass filter. j refers to the wavelet scale, and k refers to the translation factors. DS operator represents the downsampling operation. The functions in (3) give the basis of the wavelet decomposition. This operation decomposes a signal f (t) into the detail components cd j,k and the approximation coefficients ca j,k . We call this process one-level decomposition.
An iterative approximation can be used to iterate through the above decomposition process; therefore a signal is decomposed into various levels of resolution. A wavelet decomposition tree can be used to graphically depict the entire process. The example of a two-level 1D-DWT is given in Figure 2.
wavelet transform gives an adjustable time-frequency window with the width of the window. When the frequency is increased, the width of the time window is automatically narrowed to increase the resolution; when the frequency is reduced, the time is reduced. The width of the window is automatically widened to ensure the integrity of the information.

Discrete Wavelet Transform
DWT discretizes the scale and translation of basic wavelet transform. The continuous wavelet transform of function ( ) f t is defined as is stretched and translated, then In (2), the dilation factor a acts to vary the time scale of the probing function ( ) t ψ and b is the translation parameter across ( ) f t (both real positive numbers). In the development of wavelet analysis, many different types of wavelets have received wide attention. Among them, the most typical one is Haar wavelet [25], because it is the simplest and most commonly used wavelet in many applications.
The discrete wavelet transform is obtained by taking a , b in the wavelet basis function (1) at some discrete points. One of the most common discrete methods is to discrete lattice (  An iterative approximation can be used to iterate through the above decomposition process; therefore a signal is decomposed into various levels of resolution. A wavelet decomposition tree can be used to graphically depict the entire process. The example of a two-level 1D-DWT is given in Figure 2.  When DWT is used for two-dimensional (2D) images, it needs to be employed to each dimension separately. Figure 3  next 2D DWT. Then, the next layer of decomposition is performed on the low frequency subgraph obtained by the decomposition of the upper layer. Therefore, a simple layered framework of wavelets can be used to interpret image information.
When DWT is used for two-dimensional (2D) images, it needs to be employed to each dimension separately. Figure 3 shows a one-level 2D DWT. The downward arrow denotes the DS operation. At each scale, there are four sub-band (LL, LH, HH, and HL) images: low frequency approximation LL, horizontal high frequency detail component HL, vertical high frequency detail component LH and diagonal high frequency detail component HH. These four sub-bands can each be regarded as an approximate component and a detailed component of the image. The sub-band LL is applied for the next 2D DWT. Then, the next layer of decomposition is performed on the low frequency subgraph obtained by the decomposition of the upper layer. Therefore, a simple layered framework of wavelets can be used to interpret image information.

Tsallis Entropy
Entropy is used to describe the degree of chaos in the system, which is a statistical measure of randomness. Shannon entropy reflects the degree of disorder of the system, represented by S . The more ordered a system is, the lower the entropy of information will be. We use Shannon entropy to represent the amount of information contained in the aggregated features of the grayscale distribution in the image. Let i p denote the proportion of the gray level i of the reconstruction coefficient, then the Shannon entropy of the image is: where n is the total number of grey levels. Shannon entropy describes a system of breadth, while the actual system is more or less related in time or space, that is, non-wide [26]. For certain types of physical systems that require remote interaction, long-term memory and fractal structures, non-wide entropy must be used. Tsallis proposed a generalization of BGS statistics called Tsallis Entropy and is represented by q S , with the following form: where q represents the non-extensive parameter. When q approaches 1, q S in (5) will converge to the Shannon entropy. The difference between Shannon entropy and Tsallis entropy is the introduction of a non-extensive parameter q , which can be used to describe the degree of nonextensiveness of the system under testing [27], so that the system entropy satisfies the following rule.
TE has been widely used in brain image processing [28]. The combination of TE and DWT is better than the performance obtained by using TE or DWT alone [29]. MRI images have fractal structures and long-range interactions for the fact that self-similarity is observed in brain structures with limited resolution imaging. In our study, TE is used to extract features from sub-bands of DWT coefficients of MRI images. First, using DWT decomposes an image into a multiscale hierarchy of low

Tsallis Entropy
Entropy is used to describe the degree of chaos in the system, which is a statistical measure of randomness. Shannon entropy reflects the degree of disorder of the system, represented by S. The more ordered a system is, the lower the entropy of information will be. We use Shannon entropy to represent the amount of information contained in the aggregated features of the grayscale distribution in the image. Let p i denote the proportion of the gray level i of the reconstruction coefficient, then the Shannon entropy of the image is: where n is the total number of grey levels. Shannon entropy describes a system of breadth, while the actual system is more or less related in time or space, that is, non-wide [26]. For certain types of physical systems that require remote interaction, long-term memory and fractal structures, non-wide entropy must be used. Tsallis proposed a generalization of BGS statistics called Tsallis Entropy and is represented by S q , with the following form: where q represents the non-extensive parameter. When q approaches 1, S q in (5) will converge to the Shannon entropy. The difference between Shannon entropy and Tsallis entropy is the introduction of a non-extensive parameter q, which can be used to describe the degree of non-extensiveness of the system under testing [27], so that the system entropy satisfies the following rule.
TE has been widely used in brain image processing [28]. The combination of TE and DWT is better than the performance obtained by using TE or DWT alone [29]. MRI images have fractal structures and long-range interactions for the fact that self-similarity is observed in brain structures with limited resolution imaging. In our study, TE is used to extract features from sub-bands of DWT coefficients of MRI images. First, using DWT decomposes an image into a multiscale hierarchy of low and high sub-band images. Next, the noise reduction operation is performed by utilizing the Tsallis entropy of the local variance of the directional high-sub-band image modulus. Thereafter, the nonlinear operator is used to modify the wavelet coefficients of the high sub-band, and finally the power-law transform is used to modify the low sub-band image of the first scale to suppress the background.
Because the brain is a sub-extensive system consisting of tissues and complex areas, the value of q should be smaller than 1. A key question is how to find the best value of q. The value of q is varied between 0.1 and 1, in increments of 0.1, and the average classification rate of 10 replicates is recorded on the test dataset (See Section 5.1). The result is shown in Figure 3. The average value of the classification rate fluctuates as q increased from 0.1 to 0.9. However, the average value of the classification rate reaches a peak at q = 0.8. The result in Figure 4 demonstrates that the value of q influences the classification performance. In this study, DNA-GA is used to search a good value of q. More details about the searching procedure are described in Section 4. and high sub-band images. Next, the noise reduction operation is performed by utilizing the Tsallis entropy of the local variance of the directional high-sub-band image modulus. Thereafter, the nonlinear operator is used to modify the wavelet coefficients of the high sub-band, and finally the power-law transform is used to modify the low sub-band image of the first scale to suppress the background.
Because the brain is a sub-extensive system consisting of tissues and complex areas, the value of q should be smaller than 1. A key question is how to find the best value of q. The value of q is varied between 0.1 and 1, in increments of 0.1, and the average classification rate of 10 replicates is recorded on the test dataset (See 5.1). The result is shown in Figure 3. The average value of the classification rate fluctuates as q increased from 0.1 to 0.9. However, the average value of the classification rate reaches a peak at The result in Figure 4 demonstrates that the value of q influences the classification performance. In this study, DNA-GA is used to search a good value of q. More details about the searching procedure are described in Section 4.

SVMs and Parameters
Detection of malignant brain tasks are seen as a binary classification problem. Because of its good performance, SVMs is used to build the classification functions. A brief introduction to the original and double expressions of SVMs is provided. This section discusses the need to search for good parameter values. More detailed processing of SVMs is provided in other publications [9] .

Support Vector Machine
The basic model of SVM is to find the best separation hyperplane in the feature space to maximize the positive and negative sample spacing in the training set. Given some p-dimensional data points, the goal of SVM is to produce a (p-1)-dimensional hyperplane. We know that the optimal hyperplane is the hyperplane with the largest separation between the two classes.
Given a set of data sets, where xn represents a p-dimensional data point, yn stands for category (yn can be 1 or −1, respectively, which represents two different classes). The target of a classifier is to find a hyperplane in the pdimensional data space. Assuming that one dataset is nonlinear, we need to create a nonlinear classification function: The direction of the hyperplane is determined by the normal vector which is represented by w in (8); the distance between the origin and the hyperplane is determined by the value of offset which

SVMs and Parameters
Detection of malignant brain tasks are seen as a binary classification problem. Because of its good performance, SVMs is used to build the classification functions. A brief introduction to the original and double expressions of SVMs is provided. This section discusses the need to search for good parameter values. More detailed processing of SVMs is provided in other publications [9].

Support Vector Machine
The basic model of SVM is to find the best separation hyperplane in the feature space to maximize the positive and negative sample spacing in the training set. Given some p-dimensional data points, the goal of SVM is to produce a (p-1)-dimensional hyperplane. We know that the optimal hyperplane is the hyperplane with the largest separation between the two classes.
Given a set of data sets, {(x n , y n |x n ∈ R p , y n ∈ {−1, +1}} where x n represents a p-dimensional data point, y n stands for category (y n can be 1 or −1, respectively, which represents two different classes). The target of a classifier is to find a hyperplane in the p-dimensional data space. Assuming that one dataset is nonlinear, we need to create a nonlinear classification function: The direction of the hyperplane is determined by the normal vector which is represented by w in (8); the distance between the origin and the hyperplane is determined by the value of offset which is represented by b. To find a better hyperplane, we need to find the values of b and w that satisfy Equation (9): In (9), C is the penalty coefficient and ξ i is the slack variable for each sample point i. The objective function in Formula (9) indicates that there is a cost loss C m ∑ i=1 ξ i for each slack variable. The greater C is, the greater the penalty for misclassification will be. On the contrary, the smaller C is, the lower the penalty for misclassification will be. Although the value of C is significant for any application, there is no uniform method to determine its optimal value. The general approach is to find satisfactory values through repeated experiments.

Kernel Functions
Conventional SVM constructs a hyperplane data classification. In the case of linear inseparability, the SVM first completes the calculation in a low dimensional space. After that, SVM maps input space to high-dimensional feature space through the kernel function. Finally, in a high dimensional feature space, SVM creates an optimal hyperplane. The use of kernel functions is essential to SVM and nonlinear classification functions are constructed in high dimensional feature spaces through the use of kernels.
Internal products are obtained by using kernel functions, Gaussian kernel function [30] is defined as follows: where the expression x i , x j expresses the Euclidean distance between x i and x j . σ is the width parameter, which controls the radial range of action. In other words, σ controls the local range of the Gaussian kernel function and its value determines the performance of the SVM [31]. Typically, the value of σ is obtained by repeated experiments in previous studies.

DNA-GA for Optimal Parameters
Appropriate kernel functions and parameter values play a very important role in the classification performance of the SVM. In our paper, DNA-GA is proposed to get the appropriate values of the parameters in the SVM. In addition, DNA-GA is also used to get the appropriate values of Tsallis entropy parameters q in (5), regularization parameter C in (9), and the width parameter σ in (11).
We used a selection operation, three crossover operations and four mutations in the DNA-GA algorithm. This section details these operations. In order to find good values for parameter C and parameter σ, DNA-GA is included in SVM training procedure. Optimizing these two parameters helps the algorithm to converge quickly and makes the classification result more accurate.

The Proposed Algorithm
In the iterative process of the DNAGA-TE + KSVM algorithm, we set two termination conditions. The first iteration condition is that the evolutionary algebra reaches the maximum number of iterations G max . And the second iteration condition is that the change in the optimal fitness value is less than a predetermined threshold δ. In each evolutionary process, the local search, the selection operation, the crossover operation, and the mutation operation are sequentially performed in order of probability. The detailed description of the DNAGA-TE + KSVM algorithm is as follows: Step 1: Collecting the MR brain images' dataset.
Step 3: Initialization. Initialize all parameters in the algorithm. The parameters used include the maximum number of generations G max , the size of DNA soup T, length of a chromosome L, probability of chromosome crossover and mutation p c and p m , and so on. Generate T chromosomes randomly to initialize the DNA soup.
Step q, C and σ. Then, we use these parameter values obtained by decoding to train the SVM and calculate the fitness value.
Step 5: Extracting feature by using TE. Tsallis Entropy is used to extract features from sub-bands of DWT coefficients of MRI images.
Step 6: Five-folded cross-validation. As shown in in Section 4.4, The MRI image dataset is randomly divided into five equal-sized mutually exclusive subsets, four of which are used to train SVM and the other is used for verification. The above procedure is repeated five times so that each subset is used for verification once.
Step 7: Termination. If the number of iterations attains the maximum value or if the change in the fitness value is less than δ, the chromosomes in the current population are recorded and decoded. Otherwise, go to step 8.
Step 8: The selection operation. As described in Section 4.3.1, the algorithm selects a better chromosome in the current population based on the fitness value of each individual. The selected chromosome is added to a new population until the number of chromosomes in the new population reaches a predetermined number T.
Step 9: The crossover operation. As described in Section 4.3.2, perform the crossover operation to chromosomes in the new population based on the probabilities of the crossover operations p c .
Step 10: The mutation operations. As described in Section 4.3.3, the mutation procedure is applied to individuals in the new population based on the value of p m .
Step 11: If the iteration condition is not met, go back to step 5.
Step 13: Submitting new MRI images to Tsallis Entropy and trained KSVM and outputting forecast. It should be noted that the algorithm needs to check in step 11 whether the termination criteria are reached. If any of the two termination conditions is reached, the iterative operation ends and steps 12 and 13 are performed. Finally, the data is classified using the KSVM trained by the optimal parameter values obtained from the best chromosome decoding.
The flow chart of the algorithm is depicted in Figure 5.

DNA Encoding and Decoding
According to biological research, DNA carries major genetic information about the growth and reproduction of all known organisms. Biological DNA is an essential element having four nucleotide bases: Adenine (A), Thymine (T), Cytosine (C) and Guanine (G). DNA coding is more conducive to individual gene-level genetic manipulation, such as recombination and mutation. DNA soup, also known as the DNA population, is made up of thousands of DNA strands with specific characteristics.
Hereinafter, the number of chromosomes in the DNA population is represented by T .

DNA Encoding and Decoding
According to biological research, DNA carries major genetic information about the growth and reproduction of all known organisms. Biological DNA is an essential element having four nucleotide bases: Adenine (A), Thymine (T), Cytosine (C) and Guanine (G). DNA coding is more conducive to individual gene-level genetic manipulation, such as recombination and mutation. DNA soup, also known as the DNA population, is made up of thousands of DNA strands with specific characteristics. Hereinafter, the number of chromosomes in the DNA population is represented by T.
Artificial DNA computing models can be used to solve many real-life optimization problems. Each chain in a DNA individual is like a string. Mathematically, strings are often used to encode parameters in a particular problem. The researchers introduced the characteristics of biological DNA into genetic algorithms and developed a new DNA-GA algorithm. In DNAGA-KSVM, parameters are encoded as strings or as part of a chromosome. Especially, in this algorithm, each chromosome of the population represents the value of the parameter q, C and σ.
In Figure 6, the DNA string on the top encodes the value of one parameter and the DNA chromosome at the bottom is made of multiple DNA strings, each representing one parameter.

DNA Encoding and Decoding
According to biological research, DNA carries major genetic information about the growth and reproduction of all known organisms. Biological DNA is an essential element having four nucleotide bases: Adenine (A), Thymine (T), Cytosine (C) and Guanine (G). DNA coding is more conducive to individual gene-level genetic manipulation, such as recombination and mutation. DNA soup, also known as the DNA population, is made up of thousands of DNA strands with specific characteristics.
Hereinafter, the number of chromosomes in the DNA population is represented by T .
Artificial DNA computing models can be used to solve many real-life optimization problems. Each chain in a DNA individual is like a string. Mathematically, strings are often used to encode parameters in a particular problem. The researchers introduced the characteristics of biological DNA into genetic algorithms and developed a new DNA-GA algorithm. In DNAGA-KSVM, parameters are encoded as strings or as part of a chromosome. Especially, in this algorithm, each chromosome of the population represents the value of the parameter q , C and σ .
In Figure 6, the DNA string on the top encodes the value of one parameter and the DNA chromosome at the bottom is made of multiple DNA strings, each representing one parameter. In general, the length of the DNA strand depends on the actual situation. If the length of the DNA strand is L, one strand contains N parameters. Then, the length of a DNA chromosome is l = L/N. In this paper, we use S m to represent the mth chromosome in the population. Correspondingly, the fitness value defined in (11) of each chromosome is expressed as f it(S m ). Before calculating f it(S m ), S m is decoded to the value of the parameter it represents. Second, each chromosome contains the two parameters C and σ in our algorithm. The symbols f it(S m ) and f it(C, σ) can be used interchangeably.

DNA Genetic Operators
In our algorithms, three types of genetic operators are employed, namely selection, crossover and mutation. As the genetic process continues, selection and crossover operations cause individuals in the population to be similar to each other. If the individual becomes similar in the early stages, because of the lack of diversity, the solution process may fall into the local optimum state. Mutation is a supplementary method for generating new individuals. The mutation operation helps to maintain the diversity of the population, strengthen the local search ability of DNA-GA, and avoid the occurrence of premature phenomenon. These DNA genetic manipulations are described below.

The Selection Operation
Selection is often the first step in DNA genetic manipulation. The goal is to make the more adaptable DNA strands have more opportunities to breed offspring, so that the superior characteristics can be inherited, reflecting the idea of survival of the fittest in nature. The tournament method is an elite strategy designed to prevent losing the best individuals in the evolutionary process. Each individual in the current population is scanned sequentially, and chromosomes with greater fitness values are replicated into the next generation population and are compared to the fitness of all other individuals.

The Crossover Operations
Crossover operation is the core operation that determines the convergence of genetic algorithms and is the most important genetic operation. In the following, let R = R 5 R 4 R 3 R 2 R 1 be a DNA sequence,

The Mutation Operations
The basic content of the mutation operator is to change the gene values at certain loci of individual strings in the population. Various DNA sequence manipulations are proposed in the literature [32]. However, some genetic manipulations alter the length of the DNA sequence and are therefore not used in our study. We mainly use four mutation operations:

Fitness Function
Since the classification is given by a set of training data, it can only be achieved for the training high-resolution dataset, and not for other independent datasets, this is called overfitting. Cross-validation can make classification more reliable, it can be extended to other independent datasets or a new observation. Cross-validation means that in a given sample, most of the samples are taken as a training set, leaving a small part of the sample to be validated using the model just created.
To make the observation in the K folds evenly distributed, stratified K-fold cross validation is used, in which each fold has almost the same number of observations for each category [33]. To balance the computation time needed and the reliability of the estimate, K = 5, i.e., five-fold validation is used. The MRI image dataset is randomly divided into five equal-sized mutually exclusive subsets; the training used four subsets and verification used another. The fitness function is the classification rate. It is stated in the following: where N s denotes the total number of observations and N m denotes the number of correctly classified observations in the validation set. The fitness function is maximized in the DNA-GA training process.

Experimental Study
The significant contribution of this study is the advancement of an algorithm integrating DWT, Tsallis Entropy, DNA-GA and KSVM for classifying benign and malignant MRI brain images. The algorithm will be useful in helping clinicians diagnosing patients. The algorithm is implemented using the wavelet toolbox and bio-statistical toolbox of Matlab R2014b. This open SVM toolkit is also used. The program can be run on any computer platform that is loaded with Matlab.

Database
In this paper, we used two types of images. One type of data comes from the Simulated Brain Database (SBD). MRI images can be downloaded from this database. Another kind of image is real an MRI image downloaded from the website of Harvard Medical School. In our experiments, we only retained three main brain tissues, including GM, WM and CSF, but there were skulls and other tissues in the original image.
SBD: The database has DWI, T1 and T2 images of size 181 × 217 pixels. In our experiments, the classification results were quantified based on the ground truth of SBD. We added Rician noise to the analog image. In our experiments, the SNR (signal to noise ratio) values were set to 10 dB, 5 dB, 15 dB, and 20 dB, respectively. Shown in   AANLIB: There are many T2-weighted MRI brain images in the image dataset. The images have higher resolution. Because T2 images have higher contrast and clearer vision than T1 modalities, we choose the T2 model for this type of data set. This dataset of the MRI image of the malignant brain is comprehensive. The sample images of disease are shown in Figure 11.  AANLIB: There are many T2-weighted MRI brain images in the image dataset. The images have higher resolution. Because T2 images have higher contrast and clearer vision than T1 modalities, we choose the T2 model for this type of data set. This dataset of the MRI image of the malignant brain is comprehensive. The sample images of disease are shown in Figure 11. We selected five images from the dataset randomly, which included 17 types of malignant brain images and a benign brain image. The chosen five images contain four brain diseases and a benign one. That is, we select 5 (1 17) 90 × + = images to form a brain MRI image dataset.
The dataset is partitioned into five equally distributed subsets each containing one brain MRI Figure 11. Sample of brain MRIs.
We selected five images from the dataset randomly, which included 17 types of malignant brain images and a benign brain image. The chosen five images contain four brain diseases and a benign one. That is, we select 5 × (1 + 17) = 90 images to form a brain MRI image dataset.
The dataset is partitioned into five equally distributed subsets each containing one brain MRI image from each of the 18 different types. Since five-fold cross-validation is used, the model is trained and validated five times. In each experiment, we used four subsets for training and the other for verification. In other words, each subset must be used once for verification.

Feature Extraction
As shown in Figure 12, we greatly reduced the size of the input image by using three levels of wavelet decomposition. The approximation coefficient of level 3 is in the upper left corner of the image of the wavelet coefficient, and its size is only 16 × 16 = 256. To avoid boundary distortion, we calculate the boundary value by using the symmetric padding method [34].

Results on SBD
For the SBD image dataset, the values of q , C and σ are randomly generated in the intervals (0, 2), (50, 200) and [0.5, 2], respectively. The optimal parameters obtained by DNA-GA are          Table 2. These     Table 2. These

Results on AANLIB
For AANLIB images dataset, the final parameters obtained by DNA-GA are q = 0.8, C = 143.3 and σ = 1.132. The performance of the method with these parameter values is compared with that of randomly selected parameter values. The values of q, C and σ are randomly generated in the intervals (0, 1), (50, 200) and [0.5, 2], respectively. The conclusions are shown in Table 2. These conclusions indicate that the classification rate varies with the change of parameter values. Hence, it is important to determine the optimal values before constructing the classifier. Because it is difficult for randomly selected values to achieve good performance, some systematic searching method is necessary. DNA-GA is an effective method for this problem compared to the random selection method. The KSVM uses the RBF kernel function. The performance of the DNAGA-TE+KSVM method is compared with those of the one hidden-layer feed forward neural network (FF-NN) and the RBF neural network (RBF-NN). The results are shown in Table 3; FF-NN matched 388 cases correctly which makes the classification rate 86.22%. RB-FNN correctly matched 411 cases with a 91.33% classification rate. The DNAGA-TE+KSVM classification rate is up to 97.78% by matching 440 cases correctly. In conclusion, the newly developed method performed the best. The DWT can efficiently extract information from the raw brain MRI images with little loss because of the spatial resolution. In other words, the DWT acquires frequency and position information. In this study, Tsallis entropy is proved to be superior to Shannon entropy. The value of q, as a hyper parameter, needs to be optimized so as to optimize the performance of the algorithm. The Tsallis entropy degrades to the Shannon entropy when q = 1. Obviously, the introduction of Tsallis entropy does improve the performance of the classification. The purpose of introducing DNA-GA is to get the better values of the parameters q, σ and C. The method using the optimized values for these parameters perform much better than using randomly selected values. Apparently, it is hard to obtain good values for these parameters without using a systematic searching approach. Therefore, the DNA-GA can be an effective way for helping us to find the optimal values.

Conclusions
In this study, a novel DWT+TE+KSVM+DNAGA classification method is developed to distinguish between benign and malignant brain MRI images. The RBF kernel function is used in the SVM. The experiments demonstrated that the DNAGA-TE+KSVM method obtained a better classification rate on the 5-fold validation using two types of image datasets. The FF-NN and the RBF-NN-obtained classification rates are lower.
The significant work mainly focuses on the following four areas. First, the proposed SVM based method could be employed for classifying MRI brain images with other contrast mechanisms such as T1-weighted, proton density weighted, and diffusion weighted images. Second, the computation could be accelerated by using advanced wavelet transforms such as the lift-up wavelet. Third, multi-class classification focusing on brain MRI images of specific disorders can also be explored. Forth, novel kernels and DNA-GA optimized parameters will be tested to increase the classification rate and accelerate the computation. In the future, we need to compare the different wavelet transform's performances. Third, multi-class classification focusing on brain MRI images of specific disorders can also be explored. Fourth, we can test the novel kernels and DNA-GA optimized parameters to increase the classification rate and to accelerate the computation.