A Kernel-Based Intuitionistic Fuzzy C-Means Clustering Using a DNA Genetic Algorithm for Magnetic Resonance Image Segmentation

MRI segmentation is critically important for clinical study and diagnosis. Existing methods based on soft clustering have several drawbacks, including low accuracy in the presence of image noise and artifacts, and high computational cost. In this paper, we introduce a new formulation of the MRI segmentation problem as a kernel-based intuitionistic fuzzy C-means (KIFCM) clustering problem and propose a new DNA-based genetic algorithm to obtain the optimal KIFCM clustering. While this algorithm searches the solution space for the optimal model parameters, it also obtains the optimal clustering, therefore the optimal MRI segmentation. We perform empirical study by comparing our method with six state-of-the-art soft clustering methods using a set of UCI (University of California, Irvine) datasets and a set of synthetic and clinic MRI datasets. The preliminary results show that our method outperforms other methods in both the clustering metrics and the computational efficiency.


Introduction
Segmentation of brain magnetic resonance images (MRIs) into non-overlapping regions of white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) is essential for studying anatomical structure changes and brain quantification [1] and for building tumor growth models.Due to existence of noise, bias field, and partial volume effect, brain MRI segmentation faces several challenging issues.Existing methods still suffer from a lack of robustness to outliers [2], a high computational cost [3], the needs for manual adjustment of crucial parameters [4], a limited segmentation accuracy in the presence of high level noise [5], and a loss of image details [6].
One promising approach to MRI segmentation is the soft clustering, an unsupervised learning, that groups similar patterns into clusters that have soft boundaries.By assigning each pixel to each cluster with a varying degree of membership, this approach is able to account for the uncertainty [7].Several soft clustering methods have been proposed for MRI segmentation, including the well-known fuzzy C-means (FCM) clustering [8], mixture modeling [9], and some hybrid methods based on the former two methods [10].Although the accuracies of these soft clustering algorithms are good in the absence of image noise, they are sensitive to noise and other imaging artifacts.
Recently, several methods [11][12][13] have been proposed to improve the FCM clustering on noise tolerance by representing clusters as intuitionistic fuzzy sets (IFSs) [14].Aruna et al. presented a modified intuitionistic fuzzy C-means (IFCM) clustering algorithm [15], which adopts a new IFS generator and the Hausdorff distance.Verma et al. presented an improved intuitionistic fuzzy C-means (IIFCM) algorithm [16], which takes the local spatial information into consideration.IIFCM is able to tolerate noise and does not require any parameter tuning.However, these algorithms are still based on Euclidean or Hausdorff distance among pixels.As a result, these algorithms can only find linearly separable clusters and the clustering results are depending on the initial choice of centroids.
It is well-known that kernel functions can be used to find clusters that cannot be linearly separated [17,18].However, the performance of those kernel-based methods is highly sensitive to the choice of kernel parameters [19].Although several methods [20,21] have been proposed to estimate the optimal values for kernel parameter, the problem has not been completely solved.
We believe that by applying kernel functions to find the IFS-based FCM clustering, we can have a robust method for MRI segmentation.Using this approach, the segmentation problem can be transformed into an optimization problem: finding the optimal kernel parameters that lead to optimal noise-tolerant fuzzy clusters.
The DNA genetic algorithm, based on DNA computing [22] and the Genetic Algorithm (GA) [23], have been recently introduced to solve complex optimization problems in many areas, such as, chemical engineering process parameter estimation [24], function optimization [25], clustering analysis [26,27], and membrane computation [28].This technique can be used to solve the aforementioned optimization problem.
Motivated by the previous discussion, we formulate an MRI segmentation problem as a kernel-based intuitionist fuzzy C-clustering problem and provide a DNA-based genetic algorithm for solving this problem.Specifically, our contributions in this paper are as follows.
We formulate an image segmentation problem as a kernel-based intuitionistic fuzzy C-means (KIFCM) clustering problem by specifying a new parametric objective function.This formulation includes a new measure for pixel local noise, a method to model fuzzy clusters as intuitionistic fuzzy sets instead of conventional fuzzy sets, and an adaptation of a kernel trick to improve performance.
We propose a new DNA-based genetic algorithm to learn the KIFCM clustering.This algorithm uses a DNA coding scheme to represent individuals (i.e., potential solutions) and a set of improved DNA genetic operator to search through the solution space for optimal solutions.Each individual encodes a set of values of the modeling parameters, including kernel parameters.While the algorithm searches for optimal set of model parameters, it also obtains the optimal IFS based fuzzy clusters.
We perform empirical study by comparing our method with six existing state-of-the-art fuzzy clustering algorithms using a set of UCI data mining data sets, a set of synthetic MRI data, and a set of clinical MRI datasets.Our preliminary results show that our algorithm outperforms the compared algorithms in both the clustering metrics and computational efficiency.
The rest of this paper is organized as follows: Section 2 presents several basic concepts such as IFSs, FCM and DNA-GA.Section 3 discusses the related work.In Section 4, we develop the objective function that formulate the kernel-based intuitionistic fuzzy C-means problem.In Section 5, we present our algorithm.In Section 6, we present results from experiments performed on UCI data, and synthetic and clinical brain MR images.In Section 7, we discuss computational performance of our and several related algorithms.Section 8 concludes the paper.

Preliminaries
In this section, we briefly review basic concepts that lay a foundation for discussions in latter sections.Interested reader should refer to references for more details.

Intuitionistic Fuzzy Sets (IFSs)
Given a set X, an intuitionistic fuzzy set [14] over X is defined as: where 0 ≤ u A (x) ≤ 1 is the membership degree and 0 ≤ v A (x) ≤ 1 the non-membership degree of x with respect to A. In general, each element x in an intuitionistic fuzzy set also has a third parameter 0 ≤ π A (x) ≤ 1 called the hesitation degree, which represents the lack of knowledge about the membership of x (i.e., the indecision how much are the degrees of membership and non-membership that x should have for the fuzzy set).These three degrees have the following relationship: Notice that if there is no hesitation, i.e., π A (x) = 0 or equivalently v A (x) = 1 − u A (x), for every x ∈ A, A will also be a fuzzy set [29].Furthermore, if v A (x) = π A (x) = 0 for every x ∈ A, A will also be a crisp (i.e., normal) set.

Fuzzy C-Means
For a set of data X = {x 1 , x 2 , . . . ,x n } ⊂ R d , where data x k is a d-dimensional vector with real-valued elements, and c > 1, the fuzzy C-means clustering problem is to assign each data vector to the c fuzzy clusters, so that the total intra-cluster distance defined by the following objective function is minimized: where: • u ik is the membership degree of data x k in the fuzzy cluster with centroid v i .
• U = [u ik ] c×n is the membership matrix, which satisfies two conditions: (a) The optimal membership degrees and cluster centers can be obtained via an iterative process known as alternate optimization using the following equations:

DNA Genetic Algorithm
A DNA genetic algorithm [24] solves an optimization problem defined as follows: As shown in the following algorithm.It encodes each potential solution (referred to as an individual) as a DNA string and searches for the best individual by mimicking DNA genetic operations.It starts by generating at random an initial population of individuals and uses a set of DNA-based genetic operations, such as mutation and crossover, to generate new individuals.It uses the objective function f () to compute the fitness value for individuals and evolves the population based on the genetic principle.The process repeats until a pre-determined termination condition is met.The best-fit individual is then decoded and returned as the best solution to the problem.
The generic DNA-GA, named Algorithm 1, is described below.
Algorithm 1: Generic DNA-GA Input: f (X): the objective function n: the number of variables in X dom(X): domains of the n variables N: the size of initial population Output: The value of X that optimizes f (X) Method: 1.
POP = a population of N randomly generated individuals encoded as DNA strings 2.
For each p in POP 3.
Calculate the fitness value of p using f (decode(p)) 4.
While termination condition is not satisfied do 5.
Return decode(bestFitIndividual) In the Algorithm 1, an individual is a sequence of values for the n decision variables.Each variable is encoded as a DNA sequence of a pre-determined number of nucleotide bases.Each nucleotide base is represented by a base-4 digit representing one of the adenine (A), guanine (G), cytosine (C) and thymine (T).The fitness value of an individual is computed using the decoded variable values.The best fit individual in an iteration of the algorithm represents the best solution found at that point of time.At each iteration, the next generation of the population is created.The best-fit individual in the current generation is kept and new individuals are created by applying at random a set of DNA genetic operations to individuals randomly selected from the current generation of population.
By mimicking the DNA reproduction process, the genetic operations perform copying, mutation, and crossover on DNA strips.To avoid premature convergence towards local optimum solutions, the next generation must provide a sufficient variety by including not only the better fit individuals, but also individuals with very different genetic characteristics.

Related Work
The original FCM algorithm [8] is based on the objective function given in Equation (3).Since it does not include any local information, it is sensitive to noise in the image and loses clustering accuracy as noise and image artifacts increase.
The FCM_S algorithm [30] used the following modified objective function: in which the second term includes spatial information of neighborhood around pixels, where N i is the set of pixels around pixel x i , N R is the average cardinality of N i , and 0 < β < 1 is a parameter that controls the spatial information of neighbors.
Although it helps with handling noise, it is computationally expensive as the neighborhood term must be calculated in each iteration for each pixel.
The FCM_S1 and FCM_S2 algorithms [31] used a further improved objective function: Entropy 2017, 19, 578 5 of 22 in which the second term includes , where x k is the grayscale of a filtered image and only needs to be calculated once in advance and β is the same as in Equation (6).These algorithms also use kernel functions to calculate the Euclidean distance.
The difference between the two versions of the algorithm is that FCM_S1 uses the average filter and FCM_S2 uses the median filter.Although FCM_S1 and FCM_S2 improved the clustering accuracy, these algorithms are sensitive to high level and different types of noises.In addition, the parameter β, which is critical to the performance, must be determined manually according to a prior knowledge about noise.Yang and Tsai [21] proposed a Gaussian kernel-based FCM method.This method also has two forms: GKFCM1 and GKFCM2 for average and median filters, respectively.It replaces parameter β with a parameter η j , which must be calculated in every iteration for every cluster.Using appropriate value of parameter η j , this method could lead to better results than FCM_S1 and FCM_S2.However, to estimate a good value for η j , cluster centroids must be well separated, which is difficult to guarantee.As a result, the algorithm may take many iterations to converge.Moreover, the learning scheme requires many patterns and many cluster centroids to find the optimal value for η j .To handle this drawback, Elazab et al. [32] proposed an adaptively regularized kernel-based FCM framework (ARKFCM) with new parameter to control the local information of every pixel.
To overcome the problem of prior parameter adjustment, Krinidis and Chatiz [2] introduced a fuzzy local information C-means (FLICM) that includes a fuzzy factor G ij in the objective function to represent the spatial and the grayscale information of the neighborhood of pixels.Although FLICM algorithm enhances robustness against noise and artifacts, it is computationally expensive because the fuzzy factor must be calculated in each iteration.The KWFLICM algorithm [33] enhances the FLICM algorithm by using a trade-off weighted fuzzy factor G ij to control the local neighbor relationship and replace the Euclidean distance with kernel function.By using this factor, the KWFLICM algorithm can accurately estimate the damping extent of neighboring pixels, however, at the expense of substantially increasing the computational cost.In addition, the algorithm does not preserve small image details.

Problem Formulation
In this section, we formulate the FCM problem with a new objective function.We show step-by-step the derivation of this objective function.Starting from the basic objective function in Equation ( 7), we will modify some existing term or add new terms as new features are included to improve the accuracy and the performance of the algorithm for MRI segmentation.In this section, each data x i is a vector in a multi-dimensional space (for example, coordinates and the greyscale of the pixel).To ease the presentation, we also refer x i as the pixel i.

Local Intensity Variance
First, we improve the noise resistance by replacing the β in Equation ( 7) with a weight that measuring local uniformity.Let N k be the neighborhood of a given size, e.g., 3 × 3 pixels, around pixel x k .The local intensity variance (LIV) of the pixel x k is defined as: where |N k | denotes the number of pixels in N k , and x k is the mean grayscale taken over all pixels in N k .LIV estimates the discrepancy of grayscales in the neighborhood normalized by the local average grayscale.A large LIV indicates higher level of noise within the neighborhood.
We then define a brightness weight for the pixel x k as: where: Intuitively, ζ k measures the noise level of neighborhood N k and ω k is normalized over the neighborhood.Thus, a large ω k indicates that pixel x k is brighter than its neighbors.
Finally, we define the variance for the pixel x k as: The variance ϕ k is larger for those pixels with high LIV (or brighter than its neighbors) and smaller otherwise.If the local average grayscale is equal to the grayscale of the central pixel, ϕ k will be zero.The constant 2 in Equation ( 11) is determined through experiments and provides a balance between the convergence rate and the capability to preserve details.
We can now rewrite the objective function as follows: Since ϕ k is relevant only to the grayscales within a specified neighborhood, which is independent of any cluster, it only needs to be calculated once before the clustering process.This is very different from FCM_S, FLICM, and KWFLICM, where the contextual information must be updated for each pixel and each cluster in each iteration.Thus, Equation (12) will result in a much low computational cost.
The contextual information provided by ϕ k is based on the heterogeneity of grayscale distribution within the neighborhood.This is different from existing methods that base the contextual information on the difference of the grayscales between neighboring pixels and cluster centroids.As a result, ϕ k tends to produce a homogeneous clustering while local information used in existing methods [33] tend to generate clustering with more misclassified labels.
Figure 1a shows an image with Rician noise (Level 15), in which the white rectangle is an area of 6 × 6 pixels.Figure 1b shows the greyscale value of pixels in this area, as well as three 3 × 3 neighborhood areas A, B and C. Figure 1c,d show, respectively, the LIV and ϕ k for each pixel.where: Intuitively, k ζ measures the noise level of neighborhood and is normalized over the neighborhood.Thus, a large k ω indicates that pixel is brighter than its neighbors.
Finally, we define the variance for the pixel as: The variance k ϕ is larger for those pixels with high LIV (or brighter than its neighbors) and smaller otherwise.If the local average grayscale is equal to the grayscale of the central pixel, k ϕ will be zero.The constant 2 in Equation ( 11) is determined through experiments and provides a balance between the convergence rate and the capability to preserve details.
We can now rewrite the objective function as follows: Since k ϕ is relevant only to the grayscales within a specified neighborhood, which is independent of any cluster, it only needs to be calculated once before the clustering process.This is very different from FCM_S, FLICM, and KWFLICM, where the contextual information must be updated for each pixel and each cluster in each iteration.Thus, Equation ( 12) will result in a much low computational cost.
The contextual information provided by k ϕ is based on the heterogeneity of grayscale distribution within the neighborhood.This is different from existing methods that base the contextual information on the difference of the grayscales between neighboring pixels and cluster centroids.As a result, k ϕ tends to produce a homogeneous clustering while local information used in existing methods [33] tend to generate clustering with more misclassified labels.
Figure 1a shows an image with Rician noise (Level 15), in which the white rectangle is an area of 6 6 × pixels.Figure 1b shows the greyscale value of pixels in this area, as well as three 3 3 × neighborhood areas A, B and C. Figure 1c, 1d show, respectively, the LIV and k ϕ for each pixel.

Intuitionistic Fuzzy C-Means Clustering
We now change the fuzzy clusters from conventional fuzzy sets to intuitionistic fuzzy sets.According to the definition given in Section 2.1, we re-write the objective function in Equation ( 12) as follows: Here u * ik = u ik + π ik is the membership for pixel x k with respect to intuitionistic fuzzy cluster that has the centroid v i , u ik is the membership degree, and π ik is a hesitation degree.
The second term in the Equation ( 13) is called the intuitionistic fuzzy entropy (IFE) and measures the vagueness or ambiguity in the intuitionistic clusters.It is included to minimize the entropy of the histogram of the given data and therefore to maximize the number of good points in the cluster.
We can calculate the hesitation degrees using Yager's definition [20]: , where α > 0, ν(1) = 0, and ν(0) = 1.Thus: and the variance π * i in the third term in Equation ( 13) is defined by: Notice that the Yager class parameter α in Equation ( 14) controls the effect of local information and is crucial for the quality of the fuzzy clustering.

Kernel Intuitionistic FCM (KIFCM)
The Euclidean distance metric used in Equation ( 13) is valid under the assumption of uncorrelated cluster and cluster with the same spherical shape, that may be not true in real data.In addition, it is sensitive to perturbations and outliers.One way to address these problems is to use a different distance metrics, such as the Mahalanobis distance.Another way to deal with these problems is to use kernel functions to project the data into a higher dimensional space so that the data could be separated more easily [34].An advantage of the latter method is that a so-called kernel trick can be used to transform linear algorithm into nonlinear one using dot product [35].To improve segmentation accuracy and the robustness with respect to outliers, we utilize the kernel trick and replace the Euclidean distance Accordingly, the objective function in Equation ( 13) can be re-written as: where Φ is an implicit nonlinear mapping and Φ(x k ) − Φ(v i ) 2 is the squared Euclidean distance between mapped pixels Φ(x k ) and Φ(v i ) in a feature space, which can be calculated using the kernel function in the input space as follows: where K() is a kernel function.
A widely-used kernel function is the Gaussian radial basis function (GRBF) [19]: where the kernel parameter σ determines the spread of the kernel and x k − v i 2 is the Euclidean distance between pixels in the original space.Using this kernel function, Equation ( 17) can be rewritten as: Substituting Equation (19) into Equation ( 16), we obtain our final objective function: Subject to the conditions required of U and V (see Section 2.2) and given parameters m, σ and α, Equation (20) will be minimized for membership degrees: and centroids of intuitionistic fuzzy clusters: We call this formulation of the FCM clustering as the kernel-based Intuitionistic FCM (KIFCM) clustering.

The DNA Genetic Algorithm
In this section, we present a DNA genetic algorithm for the kernel-based intuitionistic fuzzy C-means clustering problem, which we call Algorithm2: KIFCM-DNAGA.This algorithm is a specialization of the generic DNA-GA presented in Section 2.3.
Depending on the calculation of x i in Step 9, there can be variations of this algorithm.Specifically, if x i is the average greyscale, we call this algorithm KIFCM1-DNAGA and if x i is the median greyscale the algorithm is denoted as KIFCM2-DNAGA.We will discuss more details in the following subsections.
Entropy 2017, 19, 578 9 of 22 KIFCM-DNAGA contains three nested loops.The outer most loop iterates the evolution process, and is typically controlled by a user-specified constant, such as the number of iterations.Each of the inner loops is repeated for each of the n pixels.In the inner most loop, each pixel needs to be compared to the centroids of the c fuzzy clusters and the distance calculations involve the d dimensions.Thus, the complexity of this algorithm is O(cn 2 d).Notice that the kernel matrix is computed only once at the very beginning of the program.
The detailed description of KIFCM-DANGA is presented in Algorithm 2 below.While for p ∈ POP do 6.
for every pixel x k ∈ M 8.
Calculate x i 10.
Calculate fitness value for p using Equation (20) 12.
V = V of the best fit individual in POP 13.NewPOP = apply selection op on POP 14.
Return U t and V

DNA Encoding and Decoding
Recall that if values of the model parameters m, σ and α are given, Equations ( 21) and ( 22) can be used to calculate the fitness value, and the clustering (i.e., the membership degrees and cluster centroids).The optimal clustering is obtained provided the optimal values for these parameters are given.As pointed out by Chaira [12] , the values of parameters m and α will affect the performance of IFCM, and by Graves and Pedrycz [36], the value of parameter σ in a kernel function will affect the performance.Our algorithm is designed to find the optimal values for parameters m, σ and α, and to obtain the optimal clustering along the way.
In this algorithm, an individual, a potential solution, consists of values for three variables v 1 representing parameter σ, v 2 representing α, and v 3 representing m.We encode each variable as a string of n base-4 digits, where n is a system parameter and each digit represents a nucleotide bases, for example, 0 for A, 1 for G, 2 for C, and 3 for T. Figure 2 shows an example of an encoded individual, where n = 6.
and to obtain the optimal clustering along the way.
In this algorithm, an individual, a potential solution, consists of values for three variables representing parameter , representing , and representing .We encode each variable as a string of base-4 digits, where n is a system parameter and each digit represents a nucleotide bases, for example, 0 for A, 1 for G, 2 for C, and 3 for T. Figure 2 shows an example of an encoded individual, where n = 6.For decoding, each variable is mapped into a decimal number as follows: ( ) where ( ) bit j is the j-th digit from the left of the current encoding segment for i v .Depending on the bounds of the variable, the value of the variable is obtained as follows: For decoding, each variable v i is mapped into a decimal number as follows: where bit(j) is the j-th digit from the left of the current encoding segment for v i .Depending on the bounds of the variable, the value of the variable is obtained as follows: where l i and h i are the lowest and the highest values of the variable v i , and ( h i − l i )/4 n − 1 is the precision of the decoded value for v i .For example, in Figure 2 the encoding of α (i.e., v 2 ) is the base-4 number 120132, representing GCAGTC.The corresponding decimal number is cv 2 = 1566.If the range of α is [1, 15], the decoded value of α is 6.35.

The Initialization
In Step 1, N individuals encoded as described in the previous subsection are randomly generated to form the initial population.For each nucleotide base in an individual's DNA strand, we randomly assign one of A, C, G, and T (in fact, the base-4 digit corresponding to these amino acids).Unlike existing DNA-GAs which assign A, C, G, T with an equal probability 0.25, our algorithm assigns A, C, G, T with differential probabilities 0.156, 0.343, 0.344, and 0.157, respectively.These probabilities are chosen according to the nature of biological structures [37].

Selection Operator
In Step 14, a 50% of the current population is selected to participate in genetic operations that generate new individuals in the next generation.In general, three types of selection methods, namely roulette wheel, ranking and tournament have been used in various DNA-GAs.To be specific, we describe the tournament selection here.In this method, pairs of individuals from the current population are randomly selected and for each pair, the individual with the higher fitness value is selected.In addition, the so-called elitism strategy is also applied, which will include the best fit individual among the selected individuals.

DNA Genetic Operators
In Step 15, different types of genetic operations are applied randomly to randomly selected individuals in the new population.These operations are applied in sequence to evolve the new generation.The operations are discussed in the following subsections.Each of these operations will generate offspring individuals from parent individuals and preserve the size of the generation.After an operator is applied to the parent individuals, the offspring individuals survive and participate in the next operation.

Crossover Operator
A crossover operator generates two offspring individuals from two randomly selected parent individuals by swapping and mixing nucleotide bases of randomly selected DNA segments of the two parents.Figure 3 shows an example illustrating the crossover operator.To ease the presentation.Only two variables are shown for individuals and each variable has a sequence of 6 nucleotide bases.Assume that the shaded segments in the two parents are randomly selected.To generate the two offspring individuals, the corresponding nucleotide bases are swapped according to the "choose big" and "choose small" strategies [38].Namely, the offspring 1 is created by choosing in each nucleotide bases the larger values between the two parents and the offspring 2 the smaller values.Consider the shaded segment in v 2 in the two parents.The sequences are 211 and 130.The larger digits for each position in these two sequences are respectively 2, 3, and 1.Similarly, the smaller digits are 1, 1, and 0. Thus, in this segment, offspring 1 has the sequence 231 and offspring 2 has the sequence 110.

Mutation Operator
A mutation operator generates an offspring from a parent by randomly changing (mutating) the digit at a randomly selected nucleotide base.
Unlike existing GAs which select position to mutate with an equal probability, we adapt a strategy of a shifting probability.Specifically, we divide the DNA strand of an individual into a high bit section and a low bit section, and mutate digits in different sections using the following probabilities: mh P for high bit section and ml P for low bit section: 1 exp[ (g g )] where 1 a is the initial mutation probability, 1 b is the range of mutation probability, g is the evolution generation, 0 g is the generation where a greater mutation probability occurs and c is the speed of change.
The idea is inspired by the DNA "hot spots" and "cold spots" in biology [39].Nucleotide bases in "cold spots" mutate more slowly than those in "hot spots".To mimic this phenomenon, mh P and ml P are designed to change with the evolution stage.At the beginning of the evolution mh P will be larger than ml P .As the evolution progresses, mh P will be gradually reduced and ml P increased.

Reconstruction Operator
In Step 16, the reconstruction operator is applied to every individual in the new population with a probability r P .The reconstruction operator generates an offspring by replacing each nucleotide base in a randomly selected segment of each variable by the complementary nucleotide base, namely A and T replace each other and so do G and C.This operation is inspired by the DNA double helix complementary principle [40].

Mutation Operator
A mutation operator generates an offspring from a parent by randomly changing (mutating) the digit at a randomly selected nucleotide base.
Unlike existing GAs which select position to mutate with an equal probability, we adapt a strategy of a shifting probability.Specifically, we divide the DNA strand of an individual into a high bit section and a low bit section, and mutate digits in different sections using the following probabilities: P mh for high bit section and P ml for low bit section: where a 1 is the initial mutation probability, b 1 is the range of mutation probability, g is the evolution generation, g 0 is the generation where a greater mutation probability occurs and c is the speed of change.
The idea is inspired by the DNA "hot spots" and "cold spots" in biology [39].Nucleotide bases in "cold spots" mutate more slowly than those in "hot spots".To mimic this phenomenon, P mh and P ml are designed to change with the evolution stage.At the beginning of the evolution P mh will be larger than P ml .As the evolution progresses, P mh will be gradually reduced and P ml increased.

Reconstruction Operator
In Step 16, the reconstruction operator is applied to every individual in the new population with a probability P r .The reconstruction operator generates an offspring by replacing each nucleotide base in a randomly selected segment of each variable by the complementary nucleotide base, namely A and T replace each other and so do G and C.This operation is inspired by the DNA double helix complementary principle [40].After the complementary operation is applied to the new generation, new random individuals will be generated as needed to keep the population size at N.

Experiments and Results
In this section, we present results from our experiments.
These algorithms are compared using three types of data: a set of UCI machine learning datasets [42], a set of synthetic MRI data, and a set of clinical MRI data.The UCI datasets include Haberman's Survival Data, Contraceptive Method Choice, Wisconsin Prognostic Breast Cancer, and SPECT Heart Data, with details are shown in Table 1.The synthetic brain MR images includes a T1-weighted axial slice with 217 × 181 pixels corrupted with 7% noise and 20% grayscale, a T1-weighted sagittal slice with 181 × 217 pixels corrupted with 7% noise and 20% grayscale, and a T1-weighted axial slice with 217 × 181 pixels corrupted with 10% Rician noise.
The clinical brain MR images include two collections of datasets.The first collection is the dataset used by MICCAI BRATS 2014 challenge [1].This dataset contains 220 clinical samples and each sample has a sequence of 155 T1-weighted axial slice images of 240 × 240 pixels.We randomly selected three samples (whose files are pat266_1, pat192_1, and pat013_1) and among the 155 slice images of each selected sample, we selected one slice that contains the most details and features and has the best visual quality (specifically, the 80-th, 86-th, and 90-th slice from the three samples, respectively).The second collection is the MR images collected using a Philips Medical Systems Intera 3T instrument obtained from the brain development project [43].This collection contains 581 samples, and for each sample from 150 to 200 slices.We also randomly selected a sample (the 160-th sample) and a coronal slice (the 80th slice), which is of 512 × 300 pixels.
We measure the performance of the clustering algorithms using three metrics: the Jaccard Similarity (JS) [44], adjusted mutual information (AMI) and adjusted rand index (ARI) [45].

•
The Jaccard Similarity is defined by: Entropy 2017, 19, 578 where S 1 is the segmented volume and S 2 is the ground truth volume.

•
AMI is defined as follows: where H(S 1 ), H(S 2 ) are the entropies for S 1 and S 2 , respectively.Mutual information MI quantifies the value of information shared between the two random variables and can be defined using the entropy definitions.E(MI) = ∑ Γ MI(Γ)P(Γ).

•
ARI is defined as follows: where N 11 denotes the number of pairs that are in the same cluster in both U and V, N 00 denotes the number of pairs that are in different clusters in both U and V, N 01 denotes the number of pairs that are in the same cluster in U but in different clusters in V, and N 10 denotes the number of pairs that are in different clusters in U but in the same cluster in V.
These metrics are commonly used to compare the performance of clustering algorithms.For all three matrices, a higher value indicates a better clustering result, with the highest value being 1.
We implemented the algorithms in MATLAB.The experiments were conducted using neighborhood of 3 × 3 pixels.The termination conditions include a maximum number of iterations t = 100 and convergence threshold ε = 0.001.

Results on UCI Datasets
In this experiment, we investigate how our algorithm performs for generic FCM clustering by using the set of UCI machine learning datasets (see Table 1).The same UCI datasets are commonly used in machine learning research [36].Because these are not image data, we set the brightness variance ϕ k to 0 in these experiments.As shown in Table 2, KIFCM1-DNAGA can obtain a higher AMI and ARI than other algorithms, especially for datasets of a higher complexity.Moreover, algorithms using kernel functions show a better performance than those traditional algorithms, such as MICO, RSCFCM.The optimized results are marked in bold.

Results on Synthetic Brain MR Images
We performed several experiments using the Simulated Brain Database (SBD) [46], which contains a set of realistic MR volumes produced by an MR imaging simulator.These data come with the ground truth, i.e., pixels are labeled for CSF, GM, and WM regions.
The first experiment is to segment a T1-weighted axial slice with 217 × 181 pixels corrupted with 7% noise and 20% grayscale non-uniformity into WM, GM, and CSF. Figure 5 shows the segmentation results and Table 3 summarizes the JS and average running times.As shown in Table 2, KIFCM1-DNAGA can obtain a higher AMI and ARI than other algorithms, especially for datasets of a higher complexity.Moreover, algorithms using kernel functions show a better performance than those traditional algorithms, such as MICO, RSCFCM.The optimized results are marked in bold.

Results on Synthetic Brain MR Images
We performed several experiments using the Simulated Brain Database (SBD) [46], which contains a set of realistic MR volumes produced by an MR imaging simulator.These data come with the ground truth, i.e., pixels are labeled for CSF, GM, and WM regions.
The first experiment is to segment a T1-weighted axial slice with 217 × 181 pixels corrupted with 7% noise and 20% grayscale non-uniformity into WM, GM, and CSF. Figure 5 shows the segmentation results and Table 3 summarizes the JS and average running times.According to Table 3, KIFCM1-DNAGA and KIFCM2-DNAGA outperform the other 6 algorithms.On average, the improvement on JS is about 1.3-4.8%.The best results are marked in bold.
The second experiment is to segment a T1-weighted sagittal slice with 181 × 217 pixels corrupted with 7% noise and 20% grayscale non-uniformity.This image is chosen to show the capability of preserving details.The segmentation results are shown in Figure 6 and JS are presented in Table 4.According to Table 3, KIFCM1-DNAGA and KIFCM2-DNAGA outperform the other 6 algorithms.On average, the improvement on JS is about 1.3-4.8%.The best results are marked in bold.
The second experiment is to segment a T1-weighted sagittal slice with 181 × 217 pixels corrupted with 7% noise and 20% grayscale non-uniformity.This image is chosen to show the capability of preserving details.The segmentation results are shown in Figure 6 and JS are presented in Table 4.According to Table 3, KIFCM1-DNAGA and KIFCM2-DNAGA outperform the other 6 algorithms.On average, the improvement on JS is about 1.3-4.8%.The best results are marked in bold.
The second experiment is to segment a T1-weighted sagittal slice with 181 × 217 pixels corrupted with 7% noise and 20% grayscale non-uniformity.This image is chosen to show the capability of preserving details.The segmentation results are shown in Figure 6 and JS are presented in Table 4.  Based on Table 4, the average JSs of KIFCM1-DNAGA and KIFCM2-DNAGA are better than the other algorithms by about 1.4-8.9%.The best results are marked in bold.
The third experiment is to segment a T1-weighted axial slice (number 91) with 217 × 181 pixels corrupted with 10% Rician noise.The Rician noise commonly affects MR images [47].The segmentation results are shown in Figure 7 and the JS with average running times is given in Table 5.The best results are marked in bold.Based on Table 4, the average JSs of KIFCM1-DNAGA and KIFCM2-DNAGA are better than the other algorithms by about 1.4-8.9%.The best results are marked in bold.
The third experiment is to segment a T1-weighted axial slice (number 91) with 217 × 181 pixels corrupted with 10% Rician noise.The Rician noise commonly affects MR images [47].The segmentation results are shown in Figure 7 and the JS with average running times is given in Table 5.The best results are marked in bold.Based on Table 4, the average JSs of KIFCM1-DNAGA and KIFCM2-DNAGA are better than the other algorithms by about 1.4-8.9%.The best results are marked in bold.
The third experiment is to segment a T1-weighted axial slice (number 91) with 217 × 181 pixels corrupted with 10% Rician noise.The Rician noise commonly affects MR images [47].The segmentation results are shown in Figure 7 and the JS with average running times is given in Table 5.The best results are marked in bold.Again, the average JSs of KIFCM1-DNAGA and KIFCM2-DNAGA are better than the other algorithms (by about 1.20 to 3.6%).The higher JSs implies that our algorithms do a better job in preserving image details in the presence of noise and grayscale homogeneity.The best results are marked in bold.

Results on Clinical Brain MR Images
In these experiments, we used MR images from two collections of clinical MRI datasets.We first used three T1-weighted axial slices (slice rank 80, 86, and 90, denoted as Brats1, Brats2, and Brats3, respectively) of 240 × 240 pixels included in MICCAI BRATS 2014 challenge.In these images, pixels with grayscale from black to white represent respectively, the background, CSF, GM, and WM.It is worth noting that clustering is carried out only for CSF, GM, and WM, with the pathology region being treated as the background.We then used a coronal slice (denoted as Brats4) with 512 × 300 pixels [43].
Figure 8 shows the results for Brats1 and Figure 9 shows the results for Brats2.The area within the red circles are enlarged to show more details.Figures 10 and 11 show the results for Brats3 and Brats4.Again, the average JSs of KIFCM1-DNAGA and KIFCM2-DNAGA are better than the other algorithms (by about 1.20 to 3.6%).The higher JSs implies that our algorithms do a better job in preserving image details in the presence of noise and grayscale homogeneity.The best results are marked in bold.

Results on Clinical Brain MR Images
In these experiments, we used MR images from two collections of clinical MRI datasets.We first used three T1-weighted axial slices (slice rank 80, 86, and 90, denoted as Brats1, Brats2, and Brats3, respectively) of 240 × 240 pixels included in MICCAI BRATS 2014 challenge.In these images, pixels with grayscale from black to white represent respectively, the background, CSF, GM, and WM.It is worth noting that clustering is carried out only for CSF, GM, and WM, with the pathology region being treated as the background.We then used a coronal slice (denoted as Brats4) with 512 × 300 pixels [43].
Figure 8 shows the results for Brats1 and Figure 9 shows the results for Brats2.The area within the red circles are enlarged to show more details.Figures 10 and 11 show the results for Brats3 and Brats4.Since the images come only with the ground truth for the pathology and do not have the ground truth for normal tissues, we also used the entropy-based metric [48] to measure the quality of the algorithms.This metric aims to maximize the uniformity of pixels within each segmented region and to minimize the uniformity across the regions.The entropy of an image region is defined as: Since the images come only with the ground truth for the pathology and do not have the ground truth for normal tissues, we also used the entropy-based metric [48] to measure the quality of the algorithms.This metric aims to maximize the uniformity of pixels within each segmented region and to minimize the uniformity across the regions.The entropy of an image region is defined as: Since the images come only with the ground truth for the pathology and do not have the ground truth for normal tissues, we also used the entropy-based metric [48] to measure the quality of the algorithms.This metric aims to maximize the uniformity of pixels within each segmented region and to minimize the uniformity across the regions.The entropy of an image region is defined as: Since the images come only with the ground truth for the pathology and do not have the ground truth for normal tissues, we also used the entropy-based metric [48] to measure the quality of the algorithms.This metric aims to maximize the uniformity of pixels within each segmented region and to minimize the uniformity across the regions.The entropy of an image region j is defined as: where V j is the set of all possible grayscales in region j, L j (m) is the number of pixels belonging to region j with grayscale m, and S j is the area of region j.For a grayscale image j, the E measure is defined as: where the first term represents the expected region entropy of the segmentation and the second term is the layout entropy.Notice that smaller E indicates a better segmentation.Table 6 shows the segmentation accuracy in terms of the E measure as well as the running times of the algorithms.The E values are calculated only for WM, GM, and CSF regions.The best results are marked in bold.From Table 6, our algorithms achieve smaller E than other algorithms.Visually, GKFCM1 (Subfigure b in Figures 8-11) and GKFCM2 (Subfigure c in Figures 8-11) achieve good results but they are unable to preserve fine details.For example, CSF was incorrectly broken, possibly due to the difficulty in estimating parameter η j .On the other hand, FLICM (Subfigure d in Figures 8-11) and KWFLICM (Subfigure e in Figures 8-11) yield smooth areas but they do not preserve fine details.For example, CSF in those results is smaller than the surrounding GM and WM.This may be because the difficulty to estimate optimal values for parameters G ij and G ij .For MICO algorithm (Subfigure f in Figures 8-11), the edges are not smooth enough, hence the CSF is not well preserved.For RSCFCM algorithm (Subfigure g in Figures 8-11), many details are missing.In comparison, KIFCM1-DNAGA (Subfigure h in Figures 8-11) and KIFCM2-DNAGA (Subfigure i in  show good balance between smooth borders and image details.It achieves better balance in preserving image details in presence of different noise than the 6 other algorithms.Especially, even with a larger MR image, the proposed approach still performs well on E and running time measure.

Discussion
In this section, we discuss and compare the computational complexity and running times of ours and the other algorithms used in our experiments.
Table 7 shows a summary of the computational complexity of the eight algorithms.Assuming all algorithms go through exactly the same number of rounds of evolution, the complexity of these algorithms differs mainly in the number of times each pixel is processed within a single round of evolution.The GKFCM requires O(cnd 2 ) to compute the partition matrix in each iteration [36].If the kernel matrix is computed at the beginning of each iteration, the complexity of the partition matrix is O(cnd).Each of the FLICM, MICO, and RSCFCM requires O(cnd) for each iteration, in which the kernel matrix requires the kernel function be evaluated cn times.The complexity for the kernel algorithms depends on the kernel used.The complexity given in Table 7 is for the Gaussian radial basis function.It is also assumed that the kernel matrix is computed once rather than at every iteration for the KIFCM-DNAGA algorithm.
To compare the efficiency of the algorithms, we measured the running time (in seconds) averaged over 10 runs for Brain MR images experiments mentioned in Section 6.For each run, the initial population is made identical for all algorithms.Figure 12 shows the results.To compare the efficiency of the algorithms, we measured the running time (in seconds) averaged over 10 runs for Brain MR images experiments mentioned in Section 6.For each run, the initial population is made identical for all algorithms.Figure 12 shows the results.As shown in Figure 12, GKFCM1 and GKFCM2 have the longest running time.This is perhaps due to the parameter ηj used by the algorithms, which needs an additional loop over all clusters to update the local contextual information.MICO comes next with running times span 1.3 to 2.85 s.It performs fast calculations owing to the convexity of its energy function particularly in the presence of less noise, but tends to have many iterations in the presence of high level noise.RSCFCM spent about 1.05 to 2.95 s for the experiments.It uses a spatial fuzzy factor that is constructed based on the posterior and prior probabilities and takes the spatial direction into account.That is likely to increase the computational cost.FLICM and KWFLICM have similar performance.FICM uses parameter Gij, which needs an additional loop over the neighborhood of the current pixel to calculate the local information in every iteration, and KWFLICM used parameter G ′ ij , which requires two additional loops to process the neighborhood.Finally, KIFCM1-DNAGA and KIFCM2-DNAGA both incorporate local contextual information obtained by LVC, which only needs to be calculated once.As a result, both algorithms require about 0.25-1.55s to run the experiments, which is the lowest among the algorithms.

Conclusions
In this paper, we introduce a new formulation of the MRI segmentation problem as a kernelbased intuitionistic fuzzy C-means (KIFCM) clustering problem and proposed a new DNA-based genetic algorithm to learn the optimal KIFCM clustering.While this algorithm searches the solution As shown in Figure 12, GKFCM1 and GKFCM2 have the longest running time.This is perhaps due to the parameter η j used by the algorithms, which needs an additional loop over all clusters to update the local contextual information.MICO comes next with running times span 1.3 to 2.85 s.It performs fast calculations owing to the convexity of its energy function particularly in the presence of less noise, but tends to have many iterations in the presence of high level noise.RSCFCM spent about 1.05 to 2.95 s for the experiments.It uses a spatial fuzzy factor that is constructed based on the posterior and prior probabilities and takes the spatial direction into account.That is likely to increase the computational cost.FLICM and KWFLICM have similar performance.FICM uses parameter G ij , which needs an additional loop over the neighborhood of the current pixel to calculate the local information in every iteration, and KWFLICM used parameter G ij , which requires two additional loops to process the neighborhood.Finally, KIFCM1-DNAGA and KIFCM2-DNAGA both incorporate local contextual information obtained by LVC, which only needs to be calculated once.As a result, both algorithms require about 0.25-1.55s to run the experiments, which is the lowest among the algorithms.

Conclusions
In this paper, we introduce a new formulation of the MRI segmentation problem as a kernel-based intuitionistic fuzzy C-means (KIFCM) clustering problem and proposed a new DNA-based genetic algorithm to learn the optimal KIFCM clustering.While this algorithm searches the solution space for the optimal model parameters, it also obtains the optimal clustering, therefore the optimal MRI segmentation.We perform empirical study by comparing our method with six state-of-the-art soft clustering methods using a set of UCI data mining data sets and a set of synthetic and clinic MRI data.Our preliminary results show that our method outperforms other methods in both the clustering metrics (against the ground truth) and the computational efficiency.For the future work, we plan to further improve the KIFCM-DNAGA and test it on a larger set of clinical MRIs.
> 0 for each column i.

Figure 1 .
Figure 1.An example of LIV and ϕ k in an image.(a) A Rician noisy image; (b) Grayscales in the 6 × 6 white rectangle area from (a), with 3 neighborhood areas A, B, and C; (c) The LIV associated with each pixel calculated using Equation (8); (d) The variance ϕ k associated with each pixel calculated using Equation (11).

Algorithm 2 :
KIFCM-DNAGA Input: : convergence threshold t: maximum number of iterations c: the number of fuzzy clusters K: a kernel function M: an MRI image N: the size of the population Output: U: the membership matrix V: the centroids of fuzzy clusters Method: 1. POP = create N randomly generated individuals encoded as DNA strings 2. t = 0; Initialize U (0) and V 3.

Figure 2 .
Figure 2.An example of an encoded individual.

Figure 2 .
Figure 2.An example of an encoded individual.

Figure 4
shows an example of the reconstruction operator.On the left, the nucleotide symbols are shown in the parent and the offspring and on the right the encoded base-4 numbers are shown for the same parent and offspring.Assume that the right-most 3 bits of each variable are randomly chosen to perform the operation based on Waston-Crick [41] pairs.The nucleotide bases ATG (encoded in 031) in 1 v , and GCT (encoded in 123) with 2 v in a parent are changed to TAC (encoded in 302) and CGA (encoded in 210), respectively, in the offspring.

Figure 3 .
Figure 3.An example of choose crossover operator.

Figure 4
shows an example of the reconstruction operator.On the left, the nucleotide symbols are shown in the parent and the offspring and on the right the encoded base-4 numbers are shown for the same parent and offspring.Assume that the right-most 3 bits of each variable are randomly chosen to perform the operation based on Waston-Crick [41] pairs.The nucleotide bases ATG (encoded in 031) in v 1 , and GCT (encoded in 123) with v 2 in a parent are changed to TAC (encoded in 302) and CGA (encoded in 210), respectively, in the offspring.

Figure 4
Figure 4 shows an example of the reconstruction operator.On the left, the nucleotide symbols are shown in the parent and the offspring and on the right the encoded base-4 numbers are shown for the same parent and offspring.Assume that the right-most 3 bits of each variable are randomly chosen to perform the operation based on Waston-Crick [41] pairs.The nucleotide bases ATG (encoded in 031) in 1 v , and GCT (encoded in 123) with 2 v in a parent are changed to TAC (encoded in 302) and CGA (encoded in 210), respectively, in the offspring.

Figure 4 .
Figure 4. Example of reconstruction operator.Figure 4. Example of reconstruction operator.

Figure 4 .
Figure 4. Example of reconstruction operator.Figure 4. Example of reconstruction operator.

Figure 12 .
Figure 12.Average running times for experiments shown in subfigure a in Figures 5-11.

Figure 12 .
Figure 12.Average running times for experiments shown in subfigure a in Figures 5-11.

Table 1 .
The detailed features of UCI datasets.

Table 2 .
Clustering results for UCI machine learning datasets.

Table 3 .
JS and running time on the T1-weighted axial slice from SBD with 7% noise and 20% grayscale non-uniformity.

Table 3 .
JS and running time on the T1-weighted axial slice from SBD with 7% noise and 20% grayscale non-uniformity.

Table 4 .
JS and running time on the T1-weighted sagittal slice from SBD with 7% noise and 20% grayscale non-uniformity.

Table 4 .
JS and running time on the T1-weighted sagittal slice from SBD with 7% noise and 20% grayscale non-uniformity.

Table 5 .
JS and running time on the T1-weighted axial slice (number 91) from SBD corrupted with 10% Rician noise.

Table 5 .
JS and running time on the T1-weighted axial slice (number 91) from SBD corrupted with 10% Rician noise.

Table 6 .
Segmentation performance in E measure and running times on the Brats1, Brats2, Brats3 and Brats4.