A Novel Consensus Fuzzy K-Modes Clustering Using Coupling DNA-Chain-Hypergraph P System for Categorical Data

: In this paper, a data clustering method named consensus fuzzy k-modes clustering is proposed to improve the performance of the clustering for the categorical data. At the same time, the coupling DNA-chain-hypergraph P system is constructed to realize the process of the clustering. This P system can prevent the clustering algorithm falling into the local optimum and realize the clustering process in implicit parallelism. The consensus fuzzy k-modes algorithm can combine the advantages of the fuzzy k-modes algorithm, weight fuzzy k-modes algorithm and genetic fuzzy k-modes algorithm. The fuzzy k-modes algorithm can realize the soft partition which is closer to reality, but treats all the variables equally. The weight fuzzy k-modes algorithm introduced the weight vector which strengthens the basic k-modes clustering by associating higher weights with features useful in analysis. These two methods are only improvements the k-modes algorithm itself. So, the genetic k-modes algorithm is proposed which used the genetic operations in the clustering process. In this paper, we examine these three kinds of k-modes algorithms and further introduce DNA genetic optimization operations in the ﬁnal consensus process. Finally, we conduct experiments on the seven UCI datasets and compare the clustering results with another four categorical clustering algorithms. The experiment results and statistical test results show that our method can get better clustering results than the compared clustering algorithms, respectively.


Introduction
Data clustering has recently attracted more attentions in practical applications. However, most studies are about numerical data. In this method, the distance between the clustering center and the data objects are calculated by the standard distance metrics. However, there are many classification datasets that do not have a natural order or distance between the parts. For example, in the real world, each classification attribute of blood type has a unique classification value, such as [A, B, O or AB]. Therefore, research into categorical data is a difficult and challenging task, which attracts many data mining researchers.
In 1998, Huang [1] proposed the k-modes algorithm for the categorical data clustering. The k-modes algorithm calculated the distance between the object and the cluster center by the Hamming distance instead of Euclidean distance which is used in the k-means algorithm. Then, Huang [2] proposed the fuzzy k-modes algorithm (FKM). This algorithm was an extended version of the k-modes algorithm. Thereafter, many algorithms have been proposed for the clustering of categorical data [3]. These methods were mostly based on numerical data clustering algorithms, such as ROCK [4], CACTUS [5], COOLCAT [6], LIMBO [7], wk-modes [8], MOGA [9], NSGA-FMC [10], SBC [11], computational models from the structure and function of biological cells and the cooperation between organs and tissues. Up to now, membrane computing mainly includes three basic models: cell-like p system, tissue-like p system and neuron-like p system. In the process of calculation, each cell is regarded as an independent unit, and each cell operates independently and does not interfere with each other. The whole membrane system operates in a highly parallel mode. According to the research content about the P system, it can be divided into theoretical study and application study. For the theoretical research, researchers use the direct membrane to solve problems [23]. In this aspect, some new P system models are proposed which can improve the computation power with the min cells or spikes [24,25]. Many other variants of membrane systems were also proposed in [26][27][28][29]. In application research, the direct membrane algorithm was used to solve some practical problems by the researchers [30]. Some researchers also used the coupled membrane system to realize the clustering process [31,32]. These membrane systems are all improvements based on cell-like P systems, tissue-like P systems and neuron-like systems. For these kinds of system, they are all designed based on simple structures and can not solve the problems with complicated structure. For example, the traditional P system cannot store multivariate data with complex relationships. Therefore, our team previously proposed the concept of P system with the simple complex structure [33] and chain structure. For instance, Liu and Xue established the new P system based on the simple complex structure in [33], and Luan and Liu designed the chain P system [34], while Yan and Xue proposed the chain-hypergraph P system [35].
Based on the above analysis, we proposed a novel hybrid DNA-chain-hypergraph P system to implement the consensus clustering (DCHP-FCC). The DCHP-FCC system contains three reaction chain membrane subsystems and one consensus subsystem. Three different base clustering algorithms are used in the three subsystems, respectively. This operation combines the different advantages of the three algorithms, and the DNA genetic algorithm implements the consensus clustering process in the consensus system. The experiment on seven UCI datasets is conducted. The experimental results show that our proposed method outperforms the results of the state-of-the-art methods.
This work makes the following contributions: (1) A novel DCHP system is designed which combines the advantage of the chain structure and hypergraph topology structure. Three reaction chain membrane subsystems and one consensus subsystem are designed to generate the basic partitions and integrate basic partitions, respectively.
(2) A revised k-means which is optimized by the DNA genetic algorithm is used for a basic partition integration strategy which can optimize the initial clustering center and obtain the global optimal solutions.
(3) Simulation is performed using well-known datasets in the UCI machine learning repository to verify the clustering quality of the DCHP-FCC.
The rest of this paper is organized as follows. Section 2 introduces the basic concepts of the k-modes algorithm, consensus clustering and chain and hypergraph structure. The coupling DCHP system is illustrated in Section 3. Experiments and results are analyzed in Section 4. Section 5 summarizes conclusions and future research directions.

Three Basic Fuzzy K-Modes Clustering Algorithms
The fuzzy k-modes algorithm (FKM) was proposed by Huang and Ng [1], and is one of the most popular clustering algorithms for categorical data. This type of method has improved the k-modes algorithm by the corresponding membership degree value in different clustering. Definition 1. Definiting 4-tuple S = (U, A, V, f ) is an information system, where U represents non-empty finite set of objects, A refers to the non-empty finite set of attributes, V = ∪ a∈A V a , V a records the domain of attribute a, and f : U × A → V is a total function such that f (u, a) ∈ V a , for every (u, a) ∈ U × A.
Let X be the dataset which has n categorical objects. Each object x i (1 ≤ i ≤ n) has p attributes, so that x i = x i1 , x i2 , . . . , x ip . The objective of the FKM is characterized as follows: subject to: where, α is weight component, U = (u ji ) is a k × n matrix which records the fuzzy membership degree, Z = {z 1 , . . . , z k } is the set of the clustering centers. X = (X 1 , X 2 , . . . , X n ) is the data matrix, where X i is the ith point. d(x i , z j ) calculates the distance between the object x i and the clustering center z j . The distance is calculated by simple matching dissimilarity measures or Hamming distance which is showed as follows: and Based on the proposed scheme, a weight vector is added in the conventional fuzzy k-modes algorithm [2]: W = [w 1 , w 2 , . . . , w p ], where w l represents the weight for the lth variable, ∀l = 1, 2, . . . , p. So, the objective function of the WFK-modes (WFKM) algorithm is: where, β is the power of the attribute weight w l which is a measure of emphasis on weights. Similar to where z ji is the lth term of Z j and x il refer to the lth term of X i .
In addition to improving the algorithm itself, some optimization algorithm is also used to optimize the FKM algorithm. For example, the genetic fuzzy k-modes algorithm integrated the genetic algorithm and the conventional fuzzy k-modes algorithm, which is called GFKM [36]. This algorithm has five basic steps: (1) string representation, (2) population initialization, (3) selection process, (4) crossover process, and (5) mutation process. The fuzzy k-modes algorithm which is optimized by the genetic algorithm can obtain the globally optimal solution and speed up the process of the convergence.

Consensus Clustering
Consistent clustering is a framework for clustering multiple algorithms or the same algorithm under different parameters to obtain better results. As shown in Figure 1, let X = {x 1 , x 2 , . . . , x n } represents the dataset, and arbitrary cluster algorithms are used p times to get p different basic partitions π 1 , π 2 , . . . , π p (BPs) [37]. where, β is the power of the attribute weight l w which is a measure of emphasis on weights.
Similar to the FKM algorithm, where ji z is the th l term of j Z and il x refer to the th l term of i X .

0, ( , )
, In addition to improving the algorithm itself, some optimization algorithm is also used to optimize the FKM algorithm. For example, the genetic fuzzy k-modes algorithm integrated the genetic algorithm and the conventional fuzzy k-modes algorithm, which is called GFKM [36]. This algorithm has five basic steps: (1) string representation, (2) population initialization, (3) selection process, (4) crossover process, and (5) mutation process. The fuzzy k-modes algorithm which is optimized by the genetic algorithm can obtain the globally optimal solution and speed up the process of the convergence.

Consensus Clustering
Consistent clustering is a framework for clustering multiple algorithms or the same algorithm under different parameters to obtain better results. As shown in Figure 1, let represents the dataset, and arbitrary cluster algorithms are used p times to get p different basic partitions 1 2 , ,..., p π π π (BPs) [37]. The number of clusters 1 2 , ,..., p K K K in different partition is arbitrary. Each cluster result can transfer to the corresponding binary valued vector representation, and the binary value 1 indicates the sample i belonging to the cluster, otherwise 0. Then, the consensus function needs to aggregate the BPs and obtains the final clustering result π . In this paper, the revised k-means algorithm which optimized by the DNA genetic algorithm is used as the consensus function. The DNA genetic algorithm is used to optimize the initial clustering center of the k-means algorithm, and the clustering quality can be evaluated by some common evaluation indicators, such as normalized mutual information, accuracy and F_measure, etc. The consensus clustering result is typically better than that obtained by the best BP. The following notations can be used to illustrate it: The number of clusters K 1 , K 2 , . . . , K p in different partition is arbitrary. Each cluster result can transfer to the corresponding binary valued vector representation, and the binary value 1 indicates the sample i belonging to the cluster, otherwise 0. Then, the consensus function needs to aggregate the BPs and obtains the final clustering result π. In this paper, the revised k-means algorithm which optimized by the DNA genetic algorithm is used as the consensus function. The DNA genetic algorithm is used to optimize the initial clustering center of the k-means algorithm, and the clustering quality can be evaluated by some common evaluation indicators, such as normalized mutual information, Processes 2020, 8, 1326 5 of 17 accuracy and F_measure, etc. The consensus clustering result is typically better than that obtained by the best BP. The following notations can be used to illustrate it:

Chain and Hypergraph Structure
A simple complex S is a set of non-empty simplex s 1 , s 2 , . . . , s p . If s 1 ≺ s 2 , s 1 represent the vertex or face of the simplex s 2 . Each complex also should be oriented. Therefore, a S-chain is a simplicial complex with p dimensional simplices, which defined in [33]. All simplexes which have the same directions can combine into a chain domain.
A hypergraph H = (v, e) represents a graph whose edge contains an arbitrary number of vertices [38][39][40]. v is vertices sets and e is hyper-edges sets. A hyper-edge can contain more than two vertices and can be formally represented by a nonempty subset of v. As shown in Figure 1. A hypergraph can better represent the complex information and the local group information than the traditional graph. In the traditional graph, there is only one edge between the two vertices if they are similar. However, in the hypergraph, we can construct a hyper-edge to connect more than two vertices. Therefore, the local group information and complex relationship hidden in the data can be captured by the hypergraph model [41].
We also can represent the hypergraph H = (v, e) in an accessible matrix: where, H = 1 if the hyper-edge e contains the vertex v. So, the Figure 2 can be expressed as:

Coupling DNA-Chain-Hypergraph P System for Consensus Clustering (DCHP-FCC)
In this section, the coupling DCHP system is proposed. Firstly, the membrane structure of the DCHP system is introduced. Then, the different operations in the subsystem and consensus system are introduced, respectively. The flowchart of the proposed DCHP-FCC algorithm is shown in Figure 3.

Coupling DNA-Chain-Hypergraph P System for Consensus Clustering (DCHP-FCC)
In this section, the coupling DCHP system is proposed. Firstly, the membrane structure of the DCHP system is introduced. Then, the different operations in the subsystem and consensus system are introduced, respectively. The flowchart of the proposed DCHP-FCC algorithm is shown in Error! Reference source not found.3.

Initialization of the membrane structure
BPs generation with different algorithm in different subsystem Transfer BPs to the co-occurrence binary dataset X (2) Realize the consensus clustering process with the k-means algorithm based on DNA genetic algorithm (Consensus funtion) Output:Clustering result Input:Dataset

Membrane Structure of the DCHP System
The DCHP system has two main membrane structures-the chain membrane and hypermembrane structure. The basic framework of the chain membrane structure is shown in Figure 4 and the structure of the hyper-membrane structure is shown in Figure 5.

Membrane Structure of the DCHP System
The DCHP system has two main membrane structures-the chain membrane and hyper-membrane structure. The basic framework of the chain membrane structure is shown in Figure 4 and the structure of the hyper-membrane structure is shown in Figure 5.

Membrane Structure of the DCHP System
The DCHP system has two main membrane structures-the chain membrane and hypermembrane structure. The basic framework of the chain membrane structure is shown in Figure 4 and the structure of the hyper-membrane structure is shown in Figure 5.  m m m m + + = .  means the membrane has no object.
• i subsys is the subsystem which is used to generate the basic partition of clustering. In this system, three subsystems execute three kind of clustering algorithm, respectively. • consys is the consensus clustering membrane, which is used to generate the final clustering result. • 0 i is the output membrane of the system  . Hyper-membrane structure of the P system Figure 5. Hyper-membrane structure of the P system.
According to the basic membrane structure in Figures 4 and 5, a novel P system is designed for the consensus clustering process. The membrane structure of the DCHP system is shown in Figure 6. • µ represents the structure of the membrane. It includes the structure of the chain membrane, hyper-membrane and consensus membrane; • ω 1 , ω 2 , . . . , ω m are objects in O, which represent the initial multisets objects in m membranes at the beginning of the calculation; we denote the number of chain membrane is m 1 , the number of hyper-membrane is m 2 and the number of membrane in consensus system is m 3 . m 1 + m 2 + m 3 = m. λ means the membrane has no object. • subsys i is the subsystem which is used to generate the basic partition of clustering. In this system, three subsystems execute three kind of clustering algorithm, respectively. • consys is the consensus clustering membrane, which is used to generate the final clustering result. • i 0 is the output membrane of the system .

The Consensus Clustering Realized with the DCHP System
To implement the consensus clustering of the fuzzy k-modes algorithm, we propose three kinds of subsystem (i.e., reaction chain-hyper P system, consensus system, and global DCHP system). As shown in Figure 6, three classic algorithms (FKM [1], WFKM [2,16] and GFKM [36]) are simultaneously implemented in the three subsystems. The fuzzy k-modes algorithm generates a fuzzy partition matrix for the categorical data, and gives confidence to objects in different clusters. We call it soft partition, which is closer to reality than hard partition. This method is equal to all variables that determine cluster membership. However, the situation in the real world is usually different. The WFKM algorithm introduces a weight vector in the traditional FKM algorithm. This modification associates higher weights with the features which are instrumental in recognizing the clustering pattern of the data. These two methods are only improvements of the K-mode algorithm itself, and there are still some drawbacks in the convergence speed and the global optimization of the algorithm. In order to speed up the convergence process, the GFKM algorithm is proposed which used a one-step crossover process, mutation process and selection process in the clustering process.

Reaction Chain-Hypergraph P System in Subsystem
Initially, objects are randomly generated in the reaction chain hypergraph P system. The object represents a set of cluster centers. Suppose the dataset X = {X 1 , X 2 , . . . , X n } ⊆ R n×d has K clusters, where n is the number of data points and d is the dimension. The initial cluster centers K i (i = 1, 2, 3) are randomly selected out of n data as the initial cluster centers and are denoted as Z = z 1 , z 2 , . . . , z k i , . . , z k i is the initial string in the reaction chain hypergraph P system. Different reaction chain hypergraph P systems in the same subsystem conduct the same k-modes algorithm. Different subsystems conduct different K-modes algorithms with different parameters in parallel.
Then, each reaction chain-hyper P system generates a clustering result with different cluster number K i . The clustering result π i is transferred to the corresponding subsystem. These results are also called basic partitions (BPs). Next, the object co-occurrence strategy is used for these BPs, and the details of this method can be seen in [42]. The BPs are transformed into a binary dataset

Local Communication Membrane System
Afterwards, the transformed binary dataset X (2) is transferred to the consensus system by the communication rule: where a k is the basic partition result in the subsystem and λ means there is no object in the membrane. p is the subsystem, and q is the consensus system.

Consensus System
The revised k-means algorithm which is optimized by the DNA genetic algorithm is used as the consensus function. The DNA genetic algorithm is used to optimize the initial clustering center of the k-means algorithm. When the dataset X (2) appears in the consensus system, it is transferred to binary. So, we can use these values as the initial population. Then, the specific selection, crossover and mutation process of DNA genetic algorithm are described as follows: (1) Selection operation The optimal individuals with the first 10% fitness are directly inherited to the next generation, and the rest of the individuals are selected according to the roulette selection strategy.
(2) Crossover and mutation operation One-point crossover and adaptive mutation operation are used in this step, respectively. The crossover probability is set as P c . The mutation probability can be adjusted according to the fitness value of the individual. The specific mutation probability is updated as follows: where, P m is the mutation probability, f max refers to the maximum fitness in every generation, f avg is the average fitness in every generation, and f is the fitness of the individual. If the fitness is equal to the maximum fitness, the mutation probability is 0. This operation can guarantee that the optimal individual does not change by the mutation operation. The absolute-deviation criterion in the following is used to measure the clustering quality: where, x i is data point i in the dataset X (2) , o k is the cluster center of the C k . dist(x i , o k ) calculates the Euclidean distance between the data point x i and the corresponding cluster center o k . The computation process can stop when the predefined maximum iteration is reached or the difference between two adjacent iterations is less than the given threshold ε. Then, the DCHP system is stopped, and the final results are output in the membrane i 0 .

Data Sets and Parameter Settings
Seven datasets which were collected from the UCI Machine Learning Repository [43] are used in this section. The data type of all seven datasets is categorical data. These datasets are used in the comparison algorithms. Although the comparison algorithms used several other data sets in addition to these seven datasets, considering that these datasets have poor results in the comparison algorithms, they are not used as experimental data in this paper. Table 1 shows the detail information of these datasets. To generate the BPs, three different clustering algorithms are used. The number of clusters K i (i = 1, 2, 3) in the process of BPs generation is set to K a , √ n , and K a is the actual number of clusters of the dataset. The cluster number of the consensus clustering process is set as K a . All experiments are simulated by the MATLAB R2014b running on a Windows 7 platform of the 64-bit edition. The PC has an Intel Core i7-4770 3.4 GHZ CPU and 8 GB of RAM.

Evaluation Metric
To evaluation the performance of the DCHP-FCC algorithm, three external clustering evaluation metrics are selected [44], i.e., Adjusted Rank Index (ARI), Clustering Accuracy (ACC), and F_measure (F).
The ARI is defined as follows: where, T represents the pre-determined clustering label, and the C represents the label of the clustering result. a, b, c and d refer to: (1) in the same class as T and C, (2) in the same class as T but not in the same class as C, (3) in the same class as C but not in the same class as T, (4) in a different class C and T, respectively. The ACC can be calculated by: where, n is the number of data points, g k represents the clustering result labels which are obtained by the algorithm, and l k is the true class label of the data x i . map(·) is the mapping function that maps the clustering labels obtained by the algorithm to the real clustering labels. When l k = map(g k ), the function value of δ(l k , map(g k )) is equal to 1, and otherwise it is equal to 0. The F_measure is defined as: where, P(k, l) = s kl /s k and R(k, l) = s kl /s l .P(k, l) refers to the precision of cluster k with respect to class l, and R(k, l) represents the recall of cluster k with respect to class l. s kl refers to data points which belong to both cluster k and l, s k represents the number of data points in k, and s l is the number of data points in l.

Experiment Results and Analysis
In this section, we firstly select the number of BPs. According to the analysis in previous research [37], we can select the number of BPs as about 50, and the clustering effect gradually stabilizes. In this paper, we also need to determine the number of BPs, but we do not use the same BP generation strategy as in the other methods. In this paper, BPs need to be generated in each time. At the same time, considering the structure of the DCHP system and the characteristics of the three basic clustering algorithms, we need to guarantee that the BPs generated by the different algorithms are the same. So, the final number of BPs is a multiple of 3. Taking into account that the number of BPs is basically maintained at about 50, the clustering effect is the best, so we conduct preliminary experiments on the number of BPs with 30, 60 and 90, respectively. Experiments are run 30 times, and the mean and variance of the experimental results are recorded in each case. The experimental results in Table 2 show that when the number of BPs is 60, the experimental effect is the best.   Next, the DCHP-FCC algorithm is compared with three basic clustering algorithms and one improved algorithm, IWFKM, which was proposed in [16]. In order to maintain the originality of the benchmark algorithms, this paper obtains the results of these algorithms through their own algorithms and parameters, which are shown in Table 3. The mutation probability in the GFKM algorithm is also set as 0.01. For comparison, the maximum number of iterations of the comparison algorithm is 100.
Every experiment is also run 30 times, and the mean and variance of the experimental results are recorded in each case. The experimental results are shown in Table 4. As we can see from Table 4, the performance of the DCHP-FCC is much better than the compared algorithms on Soybean-small, Spect heart, Voting, Zoo and Mushroom datasets, and partly better than the compared algorithms on the Tic-tac-toe and Breast cancer datasets. For the Tic-tac-toe and Breast cancer datasets, the DCHP-FCC algorithm can obtain the best ARI value, and the GFKM algorithm attains the best ACC and F_measure values. Even though the DCHP-FCC algorithm does not perform best on the Tic-tac-toe and Breast cancer datasets, the metric value is only 0.01 lower than the best value. So, the DCHP-FCC algorithm is more effective in dealing with the categorical clustering than the compared clustering algorithms.

Significance Testing
In this section, the hypothetical tests on the values of seven UCI datasets are computed between the DCHP-FCC algorithm and other four comparison clustering algorithms. The results are shown in Tables 5-7. The significance level is set as ρ < 0.05. According to the results in Tables 5-7, the symbol '+' represents when the difference between the DCHP-FCC algorithm and the compared algorithm is significant, and the symbol '−' represents when the difference between the DCHP-FCC algorithm and the compared algorithm is not significant. We can see from Tables 5-7 that the results of the hypothetical test are almost '+'. This proves that there is a clear difference between the algorithm in this paper and the comparison algorithms.

Conclusions
In this paper, we propose a novel P system (DCHP) with a hybrid structure which combines the advantage of the chain structure and hypergraph topology structure for the consensus fuzzy k-modes clustering. The DCHP system has three subsystems and one consensus system. The subsystems are used to generate three kinds of basic partitions, respectively, and the consensus system is used to realize the consensus clustering process with evolution operations. Then, the DCHP-FCC algorithm is compared with four k-modes clustering algorithms on seven UCI datasets and uses three performance validation indices: ARI, ACC and F_measure. The experimental results show that the DCHP-FCC algorithm can get better clustering results than the other compared algorithms.
There are several ways to continue this study in the future. Firstly, we can consider using different basic clustering algorithms in the consensus clustering or design a new P system structure instead of DCHP system. Secondly, the algorithm which can identify the optimal number of clusters should be investigated for categorical data in the consensus clustering process. In addition, the other clustering method can be used in the consensus clustering process.