A Clustering Method Based on the Maximum Entropy Principle

Clustering is an unsupervised process to determine which unlabeled objects in a set share interesting properties. The objects are grouped into k subsets (clusters) whose elements optimize a proximity measure. Methods based on information theory have proven to be feasible alternatives. They are based on the assumption that a cluster is one subset with the minimal possible degree of “disorder”. They attempt to minimize the entropy of each cluster. We propose a clustering method based on the maximum entropy principle. Such a method explores the space of all possible probability distributions of the data to find one that maximizes the entropy subject to extra conditions based on prior information about the clusters. The prior information is based on the assumption that the elements of a cluster are “similar” to each other in accordance with some statistical measure. As a consequence of such a principle, those distributions of high entropy that satisfy the conditions are favored over others. Searching the space to find the optimal distribution of object in the clusters represents a hard combinatorial problem, which disallows the use of traditional optimization techniques. Genetic algorithms are a good alternative to solve this problem. We benchmark our method relative to the best theoretical performance, which is given by the Bayes classifier when data are normally distributed, and a multilayer perceptron network, which offers the best practical performance when data are not normal. In general, a supervised classification method will outperform a non-supervised one, since, in the first case, the elements of the classes are known a priori. In what follows, we show that our method’s effectiveness is comparable to a supervised one. This clearly exhibits the superiority of our method. Entropy 2015, 17 152


Introduction
Pattern recognition is a scientific discipline whose methods allow us to describe and classify objects.The descriptive process involves the symbolic representation of these objects through a numerical vector x: x = [x 1 , x 2 , . . .
where its n components represent the value of the attributes of such objects.Given a set of objects ("dataset") X, there are two approaches to attempt the classification: (1) supervised; and (2) unsupervised.
In the unsupervised approach case, no prior class information is used.Such an approach aims at finding a hypothesis about the structure of X based only on the similarity relationships among its elements.These relationships allow us to divide the space of X into k subsets, called clusters.A cluster is a collection of elements of X, which are "similar" between them and "dissimilar" to the elements belonging to other clusters.Usually, the similarity is defined by a metric or distance function d : X × X → [1][2][3].The process to find the appropriate clusters is typically denoted as a clustering method.

Clustering Methods
In the literature, many clustering methods have been proposed [4,5].Most of them begin by defining a set of k centroids (one for each cluster) and associating each element in X with the nearest centroid based on a distance d.This process is repeated, until the difference between centroids at iteration t and iteration t − 1 tends to zero or when some other optimality criterion is satisfied.Examples of this approach are k-means [6] and fuzzy c-means [7][8][9].Other methods not following the previous approach are: (1) hierarchical clustering methods that produce a tree or a graph in which, during the clustering process, based on some similarity measure, the nodes (which represent a possible cluster) are merged or split; in addition to the distance measure, we must also have a stopping criterion to tell us whether the distance measure is sufficient to split a node or to merge two existing nodes [10][11][12]; (2) density clustering methods, which consider clusters as dense regions of objects in the dataset that are separated by regions of low density; they assume an implicit, optimal number of clusters and make no restrictions with respect to the form or size of the distribution [13]; and (3) meta-heuristic clustering methods that handle the clustering problem as a general optimization problem; these imply a greater flexibility of optimization algorithms, but also longer execution times [14,15].
We propose a numerical clustering method that lies in the group of meta-heuristic clustering methods, where the optimization criterion is based on information theory.Some methods with similar approaches have been proposed in the literature (see Section 1.6).

Determining the Optimal Value of k
In most clustering methods, the number k of clusters must be explicitly specified.Choosing an appropriate value has important effects on the clustering results.An adequate selection of k should consider the shape and density of the clusters desired.Although there is not a generally accepted approach to this problem, many methods attempting to solve it have been proposed [5,16,17].In some clustering methods (such as hierarchical and density clustering), there is no need to explicitly specify k.Implicitly, its value is, nevertheless, still required: the cardinality or density of the clusters must be given.Hence, there is always the need to select a value of k.In what follows, we assume that the value of k is given a priori.We do not consider the problem of finding the optimal value of k.See, for example, [18][19][20].

Evaluating the Clustering Process
Given a dataset X to be clustered and a value of k, the most natural way to find the clusters is to determine the similarity among the elements of X.As mentioned, usually, such similarity is established in terms of proximity through a metric or distance function.In the literature, there are many metrics [21] that allow us to find a variety of clustering solutions.The problem is how to determine if a certain solution is good or not.For instance, in Figures 1 and 2, we can see two datasets clustered with k = 2.For such datasets, there are several possible solutions, as shown in Figures 1b,c and 2b,c, respectively.
Frequently, the goodness of a clustering solution is quantified through validity indices [4,8,22,23].The indices in the literature are classified into three categories: (1) external indices that are used to measure the extent to which cluster labels match externally-supplied class labels (F-measure [24], NMIMeasure [25], entropy [26], purity [27]); (2) internal indices, which are used to measure the goodness of a clustering structure with respect to the intrinsic information of the dataset ( [8,[28][29][30][31]); and (3) relative indices that are used to compare two different clustering solutions or particular clusters (the RAND index [32] and adjusted RAND index [33]).Clusters are found and then assigned a validity index.We bypass the problem of qualifying the clusters and, rather, define a quality measure and then find the clusters that optimize it.This allows us to define the clustering process as follows: Definition 1.Given a dataset X to be clustered and a value of k, clustering is a process that allows the partition of the space of X into k regions, such that the elements that belong to them optimize a validity index q.
Let C i be a partition of X and Π = {C 1 , C 2 , ...C k } the set of such partitions that represents a possible clustering solution of X.We can define a validity index q as a function of the partition set Π, such that the clustering problem may be reduced to an optimization problem of the form: where g i and h j are constraints, likely based on prior information about the partition set Π (e.g., the desired properties of the clusters).We want to explore the space of those feasible partition sets and find one that optimizes a validity index instead, doing an a posteriori evaluation of a partition set (obtained by any optimization criterion) through such an index.This approach allows us to find "the best clustering" within the limits of a validity index.To tightly confirm such an assertion, we resort to the statistical evaluation of our method (see Section 5).

Finding the Best Partition of X
Finding the suitable partition set Π of X is a difficult problem.The total number of different partition sets of the form Π = {C 1 , C 2 , ...C k } may be expressed by the function S(N, k) associated with the Stirling number of the second kind [34], which is given by: where N = |X|.For instance, for N = 50 and k = 2, S(N, k) ≈ 5.63 × 10 14 .This number illustrates the complexity of the problem that we need to solve.Therefore, it is necessary to resort to a method that allows us to explore efficiently a large solution space.In the following section, we briefly describe some meta-heuristic searches.

Choosing the Meta-Heuristic
During the last few decades, there has been a tendency to consider computationally-intensive methods that can search very large spaces of candidate solutions.Among the many methods that have arisen, we mention tabu search [35][36][37], simulated annealing [38,39], ant colony optimization [40], particle swarm optimization [41] and evolutionary computation [42].Furthermore, among the many variations of evolutionary computation, we find evolutionary strategies [43], evolutionary programming [44], genetic programming [45] and genetic algorithms (GAs) [46].All of these methods are used to find approximate solutions for complex optimization problems.It was proven that an elitist GA always converges to the global optimum [47].Such a convergence, however, is not bounded in time, and the selection of the GA variation with the best dynamic behavior is very convenient.In this regard, we rely on the conclusions of previous analyses [48,49], which showed that a breed of GA, called the eclectic genetic algorithm (EGA), achieves the best relative performance.Such an algorithm incorporates the following: (1) Full elitism over a set of size n of the last population.Given that, by generation t, the number of individual tested is nt, the population in such a generation consists of the best n individuals.
(2) Deterministic selection as opposed to the traditional proportional selection operator.Such a scheme emphasizes genetic variety by imposing a strategy that enforces the crossover of predefined individuals.After sorting the individual's fitness from better to worse, the i-th individual is combined with the (n − i + 1)-th individual.
(4) Random mutation of a percentage of bits of the population.
In Appendix A, we present the pseudocode of EGA.The reader can find detailed information in [48][49][50][51].

Related Works
Our method is meta-heuristic (see Section 1.1).The main idea behind our method is to handle the clustering problem as a general optimization problem.There are various suitable meta-heuristic clustering methods.For instance, in [52][53][54], three different clustering methods based on differential evolution, ant colony and multi-objective programming, respectively, are proposed.
Both meta-heuristic or analytical methods (iterative, density-based, hierarchical) have objective functions, which are associated with a metric.In our method, the validity index becomes the objective function.
Our objective function is based on information theory.Several methods based on the optimization of information theoretical functions have been studied [55][56][57][58][59][60].Typically, they optimize quantities, such as entropy [26] and mutual information [61].These quantities involve determining the probability distribution of the dataset in a non-parametric approach, which does not make assumptions about a particular probability distribution.The term non-parametric does not imply that such methods completely lack parameters; they aim to keep the number of assumptions as weak as possible (see [57,60]).Non-parametric approaches involve density functions, conditional density functions, regression functions or quantile functions in order to find the suitable distribution.Typically, such functions imply tuning parameters that determine the convergence of searching for the optimal distribution.
A common clustering method based on information theory is ENCLUS (entropy clustering) [62], which allows us to split iteratively the space of the dataset X in order to find those subspaces that minimize the entropy.The method is motivated by the fact that a subspace with clusters typically has lower entropy than a subspace without clusters.In order to split the space of X, the value of a resolution parameter must be defined.If such a value is too large, the method may not able to capture the differences in entropy in different regions in the space.
Our proposal differs from traditional clustering methods (those based on minimizing a proximity measure as k-means or fuzzy c-means) in that, instead of finding those clusters that optimize such a measure and then defining a validity index to evaluate its quality, our method finds the clusters that optimize a purported validity index.
As mentioned, such a validity index is an information theoretical function.In this sense, our method is similar to those mentioned methods based on information theory.However, it does not use explicitly a non-parametric approach to find the distribution of the dataset.It explores the space of all possible probability distributions to find one that optimizes our validity index.For this reason, the use of a suitable meta-heuristic is compulsory.With the exception of the parameters of such a meta-heuristic, our method does not resort to additional parameters to find the optimal distribution and, consequently, the optimal clustering solution.
Our validity index involves entropy.Usually in the methods based on information theory, the entropy is interpreted as a "disorder" measure; thus, an obvious way is to minimize such a measure.In our work, we propose to maximize it due to the maximum entropy principle.

Organization of the Paper
The rest of this work is organized as follows: In Section 2, we briefly discuss the concept of information content, entropy and the maximum entropy principle.Then, we approach such issues in the context of the clustering problem.In Section 3, we formalize the underlying ideas and describe the main line of our method (in what follows, clustering based on entropy (CBE)).In Section 4, we present a general description of the datasets.Such datasets were grouped into three categories: (1) synthetic Gaussian datasets; (2) synthetic non-Gaussian datasets; and (3) experimental datasets taken from the UCIdatabase repository.In Section 5, we present the methodology to evaluate the effectiveness of CBE regardless of a validity index.In Section 6, we show the experimental results.We use synthetically-generated Gaussian datasets and apply a Bayes classifier (BC) [63][64][65].We use the results as a standard for comparison due to its optimal behavior.Next, we consider synthetic non-Gaussian datasets.We prove that CBE yields results comparable to those obtained by a multilayer perceptron network (MLP).We show that our (non-supervised) method's results are comparable to those of BC and MLP, even though these are supervised.The results of other clustering methods are also presented, including an information theoretic-based method (ENCLUS).Finally, in Section 7, we present our conclusions.We have included three appendices expounding important details.

Maximum Entropy Principle and Clustering
The so-called entropy [26] appeals to an evaluation of the information content of a random variable Y with possible values {y 1 , y 2 , ...y n }.From a statistical viewpoint, the information of the event (Y = y i ) is inversely proportional to its likelihood.This information is denoted by I(y i ), which can be expressed as: From information theory [26,66], the entropy of Y is the expected value of I.It is given by: Typically, the log function may be taken to be log 2 , and then, the entropy is expressed in bits; otherwise, as ln, in which case the entropy is in nats.We will use log 2 for the computations in this paper.
When p(y i ) is uniformly distributed, the entropy of Y is maximal.This means that Y has the highest level of unpredictability and, thus, the maximal information content.Since entropy reflects the amount of "disorder" of a system, many methods employ some form of such a measure in the objective function of clustering.It is expected that each cluster has a low entropy, because data points in the same cluster should look similar [56][57][58][59][60].
We consider a stochastic system with a set of states S = {Π 1 , Π 2 , ..., Π n } whose probabilities are unknown (recall that Π i is a likely partition set of X of the form Π = {C 1 , C 2 , ..., C k }).A possible assertion would be that the probabilities of all states are equal (p(Π 1 ) = p(Π 2 ) = ... = p(Π n−1 ) = p(Π n )), and therefore, the system has the maximum entropy.However, if we have additional knowledge, then we should be able to find a probability distribution that is better in the sense that it is less uncertain.This knowledge can consist of certain average values or bounds on the probability distribution of the states, which somehow define several conditions imposed upon such distribution.Usually, there is an infinite number of probability models that satisfies these conditions.
The question is: which model should we choose?The answer lies in the maximum entropy principle, which may be stated as follows [67]: when an inference is made on the basis of incomplete information, it should be drawn from the probability distribution that maximizes the entropy, subject to the constraints on the distribution.As mentioned, a condition is an expected value of some measure about the probability distributions.Such a measure is one for which each of the states of the system has a value denoted by g(Π i ).
For example, let X = {9, 10, 9, 2, 1} be a discrete dataset to be clustered with k = 2.We assume that the "optimal" labeling of the dataset is the one that is shown in Table 1.
The probability model of Π * is given by the conditional probabilities p(x|C i ) that represent the likelihood that, when observing cluster C i , we can find object x.Table 2 shows such probabilities.The entropy of X conditioned on the random variable C taking a certain value C i is denoted as H(X|C = C i ).Thus, we define the total entropy of Π * as: Given the probability model of Π * and Equation ( 6), the total entropy of Π * is shown in Table 3.We may also calculate the mean and the standard deviation of the elements x ∈ C i denoted as σ(C i ) in order to define a quality index: Alternative partition sets are shown in Tables 4 and 5.
In terms of maximum entropy, partition H(Π 2 ) is better than H(Π 1 ).Indeed, we can say that H(Π 2 ) is as good as H(Π * ).If we assume that a cluster is a partition with the minimum standard deviation, then the partition set Π * is the best possible solution of the clustering problem.In this case, the minimum standard deviation represents a condition that may guide the search for the best solution.We may consider any other set of conditions depending on the desired properties of the clusters.The best solution will be the one with the maximum entropy, which complies with the selected conditions.
In the example above, we knew the labels of the optimal class and the problem became trivial.In the clustering problems, there are no such labels, and thus, it is compulsory to find the optimal solution based solely on the prior information supplied by one or more conditions.We are facing an optimization problem of the form: Maximize: H(Π) subject to: where i is the upper bound of value of the i-th condition.We do not know the value of i , and thus, we do not have enough elements to determine compliant values of g i .Based on prior knowledge, we infer whether the value of g i is required to be as small (or large) as possible.In our example, we postulated that g i (Π) (based on the standard deviation of the clusters) should be as small as possible.Here, g i (Π) becomes an optimization condition.Thus, we can redefine the above problem as: which is a multi-objective optimization problem [68].Without loss of generality, we can reduce the problem in Equation ( 9) to one of maximizing the entropy and minimizing the sum of the standard deviation of the clusters (for practical purposes, we choose the standard deviation; however, we may consider any other statistics, even higher-order statistics).The resulting optimization problem is given by: Maximize: H(Π) ∧ Minimize:g(Π)) subject to: Π ∈ S (10) where g(Π) is the sum of the standard deviation of the clusters.In a n-dimensional space, the standard deviation of the cluster C i is a vector of the form σ = (σ 1 , σ 2 , σ n ), where σ j is the standard deviation of each dimension of the objects x ∈ C i .In general, we calculate the standard deviation of a cluster as: Then, we define the corresponding g(Π) as: The entropy of the partition set Π denoted by H(Π) is given by Equation ( 6).The problem of Equation ( 10) is intractable via classical optimization methods [69,70].The challenge is to simultaneously optimize all of the objectives.The tradeoff point is a Pareto-optimal solution [71].Genetic algorithms are popular tools used to search for Pareto-optimal solutions [72,73].

Solving the Problem through EGA
Most multi-objective optimization algorithms use the concept of dominance in their search to find a Pareto-optimal solution.For each objective function, there exists one different optimal solution.An objective vector constructed with these individual optimal objective values constitutes the ideal objective vector of the form: where f * i is the i-th objective function.Given two vectors z and w, it is said that z dominates w, if each component of z is less or equal to the corresponding component of w, and at least one component is smaller: We use EGA to solve the problem of Equation (10).

Encoding a Clustering Solution
We established that a solution (an individual) is a random sequence of symbols s from the alphabet = {1, 2, 3...k}.Every symbol in s represents the cluster to which an object x ∈ X belongs.The length of s is given by the cardinality of X.An individual determines a feasible partition set Π of X.This is illustrated in Figure 3. From this encoding, EGA generates a population of feasible partition sets and evaluates them in accordance with their dominance.Evolution of the population takes place after the repeated application of the genetic operators, as described in [48,49].

Finding The Probability Distribution of the Elements of a Cluster
Based on the encoding of an individual of EGA, we can determine the elements x that belong to C i for all i = 1, 2, ..., k.To illustrate the way p( x|C 1 ) is calculated, we refer to the partition set shown in Figure 4.The shaded area represents the proportion of x, which belongs to cluster C 1 .We divide the probability space of cluster C 1 into a fixed number of quantiles (see Figure 5).The conditional proportion of x in C i is the proportion of the number of objects x in a given quantile.An unidimensional case is illustrated in Figure 5.This idea may be extended to a multidimensional space, in which case, a quantile is a multidimensional partition of the space of C i , as is shown in Figure 6.In general, the p( x|C i ) is the density of the quantile to which x belongs in terms of the percentage of data contained in it.We want to obtain quantiles that contain at most 0.0001 percent of the elements.Thus, we divide the space of C i into 10,000 quantiles.

Determining the Parameters of CBE
The EGA in CBE was executed with the following parameter values: P c = 0.90, P m = 0.09, Population size = 100, Generations = 800.It is based on a preliminary study, which showed that from a statistical view point, EGA converges to the optimal solution around such values when the problems are demanding (those with a non-convex feasible space) [48,49].The value of k in all experiments is known a priori from the dataset.

Datasets
We present a general description of the datasets used in our work.They comprise three categories: (1) synthetic Gaussian datasets; (2) synthetic non-Gaussian datasets; and (3) experimental datasets taken from the UCI database repository, i.e., real-world datasets.Important details about these categories are shown in Appendix B.

Synthetic Dataset
Supervised clustering is, in general, more effective than an unsupervised one.Knowing this, we make an explicit comparison between a supervised method and our own clustering algorithm.We will show that the performance of our method is comparable to the one of a supervised method.This fact underlines the effectiveness of the proposed clustering algorithm.It is known that, given a dataset X divided into k classes whose elements are drawn from a normal distribution, a BC achieves the best solution relative to other classifiers [65].Therefore, we will gauge the performance of CBE relative to BC.
Without the normality assumption, we also want to measure the performance of CBE relative to a suitable classifier; in this regard, we use MLP, which has been shown to be effective without making assumptions about the dataset.
Our claim is that if CBE performs satisfactorily when it is compared against a supervised method, it will also perform reasonably well relative to other clustering (non-supervised) methods.In order to prove this, we generated both Gaussian and non-Gaussian synthetic datasets, as described in Appendix B. For each dataset, the class labels were known.They are meant to represent the expected clustering of the dataset.It is very important to stress that CBE is unsupervised, and hence, it does not require the set of class labels.

Real-World Datasets
Likewise, we also used a suite of datasets that represent "real-world" problems from the UCI repository, also described in Appendix B.3.We used an MLP as practical approximation to the best expected classification of these datasets.

Methodology to Gauge the Effectiveness of a Clustering Method
In what follows, we present the methodology to determine the effectiveness of any clustering method (CM).We solve a large set of clustering problems by generating a sufficiently large supply of synthetic datasets.First, the datasets are made to distribute normally.Given the fact that BC will classify optimally when faced with such datasets, we will use them to compare a CM vs. BC.Afterwards, other datasets are generated, but no longer demanded to be Gaussian.We will use these last datasets to compare a CM vs. MLP.In both cases, we are testing the large number of classification problems to ensure that the behavior of a CM, relative to the best supervised algorithm (BC or MLP), will be statistically significant.
With the real-world datasets, the effectiveness of a CM is given by the number matching between cluster labels obtained by such a CM and the a priori class labels of the dataset.

Determining the Effectiveness Using Synthetic Gaussian Datasets
Given a labeled Gaussian dataset X * i (i = 1, 2, ...n), we use BC to classify X * i .The same set not including the labels will be denoted X i .The classification yields Y * i , which will be used as the benchmarking reference.We may determine the effectiveness of any CM as follows: (1) Obtain a sample X * i .
(2) Train the BC based on X * i .(3) From the trained BC, obtain the "reference" labels for all x * ∈ X * i (denoted by Y * i ). ( 4) Train the CM with X i to obtain a clustering solution denoted as Y i .
(5) Obtain the percentage of the number of elements of Y i that match those of Y * i .

Determining the Effectiveness of Using Synthetic Non-Gaussian Datasets
Given a labeled non-Gaussian dataset X * i for i = 1, 2, ...N , we followed the same exact steps as described in Section 5.1, but we replaced the BC with an MLP.We are assuming (as discussed in [65]) that MLPs are the best practical choice as a classifier when data are not Gaussian.

Determining the Effectiveness Using Real-World Datasets
With the real-world datasets, the effectiveness of a CM is given by the percentage matching between cluster labels obtained by such a CM and the a priori class labels of the dataset.

Determining the Statistical Significance of the Effectiveness
We statistically evaluated the effectiveness of any CM using synthetic datasets (both Gaussian and non-Gaussian).In every case, we applied a CM to obtain Y i .We denote the relative effectiveness as ϕ.We wrote a computer program that takes the following steps: (1) A set of N = 36 datasets is randomly selected.
(2) Every one of these datasets is used to train a CM and BC or MLP, when data are Gaussian or non-Gaussian, respectively.
(3) ϕ i is recorded for each problem.
(4) The average of all ϕ i is calculated.Such an average is denoted as φm .
(5) Steps 1-4 are repeated, until the φm are normally distributed.The mean and standard deviation of the resulting normal distribution are denoted by µ and σ , respectively.
From the central limit theorem, φm will be normally distributed for appropriate M .Experimentally (see Appendix C), we found that an adequate value of M is given by: where P is the probability that the experimental χ 2 is less than or equal to 3.28, and there are five or more observations per quantile; a = 0.046213555, b = 12.40231200 and c = 0.195221110.For p ≤ 0.05, from 15, we have that M ≥ 85.In other words, if M ≥ 85, the probability that φ is normally distributed is better that 0.95.Ensuring that φ ∼ N (µ , σ ), we can easily infer the mean µ and the standard deviation σ of the distribution of ϕ from: From Chebyshev's inequality [74], the probability that the performance of a CM ϕ lies in the interval [µ − λσ, µ + λσ] is given by: where λ denotes the number of standard deviations.By setting λ = 3.1623, we ensure that the values of ϕ will lie in this interval with probability p ≈ 0.9.A lower bound on ϕ (assuming the symmetric distribution of the ϕs) is given by µ + λσ.

Results
In what follows, we show the results from the experimental procedure.We wanted to explore the behavior of our method relative to other clustering methods.We used a method based on information theory called ENCLUS and two other methods: k-means and fuzzy c-means.For all methods, we used the experimental methodology (with Gaussian and non-Gaussian datasets).All methods were made to solve the same problems.

Synthetic Gaussian Datasets
Table 6 shows the values of ϕ for Gaussian datasets.Lower values imply better closeness to the results achieved from BC (see Section 5.1).The Gaussian datasets were generated in three different arrangements: disjoint, overlapping and concentric.Figure 7 shows an example of such arrangements.
All four methods yield similar values for disjoint clusters (simple problem).However, important differences are found when tackling the overlapping and concentric datasets (hard problems).We can see that CBE is noticeable better.
Since BC exhibits the best theoretical effectiveness given Gaussian datasets, we have included the percentual behavior of the four methods relative to it in Table 7.

Synthetic Non-Gaussian Datasets
Table 8 shows the values of ϕ for non-Gaussian datasets.These do not have a particular arrangement (disjoint, overlapping and concentric).As before, lower values imply better closeness to the results achieved from MLP (see Section 5.2).
The values of CBE and ENCLUS are much better than the ones of traditional clustering methods.Nevertheless, when compared to ENCLUS, CBE is 62.5% better.Based on Section 5.3, we calculate the effectiveness for the same set of CMs.Here, we used MLP as a practical reference of the best expected classification.The results are shown in Table 10.Contrary to previous results, a greater value represents a higher effectiveness.

Conclusions
A new unsupervised classifier system (CBE) has been defined based on the entropy as a quality index (QI) in order to pinpoint those elements in the dataset that, collectively, optimize such an index.The effectiveness of CBE has been tested by solving a large number of synthetic problems.Since there is a large number of possible combinations of the elements in the space of the clusters, an eclectic genetic algorithm is used to iteratively find the assignments in a way that increases the intra-and inter-cluster entropy simultaneously.This algorithm is run for a fixed number of iterations and yields the best possible combination of elements given a preset number of clusters k.That the GA will eventually reach the global optimum is guaranteed by the fact that it is elitist.That it will approach the global optimum satisfactorily in a relatively small number of iterations (800) is guaranteed by extensive tests performed elsewhere (see [48,49]).
We found that when compared to BC's performance over Gaussian distributed datasets, CBE and BC have, practically, indistinguishable success ratios, thus proving that CBE is comparable to the best theoretical option.Here, we, again, stress that BC corresponds to supervised learning, whereas CBE does not.The advantage is evident.When compared to BC's performance over non-Gaussian sets, CBE, as expected, displayed a much better success ratio.The conclusions above have been reached for a statistical p-value of 0.05.In other words, the probability of such results to persist on datasets outside our study is better than 0.95, thus ensuring the reliability of CBE.
The performance value ϕ was calculated by: (1) providing a method to produce an unlimited supply of data; (2) This method is able to yield Gaussian and non-Gaussian distributed data; (3) Batches of N = 36 datasets were clustered; (4) For every one of the N Gaussian sets, the difference between our algorithm's classification and BC's classification (ϕ) was calculated; (5) For every batch, ϕ was recorded; (6) The process described in Steps 3, 4 and 5 was repeated until the ϕ distributed normally; (7) Once the ϕ are normal, we know the mean and standard deviation of the means; (8) Given these, we may infer the mean and standard deviation of the original pdf of the ϕs; (9) From Chebyshev's theorem, we may obtain the upper bound of ϕ with probability 0.95.From this methodology, we may establish a worst case upper bound on the behavior of all of the analyzed algorithms.In other words, our conclusions are applicable to any clustering problem (even those outside of this study) with a probability better than 0.95 .
For completeness, we also tested the effectiveness of CBE by solving a set of "real world" problems.We decided to use an MLP as the comparison criterion to measure the performance of CBE based on the nearness of its effectiveness with regard to it.Of all of the unsupervised algorithms that were evaluated, CBE achieved the best performance.
Here, two issues should be underlined: (1) Whereas BC and MLP are supervised, CBE is not.The distinct superiority of one method over the others, in this context, is evident; (2) As opposed to BC, CBE was designed not to need the previous calculation of the conditional probabilities of BC; a marked advantage of CBE over BC.It is important to mention that our experiments include tight tests comparing the behavior of other clustering methods.We compared the behavior of k-means, fuzzy c-means and ENCLUS against BC and MLP.The results of the comparison showed that CBE outperforms them (with 95% reliability).
For disjoint sets, which offer no major difficulty, all four methods (k-means, fuzzy c-means, ENCLUS and CBE) perform very well relative to BC.However, when tackling partially overlapping and concentric datasets, the differences are remarkable.The information-based methods (CBE and ENCLUS) showed the best results; nevertheless, CBE was better than ENCLUS.For non-Gaussian datasets, the values of CBE and ENCLUS also outperform the other methods.When CBE is compared to ENCLUS, CBE is 62.5% better.
In conclusion, CBE is a non-supervised, highly efficient, universal (in the sense that it may be easily utilized with other quality indices) clustering technique whose effectiveness has been proven by tackling a very large number of problems (1,530,000 combined instances).It has been, in practice, used to solve complex clustering problems that other methods were not able to solve.
(1) The best (overall) n individuals are considered.The best and worst individuals (1 − n) are selected; then, the second best and next-to-the-worst individuals (2 − [n − 1]) are selected, etc.
(2) Crossover is performed with a probability P c .Annular crossover makes this operation position independent.Annular crossover allows for unbiased building block search, a central feature to GA's strength.Two randomly selected individuals are represented as two rings (the parent individuals).Semi-rings of equal size are selected and interchanged to yield a set of offspring.Each parent contributes the same amount of information to their descendants.
(3) Mutation is performed with probability P m .Mutation is uniform and, thus, is kept at very low levels.For efficiency purposes, we do not work with mutation probabilities for every independent bit.Rather, we work with the expected number of mutations, which, statistically is equivalent to calculating mutation probabilities for every bit.Hence, the expected number of mutations is calculated from * n * p m , where is the length of the genome in bits and n is the number of individuals in the population.
In what follows, we present the pseudocode of EGA:

B.1. Synthetic Gaussian Datasets
To generate a Gaussian element x = [x 1 , x 2 , ..., x n ], we use the acceptance-rejection method [75,76].In this method, a uniformly distributed random point (x 1 , x 2 , ..., x n , , y) is generated and accepted iff y < f (x 1 , x 2 , ..., x n ) where f is the Gaussian pdf.For instance, on the assumption of x ∈ 2 , in Figure 8, we show different Gaussian sets obtained by applying this method.In the context of some of our experiments, X is a set of k Gaussian sets in n .We wanted to test the effectiveness of our method with datasets that satisfy the following definitions: .k and i = j.
In Figure 9 12, we show the parameters of this process.

B.2. Synthetic Non-Gaussian Datasets
To generate Non-Gaussian patterns in n , we resort to polynomial functions of the form: Given that such functions have larger degrees of freedom, we can generate many points uniformly distributed in n .As reported in the previous section, we wrote a computer program that allows us to obtain a set of 500 different problems.These problems were generated for random values of

B.3. "Real World" Datasets
In order to illustrate the performance of our method when faced with "real world" problems, we selected five datasets (Abalone [77], Cars [78], Census Income [79], Hepatitis [80] and Yeast [81]) from the UCI Machine Learning repository, whose properties are shown in Table 13.We chose datasets that represent classification problems where the class labels for each object are known.Then, we can determine the performance of any clustering method as an effectiveness percentage.The selection criteria of these datasets were based on the following features: • Multidimensionality.
• Data with missing values.Some of these features involve preprocessing tasks to guarantee the quality of a dataset.We applied the following preprocessing techniques: (1) Categorical variables were encoded using dummy binary variables [82].
(3) In order to complete missing information, we interpolate the unknown values with natural splines (known to minimize the curvature of the approximant) [83].

C. Ensuring Normality in an Experimental Distribution
We wish to experimentally find the parameters of the unknown probability density function (pdf) of a finite, but unbounded, dataset X.To do this, we must determine the minimum number of elements M that we need to sample to ensure that our experimental approximation is correct with a desired probability P .We start by observing that a mean value is obtained from X by averaging N randomly-selected values of the x i , thus: xi = 1 N x i .We know that any sampling distribution of the xi (sdm) will be normal with parameters µ x and σ x when N → ∞.In practice, it is customary to consider that a good approximation will be achieved when N > 20; hence, we decided to make N = 36.Now, all we need to determine the value of M is to find sufficient xi 's (i = 1, 2, . . ., M ) until they distribute normally.Once this occurs, we immediately have the parameters µ x and σ x.The parameters of the unknown pdf are then easily calculated as µ = µ x and σ = √ nσ x.To determine that normality has been reached, we followed the following strategy.We used a setting similar to the one used in the χ 2 test in which: (1) We defined ten categories d i (deciles) and the corresponding 11 limiting values (v i ) assuming a normal distribution, such that one tenth of the observations are expected per decile: ´vi +1 v i N (µ, σ) ≈ 0.1, i = 0, 1, . . ., 9. For this to be true, we make, v 0 = −5.000,v 1 = −1.285,v 2 = −0.845,v 3 = −0.530,v 4 = −0.255,v 5 = 0.000 and the positive symmetrical values; (2) We calculated χ 2 = (o i − e i ) 2 /e i for every x i , where o i is the observed number of events in the i-th decile and e i is the expected number of such events.For the k-th observed event, clearly, the number of expected events per decile is k/10; (3) We further require, as is usual, that there be at least o min = 5 events per decile.Small values of the calculated χ 2 indicate that there is a larger similarity between the hypothesized and the true pdfs.In this case, a small χ 2 means that the observed behavior of the xi 's is closer to a normal distribution.The question we want to answer is: How small should χ 2 be in order for us to ascertain normality?Remember the test χ 2 is designed to verify whether two distributions differ significantly, so that one may reject the null hypothesis, i.e., the two populations are not statistically equivalent.This happens for large values of χ 2 and is a function of the degrees of freedom (df ).In our case, df = 7.Therefore, if we wanted to be 95% certain that the observed xi 's were not normally distributed, we would demand that χ 2 ≥ 14.0671.However, this case is different.We want to ensure the likelihood that the observed behavior of the xi 's is normal.In order to do this, we performed a Monte Carlo experiment along the following lines.We set a desired probability P that the xi 's are normal.
We establish a best desired value of χ 2 , which we will call χ best .We make the number of elements in the sample N S = 50.We then generate N S instances of N (0, 1) and count the number of times the value of the instance is in every decile.We calculate the value of the corresponding χ 2 and store it.We thusly calculate 100, 000 combinations of size N S. Out of these combinations, we count those for which χ 2 ≤ χ best and there are at least o min observations per decile.This number divided by 100, 000, which we shall call p, is the experimental probability that, for N S observations, the sample "performs" as required.We repeat this process increasing N S up to 100.In every instance, we test whether p > P .If such is the case, we decrement the value of χ best and re-start the process.Only when p ≤ P , does the process end.
In Figure 12, we show the graph of the size of sample M vs. the experimental probability p that p ≤ P and o i ≤ o min for χ best (from the Monte Carlo experiment).
As may be observed, p < 0.05 for M ≥ 85, which says that the probability of obtaining χ 2 ≤ 3.28 by chance alone is less than five in one hundred.Therefore, its is enough to obtain 85 or more xi 's (3060 x i 's) to calculate µ and σ and be 95% sure that the values of these parameters will be correct.

Figure 5 .
Figure 5. Example of the probability space of C i in .

Figure 6 .
Figure 6.Example of the probability space of C i in 3 .

Figure 7 .
Figure 7. Example of possible arrangements of a Gaussian dataset.
, we illustrate two possible examples of such datasets for n = 2 and k = 3.The value ρ is the correlation coefficient.

Figure 12 .
Figure 12.Size of sample M vs. p.

Table 4 .
Probability model and entropy of Π 1 .

Table 5 .
Probability model and entropy of Π 2 .

Table 8 .
Average effectiveness (ϕ) for non-Gaussian datasets.ENCLUS, entropy clustering; CBE, clustering-based on entropy.Since MLP is our reference for non-Gaussian datasets, we have included the percentual behavior of the four methods relative to it in Table9.

Table 10 .
Effectiveness achieved by CBE and other clustering methods for experimental data.
As expected, the best value was achieved by MLP.The relative performance is shown in Table11.

Table 11 .
Performance relative to MLP.

Table 12 .
Parameters of the generation process of datasets.

Table 13 .
Properties of the selected datasets.