Abstract
We introduce a new approach to clustering categorical data: Condorcet clustering with a fixed number of groups, denoted -Condorcet. As k-modes, this approach is essentially based on similarity and dissimilarity measures. The paper is divided into three parts: first, we propose a new Condorcet criterion, with a fixed number of groups (to select cases into clusters). In the second part, we propose a heuristic algorithm to carry out the task. In the third part, we compare -Condorcet clustering with k-modes clustering. The comparison is made with a quality’s index, accuracy of a measurement, and a within-cluster sum-of-squares index. Our findings are illustrated using real datasets: the feline dataset and the US Census 1990 dataset.
Keywords:
k-modes; Condorcet clustering; categorical data; quality index; within-cluster sum-of-squares index MSC:
62H30; 91C20
1. Introduction
In 1909, Jan Czekanowski proposed the first clustering method [1]. This kind of method has become fundamental to many branches of statistics and social sciences. With clustering, we seek to classify a set of objects into relatively homogeneous groups, which are usually referred to as clusters. That is, for a given dataset, the goal of a cluster analysis is to define a set of clusters and to assign to each of them the observations that some distances or similarity measures are close to each other, while observations between clusters are away from each other. There are increasing discussions surrounding the best clustering method, as one can gather from the large number of review articles (see for example [2,3,4,5,6]). Many authors have proposed different clustering algorithms, and most techniques and algorithms deal with quantitative data. However, categorical data are common, particularly in the social sciences [7,8,9,10,11,12]. As such, applying clustering methods to categorical data is important, and methods have been proposed to deal with these types of data. An extension of the k-means approach to clustering, the k-modes clustering [13], is prominent among these. In this paper, we present a novel algorithm to group qualitative data: an extension of Condorcet clustering [14]. We demonstrate that, with a fixed number of clusters, a unique partition of the data could be achieved by maximizing a Condorcet’s criterion [14]. We developed a heuristic algorithm that proved to be very useful. Moreover, an adjustment rate index was used to evaluate the quality of the partition of k-modes and -Condorcet on the basis of real datasets. The rest of the paper is organized as follows: in Section 2, we present some related work. In Section 3, we introduce some relevant concepts and definitions. In Section 4, we present some theoretical results. The clustering algorithm is presented in Section 5. Using real data, in Section 6, we compare -Condorcet clustering to k-modes clustering. Finally, our concluding remarks are given in Section 7.
2. Related Work
Clusters may be regarded as crisp or fuzzy. In fuzzy clustering, an observation may belong to more than one cluster with given probabilities, whereas in crisp clustering, an observation belongs to one and only one cluster. Most clustering algorithms, but not all, may be classified into two categories: partitioning and hierarchical algorithms.
k-means is prominent among the partitional methods, and is one of the most popular techniques for clustering quantitative data [15,16,17,18]. Given a set of n multivariate observations, where is a d dimensional vector, the -means algorithm partitions the data into clusters, , such that the sum of squares within each cluster is minimized. That is, k-means seeks to minimize:
where is the mean of point in . This algorithm is fast and easy to implement [16,18]. Once the number of clusters is defined, this method chooses, at random, k points in the attribute space as initial values. After that, observations are assigned to the closest cluster and the centroids are updated. Because the algorithm does not guarantee convergence to the global optimum and since it is usually a fast algorithm, it is common to run it multiple times with different starting conditions. This method may, however, be badly affected by outliers.
Several methods have been proposed to deal with qualitative data or mixed data. The k-modes and k-prototype methods are prominent among these, as proposed by Huang [13], which are extensions to the k-means (see Table 1).
Table 1.
Some classical methods of clustering for categorical or mixed data.
k-modes, in particular, is the k-means method, but with the Euclidean distance metric substituted by a simple matching dissimilarity measure, where the centers of the clusters are represented by their modes instead of the means. To introduce k-modes, let X and Z be two objects described by n categorical attributes. Then, a simple dissimilarity measure between these objects is the total number of mismatches of the corresponding values of the attributes of the two objects. That is
where
Let be a set of m objects described by n categorical attributes denoted by , . Then a mode of is a vector that minimizes
Q not necessarily an object of . Finally, the k-modes algorithm partitions the set of m objects described by n categorical attributes into k clusters, , , by minimizing the following expression:
where is the mode of cluster . For a survey of k-modes see [19]. For a different approach to clustering categorical data, see [20].
Although the k-modes method has the advantage of being scalable to very large datasets, the final solution may be influenced by the initialization criterion of using random initial modes as centers. A number of suggestions have been made to overcome the shortcomings of k-modes. For example, Lakshmi [21] propose a different algorithm to overcome the initialization problem of k-modes. Moreover, Dorman [22] adapt the Hartigan algorithm for k-means and develop several approaches to selects the initial centroids to improve the efficiency of k-modes. Two other approaches to initialize the k-modes algorithm are given in [23,24]. A fuzzy version of the k-modes algorithm is proposed by Huang [25] to improve the performance of k-modes. Other fuzzy versions of the k-modes method are given in [26,27,28]. Ng [29] modify a simple matching dissimilarity measure to obtain clusters with intra-similarity and describe extensions of k-modes to cluster efficiently large categorical datasets. A different dissimilarity measure is provided by Cao [30].
For different approaches to clustering categorical data, see [20,31,32,33].
Besides the k-modes algorithm, Huang [25] also proposes the k-prototype, an algorithm that integrates the k-means and k-modes algorithms to cluster mixed types of objects. The dissimilarity between two mixed-type objects, X and Z, which are described by , may be measured by
Of course, the first term corresponds to the squared Euclidean distance, which is applied to the quantitative attributes and the second term is the simple matching dissimilarity measure, which is applied to the qualitative attributes. is a weight used to avoid favoring either type of attribute. Thus k-prototype seeks to minimize the following cost function:
where W is an partition matrix with elements is a set of objects in the same object domain.
Next, Marcotorchino [14], Michaud [34,35,36] were the first to propose a clustering method for categorical data using a dissimilarity measure. These authors developed the relational analysis theory, and introduced the relation aggregation problem in order to solve the Condorcet’ s paradox in the voting system, and relate it to the similarity problem.
This approach consists of using pairwise comparisons and applying the simple majority decision rule. Indeed, aggregating equivalence relations using the simple majority decision rule guarantees optimal solutions under some constraints and without fixing a priori the number of groups. In our work, we used the approach introduced by Michaud and Marcotorchino, setting a priori the number of groups.
3. Materials and Methods
Let be a set of n variables and a set of m objects. Let C be a Condorcet matrix, with elements , corresponding to the number of variables for which is similar to , denoted by , and is a matrix, such that
For two given objects, and , with we mean that and have the same value for the variable with , while means that and are similar.
In the relational analysis methodology, Marcotorchino [14] suggest the maximization of Condorcet’s criterion, under some restrictions, given by
with , and is a matrix that maximizes the function given in Equation (6). This matrix takes values 0 and 1.
Then, the model associated with the absolute global majority is defined by:
where Y is the matrix of similarities. The first constraint represents the binarity, the second restriction represents the symmetry and the third restriction is the transitivity.
The following example explains how to obtain the matrix Y, which maximizes the Condorcet’s criterion under the restrictions given below.
Let E be a dataset that is composed of three items , with three qualitative variables being measured. The dataset is presented in Table 2.
Table 2.
Example of dataset E.
Using Table 2, we identify the matrix of Condorcet C, which is given by . Then, the possible solutions Y that satisfy the constraints are
Next, we compute the function for each matrix , . Indeed, we have , , , and . We deduce that maximizes the Condorcet’s criterion. Finally, we obtain the number of clusters, equal 2, and the clusters are and .
Although, the proposed method of clustering does not require fixing the number of classes beforehand, there are instances where this is not convenient, such is the case in a psychometric analysis. In this paper, we take this point of view and assume that the number of clusters is given.
Therefore, in this paper, we fix the number of groups denoted by , and focus on finding the solution of Equation (7), giving its algorithm and comparing its results to k-modes for some fixed value of .
Next, let us denote by a partition with respect to the set of objects S. In this partition, the number of clusters is .
Recall that the matrix C represents the similarities between pairs of objects that we want to cluster. Similarly, we introduce a matrix of dissimilarities between pairs of the same objects and we denote it by . Next, we define n categorical variables denoted by , and let be a modality of assigned to object . Then, we write that each variable is associated with a matrix . As a consequence, we obtain
where the elements of matrix are given by
By abuse of notation, we write and .
Using Equation (8), the general terms of the collective relational matrix C are given by . Furthermore, we define the general terms of the collective relational matrix as . Note that represents the number of variables for which and are not similar.
4. Main Theoretical Results
4.1. -Condorcet Criterion Function
Now, we present our first important result: a new Condorcet criterion function.
Definition 1.
Let be a partition of a set of objects S. We define a new Condorcet criterion function g as
with if and if .
Using Equations and , the formula above can be rewritten as follows
Knowing the exact number of groups, the following model allows to group the objects by similarity in the sense of having common characteristics. Then, we obtain the partition by maximizing the following function
where
Using the third restriction of , the function g can be simplified to the following expression
The next theorem ensures the existence of at least one solution to the problem given in Equation (10).
Theorem 1.
Let with . Then, there exists at least a partition of S that maximizes .
Proof of Theorem 1.
We know that if the number of objects m is inferior to the number of clusters , then no solution of exists.
We now suppose that the number of objects is superior to the number of clusters . Then, we have d possible partitions, where the parameter d is the number of partitions of the set of m objects into clusters, which can be expressed as follows
where , , is the number of objects in the cluster . Furthermore, d is a positive integer. Finally, there exists at least a partition of S that maximizes . □
Theorem 2.
We assume that the dataset does not present the Condorcet’ s paradox. Then, for some value of α, there exists a unique partition of S that maximizes .
Proof of Theorem 2.
To simplify the calculations, we consider . We suppose that the object is similar to and not similar to , then, under the absence of Condorcet’ s paradox, we conclude that is not similar to . Finally, there exists a unique partition that maximizes . □
Next, we give some axiomatic conditions studied by Michaud [37], presenting some axiomatic conditions verified by Condorcet’s rule that respond to K. Arrow’s impossibility theorem, presented below.
Theorem 3
(Theorem of K. Arrow). According to Arrow’s impossibility theorem, it is impossible to formulate a social ordering without violating one of the following conditions:
- 1.
- Non-dictatorship: the voter’s preference cannot represent a whole community. The wishes of multiple voters should be taken into consideration.
- 2.
- Pareto efficiency: unanimous individual preferences must be respected. If every voter prefers candidate A over candidate B, candidate A should win.
- 3.
- Independence of irrelevant alternatives: if a choice is removed, then the others order should not change. If candidate A ranks ahead of candidate B, candidate A should still be ahead of candidate B, even if a third candidate, candidate C, is removed from participation.
- 4.
- Unrestricted domain: voting must account for all individual preferences.
- 5.
- Social ordering: each individual should be able to order their choices in a connected and transitive relation.
Then, Michaud [37] proved that the rule of Condorcet verifies some conditions given in the following Theorem. The following result concerns the verification of these conditions by the -Condorcet method given in Equation (10).
Theorem 4
(Axiomatic conditions). In the context of similarity aggregation problems, the rule of Condorcet verifies the following conditions for some values of α:
- 1.
- Non-dictatorship condition: this condition means that no variable can, by itself, determine an item (individual or object) classification maximizing the Condorcet criterion.
- 2.
- Pareto pair unanimity condition: if all variables are presented in two items, then these two items must be found in the same cluster.
- 3.
- Condition of total neutrality: the classification obtained must be independent of the order of individuals or items or variables.
- 4.
- Condition of coherent union: if two disjointed sets of variables give the same partition, then the union of the two sets will give the same partition.
4.2. Total Inertia
We now focus on the inertia or within-cluster sum-of-squares. First, we present some preliminaries. Let be a partition of S. We build a cloud of points in , denoted , in which each dimension corresponds to a category of the variable , . Let be a cloud of mass points, where is the coordinate point of object and is its corresponding mass. In general, the expression of within-cluster sum-of-squares is given by
where is cloud’s center of gravity of cluster .
The within-cluster sum-of-squares is a measure of how the objects are similar in each cluster. However, this measure does not allow making decisions regarding the quality of the partition: for values of inertia close to zero, the quality of the partition is better.
Next, let be n modalities, respectively, of variables and let p a fixed parameter, such that . Then, we define , where is one if object i has a modality k and zero otherwise, and is the total of objects that have the same modality k.
In the next theorem, we present an expression for inertia given by [38,39].
Theorem 5.
A relational expression of within-cluster sum-of-squares is given by
Note that this expression does not require the number of clusters to be specified a priori.
The following result is an application of Equation (12) and Theorem 5.
Theorem 6.
A new relational expression of the within-cluster sum-of-squares, with the number of clusters α fixed, is given by
where is the cardinality of cluster .
Proof of Theorem 6.
To prove this result, we need to consider the chi-square metric to find and compute a closed form of the expression of the within-cluster sum-of-squares. First, we define some preliminary elements. Let be a general term given by
with , and .
Then, we have
where , and with is the cardinality of class .
Note that is the mass of each individual given by .
It follows that
Finally, we obtain
□
We next introduce a quality index given by
This index was given in [14], and measures the quality of the partition.
5. Algorithm
Several algorithms have been given in the past three decades to solve problem (7) by linear programming techniques, when the population under study is relatively small. Unfortunately, the classical linear programming techniques require many restrictions. In this case, the heuristic method has been adopted in order to process large amounts of data. Although these heuristic algorithms are fast, they do not always ensure an optimal solution.
Next, for all points , we maximize the expression
Note that the term is omitted because it is constant.
The above formula represents the series of links between objects and , denoted by , and we write . Moreover, we denote the general link by , where is a matrix with elements equal to 1.
The -Condorcet clustering algorithm is illustrated by some steps given in Algorithm 1. Given a database D of m points in and partition of S, such that , .
Similar to the algorithm given by [40], we compute the following steps.
| Algorithm 1. Heuristic Algorithm -Condorcet. | |
| Input | : Number of partitions |
| : Number of observations | |
| : observations | |
| : Number of variables | |
| : Feature Matrix | |
| Output | : is a partition of clusters |
| 1 | ): the generation of the Condorcet matrix C |
| 2 | : where is a matrix of ones |
| 3 | |
| 4 | |
| 5 | |
| 6 | for todo |
| 7 | for to do |
| 8 | if then |
| 9 | |
| 10 | endif |
| 11 | endfor |
| 12 | where is the largest value of the |
| column of matrix L | |
| 13 | endfor |
| 14 | is the position of the first occurrence |
| of the largest value of vector K | |
| 15 | |
| 16 | EliminateGroup(): Elimination of the cluster |
| 17 | |
| 18 | |
| 19 | |
| 20 | |
| 21 | |
| 22 | if () then |
| 23 | : GenCombi gives the combination |
| which maximizes the link F | |
| 24 | : is the cardinality function |
| 25 | if then |
| 26 | |
| 27 | |
| 28 | , goto 5 |
| 29 | else |
| 30 | goto 5 |
| 31 | endif |
| 32 | endif |
| 33 | if then |
| 34 | goto 5 |
| 35 | endif |
| 36 | if then |
| 37 | return |
| 38 | endif |
- 1.
- First, we find the largest value in each column of Condorcet’s matrix C, which corresponds to the number of characteristics that a pair of observations share. We then take the position of the largest value, denoted by a. In this case, we put those observations in the same group , with k representing the kth column.
- 2.
- Next, we remove the value a in the matrix C and define b as the largest value of the kth column.
- 3.
- We distinguish some conditions:
- 3.1
- If , we repeat the first point.
- 3.2
- If , then the Condorcet’s criterion is applied. We group the elements that maximize the Condorcet criterion.
- 4.
- We repeat the process.
- 5.
- This process stops when the groups are identified
5.1. Illustrative Example Using the Heuristic Algorithm
We now consider a dataset D, which is composed of six items with three qualitative variables being measured. The dataset is presented in Table 3.
Table 3.
Example of dataset D.
Then, the Condorcet’s matrix C is given in the Table 4.
Table 4.
The Condorcet’s matrix C.
In general, the diagonal of Condorcet’s matrix represents the number of variables measured in this group of observations, it also represents the maximum possible similarity that can occur between two observations and with . For our heuristic algorithm, we replace the diagonal numbers by zero.
The goal of this example is to create our partition P such that , , fixing the number of groups . Before creating this partition, we suppose that each element represents a group and we write with . Let K be a vector whose elements are the maximum in each vector of Condorcet’s matrix. Then, we have . So, we identify the first maximum number of the vector K, called a, with , and its position , which represents the fourth position of the first column.
The following step is to put together the elements and in the same group . In the first column, we eliminate the fourth value, and we write:
| - | 2 | 1 | 3 | 1 | 1 | |
| 2 | - | 0 | 2 | 0 | 0 | |
| 1 | 0 | - | 1 | 2 | 3 | |
| - | 2 | 1 | - | 1 | 1 | |
| 1 | 0 | 2 | 1 | - | 2 | |
| 1 | 0 | 3 | 1 | 2 | - |
Computing the maximum of the first column, we obtain . Comparing the value of both parameters a and b, we find that . Thus, we have . Then, the vector K is recalculated without considering columns 1 and 4, obtaining . So, we identify the first maximum number of the vector K with , and its position that represents the last position of the third column. In the third column, we eliminate the sixth value, and we write:
| - | 2 | 1 | 3 | 1 | 1 | |
| 2 | - | 0 | 2 | 0 | 0 | |
| 1 | 0 | - | 1 | 2 | 3 | |
| - | 2 | 1 | - | 1 | 1 | |
| 1 | 0 | 2 | 1 | - | 2 | |
| 1 | 0 | - | 1 | 2 | - |
Computing the maximum of the third column, we obtain . In this case, . Again, the vector K is recalculated without considering columns and 6, and we have . The first maximum of vector K is in the second position, and we have , with position in the Condorcet matrix given by . The last position lead to add the element to the group . We eliminate the first value of the second column, and write:
| - | - | 1 | 3 | 1 | 1 | |
| 2 | - | 0 | 2 | 0 | 0 | |
| 1 | 0 | - | 1 | 2 | 3 | |
| - | 2 | 1 | - | 1 | 1 | |
| 1 | 0 | 2 | 1 | - | 2 | |
| 1 | 0 | - | 1 | 2 | - |
Next, the maximum of the second column is equal to 2, and we have with a position . Both parameters a and b are equal. In this case, it is not necessary to carry out combinatorics between and because the element of position belongs to . Now, we define a new vector K without the second column given by . The first maximum of vector K can be found in position and meaning that the element can be in the first group or the second group. In this case, we must check which of the two partitions maximizes the Condorcet’s criterion function. After simple calculations, we deduce that belongs to . Finally, we obtain two groups and .
5.2. Advantage of Heuristic Algorithm
The main goals of this section are threefold. Firstly, we compare the partition quality, given in Equation (21), for the feline dataset using both exact and heuristic algorithms. Secondly, we use the inertia index, given in Equation (13), to compare the exact and heuristic algorithms. Finally, the execution time of the two methods is compared. For each step, we choose the first felines of the feline dataset.
Table 5 shows that the use of the exact algorithm is ineffective due to some problems that occur when the sample size increases. Furthermore, the inertia and quality indexes of both heuristic and exact algorithms are almost equal.
Table 5.
Time comparison between an exact algorithm and a heuristic algorithm using the feline dataset.
Finally, observing the last column of Table 5, when the data size is equal to and , the execution of the exact algorithm takes s, while the execution time of the heuristic algorithm is s for and . Fixing again and , we observe that the execution time of the exact algorithm increases considerably, 586.95 s, compared to the execution time for . Furthermore, for , the quality and inertia indexes cannot be computed for exact algorithm; however, we know that the exact algorithm provide an optimal solution. Then, we can deduce that its quality index is at least as large as the quality of heuristic algorithm. Consequently, we confirm that the exact algorithm is computationally very expensive compared to the heuristic algorithm. Note that for the exact algorithm, the use of large datasets generates two important problems, the first is related to the execution time, while the second is concerned with the temporal storage space of data required by the programs being used at a particular moment (e.g., R-project).
6. Comparison between -Condorcet and -Modes
Firstly, in this section, we describe the experiments and their results. We ran our algorithm on feline datasets obtained from [14] and presented in Table A1 and Table A2 from the Appendix A. We tested the performance of -Condorcet clustering against the k-modes algorithm. Our algorithms were implemented in R language. The -Condorcet algorithm was implemented according to the description given above, and for k-modes, we used the algorithm as already implemented in R language. The quality of the partition was compared using the fit rate [14] given by Equation (21).
In previous studies [14,37], the similarity aggregation method gave an optimal solution of four groups. This solution was closer to the classification recognized by zoologists by species and genus (Figure 1).
Figure 1.
Optimal solution by the Condorcet method.
On the other hand, in the partition into 4 = four groups, applying the k-modes algorithm, it is observed that certain species belong to more than one group and that it does not agree with the classification recognized by zoologists (Figure 2).
Figure 2.
Optimal solution by k-modes, fixing the number of groups at four.
The accuracy of measurement, given in Equation (22), of both solutions given in Figure 1 and Figure 2, is 1 and 0.83, respectively.
Next, a comparison was made between the -Condorcet method and the k-modes method for different values of in order to find the best method that fit the feline data through the within-class inertia index and adjustment rate given in Equations (13) and (21) respectively.
Figure 3, contrasts the quality of groupings by means of the adjustment rate. Therefore, it is observed that the -Condorcet method presents a better quality of partition than the k-modes method for different values of .
Figure 3.
The partitioning quality of feline data under both k-modes and -Condorcet methods.
Figure 4, contrasts the quality of clustering fit through inertia, in the same dataset. In this figure, it is concluded that the intra–class inertia is almost the same for both methods with different values of .
Figure 4.
The optimal solution minimizing within-cluster sum-of-squares under both k-modes and -Condorcet methods.
We now use the 1990 US Census dataset to compare the heuristic with k-modes algorithm. This dataset contains a 1% sample of the public use microdata sample person records drawn from the full 1990 census sample. For further references, see https://archive.ics.uci.edu/ml/datasets/US+Census+Data+%281990%29 (accessed on 20 December 2021). The comparisons between both methods were made with 50, 100, 150, and 200 observations.
Table 6 shows that the inertia index is almost the same for both algorithms. However, we observe that the heuristic algorithm is better than k-modes algorithm from the point of view of the quality index.
Table 6.
Comparison between k-modes and heuristic algorithm for different sample sizes of the US Census 1990 dataset with .
7. Conclusions
In clustering categorical data, many researchers have succeeded in developing unsupervised classification methods without fixing the number of classes a priori. Fixing the number of clusters beforehand, may be major drawback.
However, sometimes it is convenient to identify beforehand the number of groups, as, for instance, in psychometrics. Several methods have been proposed with a known number of clusters. We believe, however, that these methods do not always provide optimal solutions. For this reason, we proposed a new method with a fixed number of groups. This new approach is an extension of the Condorcet method. Although, the exact algorithm of this new approach gives an optimal solution, it consumes too much time. Hence, the heuristic algorithm was introduced. Table 5 shows that the proposed algorithm produces almost the same values of quality and inertia indexes as the exact algorithm.
Next, comparing our approach with k-modes for the feline data, we found that the accuracy index gave better result for the heuristic algorithm. In this case, the comparison was made with the precision index because we know a priori that the number clusters, , is 4. This comparison was also made with the US Census 1990 data, using both partition quality and intra-class inertia indexes. The results in Table 6 show that both methods have almost the same inertia. However, the heuristic algorithm shows an improvement over k-modes in terms of partition quality. Consequently, the following may be concluded:
- We proposed a heuristic algorithm as an alternative algorithm to the exact one. This gives the same or an approximate solution as the exact one.
- From the simulations presented in Table 5, we can conclude that the heuristic algorithm is faster than the exact algorithm.
- The heuristic algorithm produces similar (or even better) results to k-modes.
- We conclude that -Condorcet is a valid technical competitor with respect to the k-modes clustering technique.
Author Contributions
Conceptualization: T.F. and L.F.-L.; methodology: T.F. and L.F.-L.; software: R.C.-S., T.F. and J.M.A.-B.; formal analysis, writing—review, and editing: T.F. and L.F.-L. All authors have read and agreed to the published version of the manuscript.
Funding
This work was partially supported by FONDECYT (grant 11200749) and the University of Bío-Bío (grant DIUBB 2020525 IF/R). Partial support was provided by the university of Bío-Bío to Luis Firinguetti-Limone (grant DIUBB 183808 3/R).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A
The dataset of Table A1 has 30 felines and 14 variables that describe the characteristics of each feline. The full name and the modalities of each variable are described in the following Table A2.
Table A1.
The feline dataset is a multivariate dataset introduced by P. Michaud and F. Marcotorchino in their articles [14,34,37].
Table A1.
The feline dataset is a multivariate dataset introduced by P. Michaud and F. Marcotorchino in their articles [14,34,37].
| French Name | English Name | Tipopiel | Longpoill | Retract | Comport | Orielles | Larynx | Tailler | Poids | Longueurs | Queue | Dents | Typproie | Arbre | Chasse |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Lion | Lion | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 3.00 | 3.00 | 3.00 | 2.00 | 1.00 | 1.00 | 0.00 | 1.00 |
| Tigre | Tiger | 3.00 | 0.00 | 1.00 | 3.00 | 1.00 | 1.00 | 3.00 | 3.00 | 3.00 | 2.00 | 1.00 | 1.00 | 0.00 | 0.00 |
| Jaguar | Jaguar | 2.00 | 0.00 | 1.00 | 2.00 | 1.00 | 1.00 | 3.00 | 3.00 | 2.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 |
| Leopardo | Leopard | 2.00 | 0.00 | 1.00 | 3.00 | 1.00 | 1.00 | 3.00 | 3.00 | 2.00 | 2.00 | 1.00 | 2.00 | 1.00 | 0.00 |
| Once | Oncilla | 2.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 2.00 | 2.00 | 2.00 | 3.00 | 1.00 | 2.00 | 1.00 | 0.00 |
| Guepardo | Cheetah | 2.00 | 0.00 | 0.00 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 3.00 | 0.00 | 2.00 | 0.00 | 1.00 |
| Puma | Puma | 1.00 | 0.00 | 1.00 | 2.00 | 1.00 | 0.00 | 2.00 | 3.00 | 2.00 | 3.00 | 1.00 | 2.00 | 1.00 | 0.00 |
| Nebul | Clouded leopard | 4.00 | 0.00 | 1.00 | 3.00 | 1.00 | 1.00 | 2.00 | 2.00 | 2.00 | 3.00 | 1.00 | 3.00 | 1.00 | 0.00 |
| Serval | Serval | 2.00 | 0.00 | 1.00 | 1.00 | 2.00 | 0.00 | 2.00 | 2.00 | 2.00 | 1.00 | 0.00 | 3.00 | 1.00 | 1.00 |
| Ocelot | Ocelot | 2.00 | 0.00 | 1.00 | 2.00 | 1.00 | 0.00 | 2.00 | 2.00 | 2.00 | 2.00 | 0.00 | 3.00 | 1.00 | 0.00 |
| Lynx | Lynx | 2.00 | 1.00 | 1.00 | 2.00 | 2.00 | 0.00 | 2.00 | 2.00 | 2.00 | 1.00 | 1.00 | 2.00 | 1.00 | 0.00 |
| Caracal | Caracal | 1.00 | 0.00 | 1.00 | 2.00 | 2.00 | 0.00 | 2.00 | 2.00 | 1.00 | 1.00 | 0.00 | 3.00 | 1.00 | 1.00 |
| Viverrin | Fishing cat | 2.00 | 0.00 | 1.00 | 2.00 | 1.00 | 0.00 | 1.00 | 1.00 | 2.00 | 2.00 | 0.00 | 3.00 | 0.00 | 0.00 |
| Yaguarun | Jaguarundi | 1.00 | 0.00 | 1.00 | 2.00 | 1.00 | 0.00 | 1.00 | 2.00 | 2.00 | 3.00 | 0.00 | 3.00 | 1.00 | 0.00 |
| Chaus | Chaus | 1.00 | 1.00 | 1.00 | 3.00 | 2.00 | 0.00 | 1.00 | 2.00 | 1.00 | 2.00 | 0.00 | 3.00 | 1.00 | 0.00 |
| Dore | Golden cat | 1.00 | 0.00 | 1.00 | 3.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 2.00 | 0.00 | 3.00 | 1.00 | 0.00 |
| Merguay | Margay | 2.00 | 0.00 | 1.00 | 3.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 2.00 | 0.00 | 3.00 | 1.00 | 0.00 |
| Margerit | Sand cat | 1.00 | 1.00 | 1.00 | 2.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 2.00 | 0.00 | 3.00 | 0.00 | 0.00 |
| Cafer | Caffer cat | 3.00 | 0.00 | 1.00 | 3.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 2.00 | 0.00 | 3.00 | 1.00 | 1.00 |
| Chine | Chinese mountain cat | 1.00 | 0.00 | 1.00 | 2.00 | 2.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | 3.00 | 1.00 | 0.00 |
| Bengale | Bengal cat | 2.00 | 0.00 | 1.00 | 3.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 2.00 | 0.00 | 3.00 | 1.00 | 0.00 |
| rouilleu | Rusty spotted cat | 2.00 | 0.00 | 1.00 | 2.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 2.00 | 0.00 | 3.00 | 1.00 | 0.00 |
| Malais | Malai | 1.00 | 1.00 | 1.00 | 3.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | 3.00 | 1.00 | 0.00 |
| Borneo | Bornean bay cat | 1.00 | 0.00 | 1.00 | 3.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 2.00 | 0.00 | 3.00 | 1.00 | 0.00 |
| Nigripes | Black footed cat | 2.00 | 0.00 | 1.00 | 2.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | 3.00 | 1.00 | 1.00 |
| Manul | Manul | 1.00 | 1.00 | 1.00 | 3.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | 3.00 | 1.00 | 0.00 |
| Marbre | Marbled cat | 4.00 | 0.00 | 1.00 | 3.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 3.00 | 0.00 | 3.00 | 1.00 | 0.00 |
| Tigrin | Tiger cat | 2.00 | 0.00 | 1.00 | 3.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 2.00 | 0.00 | 3.00 | 1.00 | 0.00 |
| Temminck | Temminck | 1.00 | 0.00 | 1.00 | 3.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 2.00 | 0.00 | 3.00 | 1.00 | 0.00 |
| Andes | Andean mountain cat | 2.00 | 1.00 | 1.00 | 3.00 | 1.00 | 0.00 | 1.00 | 1.00 | 2.00 | 2.00 | 0.00 | 2.00 | 1.00 | 0.00 |
Table A2.
Description of variables given in Table A1.
Table A2.
Description of variables given in Table A1.
| Variable | Description | Modalities |
|---|---|---|
| Typpel | Appearance of the coat | Unblemished, plain |
| Spotted | ||
| Striped | ||
| Marble | ||
| Longpoill | Fur | Short hairs |
| Long hairs | ||
| Retract | Retractable claws | Yes |
| No | ||
| Comport | Predatory behavior | Diurnal |
| Diurnal or nocturnal | ||
| Nocturnal | ||
| Orielles | Type of ears | Round or rounded |
| Pointed | ||
| Larynx | Presence of hyoid bone | Yes |
| No | ||
| Taille | Waist at the withers | Small |
| Average | ||
| Big | ||
| Poids | Weight | Low |
| Middle | ||
| Heavy | ||
| Longueur | Body length | Small |
| Middle | ||
| Big | ||
| Queue | The relative length of the tail | Short |
| Medium | ||
| Long | ||
| Dents | Developed fangs | Yes |
| No | ||
| Typproie | Type of prey | Big |
| Big or small | ||
| Small | ||
| Arbres | Climb tree | Yes |
| No | ||
| Chasse | On the run or on the lookout (prowl) | Yes |
| No |
References
- Czekanowski, J. Zur Differentialdiagnose der Neandertalgruppe. Korespondentblatt der Deutschen Gesellschaft für Anthropologie Ethnologie und Urgeschichte 1909, XL, 44–47. [Google Scholar]
- Harkanth, S.; Phulpagar, B.D. A survey on clustering methods and algorithms. Int. J. Comput. Sci. Inf. Technol. 2013, 4, 687–691. [Google Scholar]
- Madhulatha, T.S. An overview on clustering methods. arXiv 2012, arXiv:1205.1117. [Google Scholar] [CrossRef]
- Madhulatha, T.S. An overview of clustering methods. Intell. Data Anal. 2007, 11, 583–605. [Google Scholar] [CrossRef]
- Xu, R.; Wunsch, D. Survey of clustering algorithms. IEEE Trans. Neural Netw. 2005, 16, 645–678. [Google Scholar] [CrossRef] [PubMed]
- Xu, D.; Tian, Y. A comprehensive survey of clustering algorithms. Ann. Data Sci. 2015, 2, 165–193. [Google Scholar] [CrossRef]
- Ahlquist, J.S.; Breunig, C. Model-based clustering and typologies in the social sciences. Political Anal. 2010, 20, 325–346. [Google Scholar] [CrossRef]
- Aldenderfer, M.S.; Blashfield, R.K. A review of clustering methods. Clust. Anal. 1984, 33–61. [Google Scholar] [CrossRef]
- Díaz-Costa, E.; Fernández-Cano, A.; Faouzi, T.; Henríquez, C.F. Validación del constructo subyacente en una escala de evaluación del impacto de la investigación educativa sobre la práctica docente mediante análisis factorial confirmatorio. Rev. Investig. Educ. 2015, 33, 47–63. [Google Scholar] [CrossRef]
- Díaz-Costa, E.; Fernández-Cano, A.; Faouzi-Nadim, T.; Caamaño-Carrillo, C. Modelamiento y estimación del índice de impacto de la investigación sobre la docencia. Revista Electrónica Interuniversitaria de Formación del Profesorado 2019, 22, 211–228. [Google Scholar]
- Fonseca, J.R.S. Clustering in the field of social sciences: That is your choice. Int. J. Soc. Res. Methodol. 2013, 16, 403–428. [Google Scholar] [CrossRef]
- Rice, P.M.; Saffer, M.E. Cluster analysis of mixed-level data: Pottery provenience as an example. J. Archaeol. Sci. 1982, 9, 395–409. [Google Scholar] [CrossRef]
- Huang, Z. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 1998, 2, 283–304. [Google Scholar] [CrossRef]
- Marcotorchino, F.; Michaud, P. Agregation de similarites en classification automatique. Rev. Stat. Appl. 1982, 30, 21–44. [Google Scholar]
- Bock, H.-H. Origins and extensions of the k-means algorithm in cluster analysis. Electron. J. Hist. Probab. Stat. 2008, 4, 1–18. [Google Scholar]
- Forgy, E.W. Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics 1965, 21, 768–769. [Google Scholar]
- Jain, A.K. Data clustering: 50 years beyond K-means. IEEE Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
- MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Oakland, CA, USA, 1967; Volume 281–297. [Google Scholar]
- Goyal, M.; Aggarwal, S. A Review on K-Mode Clustering Algorithm. Int. J. Adv. Res. Comput. Sci. 2017, 8, 1615–1620. [Google Scholar] [CrossRef]
- Xiong, T.; Wang, S.; Mayers, A.; Monga, E. DHCC: Divisive hierarchical clustering of categorical data. Data Min. Knowl. Discov. 2012, 24, 103–135. [Google Scholar] [CrossRef]
- Lakshmi, K.; Visalakshi, N.K.; Shanthi, S.; Parvathavarthini, S. Clustering categorical data using K-Modes based on cuckoo searsh optimization algorithm. ICTACT J. Soft Comput. 2017, 8, 1561–1566. [Google Scholar] [CrossRef]
- Dorman, K.S.; Maitra, R. An Efficient k-modes Algorithm for Clustering Categorical Datasets. Stat. Anal. Data Min. ASA Data Sci. J. 2022, 15, 83–97. [Google Scholar] [CrossRef]
- Ali, D.S.; Ghoneim, A.; Saleh, M. K-modes and Entropy Cluster Centers Initialization Methods. In ICORES; SciTePress: Setúbal, Portugal, 2017; pp. 447–454. [Google Scholar]
- Khan, S.S.; Kant, S. Computation of Initial Modes for K-modes Clustering Algorithm Using Evidence Accumulation. In IJCAI; Morgan Kaufmann Publishers: San Francisco, CA, USA, 2007; pp. 2784–2789. [Google Scholar]
- Huang, Z.; Ng, M.K. A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans. Fuzzy Syst. 1999, 7, 446–452. [Google Scholar] [CrossRef]
- Gan, G.; Ma, C.; Wu, J. Data Clustering: Theory, Algorithms, and Applications; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2020. [Google Scholar]
- Jiang, Z.; Liu, X. A novel consensus fuzzy k-modes clustering using coupling DNA-chain-hypergraph P system for categorical data. Processes 2020, 8, 1326. [Google Scholar] [CrossRef]
- Kim, D.-W.; Lee, K.H.; Lee, D. Fuzzy clustering of categorical data using fuzzy centroids. Pattern Recognit. Lett. 2004, 25, 1263–1271. [Google Scholar] [CrossRef]
- Ng, M.K.; Li, M.J.; Huang, J.Z.; He, Z. On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 503–507. [Google Scholar] [CrossRef] [PubMed]
- Cao, F.; Liang, J.; Li, D.; Bai, L.; Dang, C. A dissimilarity measure for the k-modes clustering algorithm. Knowl.-Based Syst. 2012, 26, 120–127. [Google Scholar] [CrossRef]
- Hazarika, I.; Mahanta, A.K.; Das, D. A New Categorical Data Clustering Technique Based on Genetic Algorithm. Int. J. Appl. Eng. Res. 2017, 12, 12075–12082. [Google Scholar]
- Khandelwal, G.; Sharma, R. A simple yet fast clustering approach for categorical data. Processes 2015, 120, 25–30. [Google Scholar] [CrossRef]
- Seman, A.; Bakar, Z.A.; Sapawi, A.M.; Othman, I.R. A medoid-based method for clustering categorical data. J. Artif. Intell. 2013, 6, 257. [Google Scholar] [CrossRef]
- Michaud, P.; Marcotorchino, F. Modèles d’optimisation en analyse des données relationnelles. Math. Sci. Hum. 1979, 67, 7–38. [Google Scholar]
- Michaud, P. Condorcet: A man of the avant–garde. Appl. Stoch. Model. Data Anal. 1987, 3, 173–189. [Google Scholar] [CrossRef]
- Michaud, P. The true rule of the Marquis de Condorcet. In Compromise, Negotiation and Group Decision; Springer: New York, NY, USA, 1988; pp. 83–100. [Google Scholar]
- Michaud, P. Agrégation à la Majorité II: Analyse du Résultat d’un Vote; Centre Scientifique IBM France: Paris, France, 1985. [Google Scholar]
- Hägele, G.; Pukelsheim, F. Llul’s writings on electoral systems. Stud. Lul. 2001, 41, 3–38. [Google Scholar]
- Marcotorchino, F. Liaison Analyse Factorielle-Analyse Relationnelle: Dualité Burt-Condorcet; IEEE Centre Scientifique IBM France: Paris, France, 1989. [Google Scholar]
- Lebbah, M.; Bennani, Y.; Grozavu, N.; Benhadda, H. Relational analysis for clustering consensus. Mach. Learn. 2010, 45–59. [Google Scholar] [CrossRef][Green Version]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).