Categorical Data Clustering: A Bibliometric Analysis and Taxonomy

: Numerous real-world applications apply categorical data clustering to find hidden patterns in the data. The K -modes-based algorithm is a popular algorithm for solving common issues in categorical data, from outlier and noise sensitivity to local optima, utilizing metaheuristic methods. Many studies have focused on increasing clustering performance, with new methods now outperforming the traditional K -modes algorithm. It is important to investigate this evolution to help scholars understand how the existing algorithms overcome the common issues of categorical data. Using a research-area-based bibliometric analysis, this study retrieved articles from the Web of Science (WoS) Core Collection published between 2014 and 2023. This study presents a deep analysis of 64 articles to develop a new taxonomy of categorical data clustering algorithms. This study also discusses the potential challenges and opportunities in possible alternative solutions to categorical data clustering.


Introduction
Currently, the internet and artificial intelligence are undergoing significant development.Consequently, they generate vast quantities of transaction data, including structured data such as personal biodata, surveys, stock market data, medical records, marketing data, and e-commerce transactions, as well as data generated from applications used in various fields such as science, engineering, or unstructured data gathered from the internet, such as data from the Google search engine or information extracted from social media platforms.Therefore, mining these data to derive insightful information has become more important.This process is called data mining or knowledge discovery in databases (KDD).
Numerous methods exist in data mining, depending on how they process the data.For example, supervised learning involves processing data based on their labeled attributes.This method utilizes historical data to train the model and subsequently generates results based on the patterns learned during training.Classification, prediction, and regression are tasks performed under supervised learning.In contrast, unsupervised learning involves processing data without explicit supervision or labeled target variables.One common method is clustering, which identifies hidden patterns based on similarities within the data.The more similar the data points are, the more likely it is they will be grouped into the same cluster.
Since similarity is an important factor in enhancing cluster quality, it is essential to comprehend how to measure the similarity between data objects in the dataset.Each data object possesses attributes, each distinguished by its data type.For instance, the Iris dataset [12] comprises 150 data objects and four attributes: sepal length, sepal width, petal length, and petal width.These attributes are categorized as numerical data, while another category is categorical data.Numerical data types are further divided into interval and ratio, while categorical types are categorized into nominal and ordinal [13].
Different data types require different distance metrics to calculate the similarity between their data objects, and different clustering algorithms are employed to process them.Two widely-used clustering algorithms include the K-means algorithm [14], used for clustering numerical data, and the K-modes algorithm [15], utilized for clustering categorical data.
To distinguish the K-modes algorithm from the K-means algorithm, there exist at least three characteristics concerning similarity metrics and how they represent the cluster: (1) The K-means algorithm employs means to represent the clusters, whereas the K-modes algorithm uses modes.(2) The K-means algorithm utilizes Euclidean distance, whereas the K-modes algorithm employs a dissimilarity metric.(3) The K-modes algorithm applies a frequency-based method to update the mode.Indeed, these algorithms exhibit various variations, and numerous studies classify them into several taxonomies of clustering methods that complement each other.
A respected taxonomy of clustering methods was proposed by [16], which categorizes clustering into hierarchical and partitional based on how clusters are produced.Additionally, clustering can be differentiated based on memberships, such as hard and fuzzy clustering.Taxonomies may vary and overlap.For instance, ref. [17] categorizes clustering into hard, fuzzy, and rough set clustering.Another taxonomy, proposed by [18], classifies distance or similarity metrics for categorical data, distinguishing similarity into context-sensitive and context-free, with the context-free category comprising probabilistic, information-theoretic, and frequency-based approaches.
Previous review studies employ various methodologies such as systematic literature reviews (SLR), bibliometric analysis, or meta-analysis depending on the scope of the studies, number of studies, and objectives.Therefore, drawing inspiration from previous review studies, particularly those focusing on categorical data, as shown in Table 1, this study aims to develop a taxonomy to refine the previous classification of categorical data clustering.This objective will be pursued by performing bibliometric analysis, presenting quantitative and qualitative synthesis, and analyzing eligible articles based on content screening.A survey of categorical clustering.The classification of clustering consists of hard, fuzzy, and rough sets and three evaluation metrics (accuracy, precision, and recall) [17] The second stage remains closely linked to the first, involving data cleaning and imputation to ensure no duplication.However, misspellings identified during data retrieval from the WoS, such as "<i>k</i>-modes," need correction to "k-modes."The next phase involves content analysis to determine the eligibility of articles based on the aims, scope, and criteria of this study.
In the third stage, the results are categorized into three phases: (1) quantitative synthesis and analysis, (2) qualitative synthesis and analysis, and (3) taxonomy development.In the first phase, bibliometric analysis is employed to analyze the performance of articles, focusing on publication years, titles, publishers, and authors.Moreover, visualization using co-word and citation networks is demonstrated for science mapping.Following this, the qualitative synthesis and analysis phase is conducted, resulting in the development of the taxonomy.
In the final stage, the discussion and conclusion are presented.This section addresses potential challenges and outlines future research directions, contributing alternative solutions to the existing methods.
In line with the previous study detailed in Table 1 and the research design procedure outlined in Figure 1, articles are retrieved in the CSV and RIS formats, and keywords and databases are specified.Given the emphasis on categorical data clustering, the chosen keywords are "clustering" and "categorical data."Although the terms "categorical data clustering" and "clustering categorical data" are often used interchangeably, they may convey slightly different connotations depending on the context.Generally, both phrases refer to the process of grouping or categorizing data points with categorical attributes.
The database used in this study is the WoS Core Collection.Apart from being well known for hosting high-quality journals, previous review papers in the clustering domain, as depicted in Table 1, have been analyzed regarding the databases employed for article retrieval.For instance, Ezugwu conducted two systematic reviews.In the first review, Ezugwu [25] utilized the WoS database to examine nature-inspired metaheuristics algorithms for both non-automatic and automatic clustering, identifying 40 automatic clustering algorithms.In the second review, Ezugwu [26] utilized the Scopus database and identified 37 automatic clustering algorithms.Notably, using only the WoS database is considered sufficient for retrieving clustering-related articles.Additionally, many review papers rely on the official website [17][18][19]24] or the publisher's website [20][21][22] as their primary source.In another study, Wang et al. [23] retrieved articles indexed by Google Scholar and the WoS database.Hence, after considering and comparing numerous database sources, this study completely restricts the database to the WoS Core Collection.A total of 1731 articles were identified.After applying the inclusion criteria, only 567 articles were selected for further analysis.These articles were directly retrieved from the WoS in CSV and RIS format.
Furthermore, the first criterion specifies publication years with index dates between 1 January 2014 and 5 December 2023.This time frame is selected to align with a previous survey on categorical data clustering conducted in [17].That study represents the first survey on categorical data, analyzing 32 algorithms over 30 years, from 1998, when the K-modes algorithm was first introduced, to 2017.From 32 algorithms, 27 are related, and only 6 meet the inclusion criteria in this study [37][38][39][40][41][42].
The second criterion is the document types.This study excludes proceeding papers, book chapters, editorial materials, and other document types retaining only articles, as the number of other document types is insignificant.All documents are in English and belong to areas of computer science, mathematics, and engineering.Further details are provided in the flow diagram according to PRISMA 2020, shown in Figure 2. Additionally, Table 2 presents a summary of the keywords used and the databases selected, where n denotes the number of articles.
to areas of computer science, mathematics, and engineering.Further details are provided in the flow diagram according to PRISMA 2020, shown in Figure 2. Additionally, Table 2 presents a summary of the keywords used and the databases selected, where n denotes the number of articles.The inclusion criteria for articles involve the topic of categorical data clustering, focusing specifically on partition-based clustering and its variations.However, articles on statistic-based or model-based clustering, such as Latent Data Analysis [43][44][45] and EM algorithms [46], are excluded.Additionally, the exclusion criteria encompass articles related to multiview, co-clustering, consensus, deep learning clustering, and methods related to data stream clustering.This decision is based on the fact that many of these methods are employed in semi-supervised learning, which differs from the unsupervised learning approach adopted in this study.The inclusion criteria for articles involve the topic of categorical data clustering, focusing specifically on partition-based clustering and its variations.However, articles on statistic-based or model-based clustering, such as Latent Data Analysis [43][44][45] and EM algorithms [46], are excluded.Additionally, the exclusion criteria encompass articles related to multiview, co-clustering, consensus, deep learning clustering, and methods related to data stream clustering.This decision is based on the fact that many of these methods are employed in semi-supervised learning, which differs from the unsupervised learning approach adopted in this study.
Furthermore, this study excludes algorithms that process numerical, mixed (numerical and categorical), text data, and sequential categorical data.Despite the variety of data types, processing categorical data in clustering remains challenging compared to numerical data.This is primarily due to differences in calculating the distance between data points; therefore, the scope of this study is limited to addressing the specific research questions.

Results
This section consists of three phases: (1) quantitative synthesis and analysis, (2) qualitative synthesis and analysis, and (3) taxonomy development.

Quantitative Synthesis and Analysis
Quantitative synthesis and analysis involve employing bibliometric analysis to explore trends, identify patterns, and analyze the performance of articles.Initially, the contributions of research constituents related to categorical data clustering are assessed by presenting the performance analysis in a descriptive format.Three types of performance analysis are conducted: (1) citation-related metrics, (2) publication-related metrics, and (3) citationand-publication-related metrics.This study solely utilizes publication-related metrics.Subsequently, science mapping is performed to investigate the relationships between research constituents.Co-word and citation analyses are employed to determine the connections between topics/keywords and cited publications.
ical and categorical), text data, and sequential categorical data.Despite the variety of data types, processing categorical data in clustering remains challenging compared to numerical data.This is primarily due to differences in calculating the distance between data points; therefore, the scope of this study is limited to addressing the specific research questions.

Results
This section consists of three phases: (1) quantitative synthesis and analysis, (2) qualitative synthesis and analysis, and (3) taxonomy development.

Quantitative Synthesis and Analysis
Quantitative synthesis and analysis involve employing bibliometric analysis to explore trends, identify patterns, and analyze the performance of articles.Initially, the contributions of research constituents related to categorical data clustering are assessed by presenting the performance analysis in a descriptive format.Three types of performance analysis are conducted: (1) citation-related metrics, (2) publication-related metrics, and (3) citation-and-publication-related metrics.This study solely utilizes publication-related metrics.Subsequently, science mapping is performed to investigate the relationships between research constituents.Co-word and citation analyses are employed to determine the connections between topics/keywords and cited publications.

• Publication Years
Figure 3 shows the publication years, where the Y-axes represent the number of publications (left) and citations (right).The publication trend shows an increase, with 2019 emerging as the most productive year.In that year, 11 articles related to clustering were published, covering hierarchical-based [47,48], rough-set-based [49,50], weight-based [51], graph-based [52], a variant of fuzzy clustering [53][54][55], integer linear programming [56], and clustering validity [57].Moreover, for a deeper comprehension of the citations and publications spanning the period from 2014 to 2023, Table 3 presents the top ten cited articles.TC denotes the total citations, while AC represents the average citations per year.

• Publication Titles and Publishers
Table 4 shows the publication titles, with TP representing the total publications.All publication titles (journals) are categorized under "computer science, artificial intelligence" or "computer science, information systems."In total, there are 41 journals, with Neurocomputing ranking highest on the list.Furthermore, Table 5 presents the publishers, with these five publishers covering over 80% of published articles.

• Authors
Each author contributes a specific area of categorical data clustering.However, the total publication (TP) presented in Table 6 shows the total number of articles authored by both authors and co-authors.Ten authors contribute to more than 50% of the articles.Furthermore, the most productive authors in categorical data clustering are J.Y. Liang and R. J. Kuo.These authors have collaborated with co-authors who rank highly and have contributed three to four articles.For example, J. Y. Liang, as a co-author, collaborated with L. Bai on a study optimizing the objective function of partition clustering [39,65,66].Additionally, J. Y. Liang collaborated with F. Y. Cao and J. Z. X. Huang on proposing clustering algorithms tailored for various data types, such as set-valued features [40,67] and matrix-object data [68].Collaborations with authors such as W. Wei [47] and Y. H. Qian [59] further exemplify J. Y. Liang's significant contributions to the field.Similar to J. Y. Liang, R. J. Kuo has conducted studies on metaheuristic-based clustering in collaboration with T. P. Q. Nguyen [2,[53][54][55]60,69].Other authors, such as S. Salem, S. Naouali, and Z. Chtourou, have also worked on rough-set clustering [64, [70][71][72].Furthermore, Y. M. Cheung, as the second author, has proposed numerous methods related to distance metrics with Y. Q. Zhang [73][74][75] and H. Jia [58].Moreover, F. L. Chen proposed variant methods for optimizing the objective function in subspace clustering algorithms [61,76,77].
Considering the most productive authors alongside the times the articles were cited in Table 4, several findings are revealed: (1) four of the ten articles are attributed to the top ten authors [59][60][61]64].However, it is notable that the first author of the most cited article is not the most productive, even though their co-author is included among the ten most productive authors.(2) Despite the productivity of certain authors, articles focusing on topics like set-valued features and matrix-object data, as well as the subject of cluster validity, do not appear to receive significant citation counts.(3) Many articles related to metaheuristic published after 2018 are also not among the top cited articles.Nevertheless, these findings warrant further investigation, primarily due to the scope and limitations of this study, including the range of publication years and the impact of topics on citation counts.

Science Mapping
Science mapping constitutes one of the principal methodologies in bibliometric analysis.It involves a range of techniques, each distinguished by its usage and data utilization.These techniques include citation analysis, co-citation analysis, bibliographic coupling, co-word analysis, and co-authorship analysis [29].For the purposes of this study, we focused solely on two specific techniques: co-word analysis and citation analysis.
Co-word analysis involves examining the co-occurrence of word pairs or the frequency with which two or more words appear together in a given corpus.In this study, the words were extracted from "author keywords".This method operates on the assumption that keywords frequently appearing together are thematically related, thereby aiding in the formation of thematic clusters that define specific topics.
In contrast, citation analysis focuses on the relationships among publications rather than their content.Additionally, for further examination, co-citations can be employed to relate publications frequently cited together.In a co-citation network, the connection between two publications is determined by their co-occurrence in the reference lists of other publications.Although co-citation analysis can identify highly influential publications, this study primarily aims to explore the relationships among publications over a specific ten-year period.Consequently, the use of co-citations in this study might yield overly generalized results.
Moreover, visualization techniques assist as valuable tools for representing the science map.Each science map employs distinct analysis techniques and algorithms.Cobo [30] conducted a comparative study of nine science mapping applications, clarifying their advantages and drawbacks.
This study utilized VOSviewer, employing network analysis as its method.As illustrated in Figure 4, each label (keyword) is interconnected, with the size of the labels corresponding to their frequency.Bigger labels indicate a higher frequency of appearance.Furthermore, thematic clusters are distinguished through the use of different colors in the visualization.The color sequence is as follows: red, green, blue, and yellow.
word analysis, and co-authorship analysis [29].For the purposes of this study, we focused solely on two specific techniques: co-word analysis and citation analysis.
Co-word analysis involves examining the co-occurrence of word pairs or the frequency with which two or more words appear together in a given corpus.In this study, the words were extracted from "author keywords".This method operates on the assumption that keywords frequently appearing together are thematically related, thereby aiding in the formation of thematic clusters that define specific topics.
In contrast, citation analysis focuses on the relationships among publications rather than their content.Additionally, for further examination, co-citations can be employed to relate publications frequently cited together.In a co-citation network, the connection between two publications is determined by their co-occurrence in the reference lists of other publications.Although co-citation analysis can identify highly influential publications, this study primarily aims to explore the relationships among publications over a specific ten-year period.Consequently, the use of co-citations in this study might yield overly generalized results.
Moreover, visualization techniques assist as valuable tools for representing the science map.Each science map employs distinct analysis techniques and algorithms.Cobo [30] conducted a comparative study of nine science mapping applications, clarifying their advantages and drawbacks.
This study utilized VOSviewer, employing network analysis as its method.As illustrated in Figure 4, each label (keyword) is interconnected, with the size of the labels corresponding to their frequency.Bigger labels indicate a higher frequency of appearance.Furthermore, thematic clusters are distinguished through the use of different colors in the visualization.The color sequence is as follows: red, green, blue, and yellow.The co-word network depicted in Figure 4 reveals four thematic clusters comprising a total of 182 keywords.Notably, Cluster 1 exhibits a significantly larger size compared to the other clusters.While certain keywords such as "distance metric," "internal cluster validity index," "evaluation," and "dissimilarity measure for clustering" possess a general scope, a few keywords stand out for their unique association with concepts such as "outlier detection," "k-modes," "k-modes clustering," "condorcet clustering," and "rough set theory." Within Cluster 2, numerous keywords relating to variations of fuzzy concepts are apparent, including "fuzzy centroid," "fuzzy clustering," "fuzzy k-modes," "fuzzy k-modes algorithm," "fuzzy sv-k-modes," "intuitionistic fuzzy set," "rough fuzzy clustering," and "wfk-modes."Furthermore, alongside fuzzy clustering, keywords related to metaheuristics such as "genetic algorithm," "particle swarm optimization," "sine cosine algorithm," and "simulated annealing" are prevalent.Additionally, another prominent topic within Cluster 2 is "multi-objective optimization." Cluster 3 presents a distinct set of algorithms compared to the previous clusters.Specific keywords within this cluster include "hierarchical clustering," "graph embedding," "divisive clustering," "granular computing," "locality-sensitive hashing," "distribution approximation," and "holo-entropy."Notably, the keyword "hierarchical clustering" strongly connects to "rough set." The final cluster also features specific keywords, mainly related to "high-dimensional data," "attribute weighting," "cluster weighting," "dissimilarity," "similarity," "distance measure," "coupled dcp system," and "kernel density estimation," in addition to clustering methods such as the "k-mw-modes algorithm" and "automatic clustering."A summary of the thematic cluster is provided in Table 7.

Cluster
#Keywords Summary A strong connection exists between the K-modes algorithm and rough set theory.Furthermore, rough sets are linked to outlier detection, which, in turn, is associated with the initial cluster centers.This linkage suggests that rough sets are utilized to address outliers in the K-modes algorithm arising from the random initialization of cluster centroids.

43
This cluster covers the fuzzy clustering algorithm, including variations such as the fuzzy K-modes (FKM) algorithm and rough fuzzy clustering.Additionally, the cluster highlights a growing trend in optimizing fuzzy clustering using metaheuristic-based algorithms.Consequently, future studies should delve deeper into investigating the optimization of fuzzy clustering, leveraging not only genetic algorithms and particle swarm optimization but also other metaheuristics to enhance algorithm performance.

42
This cluster covers hierarchical clustering and its relationship with rough set theory.Additionally, it includes keywords related to cluster analysis, such as graph embedding and cluster validity functions.

42
The keywords in this cluster are associated with dissimilarity methods and attribute weighting, such as kernel density estimation and probabilistic frameworks.
Since co-word analysis relies on authors' keywords, redundancy can occur.Therefore, this study experimented with visualization techniques employing multiple clusters to address this issue.The findings revealed that four thematic clusters effectively identified and represented the relationships between categorical data clustering topics.Additionally, to clarify the relationships among publications, this study constructed a citation network.Among the 64 articles analyzed, 56 were interconnected, while 8 exhibited no connections.
The citation analysis shown in Figure 5 illustrates the relationships among publications, with bigger labels (articles) indicating the most influential publications.An interesting aspect to explore is the relationship between the total number of citations (TC) and the number of citations between publications (links).For example, reference [58] by Jia et al. in 2016 is associated with 15 links and 74 TCs.In other words, out of the 64 articles analyzed, 15 are linked to the work of Jia et al. [58].Further details are provided in Table 8.
interesting aspect to explore is the relationship between the total number of citations (TC) and the number of citations between publications (links).For example, reference [58] by Jia et al. in 2016 is associated with 15 links and 74 TCs.In other words, out of the 64 articles analyzed, 15 are linked to the work of Jia et al. [58].Further details are provided in Table 8.

Qualitative Synthesis and Analysis
Qualitative synthesis and analysis will be conducted following the quantitative synthesis and analysis.This phase comprehensively explains the 64 articles identified through the screening process.First, the articles will be categorized according to the classification proposed by [16,17] which distinguishes between hierarchical clustering and partitional clustering.Partitional clustering further encompasses hard, fuzzy, and rough-set-based clustering methods.Subsequently, the third and fourth sections will focus on algorithms that specifically modify the distance function and weighting method, while the fifth section will discuss algorithms related to validity functions.Additionally, this subsection will provide a summary of the datasets and performance evaluation criteria utilized by the various algorithms.Detailed explanations and patterns identified during the analysis of these algorithms will be presented in the following subsections.

Hierarchical Clustering
The hierarchical clustering algorithms are categorized into divisive and agglomerative hierarchical clustering.Among the identified articles, three algorithms are based on divisive hierarchical clustering, while two focus on agglomerative hierarchical clustering, as shown in Table 9. Notably, many of these algorithms are based on information theory.Furthermore, there has been significant advancement in the performance of previous algorithms, shown by the improvement of the min-min-roughness (MMR) algorithm [99].Divisive, based on an information-theoretic approach MMR, MGR, MDA [103], TR [104] Sun et al. ( 2017) HPCCD [96] Agglomerative, based on an information-theoretic approach MGR [41], COOLCAT, LIMBO [105], K-modes,

Altameem et al. (2023) P-ROCK [106]
Agglomerative, linked-based ROCK [107] (1) Divisive Hierarchical Clustering Li et al. [38] introduced the maximum total mean distribution precision (MTMDP), aiming to improve the min-min-roughness (MMR) algorithm [99] based on probabilistic rough set theory.The MTMDP algorithm involves three main improvements: (1) It utilizes distribution approximation precision instead of the accuracy of approximation employed in the MMR algorithm.(2) Candidate attributes are ranked by total mean distribution precision rather than by max mean distribution precision.(3) Leaf node splitting is performed based on the smallest cohesion degree rather than selecting the leaf node with more objects for further splitting clustering.As a result, the proposed algorithm demonstrates efficacy in handling uncertain and imbalanced datasets, enabling automatic cluster detection and enabling an analysis of high-dimensional datasets.A future study of the MTMDP algorithm may explore automatic subspace clustering for high-dimensional data or its implementation for mixed numeric and categorical datasets.
Similar to MTMDP, Qin et al. [41] introduced an algorithm inspired by MMR called mean gain ratio (MGR), which is based on information theory.Unlike the MMR algorithm, MGR avoids bias towards extreme selection, as extreme selection can potentially decrease accuracy.First, MGR selects a clustering attribute using the mean gain ratio and then identifies an equivalence class on the clustering attribute using cluster entropy.Notably, the MGR algorithm can operate without specifying the number of clusters.In each iteration, a cluster is discovered regardless of length, followed by a binary split on the remaining objects.
Consequently, this algorithm is well suited for large categorical datasets with imbalanced distributions.Experimental results demonstrate that MGR is efficient and scalable.In the future, enhancing accuracy can be pursued in two ways: integrating the MGR algorithm with the genetic clustering algorithm (G-ANMI) [101], and incorporating the reprocessing procedure from the COOLCAT algorithm [102].
Wei et al. [47] proposed another approach to improve the splitting of clusters in divisive hierarchical clustering.Initially, they conducted a comprehensive analysis of existing divisive hierarchical clustering algorithms, including MMR [99], MGR [41], MDA [103], and TR [104].After that, they created a unified framework based on the strengths and weaknesses of these algorithms.Within this framework, the mean normalized information gain (MNIG) was introduced, specifically designed to address the limitations of MGR.Additionally, the K-modes object function (KOF) identifies suitable measures for attribute selection.Both KOF and MNIG contribute to determining the method for splitting clusters into subclusters and identifying which cluster should be split in each iteration.While KOF, MNIG, and other measures, such as a maximum number of objects (MO) and information entropies (IE), perform well in certain steps, identifying the optimal measure that universally fits all problems remains challenging.
(2) Agglomerative Hierarchical Clustering On the other hand, Sun et al. [96] developed their algorithm based on agglomerative hierarchical clustering.Their proposed algorithm, named hierarchical projected clustering for categorical data (HPCCD), clusters high-dimensional data using the weighted holoentropy [108] instead of pairwise-similarity-based measures for merging two subclusters.HPCCD can distinguish relevant attributes within clusters and identify both the principal feature space and the core feature space, which is critical for clustering high-dimensional data.The experimental results indicate that HPCCD outperforms the MGR [41], Kmodes [15], COOLCAT [102], and scalable information bottleneck (LIMBO) [105] in terms of efficiency, accuracy, and reproducibility.
In contrast to the aforementioned variations of hierarchical-based clustering, the algorithm proposed by Altameem et al. [106] stands out.Their approach aims to modify the ROCK algorithm [107] by allowing user-defined parameters as input, thus enhancing the flexibility of the algorithm.This modified version is named The Parameterized-ROCK (P-ROCK).The parameters involved include the threshold (θ) for neighborhood decision, f(θ), and h(θ).The P-ROCK algorithm was tested using two datasets from the UCI repository: the small soybean dataset and the congressional votes dataset.The results indicate that the P-ROCK algorithm shows improved accuracy and runtime compared to the original ROCK algorithm.Furthermore, P-ROCK outperforms other variations of ROCK, such as QROCK [109] and MROCK [110], in terms of computing time.

Partition Clustering
Partition clustering includes hard, fuzzy, and rough-set clustering methods [17].Among the 64 articles analyzed, 11 developed algorithms based on hard clustering, 12 articles focused on fuzzy clustering, and 10 articles based on rough-set clustering.Each article is categorized based on its specific approach or characteristics, which helps identify its contributions.
(1) Hard Clustering The summary of algorithms for hard clustering is presented in Table 10.Given the variation in terms and acronyms used across different articles or algorithms, this study standardized their names to enhance clarity.For instance, acronyms such as "KMD," "KM," "K-modes," and "Huang's K-modes" are all standardized as "K-modes" algorithms.However, it is worth noting that similar acronyms may refer to different algorithms; for instance, "WKM" and "Cao" may each refer to more than one algorithm.Additionally, there are cases where the same algorithm is referenced differently, such as the hamming distance (HD).In such cases, this study follows the conventions established in the original articles.For further details, refer to their corresponding references.
Similarly, their other proposed algorithm, the set-valued K-modes (SV-k-modes) algorithm [40], designed for clustering data with set-valued features, employs a heuristic approach to update cluster centers.The distance between two set-valued objects is measured using the Jaccard coefficient.Furthermore, this approach is tested for scalability and enhances the initialization mechanism for cluster centers.The results demonstrate a superior performance compared to benchmark algorithms, confirming that the SV-k-modes algorithm is scalable for large, high-dimensional datasets.

•
Optimizing the number of clusters One method for handling uncertainty is belief clustering for dynamic partition (BCDP), proposed by Hariz.This study extends the belief K-modes method (BKM) [113] to dynamic environments.Unlike BKM, which maintains a fixed number of clusters, objects, and features, BCDP considers the uncertainty of attribute values and the potential adjustment of cluster numbers using the concepts of cluster cohesion and separation concepts.This adjustment can involve either increasing (IK-BKM) [139] or decreasing (DK-BKM) [140] the number of clusters.As a result, the partitioning of clusters is updated without requiring a complete re-clustering process from scratch.

•
Optimizing the cluster centers In addition to determining the number of clusters beforehand, another obstacle faced by the K-modes algorithm is overcoming the initialization problem.Various algorithms have been developed based on dissimilarity measures, such as the optimal transfer quick transfer (OTQT) algorithm.The OTQT algorithm [82], developed by Dorman and Maitra, incorporates the Hartigan algorithm for the K-means algorithm [141].Following the initialization step, the OTQT algorithm implements optimal and quick transfer stages to enhance the objective function rather than relying solely on distance metrics.One improvement in this method is ensuring that clusters are nonempty at the initialization step and in any iteration by initializing with K distinct modes.The OTQT algorithm demonstrates significantly improved accuracy and scalability in clustering complex data.
Dinh and Huynh [98] introduced a method for generating initial clusters based on frequent pattern mining, marking the first attempt to combine this approach with partitional clustering.The pattern-based clustering algorithm for categorical data (k-PbC) relies on the Fp-Max algorithm [142] for maximal frequent itemsets mining (MFIM).Additionally, k-PbC establishes cluster centers through a kernel density estimation method and computes distances using an information-theoretic-based dissimilarity measure (ITBD).
Chen et al. [76] also addressed the sensitivity of the K-modes algorithm in initializing clusters and modes by employing kernel clustering.They utilized the self-expressive kernel density estimation (SKDE) to develop a self-expressive kernel subspace clustering algorithm for categorical data (SKCC).SKCC incorporates feature weighting to discern the importance of attributes.

•
Optimizing the objective function for large datasets Fauzi et al. proposed the α-Condorcet [92] as an extension of Condorcet clustering [138].Unlike the traditional approach of setting the number of clusters a priori using pairwise comparisons and a simple majority decision rule to maximize Condorcet's criterion, the α-Condorcet sets the number of clusters, α, beforehand.It introduces a new Condorcet criterion function that incorporates similarity measures and proposes a heuristic algorithm.As a result, the algorithm efficiently processes large datasets and produces superior partitions compared to the K-modes algorithm for various values of α.
Clustering large-size datasets containing more than 100,000 data objects poses challenges in clustering categorical data.To address this issue, Xiao et al. [56] proposed a new algorithm that combines K-modes with integer linear programming (ILP).While ILP techniques are typically effective for small-size data, the proposed method leverages ILP and the framework of variable neighborhood search (VNS) to develop a heuristic approach.This approach minimizes the total inner-distance function of the K-modes algorithm, thereby reducing the computation cost of clustering large datasets.

•
Optimizing the objective function based on a multi-objective approach Another method related to multi-objective clustering based on sequential games is the MOCSG [81].Inspired by their previous work, clustering based on sequential multiobjective games (CluSMOG) [117], MOCSG extends this approach to numerical data.As a multi-objective clustering algorithm, MOCSG integrates multiple objective functions to optimize R-square (RSQ), connectivity, and intra-cluster inertia objectives.Additionally, MOCSG can dynamically determine the number of clusters.

•
Representing data based on the discretization method Another method designed to handle large datasets is the Manhattan frequency Kmeans (MFk-M) algorithm [64], proposed by Ben Salem et al.MFk-M employs a K-meansbased approach to process categorical data by converting it into numeric values using relative frequency.The use of relative frequency aims to improve the simple matching similarity measure [143].Additionally, the algorithm utilizes Manhattan distance (L 1 norm) instead of Euclidean distance to address outliers and noisy data [144,145].By adopting this approach, MFk-M results in lower computational costs than the K-modes algorithm, as computing means is less expensive than computing modes.
Similar to MFk-M, the algorithm proposed by Bai and Liang [76], categorical data clustering based on data representation with spectral embedding (CDC_DR+SE), also employs a conversion method to represent categorical data as a graph representation instead of using direct ordinal or one-hot encoding methods.The algorithm learns the representation of categorical values from their graph structure, easing the capturing of potential similarities between categorical values and their conversion into numerical data.Consequently, existing numerical clustering algorithms can effectively cluster categorical data.
(2) Fuzzy Clustering A summary of the fuzzy clustering is presented in Table 11.

•
Heuristic approach to cluster set-valued attributes Fuzzy clustering offers fuzzy membership, allowing one object to belong to more than one cluster based on the percentage of membership.However, both hard and fuzzy clustering algorithms encounter similar challenges, and various techniques have been proposed to address their drawbacks.Cao et al. introduced the SV-k-modes algorithm for clustering categorical data with set-valued attributes [40] and extended it to fuzzy-based clustering, named fuzzy SV-k-modes [67].

•
Multivariate membership approach Furthermore, in relation to fuzzy membership, Maciel et al. introduced multiple fuzzy partitions for FKM to address the ambiguity in data that share properties across different clusters.Their proposed method, the multivariate fuzzy K-modes (MFKM) algorithm [148], acknowledges that attributes in distinct clusters may possess varying degrees of membership.This approach to membership assignment differs from FKM, which assigns uniform membership to all attributes across all clusters.Additionally, the study proposed an internal validation index termed the multivariate fuzzy silhouette index, capable of assessing clustering validity by identifying a relevant subset of variables.Experimental results demonstrate that the MFKM algorithm yields superior solutions, particularly as the number of categories for each variable increase.intuitionistic fuzzy set and coupled DCP system MEC [158], FSC [159], FKM, WFKM, IWFKM [55], GIWFKM [55] •

Metaheuristic approach
Another concern arises from the random initialization of centroids, leading to fast convergence to local optima.To address this, Kuo and Nguyen introduced metaheuristicbased fuzzy clustering to determine initial centroids, emphasizing global search.In their work [2], Kuo and Nguyen integrated the particle swarm optimization algorithm (PSO), genetic algorithm (GA), and artificial bee colony algorithm (ABC) with FKM.Among these methods, the GA-based FKM algorithm achieves the highest accuracy, with PSO demonstrating the most stability.

•
Possibilistic-based approach with metaheuristic Another study by Kuo et al. extended the possibilistic fuzzy C-means (PFCM) [160] to cluster categorical data, known as the possibilistic fuzzy K-modes (PFKM) algorithm.This algorithm aims to overcome noise and outliers by employing frequency probability-based distance [58] as a dissimilarity measure and the possibility concept from the PFCM algorithm.After that, metaheuristic approaches are utilized to optimize the PFKM algorithm to achieve the optimal solution.Among the three methods considered-PSO, Sine-Cosine Algorithm (SCA), and GA-PFKM based on PSO and SCA demonstrates higher performance and requires less computational time compared to GA, which requires more complex updating rules.

•
Intuitionistic fuzzy set theory-based approach In [55], Kuo and Nguyen further integrated the frequency-probability-based distance metric with the intuitionistic fuzzy set (IFS), designed to handle uncertainty.Their study primarily extends previous methodologies employing IFS to cluster numerical datasets [161][162][163][164] to accommodate categorical data.Additionally, the study introduces attribute weighting, adopting the approach outlined by Saha and Das [80] within the framework of IFS, assigning weight factors to each categorical attribute.
However, the performance of the proposed method, the intuitionistic weighted fuzzy K-modes (IWFKM) algorithm by Kuo and Nguyen, is comparatively lower than benchmark algorithms, GA-FKM [146], SBC [59], and MaOfCentroids [63], due to the inability of IWFKM to prevent the local optima problem.Hence, to address this limitation, the authors propose a second algorithm, GIWFKM, which combines IWFKM with GA.The results demonstrate that GIWFKM outperforms all benchmark algorithms.
In 2023, Jiang et al. introduced an algorithm named the kernel-based intuitionistic weight fuzzy K-modes (KIWFKM), which integrates the IFS with kernel-trick and weighting mechanisms [93].This algorithm aims to overcome noise and distinguish important attributes.Moreover, KIWFKM establishes the coupled DCP system, a chained tissue-like P system integrating DNA genetic rules.The P system, originally proposed by Paun [165], belongs to membrane computing, a nature-inspired computational model that can be optimized using a DNA genetic algorithm [166].Consequently, KIWFKM is combined with the Coupled DCP system, as it provides a novel dynamic evolution model for existing P systems and can address non-combinatorial optimization problems.Experimental results demonstrate that the KIWFKM-DCP algorithms outperform other related algorithms across various datasets in terms of adjusted rand index (ARI), normalized mutual index (NMI), accuracy, and F-measure.

•
Multi-objective approach Furthermore, another algorithm based on fuzzy centroids, named MaOfCentroids [63], was proposed by Zhu and Xu.Their preliminary experiments suggest that fuzzy centroids are more effective and stable compared to other traditional fuzzy clustering.However, similar to other single-objective algorithms that suffer from finding the optimal partition, MaOfCentroids adopts a multi-objective clustering approach utilizing a reference pointbased non-dominated sorting genetic algorithm to address this challenge.In this approach, fuzzy memberships serve as the chromosome representation.This study is significant as it was the first to employ more than three objective functions based on various cluster validity indexes (CVIs) to evaluate the specific structure or distribution of data.
Additionally, several other multi-objective clustering approaches have been integrated with fuzzy clustering, besides MaOfCentroids, include NSGA-FMC [60], EGA-FMC [90], AFC-NSPSO [53], and PM-FGCA [54].In 2015, Yang et al. proposed the non-dominated sorting genetic algorithm-fuzzy membership chromosome (NSGA-FMC).NSGA-FMC aims to optimize clustering quality using fuzzy compactness and separation as objective functions.Unlike using attributes, NSGA-FMC initializes its chromosome with fuzzy memberships, thereby proposing a more efficient solution selection procedure that chooses a solution from the non-dominated Pareto front, leading to faster computation.
On the contrary, an enhanced genetic algorithm-based fuzzy K-modes clustering (EGA-FMC) proposed by Narasimhan [90] is derived from GA-FKM to enhance both the selection and elitism phases.Unlike the previous algorithm NSGA-FMC, EGA-FMC demonstrates efficient clustering of larger datasets.Although the objective functions remain the same as NSGA-FMC, EGA-FMC employs multi-objective rank-based selection alongside enhanced elitism operations, ensuring the replacement of the worst child of the new population with the best parent before evolution.
Another way to approach multiple objectives is through automatic fuzzy clustering, using the non-dominated sorting particle swarm optimization (AFC-NSPSO) algorithm [53].This algorithm aims for global compactness and fuzzy separation as objective functions.Moreover, the algorithm process is divided into two parts, incorporating control variables to automatically determine the cluster number and allocate objects to their respective clusters.Additionally, the proposed algorithm can identify the maximum number of clusters, which reduces computational time by minimizing iterations.
The main focus of multi-objective clustering algorithms is to enhance the performance of categorical data clustering according to the specific constraints of these algorithms.Another algorithm, known as the partition-and-merge-based fuzzy genetic clustering algorithm (PM-FGCA), is particularly dedicated to determining the optimal number of clusters within a predetermined number of clusters [54].Initially, PM-FGCA employs a multi-objective fuzzy clustering approach similar to that of NSGA-FMC to generate an intermediate clustering solution based on the initial number of clusters.Subsequently, fuzzy centroids are utilized to improve the results.This process involves iteratively merging clusters until satisfactory solutions are obtained.Consequently, the computational time required by PM-FGCA tends to be longer compared to NSGA-FMC.

•
Soft subspace clustering based on locality-sensitive hashing (LSH) Mau et al. introduced the LSHFk-centers [97] algorithm, which incorporates localitysensitive hashing (LSH) into the fuzzy clustering approach Fk-centers [156] to reduce dimensions.This process involves applying LSH to predict initial fuzzy clusters in a lowdimensional space.The LSHFk-centers algorithm is an extension of LSH-based methods for hard clustering [157].Despite its effectiveness compared to benchmark algorithms, the computational time of LSHFk-centers remains higher than that of its original method.Moreover, it is even more time-consuming than other membership chromosome-based techniques, such as the MaOfCentroids algorithm.Hence, alternative measures other than distance learning dissimilarity for categorical data (DILCA), such as context-based dissimilarity measures, can be explored.Additionally, to enhance locality-sensitive factors, utilizing properties of multi-attributes as the LSH hash function is recommended.
(3) Rough-set-based clustering Table 12 presents an overview of the different algorithms for rough-set-based clustering.

•
RST based on the K-modes algorithm Fuzzy set theory and rough set theory (RST) represent two common approaches for handling uncertainty in data.However, they employ distinct techniques.While fuzzy set theory assigns membership degrees within the range of 0 to 1, with 0 indicating no membership and 1 indicating full membership, RST tackles uncertainty by discerning lower and upper approximations.
In their work to enhance the K-modes algorithm, Suri and Murty proposed the rough K-modes (RKModes) algorithm [85], integrating lower and upper approximations from rough sets.This method, employing Cao's initialization technique [115] for cluster initialization, iteratively maximizes the modes' density until convergence, thereby introducing an effective approach to outlier detection within the K-modes framework.
Another algorithm, known as the density rough K-modes (DRk-M) algorithm [70][71][72], has been proposed to address the issue of random selection during the update of modes in the K-Modes algorithm.The DRk-M algorithm calculates the density of the modes and subsequently applies RST to select the most suitable modes based on the concepts of lower and upper approximations in RST.RST uses the density to update the modes K-modes, Ng's K-modes [143], Cao [115], the improved Huang's K-modes, the Weighted K-modes [39], improved Ng's K-modes, Bai [178], Khan [121], FKM

Uddin et al. (2021)
MVA [84] the concept of a number of automated clusters (NoACs) with a rough value set MDA [103], MSA [179], ITDR [62] Moreover, Ammar et al. integrate possibility theory with RST, aiming to manage uncertainty in attribute values by utilizing possibility degrees and uncertain clusters through possibilistic membership degrees.This approach extends their prior work [180] by employing a discretization method to convert numeric values into semantically more meaningful linguistic variables with possibilistic memberships based on the K-modes algorithm [167].

• RST based on information theory
Park and Choi introduced the information-theoretic dependency roughness (ITDR) [62].This algorithm concentrates on the dependencies of information-theoretic attributes, employing rough attribute dependencies in categorical-valued information systems to select clustering attributes based on their rough entropy values.Furthermore, ITDR employs a divide-and-conquer approach for object splitting and utilizes the mean degree of rough entropy to select the partition attribute.However, the ITDR algorithm still encounters challenges associated with entropy roughness in identifying the clustering attribute.
Therefore, Uddin et al. introduced the maximum value attribute (MVA) algorithm [84], which integrates the concept of the number of automated clusters (NoACs) to improve cluster purity while reducing complexity compared to other existing rough sets-based clustering algorithms.The MVA algorithm, which adopts the principles of RST, contains three main steps: (1) computing the value sets for each attribute, (2) determining the cardinality of each attribute value set, and (3) selecting the clustering attribute based on the maximum cardinality of the value set.By adopting this approach, the MVA algorithm effectively handles the limitations and issues associated with the random selection of clustering attributes, particularly in cases of independence and insignificant data.Comparative evaluations demonstrate that the MVA algorithm outperforms existing rough sets-based clustering algorithms, including the ITDR algorithm.

•
RST based on fuzzy k-partition algorithm In addition to FKM, other popular fuzzy clustering methods include fuzzy k-partition (FkP) [170] and fuzzy centroids [150].Yanto et al. proposed a modification of FkP known as modified FkP based on indiscernibility relation (MFk-PIND) to address the limitations of FkP, such as high computational time and low clustering purity.Unlike FkP, which relies on the likelihood function of multivariate multinomial distributions, MFk-PIND is based on the indiscernibility relation.Thus, the MFk-PIND algorithm outperforms both FkP and Fuzzy Centroids in terms of clustering performance.

•
Fuzzy rough clustering Saha et al. integrate the rough fuzzy K-modes (RFKMd) algorithm with metaheuristic methods.Therefore, the resulting algorithms are called SARFKMd when RFKMd is integrated with simulated annealing, and GARFKMd when integrated with genetic algorithms.Both are referred to as SARFKMd-RF and RFKMd-RF when combined with random forest (RF).These algorithms are based on a generalized approach termed integrated rough fuzzy clustering using random forest (IRFKMd-RF) [50].The utilization of metaheuristic methods aims to optimize the initial cluster modes, addressing the issue of indiscernibility and vagueness inherent in RFKMd, which often leads to local optima.
Furthermore, random forest trains the central points to classify peripheral points and their subsets effectively, including semi-best and pure peripheral points.The roughness measure is then utilized to select the best central points among the three algorithms, aiming to improve clustering performance.
Moreover, Xu et al. introduced a fuzzy rough clustering (FRC) algorithm [49] based on RST, combining information granularity and dimension reduction.FRC employs a weighted distance metric to measure dissimilarity in categorical datasets by converting them into numerical datasets.This conversion enables the utilization of manifold learning techniques to reduce the dimensionality of data points, resulting in decreased complexity compared to using the rough set algorithm directly.

• Distance metric based on the VD and VOW
In 2014, Lee and Lee introduced CATCH [181], a categorical data dissimilarity measure designed to cluster high-dimensional multi-valued data effectively.CATCH distinguishes the level of difference between categorical values using the value difference (VD).It incorporates the implicit influence of each attribute on constructing a particular cluster through value distribution-oriented dimensional weight (VOW).

•
Kernel-based method Chen et al. and Chen proposed two algorithms for clustering high-dimensional data into subspaces: the subspace clustering of categories (SCC) algorithm [61] and the K-meanstype projective clustering of the categorical data (KPC) algorithm [77].
The SCC algorithm is a partition-based clustering approach that utilizes kernel density estimation (KDE) to assign a weight to each attribute, reflecting the smoothed dispersion of categories within a cluster.Furthermore, it employs a probabilistic distance function to measure dissimilarity between data objects and defines a cluster validity index for estimating the number of clusters.Further improvement involves assigning individual weighting exponents to each cluster and adaptively estimating parameters.Additionally, the method can be extended to general kernel functions and tested across various kernels.Similarly, the KPC algorithm uses a probability-based learning framework, leveraging KDE to optimize both attribute weights and cluster centers.
The clustering with weighted categories (CWC) algorithm also conducts subspace clustering.Unlike the KPC algorithm, CWC is non-center-based.CWC performs better on most datasets due to its adaptive learning of distances based on category heterogeneity instead of relying on the independence assumption for computing object-to-cluster distances.However, KPC typically requires less computational time compared to CWC.

•
Space structure-based method Qian et al. [59] introduced a novel data representation scheme that maps categorical objects into Euclidean space, where each object corresponds to a single coordinate.This scheme forms the basis of the space structure-based clustering (SBC) framework.For instance, SBC utilizes Euclidean and cosine distances during experimentation, comparing their performance with various K-modes-type algorithms.
However, due to the time-consuming computation of similarity matrices for large datasets and the increase in dimensionality based on the number of datasets, the SBC algorithm required heavy memory loads and high computational complexity.To address these challenges, Zheng et al. proposed the space SBC algorithm with pre-clustering (SBC-C) [91].SBC-C tackles the limitations of the SBC algorithm by employing two strategies: selecting an appropriate reference set and combining the K-means algorithm with the proposed representation.This strategy differs from SBC, which directly applies the Kmeans algorithm to the entire representation.

•
Learning-based dissimilarity method Rios [83] introduced a learning-based dissimilarity approach that focuses on capturing per-attribute object similarity rather than relying on attribute interdependence.This dissimilarity measure aims to identify correlations between values of categorical attributes through ensemble classification.If such correlations indicate a similarity relation, they assist in determining the appropriate cluster for each object.
An advantage of the learning-based dissimilarity approach is its ability to predict the values of a target attribute.Consequently, this measure can be applied effectively in classification tasks.

•
Coupled similarity learning method Jian et al. proposed another measure known as coupled metric similarity (CMS) [88], which is designed to assess the intrinsic similarity of categorical data, particularly data that is not independent and identically distributed (non-IID).CMS is capable of flexibly capturing both intra-attribute and inter-attribute couplings, as well as value-to-attribute-toobject hierarchical couplings to measure object similarity.
In scalability testing, CMS demonstrated significantly faster and superior capability in capturing couplings compared to other similarity measures.Furthermore, CMS can be integrated with feature selection or weighting techniques to increase effectiveness and efficiency.Additionally, CMS has the potential to be extended for handling heterogeneous data, designing data structures for scalable clustering, and automatically determining the strength of couplings in the data.

• The mixed categorical attributes (nominal and ordinal) method
The HD-NDW algorithm [74], or homogeneous distance-novel distance weighting, is a clustering algorithm that incorporates HD intra-attribute information, focusing on the intrinsic connection between ordinal and nominal attributes.At the same time, the NDW calculates the weights of intra-attribute distances defined by HD to achieve optimal clustering results.
Furthermore, the authors of HD-NDW also introduced two additional methods personalized for mixed categorical attributes: the unified distance metric (UDM) [73] and the entropy-based distance metric (EBDM) [75].Both UDM and EBDM are centered around information-theoretic principles, utilizing entropy-based distance metrics.
The EBDM unifies distance measurement by incorporating order information from ordinal attributes and statistical information from nominal attributes.Additionally, a unified attribute weighting scheme is introduced to differentiate attribute contributions.However, clustering performance can be improved if EBDM incorporates valuable information from other attributes.Thus, Zhang and Cheung proposed the UDM, which considers intraattribute and inter-attribute statistical information in distance measurement.Despite its effectiveness, UDM falls short of algorithms like MWKM and SCC, which are specifically designed for nominal data.
Similarly, the dissimilarity measure introduced by Yuan et al. [78] is designed for both ordinal and nominal attributes.This method offers a dissimilarity measure for ordinal attributes, quantifying the degree of ordering based on rough set theory.Comparative analysis against previous algorithms, such as SBC [59] and CMS [88], demonstrates superior performance in measuring ordinal attributes.

• Distance metric based on Graph
Another approach to measuring dissimilarity is the heterogeneous graph-based similarity (HGS) proposed by Ye et al. [52].First, a heterogeneous weighted graph is constructed to capture latent relationships among attributes.Additionally, HGS considers both the occurrence and co-occurrence relationships between objects and attributes.Leveraging this concept, the similarity measure for objects and attribute values, including their structures, is iteratively calculated until convergence.

•
Information-theoretic based approach Kar et al.
[86] introduced an entropy-based dissimilarity metric inspired by Boltzmann's principles of counting microstates to cluster diverse datasets.This dissimilarity measure calculates the entropy of each attribute, followed by determining the weight of each attribute to indicate its significance in the dataset.
Similarly, Jiang et al. [37] employ an information-theoretic-based approach to select initial clusters.They utilize a weighted matching distance metric named initialization K-modes using outlier detection (k-MODET).This approach integrates traditional distance-based outlier detection techniques (ini_distance) with partition entropy-based outlier detection techniques (ini_entropy).

•
Frequency-based approach The most cited article in this study, as mentioned in Section 3.1, is by Jia et al. [58].They proposed a novel distance metric to measure the distance between categorical data.This metric is based on frequency probability, enabling the measurement of the distance of each attribute value in the entire dataset.Moreover, they introduced a dynamic weighting scheme to adjust the contribution of each attribute distance to the overall object distance.The proposed distance metric encompasses three cases: (1) frequency probability-based, (2) adjusted distance metric with dynamic attribute weight without considering the relationship between attributes, and (3) the complete distance metric.Lastly, considering that some attributes are interdependent, the degree of dependency between each pair is calculated using frequency probability and frequently co-occurring items.
Sulc and Rezankova [48] also employed a frequency distribution of categories to address attributes with more than two categories.Their proposed variability-based similarity measures include the variable entropy (VE) and the variable mutability (VM) measure, integrated with three hierarchical cluster analysis linkage methods.

•
Ensemble dissimilarity based on hierarchical clustering Amiri et al. [94] introduced another dissimilarity measure based on hierarchical clustering.Their approach focuses on ensembled dissimilarity designed for datasets with low, high, and varying dimensions.For high-dimensional data, categorical vectors are separated into equal and unequal lengths by including an additional layer of assembly.Alignment procedures are then employed to standardize the unequal categorical vectors.The results demonstrate improved performance of the ensembled clustering method under average linkage (AL) or complete linkage (CL).Currently, due to the absence of clustering methods for unequal-length categorical vectors, the proposed approach can only be compared with the output of phylogenetic trees.

•
Bayesian dissimilarity and KL divergence approach In 2023, a novel fuzzy clustering objective function was introduced, leveraging the concept of approximating the maximum a posteriori (MAP) and employing a Bayesian dissimilarity measure [89].Moreover, to increase clustering performance, the objective function includes Kullback-Leibler divergence-based graph regularization to identify patterns within datasets.

Weighting Method
The summary of the weighting method is provided in Table 14.
• Automatic feature weight WFK-modes [80], proposed by Sara and Das in 2015, is an automated feature weight learning method designed to adjust feature weights based on their contributions to clustering adaptively.It aims to minimize the objective function and determine cluster membership within the FKM algorithm [217].Experimental results indicate that this algorithm performs effectively, especially in datasets containing noise features.Despite trying to modify the algorithm for scenarios with an unknown number of clusters, it still requires setting threshold values for the maximum and minimum number of clusters.Thus, further studies can explore the effectiveness of different cluster validity indices and their relationship with the weight vector.Additionally, parameters such as "β" associated with attribute weight and objective function minimization need optimization.Furthermore, extending performance evaluation to larger datasets would be beneficial.Mod-2 [127], Mod-3 [128], and Attribute-weighted method: IWFKM [55], EWKM [114], Saha [80], SBC [59], Chan [130], Jia [205] Additionally, Oskouei et al. [79] explored automated attribute weighting, extending the work of [218], which employed cluster weighting to select initial centers in the FCM algorithm.However, since FCM primarily handles numerical attributes, their proposed method, the categorical fuzzy K-modes clustering with automated attribute-weight and cluster-weight learning (FKMAWCW) algorithm, is implemented for categorical attributes.This algorithm uses a local attribute weighting mechanism to appropriately weigh attributes within each cluster and a cluster weighting mechanism to address initialization sensitivity.Furthermore, to mitigate noise sensitivity, they introduce a novel distance function combining frequency probability-based distance [58] and non-Euclidean distance [219].Exploring the suitability of the FKMAWCW algorithm for clustering mixed data, especially considering its emphasis on categorical data, would be valuable.Moreover, future studies can explore the automatic determination of the number of clusters during the clustering process.

•
Information-theoretic approach Kim [95] introduced a novel attribute weighting approach for the K-modes and FKM algorithms based on within-cluster and between-cluster impurity measures to identify attribute relevance in separate clusters.These impurity measures, such as entropy and Gini impurity, assign large weights to variables with lower entropy or Gini impurity.However, the effectiveness of attribute weights depends on the parameter "c", which controls the balance between within-cluster and between-cluster information.Determining the optimal value for "c" relies on general guidelines and requires further investigation.Furthermore, the proposed method can be expanded to accommodate numerical and mixed attributes by employing inhomogeneous measures for numerical features.
Peng and Liu [51] aim to improve the cluster center during the initialization phase of the K-modes algorithm by employing an attribute-weighted distance metric and weighted average density rather than relying solely on the simple matching distance metric.This approach helps prevent the possibility of outliers becoming cluster centers or multiple cluster centers converging around a single center.Furthermore, this approach can broaden its scope in future studies by employing feature selection techniques to identify significant attributes for distance measurement between instances during cluster center initialization.

Validity Function
Table 15 shows the summary of cluster validity.Ng's K-modes [143,220], K-modes [15,221], WKM [176] Bai & Liang (2015) generalized validity function [65] K-modes, CU [222], IE [102] Gao & Wu (2019) IDC, CUBOS [57] CCI [223], CDCS [224], IE, CU, NCC [225] All the validity functions in this study focus on internal validity functions.In 2014, Bai and Liang [39] improved the K-modes algorithm by optimizing its objective function to incorporate both between-cluster separation and within-cluster compactness.Their proposed algorithm, named between-cluster information K-modes (BCIk-M), demonstrated improved effectiveness compared to traditional FKM algorithms.The integration of between-cluster information with the FKM algorithm, as verified in [178], shows its superior effectiveness.Furthermore, this study enhanced several original K-modes algorithms, including Ng's K-modes [143,220], Huang's K-modes algorithm [15,221], and the weighted K-modes algorithm (WKM) [176], by including both types of information.Despite the increased computational time required by the improved K-modes algorithm in scalability tests, the increase rate remains linear, guaranteeing its effectiveness and scalability.
In 2019, Bai and Liang [57] introduced a study focusing on the generalized validity function.Initially, they examined three existing internal validity functions: K-modes [221], category utility function (CU) [222], and information entropy function (IE) [102].As these functions solely relied on within-cluster information, the study aimed to investigate the impact of including between-cluster information on performance.The experimental results demonstrated that these three validity functions effectively evaluated clustering results even without utilizing between-cluster information.Additionally, the study proposed normalizations for these internal validity functions and found that normalization increases performance.
Gao and Wu [57] conducted a comprehensive review of existing functions of internal validity indices and, based on that, proposed the categorical data cluster utility based on silhouette (CUBOS).The CUBOS method combines the Silhouette index with an improved distance metric for categorical data (IDC).It considers the relationship between different attribute values and the detailed distribution information among data objects.IDC represents a novel improvement measure inspired by category distance [87].Furthermore, the CUBOS framework facilitates a more detailed distribution of information within clustering results.
In addition to the hierarchical and partition clustering methods, as well as the dissimilarity functions, weighting methods, and cluster validity measures outlined in Tables 8-15, it is relevant to highlight the datasets utilized in these studies.Each study employs different datasets depending on its objectives, although the datasets may not always be categorized based on their scalability or dimensions.The following section combines several frequently used datasets with several validity functions.

Datasets
The total number of datasets used in all articles is 51, as outlined in Table 16.Five of these datasets are extensively featured in over 30 articles.These datasets include breast cancer Wisconsin (original), congressional votes, mushroom, soybean small, and zoo.The summary aligns with the original dataset specifications from the UCI Repository [12], including the information on the number of records (#rec), attributes (#attr), and clusters (#clus).

Performance Evaluation
Table 17 presents the performance evaluation methods employed in the 64 articles.Among these, 20 internal and 13 external validity functions were utilized.Notably, the accuracy, adjusted rand index, and normalized mutual information were employed in oven 20 articles.Furthermore, Table 18 illustrates the most frequently used validity functions corresponding to the most common datasets.In Tables 19-21, algorithms are classified based on their types to illustrate the best algorithm with the best result, aligned with the most frequent datasets and validity indexes employed in the articles.The validity indexes include accuracy, ARI, and NMI.As several articles propose more than one algorithm, the summary only presents the best algorithm along with its corresponding results.
Despite the limited number of datasets and validity indexes used in performance evaluation, valuable insights can still be provided.In this study, articles employing roughset-based clustering do not utilize the ARI, and none of the weighting methods employ the NMI as the validity index.
Many algorithms that use the soybean dataset have the highest value for all the validity indexes.However, for the mushroom dataset, only the EN-CL algorithm can achieve an accuracy of 100%, and the ARI value is equal to one.It shows that compared to other methods, the EN-CL, which is the ensembled dissimilarity, can achieve better results, especially for high-dimensional and scalable datasets.Moreover, all weighting method algorithms used in the soybean small dataset have the highest accuracy, and one of the weighting methods, the FKMAWCW, also has the highest ARI value.The cluster validity algorithm, such as CUBOS [57], also has the highest value for all validity indexes using the same dataset.Many algorithms utilizing the soybean dataset achieve the highest values across all validity indices.However, concerning the mushroom dataset, only the EN-CL algorithm achieves 100% accuracy, with an ARI value of one.These results indicate that EN-CL, an ensembled dissimilarity approach, achieves better results compared to other methods, particularly for high-dimensional and scalable datasets.Moreover, all weighting method algorithms applied to the soybean small dataset achieve the highest accuracy, and one of these methods, FKMAWCW by Oskouei et al. [79], also secures the highest ARI value.Additionally, the cluster validity algorithm CUBOS by Gao and Wu [57] achieves the highest value across all validity indexes using the same dataset.

Taxonomy
Numerous taxonomies related to clustering are presented, as outlined in Table 1, with most of them addressing numerical and categorical data.However, as shown in Figure 6, this study aims to construct a taxonomy specifically for categorical data clustering.Nonetheless, this task presents challenges due to the varied perspectives and classification approaches found in each study.To the best of our knowledge, no comprehensive taxonomy has been established for categorical data clustering.Therefore, the proposed taxonomy shown in Table 5 assists scholars by providing a simple yet comprehensive classification that covers all relevant topics in categorical data clustering.This taxonomy is based on mapping domain research in categorical data.However, it has some limitations.For instance, it only covers nominal, ordinal, and mixed data types, excluding sequential categorical data such as DNA.Additionally, distance functions and attribute weighting can be integrated with each other.Although only a few attribute weighting methods are listed, including entropy and Gini weighting from the information-theoretic approach, entropy can also be utilized in distance functions.The taxonomy for distance functions is derived from [13,18,192], where the term "distance function" refers to dissimilarity measures.Hence, the taxonomy uses both metric and non-metric distance measures under the term "similarity distance."First, this study adopts the taxonomy [16], which classifies clustering types into hierarchical and partition-based.Hierarchical clustering is further divided into divisive and agglomerative, with the agglomerative approach consisting of single, average, and complete links.Partition-based clustering is divided into hard and soft clustering based on membership degree.In hard clustering, the data points belong to only one cluster, whereas in fuzzy partitioning, the data points can belong to multiple clusters based on their membership degree.Graph clustering is considered a separate category instead of part of partition-based clustering.The levels of graph clustering are similar to partitionbased clustering, alongside other types, including model-based, density-based, grid-based, and space-structure-based.However, space-structure-based clustering may overlap with grid-based or density-based methods since spatial data can be clustered based on density or divided into a grid.Hence, due to the growing research in this area, this study treats space-structure-based clustering as an independent category.Furthermore, clustering techniques are classified based on background theory, including rough-set, fuzzy-set, probabilistic, possibilistic, and belief functions.These techniques can be applied in hierarchical, partition-based, or other clustering methods as they primarily aim to handle uncertainty and address challenges posed by traditional clustering algorithms.While some references, such as Naouali [17], place hard, fuzzy, and rough-setbased clustering at the same level, with probabilistic and possibilistic methods considered part of a fuzzy theory, this taxonomy assumes all these theories as equal under the category "clustering techniques," aiming to cover the belief functions theory.
Additionally, distance functions and attribute weighting can be integrated with each other.Although only a few attribute weighting methods are listed, including entropy and Gini weighting from the information-theoretic approach, entropy can also be utilized in distance functions.The taxonomy for distance functions is derived from [13,18,192], where the term "distance function" refers to dissimilarity measures.Hence, the taxonomy uses both metric and non-metric distance measures under the term "similarity distance." Moreover, the concept of context-free and context-relative, as proposed in [18], is applied solely to unsupervised learning, encompassing frequency-based, informatic-theoretic, and probabilistic approaches.On the other hand, this study adds a kernel-based approach since many studies have proposed distance metrics based on the kernel.
Similarly, regarding validation functions, this study found that all methods proposed in the past decade are associated with internal validation.As a result, the validation section remains unchanged, following the previous taxonomy, which consists of internal, external, and mixed validation functions-another updated taxonomy related to datasets and optimization.Instead of combining these two aspects as part of the clustering issue, this study categorizes them based on the source or root cause of the problem.For example, issues such as noise sensitivity, outlier detection, and imbalanced data are caused by the dataset characteristics.Outliers may not always be problematic, as they can be useful depending on the clustering objective.Similarly, addressing high-dimensionality data may involve transformation methods to reduce dimensionality, but this study does not focus on dimensionality reduction or feature selection.
The final aspect relates to optimization.This study divides optimization into several parameters or processes that can be optimized, such as the objective function and the number of clusters.Furthermore, optimization approaches cover both exact and heuristic approaches rather than solely focusing on metaheuristic approaches, as many algorithms still utilize these traditional optimization strategies.

Discussion
This study conducts a bibliometric analysis focusing on categorical data clustering, particularly in partition-based clustering.The quantitative synthesis and analysis subsection provides a performance overview of articles and science mapping.A limitation of this study is its dependence solely on articles from the WoS Core Collection from 2014 to 2023.Future studies can expand the scope to include other databases like Scopus or broaden the inclusion criteria.However, the comparison between the 567 and 64 articles over the past decade effectively captures research trends in the field.In the science mapping section, co-word and citation analyses are visualized using VOSviewer.Comparing the top ten most cited articles in Table 3 with the most productive authors in Table 6 reveals interesting insights [59][60][61]64].Even though four of the top ten articles are authored by the top ten authors, the most cited article [58] is not written by the most productive author.This comparison and citation analysis provide a deeper understanding of the research trends.
For example, although [58] is cited by 15 of the 64 articles, the second top article is cited by only 9, indicating that citation count alone may not fully capture the significance of a publication.Overall, the citation network provides a comprehensive overview and detailed insights into trends and topics in categorical data clustering, suggesting further analysis related to the relationship between cited articles and authors.
After the quantitative synthesis and analysis, the qualitative synthesis and analysis for the 64 articles are presented.The classification follows a benchmark taxonomy of type, technique, distance function, attribute weighting, validation, dataset, and optimization.The detailed analysis and classification are sequentially presented in Section 3.2.Specifically, there are five studies related to hierarchical clustering, with four of them based on rough-set theory (MTMDP [38], MGR [41], MNIG [47], and HPCCD [96]), except the P-ROCK [106].Two studies focus on agglomerative hierarchical clustering, while the remaining three focus on divisive hierarchical clustering.
Related to the hierarchical clustering combined with a rough set, most proposed algorithms show capabilities in handling uncertain and imbalanced datasets, automatically discovering the number of clusters, and clustering high-dimensional datasets.Moreover, these algorithms have improved over the MMR algorithm [99] in terms of increased accuracy and efficiency.
For future research directions, it is suggested that these proposed algorithms can explore the possibility of clustering in scalable and automatic subspace clustering.Extending these algorithms to handle mixed numeric and categorical data can be a promising avenue for further investigation.
Another type of clustering is partition-based clustering.This study proposes an expanded classification of clustering type.Instead of dividing clustering types into hierarchical and partition-based, this taxonomy places all the clustering types at the same level.This includes space structure-based [59,91], which was previously categorized separately.Furthermore, spectral clustering is incorporated into the graph-based category, and entropy-based clustering is now considered part of model-based clustering.
During the period from 2014 to 2023, while the most productive year was 2019, there has been a downward trend since 2021.However, several significant works have emerged, particularly in distance function methods.Other popular research topics include multi-objective optimization clustering based on information-theoretic, kernel-based, and frequency-based approaches.Additionally, many algorithms have been developed to address the challenge of clustering high-dimensional and scalable data.
Many algorithms performed scalability testing, such as those mentioned in references [40,41,56,63,64,67,68,70,96,98,106], aiming to improve clustering methods for highdimensional data.Notably, algorithms like SCC [61] and SKSCC [76] utilize probabilistic distance functions based on kernel density estimation to increase clustering performance.Some methods focus on data representation techniques, such as discretization (converting categorical data into numeric values) [64] or representing categorical data as graph structures [66] to reduce time complexity in high-dimensional datasets.Additionally, soft subspace clustering methods like LSHFk-centers aim to reduce dimensionality before data processing.However, despite their effectiveness, these algorithms still suffer from high computational time, indicating a need for further research to improve efficiency and reduce time complexity.
On the other hand, several algorithms have been developed based on rough-set theory to address data uncertainty.These algorithms aim to prevent uncertainty associated with attribute values and uncertain clustering outcomes.For example, the RKModes [85] algorithm focuses on outlier detection and sensitivity analysis.While some algorithms are based on the K-modes, others, like MFk-PIND [42], modify fuzzy k-partition (FkP) and fuzzy centroids to improve computational efficiency and clustering purity.Additionally, certain algorithms utilize information-theoretic dependencies, such as the widely-used ITDR [62], which employs entropy roughness to identify clustering attributes.However, a challenge arises when clustering attributes possess zero or equal significance values, leading to random attribute selection.To address this issue, the MVA algorithm [84] was proposed, which overcomes the limitations of ITDR but requires further analysis in combination with other rough purity approaches (RPA).
Several algorithms have been developed to automate clustering by optimizing the number of clusters without requiring a predetermined initialization.One such algorithm is the α-Condorcet algorithm [92], which highlights the practicality of pre-identifying the cluster number in certain real-world scenarios, such as psychometrics.Developed based on a heuristic approach, this algorithm provides valuable insights into cluster number determination.Additionally, metaheuristic approaches have been integrated with clustering algorithms to improve their performance.For instance, fuzzy-based algorithms have been combined with GA, PSO, ABC, and other metaheuristic methods to optimize cluster center initialization [2,55,69].Furthermore, optimization techniques that combine metaheuristics with multi-objective algorithms have been explored [53,60,90].However, it is worth noting that while fuzzy-based algorithms utilize various metaheuristics, most multi-objective algorithms primarily rely on GA and PSO.Hence, conducting comparative performance evaluations with other metaheuristic approaches can provide valuable insights, particularly considering the diverse objective functions employed by these algorithms.Additionally, assessing algorithm efficiency in terms of time and space utilization alongside objective function optimization is recommended.
Exploring the characteristics of algorithms capable of handling empty clusters is important, especially considering the commonness of this issue in algorithms like K-modes.Many scholars have already used the brute force approaches to address this challenge.In this context, the OTQT algorithm stands out for its adoption of the Hartigan algorithm, a variation of K-means, to ensure that clusters remain nonempty during the initialization step.This innovative approach offers a promising solution to prevent the empty cluster problem commonly encountered in categorical data clustering.
Overall, the methods discussed in this study contribute to enhancing the proposed taxonomy.In the future, this taxonomy can serve as a foundation framework for further advancements in clustering algorithms, aligning with the trends identified in the bibliometric analysis.As data complexity continues to increase, there are opportunities to refine existing methods for improvement and innovation.Future research directions may involve integrating clustering methods with deep learning and ensemble techniques and exploring semi-supervised learning approaches capable of clustering mixed labeled and unlabeled data.Furthermore, algorithms can be developed to effectively cluster mixed datasets, thereby improving the overall performance and efficiency of clustering algorithms.

Conclusions
The bibliometric analysis conducted between 2014 and 2023, focusing on categorical data clustering and sourced from the WoS Core Collection, identified 64 relevant articles following content screening.Through co-word and citation network analyses, research trends and relationships among publications and clustering topics were presented.Subsequently, a qualitative synthesis and analysis were conducted to explore the details of the studies.The 64 articles were classified according to a previous taxonomy, leading to the development of a new taxonomy based on emerging methods and trends.
Numerous methods were identified to address the limitations of traditional algorithms, particularly in partition-based clustering.These methods include optimization techniques employing metaheuristics and uncertainty methods such as fuzzy and rough-set theory.Various distance functions were proposed to mitigate the shortcomings of simple matching distance, with some considering both within-cluster cohesion and between-cluster separation.Additionally, several attribute weighting methods were introduced to discern the importance of attributes.
This study also synthesized the most commonly used datasets and summarized the performance results.However, it is important to note that no single algorithm can address all clustering challenges, as efficiency depends on factors such as dataset characteristics.
Moreover, this study may not comprehend all issues presented in the articles, and due to the complexity of categorical data clustering, the proposed taxonomy may not cover all methods in detail.
For future works, the majority of studies aim to enhance the existing methods for improved scalability and efficiency while also extending these approaches to accommodate mixed data types beyond categorical datasets.Despite the declining trend observed since 2021 and the numerous algorithms proposed over the past decade for categorical data clustering, certain challenges persist, particularly in addressing issues inherent to traditional algorithms like the K-modes-based methods discussed herein.Consequently, it is recommended that modern clustering techniques be explored in future works to tackle these ongoing challenges effectively.

Table 1 .
Previous review studies in the clustering domain.

Table 2 .
Keywords and database selection.

Table 2 .
Keywords and database selection.