Categorical Data Clustering: A Bibliometric Analysis and Taxonomy
Abstract
:1. Introduction
- What are the existing categorical data clustering algorithms capable of increasing clustering performances?
- What are the research trends based on the co-word and citation network of these algorithms?
- What is the updated taxonomy for categorical data clustering?
- What potential challenges and future research directions exist as alternative solutions to the existing methods?
2. Methods
3. Results
3.1. Quantitative Synthesis and Analysis
3.1.1. Performance Analysis
- Publication Years
- Publication Titles and Publishers
- Authors
3.1.2. Science Mapping
3.2. Qualitative Synthesis and Analysis
3.2.1. Hierarchical Clustering
- (1)
- Divisive Hierarchical Clustering
- (2)
- Agglomerative Hierarchical Clustering
3.2.2. Partition Clustering
- (1)
- Hard Clustering
- Clustering various data types
- Optimizing the number of clusters
- Optimizing the cluster centers
- Optimizing the objective function for large datasets
- Optimizing the objective function based on a multi-objective approach
- Representing data based on the discretization method
- (2)
- Fuzzy Clustering
- Heuristic approach to cluster set-valued attributes
- Multivariate membership approach
- Metaheuristic approach
- Possibilistic-based approach with metaheuristic
- Intuitionistic fuzzy set theory-based approach
- Multi-objective approach
- Soft subspace clustering based on locality-sensitive hashing (LSH)
- (3)
- Rough-set-based clustering
- RST based on the K-modes algorithm
- RST based on information theory
- RST based on fuzzy k-partition algorithm
- Fuzzy rough clustering
3.2.3. Distance Function
- Distance metric based on the VD and VOW
- Kernel-based method
- Space structure-based method
- Learning-based dissimilarity method
- Coupled similarity learning method
- The mixed categorical attributes (nominal and ordinal) method
- Distance metric based on Graph
- Information-theoretic based approach
- Frequency-based approach
- Ensemble dissimilarity based on hierarchical clustering
- Bayesian dissimilarity and KL divergence approach
3.2.4. Weighting Method
- Automatic feature weight
- Information-theoretic approach
3.2.5. Validity Function
3.2.6. Datasets
3.2.7. Performance Evaluation
3.3. Taxonomy
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Abdul-Rahman, S.; Arifin, N.F.K.; Hanafiah, M.; Mutalib, S. Customer segmentation and profiling for life insurance using k-modes clustering and decision tree classifier. Int. J. Adv. Comput. Sc. 2021, 12, 434–444. [Google Scholar] [CrossRef]
- Kuo, R.J.; Potti, Y.; Zulvia, F.E. Application of metaheuristic based fuzzy k-modes algorithm to supplier clustering. Comput. Ind. Eng. 2018, 120, 298–307. [Google Scholar] [CrossRef]
- Hendricks, R.; Khasawneh, M. Cluster analysis of categorical variables of parkinson’s disease patients. Brain Sci. 2021, 11, 1290. [Google Scholar] [CrossRef] [PubMed]
- Narita, A.; Nagai, M.; Mizuno, S.; Ogishima, S.; Tamiya, G.; Ueki, M.; Sakurai, R.; Makino, S.; Obara, T.; Ishikuro, M.; et al. Clustering by phenotype and genome-wide association study in autism. Transl. Psychiat 2020, 10, 290. [Google Scholar] [CrossRef]
- Farhang, Y. Face extraction from image based on k-means clustering algorithms. Int. J. Adv. Comput. Sc. 2017, 8, 9. [Google Scholar] [CrossRef]
- Huang, H.; Meng, F.Z.; Zhou, S.H.; Jiang, F.; Manogaran, G. Brain image segmentation based on FCM clustering algorithm and rough set. IEEE Access 2019, 7, 12386–12396. [Google Scholar] [CrossRef]
- Wei, P.C.; Zhou, Z.; Li, L.; Jiang, J. Research on face feature extraction based on k-mean algorithm. Eurasip. J. Image Vide 2018, 2018, 1–9. [Google Scholar] [CrossRef]
- Bushel, P.R. Clustering of gene expression data and end-point measurements by simulated annealing. J. Bioinform. Comput. Biol. 2009, 7, 193–215. [Google Scholar] [CrossRef]
- Castro, G.T.; Zárate, L.E.; Nobre, C.N.; Freitas, H.C. A fast parallel k-modes algorithm for clustering nucleotide sequences to predict translation initiation sites. J. Comput. Biol. 2019, 26, 442–456. [Google Scholar] [CrossRef]
- Fonseca, J.R.S. Clustering in the field of social sciences: That is your choice. Int. J. Soc. Res. Method. 2013, 16, 403–428. [Google Scholar] [CrossRef]
- Luo, N.C. Massive data mining algorithm for web text based on clustering algorithm. J. Adv. Comput. Intell. Inform. 2019, 23, 362–365. [Google Scholar] [CrossRef]
- Dua, D.G. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/ (accessed on 10 January 2024).
- Tan, P.-N.; Steinbach, M.S.; Karpatne, A.; Kumar, V. Introduction to Data Mining, 2nd ed.; Pearson Education, Inc.: London, UK, 2019. [Google Scholar]
- MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June 1967; pp. 281–297. [Google Scholar]
- Huang, Z. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 1998, 2, 283–304. [Google Scholar] [CrossRef]
- Jain, A.K.; Murty, M.N.; Flynn, P.J. Data clustering: A review. ACM Comput. Surv. 1999, 31, 264–323. [Google Scholar] [CrossRef]
- Naouali, S.; Ben Salem, S.; Chtourou, Z. Clustering categorical data: A survey. Int. J. Inf. Technol. Decis. Mak. 2020, 19, 49–96. [Google Scholar] [CrossRef]
- Alamuri, M.; Surampudi, B.R.; Negi, A. A survey of distance/similarity measures for categorical data. In Proceedings of the 2014 International Joint Conference on Neural Networks (IJCNN), Beijing, China, 6–11 July 2014; pp. 1907–1914. [Google Scholar]
- Hancer, E.; Karaboga, D. A comprehensive survey of traditional, merge-split and evolutionary approaches proposed for determination of cluster number. Swarm Evol. Comput. 2017, 32, 49–67. [Google Scholar] [CrossRef]
- Alloghani, M.; Al-Jumeily, D.; Mustafina, J.; Hussain, A.; Aljaaf, A.J. A systematic review on supervised and unsupervised machine learning algorithms for data science. In Supervised and Unsupervised Learning for Data Science; Unsupervised and Semi-Supervised Learning; Springer: Berlin/Heidelberg, Germany, 2020; pp. 3–21. [Google Scholar]
- Awad, F.H.; Hamad, M.M. Big data clustering techniques challenged and perspectives: Review. Informatica 2023, 47, 6. [Google Scholar] [CrossRef]
- Ikotun, A.M.; Absalom, E.E.; Abualigah, L.M.; Abuhaija, B.; Jia, H. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. 2022, 622, 178–210. [Google Scholar] [CrossRef]
- Wang, Y.; Qian, J.; Hassan, M.; Zhang, X.; Zhang, T.; Yang, C.; Zhou, X.; Jia, F. Density peak clustering algorithms: A review on the decade 2014–2023. Expert Syst. Appl. 2024, 238, 121860. [Google Scholar] [CrossRef]
- Parsons, L.; Haque, E.; Liu, H. Subspace clustering for high dimensional data: A review. SIGKDD Explor. 2004, 6, 90–105. [Google Scholar] [CrossRef]
- Ezugwu, A.E.; Shukla, A.K.; Agbaje, M.B.; Oyelade, O.N.; José-García, A.; Agushaka, J.O. Automatic clustering algorithms: A systematic review and bibliometric analysis of relevant literature. Neural Comput. Appl. 2020, 33, 6247–6306. [Google Scholar] [CrossRef]
- Ezugwu, A.E. Nature-inspired metaheuristic techniques for automatic clustering: A survey and performance study. SN Appl. Sci. 2020, 2, 273. [Google Scholar] [CrossRef]
- Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef] [PubMed]
- Gutiérrez-Salcedo, M.; Martínez, M.Á.; Moral-Munoz, J.A.; Herrera-Viedma, E.; Cobo, M.J. Some bibliometric procedures for analyzing and evaluating research fields. Appl. Intell. 2017, 48, 1275–1287. [Google Scholar] [CrossRef]
- Donthu, N.; Kumar, S.; Mukherjee, D.; Pandey, N.; Lim, W.M. How to conduct a bibliometric analysis: An overview and guidelines. J. Bus. Res. 2021, 133, 285–296. [Google Scholar] [CrossRef]
- Cobo, M.J.; López-Herrera, A.G.; Herrera-Viedma, E.; Herrera, F. science mapping software tools: Review, analysis, and cooperative study among tools. J. Am. Soc. Inf. Sci. Technol. 2011, 62, 1382–1402. [Google Scholar] [CrossRef]
- Aria, M.; Cuccurullo, C. Bibliometrix: An R-tool for comprehensive science mapping analysis. J. Informetr. 2017, 11, 959–975. [Google Scholar] [CrossRef]
- Pranckutė, R. Web of Science (WoS) and Scopus: The titans of bibliographic information in today’s academic world. Publications 2021, 9, 12. [Google Scholar] [CrossRef]
- Shiau, W.-L.; Dwivedi, Y.K.; Yang, H.S. Co-citation and cluster analyses of extant literature on social networks. Int. J. Inf. Manag. 2017, 37, 390–399. [Google Scholar] [CrossRef]
- Perianes-Rodriguez, A.; Waltman, L.; van Eck, N.J. Constructing bibliometric networks: A comparison between full and fractional counting. J. Informetr. 2016, 10, 1178–1195. [Google Scholar] [CrossRef]
- van Eck, N.J.; Waltman, L. Citation-based clustering of publications using CitNetExplorer and VOSviewer. Scientometrics 2017, 111, 1053–1070. [Google Scholar] [CrossRef]
- Orduña-Malea, E.; Costas, R. Link-based approach to study scientific software usage: The case of VOSviewer. Scientometrics 2021, 126, 8153–8186. [Google Scholar] [CrossRef]
- Jiang, F.; Liu, G.Z.; Du, J.W.; Sui, Y.F. Initialization of k-modes clustering using outlier detection techniques. Inf. Sci. 2016, 332, 167–183. [Google Scholar] [CrossRef]
- Li, M.; Deng, S.B.; Wang, L.; Feng, S.Z.; Fan, J.P. Hierarchical clustering algorithm for categorical data using a probabilistic rough set model. Knowl. -Based Syst. 2014, 65, 60–71. [Google Scholar] [CrossRef]
- Bai, L.; Liang, J.Y. The k-modes type clustering plus between-cluster information for categorical data. Neurocomputing 2014, 133, 111–121. [Google Scholar] [CrossRef]
- Cao, F.Y.; Huang, J.Z.X.; Liang, J.Y.; Zhao, X.W.; Meng, Y.F.; Feng, K.; Qian, Y.H. An algorithm for clustering categorical data with set-valued features. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 4593–4606. [Google Scholar] [CrossRef] [PubMed]
- Qin, H.W.; Ma, X.Q.; Herawan, T.; Zain, J.M. MGR: An information theory based hierarchical divisive clustering algorithm for categorical data. Knowl. -Based Syst. 2014, 67, 401–411. [Google Scholar] [CrossRef]
- Yanto, I.T.R.; Ismail, M.A.; Herawan, T. A modified fuzzy k-partition based on indiscernibility relation for categorical data clustering. Eng. Appl. Artif. Intell. 2016, 53, 41–52. [Google Scholar] [CrossRef]
- McNicholas, P.D. Model-based clustering. J. Classif. 2016, 33, 331–373. [Google Scholar] [CrossRef]
- Goodman, L.A. Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 1974, 61, 215–231. [Google Scholar] [CrossRef]
- Weller, B.E.; Bowen, N.K.; Faubert, S.J. Latent class analysis: A guide to best practice. J. Black Psychol. 2020, 46, 287–311. [Google Scholar] [CrossRef]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 2018, 39, 1–22. [Google Scholar] [CrossRef]
- Wei, W.; Liang, J.Y.; Guo, X.Y.; Song, P.; Sun, Y.J. Hierarchical division clustering framework for categorical data. Neurocomputing 2019, 341, 118–134. [Google Scholar] [CrossRef]
- Sulc, Z.; Rezanková, H. Comparison of similarity measures for categorical data in hierarchical clustering. J. Classif. 2019, 36, 58–72. [Google Scholar] [CrossRef]
- Xu, S.L.; Liu, S.L.; Zhou, J.; Feng, L. Fuzzy rough clustering for categorical data. Int. J. Mach. Learn. Cybern. 2019, 10, 3213–3223. [Google Scholar] [CrossRef]
- Saha, I.; Sarkar, J.P.; Maulik, U. Integrated rough fuzzy clustering for categorical data analysis. Fuzzy Sets Syst. 2019, 361, 1–32. [Google Scholar] [CrossRef]
- Peng, L.W.; Liu, Y.G. Attribute weights-based clustering centres algorithm for initialising k-modes clustering. Clust. Comput. -J. Netw. Softw. Tools Appl. 2019, 22, S6171–S6179. [Google Scholar] [CrossRef]
- Ye, Y.Q.; Jiang, J.; Ge, B.F.; Yang, K.W.; Stanley, H.E. Heterogeneous graph based similarity measure for categorical data unsupervised learning. IEEE Access 2019, 7, 112662–112680. [Google Scholar] [CrossRef]
- Nguyen, T.P.Q.; Kuo, R.J. Automatic fuzzy clustering using non-dominated sorting particle swarm optimization algorithm for categorical data. IEEE Access 2019, 7, 99721–99734. [Google Scholar] [CrossRef]
- Nguyen, T.P.Q.; Kuo, R.J. Partition-and-merge based fuzzy genetic clustering algorithm for categorical data. Appl. Soft Comput. 2019, 75, 254–264. [Google Scholar] [CrossRef]
- Kuo, R.J.; Nguyen, T.P.Q. Genetic intuitionistic weighted fuzzy k-modes algorithm for categorical data. Neurocomputing 2019, 330, 116–126. [Google Scholar] [CrossRef]
- Xiao, Y.Y.; Huang, C.H.; Huang, J.Y.; Kaku, I.; Xu, Y.C. Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering. Pattern Recognit. 2019, 90, 183–195. [Google Scholar] [CrossRef]
- Gao, X.N.; Wu, S. CUBOS: An internal cluster validity index for categorical data. Teh. Vjesn. -Tech. Gaz. 2019, 26, 486–494. [Google Scholar] [CrossRef]
- Jia, H.; Cheung, Y.M.; Liu, J.M. A new distance metric for unsupervised learning of categorical data. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 1065–1079. [Google Scholar] [CrossRef] [PubMed]
- Qian, Y.H.; Li, F.J.; Liang, J.Y.; Liu, B.; Dang, C.Y. Space structure and clustering of categorical data. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 2047–2059. [Google Scholar] [CrossRef] [PubMed]
- Yang, C.L.; Kuo, R.J.; Chien, C.H.; Quyen, N.T.P. Non-dominated sorting genetic algorithm using fuzzy membership chromosome for categorical data clustering. Appl. Soft Comput. 2015, 30, 113–122. [Google Scholar] [CrossRef]
- Chen, L.F.; Wang, S.R.; Wang, K.J.; Zhu, J.P. Soft subspace clustering of categorical data with probabilistic distance. Pattern Recognit. 2016, 51, 322–332. [Google Scholar] [CrossRef]
- Park, I.K.; Choi, G.S. Rough set approach for clustering categorical data using information-theoretic dependency measure. Inf. Syst. 2015, 48, 289–295. [Google Scholar] [CrossRef]
- Zhu, S.W.; Xu, L.H. Many-objective fuzzy centroids clustering algorithm for categorical data. Expert. Syst. Appl. 2018, 96, 230–248. [Google Scholar] [CrossRef]
- Ben Salem, S.; Naouali, S.; Chtourou, Z. A fast and effective partitional clustering algorithm for large categorical datasets using a k-means based approach. Comput. Electr. Eng. 2018, 68, 463–483. [Google Scholar] [CrossRef]
- Bai, L.; Liang, J.Y. Cluster validity functions for categorical data: A solution-space perspective. Data Min. Knowl. Discov. 2015, 29, 1560–1597. [Google Scholar] [CrossRef]
- Bai, L.; Liang, J.Y. A categorical data clustering framework on graph representation. Pattern Recognit. 2022, 128, 108694. [Google Scholar] [CrossRef]
- Cao, F.Y.; Huang, J.Z.X.; Liang, J.Y. A fuzzy SV-k-modes algorithm for clustering categorical data with set-valued attributes. Appl. Math. Comput. 2017, 295, 1–15. [Google Scholar] [CrossRef]
- Cao, F.Y.; Yu, L.Q.; Huang, J.Z.X.; Liang, J.Y. K-mw-modes: An algorithm for clustering categorical matrix-object data. Appl. Soft Comput. 2017, 57, 605–614. [Google Scholar] [CrossRef]
- Kuo, R.J.; Zheng, Y.R.; Nguyen, T.P.Q. Metaheuristic-based possibilistic fuzzy k-modes algorithms for categorical data clustering. Inf. Sci. 2021, 557, 1–15. [Google Scholar] [CrossRef]
- Ben Salem, S.; Naouali, S.; Chtourou, Z. The DRk-M for clustering categorical datasets with uncertainty. IEEE Intell. Syst. 2021, 36, 113–121. [Google Scholar] [CrossRef]
- Ben Salem, S.; Naouali, S.; Chtourou, Z. A rough set based algorithm for updating the modes in categorical clustering. Int. J. Mach. Learn. Cybern. 2021, 12, 2069–2090. [Google Scholar] [CrossRef] [PubMed]
- Naouali, S.; Ben Salem, S.; Chtourou, Z. Uncertainty mode selection in categorical clustering using the rough set theory. Expert. Syst. Appl. 2020, 158, 113555. [Google Scholar] [CrossRef]
- Zhang, Y.Q.; Cheung, Y.M. A new distance metric exploiting heterogeneous interattribute relationship for ordinal-and-nominal-attribute data clustering. IEEE Trans. Cybern. 2022, 52, 758–771. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.Q.; Cheung, Y.M. Learnable weighting of intra-attribute distances for categorical data clustering with nominal and ordinal attributes. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3560–3576. [Google Scholar] [CrossRef]
- Zhang, Y.Q.; Cheung, Y.M.; Tan, K.C. A unified entropy-based distance metric for ordinal-and-nominal-attribute data clustering. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 39–52. [Google Scholar] [CrossRef]
- Chen, H.; Xu, K.P.; Chen, L.F.; Jiang, Q.S. Self-expressive kernel subspace clustering algorithm for categorical data with embedded feature selection. Mathematics 2021, 9, 1680. [Google Scholar] [CrossRef]
- Chen, L.F. A probabilistic framework for optimizing projected clusters with categorical attributes. Sci. China-Inf. Sci. 2015, 58, 072104:1–072104:15. [Google Scholar] [CrossRef]
- Yuan, F.; Yang, Y.L.; Yuan, T.T. A dissimilarity measure for mixed nominal and ordinal attribute data in k-modes algorithm. Appl. Intell. 2020, 50, 1498–1509. [Google Scholar] [CrossRef]
- Oskouei, A.G.; Balafar, M.A.; Motamed, C. FKMAWCW: Categorical fuzzy k-modes clustering with automated attribute-weight and cluster-weight learning. Chaos Solitons Fractals 2021, 153, 111494. [Google Scholar] [CrossRef]
- Saha, A.; Das, S. Categorical fuzzy k-modes clustering with automated feature weight learning. Neurocomputing 2015, 166, 422–435. [Google Scholar] [CrossRef]
- Heloulou, I.; Radjef, M.S.; Kechadi, M.T. A multi-act sequential game-based multi-objective clustering approach for categorical data. Neurocomputing 2017, 267, 320–332. [Google Scholar] [CrossRef]
- Dorman, K.S.; Maitra, R. An efficient k-modes algorithm for clustering categorical datasets. Stat. Anal. Data Min. 2022, 15, 83–97. [Google Scholar] [CrossRef]
- Rios, E.J.R.; Medina-Pérez, M.A.; Lazo-Cortés, M.S.; Monroy, R. Learning-based dissimilarity for clustering categorical data. Appl. Sci. -Basel 2021, 11, 3509. [Google Scholar] [CrossRef]
- Uddin, J.; Ghazali, R.; Deris, M.M.; Iqbal, U.; Shoukat, I.A. A novel rough value set categorical clustering technique for supplier base management. Computing 2021, 103, 2061–2091. [Google Scholar] [CrossRef]
- Suri, N.; Murty, M.N.; Athithan, G. Detecting outliers in categorical data through rough clustering. Nat. Comput. 2016, 15, 385–394. [Google Scholar] [CrossRef]
- Kar, A.K.; Mishra, A.C.; Mohanty, S.K. An efficient entropy based dissimilarity measure to cluster categorical data. Eng. Appl. Artif. Intell. 2023, 119, 105795. [Google Scholar] [CrossRef]
- Chen, B.G.; Yin, H.T. Learning category distance metric for data clustering. Neurocomputing 2018, 306, 160–170. [Google Scholar] [CrossRef]
- Jian, S.L.; Cao, L.B.; Lu, K.; Gao, H. Unsupervised coupled metric similarity for Non-IID categorical data. IEEE Trans. Knowl. Data Eng. 2018, 30, 1810–1823. [Google Scholar] [CrossRef]
- Zhang, C.B.; Chen, L.; Zhao, Y.P.; Wang, Y.X.; Chen, C.L.P. Graph enhanced fuzzy clustering for categorical data using a bayesian dissimilarity measure. IEEE Trans. Fuzzy Syst. 2023, 31, 810–824. [Google Scholar] [CrossRef]
- Narasimhan, M.; Balasubramanian, B.; Kumar, S.D.; Patil, N. EGA-FMC: Enhanced genetic algorithm-based fuzzy k-modes clustering for categorical data. Int. J. Bio-Inspired Comput. 2018, 11, 219–228. [Google Scholar] [CrossRef]
- Zheng, Q.B.; Diao, X.C.; Cao, J.J.; Liu, Y.; Li, H.M.; Yao, J.N.; Chang, C.; Lv, G.J. From whole to part: Reference-based representation for clustering categorical data. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 927–937. [Google Scholar] [CrossRef]
- Faouzi, T.; Firinguetti-Limone, L.; Avilez-Bozo, J.M.; Carvajal-Schiaffino, R. The α-Groups under condorcet clustering. Mathematics 2022, 10, 718. [Google Scholar] [CrossRef]
- Jiang, Z.N.; Liu, X.Y.; Zang, W.K. A kernel-based intuitionistic weight fuzzy k-modes algorithm using coupled chained P system combines DNA genetic rules for categorical data. Neurocomputing 2023, 528, 84–96. [Google Scholar] [CrossRef]
- Amiri, S.; Clarke, B.S.; Clarke, J.L. Clustering categorical data via ensembling dissimilarity matrices. J. Comput. Graph. Stat. 2018, 27, 195–208. [Google Scholar] [CrossRef]
- Kim, K. A weighted k-modes clustering using new weighting method based on within-cluster and between-cluster impurity measures. J. Intell. Fuzzy Syst. 2017, 32, 979–990. [Google Scholar] [CrossRef]
- Sun, H.J.; Chen, R.B.; Qin, Y.; Wang, S.R. Holo-entropy based categorical data hierarchical clustering. Informatica 2017, 28, 303–328. [Google Scholar] [CrossRef]
- Mau, T.N.; Inoguchi, Y.; Huynh, V.N. A novel cluster prediction approach based on locality-sensitive hashing for fuzzy clustering of categorical data. IEEE Access 2022, 10, 34196–34206. [Google Scholar] [CrossRef]
- Dinh, D.T.; Huynh, V.N. k-PbC: An improved cluster center initialization for categorical data clustering. Appl. Intell. 2020, 50, 2610–2632. [Google Scholar] [CrossRef]
- Parmar, D.; Wu, T.; Blackhurst, J. MMR: An algorithm for clustering categorical data using rough set theory. Data Knowl. Eng. 2007, 63, 879–893. [Google Scholar] [CrossRef]
- He, Z.Y.; Xu, X.F.; Deng, S.C. K-ANMI: A mutual information based clustering algorithm for categorical data. Inf. Fusion. 2008, 9, 223–233. [Google Scholar] [CrossRef]
- Deng, S.C.; He, Z.Y.; Xu, X.F. G-ANMI: A mutual information based genetic clustering algorithm for categorical data. Knowl. -Based Syst. 2010, 23, 144–149. [Google Scholar] [CrossRef]
- Barbará, D.; Li, Y.; Couto, J. COOLCAT: An entropy-based algorithm for categorical clustering. In Proceedings of the Eleventh International Conference on Information and Knowledge Management, McLean, VA, USA, 4–9 November 2002; pp. 582–589. [Google Scholar]
- Herawan, T.; Deris, M.M.; Abawajy, J.H. A rough set approach for selecting clustering attribute. Knowl. -Based Syst. 2010, 23, 220–231. [Google Scholar] [CrossRef]
- Mazlack, L.; He, A.; Zhu, Y.; Coppock, S. A rough set approach in choosing partitioning attributes. In Proceedings of the ISCA 13th International Conference (CAINE-2000), Honolulu, HI, USA, 1–3 November 2000; pp. 1–6. [Google Scholar]
- Andritsos, P.; Tsaparas, P.; Miller, R.J.; Sevcik, K.C. Limbo: A scalable algorithm to cluster categorical data. In Proceedings of the International Conference on Extending Database Technology, Berlin/Heidelberg, Germany, 7–10 December 2003; pp. 123–146. [Google Scholar]
- Altameem, A.; Poonia, R.C.; Kumar, A.; Raja, L.; Saudagar, A.K.J. P-ROCK: A sustainable clustering algorithm for large categorical datasets. Intell. Autom. Soft Comput. 2023, 35, 553–566. [Google Scholar] [CrossRef]
- Guha, S.; Rastogi, R.; Shim, K. ROCK: A robust clustering algorithm for categorical attributes. Inf. Syst. 2000, 25, 345–366. [Google Scholar] [CrossRef]
- Wu, S.; Wang, S. Information-theoretic outlier detection for large-scale categorical data. IEEE Trans. Knowl. Data Eng. 2013, 25, 589–602. [Google Scholar] [CrossRef]
- Dutta, M.; Mahanta, A.K.; Pujari, A.K. QROCK: A quick version of the ROCK algorithm for clustering of categorical data. Pattern Recognit. Lett. 2005, 26, 2364–2373. [Google Scholar] [CrossRef]
- Saruladha, K.; Likhitha, P. Modified rock (MROCK) algorithm for clustering categorical data. Adv. Nat. Appl. Sci. 2015, 9, 518–525. [Google Scholar]
- Ben Hariz, S.; Elouedi, Z. New dynamic clustering approaches within belief function framework. Intell. Data Anal. 2014, 18, 409–428. [Google Scholar] [CrossRef]
- Smets, P. The transferable belief model and other interpretations of Dempster-Shafer’s model. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, Cambridge, MA, USA, 27–29 July 1990. [Google Scholar]
- Ben Hariz, S.; Elouedi, Z.; Mellouli, K. Clustering Approach Using Belief Function Theory; Springer: Berlin/Heidelberg, Germany, 2006; pp. 162–171. [Google Scholar]
- Cao, F.; Liang, J.; Li, D.; Zhao, X. A weighting k-modes algorithm for subspace clustering of categorical data. Neurocomputing 2013, 108, 23–30. [Google Scholar] [CrossRef]
- Cao, F.; Liang, J.; Li, D.; Bai, L.; Dang, C. A dissimilarity measure for the k-modes clustering algorithm. Knowl. -Based Syst. 2012, 26, 120–127. [Google Scholar] [CrossRef]
- Chi-Hyon, O.; Honda, K.; Ichihashi, H. Fuzzy clustering for categorical multivariate data. In Proceedings of the Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569), Vancouver, BC, Canada, 25–28 July 2001; Volume 4, pp. 2154–2159. [Google Scholar]
- Heloulou, I.; Radjef, M.S.; Kechadi, M.T. Clustering Based on Sequential Multi-Objective Games; Springer International Publishing: Munich, Germany, 2014; pp. 369–381. [Google Scholar]
- Kaufman, L.; Rousseeuw, P. Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley & Sons: New York, NY, USA, 1990. [Google Scholar]
- Zhang, M.-L.; Zhou, Z.-H. Multi-instance clustering with applications to multi-instance prediction. Appl. Intell. 2009, 31, 47–68. [Google Scholar] [CrossRef]
- Giannotti, F.; Gozzi, C.; Manco, G. Clustering Transactional Data; Springer: Berlin/Heidelberg, Germany, 2002; pp. 175–187. [Google Scholar]
- Khan, S.S.; Ahmad, A. Cluster center initialization algorithm for k-modes clustering. Expert. Syst. Appl. 2013, 40, 7444–7456. [Google Scholar] [CrossRef]
- Wu, S.; Jiang, Q.; Huang, J.Z. A new initialization method for clustering categorical data. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Nanjing, China, 22–25 May 2007. [Google Scholar]
- Arthur, D.; Vassilvitskii, S. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
- Bahmani, B.; Moseley, B.; Vattani, A.; Kumar, R.; Vassilvitskii, S. Scalable k-means++. Proc. VLDB Endow. 2012, 5, 622–633. [Google Scholar] [CrossRef]
- Fuyuan, C.; Jiye, L.; Liang, B. A new initialization method for categorical data clustering. Expert. Syst. Appl. 2009, 36, 10223–10228. [Google Scholar] [CrossRef]
- San, O.M.; Huynh, V.-N.; Nakamori, Y. An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. Math. Comput. Sci. 2004, 14, 241–247. [Google Scholar]
- Nguyen, T.-H.T.; Huynh, V.-N. A k-means-like algorithm for clustering categorical data using an information theoretic-based dissimilarity measure. In Proceedings of the International Symposium on Foundations of Information and Knowledge Systems, Linz, Austria, 7–11 March 2016. [Google Scholar]
- Nguyen, T.-H.T.; Dinh, D.-T.; Sriboonchitta, S.; Huynh, V.-N. A method for k-means-like clustering of categorical data. J. Ambient. Intell. Humaniz. Comput. 2019, 14, 15011–15021. [Google Scholar] [CrossRef]
- Nguyen, H.H. Clustering categorical data using community detection techniques. Comput. Intell. Neurosci. 2017, 2017, 8986360. [Google Scholar] [CrossRef] [PubMed]
- Chan, E.Y.; Ching, W.-K.; Ng, M.K.P.; Huang, J.Z. An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognit. 2004, 37, 943–952. [Google Scholar] [CrossRef]
- Bai, L.; Liang, J.; Dang, C.; Cao, F. A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognit. 2011, 44, 2843–2861. [Google Scholar] [CrossRef]
- Ng, A.Y.; Jordan, M.I.; Weiss, Y. On spectral clustering: Analysis and an algorithm. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, Vancouver, BC, Canada, 3–8 December 2001; pp. 849–856. [Google Scholar]
- Lee, D.D.; Seung, H.S. Algorithms for non-negative matrix factorization. In Proceedings of the 13th International Conference on Neural Information Processing Systems, Denver, CO, USA, 28–30 November 2000; pp. 535–541. [Google Scholar]
- Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
- Ralambondrainy, H. A conceptual version of the k-means algorithm. Pattern Recognit. Lett. 1995, 16, 1147–1157. [Google Scholar] [CrossRef]
- Iam-On, N.; Boongeon, T.; Garrett, S.; Price, C. A link-based cluster ensemble approach for categorical data clustering. IEEE Trans. Knowl. Data Eng. 2012, 24, 413–425. [Google Scholar] [CrossRef]
- Jian, S.; Cao, L.; Pang, G.; Lu, K.; Gao, H. Embedding-based representation of categorical data by hierarchical value coupling learning. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 1937–1943. [Google Scholar]
- Marcotorchino, F.; Michaud, P. Agregation de similarites en classification automatique. Rev. De Stat. Appliquée 1982, 30, 21–44. [Google Scholar]
- Hariz, S.B.; Elouedi, Z. IK-BKM: An incremental clustering approach based on intra-cluster distance. In Proceedings of the ACS/IEEE International Conference on Computer Systems and Applications—AICCSA 2010, Washington, DC, USA, 16–19 May 2010; pp. 1–8. [Google Scholar]
- Ben Hariz, S.; Elouedi, Z. DK-BKM: Decremental k Belief k-Modes Method; Springer: Berlin/Heidelberg, Germany, 2010; pp. 84–97. [Google Scholar]
- Hartigan, J.A.; Wong, M.A. A k-means clustering algorithm. J. R. Stat. Society. Ser. C (Appl. Stat.) 1979, 28, 100–108. [Google Scholar] [CrossRef]
- Grahne, G.; Zhu, J. High performance mining of maximal frequent itemsets. In Proceedings of the 6th International Workshop on High Performance Data Mining, San Francisco, CA, USA, 1–3 May 2003; p. 34. [Google Scholar]
- Ng, M.K.; Li, M.J.; Huang, J.Z.; He, Z. On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 503–507. [Google Scholar] [CrossRef]
- Ben Salem, S.; Naouali, S.; Sallami, M. Clustering categorical data using the k-means algorithm and the attribute’s relative frequency. World Acad. Sci. Eng. Technol. Int. J. Comput. Electr. Autom. Control Inf. Eng. 2017, 11, 708–713. [Google Scholar]
- Semeh Ben, S.; Sami, N.; Moetez, S. A computational cost-effective clustering algorithm in multidimensional space using the manhattan metric: Application to the global terrorism database. World Acad. Sci. Eng. Technol. Int. J. Comput. Electr. Autom. Control Inf. Eng. 2017, 2017, 14. [Google Scholar]
- Gan, G.; Wu, J.; Yang, Z. A genetic fuzzy k-Modes algorithm for clustering categorical data. Expert. Syst. Appl. 2009, 36, 1615–1620. [Google Scholar] [CrossRef]
- Mukhopadhyay, A.; Maulik, U.; Bandyopadhyay, S. Multiobjective genetic algorithm-based fuzzy clustering of categorical attributes. IEEE Trans. Evol. Comput. 2009, 13, 991–1005. [Google Scholar] [CrossRef]
- Maciel, D.B.M.; Amaral, G.J.A.; de Souza, R.; Pimentel, B.A. Multivariate fuzzy k-modes algorithm. Pattern Anal. Appl. 2017, 20, 59–71. [Google Scholar] [CrossRef]
- Trigo, M. Using Fuzzy k-Modes to Analyze Patterns of System Calls for Intrusion Detection. Master’s Thesis, California State University, Los Angeles, CA, USA, 2005. [Google Scholar]
- Kim, D.-W.; Lee, K.H.; Lee, D. Fuzzy clustering of categorical data using fuzzy centroids. Pattern Recognit. Lett. 2004, 25, 1263–1271. [Google Scholar] [CrossRef]
- Cesario, E.; Manco, G.; Ortale, R. Top-down parameter-free clustering of high-dimensional categorical data. IEEE Trans. Knowl. Data Eng. 2007, 19, 1607–1624. [Google Scholar] [CrossRef]
- Tengke, X.; Shengrui, W.; André, M.; Ernest, M. DHCC: Divisive hierarchical clustering of categorical data. Data Min. Knowl. Discov. 2012, 24, 103–135. [Google Scholar] [CrossRef]
- Bouguessa, M. Clustering categorical data in projected spaces. Data Min. Knowl. Discov. 2015, 29, 3–38. [Google Scholar] [CrossRef]
- Potdar, K.; Pardawala, T.; Pai, C. A comparative study of categorical variable encoding techniques for neural network classifiers. Int. J. Comput. Appl. 2017, 175, 7–9. [Google Scholar] [CrossRef]
- Lucasius, C.B.; Dane, A.D.; Kateman, G. On k-medoid clustering of large data sets with the aid of a genetic algorithm: Background, feasiblity and comparison. Anal. Chim. Acta 1993, 282, 647–669. [Google Scholar] [CrossRef]
- Toan Nguyen, M.; Van-Nam, H. Kernel-based k-representatives algorithm for fuzzy clustering of categorical data. In Proceedings of the 2021 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Luxembourg, 11–14 July 2021. [Google Scholar] [CrossRef]
- Mau, T.N.; Huynh, V.N. An LSH-based k-representatives clustering method for large categorical data. Neurocomputing 2021, 463, 29–44. [Google Scholar] [CrossRef]
- Tao, X.; Wang, R.; Chang, R.; Li, C. Density-sensitive fuzzy kernel maximum entropy clustering algorithm. Knowl. -Based Syst. 2019, 166, 42–57. [Google Scholar] [CrossRef]
- Teng, Y.; Qi, S.; Han, F.; Xu, L.; Yao, Y.; Qian, W. Two graph-regularized fuzzy subspace clustering methods. Appl. Soft Comput. 2021, 100, 106981. [Google Scholar] [CrossRef]
- Pal, N.R.; Pal, K.; Keller, J.M.; Bezdek, J.C. A possibilistic fuzzy c-means clustering algorithm. IEEE Trans. Fuzzy Syst. 2005, 13, 517–530. [Google Scholar] [CrossRef]
- Chaudhuri, A. Intuitionistic fuzzy possibilistic c means clustering algorithms. Adv. Fuzzy Syst. 2015, 2015, 238237. [Google Scholar] [CrossRef]
- Xu, D.; Xu, Z.; Liu, S.; Zhao, H. A spectral clustering algorithm based on intuitionistic fuzzy information. Knowl. -Based Syst. 2013, 53, 20–26. [Google Scholar] [CrossRef]
- Xu, Z.; Chen, J.; Wu, J. Clustering algorithm for intuitionistic fuzzy sets. Inf. Sci. 2008, 178, 3775–3790. [Google Scholar] [CrossRef]
- Zeshui, X. Intuitionistic fuzzy hierarchical clustering algorithms. J. Syst. Eng. Electron. 2009, 20, 90–97. [Google Scholar]
- Păun, G. Computing with membranes. J. Comput. Syst. Sci. 2000, 61, 108–143. [Google Scholar] [CrossRef]
- Zang, W.; Sun, M.; Jiang, Z. A DNA genetic algorithm inspired by biological membrane structure. J. Comput. Theor. Nanosci. 2016, 13, 3763–3772. [Google Scholar] [CrossRef]
- Ammar, A.; Elouedi, Z.; Lingras, P. Semantically segmented clustering based on possibilistic and rough set theories. Int. J. Intell. Syst. 2015, 30, 676–706. [Google Scholar] [CrossRef]
- Tripathy, B.K.; Ghosh, A. SDR: An algorithm for clustering categorical data using rough set theory. In Proceedings of the 2011 IEEE Recent Advances in Intelligent Computational Systems, Trivandrum, India, 22–24 September 2011; pp. 867–872. [Google Scholar]
- Tripathy, B.K.; Adhir, G. SSDR: An algorithm for clustering categorical data using rough set theory. Adv. Appl. Sci. Res. 2011, 2, 314–326. [Google Scholar]
- Yang, M.-S.; Chiang, Y.-H.; Chen, C.-C.; Lai, C.-Y. A fuzzy k-partitions model for categorical data and its comparison to the GoM model. Fuzzy Sets Syst. 2008, 159, 390–405. [Google Scholar] [CrossRef]
- Zengyou, H.; Xiaofei, X.; Shengchun, D. A cluster ensemble method for clustering categorical data. Inf. Fusion. 2005, 6, 143–151. [Google Scholar] [CrossRef]
- Ng, M.K.; Wong, J.C. Clustering categorical data sets using tabu search techniques. Pattern Recognit. 2002, 35, 2783–2790. [Google Scholar] [CrossRef]
- Jain, A.K.; Dubes, R.C. Algorithms for Clustering Data; Prentice-Hall, Inc.: Saddle River, NJ, USA, 1988. [Google Scholar]
- Saha, I.; Sarkar, J.P.; Maulik, U. Ensemble based rough fuzzy clustering for categorical data. Knowl. -Based Syst. 2015, 77, 114–127. [Google Scholar] [CrossRef]
- Peters, G.; Lampart, M.; Weber, R. Evolutionary rough k-medoid clustering. In Transactions on Rough Sets VIII; Peters, J.F., Skowron, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2008; pp. 289–306. [Google Scholar]
- Huang, J.Z.; Ng, M.K.; Hongqiang, R.; Zichen, L. Automated variable weighting in k-means type clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 657–668. [Google Scholar] [CrossRef]
- Qin, H.; Ma, X.; Zain, J.M.; Herawan, T. A novel soft set approach in selecting clustering attribute. Knowl. -Based Syst. 2012, 36, 139–145. [Google Scholar] [CrossRef]
- Bai, L.; Liang, J.; Dang, C.; Cao, F. A novel fuzzy clustering algorithm with between-cluster information for categorical data. Fuzzy Sets Syst. 2013, 215, 55–73. [Google Scholar] [CrossRef]
- Hassanein, W.A.; Elmelegy, A.A. An algorithm for selecting clustering attribute using significance of attributes. Int. J. Database Theory Appl. 2013, 6, 53–66. [Google Scholar] [CrossRef]
- Ammar, A.; Elouedi, Z.; Lingras, P. The k-modes method using possibility and rough set theories. In Proceedings of the 2013 Joint IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), Edmonton, AB, Canada, 24–28 June 2013; pp. 1297–1302. [Google Scholar]
- Lee, J.; Lee, Y.J. An effective dissimilarity measure for clustering of high-dimensional categorical data. Knowl. Inf. Syst. 2014, 38, 743–757. [Google Scholar] [CrossRef]
- Tao, L.; Sheng, M.; Mitsunori, O. Entropy-based criterion in categorical clustering. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004. [Google Scholar] [CrossRef]
- Liang, B.; Jiye, L.; Chuangyin, D.; Fuyuan, C. The impact of cluster representatives on the convergence of the k-modes type clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1509–1522. [Google Scholar] [CrossRef]
- Esposito, F.; Malerba, D.; Tamma, V.; Bock, H.-H. Classical Resemblance Measures; Springer: Berlin/Heidelberg, Germany, 2000; pp. 139–152. [Google Scholar]
- Ahmad, A.; Dey, L. A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set. Pattern Recognit. Lett. 2007, 28, 110–118. [Google Scholar] [CrossRef]
- Knorr, E.M.; Ng, R.T. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the Very Large Data Bases Conference, New York, NY, USA, 24–27 August 1998. [Google Scholar]
- Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
- Fraley, C.; Raftery, A.E. Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 2002, 97, 611–631. [Google Scholar] [CrossRef]
- Ahmad, A.; Dey, L. A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 2007, 63, 503–527. [Google Scholar] [CrossRef]
- Wang, C.; Cao, L.; Wang, M.; Li, J.; Wei, W.; Ou, Y. Coupled nominal similarity in unsupervised learning. In Proceedings of the 20th ACM international conference on Information and knowledge management, Glasgow, Scotland, 24–28 October 2011; pp. 973–978. [Google Scholar]
- Wang, C.; Dong, X.; Zhou, F.; Cao, L.; Chi, C.-H. Coupled Attribute Similarity learning on categorical data. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 781–797. [Google Scholar] [CrossRef]
- Boriah, S.; Chandola, V.; Kumar, V. Similarity measures for categorical data: A comparative evaluation. In Proceedings of the 2008 SIAM International Conference on Data Mining (SDM); SIAM: Atlanta, GA, USA, 2008; pp. 243–254. [Google Scholar]
- Bock, H.-H.; Diday, E. Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data; Springer Science & Business Media: Munich, Germany, 2000. [Google Scholar]
- von Luxburg, U. A tutorial on spectral clustering. Stat. Comput. 2007, 17, 395–416. [Google Scholar] [CrossRef]
- Jones, K.S. A statistical interpretation of term specificity and its application in retrieval. In Document Retrieval Systems; Taylor Graham Publishing: Abingdon, UK, 1988; pp. 132–142. [Google Scholar]
- David, W.G. A new similarity index based on probability. Biometrics 1966, 1966, 882–907. [Google Scholar] [CrossRef]
- Li, C.; Li, H. A modified short and fukunaga metric based on the attribute independence assumption. Pattern Recognit. Lett. 2012, 33, 1213–1218. [Google Scholar] [CrossRef]
- Eskin, E.; Arnold, A.; Prerau, M.; Portnoy, L.; Stolfo, S. A geometric framework for unsupervised anomaly detection. In Applications of Data Mining in Computer Security; Barbará, D., Jajodia, S., Eds.; Springer: Boston, MA, USA, 2002; pp. 77–101. [Google Scholar]
- Morlini, I.; Zani, S. A new class of weighted similarity indices using polytomous variables. J. Classif. 2012, 29, 199–226. [Google Scholar] [CrossRef]
- Lin, D. An information-theoretic definition of similarity. In Proceedings of the Fifteenth International Conference on Machine Learning, Wisconson, DC, USA, 24–27 July 1998; pp. 296–304. [Google Scholar]
- Sokal, R.R.; Michener, C.D. A statistical method for evaluating systematic relationships. Univ. Kans. Sci. Bull. 1958, 38, 1409–1438. [Google Scholar]
- Dino, I.; Ruggero, G.P.; Rosa, M. Context-Based Distance Learning for Categorical Data Clustering; Springer: Berlin/Heidelberg, Germany, 2009; pp. 83–94. [Google Scholar]
- Dino, I.; Ruggero, G.P.; Rosa, M. From context to distance: Learning dissimilarity for categorical data clustering. ACM Trans. Knowl. Discov. Data 2012, 6, 1. [Google Scholar] [CrossRef]
- Liping, J.; Michael, K.N.; Joshua Zhexue, H. An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans. Knowl. Data Eng. 2007, 19, 1026–1041. [Google Scholar] [CrossRef]
- Jia, H.; Cheung, Y.-m. Subspace clustering of categorical and numerical data with an unknown number of clusters. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 3308–3325. [Google Scholar]
- Jian, S.L.; Pang, G.S.; Cao, L.B.; Lu, K.; Gao, H. CURE: Flexible categorical data representation by hierarchical coupling learning. IEEE Trans. Knowl. Data Eng. 2019, 31, 853–866. [Google Scholar] [CrossRef]
- Zhu, C.; Cao, L.; Yin, J. Unsupervised heterogeneous coupling learning for categorical representation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 533–549. [Google Scholar] [CrossRef]
- Zhang, Y.; Cheung, Y.-m. An ordinal data clustering algorithm with automated distance learning. Proc. AAAI Conf. Artif. Intell. 2020, 34, 6869–6876. [Google Scholar] [CrossRef]
- Murthy, K.P.N. Ludwig boltzmann, transport equation and the second law. arXiv 2006, arXiv:cond-mat/0601566. [Google Scholar]
- Du, M.; Ding, S.; Xue, Y. A novel density peaks clustering algorithm for mixed data. Pattern Recognit. Lett. 2017, 97, 46–53. [Google Scholar] [CrossRef]
- Hamming, R.W. Error detecting and error correcting codes. Bell Syst. Tech. J. 1950, 29, 147–160. [Google Scholar] [CrossRef]
- Gambaryan, P. A mathematical model of taxonomy. Izvest. Akad. Nauk. Armen. SSR 1964, 17, 47–53. [Google Scholar]
- Burnaby, T.P. On a method for character weighting a similarity coefficient, employing the concept of information. J. Int. Assoc. Math. Geol. 1970, 2, 25–38. [Google Scholar] [CrossRef]
- Chatzis, S.P. A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Expert. Syst. Appl. 2011, 38, 8684–8689. [Google Scholar] [CrossRef]
- de Amorim, R.C.; Makarenkov, V. Applying subclustering and Lp distance in weighted k-means with distributed centroids. Neurocomputing 2016, 173, 700–707. [Google Scholar] [CrossRef]
- Mahamadou, A.J.D.; Antoine, V.; Nguifo, E.M.; Moreno, S. Categorical fuzzy entropy c-means. In Proceedings of the 2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Glasgow, UK, 19–24 July 2020; pp. 1–6. [Google Scholar]
- Huang, J.Z.; Ng, M.K. A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans. Fuzzy Syst. 1999, 7, 446–452. [Google Scholar] [CrossRef]
- Hashemzadeh, M.; Oskouei, A.G.; Farajzadeh, N. New fuzzy C-means clustering method based on feature-weight and cluster-weight learning. Appl. Soft Comput. 2019, 78, 324–345. [Google Scholar] [CrossRef]
- Zhi, X.B.; Fan, J.L.; Zhao, F. Robust local feature weighting hard c-means clustering algorithm. Neurocomputing 2014, 134, 20–29. [Google Scholar] [CrossRef]
- He, Z.; Deng, S.; Xu, X. Improving k-Modes Algorithm Considering Frequencies of Attribute Values in Mode; Springer: Berlin/Heidelberg, Germany, 2005; pp. 157–162. [Google Scholar]
- Huang, J.Z. A fast clustering algorithm to cluster very large categorical data sets in data mining. In Proceedings of the Data Mining and Knowledge Discovery, Tucson, AZ, USA, 11 May 1997. [Google Scholar]
- Gluck, M.; Corter, J. Information uncertainty, and the utility of categories. In Proceedings of the Seventh Annual Conference of the Cognitive Science Society, Irvine, CA, USA, 15–17 August 1985; pp. 283–287. [Google Scholar]
- Gao, C.; Pedrycz, W.; Miao, D.Q. Rough subspace-based clustering ensemble for categorical data. Soft Comput. 2013, 17, 1643–1658. [Google Scholar] [CrossRef]
- Chang, C.-H.; Ding, Z.-K. Categorical Data Visualization and Clustering Using Subjective Factors; Springer: Berlin/Heidelberg, Germany, 2004; pp. 229–238. [Google Scholar]
- Michaud, P. Clustering techniques. Future Gener. Comput. Syst. 1997, 13, 135–147. [Google Scholar] [CrossRef]
Author (Year) | Summary | Database | Data of Collection Period | #Articles/Algorithms | Methodology |
---|---|---|---|---|---|
* Alamuri et al. (2014) | A taxonomy of categorical data similarity measures and the categorical clustering algorithms [18] | Official website | n/a | n/a | Survey |
Parsons et al. (2004) | A survey of the various subspace clustering algorithms [24] | Official website | n/a | 11 algorithms | Survey |
Hancer & Karaboga (2017) | A comprehensive review of the determination of cluster numbers based on traditional, merge-split, and evolutionary computation (EC)-based approaches [19] | Official website | n/a | Single-objective: 43 algorithms; multi-objective: 15 algorithms | Survey |
Alloghani et al. (2019) | A systematic review of supervised and unsupervised machine learning techniques [20] | EBSCO, ProQuest Central Databases | 2015–2018 | 84 articles | SLR and Meta-analysis using PRISMA |
Naouali et al. (2019) | A survey of categorical clustering. The classification of clustering consists of hard, fuzzy, and rough sets and three evaluation metrics (accuracy, precision, and recall) [17] | Official website | 1998–2017 | 32 algorithms | Survey |
Ezugwu (2020) | A taxonomical overview and bibliometric analysis of clustering algorithms and automatic clustering algorithms, as well as the systematic review of all the nature-inspired metaheuristic algorithms for both non-automatic and automatic clustering [25] | WoS database | 1989–2019 | 4875 articles for bibliometric analysis and 86 articles for SLR (45 non-automatic and 40 automatic clustering) | SLR, Bibliometric analysis via VOSviewer |
Ezugwu (2020) | A systematic review of nature-inspired metaheuristic algorithms for automatic clustering [26] | Scopus database | n/a–2020 | 1649 articles for bibliometric analysis; 37 automatic clustering algorithms; and an experimental study of 5 metaheuristic algorithms using 41 datasets | SLR, Bibliometric analysis via VOSviewer |
Awad & Hamad (2022) | A review of clustering techniques to handle big data issues [21] | Publisher website | 2015–2022 | >200 articles | SLR |
Ikotun et al. (2023) | A comprehensive overview and taxonomy of the K-means clustering algorithm and its variants [22] | Publisher website | 1984–2021 | 83 articles | SLR |
Wang et al. (2024) | A review of all the density peak clustering (DPC)-related works [23] | Google Scholar and WoS database | 2014–2023 | >110 articles | SLR |
This study | An up-to-date taxonomical overview and bibliometric analysis for categorical data clustering | WoS database | 2014–2023 | 64 articles | Combine PRISMA and bibliometric analysis procedure via VOSviewer |
Filters | n |
---|---|
Keywords: ((ALL = (clustering)) AND ALL = (categorical data)) | 1731 |
Publication Years: 2014–2023 (index date: 1 January 2014 to 5 December 2023 | 1113 |
Document Types: Article | 1083 |
Languages: English | 1067 |
Research Areas: Computer science, mathematics, and engineering | 567 |
Content Screening | 64 |
Articles | AC | TC |
---|---|---|
A new distance metric for unsupervised learning of categorical data [58] | 8.22 | 74 |
Initialization of K-modes clustering using outlier detection techniques [37] | 7.33 | 66 |
Space structure and clustering of categorical data [59] | 6 | 54 |
Hierarchical clustering algorithm for categorical data using a probabilistic rough set model [38] | 3.64 | 40 |
Non-dominated sorting genetic algorithm using fuzzy membership chromosome for categorical data clustering [60] | 3.5 | 35 |
Soft subspace clustering of categorical data with probabilistic distance [61] | 3.67 | 33 |
Rough set approach for clustering categorical data using information-theoretic dependency measure [62] | 3 | 30 |
Many-objective fuzzy centroids clustering algorithm for categorical data [63] | 3.71 | 26 |
Comparison of similarity measures for categorical data in hierarchical clustering [48] | 4.17 | 25 |
A fast and effective partitional clustering algorithm for large categorical datasets using a K-means based approach [64] | 3.43 | 24 |
Publication Titles | TP |
---|---|
Neurocomputing | 7 (10.938%) |
IEEE Transactions on Neural Networks and Learning Systems | 5 (7.813%) |
Applied Soft Computing | 3 (4.688%) |
Pattern Recognition | 3 (4.688%) |
IEEE Access | 3 (4.688%) |
Applied Intelligence | 2 (3.125%) |
Engineering Applications of Artificial Intelligence | 2 (3.125%) |
Expert Systems with Applications | 2 (3.125%) |
Information Sciences | 2 (3.125%) |
International Journal of Machine Learning and Cybernetics | 2 (3.125%) |
Knowledge-based Systems | 2 (3.125%) |
Mathematics | 2 (3.125%) |
Others (29 publication titles) | 29 (45.31%) |
Publishers | TP |
---|---|
Elsevier | 27 (42.188%) |
IEEE | 13 (20.313%) |
Springer Nature | 11 (17.188%) |
MDPI | 3 (4.688%) |
IOS Press | 2 (3.125%) |
Others (8 publishers) | 8 (12.498%) |
Authors | TP |
---|---|
J. Y. Liang | 8 (12.500%) |
R. J. Kuo | 6 (9.375%) |
S. B. Salem | 4 (6.250%) |
Z. Chtourou | 4 (6.250%) |
S. Naouali | 4 (6.250%) |
Y.M. Cheung | 4 (6.250%) |
T. P. Q. Nguyen | 4 (6.250%) |
L. Bai | 3 (4.688%) |
F. Y. Cao | 3 (4.688%) |
J. Z. X. Huang | 3 (4.688%) |
L. F. Chen | 3 (4.688%) |
Y. Q. Zhang | 3 (4.688%) |
Others (153 authors) | 18 (29.685%) |
Cluster | #Keywords | Summary |
---|---|---|
1 | 73 | A strong connection exists between the K-modes algorithm and rough set theory. Furthermore, rough sets are linked to outlier detection, which, in turn, is associated with the initial cluster centers. This linkage suggests that rough sets are utilized to address outliers in the K-modes algorithm arising from the random initialization of cluster centroids. |
2 | 43 | This cluster covers the fuzzy clustering algorithm, including variations such as the fuzzy K-modes (FKM) algorithm and rough fuzzy clustering. Additionally, the cluster highlights a growing trend in optimizing fuzzy clustering using metaheuristic-based algorithms. Consequently, future studies should delve deeper into investigating the optimization of fuzzy clustering, leveraging not only genetic algorithms and particle swarm optimization but also other metaheuristics to enhance algorithm performance. |
3 | 42 | This cluster covers hierarchical clustering and its relationship with rough set theory. Additionally, it includes keywords related to cluster analysis, such as graph embedding and cluster validity functions. |
4 | 42 | The keywords in this cluster are associated with dissimilarity methods and attribute weighting, such as kernel density estimation and probabilistic frameworks. |
Articles | Cluster | Links | TC | Articles | Cluster | Links | TC |
---|---|---|---|---|---|---|---|
Jia et al. (2016) [58] | 3 | 15 | 74 | Yuan et al. (2020) [78] | 1 | 4 | 5 |
Oskouei et al. (2021) [79] | 1 | 11 | 8 | Saha & Das (2015) [80] | 1 | 4 | 22 |
Zhu & Xu (2018) [63] | 1 | 10 | 26 | Heloulou et al. (2017) [81] | 2 | 4 | 15 |
Salem et al. (2021) [71] | 1 | 9 | 3 | Xiao et al. (2019) [56] | 4 | 4 | 12 |
Kuo & Nguyen (2019) [55] | 1 | 9 | 15 | Salem et al. (2018) [64] | 1 | 3 | 24 |
Jiang et al. (2016) [37] | 1 | 9 | 66 | Chen et al. (2021) [76] | 2 | 3 | 3 |
Dorman & Maitra (2022) [82] | 2 | 9 | 4 | Bai & Liang (2015) [65] | 2 | 3 | 9 |
Qian et al. (2016) [59] | 4 | 9 | 54 | Qin et al. (2014) [41] | 2 | 3 | 19 |
Chen et al. (2016) [61] | 4 | 8 | 33 | Rios et al. (2021) [83] | 3 | 3 | 3 |
Yanto et al. (2016) [42] | 1 | 7 | 15 | Cao et al. (2018) [40] | 4 | 3 | 16 |
Bai & Liang (2014) [39] | 1 | 7 | 22 | Uddin et al. (2021) [84] | 1 | 2 | 1 |
Nguyen & Kuo (2019a) [54] | 4 | 7 | 19 | Peng & Liu (2019) [51] | 1 | 2 | 2 |
Naouali et al. (2020) [72] | 1 | 6 | 9 | Suri et al. (2016) [85] | 1 | 2 | 11 |
Yang et al. (2015) [60] | 1 | 6 | 35 | Wei et al. (2019) [47] | 2 | 2 | 10 |
Li et al. (2014) [38] | 1 | 6 | 40 | Kuo et al. (2021) [69] | 3 | 2 | 15 |
Kar et al. (2023) [86] | 3 | 6 | 2 | Ye et al. (2019) [52] | 3 | 2 | 0 |
Bai & Liang (2022) [66] | 3 | 6 | 3 | Chen & Yin (2018) [87] | 4 | 2 | 6 |
Jian et al. (2018) [88] | 3 | 6 | 20 | Cao et al. (2017b) [67] | 4 | 2 | 10 |
Zhang et al. (2023) [89] | 4 | 6 | 6 | Narasimhan et al. (2018) [90] | 1 | 1 | 3 |
Zheng et al. (2020) [91] | 4 | 6 | 7 | Faouzi et al. (2022) [92] | 2 | 1 | 0 |
Jiang et al. (2023) [93] | 1 | 5 | 1 | Nguyen & Kuo (2019) [53] | 2 | 1 | 11 |
Salem et al. (2021a) [70] | 1 | 5 | 0 | Amiri et al. (2018) [94] | 2 | 1 | 10 |
Park & Choi (2015) [62] | 1 | 5 | 30 | Kim (2017) [95] | 2 | 1 | 4 |
Zhang & Cheung (2022a) [73] | 3 | 5 | 2 | Sun et al. (2017) [96] | 2 | 1 | 2 |
Zhang & Cheung (2022b) [74] | 3 | 5 | 11 | Chen (2015) [77] | 2 | 1 | 2 |
Zhang et al. (2020) [75] | 3 | 5 | 15 | Sulc & Rezankova (2019) [48] | 3 | 1 | 25 |
Mau et al. (2022) [97] | 1 | 4 | 2 | Gao & Wu (2019) [57] | 4 | 1 | 2 |
Dinh & Huynh (2020) [98] | 1 | 4 | 18 |
Authors (Year) | Algorithms | Methods | Comparisons |
---|---|---|---|
Li et al. (2014) | MDP, TMDP, MTMDP [38] | Divisive, based on probabilistic rough set theory approach | MMR |
Qin et al. (2014) | MGR [41] | Divisive, based on an information-theoretic approach | MMR, k-ANMI [100], G-ANMI [101], COOLCAT [102] |
Wei et al. (2019) | KOF, MNIG [47] | Divisive, based on an information-theoretic approach | MMR, MGR, MDA [103], TR [104] |
Sun et al. (2017) | HPCCD [96] | Agglomerative, based on an information-theoretic approach | MGR [41], COOLCAT, LIMBO [105], K-modes, |
Altameem et al. (2023) | P-ROCK [106] | Agglomerative, linked-based | ROCK [107] |
Authors (Year) | Algorithms | Methods | Comparisons |
---|---|---|---|
Hariz & Elouedi (2014) | BCDP: IKBKM and DKBKM [111] | dynamic clustering based on the K-modes algorithm that uses the Transferable Belief Model (TBM) concepts [112] | BKM [113] |
Cao et al. (2017) | k-mw-modes [68] | clustering categorical matrix-object data based on the K-modes algorithm | K-modes, Wk-modes [114], Cao [115], FCCM [116] |
Heloulou et al. (2017) | MOCSG [81] | the multi-objective clustering based-sequential game theoretic that extends the ClusSMOG algorithm [117] | K-modes, PAM [118], and single linkage algorithm [16] |
Salem et al. (2018) | MFk-M [64] | frequency-based method to update the modes and the Manhattan distance metric to compute the distance | K-modes, K-means |
Cao et al. (2018) | SV-k-modes [40] | heuristic method to update the centroids and the Jaccard coefficient to measure the distance between two set-valued objects | a multi-instance clustering algorithm (BAMIC) [119], K-modes, and TrK-means [120] |
Xiao et al. (2019) | IPO-ILP-VNS [56] | integer linear programming (ILP) approach under variable neighborhood search (VNS) framework | K-modes, Khan [121], k-MODET [37], Wu [122] |
Dinh & Huynh (2020) | k-PbC [98] | the MFI-based approach integrated with partitional clustering with a kernel-based method and information-theoretic-based dissimilarity measure | K-means++ [123], K-means|| [124], Cao [125], Khan, k-MODET [37], K-modes, K-representatives [126], M-K-Centers (Mod-2) [127] and New (Mod-3) [128] and CD-Clustering [129] |
Chen et al. (2021) | SKSCC [76] | subspace clustering algorithm based on kernel density estimation (KDE), self-expressiveness-based methods, and probability-based similarity measurement | K-modes, WKM [130], MWKM [131] |
Bai & Liang (2022) | CDC_DR, CDC_DR + SE [66] | graph-based representation method | graph-embedding methods: Non-Embedding (NE), Spectral Embedding (SE) [132], Non-negative Matrix Factorization (NMF) [133], and Autoencoder (AE) [134] using joint and mean operation; categorical data encodings: K-modes, K-means with ordinal encoding, one-hot encoding [135], link-graph encoding [136], and coupled data embedding (CDE) [137] |
Dorman & Maitra (2022) | OTQT [82] | based on hartigan algorithm for K-means algorithm | K-modes |
Faouzi et al. (2022) | α-Condorcet [92] | based on Condorcet clustering [138] | K-modes |
Authors (Year) | Algorithms | Methods | Comparisons |
---|---|---|---|
Yang et al. (2015) | NSGA-FMC [60] | fuzzy genetic algorithm and multi-objective optimization | GA-FKM [146], MOGA [147] |
Cao et al. (2017) | Fuzzy SV-k-modes [67] | FKM for clustering the set-valued attributes | FKM |
Maciel et al. (2017) | MFKM [148] | FKM with multivariate approach | FKM and LFkM [149] |
Kuo et al. (2018) | PSOFKM, GAFKM, ABCFKM [2] | FKM with PSO, GA, and ABC algorithm | FKM |
Narasimhan et al. (2018) | EGA-FMC [90] | GA-FKM with multi-objective rank-based selection | MOGA, GA-FKM, NSGA-FMC |
Zhu & Xu (2018) | MaOFCentroids [63] | many-objective clustering with fuzzy centroid algorithm | FKM, Fuzzy Centroids [150], SBC [59], NSGA-FMC |
Nguyen & Kuo (2019) | PM-FGCA [54] | MOGA with fuzzy membership chromosomes | K-Modes, FKM, GA-FKM, NSGA-FMC |
Nguyen & Kuo (2019) | AFC-NSPSO [53] | automatic fuzzy clustering using non-dominated PSO | AT-DC [151], DHCC [152], PROCAD [153], MOCSG [81] |
Kuo & Nguyen (2019) | IWFKM, GIWFKM [55] | intuitionistic fuzzy set and genetic algorithm | FKM, WFKM [80], GA-FKM, SBC, MaOFCentroids |
Kuo et al. (2021) | PFKM, GA-PFKM, PSO-PFKM, SCA-PFKM [69] | possibilistic fuzzy c-means for the categorical data and metaheuristic methods (GA, PSO, and SCA) | FKM |
Mau et al. (2022) | LSHFK-centers. [97] | locality-sensitive hashing (LSH)-based approach | FCM, FEK-means [154], SBC, K-medoids [155], K-modes, K-representative [126], K-centers [150],, FK-centers [156], FKM, SGA-Dist, SGA-Sep, SGA-SepDist [146], MOGA, NSGA-FMC, MaOFCentroids, LSHK-reps [157] |
Jiang et al. (2023) | KIWFKM, KIWFKM-DCP [93] | intuitionistic fuzzy set and coupled DCP system | MEC [158], FSC [159], FKM, WFKM, IWFKM [55], GIWFKM [55] |
Authors (Year) | Algorithms | Methods | Comparisons |
---|---|---|---|
Ammar et al. (2015) | semantically segmented clustering based on possibilistic and rough set theories [167] | K-modes algorithm based on possibility and rough set theories (KM-PR) with semantic interpretations as a discretization method | n/a |
Park & Choi (2015) | ITDR [62] | RST integrated with possibility based on information-theoretic attribute dependencies to handle uncertainty in values of attributes and uncertain clusters | K-Means, FKM, Fuzzy Centroids [150], SDR [168], SSDR [169], MMR [99] |
Suri et al. (2016) | RKModes [85] | K-modes algorithm based on RST for outlier detection | K-Modes, MMR [99], MTMDP [38] |
Yanto et al. (2016) | MFk-PIND [42] | fuzzy k-Partition based on indiscernibility relation | Fuzzy Centroids [150] and Fuzzy k-Partition [170] |
Xu et al. (2019) | FRC [49] | K-modes algorithm based on RST with the information granularity and dimension reduction method | Cao [115], WKModes [114], K-modes |
Saha et al. (2019) | SARFKMd, GARFKMd, IRFKMd-RF [50] | the rough fuzzy K-modes (RFKMd) with random forest and the metaheuristic methods (simulated annealing, GA) | ccdByEnsemble [171], G-ANMI [101], MMR [99], Tabu Search based FKM [172], AL [173], FKM, RFKMd [174], Rough K-medoids [175], K-medoids [118], K-modes |
Naouali et al. (2020) | DRK-M [72] | RST uses the density to update the modes | K-modes, original weighted K-modes [176], original Ng’s K-modes [177], improved weighted K-modes [39], improved Huang’s K-modes [39], improved Ng’s K-modes [39] |
Salem et al. (2021) | DRK-M [70] | RST uses the density to update the modes | K-modes, Ng’s K-modes [143], Cao [115] |
Salem et al. (2021) | DRK-M [71] | RST uses the density to update the modes | K-modes, Ng’s K-modes [143], Cao [115], the improved Huang’s K-modes, the Weighted K-modes [39], improved Ng’s K-modes, Bai [178], Khan [121], FKM |
Uddin et al. (2021) | MVA [84] | the concept of a number of automated clusters (NoACs) with a rough value set | MDA [103], MSA [179], ITDR [62] |
Authors (Year) | Algorithms | Measurement-Based | Comparisons |
---|---|---|---|
Lee & Lee (2014) | CATCH [181] | value difference (VD) and value distribution-oriented dimensional weight (VOW) to cluster the high-dimensional multi-valued data | Jaccard coefficient, which is embedded with the K-modes algorithm |
Chen et al. (2015) | Subspace clustering of categories (SCC) [61] | probabilistic distance function based on kernel density estimation | non-mode clustering (KR) [126], WKM [130], mode-frequency-based (MWKM) [131], complement-entropy-based (CWKM) [114] |
Chen (2016) | KPC [77] | a probability-based learning framework with a kernel smoothing method to optimize the attribute weights | K-modes, DWKM [130], MWKM [131], CWKM, and EBC [182] |
Qian et al. (2016) | SBC [59] | space structure-based representation scheme | K-modes, Chan [130], Mkm-nof, Mkm-ndm [183] |
Jia et al. (2016) | Frequency probability-based distance measure (FPDM) [58] | frequency probability and co-occurrence probability | Hamming distance (HD) [184], Ahmad’s distance [185] |
Jiang et al. (2016) | k-MODET (Ini_Distance, Ini_Entropy) [37] | traditional distance-based outlier detection technique [186], partition entropy-based outlier detection technique | Khan [121], Cao [125], Wu [122], and the random initialization method embedded with the K-modes algorithm. |
Amiri et al. (2018) | EN-KM, EN-MBC, EN-SL, EN-AL, EN-CL [94] | ensembled dissimilarity based on the hierarchical method and Hamming distance | K-modes, DBSCAN [187], ROCK [107], MBC [188], ensembled version of K-modes and MBC using dissimilarity matrix D (EN-KM and EN-MBC), agglomerative (SL, AL, CL, EN-SL, EN-AL, EN-CL) |
Jian et al. (2018) | A coupled metric similarity (CMS) measure [88] | the intra-attribute similarity (frequency-based) integrated with the inter-attribute measure (correlation-based) | ALGO [189], coupled object similarity (COS) [190,191], distance matrix (DM) [58], occurrence frequency-based measure (OF) [192], and HD [193] embedded with spectral clustering [194] and K-modes algorithms. |
Chen & Yin (2018) | CWC [87] | a non-center-based algorithm based on weighted similarity. | OF [195], Goodall3 [192,196], and MSFM measures embedded with K-modes [197], KPC [77], entropy-weighting K-modes (CEWKM) [114], and the MWKM algorithm [131] |
Sulc & Rezankova (2019) | VE, VM [48] | a relative frequency-based where the VE measure uses the entropy while the VM measure uses the Gini coefficient. | ES [198], G1, G2, G3, G4, LIN1 [192], MZ [199], OF, IOF [195], LIN [200], and simple matching [201] embedded with three linkage methods of hierarchical cluster analysis |
Ye et al. (2019) | Heterogeneous Graph-based Similarity measure (HGS) [52] | a heterogeneous weighted graph combining the content-based and structural-based similarity measures | HD [143], OF [18], Lin [200], ALGO [189], and CMS [88] embedded with spectral clustering (SC) and K-modes algorithm |
Zhang et al. (2020) | EBDM [75] | entropy-based distance metric with a weighting scheme for the mixed-categorical attributes | Hamming Distance, Ahmad’s distance [185], ABDM [189], CBDM [202,203], CDDM [58] embedded with K-modes, WKM [176], entropy weighting (EW) K-means [204], WOC and EBC [182] |
Yuan et al. (2020) | mixed-type dissimilarity measure [78] | the idea of mining ordinal information and the rough set theory for the mixed-categorical attributes | Huang, Cao [125], SBC [59], and CMS [88] embedded with K-modes |
Zheng et al. (2020) | SBC-C [91] | space structure-based representation scheme | SBC, SC [132], K-modes, One-Hot Encoding |
Rios et al. (2021) | learning-based dissimilarity [83] | a classification ensemble to compute a confusion matrix for the attribute | Eskin [198], Lin, OF, IOF, Goodall, Gambaryan, Euclidean, and Manhattan embedded with K-means++ |
Zhang & Cheung (2022) | UDM [73] | entropy-based distance metric using the weights attributes | Distance measures: HD [184], Goodall [196], Lim [200], context-based distance metric (CBDM) [203], FPDM [58] and EBDM [75] are embedded into K-modes, entropy-based categorical data clustering (ECC) [182], the representative attribute weighting K-modes (WKM) [130], mixed attribute WKM (MWKM) [131], and SCC [61], and WOC [205] |
Zhang & Cheung (2022) | HD-NDW [74] | an automatic distance weighting mechanism based on the intrinsic connection of ordinal and nominal attributes | HD, Lin, CBDM, FPDM, EBDM, CMS embedded with K-modes, ECC, WKM, MWKM, and attribute Weighting, WOC, SBC, Coupled Data Embedding-based clustering (CDE) [206], UNsupervised heTerogeneous couplIng IEarning-based clustering (UNTIE) [207], Distance Learning-based Clustering (DLC) [208] |
Kar et al. (2023) | an entropy-based dissimilarity measure [86] | Bolzmann’s entropy [209] | Distance Measure with Entropy (DME) [210], HD [211], Weighted Similarity Measure (WSM) [109], FPDM, Gambaryan [212], Burnaby [213], embedded with K-modes, weighted K-modes [176] and Density Peak Clustering for Mixed Data (DPC-MD) algorithms [210] |
Zhang et al. (2023) | MAP, BFKMG [89] | Bayesian dissimilarity measure to measure the dissimilarity, Kullback–Leibler (KL) divergence-based regularization to find the patterns in datasets | Cao [125], FKMFC [150], KL-FCM-GM [214], MWK-DC [215], SBC-C, CFE [216], UDM |
Authors (Year) | Algorithms | Comparisons |
---|---|---|
Saha & Das (2015) | WFK-modes [80] | n/a |
Kim (2017) | attribute weighting method based on within-cluster and between-cluster impurity measures [95] | K-modes, FKM |
Peng & Liu (2019) | weighting method combined with the distance and density measures to select the cluster centers based on a rough set and information theory [51] | Random method, Khan [121], Cao [125], Wu [122] |
Oskouei et al. (2021) | FKMAWCW [79] | Initialization sensitivity reduction methods: Khan [121], Cao [125], Wu [122], k-MODET [37], Peng [51], Mod-2 [127], Mod-3 [128], and Attribute-weighted method: IWFKM [55], EWKM [114], Saha [80], SBC [59], Chan [130], Jia [205] |
Authors (Year) | Function | Comparisons |
---|---|---|
Bai & Liang (2014) | BCIk-M [39] | Ng’s K-modes [143,220], K-modes [15,221], WKM [176] |
Bai & Liang (2015) | generalized validity function [65] | K-modes, CU [222], IE [102] |
Gao & Wu (2019) | IDC, CUBOS [57] | CCI [223], CDCS [224], IE, CU, NCC [225] |
Datasets | #rec | #attr | #clus | n | Datasets | #rec | #attr | #clus | n |
---|---|---|---|---|---|---|---|---|---|
Adult + Stretch | 48,842 * | 14 * | 2 | 3 | HIV-1 protease cleavage | 6590 | 9 * | 2 | 1 |
Arrhythmia | 452 | 279 | 16 | 1 | Horse Colic | 368 * | 24 * | 2 | 1 |
Audiology | 226 | 69 | 24 | 2 | Letter Recognition (E, F) | 1543 | 16 | 2 | 4 |
Australian Credit Approval | 690 | 14 | 2 | 1 | Lung Cancer | 286 | 9 | 2 | 17 |
Balance | 625 | 4 | 3 | 12 | Lymphography | 148 * | 18 | 4 * | 17 |
Ballonn | 20 | 4 | 2 | 6 | Mammographic Masses | 961 * | 4 | 2 | 3 |
Breast Cancer Wisconsin (Original) | 699 * | 9 | 2 | 38 | Microsoft Web | 37,711 | 294 | - | 3 |
Car Evaluation | 1728 | 6 * | 4 * | 1 | Monk | 432 | 6 | 2 | 5 |
Cervical Cancer | 858 | 32 * | 4 | 1 | Mushroom | 8124 | 22 * | 2 | 38 |
Chess | 3196 | 36 | 2 | 14 | Nursery | 12,960 | 8 * | 3 * | 15 |
Chess (Big) | 28,056 | 6 | 18 | 1 | Page Blocks | 5473 | 10 | 5 | 1 |
Congressional Votes | 435 * | 16 | 2 | 37 | Primary Tumor | 339 * | 17 * | 21 | 7 |
Connect-4 | 67,557 | 42 | 3 | 4 | Post-Operative Patient | 90 * | 8 | 3 | 2 |
Contraceptive Method Choice | 1473 | 10 | 3 | 1 | Shuttle Landing Control | 15 | 6 | 2 | 2 |
Credit Approval | 690 * | 15 * | 2 | 10 | Solar Flare | 1066 | 10 * | 6 | 9 |
Dermatology | 366 | 34 * | 3 | 16 | Soybean Large | 307 * | 35 | 19 * | 6 |
DNA Splice | 3190 | 60 | 3 | 10 | Soybean Small | 47 | 35 * | 4 | 43 |
DNA Promoter | 106 | 57 | 2 | 14 | Spect Heart | 267 | 22 | 2 | 10 |
Drug Consumption | 1885 | 6 * | 7 | 1 | Sponge | 75 | 45 | 12 | 1 |
Fitting Contact Lenses | 24 | 4 | 3 | 8 | Student | 300 | 32 | 3 | 1 |
Flag | 194 | 30 | - | 1 | Thoracic | 470 | 16 | 2 | 1 |
Germany | 1000 * | 20 | 2 | 1 | Tic-Tac-Toe | 958 | 9 | 2 | 17 |
Hayes-Roth | 132 | 4 | 3 | 17 | Train | 10 | 32 | 2 | 1 |
HCC survival | 165 | 49 * | 2 | 1 | Optical Recognition of Handwritten Digits | 5620 * | 64 | 10 | 1 |
Heart Disease | 303 | 8 | 2 | 8 | Zoo | 101 | 16 * | 7 | 40 |
Hepatitis | 155 * | 19 * | 2 | 3 |
No. | Internal Validity Functions | n | No. | External Validity Functions | n |
---|---|---|---|---|---|
1. | Silhouette coefficient | 4 | 1. | Accuracy (AC) | 42 |
2. | Davies–Bouldin index (DBI) | 5 | 2. | Adjusted rand index (ARI) | 32 |
3. | Category utility function (CU) | 4 | 3. | Random index (RI) | 5 |
4. | Dunn | 2 | 4. | Normalized mutual information (NMI) | 21 |
5. | Calinski–Harabasz index (CH) | 2 | 5. | Purity | 12 |
6. | New Condorcet criteria (NCC) | 1 | 6. | Entropy | 8 |
7. | Compactness | 1 | 7. | Precision (PE) | 10 |
8. | Separation | 1 | 8. | Recall (RE) | 10 |
9. | Fuzzy silhouette coefficient (FSI) | 1 | 9. | F-measure | 9 |
10. | Multivariate FSI (MFSI) | 1 | 10. | Jaccard coefficient | 2 |
11. | Sum of square error (SSE) | 1 | 11. | Micro-p | 1 |
12. | Pseudo F index based on the mutability (PSFM) | 1 | 12. | Fowlkes and mallows index (FM) | 1 |
13. | Pseudo F index based on the entropy (PSFE) | 1 | 13. | Roughness measure | 2 |
14. | Partition entropy coefficient (PE) | 1 | |||
15. | Partition coefficient (PC) | 1 | |||
16. | Cluster cardinality index (CCI) | 1 | |||
17. | Categorical data clustering with subjective factors (CDCS) | 1 | |||
18. | Information entropy (IE) | 1 | |||
19. | Czekanowski–Dice index (CDI) | 1 | |||
20. | Kulczynski index | 1 |
Validity Index | #Articles | ||||
---|---|---|---|---|---|
Breast Cancer Wisconsin (Original) | Congressional Votes | Mushroom | Soybean Small | Zoo | |
AC | 19 | 20 | 20 | 22 | 21 |
ARI | 16 | 17 | 15 | 21 | 21 |
NMI | 12 | 10 | 8 | 12 | 11 |
Algorithm * | Breast Cancer Wisconsin (Original) | Congressional Votes | Mushroom | Soybean Small | Zoo |
---|---|---|---|---|---|
Hierarchical Clustering | |||||
MNIG [47] | 92.7 | 87.4 | 84.8 | 97.9 | 93.1 |
MGR [41] | 88.4 | 82.8 | 67.7 | - | 93.1 |
HPCCD [96] | - | 92.18 | 86.41 | 100 | 96.04 |
P-ROCK [106] | - | 79.77 | - | - | - |
Partition Clustering: Hard Clustering | |||||
MFk-M [64] | - | - | 45 | - | - |
SKSCC [76] | 96.59 | 87.34 | 81.94 | 90.85 | 80.43 |
DKBKM-Max [111] | 79.1 | 82.19 | 81.7 | - | 74.5 |
MOCSG [81] | 89.1 | - | - | 100 | 83.2 |
k-PbC [98] | 96.14 | 88.05 | 88.61 | 100 | 89.11 |
Partition Clustering: Fuzzy Clustering | |||||
GAFKM [2] | - | 86.6 | - | - | - |
GIWFKM [55] | 69.2 | 91 | 93.2 | 98.5 | 92.7 |
MaOFcentroids [63] | - | 88.1 | 88.5 | 100 | 91 |
SCA-PFKM [69] | 94.07 | 86.44 | 88.95 | 100 | - |
KIWFKM-DCP [93] | 70.28 | - | 88.31 | 98.72 | 90.1 |
Partition Clustering: Rough-set-based Clustering | |||||
IRFKMd-RF [50] | - | 88.79 | 91.50 | 99.85 | 98.38 |
DRk-M [70] | 93.29 | - | 85.91 | - | 88.56 |
DRk-M [71] | 93.29 | - | 85.91 | 100 | - |
MFk-PIND [42] | 97.17 | - | - | 100 | 89.96 |
MVA [84] | - | - | - | 72 | 82 |
Distance Function | |||||
Ini_Entropy [37] | 93.28 | 86.9 | 88.76 | 100 | 90.1 |
SBC [59] | 92.93 | 87.83 | - | 96.66 | - |
SCC [61] | 97 | - | - | - | - |
HD-NDW [74] | 65.1 | 87.6 | - | 84.9 | 76 |
EBDM [75] | - | 87.1 | - | - | - |
EN-CL [94] | - | - | 100 | 100 | 99 |
mixed-type dissimilarity measure [78] | - | - | - | 95.75 | - |
CDMs [86] | 53.29 | 86.64 | 85.05 | - | 75.91 |
Weighting Method | |||||
weighted attributes [51] | - | 86.71 | 91.85 | 100 | 89.33 |
FKMAWCW [79] | - | 89.22 | 81.82 | 100 | 82.18 |
Cluster Validity | |||||
CUBOS [57] | 78.7 | 87.9 | - | 100 | - |
Improved Ng’s k-modes [39] | 87.7 | - | 83.66 | 99.79 | 89 |
Total | 19 | 20 | 20 | 22 | 21 |
Algorithm * | Breast Cancer Wisconsin (Original) | Congressional Votes | Mushroom | Soybean Small | Zoo |
---|---|---|---|---|---|
Hierarchical Clustering | |||||
MNIG [47] | 0.725 | 0.556 | 0.475 | 0.937 | 0.945 |
MTMDP [38] | 0.585 | 0.274 | 1 | 0.96 | |
MGR [41] | 0.79 | 0.8 | 0.65 | - | 0.96 |
HPCCD [96] | - | 0.7109 | 0.5302 | 1 | 0.963 |
Partition Clustering: Hard Clustering | |||||
CDC_DR + SE [66] | 0.89 | - | 0.61 | 0.74 | 0.64 |
MOCSG [81] | - | - | - | 1 | 0.851 |
OTQT [82] | 0.67 | - | 0.61 | 0.95 | 0.66 |
Partition Clustering: Fuzzy Clustering | |||||
AFC-NSPSO [53] | 0.713 | 0.617 | 0.634 | 0.958 | 0.898 |
PM-FGCA [54] | 0.467 | 0.624 | 0.468 | 0.938 | 0.832 |
GIWFKM [55] | 0.388 | 0.649 | 0.703 | 0.967 | 0.93 |
NSGA-FMC [60] | - | 0.508 | - | 0.919 | 0.8 |
MaOFcentroids [63] | - | 0.578 | 0.593 | 1 | 0.894 |
EGA-FMC [90] | - | 0.79 | - | 1 | 0.92 |
KIWFKM-DCP [93] | 0.5022 | 0.7967 | 0.9864 | 0.9457 | |
Distance Function | |||||
SBC [59] | 0.7331 | 0.5715 | - | 0.94 | - |
HD-NDW [74] | 0.09 | 0.564 | - | 0.803 | 0.721 |
EBDM [75] | - | 0.548 | - | - | - |
EN-CL [94] | - | - | 1 | 1 | 0.99 |
CDMs [86] | 0.0019 | 0.5349 | 0.4847 | 0.7195 | |
BFKMG [89] | 0.9138 | 0.6412 | 0.4958 | 1 | 0.9087 |
Weighting Method | |||||
FWFKM [95] | 0.9111 | - | - | 0.9787 | 0.877 |
FKMAWCW [79] | - | 0.6137 | 0.4053 | 1 | 0.7806 |
Cluster Validity | |||||
CUBOS [57] | 0.247 | 0.574 | - | 1 | - |
generalized validity function [65] | 0.7712 | 0.5181 | 0.6059 | 1 | 0.644 |
Total | 19 | 20 | 20 | 22 | 21 |
Algorithm * | Breast Cancer Wisconsin (Original) | Congressional Votes | Mushroom | Soybean Small | Zoo |
---|---|---|---|---|---|
Hierarchical Clustering | |||||
MTMDP [38] | 0.541 | - | 0.443 | 1 | 0.925 |
Partition Clustering: Hard Clustering | |||||
CDC_DR + SE [66] | 0.8269 | - | 0.5845 | 0.8627 | 0.7777 |
MFk-M [64] | - | - | 0.0962 | - | - |
Partition Clustering: Fuzzy Clustering | |||||
PM-FGCA [54] | 0.507 | 0.532 | 0.448 | 0.882 | 0.775 |
KIWFKM-DCP [93] | 0.0071 | - | 0.5632 | 0.9727 | 0.8298 |
Distance Function | |||||
HGS [52] | 0.316 | 0.3 | - | 0.709 | 0.753 |
DM3 [58] | 0.6917 | 0.4987 | 0.3182 | 0.8991 | 0.7927 |
SCC [61] | 0.78 | - | - | - | - |
HD-NDW [74] | 0.062 | 0.489 | - | 0.897 | 0.809 |
EBDM [75] | - | 0.483 | - | - | - |
CMS-enabled k-modes [88] | 0.595 | 0.447 | - | 1 | 0.842 |
BFKMG [89] | 0.8503 | 0.5625 | 0.4845 | 1 | 0.8997 |
Cluster Validity | |||||
CUBOS [57] | 0.144 | 0.51 | - | 1 | - |
generalized validity function [65] | 0.6534 | 0.4555 | 0.5465 | 1 | 0.8071 |
Total | 19 | 20 | 20 | 22 | 21 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cendana, M.; Kuo, R.-J. Categorical Data Clustering: A Bibliometric Analysis and Taxonomy. Mach. Learn. Knowl. Extr. 2024, 6, 1009-1054. https://doi.org/10.3390/make6020047
Cendana M, Kuo R-J. Categorical Data Clustering: A Bibliometric Analysis and Taxonomy. Machine Learning and Knowledge Extraction. 2024; 6(2):1009-1054. https://doi.org/10.3390/make6020047
Chicago/Turabian StyleCendana, Maya, and Ren-Jieh Kuo. 2024. "Categorical Data Clustering: A Bibliometric Analysis and Taxonomy" Machine Learning and Knowledge Extraction 6, no. 2: 1009-1054. https://doi.org/10.3390/make6020047
APA StyleCendana, M., & Kuo, R. -J. (2024). Categorical Data Clustering: A Bibliometric Analysis and Taxonomy. Machine Learning and Knowledge Extraction, 6(2), 1009-1054. https://doi.org/10.3390/make6020047