# The k-means Algorithm: A Comprehensive Survey and Performance Evaluation

## Abstract

## 1. Introduction

- Existing solutions of the k-means algorithm along with a taxonomy are outlined and discussed in order to augment the understanding of these variants and their relationships.
- This research frames the problems, analyses their solutions and presents a concrete study on advances in the development of the k-means algorithm.
- Experiments are performed using the improved k-means algorithms to find out their effectiveness using six benchmark datasets.

#### Paper Roadmap

## 2. k-means Variants for Solving the Problem of Initialization

## 3. k-means Variants for Solving the Problem of Data Issue

## 4. Performance Evaluation of k-means Representative Variants

#### 4.1. Metrics Used for Experimental Analysis

**Accuracy**: This measure outlines the extent to which the predicted labels are in agreement with the true labels. The predicted labels correspond to the class labels where new instances are clustered. The accuracy is calculated by Equation (1).$$Accuracy={\displaystyle \frac{\mathrm{Correctly}\phantom{\rule{4.pt}{0ex}}\mathrm{identified}\phantom{\rule{4.pt}{0ex}}\mathrm{class}}{\mathrm{Total}\phantom{\rule{4.pt}{0ex}}\mathrm{number}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathrm{class}}}\times 100$$**Adjusted rand index (ARI)**: Provides a score of similarity between two different clustering results of the same dataset. For a given set S consisting of $\alpha $ elements and r subsets and two partitions $Y=\{{Y}_{1},{Y}_{2},\cdots ,{Y}_{b}\}$ and $X=\{{X}_{1},{X}_{2},\cdots ,{X}_{c}\}$, the overlap between the two partitions can be summarized as follows:$$\begin{array}{cccccc}& {Y}_{1}\hfill & {Y}_{2}\hfill & \cdots \hfill & {Y}_{b}\hfill & \mathrm{Sums}\hfill \\ {X}_{1}\hfill & {\alpha}_{11}\hfill & {\alpha}_{12}\hfill & \hfill & {\alpha}_{1b}\hfill & {r}_{1}\hfill \\ {X}_{2}\hfill & {\alpha}_{21}\hfill & {\alpha}_{22}\hfill & \hfill & {\alpha}_{2b}\hfill & {r}_{2}\hfill \\ \vdots \hfill & \vdots \hfill & \vdots \hfill & \ddots \hfill & \vdots \hfill & \vdots \hfill \\ {X}_{c}\hfill & {\alpha}_{c1}\hfill & {\alpha}_{c2}\hfill & \cdots \hfill & {\alpha}_{cb}\hfill & {r}_{c}\hfill \\ \mathrm{Sums}\hfill & {s}_{1}\hfill & {s}_{2}\hfill & \cdots \hfill & {s}_{b}\hfill & \hfill \end{array}$$$$ARI={\displaystyle \frac{{\sum}_{ij}\left(\genfrac{}{}{0pt}{}{{\alpha}_{ij}}{2}\right)-\left[{\sum}_{i}\left(\genfrac{}{}{0pt}{}{{r}_{i}}{2}\right){\sum}_{j}\left(\genfrac{}{}{0pt}{}{{s}_{j}}{2}\right)\right]/\left(\genfrac{}{}{0pt}{}{\alpha}{2}\right)}{\frac{1}{2}[{\sum}_{i}\left(\genfrac{}{}{0pt}{}{{r}_{i}}{2}\right)+{\sum}_{j}\left(\genfrac{}{}{0pt}{}{{s}_{j}}{2}\right)]-\left[{\sum}_{i}\left(\genfrac{}{}{0pt}{}{{r}_{i}}{2}\right){\sum}_{j}\left(\genfrac{}{}{0pt}{}{{s}_{j}}{2}\right)\right]/\left(\genfrac{}{}{0pt}{}{\alpha}{2}\right)}}$$

#### 4.2. Results

#### 4.3. Computational Complexity Analysis

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

**Figure 2.**Refined initial instances, adapted from [54].

Reference | Application | Algorithm |
---|---|---|

[10] | Face detection | Symmetry-based version of k-means (SBKM). |

[11] | Mobile storage positioning | Potential k-means. |

[12] | Load pattern | Hierarchical k-means (H-Kmeans). |

[13] | Wireless sensor networks | Distributed k-means and fuzzy c-means. |

[14] | Partial multiview data | Weighted k-means. |

[15] | Mobile health | k-means implemented with CORDIC. |

[16] | Endpoint detection | k-means for realtime detection. |

[17] | Big data | Privacy preserving k-means. |

[18] | Multiview data | k-means. |

[19] | Wind power forecasting | k-means with bagging neural network. |

[20] | Social tags | k-means based on latent semantic analysis. |

[21] | Sensing for IGBT current | k-means with neural network. |

[22] | Image segmentation. | Kernel k-means Nystrom approximation. |

[23] | Image compression | k-means cuckoo optimization. |

[24] | Sound source angle estimation. | Neural network based on global k-means. |

[25] | Shape recognition | Fuzzy k-means clustering ensemble (FKMCE). |

[26] | Signal processing | Compressive k-means clustering (CKM). |

[27] | Text processing | Vanilla k-means. |

[28] | High dimensional data processing | Fast adaptive k-means (FAKM). |

[29] | Computational complexity | Multiple kernel k-means with late fusion. |

[30] | Image processing | A hybrid parallelization of k-means algorithm. |

[31] | Adaptive clustering | Fuzzy k-means with S-distance. |

[32] | DDoS detection | Semi-supervised k-means algorithm with hybrid feature. |

[33] | Optimization | Non alternating stochastic k-means. |

[34] | Data Summarization | Modified x-means. |

Survey | Initialization | Data Types | Applications | Experiments |
---|---|---|---|---|

Yang [38] | ✓ | × | × | × |

Filippone [39] | ✓ | × | × | × |

Rai [40] | ✓ | × | × | × |

This paper | ✓ | ✓ | ✓ | ✓ |

Dataset | Summary |
---|---|

Cleveland Heart Disease | Widely used by machine learning researchers. The goal is to detect the presence of heart disease in a patient. |

KDD-Cup 1999(10%) | Contains standard network traffic that contains different types of cyber attacks simulated in a military network. |

Wisconsin Diagnostic Breast Cancer | Includes features calculated from the images of fine needle aspirate of breast mass. |

Epileptic Seizure Recognition | Commonly used for feature epileptic seizure prediction. |

Credit Approval | Contains a mix of attributes, which makes it interesting to be used with k-means for mixed attributes. |

Postoperative | Contains both categorical and integer values. The missing values are replaced with an average. |

Metric | k-means | Constrained k-means | x-means |
---|---|---|---|

Wisconsin Diagnostic Breast Cancer | |||

Accuracy | $0.223\pm 0.310$ | $0.596\pm 0.406$ | $0.086\pm 0.042$ |

ARI | $0.690\pm 0.134$ | $0.682\pm 0.13$ | $0.683\pm 0.128$ |

KDD Cup 1999 | |||

Accuracy | $0.195\pm 0.077$ | $0.118\pm 0.087$ | $0.045\pm 0.034$ |

ARI | $0.004\pm 0.007$ | $0.107\pm 0.059$ | $0.085\pm 0.169$ |

Epileptic Seizure | |||

Accuracy | $0.101\pm 0.060$ | $0.099\pm 0.012$ | $0.102\pm 0.053$ |

ARI | $0.005\pm 0.001$ | $0.002\pm 0.001$ | $0.002\pm 0.002$ |

Metric | k-prototype | Kernel-k-means |
---|---|---|

Credit Approval | ||

Accuracy | $0.456\pm 0.061$ | $0.437\pm 0.283$ |

ARI | $0.004\pm 0.005$ | $0.044\pm 0.092$ |

Cleveland Heart Disease | ||

Accuracy | $0.462\pm 0.043$ | $0.590\pm 0.080$ |

ARI | $0.003\pm 0.001$ | $0.017\pm 0.041$ |

Post Operative | ||

Accuracy | $0.462\pm 0.043$ | $0.590\pm 0.080$ |

ARI | $0.003\pm 0.001$ | $0.017\pm 0.041$ |

Complexity | k-means | Constrained-k-means | x-means |
---|---|---|---|

Time | $\mathcal{O}\left({n}^{2}\right)$ | $\mathcal{O}\left(kn\right)$ | $\mathcal{O}(nlog{k}_{max})$ |

Space | $\mathcal{O}\left(\right(n+k\left)d\right)$ | $\mathcal{O}\left(\right(n+k\left)d\right)$ | $\mathcal{O}\left(\right(n+k\left)d\right)$ |

