# An Improved Three-Way K-Means Algorithm by Optimizing Cluster Centers

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- The clustering results of k-means are dependent on the random selection of clustering centers and the problem of local optimization readily occurs.
- Traditional k-means algorithms are based on the assumption that a cluster is represented by one single set with a sharp boundary. Only two types of relationship between an object and cluster are considered, i.e., belong to and not belong to. The requirement for a sharp boundary is easily met for analytical clustering results, but may not adequately show the uncertainty information in the dataset.

## 2. Related Work

#### 2.1. Three-Way Clustering

#### 2.2. Three-Way k-Means

- (1)
- If $T\ne \varphi $, then $v\in \mathrm{support}\left({C}_{i}\right)$ and $v\in \mathrm{support}\left({C}_{j}\right)$.
- (2)
- If $T=\varphi $, then $v\in \mathrm{support}\left({C}_{i}\right)$.

Algorithm 1: Three-way k-means [21] |

## 3. The Improved Three-Way k-Means

#### 3.1. Random Probability Selection Strategy

#### 3.2. The Improved Three-Way k-Means Algorithm

Algorithm 2: The improved three-way k-means based on ant colony algorithm. |

## 4. Experimental Analysis

#### 4.1. Evaluation Indices

- In the above formula, n is the total number of samples in the dataset, ${C}_{i}$ is the correct number of samples divided into class clusters i, and k is the number of class clusters. $Acc$ represents the ratio between the number of correctly partitioned elements and the total number. A greater $ACC$ value implies a better clustering result. When $ACC$ = 1, the result of the clustering algorithm is consistent with the real result.
- Davies–Bouldin index ($DBI$) [47].$$DB=\frac{1}{c}\sum _{i=1}^{c}\underset{j\ne i}{max}\left\{\frac{S\left({C}_{i}\right)+S\left({C}_{j}\right)}{d({x}_{i},{x}_{j})}\right\}$$$$S\left({C}_{i}\right)=\frac{{\sum}_{v\in {C}_{i}}\Vert v-{x}_{i}\Vert}{\mid {C}_{i}\mid}.$$As a function of the ratio of the within cluster scatter to the between cluster separation, a lower value will mean that the clustering is better.
- Average silhouette index ($AS$) [47].$$AS=\frac{1}{n}\sum _{i=1}^{n}{S}_{i},$$$${S}_{i}=\frac{{b}_{i}-{a}_{i}}{max\{{a}_{i},{b}_{i}\}},$$${a}_{i}$ is the average distance between ${x}_{i}$ and all other objects in its own cluster, and ${b}_{i}$ is the minimum of the average distance between ${x}_{i}$ and objects in other clusters. The range of the average silhouette index is $[-1,1]$; a larger value means a better clustering result.

#### 4.2. Performances of Proposed Algorithm

#### 4.3. Experimental Results Analysis

- Similar to the k-means algorithm, the proposed method can achieve good results for convex datasets. If the dataset is non-convex, the proposed algorithm fails to give good results.
- The time complexity and computation complexity of the proposed algorithm are higher than for k-means and three-way k-means, which means it is not suitable for big data.

## 5. Conclusions and Future Work

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

**Figure 1.**The TAO of three-way decision (adapted from [25]).

**Figure 4.**Time comparison of different algorithms on UCI datasets. (

**1**) Wine. (

**2**) Class. (

**3**) Ecoli. (

**4**) Forest. (

**5**) Bank. (

**6**) Iris. (

**7**) Contraceptive. (

**8**) Molecular Biology. (

**9**) Libras. (

**10**) Caffeine Consumption.

ID | Datasets | Samples | Attributes | Classes |
---|---|---|---|---|

1 | Wine | 178 | 13 | 3 |

2 | Class | 214 | 9 | 6 |

3 | Ecoli | 366 | 7 | 8 |

4 | Forest | 523 | 27 | 4 |

5 | Bank | 1372 | 4 | 2 |

6 | Iris | 150 | 4 | 3 |

7 | Contraceptive | 1473 | 9 | 3 |

8 | Molecular Biology | 106 | 52 | 2 |

9 | Libras | 360 | 90 | 15 |

10 | Caffeine Consumption | 1885 | 12 | 7 |

ID | Data Sets | k-Means | FCM | Three-Way k-Means | Proposed Algorithm |
---|---|---|---|---|---|

1 | Wine | 0.6573 | 0.6692 | 0.6831 | 0.6911 |

2 | Class | 0.5981 | 0.6007 | 0.6112 | 0.6366 |

3 | Ecoli | 0.6339 | 0.6335 | 0.6652 | 0.6773 |

4 | Forest | 0.7795 | 0.7540 | 0.7807 | 0.8294 |

5 | Bank | 0.5758 | 0.5969 | 0.6123 | 0.6131 |

6 | Iris | 0.8866 | 0.8933 | 0.9040 | 0.9040 |

7 | Contraceptive | 0.2145 | 0.2179 | 0.2822 | 0.2826 |

8 | Molecular Biology | 0.6037 | 0.6226 | 0.6547 | 0.6659 |

9 | Libras | 0.8611 | 0.9162 | 0.9256 | 0.9240 |

10 | Caffeine Consumption | 0.2005 | 0.1960 | 0.2411 | 0.2422 |

ID | Data Sets | k-Means | FCM | Three-Way k-Means | Proposed Algorithm |
---|---|---|---|---|---|

1 | Wine | 1.7835 | 1.6922 | 1.5542 | 1.5431 |

2 | Class | 1.0475 | 1.2233 | 0.7855 | 0.7596 |

3 | Ecoli | 1.1504 | 1.0273 | 0.9667 | 0.9425 |

4 | Forest | 1.2774 | 1.2253 | 1.200 | 1.1879 |

5 | Bank | 1.1913 | 1.1952 | 1.1332 | 1.1267 |

6 | Iris | 0.7609 | 0.7507 | 0.7236 | 0.7355 |

7 | Contraceptive | 1.2716 | 1.2539 | 1.2323 | 1.2220 |

8 | Molecular Biology | 4.9588 | 4.8236 | 4.6783 | 4.6689 |

9 | Libras | 1.9240 | 1.9126 | 1.9033 | 1.9023 |

10 | Caffeine Consumption | 1.9116 | 1.8072 | 1.6655 | 1.6048 |

ID | Data Sets | k-Means | FCM | Three-Way k-Means | Proposed Algorithm |
---|---|---|---|---|---|

1 | Wine | 0.3383 | 0.2337 | 0.3347 | 0.3574 |

2 | Class | 0.5309 | 0.5543 | 0.5887 | 0.6038 |

3 | Ecoli | 0.4419 | 0.4326 | 0.4433 | 0.4524 |

4 | Forest | 0.4029 | 0.4302 | 0.4559 | 0.4669 |

5 | Bank | 0.5000 | 0.4954 | 0.5111 | 0.5280 |

6 | Iris | 0.6959 | 0.7091 | 0.7114 | 0.7188 |

7 | Contraceptive | 0.4236 | 0.4309 | 0.4597 | 0.4672 |

8 | Molecular Biology | 0.0553 | 0.0538 | 0.0558 | 0.0585 |

9 | Libras | 0.3519 | 0.3000 | 0.3533 | 0.3556 |

10 | Caffeine Consumption | 0.3150 | 0.3491 | 0.3517 | 0.3563 |

