# A Fast K-prototypes Algorithm Using Partial Distance Computation

## Abstract

**:**

## 1. Introduction

- Reduction: computational cost is reduced without an additional data structure and memory spaces.
- Simplicity: it is simple to implement because it does not require a complex data structure.
- Convergence: it can be applied to other fast k-means algorithms to compute the distance between each cluster center and an object for numerical attributes.
- Speed: it is faster than the conventional k-prototypes.

## 2. Related Works

#### 2.1. K-means

- It chooses k cluster centers in some manner. The final result of the algorithm is sensitive to the initial selection of k initial centers, and many efficient initialization methods have been proposed to calculate better final k centers.
- The k-means repeats the process of assigning individual objects to their nearest centers and updating each k center as the average of a value of object’s vector assigned to the centers until no further changes occur on the k centers.

^{2}, the distance between a point x and a center c can be calculated by summing squared distances in each dimension. In a distance calculation between point x and another center c’, if the sum exceeds $\parallel $x − c$\parallel $

^{2}, the distance $\parallel $x − c’$\parallel $

^{2}cannot be the minimum distance, so the distance calculation stops before all attribute calculations. The cost of a partial distance search is usually effective in high dimension.

#### 2.2. K-prototypes

## 3. K-prototypes Using Partial Distance Computation

**Definition**

**1.**

**Lemma**

**1.**

**Proof.**

#### 3.1. Proposed Algorithm

Algorithm 1 Proposed k-prototypes algorithm |

Input: n: the number of objects, k: the number of cluster, p: the number of numeric attribute, q: the number of categorical attribute |

Output: k cluster |

01: INITIALIZE // Randomly choosing k object, and assigning it to ${C}_{j}$. |

02: While not converged do |

03: for i = 1 to n do |

04: dist_n[] = DIST-COMPUTE-NUM(${X}_{i},\text{}\mathrm{C}$, k, p) // distance computation only numeric numerical attributes |

05: first_min = DIST-COMPUTE.first_min // first minimum value among ${d}_{r}\left({X}_{i},{C}_{j}\right)$ |

06: second_min = DIST-COMPUTE.second_min // second minimum value among ${d}_{r}\left({X}_{i},{C}_{j}\right)$ |

07: if (second_min − first_min < m) then |

08: dist[] = dist_n[] + DIST-COMPUTE-CATE(${X}_{i},\text{}\mathrm{C}$, k) |

09: else |

10: dist[] = dist_n[] |

11: num = $\underset{z}{\mathrm{argmin}}\text{}dist\left[z\right]$ |

12: ${X}_{i}$ is assigned to ${C}_{num}$ |

13: UPDATE-CENTER(${C}_{num}$) |

Algorithm 2 DIST-COMPUTE-NUM() |

Input: ${X}_{i}$: an object vector, $\mathrm{C}$: a set of cluster center vectors, k: the number of clusters, p: the number of numeric attribute |

Output: dist_n[], first_min, second_min |

01: for i = 1 to k do |

02: for j = 1 to p do |

03: dist_n[j] $={\left(X\left[j\right]-\text{}{C}_{i}\left[j\right]\right)}^{2}$ |

04: first_min = dist_n[0] |

05: second_min = dist_n[0] |

06: for i = 0 to k-1 do |

07: if (dist_n[i] < first_min) then |

08: second_min = first_min |

09: first_min = dist_n[i] |

10: else if (dist_n[i] < second_min) then |

11: second_min = dist_n[i] |

12: Return dist_n[] |

Algorithm 3 DIST-COMPUTE-CATE() |

Input: ${X}_{i}$: an object vector, $\mathrm{C}$: cluster center vectors, k: the number of clusters, p: the number of numeric attribute |

Output: dist_c[] |

01: for i = 1 to k do |

02: for j = p + 1 to m do |

03: if ($X\left[j\right]=\text{}{C}_{i}\left[j\right]$) then |

04: dist_c[i] += 0 |

05: else |

06: dist_c[i] += 1 |

07: Return dist_c[] |

Algorithm 4 UPDATE-CENTER() |

Input: ${C}_{i}$: an i-th cluster center vectors |

01: foreach $\mathrm{o}\in {C}_{i}$ do |

02: for j = 1 to p do |

03: sum[j] += o[j] |

04: for j = 1 to p do |

05: C[j] = sum[j] / $\left|{C}_{i}\right|$ |

06: for j = p +1 to m do |

07: C[j] = argmax COUNT(o[j]) |

#### 3.2. Time Complexity

## 4. Experimental Results

#### Effect of Cardinality

## 5. Conclusions

## Author Contributions

## Conflicts of Interest

## References

- MacQueen, J.B. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability: Statistics, Oakland, CA, USA, 21 June–18 July 1965 and 27 November 1965–7 January 1966; University of California Press: Berkeley, CA, USA, 1967; pp. 281–297. [Google Scholar]
- Aloise, D.; Deshpande, A.; Hansen, P.; Popat, P. NP-hardness of euclidean sum-of-squares clustering. Mach. Learn.
**2009**, 75, 245–249. [Google Scholar] [CrossRef] - Dasgupta, S.; Freund, Y. Random projection trees for vector quantization. IEEE Trans. Inf. Theory
**2009**, 55, 3229–3242. [Google Scholar] [CrossRef] - Drake, J.; Hamerly, G. Accelerated k-means with adaptive distance bounds. In Proceedings of the 5th NIPS Workshop on Optimization for Machine Learning, Lake Tahoe, NV, USA, 7–8 December 2012. [Google Scholar]
- Elkan, C. Using the triangle inequality to accelerate k-means. In Tom Fawcett and Nina Mishra; Fawcett, T., Mishra, N., Eds.; AAAI Press: Washington, DC, USA, 2003; pp. 147–153. [Google Scholar]
- Hamerly, G. Making k-means even faster. In Proceedings of the 2010 SIAM International Conference on Data Mining, Columbus, OH, USA, 29 April–1 May 2010; pp. 130–140. [Google Scholar]
- Huang, Z. Clustering large data sets with mixed numeric and categorical values. In Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, Singapore, 23–24 February 1997; pp. 21–34. [Google Scholar]
- Pelleg, D.; Moore, A.W. Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 15–18 August 1999; Association of Computing Machinery Press: NY, USA, 1999; pp. 277–281. [Google Scholar]
- Kanungo, T.; Mount, D.M.; Netanyahu, N.S.; Piatko, C.D.; Silverman, R.; Wu, A.Y. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell.
**2002**, 24, 881–892. [Google Scholar] [CrossRef] - Cheng, D.; Gersho, A.; Ramamurthi, B.; Shoham, Y. Fast search algorithms for vector quantization and pattern matching. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, San Diego, CA, USA, 19–21 March 1984; pp. 372–375. [Google Scholar]
- McNames, J. Rotated partial distance search for faster vector quantization encoding. IEEE Signal Process. Lett.
**2000**, 7, 244–246. [Google Scholar] [CrossRef]

**Figure 1.**A process of assigning an object X

_{i}to a cluster of which the center is the closest to the objects.

**Figure 3.**Effect of cardinality. FKPT (fast k-prototypes) is the result of our propose k-prototypes algorithm and TKPT (traditional k-prototypes) is the result of original k-prototypes algorithm.

© 2017 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kim, B.
A Fast K-prototypes Algorithm Using Partial Distance Computation. *Symmetry* **2017**, *9*, 58.
https://doi.org/10.3390/sym9040058

**AMA Style**

Kim B.
A Fast K-prototypes Algorithm Using Partial Distance Computation. *Symmetry*. 2017; 9(4):58.
https://doi.org/10.3390/sym9040058

**Chicago/Turabian Style**

Kim, Byoungwook.
2017. "A Fast K-prototypes Algorithm Using Partial Distance Computation" *Symmetry* 9, no. 4: 58.
https://doi.org/10.3390/sym9040058