Cluster analysis is one of the most important research directions in the field of data mining. “Things are clustered and people are grouped”; compared with other data mining methods, clustering can complete the classification of data without prior knowledge. Clustering algorithms can be divided into multiple types based on partitioning, density, and model [1
]. A clustering algorithm is a process of dividing a physical or abstract object into a collection of similar objects. A cluster is a collection of data objects; objects in the same cluster are like each other and different from objects in other clusters [2
]. For a clustering task, we want to get the objects as close as possible within the clusters: first cluster tends to sample or data point. However, the randomness of sample center point selection tends to make cluster aggregation not converge. Cluster analysis is based on the similarity in clustering data sets, which is unsupervised learning.
In the partition-based clustering algorithm, K-means algorithm has many advantages such as simple mathematical ideas, fast convergence, and easy implementation [3
]. Therefore, the application fields are very broad, including different types of document classification, music, movies, classification based on user purchase behavior, the construction of recommendation systems based on user interests, and so on. With the increase of the amount of data, the traditional K-means algorithm has been difficult to meet the actual needs when analyzing massive data sets. In view of the shortcomings of the traditional K-means algorithm, many scholars have proposed improvement measures based on K-means. For instance, in Reference [4
], a simple and efficient implementation of the K-means clustering algorithm is presented to solve the problem of the cluster center point not being well-determined; it built a kd-tree data structure for the data points. The algorithm is easy to implement and can effectively avoid entering the local optimal solution to some extent. For the problems of the traditional clustering algorithms having no way to take advantage of some background knowledge (about the domain or the data set), an Improved K-means Algorithm Based on Multiple Information Domains is presented in Reference [5
]; they apply this method to six data sets and the real-world problem of automatically detecting road lanes from global positioning system (GPS) data. Experiments show that the improved algorithm is more correct when selecting K values when solving practical problems. Two algorithms which extend the k-means algorithm to categorical domains and domains are reported in Reference [6
], through the pattern mixing algorithm, the combination of the effectiveness measure, in order to solve the problem of complex data and more noise in the real world. A principal Component Analysis (PCA) method is implemented in Reference [7
]; they use the artificial neural network (ANN) algorithm and K-nearest neighbor (KNN) and support vector machine (SVM) classification algorithms to extract and analyze the features, which effectively realize the classification of malware. The clustering algorithm is also applied to the early detection of pulmonary nodules [8
]; they propose a novel optimized method of feature selection for both cluster and classifier components. In the field of medical imaging, clustering and classification based on selection features effectively improve the classification performance of Computer-aided detection (CAD) systems. With the advent of deep learning methods in pattern recognition applications, some scholars have applied them to cluster analysis. For example, in Reference [9
], by studying the performance of a CAD system for lung nodules in Computed tomography (CT) as a function of slice thickness, a method of comparing the performance of CAD systems using a training method using nonuniform data was proposed.
In summary, based on the traditional K-means clustering algorithm, this paper discusses how to quickly determine the K-value algorithm. The remainder of this paper is organized as follows: Section 2
provides a brief description of the K-means clustering algorithm. Section 3
presents the four K-value selection algorithms—Elbow Method, Gap Statistic, Silhouette Coefficient and Canopy—and elucidates the various methods with sample data along with their experimental results. Finally, a discussion and conclusions are given in Section 4