An Improved K-Means Algorithm Based on Evidence Distance

The main influencing factors of the clustering effect of the k-means algorithm are the selection of the initial clustering center and the distance measurement between the sample points. The traditional k-mean algorithm uses Euclidean distance to measure the distance between sample points, thus it suffers from low differentiation of attributes between sample points and is prone to local optimal solutions. For this feature, this paper proposes an improved k-means algorithm based on evidence distance. Firstly, the attribute values of sample points are modelled as the basic probability assignment (BPA) of sample points. Then, the traditional Euclidean distance is replaced by the evidence distance for measuring the distance between sample points, and finally k-means clustering is carried out using UCI data. Experimental comparisons are made with the traditional k-means algorithm, the k-means algorithm based on the aggregation distance parameter, and the Gaussian mixture model. The experimental results show that the improved k-means algorithm based on evidence distance proposed in this paper has a better clustering effect and the convergence of the algorithm is also better.


Introduction
With the rapid development of technologies such as cloud computing and the internet of things [1,2], the number of connected devices is increasing and the data generated during human-computer interaction and system operation is growing exponentially [3][4][5]. In response to fast-growing data, data mining technology is constantly updated and iterated [6][7][8]. Clustering is a method of data mining [9]. A data set is divided into multiple clusters through a certain process [10,11]. Data similarity within clusters is high, while data similarity between clusters is low [12][13][14]. Depending on the clustering method and characteristics, clustering algorithms can be classified as: divisional, hierarchical, density algorithms, graph theoretic clustering, grid algorithms, model algorithms, etc. [15,16].
The k-means algorithm has been widely used due to its simple algorithm idea, easy implementation, and high efficiency when processing large-scale data [17,18]. However, the traditional k-means algorithm has major limitations [19,20]. For example, when using Euclidean distance calculations, the degree of discrimination between clusters is low and the output results in unstable values [21,22]. In view of the shortcomings of the traditional k-means algorithm, the k-means algorithm can be improved from different perspectives, such as random sampling, distance optimization, and density estimation methods [23]. Better results can be obtained by improving the method of measuring distances between sample points. Researchers at home and abroad have done a lot of research on distance optimization. Tang et al. [24] proposed the d-k-means algorithm, which weighs the influence of density and distance on clustering based on traditional algorithms, and weights the data. On the basis of weights, the principle of minimum and maximum is introduced to automatically determine the initial cluster centers and the number of centers. Wang et al. [25] proposed an improved k-means algorithm based on distance and sample weights, using dimensionally weighted Euclidean distance to calculate the distance between samples. Wang et al. [26] proposed a new algorithm to help k-means jump out of a local optimum on the basis of several ideas from evolutionary computation, through the use of random and evolutionary processes. Zhao et al. [27] proposed a new variant of k-means. The clustering process is driven by an explicit objective function, which makes the k-means process simpler and converges to a better local optimal solution. Qi et al. [28] proposed an optimized k-means clustering method, named k*-means, and three optimization principles, which can reduce the risk of randomly selecting seeds and reduce the adjustable space. Chen et al. [29] proposed an efficient hybrid clustering algorithm called QALO-K, which combines k-means with an optimized quantum-inspired antlion to make the k-means algorithm converge towards the global optimum. Zhang et al. [30] proposed the DC-kmeans algorithm, which added the idea of canopy. At the same time, it combines the sample density in the process of finding the initial clusters, which has a good effect when dealing with low-density areas; however, it is possible that the outliers are classified into one class in the clustering process, which affects the clustering effect.
Dempster-Shafer (DS) theory, also known as evidence theory, was first proposed by Dempster in 1967 and was refined and developed by his student Shafer in 1976. Because evidence theory can meet uncertainty and uncertain information flexibly and effectively without relying on a priori knowledge, it is widely used in many fields, such as: correlation analysis, clustering, classification, etc. Fred et al. discussed the problem of clustering data based on evidence. The n d-dimensional data are decomposed into a large number of compact clusters, and then the k-means algorithm is used to cluster them separately, and several clustering results are obtained, which constitute the association matrix. Finally, the final clustering results are obtained using the MST algorithm on the basis of the association matrix. This method can effectively identify arbitrary clusters in multidimensional data [31]. Li et al. proposed a clustering integration algorithm based on evidence theory, which focuses on the fusion process in the clustering integration algorithm. After obtaining the probability of belonging to each label using the label distribution status of the neighborhood information of the object under test, the probability values are used to form the basic partition. After that, fusion is performed using the Dempster-Shafer fusion rules to obtain the final clustering results. This algorithm avoids blind trust in the obtained labels [32]. Yu et al. proposed a three-way density-peak clustering algorithm based on evidence theory, which uses a density-peak clustering algorithm to obtain clustering centers and noise points, and then uses a mid-distance comparison scheme to merge neighboring points. Finally, the remaining points are assigned using the evidence distance fusion rule. The method effectively solves the problem of error propagation of clustering labels [33].
The main feature of k-means algorithm clustering is the high degree of similarity of data in the same class and the low degree of similarity of data in different classes. The evidence distance in evidence theory can be used to describe the degree of similarity between two bodies of evidence. In order to explore whether a new distance measure can be obtained by using evidence distance instead of Euclidean distance, an improved k-means algorithm based on evidence distance is proposed in this paper. In this paper, we use the attribute values of each sample point to form the evidence body of each sample point, and then select the class in which the cluster center with the smallest distance is added based on the evidence distance from each evidence body to the initial cluster center. Finally, it is divided into k classes to obtain the final clustering results. Through validation on the UCI data set and toy data set, and experimental comparison with the traditional kmeans algorithm, and the k-means algorithm based on the aggregation distance parameter and the Gaussian mixture model, the improved k-means algorithm in this paper has better clustering effect and convergence. The rest of the thesis is organized as follows. The second section provides a review of relevant theory. The third section introduces the algorithmic ideas and motivation of this paper and proposes a k-means algorithm based on evidence distance improvement. The fourth section describes the experimental setting and the chosen algorithm evaluation metrics. The fifth section is devoted to conducting relevant experiments on the UCI dataset and the toy dataset and comparing the experimental results with some existing algorithms. Finally, the sixth section provides the conclusion.

Traditional K-Means Algorithm
The core idea of the k-means algorithm is: After inputting the k value, randomly select k sample points in the sample point set as the initial clustering center. Then, the distances of the remaining sample points to the initial cluster centers are calculated and the sample points are grouped into the closest clusters. In the generated new clusters, new cluster centroids are reselected and the sample points are clustered and classified again until the clustering classification results no longer change [34]. In the actual application process, after multiple iterations, due to various factors, the termination conditions may not be met. Therefore, a maximum number of iterations will be set in the actual application process, and the calculation will be terminated when the maximum number of iterations is reached. The pseudo-code of the traditional k-means algorithm is summarized as Algorithm 1.

Algorithm 1 The traditional k-means algorithm.
Input: data set, k value Output: divided into k clusters 1.
select k points from the sample Euclidean from sample point x i to each cluster center 2. repeat 3.
calculate the Euclidean distance from sample point x i to each cluster center 5.
determine the cluster class mark of x i according to the closest distance 6.
divide the sample points into corresponding clusters 7.
end for 8.
calculate new cluster centers 9.
until the cluster allocation result remains unchanged The traditional k-means algorithm distance measures include: Euclidean metric, city block distance, Pearson correlation, absolute value correlation, absolute non-central correlation, Spearman rank correlation, and Kendall's tau. The traditional k-means algorithm mainly uses the Euclidean distance [35].
The Euclidean metric [36,37] (also known as the Euclidean distance) is a commonly adopted definition of distance and refers to the true distance between two points in mdimensional space, or the natural length of a vector (i.e., the distance from that point to the origin). The Euclidean distance in two and three dimensions is the actual distance between two points [38].
The distance measurement formula in two-dimensional space: where d is the Euclidean distance between the point (x 2 , y 2 ) and (x 1 , y 1 ).
The distance measurement formula in three-dimensional space:

D-S Evidence Theory
Evidence theory was first proposed by Dempster [39] and further developed by his student Shafer [40], an imprecise reasoning theory, also known as Dempster-Shafer evidence theory. As an uncertain reasoning method, the main characteristics of evidence theory are: it satisfies lower conditions than naive Bayesian probability theory and it has the ability to express 'uncertainty' and 'not knowing' directly. At the heart of D-S evidence theory is the Dempster combination rule, which integrates the underlying reliability distributions of multiple information sources and obtains a new reliability distribution as an output [41][42][43][44].

Definition 1.
Assuming that a non-empty set Θ is composed of m mutually exclusive events, Θ is the identification frame, [39,40].

Definition 2.
For any A∈ 2 Θ , m is the mass function. For any subset A in m, let m(A i )∈ (0, 1), satisfy the following conditions [39,40]: Among them, m(A i ) represents the basic probability of A.
Definition 3. Body of Evidence (BOE) is a collection of all focal members and its corresponding mass functions, expressed as follows [39,40]: where B is a subset of the power set 2 θ .

Definition 4. A's belief function (Bel) represents A's total trust, and A's likelihood function (Pl)
represents the confidence level of not denying A. Belief function (Bel) and likelihood function (Pl) represent the upper limit function and lower limit function of A, respectively, defined as follows [39,40]: where Bel(A) ≤ Pl(A).

Definition 5.
Assuming that under the basic identification framework, the basic probability distribution functions of the two bodies of evidence (BOE) are m 1 and m 2 , respectively, the formula for combining according to the Dempster rule is as follows [39]: where m i (A), m i (B) represents two bodies of evidence and m(C) represents the consensus of two bodies of evidence; K represents the conflicting factor between the two evidence bodies and is defined as follows: Entropy 2021, 23, 1550 5 of 15 Definition 6. Evidence distance [45][46][47] is usually used to describe the degree of difference between two evidence bodies, and its calculation formula is as follows: The actual calculation formula used is as follows: Among them d BOE : the distance between the two evidence bodies; m 1 represents body of evidence 1 and m 2 represents body of evidence 2; → m 1 , → m 2 : the vector constituted by the basic distribution probabilities of the two evidence bodies. D is a 2 N × 2 N matrix, the row index corresponds to m 1 , and the column index corresponds to m 2 , indicating the similarity between the two evidence bodies. Each element of the matrix can be represented as:

Algorithm Idea Description
The algorithmic idea is that since the selection of k values is not optimized in this method, trial and error is used to find the optimum number of clustering centers, i.e., k values. k sample points are randomly selected as the initial clustering centers. The attributes of the sample points can be regarded as experts for judging the sample points that belong to a certain class, so the values of the attributes of the sample points are used to form the evidence body of each sample point. After that, the distance from the evidence body to the initial clustering center is calculated using the evidence distance formula. After the initial division of sample points, the clustering centers are then re-selected using the arithmetic mean algorithm. Finally, iterative calculations are performed until the clustering centers do not change.

Algorithm Flow
Step 1: For a given data set, randomly select k data sample points as the initial cluster center.
Step 2: Use the attribute value of each sample point to form the evidence body of each sample point.
Step 3: Use Formula (9) to calculate the evidence distance from each sample point to each initial cluster center, select the center with the smallest distance, and add the cluster center to the class.
Step 4: Select k cluster centers again.
Step 5: Determine whether the clustering center has changed, if it has changed, continue the iteration, if it remains the same, output the corresponding clustering result.
The algorithm flow chart is shown in Figure 1. The pseudo code for the evidence distance-based k-means algorithm proposed in this paper is summarized in Algorithm 2.  The pseudo code for the evidence distance-based k-means algorithm proposed in this paper is summarized in Algorithm 2.

Experiment Preparation
The experimental environment is: AMD A10-7300 processor, AMD Raden R7 M260DX graphics card, 8G of running memory, windows10 operating system, and programming with Python 3.7-32 bits.

Experimental Data Set
The data set used in this article comes from the UCI data set. The name of the data set and its attributes are shown in Table 1.

Experimental Evaluation Indicators
The evaluation indicators used in this paper mainly include adjusting the Rand index, the contour coefficient, and the number of iterations. Adjustment of the Rand index and silhouette coefficient are used to evaluate the clustering performance of the algorithm, and the number of iterations is used to evaluate the convergence of the algorithm.
(1) Adjusted Rand index In the clustering model, assuming that the actual category information is C and the clustering result is K, a denotes the number of pairs of elements that are both in the same category in C and K, and b denotes the number of pairs of elements that are both in different categories in C and K. The Rand index is defined as: where C n samples 2 represents the total number of pairs of elements that can be composed in the data set. The range of RI is [0, 1] and a higher value of RI means that the clustering results match the real situation.
The problem with the Rand index is that for two random divisions, the value of the Rand coefficient is not a constant close to zero. Therefore, the adjusted Rand index is used, which has a higher degree of discrimination. The ARI is calculated as: where RI is the Rand index and E[RI] represents the mean value. The range of values for ARI is [-1, 1]. A larger value for ARI means that the clustering results match the real situation.
(2) Silhouette Coefficient The silhouette coefficient is a way of evaluating how well clustering works. It was first proposed by Peter J. Rousseeuw in 1986. It combines both cohesion and separation factors.
Suppose we have completed clustering by some clustering algorithm. For any one of these samples, A represents the average distance between the sample and the other samples in its cluster, and B represents the average distance between the sample and the samples in the other clusters, the silhouette coefficient of the sample is: where S denotes the silhouette coefficient of a single sample. The total silhouette coefficient of clustering is the average value of all sample silhouette coefficients. The contour coefficients range from (−1, 1), with values closer to 1 indicating better clustering performance, and conversely, values closer to −1 indicating worse clustering performance.

Experimental Procedure
(1) Import the iris data set and enter the cluster category k value.
(2) The traditional k-means method and the improved k-means method are used for clustering, respectively.

Iris Data Set Test Results
After 10 clusters, the ARI value of each cluster is shown in Figure 2, the Silhouette Coefficient value is shown in Figure 3, and the number of iterations is shown in Figure 4. The final result is obtained by calculating the average value. The ARI value of the traditional method is 0.603, the profile coefficient value is 0.5371, and the number of iterations is 8.8 times. The ARI value of the improved method is 0.719, the silhouette coefficient value is 0.5514, and the number of iterations is 8.3 times. From Figures 2-4, it can be seen that the new method adopted in this paper is more stable than the traditional method, and the ARI value and the silhouette coefficient have been effectively improved. Therefore, the accuracy of this method is better than that of the traditional method, and better clustering effect can be obtained. The improved method has generally reduced the number of iterations compared with the traditional method, so the convergence of the new method is also better than that of the traditional method. the new method adopted in this paper is more stable than the traditional method, and the ARI value and the silhouette coefficient have been effectively improved. Therefore, the accuracy of this method is better than that of the traditional method, and better clustering effect can be obtained. The improved method has generally reduced the number of iterations compared with the traditional method, so the convergence of the new method is also better than that of the traditional method.

Validation Results of Other Data Sets
The clustering effect and convergence of the algorithm were verified by using wine, breast cancer, digits, and pima datasets with ARI values, silhouette coefficient, and number of iterations as shown in Figures 5-7. The analysis of Figures 5 and 6 shows that the new method can obtain better clustering results and the output results of the new method are more stable in the output process. However, when there are more attribute values in the data set, the improvement of the new method is smaller. Through the analysis of Figure 7, it can be seen that, except for clustering using the breast cancer data set, the convergence of the new method is slightly worse than that of the traditional method, and the overall convergence of the new method is better than that of the traditional method.

Validation Results of Other Data Sets
The clustering effect and convergence of the algorithm were verified by using wine, breast cancer, digits, and pima datasets with ARI values, silhouette coefficient, and number of iterations as shown in Figures 5-7. The analysis of Figures 5 and 6 shows that the new method can obtain better clustering results and the output results of the new method

Validation Results of Other Data Sets
The clustering effect and convergence of the algorithm were verified by using wine, breast cancer, digits, and pima datasets with ARI values, silhouette coefficient, and number of iterations as shown in Figures 5-7. The analysis of Figures 5 and 6 shows that the new method can obtain better clustering results and the output results of the new method , it can be seen that, except for clustering using the breast cancer data set, the convergence of the new method is slightly worse than that of the traditional method, and the overall convergence of the new method is better than that of the traditional method.  In summary, the new method used in this experiment can obtain better clustering results than traditional methods, and in the output process, the variance between the results is smaller and the output is more stable. At the same time, the convergence of the algorithm is improved to a certain extent.   In summary, the new method used in this experiment can obtain better clustering results than traditional methods, and in the output process, the variance between the results is smaller and the output is more stable. At the same time, the convergence of the algorithm is improved to a certain extent.    In summary, the new method used in this experiment can obtain better clustering results than traditional methods, and in the output process, the variance between the results is smaller and the output is more stable. At the same time, the convergence of the algorithm is improved to a certain extent.

Algorithm Comparison
In order to conduct a more in-depth verification of the performance of the evidencedistance-based improved k-means algorithm proposed in this paper, the performance of the traditional k-means algorithm (T-K-means), the k-means algorithm based on aggregated distance parameters (AD-K-means) [48], the Gaussian mixture model (GMM) [49], and the k-means algorithm based on evidence distance proposed in this paper (ED-K-means) were selected for experimental comparison. The datasets used for the experiments were the UCI dataset and four toy datasets, iris, digits, wine, noisy-moon, blobs, anisotropicly distributed data, and blobs with varied variances, in that order. The parameters for the four toy datasets are shown in Figure 8.

Algorithm Comparison
In order to conduct a more in-depth verification of the performance of the evidencedistance-based improved k-means algorithm proposed in this paper, the performance of the traditional k-means algorithm (T-K-means), the k-means algorithm based on aggregated distance parameters (AD-K-means) [48], the Gaussian mixture model (GMM) [49], and the k-means algorithm based on evidence distance proposed in this paper (ED-Kmeans) were selected for experimental comparison. The datasets used for the experiments were the UCI dataset and four toy datasets, iris, digits, wine, noisy-moon, blobs, anisotropicly distributed data, and blobs with varied variances, in that order. The parameters for the four toy datasets are shown in Figure 8.
The experimental results were evaluated in terms of adjusted Rand index (ARI), silhouette coefficient, number of iterations, and algorithm runtime. The experimental results are shown in Figures 9-12.      Figure 9 shows the results of the adjusted Rand index, with larger values indicatin that the clustering results are more consistent with the actual situation. The ED-K-mean algorithm proposed in this paper gives higher results than the other three algorithms i both the digits and noisy-moon datasets. In the iris, wine, and blobs with varied variance datasets, the results are slightly lower than those of the GMM algorithm, but higher tha those of the other two algorithms. Figure 10 shows the values of the silhouette coefficien with larger values indicating that the clustering results are more consistent with the actua situation. The results of the ED-K-means algorithm proposed in this paper outperforme the other three algorithms on both the toy dataset and the UCI dataset. Therefore, the ED K-means algorithm proposed in this paper can achieve better clustering results.  Figure 9 shows the results of the adjusted Rand index, with larger values indicating that the clustering results are more consistent with the actual situation. The ED-K-means algorithm proposed in this paper gives higher results than the other three algorithms in both the digits and noisy-moon datasets. In the iris, wine, and blobs with varied variances datasets, the results are slightly lower than those of the GMM algorithm, but higher than those of the other two algorithms. Figure 10 shows the values of the silhouette coefficient, with larger values indicating that the clustering results are more consistent with the actual situation. The results of the ED-K-means algorithm proposed in this paper outperformed the other three algorithms on both the toy dataset and the UCI dataset. Therefore, the ED-K-means algorithm proposed in this paper can achieve better clustering results.
both the digits and noisy-moon datasets. In the iris, wine, and blobs with varied variances datasets, the results are slightly lower than those of the GMM algorithm, but higher than those of the other two algorithms. Figure 10 shows the values of the silhouette coefficient, with larger values indicating that the clustering results are more consistent with the actual situation. The results of the ED-K-means algorithm proposed in this paper outperformed the other three algorithms on both the toy dataset and the UCI dataset. Therefore, the ED-K-means algorithm proposed in this paper can achieve better clustering results. Figure 11. Number of iterations. Figure 11. Number of iterations. Figure 11 shows the value of the average number of iterations of the algorithm and Figure 12 shows the algorithm running time. Both algorithm metrics are smaller indicating better convergence of the algorithm. The analysis of the results in this figure shows that in the toy dataset, the ED-K-means algorithm proposed in this paper has the lowest number of iterations and the running time is comparable with T-K-means and slightly higher than the GMM algorithm. In the three UCI datasets, the number of iterations is less and the running time is better than that of T-K-means and GMM, and similar to that of AD-Kmeans. Therefore, on balance, the ED-K-means algorithm proposed in this paper has better convergence.  Figure 11 shows the value of the average number of iterations of the algorithm and Figure 12 shows the algorithm running time. Both algorithm metrics are smaller indicating better convergence of the algorithm. The analysis of the results in this figure shows that in the toy dataset, the ED-K-means algorithm proposed in this paper has the lowest number of iterations and the running time is comparable with T-K-means and slightly higher than the GMM algorithm. In the three UCI datasets, the number of iterations is less and the running time is better than that of T-K-means and GMM, and similar to that of

Conclusions
In the era of big data, data is expanding, so the clustering algorithm has a wide range of application scenarios. This paper presents an improved k-means algorithm based on evidence distance. The algorithm uses the attribute values of the sample points to form the body of evidence for the sample points. Then, the distance measure between sample points is performed using the evidence distance instead of the Euclidean distance. Finally, the k-means algorithm was used to cluster. Through experimental comparison, the improved k-means algorithm based on evidence distance proposed in this paper has good clustering effect and convergence. However, the initial clustering centers are still selected randomly when processing the data in this paper, so it can be further optimized.