K-Means++ Clustering Algorithm in Categorization of Glass Cultural Relics

: We used statistical methods to study the classiﬁcation of high-potassium glass and lead– barium glass and analyzed the correlation between the chemical composition of different types of glass samples. We investigated the categorization methodology of glass cultural relics, conducted a principal component analysis on the chemical composition data of the glass, and developed a case-speciﬁc clustering algorithm (K-Means++) to further categorize the glass cultural relics. K-Means++ was developed to reduce the sensitivity of a traditional K-Means clustering algorithm, by choosing the next clustering center with probability inversely proportional to the distance from the current clustering center. Then we veriﬁed the validity of the six subcategories we deﬁned by inertia and silhouette score and evaluated the sensitivity of the clustering algorithm. We obtained a robustness ratio that maintained over 0.9 in the random noise test and a silhouette score of 0.525 in the clustering, which illustrated signiﬁcant divergence among different clusters and showed the result is reasonable. With our proposed algorithm and classiﬁcation result, a more comprehensive understanding of glass relics can be gained.


Introduction
Glass has long been recorded among Chinese historical materials, but research on ancient Chinese glass started late.There is a lack of research on the weathering and composition of ancient silicate glass, and most of it is from the perspective of dynasty replacement.The cultural and artistic forms of glass and the laws of its own operation and development are studied in terms of cultural exchange and chemical analysis.Few scholars have systematically established mathematical models and used intelligent algorithms to qualitatively and quantitatively predict the original composition and subclassification methods of weathered silicate glass.
Machine-learning algorithms play an important role in the exploration of ancient cultures nowadays, helping in the search for statistical insight and classification.Data measurements and statistical analyses of the chemical compositions of ancient cultural relics, looking for statistical rules and classifying the types of cultural relics, can provide a reliable basis for identifying the types of ancient cultural relics and tracing the history of ancient cultural relics from a statistical perspective.In the most basic way, ancient glass cultural relics discovered in China are mainly divided into two categories: high-potassium and lead-barium glass.This is because in the process of making these glasses, people needed to add many kinds of auxiliary solvents into the main part, SiO 2 .Commonly seen in the southeast of China, the high-potassium glass usually has plant ash as its auxiliary solvent, which is rich in potassium (in the form of K 2 O), while the lead-barium glass is located elsewhere in China, rich in PbO and BaO [1,2].As time went by, some of the glass relics were weathered and their chemical compositions changed accordingly.Wang Chengyu's [3] in-depth study on the mechanism of weathering has certain reference significance for component prediction.Zhao Fengyan, et al. [4] classified the chemical composition of glassware by nondestructive analysis of pXRF.However, the existing chemical research methods cannot accurately and reasonably classify according to the composition of glass.Therefore, we consider introducing machine learning to solve practical problems by using classification models and intelligent algorithms.Intelligent algorithms have been widely used in scientific research in the field of materials in recent years, such as for the construction and application of the database of ancient Chinese unearthed glass beads studied by Feng Bailing [5] and the summary by Zhang Liyan [6] of the main theoretical basis, simulation process, and application status of each simulation method by using seven simulation methods of glass composition properties, but there are few methods that shed light on the specific field of glass relics.Li Jiangang [7] conducted glass defect detection based on deep learning, but there is still a gap in the use of machine learning to study the weathering and subclassification of ancient glass at home and abroad [8].
In this paper, we first study the classification rules of high-potassium glass and leadbarium glass by statistical methods and try to recover the original proportion of chemical elements of weathered points, i.e., their chemical composition before being weathered, by statistical methods.After that, we conduct the subcategorization based on dimensionality reduction by PCA and make an evaluation.The K-Means++ clustering model is established using three principal components, and the rationality and sensitivity of the model are tested.
The rest of this paper is organized as follows: In Section 2, the classification law of high-potassium glass and lead-barium glass are studied based on sample data.Section 3 proposes the K-Means++ clustering algorithm and establishes the classification model.In Section 4, the validity of the model is analyzed.Section 5 analyzes the model sensitivity.Section 6 discusses limitations of the study and future work.The conclusions are provided in Section 7.

Data Overview
The data were acquired from a private archaeology database, which contains the results of chemical composition detection of 58 glass cultural relics in total.Because two parts of one glass relic (for example, the inside and outside of a glass vase) may have significant differences in chemical composition, 11 glass relics underwent detection on two different points.Therefore, the data mainly consisted of 69 data points collected from 58 glass relics.
For each glass relic, its ornamentation, color, and type (high-potassium/lead-barium) were recorded, and for each detection point, its weathering degree (unweathered, weathered, severely weathered) and 14 chemical compositions were recorded.The 14 chemical compositions were SiO 2 , Na 2 O, K 2 O, CaO, MgO, Al 2 O 3 , Fe 2 O 3 , CuO, PbO, BaO, P 2 O 5 , SrO, SnO, and SO 2 .The first three lines of the data are shown in Table 1.Because the chemical detection method in archaeology may produce a slight error, which makes the sum of the 14 compositions deviate from 100%, we regard a sum between 85% and 105% as reasonable.Based on this, we dropped two data points which have the sum 79.47% and 71.89%, respectively, and 67 data points remained for further processing.

Exploratory Data Analysis
Based on the chemical composition of the artifact samples and other testing means, the high-potassium glass and lead-barium glass were classified from a total of 67 data samples (including ornamentation, color, weathering degree, the proportion of the main components, etc.).The data distribution of sample color and the data distribution of color based on type are shown in Figure 1.
coefficient matrix was established to initially observe the correlations among the four qualitative variables, and a correlation diagram was drawn, as shown in Figure 2. Additionally, because in the later section we recover the original chemical compositions of weathered points, we took a brief look at the linear relationship between weathering degree and other variables (including 3 qualitative variables and 14 continuous variables) by linear regression.This aimed to provide insight into how well we could do to discover the nonlinear clusters and distinguish weathered data from unweathered data using other variables.In the regression result shown in Figure 3, we obtained adjusted an R-squared value of 0.75.In Figure 4, we can see the positive/negative correlations between weathering degree and certain chemical compositions.From the model diagnostics graph shown in Figure 5, we conclude that there is a nonlinear relationship (from Residuals vs. Fitted), the residual is statistically normally distributed (from Normal Q-Q), heterogeneity appears in the model (from Scale-Location), and there are no global outliers in the data (from Residuals vs. Leverage).The samples are marked 01-67, including 18 high-potassium and 49 lead-barium samples.To see how much information the basic data provided, a Kendall correlation coefficient matrix was established to initially observe the correlations among the four qualitative variables, and a correlation diagram was drawn, as shown in Figure 2. Additionally, because in the later section we recover the original chemical compositions of weathered points, we took a brief look at the linear relationship between weathering degree and other variables (including 3 qualitative variables and 14 continuous variables) by linear regression.This aimed to provide insight into how well we could do to discover the nonlinear clusters and distinguish weathered data from unweathered data using other variables.In the regression result shown in Figure 3, we obtained adjusted an R-squared value of 0.75.In Figure 4, we can see the positive/negative correlations between weathering degree and certain chemical compositions.From the model diagnostics graph shown in Figure 5, we conclude that there is a nonlinear relationship (from Residuals vs. Fitted), the residual is statistically normally distributed (from Normal Q-Q), heterogeneity appears in the model (from Scale-Location), and there are no global outliers in the data (from Residuals vs. Leverage).

Analysis of Classification Laws Based on Sample Data
It is necessary to explore the classification pattern of high-potassium and leadbarium glass, thus enabling the further classification of subcategories in terms of chemical composition content.In order to explore the relationship between the proportion of

Analysis of Classification Laws Based on Sample Data
It is necessary to explore the classification pattern of high-potassium and lead-barium glass, thus enabling the further classification of subcategories in terms of chemical composition content.In order to explore the relationship between the proportion of chemical components and the division of main categories (high potassium, lead and barium), we used the supervised learning algorithm KNN to construct a K-nearest neighbor graph between sample points, which proved well that K 2 O content and (PbO + BaO) content were the key indicators for distinguishing the main categories.
Semantically, "high-potassium glass" should be glass with a high potassium oxide content (noted as K 2 O below), while "lead-barium glass" is glass with a high content of lead oxide (PbO) and barium oxide (BaO).It can be assumed that if the K 2 O content and the (PbO + BaO) content of the glass cultural relics are taken as two dimensions, a clear demarcation line can be drawn under the plane right-angle coordinate system.We validated this idea based on 67 data samples by the KNN (K-Nearest Neighbor) algorithm of supervised learning.
The KNN algorithm is a method to classify each record in a dataset, which is a typical supervised learning algorithm.The process of a KNN algorithm classifying one new point is as follows: the distances between this point and all marked points are calculated, from which n_neighbors points with the closest distance are selected.The category with the largest proportion of these n_neighbors points is the classification of the new point [9][10][11][12].
First, the values of the two independent input variables K 2 O and (PbO + BaO) were calculated for the 67 samples.Among the 67 samples, 70% were randomly selected as the training set and 30% as the test set, taking n_neighbors = 5 to obtain a 100% correct classification rate, as shown in Figure 6   chemical components and the division of main categories (high potassium, lead and barium), we used the supervised learning algorithm KNN to construct a K-nearest neighbor graph between sample points, which proved well that K2O content and (PbO + BaO) content were the key indicators for distinguishing the main categories.Semantically, "high-potassium glass" should be glass with a high potassium oxide content (noted as K2O below), while "lead-barium glass" is glass with a high content of lead oxide (PbO) and barium oxide (BaO).It can be assumed that if the K2O content and the (PbO + BaO) content of the glass cultural relics are taken as two dimensions, a clear demarcation line can be drawn under the plane right-angle coordinate system.We validated this idea based on 67 data samples by the KNN (K-Nearest Neighbor) algorithm of supervised learning.
The KNN algorithm is a method to classify each record in a dataset, which is a typical supervised learning algorithm.The process of a KNN algorithm classifying one new point is as follows: the distances between this point and all marked points are calculated, from which n_neighbors points with the closest distance are selected.The category with the largest proportion of these n_neighbors points is the classification of the new point [9][10][11][12].
First, the values of the two independent input variables K2O and (PbO + BaO) were calculated for the 67 samples.Among the 67 samples, 70% were randomly selected as the training set and 30% as the test set, taking n_neighbors = 5 to obtain a 100% correct classification rate, as shown in Figure 6 (both training and testing data are plotted in the figure).To further determine the correctness of this classification criterion, the proportion of the training set was reduced to 50% and the other 50% selected as the test set, taking n_neighbors = 5, and we still obtained a 100% correct classification rate.
From this, we can infer statistically that the classification of high-potassium glass and To further determine the correctness of this classification criterion, the proportion of the training set was reduced to 50% and the other 50% selected as the test set, taking n_neighbors = 5, and we still obtained a 100% correct classification rate.
From this, we can infer statistically that the classification of high-potassium glass and lead-barium glass is precisely binary clustering according to the K 2 O, (PbO + BaO) content.

Prediction of the Chemical Composition Content before Weathering on Weathered Samples
The weathering of glass products will lead to significant changes in their composition, so it is necessary to predict the composition content before weathering based on the weathering point detection data.We chose to use the similarity between glass products to predict the chemical composition of specific glass products when they are not weathered.
Because the present compositions of weathered data deviate from their original compositions, we cannot compare the similarities using given chemical compositions directly.As an alternative, we chose to conduct a principal component analysis first, using only unweathered data so that the unimportant compositions would be eliminated.Second, we applied PCA to weathered data, and the principal components (PCs) obtained in this way would be irrelevant to the weathering degree.
First, the original data were tested for validity, and only unweathered samples were selected, from which we had 35 data points [13][14][15].The 14 chemical content compositions of these data were taken into PCA.The percentage of variance explained by each orthogonal index was analyzed, and the first four items were [0.72766864 0.15677105 0.05843352 0.02400202], and the first three items were taken as new PCs (for explained variance greater than 0.05), which were noted as PC 1 , PC 2 , and PC 3 .

•
Calculate the eigenvalue decomposition: X T X = QΣQ T .

•
Calculate PCs for weathered data: Y 1 = X 1 P.After that, this PCA model was applied to calculate the PCs of the weathered samples.The PC matrix obtained in this way is significantly less related to weathering because when constructing the model, no factors related to weathering are introduced, and when using the model to calculate the factor values of weathered samples, it is equivalent to making it as reductive as possible for the unweathered samples, in line with the prediction requirements.
For the definition of similarity, we used the Euclidean distance to measure the similarity.Due to the large number of chemical components and the fact that some components are impurities, we first used the principal component analysis to perform the dimensionality reduction of the chemical content, which also ensured that some impurities in the glass products (such as strontium oxide, which is less than 0.01 in all glasses) can be ignored sufficiently as to not affect the subsequent prediction results [16][17][18].
Based on the above definition of "relevance" and comparative analysis, the three PCs were used as indicators of similarity for prediction, as follows: similarity of glass products a and b, For all weathered points, the n points that were most similar to them were taken, and the content prediction of this point was obtained by taking the similarity between the n points and this point as the weight and weighting the average of the chemical composition content of the n points.For the selection of n, we tried = 3, 4, 5, and found that the difference between the predicted values was less than 5%.Thus, the three cases could be regarded as equivalent within the error tolerance, and the prediction results were obtained by taking n = 4.

K-Means++ Algorithm
Next, we performed the subclass classification based on high-potassium glass and lead-barium glass.In the absence of a priori knowledge of the subclass classification criteria and no real labels for reference, we first chose to use the unsupervised clustering algorithm K-Means for modeling and analysis, using the three PCs obtained in the PCA section as three feature variables in the clustering.
The K-Means algorithm at random uniformly selects K points as the center of mass at initialization, and in each iteration, calculates the distance from each point to the K centers of mass, divides the samples into the clusters corresponding to the closest center of mass, and at the same time, calculates the mean value of all samples within each cluster and updates the center of mass of the cluster using this mean value, until the position change of the center of mass is less than the specified threshold (default 0.0001) or the maximum number of iterations is reached [19][20][21][22].Because the classification of glass products is not related to weathering, it was necessary to eliminate the influence of weathering on chemical composition and classification and to select the PCA obtained above.The three-factor index was used as the factor parameter of the sample.
The weakness of K-Means lies in its initialization.Because the probability of choosing any data point as the center of mass is equal, there is a significant chance that several close points are simultaneously selected as centers, which means that in the later process of iteration, these points are highly likely to be divided into different clusters.However, if two points are close to each other in the context of feature variables, they should be in the same cluster for unsupervised learning.Therefore, we considered improving the initialization strategy and developing K-Means++.
Since K-Means is more sensitive to initial values, the algorithm was improved.After setting the probability of each sample point becoming a cluster center to be inversely proportional to the distance from the current cluster center, the more distant the sample point is from the existing cluster center, the more likely it is to be selected as the next cluster center.In the mathematical formulation, we denoted the set of existing cluster centers as S, and the probability of choosing x as the next cluster center is In other words, K-Means++ is most likely to select the point which is far from all the existing cluster centers.In this way, the initialization will guarantee enough divergence (in the sense of distance) among different clusters, which improves the effectiveness of clustering and the convergence rate.
The algorithm was tested and found to be significantly less sensitive, so the improved algorithm was named K-Means++.All 67 samples were classified into six classes using the K-Means++ algorithm, and the classification results are shown in Figure 7 and Table 2.By contrast, the classification result of K-Means is shown in Figure 8.
In order to see the improvement in the sensitivity of K-Means++, we compared K-Means++ with K-Means in the classification results, i.e., we ran both algorithms 1000 times in clustering the 67 samples.Initially, we obtained a result for both algorithms, as shown in Figures 7 and 8. Every time, we got a classification result with six cluster centers, and we recorded the accumulated difference among these six centers from the initial six centers (represented by the absolute difference of center distance).The larger the accumulated difference is, the more volatile the classification is and the more sensitive the algorithm is.From Figure 9, we can see that the accumulated difference of K-Means++ increases more slowly than that of K-Means, which verifies the sensitivity improvement.
in clustering the 67 samples.Initially, we obtained a result for both algorithms, as shown in Figures 7 and 8. Every time, we got a classification result with six cluster centers, and we recorded the accumulated difference among these six centers from the initial six centers (represented by the absolute difference of center distance).The larger the accumulated difference is, the more volatile the classification is and the more sensitive the algorithm is.From Figure 9, we can see that the accumulated difference of K-Means++ increases more slowly than that of K-Means, which verifies the sensitivity improvement.Define error function J(c, µ) = ∑ n i=1 x i − µ c i 2 , where c i is the cluster which x i belongs to.

Model Validity Analysis
Whether the division into parent classes is satisfied is an intuitive indicator of model validity.The subclass classification relies on the chemical composition content, which makes the samples in the same subclass highly similar with respect to some chemical compositions.In order to better reflect this similarity, linear regression can be used to explore the composition characteristics of each subclass as another criterion for subclass classification.The contour coefficient of the clustering results is a measure of whether the cluster is reasonable and valid [23].In this paper, we mainly analyzed the reasonableness of the K-Means++ clustering model from the above three aspects.

Comparison Verification with Parent Classes
A more intuitive indicator is to verify that the division of subclasses satisfies the division of the parent classes (i.e., high-potassium vs. lead-barium).The statistical results of the K-Means++ clustering algorithm are shown in Table 3.It can be found that although the six subclasses were divided without introducing any parent variables and clustering was performed only by relying on the dimensionality reduction factor of chemical content, the samples of the same subclass all belonged to the same parent class.This indicates that the clustering results obtained by this method have a certain degree of validity.

Significance Test of Linear Regression
Subclass classification relies on chemical composition content, which makes samples within the same subclass highly similar with respect to some chemical compositions.To better represent this similarity, linear regression can be used to explore the compositional characteristics of each subclass, using different subclasses within the same parent class as dependent variables and the chemical composition content of each sample before weathering (including predicted values for weathered samples) as independent variables [24].
Before the component regression, the subclass numbers need to be adjusted so that the subclasses belonging to the same parent class have consecutive serial numbers.This is because there is a difference in component content between the parents themselves, and Analyzing the composition table, we can clearly see that the two subclasses of the "high-potassium" parent class 1.3 differ significantly in Ca2O content, so the subclasses are named "high-potassium high-calcium" and "high-potassium low-calcium" classes.At the same time, we can clearly see that the two subclasses 0.2 of the parent class "Pb-Ba" differ significantly in the content of CuO.The value for subclass 0 is significantly higher than the mean value; subclass 2 has a value significantly lower than the mean value, and that of subclass 4, 5 is close to the mean value.The difference in the content of subclass 4 and 5 is reflected in Ca2O.Therefore, the subclasses were named "lead-barium highcopper", "lead-barium low-copper", "lead-barium medium-copper low-calcium", and "lead-barium medium-copper high-calcium".The details of the subclasses are shown in Table 4.

Subclass Name
Subclass Number Lead-barium high-copper 0 High-potassium and calcium 1 Lead-barium low-copper 2 High-potassium and low-calcium 3 Lead-barium mid-copper low-calcium 4 Lead-barium mid-copper high-calcium 5

Silhouette Coefficient and Square Sum of Error
The contour coefficient refers to a method that reflects the consistency of the data clustering results and can be used to assess the degree of dispersion among clusters after clustering.For a sample u belonging to cluster Ci, we denote d (u, v) as the distance between u and v, defined as Analyzing the composition table, we can clearly see that the two subclasses of the "high-potassium" parent class 1.3 differ significantly in Ca 2 O content, so the subclasses are named "high-potassium high-calcium" and "high-potassium low-calcium" classes.At the same time, we can clearly see that the two subclasses 0.2 of the parent class "Pb-Ba" differ significantly in the content of CuO.The value for subclass 0 is significantly higher than the mean value; subclass 2 has a value significantly lower than the mean value, and that of subclass 4, 5 is close to the mean value.The difference in the content of subclass 4 and 5 is reflected in Ca 2 O. Therefore, the subclasses were named "lead-barium highcopper", "lead-barium low-copper", "lead-barium medium-copper low-calcium", and "lead-barium medium-copper high-calcium".The details of the subclasses are shown in Table 4.

Subclass Name Subclass Number
Lead-barium high-copper 0 High-potassium and calcium 1 Lead-barium low-copper 2 High-potassium and low-calcium 3 Lead-barium mid-copper low-calcium 4 Lead-barium mid-copper high-calcium

Silhouette Coefficient and Square Sum of Error
The contour coefficient refers to a method that reflects the consistency of the data clustering results and can be used to assess the degree of dispersion among clusters after clustering.For a sample u belonging to cluster C i , we denote d (u, v) as the distance between u and v, defined as (2) We define the contour coefficient of u as s(u) = b(u)−a(u) max{a(u),b(u)} , lying between −1 and 1.If s(u) of a sample is close to 1, it means the sample is reasonably clustered; if it is close to −1, it means it should be classified into other clusters.If the silhouette is close to 0, it means the sample is on the boundary of two clusters.The mean value of all sample contours is called the silhouette coefficient, which is a measure of whether the clustering is reasonable and valid.
The square sum of errors inertia is defined as the sum of squares of the distances between all samples and the center of mass of the cluster to which they belong, and the optimal number of classifiers should be taken at the point where the deformation of inertia is most intense.For the optimal number of classifications for K-Means++ clustering, two evaluation metrics (inertia and silhouette coefficient) are used.The traversal is performed for the possible number of classifications n = 2, 3, ..., 19, 20, varying the number of clusters k, using the silhouette_score function implemented in the python sklearn library for validation and plotting the curve of inertia and silhouette coefficient, as shown in Figures 11 and 12. From the square sum of errors image and the silhouette_score image, it is best to be divide the data into six categories.The square sum of errors inertia is defined as the sum of squares of the distances between all samples and the center of mass of the cluster to which they belong, and the optimal number of classifiers should be taken at the point where the deformation of inertia is most intense.For the optimal number of classifications for K-Means++ clustering, two evaluation metrics (inertia and silhouette coefficient) are used.The traversal is performed for the possible number of classifications n = 2, 3, ..., 19, 20, varying the number of clusters k, using the silhouette_score function implemented in the python sklearn library for validation and plotting the curve of inertia and silhouette coefficient, as shown in Figures 11 and 12. From the square sum of errors image and the silhouette_score image, it is best to be divide the data into six categories.The square sum of errors inertia is defined as the sum of squares of the distances between all samples and the center of mass of the cluster to which they belong, and the optimal number of classifiers should be taken at the point where the deformation of inertia is most intense.For the optimal number of classifications for K-Means++ clustering, two evaluation metrics (inertia and silhouette coefficient) are used.The traversal is performed for the possible number of classifications n = 2, 3, ..., 19, 20, varying the number of clusters k, using the silhouette_score function implemented in the python sklearn library for validation and plotting the curve of inertia and silhouette coefficient, as shown in Figures 11 and 12. From the square sum of errors image and the silhouette_score image, it is best to be divide the data into six categories.

Model Sensitivity Analysis
For a certain type of glass artifact, the randomness associated with the production process or prolonged exposure to the environment may cause some change in the proportions of the various chemical components, but the change is relatively minor to the extent that it should not affect our determination of that glass category.We consider such perturbations as noise added to the sample data [25].Since the sample of data in this study is small and estimates of noise are highly susceptible to overfitting, it is useful to assume that the prior probability distribution of the noise is F = N (0, 0.0001), i.e., a Gaussian distribution with mean 0 and standard deviation 0.01.For any x ∈ F, where f(t) is the probability density function of the distribution F.There is about a 68.2% probability that the noise lies in the thousandths interval and has a small effect on the weight of the chemical composition.The process of a random noise test is as follows: The number of random tests is initially set T = 100.In the i'th (1 ≤ i ≤ T) test, all original unweathered samples are added with noise sampled by the above prior distribution to obtain the noise-containing sample data, and the prediction labels are obtained by the clustering algorithm.The number of samples for which the predicted labels are the same as the original predicted labels after adding noise is counted and is denoted as t i .At the end of T times of testing, it is calculated that (N is the number of samples) As a measure, its value ranges from [0, 1], and the closer to 1, the better the noise immunity of the model.
The K-Means++ clustering algorithm was tested 10 times for random noise and averaged to obtain a random noise test ratio = 0.980, indicating that the model is insensitive to noise under the current noise prior distribution.
To further investigate the relationship between model sensitivity and noise standard deviation, we used the following equation: σ = 0.01 + 0.00474t, 0 ≤ t < 20. ( The result of increasing the standard deviation in the above random noise test for the K-Means++ model reveals the robustness of K-Means++.Compared with the K-Means algorithm, when σ increases from 0.01 to 0.1, the clustering accuracy of K-Means++ remains significantly higher and shows a smaller variance, which illustrates the performance improvement of K-Means++.The random noise test ratio with σ is shown in Figure 13.

Model Sensitivity Analysis
For a certain type of glass artifact, the randomness associated with the production process or prolonged exposure to the environment may cause some change in the proportions of the various chemical components, but the change is relatively minor to the extent that it should not affect our determination of that glass category.We consider such perturbations as noise added to the sample data [25].Since the sample of data in this study is small and estimates of noise are highly susceptible to overfitting, it is useful to assume that the prior probability distribution of the noise is F = N (0, 0.0001), i.e., a Gaussian distribution with mean 0 and standard deviation 0.01.For any x ∈ F, P(−σ X σ) = f(t)dt = 0.682 , where f(t) is the probability density function of the distribution F.There is about a 68.2% probability that the noise lies in the thousandths interval and has a small effect on the weight of the chemical composition.The process of a random noise test is as follows: The number of random tests is initially set T = 100.In the i'th (1 ≤ i ≤ T) test, all original unweathered samples are added with noise sampled by the above prior distribution to obtain the noise-containing sample data, and the prediction labels are obtained by the clustering algorithm.The number of samples for which the predicted labels are the same as the original predicted labels after adding noise is counted and is denoted as ti.At the end of T times of testing, it is calculated that (N is the number of samples) Random noise test ratio = ∑ . ( As a measure, its value ranges from [0, 1], and the closer to 1, the better the noise immunity of the model.
The K-Means++ clustering algorithm was tested 10 times for random noise and averaged to obtain a random noise test ratio = 0.980, indicating that the model is insensitive to noise under the current noise prior distribution.
To further investigate the relationship between model sensitivity and noise standard deviation, we used the following equation: σ = 0.01 + 0.00474t, 0 t 20. ( The result of increasing the standard deviation in the above random noise test for the K-Means++ model reveals the robustness of K-Means++.Compared with the K-Means algorithm, when σ increases from 0.01 to 0.1, the clustering accuracy of K-Means++ remains significantly higher and shows a smaller variance, which illustrates the performance improvement of K-Means++.The random noise test ratio with σ is shown in Figure 13.
(both training and testing data are plotted in the figure).
in which D(x, a) = Similarity x,a 2 .

Figure 9 .
Figure 9. Accumulated difference of two clustering algorithms.

Figure 9 .
Figure 9. Accumulated difference of two clustering algorithms.

Figure 9 .
Figure 9. Accumulated difference of two clustering algorithms.

, lying between −1 and 1 .
Appl.Sci.2023, 13, x FOR PEER REVIEW 12 of 15 We define the contour coefficient of u as s(uIf s(u) of a sample is close to 1, it means the sample is reasonably clustered; if it is close to −1, it means it should be classified into other clusters.If the silhouette is close to 0, it means the sample is on the boundary of two clusters.The mean value of all sample contours is called the silhouette coefficient, which is a measure of whether the clustering is reasonable and valid.

Figure 11 .
Figure 11.Variation in square sum of errors with the number of clusters for the K-Means++ clustering algorithm.

Figure 12 .
Figure 12.Variation in silhouette_score with the number of clusters for the K-Means++ clustering algorithm.

Figure 11 ., lying between −1 and 1 .
Figure 11.Variation in square sum of errors with the number of clusters for the K-Means++ clustering algorithm.

Figure 11 .
Figure 11.Variation in square sum of errors with the number of clusters for the K-Means++ clustering algorithm.

Figure 12 .
Figure 12.Variation in silhouette_score with the number of clusters for the K-Means++ clustering algorithm.

Figure 12 .
Figure 12.Variation in silhouette_score with the number of clusters for the K-Means++ clustering algorithm.

Figure 13 .
Figure 13.K-Means++ and K-Means clustering model sensitivity with noise standard deviation.

Table 1 .
The first three lines of the data.

Table 2 .
Results of K-Means++ clustering for all samples.

Table 2 .
Results of K-Means++ clustering for all samples.

Table 3 .
Results of clusters by parent class obtained by K-Means++ clustering algorithm.

Table 4 .
Subclass results obtained by the K-Means++ clustering algorithm.

Table 4 .
Subclass results obtained by the K-Means++ clustering algorithm.