Research on Rapid Identiﬁcation Technology of Sand and Dust Characteristic Monitoring Data Based on Optimized K-Means Clustering

: The criteria-based sand and dust weather determination method has the problem ofbeing a cumbersome and time-consuming process when processing a large amount of raw data, and cannot avoid the problems of repeatability and reproducibility. On the basis of statistical analysis of the air automatic monitoring data in the cities affected by sand and dust, this paper proposes a k-means optimization algorithm (MDPD-k-means) based on maximum density and percentage distance, which can quickly ﬁlter the characteristic data of sand and dust in a short time, and identify the days affected by sand and dust. This method effectively improves the data processing efﬁciency, solves the problems of poor reproducibility and large artiﬁcial error of traditional methods, and can support the business application of sand and dust data elimination. This paper uses the method to identify the sand and dust data of 10 cities in Shaanxi Province from 2016 to 2022, determines a total of 1107 sand and dust days, and points out that the number of days affected by sand and dust is increasing year by year. After excluding the effect of sand and dust, the urban PM 10 concentration decreases by 18.42~1.41% respectively, which provides important data information for accurately evaluating the effectiveness of air pollution prevention and control.


Introduction
Sand and dust weather is a disastrous weather that affects the quality of the ecological environment. It is mainly produced by the soil wind erosion process in arid and semiarid areas. Not only does it seriously threaten the surrounding ecological environment, human health, and industrial and agricultural production of sand source areas, but it also affects the regions on the transmission path to various degrees [1]. Sudden sand and dust weather leads to huge changes in air quality monitoring data. Indexes such as particle concentration, air-quality index (AQI), and comprehensive index far exceed the normal state during the sand and dust transmission process, and the air quality level often reaches heavy pollution or severe pollution. Different from the heavy pollution process with PM 2.5 as the primary pollutant in autumn and winter, when the sand and dust weather occurs, the mass concentrations of total suspended particulate (TSP) and PM 10 in the ambient air increase rapidly in a short period of time. In severe cases, the PM 10 mass concentration rapidly rises to above 1000 µg/m 3 , and the AQI continues to be off the charts (AQI = 500), which seriously threatens human health [2]. Therefore, fast and accurate identification of sand and dust weather is of great significance for studying the law of sand and dust transmission, distinguishing heavy pollution processes from sand and dust transmission processes, and ensuring the stability of automatic monitoring network operation and the rationality of regional air quality evaluation.
Sand and dust sources in northern China are mainly distributed in the Hexi Corridor and Alxa region, the southern margin of the Southern Xinjiang Basin, and the central region of Inner Mongolia, whose central and western regions are one of the main sources of sand and dust in northwestern China [3,4]. In the absence of precipitation, the continuous increase in temperature in the sand source area leads to a decrease in the water content of the bare soil, which provides a material basis for the formation of sand and dust, while multiple strands of cold air move from north to south, resulting in strong winds and forming a transmission path for sand and dust diffusion [5][6][7]. With the development of atmospheric environment monitoring network, high temporal resolution monitoring data continue to accumulate, research on sand and dust characteristics continues to deepen, and sand and dust transmission paths and data characteristics become increasingly apparent. The area of effect and duration of sand and dust can be determined by identifying the start time, end time, peak concentration of particles, and physical properties of the particles [8,9].
For a single sand and dust transmission process, indexes including the depolarization ratio, extinction coefficient, and mixed layer height are directly observed by the spaceborne lidar and ground-based lidar network, and these indexes are combined with the meteorological data and automatic particle monitoring data of the same period for analysis, in order to determine the area of effect and duration of sand and dust [10,11]. The extinction coefficient is the degree of attenuation of light by particles at a specific spatial coordinate point, which is usually positively related to the intensity of sand and dust. The depolarization ratio is a physical quantity that distinguishes spherical particles from non-spherical particles. The higher the proportion of non-spherical particles is, the more prominent the dust characteristics are [12]. When the extinction coefficient and depolarization ratio of near-surface observations increases rapidly, the dust transport height decreases or appears stratified. With the rapid settlement of coarse particles, the near-surface dust intensity reaches a peak and then gradually subsides [13]. When dust features are observed at high altitude, the dust process persists and continues to affect areas downstream of the transmission channel, and the backward trajectory model (HYSPLIT) is often used to analyze and verify the transmission path [14]. The quality of the data observed by lidar is greatly affected by natural conditions such as rain and snow weather, and the particle concentration cannot be directly observed. Therefore, it is usually used as an auxiliary support for ground-based particle concentration observation data in the process of sand and dust weather determination [15].
Aiming at the characteristics of sand and dust over a long time span, polar-orbiting and geostationary satellite remote sensing data, such as MODIS, Landsat, Himawari 8, FY-4, and CALIPSO are nested and superimposed, and the spectral characteristics of sand and dust particles in different regions are identified to determine sand sources, transmission paths, and affected areas, which are often used to identify the characteristics of sand and dust frequency changes in a large area over a long time span [16][17][18][19]. However, multispectral remote sensing methods are still restricted by objective factors such as cloud cover and land desertification, and the inversion results of dust intensity still need to be corrected by ground-based monitoring data.
With the rapid development of the automatic monitoring network of atmospheric environment, the monitoring data of particle concentration with high temporal resolution are accumulating continuously, which provides a lot of basic information for the judgment of sand and dust weather. The "Supplementary Regulations on the Evaluation of Urban Air Quality Affected by Sand and Dust Weather Processes" issued by the Ministry of Ecology and Environment of China in 2018 stipulates the method for determining sand and dust weather based on the criteria of PM 10 and PM 2.5 hourly concentrations of the urban atmospheric environment monitoring network. The sand and dust process is determined by identifying when the sand and dust starts and ends. As a quantifiable identification method of sand and dust characteristics, the criteria method can determine the intensity and duration of sand and dust in affected cities. However, its determination process involves a large number of calculation, screening, judgment and audit links, in which the selection, Atmosphere 2022, 13, 1720 3 of 18 judgment and audit all rely on the experience of operators. Therefore, the method has poor reproducibility and repeatability in the face of massive historical monitoring data in a large number of cities.
Based on the criteria method, this paper proposes a fast identification method of sand and dust monitoring data based on optimized k-means clustering with the goal of supporting the business application of sand and dust identification. This method is used to study the sand and dust days in 10 cities in Shaanxi Province from 2016 to 2022, summarize the characteristics of air quality, and analyze the law of sand and dust transmission.

Criteria Method
The criteria method is mainly based on the hourly concentration changes of PM 10 and PM 2.5 measured by the urban atmospheric environment monitoring network to determine the period affected by sand and dust. PM 10 is the most important characteristic factor of sand and dust weather. Therefore, the hourly change characteristics of PM 10 mass concentration are mainly considered when judging the impact of sand and dust. Sand and dust weather is usually accompanied by a sharp and rapid increase in PM 10 mass concentration and a rapid decrease in the mass concentration ratio of PM 2.5 to PM 10 [20,21]. In addition, taking into account the change characteristics of PM 2.5 , the monitoring data of particulate matter in the period with obvious external sand and dust intrusion characteristics were analyzed, so as to identify the starting time and end time of the influence of sand and dust.
The starting time of sand and dust can be identified by criteria method. Either the time when the average urban PM 10 hourly mass concentration is greater than or equal to twice the average PM 10 mass concentration of the previous 6 h and greater than 150 µg/m 3 as the starting time of the sand and dust weather, or the time when the urban PM 2.5 to PM 10 hourly mass concentration ratio is less than or equal to the previous 6 h 50% of the average hourly ratio, is taken as the starting time of the sand and dust weather.
The ending time of sand and dust can be identified by criteria method. Either the time when the hourly average mass concentration of PM 10 in the city for the first time drops to a relative deviation of less than or equal to 10% from the average PM 10 mass concentration in the previous 6 h before the sand and dust weather, or the moment when the hourly average mass concentration of PM 10 in the city drops to less than 1.1 times of the average mass concentration of PM 10 6 h before the sand and dust weather for the first time, is taken as the ending time of the sand and dust weather.
The above judgment method is suitable for single-time sand and dust process identification in a single city with a small amount of data. When processing monitoring data of a long time, a large area, and multiple cities, the huge amount of data and the complex process will lead to a substantial increase in manual errors. In addition, the data characteristics of different regions are different, and the workload of data review and sand and dust weather determination is enormous, therefore it is difficult to obtain reliable and accurate statistical results in a short period of time.

Data Preprocessing
Reasonable data preprocessing can effectively improve the efficiency and accuracy of the clustering algorithm. When applying the distance-based clustering method, the mean and variance of the data set play a decisive role in the clustering results. Too many outliers make the clustering center shift, and some sand and dust data or outliers are far from the clustering center, which may lead to situations where the classification boundaries are blurred, and the critical point is difficult to accurately classify the classes. Therefore, before the cluster analysis, the original monitoring data need to be processed first, in order to further screen the target data and improve the data characteristics of the cluster center.
It can be seen from the determination process of the criteria method that the necessary conditions for the determination of sand and dust weather are the hourly concentration of PM 10 , the concentration ratio of PM 10 to PM 2.5 , and the index of whether PM 10 is the Atmosphere 2022, 13, 1720 4 of 18 primary pollutant of AQI. Only if the above three conditions are met at the same time, can the basic requirements for sand and dust data identification be met. Therefore, the hourly concentration of PM 10 , the Individual Air Quality Index (IAQI) of PM 10 , and the ratio of PM 10 to PM 2.5 are analyzed as clustering elements.

1.
Hourly concentration of PM 10 . The characteristic pollutants of sand and dust weather are particulate matter (PM 10 , PM 2.5 ), of which the short-term change of PM 10 determines the strength of the sand and dust transmission process, therefore the hourly concentration of PM 10 is an important condition for determining sand and dust weather. When the mass concentration of PM 10 is greater than 150 µg/m 3 , other data characteristics of sand and dust weather can be displayed.

2.
IAQI of PM 10 . When sand and dust weather occurs, PM 10 is the only major pollutant, and the sub-index (IAQI PM10 ) of PM 10 is equal to AQI at this time. IAQI PM10 is calculated based on the PM 10 concentration in the original data, and whether it is the primary pollutant is identified by comparing with the AQI. Since the purpose of the experiment is to identify the sand and dust data for which PM 10 is the only primary pollutant, in order to highlight the data characteristics, the index of whether PM 10 is the primary pollutant is counted as 1 when PM 10 is the primary pollutant, otherwise it is counted as 0, that is, when IAQI PM10 = AQI, it is counted as 1; when IAQI PM10 < AQI, it is counted as 0.

3.
The concentration ratio of PM 10 to PM 2.5 . The concentration ratio of PM 10 to PM 2.5 is another important factor in the determination of sand and dust weather. According to the PM 10 and PM 2.5 air quality sub-indices and the concentration limits of corresponding pollutants given in the "Ambient Air Quality Index (AQI) Technical Regulations (Trial)" [22], it can be found that: When IAQI is 100, C PM10 /C PM2.5 = 2; when IAQI is 150, C PM10 /C PM2.5 is 2.17; when IAQI = 200, C PM10 /C PM2.5 = 2.33; when IAQI is 300, C PM10 /C PM2.5 = 1.68; when IAQI is 400, C PM10 /C PM2.5 = 1.43; when IAQI is 500, C PM10 /C PM2.5 = 1.2. When sand and dust weather occurs, air quality levels can range from mild to severe pollution, with AQI ranging from 100 to 500. If the primary pollutant is PM 10 , the characteristic distribution of C PM10 /C PM2.5 should be as shown in Figure 1.
In addition, the criteria method stipulates that "the hourly mass concentration ratio of PM 2.5 to PM 10 is less than or equal to 50% of the average value of the ratio in the previous 6 h" [23], as shown in Equation (1): where C PM10 is the mass concentration of PM 10 ; C PM2.5 is the mass concentration of PM 2.5 ; and A n is the mass concentration ratio of PM 10 to PM 2.5 at the nth hour. Due to objective fact, C PM10 ≥ C PM2.5 , therefore: By substituting Equation (2) into (1), it can be deduced that: Therefore, on the basis of satisfying the distribution law shown in Figure 1, the concentration ratio of PM 10 to PM 2.5 should further satisfy the determination condition of C PM10 /C PM2.5 ≥ 2. In addition, the criteria method stipulates that "the hourly mass concentration ratio of PM2.5 to PM10 is less than or equal to 50% of the average value of the ratio in the previous 6 h" [23], as shown in Equation (1): where 10 is the mass concentration of PM10; 2.5 is the mass concentration of PM2.5; and is the mass concentration ratio of PM10 to PM2.5 at the nth hour. Due to objective fact, 10 ≥ 2.5 , therefore: By substituting Equation (2) into (1), it can be deduced that: Therefore, on the basis of satisfying the distribution law shown in Figure 1, the concentration ratio of PM10 to PM2.5 should further satisfy the determination condition of CPM10/CPM2.5 ≥ 2.
In summary, in order to further highlight the characteristics of sand and dust data, the experimental data used for cluster analysis are converted from the three indexes of PM10 concentration, PM2.5 concentration, and AQI in the original data into PM10 In summary, in order to further highlight the characteristics of sand and dust data, the experimental data used for cluster analysis are converted from the three indexes of PM 10 concentration, PM 2.5 concentration, and AQI in the original data into PM 10 concentration, the concentration ratio of PM 10 to PM 2.5 , and the index of whether PM 10 is the primary pollutant (yes is 1, no is 0). The indexes settings are shown in Table 1. Table 1. Index setting of sand dust data feature extraction.

Original Data Converted Data Basic Conditions for Determining Dust Data
Since the PM 10 concentration and the concentration ratio of PM 10 to PM 2.5 concentration are different dimensions, the data should be normalized first, transforming the variables into dimensionless numbers between [0, 1] by using the mapminmax function in MATLAB (The MathWork, Inc, Natick, MA, USA). For the case where the analysis process, especially the data processing process, involves data rounding, no data rounding should be performed during the calculation process, otherwise the original data information may be lost during denormalization.

K-Means Clustering
Based on the similarity of data features, cluster analysis divides similar objects in the data set into multiple categories, which is an exploratory classification process. Cluster analysis does not need to specify the classification criteria in advance, but can start from the data themselves, start unsupervised learning and perform clustering. Because the Atmosphere 2022, 13, 1720 6 of 18 characteristics of the same type of data are as similar as possible, and there are obvious differences in different types of data, the real distribution of the data can be analyzed in the end. In a data system with a stable operating system, the clusters of normal data are usually numerous and dense, while the clusters of abnormal data are small and sparse. Thus, the abnormal data can be preliminarily determined by the clustering method, and then other technical methods can be used to further analyze the data characteristics [24].
The k-means clustering algorithm is an iteratively solved partitioned clustering algorithm proposed by James MacQueen in 1967, which has the advantages of being simple, fast, and suitable for processing large-scale data. The basic idea is to randomly select k data objects from a data set containing a large number of data objects as the initial clustering centers and calculate the Euclidean geometric distance between each data object and the k clustering centers. All data are divided into the class represented by the cluster center closest to it, and the k cluster centers are updated according to the mean of the newly generated data objects in each category. If the change of the cluster center value in the adjacent iteration times exceeds the specified threshold, all data objects will be redivided according to the new cluster center; if the change of the cluster center value in the adjacent iteration times is less than the specified threshold, then the algorithm converges and the clustering result is output [25].
Since the initial cluster centers of k-means clustering are randomly selected, the final clustering results may vary. The k-means++ algorithm proposed by David Arthur in 2007 improves the selection of initial cluster centers based on k-means clustering [26]. Firstly, a sample from the data set is randomly selected as the cluster center C n . According to the Euclidean geometric distance D n between all samples and the cluster center C n , the probability P n that each sample is used as the next cluster center is calculated. Then a new initial cluster center C n+1 is randomly selected according to the probability. Finally, the above steps are repeated until k initial cluster centers appear, the clustering process of k-means is iterated to determine k final cluster centers and output the clustering result. The k-means++ algorithm is essentially the process of optimizing the initial clustering centers of the k-means algorithm, so that the k initial clustering centers can keep the maximum distance as much as possible, thereby improving the clustering accuracy and iterative efficiency, so as to obtain a relatively stable clustering result.
Since the k-means algorithm randomly selects the initial cluster centers, an unreasonable initial cluster center will affect the quality of the clustering results or the number of iterations of the algorithm. As an improved algorithm of k-means, k-means++ can separate the initial cluster centers as much as possible by screening the initial cluster centers, thereby enhancing the rationality of clustering and reducing the number of iterations. However, the first initial clustering center of the k-means++ algorithm is still randomly selected, and in the process of using the roulette method to select the initial clustering center of the target number, if there are many outliers, the clustering will still be affected. Therefore, the selection of initial clustering centers in the above two methods is random, and different clustering results may appear in the process of identifying sand and dust characteristic data, which is difficult to support the business application of the methods.

MDPD-k-Means Clustering
For the application scenario of feature recognition of sand and dust data, this paper proposes a K-means initial Clustering Center Optimization based on Maximum Density and Percentile Distance (MDPD-k-means). By dividing the coordinate grid in equal proportions and finding the center point of the maximum density grid, an initial cluster center is determined. By using the percentage distance instead of the roulette method, k initial clustering centers are selected, and the randomness of the algorithm is eliminated. When the number of clusters k is determined, the result will not change even if the clustering is performed multiple times, which can support the business application of the method.
First, the original data need to be dimensionally reduced. By replacing AQI to determination of whether PM 10 is the primary pollutant, represented by 0 and 1, the data scatter diagram for identifying sand and dust is transformed from one three-dimensional coordinate system to two two-dimensional coordinate systems. Then the two two-dimensional coordinate systems are combined to obtain a two-dimensional coordinate system, in which the information contained in each point is 0 or 1. Taking the monitoring results of Xi'an, Shaanxi Province from 1 February to 30 April 2018 as an example, the preprocessed data are shown in Figure 2.
portions and finding the center point of the maximum density grid, an initial cluster center is determined. By using the percentage distance instead of the roulette method, k initial clustering centers are selected, and the randomness of the algorithm is eliminated. When the number of clusters k is determined, the result will not change even if the clustering is performed multiple times, which can support the business application of the method.
First, the original data need to be dimensionally reduced. By replacing AQI to determination of whether PM10 is the primary pollutant, represented by 0 and 1, the data scatter diagram for identifying sand and dust is transformed from one three-dimensional coordinate system to two two-dimensional coordinate systems. Then the two two-dimensional coordinate systems are combined to obtain a two-dimensional coordinate system, in which the information contained in each point is 0 or 1. Taking the monitoring results of Xi'an, Shaanxi Province from 1 February to 30 April 2018 as an example, the preprocessed data are shown in Figure 2. After the data in Figure 2 are normalized to [0, 1] using MATLAB's mapminmax function, a grid of 0.1 × 0.1 is drawn to cover the entire coordinate system, and the number of data points in the 100 grids is calculated one by one and sorted. Since the characteristic data of sand and dust are few and sparse, and the characteristic data of non-sand dust are many and dense, the grid with the largest number of points must belong to the non-dust After the data in Figure 2 are normalized to [0, 1] using MATLAB's mapminmax function, a grid of 0.1 × 0.1 is drawn to cover the entire coordinate system, and the number of data points in the 100 grids is calculated one by one and sorted. Since the characteristic data of sand and dust are few and sparse, and the characteristic data of non-sand dust are many and dense, the grid with the largest number of points must belong to the non-dust data set A_0. Therefore, the center coordinates C_0 of all points in A_0 are calculated and taken as the first initial cluster center.
When selecting the remaining initial cluster centers, the Euclidean distance from each point to each cluster center is calculated, the shortest distance D(x) is taken, and the percentile distance is sorted and obtained, until k initial cluster centers are selected. According to the sand emission frequency of sand source cities in Northwest China, taking the experience of identifying sand and dust data by the criteria method as a reference, the 95th percentile distance is used as the screening condition of the initial cluster center. Through the above improvements, the accuracy and convergence speed of clustering are improved, and the uncertainty is eliminated. The MDPD-k-means algorithm process can be summarized as follows: Step1 Initializing: reading the dataset and the point information, and normalizing the dataset to make x, y ∈ [0, 1]; Step2 Dividing the density grid: dividing the two-dimensional coordinate system into 100 grids of 0.1 × 0.1, counting the number of scattered points in each grid, and sorting them according to the density from large to small; Step3 Selecting the center point C1 of the grid with the largest number of scatter points, and calculating the distance D1(x) from all scatter points to C1; Step4 Sorting D1(x) from small to large, selecting a point C2 at the 95th percentile distance, calculating the distances from all scattered points to C1 and C2, and taking the minimum distance D12(x) from each point to C1 and C2; Step5 Sorting D12(x) from small to large, selecting a point C3 at the 95th percentile distance to calculate the distances from all scatter points to C1, C2, and C3, and taking the minimum distance D123(x) from each point to C1, C2, and C3; Step6 Repeating step 5 until k center points appear, and outputting k center point coordinates; Step7 Performing k-means clustering with k center points as the initial cluster centers, and outputting the number of iterations, the final cluster center, the clustering results and the number of cases; Step8 After de-normalization, carrying out feature determination based on case information.

Results of Clustering
Taking the ambient air quality monitoring data of Xi'an City, Shaanxi Province from 1 February to 30 April 2018, as the research object, including the hourly mean concentrations of PM 10 and PM 2.5 , and the city's hourly AQI, a total of 2136 groups of raw data were extracted from the Shaanxi Provincial Ambient Air Quality Monitoring Network Management Platform. After the data were preprocessed, MATLAB was used to run the MDPD-k-means clustering program, the number of clusters was determined and set to 7 according to the elbow rule, and the 95th percentile distance was taken. The clustering results converged after 22 iterations, and by the denormalization of the clustering results, seven cluster centers were obtained as shown in Table 2, and the clustering results were shown in Figure 3. Table 2. Clustering centers of MDPD-k-means algorithm.  Using the three basic characteristics of the sand and dust data in Table 1 to determine whether the cluster centers in Table 2 conform to the characteristics of the dust data one by one, it could be found that Cluster 4, Cluster 6, and Cluster 7 generated by the MDPDk-means algorithm conform to the three basic characteristics at the same time, while at least one of the indexes in Cluster 1, Cluster 2, Cluster 3, and Cluster 5 did not meet the characteristics of sand and dust, and could not be identified as sand and dust data. The identification results were shown in Table 3 and Figure 4 for details.  Using the three basic characteristics of the sand and dust data in Table 1 to determine whether the cluster centers in Table 2 conform to the characteristics of the dust data one by one, it could be found that Cluster 4, Cluster 6, and Cluster 7 generated by the MDPDk-means algorithm conform to the three basic characteristics at the same time, while at least one of the indexes in Cluster 1, Cluster 2, Cluster 3, and Cluster 5 did not meet the Atmosphere 2022, 13, 1720 9 of 18 characteristics of sand and dust, and could not be identified as sand and dust data. The identification results were shown in Table 3 and Figure 4 for details.

Accuracy Analysis
Taking the identification results of the traditional criteria method as the real classification, the confusion matrix was used to evaluate the accuracy of the MDPD-k-means clustering results, and compared with the k-means, k-means++ and DBSCAN clustering results. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a density-based unsupervised machine learning clustering algorithm, which is suitable for detecting outliers in samples. Firstly, the similarity between the recognition results of the four groups of clustering algorithms for sand and dust data and the 118 groups of data recognized by the criteria method was compared. Since the purpose of the experiment was to identify the characteristic data of sand and dust, the identification results of the above four algorithms were divided into two categories according to the sand-dust data and the non-sand-dust data. Next, the intersect function of MATLAB was used to compare the cross relationship between the sand and dust characteristic data identified by the above four clustering algorithms and the sand and dust characteristic data identified by the criteria method, as shown in Table 4. Table 4. Comparison between the recognition results of 4 algorithms and criterion method.

Methods
Dust Data (Group)

Accuracy Analysis
Taking the identification results of the traditional criteria method as the real classification, the confusion matrix was used to evaluate the accuracy of the MDPD-k-means clustering results, and compared with the k-means, k-means++ and DBSCAN clustering results. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a density-based unsupervised machine learning clustering algorithm, which is suitable for detecting outliers in samples. Firstly, the similarity between the recognition results of the four groups of clustering algorithms for sand and dust data and the 118 groups of data recognized by the criteria method was compared. Since the purpose of the experiment was to identify the characteristic data of sand and dust, the identification results of the above four algorithms were divided into two categories according to the sand-dust data and the non-sand-dust data. Next, the intersect function of MATLAB was used to compare the cross relationship between the sand and dust characteristic data identified by the above four clustering algorithms and the sand and dust characteristic data identified by the criteria method, as shown in Table 4.
As could be seen from Table 4, among the 116 sets of dust data identified by the MDPDk-means algorithm, 109 sets of data were consistent with the judgment results of the criteria method, and 7 sets of data were identified incorrectly. The seven sets of identification error data all appeared in the initial stage of the increase of PM 10 concentration and the decrease of PM 2.5 concentration, that is, the continuous occurrence of haze days and dusty days, and the rapid transition stage of PM 2.5 pollution to PM 10 pollution. A total of nine sets of dust characteristic data were not effectively identified. The main reason was that these data were located at the end of the sand and dust transmission process, and the pollutant concentration began to decline rapidly and tended to normal levels, but the data conformed to the characteristics of sand and dust, thus the clustering algorithm could not accurately classify categories. Although the MDPD-k-means algorithm did not identify all the sand and dust data identified by the criteria method, its recognition rate of 92.37% was still higher than that of k-means, k-means++, DBSCAN and other algorithms, which can better support the sand and dust characteristic data quick identification. Using confusion matrix to evaluate the accuracy of the four clustering methods for sand and dust data identification, the confusion matrix of the identification results of the four clustering methods and the true value (criteria method results) were drawn respectively, as shown in Figure 5. and dusty days, and the rapid transition stage of PM2.5 pollution to PM10 pollution. A total of nine sets of dust characteristic data were not effectively identified. The main reason was that these data were located at the end of the sand and dust transmission process, and the pollutant concentration began to decline rapidly and tended to normal levels, but the data conformed to the characteristics of sand and dust, thus the clustering algorithm could not accurately classify categories. Although the MDPD-k-means algorithm did not identify all the sand and dust data identified by the criteria method, its recognition rate of 92.37% was still higher than that of k-means, k-means++, DBSCAN and other algorithms, which can better support the sand and dust characteristic data quick identification.
Using confusion matrix to evaluate the accuracy of the four clustering methods for sand and dust data identification, the confusion matrix of the identification results of the four clustering methods and the true value (criteria method results) were drawn respectively, as shown in Figure 5. According to the confusion matrix in Figure 5, the ratio of the number of correctly identified sand and dust samples to the total number of actual sand and dust samples (TPR), the ratio of the number of incorrectly identified sand and dust samples to the total number of actual sand and dust samples (FNR), the ratio of the number of incorrectly identified non-sand-dust samples to the total number of non-sand-dust samples (FPR), and the ratio of the number of correctly identified non-sand-dust samples to the total number of actual non-sand-dust samples (TNR) can all be calculated. The formula is shown as follows: According to the confusion matrix in Figure 5, the ratio of the number of correctly identified sand and dust samples to the total number of actual sand and dust samples (TPR), the ratio of the number of incorrectly identified sand and dust samples to the total number of actual sand and dust samples (FNR), the ratio of the number of incorrectly identified non-sand-dust samples to the total number of non-sand-dust samples (FPR), and the ratio of the number of correctly identified non-sand-dust samples to the total number of actual non-sand-dust samples (TNR) can all be calculated. The formula is shown as follows: F1_score is calculated according to Equations (4) and (5): F1_score is a standard of measurement of classification problems, and its value ranges from 0 to 1, where 0 represents the worst output of the model, and 1 represents the best.
The TPR, FNR, FPR, TNR, accuracy, precision and F1_score calculation results of the four clustering methods were shown in Table 5. It could be seen from Table 5 that k-means and its optimization method were suitable for the rapid identification of sand and dust feature data, among which the MDPD-k-means algorithm was better than other clustering algorithms in terms of recognition accuracy, precision and F1_score for sand and dust feature data, with its determined sand and dust characteristic data closest to that of the criteria method. For the hourly data of sand and dust identified by the MDPD-k-means algorithm, according to the principle that if the impact time ≥ 1 h, that day is a day affected by sand and dust, 9 days affected by sand and dust were obtained, which was exactly the same as the date identified by the criteria method. Therefore, the MDPD-k-means algorithm has high accuracy and efficiency in identifying the sand and dust features of hourly monitoring data and can support the business application of sand and dust feature recognition.

Applicability Analysis
The criteria method is supported by the Ministry of Ecology and Environment's "Supplementary Regulations on the Evaluation of Urban Air Quality Affected by Sanddust Weather Processes" and is being widely used in practical work. It is the process of calculating the characteristic data and then filtering the combination, and the original data are dimensionally reduced by calculation and screening to meet the requirements of data feature identification. However, the criteria determination process involves a large number of calculation, screening, judgment and review processes, among which screening, judgment and review are easily affected by the subjective influence of operators, especially when processing monitoring data sets under critical conditions. There are differences in the results judged by different operators based on their different experience. In the face of massive historical monitoring data in a large number of cities, this method is time-consuming and labor-intensive, requires high operator experience, and has poor reproducibility of the determination process.
The technical idea of clustering to identify sand and dust characteristic monitoring data is different from that of criteria method. There is no manual calculation and screening process at the data level. Clustering all original monitoring data according to data features through unsupervised machine learning is a process of data classification and re-identification of features, and the original data are dimensionally reduced by data classification to meet the requirements of data feature judgment. In this process, the classification is completed by the clustering algorithm of the computer software, which can process a large amount of monitoring data at the same time according to the unified algorithm, so the clustering result will not lose the original data information, the process reproducibility is strong, and there will not be discrepancies in judgment due to the inexperience of the staff in the data calculation and classification process. In the partition-based clustering method represented by the k-means algorithm, there is no uniform standard for the number of clusters k, and it is easy to miss the data features with relatively low mass concentrations of PM 10 and PM 2.5 , for example the end time of sand and dust weather is determined earlier than that determined by the criteria method. In addition, whether it is the k-means or k-means++ algorithm, there is a certain degree of randomness in the selection of the initial cluster center. When the preset indicators are the same, the results obtained by running the algorithm multiple times may be inconsistent and cannot support business application.
The MDPD-k-means algorithm eliminates the randomness in the process of selecting the initial cluster center by the k-means or k-means++ algorithm, ensures that the cluster center can be classified into the sand and dust data through the density grid and percentile distance, and improves the efficiency of feature extraction and reduces the number of iterations, which can support the business application of sand and dust identification. Compared with the criteria method, the use of MDPD-k-means clustering algorithm can minimize the workload of manual judgment and avoid manual errors and systematic errors to the greatest extent, while the determination process has better reproducibility and repeatability, therefore can be used for simultaneous identification of multi-region and largescale hourly monitoring data. However, the k value still affects the clustering performance. When a large number of samples are processed, the amount of computation will increase rapidly as the value of k increases. The recommended sample size is continuous 720 h (1 month) to 8760 h (1 year), the number of k values is 5-10. The model operates quickly and the results are accurate.

Sand and Dust Weather Determination Results
A total of 10 cities in Shaanxi Province were used as research objects, and the data source was the real-time PM 10 and PM 2.5 hourly concentration status of 50 air automatic stations in 10 cities from 1 January 2016 to 31 May 2022 obtained by the "Shaanxi Provincial Ambient Air Quality Monitoring Network Management Platform". According to the distribution of stations in each city, the city hourly mean value was calculated, and the hourly AQI was counted, obtaining a total of 1.703 million pieces of hourly PM 10 , PM 2.5 , and AQI data of 10 cities from 2016 to 2022. The 1.703 million pieces of data were divided into 10 groups according to different cities, and the MDPD-k-means clustering method was used to identify the hourly data of 10 cities in Shaanxi Province from 2016 to 2022 that conformed to the characteristics of sand and dust. According to the clustering results, the monitoring data conformed to the characteristics of sand and dust in each city was screened out, and the sand-dust-affected days of 10 cities from 2016 to 2022 were determined daily according to the corresponding time. Since the data were identified based on the characteristics of hourly values, according to the principle that if the influence time ≥1 h, that day is a day affected by sand and dust, the final determination of the number of days affected by sand and dust in 10 cities in Shaanxi Province from 2016 to 2022 was shown in Table 6

Characteristics of Data Changes
After excluding the impact of sand and dust, the annual average PM10 concentration in 10 cities has changed to varying degrees. It can be seen from Figure 6 that from 2016 to 2022, northern Shaanxi is heavily affected by sand and dust. Among them, Yulin is located in the transition zone between the Loess Plateau and the Inner Mongolia Plateau, and is close to the Mu Us Sand source, therefore is most seriously affected by sand and dust, with its PM10 concentration dropping by 18.42% after excluding the impact of sand and dust. Yan'an is severely affected, with its PM10 concentration dropping by 12.99%. Hanzhong and Ankang in southern Shaanxi are less affected by sand and dust, and their PM10 concentrations decrease by 1.41-1.69% after excluding the impact of sand and dust. Shangluo City is at the end of the sand and dust transmission path. Due to the low background value of its urban environment, it is prominently affected by sand and dust, and its PM10 concentration drops by 8.62% after excluding the impact of sand and dust. Cities in the Guanzhong area all belong to the area affected by sand and dust transmission. Due to the combined effect of sand and dust transmission and local fugitive dust, its PM10 concentration decreases by 5.68-7.14% after excluding the impact of sand and dust.  ArcGIS was used to draw the distribution map of sand and dust days in Shaanxi Province from 2016 to 2021. As shown in Figure 7, from 2016 to 2021, 10 cities in Shaanxi Province experienced sand and dust weather to varying degrees, and the number of days with sand and dust showed an overall upward trend, while the increase was obvious in 2021. The number of times of cities affected by sand and dust gradually increased from south to north. Yulin and Yan'an were frequently affected by sand and dust, and the number of times of impact was increasing year by year. Cities in the Guanzhong region were seriously affected by sand and dust from 2018 to 2020, and the number of sand and dust days was basically the same every year, but it increased significantly in 2021. The southern Shaanxi region was less affected by sand and dust. However, located at the end of the sand and dust transmission path, the number of affected times of Shangluo was higher than that of the other two cities in southern Shaanxi.
From the analysis of topography and transmission path, the sand and dust in the northwest originate from the Hexi Corridor in Gansu and enter the Guanzhong Plain from west to east through Tianshui-Baoji. At the same time, affected by the return of sand and dust in the east, sand and dust remain in Guanzhong and accumulate, causing secondary pollution to the city of Guanzhong, showing the characteristics of intermittent occurrence and decreasing intensity of sand and dust for several consecutive days. Originating from Inner Mongolia and Ningxia, the northern sand and dust travel south through Yulin and Yan'an, enter the Guanzhong Plain from north to south, resulting in a rapid increase in PM 10 concentrations in cities along the way, and cross the Qinling Mountains to affect the southern Shaanxi area, mainly Shangluo, showing large-scale, high-intensity sand and dust transport characteristics. The transmission of other northern sand and dust starts from Mongolia and travels south through the Beijing-Tianjin-Hebei region. At the end of the sand and dust transmission process, it usually enters the Guanzhong Plain from Shanxi, which has a certain impact on Xi'an, Xianyang, and Weinan. Through years of sand control and soil erosion control in Shaanxi Province, the impact of local sand and dust has been basically eliminated, and the PM 10 concentration has dropped significantly. However, many cities are located in the sand and dust transmission channels, and the sand and dust transmission process leads to a short-term rapid increase in PM 10 concentration. Although the concentration of PM 10 can be reduced through measures such as regional air pollution prevention and control, it is difficult to effectively reduce the number of sandy and dusty days.
It could be seen from the frequency and distribution of sand and dust weather from 2016 to 2022 shown in Figure 8 that Shaanxi Province was mainly affected by sand and dust from March to May, and secondarily affected by sand and dust from November to December. The transmission of sand and dust led to a rapid increase in the concentration of PM 10 , with more pollution days and heavier pollution level. March was the month with the highest frequency of sand and dust occurrences, accounting for 29.7% of all sand and dust days in Shaanxi Province from 2016 to 2022. Sorting the daily average PM 10 concentration of each sand and dust process from large to small, the highest value of PM 10 daily average concentration appeared in Yulin City, which was 3673 µg/m 3 on 15 March 2021, followed by 2980 µg/m 3 of Yan'an City on 16 March. In addition, in 2021, Yulin and Yan'an had four sand and dust days with an average daily PM 10 concentration of over 1000 µg/m 3 , which seriously threatened human health. In Tongchuan, Baoji, Xianyang, Weinan, Xi'an and Shangluo, the average daily concentration of PM 10 exceeded 600 µg/m 3 dust for many times, which seriously affected the ambient air quality. The one with the widest impact was the sand and dust transmission process from 12 to 14 May 2019, which affected all cities in Shaanxi Province, and the average daily PM 10 concentrations in Shangluo on May 12 and 13 were 626 µg/m 3 and 499 µg/m 3 . This sand and dust process caused the average annual concentration of PM 10 in Shangluo to increase by 2.7 µg/m 3 in 2019, while the sand dust in the whole year of 2019 caused the average annual concentration of PM 10 in Shangluo to increase by 4.0 µg/m 3 to 58 µg/m 3 , causing the excess of the standard (>70 µg/m 3 ) of PM 10 concentration in the three cities of Yulin, Yan'an and Hanzhong, which was not conducive to air quality evaluation and national ranking. Moreover, the transmission of sand and dust greatly increased the air quality level. From 2016 to 2022, there were 142 days of severe and above pollution caused by sand and dust, accounting for 12.7% of the total number of days with sand and dust, which was not conducive to the reduction of heavily polluted weather. When there is no effective method to control the sand emission conditions in the northern sand source areas, sand prevention and dust suppression measures can be taken to reduce the superimposed pollution of sand and dust transmission and local particle sources. Based on the overall changes in air quality in 10 cities in Shaanxi Province from 2016 to 2022, it can be found that although the ambient air quality shows an overall improvement trend, with the increase in the number of sand and dust occurrences year by year, the proportion of sand and dust transmission on air quality will further increase, thereby reversing the situation of air quality improvement.

Conclusions
Aiming at the problems of cumbersome and time-consuming process, unavoidable repeatability and reproducibility errors when processing massive raw data of atmospheric environment monitoring network based on traditional sand and dust weather determination methods, this paper discusses the feasibility of using clustering algorithm to identify sand and dust data, optimizes the k-means clustering algorithm, and proposes a MDPDk-means algorithm based on the maximum density and percentage distance, forming a relatively complete sand and dust data identification process which can quickly process a large amount of raw data in a short time. The determination efficiency of the proposed method for sand and dust data is much higher than that of the criteria method, which effectively solves the problems of cumbersome calculation process, poor process reproducibility, and large manual error when faced with a large amount of original data, and the recognition accuracy is also higher than other clustering methods. The proposed method is suitable for the business application of sand and dust data elimination, and has strong supporting significance for the research and judgment of regional atmospheric pollution situation.
In addition, the MDPD-k-means algorithm is used to identify the characteristics of sand and dust on the hourly data of the atmospheric environment monitoring network in 10 cities in Shaanxi Province from 2016 to 2022. According to the principle that if the impact time is more than 1 h, the day is a day affected by sand and dust, a total of 1107 sand and dust days are identified, including 142 days with severe and above pollution, and the daily average concentration of PM10 exceeded 600 μg/m 3 sand for many times, which seriously affects the ambient air quality and threatens human health. It can be found from the changes in the days affected by sand and dust that the number of sand and dust weather occurrences in 10 cities in Shaanxi Province shows an overall upward trend from 2016 to 2022, and a larger increase from 2021 to 2022. As the concentration of air pollutants continues to decline, the proportion of sand and dust transport on air quality will further increase, which will have a serious impact on the improvement of air quality and is not conducive to reducing heavily polluted weather. After eliminating the impact of sand and dust, the PM10 concentrations in 10 cities in Shaanxi Province decreased by 18.42%~1.41% respectively, providing important data information for accurate assessment of the effectiveness of air pollution prevention and control and assessment of ambient air quality.

Conclusions
Aiming at the problems of cumbersome and time-consuming process, unavoidable repeatability and reproducibility errors when processing massive raw data of atmospheric environment monitoring network based on traditional sand and dust weather determination methods, this paper discusses the feasibility of using clustering algorithm to identify sand and dust data, optimizes the k-means clustering algorithm, and proposes a MDPDk-means algorithm based on the maximum density and percentage distance, forming a relatively complete sand and dust data identification process which can quickly process a large amount of raw data in a short time. The determination efficiency of the proposed method for sand and dust data is much higher than that of the criteria method, which effectively solves the problems of cumbersome calculation process, poor process reproducibility, and large manual error when faced with a large amount of original data, and the recognition accuracy is also higher than other clustering methods. The proposed method is suitable for the business application of sand and dust data elimination, and has strong supporting significance for the research and judgment of regional atmospheric pollution situation.
In addition, the MDPD-k-means algorithm is used to identify the characteristics of sand and dust on the hourly data of the atmospheric environment monitoring network in 10 cities in Shaanxi Province from 2016 to 2022. According to the principle that if the impact time is more than 1 h, the day is a day affected by sand and dust, a total of 1107 sand and dust days are identified, including 142 days with severe and above pollution, and the daily average concentration of PM 10 exceeded 600 µg/m 3 sand for many times, which seriously affects the ambient air quality and threatens human health. It can be found from the changes in the days affected by sand and dust that the number of sand and dust weather occurrences in 10 cities in Shaanxi Province shows an overall upward trend from 2016 to 2022, and a larger increase from 2021 to 2022. As the concentration of air pollutants continues to decline, the proportion of sand and dust transport on air quality will further increase, which will have a serious impact on the improvement of air quality and is not conducive to reducing heavily polluted weather. After eliminating the impact of sand and dust, the PM 10 concentrations in 10 cities in Shaanxi Province decreased by 18.42%~1.41% respectively, providing important data information for accurate assessment of the effectiveness of air pollution prevention and control and assessment of ambient air quality.