Predicting PM 2.5 in the Northeast China Heavy Industrial Zone: A Semi-Supervised Learning with Spatiotemporal Features

: Particulate matter PM 2.5 pollution affects the Chinese population, particularly in cities such as Shenyang in northeastern China, which occupies a number of traditional heavy industries. This paper proposes a semi-supervised learning model used for predicting PM 2.5 concentrations. The model incorporates rich data from the real world, including 11 air quality monitoring stations in Shenyang and nearby cities. There are three types of data: air monitoring, meteorological data, and spatiotemporal information (such as the spatiotemporal effects of PM 2.5 emissions and diffusion across different geographical regions). The model consists of two classiﬁers: genetic programming (GP) to forecast PM 2.5 concentrations and support vector classiﬁcation (SVC) to predict trends. The experimental results show that the proposed model performs better than baseline models in accuracy, including 3% to 18% over a classic multivariate linear regression (MLR), 1% to 11% over a multi-layer perceptron neural network (MLP-ANN), and 21% to 68% over a support vector regression (SVR). Furthermore, the proposed GP approach provides an intuitive contribution analysis of factors for PM 2.5 concentrations. The data of backtracking points adjacent to other monitoring stations are critical in forecasting shorter time intervals (1 h). Wind speeds are more important in longer intervals (6 and 24 h).


Introduction
A major concern among the Chinese public has been increasing air pollution as a result of rapid industrialization and urbanization. Among the pollutants, particulate matter (PM) 2.5 , which refers to particles less than 2.5 µm in diameter, has become the biggest threat to human health. The prediction of PM 2.5 concentrations has attracted both scientists and society. The study in this field starts at the end of the 20th century and has reached a new height for machine learning exploding in recent years. The most widely used models are multivariate linear regression (MLR), support vector regression (SVR), and artificial neural networks (ANNs). Despite so many approaches and models, most are still far from good enough in accuracy and robustness.
This paper collects real-world data to train machine learning models, including hourly PM 2.5 concentration from January to March 2022, from all air-monitoring sites in Shenyang, China., Predictions are based on meteorological data, especially daily sunlight, which has an essential influence on forming PM 2.5 concentrations [1,2]. Historical PM 2.5 concentrations from the target site and nearby sites, such as backtracking points, are critical aspects of the proposed models. We separate the task of predicting PM 2.5 into two models: one for exploring the PM 2.5 concentrations and the other for forecasting the trend. Our model predicts concentrations mainly using genetic programming (GP) after using the shortestdistance clustering method to cluster the correlation coefficients of PM 2.5 concentrations at each site. In addition, we provide support vector classification (SVC) to forecast the trend, i.e., an increase or a decrease in the PM 2.5 index, in advance of time intervals of 1 h, 6 h, and 24 h.
(2005) built credit-scoring models and compared GP with approximate sets, logistic regression, decision trees, and artificial neural networks [24]. They found that GP provided better accuracy and robustness than the other models. Muttil et al. (2005) presented a realtime model using GP to predict algal blooms, prevent harm to fisheries, and provide a suggestion for the management departments [25]. Baykasoğlu et al. (2008) applied GP to predict limestone's compressive and tensile strengths [26]. They chose a GP rather than a NN because the former produces prediction equations explicitly, while the latter cannot.

Features Engineering
This paper's data, including PM2.5 concentration values, meteorological data, and the latitude-longitude data, came from the official departments in Shenyang and some others from the internet, such as the daily times for sunrise and sunset from http://www.weather.com.cn/ (accessed on 1 January 2022). This section introduces the preprocessing of raw data and selecting features before training models.

Air Quality Data, Meteorological Data, and Their Relevance
According to official air quality standards published on the website of the China National Environmental Monitoring Center in 2012, PM2.5 concentration values were divided into six categories: 0 to 35 μg • m , 36 to 75 μg • m , 76 to 115 μg • m , 116 to 150 μg • m , 151 to 250 μg • m , and over 251μg • m , corresponding togood, general, mildly polluted, moderately polluted, severely polluted, and seriously polluted levels, respectively. The percentages of the six categories are shown in Figure 1 (some concentration values greater than 250 μg • m are included in the seriously polluted category). The category values are seen as the target of the models. Meteorological data were gathered at hourly intervals corresponding to the PM2.5 concentration values, which were temperature, atmospheric pressure, relative humidity, wind direction, speed. We preliminarily analyzed the correlations of PM2.5 concentrations and each meteorological feature using scatter grams shown in Figure 2. Figure 2 indicates that when the temperature was very low (lower than −20 °C), the concentration of PM2.5 was also relatively low; when the atmospheric pressure was high, the PM2.5 concentrations were relatively low; when the relative humidity was lower than 20%, the PM2.5 concentrations were relatively low; and when the wind speed was high, the PM2.5 concentrations Meteorological data were gathered at hourly intervals corresponding to the PM 2.5 concentration values, which were temperature, atmospheric pressure, relative humidity, wind direction, speed. We preliminarily analyzed the correlations of PM 2.5 concentrations and each meteorological feature using scatter grams shown in Figure 2. Figure 2 indicates that when the temperature was very low (lower than −20 • C), the concentration of PM 2.5 was also relatively low; when the atmospheric pressure was high, the PM 2.5 concentrations were relatively low; when the relative humidity was lower than 20%, the PM 2.5 concentrations were relatively low; and when the wind speed was high, the PM 2.5 concentrations were relatively low. However, the PM 2.5 concentrations were not always high when the wind speed was low. The wind direction was measured by an angle, with north at 0 • , east at 90 • , south at 180 • , and west at 270 • . The remaining wind directions were calculated from this. The last picture in Figure 2 also tells us that a correlation between PM 2.5 concentrations and wind direction cannot be made directly according to the scatter gram.
were relatively low. However, the PM2.5 concentrations were not always high when the wind speed was low. The wind direction was measured by an angle, with north at 0°, east at 90°, south at 180°, and west at 270°. The remaining wind directions were calculated from this. The last picture in Figure 2 also tells us that a correlation between PM2.5 concentrations and wind direction cannot be made directly according to the scatter gram.

Spatiotemporal Data
This paper introduces a spatiotemporal site relevance concept for air quality prediction, based on locations sites, wind direction, and wind speed. Then, PM2.5 concentrations of the related sites, called backtracking points, are selected as input features to train the model. For the specifics, see Section 4. For each record of each site, we found the records of the same site 1 h, 6 h, and 24 h earlier. Then, we considered the PM2.5 concentration values in the found records as a feature for prediction.

Sunlight Effect
PM2.5 concentrations are remarkably affected by sunlight. However, there are many complex effects on light intensity, especially weather. Without the loss of generality, in this study, we simply considered whether the site was in the dark hemisphere. First, we collected data on sunrise and sunset times in Shenyang on a daily basis.. When a record was acquired after sunrise and before sunset, we marked 0 to indicate that the site was in the daylight hemisphere.. Otherwise, if it was later than the time of sunset and earlier than

Spatiotemporal Data
This paper introduces a spatiotemporal site relevance concept for air quality prediction, based on locations sites, wind direction, and wind speed. Then, PM 2.5 concentrations of the related sites, called backtracking points, are selected as input features to train the model. For the specifics, see Section 4. For each record of each site, we found the records of the same site 1 h, 6 h, and 24 h earlier. Then, we considered the PM 2.5 concentration values in the found records as a feature for prediction.

Sunlight Effect
PM 2.5 concentrations are remarkably affected by sunlight. However, there are many complex effects on light intensity, especially weather. Without the loss of generality, in this study, we simply considered whether the site was in the dark hemisphere. First, we collected data on sunrise and sunset times in Shenyang on a daily basis.. When a record was acquired after sunrise and before sunset, we marked 0 to indicate that the site was in the daylight hemisphere.. Otherwise, if it was later than the time of sunset and earlier than sunrise on the next day, we recorded 1. Table 1 shows the relationship between the PM 2.5 index and whether the site was in the dark hemisphere.

The Proposed Model
This paper proposes a semi-supervised learning model for predicting PM 2.5 concentrations, which consists of two classifiers: genetic programming (GP) to forecast PM 2.5 concentrations and nu-support vector classification (SVC) to predict trends.

A Genetic Programming Model
A GP model begins with an initial randomly generated population, in which each individual is one predicting model. To achieve a better effect, we specified the basic terminal set and function set for individuals. The terminator set is a function of various parameters and variables, including constants, logic constants, variables, etc. In this research, the terminal set consisted of the data mentioned in Section 3.1 and random numbers. The function set may include basic arithmetic operations, standard programming operations, standard mathematical functions, logic functions, or any other mathematical function. After generating the initial generation, we defined the fitness function, i.e., RMSE in our research, to rank the individuals. Common genetic manipulations, such as replication, crossover, and mutation, were used to generate new offspring populations. Replication involves copying an existing individual directly into a newly created child population without modifying it. Crossover refers to selecting a part of two parents that has a better fitness exchange to produce two new individuals. The process of mutation involves randomly altering a specific part of the parent, thereby creating a new individual.. To finish modeling, we set the maximum number of evolutions and target fitness. When the times of evolution reached the set number or the fitness of the best individual in the population reached the target, the modeling ended, and the prediction model was constructed using the best individual. GP's operating process is shown in Figure 3.
Atmosphere 2022, 13, x FOR PEER REVIEW 5 of 12 sunrise on the next day, we recorded 1. Table 1 shows the relationship between the PM2.5 index and whether the site was in the dark hemisphere.

The Proposed Model
This paper proposes a semi-supervised learning model for predicting PM2.5 concentrations, which consists of two classifiers: genetic programming (GP) to forecast PM2.5 concentrations and nu-support vector classification (SVC) to predict trends.

A Genetic Programming Model
A GP model begins with an initial randomly generated population, in which each individual is one predicting model. To achieve a better effect, we specified the basic terminal set and function set for individuals. The terminator set is a function of various parameters and variables, including constants, logic constants, variables, etc. In this research, the terminal set consisted of the data mentioned in Section 3.1 and random numbers. The function set may include basic arithmetic operations, standard programming operations, standard mathematical functions, logic functions, or any other mathematical function. After generating the initial generation, we defined the fitness function, i.e., RMSE in our research, to rank the individuals. Common genetic manipulations, such as replication, crossover, and mutation, were used to generate new offspring populations. Replication involves copying an existing individual directly into a newly created child population without modifying it. Crossover refers to selecting a part of two parents that has a better fitness exchange to produce two new individuals. The process of mutation involves randomly altering a specific part of the parent, thereby creating a new individual.. To finish modeling, we set the maximum number of evolutions and target fitness. When the times of evolution reached the set number or the fitness of the best individual in the population reached the target, the modeling ended, and the prediction model was constructed using the best individual. GP's operating process is shown in Figure 3.   Table 2 shows the GP parameters. Some other parameters that were not listed would not likely affect the results; thus, we used the default values. Table 3 shows the meaning of some of the operators in GP.

A nu-SVC Model
The SVC model is a type of SVM which can be used for classification purposes. In this paper, we used nu-SVC to predict the changing tendency of PM 2.5 concentrations. In nu-SVC modelling, we established an optimal decision hyperplane that maximizes the distance between the two types of samples on both sides of the plane, thus providing a good generalization capability for classification problems. For a multidimensional sample set, the system randomly generates a hyperplane and keeps it moving, classifying the samples until the sample points belonging to different classes in the training sample are located exactly on both sides of the hyperplane. There may be many hyperplanes satisfying the condition. SVC formally seeks such a hyperplane while guaranteeing classification accuracy, so that the blank area on both sides of the hyperplane is maximized, thereby achieving an optimal classification of linear separable samples. The parameters of nu-SVC in our research are shown in Table 4.

Related Sites and Backtracking Points
For the site-related concept, we used 1 h predictions of PM 2.5 concentration based on spatial location information and wind speed, which can be calculated as follows. First, we calculated the correlation coefficient matrix of the PM 2.5 concentrations between monitoring sites. Second, we created a coordinate and put all the sites into it by referring to their latitudes and longitudes, and then calculated the distance between the monitoring. As can be seen from these two appendices, the correlation of the PM 2.5 concentration between the two sites did not necessarily decrease as the distance between them decreased. Third, we calculated a backtracking point for each record of each site based on a vector whose starting point is the predicted location, whose direction is the opposite of the wind direction, and whose length is the wind speed multiplied by one hour.. Then, we used this vector as a center and drew a fan shape with left and right deviations of 15 degrees. We then selected the sites within the fan shape that have either the highest correlation with the PM 2.5 concentrations of the predicted site or the shortest distance with the predicted site as the backtracking point. (As for the 6 h and 24 h predictions, because of errors in wind direction and speed, the simplification of wind speed, and the accumulation of these errors in the 6 and 24 h situations, we abandoned the site-related concept in those two groups of experiments.) Finally, we obtained the PM 2.5 concentrations of the backtracking point an hour earlier as a basis to predict the PM 2.5 concentrations of the original site. As Figure 4 shows, we assumed that a north wind blew at site A, and the wind speed was 1 m/s. We then constructed a vector, and B was the backtracking point of site A in this record. So, we considered the PM 2.5 concentration values of site B as an independent variable in the prediction of the PM 2.5 concentrations for site A. Furthermore, to decide whether to choose a backtracking point with a correlation coefficient or a distance, we experimented with both ways of selecting a backtracking point and compared the error of the two ways.
sphere 2022, 13, x FOR PEER REVIEW 7 of 12 experiments.) Finally, we obtained the PM2.5 concentrations of the backtracking point an hour earlier as a basis to predict the PM2.5 concentrations of the original site. As Figure 4 shows, we assumed that a north wind blew at site A, and the wind speed was 1 m/s. We then constructed a vector, and B was the backtracking point of site A in this record. So, we considered the PM2.5 concentration values of site B as an independent variable in the prediction of the PM2.5 concentrations for site A. Furthermore, to decide whether to choose a backtracking point with a correlation coefficient or a distance, we experimented with both ways of selecting a backtracking point and compared the error of the two ways.

Experiments and Discussion
From the official departments in Shenyang, we obtained a total of 22,528 datasets across 11 monitoring stations from 01:00 on 1 January to 08:00 on 26 March 2022; the data were recorded hourly. Because there were missing values, we selected 22,421 sets of data. We applied ten-fold cross-validation (CV) to obtain the prediction results and validate the process. In order to evaluate the model, the data were arranged chronologically, and the last 30% was taken as a hold-out test set. The remaining 70% of the data were used to train the model, and a 10-fold CV was applied. The averages of the CV evaluations provided evidence of the model's effectiveness on the test dataset. Using average evaluations, we tuned the model's parameters. Regarding the experimental software and tools, we adopted Weka 3.8.2 for clustering, logistic regression (LR), and MLPNN, as well as Pycharm 2017.1 for SVR and SVC. For GP, we used GPTIPS, an extension pack of MATLAB. Our MATLAB version was R2017b.

Site Clustering
To predict the PM2.5 concentrations more accurately, we first used the shortest distance clustering method to cluster the correlation coefficients of PM2.5 concentrations at each site. The distance between the clusters was set at 0.3, and the 11 monitoring sites were divided into eight clusters, as shown in Table 5.

Experiments and Discussion
From the official departments in Shenyang, we obtained a total of 22,528 datasets across 11 monitoring stations from 01:00 on 1 January to 08:00 on 26 March 2022; the data were recorded hourly. Because there were missing values, we selected 22,421 sets of data. We applied ten-fold cross-validation (CV) to obtain the prediction results and validate the process. In order to evaluate the model, the data were arranged chronologically, and the last 30% was taken as a hold-out test set. The remaining 70% of the data were used to train the model, and a 10-fold CV was applied. The averages of the CV evaluations provided evidence of the model's effectiveness on the test dataset. Using average evaluations, we tuned the model's parameters. Regarding the experimental software and tools, we adopted Weka 3.8.2 for clustering, logistic regression (LR), and MLPNN, as well as Pycharm 2017.1 for SVR and SVC. For GP, we used GPTIPS, an extension pack of MATLAB. Our MATLAB version was R2017b.

Site Clustering
To predict the PM 2.5 concentrations more accurately, we first used the shortest distance clustering method to cluster the correlation coefficients of PM 2.5 concentrations at each site. The distance between the clusters was set at 0.3, and the 11 monitoring sites were divided into eight clusters, as shown in Table 5.

Regression
The performances of LR, SVR, MLPNN, and GP were compared by predicting the PM 2.5 index at different time intervals of 1 h, 6 h, and 24 h. In the 1 h interval group, we considered the influence of the backtracking point. We used the root-mean-square error (RMSE) to evaluate the pros and cons of the model, as demonstrated by the RMSE = ∑(predictPM−actualPM) 2 n . A lower RMSE indicated that the predicted value was closer to the actual value and that the model was more accurate.
The purpose of the parameter settings of all the predicting methods was that the RMSE was the lowest, while the time and space complexities were in an acceptable range. The parameters are shown in Tables 6 and 7. The parameters that were not listed would not likely affect the results; thus, we used the default values. For the sake of brevity, we only included experimental results of 1 h prediction, as shown in Table 8. To decide whether to choose a backtracking point by using a correlation coefficient or distance, we experimented on both ways and compared their errors. In Table 8, "c" denotes using a correlation coefficient to obtain the backtracking point and "d" denotes using distance to obtain the backtracking point. As can be seen from the results, using a correlation coefficient to obtain the backtracking point was better than using distance. The RMSEs not marked with 'c' or 'd' mean that the backtracking points obtained using the correlation coefficient and distance were the same. The results of 1 h intervals indicate that without the backtracking point-based features, the approximate RMSEs of both the LR and the MLPNN were 11 to 27, the approximate RMSE of the SVR was 30 to 60, and the approximate RMSE of GP was 10 to 26, which was lower than other methods. When the related site concept was added, almost every method was reduced, but the reduction was not obvious. For the 6 h interval, the RMSEs of LR, MLPNN, SVR, and GP were separately approximately 28 to 51, 24 to 44, 33 to 60, and 23 to 44, respectively. The RMSEs of the four methods in the 24 h interval were 27 to 50 (LR), 26 to 44 (MLPNN), 33 to 60 (SVR), and 25 to 43 (GP). It can be inferred from the results that GP performed better than all the other methods with all the different time intervals; however, the reductions caused by related site features were observed but not evident. Furthermore, when the time interval increased, the RMSEs of all the methods increased. The predicted RMSEs of some sites were obviously higher than others in each time interval and each method, which was caused by data fluctuations at those sites. Notes: the term "LR with BP" indicates the results of a logistic regression with backtracking points. Other terms "with BP" or "without BP" are in the same situation. The symbol "(c)" means that a correlation coefficient is used to obtain the backtracking point; the symbol "(d)" means that distance is used to obtain the backtracking point.

Tendency Classification
Using nu-SVC, we predicted whether PM 2.5 concentrations rose or fell in an hour. The measurement was defined as Recall = TP TP+FN , i.e., a kind of variation in the original index recall from a confusion matrix, calculated as the ratio between the number of positive samples correctly classified as positive and the total number of positive samples. In this scenario of predicting PM 2.5 concentrations, people care more about the rises than the falls. Therefore, in this paper, we looked more at the prediction accuracy when the concentration rises. Let T represent that the PM 2.5 index rises or is equal and F represents that it falls, TP indicates that the prediction is T and the actual answer is T, and FN indicates that the prediction is F and the actual answer is F. Consequently, it is evident the higher the recall, the more accurate the model's predictions when air pollution is a concern. We used recall to evaluate our model; the results are shown in Table 9.
Overall, the average recall was approximately 0.7915. It can be inferred that the prediction targeting the tendency of PM 2.5 concentrations had some effects. In particular, recalls were higher at the 1 h interval with a backtracking point than without a backtracking point, but this was not obvious. Moreover, recalls did not decrease significantly when the time interval increased. The symbol "(c)" means that a correlation coefficient is used to obtain the backtracking point; the symbol "(d)" means that distance is used to obtain the backtracking point.

Feature Influence
We used GP to obtain a more intuitive model to analyze the effect of the factors on PM 2.5 concentration. Table 10 lists the most influential factors in each set of experiments in each time interval. The results of the 1 h interval indicate that the most influential factors were the PM 2.5 concentrations in the previous period, the backtracking point, and the wind speed. Because the time interval was short, PM 2.5 concentrations in the previous period and the backtracking point were relatively important for prediction. However, as the time interval increased, other factors began to dominate the prediction, especially wind speed. The symbol "(c)" means that a correlation coefficient is used to obtain the backtracking point; the symbol "(d)" means that distance is used to obtain the backtracking point.

Conclusions
Based on our experiments, we can draw many conclusions. First, GP dominated LR, MLPNN, and SVR in predicting PM 2.5 concentrations at time intervals of 1 h, 6 h, and 24 h, and the root-mean-square errors of GP were 10 to 26 (1 h), 23 to 44 (6 h), and 25 to 43 (24 h). Second, the new spatiotemporal feature produced a slight reduction in all the methods, with values ranging between 0.0001 and 0.01. We think the possible reasons are one or more of the following: (a) the wind direction and wind speed are constantly changing, which causes some inaccuracy in the choice of the backtracking point; (b) adapting the backtracking point concept to each method creates some differences; and (c) the concept of the backtracking point is itself incorrect or has some other insufficiency. Third, the PM 2.5 concentration tendency prediction based on SVC was relatively satisfactory with an average recall of 0.7915. Finally, GP provides an intuitive model for analyzing the factors which affect predictions. We found that when the time interval was relatively short, the most influential features were the PM 2.5 concentrations in the previous period and the backtracking points. Nevertheless, as the time interval increased, other factors, such as wind speed and the location of the predicted site, had a more significant influence on the prediction.
Our contributions in this study are as follows: (a) we innovatively applied GP to predict the PM 2.5 concentrations and verified the advantage of that method over LR, MLPNN, and SVR; (b) we innovatively proposed the backtracking point concept based on wind direction, wind speed, the correlation coefficient, and spatial geographic information; (c) we used SVC to predict the tendency of PM 2.5 concentrations and achieved some progress; and (d) we used GP to identify influential factors for PM 2.5 concentration prediction.