PM 2.5 Concentration Prediction Based on Spatiotemporal Feature Selection Using XGBoost-MSCNN-GA-LSTM

: With the rapid development of China’s industrialization, air pollution is becoming more and more serious. Predicting air


Introduction
With the increasing of environmental pollution, the weather issue of haze is spreading in China's major cities. PM 2.5 has become a major problem of air pollution. Recent studies have shown that PM 2.5 leads to the occurrence of respiratory diseases, immune diseases, cardiovascular and cerebrovascular diseases and tumors [1,2]. Accurate prediction and early warnings of the concentration of PM 2.5 are of great significance. Many scholars have begun to integrate multiple data features, but too many data and factor features will affect the prediction effect, and redundant features will affect the performance of model prediction. Therefore, many scholars have begun to use feature selection to make predictions. For example: In power system, cooperative search algorithm is used to select researches are limited to cities in specific regions, ignoring the predictive performance of the model itself, resulting in poor applicability and migration of the model used.
The main contributions of this paper are as follows: (1) In terms of the research object, the air quality of Fenwei plain is worse than that of other regions in China. Therefore, it is typical to predict and analyze the PM 2.5 concentration of the cities in this region. In this paper, the PM 2.5 concentration of 12 cities in this region is predicted. Through the simulation and comparison in 12 cities, the portability and applicability of this study are verified. (2) In terms of prediction model, firstly, Pearson correlation analysis and XGBoost are used to select the features of PM 2.5 to solve the problem of feature redundancy, and the optimal features are extracted through one-dimensional multi-scale convolution kernel to solve the local time relationship and spatial feature relationship in air quality data. Then the parameters of LSTM are optimized by genetic algorithm to solve the accuracy problem of the model. Finally, the extracted features are input into LSTM for prediction. An XGBoost MSCGL (XGBoost-MSCNN-GA-LSTM) model is proposed to improve the PM 2.5 prediction of Fenwei plain. The combined model constructed in this paper not only conforms to the temporal characteristics of prediction data, solves the problem of feature redundancy and insufficient accuracy of the traditional machine model, but also follows the optimal and simplest principle in the nesting of the model. (3) In terms of prediction results, the experiment also discusses the PM 2.5 h concentration prediction under the influence of different characteristics. The prediction results show that appropriate input characteristics will help to improve the prediction accuracy of the model, and the model has been proved for many times that the prediction accuracy of the combined prediction model proposed in this paper is higher than that of a single deep learning model. After many experiments, it is found that the prediction results of XGBoost mscgl are better than XGBoost CNN, XGBoost LSTM, XGBoost MLP and XGBoost CNN LSTM models. The advantages of the proposed model are verified from multiple angles and multiple evaluation indexes, and the experimental results show that the proposed model has good robustness.

Study Area
Fenwei plain is the general name of Fenhe plain, Weihe plain and its surrounding terraces in the Yellow River Basin. It ranges from the north, Yangqu County in Shanxi Province to the south, Qinlin Moutains in Shaanxi Province, and to the west, Baoji City in Shaanxi Province. It is distributed in Northeast southwest direction, about 760 km long and 40-100 km wide. It has a population of 55,5445., including Xi'an, Baoji, Xianyang, Weinan and Tongchuan in Shaanxi Province, Taiyuan, Jinzhong, Lvliang, Linfen and Yuncheng in Shanxi Province, and Luoyang and Sanmenxia in Henan Province. Since 2019, Fenwei plain is still the area with the highest PM 2.5 concentration in China. The average PM 2.5 concentration in autumn and winter is about twice as much as other seasons, and the days of heavy pollution account for more than 95% of the whole year [31]. In 2020, the average concentration of PM 2.5 in Fenwei plain was 70 µg/m 3 , and serious pollution occurred in 152 days. Table 1 shows the factors of air pollutants [32]. Since December 2013, the China Environmental Protection Agency (EPA) has published open air quality observation data from China's ground monitoring stations. The study data in this article comes from the atmospheric pollutants of 12 cities in Xi'an, Baoji, Xianyang, Weinan, Tongchuan, Taiyuan, Jinzhong, Luliang, Linfen, Yuncheng, Luoyang, and Sanmenxia from 1 January 2020 to 31 December 2020 (PM 2.5 , PM 10 , NO 2 , SO 2 , O 3 , CO) hourly concentration data set, Table 1 is the atmospheric pollutant factors of PM 2.5 concentration prediction model. There are 2,838,240 pieces of air quality data and meteorological data in 12 cities.

Meteorological Data
The meteorological data of this paper come from the Chinese weather website platform. As shown in Table 2, through data preprocessing, 21 types of meteorological factors are selected in this paper, and they are average surface temperature, maximum surface temperature, minimum surface temperature, daily average wind speed, daily maximum wind speed, daily maximum wind direction, maximum wind speed, maximum wind direction, daily precipitation of maximum wind speed, 20-8 h (mm) precipitation, 8-20 h (mm) precipitation, 20-20 h (mm) precipitation, average temperature, maximum temperature, minimum temperature, daily average pressure, daily maximum pressure, daily minimum pressure, sunshine hours, daily average relative humidity, daily minimum relative humidity, and season. The data set needs to be divided before it can be input to the model for training. Otherwise, the prediction model will have no additional data for effect evaluation, and the training results may be overfitted due to training on all data. In the experiment, each data set is divided into training set and test set, after that, the training set is divided into training set and verification set. The data ratio of training set, test set, and verification set is 6:2:2. The training set mainly learns the sample data set and establishes a classifier by matching some parameters. A classification method is established, which is mainly used to train the model. The verification set is used to determine the network structure or the parameters controlling the complexity of the model, and select the number of hidden units in the neural network. The test set is used to test the performance of the finally selected optimal model. It mainly tests the resolution of the trained model (recognition rate, etc.).

Raw Data Processing Identification and Processing of Abnormal Data
Abnormal data may be caused by errors in the process of collecting and recording data. Abnormal data will affect the prediction accuracy of the model, so it is necessary to identify and process the abnormal data. Outlier detection is used to find outliers. Here, quartile analysis is used to identify outliers. First, the first quartile and the third quartile of variables are solved. If there is a value less than the first quartile or greater than the third quartile, the value is determined as an outlier. The horizontal processing method is used to correct the abnormal data.
The calculation formula of horizontal treatment method is shown in Equations (1) and (2) If, Then, Among them, y i represents the concentration of air pollutants in a certain day or hour, y i−1 represents the concentration of air pollutants in the previous day or hour, and y i+1 represents the concentration of air pollutants in the next day or hour, ε a represents the threshold.

Data Normalization
Due to the different meanings and dimensions of physical quantities such as air pressure and evaporation, the input to the prediction model will have an impact on results., so it is necessary to normalize such data. The input of normalized data into the prediction model can effectively reduce the training time of the model, accelerate the convergence speed of the model, and further improve the prediction accuracy of the model. The normalized calculation formula of the data is shown in Equation (3). This method realizes the equal scaling of the original data [33]: Among them, x norm is the normalized value, x is the original data, x min is the minimum value in the original data, x max is the maximum value in the original data, and the size of the normalized data is constrained between 0 to 1 interval.

XGBoost
XGBoost is an extreme gradient boosting decision tree, which belongs to a machine learning algorithm. The algorithm introduces regular items during the generation period and prunes at the same time, making the algorithm more efficient and more accurate. [34].
XGBoost (eXtreme Gradient Boosting) can be expressed in a form of addition, as shown in Equation (4) Among them,ŷ i represents the predicted value of the model; K represents the number of decision trees, f k represents the k sub-models, x i represents the i-th input sample; F represents the set of all decision trees. The objective function of XGBoost consists of two parts: a loss function and a regular term, as shown in Equations (5) and (6): Among them, L(ϕ) t represents the objective function of the tth iteration,ŷ represents the predicted value of the (t − 1) iteration; Ω( f k ) represents the regular term of the model of the tth iteration, which plays a role in reducing overfitting; γ and λ represent the regular term Coefficient to prevent the decision tree from being too complicated; T represents the number of leaf nodes of the model.
Using Taylor's formula to expand the objective function shown in Equation (7), we can get: Among them, g i represents the first derivative of sample x i ; h i represents the second derivative of sample x i ; ω j represents the output value of the j-th leaf node, and I j represents the sample subset of the value of the j-th leaf node.
It can be seen from Equation (7) that the objective function is a convex function. Taking the derivative of ω j and making the derivative function equal to zero, the objective function can reach the minimum value of ω j , as shown in Equation (8): Equation (9) can be used to evaluate the quality of a tree model. The smaller the value, the better the tree model. It can be easily concluded that we can obtain the scoring formula for the tree to split the node:L Equation (10) is used to calculate the split node of the tree model.

One-Dimensional Multi-Scale Convolution Kernel (MSCNN)
Convolutional neural network has been successfully applied to image recognition direction, which verifies that the network has a strong extraction of feature map. Based on the analysis of the data set, it is found that the characteristics of the data are multi features, shown in the form of numerical value, rather than in the form of feature map. Therefore, this study preprocesses the data, combines the characteristics of the data into a feature map, and inputs it to the convolution neural network to complete the extraction of the spatial and temporal characteristics of the air pollutant concentration data and meteorological factors [35]. The spatiotemporal feature extraction of single factor PM 2.5 is shown in Figure 1. Among them, the feature map is traversed from left to right on the data feature axis through a one-dimensional multi-scale convolution kernel to complete the convolution operation, the number of steps is 1, and the feature vectors output by different convolution kernels are spliced and fused to obtain a single factor. The spatial characteristics of the relationship. On the time axis, as the convolution kernel traverses from top to bottom to complete the convolution operation, the number of steps is 1, and the local trend of the single factor changing over time can be obtained. Finally, the spliced and fused feature vectors are merged in the data feature direction, and the spatio-temporal features of multi-site PM 2.5 are output. The following is the formula derivation of MSCNN's convolution operation on the special whole. The feature map contains N sample data and M air pollutant factors. Then the feature map formula of single factor i is as shown in Equations (11) and (12): In the formula, ] ∈ R represents the vector of the single factor i at time t, X t:t+T−1 i represents the T group vector of X i in the [t,t + T − 1] time zone, and T represents the matrix transpose.
The convolution operation multiplies the weight matrix W j by X t:t+T−1 i .
(1) Single-factor spatial feature relationship: multiply W j by X t:t+T−1 i on the data feature axis.
(2) Single factor time change feature: multiply x by y on the time feature axis.
When the first convolution kernel traverses the entire feature map on the time axis, and the number of steps is 1, the feature vector a j i is obtained, and its size is N − T + 1, and the eigenvectors obtained by multiple convolution kernels Z merge [N − T + 1] × Z size A i in the data feature direction, and A i represents the single-factor spatiotemporal feature matrix, as shown in Equations (13) and (14).
So far, the single-factor spatiotemporal feature extraction has been completed, but the data set also contains other features, such as NO2, SO2, CO, etc. A total of M factors, so we can extract the M factors through the same operation as above, and then they can be extracted. Single-feature spatio-temporal feature matrix, and then linearly splicing and fusion them to form a multi-factor fusion spatio-temporal feature matrix A, as shown in Equation (15): Based on MSCNN convolution neural network, the space-time characteristics of air quality data are extracted. This method makes a simple transformation of the twodimensional feature map to form a side-by-side one-dimensional feature map, which makes the network training show better generalization ability. Meanwhile, the convolution neural network automatic feature extraction method replaces the traditional artificial feature selection method, which makes the feature extraction more comprehensive and deeper.

Genetic Algorithm
The genetic algorithm is a method to perform crossover and mutation operations on feasible solutions in the population, so the objective function of the genetic algorithm does not require derivable or continuous conditions. The genetic algorithm applies a probabilistic optimization method to automatically obtain and guide the optimized search space, and adaptively adjust the search direction. The genetic algorithm is simple, universal, and suitable for parallel processing. The specific steps of the algorithm are shown in Figure 2. The GA process can be divided into six stages: initialization, fitness calculation, checking termination conditions, crossover, selection, and mutation. In the initialization phase, a chromosome is selected arbitrarily in the search space, and then the fitness of the chromosome is determined according to the preset fitness function. For optimization algorithms such as GA, the fitness function is a key factor that affects the performance of the model. Chromosomes are randomly selected based on the fitness of the fitness function. Dominant chromosomes have a higher chance of being inherited to the next generation. The selected dominant chromosomes can produce offspring through the exchange of similar segments and changes in gene combinations.

LSTM
Long Short-Term Memory (LSTM) is an improvement of Recurrent Neural Network (RNN) [36]. RNN has a higher probability of gradient disappearance and gradient explosion during training, and there is a long-term dependence problem. LSTM can effectively solve this problem. LSTM introduces a gate mechanism, which makes LSTM have a longerterm memory than RNN and can be more effective in learning. In LSTM, each neuron is equivalent to a memory cell (cell, c t ). LSTM controls the state of the memory cell through a "gate" mechanism, thereby increasing or deleting the information in it. The structure of LSTM is shown in Figure 3. In the LSTM cell structure, the Input Gate (i t ) is used to determine what information is added to the cell, and the Forget Gate ( f t ) is used to determine what information is deleted from the cell. The Output Gate (o t ) is used to determine what information is output from the cell. The complete training process of LSTM is that at each time t, the three gates receive the input vector x t at time t and the hidden state h t−1 of the LSTM at time t − 1 and the information of the memory unit c t and then perform the received information Logical operation, the logical activation function σ decides whether to activate i t , and then synthesize the processing result of the input gate and the processing result of the forgetting gate to generate a new memory unit c t , and finally obtain the final output result h t through the nonlinear operation of the output gate. The calculation formula for each process as shown in Equations (16)- (20).
Input Gate calculation formula: Forget Gate calculation formula: output gate calculation formula: Memory unit calculation formula, the internal hidden state: Hidden state calculation formula: Among them, σ represents generally a nonlinear activation function, such as a sigmoid or tanh function. W xi , W x f , W xo , W xc represents the weight matrices of nodes connected to the input vector W t for each layer, W hi , W h f , W ho , W hc represents the weight matrices connected to the previous short-term state h t−1 for each layer, b i , b f , b o , b c represents the offset terms of each layer node. In short, the input gate in LSTM can identify important inputs, and the forget gate can reasonably retain important information and extract it when needed. Therefore, this feature of LSTM can effectively identify long-term patterns such as time series, making training convergence faster. Figure 4 shows the XGBoost-MSCGL process. First, the atmospheric pollutant data and meteorological data are normalized and processed with missing values. Secondly, Pearson analyzes the correlation of the original data and uses XGBoost to select the importance of features. Furthermore, input the data after feature selection into MSCNN, and use the MSCNN algorithm to extract the temporal and spatial features of the data. At the same time, GA is used to optimize the parameters of the LSTM, the best fitness output of the chromosome is used as the global optimal parameter combination of the LSTM network, and then the data extracted from the spatiotemporal features are input into the optimized LSTM for prediction. In order to better verify the effect of the model, finally combined models such as XGBoost-MLP, XGBoost-LSTM, XGBoost-CNN are used for comparison, and then RMSE, MAE, MAPE and other indicators are used for evaluation.

Evaluation Index
In order to measure the accuracy of the prediction model, this paper uses Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) as evaluation indicators. The formulas are shown in Equations (21)-(23).
Whereŷ represents the predicted value, y i is the true value, and N is the number of test samples. The ranges of RMSE, MAE, and MAPE are all [0, +∞). Generally, the larger the value of RMSE and MAE, the greater the error and the lower the prediction accuracy of the model. MAPE is the most intuitive prediction accuracy criterion. When MAPE tends to 0%, it means the model is perfect, when MAPE tends to 100%, it means that the model is inferior. Generally, it can be considered that the prediction accuracy is higher when the MAPE is less than 10% [37]. R 2 measures the applicability of the model to sample values and can test the prediction ability of the model. The closer to 1, the higher the fitness of the model, and the closer to 0, the lower the fitness of the model.

Analysis of Factor Characteristics
In order to better analyze the characteristics of the model input factors, the Pearson correlation method is used for analysis. As shown in Figure 5, the factors for the correlation coefficient of PM 2.5 in Yuncheng are PM 10  The correlation coefficient of PM 2.5 was PM 10 (0.7) and CO (0.8), and the average humidity (0.5), minimum humidity (0.5) and season (0.5) were moderately correlated. The average temperature (−0.5), the maximum temperature (−0.5), the average temperature (−0.5), the minimum temperature (-0.4), and the maximum temperature (−0.5) were moderately negatively correlated.
Meteorological elements affect air quality by affecting the accumulation, diffusion, and elimination of pollutants. In the studies of PM 2.5 and PM 10 concentration, they are found closely related to meteorological elements (such as temperature, precipitation, wind speed, etc.). According to existing studies, relative humidity has an important a key factor to fine particle concentration [38]. At higher relative humidity, pollutants are attached to the surface of water vapor easier. Water solution is a good place for chemical reaction [39]. Wind direction and speed affect the dispersion of particulate matter in the air [40]. Chen et al. made predictions on PM 2.5 concentration in Zhejiang Province, finding that meteorological factors such as air temperature, air pressure, evaporation, humidity are remarkably correlated with PM 2.5 concentration [41]. Zhang Zhifei et al. found that O 3 h mass concentration has positive correlation with air temperature, solar radiation, visibility and wind speed, whereas NO 2 concentration is positively correlated with relative humidity and atmospheric pressure [42]. Precipitation [43], season [44], precipitation [45], sunshine duration [46], and other factors have remarkable impacts on the concentrations of air pollutants. Different city characteristics will also have different impacts on PM2.5, the correlation coefficient of PM 10 and CO in Jinzhong and Linfen is 0.9. Further, the two numbers in Lv Liang are 0.6 and 0.7. The correlation coefficients of 12 cities show that temperature, surface temperature, atmospheric pressure, air humidity, and sunshine duration all affect PM 2.5 . Further analysis is needed in selecting appropriate features for the model.

Feature Selection
Through Pearson analysis, it is found that addition to the traditional six atmospheric pollutants, meteorological factors are also main factors to PM 2.5 concentration, such as surface temperature, temperature, sunshine duration, humidity, and so on. Consider unrelated and redundant factors, which may obscure the role of important factors and require the mining and refinement of raw data.

Feature Importance Sorting Principle
The traditional GBDT algorithm uses first derivative, while XGBoost expands the error function with second-order Taylor, using both first-order and second-order derivatives. XGBoost uses a second-order Taylor expansion of the error function, and XGBoost uses column sampling of features to select the proportion of features used in training and to prevent over-fitting effectively. The parallel approximate histogram algorithm for XGBoost's feature split gain calculation can make full use of multicore CPUs for parallel computation. Traditional feature selection models iterate continuously during operation, and new trees will be generated after each iteration. When dealing with complex datasets, they may iterate over hundreds of thousands of times, so they are not efficient. To overcome this disadvantage, the XGBoost algorithm uses a regression tree to build models. This system is based on the Boosting algorithm, which has made great breakthroughs in prediction accuracy and training speed. In fact, XGBoost calculates which feature to select as the split point based on the gain of the structure fraction. The importance of a feature is the sum of times it occurs in all trees. The more an attribute is used to build a decision tree in a model, the more important it is. Using gradient enhancement makes it relatively easy to retrieve the importance for each attribute after building an enhanced tree. Generally, importance represents a score, indicating the usefulness or value of a feature in the process of building an enhanced tree in a model. The more attributes used for key decisions in a decision tree, the higher its relative importance is. Generally speaking, importance provides a score indicating how useful or valuable each attribute is in building an enhanced decision tree in a model. The more times attributes are used to make key decisions using a decision tree the higher the relative importance is. This importance is explicitly calculated for each attribute in the dataset so attributes can be ranked and compared with each other. The importance of a single decision tree is calculated by increasing the number of performance indicators per attribute split point, weighted by the number of observations the node is responsible for.

Experimental Process and Analysis of Feature Selection
We conduct a feature filtration on some parts of training set, and divided data sets into training sets and validation sets. First, we make XGBoost model which contains that contains all the feature training sets, use the five-fold cross validation to find the optimal parameters, and sort the features based on Fscore. Then we filter the sorted feature sets, evaluate whether a feature can be preserved under Fscore value, and delete the feature set which is scored lowest one by one. The AUC value of the validation set under the new feature subset is used to determine whether the predicted results of the remaining features are better or not. Both the number of features and the model improvement effect should be taken into consideration when selecting features. As some features have limited improvement effect on models, this experiment should use features that have greater impact on prediction of PM 2.5 concentration. The threshold h is set (the exact value of H is set according to the experimental results) to select the features. If the AUC value of the validation set increases more than h, the recently deleted features are saved. If the AUC value increases less than h or decreases, the deleted features are still removed. The algorithm can filter out the features that have a greater impact on the target variable and reduce the redundancy between the features.
As shown in the Figure 6, features are filtered by XGBoost. we use the "importance_type = gain" method to calculate the importance of features. We use five-fold cross validation meothod and grid search to find the optimal parameters of XGBoost model. The parameters of XGBoost algorithm are according to the weight of features. The importance of a feature can be used as a model explanatory value. This method represents the average gain from the presence of a feature as a split point in all trees.
In all trees, the number of times a feature is used to split nodes is Weight, and the total gain that a feature brings each time it splits a node is Total_gain. F Score formula is shown in Equation (25): F Score = Total_gain/weight (25) Average gain is calculated as Equation (26): XGBoost calculates which feature to select as the split point based on the increment of the structure fraction. The importance of a feature is the sum of the number of times it occurs in all trees. The more an attribute is used to construct a decision tree in a model, the more important it is. Using XGBoost to rank the feature importance, as shown in Figure 6, the top 10 cities are the 12 cities with different feature importance of PM 2.5 . We input the filtered features into MSCNN-GA-LSTM. The importance of features is sorted by XGBoost, and the threshold h is set to 0.002. As shown in Figure 7 the y-axis represents each city, and the x-axis coordinates represent each feature. The numbers in the box represent the value of features importance in different cities. The color depth of the box represents the size of Fscore. The darker the color, the more important the feature. The lighter the color, the less important the feature. The top 10 feature importance of 12 cities are listed in the chart. Consistent with the previous Pearson correlation analysis, we found that the air pollutant characteristics with strong correlation, such as PM 10 and CO, ranked as first and second in 12 cities in the feature importance ranking, while the factors with strong negative correlation, such as maximum temperature and average wind speed, are also of high importance. The feature importance of PM2.5 varied in different cities. We input the filtered features into MSCNN-GA-LSTM.

Model Comparison before and after Feature Selection
At the beginning of this section, we evaluate the performance of different models by using the predictions from the test set. Figures 8 and 9 show the simulated prediction results of PM 2.5 in 12 cities using nine models. First, PM 2.5 test set data are input into four single trained models for calculation, and the PM 2.5 h predictions are compared with the measured results. The predicted PM 2.5 h concentration is close to the measured value when the measured value of PM2.5 h concentration increases rapidly, the predicted values deviate from the measured values significantly. This may be due to the redundancy of features and the influence of space-time characteristics. It is difficult to accurately predict if the model is not trained to filter feature values. The MLP model is similar to the LSTM model in that the predicted values deviate greatly from the measured values when the measured values increase or decrease sharply. The main reason why XGBoost model is not efficient is that it cannot achieve accurate prediction over time series data. When the measured values are small, the predicted values of PM 2.5 concentration are consistent with the measured values, and when the measured values are large, the predicted values are larger than the measured values. Comparing the predicted values of PM2.5 concentration of four single models in 12 cities, the LSTM model has the best predicted results. PM 2.5 test set data are input into five trained combination models to calculate. The predicted PM 2.5 h concentration values of 12 cities are compared with the measured values which are shown in Figures 8 and 9. In the figure, the predicted values of the XGBoost-MSCGL model PM 2.5 are consistent with the measured values, even when some individual PM 2.5 h concentration values increase or decrease sharply, the predicted values are close to the measured values. XGBoost-LSTM prediction is similar to XGBoost-MSCGL model in that when the measured value increases or decreases sharply, the predicted value has a smaller deviation from the measured value, but the predicted result is slightly worse than that of XGBoost-MSCGL model. When the measured value of XGBoost-MLP model is higher or lower, the predicted value has a larger deviation from the measured value and the predicted value is smaller than that of the measured value. CNN-LSTM model performs better when the measured value increases or decreases sharply. However, compared with the other eight models, its prediction effect is the worst. For PM 2.5 average concentration prediction, the predicted value is larger than the measured value. Comparing XGBoost-MLP, XGBoost-LSTM, XGBoost-CNN, XGBoost-MSCGL with CNN, LSTM, MLP, and CNN-LSTM, we found that the predicted value of the model after feature selection is closer to the measured value than that before feature selection, with a greater increase in accuracy, and a marked decrease in derivation value. Comparing the predicted values of PM2.5 h concentration of the nine models with their corresponding measured values, the XGBoost-MSCGL model had the best prediction effect.

Model Accuracy Evaluation
The accuracy of the four models was evaluated by RMSE, MAPE, MAE, and R 2 . The smaller the RMSE, MAPE, and MAE, the higher the accuracy of the model, and the larger the R 2 , the higher the accuracy of the model. In order to better evaluate the error, prediction effect, and prediction accuracy of the nine models, we selected four evaluation indexes to evaluate the performance results of each model in each city, as shown in Table 4.  Among the nine models which predicted PM2.5 h concentration value, XGBoost-MSCGL had the best prediction effect. The average MAE (8.26), RMSE (5.6), MAPE (9.9), R 2 (0.95) in 12 models were the highest, while the XGBoost model had the worst predictive effect in nine models, with the average MAE (21.67), RMSE (15.25), MAPE (31.94%), R 2 (0.69) in 12 cities. R 2 was the smallest of the nine models. The correlation coefficient R2 of the four single models was 79.07%, which may be related to the unstable time series of PM 2.5 concentration and no screening of features during the model building process, resulting in no further improvement of model accuracy. From the prediction effect after feature selection, the overall prediction effect of the combination of feature selection based on XGboost with a single model has been remarkably improved. XGBoost-CNN and XGBoost in 12 cities prediction compared with CNN, LSTM, MLP, XGBoost-MSCGL, the values of -LSTM, XGBoost-MSCGL, and CNN-LSTMRMSE decreased by 13 By analyzing the predicted data of 12 cities in the Fenwei Plain, it is noted that different prediction models have different performances in reducing errors and improving consistency of changes in different cities. The prediction error may be related to the different city characteristics that we choose, and to a different dispersion of air pollutant concentration values in each season. Using four deep learning combination models for training and validating the prediction accuracy, the results show that XGBoost-MSCGL has the highest prediction accuracy for most city training sets, and its prediction performance is better than other models. Through the three indicators of RMSE, MAE, and MAPE, we can see that XGBoost-MSCGL has better prediction performance than XGBoost-MLP, XGBoost-LSTM, XGBoost-CNN. In 12 cities, RMSE, MAE, and MAPE decreased by 11.11%, 15.97%, and 15.36%, respectively. However, XGBoost-LSTM in Xianyang, XGBoost-MLP in Weinan, and XGBoost-CNN in Jin are slightly higher than XGBoost-MSCGL in MAPE. XGBoost-MSCGL in Xi'an, Taiyuan, Sanmenxia, and other cities declined significantly. Overall, the error value of XGBoost-MSCGL in the four combined prediction models is small, the performance is outstanding, and the prediction effect is better.
Through the analysis of the prediction data of the 12 cities in the Fenwei plain, we noticed that the performances of different prediction models were different in reducing errors and improving the consistency of changes of changes in different cities. The prediction errors might have something to do with the different types of city characteristics and the degree of dispersion of the concentration of air pollution in different seasons. We used four models of deep learning to train and test. The results show that XGBoost-MSCGL has the highest prediction accuracy for most of the city's test sets, and it is better than other models in terms of prediction performance.

Discussion
In this study, the PM 2.5 feature selection based on XGBoost, combined with MSCNN to extract temporal and spatial features, and GA optimized LSTM, were used to establish the XGBoost-MSCGL air pollutant concentration prediction model. Compared with other machine learning, feature selection combined with feature extraction and combined with deep learning is an effective method for processing big data (especially spatio-temporal feature data). Combining spatio-temporal feature and models can improve the performance of spatio-temporal data prediction to a certain extent. In different cities, the importance of PM 2.5 influencing factors are different. It is necessary to select PM 2.5 influencing factors in different cities and propose redundant features and delete redundant features in order to avoid influencing the accuracy of the prediction model. The prediction method proposed in this paper is feasible for the PM 2.5 h concentration data prediction in multiple cities, and the method can be used in multiple regions and predictions on different atmospheric pollutant concentration. In terms of input variables, regular monitoring data from the National Environmental Monitoring Station, and China Meteorological Administration are used. In terms of modeling methods, machine learning and deep learning algorithms are combined. On the premise of eliminating redundant features, space and time features are considered, and a genetic algorithm is used to optimize the parameters of the LSTM network, enabling it to capture optimal parameters better. With stronger capturing ability, the long-term dependence relationship hidden by air quality data is more accurate, and the prediction accuracy is further improved.
The shortcoming of this research is that in different cities, the performances of XGBoost-MSCGL model may be different due to driving factors, spatio-temporal characteristics, model types, model structure, and model development methods. We find that in cities such as Xi'an, the model performs well. However, in some cities, their performances cannot achieve the same accuracy and prediction effect. The dispersion of PM 2.5 concentration data in different cities and other city air pollutants may also affect the prediction performance of the model. So, it was necessary to further analyze the reason for the difference. At the same time, the data volume, the dispersion between air pollutant concentration values and space features might also affect the performance of model prediction. So, it was necessary to further analyze the reason for the difference. In this study, the range and interval prediction of air pollutants concentrations are not taken into consideration. In following researches, it needs to be discussed in detail. Only in this way could the relevant government and enterprises better monitor and manage the release of air pollution.

Conclusions
In this study, based on the hourly concentration data and meteorological data of six air pollutants in 12 cities in the Fenwei Plain in 2020, a PM 2.5 concentration prediction model, based on XGBoost-MSCGL, was established, and the performance of the model was compared with XGBoost-MLP, XGBoost-LSTM, and XGBoost-CNN. The main research results are as follows: In the PM 2.5 concentration prediction, the XGBoost-MSCGL model performs better in 12 cities in the Fenwei Plain, with smaller error values and better prediction results. As for feature selection, compared with the prediction of all influencing factors, the prediction effect of the former is significantly improved for the factors of feature selection. From the perspective of spatio-temporal characteristics, the hourly concentration prediction performance of the 12 cities considering spatio-temporal characteristics is better than the prediction model that does not consider spatio-temporal characteristics. From the perspective of the optimized model, the accuracy of the optimized model is significantly improved compared to the unoptimized model. In general, based on feature selection, screening the influencing factors of PM 2.5 according to their importance helps to reduce the feature redundancy of the data set. In terms of overall performance, the prediction performance of the XGBoost-MSCGL model is generally better than that of the XGBoost-MLP, XGBoost-LSTM, and XGBoost-CNN models. Compared with other prediction methods, the PM2.5 concentration prediction, based on the XGBoost-MSCGL model, has better performance in accurately predicting the actual data in different cities. Compared with other models, it has a higher accuracy improvement and achieves better prediction especially when the data are at extremely high and low points in the sharp fluctuations. The migration of the model is verified by the prediction results of 12 cities in Fenwei plain. The concentration change direction and volatility of PM2.5 need to be further considered in future research.