Analysis of Atmospheric Pollutants and Meteorological Factors on PM 2.5 Concentration and Temporal Variations in Harbin

: With rapid economic development, the problem of air pollution has become increasingly prominent. Countries have paid attention to PM 2.5 , one of the main air pollutants, and have gradually addressed this issue. Based on the 2015–2019 air quality data, meteorological data, and aerosol optical depth data from Harbin, China, this study investigated the relationship between PM 2.5 , a number of inﬂuencing factors, and their temporal changes using a machine-learning method. It can be seen from the analysis that the random forest model can predict PM 2.5 concentration. In this model, the mean RH and AOD have a high impact on PM 2.5 concentration, but there was negligent correlation with PM 2.5 . The results indicated that the level of PM 2.5 pollution continuously decreased from 2015 to 2019, and there were signiﬁcant seasonal differences in PM 2.5 concentration and its variations. In 2019, due to the impact of heating and adverse meteorological conditions, PM 2.5 pollution during the heating period increased signiﬁcantly. This study provides theoretical and data support for the analysis of PM 2.5 pollution in Harbin and formulation of air pollution control policies.


Introduction
In recent years, with the continuous advancement of industrialization and urbanization, polluting gases from fuel combustion and fugitive dust [1] from traffic and construction have triggered frequent occurrence of haze pollution globally under unfavorable meteorological conditions of diffusion.In haze-polluted environments, PM 2.5 is one of the most significant pollutants [2].It increases travel barriers, impacts economic development, and adversely affects human health [3].Presently, strategies to improve the management of PM 2.5 have been launched worldwide.In China, from 2016 to 2020, the 13th Five-Year Plan was implemented with the aim of controlling the precursors of fine particulate matter, including SO 2 and NO x [4].At the same time, under the China's existing carbon peak and carbon-neutral goal, effective management of PM 2.5 has also become an important objective.To effectively control PM 2. 5 , it is crucial to analyze air pollutants and meteorological factors closely associated with it.
Numerous studies have been conducted with the goal of identifying means to effectively prevent and control air pollution.Many factors affect the concentration of PM 2.5 , including social, economic, and meteorological factors, and its interaction with other pollutants.By definition, PM 2.5 resides in the atmosphere, so changes in meteorological conditions directly affect its diffusion, transformation, and deposition.The impacts of other factors on PM 2.5 are essentially the same as those on other manmade pollutants.Therefore, scholars have focused their research on the following: meteorological conditions and other pollutants.For example, Cifuentes et al. [5] found that solar radiation and temperature were the most significant predictors using statistical models.Park et al. [6] found that meteorological conditions such as wind and turbulent motion at the surface were sensitive to PM 2.5 concentrations.Greatly reduced PM 2.5 concentrations were mainly influenced by synoptic rather than local conditions.Li et al. [7] found that the daily variation in the PM 2.5 concentration is related to weather conditions.Chen et al. [8] found obvious seasonal and regional differences in the effect of meteorological factors on the concentration of PM 2.5 .The effects of temperature, humidity, and wind on the concentration of PM 2.5 were greater than those of other meteorological factors.Jing et al. [9] found significant temporal and spatial differences in the impact of meteorological conditions and precursors of human activities on PM 2.5 .Zheng et al. [10] elucidated that PM 10 , SO 2 , NO 2 , and CO are the main factors affecting the concentration of PM 2.5 , while meteorological factors and O 3 are secondary factors affecting its concentration.Luo [11] found that climatic conditions, NO 2 , and O 3 are the Granger causes of PM 2.5 .Zhang et al. [12] employed diverse statistical methods to evaluate the regional and seasonal differences of PM 2.5 concentrations based on long-term air quality data.Shao et al. [13] evaluated the influence of typical adverse meteorological conditions in Tianjin on PM 2.5 and observed that an increase in wind speed and planetary boundary height reduced the PM 2.5 concentration, with inversion having the most significant effect on PM 2.5 .Therefore, to predict PM 2.5 concentrations more accurately, studies must consider meteorological variables such as temperature, wind direction and speed, rainfall, and other air pollutants, such as SO 2 , CO, O 3 , NO 2 , and PM 10 .
Because of the uncertainty in the variation of PM 2.5 concentration and the diversity of the influencing factors, it is of great scientific significance to explore an effective PM 2.5 concentration prediction model.A PM 2.5 concentration forward prediction model based on physical principles, starting from meteorological factors and pollution conditions, has many shortcomings owing to the imperfect initial state, inaccurate lower boundary conditions, and approximation of physical parameters [14].At the same time, with the development of computer technology, data-driven statistical methods have attracted increasing attention.Among them, machine learning has received extensive attention because of its advantages in automatically improving algorithms through experience [15].Some typical nonlinear machine-learning models, such as random forests and neural networks, have exhibited good predictive performance [16].The integration algorithm is not a single machinelearning algorithm.It builds and combines multiple machine learners, creating a strong learner to complete a task.Random forest is a type of integration algorithm that comprises a large number of individual decision trees that operate as an ensemble [17].The key difference between the neural network method and the conventional parametric model method is that the former is a data-driven adaptive technique and does not require any prior assumptions on the problem model.When the internal rules of problem solving are unknown, neurons can acquire the hidden functional relationships between the data by learning and training samples [18].It is appropriate for solving problems that are challenging to explain using hypotheses and existing theories but have sufficient data and observed variables.Owing to its outstanding learning ability, machine learning has been widely used for predicting PM 2.5 .Wang et al. [19] introduced gaseous pollutant data (NO 2 , SO 2 , CO, and O 3 ) as predictors into a deep neural network model to predict the PM 2.5 concentration in the missing AOD (aerosol optical depth) area, reducing the estimation bias caused by insufficient AOD.Wang selected a single category of predictors, considering only gaseous pollutants.Yang et al. [20] developed a random forest model that integrated TOA reflectance, meteorological field, and land use variables to evaluate ground PM 2.5 and demonstrated its excellent performance.Yang emphasized the impact of ground-level factors on PM 2.5 , selected meteorological factors, and land use variables.Zheng et al. [21] selected PM 10 , SO 2 , NO 2 , temperature, pressure, humidity, wind direction, and wind speed as influencing factors.A prediction model based on a radial basis function (RBF) neural network was used to predict PM 2.5 .The results showed that the model has significant advantages.Zheng used a more comprehensive combination of gaseous pollutants and meteorological factors as predictors.Shi et al. [22] proposed a neural network method based on the attention mechanism to predict the 24-h PM 2.5 concentration.He found that the model had a better predictive effect with respect to the root-mean-square error (RMSE) and the mean absolute error (MAE) evaluation indicators.Lu et al. [23] established an enhanced deep neural network model to estimate the hourly PM 2.5 concentration of the Yangtze River Delta Urban Agglomeration.The results demonstrate that this model can significantly improve the estimation accuracy of the algorithm.Du et al. [24] designed a hybrid deeplearning framework of one-dimensional convolutional neural networks and bidirectional long short-term memory networks to handle PM 2.5 air pollution forecasts with suitable accuracy.Czernecki et al. [25] tested four machine-learning models and confirmed the high applicability of machine learning to short-term air quality prediction with the perfect approach employing analysis and cross-validation.The above studies focused on enhancing the existing model to achieve better prediction accuracy and performance, ignoring the interpretability of the model and not studying the various factors affecting PM 2.5 .
Most of the existing PM 2.5 studies in China have focused on the Beijing-Tianjin-Hebei, Yangtze River Delta, and Pearl River Delta regions [26][27][28], and only a few have focused on the Northeast region.Northeast China has a high latitude and a mid-temperate continental monsoon climate with four distinct seasons.The summer is short, hot, and rainy, while the winter is long, cold, and dry.Development has mainly focused on heavy industries that consume a significant amount of energy.The heating period is long, and a large amount of straw is burned in the open air in the surrounding rural areas, causing frequent smog in winter.Harbin is a typical representative of northeast cities with typical characteristics of air pollution in Northeast China.The evaluation of the current situation with PM 2.5 pollution in Harbin is easy for the public to have a better and more intuitive understanding of the current situation of the air environment.This can also provide a basis for environmental management departments and government decision-making bodies to conduct air environment control in a targeted manner.It is of utmost significance to formulate urban development plans and maintain sustainable economic development.
We used the random forest model to study the influence of air quality data, meteorological data and AOD on the PM 2.5 concentration changes and quantify the different contributions of these factors.A prediction model was established using air quality, meteorological, and AOD data of the Harbin City from 2015 to 2019 as the predictors and the PM 2.5 concentration as the outcome.Under this model, the SHAP method was used to establish the critical influencing factors of PM 2.5 , and the effect of each factor on PM 2.5 was quantified.The Pearson algorithm was used to analyze the correlation between each factor and PM 2.5 .The inter-and intra-annual variations in PM 2.5 concentration were analyzed, and multiple reasons for the variation in the PM 2.5 concentration in Harbin were investigated.This can provide theoretical and data support for air pollution control and air quality management in Harbin.

Data
The following datasets were used in this study: air quality data, meteorological data, and AOD data.
Meteorological data were obtained from a greenhouse gas data-sharing platform (http://data.sheshiyuanyi.com/,accessed on 12 March 2021).We obtained the daily mean temperature (mean TEMP, • C), and the total daily precipitation (PRE, mm), mean relative humidity (mean RH, %), daily mean pressure (mean PRS, hPa), daily sunshine duration (SSD, h), and daily mean wind speed (mean WIN, m/s) in Harbin from 2015 to 2019.
AOD data were obtained from the MODIS aerosol products from the Terra satellite of the National Aeronautics and Space Administration (NASA).We downloaded the data from 2015 to 2019 from the Level 1 Atmospheric Archive and Distribution System (LAADS, http://ladswebnascomnasagov/data/searchhtml, accessed on 15 March 2021).MODIS provides two AOD products with resolutions of 10 km and 3 km.In this study, the dark target and deep blue methods were combined to retrieve AOD data with a resolution of 10 km.The AOD data obtained by the Terra satellite at 10:30 am transit time were the average for the day.
The data used in this study are shown in Table 1.[29].Among the ensemble-learning algorithms, bagging and RF are two representative parallelization methods in which individual learners do not have strong dependence on each other; can be generated simultaneously.The principle of bagging is to extract a sampling set of several training samples from the sample set using the bootstrap method, train a weak learner based on this sampling set, repeat training many times, and then combine the trained multiple learners.The classification tasks are simply voted in the output of the prediction, and the regression tasks are simply averaged to obtain the final result.RF is an extended variant of bagging.Its algorithm principle is shown in Figure 1.First, a RF uses a decision tree based on the classification and regression tree (CART) algorithm as a weak learner, and the RF introduces a random selection of attributes to the training process of decision trees.For a traditional decision tree, an optimal feature is selected from all N sample features on the node to divide the left and right subtrees of the decision tree.However, RF randomly selects N sub (N sub < N) sample features on the node to select an optimal feature to divide the left and right subtrees of the decision tree.This further enhances the generalizability of the model.In each round of random sampling of bagging, for the k th tree, approximately 36.8% of the training data do not participate in the generation of the k th tree.These are called out-of-bag samples of the k th tree.These unselected data are not involved in modeling and can be used as verification data to evaluate the model performance.
In summary, RF uses the bootstrap method to generate a sample subset and randomly selects F features for node splitting to build a single regression decision subtree.The above process is repeated many times to construct T regression decision subtrees, and each tree grows freely without pruning to form a random forest.The final prediction result is determined by the mean value of all decision trees for sample training.A flowchart of the algorithm for the random forest is shown in Figure 2. The random forest method can deal with high-dimensional data and is not prone to overfitting owing to the introduction of sample randomness and feature randomness.Moreover, it achieves an unbiased estimate of the importance of each feature and obtains good results for default value problems.This method can be highly parallelized during training.It has advantages of speed for training large samples, strong adaptability to datasets, and high prediction accuracy.In summary, RF uses the bootstrap method to generate a sample subset and ran domly selects F features for node splitting to build a single regression decision subtree The above process is repeated many times to construct T regression decision subtrees, and each tree grows freely without pruning to form a random forest.The final prediction re sult is determined by the mean value of all decision trees for sample training.A flowchar of the algorithm for the random forest is shown in Figure 2. The random forest method can deal with high-dimensional data and is not prone to overfitting owing to the intro duction of sample randomness and feature randomness.Moreover, it achieves an unbi ased estimate of the importance of each feature and obtains good results for default valu problems.This method can be highly parallelized during training.It has advantages o speed for training large samples, strong adaptability to datasets, and high prediction ac curacy.In summary, RF uses the bootstrap method to generate a sample subset and randomly selects F features for node splitting to build a single regression decision subtree.The above process is repeated many times to construct T regression decision subtrees, and each tree grows freely without pruning to form a random forest.The final prediction result is determined by the mean value of all decision trees for sample training.A flowchart of the algorithm for the random forest is shown in Figure 2. The random forest method can deal with high-dimensional data and is not prone to overfitting owing to the introduction of sample randomness and feature randomness.Moreover, it achieves an unbiased estimate of the importance of each feature and obtains good results for default value problems.This method can be highly parallelized during training.It has advantages of speed for training large samples, strong adaptability to datasets, and high prediction accuracy.In our study, to obtain the best prediction, a grid search was used to find the optimal parameters of the model.Grid search is an exhaustive search parameter-tuning method: the procedure is to loop through all the candidate parameters, try every possibility, and obtain the best-performing set of parameters.Some of the key parameter descriptions and settings of random forest used in the current analysis are shown in Table 2, and default values were adopted for the other parameters.

Data Analysis Method
Explainability refers to the degree to which humans grasp the purpose for their decisions.Machine learning is based on an algorithm that acquires potential patterns and relationships from data and ultimately generates decisions or predictions.The more explainable it is, the easier it is for people to understand the rationale for certain decisions or predictions.People are not only satisfied with the effect of the model, but also think more about the reasons for its effect.Such thinking helps to optimize the model and features and can also help to comprehend the model itself and enhance the quality of the model.
SHAP is an additive explanation model inspired by the Shapley value [30].Its goal is to explain the prediction of an instance X by computing the contribution of each feature to the prediction.The feature values of a data instance act as "contributors", and SHAP attributes the output values to the Shapely values of each feature.Shapely values are measures of contributions each feature has in a machine-learning model.The Shapley value for feature X j in the model is as follows [31]: where p is the total number of features, N\{j} is the set of all the conceivable combinations of the features excluding X j , S is the feature set in N\{j}, f (S) is the model prediction with features in S and f (S ∪ {j}) is the model prediction with features in S and X j .The interpretation of Equation (1) indicates that the Shapley value of a feature is its marginal contribution to the model prediction averaged over all potential models with diverse permutations of features.As the main obstacle to the widespread adoption of Shapley values is computation, Lundberg and Lee [30] created SHAP.It calculates the Shapley value of each feature value to measure the influence of feature values on the final output value, and specifies the explanation as follows: where g is the explanation model, M is the number of input features, z ∈ {0, 1} M is the simplified features and indicates whether the corresponding feature exists (1 or 0), φ j ∈ R is the feature attribution for feature j, the Shapley values, and φ 0 is usually the mean of the target variable for all the samples.For instance, with the X we are explaining, all the feature values are "present", i.e., all the simplified features are 1.At this point, the above formula can be simplified as follows: It can be regarded as the contribution of the sum of Shapley values of each feature, which changes the predicted result from the mean to the predicted result, namely g(z ).
The advantage of SHAP is that we can clearly see whether each feature has a positive or negative impact on the prediction.
The analysis reported in this paper adopted the SHAP library of Python.By calling this library, the random forest prediction model established in the current analysis can be explained.89.R 2 was as high as 0.94; RMSE and MAE were 13.23 µg/m 3 and 6.69 µg/m 3 , respectively, i.e., relatively low and, thus, satisfactory.It can be seen from the results above that the adopted approach accurately predicted PM 2.5 concentrations.A random forest model was also used to predict PM 2.5 and obtained similar results with an R 2 of 0.93 [32].On this basis, the influence of different factors can be analyzed.feature values are "present", i.e., all the simplified features are 1.At this point, the above formula can be simplified as follows:

Prediction Results and Model Evaluation
It can be regarded as the contribution of the sum of Shapley values of each feature, which changes the predicted result from the mean to the predicted result, namely   .The advantage of SHAP is that we can clearly see whether each feature has a positive or negative impact on the prediction.
The analysis reported in this paper adopted the SHAP library of Python.By calling this library, the random forest prediction model established in the current analysis can be explained.

Prediction Results and Model Evaluation
A random forest prediction model was established; it considered the atmospheric pollutants (PM10, NO2, CO, SO2, and O3), meteorological conditions (PRE, SSD, mean TEM, mean RH, mean PRS, and mean WIN), and AOD in Harbin from 2015 to 2019 as characteristic quantities.Seven out of ten of them were used as training data, and those remaining were used as test data.Daily concentrations of PM2.5 were the output of the model.
The results of fitting the test samples are shown in Figure 3.The density scatter diagram shows that the model had an excellent prediction ability.The sample points are roughly distributed on both sides of the straight line and are relatively clustered.The predicted PM2.5 concentrations obtained by the prediction model were very close to the actual PM2.5 concentrations.The fitting equation was Y = 0.97X + 1.89.R 2 was as high as 0.94; RMSE and MAE were 13.23 μg/m 3 and 6.69 μg/m 3 , respectively, i.e., relatively low and, thus, satisfactory.It can be seen from the results above that the adopted approach accurately predicted PM2.5 concentrations.A random forest model was also used to predict PM2.5 and obtained similar results with an R 2 of 0.93 [32].On this basis, the influence of different factors can be analyzed.

Influence of Predictive Factors on the PM 2.5 Concentration
We used the SHAP method to explain the model results and the roles of the influencing factors (Figure 4). Figure 4a shows that the mean RH had the greatest influence on PM 2.5 and its SHAP value was as high as 22.52.It was discovered that RH has an inverted U-shaped relationship with PM 2.5 concentrations (peaking at RH = 45-70%) [33].Dry (RH = 45-60%) and low-humidity (RH = 60-70%) conditions positively affected PM 2.5 and exerted an accumulation effect.Under normal weather conditions, RH is mostly 45-60%; the SHAP value of RH is thus relatively high.The influence of AOD was the second highest, with a SHAP value of 8.32, which was less than one-third of the mean RH.The influence of the other factors (SSD, PRE, NO 2 , SO 2 , CO, mean TEMP, PM 10 , O 3 , mean WIN, and mean PRS) was lower, from 5.12 to 0.23.
(mean TEMP).The absolute values of SHAP for both remained in a relatively stable range around the 0 line.
When PRE 1.3 mm, its influence on the PM2.5 concentration was positive and the degree of influence remained relatively stable.When the mean WIN 3.9 m/s, the influence on the PM2.5 concentration was mostly positive.When the mean WIN 3.9 m/s, it began to have a negative influence on PM2.5.There was an obvious cutoff point for the influence of SO2 on PM2.5: when SO2 17 μg/m 3 , it exerted a negative influence on PM2.5, with a drastic variation in effect.When SO2 17 μg/m 3 , the impact became positive and the variation decreased slightly as the concentration increased.The influence of AOD on the PM2.5 concentration remained stable.The influence trend of the concentration of O3 first decreased and then increased, with a turning point at O3 = 60 μg/m 3 .

Correlation between PM2.5 and Predictors
We used the Pearson correlation algorithm to calculate the correlation between PM2.5 and the studied predictors.The results are shown in Figure 5. Figure 4b demonstrates the effect of different values of each feature on PM 2.5 .With an increase in the values of the mean PRS and the mean RH, as well as of the NO 2 , CO, and PM 10 concentrations, their influence on the PM 2.5 concentration gradually changed from negative to positive.The turning points were found at around 1002 hPa (mean PRS), 70% (mean RH), 45 µg/m 3 NO 2 , 1.1 mg/m 3 CO, and 85 µg/m 3 PM 10 , respectively.The SHAP absolute values of the mean PRS and the mean RH changed a little around the 0 line, while the SHAP values of the NO 2 , CO, and PM 10 concentrations increased significantly with an increase in their numerical values.The NO 2 concentration increased slowly, the CO concentration increased more rapidly, and the PM 10 concentration increased even faster.The influence of SSD and the mean TEMP on the PM 2.5 concentration changed from positive to negative as their values increased, with turning points around 6 h (SSD) and 6 °C (mean TEMP).The absolute values of SHAP for both remained in a relatively stable range around the 0 line.
When PRE >1.3 mm, its influence on the PM 2.5 concentration was positive and the degree of influence remained relatively stable.When the mean WIN < 3.9 m/s, the influence on the PM 2.5 concentration was mostly positive.When the mean WIN > 3.9 m/s, it began to have a negative influence on PM 2.5 .There was an obvious cutoff point for the influence of SO 2 on PM 2.5 : when SO 2 < 17 µg/m 3 , it exerted a negative influence on PM 2.5 , with a drastic variation in effect.When SO 2 > 17 µg/m 3 , the impact became positive and the variation decreased slightly as the concentration increased.The influence of AOD on the PM 2.5 concentration remained stable.The influence trend of the concentration of O 3 first decreased and then increased, with a turning point at O 3 = 60 µg/m 3 .

Correlation between PM 2.5 and Predictors
We used the Pearson correlation algorithm to calculate the correlation between PM 2.5 and the studied predictors.The results are shown in Figure 5.Among the major atmospheric pollutants, the O3 concentration was negatively but weakly correlated (−0.217) with PM2.5.The main reason is that an increase in the PM2.5 concentration can increase the scattering of solar radiation in the visible and near-infrared bands, thus reducing the photochemical rate and, finally, leading to a decrease in the O3 Among the major atmospheric pollutants, the O 3 concentration was negatively but weakly correlated (−0.217) with PM 2.5 .The main reason is that an increase in the PM 2.5 concentration can increase the scattering of solar radiation in the visible and near-infrared bands, thus reducing the photochemical rate and, finally, leading to a decrease in the O 3 concentration.However, SSD and temperature had a more obvious impact on the total solar radiation energy; consequently, O 3 had a stronger correlation with the mean TEMP and SSD, with correlation coefficients reaching 0.624 and 0.383, respectively.Significant positive correlations were observed between other pollutants and PM 2.5 , with coefficients ranging from 0.906 to 0.598 in the order of PM 10 > NO 2 > CO > SO 2 .In fact, PM 10 , NO 2 , CO, and SO 2 concentrations were positively correlated, indicating that the emission sources of these pollutants were similar; for example, they may have been the burning of straw and coal [34].In addition, the photochemical reactions of NO 2 , CO, and SO 2 can generate nitrate and carbonate, which can lead to an increase in PM 2.5 [35].
Among the meteorological factors, the mean PRS was positively correlated with the PM 2.5 concentration, while the mean TEMP, mean WIN, SSD, and PRE were negatively correlated with the PM 2.5 concentration.This is because when pollution occurs in a season with low temperatures, such as winter, high pressure and low wind speed promote a stable state for the near-surface atmosphere, which increases the strength of the thermal inversion layer and hinders the diffusion of pollutants [36].A larger SSD can enhance the photolysis of organic carbon, which is a major component of particulate matter, thereby reducing the PM 2.5 concentration [37].A higher wind speed has a diffusion and dilution effect on PM 2.5 , reducing its concentration.More precipitation corresponds to a more significant removal of moisture, which is also an important reason for the reduction in the PM 2.5 concentration [38,39].The correlation coefficients between the mean RH and AOD and the PM 2.5 concentration were less than 0.1 and could thus be considered irrelevant.According to a previous study [33], RH has an inverted U-shaped relationship with PM 2.5 concentrations (peaking at RH = 45-70%).The RH fluctuated between 45-60% and the PM 2.5 concentrations were always high, which may be attributed to a low correlation.
The hygroscopic growth of aerosol particles is an important factor in the estimation of particle concentration.The measurement of the PM 2.5 concentration near the ground is achieved through a drying process, so the air quality data represent the concentration under dry air.However, AOD via satellite remote sensing was obtained by inversion under environmental RH conditions.Higher atmospheric humidity causes a hygroscopic growth of aerosol particles, which affects the composite refractive index, resulting in greater AOD values.Therefore, to reduce the uncertainty introduced by particle hygroscopic growth and the difference between AOD and the PM 2.5 concentration in the monitoring environment and increase their correlation, we made an appropriate humidity correction for AOD.The revised formula [40] is as follows: where AOD RH is the aerosol optical thickness corrected for humidity, RH is the relative humidity in the atmosphere (%), and f (RH) is the scattering hygroscopic growth factor of particulate matter, which represents the influence of air humidity on AOD.After revision, the correlation coefficient between AOD and the PM 2.5 concentration was −0.032.This represented a slight improvement, but the value remained very low.Based on this, the correlation coefficient between AOD and the PM 2.5 concentration in the four seasons was calculated as shown in Table 3.The correlation between AOD and the PM 2.5 concentration exhibited seasonal differences in Harbin and was the highest in winter.However, the correlation coefficient was only −0.218, which may be because the AOD data were obtained by the Terra satellite at 10:30 am transit time, while the PM 2.5 concentration data were daily averages and the degree of time matching of the two data sources was insufficient.

Time Series Analysis of PM 2.5
In this section, we report our investigation on the temporal relationships between the PM 2.5 concentration and the abovementioned factors.They are described in terms of annual and monthly averages.

Interannual Variation of PM 2.5
We analyzed the trend of the PM 2.5 concentration in Harbin.As shown in Figure 6, the PM 2.5 concentration in 2015 had the highest median, quartile, and upper margin values, corresponding to the worst pollution.In 2016, the PM 2.5 concentration decreased overall and the number of outliers increased, but the violin plot widened around the median, indicating that the number of days with a low PM 2.5 concentration increased and the pollution situation improved.In 2017, the PM 2.5 concentration decreased slightly and the number of outliers decreased significantly, but the number of days with a low PM 2.5 concentration also decreased.In 2018, the overall distribution and outliers of the PM 2.5 concentration decreased significantly and the distribution was more concentrated near the median, making it the year with the lightest pollution in the past five years.This result may be attributable to the comprehensive and in-depth promotion of air pollution prevention and control work in Harbin in 2018.A total of 1191 coal-fired boilers were eliminated, 415 coal-fired boiler pollution prevention and control facilities were upgraded, 376 scattered pollution enterprises were investigated and cleaned up, and the comprehensive utilization rate of straw reached more than 75% [41].PM 2.5 pollution increased slightly in 2019, but the overall level did not differ significantly from that of the previous year.According to the Ambient Air Quality Standard (GB3095-2012), the annual average concentration limit and the 24-hour average concentration limit of PM2.5 are 35 μg/m 3 and 75 μg/m 3 , respectively.As shown in Figure 7, the number of days of PM2.5 exceeding the standard and the annual average concentration of PM2.5 showed a declining trend overal According to the Ambient Air Quality Standard (GB3095-2012), the annual average concentration limit and the 24-hour average concentration limit of PM 2.5 are 35 µg/m 3 and 75 µg/m 3 , respectively.As shown in Figure 7, the number of days of PM 2.5 exceeding the standard and the annual average concentration of PM 2.5 showed a declining trend overall from 2015 to 2019.In 2015, fine particulate matter pollution was severe in Harbin, with an average PM 2.5 concentration nearly twice the standard, and the number of days exceeding the standard was approximately one-third of the whole year.The annual average PM 2.5 concentration in 2018 was the lowest, but it still failed to meet the standard.The number of days exceeding the standard in 2019 was less than half that in 2015.In the past five years, the annual average concentration of PM 2.5 decreased in a wave pattern, showing a phenomenon of "annual decline," and the number of days of PM 2.5 exceeding the standard decreased sharply.According to the climate characteristics of Harbin, the seasons are divided as follows spring lasts from March to May, summer-from June to August, autumn-from Septem ber to November, and winter-from December to February.Figure 8 shows the averag PM2.5 concentration data for the different seasons.It can be seen from the figure that th average concentration of PM2.5 changed as follows: in the spring of 2015-2018, it was ba sically stable at 46 μg/m 3 and decreased slightly in 2019; in the summer, a stepwise de crease occurred from 30 μg/m 3 to 15 μg/m 3 , and in the autumn of 2015-2017, the PM2 concentration decreased first, then increased, exhibiting a value as low as 23 μg/m 3 and a cliff-like decline in 2018 and 2019.In fact, in 2018 and 2019, efforts were made to ban straw burning, which may have prevented air pollution caused by open burning during the har vest season [42].Finally, the concentration in the winter did not show a clear trend.According to the climate characteristics of Harbin, the seasons are divided as follows: spring lasts from March to May, summer-from June to August, autumn-from September to November, and winter-from December to February.Figure 8

Monthly Variation of PM2.5
After analyzing the interannual variation trend as reported in the previous section, it was found that with respect to the number of days exceeding the standard and seasonal variation, PM2.5 pollution in 2019 was moderate.These findings are in line with national governance policies and urban living requirements and are most representative of the current PM2.5 pollution levels.
We investigated the monthly variation of PM2.5 and the related factors in 2019.The results are shown in Figure 9.In 2019, Harbin's non-heating period lasted from 28 April to 19 October.During the heating period, three small peaks of PM2.5 were observed, specifically, in January, February, and December.The PM10, NO2, CO, and SO2 concentrations showed similar trends.This could be attributed to the surge of gas precursors and PM2.5 caused by coal combustion for heating in winter.Moreover, the meteorological conditions at that time, namely low wind speed, low temperature, low sunshine duration, high pressure, and high relative humidity, were conducive to the accumulation of PM2.5.In March, the PM2.5 concentration dropped sharply, remaining below 100 μg/m 3 , and it fluctuated around this value until the end of April.In fact, the temperature increased significantly during this period, and the intensity of urban heating was reduced.From the middle of October, the PM2.5 concentration began to increase slightly, and the PM10, NO2, CO, AOD, and SO2 concentrations also increased slightly at the same time.Several heating companies in Harbin began a hot-state trial operation of the pipe network before the heating season, and research [43] has shown that the main source of air pollution is coal burning (62.365%).Combined with the changes in pollutants, it can be seen that coal combustion could have led to an increase in PM2.5.This continued until November.During the entire heating period, the factors contributing to O3 were dominated by temperature and sunshine duration and the trend in O3 was completely different from that of PM2.5.

Monthly Variation of PM 2.5
After analyzing the interannual variation trend as reported in the previous section, it was found that with respect to the number of days exceeding the standard and seasonal variation, PM 2.5 pollution in 2019 was moderate.These findings are in line with national governance policies and urban living requirements and are most representative of the current PM 2.5 pollution levels.
We investigated the monthly variation of PM 2.5 and the related factors in 2019.The results are shown in Figure 9.In 2019, Harbin's non-heating period lasted from 28 April to 19 October.During the heating period, three small peaks of PM 2.5 were observed, specifically, in January, February, and December.The PM 10 , NO 2 , CO, and SO 2 concentrations showed similar trends.This could be attributed to the surge of gas precursors and PM 2.5 caused by coal combustion for heating in winter.Moreover, the meteorological conditions at that time, namely low wind speed, low temperature, low sunshine duration, high pressure, and high relative humidity, were conducive to the accumulation of PM 2.5 .In March, the PM 2.5 concentration dropped sharply, remaining below 100 µg/m 3 , and it fluctuated around this value until the end of April.In fact, the temperature increased significantly during this period, and the intensity of urban heating was reduced.From the middle of October, the PM 2.5 concentration began to increase slightly, and the PM 10 , NO 2 , CO, AOD, and SO 2 concentrations also increased slightly at the same time.Several heating companies in Harbin began a hot-state trial operation of the pipe network before the heating season, and research [43] has shown that the main source of air pollution is coal burning (62.365%).Combined with the changes in pollutants, it can be seen that coal combustion could have led to an increase in PM 2.5 .This continued until November.During the entire heating period, the factors contributing to O 3 were dominated by temperature and sunshine duration and the trend in O 3 was completely different from that of PM 2.5 .During the non-heating period, the slightly higher PM 2.5 concentration in mid-to-early May could be attributable to residual pollution caused by heating.During the rest of the non-heating period, the PM 2.5 concentration remained at approximately 50 µg/m 3 , and the concentrations of PM 10 , NO 2 , CO, and SO 2 were low.In addition, the increases in temperature and the mean RH, SDD, and PRE were conducive to the dissipation of PM 2.5 .During this period, the low concentration of PM 2.5 not only reduced the weakening effect of solar radiation, but also reduced the heterogeneous absorption of the precursors of O 3 , such as HO 2 , NO 2 , NO 3 , and N 2 O 5 , leading to an acceleration of the O 3 photochemical generation rate and, thus, to an increase in its concentration [43].However, the O 3 concentration was very unstable and fluctuated significantly.Each drop in O 3 concentration was accompanied by a large amount of precipitation.
Briefly, according to the evaluation of changes in the PM 2.5 concentration, its pollution levels in Harbin gradually reduced from 2015 to 2019.The annual average concentration of PM 2.5 indicated a wavelike decline, whereas the number of days when the PM 2.5 concentration exceeded the standard demonstrated a stepwise decline.There were significant seasonal differences in the PM 2.5 concentration over the past five years.The PM 2.5 concentration was highest in winter but declined in a wavelike pattern, slightly lower in autumn, declining in a cliff-like manner, low but stable in spring, and lowest in summer, decreasing gradually year by year.During the 2019 heating period, the meteorological conditions were conducive to the accumulation of PM 2.5 .Hence, the concentration of PM 2.5 was high, while the opposite was true in the non-heating period.In addition, PM 2.5 has a similar trend to PM 10 , NO 2 , CO, and SO 2 , indicating that its concentration is significantly affected by urban heating.

Conclusions
In this study, a random forest model was established, which used the air quality data, meteorological data, and atmospheric aerosol optical thickness data of Harbin from 2015 to 2019 to predict the concentration of PM 2.5 .After evaluating the prediction accuracy of the model, the SHAP method was used to analyze the contribution of each feature to the predicted value.The correlation between each factor and PM 2.5 and the temporal and spatial patterns of PM 2.5 were identified.The following conclusions were drawn.
The random forest model had a very good effect on the prediction of the PM 2.5 concentration.The SHAP method was employed to explain the model.It was found that the mean RH and AOD have a great influence on the PM 2.5 concentration, but the correlation coefficient with the PM 2.5 concentration is less than 0.1.PM 10 , NO 2 , CO, SO 2 , and the mean PRS were positively correlated with PM 2.5 , whereas the mean TEMP, SSD, O 3 , mean WIN, and PRE were negatively correlated with PM 2.5 .There were seasonal differences in the correlation between AOD and PM 2.5 , and the highest correlation was found in winter, with a correlation coefficient of −0.218.
The degree of PM 2.5 pollution in Harbin gradually diminished from 2015 to 2019, and there were substantial seasonal differences in the PM 2.5 concentration and its variations.In 2019, the PM 2.5 concentration was considerably affected by heating; together with the difference in meteorological conditions, it led to remarkable differences in the PM 2.5 concentration between the heating and non-heating periods.
However, due to the different accuracy of the different data, this article can only be used for calculation, which may cause deviation of the analysis results.In future work, we will continue to improve this issue.

Figure 1 .
Figure 1.Schematic diagram of the random forest principle.

Figure 1 .
Figure 1.Schematic diagram of the random forest principle.

Figure 2 .
Figure 2. Flow chart of the algorithm for the random forest.

Figure 2 .
Figure 2. Flow chart of the algorithm for the random forest.

A
random forest prediction model was established; it considered the atmospheric pollutants (PM 10 , NO 2 , CO, SO 2 , and O 3 ), meteorological conditions (PRE, SSD, mean TEM, mean RH, mean PRS, and mean WIN), and AOD in Harbin from 2015 to 2019 as characteristic quantities.Seven out of ten of them were used as training data, and those remaining were used as test data.Daily concentrations of PM 2.5 were the output of the model.The results of fitting the test samples are shown in Figure 3.The density scatter diagram shows that the model had an excellent prediction ability.The sample points are roughly distributed on both sides of the straight line and are relatively clustered.The predicted PM 2.5 concentrations obtained by the prediction model were very close to the actual PM 2.5 concentrations.The fitting equation was Y = 0.97X + 1.

Figure 4 .
Figure 4. Contribution of features to predicted values.(a) Ranking of feature contributions.(b) Variation of feature influence.

Figure 4 .
Figure 4. Contribution of features to predicted values.(a) Ranking of feature contributions.(b) Variation of feature influence.

Figure 5 .
Figure 5. Heatmap of correlations between the variables.
Atmosphere 2022, 13, x FOR PEER REVIEW 12 of 18 may be attributable to the comprehensive and in-depth promotion of air pollution pre vention and control work in Harbin in 2018.A total of 1191 coal-fired boilers were elimi nated, 415 coal-fired boiler pollution prevention and control facilities were upgraded, 376 scattered pollution enterprises were investigated and cleaned up, and the comprehensive utilization rate of straw reached more than 75% [41].PM2.5 pollution increased slightly in 2019, but the overall level did not differ significantly from that of the previous year.

Figure 6 .
Figure 6.Distribution of PM 2.5 by year.

Figure 7 .
Figure 7. Histogram of PM 2.5 pollution over time.
shows the average PM 2.5 concentration data for the different seasons.It can be seen from the figure that the average concentration of PM 2.5 changed as follows: in the spring of 2015-2018, it was basically stable at 46 µg/m 3 and decreased slightly in 2019; in the summer, a stepwise decrease occurred from 30 µg/m 3 to 15 µg/m 3 , and in the autumn of 2015-2017, the PM 2.5 concentration decreased first, then increased, exhibiting a value as low as 23 µg/m 3 and a cliff-like decline in 2018 and 2019.In fact, in 2018 and 2019, efforts were made to ban straw burning, which may have prevented air pollution caused by open burning during the harvest season [42].Finally, the concentration in the winter did not show a clear trend.

Figure 8 .
Figure 8. Seasonal changes in the PM 2.5 concentration.

Figure 9 .
Figure 9. Changes of the PM2.5 concentration, air pollutants, and meteorological conditions in (a) Concentrations of PM2.5 and meteorological conditions.(b) Concentrations of PM2.5 and a lutants.

Figure 9 .
Figure 9. Changes of the PM 2.5 concentration, air pollutants, and meteorological conditions in 2019.(a) Concentrations of PM 2.5 and meteorological conditions.(b) Concentrations of PM 2.5 and air pollutants.

Table 2 .
Random forest model: parameter settings.

Table 3 .
Seasonal correlation between AOD and the PM 2.5 concentration.