PM 2.5 Prediction Based on Random Forest, XGBoost, and Deep Learning Using Multisource Remote Sensing Data

: In recent years, air pollution has become an important public health concern. The high concentration of ﬁne particulate matter with diameter less than 2.5 µ m (PM 2.5 ) is known to be associated with lung cancer, cardiovascular disease, respiratory disease, and metabolic disease. Predicting PM 2.5 concentrations can help governments warn people at high risk, thus mitigating the complications. Although attempts have been made to predict PM 2.5 concentrations, the factors inﬂuencing PM 2.5 prediction have not been investigated. In this work, we study feature importance for PM 2.5 prediction in Tehran’s urban area, implementing random forest, extreme gradient boosting, and deep learning machine learning (ML) approaches. We use 23 features, including satellite and meteorological data, ground-measured PM 2.5 , and geographical data, in the modeling. The best model performance obtained was R 2 = 0.81 (R = 0.9), MAE = 9.93 µ g / m 3 , and RMSE = 13.58 µ g / m 3 using the XGBoost approach, incorporating elimination of unimportant features. However, all three ML methods performed similarly and R 2 varied from 0.63 to 0.67, when Aerosol Optical Depth (AOD) at 3 km resolution was included, and 0.77 to 0.81, when AOD at 3 km resolution was excluded. Contrary to the PM 2.5 lag data, satellite-derived AODs did not improve model performance.

In Tehran, the capital of Iran, the annual PM 2.5 concentration of 86.8 ± 33 µg m −3 (based on 4 years of observations, from 2015 to 2018) significantly exceeds the World Health Organization (WHO) guideline [8]. Gasoline and diesel vehicles, industrial emissions, and dust storms are the main reasons for high PM 2.5 concentration in Tehran. Taghvaee et al. [12] reported that diesel exhaust and industrial emissions have a greater impact on cancer risks (~70%) than other air pollution sources in Tehran. Tehran is not the city in Iran with the worst air pollution, however, it has received more attention [8,[12][13][14][15][16][17][18] because of its large population (estimated to be 9 million in 2019 [19]). Dehghan et al. [18] investigated the impact of Tehran's air pollution on the mortality rate related to respiratory diseases. They reported that from 2005 to 2014, high concentrations of O 3 , NO 2 , PM 10 , and PM 2.5 were strongly associated with 46%, annual precipitation of 429 mm, and temperature range from −5 to 38 • C. Elevation in Tehran varies significantly from 1117 to 1712.6 m above sea level. Tehran's urban extent and population has been increasing over the last few decades. According to the national census conducted in 2016, the population was 8.69 million, which was 10.8% of the total population of the country (80.28 million). Based on the newest revision of the UN World Urbanization Prospects, the population of Tehran is estimated to be~9 million in 2019 [19]. Although Tehran is not the most polluted city in Iran, more people are exposed to air pollution than in other cities because of its high population density. Tehran is located on the southern slope of Alborz Mountain, which has a significant effect on the region's weather. Air pollution in Tehran mainly originates from three major sources: transportation, industry, and dust storms. Tehran is supported by 42 air pollution monitoring stations that were established by the Department of Environment (23 stations) and the Municipality of Tehran (19 stations). The nearest meteorological station to the APM sites is Mehrabad, located in the urban area. The average distance of APM sites from Mehrabad is approximately 10 km (1 to 20 km for APM stations with missing data less than 75%).

Data
Ground measured PM 2.5 concentration , satellite-derived AODs at 3 and 10 km spatial resolution, and meteorological data were utilized (see Table 1). Also, other auxiliary data was used, such as APM site geographic information (longitude, latitude, and altitude) and historical parameters of records such as day of year, day of week, and season. The study period was from 2015 to the end of 2018. The National Department of Environment and Municipality of Tehran city have set up 42 APM stations, distributed as illustrated in Figure 1. Air pollution parameters such as PM 2.5 , PM 10 , CO, O 3 , NO 2 , and SO 2 are recorded by APM sites. The daily average of the PM 2.5 data is accessible through the Tehran's Municipality ICT website [42] and Air Pollution Monitoring System platform of the Department of Environment [43]. In addition, station parameters such as altitude, longitude, and latitude were obtained from the same sources.

Aerosol Optical Depth (AOD) Data
Aerosol Optical Depth is recognized as the accumulated attenuation factor over a perpendicular column of unit cross section [44]. The Moderate Resolution Imaging Spectroradiometer (MODIS) AOD products are well known in air pollution studies. The MODIS instrument is installed in both Aqua and Terra satellites. Recently, the MODIS atmospheric analysis team published a new level-2 collection 6.1 global scale aerosol optical depth product. Products are offered in two spatial resolutions of 10 km and 3 km. These products are labeled MOD04_L2 and MOD04_3K, respectively, for Aqua satellite-derived AOD. The AOD is calculated based on variations in the Dark Target (DT) and Deep Blue (DB) Aerosol retrievals algorithm, over urban areas. The recent version of the product offered by the MODIS atmosphere algorithm developer's team has improved the estimation of AOD values. In particular, improvements have been made for areas with extremely variable topography, such as Iran. However, there is still a high rate of missing values, and bias in AOD values is observed for our study area. been increasing over the last few decades. According to the national census conducted in 2016, the population was 8.69 million, which was 10.8% of the total population of the country (80.28 million). Based on the newest revision of the UN World Urbanization Prospects, the population of Tehran is estimated to be ~9 million in 2019 [19]. Although Tehran is not the most polluted city in Iran, more people are exposed to air pollution than in other cities because of its high population density. Tehran is located on the southern slope of Alborz Mountain, which has a significant effect on the region's weather. Air pollution in Tehran mainly originates from three major sources: transportation, industry, and dust storms. Tehran is supported by 42 air pollution monitoring stations that were established by the Department of Environment (23 stations) and the Municipality of Tehran (19 stations). The nearest meteorological station to the APM sites is Mehrabad, located in the urban area. The average distance of APM sites from Mehrabad is approximately 10 km (1 to 20 km for APM stations with missing data less than 75%).

Data
Ground measured PM2.5 concentration, satellite-derived AODs at 3 and 10 km spatial resolution, and meteorological data were utilized (see Table 1). Also, other auxiliary data was used, such as APM site geographic information (longitude, latitude, and altitude) and historical parameters of records such as day of year, day of week, and season. The study period was from 2015 to the end of 2018. The aerosol optical depth products (Both 10 and 3 km spatial resolution) were downloaded for the study period of 2015 to the end of 2018, from the NASA Atmosphere Archive & Distribution System (LAADS) archive portal [45]. Products are in the Hierarchical Data Format (HDF) with several subdatasets. The "AOD_550_Dark_Target_Deep_Blue_Combined" subdataset from the 10 km resolution product (MOD04_L2) and "Optical_Depth_Land_And_Ocean" subdataset from MOD04_3k were extracted from the main datasets.

Meteorological Data
The climatic features utilized in this study include air temperature (T), maximum and minimum air temperature (T_max,T_min), relative humidity (RH), daily rainfall, visibility, wind speed (Windsp), sustained wind speed (ST_windsp), air pressure, and dew point. Climatic data from Mehrabad weather station was used for all APM stations, because it is the nearest meteorological station to the study area and the APM stations are all within about 10 km of it. Data were downloaded from the Iran Meteorological Organization (IMO) portal [46].

Methodology
This work involved sampling, data pre-processing, data aggregation, and using three modeling methods for prediction and validation of PM 2.5 concentration. Out of 42 available sites, 37 APM sites were used in the modeling. The purpose of the current study is to find the best model to predict PM 2.5 at selected sites, using climatic, satellite, and auxiliary data. We analyzed and introduced the most important features for the study area, based on several methods. The Random Forest and XGBoost algorithms have built-in methods for detection of important features, but for deep leaning we implemented feature permutation to evaluate feature importance. In addition, a recursive feature removing and model training based on XGBoost modeling was carried out, and the mean absolute error was used as a reference for feature removal during modeling.

Data Preprocessing and Matching
Data processing and matching is necessary because the data was obtained from different sources. Daily climatic data was downloaded from the IMO portal. There are a few missing values or in some cases, full day missing records. Therefore, we used interpolation to estimate and fill in the missing data. The PM 2.5 data was collected from the Tehran Municipality ICT website [42] and the Air Pollution Monitoring System platform of the Iran Environment Department [43]. Also, the altitude, longitude, and latitude of each APM station were recorded from those references, to be used later in modeling and for AOD data sampling reference. The time format of the data offered by the department of the environment was not in Julian format and thus was converted to be compatible with the other data. Missing values for short or long periods are a common problem in air pollution monitoring stations. This happens when there is a critical failure or temporary power cutoff [17,47]. For Tehran's APM sites, there are many missing values that cannot be compensated by interpolation.
Next, we downloaded both the 10 km (MOD04_L2) and 3 km (MOD04_3k) spatial resolution MODIS AOD products of the Aqua satellite, from the NASA Atmosphere Archive & Distribution System (LAADS) portal. Products are in HDF file format and have multiple subdatasets. We used the "AOD_550_Dark_Target_Deep_Blue_Combined" subdataset from the MOD04_L2 and the "Optical_Depth_Land_And_Ocean" subdataset from the MOD04_3k. Considering all 37 APM sites location, AOD values were sampled for the entire study period (four years, equal to 1460 days).
The AOD and PM 2.5 data, in addition to the altitude, longitude, and latitude of each station, were merged together. The same climatic data were concatenated to this data, based on the sampling date. Day of year (DoY), season, and weekday were also obtained for each day and added to the database. We also used the PM 2.5 and rainfall with a lag of one and two days, so new columns were added to the database as PM2.5_lag1, PM2.5_lag2, Rainfall_lag1, and Rainfall_lag2. Finally, the PM 2.5 monitoring organization as a Boolean value (zero = Tehran Municipal, one = DOE) and distance of each station from Mehrabad Weather station, were added to the database. Descriptive statistics over meteorology parameters, PM 2.5 , and AOD values were calculated to evaluate the datasets. The mean, standard deviation, maximum and minimum values of features, and the 25, 50, and 75% quartile values of each parameter were calculated. This is illustrated in the Supplementary Materials Section S3 and Tables S1 and S2.

Normalization
Data normalization is an important step for many machine-learning estimators, particularly when dealing with deep learning. The preferred range of features for most ML approaches is between −1 to 1. Features with a wider range can cause instability during the model training [48]. Standardization was used to standardize the features by deducting the mean and scaling the data, with the variance of feature (Z i ) calculated as where x i , x, and δ x are the sample values, mean, and standard deviation, respectively. After applying standard normalizations, train and test datasets were prepared. Dataset records were shuffled and split to 70% for the train and 30% for the test. As a result of the high rate of missing values in AOD03 (94%), the training was carried out once including the AOD03 and then excluding AOD03. Records for each step were purified of missing values, based on which feature was used for modeling.

Random Forest Modeling
Random forest, introduced by Ho [49], is a supervised ensemble learning method that acts based on the decision tree. It can be used for both classification and regression, and it is very flexible and fast. To conduct RF analysis, it is necessary to adjust a model's hyperparameters. A grid search for model performance optimization was carried out with the 10-fold cross-validation technique based on the R 2 metric. Table 2 shows the RF hyperparameter ranges and the optimized values detected by the grid search.

Extreme Gradient Boosting
Extreme Gradient Boosting (XGBoost) is a successful machine learning library based on a gradient boosting algorithm proposed by Tianqi Chen [50]. It has better control against overfitting by using more regularized model formalization, in comparison to prior algorithms. It has a high rate of success in Kaggle competitions, particularly for the structured features [50]. The XGBoost similar to the random forest is tuned using hyperparameters. A grid search on hyperparameters with 10-fold cross-validation was carried out to find the best model based on R 2 metrics (see Table 3).

Deep Learning
Deep learning is one of the machine learning methods that is based on its ancestor-the Artificial Neural Network (ANN) [48,51]. Due to significant developments in hardware and algorithms, deeper hidden layers with more neurons per layer can be implemented, performing deep neural network modeling. These developments have allowed deep learning to progress from research to industrial applications. Recently, it has been shown to have comparable performance to human experts [52][53][54]. For PM 2.5 prediction with deep learning, there have been some attempts with different structures, including a deep neural network, long and short term memory (LSTM), and a convolutional neural network (CNN) [33,35,37,38,55,56]. One of the most challenging problems in deep learning methods is missing values. The missing values in our study area were very high, and thus CNN and LSTM could not be applied. Therefore, in this study, we used a six-layer deep neural network with an Adam optimizer (see Figure 2). Here, L2 and L1 regularization were applied to the layers to avoid the model over-fitting issue [57]. The structure of the deep neural network used in this study is presented in Table 4.

Feature Importance Assessment
A Spearman correlation coefficient analysis was carried out, to evaluate the mutual associations of the features (for more details please refer to Supplementary Material Section S2). Moreover, features were analyzed for their importance in PM2.5 concentration prediction. This provides a better understanding of the trained model and feature importance. Eliminating the features with negative or neutral effect on the model's performance can improve the cost and prediction performance.
In this study, we used three methods for feature importance estimation. First, we utilized builtin functions of random forest and XGBoost regression that estimate feature importance, based on the impurity variance of decision tree nodes, a fast but not perfect method. Second, features permutation was implemented. In this step, the performance (R 2 , MAE, and RMSE) of a well-trained deep neural network considering all features was obtained. Performance of the trained model was evaluated by permuting just one of the features in each round. Lower performance can be seen for permuted features with higher importance. This method is fast and reliable and has very low computation cost. Third, XGBoost was used for recursive feature elimination. Here, MAE metrics were used for model performance evaluation. In the first step, a model was trained using all features and the performance of the model was measured. In step 2, model training and performance calculation were repeated by excluding just one of the features at once and including the others. In the third step, the excluded feature that caused the highest prediction performance (lowest MAE) was removed from the features list and this step was repeated, until just three features remained for modeling.

Results
Descriptive statistics of meteorological data and APM sites for the entire 4 year study period are illustrated in the Supplementary Materials Section S3 and Tables S1 and S2. The missing value rate for most of the variables is less than 1.3% and for visibility only is 10.19%. The missing value for PM2.5 was approximately 54.11%. This is caused by a critical failure of stations or power outage, maintenance, and so on [17]. The AOD03 data show a high rate of missing values of approximately 94.09%. Thus, although there is an improvement in version 6.1 of the MODIS AOD product, it is not acceptable for this study area. The proportion of missing values in AOD10 is approximately 63.13%, Figure 2. The deep neural network configuration with six layers (4 hidden layers) employed in this study to predict PM 2.5 concentration value.

Feature Importance Assessment
A Spearman correlation coefficient analysis was carried out, to evaluate the mutual associations of the features (for more details please refer to Supplementary Material Section S2). Moreover, features were analyzed for their importance in PM 2.5 concentration prediction. This provides a better understanding of the trained model and feature importance. Eliminating the features with negative or neutral effect on the model's performance can improve the cost and prediction performance.
In this study, we used three methods for feature importance estimation. First, we utilized built-in functions of random forest and XGBoost regression that estimate feature importance, based on the impurity variance of decision tree nodes, a fast but not perfect method. Second, features permutation was implemented. In this step, the performance (R 2 , MAE, and RMSE) of a well-trained deep neural network considering all features was obtained. Performance of the trained model was evaluated by permuting just one of the features in each round. Lower performance can be seen for permuted features with higher importance. This method is fast and reliable and has very low computation cost. Third, XGBoost was used for recursive feature elimination. Here, MAE metrics were used for model performance evaluation. In the first step, a model was trained using all features and the performance of the model was measured. In step 2, model training and performance calculation were repeated by excluding just one of the features at once and including the others. In the third step, the excluded feature that caused the highest prediction performance (lowest MAE) was removed from the features list and this step was repeated, until just three features remained for modeling.

Results
Descriptive statistics of meteorological data and APM sites for the entire 4 year study period are illustrated in the Supplementary Materials Section S3 and Tables S1 and S2. The missing value rate for most of the variables is less than 1.3% and for visibility only is 10.19%. The missing value for PM 2.5 was approximately 54.11%. This is caused by a critical failure of stations or power outage, maintenance, and so on [17]. The AOD03 data show a high rate of missing values of approximately 94.09%. Thus, although there is an improvement in version 6.1 of the MODIS AOD product, it is not acceptable for this study area. The proportion of missing values in AOD10 is approximately 63.13%, which is better in comparison with AOD03. However, the product's spatial resolution (10 km) is not high enough and multiple APM stations will share the same AOD value, while sampling the AOD10. The minimum and maximum altitude of APM stations are 1023 m and 1758 m above sea level. Five of the 42 APM stations have no instrument for PM 2.5 concentration measurement. The histograms of features are illustrated in Section S4 Figure S1 of Supplementary Materials. The PM 2.5 , AOD03, AOD10, windsp, and air_pressure histograms have an almost bell shaped distribution, while the other features show no special uniform distribution.

Model Performance Validation
In this study, we used AOD derived satellite data, in addition to ground measured climatic data and 37 APM stations, to predict the PM 2.5 concentration. Three methods of machine learning-RF, XGBoost, and deep learning-were used for predictions. The dataset size has a significant impact on model training performance, particularly for the deep neural network. The AOD03 has a high rate of missing values (94%). Therefore, we conducted three tests, including AOD03 and excluding AOD03 and excluding both AODs, for each training. In addition, records with missing values were excluded from training and test datasets. All of the 1900 (including AOD03) and 11,800 (excluding AOD03) non-missing records out of a 41.2 k data size were used for training the models. The R 2 , MAE, and RMSE metrics were used to evaluate and compare the performance of the three methods.

Random Forest
The optimum configuration of random forest modeling was obtained using the built-in grid search function in Python, with the 10-fold cross-validation technique. The optimum values are shown in Table 2. For records size, the best performance is seen while excluding the AOD03 from the model input features. The prediction metrics show R 2 = 0.78 (R = 0.88), MAE = 10.8 µg/m 3 , and RMSE = 14.54 µg/m 3 . The predicted vs. observed scattered points are distributed around the y = x reference line and the density of points is closer to the reference line. This exhibits a reasonable prediction of PM 2.5 (see Figure 3a,b). Excluding both AODs shows almost the same performance as a test with AOD10.

XGBoost
The scatter plot of predicted PM 2.5 versus observed values using the XGBoost method is illustrated in Figure 4. Figure 4a shows the scatterplot of predicted vs. observed PM 2.5 values considering all parameters. Figure 4b shows the scatterplot of predicted vs. observed PM 2.5 values excluding the AOD03 variable. The scattered points are distributed around the y = x reference line, demonstrating a reasonable prediction of PM 2.5 concentration. Excluding both AODs shows the same performance as a test with AOD10.

Deep Learning
Deep neural networks are very sensitive to the input features range and easily become unstable during the training process. In the first step, 1900 (including AOD03) and 11,800 (excluding AOD03) records were selected, based on non-missing records. Moreover, a standard scaler was used to normalize the features to (−1,1). The model was trained based on 70% of selected records that were shuffled in advance. The scatter plots of predicted PM 2.5 concentration versus observed values, considering all features, is illustrated in Figure 5. The best model performance was achieved by excluding AOD03 from the input features, with R 2 = 0.77 (R = 0.88), MAE = 10.99 µg/m 3 , and RMSE = 14.86 µg/m 3 . Although the R 2 value obtained by deep learning is lower than for RF and XGBoost, the distribution of predictions vs. observed points around the y = x reference line still demonstrates acceptable performance. Excluding both AODs shows the same performance as a test with AOD10.

Random Forest
The optimum configuration of random forest modeling was obtained using the built-in grid search function in Python, with the 10-fold cross-validation technique. The optimum values are shown in Table 2. For records size, the best performance is seen while excluding the AOD03 from the model input features. The prediction metrics show R 2 0.78 (R = 0.88), MAE = 10.8 μg/m 3 , and RMSE = 14.54 μg/m 3 . The predicted vs. observed scattered points are distributed around the y = x reference line and the density of points is closer to the reference line. This exhibits a reasonable prediction of PM2.5 (see Figure 3a and Figure 3b). Excluding both AODs shows almost the same performance as a test with AOD10.
(a) (b)  The scatter plot of predicted PM2.5 versus observed values using the XGBoost method is illustrated in Figure 4. Figure 4a shows the scatterplot of predicted vs. observed PM2.5 values considering all parameters. Figure 4b shows the scatterplot of predicted vs. observed PM2.5 values excluding the AOD03 variable. The scattered points are distributed around the y = x reference line, demonstrating a reasonable prediction of PM2.5 concentration. Excluding both AODs shows the same performance as a test with AOD10.

Deep Learning
Deep neural networks are very sensitive to the input features range and easily become unstable during the training process. In the first step, 1900 (including AOD03) and 11,800 (excluding AOD03) records were selected, based on non-missing records. Moreover, a standard scaler was used to normalize the features to (−1,1). The model was trained based on 70% of selected records that were shuffled in advance. The scatter plots of predicted PM2.5 concentration versus observed values, considering all features, is illustrated in Figure 5. The best model performance was achieved by excluding AOD03 from the input features, with R 2 = 0.77 (R = 0.88), MAE = 10.99 μg/m 3 , and RMSE = 14.86 μg/m 3 .Although the R 2 value obtained by deep learning is lower than for RF and XGBoost, the distribution of predictions vs. observed points around the y = x reference line still demonstrates acceptable performance. Excluding both AODs shows the same performance as a test with AOD10. shuffled in advance. The scatter plots of predicted PM2.5 concentration versus observed values, considering all features, is illustrated in Figure 5. The best model performance was achieved by excluding AOD03 from the input features, with R 2 = 0.77 (R = 0.88), MAE = 10.99 μg/m 3 , and RMSE = 14.86 μg/m 3 .Although the R 2 value obtained by deep learning is lower than for RF and XGBoost, the distribution of predictions vs. observed points around the y = x reference line still demonstrates acceptable performance. Excluding both AODs shows the same performance as a test with AOD10. The results for all three modeling approaches with and without AOD03 and without both AODs, are shown in Table 5. The XGBoost method demonstrates the highest model performance with R 2 = 0.8 (R = 0.894), MAE = 10.0 µg/m 3 , and RMSE = 13.62 µg/m 3 , while excluding the AOD03. This shows that AOD03 is not a good feature for PM 2.5 concentration prediction. In addition, excluding both AODs did not reduce the performance. Therefore, it can be inferred that other features can act as a substitute for AODs. This decreases the importance of AODs on PM 2.5 prediction. In addition, sample size has significant impact on modeling and prediction performance. Considering the performance metrics, all three ML methods demonstrate almost similar performance. The R 2 values varied from 0.63 to 0.67, excluding AOD03, and 0.77 to 0.80 with AOD10 and without AODs. The best model performance was obtained with the XGBoost ML method, with a very low time cost of 19 s.

RF and XGBoost Feature Importance Ranking
Some features do not contribute to the modeling and only increase the complexity of the model. Therefore, we conducted a feature importance assessment, to detect and eliminate useless features. The RF and XGBoost have a built-in function that evaluates the features importance. The feature importance bar graph plot based on RF and XGBoost modeling is shown in Figures 6 and 7. The features are sorted based on their importance. In both RF and XGBoost, PM2.5_lag1 and visibility show significant importance compared to the other features.  However, there are large differences between RF and XGBoost feature importance ranking results. For example, AOD10 has the lowest rank in XGBoost feature ranking, while AOD10 is ranked seventh in the feature importance ranking by the RF method. Some studies have reported that the feature importance ranking built-in function of RF is biased and unreliable [58] and suggest carrying out the features permutation for feature importance ranking.

Feature Permutation Using Deep Neural Network
The Table 6 shows the feature permutation impact on the prediction performance of a welltrained DNN model. It is reasonable that by permuting an important feature, lower prediction performance be obtained.  However, there are large differences between RF and XGBoost feature importance ranking results. For example, AOD10 has the lowest rank in XGBoost feature ranking, while AOD10 is ranked seventh in the feature importance ranking by the RF method. Some studies have reported that the feature importance ranking built-in function of RF is biased and unreliable [58] and suggest carrying out the features permutation for feature importance ranking.

Feature Permutation Using Deep Neural Network
The Table 6 shows the feature permutation impact on the prediction performance of a welltrained DNN model. It is reasonable that by permuting an important feature, lower prediction performance be obtained.
Considering the feature importance ranking obtained by features permutation, we repeated However, there are large differences between RF and XGBoost feature importance ranking results. For example, AOD10 has the lowest rank in XGBoost feature ranking, while AOD10 is ranked seventh in the feature importance ranking by the RF method. Some studies have reported that the feature importance ranking built-in function of RF is biased and unreliable [58] and suggest carrying out the features permutation for feature importance ranking.

Feature Permutation Using Deep Neural Network
The Table 6 shows the feature permutation impact on the prediction performance of a well-trained DNN model. It is reasonable that by permuting an important feature, lower prediction performance be obtained.
Considering the feature importance ranking obtained by features permutation, we repeated DNN training 23 times. In round one, PM2.5_lag1 was used as the input feature and model performance was measured. In the second round of training, features with the rank of 1 and 2 were used as input features. This procedure was repeated to cover all 23 features. In each step, the R 2 value was measured to evaluate model performance. The best model performance during this procedure was obtained using the 15 most important features (from PM2.5_lag1 to dew point), with R 2 of 0.776. In the Table 6 column "R 2 based on ranking", the result of this procedure is presented, and less important features are marked with bold font. A negative effect on R 2 value was observed after adding more features. Therefore, the best DNN model performance using useless feature reduction is a R 2 of 0.776.

MAE Based Feature Elimination Using XGBoost
In addition to the other methods explained above, we conducted a recursive XGBoost training procedure, with feature removal based on MAE metrics. In the first step, a model was trained using all features and the performance of the model measured using MAE metrics. In the second step, the training was repeated 23 times, by removing one of the features at each round and measuring MAE metrics. In the third step, feature with the lowest effect on model performance (lowest MAE) was removed from the total features. The procedure was repeated using 22 features and so on until three features remained. The results are illustrated in Figure 8. These results demonstrate that removing RH in the first step improved the model. Model performance did not decrease when RH, longitude, T, sustained wind speed, distance, rainfall_lag2, T_min, org., and weekdays (located on the left side of the dashed blue line in Figure 8 were removed. In addition, removing season, T_max, AOD10, and rainfall did not change the R 2 value and had a small effect on MAE and RMSE. Feature dependency may be the reason for the low changes in model performance. The best model performance obtained by this method was R 2 = 0.81.

Discussion
Using different methods for feature importance evaluation, we achieved slightly different results. However, in most of the methods, historical observations of PM2.5, wind speed, visibility, day of the year, altitude, and temperature were very important in modeling. The features importance ranking based on different modeling approaches is presented in Table 7. The rankings median value for each feature is presented in Table 7. Features are sorted from top down, based on the median of feature importance rankings. Features such as latitude are important for RF and XGBoost, but rank lower in the deep learning features permutation method. This is because some features are dependent and can be replaced by other features.
The features used in this study can be divided into three categories. The first category is features that directly carry spatial information, such as latitude, longitude, and altitude. The second category is features that indirectly carry APM station spatial information, such as AOD10, AOD03, PM2.5_lag1, and PM2.5_lag2. The third category is the parameters that are shared for all stations, such as meteorological data and day of year, day of week, and sampling season. These results demonstrate that removing RH in the first step improved the model. Model performance did not decrease when RH, longitude, T, sustained wind speed, distance, rainfall_lag2, T_min, org., and weekdays (located on the left side of the dashed blue line in Figure 8 were removed. In addition, removing season, T_max, AOD10, and rainfall did not change the R 2 value and had a small effect on MAE and RMSE. Feature dependency may be the reason for the low changes in model performance. The best model performance obtained by this method was R 2 = 0.81.

Discussion
Using different methods for feature importance evaluation, we achieved slightly different results. However, in most of the methods, historical observations of PM 2.5 , wind speed, visibility, day of the year, altitude, and temperature were very important in modeling. The features importance ranking based on different modeling approaches is presented in Table 7. The rankings median value for each feature is presented in Table 7. Features are sorted from top down, based on the median of feature importance rankings. Features such as latitude are important for RF and XGBoost, but rank lower in the deep learning features permutation method. This is because some features are dependent and can be replaced by other features.
The features used in this study can be divided into three categories. The first category is features that directly carry spatial information, such as latitude, longitude, and altitude. The second category is features that indirectly carry APM station spatial information, such as AOD10, AOD03, PM2.5_lag1, and PM2.5_lag2. The third category is the parameters that are shared for all stations, such as meteorological data and day of year, day of week, and sampling season. The Spearman's correlation coefficient heat map of features is shown in Figure 9. The PM 2.5 historical observation values have the highest correlation to PM 2.5 . The air pressure, AOD03, and AOD10 have positive correlation with the PM 2.5 . Visibility, wind speed, rainfall, altitude, and latitude show a negative correlation with the PM 2.5 . Also, high correlation between other features reveals features dependency on each other. Dependent features can be predicted using other features, and subsequently can be eliminated from modeling to reduce the model complexity and cost of prediction.
Air temperature and pressure, dew point, and RH are dependent on altitude. We did not use the exact meteorological parameters for each station because of a lack of data; however, meteorological parameters can be modeled and predicted for each station based on altitude and other available features. Based on the Spearman correlation heat map, RH and air pressure are highly correlated with temperature. Considering the cost of prediction, they can be used as substitutes for each other. Some features such as temperature, RH, and pressure have a seasonal trend, and thus the feature "day of year" can facilitate modeling and improve the performance. Its ranking varies from 2 to 10 with a median value of 5.5.
In this study, three-machine learning techniques (RF, Deep learning, and XGBoost) were used to predict PM 2.5 concentration. The XGBoost technique demonstrated the highest performance and an acceptable time of training. To detect features importance, permutation and recursive feature removal, in addition to RF and XGBoost built-in functions, were used. Some of differences in features importance ranking could be a result of features dependency. However, overall, the features rankings obtained in this paper are logical and beneficial for future studies. Air temperature and pressure, dew point, and RH are dependent on altitude. We did not use the exact meteorological parameters for each station because of a lack of data; however, meteorological parameters can be modeled and predicted for each station based on altitude and other available features. Based on the Spearman correlation heat map, RH and air pressure are highly correlated with temperature. Considering the cost of prediction, they can be used as substitutes for each other. Some features such as temperature, RH, and pressure have a seasonal trend, and thus the feature "day of year" can facilitate modeling and improve the performance. Its ranking varies from 2 to 10 with a median value of 5.5.
In this study, three-machine learning techniques (RF, Deep learning, and XGBoost) were used to predict PM2.5 concentration. The XGBoost technique demonstrated the highest performance and an acceptable time of training. To detect features importance, permutation and recursive feature removal, in addition to RF and XGBoost built-in functions, were used. Some of differences in features importance ranking could be a result of features dependency. However, overall, the features rankings obtained in this paper are logical and beneficial for future studies.

Conclusions
In this study, we utilized Random forest, XGBoost, and Deep learning machine learning techniques to predict PM2.5 concentration in Tehran's urban area. Widely distributed ground measured PM2.5 data, meteorological features, and remote sensing AOD data were used. In previous research, different methods and features were employed for PM2.5 concentration prediction. However, few studies dealt with the limitations of our study area, including the high rate of PM2.5 and AOD missing values. In addition, the air pollution monitoring sites in our study area were densely distributed, with just one available weather station. We also utilized 3 and 10 km MODIS AOD products, and geographic properties of the monitoring stations such as latitude, longitude, and topography. Also included were historical observation values of PM2.5 and rainfall, in addition to day

Conclusions
In this study, we utilized Random forest, XGBoost, and Deep learning machine learning techniques to predict PM 2.5 concentration in Tehran's urban area. Widely distributed ground measured PM 2.5 data, meteorological features, and remote sensing AOD data were used. In previous research, different methods and features were employed for PM 2.5 concentration prediction. However, few studies dealt with the limitations of our study area, including the high rate of PM 2.5 and AOD missing values. In addition, the air pollution monitoring sites in our study area were densely distributed, with just one available weather station. We also utilized 3 and 10 km MODIS AOD products, and geographic properties of the monitoring stations such as latitude, longitude, and topography. Also included were historical observation values of PM 2.5 and rainfall, in addition to day of year, day of week, and season. Features importance and correlation were evaluated using the Spearman correlation method, permutation, recursive feature removal, and default built-in functions of the XGBoost and RFF techniques.
In comparison to RF and Deep learning methods, XGBoost achieved the best performance of approximately R 2 = 0.81 (R = 0.9), MAE = 09.92 µg/m 3 , and RMSE = 13.58 µg/m 3 , with very low cost of time (19 s). Although a DNN model was used for modeling and prediction, XGBoost with its simple structure, performed better. However, all three ML methods performed similarly and R 2 varied from 0.63 to 0.67, when Aerosol Optical Depth (AOD) at 3 km resolution was included, and 0.77 to 0.81, when AOD at 3 km resolution was excluded. Based on feature importance ranking, we found that there are features with high dependency on other features. Therefore, some features can be ranked differently based on machine learning structure. We investigated 23 features and determined that by using eight to 12 features, we can achieve acceptable PM 2.5 prediction performance. For example, with MAE based XGBoost feature removal, by using only nine of the most important features, such as PM2.5_lag1, day of year, wind speed, visibility, latitude, air pressure, dew point, PM2.5_lag2, and altitude (see Figure 8), an acceptable performance of R 2 = 0.79 (R = 0.888), MAE = 10.20 µg/m 3 , and RMSE = 14 µg/m 3 was obtained.
Most notably, this is the first study, to our knowledge, to investigate the importance of features for PM 2.5 concentration prediction. New features such as latitude, longitude, altitude, and dew point, in addition to day of year, day of week, and season were utilized in a way that has not been done in previous work. However, some limitations are worth noting. Although we have achieved reasonable PM 2.5 prediction performance, satellite-derived AODs did not have a significant impact on predictions. Yet, historical values of PM 2.5 are necessary for reasonable PM 2.5 prediction. In particular, AOD03 has a very high rate of missing values. Thus, it is not useful for our study area. Spatial distribution pattern prediction of PM 2.5 is limited without historical values of PM 2.5 . Future work will focus on images with high spatial resolution, based on the important features introduced in this research.