Machine Learning for Determining Interactions between Air Pollutants and Environmental Parameters in Three Cities of Iran

: Air pollution, as one of the most signiﬁcant environmental challenges, has adversely affected the global economy, human health, and ecosystems. Consequently, comprehensive research is being conducted to provide solutions to air quality management. Recently, it has been demonstrated that environmental parameters, including temperature, relative humidity, wind speed, air pressure, and vegetation, interact with air pollutants, such as particulate matter (PM), NO 2 , SO 2 , O 3 , and CO, contributing to frameworks for forecasting air quality. The objective of the present study is to explore these interactions in three Iranian metropolises of Tehran, Tabriz, and Shiraz from 2015 to 2019 and develop a machine learning-based model to predict daily air pollution. Three distinct assessment criteria were used to assess the proposed XGBoost model, including R squared (R 2 ), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE). Preliminary results showed that although air pollutants were signiﬁcantly associated with meteorological factors and vegetation, the formulated model had low accuracy in predicting (R 2PM 2.5 = 0.36, R 2PM 10 = 0.27, R 2NO 2 = 0.46, R 2SO 2 = 0.41, R 2O 3 = 0.52, and R 2CO = 0.38). Accordingly, future studies should consider more variables, including emission data from manufactories and trafﬁc, as well as sunlight and wind direction. It is also suggested that strategies be applied to minimize the lack of observational data by considering second-and third-order interactions between parameters, increasing the number of simultaneous air pollution and meteorological monitoring stations, as well as hybrid machine learning models based on proximal and satellite data.


Introduction
As a result of the increasing demand for energy in the previous 50 years, air pollution has expanded dramatically, with a threatening acceleration that cannot be eliminated [1][2][3][4].Approximately 91% of the population of the world lives in regions with high levels of air pollution [5], which contributes to the deaths of seven million people annually [6].According to Statista [7], nearly 1.1 million Americans live in zones with high levels of PM 2.5 , and more than 90% of European citizens are exposed to PM exceeding the WHO standard [8].Neurological and psychological disruptions, eye irritation, and the progression of various diseases such as asthma, Alzheimer's, Parkinson's, autism, and low birth weight (LBW) are among the short-term and long-term consequences of air pollution [9].Annual financial damages of O 3 and PM 2.5 in healthcare sector are estimated to be $5.5-12.5 billion and $48.6-140.7 billion, respectively [10].Premature death and air pollution-related diseases have been documented to cause financial losses in India totaling $28.8 billion and $8 billion, respectively [11].Air pollution causes $2.9 trillion in economic losses to the global economy [12].These economic statistics only refer to the financial losses caused by air pollution in the public health sector.
In addition to human health, air pollution is also a significant threat to ecosystems [13].According to research by Ito et al. [14], plants exposed to NO 2 had a lower dry weight.Examining the effects of air pollution on lichens in northeastern Norway by Hogda et al. [15] showed that lichen-rich areas fell from 30% in 1973 to 1.5% in 1992.Furthermore, Bignal et al. [16] explored the impacts of air pollutant emissions on vegetation along two highways in the UK and observed that the deforestation rate rose.SO 2 , O 3 , and NO X can alter the physiological processes that affect plants' growth patterns by damaging the leaf cuticles and affecting the conductivity of the stomata, thereby having a direct impact on the photosynthesis system, leaf longevity, and carbon allocation [17,18].Moreover, these pollutants can change the competitive balance between plant species and alter the composition of the plant community; thus, they can reduce crop yield and, subsequently, economic effectiveness in agricultural systems [19].According to assessments, O 3 is responsible for yield losses of 7-12% in wheat and 3-5% in corn.The economic losses from O 3 on 23 crops in Europe in 2000 were estimated to be around €6.7 billion [20].Vlachokostas et al. [21] reported that the economic damage to O 3 -induced crops in Thessaloniki (Greece) is estimated at roughly €43 million per year.Hence, air pollution imperils both public health and economic development.
Presently, air pollution is a substantial concern for emerging Asian economies [22].Approximately 70% of air pollution-related deaths occur in Asia and the Pacific [23], and 98% of cities with a population of more than 100,000 in these regions do not follow WHO air quality policies [24].As an Asian country, Iran is facing various environmental challenges such as arid and semi-arid climate, water crisis, soil salinity, desertification, floods, and air pollution [25][26][27], which ranked 23rd among 106 countries in air pollution [28].Extensive use of fossil fuels, an antiquated transportation system, and industrial activities besides natural dust are the leading reasons for air pollution in Iran [29].It has been estimated that a one percent rise in the production of gasoline in Iran would result in a 0.59 percent increase in the country's carbon emissions [30].In 2014, Iran ranked eighth out of 27 countries in terms of CO 2 emissions from energy consumption [31].The financial consequences of CO 2 emissions in some Asian countries from 1970 to 2018 are shown in Figure 1.
Sustainability 2022, 14, 8027 2 of 27 pollution [5], which contributes to the deaths of seven million people annually [6].According to Statista [7], nearly 1.1 million Americans live in zones with high levels of PM2.5, and more than 90% of European citizens are exposed to PM exceeding the WHO standard [8].Neurological and psychological disruptions, eye irritation, and the progression of various diseases such as asthma, Alzheimer's, Parkinson's, autism, and low birth weight (LBW) are among the short-term and long-term consequences of air pollution [9].Annual financial damages of O3 and PM2.5 in healthcare sector are estimated to be $5.5-12.5 billion and $48.6-140.7 billion, respectively [10].Premature death and air pollution-related diseases have been documented to cause financial losses in India totaling $28.8 billion and $8 billion, respectively [11].Air pollution causes $2.9 trillion in economic losses to the global economy [12].These economic statistics only refer to the financial losses caused by air pollution in the public health sector.
In addition to human health, air pollution is also a significant threat to ecosystems [13].According to research by Ito et al. [14], plants exposed to NO2 had a lower dry weight.Examining the effects of air pollution on lichens in northeastern Norway by Hogda et al. [15] showed that lichen-rich areas fell from 30% in 1973 to 1.5% in 1992.Furthermore, Bignal et al. [16] explored the impacts of air pollutant emissions on vegetation along two highways in the UK and observed that the deforestation rate rose.SO2, O3, and NOX can alter the physiological processes that affect plants' growth patterns by damaging the leaf cuticles and affecting the conductivity of the stomata, thereby having a direct impact on the photosynthesis system, leaf longevity, and carbon allocation [17,18].Moreover, these pollutants can change the competitive balance between plant species and alter the composition of the plant community; thus, they can reduce crop yield and, subsequently, economic effectiveness in agricultural systems [19].According to assessments, O3 is responsible for yield losses of 7-12% in wheat and 3-5% in corn.The economic losses from O3 on 23 crops in Europe in 2000 were estimated to be around €6.7 billion [20].Vlachokostas et al. [21] reported that the economic damage to O3-induced crops in Thessaloniki (Greece) is estimated at roughly €43 million per year.Hence, air pollution imperils both public health and economic development.
Presently, air pollution is a substantial concern for emerging Asian economies [22].Approximately 70% of air pollution-related deaths occur in Asia and the Pacific [23], and 98% of cities with a population of more than 100,000 in these regions do not follow WHO air quality policies [24].As an Asian country, Iran is facing various environmental challenges such as arid and semi-arid climate, water crisis, soil salinity, desertification, floods, and air pollution [25][26][27], which ranked 23rd among 106 countries in air pollution [28].Extensive use of fossil fuels, an antiquated transportation system, and industrial activities besides natural dust are the leading reasons for air pollution in Iran [29].It has been estimated that a one percent rise in the production of gasoline in Iran would result in a 0.59 percent increase in the country's carbon emissions [30].In 2014, Iran ranked eighth out of 27 countries in terms of CO2 emissions from energy consumption [31].The financial consequences of CO2 emissions in some Asian countries from 1970 to 2018 are shown in Figure 1.
Figure 1.The economic loss caused by CO2 emissions in some Asian countries, data source: [32].

Figure 1.
The economic loss caused by CO 2 emissions in some Asian countries, data source: [32].
In order to encounter ecological issues, environmental engineering provides low-cost and practical solutions [33].Monitoring the concentration of ambient air pollutants to maintain public health and sustainable development is a helpful solution [34].Recently, various technologies have been employed to monitor air pollution [35].In Iran, the Department of Environment (DOE) is monitoring air quality and executing the national strategies to reduce air pollution [29].Although reducing air pollution by declining industrial activities and traffic is an approach that policymakers and administrators follow, the effectiveness of this strategy in decreasing pollution risks is controversial [36].Since urbanization and rising energy usage ultimately render ineffective air pollution management policies [37].More so, lowering emissions does not directly reduce air pollution since different characteristics, such as topography and climatic factors, are also involved in air quality.
Meteorological parameters, directly or indirectly, play a crucial role in ambient air quality by impacting the formation, emission, and deposition of pollutants [38].In a study by Liu et al. [39], the concentration of air pollutants was depended on geography.Jayamurugan et al. [40] reported that the concentration of pollutants is influenced by wind speed, wind direction, relative humidity, and temperature.Zhang et al. [41] found that the increase in PM 2.5 concentrations might be due to the relative humidity fluctuation.Yang et al. [42] examined the interaction between PM 2.5 and meteorological parameters in Chinese cities for 22 months and found a positive correlation between relative humidity and PM 2.5 .In most zones, wind speed demonstrated a negative relationship with PM 2.5 .Lou et al. [43] observed that low humidity drives the accumulation of PM 2.5 .In another study by Zhou et al. [44] in Beijing and Nanjing, the seasonal average of PM 2.5 , PM 10 , SO 2 , CO, and NO 2 was significantly correlated with wind speed, and relative humidity had a contrasting impact on pollutant accumulation.The Pearson correlation analysis also demonstrated a significant relationship between air pollutants and meteorological parameters in Iran [45].Therefore, to efficiently control of air quality, it is necessary to explore the interactions between air pollution and meteorological factors based on longterm daily data.In this regard, Fan et al. [46] showed that improving air quality in some Chinese cities was associated with changes in weather conditions.Sunday and Haruna [47] concluded that evaluating the effect of climatic conditions on seasonal changes in pollutant concentrations might help to reduce ambient air pollution.
In addition to meteorological factors, plants also affect air quality [48].Although plants are victims of air pollution, they have protection mechanisms to absorb air pollutants [49].Hence, vegetation is one of the major sources of ecosystem services to enhance the quality of urban life [50] by preventing the release of contaminants [51].In an investigation by Klingberg et al. [52], NO 2 levels were lower in vegetated areas, indicating that leaf area and tree bark can be critical elements in improving air quality [53].The obtained results by Jeanjean et al. [54] demonstrated that trees trap 7% of air pollutants.In an examination in the United States by Nowak et al. [55], forest trees dragged 17.4 million tons of air pollutants, saved $6.8 billion in public health costs in 2010.In an analysis by Wu et al. [56] in Shenzhen, China, it was found that the removal of PM 2.5 by vegetation was nearly 1000 tons in 2015, and the average removal rate was measured at 16 g m −2 per year.Alonso et al. [57] documented that vacating vegetation increases O 3 levels.Mirsanjari et al. [58] detected that a drop in dense vegetation and the extension of regions with poor vegetation were positively associated with an increase in air pollution in Karaj (Iran).Despite the fact that urban vegetation is pinpointed as an eco-friendly solution, Xing and Brimnlecombe [59] found that cities rarely remove more than 1% of air pollutants via plants.Moreover, the deposition of pollutants on the branches and leaves does not appreciably improve air quality.Nemitz et al. [60] declared that urban vegetation in the UK reduced PM 2.5 by an average of 1%.According to Viippola et al. [61], there is inadequate empirical evidence for ameliorating urban air pollution by forests.According to this conflicting evidence, more investigations are required to focus on the efficiency of vegetation in lowering air pollution [62].
The interaction of air pollutants with meteorological factors and plants is explored in order to forecast the behavior of pollutants and the status of air quality.As much as constructing a forewarning system is essential to protect humans against detrimental consequences of air pollution [63], forecasting air pollution using machine learning, neural networks, and deep learning has recently been addressed by many researchers.Environ-mental sciences, including weather prediction, soil erosion, waste disposal, dust storms, and air pollution, make extensive use of machine learning techniques [64][65][66].Conventional air pollution prediction techniques can be divided into statistical methods, artificial intelligence, and numerical forecasting [67].Sharma et al. [68] used time-series analysis of 2009-2017 data to predict New Delhi air quality.Kaya and Oguducu [69] developed a 4, 12, and 24-h forecasting model based on deep learning using PM 10 hourly data from Istanbul (Turkey) between 2014 and 2018.By applying the classification and regression tree method, Gocheva-llieva et al. [70] presented a model for forecasting daily PM 10 concentration with 90% accuracy in Ruse and Pernik (Bulgaria).Madan et al. [71] mentioned that a variety of machine learning methods, including linear regression, decision tree, random forest, neural network, and support vector machine, have been used to predict quality of air.The air quality prediction model developed by Mahalingam et al. [72] using the neural network algorithm and support vector machine proved effective.Pasupuleti et al. [73] found that the random forest method is more accurate in comparison to regression and decision tree for predicting pollutants (r CO = 0.79, r O 3 = 0.79, r NO 2 = 0.70, r PM 2.5 = 0.86, and r PM 10 = 0.79).In another study, Pan [74] demonstrated that the Extreme Gradient Boosting (XGBoost) significantly outperforms random forest, multiple linear regression, decision tree, and support vector machine algorithms for hourly PM 2.5 concentration forecasts (r PM 2.5 = 0.95).Furthermore, Ma et al. [75] conducted a study in the northern United States and demonstrated that XGBoost was able to accurately model PM 2.5 interactions with the environment.Liu et al. [76] revealed that the integration of the ridge regression (RR) model and the XGBoost algorithm had more generalization ability than conventional machine learning techniques for forecasting pollutants.Kumar and Pande [77] recognized that XGBoost had the highest amount of linearity between predicted and real data.Therefore, many air quality models have been developed [78].However, air pollution is driven by a complex combination of meteorological factors, physical obstacles, and chemical reactions among pollutants [79] that lower the model precision.
This research will first examine these hypotheses: (H1) as urbanization and population grow, air pollution matures annually (Section 3.1); (H2) does vegetation have a significant impact on lowering air pollution?(Section 3.2).Considering few investigations have been conducted on assessing the interactions of air pollutants with the ambient environment in Iran, the current survey's objective is to (i) evaluate relationships between air pollutants, meteorological parameters, and vegetation in three Iranian metropolises between 2015 and 2019 (Section 3.3), and, thus, (ii) develop a XGBoost-based model to predict air quality and assess its performance under real-world conditions (Section 3.4).

Case Study
Tehran is the largest metropolis and capital of Iran, and it is located at the geographical position of 35 • 41 N, 51 • 26 E [80].Its altitude is 900 to 1800 m above sea level, and its northern part has cold and dry weather, while its southern part is relatively hot and dry.The yearly temperature ranges from 15 • to 18 • , though it varies by about 3 • in different parts of the city.The area of this city is approximately 730 km −2 , and its population density is estimated at 10,555 people per km −2 .Due to the fact that Tehran, with a population of nearly 13 million, has important governmental, political, economic, and industrial headquarters, there is a significant desire to migrate to it.Tehran's population growth is 4.1% and is expected to grow in the forthcoming years.More than 2 million cars, 500,000 motorcycles, and 5000 industrial units operate in Tehran.Considering that Tehran is the industrial and commercial capital of Iran and uses over 20% of the country's total energy, its air pollution is one of the most prominent environmental issues in Iran [81].Tehran has two international airports and twelve active air pollution monitoring stations [82].
Tabriz, the capital of East Azerbaijan, is one of Iran's largest and oldest cities, located at the geographical position of 38 • 4 N, 46 • 25 E. The area of this city is 324 km −2 , and its altitude is 1350 to 1550 m above sea level.It is the most populous metropolis (1,559,000 people) in northwestern Iran.Moreover, Tabriz is known as an air pollution hotspot due to its extensive industrial activities.The city has ten municipal districts, an international airport, and eight air pollution sensors.
Shiraz is located in the mountainous region of Zagros at the geographical position of 29 • 36 N, 52 • 33 E and an altitude of 1486 m above sea level.This metropolis has an area of 217 km −2 and is divided into eleven municipal districts.In 2016, about 32% of the population of Fars province lived in Shiraz (1,566,000 people) [83], and the population density of this city was 7215 per km −2 .Shiraz has an international airport and three active air pollution monitoring stations.Figure 2 shows a schematic of the study zones.
torcycles, and 5000 industrial units operate in Tehran.Considering that Tehran is the industrial and commercial capital of Iran and uses over 20% of the country's total energy, its air pollution is one of the most prominent environmental issues in Iran [81].Tehran has two international airports and twelve active air pollution monitoring stations [82].
Tabriz, the capital of East Azerbaijan, is one of Iran's largest and oldest cities, located at the geographical position of 38°4′ N, 46°25′ E. The area of this city is 324 km −2 , and its altitude is 1350 to 1550 m above sea level.It is the most populous metropolis (1,559,000 people) in northwestern Iran.Moreover, Tabriz is known as an air pollution hotspot due to its extensive industrial activities.The city has ten municipal districts, an international airport, and eight air pollution sensors.
Shiraz is located in the mountainous region of Zagros at the geographical position of 29°36′ N, 52°33′ E and an altitude of 1486 m above sea level.This metropolis has an area of 217 km −2 and is divided into eleven municipal districts.In 2016, about 32% of the population of Fars province lived in Shiraz (1,566,000 people) [83], and the population density of this city was 7215 per km −2 .Shiraz has an international airport and three active air pollution monitoring stations.Figure 2 shows a schematic of the study zones.

Data
The air pollution data in this investigation includes the recorded data of each of the parameters CO, O3, NO2, SO2, PM10, PM2.5, and air quality index (AQI), which are registered by the DOE monitoring system.The AQI is equal to the highest amount of pollutant measured per day, and it rises as air pollution worsens.This paper directly acquired the

Data
The air pollution data in this investigation includes the recorded data of each of the parameters CO, O 3 , NO 2 , SO 2 , PM 10 , PM 2.5 , and air quality index (AQI), which are registered by the DOE monitoring system.The AQI is equal to the highest amount of pollutant measured per day, and it rises as air pollution worsens.This paper directly acquired the average daily data from January 2015 to December 2019 from the Air Quality Monitoring System (AQMS) (available at https://aqms.doe.ir,accessed on 20 June 2021).
The Weather Underground archive (available at https://www.wunderground.com,accessed on 20 June 2021) has been used to obtain meteorological data.Approximately 6000 computerized meteorological stations operate at international airports where their data is updated every 1, 3, and 6 h.The meteorological variables in this study include temperature (T, C 0 ), relative humidity (RH, %), wind speed (WS, mp h −1 ), and air pressure (AP, mmHg).The international airports connected to this system are Mehrabad Airport in Tehran (35 NDVI is a well-known and broadly used indicator for numerically determining vegetation and measuring the health status of plants based on the reflections of light at specific frequencies by plants [84].In remote sensing studies, data related to the wavelengths of light absorbed and reflected by satellite sensors are used.NDVI demonstrates vegetation in numerals between −1 and 1, allocates values near to −1 for water, rocky places, sand, and snow at 0.1 or less, and shrubs, grasslands, or old plants between 0.2 and 0.5.Dense plants, forests, and farms' canopy are between 0.6 and 0.9 [85].The Sentinel-Hub database (available at https://apps.sentinel-hub.com/eo-browser,accessed on 27 June 2021) enables researchers to receive images from different satellites having a variety of indicators for monitoring water, soil, and the atmosphere.NDVI was acquired using the Landsat 8 L1 satellite.It can automatically calculate indices in a selected zone.The data provided by the automatic calculation system includes the maximum, average, and minimum values every ten days and every 20 days.The average data from 2015-2019 for three studied zones was selected as a reference for this study.

Mapping
The ratio of near-infrared (NIR) and red (R) reflectance is used to calculate NDVI (Equation ( 1)) [86], and Landsat 8 satellite imagery (OLI_TIRS) was employed for NDVI zoning.The images were downloaded from the USGS database (available at https:// earthexplorer.usgs.gov,accessed on 27 June 2021) for six days in 2020 and 2021.In preparing this layer, four and five bands of Landsat 8 have been used, and NDVI was implemented on images in the ArcGIS Pro software; thus, a vegetation density layer was obtained.
Kriging interpolation was used to zone AQI.Kriging is a robust geostatistical method to evaluate a surface from a scattering of points having z-values [87].In the present study, after analyzing the distribution pattern of points and the difference between their mean and variance, AQI was used as the value of known z-values and zoned for the study areas.In order to better compare NDVI and AQI, the statistical zoning (Zonal Statistics) in the spatial statistics toolbox in ArcGIS was applied.A Zonal Statistics application computes statistics on raster (value raster) cell values inside the zones defined by another dataset.Its framework produces a raster outcome after calculating only one statistic at a certain time.With the cells corresponding to that zone, this value has become the cell value of the raster output.Since a cell in the output raster may only represent one value, the statistic is generated for only one zone if a zone attribute has overlapped zones [88].This tool calculates the average value of raster layer cells (NDVI and AQI) by polygons based on the created Thyssen polygons.There is only one spot input attribute for each Thiessen polygon.A Thiessen polygon's corresponding point is relatively closer toward any spot inside it than any other point input feature.

Statistical Analyzing
SPSS version 18 was utilized for conducting statistical analysis.Indicators (mean and standard deviation) were used in the descriptive statistics processes.The Tukey test and one-way analysis of variance were carried out to compare the means of the variables NDVI, PM 2.5 , PM 10 , SO 2 , NO 2 , O 3 , and CO.The Pearson correlation coefficient was employed to evaluate the correlation of the variables T, RH, WS, AP, and NDVI with the variables PM 2.5 , PM 10 , SO 2 , NO 2 , O 3 , and CO.The significance level was considered to be <0.05.

Modeling
Extreme Gradient Boosting (XGBoost) and Gradient Boosting (GB) are group tree techniques that boost weak learners using the gradient descent architecture.XGBoost, on the other hand, empowers the fundamental GB architecture through algorithmic optimizations.
The XGboost package is part of the Distributed Machine Learning Community.The data is first fitted using a weak regressor.It adopts a weaker regressor to ensure better accuracy of the algorithm without changing the prior regressor, and the procedure is repeated.Each subsequent regressor should incorporate where the preceding regressor failed to perform appropriately.Figure 3 illustrates the flow of the General Boosting algorithm.Initially, it approximates y1 by assigning numerical values to a decision tree, and then the second tree is adapted based on the previous step's residual, which is y-y1, and so on.By analogy, the algorithm anomaly may be substantially reduced.Table A1 in Appendix A shows the air quality features studied in the modeling process.variables PM2.5, PM10, SO2, NO2, O3, and CO.The significance level was considered to be < 0.05.

Modeling
Extreme Gradient Boosting (XGBoost) and Gradient Boosting (GB) are group tree techniques that boost weak learners using the gradient descent architecture.XGBoost, on the other hand, empowers the fundamental GB architecture through algorithmic optimizations.The XGboost package is part of the Distributed Machine Learning Community.The data is first fitted using a weak regressor.It adopts a weaker regressor to ensure better accuracy of the algorithm without changing the prior regressor, and the procedure is repeated.Each subsequent regressor should incorporate where the preceding regressor failed to perform appropriately.Figure 3 illustrates the flow of the General Boosting algorithm.Initially, it approximates y1 by assigning numerical values to a decision tree, and then the second tree is adapted based on the previous step's residual, which is y-y1, and so on.By analogy, the algorithm anomaly may be substantially reduced.Table A1 in Appendix A shows the air quality features studied in the modeling process.
where yi indicates the forecast of the i-th instance at the k-th boost and xi is the training dataset's i-th instance sample.The k-th tree's value is fk (xi), and the function F represents all decision trees' values.The loss function Lk, as defined in Equation ( 3), is minimized by GB.Following comprehensively reviewing the related literature of Chen and Guestrin [90], Friedman [91], the GB and XGBoost algorithms are represented as follows: D = x ; y, |D| = n, x ∈ R m , y ∈ R D denotes a dataset, n is the number of samples, m is the number of parameters, and x and y represent the dataset's features and target variable.The ambient air database comprises 1697 samples and five parameters.The prediction results in GB for dataset D are k tree forecasted scores total, which would be determined by a method called the K-additive function, as indicated in Equation ( 2 where y i indicates the forecast of the i-th instance at the k-th boost and x i is the training dataset's i-th instance sample.The k-th tree's value is f k (x i ), and the function F represents all decision trees' values.The loss function L k , as defined in Equation ( 3), is minimized by GB.
Considering GB and XGBoost are both decision tree-based techniques, several treerelated hyper-parameters, such as subsample and max depth, were employed to avoid overfitting and optimize predictive accuracy.Moreover, the learning rate governs the tree weighting that is attached to the model, which is also used to reduce the model's rate of adaption to the training dataset.These hyper-parameters are similarly defined by XGBoost, and their explanations are reported in Table A2 in Appendix A.
The XGBoost objective operation includes a regularization mechanism that facilitates the selection of prediction operations and the management of model complexity.The objective function of the XGBoost is obtained by combining the loss function with the regularization term.The loss function controls the model's predictive power, while the regularization term determines the model's complexity.The XGBoost's aim function can be represented as follows in Equation ( 4): where L is the loss value that represents the model's compatibility on training dataset, ŷi is the predicted label, and y i is the actual label.R(f ) is capable of reducing the dynamics of the training tree's functions.It also addresses the overfitting issue.In order to demonstrate the complexity, we should first describe the details of the tree f (x) as Equation (5): The leaf score vector is represented by w, q is a mapping function that maps data samples to the associated leaf is represented by q, and T is the number of leaves.Equation ( 6) is based on the equation for penalizing the model's complexity: where γ and α are the hyper-parameters or constant coefficients, α represents each leaf value, and T is the total number of leaves in the tree.||w|| 2 signifies the L2-norm of the leaf weight controlled by the γ term, whereas ||w|| indicates the L1-norm of the leaf weight controlled by the α term.The weights are driven to be modest by L2 regularization (controlled by the reg lambda term), whereas sparsity is encouraged by L1 regularization (controlled by the reg alpha term).The lowest loss reduction is determined by the hyperparameter γ towards further division.
A hyper-parameter wmc (min child weight) maintains the depth of the tree, similar to alpha, and a substantial wmc could therefore make the system more precise in the splitting process.The objective function of XGBoost is optimized via gradient descent.The model is an additive model, which means that it introduces a tree to the model every time the forecast outcome equals the sum of the previous and new trees.So, among these equations, at the t-th step, Equation (7) calculates the target at each step, and a f t is used to minimize the error, which seems to minimize errors between both the predicted and measured output with loading f t .
To compute the second-order Taylor equivalent as indicated in Equation ( 8), we do not have a gradient per each optimization process.
where g i is Equation ( 9) and h i is Equation (10): After deleting the constant terms and appending regularization of Equation (6), Equation (11) depicts the objective function at the t-th step.
In comparison to GB, another strategy employed in XGBoost to minimize additional overfitting is column subsampling.The use of column subsampling is shown to be more beneficial than typical row subsampling in avoiding fitting problems [92].The hyper-parameter "subsample" is used to subsample the data by row, and so its definition is demonstrated in Table A2 in Appendix A, which includes a definition of the "colsample bytree" hyper-parameter.As it is not practical to compute all possibilities of trees at the same time, the tree framework is made by computing the leaf scores, regularization, and objective functions at each level.The tree structure will be replicated in subsequent rounds, decreasing the computational complexity dramatically.
In addition, during the node splitting process, the gain of every characteristic is determined.Iteratively, it determines the optimal dividing point till it exceeds the maximum depth.The nodes are then pruned in a bottom-up direction, resulting in a negative gain.This is how XGBoost categorizes the data, which arrive deep in the trees.The findings were derived using the hyper-parameters.Default variables are defined by XGBoost if variables are not specified, though parameters would be set.

Hyperparameters Optimization
Hyper-parameter optimization is the process of determining which hyper-parameters for a particular learning algorithm achieve the best possible results whenever tested on validation data.Equation ( 12) represents hyper-parameter optimization: f (x) denotes an objective score to reduce which is assessed in the test dataset; x indicates a collection of hyper-parameters that gives the minimum score value; and x means any value in the X domain.Determining the model hyper-parameters that result in the highest validation set metric score is crucial.Another challenge concerning hyperparameter optimization is that evaluating the objective function to determine the score is exceedingly expensive.Whenever users attempt alternative hyper-parameters, they must train the model on the training sets, predict outcomes on the validation dataset, and afterwards evaluate the validation metric.With multiple hyper-parameters and models, which include combinations of deep neural networks, this operation is impossible to perform manually.The four typical hyper-parameter optimization methods include (i) grid search, (ii) manual search, (iii) random search, and (iv) Bayesian optimization.Grid search is a conventional technique of hyperparameter optimization that executes a comprehensive search across a portion of the training algorithm's hyperparameter space.There is a need to characterize a border to execute a grid search since the machine learning algorithm parameter area potentially comprises spaces with actual or limitless values for some parameters.Grid search has a high-dimensional space issue, and it can frequently be simply parallelized since the hyperparameter values used by the algorithms are generally independent of one another.

Preprocessing Dataset
The ambient air dataset is divided into training and testing datasets by 80% and 20%, respectively.The training dataset was employed in model training and optimization.When both of the individuals' characteristics are numerical, the mean of their up and down values is used to fill in the missing values in the dataset.Feature scaling is a method used to standardize the range of features.In this regard, we use normalization to re-scale features in the range [0, 1].To normalize our data, we can apply the min-max scaling to each feature column, where the new value xnorm of a sample x can be computed by Equation (13):

Xgboost Training and Hyper-Parameter Optimization
The XGBoost regressor of the target variable with Grid Search Optimization was used on the training data after preprocessing the training and test datasets.The hyperparameters are fine-tuned using Grid Search Optimization.Learning_rate, n_estimators, min_child weight, max_depth, subsample, gamma, reg_lambda, and booster are the eight parameters that were tuned in this study.The learning rate improves the model's stability and robustness, although the min child weight, max depth, subsample, and gamma control over-fitting.Similarly, the reg_lambda regularization parameter penalizes complex models.The evaluation indicator is the mean squared error value of 6-fold stratified cross-validation of training examples, which varies depending on the objective function considered.The main objective function is the XGBoost algorithm with different varieties of hyper-parameters.Figure 4 illustrates the proposed model.The XGBoost regressor of the target variable with Grid Search Optimization was used on the training data after preprocessing the training and test datasets.The hyper-parameters are fine-tuned using Grid Search Optimization.Learning_rate, n_estimators, min_child weight, max_depth, subsample, gamma, reg_lambda, and booster are the eight parameters that were tuned in this study.The learning rate improves the model's stability and robustness, although the min child weight, max depth, subsample, and gamma control over-fitting.Similarly, the reg_lambda regularization parameter penalizes complex models.The evaluation indicator is the mean squared error value of 6-fold stratified crossvalidation of training examples, which varies depending on the objective function considered.The main objective function is the XGBoost algorithm with different varieties of hyper-parameters.Figure 4 illustrates the proposed model.In Figure 4, cross-validation of a training phase and assessment of mean and mean squared error values for a set of XGBoost hyper-parameters are shown.The Grid Search Optimization attempts to determine and present the highest mean and mean squared error number potential.It determines the model with the maximum mean squared error value for prediction on holdout test data when the specified number of iterations is finalized.On holdout test data, a variety of evaluation metrics were employed to assess the performance of the selected optimized XGBoost model.

Evaluation Metrics
The developed method's performance was assessed utilizing evaluation metrics such as RMSE (root mean squared error), R 2 (R squared), and MAE (mean absolute error), which were determined using formulae.Machine learning has a single number to evaluate a model's performance, whether this is during training, cross-validation, or monitoring after deployment.One of the most frequently utilized measurements is root mean  In Figure 4, cross-validation of a training phase and assessment of mean and mean squared error values for a set of XGBoost hyper-parameters are shown.The Grid Search Optimization attempts to determine and present the highest mean and mean squared error number potential.It determines the model with the maximum mean squared error value for prediction on holdout test data when the specified number of iterations is finalized.On holdout test data, a variety of evaluation metrics were employed to assess the performance of the selected optimized XGBoost model.

Evaluation Metrics
The developed method's performance was assessed utilizing evaluation metrics such as RMSE (root mean squared error), R 2 (R squared), and MAE (mean absolute error), which were determined using formulae.Machine learning has a single number to evaluate a model's performance, whether this is during training, cross-validation, or monitoring after deployment.One of the most frequently utilized measurements is root mean square error.This is a simple scoring method that is also consistent with several of the most basic statistical assumptions.
The mean absolute error (MAE) measures the difference in errors between paired observations describing the same occurrence.Comparisons of expected against observed, further time versus initial time, and one measuring technique versus are instances of Y versus X.

MAE =
∑ n i=1 |y i − x i | n The coefficient of determination, sometimes called coëfficient, is the fraction of the variation in the dependent variable that is predicted from the independent variable(s), denoted R 2 and pronounced "R squared".This is a statistic used with statistical models whose primary objective is either to forecast future results or to evaluate hypotheses based on other data.According to the fraction of the overall variance of outputs described by the model, it allows assessment of how well observed results are duplicated by the model.If y{\displaystyle {\bar {y}}} is the mean of the observed data (y i ) and f i is model forecasted values:

Changes of Pollutants Emission
Although air pollution in the metropolitan regions was expected to increase in the 2015-2019 period owing to population and urbanization growth according to the hypothesis H1, preliminary results demonstrated a declining trend of CO and PM 2.5 in Tehran.PM 10 , SO 2 , T, and WS had a stable trend.Exclusively, NO 2 and RH were slightly increased (Figure 5A).The annual average of PM 2.5 in Tabriz significantly decreased from 70 µg m −3 in 2015 to 50 µg m −3 in 2019.There were no significant variations in the mean changes of SO 2 and NO 2 , while the levels of O 3 and CO had an increasing trend in 2017-2019.Furthermore, T, RH, WS, and AP were relatively unchanged (Figure 5B).In Shiraz, there was no considerable difference between 2018-2019 and 2015-2017 for the PM 2.5 level.PM 10 and SO 2 levels dropped dramatically at the same time.NO 2 and O 3 experienced a significant increase, whereas CO only decreased in 2019.WS and T had a consistent trend, while RH enhanced (Figure 5C).In addition, AP was unchanged in all cities.Accordingly, since no substantial increase in all or the majority of pollutants was recorded, the hypothesis H1 is rejected.More so, several pollutants in 2015-2019 showed a lowering or an unchanging trend.

Interactions between Air Pollutants and Vegetation
The mean difference and standard deviation of NDVI in Tehran, Shiraz, and Tabriz were statistically significant in pairs (p < 0.05) (Table A3 in Appendix A).The correlation of PM2.5 with NDVI was r = −0.15 in Tehran, r = 0.12 in Shiraz, and r = −0.25 in Tabriz, which was the only non-significant correlation in Shiraz (p = 0.218).The correlation between PM10 and NDVI was r = 0.08 in Tehran, r = 0.20 in Shiraz, and r = 0.08 in Tabriz, all of which were insignificant.The correlation of SO2 with the NDVI index was not significant in all cities.The correlation of NO2 with NDVI was significant in Shiraz and Tabriz.The correlation of O3 with NDVI was r = 0.54 in Tehran, r = −0.03 in Shiraz, and r = 0.34 in Tabriz, and only Shiraz had an insignificant correlation (p = 0.801).The correlation of CO with

Interactions between Air Pollutants and Vegetation
The mean difference and standard deviation of NDVI in Tehran, Shiraz, and Tabriz were statistically significant in pairs (p < 0.05) (Table A3 in Appendix A).The correlation of PM 2.5 with NDVI was r = −0.15 in Tehran, r = 0.12 in Shiraz, and r = −0.25 in Tabriz, which was the only non-significant correlation in Shiraz (p = 0.218).The correlation between PM 10 and NDVI was r = 0.08 in Tehran, r = 0.20 in Shiraz, and r = 0.08 in Tabriz, all of which were insignificant.The correlation of SO 2 with the NDVI index was not significant in all cities.The correlation of NO 2 with NDVI was significant in Shiraz and Tabriz.The correlation of O 3 with NDVI was r = 0.54 in Tehran, r = −0.03 in Shiraz, and r = 0.34 in Tabriz, and only Shiraz had an insignificant correlation (p = 0.801).The correlation of CO with NDVI in Tehran (p = 0.183) and Tabriz (p = 0.066) was insignificant.Relationships between NDVI and pollutants in Tehran, Tabriz, and Shiraz are illustrated in Figure 6A-C.
In response to hypothesis 2 (H2), this research reveals that vegetation may have a minor effect in lowering PM 2.5 , SO 2 , and CO emissions in Iranian cities (r PM 2.5 = −0.03,r SO 2 = −0.08,and r CO = −0.17)(Table A4 in Appendix A).This finding is consistent with the results of prior studies that found plants to be inefficient in lowering pollution levels.According to Yli-Pelkonen et al. [93], the influence of urban vegetation on enhancing air quality and lowering pollutants in Helsinki, Finland, was low.
NDVI and pollutants in Tehran, Tabriz, and Shiraz are illustrated in Figure 6A-C.
In response to hypothesis 2 (H2), this research reveals that vegetation may have a minor effect in lowering PM2.5, SO2, and CO emissions in Iranian cities (rPM2.5 = −0.03,rSO2 = −0.08,and rCO = −0.17)(Table A4 in Appendix A).This finding is consistent with the results of prior studies that found plants to be inefficient in lowering pollution levels.According to Yli-Pelkonen et al. [93], the influence of urban vegetation on enhancing air quality and lowering pollutants in Helsinki, Finland, was low.In response to hypothesis 2 (H2), this research reveals that vegetation may have a minor effect in lowering PM2.5, SO2, and CO emissions in Iranian cities (rPM2.5 = −0.03,rSO2 = −0.08,and rCO = −0.17)(Table A4 in Appendix A).This finding is consistent with the results of prior studies that found plants to be inefficient in lowering pollution levels.According to Yli-Pelkonen et al. [93], the influence of urban vegetation on enhancing air quality and lowering pollutants in Helsinki, Finland, was low.Numerous investigations have been conducted to demonstrate visually and spatially the relationship between vegetation and air pollution.In this context, Zhou et al. [94] used the Pearson correlation coefficient to examine the association between NDVI and pollutants in Chinese cities.They recognized that regions with higher NDVI had lower AQI, and there was a negative relationship between NDVI and AQI, so that increasing 0.1 NDVI units reduced AQI by 3.75 units (95% confidence interval).Zheng et al. [95] also evaluated the connection between air pollution and land use in Hangzhou, China, and identified that areas with low NDVI and high surface temperature had high concentrations of PM, NO 2 , SO 2 , and CO.Prakasam et al. [96] examined satellite images during 2001-2021 and identified that decreasing vegetation was clearly connected with poor air quality in Himachal Pradesh (India).Sun et al. [97] reported that concentrations of PM 2.5 , PM 10 , CO, NO 2 , and SO 2 were negatively correlated with NDVI levels.Figures 7-9 reveal that regions with lower vegetation have higher pollutant emissions, leading to an increase in AQI.Although the figures demonstrated that AQI is higher in regions with low NDVI, the contribution of vegetation in decreasing pollution cannot be deemed effective.
the Pearson correlation coefficient to examine the association between NDVI and pollutants in Chinese cities.They recognized that regions with higher NDVI had lower AQI, and there was a negative relationship between NDVI and AQI, so that increasing 0.1 NDVI units reduced AQI by 3.75 units (95% confidence interval).Zheng et al. [95] also evaluated the connection between air pollution and land use in Hangzhou, China, and identified that areas with low NDVI and high surface temperature had high concentrations of PM, NO2, SO2, and CO.Prakasam et al. [96] examined satellite images during 2001-2021 and identified that decreasing vegetation was clearly connected with poor air quality in Himachal Pradesh (India).Sun et al. [97] reported that concentrations of PM2.5, PM10, CO, NO2, and SO2 were negatively correlated with NDVI levels.Figures 7-9 reveal that regions with lower vegetation have higher pollutant emissions, leading to an increase in AQI.Although the figures demonstrated that AQI is higher in regions with low NDVI, the contribution of vegetation in decreasing pollution cannot be deemed effective.

Interactions between Air Pollutants and Meteorological Factors
The mean of PM 2.5 in Tehran (25.8 ± 90.2) was statistically different from Shiraz and Tabriz.The mean of PM 10 in Tehran (51.7 ± 16.2) had a significant difference with Shiraz and a slight difference with Tabriz.The mean of SO 2 in Tehran (24.9 ± 5.8) was insignificantly different from Shiraz and considerably different from Tabriz.The mean of NO 2 in Tehran (62.3 ± 16.3) was significantly different from Shiraz and Tabriz.Tehran's mean O 3 level (32.5 ± 19.4) clearly varied with Shiraz and Tabriz.The mean of CO in Tehran (38 ± 10.7) showed a significant difference with Shiraz and Tabriz.
In analyzing the average daily data of meteorological parameters, T in Tehran (66 ± 17.9) recorded a slight difference with Shiraz and a significant difference with Tabriz.The mean of RH in Tehran (34 ± 17.6) showed an insignificant difference with Shiraz and a significant difference in Tabriz.The mean of WS in Tehran (7.2 ± 3) had a significant difference from Shiraz and Tabriz.The difference between the mean AP in Tehran (26 ± 0.1) and Shiraz and Tabriz was significant (Table A5 in Appendix A).
In Shiraz and Tabriz, all correlations of meteorological parameters with PM 2.5 were significant.Tehran recorded the highest negative correlation of PM The findings of this study are correlated with several previous studies.According to the obtained results by Qiao et al. [98], RH, WS, and T are the main factors affecting air quality in China.In addition, Jayamurugan et al. [40] observed a significant negative correlation between RH and PM, and this correlation is similar to the outcomes of the PM relationship analysis in Shiraz.In a study by Zhou et al. [44], the mean of O 3 had the highest positive correlation with T, which was also recorded in Tehran.In addition, in an investigation by Kayes et al. [99], most pollutants had a negative relationship with T and RH, and this outcome was also observed in Tabriz.In a study by Sezer Turalıo glu et al. [100] in Erzurum, Turkey, higher SO 2 concentrations were associated with lower T, lower WS, and higher RH.Moreover, the results of linear and nonlinear regression analyses of SO 2 with meteorological parameters showed a moderate and weak connection between this pollutant and meteorological parameters in Elazig [101].In research by Ilten and Selici [102] in Balikesir, higher concentrations of total daily particulate matter and SO 2 were associated with lower T, lower WS, higher AP, and higher RH.In an analysis by Kliengchuay et al. [103] in Mae Hong Son Province, Thailand, PM 10 concentrations were significantly associated with RH (r = −0.37).The results of Spearman analysis in research by Jassim et al. [104] in Bahrain revealed that the correlation coefficient between RH and the concentrations of PM 10 and PM 2.5 was r = −0.595and r = −0.526,respectively, which was a remarkable negative relationship.There was a considerable positive correlation between temperature and PM 10 (r = 0.42) and PM 2.5 (r = 0.48).The correlations between PM 10 and RH and T in Tehran and the correlations between PM 2.5 and PM 10 with RH and T in Shiraz were similar to the obtained outcomes by Jassim et al. [104].

Model Evaluation
The Grid Search optimization approach is used in the phase of Hyper-parameter optimization to apply various combinations of XGBoost parameters and try to optimize the mean squared error on 6-fold stratified cross-validation on each of the models.It was hypothesized that the Grid Search optimization algorithm iterations could provide an ideal set of parameters.Table A7 in Appendix A expresses the model assessment results and demonstrates that the model performance (R 2 test) in the daily forecast of PM 2.5 , PM 10 , NO 2 , SO 2 , O 3 , and CO emissions was 0.36, 0.27, 0.46, 0.41, 0.52, and 0.38, respectively, which indicates better performance in predicting gaseous pollutant emissions.However, this accuracy is insufficient to predict air pollution on an urban scale.
It was expected that the forecasting model output should be closer to the sensors near airports since the meteorological and air pollution stations were distinct.Therefore, the data generated by the model and the actual data for November 2021 were compared with each other, and it was found that the distance factor is not related to the performance of the model.Since there is no linear and significant relationship between the performance of the model and real data reported by sensors in Tehran (Figure 10A), Tabriz (Figure 10B), and Shiraz (Figure 10C), the model's performance in predicting air pollution varies region by region, which means it is not usable in real conditions.It appears that various algorithms should have been employed in addition to XGBoost.However, it is more necessary to explore the modeling challenges than to have a more efficient model.Numerous prior articles made use of the neural network model, although data training for this model requires large and long-term data [105], which was not obtainable for this research.
Considering that air pollution prediction can be effective for controlling urban operations, the air pollution status depends on various environmental factors that make it difficult to predict the concentration of pollutants [106].It has been reported that modeling dynamic real-world phenomena like air pollution is a significant challenge owing to their non-linearity and high dimensional sample space [107].In this study, it appears that several factors were effective in reducing the accuracy of the model, including the unavailability of information regarding pollutant emissions from sources such as factories and traffic, the high dynamics of environmental parameters, and the lack of data due to sensor errors.Kang et al. [108] reported that sensor flaws or incomplete data make it exceedingly difficult to forecast air quality through modeling.Additionally, since linear regression techniques are not efficient for predicting time-dependent data [109], it is challenging to predict air pollution despite invalid and missing inputs [110].Liao et al. [111] documented that shallow statistical methods and flawed sensors restrict the air quality prediction process.On the basis of the findings, the following summary of obstacles and solutions corresponding to reliable modeling to forecast air quality is presented:

•
The use of deep learning techniques to improve prediction [111,112]; • This survey did not consider second-and third-order interactions between parameters.
Researchers should, therefore, address these interactions in the modeling process; It is suggested that in machine learning-based investigations, correlations across weather stations and nearby air quality stations should be explored to improve prediction accuracy [113].In addition, it is necessary to develop dynamic and integrated air quality models employing hybrid machine learning algorithms [108]; Modeling the emission from sources, chemical reactions of pollutants, and urban activities is required to improve forecasting accuracy [114], which was not considered in the present investigation.Eventually, clean air may only be restored whenever governments shift their approach toward sustainable environmental strategies [115].
Sustainability 2022, 14, 8027 17 of 27 mean squared error on 6-fold stratified cross-validation on each of the models.It was hypothesized that the Grid Search optimization algorithm iterations could provide an ideal set of parameters.Table A7 in Appendix A expresses the model assessment results and demonstrates that the model performance (R 2 test) in the daily forecast of PM2.5, PM10, NO2, SO2, O3, and CO emissions was 0.36, 0.27, 0.46, 0.41, 0.52, and 0.38, respectively, which indicates better performance in predicting gaseous pollutant emissions.However, this accuracy is insufficient to predict air pollution on an urban scale.It was expected that the forecasting model output should be closer to the sensors near airports since the meteorological and air pollution stations were distinct.Therefore, the data generated by the model and the actual data for November 2021 were compared with each other, and it was found that the distance factor is not related to the performance of the model.Since there is no linear and significant relationship between the performance of the model and real data reported by sensors in Tehran (Figure 10A), Tabriz (Figure 10B), and Shiraz (Figure 10C), the model's performance in predicting air pollution varies region by region, which means it is not usable in real conditions.It appears that various algorithms should have been employed in addition to XGBoost.However, it is more necessary to explore the modeling challenges than to have a more efficient model.Numerous prior articles made use of the neural network model, although data training for this model requires large and long-term data [105], which was not obtainable for this research.

Conclusions
Air pollution is an inevitable phenomenon caused by the development of industry and urbanization in recent decades, which has adversely affected human and ecosystem health.Although some actions have been taken to reduce it, they have not been significantly efficient.Given the fact that air pollutants, including PM, NO 2 , SO 2 , O 3 , and CO, interact with their surrounding environment, many researchers use the interaction of the pollutants with vegetation and meteorological parameters, such as temperature, relative humidity, wind speed, and air pressure, to create and develop air quality forecasting models.The present research attempted to explore the relationships between air pollutants and the ambient environment from a statistical perspective in Tehran, Tabriz, and Shiraz, Iran, then create a model for predicting air pollution using the machine learning method.In the case of regression, the improved XGBoost algorithm was applied in the suggested strategy for the model.Three distinct assessment criteria were used to assess the proposed technique, including R 2 , Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).The Grid Search optimization was employed as a hyper-parameter optimization method in modeling, and it was shown to be a beneficial approach for obtaining the optimum hyper-parameters.According to the obtained results of experiments, it may be inferred that the proposed forecasting model could enhance the level of decision-making during air quality prediction.
Although in all three cities, there were evident connections between pollutants and meteorological factors and vegetation, it was not adequate to allow modeling to accurately predict daily air pollution (R 2 PM 2.5 = 0.36, R 2 PM 10 = 0.27, R 2 NO 2 = 046, R 2 SO 2 = 0.41, R 2 O 3 = 0.52, and R 2 CO = 0.38).It has been found that in addition to meteorological factors, other factors are also involved in the diffusion of air pollutants in the atmosphere, such as sunlight, wind direction, and chemical reactions of pollutants.It appears that factors such as lack of data caused by sensor errors, lack of data regarding polluting sources such as factories and traffic, and the high dynamics of environmental conditions have driven the reduction in the accuracy of the model.Thus, it is concluded that for modeling and predicting air pollution, examining only the interaction of pollutants with meteorological and vegetation parameters is not sufficient.Furthermore, the spatial diversity of pollution monitors and meteorological stations made it difficult to develop a model for predicting air pollution region by region.The following strategies can be effective for future studies: (1) the number of air pollution and meteorological monitoring stations should be equal; (2) using small and low-cost sensors to develop the pollution monitoring network; (3) the problem of data loss due to sensor errors must be solved by deep learning methods; and (4) integration of satellite observations with proximal data.

Figure 2 .
Figure 2. The map of study zones in Iran.

Figure 2 .
Figure 2. The map of study zones in Iran.

Figure 3 .
Figure 3.A schematic of GB algorithm, adapted from [89].Following comprehensively reviewing the related literature of Chen and Guestrin [90], Friedman [91], the GB and XGBoost algorithms are represented as follows: D = x ; y, |D| = ,  ∈  ,  ∈  D denotes a dataset, n is the number of samples, m is the number of parameters, and x and y represent the dataset's features and target variable.The ambient air database comprises 1697 samples and five parameters.The prediction results in GB for dataset D are k tree forecasted scores total, which would be determined by a method called the K-additive function, as indicated in Equation (2):

Figure 4 .
Figure 4. Grid Search Optimization proposes a set of hyper-parameters for six-fold cross-validation of the XGBoost model; adapted from [89].

Figure 4 .
Figure 4. Grid Search Optimization proposes a set of hyper-parameters for six-fold cross-validation of the XGBoost model; adapted from [89].

Figure 6 .
Figure 6.Relationships between air pollutants and NDVI in Tehran (A), Tabriz (B), and Shiraz (C) from 2015 to 2019.Value of the NDVI is between 0 and 1.

Figure 6 .Figure 6 .
Figure 6.Relationships between air pollutants and NDVI in Tehran (A), Tabriz (B), and Shiraz (C) from 2015 to 2019.Value of the NDVI is between 0 and 1.

Figure 7 .
Figure 7. Relationships between NDVI and AQI in Tehran; AQI is collected from air pollution sensors disturbed in the city, and NDVI is obtained from Landsat 8; (a) 6 November 2020; and (b) 2 June 2021.

Figure 7 .
Figure 7. Relationships between NDVI and AQI in Tehran; AQI is collected from air pollution sensors disturbed in the city, and NDVI is obtained from Landsat 8; (a) 6 November 2020; and (b) 2 June 2021.Sustainability 2022, 14, 8027 15 of 27

Figure 8 .
Figure 8. Relationships between NDVI and AQI in Tabriz; AQI is collected from air pollution sensors disturbed in the city, and NDVI is obtained from Landsat 8; (a) 1 November 2020; and (b) 29 May 2021.

Figure 8 .
Figure 8. Relationships between NDVI and AQI in Tabriz; AQI is collected from air pollution sensors disturbed in the city, and NDVI is obtained from Landsat 8; (a) 1 November 2020; and (b) 29 May 2021.

Figure 8 .
Figure 8. Relationships between NDVI and AQI in Tabriz; AQI is collected from air pollution sensors disturbed in the city, and NDVI is obtained from Landsat 8; (a) 1 November 2020; and (b) 29 May 2021.

Figure 9 .
Figure 9. Relationships between NDVI and AQI in Shiraz; AQI is collected from air pollution sensors disturbed in the city, and NDVI is obtained from Landsat 8; (a) 15 November 2020; and (b) 26 May 2021.

Figure 9 .
Figure 9. Relationships between NDVI and AQI in Shiraz; AQI is collected from air pollution sensors disturbed in the city, and NDVI is obtained from Landsat 8; (a) 15 November 2020; and (b) 26 May 2021.
2.5 with WS (r = −0.38)and Tabriz recorded the highest positive correlation of PM 2.5 with RH (r = 0.24).The most negative relationship of PM 10 with RH was recorded in Shiraz (r = −0.26)and the most positive correlation with T was recorded in Shiraz (r = 0.24).The most negative correlation of SO 2 with WS was recorded in Tehran (r = −0.28)and the most positive relationship with AP was recorded in Tehran (r = 0.24).The most negative correlation of NO 2 with WS was observed in Tehran (r = −0.31)and the most positive relationship with T was observed in Shiraz (r = 0.14).The most negative correlation of O 3 with RH (r = −0.42)was observed in Tehran, and the most positive relationship with T (r = 0.50) was also observed in Tehran.The most negative correlation of CO with WS was recorded in Tehran (r = −0.28)and the most positive relationship with AP in Shiraz (r = 0.28) (Table A6 in Appendix A).

Figure 10 .
Figure 10.Correlations of the AQI predicted by the model with the AQI reported by sensors in Tehran (A), Tabriz (B), and Shiraz: (C) in November, 2021.The stations are listed in order of their distance from the airport.

Figure 10 .
Figure 10.Correlations of the AQI predicted by the model with the AQI reported by sensors in Tehran (A), Tabriz (B), and Shiraz: (C) in November 2021.The stations are listed in order of their distance from the airport.

Table A3 .
The difference of NDVI in the cities.
One-way ANOVA; Tukey test for post hoc test; significance level was set at 0.05.Significant quantities are shown in bold format.

Table A4 .
The correlation between air pollutants and NDVI.: The Pearson correlation test, the significance level was considered 0.05.Significant quantities are shown in bold format. r

Table A5 .
The difference between average air pollution and meteorological data.: Standard deviation, one-way analysis of variance and Tukey test, significance level was considered to be 0.05.Significant quantities are shown in bold format. SD

Table A6 .
The obtained correlations between air pollutants and meteorological parameters.
r: The Pearson correlation, and the significance level was set at 0.05.Significant quantities are shown in bold format.

Table A7 .
The obtained results by model evaluation.