PM 2.5 Retrieval Using Aerosol Optical Depth, Meteorological Variables, and Artiﬁcial Intelligence

: Particulate matter (PM) is one of the major air pollutants that has adverse impacts on human health. The aim of this study is to present an alternative approach for retrieving ﬁne PM (particles with an aerodynamic diameter less than 2.5 µ m, PM 2.5 ) using artiﬁcial intelligence. Ground-based instruments, including a hand-held Microtops II sun photometer (for aerosol optical depth), a PurpleAir sensor (for PM 2.5 ), and Rotronic sensors (for temperature and relative humidity), are used for the machine learning algorithm training. The retrieved PM 2.5 reveals an adequate performance with an error of 0.08 µ g m − 3 and a Pearson correlation coefﬁcient of 0.84.


Introduction
Particulate matter (PM)-related air pollution is a major environmental risk affecting human health and the environment [1].Thus, precise knowledge of PM mass concentration spatiotemporal distribution is vital to quantitatively assessing its impact on the environment and investigating the health risks for the public [2].Current conventional reference grade instruments face several limitations, mainly due to their increased installation and operation costs.Therefore, regulatory monitoring sites' density is impeded, and they are unable to capture the small-scale variations of PM concentrations across complex environments.Recent advance in electronics facilitates the assessment of PM monitoring techniques using low-cost and portable sensing modules.Low-cost sensor technologies constitute a promising tool to supplement and enhance the spatiotemporal resolution of existing PM monitoring networks.
During the last two decades, new alternative techniques for retrieving the spatiotemporal distribution of PM 2.5, have rapidly increased, using the relationship between satellitebased AOD and PM 2.5 in conjunction with advanced mathematical methods [3].Some of the most frequently implemented methods are multiple linear regression models [4] and machine learning (ML) algorithms such as artificial neural networks [5], support vector machines [6], and random forest [7,8].The accuracy of PM 2.5 estimations is related to the uncertainties that are induced by satellite AOD products.In addition, since AOD measurements from satellites are available 1-2 times per day, PM 2.5 retrievals are provided solely on a daily basis.
In this work, an alternative machine learning methodology for retrieving PM 2.5 is proposed, taking into account for the first time the importance of applying the AOD to various spectral channels along with several meteorological variables using quality-assured data from ground-based instruments.

Data
The data used in this study were collected at the Laboratory of Atmospheric Physics at the University of Patras (38.291 • N, 21.789 • E) and were divided into three main categories.The first category includes aerosol optical properties such as aerosol optical depth (AOD) at four spectral channels (e.g., 440, 500, 675, and 870 nm) as collected from a hand-held Microtops II (MII) sun photometer.MII retrieves columnar AOD using the Bouguer-Lambert-Beer law [9].All the MII measurements were acquired under cloud-free conditions at a 30 min resolution.
The second category includes calibrated PM 2.5 measurements from a PurpleAir-II low-cost particle concentration sensor (PAir).PAir monitors integrate a set of PMS 5003 sensors (Plantower Co., Ltd., Beijing, China) and conduct simultaneous PM concentration measurements at approximately 2 min temporal resolution.PMS sensors' operation is based on particle light scattering principles and reports the size distribution of particles, with a diameter ranging between 0.3 and 10 µm, and the mass concentration of PM 1 , PM 2.5 , and PM 10 .They are equipped with a built-in fan that draws ambient air (flow rate: 0.1 L min −1 ), and a laser at 680 nm wavelength that is used as the light source.Particles pass through the laser beam and the scattered light is collected by a photodetector; a proprietary algorithm is used to determine PM mass concentrations based on the output signal.PAir sensors' sensitivity and reliability have been widely investigated during the last few years, exhibiting good performance and long-term performance stability [10][11][12].Low-cost sensors, however, require site-specific calibration to assure good data quality [13,14].In this work, PAir PM 2.5 values were corrected by implementing a calibration method proposed by [15] that is appropriate for the examined area.
The third data category contains meteorological data, ambient temperature (T), and relative humidity (RH) obtained from Rotronic sensors (MP101A-T7-W4W) at the automatic weather station located at the University campus in Patras, Greece.Within the study period, 1767 measurements were acquired, spanning from 04/2021 to 10/2022.The meteorological and PM 2.5 data were temporally aggregated within the time window of 2 min (±1 min) centered over the MII timestamp.

Methodology
The PM 2.5 is retrieved based on the following parameters: (1) AOD at four spectral channels (440, 500, 670, and 870 nm), (2) T, and (3) RH.AOD is an adequate variable in terms of capturing the intra-day variations of PM 2.5 mass concentrations since aerosol emissions, dynamical transport, etc., will affect both parameters.The whole dataset, which consists of the previous parameters, has initially been separated into two datasets: the train and the test, which include 70% and 30% of the whole dataset, respectively.For the sake of this study, an ensemble technique, the random forest (RF), is adapted.RF presents a very effective supervised machine learning algorithm that can produce very accurate predictions in large datasets, either for classification or regression tasks.In this study, the RF is used for regression.Thus, the train dataset is applied to train the RF algorithm.
In order to achieve optimal accuracy, a randomized search procedure was performed during the training in order to find the best combination of hyperparameters, including a 10-fold cross-validation process using the mean square error as a loss function.After the training of the RF algorithm, the RF scheme with the highest performance, including the best combination of hyperparameters, is implemented to evaluate the test dataset.

Descriptive Statistics
Based on Table 1, the minimum and maximum values of PM 2.5 ranged from 0.37 to 18.76 µg m −3 , with a mean of 4.72 µg m −3 , highlighting the modest level of pollution across the study station.During the same period, the mean AOD values ranged between 0.11 and 0.21.The city of Patras, located in southern Europe, is frequently affected by dust particles transported from the Sahara Desert, recording high levels of AOD (maximum values 0.93-1.10).Nevertheless, fine particles are dominant across the area revealing a mean AE 440−870nm (Ångström Exponent between 440 and 870 nm) of 1.41.The AE 440−870nm from MII is computed using the Ångström power formula from the corresponding AOD channels.T and RH values ranged between 4.40-39.70• C and 11.80-89.80%with average values of 24.26 • C and 45.36%, respectively.

Machine Learning Algorithm Performance
In order to investigate the different effects of spectral AOD and meteorological variables on model retrieval performance, a sensitivity analysis of the input parameters was performed during the training of the RF algorithm.In total, 15 different cases were applied, with the aerosol optical properties as a baseline (Table 2).The first scenario (Scenario 1) consisted of five different sub-scenarios.Scenario 1.1 included solely the AOD 440nm as an input parameter for the RF algorithm training, whereas for scenario 1.2 the AOD 500nm was included, and so on for the rest of the sub-scenarios.Thus, scenario 1.5 included the AOD at four MII spectral channels and AE 440-870nm .The cases in Scenarios 2 and 3 are similar to Scenario 1 but included T and RH, respectively, as input parameters.Figure 1 illustrates the findings of the sensitivity analysis for the 15 different training scenarios.In the literature, the majority of the studies dedicated to PM 2.5 retrieval via ML use satellite based AOD at a specific channel.In this study, firstly the effect of spectral AOD information on ML algorithm performance (Scenario 1) is investigated, and it is apparent that the performance of the ML algorithm increases as more spectral channels of AOD are included.In particular, the MAE (RMSE) values range from 1.76 µg m −3 (2.25 µg m −3 ) to 1.10 µg m −3 (1.53 µg m −3 ).In terms of correlation coefficient (R), the ML algorithm performance increased substantially by including all four spectral channels of AOD (from 0.45 to 0.78).The effect of AE 440-870nm was marginal for all scenarios.In total, including all spectral AOD channels, the Mean Absolute Error (MAE) (Root Mean Square Error (RMSE)) was suppressed by ~38% (~32) compared to when using only AOD 440nm .
Secondly, the effect of two meteorological parameters on ML performance was investigated together with AOD (Table 1).By including T (Scenario 2) in ML training, an increase in the model's performance was revealed, reducing the MAE (RMSE) from  1.
Secondly, the effect of two meteorological parameters on ML performance was investigated together with AOD (Table 1).By including T (Scenario 2) in ML training, an increase in the model's performance was revealed, reducing the MAE (RMSE) from 1.46 μg m −3 (1.90 μg m −3 ) to 0.97 μ gm −3 (1.38 μg m −3 ).In addition, R improved from 0.62 to 0.82.For scenario 3, RH was also included on ML training in addition to AOD and T, leading to a further improvement of the model's performance from 1.31 μgm −3 (1.72 μg m −3 ) to 0.91 μg m −3 (1.30 μg m −3 ) for MAE and RMSE, respectively, and from 0.70 to 0.84 for R. Including the two meteorological parameters, MAE (RMSE) was decreased by ~20% (~15%), compared to using the parameters of scenario 1.5.Figure 2a shows the linear relationship between the ML-based (estimations) and ground-based (measurements) PM2.5 for the scenario with the highest accuracy (Scenario 3.5).The findings revealed a dispersion of 26.9%.   1.  1.
Secondly, the effect of two meteorological parameters on ML performance was investigated together with AOD (Table 1).By including T (Scenario 2) in ML training, an increase in the model's performance was revealed, reducing the MAE (RMSE) from 1.46 μg m −3 (1.90 μg m −3 ) to 0.97 μ gm −3 (1.38 μg m −3 ).In addition, R improved from 0.62 to 0.82.For scenario 3, RH was also included on ML training in addition to AOD and T, leading to a further improvement of the model's performance from 1.31 μgm −3 (1.72 μg m −3 ) to 0.91 μg m −3 (1.30 μg m −3 ) for MAE and RMSE, respectively, and from 0.70 to 0.84 for R. Including the two meteorological parameters, MAE (RMSE) was decreased by ~20% (~15%), compared to using the parameters of scenario 1.5.Figure 2a shows the linear relationship between the ML-based (estimations) and ground-based (measurements) PM2.5 for the scenario with the highest accuracy (Scenario 3.5).The findings revealed a dispersion of 26.9%.Figure 2b depicts the frequency distribution of differences between the ML-based (estimations) and ground-based (measurements) PM2.5 for the scenario with the highest accuracy (Scenario 3.5).For the 69% (89%) of the test dataset, the differences between the PM2.5 estimations and measurements were lower than 1 μg m −3 (2 μg m −3 ).

Conclusions
Quantitative and qualitative information on surface PM2.5 mass concentration is vital for monitoring and regulating air quality.In this work, an alternative ML-based methodology relying on the synergy of ground-based AOD and meteorological measurements is  Figure 2b depicts the frequency distribution of differences between the ML-based (estimations) and ground-based (measurements) PM 2.5 for the scenario with the highest accuracy (Scenario 3.5).For the 69% (89%) of the test dataset, the differences between the PM 2.5 estimations and measurements were lower than 1 µg m −3 (2 µg m −3 ).

Conclusions
Quantitative and qualitative information on surface PM 2.5 mass concentration is vital for monitoring and regulating air quality.In this work, an alternative ML-based methodology relying on the synergy of ground-based AOD and meteorological measurements is proposed for retrieving PM 2.5 .The most interesting finding of this study is the great improvement in ML algorithm's performance by including AOD spectral information.Moreover, the addition of two meteorological parameters, T and RH, increased the retrieval performance of the ML algorithm.The results of the proposed methodology, due to their high temporal resolution, could be used to fill and extend either existing or missing PM 2.5 time series derived from ground-based measurements.In addition, the retrieved PM 2.5 can be used as a reference measurement for the validation of retrieval algorithms based on satellite measurements.

Figure 1 .
Figure 1.(a) MAE, (b) RMSE, and (c) R for the 15 scenarios.The description of each scenario is presented in Table1.

FigureFigure 1 .
Figure2bdepicts the frequency distribution of differences between the ML-based (estimations) and ground-based (measurements) PM2.5 for the scenario with the highest accuracy (Scenario 3.5).For the 69% (89%) of the test dataset, the differences between the −3 −3

Figure 1 .
Figure 1.(a) MAE, (b) RMSE, and (c) R for the 15 scenarios.The description of each scenario is presented in Table1.

Figure 2 .
Figure 2. (a) Linear relationship and (b) frequency distribution of differences between the ML-based (estimations) and ground-based (measurements) PM 2.5 for scenario 3.5 (see Table2).
Figure 2. (a) Linear relationship and (b) frequency distribution of differences between the ML-based (estimations) and ground-based (measurements) PM 2.5 for scenario 3.5 (see Table2).

Table 1 .
Minimum, maximum, and average values of ML algorithm input parameters.

Table 2 .
Scenarios applied during the RF algorithm training procedure.