Compositional Spatio-Temporal PM 2.5 Modelling in Wildﬁres

: Wildﬁres are natural ecological processes that generate high levels of ﬁne particulate matter (PM 2.5 ) that are dispersed into the atmosphere. PM 2.5 could be a potential health problem due to its size. Having adequate numerical models to predict the spatial and temporal distribution of PM 2.5 helps to mitigate the impact on human health. The compositional data approach is widely used in the environmental sciences and concentration analyses (parts of a whole). This numerical approach in the modelling process avoids one common statistical problem: the spurious correlation. PM 2.5 is a part of the atmospheric composition. In this way, this study developed an hourly spatio-temporal PM 2.5 model based on the dynamic linear modelling framework (DLM) with a compositional approach. The results of the model are extended using a Gaussian–Mattern ﬁeld. The modelling of PM 2.5 using a compositional approach presented adequate quality model indices (NSE = 0.82, RMSE = 0.23, and a Pearson correlation coefﬁcient of 0.91); however, the correlation range showed a slightly lower value than the conventional/traditional approach. The proposed method could be used in spatial prediction in places without monitoring stations.


Introduction
Wildfires are natural or human-based phenomena that emit various air pollutants into the atmosphere [1,2]. PM 2.5 is one of the most critical pollutants to human health produced by wildfires [3,4]. PM 2.5 , inhaled and transported by the bloodstream, can impair the lungs and other vital organs, and its impact is more harmful if the source is from wildfires [5,6]. On the other hand, PM 2.5 emitted from biomass burning (carbonaceous aerosols from wildfires) contributes to one of the largest variables of uncertainty in the current estimates of radiative forcing [7,8].
The accurate predictions of fine particulate matter related to wildfires can aid decisionmakers in mitigating the environmental and socio-economic impacts of wildfires [9][10][11]. In this sense, among the most important studies are those models that seek to estimate the emission of PM 2.5 using a set of fixed-source profiles (land use, vegetation inventories, types of forest, chemistry, and physics characteristics) [12][13][14]. In this way, we could mention some examples, such as the BlueSky modelling framework developed by the Fire Consortium for the Advanced Modeling of Meteorology and Smoke (FCAMMS), which combines state of the art emissions, meteorology, and dispersion models to generate the best possible predictions of smoke impacts across the landscape. Another example is the Sparse Matrix Operator Kerner Emissions Modeling System (SMOKE), developed by the Center for Environmental Modeling for Policy Development (CEMPD), which is based on RatePerStart (RPS) emission rates [15]. However, the results from the emission models could be wrong even if representative source profiles are used, and thus a contradiction in the empirical evidence for ground-level monitoring could appear [16][17][18][19][20][21][22]. On the other hand, measures of PM 2.5 from monitoring stations on the surface could be used in statistical models under a dispersion modelling approach. The dispersion models are usually presented in univariate spatio-temporal research [23][24][25][26]. For instance, Mirzaei et al. used a land use regression with ground-level monitoring of smoke to propose exposure models [27]. The dynamic linear modelling framework is commonly used in air quality models due to its flexibility in treating time series in both stationary and non-stationary approaches [28][29][30][31][32][33]. For instance, Cameletti et al. developed a daily spatio-temporal model for PM 10 for Piemonte in Italy with an extensive network of monitoring stations [34]. Sánchez-Balseca and Pérez-Foguet, with a limited number of monitoring stations, presented hourly spatio-temporal PM 2.5 modelling in wildfires events, a validation method using PM 10 levels and a PM 2.5 /PM 10 ratio was proposed as well. Both studies used DLM with a Gaussian-Mattern field due to its low computational cost [35]. PM 2.5 is an air pollutant and thus part of an atmospheric composition (e.g., µg/L, mg/kg, wt%). Compositional data (CoDa) belong to a sample space called the simplex. If PM 2.5 data are not treated under a compositional approach, the results could draw wrong conclusions [36,37]. One statistical problem if compositional data are not adequately treated is the spurious correlation. In a composition of two elements that sum a constant, the increase in one of them means reducing the other component, and vice versa. The two elements have an inverse correlation imposed upon them, even if these two elements have no relationship. This imposed correlation is called a spurious correlation and could be eliminated through transformations in the form of logarithms of ratios (log-ratios) [38]. The isometric log-ratio (ilr) transformation is the most used due to its advantage of representing the simplex space orthogonally [39]. In addition, the CoDa approach has been widely used in other environmental fields (soil, water, geology, etc.), but the application in air pollution modelling is scarce.
This article presented a compositional, hourly spatio-temporal model for PM 2.5 based on a dynamic linear modelling framework. To extend the results of the model in places with no monitoring stations, a Gaussian-Mattern field is used. The remainder of this article provides the site description, datasets used, a brief background on the statistical tools (DLM and CoDa), the methodology (Section 2), the results (Section 3), the discussion (Section 4), and the principal conclusions (Section 5).

Wildfire Description
Quito had unprecedented wildfires in September 2015, and the 14th of September was the most remarkable air pollution event. Quito is located in Ecuador in the Andean mountains at 2800 m.a.s.l., and it has 2,240,000 inhabitants. Figure 1 presents the satellite image that represents the wildfire event with points of red colour. The MODIS Terra/Aqua sensor platform was used to obtain the thermal anomalies/active fire image [40]. The yellow points are the monitoring stations for PM 2.5 .

Data
2.2.1. PM 2.5 Data PM 2.5 data were collected hourly during September (720 hours) by the Air Quality Network of Quito, which is formed by five monitoring stations, and they are described in Table 1. The monitoring network used a Thermo Fisher Scientific FH62C14-DHS Continuous Ambient Particulate Monitor 5014i with beta rays' attenuation method (Waltham, Massachusetts, USA), as suggested by the Environmental Protection Agency (EPA). The Air Quality Network of Quito is a permanent air pollution surveillance network. The data were obtained from the open-source online data repository managed by the environmental agency of Quito, and hosted at Secretaria de Ambiente del Distrito Metropolitano de Quito [41].

Meteorological Data
The meteorological data were collected from meteorological assimilation systems based on satellite data. This article used Modern-Era Retrospective analysis for Research and Applications version 1 and 2 (MERRA and MERRA-2) from NASA's Giovanni web platform; MERRA-2 published many analysis products used in meteorological and air quality modelling [42,43]. Some works used the soil surface temperature variable to indicate wildfire events [44][45][46]. Table 2 shows the main characteristics of meteorological data. Two equations defined the dynamic linear modelling; the first one is denoted as the observation equation. The dependent variable, y st , is the observed generic pollutant concentration at spatial location s (s = 1, . . . , S) on time t (t = 1, . . . , T) and it is described in Equation (1): where v st denotes the measurement error, which is assumed to be independent, and it has a variance, σ 2 v . The vector of regression coefficients is represented by vector β; X st represents a vector of regressors that change temporally. Operator "×" is used to indicate multiplication of scalars, vectors or matrices depending on the context in this article. The second equation that describes the dynamic linear modelling is related to the term θ st ; its name is the system equation, and it describes a dynamic autoregressive first-order model, shown as: where w st is the temporal and spatial error; it has a normal distribution and a variance, σ 2 w / 1 − a 2 . The temporal and spatial variance (σ 2 w ) is based on the correlation between monitoring stations and their Euclidean spatial distance using a Gaussian-Mattern field, and is parameterized by the empirically derived correlation range (ρ). This empirically derived correlation range is the distance at which the correlation is close to 0.1. For more details, see [34,[47][48][49]].

Compositional Data (CoDa) Approach
Compositional data belong to a sample space called the simplex S D , which could be represented in mathematical terms as: where K is defined a priori and is a positive constant. x i represents the components of a composition. The next equation represents the isometric log-ratio (ilr) transformation (Egozcue et al. [36]).
where x is the vector with D components of the compositions, V is a D × (D − 1) matrix that denotes the orthonormal basis in the simplex, and Z is the vector with the D − 1 log-ratio coordinates of the composition on the basis, V. The ilr transformation allows for the definition of the orthonormal coordinates through the sequential binary partition (SBP), and thus, the elements of Z, with respect to the V, could be obtained using Equation (5) (for more details see [39]).
where g m (x k+ ) and g m (x k− ) are the geometric means of the components in the kth partition, and r k and s k are the number of components. After the log-ratio coordinates are obtained, conventional statistical tools can be applied. For a 2-part composition, x = (x 1, x 2 ), an orthonormal basis could be V = [ 1 , and then the log-ratio coordinate is defined using Equation (6): After the log-ratio coordinates are obtained, conventional statistical tools can be applied.

Methodology: Proposed Approach Application in Steps
To propose a compositional spatio-temporal PM 2.5 model in wildfire events, our approach encompasses the following steps: (i) pre-processing data (PM 2.5 data expressed as hourly 2-part compositions), (ii) transforming the compositions into log-ratio coordinates, (iii) applying the DLM to compositional data, and (iv) evaluating the compositional spatiotemporal PM 2.5 model.
Models were performed using the INLA [48], OpenAir, and Compositions [50] packages in the R statistical environment, following the algorithm showed in Figure 2. The R script is described in [51]. Step 1. Pre-processing data To account for missing daily PM 2.5 data, we used the compositional robust imputation method of k-nearest neighbor imputation [52,53]. Then, the air density from the ideal gas law was used to transform the concentration from volume to weight (Equation (7)). The concentration by weight has absolute units, while the volume concentration has relative units that depend on the temperature [49]. The air density is defined by temperature (T), pressure (P), and the ideal gas constant for dry air (R).
The closed composition can then be defined as [PM 2.5 , Res], where Res is the residual or complementary part. We fixed K = 1 million (ppm by weight). Due to the sum(x i ) for all compositions x is less than K, and the complementary part is Res = K -sum(x i ) for each hour. The meteorological and geographical covariates were standardized using both the mean and standard deviation values of each covariate. For meteorological missing data, a simple imputation method was used.
Step 2. Log-ratio transformation The two components (PM 2.5 and the residual) are first log-transformed into one logratio coordinate for each hour (Z * 1 ) using Equation (6), where x 1 represents the PM 2.5 levels and x 2 describes the residual part (Res) for each hour.
Step 3. Model application The log-ratio coordinate is the dependent variable (y st ) in the DLM modelling framework, and the independent variables (X st ) are described by the meteorological data that change spatial-temporally. The posterior estimates-β, v st , w st , σ v , σ w , a, and ρ-are obtained from the regression using Bayesian inference. The empirically derived correlation range was defined in km. The spatial distribution of PM 2.5 in places with no monitoring stations was featured using a triangular irregular mesh for monitoring stations of PM 2.5 and a grid of 4 km between each intersection of meteorological data, as proposed by Sánchez-Balseca and Pérez-Foguet (2020) [49].
It is necessary to recover the original units for the estimates in compositional data analysis [54]. Once results are back-transformed in proportions, (p×E; sum(p×E) = 1), they are multiplied by K to obtain the model results in original units.
Step 4. Model Evaluation For this step, the Nash-Sutcliffe efficiency index (NSE) and the Pearson correlation coefficient were used. Both the NSE and the Pearson correlation are independent of the scale of measurement of the variables. The NSE scale ranges from 0 to 1, whereby NSE = 1 means the model is perfect, NSE = 0 means that the model is equal to the average of the observed data, and negative values mean that the average is a better predictor.

Results
The compositional spatio-temporal air pollution modelling used five monitoring stations and 720 hours in a wildfire event. The posterior estimates (mean, quantiles, and standard deviation) for the parameters σ 2 v , σ 2 w , a, and ρ are presented in Table 3. The spatial variance (σ 2 w ) was slightly more significant than the measure variance (σ 2 v ). The empirically derived correlation range was about 26.006 km; this represents the distance at which the correlation is close to 0.1. The parameter a is 0.7547, which was directly proportional to the spatial and temporal variance. The compositional model presented an intercept of about −12.618 that represents, in the original units, 0.018 ppm of PM 2.5 (see Table 4). Considering the threshold for fine particulate matter suggested by WHO in a 24 h average, about 0.022 ppm (using an air density value equal to 1.15 kg/m 3 to transform it into concentration in mass), the intercept value does not exceed the limit in a wildfire event. The regression coefficients of altitude, air temperature, and radiation had negative values. The concentration of PM 2.5 decreases with increasing altitude [55]. The air temperature and radiation are related to thermal inversion and air density, and thus their increase means the PM 2.5 concentration decreases [56]. The surface soil temperature had a positive influence on the concentration of PM 2.5 . The compositional model presented an NSE of 0.82, an RMSE of 0.23, and a Pearson correlation coefficient of 0.91. Figure 3 shows the highest hourly concentration of PM 2.5 presented in the wildfire at 16:00 h on 14 September 2015. It illustrates the spatial ilcoordinate (without back-transformed process) and the logarithmic concentration of PM 2.5 on its original units (ppm).

Discussion
This article presented a compositional spatio-temporal air pollution model for PM 2.5 using meteorological and geographical covariates. The proposed model showed adequate quality model metrics; in addition, spurious correlation was avoided by applying the ilrtransformation. The values of the quality model metrics obtained in this article were similar to those obtained using a conventional approach. The RMSE criterion displayed the most evident difference; it was about 0.23 when using a compositional process, whereas it was about 0.32 when using a traditional approach. The empirically derived correlation range, when using a conventional approach, was about 27 km; this is slightly higher than the value obtained in previous work, which was 26 km (Sánchez-Balseca and Pérez-Foguet [35]). In this sense, the compositional approach had better quantitative modelling performance but a slightly lower capacity for spatial correlation than the conventional approach [34].
The interpretation for modelling ilr-coordinates could be complicated because the information is only in the relationships between the parts [36]. For this reason, the log-ratio used in this article should be interpreted as the influence of PM 2.5 in the composition of air when using a relative approach. This approach transforms a univariate analysis into a bivariate (multivariate) analysis [37]. Usually, the variable thermal anomalies are used to identify wildfires; however, this information is available only two times per day in some territories. For this reason, this article uses the temperature of the surface soil as a spatial wildfire indicator due to the temporal resolution needed (hourly). However, the PM 2.5 measures could be distorted by the secondary organic aerosol (SOA) formation [57][58][59].
For further works, the compositional approach for univariate analysis could be performed using the centered log-ratio (clr) or the additive log-ratio (alr), which Aitchison proposed in 1982 [60].

Conclusions
The compositional approach performs the modelling of PM 2.5 slightly better than the conventional approach. However, the compositional approach presented a slightly lower correlation range than the traditional approach. The compositional spatio-temporal PM 2.5 model showed adequate quality indexes and thus could be used to determine the concentration of fine particulate matter in places where there are no monitoring stations for wildfire scenarios. This information could allow for the determination of zones with significant impacts on human health.