Analysis of the Air Quality of the Basque Autonomous Community Using Spatial Interpolation

This work presents the results obtained from a spatial modeling and analysis process on pollutants measured in the air through forty-three monitoring stations located in the three provinces of the Basque Autonomous Community (Spain). The pollutants measured correspond to the set of nitrogen oxides (nitric oxide, NO; nitrogen dioxide, NO2; and nitrogen oxides, NOx) and atmospheric particulate matter with a diameter less than or equal to 10 micrometers (PM10). The objective of this work was to generate a map of the pollutants that exhaustively covers the entire area of the Basque Autonomous Community using geostatistical techniques, in such a way that it serves as a basis for short and midterm environmental studies.


Introduction
Environmental pollution is a global problem that puts the health of the population at risk [1]. Air quality is a determining factor in the health of population and ecosystems [2]. Emission of anthropogenic origin polluting gases unbalances the natural composition of the atmosphere (e.g., CO 2 or carbon dioxide is the main greenhouse gas due to its abundance in the atmosphere [3]). It is the cause of various diseases, environmental damage, and the increase of the Earth's temperature. These are the reasons international and national organizations have established various agendas for evaluating emissions and creating plans for their reduction.
Thanks to these types of measurements, air quality in Europe is slowly improving according to the European Air Quality Report [4]. Nevertheless, high concentrations of air pollutants continue harming European citizens' health, so it is still a priority to avoid, prevent and reduce the existence of these pollutants. According to the aforementioned report [4], the most harmful pollutants are suspended particles (PM), nitrogen dioxide (NO 2 ) and tropospheric ozone (O3) (although the list of pollutants covers around ten substances).
In addition, it is appropriate to carry out this type of study, since in recent years, technologies that generate particles with a diameter less than or equal to 10 microns have proliferated. A relevant example of this type of technologies is the additive manufacturing of metals. These technologies generate particles in this range of sizes: if this equipment is not operated properly or is not in adequate conditions, these particles can travel extensive distances in the air, and they have risks associated that raised the alarms of the scientific community for at least the last 15 years [5]. Thus, among others, the novel situation associated with the popularization of these technologies, thanks to the explosion of Industry 4.0 and their consequent risks justify the forms of modeling and representation exposed in this writing.
Finally, many territories or parts of territories do not have too many monitoring stations to collect these data and offer pollution measurement statistics. Therefore, the models developed and tested in this writing have served to offer a mechanism that generates very valuable complementary information.
In this sense, many administrations, both national and regional, have established different control systems to measure levels of contamination in the environment. The European directive that sets the bases for the measurement and evaluation of air quality [6] establishes the need to make public all this information and it also says that measurements may be accompanied by results of modeling techniques. Modeling air quality information is, therefore, an appropriate way to publicize that information. This work is proposed with the aim of building a whole map of the air quality in the aforementioned country using modeling techniques.

Normative
In Spain, the reference regulation regarding air quality is the Royal Decree-Law 102/2011, of the 28th of January, which transposes the Directive 2008/50/CE, of the 21st of May 2008 and the Directive 2004/107/EC, of the 15th December 2004. These regulations establish the limits for the most important pollutants and regulate the way in which air quality should be measured and evaluated, the information that should be provided to population and the actions to be carried out in case of exceeding certain concentration values. This Royal Decree was modified by the Royal Decree 678/2014 which modifies the quality objectives of carbon sulfide; and by the Royal Decree 39/2017, to transpose the Directive 2015/1480 into our legal system, which establishes regulations about reference methods, data validation and locations to obtain air quality measurements. In addition, in this last royal decree the approval of a National Air Quality Index is foreseen, which allows citizens to be informed about air quality, in a clear and homogeneous way throughout the country.
Based on this standard, the Autonomous Communities of Spain have the obligation to measure the levels of pollutants in the atmosphere in their territory, establishing necessary control networks and publishing the results. These networks must also have a guarantee and quality control system for their process, guaranteeing the accuracy of the data and the method in which they were obtained (following established standards). That is why the used measuring equipments must be approved according to the reference regulations.
Measurement stations are distributed throughout the territory, in a way that they allow measuring background pollution in natural conditions, air quality in urban areas and industrial environments. The impossibility of placing measurement stations in all points of the territory makes it necessary to apply interpolation techniques that enable having information on all the points in the territory.
Limits for health protection of some pollutants have been set: SO 2 (sulfur dioxide), NO 2 (nitrogen dioxide), PM 10 (atmospheric particulate matter with a diameter less than or equal to 10 micrometers), PM 2.5 (atmospheric particulate matter with a diameter less than or equal to 2.5 micrometers), CO (carbon monoxide), O 3 (ozone), C 6 H 6 (benzene), Pb (lead), As (arsenic), Cd (cadmium), Ni (nickel) and B(a)P (benzo(a)pyrene). In this work, data related to nitrogen oxides and atmospheric particulate matter with a diameter less than or equal to 10 micrometers (PM 10 ) has been analyzed, as they are the only pollutants with a sufficient number of readings. The limits established in the regulations for the protection of human health of the studied pollutants can be seen in Table 1. The article is organized as follows: in Section 2 the methodology followed in this work is presented; in Section 3 the techniques applied in the study are explained; Sections 4 and 5 are devoted to statistical and semivariogram analysis; in Sections 6 and 7 the results of the estimation and the model validation are described, and finally, in Section 8 conclusions derived from the study are reported.

Analyzed Area
The Basque Country (officially the Basque Autonomous Community) is located in the north of Spain. It covers 7230 km 2 and it is structured in three historical territories: Biscay, Gipuzkoa and Álava, see Figure 1. It has approximately two million inhabitants, of which the 40% are concentrated in the Bilbao Metropolitan Area [7], the area with the most industry and traffic. It is in this area where the greatest number of measurement stations is concentrated (16 out of the 43 are located in this area). The rest of the stations are distributed in the suburbs of Donostia-San Sebastian (with 10 measurement locations), in the city of Vitoria-Gasteiz (3 locations), and the rest distributed in the municipalities with the highest concentration of population and industry (17 locations distributed in municipalities such as Amorebieta, Durango, Mondragon, Zumarraga, Llodio, or Agurain). To assess the air quality of the Basque Country, the territory is divided into eight zones, in accordance with the requirements of current regulations, as shown in Figure 1. This division is calculated based on aerial basins of similar orography in which the levels of pollutants are fundamentally influenced by the same sources, and by the same transport processes of the aerial mass of the aforementioned sources. The zoning of the territory also depends on the pollutant. It has to be said that in the case of ozone, five zones are considered due to their different behavior.

Data
The data used in this work belong to the Basque Government's Air Quality Control Network. This network consists of 59 stations that are arranged throughout the Basque territory. They measure the pollutants set by the regulations on air quality; mainly sulfur dioxide (SO2), nitrogen oxides (NO and NO 2 ), tropospheric ozone, carbon monoxide (CO), benzene and particles in suspension (PM 10 and PM 2.5 ). In addition, meteorological parameters such as wind speed and direction, temperature, relative humidity, pressure, radiation and precipitation are measured.
The Basque Government publishes all the data collected by the stations on its Open Data website [8]. They also create a map of the Air Quality Index in which the state of air quality is assessed taking into account a category defined according to a concentration range. Analysis of the data registered throughout the network is carried out annually, and an annual report is elaborated with all this information.
For this study, data from 2019 was used. The contaminants that have been taken into account in the study are nitrogen oxides (NO, NO 2 and NO x ) and atmospheric particulate matter with a diameter less than or equal to 10 micrometers (PM 10 ). The rest of the pollutants have not been incorporated because they do not have a minimum number of records to perform an accurate statistical analysis. In all the cases (PM 10 , NO, NO 2 and NO x ) and for each station, the annual average of 2019 has been used. First, a daily average has been calculated using hourly data; and then, an annual average has been calculated using daily averages.

Analysis
In the first stage of the study, a statistical description of the polluting substances was developed, where exhaustive models are defined for PM 10 and NO x (that is we do not include NO and NO 2 ). This is due to the fact that the group of nitrogen oxides, NO x , is formed from a relationship between NO and NO 2 . Both pollutants, NO and NO 2 , present a linear correlation level of 98% with NO x , which justifies a simple regression linear fitting from the estimation model of NO x .
Subsequently, in order to analyze the feasibility of using geostatistical estimation methods, the spatial correlation of PM 10 and NO x is studied, by means of the experimental semivariogram. For PM 10 a spatial dependency of 10,000 m is interpreted in an isotropic manner at the provincial level, specifically in Biscay and Gipuzkoa, where 84% of the total monitoring stations are located. In Álava, since there are only seven stations and since they are separated by large distances, it is difficult to perform a semivariographic analysis, so results do not provide conclusive information and they were not considered for the spatial modeling of the pollutant. The PM 10 is fitted to a nested semivariographic model with a first nugget structure and a second exponential one. On the other hand, NO x does not present spatial correlation in any of the three provinces. Therefore, it is fitted to a pure nugget effect model and the use of geostatistical estimation methods on this pollutant is rejected.
Taking into account the information obtained from the spatial correlation analysis of the data, and considering the extensive area of the Basque Country, the simple krigging (SK) seemed the geostatistical technique possible to implement for PM 10 , as this technique uses a global mean to compensate for the lack of information in sectors where there are no monitoring stations. However, this technique is extremely smooth for this case study, given the poor sampling density; hence, we concluded that the application of kriging methods is not adequate for this study. In contrast, we propose to use linear-type radial basis function (RBF) interpolation. The choice of this advanced interpolator is not accidental, as under certain circumstances they are equivalent to kriging methods.
Finally, the resulting maps of PM 10 and NO x are evaluated using cross validation. Concretely, we have used the leave-one-out cross validation, concluding the validation of the estimation model and generating maps for NO and NO 2 using the simple linear regression model with NO x . The steps followed in this study can be seen in Figure 2.

Theoretical Framework
Geostatistics is a ramification of statistics that consist of modeling the spatial continuity of a target variable. It is used to estimate models based on observed data, to predict values in not sampled locations, and to test hypothesis [9]. Its mathematical formulation originated in France in the 1960s [10,11] and since then, it has been applied to many disciplines [12], such as: mining, environmental sciences, ecology, civil engineering, petroleum engineering, etc.
Geostatistics studies variables that present a geographical structure, i.e., variables distributed spatially in a structured manner, with some degree of spatial self-correlation [13]. Given a domain D ⊂ R d in space, where d is the dimension of the space (d = 2 for 2D and d = 3 for 3D), and given a variable Z, a regionalized variable makes correspond a value of the variable Z(x) to each point x in space that belongs to the domain D. This value depends on the position of the point x in space. The regionalized variable is stationary if its joint distribution function is invariant by translation in space. The semivariogram is the main tool used to quantify the spatial self-correlation of a regionalized variable [14]. It is used in different directions of the space to study the continuity of the variable.
We will assume that Z(x) is the regionalized variable, and that we have n observations Z(x 1 ), Z(x 2 ), . . ., Z(x n ). Kriging consists of predicting Z * (x 0 ), being x 0 / ∈ {x 1 , x 2 , . . . , x n }. It is an interpolation technique based on regression algorithms generalized by least squares. There exist different type of kriging estimates (simple kriging, ordinary kriging, universal kriging, etc.) and all of them are variants of basic linear regression estimates. They are given by the following expression: being λ i the weight values, m(x 0 ) and m(x i ) are the values expected by Z(x 0 ) and Z(x i ) respectively. The objective of the kriging is the minimization of the variance of the error σ 2 E (x), which is given by the following expression: being Z * (x) the estimated value and Z(x) the observed real value. The variance of the error and a deterministic one that defines the trend of m(x): The residual component is modeled as a zero-mean stationary random variable and covariance In this way, the expected value of the random variable Z at a certain location x is given by the value of the trend component at that location, i.e., E [Z(x)] = m(x). This is the general kriging theory, and its different variations depend on the information that we have about the mean and the structutal analysis performed. In the case of simple kriging, we have a regionalized variable with known mean m and known covariance. In the simple kriging model the variable Z(x) is decomposed into a residual component (x) and the mean. In this case, errors are not independent: The predictor of the variable of interest in a location x 0 is defined by: being * (x 0 ) the prediction of the random error in x 0 . Solving * (x 0 ) from (6), which is defined as follows: The predictor is unbiased if E [Z(x 0 )] = m, and this happens when E [ (x 0 )] = 0. The simple kriging weights are calculated by minimizing the variance Radial basis function interpolants were introduced by [15]. Among all the interpolation techniques that exist, RBFs are useful in the absence of grid data [16], and they may be useful when the results obtained by kriging are not accurate (because the number of samples is poor and their distribution non-adequate [17]). We will continue with the regionalized variable Z(x) and the n observations of this variable, Z(x 1 ), Z(x 2 ), . . ., Z(x n ). RBF interpolants consist of defining a function s f [18,19]: in such a way that s f (x k ) = Z(x k ) for 1 ≤ k ≤ n. In our case · will be the Euclidean norm in R d , where d is the dimension of the space. The coefficients α 1 , α 2 , . . . , α n are determined by solving the linear system that comes from: Several authors have proved that in some cases, the kriging and RBF approaches are equivalent [20][21][22]. The choice of the function φ(r) could be different; linear (φ(r) = r), cubic (φ(r) = r 3 ), thin plate spline (φ(r) = r 2 log r), etc. could be chosen. In this work our choice has been the linear-type RBF φ(r) = r.

Statistical Analysis
The Basque Country is divided into three provinces: Biscay, Gipuzkoa, and Álava. The 47% of the monitoring stations are located in Biscay, the 37% in Gipuzkoa and the 16% in Álava. As it has been explained before, the values worked by station correspond to an annual average of 2019. This has been calculated by averaging daily averages (which have been calculated averaging hourly data). The fact that NO x achieves a positive linear correlation of 98% with the variables NO and NO 2 is highlighted. It is justified that after estimating the exhaustive map of NO x , values of NO and NO 2 can be estimated by a simple regression model. NO can be estimated using NO x by: and NO 2 can be calculated in function of NO x by this formula: In Figure 3 the scatter diagrams can be seen. The correlation between PM 10 and NO x is 66%, see Figure 4, which means that independent estimation processes have to be followed for these two pollutants.
By applying multiple linear regression (MLR) the interaction of NO x , NO and NO 2 can be calculated, which is expressed as follows: and the P-values are: for the intercept, 2.456217 · 10 −3 ; for NO, 4.880887 · 10 −47 , and for NO 2 , 8.012475 · 10 −48 . Therefore, the regression coefficients are significant. When comparing the multiple linear regression of the variable NO x with real data, a linear correlation coefficient of 100% is obtained, see Figure 5.
In Tables 2-4 the values of descriptive statistics of the pollutants in each province are shown: the mean, the standard deviation (SD), the minimum and the maximum values, and the first, the second and the third quartile (Q1, Q2 and Q3, respectively). In addition, in Table 5, values of the Basque Country are presented, taking into account the data of the three provinces.
In Figures 6 and 7 the variation of PM 10 and NO x can be seen in each of the provinces. The annual average per pollutant is similar in the two main provinces of the Basque Country (Biscay and Gipuzkoa), and this area could be considered homogeneous for the geostatistical estimation process. The province of Álava differs from the other two by its low concentrations in both PM 10 and NO x . Because of this reason, an option could be to divide the overall area into two estimation domains: the first domain formed by Biscay and Gipuzkoa; and the second formed by Álava. However, the disadvantage of dividing the area into two domains is the absence of data in the second domain (province of Álava), since there are only seven measurements per pollutant. This is the reason a global modeling process has been developed.        The analysis of anomalous values has been done. We have detected 3 high measurements of PM 10 in Gipuzkoa, and 1 of NO x in Biscay, see Figure 8. Nevertheless, as these values were obtained by averaging daily averages, they correspond to real measured values and we have maintained them during the process.

Semivariogram Analysis
We define the parameters of a grid that will cover the area of the Basque Country, see Table 6. These parameters will serve to develop the estimation process. Measurements are located in the geodetic reference system European Terrestrial Reference System 1989 (ETRS89). In Figure 9 the values of PM 10 in all the monitoring stations can be seen. To measure the spatial continuity of the pollutants, the experimental semivariogram is used.
In the configuration of the parameters for the experimental semivariogram, a step of 0.5 m and a tolerance in step of 1/2 of the same will be used. Four directions will be studied: 0 • , 45 • , 90 • and 135 • . In addition, 20 steps in each of the four studied directions will be used. In this way, the entire space where the monitoring stations are located can be traversed.
We consider an angular tolerance of 22.5 • and a bandwidth equal to the tolerance in the step. The experimental semivariogram for 4 azimuth directions over the 43 monitoring stations does not reflect any spatial structure or a greater continuity in any particular direction, Figure 10.  Using the omnidirectional experimental semivariogram independently in each province, an isotropic continuity with a range of 10,000 m for PM 10 is interpreted in the provinces of Biscay and Gipuzkoa, see Figure 11. It is not possible to fit a continuity model in Álava, given the lack of information.
Due to the fact that the provinces Biscay and Gipuzkoa occupy 84% of the overall data, the total area has been modeled using a nugget effect model, added to an exponential structure: The fit nested structural model for the Basque Country is presented in Figure 12.   The same procedure was followed with the pollutant NO x . In Figure 13 the values of NO x in all the monitoring stations are represented.
For this pollutant, there is not spatial structure in any of the three provinces. Therefore, it was modeled by a pure nugget effect model, with a value equal to the statistical variance of the pollutant. Notice that the pure nugget effect implies a total absence of spatial correlation of the observed values. The experimental semivariograms per province are shown in Figure 14 and the pure nugget effect in Figure 15.

Estimation of Pollutant PM 10
First of all, simple krigging was applied, using a global mean of the three provinces (15.31 µg/m 3 ). The parameters of the process and the results are presented in Table 7. The map generated by simple kriging can be seen in Figure 16.  7228 estimates for PM 10 were generated from 43 measurements. A reduction of the standard deviation of this pollutant is observed, which is a feature of the estimation method as it tends to smooth the data. Given the great extension of the Basque Country and considering the number of observations of the variable under study, the estimated value is equal to or very close to the global mean in all sectors without monitoring stations. The total mean value of the estimate is equal to the statistical mean of the 43 observations, and the change in shape of the histogram is consequence of the over-influence of this position statistic that simple kriging incorporates, Figure 17. The statistics obtained after applying simple kriging can be seen in Table 8. Figure 17. Histograms of PM 10 before (left) and after (right) simple kriging application. σ 2 (PM 10 ) is the estimated variance, which presents an approximation of the error of the estimation process. A reduction of this variable is observed in the monitoring stations. However, the majority of the Basque Country presents a PM 10 estimate of high uncertainty using simple kriging, see Figure 18.  Given the purpose of interpreting the entire extension of the Basque country using only 43 stations, the simple kriging method is discarded, as it attributes too much weight to the statistical mean of the data changing abruptly the original distribution of the variable. The same happens with other types of kriging, such as the ordinary kriging; they cannot fit to the area given the poor amount of data and the long distances between stations. The linear-type radial basis function (RBF) was used instead. This method considers in an equal manner the 43 observations, and it generates a more realistic smoothing of the estimation grid. The PM 10 map generated by linear-type RBF is shown in Figure 19. The descriptive statistics of PM 10 after applying linear-type RBF are presented in Table 9. The skewness of the estimated distribution of PM 10 is similar to that of the original data, a tail to the right is observed in the histogram describing a positive bias, Figure 20. This means that there is a greater amount of data smaller than the mean value. On the other hand, kurtosis changes from platicurtic to leptokurtic, assuming a higher concentration of data in an interval that ranges between 11 and 15 microns of PM 10 . In Table 10 the vias factor and the kurtosis after RBF application to the pollutant PM 10 can be seen.

Estimation of Pollutant NO x
Using the same technique the NO x map has been generated, Figure 21. The descriptive statistics of this pollutant are in Table 11. The shape of the estimated NO x distribution is similar to the original one with respect to the skewness factor, see Figure 22. In addition, the kurtosis goes from being leptokurtic to platicurtic, Table 12.

Model Validation
Model validation was developed using the leave-one-out cross validation. The cross validation consists in extracting a known value from the set of observations (in our case, n = 43), and estimating the value at that location based on the selected model. Afterwards, the estimated value is compared with the real one n times. It is the same as k-fold cross validation by doing k = n, being n the total number of observations. The estimate generated by the RBF model is quite accurate with respect to the mean of the pollutants in the extension of the Basque Country. The global difference in PM 10 is of 0.41 microns, and 1.40 microns in NO x , see Table 13. The percentage of the relative root mean square error (RRMSE) has been calculated by dividing the RMSE by the difference between the maximum and minimum value: The obtained results are presented in Table 14. The global effectiveness or accuracy of the RBF model is 69.40% for PM 10 and 80.01% for NO x , which indicate that the model is valid.  Figures 23 and 24 show the absolute error |Z * (x) − Z(x)| for each station for each pollutant. The size of the circle is directly proportional to this error. In Table 15 values of the cross-validation per station are presented.  In Figures 25 and 26 NO and NO 2 maps were built, using the map generated for NO x by the linear-type RBF and applying simple linear regression, Equations (10) and (11).

Conclusions
In this work, measurements of the pollutants PM 10 , NO x , NO 2 and NO were studied in 43 measurement stations in the Basque Country, generating maps of the aforementioned pollutants for the entire area. Exploratory analysis has shown a correlation between the set of nitrogen oxides (nitric oxide, NO; nitrogen dioxide, NO 2 ; and nitrogen oxides, NO x ). For the geostatistical analysis, a kriging analysis was first performed. This technique has insufficient results because the amount of data is poor and the distances between stations are long to cover all the area of the Basque Country. Hence, linear-type RBF has been used instead, obtaining high global effectiveness.
Therefore, the geostatistical methods applied to the air quality data in the Basque Country have been valuable, not only for showing the incidence of the studied pollutants but also providing a deeper understanding of the phenomenon, which can help to design the control network itself and to publish adequate information about its distribution, as current regulations establish. The developed study also serves as a basis for short and midterm environmental studies that will support the decision making regarding transport, industry, and environmental policies.
We detected three high measurements of PM 10 in Gipuzkoa, and 1 of NO x in Biscay. The anomalous values occur in stations located within urban centers. The high value of NO x in Biscay happens in Bilbao, in the station of M a Diaz de Haro, one of the busiest points in the community. The PM 10 high values of Gipuzkoa correspond to the stations of Easo and Ategorrieta (stations in the center of Donostia), and Beasain, a smaller municipality but with an important industrial network, located on the axis of the A-1 motorway, which is a road with heavy traffic. Therefore, anomalous values may be due to traffic, one of the anthropogenic sources that generates the majority of PM 10 and NO x emissions.
With the demonstration case presented herein, it can be said that the developed approach is valid as a complementary tool for territories that do not have a significant amount of monitoring stations. It is worth mentioning that this work received the support of the BBK Fundazioa fundation, and raised the interest of its BBK-Behatokia observatory, which gathers information to inform the general public about the basic foundations of society. Among the potential further research avenues for the presented work, both the entity and the research team are considering the possibility of comparing 2019 and 2020 data, to analyze the indirect consequences of the COVID-19 in the quality of air and vice versa. Even though it is not sufficiently proved, it seems that the concentration of pollutants is correlated with higher transmissions of the pandemic, as the Director of the Center for the Coordination of Alerts and Emergencies of Spain, Fernando Simon, states [23].
Additionally, it is worth mentioning that meteorological conditions such as wind speed, high atmospheric pressure, low temperature, low humidity and large diurnal amplitude can directly have an impact on the concentration of pollutants. Thus, this concentration is not usually constant during the year. Low temperatures, which make NO 2 to remain longer in the air, are not conducive to its conversion, and they help in the gradual accumulation of this pollutant. It has to be also said that NO 2 concentration changes during the day, being higher in the morning and in the evening. Therefore, the analysis of the climate conditions, pollution, and COVID-19 infection trinomial will be the basis for further research.