Addressing Missing Environmental Data via a Machine Learning Scheme

An important aspect in environmental sciences is the study of air quality, using statistical methods (environmental statistics) which utilize large datasets of climatic parameters. The air-qualitymonitoring networks that operate in urban areas provide data on the most important pollutants, which, via environmental statistics, can be used for the development of continuous surfaces of pollutants’ concentrations. Generating ambient air-quality maps can help guide policy makers and researchers to formulate measures to minimize the adverse effects. The information needed for a mapping application can be obtained by employing spatial interpolation methods to the available data, for generating estimations of air-quality distributions. This study used point-monitoring data from the network of stations that operates in Athens, Greece. A machine-learning scheme was applied as a method to spatially estimate pollutants’ concentrations, and the results can be effectively used to implement missing values and provide representative data for statistical analyses purposes.


Introduction
Studying the distribution of air-quality parameters is an important task of urban communities. According to the European Environment Agency (EEA), air pollution is identified as a major environmental health hazard in Europe, as hundreds of thousands of Europeans are affected each year by air-quality issues [1][2][3]. Furthermore, air-quality parameters' concentrations are associated with effects that are non-health-related and can influence the interactions between humans and the environment that surrounds them [4,5]. Effective planning strategies require constant monitoring of the various pollutants, creating databases suitable for statistical analysis. Increased data availability can help researchers produce more reliable results. However, for areas where the number of air-quality-monitoring sites that are part of a network is limited and/or not fully functional (possibly due to high establishment and maintenance costs, etc.) and, subsequently, a lower number of available observations cannot reflect the spatiotemporal distribution; thus, interpolation methods are of great significance. Spatial interpolation techniques have been widely used in air-quality studies [6,7], as they can be utilized effectively for data implementation in pollutant time series with missing values and even for sites of interest with no data availability. There are several categories in which these techniques can be classified. According to Li and Heap [8], they can be typically grouped into non-geostatistical, geostatistical and combined methods. The importance of using these methodologies to fill data gaps has been proved by the abundance of research studies on this subject, which, especially during the last few years, emphasize the need for the development of more advanced methodologies [9][10][11][12]. Machine learning (ML) and, in particular, artificial neural networks (ANNs) are considered as a novel superior alternative to traditional data implementation techniques, due to their ability to perceive the relationships among the various air-quality parameters as Atmosphere 2021, 12, 499 2 of 10 nonlinear in contrast with other statistical schemes which assume that these linkages are linear [13,14]. While ANNs have been mostly utilized for temporal predictions in the field of air-quality and climatic-parameters forecasting [15][16][17][18], they have been additionally applied as a tool to provide spatial estimations in order to create datasets without missing values [12,[19][20][21]. Additionally, by using these implemented databases, the development of informational tools, such as Air Quality Indices (AQIs), can be beneficial for presenting, in a comprehensible manner, new insight to policy makers and the public [22][23][24]. The EEA proposed a European Air Quality Index (EAQI) which is based on hourly concentrations of five key pollutants (PM 10 , PM 2.5 , NO 2 , O 3 and SO 2 ) and has six different levels based on each pollutant's concentrations. This study aimed to present an ANN scheme for filling gaps in environmental and climate sciences and specifically in the field of air quality. ANNs usage for spatial-interpolation purposes is limited, and this work concentrates on the development of an effective method to spatially approximate air-quality parameters. From the original datasets and based on concentration time series for the selected pollutants of the EAQI, a shallow neural network implementation process was followed. This methodology can be utilized as a fast and effective tool which will contribute to the development of indexes such as the EAQI, which will subsequently visualize air pollutants' profiles and provide insight in patterns and relationships. implementation techniques, due to their ability to perceive the relationships among the various air-quality parameters as nonlinear in contrast with other statistical schemes which assume that these linkages are linear [13,14]. While ANNs have been mostly utilized for temporal predictions in the field of air-quality and climatic-parameters forecasting [15][16][17][18], they have been additionally applied as a tool to provide spatial estimations in order to create datasets without missing values [12,[19][20][21]. Additionally, by using these implemented databases, the development of informational tools, such as Air Quality Indices (AQIs), can be beneficial for presenting, in a comprehensible manner, new insight to policy makers and the public [22][23][24]. The EEA proposed a European Air Quality Index (EAQI) which is based on hourly concentrations of five key pollutants (PM10, PM2.5, NO2, O3 and SO2) and has six different levels based on each pollutant's concentrations. This study aimed to present an ANN scheme for filling gaps in environmental and climate sciences and specifically in the field of air quality. ANNs usage for spatial-interpolation purposes is limited, and this work concentrates on the development of an effective method to spatially approximate air-quality parameters. From the original datasets and based on concentration time series for the selected pollutants of the EAQI, a shallow neural network implementation process was followed. This methodology can be utilized as a fast and effective tool which will contribute to the development of indexes such as the EAQI, which will subsequently visualize air pollutants' profiles and provide insight in patterns and relationships.

Data
The air-quality-monitoring sites, from which the data were derived, are located at the metropolitan city of Athens in Greece. As part of the Southeastern Mediterranean region, Athens climate is defined by dry summers (long periods, during which the temperatures are considerably high) and wet, mild winters [25]. The basin is bounded by mounts Parnitha, Pentelikon, Hymmetus and Aigaleo to the north, northeast, east-central and west, respectively. Due to the transport mechanisms, the topography of the area and the proximity to the sea, the air pollution fields are greatly affected by various flows of different scales [26][27][28]. The monitoring sites in the area are part of an air-quality-monitoring network that has operated since 1984, under the supervision of the Hellenic Ministry of Environment and Energy (MEE). Figure 1 presents the area of study and the locations of the monitoring sites (Table 1).   The network is considered representative of the pollutants' spatial variability and, thus, suitable for the application of advanced statistical methodologies. As input data for the development of the neural network models, a different number of stations was selected for each pollutant. The criterion for this selection was that a station should have at least a small percent of available data and, thus, could contribute to the data implementation methodology. The percentage of data availability for each station and pollutant was, in most cases, above 80%. However, the few exceptions for which the percentage was lower than 80% were also included in the analysis, as they could contribute to the interpolation process and, additionally, many of their missing concentrations could be targeted for data implementation. Only the stations that had no data availability for a whole year were excluded from this process. For the five pollutants, NO 2 , O 3 , PM 10 , PM 2.5 and SO 2 , the number of stations used was fourteen, thirteen, eleven, six and six, respectively. All five were monitored hourly, and the time period of the analysis was three years (2016-2018).

Methodology
The first step in this study, after the database development, was to find the number of gaps that are present in each station's data (target station/missing hourly concentrations) for 2018. This task was performed for all pollutants individually. However, in order to be able to apply effectively the machine learning spatial interpolation scheme, a specific criterion was adopted. For each one of these gaps at a target station, at the same time, all the remaining stations must have an available measurement. Even if one of them also also a gap, it was not included in the interpolation process. This process was followed in order to avoid using a limited number of stations (or even an individual station) to interpolate missing values. The networks perform better when more information is provided. However, the same procedure could be performed by using less stations' data (and thus, not fulfilling the criterion that was mentioned before). In this case, less information would be available for the models in order to train, but more gaps could be filled, which would lead to a more complete database.
The results of the first step of the methodology are presented in Table 2 and reveal the number of missing values that can be potentially estimated initially and used to increase the available data points. The next step was to apply an ANN approach for spatial estimation purposes. To achieve this, a Shallow Neural Network (SNN) was utilized as a practical and fairly simple ANN that is moderately demanding in terms of time and computational power. However, it can effectively simulate complex nonlinear relationships between parameters. In detail, two-layer networks with sigmoid hidden neurons and linear output neurons were used ( Figure 2). between parameters. In detail, two-layer networks with sigmoid hidden neurons and linear output neurons were used ( Figure 2).  The training of the networks was performed with the Levenberg-Marquardt backpropagation algorithm. The dataset was divided into three subsets used for training, validation and testing randomly, and each subset corresponded to specific percentages of the original data (70% training, 15% validation and 15% testing). To reduce overfitting, the early stopping approach was utilized on the validation subset [26]. This approach terminates the training process when the validation subset's error begins to increase. Depending on the pollutant, the number of data points used for the subsets was different (as the number of stations with data availability is different) and is presented in Table 3. The network architecture includes a number of inputs equal to the number of all stations minus the target station (13 for NO2, 12 for O3, 10 for PM10, 5 for PM2.5 and 5 for SO2), while the output is always one (target station). Regarding the number of neurons in the hidden layer, the performance of each network was evaluated by using the Mean Absolute Error (MAE) statistical criterion [29][30][31][32][33], which is calculated by using the following equation: where E denotes the estimated concentration, O the observed concentration and n the number of data points.  The training of the networks was performed with the Levenberg-Marquardt backpropagation algorithm. The dataset was divided into three subsets used for training, validation and testing randomly, and each subset corresponded to specific percentages of the original data (70% training, 15% validation and 15% testing). To reduce overfitting, the early stopping approach was utilized on the validation subset [26]. This approach terminates the training process when the validation subset's error begins to increase. Depending on the pollutant, the number of data points used for the subsets was different (as the number of stations with data availability is different) and is presented in Table 3. The network architecture includes a number of inputs equal to the number of all stations minus the target station (13 for NO 2 , 12 for O 3 , 10 for PM 10 , 5 for PM 2.5 and 5 for SO 2 ), while the output is always one (target station). Regarding the number of neurons in the hidden layer, the performance of each network was evaluated by using the Mean Absolute Error (MAE) statistical criterion [29][30][31][32][33], which is calculated by using the following equation: where E denotes the estimated concentration, O the observed concentration and n the number of data points. Two more statistical metrics, the Root Mean Squared Error (RMSE) and the coefficient of determination (R 2 ), were also calculated, and in combination with MAE, they were used to provide a comparison between the results of the ANN methodology and a Multiple Linear Regression (MLR) scheme. The MLR was applied according to the same criterion as with the ANN models. The equations for RMSE and R 2 are the following: Lower MAE and RMSE values and higher R 2 illustrate the optimum performing scheme. Regarding the ANN method, five runs were performed for all models and for hidden layer neurons that ranged from 1 to 40. The best performing networks and their architecture are presented in Table 4. By using these selected SNN models for the corresponding inputs of 2018, the gaps in each station and pollutant were filled. Finally, on the interpolated datasets, mean and variance values were calculated and compared with the corresponding values of the original datasets for 2018.

Results and Discussion
A total of 12,526 missing values were estimated, and the percentage of gaps that were filled out in each station was above 40% for PM 10 and PM 2.5 , above 20% for O 3 and SO 2 and above 15% for NO 2 . Regarding O 3 and NO 2 where the percentage of interpolated values is lower, it needs to be considered that they had a higher number of stations with data availability (inputs for the networks), and, thus, the criterion that none of the inputs should have a missing value for each gap of the target station was more difficult to fulfill. Table 2 presents in detail the gaps originally and the number of them that will be eventually filled, after the interpolation, as well as the percentage of missing values that were estimated. It is noted that the number of gaps after the interpolation were calculated based on the criterion explained in the Methodology section and, thus, before the interpolation process, which provided the corresponding concentrations for each missing value.
The number of data points for the training, validation and testing subsets and for each pollutant is presented in Table 3. Pollutants with a lower number of input stations are associated with higher data points numbers per station (smaller probability for all the stations to have a missing value at the same time). However, more stations (NO 2 and O 3 ) provide additional data points. NO 2 and PM 2.5 are the pollutants which provided more data for training, validation and testing purposes.
The architecture of the optimum performance models is presented in Table 4. The hidden neurons number is an average of all the stations for each pollutant. The MAE, RMSE and R 2 average values (MAE and RMSE are measured in the same units as the concentrations of the pollutants, µg/m 3 ) in these cases are also included. However, all pollutant-specific networks have the same number of inputs and all networks have a single output (the target station). The average hidden neuron value ranges from 21.7 to 25.2, which reveals that the models are at an almost equal complexity level. As mentioned beforehand, to illustrate the validity of the ANN approach, the steps of the analysis that were applied on the available datasets were also performed for MLR. Table 4 additionally presents the MAE, RMSE and R 2 results for the MLR method. It is evident that the ANNs are superior in all cases. The detailed results that include a station-by-station comparison are also provided in Supplementary Materials Tables S1-S5.
Tables 5-9 present the results for the mean and variance values of both the original and the gap-filled datasets for the five pollutants and the 2018 time period. It is noted that the differences are marginal in nearly all cases, and this is evident by the error value percentages (mean error and variance error).  By examining the total number of gaps in the original and interpolated databases of all pollutants (Table 2), it is evident that a considerable number of missing data points was estimated after the application of the methodology, which relies on the data-point availability of all the selected stations to interpolate the corresponding missing data points Atmosphere 2021, 12, 499 7 of 10 (time-related) of the target station. In particular, the application of the ANNs, which utilized the available observations of the pollutants' concentrations based on the criterion that was introduced in the methodology, added a percent of missing values that ranged from about 16% to 45%, which depended on how many stations were used as inputs and the overall existing concentrations' distribution (whether for the same hour one or more stations had data availability). While the ANNs could be used to estimate data points at the target station, when not all stations had available data at the same hour, there are some factors that need to be considered. Although the ANNs provide representative estimations, there is always an associated error percentage when the results are compared with the observational data. This error percentage can be enhanced when the information provided at the models is not adequate for them to train effectively. If fewer stations were utilized, it could possibly lead to higher errors, and, in any case, the results should be analyzed carefully to find out how using a different number of inputs affects the output.  As proposed by Willmott and Matsuura [29], dimensioned evaluations of modelperformance error should be based on MAE. However, a better understanding of the MAE values can be achieved by calculating the percentage of error (MAE to mean concentration). According to Table 4 results, it can be concluded that the error percentage is higher when the number of input stations is lower and subsequently the information provided for training is more limited. O 3 is an exception to this statement because, although the number of input stations is 12 versus 13 for NO 2 and correspondingly the available data points are nearly half, the error percentage is considerably lower. This can be explained by examining other behavioral characteristics of this pollutant (differences in mean values among stations, more easily identifiable patterns in datasets, etc.). When comparing PM 2.5 and SO 2 , where the input neurons are five for both, the prediction performance for SO 2 is lower, possibly due to the smaller number of data points, according to Table 3 (PM 2.5 has nearly three times more data points). Different approaches to evaluate the performance of the models can be followed (scatter diagrams, etc.), and more types of similar complexity neural network models can be examined.

Conclusions
This study applied SNN models as a tool for point spatial interpolation of air-quality parameters, using data from an air-quality-monitoring network located at a densely populated urban area. Five air-quality parameters were selected (PM 10 , PM 2.5 , NO 2 , O 3 and SO 2 ), due to their importance in the field of air-quality indices and, more specifically, based on the EAQI (proposed by EEA). The results highlight that the models' performance is significantly affected by the density of the air-quality-monitoring network (number of stations and data points per station), as well as the specific patterns that characterize each pollutant's concentrations. The training dataset is crucial for the networks' development and needs to be carefully selected in order to provide adequate information which will augment the networks' generalization ability. This work can be utilized as an alternative for commonly used spatial interpolation methods in the field of air quality, and further improvements can be made by using more advanced networks and/or adding meteorological/climatic parameters as inputs. Author Contributions: C.G.T. and A.A. were involved into the conceptualization, writing-original draft preparation and writing-review and editing of this work; individually, C.G.T. was responsible for the data curation and validation of the results and supervised the whole procedure. All authors (C.G.T., A.A. and I.K.) performed the various steps of the methodology, processed the data and developed the neural network models. All authors were involved in the discussion of the results and commented on the manuscript. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: The datasets generated during and/or analyzed during the current study are publicly available in the Ministry of Environment and Energy repository (ypen.gov.gr).