Application of a Machine Learning Methodology for Data Implementation

: An important aspect in environmental sciences is the study of air quality, using statistical methods (environmental statistics) which utilize large datasets of climatic parameters. The air quality monitoring networks that operate in urban areas provide data on the most important pollutants, which via environmental statistics can be used for the development of continuous surfaces of pol-lutants’ concentrations. Generating ambient air quality maps can help guide policy makers and researchers to formulate measures to minimize the adverse effects. The information needed for a mapping application can be obtained by employing spatial interpolation methods to the available data, for generating estimations of air quality distributions. This study used point monitoring data from the network of stations that operates in Athens. A machine learning scheme was applied as a method to spatially estimate pollutants’ concentrations and the results could be effectively used to implement missing values and provide representative data for statistical analyses purposes.


Introduction
Studying the distribution of air quality parameters is an important task of urban communities.According to the European Environmental Agency (EEA), air pollution is identified as a major environmental health hazard in Europe as hundreds of thousands of Europeans are affected each year by air quality issues [1][2][3].Effective planning strategies require constant monitoring of the various pollutants, creating databases suitable for statistical analysis.Increased data availability can help researchers produce more reliable results.Spatial interpolation techniques have been widely used in air quality studies [4,5] as they can be utilized for data implementation in pollutant time series with missing values and even for sites of interest with no data availability.Additionally, by using these implemented databases, the development of informational tools such as Air Quality Indices (AQI) can be beneficial for presenting in a comprehensible manner new insight to policy makers and the public [6][7][8].The EEA proposed a European Air Quality Index (EAQI) which is based on hourly concentrations of five key pollutants (PM10, PM2.5, NO2, O3, and SO2) and has six different levels based on each pollutant's concentrations.This study aims to present a methodology for filling gaps in environmental sciences and specifically in the field of air quality.From the original datasets and based on concentration time series for the selected pollutants of the EAQI, a machine learning data implementation process was followed.This methodology can be utilized as a fast and effective tool which will contribute to the development of indexes such as the EAQI, which will subsequently visualize air pollutants' profiles and provide insight in patterns and relationships.

Data
The air quality monitoring sites, from which the data were derived, are located in the metropolitan city of Athens, in Greece.As part of the Southeastern Mediterranean region, Athens' climate is defined by dry summers (long periods, during which the temperatures are considerably high) and wet winters (these periods are usually short) [9].The basin is bounded by mounts Parnitha, Pentelikon, Hymmetus, and Aigaleo to the north, northeast, east-central, and west, respectively.Due to the transport mechanisms, the topography of the area, and the proximity to the sea, the air pollution fields are greatly affected by various flows of different scales [10][11][12][13].The monitoring sites in the area are part of an air quality monitoring network that has operated since 1984, under supervision of the Hellenic Ministry of Environment and Energy (MEE).The network is considered representative of the pollutants' spatial variability and thus suitable for the application of advanced statistical methodologies.For the development of the EAQI, a different number of stations was selected for each pollutant.The criterion for this selection was that a station should have at least a small percent of available data and thus, could contribute to the data implementation methodology.For the five pollutants, NO2, O3, PM10, PM2.5, and SO2, the number of stations used was fourteen, thirteen, eleven, six, and six, respectively.All five were monitored hourly, and the time period of the analysis was three years (2016-2018).

Methodology
The first step in this study, after the database development, was to find the number of gaps that are present in each station's data (target station/missing hourly concentrations) for 2018.This task was performed for all pollutants individually.However, in order to be able to apply effectively the machine learning spatial interpolation scheme, a specific criterion was adopted.For each one of these gaps at a target station, at the same time all the remaining stations had to have an available measurement.Even if one of them also had a gap, it was not included in the interpolation process.The results of this step are presented in Table 1 and reveal the number of missing values that could be potentially estimated and used to increase the available data points.The next step was to apply an Artificial Neural Network (ANN) approach for spatial estimation purposes.To achieve this, a Shallow Neural Network (SNN) was utilized as a practical and fairly simple ANN that is moderately demanding in terms of time and computational power.However, it can effectively simulate complex nonlinear relationships between parameters.In detail, two-layer networks with sigmoid hidden neurons and linear output neurons were used (Figure 1).The number of hourly concentrations that were used for the models were those for which none of the stations had a missing value.The training of the networks was performed with the Levenberg-Marquardt backpropagation algorithm.The dataset was divided into three subsets used for training, validation, and testing randomly and each subset corresponded to specific percentages of the original data (70% training, 15% validation, 15% testing).Depending on the pollutant, the number of data points used for the subsets was different and is presented in Table 2.The network architecture included a number of inputs equal to the number of all stations minus the target station (13 for NO2, 12 for O3, 10 for PM10, 5 for PM2.5, and 5 for SO2), while the output was always one (target station).Regarding the number of neurons in the hidden layer, the performance of each network was evaluated by using the Mean Absolute Error (MAE) statistical criterion [14][15][16][17][18], which is calculated by using the following equation: where E denotes the estimated concentration, O the observed concentration, and n the number of data points.Lower MAE values illustrate the optimum performing network.Five runs were performed for all schemes and for hidden layer neurons that ranged from 1 to 40.The best performing networks and their architecture are presented in Table 3.By using these selected SNN models for the corresponding inputs of 2018, the gaps in each station and pollutant were filled.

Results
A total of 12,526 missing values were estimated and the percentage of gaps that were filled out in each station was above 40% for PM10 and PM2.5, above 20% for O3 and SO2, and above 15% for NO2.Regarding O3 and NO2 where the percentage of interpolated values is lower, it needs to be considered that they had a higher number of input stations and thus, the criterion that none of the inputs should have a missing value for each gap of the target station was more difficult to fulfill.Table 1 presents in detail the gaps originally and after the interpolation, as well as the percentage of missing values that were estimated.
The number of data points for the training, validation, and testing subsets and for each pollutant are presented in Table 2. Pollutants with a lower number of input stations were associated with higher data point numbers per station (smaller probability for all the stations to have a missing value at the same time).However, more stations (NO2, O3) provided additional data points.NO2 and PM2.5 are the pollutants which provided more data for training, validation, and testing purposes.
The architecture of the optimum performance models is presented in Table 3.The hidden neurons number was an average of all the stations for each pollutant.The MAE average values (measured in the same units as the concentrations of the pollutants, μg/m 3 ) in these cases were also included.However, all pollutant-specific networks had the same number of inputs and all networks had a single output (the target station).The average hidden neuron value ranged from 21.7 to 25.2, which revealed that the models were at an almost equal complexity level.

Discussion
According to Table 3 results, it could be concluded that the error percentage was higher when the number of input stations was lower and subsequently the information provided for training was more limited.O3 was an exception to this statement because although the number of input stations was 12 versus 13 for NO2 and correspondingly the available data points were nearly half, the error percentage was considerably lower.This can be explained by examining other behavioral characteristics of this pollutant (differences in mean values among stations, more easily identifiable patterns in datasets, etc.).When comparing PM2.5 and SO2, where the input neurons were five for both, the prediction performance for SO2 was lower, possibly due to the smaller number of data points, according to Table 2 (PM2.5 had nearly three times more data points).Different approaches to evaluate the performance of the models can be followed (scatter diagrams, correlation metrics, etc.), and more types of similar complexity neural network models can be examined.

Conclusions
This study applied SNN models as a tool for point spatial interpolation of air quality parameters, using data from an air quality monitoring network located at a densely populated urban area.Five air quality parameters were selected (PM10, PM2.5, NO2, O3, and SO2), due to their importance in the field of air quality indexes, and, more specifically, based on the EAQI (proposed by EEA).The results highlight that the models' performance was significantly affected by the density of the air quality monitoring network (number of stations and data points per station) as well as the specific patterns that characterize each pollutant's concentrations.The training dataset is crucial for the networks' development and needs to be carefully selected in order to provide adequate information which will augment the networks' generalization ability.This work can be utilized as an alternative for commonly used spatial interpolation methods in the field of air quality and further improvements can be made by using more advanced networks and/or adding meteorological parameters as inputs.
Author Contributions: C.G.T. and A.A. were involved into the conceptualization, writing-original draft preparation and writing-review and editing of this work, while individually C.G.T. was responsible for the data curation, validation of the results and supervised the whole procedure.All authors (C.G.T., A.A. and I.K.) performed the various steps of the methodology, processed the data and developed the neural network models.All authors were involved in the discussion of the results and commented on the manuscript.All authors have read and agreed to the published version of the manuscript.

Figure 1 .
Figure 1.A two-layer network with sigmoid hidden neurons and linear output neurons.

Table 1 .
Number of missing values (gaps) during 2018, for the original and spatially interpolated dataset.

Table 2 .
Number of data points distributed to the training, validation, and testing subset for the 2016-2017 time period.

Table 3 .
Number of input, hidden (average), and output neurons as well as Mean Absolute Error (MAE) (average), mean concentration values, and percentage of error (MAE to mean concentration) for the best performing models and the 2016-2017 time period.