Estimation of Prediction Error in Regression Air Quality Models

: Combustion of energy fuels or organic waste is associated with the emission of harmful gases and aerosols into the atmosphere, which strongly affects air quality. Air quality monitoring devices are unreliable and measurement gaps appear quite often. Missing data modeling techniques can be used to complete the monitoring data. Concentrations of monitored pollutants can be approximated with regression modeling tools, such as artiﬁcial neural networks. In this study, a long-term set of data from the air monitoring station in Zabrze (Silesia, South Poland) was analyzed. Concentration prediction was tested for the main air pollutants, i.e., O 3 , NO, NO 2 , SO 2 , PM 10 , CO. Multilayer perceptrons were used to model the concentrations. The predicted concentrations were compared to the observed ones to evaluate the approximation accuracy. Prediction errors were calculated separately for the whole concentration range as well as for the speciﬁed concentration subranges. Some different measures of error were estimated. It was stated that the use of a single measure of the approximation accuracy may lead to incorrect interpretation. The application of one neural network to the entire concentration range results in different prediction accuracy in various concentration subranges. Replacing one neural network with several networks adjusted to speciﬁc concentration subranges should improve the modeling accuracy.


Introduction
The air pollutants can be gases or particles. The basic gas pollutants include O 3 , NO, NO 2 , SO 2 , CO, and volatile organic compounds (VOC). Aerosol pollutants usually appear as airborne particles, i.e., very fine particles made up of either solid or liquid matter that can stay for a long time suspended in the air and spread with the wind [1]. They are called PM 10 , PM 2.5 or PM 1.0 , depending on the particle's size. Air pollution comes from both natural and anthropogenic sources. In urban areas, the combustion of fuels, biomass, and organic waste is the main source of gaseous and particle pollution.
The impact of air pollution on the environment, economy, and human health are indisputable. It is also increasingly well documented in scientific reports. Air pollution is considered to be one of the most important factors influencing human health [2][3][4][5]. Polluted air can cause negative changes in living organisms, even when the concentrations do not exceed the permissible levels. Air pollution is linked to mental health disorders [6,7]. It has also been reported that air pollution can have negative economic effects related to lower employee productivity and labor supply [8][9][10][11]. The World Health Organization reported, that about 7 million people died in 2012 because of poor air quality [12]. This points to a significant global threat from air pollution. In many European countries particulate matter (measured as PM 10 or PM 2.5 ), NO 2 , and O 3 concentrations are still above acceptable limits [13,14].
Air pollution is an important social, economic, and health problem, especially in highly urbanized areas. The level of pollutant concentration in the air is standardized and The devices measuring pollutants were placed in a thermostated kiosk. Each of them was equipped with auto-calibration systems. The measurements were carried out automatically.
The data subjected to the regression analysis consisted of 3 groups: time data, concentration data, and meteorological data. The following symbols were used to describe the variables: D -day H -hour O 3 -hourly O 3 concentration, µg/m 3 NO -hourly NO concentration, µg/m 3 NO 2 -hourly NO 2 concentration, µg/m 3 SO 2 -hourly SO 2 concentration, µg/m 3 PM 10 -hourly concentration of PM 10 , µg/m 3 CO -hourly CO concentration, mg/m 3 T -hourly mean temperature, • C I -the hourly mean intensity of solar radiation, W/m 2 WS -hourly mean wind speed, m/s.

Transformation of Time Data
In the case of the date, the discrete form of this variable was changed to a cyclic form in which the same values were assigned to the same dates in different years. This procedure allowed for assigning higher values to the dates in the winter months, with the maximum equal 1.00 for 31 December, and lower values to dates in the summer months with the minimum of 0.00 for 2 July. Thus, in the period from 2 July to 31 December, the date variable increases linearly by 0.005494 a day (in leap years by 0.005479) every day, while after December 31, it decreases at the same rate for half a year, reaching 0.00 again on 2 July.
In the case of the variable describing the time of day, the minimum value of 0.00 was assigned to 12 a.m., and the maximum value of 1.00 was assigned to 12 p.m. For hours from 12 a.m. to 12 p.m., the variable value increases linearly from 0.00 to 1.00 by 0.08333 for each subsequent hour and then decreases by 0.08333 for subsequent hours from 12 p.m. to 12 a.m.

Regression Models Concept
Artificial neural networks were used to build all regression models. Regression relationships present in the data were used to predict the concentrations of pollutants. In the specific case of the time series, the concentration of a selected pollutant was correlated with the time data, concentrations of other pollutants, and meteorological data. The knowledge hidden in the data can be used to make predictions according to the pattern shown in Figure 1.

Artificial Neural Networks
All neural models used a multilayer perceptron with five neurons in a single hidden layer ( Figure 2). Such a relatively simple structure of a neural network allows for efficient exploration of the knowledge hidden in the data [37]. Six perceptrons were created, one for each pollutant as the output, and with 10 other variables as the inputs, The choice of the input variables for each of the 6 models is presented in Table 1.  The analysis was carried out using the Artificial Neural Network module in the Statistica program. During the neural network training, the analyzed set of data was randomly divided into three different subsets: the training subset (70% of cases), the verification subset (15% of cases), and the test subset (15% of cases). The BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm was used in the network learning process. The weights were initialized randomly before starting the network training. A logistic activation function was used in the neurons of the hidden layer as well as at the output. Each neural network was trained 5 times, and then the best model was selected for further analysis. The learning process was limited to 200 epochs. This number of epochs is sufficient to stabilize the modeling error. A learning rate of 0.1 was assumed. The random Gaussian initialization of the network was used. The sum of squares (SOS) was assumed as the error function.
The time series of the 6-year measurement period included 52,608 hourly cases. Only 25,523 of them were the cases with no missing data for any variable and they were used to train the networks. When the best network was chosen, the approximation errors were calculated for the entire range of concentrations as well as for several concentration subranges. The errors and their variability in different subranges were analyzed further.

Estimation of Prediction Error
The values of the prediction errors were estimated based on divergences between the predicted concentrations (model outputs) y i and the actual concentrations x i . Seven different categories of approximation error were calculated for each regression model. The corresponding formulas were listed below, where n-number of observations, y-average value in the set of predicted concentrations, x-average value in the set of observed concentrations: MAE-mean absolute error: MSE-mean squared error: RMSE-root mean squared error: MARE-mean absolute relative error: r-Pearson's correlation coefficient:

Results
The approximation errors were calculated for 6 air pollutants: O 3 , NO, NO 2 , SO 2 , PM 10 , CO. The hourly concentrations of these pollutants were modeled. The prediction was performed using multilayer perceptrons. For each of the pollutants, one, the most accurate of the created neural networks, was chosen. The modeled concentration sets were compared to the actual concentrations to assess the prediction error. The modeling errors of the hourly concentrations were averaged over the entire 6-year period (2011-2016). The seven different measures of the prediction error were calculated, listed above, separately for each pollutant. The errors were estimated for the entire range of observed concentrations as well as for several concentration subranges. The error values are shown in Tables 2-7, separately for each pollutant. The tables also show the number of observations as well as the mean observed and predicted concentrations in the various ranges of concentrations.

Modelling of O 3 Concentrations
The prediction results of the O 3 concentrations were presented in Table 2. The error measures, such as the MAE, MSE, and RMSE, behave alike. They achieve the minimum values for the first O 3 concentration subrange (0-20 µg/m 3 ), which is the most numerous one. Moving to higher concentration subranges the modeling precision decreases alongside the decrease of the number of observations. The opposite changes are shown by the mean absolute relative error (MARE). The r, d, d 1 measures have much lower values in the specified subranges than in the entire concentration range.

Modelling of NO Concentrations
The prediction results of the NO concentrations are presented in Table 3. The values of MAE, MSE, and RMSE behave alike. They achieve the minimum values for the first NO concentration subrange (0-20 µg/m 3 ) which is also the most numerous subrange. These errors gradually increase, with minor variations, as we move to higher concentration subranges. In the same direction, the number of observations in the subranges decreases. The opposite changes are shown by values of MARE, r, d, d 1 . In the case of r, d, d 1 the decreasing of values may be interpreted as a loss of the modeling accuracy. Similar to the ozone r, d, d 1 have much lower values in the specified subranges than in the entire concentration range.

Modelling of NO 2 Concentrations
The prediction results of the NO 2 concentrations are presented in Table 4. The MAE, MSE, and RMSE achieve the minimum values for the first NO 2 concentration subrange (0-20 µg/m 3 ). This subrange is also the most numerous one. Moving to higher concentrations, the MAE, MSE, and RMSE values in subranges increase. It means that modeling accuracy decreases. In the same direction, the number of observations in the subranges decreases. The opposite changes are shown by values of other errors: MARE, r, d, and d 1 . The decreasing values of r, d, and d 1 may be interpreted as the loss of modeling accuracy. R, d, d 1 show much lower values in the specified subranges than in the entire concentration range.

Modelling of SO 2 Concentrations
The prediction results of the SO 2 concentrations are presented in Table 5. The error measures, such as MAE, MSE, and RMSE, achieve the minimum values for the first SO 2 concentration subrange (0-10 µg/m 3 ). This subrange is also the most numerous one. Moving to the higher concentration subranges the prediction precision decreases, and the number of observations also decreases. The opposite changes are shown by the MARE, r, d, and d 1 . When the subrange width is bigger, the values of r, d, and d 1 may increase. In the wide subranges, 100-200 µg/m 3 and 200-308 µg/m 3 these values are bigger than in the narrow subranges, for example, 90-100 µg/m 3 . The highest values of these measures are observed for the entire concentration range.

Modelling of PM 10 Concentrations
The prediction results of the PM 10 concentrations are presented in Table 6. The MAE, MSE, and RMSE achieve the minimum values for the second PM 10 concentration subrange (20-40 µg/m 3 ). This subrange is also the most numerous one. Moving from the second subrange to the higher subranges these errors increase. It means that modeling accuracy decreases. In the same direction, the number of observations in the subranges decreases. The values of the MARE gradually decrease as we move to the higher concentration subranges. The decreasing of the r, d, d 1 values is observed up to the subrange (140-160 µg/m 3 ). For the wider subranges, i.e., for the subranges above 200 µg/m 3 , the values of r, d, d 1 are higher, which can be explained by the effect of the range extension. The highest values of these three measures were estimated for the entire concentration range (0-1000 µg/m 3 ).

Modelling of CO Concentrations
The prediction results of the CO concentrations in subranges are presented in Table 7. The MAE, MSE, and RMSE achieve the minimum values for the first CO concentration subrange (0-1 mg/m 3 ). This is also the most numerous subrange. Moving from this subrange to the higher subranges these errors increase up to the subrange (4-5 mg/m 3 ). In the same direction, the number of observations in the subranges decreases quickly. For the higher subranges, the uptrends are disrupted. Up to this subrange (4-5 mg/m 3 ) also the decreasing of the r, d, d 1 values is observed. For the higher subranges, the trend is disrupted. The highest values of r, d, d 1 were estimated for the entire concentration range (0-9 mg g/m 3 ). Table 6. Values of approximation errors calculated for different subranges and the entire range of PM 10,obs concentrations (hourly data, Zabrze 2011-2016), PM 10,obs means real PM 10 concentration, and PM 10,pred means predicted PM 10 concentration.

The Comparison of Real and Predicted Concentrations
The scatterplots of the observed and the predicted concentration values for O 3 , NO, NO 2 , SO 2 , PM 10 and CO are presented in Figure 3. The scatterplots show that neural prediction models underestimate the values for the higher concentrations. The same assessment results from comparing the averages of the actual and the predicted concentrations in the specified subranges for the pollutants (Tables 2-7). This effect applies to all the studied pollutants.

Discussion
The error measures based on differences between the sets of real and predicted concentrations, such as MAE, MSE, and RMSE, behave in like manner. They achieve the minimum values for the most numerous concentration subranges. For most pollutants, measured at the station Zabrze, it is the first subrange with the lowest concentrations. The only exception is the PM 10 , for which the second subrange (20-40 µg/m 3 ) is the most numerous one. A similar effect was noted in the previous studies for a dataset from another air monitoring station in Lodz (central Poland) [30]. In that work, the error values for two pollutants O 3 (Table 3). The negative sign means a negative correlation of the actual and the predicted concentrations in this subrange. This result shows that it is necessary to carefully draw conclusions about the accuracy of the models based on the value of r. The accuracy measures such as r, d, d 1 fail when assessing the prediction precision, but these measures can be used for a comparison of precisions of models created for different air pollutants.
The errors based on differences between the actual and the predicted concentrations, like MAE, MSE, RMSE, show better prediction quality. However, the neural networks adapted to the whole range of concentrations cannot predict with the same quality in different concentration subranges. Moving on to less numerous concentration subranges, the modeling precision falls due to fewer training cases. This is due to the specific nature of machine learning. The adaptation process is predominated by the cases from the most numerous concentration ranges. The MSE formula is based on the sum of squared errors of the individual cases. The sum of squares (SOS) is also assumed as the error function during the neural network adaptation process. Thus, the neural network adaptation process leads to the minimization of the MSE error as well as related errors, i.e., RMSE and MAE. The advantage of measures such as MAE, MSE, RMSE is the ability to reflect real modeling accuracies in different concentration subranges.
In publications on modeling the air pollutant concentrations, the dominant approach is to create a single model that works across the entire concentration range. Among the publications, there are those in which only one measure of prediction accuracy is used, for example, MAE [38] or R 2 [39]. This approach seems risky, especially when using measures