Forecasting PM 10 in the Bay of Algeciras Based on Regression Models

Different forecasting methodologies, classified into parametric and nonparametric, were studied in order to predict the average concentration of PM10 over the course of 24 h. The comparison of the forecasting models was based on four quality indexes (Pearson’s correlation coefficient, the index of agreement, the mean absolute error, and the root mean squared error). The proposed experimental procedure was put into practice in three urban centers belonging to the Bay of Algeciras (Andalusia, Spain). The prediction results obtained with the proposed models exceed those obtained with the reference models through the introduction of low-quality measurements as exogenous information. This proves that it is possible to improve performance by using additional information from the existing nonlinear relationships between the concentration of the pollutants and the meteorological variables.


Introduction
Atmospheric pollution is currently one of the most important environmental problems on a global scale, with a direct and principal impact on human health [1,2].For this reason, the European Environmental Agency conducted a study which concluded that large proportion of European populations and ecosystems are still exposed to air pollution that exceeds European standards, and therefore a considerable impact on human health and on the environment persists [3].
Regulatory levels of ambient air quality referring to this particulate issue (PM 10 and PM 2.5 ) are highlighted in Directive 2008/50/EC of the European Parliament and of the council [4].The implementation of those measures is contained in the Royal Decree 102/2011 on the improvement for ambient air quality of the Spanish Government [5].This issue defines a common strategy to define and establish objectives for ambient air quality in the community and assess the ambient air quality on the basis of common methods and criteria.
Air quality in cities is not limited to a single factor.In fact, it depends on multiple causes such as meteorological variables, topographical characteristics, the degree of industrialization, and traffic and population densities [6][7][8][9].The problem of atmospheric pollutants and their effects on health and the environment, as well as the intrinsic complexity of these phenomena, justifies the need for developing management and control strategies that safeguard the environment.These problems have attracted the interest of environmental authorities and researchers, which have developed different air quality models as forecasting strategies.
Modeling atmospheric pollutants is a powerful analysis tool with multiple applications, e.g., the evaluation of emission control strategies, support in environmental decision making, generation of scientific information for a better understanding of the atmosphere dynamics and pollution in an area, etc.The importance of the models relies on the development and implementation of environmental policies, predictions of pollutant levels, information systems, forewarning and prevention of environmental pollution, or standardization of databases.Regarding the industrial sector, they can also report on the effects of new installations and the optimization of processes.
Mathematical models are generally used to simulate the physical and chemical processes that affect pollutants, and their dispersion and transformation in the atmosphere.As indicated before, the diffusion mechanism of the pollutants in the atmosphere is a complex process that depends on numerous parameters, making the development of traditional mathematical models more difficult.
The purpose of the study was twofold: to draw up a detailed analysis of the environmental, meteorological, and seasonal variables that may influence the levels of suspended particles in order to build a solid and reliable database, and to develop and assess regression models applied to forecast particulate matter PM 10 in the Bay of Algeciras with a prediction horizon of 24 h.This area has the most complex environmental issues in Andalusia because it is located in the Straits of Gibraltar.Furthermore, the zone brings together large volumes of the population and a significant industrial and port development.For this reason, the Bay of Algeciras has an extensive network of air quality stations; this availability of data enabled us to improve, explore, and develop predictive models.
The regression models developed in this work are based on different techniques of artificial neural networks (ANN), multiple linear regression (MLP), and persistence.These models are based on statistical and empirical equations, in connection with the data relative to pollution and other variables that may influence it.Regarding the last two of them, we can highlight: persistence [10][11][12][13][14] and MLP [9,[15][16][17][18].
It is common knowledge that ANN's are applied in tasks of prediction and have been extensively used in myriad works.The approaches in References [18][19][20][21][22][23][24] apply ANN-based models, which are indexes that support the present research.
The paper is organized as follows.Section 2 presents the region and the raw data from the on-site equipment.Section 3 summarizes the theoretical framework.The experimental procedure is outlined in Section 4, and the results are presented in Section 5. Finally, our conclusions are explained in Section 6.

Target Area and Experimental Data
The Bay of Algeciras is located in the south of Andalusia, Spain.It is around 10 km long by 8 km wide, covering an area of some 75 km 2 .The global and regional variations in the climate, along with the topographical conditions of the area studied, affect the transport and dispersion of pollutants [25][26][27].
The data utilized in the simulation models came from the European Environment Agency database where the information is collected by air quality stations of the Environmental Quality Surveillance Network in Andalusia, as well as by using other methods [28,29].These data are the combination of meteorological and air pollutants parameter measurements that were used as exogenous variables in the configuration of the proposed models.Time-series from the measurement stations belonging to the Bay of Algeciras for the period from 2005 to 2010 were used.
The stations are strategically located with the goal of improving the spatial distribution data of pollution in the Bay of Algeciras (Andalusia) (Figure 1), providing a high density grid of measurement points over the region.These stations are designed to monitor the levels of air pollution in urban areas, traffic, maximum values or background contamination.Tables 1 and 2 contain detailed descriptions of all of the parameters that can be monitored or displayed through these stations.

Station
Township San   The pollutants and meteorological variables were selected taking into account the following criteria: • Limited data availability according to: (1) location of measurement stations, (2) measured parameters in each one, and (3) period of update of the European Environment Agency database.• Reliability of data, considering the obtained data with a higher percentage of validity.

•
The geographical location, selecting the stations in the principal urban areas of El Campo de Gibraltar.
Invalid data may have been caused by possible faults in the sensors of the measuring stations, poor calibration of the equipment, configuration errors, power outages, etc.
Table 3 shows the valid percentages of particulate matter PM 10 corresponding to the period between 2005 and 2010.Because a greater number of measuring stations measure PM 10 , these stations were used in the study.The database is built with variables that are selected by regression analysis, and is complemented with success/error tests.This database contains information regarding the parameter to be predicted, concentrations of other atmospheric pollutants, meteorological variables, day of the week (DW), season (SS), and autoregressive data.Furthermore, all selected data had to satisfy a minimum of 85% of all measured annual data during three consecutive years as acceptance criterion.This minimum threshold of measures was chosen in order to obtain a database where the evolution and seasonality of the variables were registered.
From the analysis carried out and considering the main urban centers of El Campo de Gibraltar, three databases were obtained for the development of models in the municipalities of San Roque, Algeciras, and La Línea de la Concepción.

Prediction Models
In this work, five forecasting methodologies were used.They were classified into parametric and nonparametric.The parametric techniques consist of persistence and multiple linear regression models, while the nonparametric techniques are based on ANNs.More precisely, three ANN types were used: adaptive linear neuron, multilayer, and radial basis function.

Persistence Model
It is the most common reference method for forecasting horizons up to 3-6 h and needs no complex computation.It states that the predicted value at one time instance t ( ŷt ) is similar to the last measurement (y t−1 ) [14].

Multiple Linear Regression
The model has at least two predictors.Regression analysis conveys the idea of finding descriptive or predictive models from the observed relationships in a set of data.It is a widely used method in the prediction of atmospheric pollutants.Linear multiple regression defines the level and the dependence relationship of the involved parameters [15].

Adaptive Linear Neuron
These networks are simpler than feedforward networks as they do not have hidden layers.
The training of this model is based on the Widrow-Holf rule [30], which obtains the weights and biases minimizing the mean square error (MSE-Equation ( 1)).
where N is the number of data, y t is the observed data, and ŷt is the predicted data.

Multilayer Perceptron
Multilayer structure, which is based on the error backpropagation via the Levenberg-Marquardt paradigm, is the most extended method.This technique consists of updating the weights of the connections between neurons in a way that the weights are directly proportional to the estimated error between the desired output and the outputs that occur at each step of an iterative process [31].

Radial Basis Function (RBF)
RBF networks have similar structures to that of a multilayer one [32].The main difference arises in the hidden neurons, and operates on the Euclidean distance between an input with respect to the synaptic vector (the so-called centroid).The localized neurons respond uniquely with an appreciable intensity when the presented input vector and the centroid of the neuron fall into a nearby area in the input space.The training of RBF networks comprises two stages.The first is one unsupervised and accomplished by obtaining cluster centers of the training set inputs.The second consists of solving linear equations.

Experimental Procedure
The experimental procedure, depicted in the conceptual map of Figure 2

Preprocessing of Data
In this step, the stations that exceed 85% of valid data were selected.In addition, the statistical analysis was performed in order to eliminate the outliers.Finally, the variables of the database were ordered according to the correlation coefficient between the exogenous variables and the variable under study to be predicted (PM 10 concentration).

Implementation of the Models
The implementations of the reference models (persistence and MLR) did not present any problem.Regarding the ANN models evaluated in this paper, they are made up of: a linear network (LIN), backpropagation network with one and two hidden layers (BP1 and BP2), and radial basis function network.The following premises are declared for all of them: • Data were normalized so that they fall into the interval [−1, 1], to achieve a faster computation.Equation (1) shows the used algorithm where x is an element of the vector (input or output) to normalize, x max is the value of the greatest element of the vector to normalize, x min is the value of the smallest element of the vector to normalize, y is the normalized value of x, y max is the maximum value (1), and y min is the maximum value (−1).
• The dataset was randomly divided into three subsets: training, evaluation, and test sets.The first two sets were used for ANN model building with 70% and 15% of the data, respectively; and the third set, with the last 15%, was used to test the predictive power of a model using the out-of-sample set.
• A total of 100 experiments were repeated for each model to avoid randomness limiting the results.Training of the tested networks was carried out until the validation error started increasing.At this point, the training was stopped, and the performance of the network was assessed.

•
The simulation started without exogenous variables and then we progressively added variables (from the highest to lowest correlation).
Hereinafter, the particularities of the models are detailed.Table 4 collects parameters, corresponding to the network architecture and the activation functions for the neural networks.The rule to select the range of the neurons in the hidden layers for BP1 and BP2 models is described as follows.The number of neurons in the first hidden layer is the mean of the neurons between the input and output layers, while for the second hidden layer it is one half of the neurons in the first hidden layer [31], as shown in Table 4.
For the RBF model, it is compulsory to specify the appropriate value of the Gaussian Kernel spread.If this value is too small or too high the network might not generalize well and a lot of neurons would be required to fit a fast-changing approximation function.

Evaluation of the Models
The models performance was assessed via the following four quality indicators: Pearson's correlation coefficient (ρ), the index of agreement (IOA), the mean absolute error (MAE), and the root mean square error (RMSE).
where σ y t ŷt is the covariance between y t (observed data) and ŷt (predicted data), and σ y t and σ ŷt are their respective standard deviations, N is the number of data.

Results
As mentioned in Section 2, three databases were obtained for the development of the models in the cities of San Roque, Algeciras, and La Línea de la Concepción.These databases were designed according to the correlation coefficient between the exogenous variables and the variable to be predicted, obtaining the relations shown in Table 5.Although the inclusion of exogenous variables with correlation coefficients lower to 0,6 would appear to be a mistake, these were applied because of the large volume of data.Thanks to computing power of the models used, we were able to study the appropriateness of using such data.After data is preprocessed, the four quality indexes of all implemented models are obtained, selecting the best configurations as a function of the number of exogenous variables.For example, the database used in San Roque has ten exogenous variables and it is designed according to the coefficient of correlation between the exogenous variables (rows 4 to 13) and the variable to be predicted (row 3), obtaining the relation shown in Table 5.
Once the models which best minimized the errors of the evaluation set were selected, they were used to test the predictive power of a model using the out-of-sample set.The results are shown in Table 6.After an in-depth assessment of the colored maps of each municipality and the results obtained in Table 7, the following outcomes were concluded:

•
In San Roque, the BP2 model with 10 exogenous variables is the best with respect to the reference models.The optimal configuration of the BP2 model is as follows: number of neurons in the hidden layer 1 = 17, number of neurons in the hidden layer 2 = 7, training condition: epoch = 500 and performance function: MSE = 0.001.

•
In Algeciras, the RBF model with 3 exogenous variables is the best with respect to the reference models.The optimal configuration of the RBF model is as follows: number of neurons in the hidden layer = 20, spread = 0.4, and performance function: MSE = 0.001.

•
In La Línea de la Concepción, the RBF model with 6 exogenous variables is the best with respect to the reference models.The optimal configuration of the RBF model is as follows: number of neurons in the hidden layer = 13, spread = 2.7, and performance function: MSE = 0.001.
Figure 4 shows the best obtained result for each city for daily concentration of particulate matter.These levels of concentration are standardized based on the limit value of the 150 µ/m 3 that is the current 24 h PM 10 set by the National Ambient Air Quality Standards since 1987 [33].

Conclusions
In this paper, five forecasting methodologies have been classified according to parametric and nonparametric techniques with the goal of predicting the averaged concentration of PM 10 over the course of 24 h.These models were definitively used in three urban centers: San Roque, Algeciras, and La Línea de la Concepción.
Different results were obtained according to the locations under study.With respect to the reference models, the best one and their percentages of improvement as regards MAE and RMSE in each of them are as follows: San Roque (BP2 model with 10 exogenous variables; [0.20%, 4.54%]-PERS and In summary, it can be concluded that the prediction results with the proposed models exceed those obtained with the reference models.This proves that it is possible to improve performance by using additional information from the existing nonlinear relationships between the concentration of the pollutants and the meteorological variables.In this sense, the inclusion of new stations from other nets of meteorological stations and/or amateur observers available on websites should be used to increase

Figure 1 .
Figure 1.Location of the air quality stations of the Environmental Quality Surveillance Network in the Bay of Algeciras (Andalusia).

Figure 2 .
Figure 2. Graphical abstract of the paper.

Figure 3 .
Figure 3. Colormap of quality indicators of the best models obtained in San Roque in accordance with the number of exogenous variables and the model used.

Figure 4 .
Figure 4. Prediction results using the best models for each city (top: San Roque; middle: Algeciras; bottom: La Línea de la Concepción).

Table 1 .
Stations and parameters analyzed in the Bay of Algeciras.

Table 2 .
Continuation of previous table.

Table 3 .
Annual percentage of valid data.

Table 4 .
Parameters for the network models.

Table 5 .
Descending order of the parameters.

Table 6 .
Results of the best models obtained at each site.

Table 7 .
Percentages of improvement over the reference models at each site.