Forecasting Urban Air Quality via a Back-propagation Neural Network and a Selection Sample Rule

In this paper, based on a sample selection rule and a Back Propagation (BP) neural network, a new model of forecasting daily SO 2 , NO 2 , and PM 10 concentration in seven sites of Guangzhou was developed using data from January 2006 to April 2012. A meteorological similarity principle was applied in the development of the sample selection rule. The key meteorological factors influencing SO 2 , NO 2 , and PM 10 daily concentrations as well as weight matrices and threshold matrices were determined. A basic model was then developed based on the improved BP neural network. Improving the basic model, identification of the factor variation consistency was added in the rule, and seven sets of sensitivity experiments in one of the seven sites were conducted to obtain the selected model. A comparison of the basic model from May 2011 to April 2012 in one site showed that the selected model for PM 10 displayed better forecasting performance, with Mean Absolute Percentage Error (MAPE) values decreasing by 4% and R 2 values increasing from 0.53 to 0.68. Evaluations conducted at the six other sites revealed a similar performance. On the whole, the analysis showed that the models presented here could provide local authorities with reliable and precise predictions and alarms about air quality if used at an operational scale.


Introduction
Air quality has recently become a serious issue in several of the large cities in China.This problem has significant potential for adverse impacts on human health and the environment [1][2][3].Therefore, it is extremely important to accurately forecast the concentrations of pollutants to provide guidance for travel advice and governmental policies.
Forecasting the concentrations of air pollutants represents a difficult task due to the complexity of the physical and chemical processed involved.However, many researchers have been focusing on these types of forecasts [4][5][6][7][8].The most common forecasting approaches are numerical models and statistical models.Numerical models do not require a large quantity of measured data, but they demand sound knowledge of pollution sources, the chemical composition of the exhaust gases, and the physical processes in the atmospheric boundary layer.This crucial knowledge is often limited.Thus, approximations and simplifications are often employed in the modeling process.
In contrast, statistical models usually necessitate a large quantity of measurement data under a large variety of atmospheric conditions.By applying regression and machine learning techniques, a number of functions can be used to fit the pollution data in terms of selected predictors.Neural networks, a subset of statistical models, are usually presented as systems of interconnected neurons that can compute values from inputs by feeding information through the network.Unlike other statistical models, neural networks make no prior assumptions concerning the data distribution.They can model highly nonlinear functions and can be trained for accurate generalization.These features of the neural network make it an attractive alternative to numerical and other statistical models [9][10][11][12].
There have been many applications of neural networks in air quality forecasting since the 1990s, and researchers have obtained fairly good results [13][14][15][16].Despite the successful applications of neural networks in the area of atmospheric science, the method has its own weakness and limitations.Studies have shown that there are three main factors that affect neural network effectiveness: network topology, learning algorithm, and learning samples [17,18].Previous research mainly concentrated on the network structure and learning algorithm, which improved the forecasting accuracy of the network [19][20][21][22][23][24].However, when improvements in the network structure and learning algorithm reach a certain degree, improvements in the accuracy of the air quality forecasting models plateau.Therefore, the selection of learning samples has become a vital factor that determines the mapping ability and generalization of the network.This is because the selection can ensure the representativeness of the learning samples and remove unnecessary interference, and thereby improve the forecasting accuracy of the model.Harri Niska et al. [21] used a genetic algorithm for selecting the inputs and designing the high-level architecture of a multi-layer perceptron model for forecasting NO 2 concentrations.Sousa et al. [22] predicted hourly ozone concentrations based on feed-forward artificial neural networks using principal components as inputs, and they improved the predictions of models by reducing their complexity and eliminating data collinearity.
The main objectives of this paper are to develop a sample filter method for the prediction of the daily NO 2 , SO 2 , and PM 10 concentration in the Guangzhou Pearl River Delta region based on a similarity principle of weather and pollutant background concentration.During the development of the prediction models, the selection of parameters is conducted by means of sensitivity experiments and the Back Propagation (BP) neural network is used for data-driven computation.The above actions are all part of an integrated environmental strategy designed and run by the local authorities of Guangzhou, according to the demands of the Action Plan on Prevention and Control of Air Pollution.Currently, this action plan is the most rigorous and systematic framework for improving air quality in China.

Data
A significant quantity of observational data under a wide variety of atmospheric conditions was required for this study.The dataset in this paper includes meteorological parameters and pollutant concentrations in Guangzhou, which is located in the south central part of Guangdong Province, China (23°06′ N Latitude, 113°15′ E Longitude).
Real-time monitoring meteorological parameters, including temperature, wind speed, wind direction, rainfall, atmospheric pressure, relative humidity, and solar radiation intensity, were obtained from an automatic air quality monitoring station at Sun Yat-Sen University, located in the Haizhu District of Guangzhou City.Forecasting meteorological data, including temperature, wind speed, wind direction, and rainfall, were obtained from Guangzhou Weather Forecasts [25].All the data were processed into the daily mean value as needed, according to the National Ambient Air Quality Standards (GB 3095-2012) issued by Environment Protection Administration (EPA) of China [26].The monitoring meteorological data were used as historical meteorological data in the model, and the forecasting meteorological data were used as the meteorological data of the forecasting day.To reduce the interference of different geographic locations on the monitoring meteorological data, pollutant concentration forecasting of seven state-controlled air quality monitoring sites in urban Guangzhou was performed.Thus, the applied monitoring data of the atmospheric environment were derived from the daily pollutant concentration data from seven state-controlled air quality monitoring sites as reported by the Guangzhou Environmental Protection [27].These state-controlled air quality monitoring sites are the Guangya Middle School (Num.1), the Guangzhou No. 5 Middle School (Num.2), the Guangzhou Environmental Monitor Station (Num.3), the Experimental Kindergarten of Tianhe Vocational School (Num.4), Luhu Park (Num.5), Guangdong University of Business Studies (Num.6), and the Guangzhou No. 86 Middle School (Num.7).The data span the period from January 2006 to April 2012, and a total of 23,195 valid samples were used for the paper.

Methods
In view of the small variation in weather during our study period, a similarity principle of weather and concentration parameters was applied.The multilayer selection rule for historical samples from Guanghzhou was then constructed.This step is very important for the development of predictive models.The selection of historical samples can improve the similarity between the occurrence of historical pollution and future pollution, and a proper selection can improve the efficiency of data-driven models (e.g., BP neural networks).This is also in line with the pollution formation, where the main factor affecting the diffusion and transport of pollutants is the different meteorological parameters, and every meteorological parameter has a different influence on NO 2 , SO 2 , and PM 10 [28,29].Thus, the sample selection was based on meteorological similarity and the consistency of the variation trend.The rule was divided into two parts, namely the identification of meteorological parameter similarity and the consistency of the variation trend, i.e., the identification of similarity in background concentrations.
First, a comprehensive correlation analysis of pollutant concentration and meteorological parameters was performed to determine the key factors of the selection rule, and these parameters were also used as inputs into the BP neural network.Next, the three-layer selection sample rule was applied.Finally, we utilized the improved BP neural network for data-driven computation to establish the air quality forecasting model of urban Guangzhou.

Identification of the Key Factors
A comprehensive correlation analysis of pollutant concentration and meteorological factors was conducted.The number of related days was set to two: the meteorology for the forecasting day and for the day before the forecasting day.Meanwhile, the daily mean value of pollutant concentration two days before the forecasting day was used as an input factor in an attempt to counteract the lack of pollutant emission source data.
A comprehensive analysis of pollutant concentration and meteorological factors was conducted for different pollutants, mainly through correlation analysis and weight analysis of the influencing factors in each pollution scenario.The analysis was intended to identify the degree of influence of each meteorological factor on pollutants, thus resulting in the selection of the factors with the greatest impact on pollutants and the allocation of the corresponding influencing weights.The correlation analysis started with the comparison of two typical pollution scenarios, namely, the ascending or descending periods of each pollutant, and the serious pollution or slight pollution periods.In this way, the degree of influence that the meteorological factors had on pollutants under these two situations was obtained.The average value of the two scenarios was calculated and multiplied with a correlation coefficient to obtain the comprehensive weight of the influence of each meteorological factor on different pollutants.
The ascending and descending periods of each pollutant are defined as the periods when the change in the pollutant concentration between consecutive days exceeds 0.05 mg/m 3 .Serious pollution or slight pollution are defined as periods when the Air Pollution Index of the pollutant exceeds 100 or is lower than 20, respectively.
The identification of the influencing weight of each meteorological factor under the above-mentioned periods was achieved using the following steps: (a) Obtaining the representative data for the meteorological factor The specific data include the average value of the ascending period iu M , the average value of the descending period id M , the maximum value of the analysis period Finally, the comprehensive influencing weights between meteorology factors and pollutant concentrations were determined by the following equation: where r is the comprehensive influencing weight between the meteorology factor and the pollutant concentration; R is the correlation coefficient between the meteorology factor and the pollutant concentration; 1 w is the influencing weight in the ascending or descending period; and 2 w is the influencing weight in the serious or slight pollution periods.

A Selection Sample Rule Based on the Similarity Principle
Multiple meteorological factors create a variety of meteorological parameter spaces that impose different impacts on the transport and diffusion of pollutants.During air quality forecasting, if the appropriate meteorological space is found, the intrinsic relationship between multiple physical quantities and the pollutant will have a reference.An appropriate set of samples was selected for the main influencing factors such that forecasting could be targeted, and the mapping ability and generalization of the network could be improved.Thus, three-layer sample screening principles based on meteorological similarity criteria were proposed.

The Basic Description
The first level of screening identifies samples where the similarity of each meteorological factor reaches a certain threshold value range.The screened samples should conform to the following formula: where pre j y is the meteorological factor on the day of forecasting; sam j y is the meteorological factor of the sample; j y Δ is the meteorological similarity of the meteorology factors between the sample and the day of forecasting; j is the specific meteorological factor; and set j y is the threshold value screened by the meteorological factor, forming a primary threshold matrix Y .In this matrix, the threshold value can change dynamically according to the sample size demanded.
The second level of screening applies a threshold value range for total weighted meteorological similarity.The screened samples should conform to the following formula: where S is the entire meteorological similarity; S set is the threshold value screened by the entire meteorological similarity; w j is the weight of each meteorological factor, forming the weight matrix W; and M num is the number of meteorological factors.
The third level of screening identifies the n samples with the highest meteorological similarity.The screened samples should conform to the following formula: where Q num is the number of samples in the sequenced sample column, and n is the number of samples needed.
Among these criteria, the selection of the weight matrices and the threshold matrices is key to obtaining high quality samples.Hence, the following identification approaches for weight matrices and threshold matrices were adopted.

Identification of w j
The establishment of the weight matrix w j was integrated with the selection of model input factors, and a comprehensive correlation analysis of pollutant concentration and meteorological factors was performed.While choosing the input parameters of the neural network, the weight matrix of the selection sample rule was also established.

Identification of set j y
The establishment of the threshold matrix set j y was accomplished via the orthogonal test method, which is a highly efficient experimental design method used for the arrangement of multi-factor experiments and the search for optimal horizontal combinations [30].For the different pollutants, we set different levels of factors and selected some representative experimental points (horizontally mixed) for the experiments.The optimal horizontal combination was selected to generate the threshold matrix of the selection sample rule [31].
Based on the results of the above weight matrix j w , the tested experimental factors were identified.In accordance with prior knowledge, the level of each experimental factor was confirmed.The minimum absolute error of the forecasting model was adopted as the experimental objective to seek the optimal combination and finally identify the sample optimization threshold matrix.

Identification of the Variation Trend Consistency
There will be some scenarios in which wind speed decreases in history but increases on the prediction day compared with the previous day, based on the selection rule stated above in Section 3.2.Such a scenario will lead to an error in the prediction model for use in the BP neural network.Therefore, it is necessary to identify the variation trend consistency.
The factors considered were deduced according to the weight matrix of the selection rule (see Section 3.1) and the principles of the pollution formation.The chosen factors were rainfall, wind speed, and background concentration.However, sensibility experiments were still needed to determine the key factor for NO 2 , PM 10 , and SO 2 .The details of the experimental results will be introduced in the following section.

Variation Trend Consistency for Wind Speed
Because wind speed is a vector, wind speed is described as x w , y w .
cos( ) = ⋅ where s w is the recorded wind speed and d w is the recorded wind direction.
Thus, the steps for the identification of the variation trend consistency for wind speed are as follows: (1) Calculate the variation between the forecasting day and the day before.
where ( ) (2) Calculate the variation between the two adjacent days in the samples selected in Section 3.1, where ( ) is the difference between the squared values of wind speed on the forecasting day and the day before; x t w − and y t w − are the two wind vectors on the forecasting day; and (3) Identify whether the wind speed in the forecasting data shows the same tendency of ascending or descending as that in the selected samples.If the tendency is the same, the samples are reserved; otherwise, the samples are removed.

The Variation Trend Consistency Identification of Rainfall
The variation in the rainfall levels in the forecasting data was calculated using the following formula: The variation in the historical rainfall levels was calculated using the following formula: We then identified whether the rainfall level in the forecasting data showed the same tendency of ascending or descending as that in the sample data.If similar, the samples are reserved; otherwise, the samples are removed.

Similarity Identification of Background Concentration
The following steps were used to conduct the similarity identification of the background concentration: (1) The background concentration on the day of forecasting is calculated as follows: =0.6 +0.4 (2) The background concentration in the sample data is calculated as follows: ( 1-2)<= ABS BC BC Set (13)

Improvements in BP Neural Network
Due to its strong learning and generalization ability, a BP neural network was used as the data-driven computation method [32].In this paper, a BP neural network with three layers was applied to predict the daily concentrations of NO 2 , PM 10 , and SO 2 .The layers included an input layer, a hidden layer, and an output layer.The data described in Section 2 were divided into training, validation and test sets.The training and validation sets were from January 2006 to April 2011 in seven air quality monitoring sites, of which 80% of these data were randomly selected for the training set; the remaining 20% of the data comprised the validation set.In addition, the data from May 2011 to April 2012 were used for the test set, aiming to test and compare the model performance in seven air quality monitoring sites.There are two main components affecting pollutant concentration: emission sources and pollutant transmission and diffusion conditions.The key factor that affects pollutant transmission and diffusion in a city is the meteorological conditions.Therefore, the meteorological factors identified in Section 3.1 were considered as the major input factors for the BP neural network.According to the conclusions in the literature [33,34], the daily concentrations of NO 2 , PM 10 , and SO 2 for the two days before the forecasting day were also used as input factors for the BP neural network to reduce the influencefor lacking emissions data.The final number of variables used in the input layer (NInput) in each forecast model is shown in Table 1.
The neuron number of the hidden layer is half that of the input layer [35].Different neural network structures were established for NO 2 , PM 10 , and SO 2 .The neuron in the output layer was regarded as the forecasted daily concentration of NO 2 , PM 10 , and SO 2 .
The training termination conditions in the BP neural network were also changed to improve the overall accuracy of the forecasting model.When the average relative error of all training samples reached a specified error value, the training would cease.The specified error value was determined by experiments for different error.For NO 2 , PM 10 , and SO 2 , the optimal specified error values were 0.5, 0.4, and 0.35, respectively.Every group training sample was processed five times, which means that five groups of models were developed.The model with the least average relative error was selected as the prediction model, reducing the randomness of the BP neural network.

Indices of Model Evaluation
We used the following indicators to evaluate the models: Mean absolute error (MAE), Mean Absolute Percentage Error (MAPE), Correlation coefficient (R), tendency forecasting accuracy (TFA), Nash-Sutcliffe coefficient of efficiency (Ef), and Accuracy factor (Af) [36].The TFA is the forecasting accuracy rate determination for the upward or downward trend of pollutant concentrations over two consecutive days on the basis of monitoring results.Ef, an indicator of the model fit, is a normalized measure (−∞ to 1) that compares the mean square error generated by a particular model simulation to the variance of the target output sequence.An Ef value closer to 1 indicates better model performance; an Ef value of zero indicates that the model is, on average, performing only as good as the use of the mean target value for prediction, and an Ef value < 0 indicates an altogether questionable choice of the model.Af is a simple multiplicative factor indicating the spread of the results around the prediction.The larger the Af value, the less accurate the average estimate.
The MAE, MAPE, TFA, Af and Ef are defined as follows: , , 1 1 , , 1 log 10 where pre y and mon y are the predicted and measured values, respectively, and mon y is the mean of the measured values of the response variable.N is the total number of the observations.A is the number of correct forecasts for the upward or downward trend of pollutant concentrations over two consecutive days.

The Results of the Sensitivity Experiments in Guangzhou No. 5 Middle School (Num. 2)
As described in Section 3.3, sensitivity experiments were performed to determine the key factors.The data were obtained from the Guangzhou No. 5 Middle School site.Seven group experiments were performed for SO 2 , PM 10 , and NO 2 .The first experiment (called "Group 1") was made by the model based on the selection rules described in Section 3.2.That is to say, Group 1 was run using the Basic Model.Besides these selection rules, the second to fourth experiments were conducted based on the variation trend consistency identification of rainfall (RF), wind speed (WS), and background concentration (BC), while the fifth to seventh experiments were considerations of RF + WS, RF + BC, and WS + BC.These experiments were referred to as Group 2, Group 3, Group 4, Group 5, Group 6, and Group 7, respectively.Table 1 summarizes the results of the seven groups of sensitivity experiments.The models with the best performance were selected (termed the Selected Models).
For PM 10 , the value of Ef and Af of Group 6 were much closer to 1.0 compared with the other models.Compared with Group 1, the Mean Absolute Percentage Error (MAPE) of Group 6 was 4% lower (0.227), R increased by almost 14%, and TFA increased by nearly 6% (0.550).For NO 2 , Group 7 had the best results with an MAPE of only 0.225, an R value of 0.688, and a TFA value of 0.567.The Ef and Af of Group 7 were 0.397 and 1.271, respectively, which were much closer to 1.0 than the other experiments.Group 2 had the most ideal experimental results for SO

Errors in the Selected Models for Others Sites
The selected model for SO 2 , NO 2 , and PM 10 was tested in the remaining six sites (detailed description in Section 2) in the urban district of Guangzhou, and a comparison was made between the Selected Model and the Basic Model.The results are shown in Table 2. On the whole, the Selected Model was equal to or better than the Basic Model for SO 2 , NO 2 , and PM 10 .As for SO 2 , the MAPE of the Selected Model decreased from 0.417 to 0.377, the correlation increased from 0.409 to 0.477, the TFA increased from 0.490 to 0.517.In addition, the Ef and Af were closer to 1 compared with the Basic Model.Adding the sample optimization rules to the variation tendency identification of the rainfall level changes improved the forecast accuracy of the different pollutants to different degrees at every site.For PM 10 , the MAPE of the Selected Model was 0.250 for the six sites, which was almost 0.10 lower than that of the Basic model.The correlation was greater than 0.7, and the TFA increased by 24%, from 0.421 to 0.523.Adding the variation tendency identification of the rainfall level changes and the similarity identification of the background concentrations to the model resulted in an effective improvement of the forecast accuracy of PM 10 .Regarding NO 2 , adding the variation tendency identification of the wind speed changes and the similarity identification of the background concentrations did not greatly improve the forecast results.The Selected Model is useful for the six sites, and the errors of the model are acceptable for application purposes.
then developed based on the improved BP neural network.The selection sample rule consisted of three layers.(2) In improving the basic model, identification of the variation consistency of some factors was added in the rule, and seven sets of sensitivity experiments (one in each of the seven sites) were conducted to obtain the selected model.These experiments determined that the variation consistency of the rainfall level added to the SO 2 forecast model, the rainfall level variation tendency and the background concentration similarity identification added to the PM 10 forecast model, while wind speed variation identification and background concentration similarity identification added the NO 2 forecast The improved BP neural network was also used for data-driven computation.(3) Evaluations in the site by comparison of the basic model from May 2011 to April 2012 showed the selected model for PM 10 displayed better forecasting performance, with MAPE values decreasing by 4% and R 2 values increasing from 0.53 to 0.68.The selected model for NO 2 had little improvements compared with the basic model, while the MAPE values of the selected model for SO 2 were as high as 36.6% with R 2 values of 0.51.( 4) Evaluations conducted at the six other sites revealed similar performances.The MAPE values of the selected models for SO 2 , PM 10 , and NO 2 were 37.7%, 25.0%, and 22.0%, respectively.Of course, the above results showed that the SO 2 model may be further improved in future research, by developing a combined model or by considering the interaction of atmospheric pollutants.
max i M , the minimum value min i M of the analysis period, and the overall average value adv i M .The i represents the specific meteorological factor.(b) Numerical normalization (c) Variation analysis of the meteorological factor ( i D ) Computation of the influencing weight between the squared values of wind speed on the day of forecasting and the day before; x p w − and y p w − are the two wind vectors on the day of represent the two wind vectors before the day of forecasting.
are the two wind vectors on the day before the forecasting day.

3 )
Identify whether the background concentration in the forecasting data and the absolute difference of the background concentration on the day of forecasting is in the range of the threshold value.If they are in the range, the samples are reserved; otherwise, they are removed.

Figure 2 .
Figure 2. MAPE of models for Num. 2 from May 2011 to April 2012.

Table 1 .
Forecasting results of the seven groups of sensitivity experiments.
Note: * the Selected Model determined by making experiments.
[33,34] and NO 2 were observed in February, where the daily concentrations were almost the highest due to the bad weather; the BP neural network is not sensitive to extremely high or low values[33,34].However, the MAPE of the SO 2 , PM 10 , and NO 2 models were 0.383, 0.353, and 0.290, respectively.These MAPE values are acceptable for operational forecasts.