Application Study of Comprehensive Forecasting Model Based on Entropy Weighting Method on Trend of PM2.5 Concentration in Guangzhou, China

For the issue of haze-fog, PM2.5 is the main influence factor of haze-fog pollution in China. The trend of PM2.5 concentration was analyzed from a qualitative point of view based on mathematical models and simulation in this study. The comprehensive forecasting model (CFM) was developed based on the combination forecasting ideas. Autoregressive Integrated Moving Average Model (ARIMA), Artificial Neural Networks (ANNs) model and Exponential Smoothing Method (ESM) were used to predict the time series data of PM2.5 concentration. The results of the comprehensive forecasting model were obtained by combining the results of three methods based on the weights from the Entropy Weighting Method. The trend of PM2.5 concentration in Guangzhou China was quantitatively forecasted based on the comprehensive forecasting model. The results were compared with those of three single models, and PM2.5 concentration values in the next ten days were predicted. The comprehensive forecasting model balanced the deviation of each single prediction method, and had better applicability. It broadens a new prediction method for the air quality forecasting field.


Introduction
With the development of industry and the consumption of fossil fuels, air quality is worsening. In recent years, haze-fog pollution has occurred frequently in many parts of China [1]. The average concentrations of PM2.5, which is the main influence factor of haze-fog, in some areas of China are more than the average annual values in World Health Organization standard, which is 10 µg/m 3 [2]. The haze-fog pollution goes from the local environmental factor to a nationwide environmental disaster. Especially in January 2013, there was extended haze-fog weather in the mid-eastern part of China. In the following months, the haze-fog pollution ranged from Beijing Tianjin Hebei region to the Yangtze River Delta region [3]. The spread of wide range of haze-fog caused a public panic, and caused serious impact on the normal production and operation. Air pollution is not only a threat to public health, which affects social stability, but also a bottleneck to economic development of many places [4]. The haze-fog pollution has negative effects on the environment, climate, human health, economic and other aspects, such as chronic diseases, respiratory and cardiac diseases, visibility reduction, damage of natural and agricultural systems and traffic accidents in the land, waterways, and air [5,6].
The most important factor for the formation of haze-fog pollution in the atmosphere is PM2.5. PM2.5 can suspend in the air for a long time. PM2.5 is particulate matter with an aerodynamic diameter ≤ 2.5 μm. It has been regulated in developed countries such as the USA, Australia, and some European countries [7,8]. In order to analyze the atmospheric environment pollution in China quantitatively and protect the living environment, new "Ambient Air Quality Standards (GB 3095-2012)" were introduced by the Ministry of Environmental Protection of China [9]. According to the new Ambient Air Quality Standards, Sulfur dioxide (SO2), Nitrogen dioxide (NO2), PM10, Ozone (O3), Carbon monoxide (CO) and PM2.5 were set as the six basic monitoring indicators, and released in real-time. "Air Quality Index" (AQI) was introduced to replace the earlier "Air Pollution Index" (API) at the same time. In the new ambient air quality standards, PM2.5 was added as a monitoring indicator and it is a key influencing factor.
Severe haze pollution and PM2.5 attracted widespread attention of scholars. Some researchers argued that the haze-fog formation was closely connected with the chemical reactions of pollutants in the planetary boundary layer and thermal and dynamic processes in the atmospheric environment [10,11]. Liu et al. (2013) and Zhang et al. (2013) also believed that the haze-fog formation might be influenced by primary pollutant emissions, anti-cyclone synoptic conditions, and the boundary layer height [12,13]. The major components of PM2.5 were nitrate, secondary sulfate, and organic aerosols in the haze-fog pollution in Shanghai, China [14]. The haze-fog pollution was extremely serious during the winter in central and eastern China, and the emission of coal combustion for heating and stagnant meteorological environment conditions affected the haze-fog greatly [15,16].
Because the atmosphere was seriously polluted, studies on prediction of concentration of important indicators in the atmosphere and analysis on air quality trends have important theoretical and practical significance. Soltani et al. (2007) developed the time-series model to forecast climatic fluctuations [17]. Autoregressive (AR) models, moving average (MA) models or autoregressive moving average (ARIMA) models were used in air-pollution modeling to predict and analyze the time series data [18,19]. However, in respect of the statistical analysis of air pollutant concentrations, the present works mainly focus on the future prediction and analysis of common indicators, such as NO2, O3, and PM10. Chelani and Devotta (2006) used the ARIMA model to forecast the NO2 concentration in Delhi, India [20]. Prybutok and Mitchell (2000) developed the neural network model for forecasting daily maximum ozone levels [21]. Stadlober et al. (2008) presented the forecasting model to analyze the performance and quality of PM10 [22]. By contrast, the new indicator PM2.5, which was the main influencing factor of haze-fog pollution in China, has not been forecasted and analyzed.
In this study, PM2.5 was set as the research indicator, and the time series data of PM2.5 concentration were analyzed and forecasted. Three methods, that is, the ARIMA model, ANNs model, and Exponential smoothing method were used to forecast the time series data of PM2.5 concentration. Their results were combined with the entropy weighting method, and the comprehensive forecasting model was developed based on combination forecasting ideas. The comprehensive forecasting model was applied to predict and analyze the time series data of PM2.5 concentration in Guangzhou, China quantitatively. The trend of haze-fog pollution in Guangzhou was analyzed. The results were expected to provide a quantitative basis for the management and control of the haze pollution.

ARIMA Model
The Autoregressive Integrated Moving Average Model is an important time series prediction method. It was presented by Box and Jenkins in 1970s [23]. The basic ideas of the ARIMA model are as follows. In the ARIMA model, the time series data of the prediction object are regarded as a stochastic sequence, and this sequence is fitted with some mathematical models. Once this model is identified, the future values would be predicted by the time series of past and present values [24]. The ARIMA model can be divided into three types: (1) The autoregressive model (AR model), where p is the number of self-regression items; (2) The moving average model (MA model), where q is the number of moving average items; (3) The autoregressive integrated moving average model, that is, ARIMA (p, d, q), where d is the difference of frequency of time series data that become the stationary difference, and d is generally less than 2 in the practical application [25].
Assuming the random variable Yt was an observation value at the time t (t = 1, 2,  , n). Then a series of Yt constitute a stochastic process. The ARIMA (p, d, q) model can be written as Yt~ARIMA (p, d, q), and its definition is as follows. where εt is white noise, and εt~N(0, σa 2 ); p, d and q are non-negative integers; B is the moving operator, and BYt = Yt-1; φ1, φ2, …, φp are the autoregressive parameters, while θ1, θ2, …, θq are the moving average parameters.
The modeling processes of ARIMA model are as follows.
(1) Sample pretreatment. The establishment of the ARIMA model requests that the time series data should be stationary stochastic process. Thus the data should be tested for stationary before modeling. (2) Pattern recognition. After the differential transform for the non-stationary time series, the key step is to determine the order of the ARIMA model. There are four methods to determine the order: (i) Auto Correlation Function (ACF) and Partial Auto Correlation Function (PACF) method; (ii) Final Prediction Error (FPE) method; (iii) Aikake Information Criterion (AIC) method; (iv) Aikake Information Corrected Criterion (AICC) method. The ACF and PACF method were used to master the direction of the general model to determine the order in this study. The initial choices need to be constantly adjusted according to concrete problems.
The ARIMA model can find out the characteristics and trends of the variables from the time series data, and forecast the future values effectively. The ARIMA model is a prediction method with a good statistical theory, and has the advantages of high accuracy, and strong adaptive ability. It is used in many fields, and has wide applications [26,27].

Artificial Neural Networks Model
The Artificial Neural Network model has been a hot research issue in the field of artificial intelligence since the 1980s. It can simulate the human brain neural networks for information processing, and construct different network models according to different connection ways. In recent years, research on the artificial neural networks developed, and great progress has been made. It is widely used in many fields, such as pattern recognition, intelligent robots, automatic control, biological, medical, economic etc. [28,29]. It has successfully solved many practical problems which are difficult to solve by modern computers, and shows a good intelligent characteristics [30]. The artificial neural network model is generally composed of the input layer, the hidden layer and the output layer, and its structure is shown as Figure 1. The artificial neural network model has good characteristics of nonlinear combination, and is a global approximation network. It has strong learning ability, and can achieve nonlinear mapping between the input and output [31]. Artificial neurons in the ANNs model, as a simple processor, can sum the coming signal with appropriate weights, and its general expression is: where xi (i = 1, 2, …, n) are the input data; wi (i = 1, 2, …, n) are the weights; b is a threshold value, y is the output result. The ANNs model can solve a lot of problems about the nonlinear system, such as function approximation, system identification. The choice of transfer functions and sample pretreatment should be paid more attention while modeling. The MATLAB neural network toolbox is very functional. It provides many functions of the design, training, and simulation of neural network model. The users can just call the functions according to their needs to design and simulate the neural network model facilitates, and this exempts the troubles of writing the complex and huge algorithms and programs. The MATLAB neural network toolbox was utilized to develop the ANNs model in this study.

Exponential Smoothing Method
The Exponential smoothing method is one of the important time series forecasting methods. It has a simple principle and good applicability. This method could not only be used for short-term prediction, but also had a better effect on the medium term or long term prediction problems. The basic prediction ideas are as follows. The average value of the first few periods is set as the initial value of the prediction period. Then when one novel observation value occurs, the earliest observation value would be removed from the initial few periods, and the novel observation value would be added. The novel prediction value can be obtained according to the novel observation value, the initial prediction value and weight of the latest observation value [32]. The Exponential smoothing method can eliminate the accidental changes of time series data, and enhance the importance of recent data as well.
The Brown quadratic polynomial exponential smoothing method was employed to predict the PM2.5 concentration time series data in this study. This method could track non-linear trend changes well. Its equation is: where Yt+m is the prediction value at the time t + m (t = 1, 2, …, n); m is the prediction step; at, bt and ct are the parameters to be estimated, and they could be estimated according to the original time series data.

Entropy Weighting Method
In information science theory, entropy is a very important concept. Information entropy is a measure of the degree of disorder of system information, and can measure the amount of useful information of the data [33]. The basic idea of the entropy weighting method is as follows. When the data of one object show great differences, according to information theory, its entropy would be low. This shows the object could contribute much useful information, so its weight should be set high; otherwise, the weight should be set low correspondingly [34]. Entropy weighting method is an objective weighting method. In this study, the entropy weighting method was used to weight the results of three prediction methods. The processes of determining weights are as follows: (i) The original data of all objects should be normalized to eliminate effects of dimension. For the benefit object, the higher its value, the greater its impact. Its equation is: For the cost object, the lower its value, the greater its impact. Its equation is: where, xij (i = 1, 2, …, m, and j =1, 2, …, n) is the observation value of the j-th object on the i-th object, and rij is the dimensionless value that has been normalized.

Simulation Data and Qualitative Trend Analysis
The comprehensive forecasting model was utilized to predict PM2.5 concentration in the atmosphere in Guangzhou city in China. Guangzhou city is the capital of Guangdong Province in China, and is the center of political, economic, science and technology, education and culture of Guangdong Province. Guangzhou in located in the south of Guangdong Province in southern China and at the northern margin of the Pearl River Delta. Guangzhou is on the verge of the South China Sea, with significant characteristics of an oceanic climate. With the Tropic of Cancer crossing through the north of the city, it is warm and rainy, with plenty of heat, small temperature difference and a long summer and other climatic characteristics. Guangzhou has the characteristics of typical southern coastal cities in China, and studies in this manuscript had important significance for the studies on the haze-fog pollution in this category of cities. The original data were the time series data from 2 December 2013 to 21 January 2015 in Guangzhou city [35]. They were from the China National Environmental Monitoring Center. They were the values of 24-h averages.
The factors that influenced the changes of PM2.5 concentrations included two aspects: one was the basis concentration that was determined by the actual air quality, the other was the impact on PM2.5 concentration from the external meteorological environment and random factors. With the changes of sunshine, temperature, and pressure the concentration of PM2.5 would change along with the time. The external environment change, such as increasing automobile exhaust quantity and more garbage incineration would also affect the concentration of PM2.5.
The long-term trend of the concentration of PM2.5 over one year was investigated. It was the trend that was affected by some fundamental factors for a long period. The averages of every month of the PM2.5 concentration were calculated, as shown in Figure 2.  Figure 2, and the averages were the highest in winter, the lowest in summer. This may mainly be affected by the seasonal temperature, precipitation, and other meteorological factors. The summer precipitation is substantial, and rainwater can bring some of the particulate matter to the ground. In addition, the weather is warm in the summer, and the people in China would not burn coal to keep warm. Thus PM2.5 concentrations were relatively low. In addition, summer weather conditions such as: a high atmospheric boundary layer, frequent precipitation, etc. were conducive to clear the particles. However, temperatures in winter were always low in China, and the atmospheric pressure was high. The concentrations of PM2.5 in winter were generally high. Meanwhile wind speed, relative humidity, and other meteorological factors would also affect the concentration of PM2.5.

Comprehensive Forecasting Model
For the issue of time series data prediction, there are a variety of forecasting models and methods, such as regression analysis, the ARIMA model, gray forecasting system, ANNs and so on. While their modeling mechanism and application conditions are different, they all have some limitations for a certain prediction problem in the application fields. In 1969, Bates and Granger proposed an idea of "combination forecasting" on "Operations Research Quarterly" for the first time [36]. It began a systematic study on "combination forecasting" issue. Several forecasting methods were combined into one comprehensive prediction model. In this way, a comprehensive description of the objective system could be made, and the combination forecasting model was used widely.
Different forecasting values could be obtained based on different prediction methods. We developed mathematical models based on the ARIMA model, the ANNs model and the Exponential smoothing method respectively, and combined the predictive values at the same time with the weights from the entropy weighting method. Thus the combination forecasting values could be obtained. The combination equation was: where k1 + k2 +···+ kn = 1, and ki ≥ 0 (i = 1, 2, ···, n) were the weights of each prediction sequence.

Simulation Results
Based on the algorithm of comprehensive forecasting model in Section 2.1, we programmed the MATLAB software platform according to the time series data of PM2.5 concentration. The ARIMA model was developed as follow: where The model we developed was ARMA (2, 3) as the Equation (9). The prediction results were shown in Figure 3a. Using MATLAB toolbox, the ANNs model was constructed. The input and output models were respectively "tansig" (Hyperbolic tangent sigmoid transfer function) and "purelin" (Linear transfer function) function. Number of neurons in the hidden layer was selected according to the principle that the sum of squares of prediction errors was the smallest, and the number of neurons in the hidden layer was selected as 5 finally. The training step was set as 20,000, and the error precision was 0.001. Then the prediction values were obtained. They were shown in the Figure 3b.
According to formulas in Section 2.3, the Exponential Smoothing Model was established based on the time series data and the results were shown in Figure 3c.
From the entropy weighting method, the weights of the three methods were respectively: k1 = 0.2399, k2 = 0.5419 and k3 = 0.2182. The results of the three methods were combined with the weights. The results of the comprehensive forecasting model were obtained, as were shown in Figure 3d.
From the four figures we could see, the prediction results of each method were different, and they all had their own characteristics. The results of the ARIMA model and the ESM model tracked the original time series data, but their results might lag behind the original data. The trend of results of ANNs followed the original time series data, and its results were near the means of the data sequence. The prediction results of the comprehensive forecasting model were the combination of results of three methods. The original data of PM2.5 concentration were severely affected by the external meteorological environment and random factors. The original data were with great fluctuation, and the fluctuation was often not regular. The ANNs model excluded these irregular changes and seized the basic trends of the time series data. We could believe that the results of CFM model followed the trend of the original data along with the results of ANN model, and meanwhile they fluctuated around the means of the original data sequence according to the results of the ARIMA and ESM models. Thus the prediction results of the comprehensive forecasting model also tracked the original time series data, and its curve fluctuated with the curve of original data.

Accuracy Test
In order to investigate the accuracy and precision of the prediction results, the results should be tested by error testing indexes. The error testing indexes included Mean Absolute Error (MAE), Mean Percentage Error (MPE), Root Mean Square Error (RMSE), Theil inequality coefficient, bias ratio and variance ratio. The calculation formulas and functions of the error testing indexes were shown in Table 1. The various error testing indexes were calculated, and the results were shown in Table 2.  Note: In Table 1, i y (i = 1, 2,  , n) were the actual observation values; ˆi y were the prediction values; y and ŷ were the averages of i y and ˆi y ; y s and ŷ s were the standard deviation of i y and ˆi y . The MAE, MPE, RMSE and Theil inequality coefficient can describe the system errors, and indicate the dispersion of prediction results and original sequence. These four indexes are as small as possible for good prediction results. In Table 2, from these four indexes, results of the ARIMA model were the best, and the CFM was slightly lower than those of ARIMA model. Results of the ANNs and ESM were worse than those of the former two models. The bias ratio and variance ratio measure the deviation degree of the average and variance between the forecasting sequence and the original sequence. These two indexes were also as small as possible for good forecasting results. The bias ratio of the ARIMA model was the best, while that of the ESM was the worst; the variance ratio of the ESM was the best, while that of the CFM was the worst.
To sum up, the accuracy of the ARIMA model was the best for the historical data in this study, while the accuracy of the CFM was close to it. They were significantly higher than that of other two methods. This showed that, for a particular sequence, the applicability of one prediction method might just suit to it, and its forecasting accuracy might be better than other methods. However, the applicability of the single prediction method was often limited, and not universal. We combined multiple methods and developed a combination forecasting model. The combination forecasting model can balance the deviation of each single prediction method, and had better applicability. High accuracy could also be achieved at the same time. Thus the comprehensive effect of the combination forecasting model was good in the practical applications.

Prediction of Next Ten Days
The comprehensive forecasting model was applied to predict PM2.5 concentrations in the next ten days in Guangzhou, compared with the ARIMA model, ANNs model and ESM model. Their results were shown in Table 3. The actual observation values were obtained from the website "Historical data of PM2.5" on the internet [35], where PM2.5 concentrations were updated in real-time. The important error testing indexes were calculated to evaluate the prediction accuracy, as were shown in Table 4. Table 3. Prediction results of next ten days.   From Table 4 we could see MAE, MPE, RMSE and Theil inequality coefficient of the CFM were significantly less than those of other three methods in the numerical values. This showed that the combination forecasting values of CFM model were closer to the actual observation values. The prediction accuracy of the CFM model was higher than that of these three methods, and the results of the CFM model were more effective and reliable. However, the comprehensive forecasting model had some shortcomings. The workload of the comprehensive forecasting method might be heavier than that of the single prediction method.

Conclusions
Haze-fog was the most serious air pollution in 2013. The most important factor of haze-fog pollution was PM2.5. The sources of PM2.5 were wide, and its formation was complex. In order to reflect the trend of haze-fog pollution, it is very important to strengthen pollution prevention and control. The comprehensive forecasting model was developed based on three prediction models in this study. The time series data of PM2.5 concentration were forecasted by the ARIMA model, ANNs model, and ESM model. Their results were combined with weights from the entropy weighting method. Thus the combination forecasting results were obtained. The comprehensive forecasting model was applied to predict PM2.5 concentration in Guangzhou China, and good forecasting results were obtained. The results were with high accuracy compared with those of the three single methods. The combination forecasting model could make a balance of deviation of each single forecasting method, and overcome the applicability limitations of each single method. It broadened a new prediction method for the air quality forecasting field. This study could provide a scientific basis for the prevention and prediction of haze-fog pollution in the city, and provide a methodological basis for this kind of scientific research.
interest. Dong-jun Liu and Li Li developed the evaluation model. Dong-jun Liu developed the programs and performed data analysis and discussion. Li Li provided a lot of instructive suggestions. Dong-jun Liu wrote the initial manuscript draft and Li Li contributed to manuscript revision.