Day-Ahead PM2.5 Concentration Forecasting Using WT-VMD Based Decomposition Method and Back Propagation Neural Network Improved by Differential Evolution

Accurate PM2.5 concentration forecasting is crucial for protecting public health and atmospheric environment. However, the intermittent and unstable nature of PM2.5 concentration series makes its forecasting become a very difficult task. In order to improve the forecast accuracy of PM2.5 concentration, this paper proposes a hybrid model based on wavelet transform (WT), variational mode decomposition (VMD) and back propagation (BP) neural network optimized by differential evolution (DE) algorithm. Firstly, WT is employed to disassemble the PM2.5 concentration series into a number of subsets with different frequencies. Secondly, VMD is applied to decompose each subset into a set of variational modes (VMs). Thirdly, DE-BP model is utilized to forecast all the VMs. Fourthly, the forecast value of each subset is obtained through aggregating the forecast results of all the VMs obtained from VMD decomposition of this subset. Finally, the final forecast series of PM2.5 concentration is obtained by adding up the forecast values of all subsets. Two PM2.5 concentration series collected from Wuhan and Tianjin, respectively, located in China are used to test the effectiveness of the proposed model. The results demonstrate that the proposed model outperforms all the other considered models in this paper.


Introduction
Over the past few decades, with the rapid development of industrialization and urbanization, the occurrence of haze pollution episodes has become more frequent and more severe in China [1,2]. According to the statistics of China's National Development and Reform Commission, since early 2013, many areas including the north China, Huanghuai, Jianghuai, Jianghan, south of the Yangtze River and the north of southern China have suffered severe and continuous haze weather. Haze pollution brings serious adverse effects on the environment, clime, ecological systems, economy and public health, thus causes great harm to the human production and life on a global scale [3,4]. Even though the mechanism of haze formation is still not clear [5], the high level concentrations of fine particles with aerodynamic diameter of 2.5 µm or less (PM 2.5 ) was inferred as the main reason of haze pollution episodes, and thus attracted widespread public concerns [6,7]. Compared to the PM 10 (particulate matter with aerodynamic diameter below 10 µm), PM 2.5 has smaller diameter and stronger adsorption capacity of hazardous materials such as heavy metal and organic pollutants, and therefore has more 4 of 22

Wavelet Transform (WT)
WT is a powerful technique for processing the non-periodic, non-stationary and transient signals [33]. WT decomposes a time series into different components at different frequency levels: one low frequency approximation subset which shows the general trend of the signal and several high frequency detail subsets which are related to the noise and disturbance [34]. As shown in Figure 1, the process of m-level decomposition by WT for time series S(t) can be defined as follows: Compared to the original signal, these subsets generated by WT usually have some better behaviors such as more stable variance and fewer outliers, which facilitates the prediction task and therefore improves the overall prediction accuracy [35]. WTs can be divided into the following two categories: (1) continuous wavelet transform (CWT); and (2) discrete wavelet transform (DWT).
The CWT of a signal f (t) is defined as follows: where a and b are the parameters of scale and translation, respectively; * represents the complex conjugate; and the mother wavelet ψ(t) denotes the transforming function. The DWT of a signal f (t) is defined as follows: where the integer m is the scale factor (decomposition level), the integer n is the sampling time, T is the length of signal f (t), and t is the discrete time index.

Wavelet Transform (WT)
WT is a powerful technique for processing the non-periodic, non-stationary and transient signals [33]. WT decomposes a time series into different components at different frequency levels: one low frequency approximation subset which shows the general trend of the signal and several high frequency detail subsets which are related to the noise and disturbance [34]. As shown in Figure 1, the process of m -level decomposition by WT for time series ( ) S t can be defined as follows: Compared to the original signal, these subsets generated by WT usually have some better behaviors such as more stable variance and fewer outliers, which facilitates the prediction task and therefore improves the overall prediction accuracy [35]. WTs can be divided into the following two categories: (1) continuous wavelet transform (CWT); and (2) discrete wavelet transform (DWT).
The CWT of a signal ( ) f t is defined as follows: where a and b are the parameters of scale and translation, respectively; * represents the complex conjugate; and the mother wavelet where the integer m is the scale factor (decomposition level), the integer n is the sampling time, T is the length of signal ( ) f t , and t is the discrete time index. The number of decomposition levels and selection of the mother wavelet have considerable effects on the characteristics of subsets and thus influence significantly the overall prediction error. In the decomposition process of WT, more levels will result in more stationary subsets; however, large number of levels might cause decomposition information loss and thus low prediction accuracy [34]. Based on the above considerations, this paper adopts a three-level DWT with mother wavelet of the Daubechies wavelet of order 4 (Db4), which has the ability of providing a balance between wavelength and smoothness [35].

Variational Mode Decomposition (VMD)
VMD is an effective signal decomposition method proposed by Dragomiretskiy and Zosso in 2014 [36]. VMD can decompose a real valued signal into a discrete set of band-limited modes The number of decomposition levels and selection of the mother wavelet have considerable effects on the characteristics of subsets and thus influence significantly the overall prediction error. In the decomposition process of WT, more levels will result in more stationary subsets; however, large number of levels might cause decomposition information loss and thus low prediction accuracy [34]. Based on the above considerations, this paper adopts a three-level DWT with mother wavelet of the Daubechies wavelet of order 4 (Db4), which has the ability of providing a balance between wavelength and smoothness [35].

Variational Mode Decomposition (VMD)
VMD is an effective signal decomposition method proposed by Dragomiretskiy and Zosso in 2014 [36]. VMD can decompose a real valued signal into a discrete set of band-limited modes (denoted by y k ) which have specific sparsity properties when producing main signal. It is assumed that each mode y k generated by VMD can be compressed around a center pulsation ω k which is determined along with the decomposition process. In order to obtain the bandwidth of each mode, the following procedures should be accomplished: (1) for each mode µ k , compute the associated analytic signal with the benefit of Hilbert transform to obtain a unilateral frequency spectrum; (2) mix with an exponential tuned to the respective estimated center frequency in order to shift the mode's frequency spectrum to baseband; and (3) estimate the bandwidth of each model through Gaussian smoothness of the demodulated signal. Then, the constrained variational problem can be provided as follows: where f is the original signal, µ is its mode, ω is the frequency, δ is the Dirac distribution, t is time script, k is the number of modes, and * denotes convolution. Recall that, in the VMD framework, the original signal f is decomposed into a set of modes denoted µ (see Equation (5)) each having a bandwidth in Fourier domain (see Equation (4)) and compacted around a center pulsation ω k . The solution to the original minimization problem (see Equation (4)) is the saddle point of the following augmented Lagrangian (L) expression: where λ is the Lagrange multiplier and α represents the balancing parameter of the data-fidelity constraint. Consequently, the solutions for u and ω can be obtained based on the following two equations: where n is the number of iterations.

Back Propagation (BP) Neural Network
Artificial neural networks (ANNs) include a family of intelligent models that mimic the biological neural networks. The BP neural network including one or more hidden layers is one of the ANN models, which has a relative simple structure and thus can be realized easily. Since the distinguish performance of the BP neural network, it has been popularly used in many practical fields such as wind speed forecasting [37], plastic injection molding [38], natural gas load forecasting [39] and so forth. The BP neural network used in this study has a three-layer network consisting of an input layer, a hidden layer, and an output layer (see Figure 2). The BP neural network distinguishes itself by the presence of hidden layers whose computation nodes are correspondingly called hidden neurons. The function of hidden neurons is to connect the input and the network output. Given a training set of input-output data, the most common learning rule for multi-layer perceptron (MLP) neural networks is the back-propagation algorithm which involves two following phases: the first one is a feed-forward phase in which the external input information at the input nodes is propagated forward to compute the output information signal at the output unit; the second one is a backward phase in which modifications to the connection weights are made based on the differences between the computed and observed information signals at the output units. In this study, a tangent sigmoid function is used as the neuron transfer function. multi-layer perceptron (MLP) neural networks is the back-propagation algorithm which involves two following phases: the first one is a feed-forward phase in which the external input information at the input nodes is propagated forward to compute the output information signal at the output unit; the second one is a backward phase in which modifications to the connection weights are made based on the differences between the computed and observed information signals at the output units. In this study, a tangent sigmoid function is used as the neuron transfer function. After determination of the network topology and initialization of the associated network parameters, the BP neural network should be trained and tested through the following three steps: Step 1: Calculate the output of the jth node in the hidden layer using the following equation.
where i is the index of neuron in the input layer, n is the number of neurons in the input layer, ij ω is the connection weights between input layer and hidden layer, i x is the ith input value, j a is threshold value, and j H and f represent the output of hidden layer and the incentive function of neurons, respectively.
Step 2: Calculate the fitted value or forecasting value of the kth node in the output layer using the following equation.
where jk ω is the connection weights between the output layer and hidden layer, k b is threshold value, and m is the number of neurons in the output layer.
Step 3: Calculate the fitting error k e based on the fitted value and expected output, and update the weight factor and threshold value by the following formula. After determination of the network topology and initialization of the associated network parameters, the BP neural network should be trained and tested through the following three steps: Step 1: Calculate the output of the jth node in the hidden layer using the following equation.
where i is the index of neuron in the input layer, n is the number of neurons in the input layer, ω ij is the connection weights between input layer and hidden layer, x i is the ith input value, a j is threshold value, and H j and f represent the output of hidden layer and the incentive function of neurons, respectively. Step 2: Calculate the fitted value or forecasting value of the kth node in the output layer using the following equation.
where ω jk is the connection weights between the output layer and hidden layer, b k is threshold value, and m is the number of neurons in the output layer.
Step 3: Calculate the fitting error e k based on the fitted value and expected output, and update the weight factor and threshold value by the following formula.
The training process of the BP neural network is stopped when one of the following two conditions is satisfied: (1) the maximum number of iterations is reached; and (2) the fitting accuracy meets the requirement.

Differential Evolution (DE) Algorithm
DE algorithm proposed by Storn and Price in 1997 is a stochastic, population-based and direct search algorithm which has the characteristics of simple structure, less control parameters, fast convergence, and strong robustness, and therefore has significant advantages for dealing with the non differentiable, nonlinear and multimodal functions [40]. As shown in Figure 3, the standard DE algorithm consists of four main operations: initialization, mutation, crossover and selection. 1,2,..., k m = (14) where η denotes the learning rate, ( ) x i is the ith input value.
The training process of the BP neural network is stopped when one of the following two conditions is satisfied: (1) the maximum number of iterations is reached; and (2) the fitting accuracy meets the requirement.

Differential Evolution (DE) Algorithm
DE algorithm proposed by Storn and Price in 1997 is a stochastic, population-based and direct search algorithm which has the characteristics of simple structure, less control parameters, fast convergence, and strong robustness, and therefore has significant advantages for dealing with the non differentiable, nonlinear and multimodal functions [40]. As shown in Figure 3, the standard DE algorithm consists of four main operations: initialization, mutation, crossover and selection. The basic steps of DE algorithm are illustrated as follows: Step 1: Population initialization. Initializing population of DE algorithm based on the following formula.
( ) ( ) , , ,min , ,max ,min where , , where k m i j ≠ ≠ ≠ , F is a scaling factor and is the base vector.
Step 3: Crossover. Crossover operation is introduced into DE algorithm in order to improve the multiplicity of the perturbed parameter vectors. The trial point where R C is crossover probability and The trial vector is formed of both current parameter vectors and mutant vector parameters (see formula (17)).
Step 4: Selection. The trial vector , 1 i G X + can be obtained by comparing the fitness value of the vector obtained through mutation and crossover, and the process can be denoted as follows: The basic steps of DE algorithm are illustrated as follows: Step 1: Population initialization.
Initializing population of DE algorithm based on the following formula. x where x j,i,o denotes the value of ith individual in the 0th generation and jth dimension.
Step 2: Mutation. Based on the randomly selected three indices, m, i and j, m = i = j, a mutant vector V k,G is generated based on the following formula.
where k = m = i = j, F is a scaling factor and F ∈ [0, 2], X m,G is the base vector.
Step 3: Crossover. Crossover operation is introduced into DE algorithm in order to improve the multiplicity of the perturbed parameter vectors. The trial point U j,k,G+1 is established from its parents V j,k,G+1 and X j,k,G by the following formula.
where C R is crossover probability and C R ∈ [0, 1], rnbr(i) is a randomly selected index in the set of {1, 2, 3, . . . , D}, which ensures that U j,k,G+1 obtains at least one parameter from V j,k,G+1 . The trial vector is formed of both current parameter vectors and mutant vector parameters (see formula (17)).
Step 4: Selection. The trial vector X i,G+1 can be obtained by comparing the fitness value of the vector obtained through mutation and crossover, and the process can be denoted as follows: Step 5: Iterative computing and stop the DE algorithm if the result satisfies the error requirement or the maximum number of iterations is reached. Otherwise, return to Step 2.

The DE-BP Model
In the BP neural network, the two kinds of training parameters of weight matrices (ω ij and ω jk ) and thresholds (a j and b k ) have significant influences on the prediction accuracy. In order to improve the function approximation ability of the BP neural network, especially on the catastrophe points, in this study, DE algorithm is utilized to optimize the weight matrices and thresholds, see Figure 4. The fitness function of DE algorithm used in this study is the RMSE of forecast results, and is defined as follows: whereX(t) denotes the forecast value at time t, X(t) represents the actual value at time t, and N is the total number of data. The individual owning the minimal fitness value is the global best point, which can be used to determine the parameters of the BP neural network. The steps of DE-BP model are described as follows: Step 1: Initialization. Determine the network topology of the network and initialize the parameters of DE algorithm including population size, maximum iteration number, probabilities of mutation and crossover operators. The initial population is generated using Equation (15).
Step 2: Calculate the fitness value of each individual using Equation (19). The DE algorithm is stopped when the stop criterion is satisfied, and go to Step 4.
Step 3: Update the population of DE algorithm based on mutation, crossover and selection operators.
Go to Step 2.
Step 4: The optimal individual obtained from DE algorithm is adopted as the initial connection weights and thresholds of the BP neural network.
Step 5: Train and test the BP neural network based on the training and testing samples.

Hybrid WT-VMD-DE-BP Forecasting Model
In this section, the proposed WT-VMD-DE-BP model is established for daily PM2.5 concentration forecasting. As shown in Figure 5, the basic structure of the hybrid forecasting method includes the following five steps: Step 1: First decomposition. The WT decomposition technique is utilized to decompose the PM2. 5 concentration series into one low frequency approximation subset and several high frequency detail subsets.
Step 2: Second decomposition. In order to increase the forecasting accuracy, the VMD technique is further employed to conduct the secondary decomposition of each subset generated by WT, and consequently a number of VMs are obtained.
Step 3: Individual forecasting. Each VM generated by VMD is forecasted using DE-BP model.
Step 4: First summation. The forecast value of each subset generated by WT is obtained by adding up all the forecast values of VMs generated by VMD decomposition of this subset.
Step 5: Second summation. The forecast series of PM2.5 concentration is obtained by aggregating the forecast result of each subset.

Hybrid WT-VMD-DE-BP Forecasting Model
In this section, the proposed WT-VMD-DE-BP model is established for daily PM 2.5 concentration forecasting. As shown in Figure 5, the basic structure of the hybrid forecasting method includes the following five steps: Step 1: First decomposition. The WT decomposition technique is utilized to decompose the PM 2. 5 concentration series into one low frequency approximation subset and several high frequency detail subsets.
Step 2: Second decomposition. In order to increase the forecasting accuracy, the VMD technique is further employed to conduct the secondary decomposition of each subset generated by WT, and consequently a number of VMs are obtained.
Step 3: Individual forecasting. Each VM generated by VMD is forecasted using DE-BP model.
Step 4: First summation. The forecast value of each subset generated by WT is obtained by adding up all the forecast values of VMs generated by VMD decomposition of this subset.
Step 5: Second summation. The forecast series of PM 2.5 concentration is obtained by aggregating the forecast result of each subset.
Forecasting each VM using BP neural network optimized by differential evolution algorithm

Summation of forecast values of di and an
The ultimate forecast result . . .

Study Area and Data Description
In this paper, two PM2.5 concentration series respectively collected from Wuhan and Tianjin located in China are adopted for testing the validity of the proposed model. The specific locations of the two study areas are briefly depicted in Figure 6. Wuhan, situated in the middle-lower Yangtze Plain and the eastern part of Jianghan Plain (30° N and 114° E), has been regarded as China's important industrial base, integrated transportation hub, and science and education base. The Yangtze River, which is the third longest river in the world, and the largest tributary of the Han River meet at this city, making Wuhan become a very important inland river port. Wuhan has a sub-tropical monsoon humid climate with abundant rainfall, abundant sunshine and four distinct seasons. Tianjin, the largest coastal city in northern China, is located along the west coast of the Bohai Gulf (39° N and 117° E). Tianjin has become a new growth pole and a hub of advanced industry and financial activity in China. Tianjin has a sub-humid warm temperate monsoon climate that characterized by significant winds and four distinct seasons. With rapid development of urbanization in the past several decades, both Wuhan and Tianjin become two megalopolises with a population of more than ten million people. Simultaneously, because of the development of industrialization and increase of motor vehicles, the occurrence of haze weather in these two

Study Area and Data Description
In this paper, two PM 2.5 concentration series respectively collected from Wuhan and Tianjin located in China are adopted for testing the validity of the proposed model. The specific locations of the two study areas are briefly depicted in Figure 6. Wuhan, situated in the middle-lower Yangtze Plain and the eastern part of Jianghan Plain (30 • N and 114 • E), has been regarded as China's important industrial base, integrated transportation hub, and science and education base. The Yangtze River, which is the third longest river in the world, and the largest tributary of the Han River meet at this city, making Wuhan become a very important inland river port. Wuhan has a sub-tropical monsoon humid climate with abundant rainfall, abundant sunshine and four distinct seasons. Tianjin, the largest coastal city in northern China, is located along the west coast of the Bohai Gulf (39 • N and 117 • E). Tianjin has become a new growth pole and a hub of advanced industry and financial activity in China. Tianjin has a sub-humid warm temperate monsoon climate that characterized by significant winds and four distinct seasons. With rapid development of urbanization in the past several decades, both Wuhan and Tianjin become two megalopolises with a population of more than ten million people. Simultaneously, because of the development of industrialization and increase of motor vehicles, the occurrence of haze weather in these two megalopolises becomes more frequent and more severe, which makes it an urgent need for researchers and relevant government departments to simulate and forecast the PM 2.5 concentration in order to protect public health and atmospheric environment. megalopolises becomes more frequent and more severe, which makes it an urgent need for researchers and relevant government departments to simulate and forecast the PM2.5 concentration in order to protect public health and atmospheric environment. In this paper, the two original daily PM2.5 concentration series from 1 January 2014 to 30 June 2016 with a total of 912 observations in Wuhan and Tianjin are collected from China's online air quality monitoring and analysis platform (http://www.aqistudy.cn/), as shown in Figure 7. In Figure  7, it can be seen that the two PM2.5 concentration series share some common features, for example, both Wuhan and Tianjin have high level of PM2.5 concentration in winter (roughly between November and February of each year). However, since Wuhan and Tianjin have different geographical positions, climatic characteristics and industrial structures, the PM2.5 concentration series associated with the two megalopolises appear to be different, and therefore are suitable for testing the effectiveness and practicability of the proposed forecasting model. Specifically, in each PM2.5 concentration series, the 1st-882nd data (1 January 2014-31 May 2016) and 883rd-912th data (1 June 2016-30 June 2016) are adopted as the training and testing samples, respectively. This study selects four accuracy measures presented in Section 3.2 in order to evaluate the performance of the proposed forecasting model. In addition, it should be noted that all considered models adopted in this paper are coded in MATLAB R2010a. Figure 6. Geographical locations of the study areas.

Tianjin Wuhan
In this paper, the two original daily PM 2.5 concentration series from 1 January 2014 to 30 June 2016 with a total of 912 observations in Wuhan and Tianjin are collected from China's online air quality monitoring and analysis platform (http://www.aqistudy.cn/), as shown in Figure 7. In Figure 7, it can be seen that the two PM 2.5 concentration series share some common features, for example, both Wuhan and Tianjin have high level of PM 2.5 concentration in winter (roughly between November and February of each year). However, since Wuhan and Tianjin have different geographical positions, climatic characteristics and industrial structures, the PM 2.5 concentration series associated with the two megalopolises appear to be different, and therefore are suitable for testing the effectiveness and practicability of the proposed forecasting model. Specifically, in each PM 2.5 concentration series, the 1st-882nd data (1 January 2014-31 May 2016) and 883rd-912th data (1 June 2016-30 June 2016) are adopted as the training and testing samples, respectively. This study selects four accuracy measures presented in Section 3.2 in order to evaluate the performance of the proposed forecasting model. In addition, it should be noted that all considered models adopted in this paper are coded in MATLAB R2010a.

Performance Criteria of Forecasting Accuracy
This study adopts the following four error metrics to testify the effectiveness and practicability of the proposed forecasting model: mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE) and Theil's inequality coefficient (TIC

Performance Criteria of Forecasting Accuracy
This study adopts the following four error metrics to testify the effectiveness and practicability of the proposed forecasting model: mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE) and Theil's inequality coefficient (TIC). The performance measures of MAE, RMSE and MAPE are utilized to quantify the errors of forecast values, and the smaller they are, the better the prediction accuracy is. TIC is employed to evaluate the predictive capability of different forecasting models, and the smaller it is, the better the forecasting capability that the model has.
The computational formulas of these four performance measures are provided as follows: where n is the number of observed PM 2.5 concentration values; andx(t) and x(t) are the forecast and observed values of PM 2.5 concentration at time t, respectively.

Analysis of Decomposition Results
The multiple frequency components in PM 2.5 concentration series are always the challenging parts in forecasting, making the models which work on the original time series cannot handle them appropriately. In order to improve the forecast accuracy, in this study, WT is firstly employed to divide the PM 2.5 concentration series collected from Wuhan into four components including one low frequency approximation subset and three high frequency detail subsets (see Figure 8). The four components are denoted respectively as d 1 , d 2 , d 3 and a 3 , where a 3 is the low frequency approximation subset which illustrates the general trend of the PM 2.5 concentration series, and d 1 ,d 2 , and d 3 indicate the high frequency detail subsets, which are related to the noise and disturbance.  After decomposition of original PM2.5 concentration series by WT, DE-BP model is utilized to forecast all the subsets. There is no doubt that the PM2.5 concentrations of previous several days have a great influence on the latter ones. Therefore, in this study, a certain number of previous PM2.5 concentration data are taken as the input of DE-BP model for forecasting the latter one. After several simulations and predictions, the optimal length of predicted series is set as eight in DE-BP model in order to obtain the higher accuracy. In DE algorithm, the parameter settings are listed as follows:  After decomposition of original PM 2.5 concentration series by WT, DE-BP model is utilized to forecast all the subsets. There is no doubt that the PM 2.5 concentrations of previous several days have a great influence on the latter ones. Therefore, in this study, a certain number of previous PM 2.5 concentration data are taken as the input of DE-BP model for forecasting the latter one. After several simulations and predictions, the optimal length of predicted series is set as eight in DE-BP model in order to obtain the higher accuracy. In DE algorithm, the parameter settings are listed as follows: population size: N p = 100, scaling factor: F = 0.5, crossover probability: C R = 0.5, max iterations:G M = 100. The above parameter settings and input determination method are used in all tests throughout the paper in order to ensure fair and valid comparisons between different forecasting models.
Based on the above parameter settings, each subset is forecasted using DE-BP model, and the forecast results are illustrated in Figure 9. It is obvious that all the four subsets cannot be forecasted with high accuracy, especially the subsets d 1 , d 2 , and d 3 , which are related to the noise and disturbance. Therefore, it can be concluded that the single decomposition process by WT cannot effectively extract the multiple frequency components existed in the PM 2.5 concentration series, and therefore leads to a relatively inferior forecasting performance. After decomposition of original PM2.5 concentration series by WT, DE-BP model is utilized to forecast all the subsets. There is no doubt that the PM2.5 concentrations of previous several days have a great influence on the latter ones. Therefore, in this study, a certain number of previous PM2.5 concentration data are taken as the input of DE-BP model for forecasting the latter one. After several simulations and predictions, the optimal length of predicted series is set as eight in DE-BP model in order to obtain the higher accuracy. In DE algorithm, the parameter settings are listed as follows: population size: The above parameter settings and input determination method are used in all tests throughout the paper in order to ensure fair and valid comparisons between different forecasting models. Based on the above parameter settings, each subset is forecasted using DE-BP model, and the forecast results are illustrated in Figure 9. It is obvious that all the four subsets cannot be forecasted with high accuracy, especially the subsets 1 d , 2 d , and 3 d , which are related to the noise and disturbance. Therefore, it can be concluded that the single decomposition process by WT cannot effectively extract the multiple frequency components existed in the PM2.5 concentration series, and therefore leads to a relatively inferior forecasting performance.  Therefore, in order to solve the drawback of WT and further improve the forecast accuracy, VMD is further applied to conduct the secondary decomposition of each subset generated by WT. In this study, each subset is decomposed into eight VMs, and the decomposition results of d 1 , d 2 , d 3 and a 3 by VMD are illustrated in Figure 10. Then, DE-BP model is employed to forecast all VMs based on the rolling technology. Next, the forecasting value of each subset is obtained by aggregation of the forecast values of all the VMs generated by VMD decomposition of this subset. Finally, the ultimate forecast result of PM 2.5 concentration series can be obtained by adding up the forecast values of each subset. The ultimate forecast results and the corresponding MAE, RMSE, MAPE and TIC of WT-VMD-DE-BP model for all the four subsets are presented in Figure 11 and Table 1. From Figure 11 and Table 1, it is obvious that after secondary decomposition by VMD, the forecast accuracies of all the four subsets are significantly improved, which confirms the effectiveness of the hybrid decomposition technology proposed in this study.
Finally, the ultimate forecast result of PM2.5 concentration series can be obtained by adding up the forecast values of each subset. The ultimate forecast results and the corresponding MAE, RMSE, MAPE and TIC of WT-VMD-DE-BP model for all the four subsets are presented in Figure 11 and Table 1. From Figure 11 and Table 1, it is obvious that after secondary decomposition by VMD, the forecast accuracies of all the four subsets are significantly improved, which confirms the effectiveness of the hybrid decomposition technology proposed in this study.

Results and Discussions
In this section, to verify the superiority of the proposed WT-VMD-DE-BP model in forecasting capability, forecasting models of BP, DE-BP, WT-DE-BP, VMD-DE-BP and WT-VMD-DE-BP are adopted as the benchmark models. Four error measurements, MAE, RMSE, MAPE and TIC, are employed for evaluating the performance of all the forecasting models.
The forecast results of all considered models are shown in Figure 12, and the forecast errors

Results and Discussions
In this section, to verify the superiority of the proposed WT-VMD-DE-BP model in forecasting capability, forecasting models of BP, DE-BP, WT-DE-BP, VMD-DE-BP and WT-VMD-DE-BP are adopted as the benchmark models. Four error measurements, MAE, RMSE, MAPE and TIC, are employed for evaluating the performance of all the forecasting models.
The forecast results of all considered models are shown in Figure 12, and the forecast errors including MAE, RMSE, MAPE and TIC of the proposed model and benchmark models are presented in Table 2 where the smallest value of each row is marked in boldface. As shown in Table 2, the error values of MAE, RMSE, MAPE and TIC of the proposed model are all smallest compared with all the other benchmark models, which confirms that the proposed hybrid WT-VMD-DE-BP model based on WT-VMD decomposition technique has the best forecasting performance. In order to present the comparison more intuitively, the error figures MAE, RMSE, MAPE and TIC of different models are also provided in Figure 13.   Table 3. Based on the results listed in Table 3, the following three categories of findings can be obtained. 16   In order to further analyze the effects of the decomposition technique and DE optimization algorithm on the proposed model, the following three categories of comparisons are conducted in this experiment. The first category of comparison (Comparison I), which is designed for testing the positive effects of single decomposition techniques, is conducted between the forecasting models embedded  Table 3. Based on the results listed in Table 3, the following three categories of findings can be obtained.   Table 3. Based on the results listed in Table 3, the following three categories of findings can be obtained.      In Table 3, it is obvious that the values of MAE, RMSE, MAPE and TIC of DE-BP model have been reduced by 50.12%, 51.08%, 56.67% and 50.00%, respectively, via integrating the WT decomposition technique into DE-BP model, and have been decreased by 44.21%, 45.04%, 44.02% and 42.86%, respectively, through combining the VMD decomposition technique into DE-BP model. Based on the above comparison results, it can be concluded that through decomposing the PM 2.5 concentration series into a set of subsets with different frequencies, the single decomposition technique (WT and VMD) can decrease the characteristics of non-linearity and non-stability existed in the original PM 2.5 concentration series to some extent, and thus is benefit for improving the forecasting ability of DE-BP model. In Table 2, it can be found that the values of MAE, RMSE, MAPE and TIC of WT-VMD-DE-BP model decrease by 66.91%, 62.94%, 57.01% and 57.14%, respectively, compared with those of WT-DE-BP model, and 70.42%, 67.69%, 66.72% and 62.50%, respectively, compared with those of VMD-DE-BP model. Therefore, based on the above analysis, it can be easily found that the proposed WT-VMD-DE-BP model can significantly decrease the errors including MAE, RMSE, MAPE and TIC of WT-DE-BP and VMD-DE-BP models. Thus, it can be concluded that the hybrid WT-VMD decomposition technique proposed in this paper is very effective for improving the forecast accuracy. The reason lies in that the single decomposition techniques (WT and VMD) have the drawback of mode mixing problem with different levels, which makes the multiple frequency components existed in the PM 2.5 concentration series cannot be effectively extracted, and therefore leads to an inferior forecasting performance.

(3) Findings of Comparison III (DE-BP vs. BP)
In Table 2, it is obvious that the values of MAE, RMSE, MAPE and TIC of BP model has been reduced by 15.50%, 13.69%, 19.14% and 14.29%, respectively, via integrating the DE algorithm into BP model. Thus, it can be concluded that through optimizing weight matrices and thresholds using DE algorithm, the BP model obtains stronger approximation ability. In addition, it can also be seen that the DE algorithm cannot effectively decrease the forecast errors without decomposition techniques (WT and VMD), which confirms that the multiple frequency components existed in the PM 2.5 concentration series have remarkable influence on the forecast accuracy.

PM 2.5 Concentration Forecasting in Tianjin
In order to further systematically and comprehensively testify the validity and applicability of the proposed WT-VMD-DE-BP model, the PM 2.5 concentration series collected in Tianjin (see Figure 7) is also taken as another study case. Similar to the case in Wuhan, the decomposition results of original PM 2.5 concentration series using WT decomposition method are shown in Figure 14. For each subset, the forecast result of WT-DE-BP model is shown in Figure 15. The decomposition results of each subset based on VMD decomposition method are depicted in Figure 16. In addition, for each subset, the forecast result of WT-VMD-DE-BP model is provided in Figure 17. Finally, the ultimate PM 2.5 concentration forecast results of different models are illustrated in Figure 18. The forecast errors, MAE, RMSE, MAPE and TIC, of all the forecasting models are also calculated and displayed, respectively, in Tables 4 and 5 and Figure 19.

Conclusions
Accurate PM2.5 concentration forecasting is crucial for risk-analysis and decision-making in environmental protection departments. However, the multiple frequency components that exist in PM2.5 concentration series are always the challenging parts in forecasting, making models that work on the original time series unable to handle them appropriately. Thus, many researchers have been making efforts to solve this problem using different data decomposition techniques such as WT and VMD before forecasting. Since all single decomposition techniques have the drawback of mode mixing problem with different levels, which makes the multiple frequency components that exist in the PM2.5 concentration series unable to be effectively extracted, and consequently leading to an inferior forecasting performance. Therefore, in order to solve the mode mixing problem existed in the single decomposition technique, this paper, through combing the advantages of WT and VMD, proposes a novel hybrid WT-VMD decomposition technique, and then established a forecasting model based on WT-VMD and DE-BP model to improve the forecast accuracy of PM2.5 concentration.
In order to demonstrate the effectiveness and applicability of the proposed model, two PM2.5 concentration series collected from Wuhan and Tianjin located in China are taken for conducting the empirical study. Based on the experimental results, four main conclusions can be obtained as follows: (1) The proposed WT-VMD-DE-BP model owns the best performance compared with all the other considered benchmark models including BP, DE-BP, WT-DE-BP and VMD-DE-BP, which demonstrates that the proposed model is highly suitable for the non-stationary PM2.5 concentration 16

Conclusions
Accurate PM2.5 concentration forecasting is crucial for risk-analysis and decision-making in environmental protection departments. However, the multiple frequency components that exist in PM2.5 concentration series are always the challenging parts in forecasting, making models that work on the original time series unable to handle them appropriately. Thus, many researchers have been making efforts to solve this problem using different data decomposition techniques such as WT and 16

Conclusions
Accurate PM 2.5 concentration forecasting is crucial for risk-analysis and decision-making in environmental protection departments. However, the multiple frequency components that exist in PM 2.5 concentration series are always the challenging parts in forecasting, making models that work on the original time series unable to handle them appropriately. Thus, many researchers have been making efforts to solve this problem using different data decomposition techniques such as WT and VMD before forecasting. Since all single decomposition techniques have the drawback of mode mixing problem with different levels, which makes the multiple frequency components that exist in the PM 2.5 concentration series unable to be effectively extracted, and consequently leading to an inferior forecasting performance. Therefore, in order to solve the mode mixing problem existed in the single decomposition technique, this paper, through combing the advantages of WT and VMD, proposes a novel hybrid WT-VMD decomposition technique, and then established a forecasting model based on WT-VMD and DE-BP model to improve the forecast accuracy of PM 2.5 concentration.
In order to demonstrate the effectiveness and applicability of the proposed model, two PM 2.5 concentration series collected from Wuhan and Tianjin located in China are taken for conducting the empirical study. Based on the experimental results, four main conclusions can be obtained as follows: (1) The proposed WT-VMD-DE-BP model owns the best performance compared with all the other considered benchmark models including BP, DE-BP, WT-DE-BP and VMD-DE-BP, which demonstrates that the proposed model is highly suitable for the non-stationary PM 2.5 concentration forecasting; (2) The single decomposition techniques of WT and VMD cannot improve the forecasting ability of DE-BP model significantly due to the drawback of mode mixing problemwith different levels existed in WT and VMD; (3) The hybrid WT-VMD decomposition technique performs better than the single decomposition methods of WT and VMD in extracting the multiple frequency components that exist in the PM 2.5 concentration series, thus leading to a good forecasting performance; (4) DE algorithm has a positive effect on the BP model by optimizing the weights and thresholds between input layer and hidden layer.
However, as mentioned above, the intermittent and unstable nature of PM 2.5 concentration series makes its forecasting become a very difficult task. Therefore, there are still several research directions left for the future. For example, some meteorological factors such as atmospheric pressure, temperature, and precipitation may be integrated into the forecasting model to improve the forecast accuracy. Furthermore, since the PM 2.5 concentration series have some similar characteristics as other time series such as non-linearity and non-stability, the proposed model in this study can also be used for other complex time series forecasting, such as forecasting of electricity load, wind speed and stock price.