MultiStep Ahead Forecasting for Hourly PM10 and PM2.5 Based on Two-Stage Decomposition Embedded Sample Entropy and Group Teacher Optimization Algorithm

: The randomness, nonstationarity and irregularity of air pollutant data bring difﬁculties to forecasting. To improve the forecast accuracy, we propose a novel hybrid approach based on two-stage decomposition embedded sample entropy, group teaching optimization algorithm (GTOA), and extreme learning machine (ELM) to forecast the concentration of particulate matter (PM10 and PM2.5). First, the improvement complementary ensemble empirical mode decomposition with adaptive noise (ICEEMDAN) is employed to decompose the concentration data of PM10 and PM2.5 into a set of intrinsic mode functions (IMFs) with different frequencies. In addition, wavelet transform (WT) is utilized to decompose the IMFs with high frequency based on sample entropy values. Then the GTOA algorithm is used to optimize ELM. Furthermore, the GTOA-ELM is utilized to predict all the subseries. The ﬁnal forecast result is obtained by ensemble of the forecast results of all subseries. To further prove the predictable performance of the hybrid approach on air pollutants, the hourly concentration data of PM2.5 and PM10 are used to make one-step-, two-step-and three-step-ahead predictions. The empirical results demonstrate that the hybrid ICEEMDAN-WT-GTOA-ELM approach has superior forecasting performance and stability over other methods. This novel method also provides an effective and efﬁcient approach to make predictions for nonlinear, nonstationary and irregular data.


Introduction
The prediction of future events from the noisy and nonstationary time series data is a challenging problem. This type of problem has been found in a wide range of research areas, including finance, biological sciences, environment, ecology, and engineering [1][2][3][4][5]. Although a number of statistical inference methods and machine learning algorithms have been proposed to make predictions, it is still a substantial challenge to make more accurate and reliable forecasting for time series data with high volatility.
The forecasting of atmospheric pollution is one of the important problems in the analysis of noise and nonstationary time series. In recent years, atmospheric pollution has become one of the most important issues all over the world [6,7]. The particulate matter of aerodynamic diameter less than or equal to 2.5 and 10 microns is the major air pollutant, which also causes significant influence on human health and daily activities [8]. Therefore, a precise short-term forecasting model should be developed to enhance the prediction performance of air-pollutant concentrations, which not only can give valuable support for

ICEEMDAN
Empirical mode decomposition (EMD) is a kind of adaptive data analysis method utilized for the nonlinear and nonstationary signal that was proposed by Huang et al. [24], and it can decompose a complex signal into several intrinsic mode functions that contain the local characteristics of the original signals at different time scales. Based on EMD, ensemble empirical mode decomposition (EEMD) [25] is extended from EMD by adding white Gaussian noise to solve the problem of mode mixture in EMD. The intrinsic mode functions are obtained by EMD decomposition of a signal added with Gaussian white noise. However, the reconstruction error cannot be eliminated absolutely with limited times of iterations. In order to overcome the drawback of EEMD, the complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) [26] was developed. Compared with EEMD, CEEMDAN is able to reduce the reconstruction error under the low computational cost. Nevertheless, the modes of CEEMDAN still have some residual noise. The improved CEEMDAN (ICEEMDAN) was developed by Colominas et al. [27] to achieve the perfect reconstruction and avoid spurious modes, which can provide more physical meaning to the IMF and have better convergence and stability [28]. The ICEEMDAN technique is used to decompose the time series G(t) into a number of IMFs and one residual component as follows: 1.
The EMD algorithm is utilized to calculate the local means of n realizations G i (t): where the operator I j is the j th mode of the time series G(t) decomposed by the EMD, κ i represents Gaussian white noise with variance ranging from 0 to 1. A = φ 0 std(G(t))/ std(I 1 (κ i (t))), which is utilized to remove the fraction of noise energy action at the stage of algorithm begins. std represents the standard deviation.

2.
Calculate the first residue r 1 (t), and the first mode can be obtained based on it. The operator E( ) produces the local mean.
3. r 2 (t) can be estimated by Equation (4), and then compute the second mode IMF 2 (t).
Step 4 is repeated until obtain all the IMFs.

Wavelet Transform (WT)
Wavelet transform (WT) is a kind of powerful signal processing technique. WT can be utilized to decompose a nonlinear and nonstationary signal into several components at different resolution levels, including a low-frequency approximation subset and several high-frequency subseries. The low-frequency approximation subset that can reflect the tendency of the original signal and those high-frequency subseries represent the components of turbulence and noise [29]. As an effective data preprocessing method, it can reduce the Atmosphere 2021, 12, 64 4 of 18 difficulty of feature learning and enhance the prediction performance. Thus, WT has been widely used in many fields [30].

Sample Entropy
Sample entropy (SE) is an algorithm to measure the complexity of time series, which was proposed by Richman and Moorman [31]. The greater the value of sample entropy, the higher the complexity of time series. As an improvement of approximate entropy, SE is independent of data length and has better consistency. The value of sample entropy can be represented as SE(m, r, N), r is the similarity tolerance and the value of it was set to r = 0.2 standard deviations, the reconstruction dimension m was set as 2 in this paper. The steps to calculate the sample entropy are shown below.
Step 1: For a given time series, and m-dimensional vectors is comprised based on it.
Step 2: The absolute distance of corresponding elements between vectors is calculated, and define distance between Y i and Y j .
Step 3: For a given threshold r, count the total number of D m (Y i , Y j ) < r, calculate the ratio B i m (r) and the mean value of B i m (r) Step 4: Set m to m+1, repeat steps1-3, and then obtain the mean value of B m+1 (r).
Step 6: For a finite time series, the result of SE value can be estimated through Equation (14).

Group Teaching Optimization Algorithm (GTOA)
The group teaching optimization algorithm (GTOA) proposed by Zhang and Jin [32] in 2020 is a novel metaheuristic optimization algorithm that was inspired by the group teaching mechanism. Less control of parameters is the best advantage of the GTOA algorithm. In addition to the basic parameter of population size and termination criteria, it does not need any extra control parameters. Furthermore, the GTOA algorithm has the advantages of fast convergence speed and strong global optimal searching ability. The GTOA algorithm mainly contains four phases, including ability grouping phase, teacher phase, student phase and teacher allocation phase. The detailed description of four phases is given as follows [32].
In the phase of ability grouping, the knowledge of whole class is following the normal distribution. Equation (15) shows the knowledge distribution of students. Here x is the knowledge of the students, µ is the mean value of the whole class, and σ is standard deviation of knowledge among students. In the GTOA algorithm, all the students can be divided into two groups based on the ability to accept knowledge: outstanding group and average group.
In the teacher phase, students acquire knowledge from the teacher. And the students of the outstanding group and the average group update their knowledge by Equations (16) and (17), respectively.
where X l + 1 t,j represents the knowledge of student j learned from the teacher at time l + 1, X j l is the knowledge of student j at time l, T l is the knowledge of teacher at time l, and the M l is the average knowledge of the group at tome l. α, β, γ and λ are the random numbers between 0 and 1. Then, based on following equation, the student should decide whether to accept the knowledge learned from the teacher.
In the student phase, students are able to gain the new knowledge through selflearning and cooperation with other students. The updated knowledge is expressed as follows.
where X l + 1 s,j is the knowledge of student j at time l + 1 through learning from the student phase, a and b are the random numbers between 0 and 1. Similar to the teacher phase, the student should decide whether to accept the knowledge gained through the student phase according to the Equation (20).
In the teacher allocation phase, the teacher allocation can be expressed by the following equation for accelerating the convergence as well as avoiding local minimum.
where X l 1 , X l 2 and X l 3 denote the first, second, and third best students.

Modified Extreme Learning Machine
The extreme learning machine (ELM), proposed by Huang et al. [33] is a single-hidden layer feedforward neural network (SLFN) with an effective learning algorithm. The ELM is able to choose the weights and thresholds randomly and determine the output weights analytically, and thus has the advantage of fast training speed. Furthermore, it has better nonlinear mapping capability and is widely used in the field of prediction. However, due to the random selection of weights and thresholds, the generalization ability and stability may be influenced. In this paper, the ELM is modified by using the GTOA algorithm to optimize the initial weights and thresholds. The mean absolute error (MAE) of ELM is set as the objective function of the GTOA algorithm. And the GTOA-ELM is utilized to enhance the forecasting performance, stability, and generalization ability of the ELM.

Multistep Prediction
The PM10 and PM2.5 concentrations are no doubt affected by the concentrations of previous several days greatly. Thus, many forecasting models utilized the data of previous several days as input and predicted the latter one [22]. Consequently, in this paper, the n-step ahead forecasting formula is presented as Equation (23).
whereŷ l + n is the forecasting result of original time series at time l + n, it is the result of n-step forecasting; y l−h represents the actual value of time series at time l − h, we will use the data of previous seven days to predict one-step-, two-step-, and three-step-ahead concentrations of PM10 and PM2.5.

Framework of the Proposed Hybrid Approach
In order to enhance the forecasting performance of PM2.5 and PM10 concentrations, a novel hybrid approach is developed based on a data-preprocessing technique that embedded sample entropy into two-stage decomposition and the modified ELM with GTOA. The structure of the proposed approach is shown as Figure 1, which contains three main steps: Step 1: Data preprocessing. In order to reduce the prediction difficulty and errors, the ICEEMDAN algorithm is first used to decompose the original time series of PM10 and PM2.5 concentration into several IMFs. There are still complex subseries in IMFs. Sample entropy was adopted to measure the complexity of IMFs. For fully extracting the complex features of data, the WT algorithm is further utilized to decompose IMFs with higher sample entropy value.
Step 2: Individual forecasting. In order to improve the forecasting performance, stability and generalization ability of ELM, a new metaheuristic optimization algorithm GTOA is used to optimize the initial weights and thresholds of ELM. Then, the modified ELM of GTOA-ELM is employed to predict all the subseries.
Step 3: Prediction result aggregating. the prediction outcomes of IMFs is obtained via summarizing the forecasting results of subseries decomposed by WT, and then we further summarize the results of IMFs to get the final prediction results. Step 1: Data preprocessing. In order to reduce the prediction difficulty and errors, the ICEEMDAN algorithm is first used to decompose the original time series of PM10 and PM2.5 concentration into several IMFs. There are still complex subseries in IMFs. Sample entropy was adopted to measure the complexity of IMFs. For fully extracting the complex features of data, the WT algorithm is further utilized to decompose IMFs with higher sample entropy value.
Step 2: Individual forecasting. In order to improve the forecasting performance, stability and generalization ability of ELM, a new metaheuristic optimization algorithm GTOA is used to optimize the initial weights and thresholds of ELM. Then, the modified ELM of GTOA-ELM is employed to predict all the subseries.
Step 3: Prediction result aggregating. the prediction outcomes of IMFs is obtained via summarizing the forecasting results of subseries decomposed by WT, and then we further summarize the results of IMFs to get the final prediction results.

Empirical Results and Analysis
In this section, we will use the hybrid ICEEMDAN-WT-GTAO-ELM model to make one-step-, two-step-, and three-step-ahead predictions for hour PM2.5 and PM10 concentrations of Beijing.

Data Set and Evaluation Criteria
In this study, the data comprises hour concentrations of fine particulate matter (PM10 and PM2.5) of Beijing, the time series includes 10,510 observations from 10 May 2018 to 1 August 2019. The concentrations of PM10 and PM2.5 are shown in Figures 2 and 3. In addition, in the PM2.5 concentration series and PM10 concentration series, the data of the last three months are set as a testing set, and the rest of the data are a corresponding training set. In this study, the data comprises hour concentrations of fine particulate matter (PM10 and PM2.5) of Beijing, the time series includes 10,510 observations from 10 May 2018 to 1 August 2019. The concentrations of PM10 and PM2.5 are shown in Figures 2 and 3. In addition, in the PM2.5 concentration series and PM10 concentration series, the data of the last three months are set as a testing set, and the rest of the data are a corresponding training set.   In this study, the data comprises hour concentrations of fine particulate matter (PM10 and PM2.5) of Beijing, the time series includes 10,510 observations from 10 May 2018 to 1 August 2019. The concentrations of PM10 and PM2.5 are shown in Figures 2 and 3. In addition, in the PM2.5 concentration series and PM10 concentration series, the data of the last three months are set as a testing set, and the rest of the data are a corresponding training set.   In order to evaluate the performance of the proposed model, several statistical indicators and statistical tests are used to make reasonable comparison. Statistical indicators of mean absolute error (MAE), mean absolute percentage error (MAPE), root mean square error (RMSE), normalized root mean square error (NRMSE), Theil's coefficient (TIC), and direction statistics (Ds) are utilized to evaluate forecasting error of different models. Especially, the statistical indicators of MAE, MAPE, NRMSE and TIC are employed to quantify the horizontal errors, which can be used to assess the horizontal deviation between the predicted result and the actual value [34]. Ds is utilized to measure the capability of forecasting direction. The calculation method of those statistical indicators are explained as follows: Atmosphere 2021, 12, 64 where T is the total number of observations in the testing set, G(t) represents the original PM10 or PM2.5 concentration series,Ĝ(t) indicates the predicted value of PM10 or PM2.5 concentration. G(t) is the mean value of actual concentrations. In order to further evaluate the forecasting performance of different models, the Diebold-Mariano (DM) test [35] is utilized to analyze the statistically significant differences between the proposed model and other benchmark models. In this paper, mean square error is taken as the loss function of the DM test. The null hypothesis is that the performance of the benchmark model and the tested model is not significant; the alternative hypothesis is that the forecasting performance of the test model is better than the benchmark models.

Result and Analysis
In this section, the concentrations of fine particulate matter PM2.5 and PM10 are used to verify the superiority of the proposed model.  Tables 1-3. From Figures 4-8, we can conclude that: (1) The proposed model based on sample entropy and two types of data decomposition technology achieve the highest forecast performance on the horizontal and directional forecastingthe MAE, MAPE, NRMSE, and TIC of it are all the smallest in one-step-, two-step-, and three-step-ahead forecasting. The values for the directional forecasting of the ICEEMDAN-WT-GTOA-ELM are 97.02%, 90.09%, and 88.53%, which are higher than other benchmark models. (2) The models that embedded with sample entropy selection and two-stage decomposition technology display better performance than those models that only include one type of decomposition technology or not. For example, the MAPE of ICEEMDAN-WT-GTOA-ELM is 3.11% in one-step-ahead forecasting and 4.06%, 6.79%, respectively, in ICEEMDAN-GTOA-ELM and WT-GTOA-ELM; the same conclusion can be draw in two-step and three-step forecasting. (3) The forecasting error of models with decomposition technology are all lower than those models without decomposition, which demonstrates that the decomposition technology can effectively extract the feature of nonstationary PM10 concentration time series, further improving the forecasting performance of the single model. (4) The forecasting errors of GTOA-ELM are less than DE-ELM and ELM. The forecasting performance of DE-ELM is also better than ELM. It can be seen that the optimization algorithm can improve the prediction accuracy of the single model, and the GTOA algorithm is more suitable for optimizing the weights and thresholds of ELM than DE optimization algorithm.
demonstrates that the decomposition technology can effectively extract the feature of nonstationary PM10 concentration time series, further improving the forecasting performance of the single model. (4) The forecasting errors of GTOA-ELM are less than DE-ELM and ELM. The forecasting performance of DE-ELM is also better than ELM. It can be seen that the optimization algorithm can improve the prediction accuracy of the single model, and the GTOA algorithm is more suitable for optimizing the weights and thresholds of ELM than DE optimization algorithm.   demonstrates that the decomposition technology can effectively extract the feature of nonstationary PM10 concentration time series, further improving the forecasting performance of the single model. (4) The forecasting errors of GTOA-ELM are less than DE-ELM and ELM. The forecasting performance of DE-ELM is also better than ELM. It can be seen that the optimization algorithm can improve the prediction accuracy of the single model, and the GTOA algorithm is more suitable for optimizing the weights and thresholds of ELM than DE optimization algorithm.               The DM test is utilized to test the statistical difference between the test model and the benchmark models. The null hypothesis is that the performance of the benchmark models and the tested model is not significant, and the alternative hypotheses is that the forecasting performance of the tested model is better than the benchmark models. Tables 1-3 report DM test results for one-step-, two-step-, and three-step-ahead predictions. The values in brackets are p-values of the DM test, and the digits above p-values are the values of the DM statistic. From the results of the DM test, several conclusions can be obtained: (1) When the proposed ICEEEMDAN-WT-GTOA-ELM is set as the tested model, the p-values are all smaller than 0.01; we can conclude that the data preprocessing technology with sample entropy embedded into two-stage decomposition can effectively reduce the forecasting difficulty of PM10 concentration and significantly improve the prediction accuracy of the hybrid model. (2) GTOA-ELM is outperforming ELM model at the significance level of 5% in one-step-, two-step-and three-step-ahead forecasting, indicating that the GTOA algorithm is an effective optimizer tool. (3) The predictive capability of GTOA-ELM is also better than DE-ELM, which shows that the GTOA algorithm can significantly improve the forecasting precision of ELM compared with the DE algorithm.

Comparison Analysis of PM10 Forecasting
In this subsection, the new dataset of hourly PM2.5 concentrations is utilized to test the stability of the new hybrid model. The multistep forecasting results of PM2.5 via the proposed hybrid model are also compared with the results of other benchmark models.
The forecasting errors of all the models are shown in Table 4, the optimal values are highlighted in bold, and it can be seen clearly in Figures 9-13. Tables 5-7 provide the DM test results of one-step, two-step, and three-step forecasting.                 From Table 4 and Figures 9-13, it is obvious that similar conclusions can be summarized that: (1) In most cases, the proposed model ICEEMDAN-WT-GTOA-ELM performs better than the other benchmark models. With one-step-ahead prediction as an example, the MAE, MAPE, NRMSE and TIC values of the proposed model can be reduced more than 80% compared with GTOA-ELM. It is further demonstrated that the data preprocessing technology with sample entropy and two-stage decomposition can lead to a significant enhancement of forecasting performance. (2) Comparing the two types of decomposition models that use different decomposition techniques, the proposed hybrid models based on ICEEMDAN decomposition technique perform better than the models based on WT decomposition technique. (3) Based on the values of direction statistics, the hybrid model of embedded decomposition algorithm has good performance on directional forecasting, while the other models cannot forecast the variation directions precisely. Tables 5-7 show the DM test results for one-step-, two-step-, and three-step-ahead predictions. The proposed model ICEEMDAN-WT-GTOA-ELM is also significantly better than the other benchmark models; the horizontal and directional accuracy can be greatly improved by data preprocessing technology that embedded sample entropy into the twostage decomposition technique. The p-value between GTOA-ELM and DE-ELM in the two-step-ahead forecasting is greater than 0.05, but it is smaller than 0.01 in one-step-and three-step-ahead forecasting, and the forecasting errors of GTOA-ELM are all smaller than DE-ELM. The GTOA algorithm is still considered superior to the DE algorithm.

Conclusions and Discussion
In this paper, a hybrid approach of GTOA and ELM based on data preprocessing technology to predict the concentration of PM10 and PM2.5 is proposed. This hybrid ICEEMDAN-WT-GTOA-ELM prediction model is able to achieve precise prediction for PM10 and PM2.5 concentrations. The original concentration data of PM10 and PM2.5 are decomposed by ICEEMDAN. In order to completely extract the future of nonstationarity and irregular time series, WT is employed to conduct the secondary decomposition of subseries with high frequency, which possesses greater sample entropy value. The data preprocessing technology that embedded sample entropy into two-stage decomposition can reduce the volatility and prediction difficulty of the initial data significantly. GTOA is used to optimize the initial weights and thresholds of ELM, and the modified ELM is adopted to predict the concentration of subseries. For testing the superiority of the hybrid ICEEMDAN-WT-GTOA-ELM model, we utilized concentration series of PM10 and PM2.5 to make one-step-, two-step-, and three-step-ahead predictions. Empirical analysis results demonstrate that the proposed hybrid approach has superior performance in forecasting accuracy and stability for air-pollutant forecasting compared with other benchmark models, which indicates that the data preprocessing technology that embedded sample entropy into two stage-decomposition is a powerful tool for complex series forecasting and the heuristic algorithms of GTOA can efficiently optimize weights and thresholds of ELM.
Although the proposed approach has made substantial progress to make reliable forecasting using irregular and nonstationary time series, there are still a number of issues that need to be addressed. First, this paper considers the univariate time series prediction; some extra factors also can be taken into account. In addition, the deep learning methods with powerful ability of feature extraction have appeared. As a new machine learning method, it will be more efficient in promoting the prediction accuracy of air pollutant concentration. In the future, we will take the meteorological factors and other influence factors into consideration and utilize deep learning models that combine with other heuristic algorithms to further improve the prediction accuracy. Furthermore, it would be important to apply our improved approach to analyze the observed data in real time for making timely predictions. In addition, the proposed model can be also considered to forecast the complex time series in other fields such as finance, biological sciences, ecology, and engineering. All of these issues will be the topics of our further research.