Deep Prediction Model Based on Dual Decomposition with Entropy and Frequency Statistics for Nonstationary Time Series

The prediction of time series is of great significance for rational planning and risk prevention. However, time series data in various natural and artificial systems are nonstationary and complex, which makes them difficult to predict. An improved deep prediction method is proposed herein based on the dual variational mode decomposition of a nonstationary time series. First, criteria were determined based on information entropy and frequency statistics to determine the quantity of components in the variational mode decomposition, including the number of subsequences and the conditions for dual decomposition. Second, a deep prediction model was built for the subsequences obtained after the dual decomposition. Third, a general framework was proposed to integrate the data decomposition and deep prediction models. The method was verified on practical time series data with some contrast methods. The results show that it performed better than single deep network and traditional decomposition methods. The proposed method can effectively extract the characteristics of a nonstationary time series and obtain reliable prediction results.


Introduction
A time series is a significant representation of various objects and systems that describes their changing processes. The prediction of time series aims at estimating their future trends with hidden characteristics in the historical data. Time series prediction has gained widespread attention in many fields, such as meteorology [1,2], the stock market [3], environment pollution control [4,5], and data mining on the Internet. The reliable prediction of future trends can help administrators in comprehensive and scientific decision making [6]. At the core of making predictions is an appropriate model based on the features of the data. Predicting a time series effectively has been a hot issue in the fields of data mining and machine learning.
The early methods of time series prediction are based on statistical theory [7][8][9], which uses statistical analysis to model time series. Statistical methods perform well on stationary series, but they usually fail when the data comprises complex noises and nonstationary trends. Shallow neural networks [10] have been widely used due to their self-learning ability. The research shows that shallow networks face overfitting and insufficient fitting problems in different data conditions. Deep networks [11] were developed in view of the network structure and data scale. They can fit data trends based on the inner multiple layers that help model high-dimensional nonlinear relationships. Deep networks also benefit from mass data, which can train models with vast iterations. Although they have powerful learning abilities, deep networks may fail on nonstationary data with unpredictable noise and complex features.
For nonstationary time series data with complex noise, studies have been conducted to decompose the data for the feature extraction, such as seasonal-trend decomposition procedure based on LOESS (STL), wavelet decomposition (WD), empirical mode decomposition (EMD), and ensemble empirical mode decomposition (EEMD). Variational mode decomposition (VMD) [12][13][14] improves the mode aliasing and noise sensitivity of the existing method. It can be regarded as a relatively robust decomposition method, and it has been proved valid in practical data analysis. In VMD, the number of decomposition layers can be preset to effectively separate different frequency components, which is more widely applicable to nonstationary data. However, some practical issues occur in the VMD method, including the determination of the components and the inadequate decomposition of the subsequence. The degree of the decomposed subsequences directly influences the feature extraction. For data with implicit trends and noise, the complexity of the decomposition components is still very high. The ideal simple components cannot be obtained by a single decomposition. The question of how to obtain the subsequence with clear trend features and without complex noise has been a major issue.
Considering the feature extraction ability of VMD and the prediction performance of deep networks, an integrated model is proposed in this paper. Nonstationary time series data are the main object of analysis, and a deep prediction model with dual decomposition is proposed. In the dual decomposition, a new method was designed to determine the number of VMD subsequences based on information entropy. The requirements of the dual decomposition were determined with the frequency characteristics of the components. Finally, a general framework was built, which consists of a dual VMD and a gated recurrent unit (GRU). Experiments were conducted to show that the proposed method can extract features from time series data with an automatic mechanism and achieve reliable prediction results. This paper is organized as follows. Section 2 introduces the related works on time series prediction. Section 3 presents the main prediction method, including the decomposition mechanism and the deep neural networks. Section 4 presents the experiments. The experimental results are discussed in Section 5. The work is concluded in Section 6.

Prediction Methods Based on Machine Learning
As mentioned in Section 1, research on time series prediction has gained widespread attention. The methods can be classified into three categories, namely statistical models, shallow neural networks, and deep networks. In this section, the related works are introduced, with a focus on machine learning methods, especially deep networks.
Statistical models describe changes in time series data using mathematical analysis. The ARMA [15] model is a typical statistical model that combines the AR and MA models. The ARIMA [16] model adds differential processing based on the ARMA model and can be used to process nonstationary data. In addition, the GARCH [17] and state space methods, represented by Kalman filters, [18,19] are used to build predictive models.
Artificial neural networks, such as the BP, Elman, and RBF, can autonomously capture nonlinear features in data. Various methods have been studied to solve practical problems. Xue et al. [20] used extreme learning machines to study financial time series, and Lin et al. [21] established the support vector machine method to predict electricity generation. Amjady et al. [22] constructed a fuzzy neural network method to predict electricity prices. Yang et al. [23] proposed the dynamic regularized echo state network (DRESN) model, which can dynamically determine the structure of an echo state network (ESN) network. Compared to ESN networks, DRESN networks have better accuracy.
In recent years, neural networks with deeper layers have been used to build models. Che et al. [24] built the RNN network to predict multivariate time series. Based on the RNN network, Hochreiter et al. [25] proposed a long short-term memory (LSTM) network to effectively solve the problems of vanishing gradients and the long-term dependence of  [26] used the LSTM network for financial market prediction in their research, and LSTM is superior to memoryless classification methods, such as random forest, deep neural networks, and logistic regression classifier. As an improvement to LSTM, GRU [27] reduces the gating cycle unit based on LSTM, thereby reducing the number of parameters. While obtaining the same or even better results, the calculation speed is increased. Ding et al. [28] used the GRU network to predict short-term wind speed.
Statistical models are appropriate for time series with obvious trends. Machine learning methods supplement statistical models for their lack of nonlinear fitting ability. In the existing literature and previous experimental studies, it has been shown that deep networks still need to improve the fitting of nonstationary time series with noise. It is possible to improve the prediction accuracy by introducing data pre-analysis before deep network training.

Decomposition and Prediction Methods
For the feature extraction of complex time series data, decomposition methods have been applied widely. In addition, decomposition has been combined with machine learning to solve the problem of time series prediction. The aim of these methods is to decompose raw data into multiple subsequences, and appropriate machine learning models have been established according to the characteristics of different subsequences.
Various decomposition methods have been studied to extract data features. The seasonal-trend decomposition procedure based on LOESS (STL) [29,30] decomposes time series into three components-period, trend, and residual. Wavelet decomposition (WD) [31] improves the Fourier transform method. An original series is decomposed into a series of wavelets by translating and scaling the wavelets. Empirical mode decomposition (EMD) [32] obtains an inherent mode function without a pre-set basis function. Ensemble empirical mode decomposition (EEMD) [33,34] can distribute a series into appropriate reference scales by introducing white noise. Although they have the ability of feature extraction, there are shortcomings in these methods. STL is generally applied to periodic time series with stable trends. It is difficult for WD to choose a suitable wavelet base. EMD has problems with mode aliasing and noise sensitivity, and EEMD generates a lot of redundant information during the decomposition process.
The integrated prediction methods have been applied based on decomposition. Jin et al. [35] decomposed data into period, trend, and residual components by STL. For the period and residual components, GRU networks were built to predict the values. The trend component has good linear characteristics, so the ARIMA model, which has a good predictive ability for linear systems, was selected to build the prediction model. The prediction results of the three models were integrated into the final result. Compared to the results of a single GRU model and all three components using the GRU network, the proposed model achieved the best performance. In addition, Jin et al. [36] used EMD to divide the original data into intrinsic mode functions (IMFs) with different frequency characteristics. They used the convolutional neural network to extract local feature components to build a classification model and reconstructed components with similar frequency characteristics into a group. All components were divided into three groups, and each group was modeled by the GRU network. Their method solved the shortcomings of the uncertain layers of EMD and achieved the best accuracy compared to other multiple-component combinations. Niu et al. [37] proposed a hybrid model based on evolutionary extreme learning machine and VMD to predict annual runoff time series. First, the VMD method was used to decompose the original stream into a series of components. Secondly, each component was predicted by constructing an appropriate extremum learning machine model, and the model parameters were adjusted using the gravity search algorithm. Finally, the prediction results of all the models were aggregated and used as the simulation output. Yang et al. [38] established a hybrid method combining wavelet transform, kernel extreme value learning machine (KELM) based on adaptive particle swarm optimization, and ARMA. An adaptive particle swarm optimization algorithm (SAPSO) was used to find the best kernel parameters. After testing the wavelet decomposition components, the ARMA model was applied to predict stationary sequences as a new input set, and the SAPSO-KELM model was constructed to predict nonstationary sequences. The experimental results showed that the method has more accurate prediction ability, better versatility, and practicability than the single method and other hybrid methods. Xie et al. [13] constructed a comprehensive model to predict solar output power combining VMD, ARMA, and a deep belief network (DBN). The time series were decomposed into components of different frequencies by VMD, and then the DBN and ARMA were established to predict high-frequency components and low-frequency components, respectively. Finally, the predicted values were reconstructed to get the final result. The hybrid method was superior to the single prediction model, proving its good accuracy and reliability.
As mentioned above, the decomposition methods are not consummate, and VMD is regarded as a relatively effective solution. In previous studies, we applied the VMD method to PM 2.5 monitoring data analysis. PM 2.5 concentration data in Beijing were recorded every hour [39]. The data from 100 days were selected for analysis, starting on 1 January 2016, with a total of 2400 h. The PM 2.5 concentration data were decomposed with VMD, in which the different decomposition layers were tested. The decomposition results are shown in Figure 1.
nally, the prediction results of all the models were aggregated and used as the simulation output. Yang et al. [38] established a hybrid method combining wavelet transform, kernel extreme value learning machine (KELM) based on adaptive particle swarm optimization, and ARMA. An adaptive particle swarm optimization algorithm (SAPSO) was used to find the best kernel parameters. After testing the wavelet decomposition components, the ARMA model was applied to predict stationary sequences as a new input set, and the SAPSO-KELM model was constructed to predict nonstationary sequences. The experimental results showed that the method has more accurate prediction ability, better versatility, and practicability than the single method and other hybrid methods. Xie et al. [13] constructed a comprehensive model to predict solar output power combining VMD, ARMA, and a deep belief network (DBN). The time series were decomposed into components of different frequencies by VMD, and then the DBN and ARMA were established to predict high-frequency components and low-frequency components, respectively. Finally, the predicted values were reconstructed to get the final result. The hybrid method was superior to the single prediction model, proving its good accuracy and reliability.
As mentioned above, the decomposition methods are not consummate, and VMD is regarded as a relatively effective solution. In previous studies, we applied the VMD method to PM 2.5 monitoring data analysis. PM 2.5 concentration data in Beijing were recorded every hour [39]. The data from 100 days were selected for analysis, starting on 1 January 2016, with a total of 2400 h. The PM 2.5 concentration data were decomposed with VMD, in which the different decomposition layers were tested. The decomposition results are shown in Figure 1. The subsequence and the frequency distribution after decomposition are shown in Figure 1, where the data are decomposed into 4, 6, and 8 layers, respectively. The center frequencies of the subsequences after the four-layer decomposition were 3.49, 23.14, 65.57, and 107.51. After six layers of decomposition, the center frequencies of the subsequences were 3.33, 21.79, 58.24, 100.42, 139.43, and 208.38. The center frequencies of the subsequences obtained after the eight-layer decomposition were 2.63, 16.46, 34.22, 68.45, 102.15,139.67, 203.53, and 294.96. It can be seen that the frequencies of the subsequences increased along with the decomposition layers. Subsequences with higher frequencies had better periodicity, which is easier to model and can improve the prediction accuracy.
Two practical problems can be seen in the analysis above. First, the number of subsequences impacted the feature complexity, but it was usually determined by artificial experience. An automatic mechanism should be explored for the determination of the subsequence quantity. Second, some subsequences still had higher center frequencies, which negatively impacted the prediction modeling. The complex subsequences should be decomposed more than once, and criteria should be determined for which subsequences should be decomposed again. The subsequence and the frequency distribution after decomposition are shown in Figure 1, where the data are decomposed into 4, 6, and 8 layers, respectively. The center frequencies of the subsequences after the four-layer decomposition were 3.49, 23.14, 65.57, and 107.51. After six layers of decomposition, the center frequencies of the subsequences were 3.33, 21.79, 58.24, 100.42, 139.43, and 208.38. The center frequencies of the subsequences obtained after the eight-layer decomposition were 2.63, 16.46, 34.22, 68.45, 102.15,139.67, 203.53, and 294.96. It can be seen that the frequencies of the subsequences increased along with the decomposition layers. Subsequences with higher frequencies had better periodicity, which is easier to model and can improve the prediction accuracy.
Two practical problems can be seen in the analysis above. First, the number of subsequences impacted the feature complexity, but it was usually determined by artificial experience. An automatic mechanism should be explored for the determination of the subsequence quantity. Second, some subsequences still had higher center frequencies, which negatively impacted the prediction modeling. The complex subsequences should be decomposed more than once, and criteria should be determined for which subsequences should be decomposed again.
In this study, we mainly investigated deep networks with data feature extraction for nonstationary time series, which is significantly different from previous studies. Our innovative contributions are highlighted as follows: 1.
An automatic mechanism was designed for the decomposition process of VMD, in which the criteria are determined based on the entropy and frequency to determine the number of subsequences and the dual decomposition parts.

2.
A general framework was constructed to integrate the dual decomposition mechanism and deep networks for time series prediction. The integrated deep model effectively solves the prediction issue with nonstationary time series. VMD is an adaptive signal processing method with a completely non-recursive mechanism. It decomposes the input signal into several IMFs with limited bandwidth. In the optimization process, each component can be compressed around a central frequency to achieve the separation of inherent modal components. It overcomes the problems of end effects and modal component aliasing in the EMD method. For time series data with high complexity and strong nonlinearity, the VMD method can reduce the non-stationarity to obtain relatively stationary subsequences. For the original signal, the corresponding constrained variational model is expressed as where k is the number of modals to be decomposed. u k , w k are the k-th modal component and center frequency after decomposition, respectively. δ(t) is the Dirac function, and * is the convolution operator. The Lagrange multiplication operator λ is introduced to transform the constrained variational problem into an unconstrained variational problem.
The augmented Lagrange expression is as follows: Finally, the original signal f can be decomposed to K IMFs. The algorithm of VMD is shown as the following Algorithm 1:

Criteria for the Number of Components
The decomposition K number is a significant parameter in the Algorithm 1. Modal aliasing may occur when the K is large, and the original data cannot be decomposed effectively when the K is small. In practice, the number of components after the decomposition is determined artificially based on personal experience. The subjective judgment may not apply to the current condition. Moreover, the number determined cannot be generalized when the data trends and features change. An automatic mechanism should be explored to determine the number of decomposed components. The criteria are proposed based on the analysis of feature changes with information entropy.
The original time series data are decomposed to form the component set of IMFs. In the decomposition process, the IMF is marked as u k , the center frequency of which is w k . The center frequency reflects the nonstationary characteristics of the decomposed IMF. The IMFs are approximate when their frequencies are similar to low gradients, which may lead to modal aliasing. The change in the center frequency can be considered the terminal condition of the decomposition. Information entropy is introduced to measure the change rate of the center frequency. The information entropy e k of the existing IMFs is calculated as follows: where p is the probability measure of the object data. Based on the criteria, the probability can be concerted from the normalization, as follows: The change rate of e k is obtained as The change rate of the information entropy d k is considered to be the criteria. The decomposition of VMD ends if d k ≤ αe k , where α is an adjusting parameter that is usually set between 0.05 and 0.2 following the Pareto rule. The determination of the component amount can be conducted as shown in the flowchart in Figure 2.

Criteria for Dual Decomposition
For prediction based on decomposition, the complexity degree of the decomposed component impacts the prediction difficulty level and precision. For IMFs obtained from original data, some still have high nonlinear and nonstationary trends. The complex IMFs with low center frequencies need further decomposition, which is the dual decomposition method of this study.
When the center frequency of an adjacent IMF changes greatly, the decomposition is insufficient and it needs to be decomposed again. The flowchart of the dual decomposition is shown in Figure 3. As shown in Figure 3, first, the change rates of the center frequencies of the first and last IMF are defined as 0 p and 0 p p b = × , respectively. ( 0 1 b < < ) is defined as the second decomposition threshold. Then, starting from the first subcomponent, the rate of change between the n-th IMF and the (n + 1)-th IMF center frequency is calculated as n p . When n p is greater than the threshold, it indicates that the data in this frequency range is not sufficiently decomposed and a second decomposition is required. However, when n p is less than the threshold value for the first time, it indicates that the latter high-frequency components have been decomposed completely in the first VMD and there is no need to continue with a second decomposition. All the low-frequency components that need to be decomposed are divided into three components again. When the rate of change of all adjacent components obtained from the first decomposition is less than the threshold, dual decomposition is not required.

Basic Deep Network of GRU
GRU is a recurrent neural network that updates and optimizes weights by gradient descent algorithms. With its delicate network characteristics and structure, it achieves good performance in the prediction of nonstationary time series. The GRU contains two gating units-the reset gate and the update gate. The reset gate is used to ignore information in the current state. The smaller the value, the more the information is ignored. The update gate is used to record the information of the current state. The larger the value, the more the information is recorded in the current state. In our method, the GRU network As shown in Figure 3, first, the change rates of the center frequencies of the first and last IMF are defined as p 0 and p = p 0 × b, respectively. (0 < b < 1) is defined as the second decomposition threshold. Then, starting from the first subcomponent, the rate of change between the n-th IMF and the (n + 1)-th IMF center frequency is calculated as p n . When p n is greater than the threshold, it indicates that the data in this frequency range is not sufficiently decomposed and a second decomposition is required. However, when p n is less than the threshold value for the first time, it indicates that the latter high-frequency components have been decomposed completely in the first VMD and there is no need to continue with a second decomposition. All the low-frequency components that need to be decomposed are divided into three components again. When the rate of change of all adjacent components obtained from the first decomposition is less than the threshold, dual decomposition is not required.

Basic Deep Network of GRU
GRU is a recurrent neural network that updates and optimizes weights by gradient descent algorithms. With its delicate network characteristics and structure, it achieves good performance in the prediction of nonstationary time series. The GRU contains two gating units-the reset gate and the update gate. The reset gate is used to ignore information in the current state. The smaller the value, the more the information is ignored. The update gate is used to record the information of the current state. The larger the value, the more the information is recorded in the current state. In our method, the GRU network is used to construct a prediction model for each subsequence. Figure 4 shows the internal structure of the GRU. For a GRU cell, the relationship between its internal parameters can be expressed by Equation (6), as follows:

Framework of the Deep Prediction Model with Dual Decomposition
The framework for a deep prediction model with dual decomposition is proposed in this paper. It consists of the modules of a first decomposition, second decomposition, and prediction and fusion. The general framework design is shown in Figure 5. As shown in Figure 5, the proposed model includes three parts-first decomposition, second decomposition, and prediction and fusion. In the module of the first decomposi- For a GRU cell, the relationship between its internal parameters can be expressed by Equation (6), as follows: where r t , u t , h t , h t , and y t represent reset gate, update gate, hidden state, output state, and output respectively. W rx , W rh , W ux , W uh , W hx , W hr , and W y represent the weight matrix of training. b r , b u , b h , and b y represent the bias. σ and tanh represent the activation function.

Framework of the Deep Prediction Model with Dual Decomposition
The framework for a deep prediction model with dual decomposition is proposed in this paper. It consists of the modules of a first decomposition, second decomposition, and prediction and fusion. The general framework design is shown in Figure 5. For a GRU cell, the relationship between its internal parameters can be expressed by Equation (6), as follows:

Framework of the Deep Prediction Model with Dual Decomposition
The framework for a deep prediction model with dual decomposition is proposed in this paper. It consists of the modules of a first decomposition, second decomposition, and prediction and fusion. The general framework design is shown in Figure 5.  As shown in Figure 5, the proposed model includes three parts-first decomposition, second decomposition, and prediction and fusion. In the module of the first decomposition, the original data are decomposed into IMFs with the VMD method, in which the As shown in Figure 5, the proposed model includes three parts-first decomposition, second decomposition, and prediction and fusion. In the module of the first decomposition, the original data are decomposed into IMFs with the VMD method, in which the criteria are set to determine the number of components. The first decomposition ends when the frequency variation meets the conditions following the calculation in Section 3.1.2. For IMFs with high nonlinear and nonstationary features, they are processed with a second decomposition. The criteria in Section 3.1.3 guide the determination and selection of the components to be decomposed again. Finally, in the prediction and fusion module, the decomposed subsequences are predicted with GRUs. The prediction results of all individual network models are fused to obtain the final output.

Experimental Setting and Data Set
Experiments were designed and carried out to verify the proposed method. For the prediction model, a practical data set was used to test its performance. The data of PM 2.5 concentration mentioned in Section 2.2 were selected as the practical data.
For the data totaling 2400 h, the first 80% of the data were used as the training set to build the model, and the other 20% of the data were used to test the accuracy of the model. The experimental platform was built on 64-bit Windows. The RAM was 8 GB, and the core was i7-8565u with 1.8 GHz. The program used Python version 3.7.4. The deep learning framework used the Tensorflow interface of Keras, and the optimizer was Adam.
The aim proposed method is to solve the problem of data decomposition, including determining the number of components and which ones should be decomposed again. Three sets of tests were conducted to verify the different aspects of the proposed method.
Test 1: The criteria for the number of decomposed components were tested. The original data were analyzed with the method described in Section 3.1.2 to determine the optimal number of IMFs. Then, the data were decomposed with different layers following the exhaustion method. The calculated and exhaustive results were analyzed.
Test 2: The effect of the dual decomposition on the prediction was tested. For the decomposed components in Test 1, parts of them still had complex features. The criteria were set to determine which components should be decomposed again. The prediction performance of the dual decomposition was analyzed.
Test 3: The proposed method was tested and compared to other methods. As discussed in Section 2, five well-known models were used to verify our method, including RNN, LSTM, GRU, decomposition-ARIMA-GRU-GRU [35], and EMDCNN-GRU [36], which are introduced briefly as follows.
RNN, LSTM, and GRU are typical machine learning methods that have been widely used in time series prediction. RNN establishes weighted connections between neurons in layers. LSTM and GRU were developed from RNN by introducing different kinds of gates to enhance information storage and screening.
Decomposition-ARIMA-GRU-GRU and EMDCNN-GRU are proposed as advanced approaches based on machine learning models. They mainly focus on the decomposition of the original data to reduce complexity. The decomposition is executed in a single pass, which is different from the dual decomposition presented in this paper.
To evaluate the performance of the model, the following metrics were used: Entropy 2022, 24, 360 10 of 16 where y r represents the real value, y p represents the predicted value, T represents the number of data, y rv represents the average value of the real value, and y pv represents the average value of the predicted value. The root mean squared error (RMSE) measures the deviation between the predicted value and the true value. The smaller the value of the RMSE, the better the model prediction. The mean absolute error (MAE) is used to measure the error between the observed value and the real value. When the error is zero, it is a perfect model. The square of R is between 0 and 1. The closer it is to 1, the more accurate the model is. CC is the correlation coefficient [39] and represents a linear relationship between the predicted and real values. The closer it is to 1, the stronger the linear relationship is.

Test on the Number of Decomposition Components
Different decomposition layers significantly influence the prediction results. The algorithm for the criteria proposed in Section 3.1.2 was applied to obtain the number of decomposed layers, and the calculation result was 6. In the first test, the original data were decomposed following the exhaustion method, in which data are decomposed into 3-12 layers. The decomposed IMFs were predicted with GRUs and the fusion result was finally obtained. The RMSEs of the prediction results of different decomposition layers are listed in Table 1. The calculation time spent for each experiment is also presented. As the number of decomposition layers increased, the accuracy of the model increased. When the number of decomposition layers reached six, the model performance tended to be stable. Although the 8-, 9-, and 12-layer decomposition models performed better than the six-layer decomposition, the time spent increased significantly. The changes in the RMSEs are shown in Figure 6. The horizontal axis is the number of decomposition layers and the vertical axis is the RMSE. As shown in Figure 6, the prediction performance got better as the number position layers increased and was between three and six layers. When the num composition layers was greater than 6, the prediction accuracy tended to be mentioned at the beginning of Section 4.2.1, it was observed that the exhaust met the calculated values, where the optimal number of decomposed IMF layer

Test of Dual Decomposition
In Test 1, the optimal number of decompositions was determined to be s previous analysis, the low-frequency subsequence still had highly complex tren on the criteria in Section 3.1.3, the third to the sixth IMFs still had obvious nons features. Then, the dual composition was applied to the IMFs. In the test, the numbers of IMFs were decomposed again to verify the performance. In the foll sults, "IMF 1" means only the first IMF was decomposed twice, "IMF 1-2" mean and second IMFs were decomposed twice, "IMF 1-3" means the first to the t "IMF 1-4" means the first to the fourth. The prediction results based on the dif composition situations are shown in Figure 7. The absolute errors of the predicti are shown in Figure 8.  As shown in Figure 6, the prediction performance got better as the number of decomposition layers increased and was between three and six layers. When the number of decomposition layers was greater than 6, the prediction accuracy tended to be stable. As mentioned at the beginning of Section 4.2.1, it was observed that the exhausted results met the calculated values, where the optimal number of decomposed IMF layers was six.

Test of Dual Decomposition
In Test 1, the optimal number of decompositions was determined to be six. In the previous analysis, the low-frequency subsequence still had highly complex trends. Based on the criteria in Section 3.1.3, the third to the sixth IMFs still had obvious nonstationary features. Then, the dual composition was applied to the IMFs. In the test, the different numbers of IMFs were decomposed again to verify the performance. In the following results, "IMF 1" means only the first IMF was decomposed twice, "IMF 1-2" means the first and second IMFs were decomposed twice, "IMF 1-3" means the first to the third, and "IMF 1-4" means the first to the fourth. The prediction results based on the different decomposition situations are shown in Figure 7. The absolute errors of the prediction results are shown in Figure 8. As shown in Figure 6, the prediction performance got better as the number o position layers increased and was between three and six layers. When the numb composition layers was greater than 6, the prediction accuracy tended to be st mentioned at the beginning of Section 4.2.1, it was observed that the exhauste met the calculated values, where the optimal number of decomposed IMF layers

Test of Dual Decomposition
In Test 1, the optimal number of decompositions was determined to be si previous analysis, the low-frequency subsequence still had highly complex trend on the criteria in Section 3.1.3, the third to the sixth IMFs still had obvious nonst features. Then, the dual composition was applied to the IMFs. In the test, the numbers of IMFs were decomposed again to verify the performance. In the follo sults, "IMF 1" means only the first IMF was decomposed twice, "IMF 1-2" means and second IMFs were decomposed twice, "IMF 1-3" means the first to the th "IMF 1-4" means the first to the fourth. The prediction results based on the diffe composition situations are shown in Figure 7. The absolute errors of the predictio are shown in Figure 8.   Figure 7 shows the prediction results from the second decomposition. The curve represents the real result. It was observed that the prediction curve of IMF 1closer to the true value, which indicates the best performance. Figure 8 shows the ab error curve calculated by Equation (11). The black straight line represents the sta line corresponding to the true value. The closer it is to the black line indicates the p tion error of the model was relatively minimal and the performance was better. It seen that the dual decomposition of IMF 1-4, represented by the blue curve, is clo the black curve.
For the quantitative evaluation, the prediction performance is presented in T The performance metrics are also shown in Figure 9.     Figure 7 shows the prediction results from the second decomposition. The black curve represents the real result. It was observed that the prediction curve of IMF 1-4 was closer to the true value, which indicates the best performance. Figure 8 shows the absolute error curve calculated by Equation (11). The black straight line represents the standard line corresponding to the true value. The closer it is to the black line indicates the prediction error of the model was relatively minimal and the performance was better. It can be seen that the dual decomposition of IMF 1-4, represented by the blue curve, is closest to the black curve.
For the quantitative evaluation, the prediction performance is presented in Table 2. The performance metrics are also shown in Figure 9.   Figure 7 shows the prediction results from the second decomposition. Th curve represents the real result. It was observed that the prediction curve of IMF 1 closer to the true value, which indicates the best performance. Figure 8 shows the a error curve calculated by Equation (11). The black straight line represents the st line corresponding to the true value. The closer it is to the black line indicates the tion error of the model was relatively minimal and the performance was better. It seen that the dual decomposition of IMF 1-4, represented by the blue curve, is clo the black curve.
For the quantitative evaluation, the prediction performance is presented in T The performance metrics are also shown in Figure 9.     Figure 9 shows the change curves of RMSE, MAE, R 2 , and CC. As the components used for secondary decomposition increased, the RMSE and MAE gradually decreased, while R 2 and CC gradually increased. The performance of the model achieved the optimal value when the IMF 1-4 components were subjected to dual decomposition. As can be observed from the results above, when the number of decomposition layers was four, the highest model accuracy was achieved. As the number of decomposition layers increased, the accuracy of the model gradually improved. When the number of decomposition layers was four, the best effect was achieved. Without decomposition, the RMSE was 18.2552, and the result of the four-layer decomposition was improved by 17.0461%. The MAE increased by 10.0290%, the R 2 increased by 3.5655%, and the CC increased by 1.9523%.

Comparison of Contrast Methods
In order to verify the proposed model, the RNN, LSTM, GRU, decomposition-ARIMA-GRU-GRU, and EMDCNN-GRU models were selected as the comparison models. The first three models represent classic machine learning and were constructed by three layers of simple RNN, LSTM, and GRU in Keras to simulate the input layer, hidden layer, and output layer, respectively. For the decomposition-ARIMA-GRU-GRU model, the period, trend, and residual components of the time series were generated by STL decomposition firstly. Then, the period and residual components were trained by the GRUs respectively, and the trend component was trained by the ARIMA model. Finally, all results were fused to form a prediction result. For the EMDCNN-GRU mode, the IMFs were generated by EMD decomposition, then all IMFs were divided into three groups by CNN. Each group was trained and predicted with the GRU, and the three prediction results were integrated. Table 3 shows that the dual decomposition model obtained the smallest RMSE and MAE and the largest R 2 and CC. Compared to RNN, LSTM, GRU, decomposition-ARIMA-GRU-GRU, and EMDCNN-GRU, our model achieved the best performance. The results show that the dual decomposition model can fully mine the data features and has a good predictive ability for nonstationary time series.

Discussion
In this study, a new prediction method for nonstationary time series data was proposed, combining dual VMD with a GRU network. On the basis of the first decomposition, dual decomposition is performed on subsequences with lower frequencies to fully reduce complexity and further improve the prediction accuracy. In this study, the criteria were designed to reasonably determine the decomposition layers and dual decomposition conditions in VMD.
Test 1 decomposed the original data with different layers. The optimal number of layers was consistent with the results obtained by the proposed criteria, which proves that the method is effective in determining the number of layers through information entropy. By decomposing the components obtained in Test 1, Test 2 quantitatively determined the components that needed to be further decomposed, which fully reduced the complexity and further improved the prediction accuracy based on the first decomposition, verifying the effectiveness of the method for determining the components of decomposition. Test 3 compared the RNN, LSTM, GRU, decomposition-ARIMA-GRU-GRU, and EMCDCNN-GRU models with our proposed method. Our dual decomposition model obtained the highest accuracy, which proves that dual decomposition is more effective.
The proposed model has outstanding prediction performance, but it takes a long time to build the model and takes up more computing resources, which we further investigate in the future. For data with different characteristics, we will try to combine multiple decomposition methods to make the method more effective. In addition, we will explore online learning methods to improve their applicability.

Conclusions
In this study, we investigated the issue of the prediction of nonlinear and nonstationary time series. A framework of deep prediction models with dual decomposition was proposed as the core method. In the proposed method, the quantitative criteria were explored to solve the feature extraction problem in VMD decomposition. The GRU deep model performed well in the prediction of subsequences. The experiments and results show that the first and second decomposition methods can reduce the difficulty of modeling and improve prediction performance. The criteria for the decomposition are effective and practical, which can help the automatic feature extraction of the time series. In the future, we will further investigate non-stationary time series forecasting [40][41][42][43][44], including optimization methods for decomposition models and multi-model fusion methods [45][46][47].