Application of Combined Models Based on Empirical Mode Decomposition, Deep Learning, and Autoregressive Integrated Moving Average Model for Short-Term Heating Load Predictions

: Short-term building energy consumption prediction is of great signiﬁcance for the optimized operation of building energy management systems and energy conservation. Due to the high-dimensional nonlinear characteristics of building heat loads, traditional single machine-learning models cannot extract the features well. Therefore, in this paper, a combined model based on complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN), four deep learning (DL), and the autoregressive integrated moving average (ARIMA) models is proposed. The DL models include a convolution neural network, long- and short-term memory (LSTM), bi-directional LSTM (bi-LSTM), and the gated recurrent unit. The CEEMDAN decomposed the heating load into different components to extract the different features, while the DL and ARIMA models were used for the prediction of heating load features with high and low complexity, respectively. The single-DL models and the CEEMDAN-DL combinations were also implemented for comparison purposes. The results show that the combined models achieved much higher accuracy compared to the single-DL models and the CEEMDAN-DL combinations. Compared to the single-DL models, the average coefﬁcient of determination (R 2 ), root mean square error (RMSE), and coefﬁcient of variation of the RMSE (CV-RMSE) were improved by 2.91%, 47.93%, and 47.92%, respectively. Furthermore, CEEMDAN-bi-LSTM-ARIMA performed the best of all the combined models, achieving values of R2 = 0.983, RMSE = 70.25 kWh, and CV-RMSE = 1.47%. This study provides a new guide for developing combined models for building energy consumption prediction.


Introduction
Building energy consumption accounts for more than one-third of the global total energy consumption and is expected to continue increasing with the acceleration of urbanization in developing countries [1]. Consequently, energy saving and sustainable buildings have become the top priority of all countries in the world. To solve this problem, researchers have put forward a series of building energy-saving measures, such as renewable energy utilization and high-performance building envelopes. Among them, efficient intelligent control is one of the most effective measures to reduce energy consumption during the building's lifecycle [2]. However, building energy consumption exhibits substantial randomness, so its accurate and reliable prediction is of great significance to predictive building systems' control and optimal scheduling [3][4][5].
Depending on the time-scale range, building energy consumption prediction (BECP) methods can be divided into short-term, medium-term, and long-term [6]. Short-term building energy forecasting is mainly used for the daily operation and management of building energy systems [7]; it can help users adjust the equipment in real time according to the predicted values so as to achieve the best match between energy supply and demand as well as formulate the operational and regulation strategy of building energy systems [8]. Medium-term BECP is mainly applied to the scheduling of energy systems [9], while long-term prediction is mainly used for capacity design [10].
BECP methods can be mainly divided into physical models and data-driven models [11]. The physical models are based on energy consumption analysis and make predictions according to the heat-transfer process and outdoor meteorological conditions [12]. A notable example of this approach is the EnergyPlus simulation tool [13]. Physical models require detailed physical building information, including parameters related to exterior walls, doors and windows, ground, and other details, as well as the heat sources in the building. This information is often difficult to obtain, which limits the wide adoption of physical models [14]. With the continuous development of the Internet of Things, building energy consumption monitoring technologies are gradually maturing, and a large amount of historical data have become available, which has allowed the rapid development of data-driven energy consumption prediction models [15].
Machine-learning models have become one of the most commonly used data-driven methods for BECP due to their high accuracy and excellent ability to deal with highdimensional nonlinear problems [16]. Such methods include artificial neural networks (NN) [17], support vector machines (SVMs) [18], least square SVMs [19], decision trees [20], random forests (RFs) [21], extreme gradient boosting (XGBoost) [22], etc. For example, Ling et al. [23] adopted a back-propagation neural network (BPNN) and an SVM to select the optimal input parameters for BECP. Chen and Chou [24] reviewed the machine-learning models commonly used for this purpose. Yang et al. [25] compared an RF with a BPNN and an SVM for BECP and concluded that the RF exhibits obvious advantages in terms of performance. Liu et al. [26] developed the 24-h-ahead BECP model based on long-and short-term memory (LSTM) and demonstrated that their proposed models achieved high accuracy and reliable predictions. Lei et al. [27] adopted a rough dataset and deep learning (DL) models for short-and medium-term BECP using one-year data from a laboratory building of a university in Dalian. The comparison with a BPNN, an Elman-NN, and a fuzzy NN showed that the DL models using the rough dataset produced the highest accuracy. Zhou et al. [28] compared 15 machine-learning models for short-term building heat load prediction and concluded that the tree-based models that adopted decision tree as the base model easily countered over-fitting problems, while the SVM and Gaussian process regression models produced the highest accuracy. Li et al. [29] developed a genetic algorithm (GA) and an NN BECP model for the building design and summarized that the application of GA can improve the accuracy and efficiency of BECP.
As the accuracy of single models is to some degree limited, researchers have proposed different model combinations to improve the prediction accuracy of machine-learning models [30]. For example, Li et al. [31] compared five combined models, namely an LSTM-convolutional neural network (LSTM-CNN), a CNN-LSTM, LSTM-Attention, a CNN-Attention-LSTM, and an LSTM-Attention-CNN, using data from 60 buildings. The results showed that LSTM-Attention-LSTM performed the best, as the RMSE was decreased by 5.6% compared to the LSTM alone. Li et al. [32] developed different ensemble models, including a teaching learning-based optimization algorithm-BPNN (TLBO-BPNN), a TLBO-SVM, BPNN-Adaboost, an extreme learning machine, and an RF. The results showed that the ensemble models performed much better compared to the single models in all cases. Zhang et al. [33] proposed two combined models for short-term BECP. The first was based on LSTM and BPNN and was proposed to solve the time-lag problems, while the second was based on a dimensionless sensitivity index and a weighted Manhattan distance to allow the machine-learning models to be explained more easily using domain knowledge. The results showed that the combined models achieved much higher accuracy. Overall, combined models offer an efficient way to improve the accuracy of BECP. Generally, the combined techniques used for BECP include ensemble-based models [32], cluster-based models [34], decomposition-based models, etc. The decomposition-based models are based on decomposition algorithm and machine-learning models and can process multi-resolution data by extracting the embedded information from non-stationary BECP data. It is also an efficient method for BECP. However, there is little research on decomposition-based models.
Due to the high-dimensional nonlinear characteristics of building load, traditional single models face difficulties in capturing the features of the original data accurately [35]. In recent years, researchers have proposed to decompose the time series using decomposition algorithms in an attempt to capture the different features of the original data. Such methods include the wavelet transform, empirical mode decomposition, ensemble empirical model decomposition, and so on [36]. Following this trend, ensemble empirical model decomposition with adaptive noise (CEEMDAN) has become one of the most widely used decomposition algorithms, as it does not require a predetermined mother wavelet basis and can be used to solve problems such as modal aliasing. Through the decomposition algorithm, different features of the building load data can be extracted effectively, and then, the most appropriate model for each component can be found, which leads to the improved prediction accuracy of single models [37].
Based on the above, in this study, a decomposition-based model is developed based on CEEMDAN, DL models (including CNN, LSTM, bi-directional LSTM, and gated recurrent unit (GRU)), and autoregressive integrated moving average (ARIMA) models for BECP. The one-step prediction model (single model) and two-step prediction models (CEEMDAN-DLs) are adopted for comparison. First, CEEMDAN is used to decompose the original data into n intrinsic mode functions (IMFs) and a residual (RES). Then, the sample entropy of each component is calculated, based on which the complexity of different components is determined. As for deep learning models, the deep structure can fit the highly nonlinear relationships; however, there are issues with overfitting and high computational complexity. In comparison, numerous studies have shown that the ARIMA model has been proven to be excellent in handling linear problems [28]. Furthermore, its structure is much simpler, and its computational complexity is much lower than that of DL models. Therefore, ARIMA is used to predict the low-complexity components, and DL models are used to predict the high-complexity components. Finally, the CEEMDAN-DL-ARIMA model is proposed as an integration of the prediction results of the two algorithms, and it is validated using real building energy consumption data. Consequently, in this paper, a novel approach to high-dimensional nonlinear building load forecasting is proposed. Figure 1 shows the development framework of the combined model. First, the original heating load data are decomposed into n IMFs and a RES through CEEMDAN, and the sample entropy of each component is calculated. For components with high complexity, a DL model is used for prediction, while the ARIMA model is used for components with low complexity. Finally, the prediction results of the different components are weighted linearly to obtain the final prediction results.

Empirical Model Decomposition
The decomposition algorithm can be used to extract the embedded information of non-stationary time series and obtain multi-resolution data. Generally, the time series is decomposed into several low-frequency components and one high-frequency component. Wavelet analysis is one of the most widely used decomposition algorithms. However, the wavelet basis and decomposition need to be selected manually [38], which limits the method's applicability. Empirical mode decomposition (EMD) can be used to select the wavelet basis and decomposition layers automatically [39] and has been widely adopted. However, EMD is prone to modal aliasing when there are jump changes in the time se-Sustainability 2022, 14, 7349 4 of 20 ries. Therefore, Wu et al. [40] proposed the ensemble empirical model decomposition (EEMD) model, which can eliminate modal aliasing through the addition of white noise to the time series. However, redundant white noise may exist after the reconstruction [38]. Consequently, Torres et al. [41] proposed the complete ensemble empirical model decomposition with adaptive noise (CEEMDAN). By adding adaptive noise, CEEMDAN not only avoids the problem of modal aliasing but also eliminates the redundant white noise in the reconstructed time series.

Empirical Model Decomposition
The decomposition algorithm can be used to extract the embedded information of non-stationary time series and obtain multi-resolution data. Generally, the time series is decomposed into several low-frequency components and one high-frequency component. Wavelet analysis is one of the most widely used decomposition algorithms. However, the wavelet basis and decomposition need to be selected manually [38], which limits the method's applicability. Empirical mode decomposition (EMD) can be used to select the wavelet basis and decomposition layers automatically [39] and has been widely adopted. However, EMD is prone to modal aliasing when there are jump changes in the time series. Therefore, Wu et al. [40] proposed the ensemble empirical model decomposition (EEMD) model, which can eliminate modal aliasing through the addition of white noise to the time series. However, redundant white noise may exist after the reconstruction [38]. Consequently, Torres et al. [41] proposed the complete ensemble empirical model decomposition with adaptive noise (CEEMDAN). By adding adaptive noise, CEEMDAN not only avoids the problem of modal aliasing but also eliminates the redundant white noise in the reconstructed time series.
The calculation steps of CEEMDAN are as follows: (1) White noise ωi is added into the original signal x; then, the IMFs are obtained via decomposition of the signal using EMD. Thus, the first model is calculated using the following equation: where Ei(‧) is the i-th IMF decomposed using EMD; is the white noise; and is the signal-to-noise ratio.
(2) Calculate the first residual (r1) at the first stage: (3) The signal ( + ) is further decomposed using EMD until IMF1 is obtained.  The calculation steps of CEEMDAN are as follows: (1) White noise ω i is added into the original signal x; then, the IMFs are obtained via decomposition of the signal using EMD. Thus, the first model is calculated using the following equation: where E i (·) is the i-th IMF decomposed using EMD; ω i is the white noise; and ε is the signal-to-noise ratio.
(2) Calculate the first residual (r 1 ) at the first stage: (3) The signal (r 1 + εE 1 [ω i ]) is further decomposed using EMD until IMF 1 is obtained. Then, IMF 2 can be calculated as follows: (4) The k-th residual is calculated.
(5) The realization signal (r k + εE k [ω i ]) is further decomposed using EMD, and IMF k+1 is calculated: Sustainability 2022, 14, 7349 5 of 20 (6) Repeat the above steps until the residual satisfies the iteration termination condition, i.e., when there is no further decomposition possible. Then, the final residual is: The original signal can be reconstructed as follows: CNNs consist of three main mapping layers types, namely convolutional, pooling, and fully-connected layers [42]. Convolutional layers are used to extract the features of input data; pooling layers are used to reduce the input data's dimensions, while fully-connected layers are used to produce the output. Overall, CNNs are a reliable technique for the efficient extraction of hidden features and the creation of filters according to these features. Weight sharing and local connections are two main characteristics of CNNs [43]. The topological structure of a CNN is shown in Figure 2.
(5) The realization signal ( + ) is further decomposed using EMD, and IMFk+1 is calculated: (6) Repeat the above steps until the residual satisfies the iteration termination condition, i.e., when there is no further decomposition possible. Then, the final residual is: The original signal can be reconstructed as follows:

Convolutional Neural Networks
CNNs consist of three main mapping layers types, namely convolutional, pooling, and fully-connected layers [42]. Convolutional layers are used to extract the features of input data; pooling layers are used to reduce the input data's dimensions, while fullyconnected layers are used to produce the output. Overall, CNNs are a reliable technique for the efficient extraction of hidden features and the creation of filters according to these features. Weight sharing and local connections are two main characteristics of CNNs [43]. The topological structure of a CNN is shown in Figure 2. The convolutional layer is defined as [44]: where f is the activation function; is the weight vector; x is the input data; * is the convolution operator; is a bias matrix. For the activation function, the rectified linear unit (ReLU) is adopted and calculated as follows:

… … Input layer Convolutional layer Pooling layer
Fully-connected layer Output layer The convolutional layer is defined as [44]: where f is the activation function; W k is the weight vector; x is the input data; * is the convolution operator; b k is a bias matrix. For the activation function, the rectified linear unit (ReLU) is adopted and calculated as follows:

Long-Short-Term Memory Networks
Long-and short-term memory (LSTM) has been proposed as a method to solve the time dependence problems of recursive neural networks. It is an excellent tool with strong memory for long time-series prediction. LSTMs consist of four gates, including the input, forget, control, and output gate [31]. The topological structure of the LSTM is shown in Figure 3.

Long-Short-Term Memory Networks
Long-and short-term memory (LSTM) has been proposed as a method to solve the time dependence problems of recursive neural networks. It is an excellent tool with strong memory for long time-series prediction. LSTMs consist of four gates, including the input, forget, control, and output gate [31]. The topological structure of the LSTM is shown in Figure 3. (1) Depending on the input xt and the state of the last hidden layers ht−1, the LSTM can determine the information to be thrown away by forget gate ft: where is the activation function, is the weight matrix, and is the bias vector.
(2) The last output and the current input value are transferred to the input gate. Then, the output it and candidate state are generated.
= tanh • ℎ , + (3) The current state of cell Ct is updated through the integration of the input of the forget gate with the last state of the cell.
(4) Finally, the output gate ot is calculated using ht-1 and xt. Then, the final output ht is obtained from the output gate.

Gated Recurrent Unit
The structure of the GRU is similar to that of LSTM, but there is no cell that stores the state. The GRU instead has a reset gate and an update gate, which are used to determine the information being stored or thrown away [45]. The reset gate allows the dismissal of

Input gate
Forget gate Output gate (1) Depending on the input x t and the state of the last hidden layers h t−1 , the LSTM can determine the information to be thrown away by forget gate f t : where σ is the activation function, ω f is the weight matrix, and b f is the bias vector.
(2) The last output and the current input value are transferred to the input gate. Then, the output i t and candidate state C t are generated.
(3) The current state of cell C t is updated through the integration of the input of the forget gate with the last state of the cell.
(4) Finally, the output gate o t is calculated using h t−1 and x t . Then, the final output h t is obtained from the output gate.

Gated Recurrent Unit
The structure of the GRU is similar to that of LSTM, but there is no cell that stores the state. The GRU instead has a reset gate and an update gate, which are used to determine the information being stored or thrown away [45]. The reset gate allows the dismissal of irrelevant information from the hidden state, while the update gate controls how much information from the previous hidden state (h t−1 ) can be stored in the current hidden state (h t ) [46]. The general topological structure of the GRU is shown in Figure 4. irrelevant information from the hidden state, while the update gate controls how much information from the previous hidden state (ht−1) can be stored in the current hidden state (ht) [46]. The general topological structure of the GRU is shown in Figure 4. The reset gate rt is defined as follows: The update gate zt is defined as follows: where is the new input; , , , and are the weight matrices; , and are the bias vectors. The output ht is calculated as follows: where ℎ is the candidate state, calculated as follows:

Bidirectional Long-Short-Term Memory Network (Bi-LSTM)
Bi-LSTM transfers the input data at each time to the network for calculation. Each hidden layer contains two LSTM units. The first is the forward LSTM, which receives the output data corresponding to the previous instance and forwards it to the next instance after calculation to ensure that past information is taken into account. The other is the backward LSTM, which receives the output data of the next instance, calculates, and sends it back to the previous instance to capture future information in current calculations [47]. Bi-LSTM's topological structure is shown in Figure 5. The reset gate r t is defined as follows: The update gate z t is defined as follows: where x t is the new input; W r , W z , U r , and U z are the weight matrices; b r , and b z are the bias vectors. The output h t is calculated as follows: where h t is the candidate state, calculated as follows:

Bidirectional Long-Short-Term Memory Network (Bi-LSTM)
Bi-LSTM transfers the input data at each time to the network for calculation. Each hidden layer contains two LSTM units. The first is the forward LSTM, which receives the output data corresponding to the previous instance and forwards it to the next instance after calculation to ensure that past information is taken into account. The other is the backward LSTM, which receives the output data of the next instance, calculates, and sends it back to the previous instance to capture future information in current calculations [47]. Bi-LSTM's topological structure is shown in Figure 5. Defining the last hidden vector as ℎ ⃗ and the next hidden vector as ℎ ⃖ , the output of bi-LSTM is calculated as follows: where and are the constants and + = 1.

Input layer
Output layer Defining the last hidden vector as → h and the next hidden vector as ← h , the output of bi-LSTM is calculated as follows: where α and β are the constants and α + β = 1.

ARIMA Model
The autoregressive integrated moving average (ARIMA) model converts a non-stationary time series into a stationary one through the d-order difference between an autoregressive model (AR) and a moving average model (MA) [48]. Generally, it is represented as ARIMA (p, q, d), where p is the order of the AR model; q is the order of the MA model; and d is the difference order.
The AR model captures the relationship between the values at the current time and the previous p instances. It can be expressed as: where x t is the current value; µ is a constant; Y i is the correlation coefficients; and ε t is the error. The MA model is the weighted sum of white noises: where θ is a coefficient. The combined AR (p) and MA (q) models can be obtained from the following equation: The difference order can be determined using Bayesian information criterion (BIC): where M is the optimal log likelihood value, and N is the number of samples.

Data Description
The research data were obtained from the monitoring data of the heating pipe network of Xi'an University of Architecture and Technology, China. The total heating area was 392,219.63 m 2 and included student dormitories, restaurants, teaching buildings, office buildings, and hospitals. The research data include hourly heating load data covering from 2 January to 2 March 2016. Figure 6 shows the heating load and the corresponding IMFs and RES decomposed using the CEEMDAN algorithm. The training and testing dataset are decomposed separately. First, the training dataset was decomposed, and then, the testing dataset was decomposed based on the training dataset. The total number of heating load samples was 1480, and 991 samples were used as training data, while the remaining 489 samples were taken as test data. The heating loads of the training dataset varied from 3743.9-5707.8 kWh with an average value of 7518.6 kWh, while the range of the test dataset was 3313.8-4764.0 kWh with an average value of 5612.9 kWh.
IMFs and RES decomposed using the CEEMDAN algorithm. The training and testing dataset are decomposed separately. First, the training dataset was decomposed, and then, the testing dataset was decomposed based on the training dataset. The total number of heating load samples was 1480, and 991 samples were used as training data, while the remaining 489 samples were taken as test data. The heating loads of the training dataset varied from 3743.9-5707.8 kWh with an average value of 7518.6 kWh, while the range of the test dataset was 3313.8-4764.0 kWh with an average value of 5612.9 kWh.

Correlation Analysis
Methods based on correlation coefficients are the most widely used for feature selection. Among these methods, Pearson's correlation coefficient is one of the most popular options. For example, Zhang et al. [34] adopted Pearson's correlation coefficient to determine the key impact parameters of day-ahead and intra-day heating loads. However, generally, Pearson's correlation coefficient can only be used to describe linear correlations [9]. For the building heating load, due to the thermal inertial of buildings, the hourly-ahead heating load presents high nonlinear correlation with historic heating loads. As a result, Pearson's correlation coefficient's applicability is limited in building heating load prediction. In contrast, Spearman's correlation coefficient can describe the nonlinear correlation between two parameters. It is a nonparametric index to measure the dependence of two variables, and it has the advantages of universality, low complexity, and high efficiency. Therefore, in this study, Spearman's correlation coefficient was adopted. For a sample size N, the N original samples are converted into rank data, and the correlation coefficient can be calculated through the following equation:

Correlation Analysis
Methods based on correlation coefficients are the most widely used for feature selection. Among these methods, Pearson's correlation coefficient is one of the most popular options. For example, Zhang et al. [34] adopted Pearson's correlation coefficient to determine the key impact parameters of day-ahead and intra-day heating loads. However, generally, Pearson's correlation coefficient can only be used to describe linear correlations [9]. For the building heating load, due to the thermal inertial of buildings, the hourly-ahead heating load presents high nonlinear correlation with historic heating loads. As a result, Pearson's correlation coefficient's applicability is limited in building heating load prediction. In contrast, Spearman's correlation coefficient can describe the nonlinear correlation between two parameters. It is a nonparametric index to measure the dependence of two variables, and it has the advantages of universality, low complexity, and high efficiency. Therefore, in this study, Spearman's correlation coefficient was adopted. For a sample size N, the N original samples are converted into rank data, and the correlation coefficient ρ can be calculated through the following equation: where X i and Y i represent the independent and dependent variables, respectively. x i and y i represent the average values of independent and dependent variables. Figure 7 illustrates the Spearman correlation coefficients between the hour-ahead heating load with the 12-h historic heating loads. It is evident that the historic heating loads have a high correlation with the hour-ahead heating load, as the corresponding Spearman correlation coefficient values are higher than 0.70. For the 2-h historic heating load in particular, the corresponding values are above 0.90. As values further back in time are considered, the correlation coefficient decreases gradually. Considering the model complexity and accuracy comprehensively, the historical heating loads with correlation coefficients above 0.80 were selected as input parameters. That is, the 7-h historical heating loads were selected as model input parameters.
yi represent the average values of independent and dependent variables. Figure 7 illustrates the Spearman correlation coefficients between the hour-ahead heating load with the 12-h historic heating loads. It is evident that the historic heating loads have a high correlation with the hour-ahead heating load, as the corresponding Spearman correlation coefficient values are higher than 0.70. For the 2-h historic heating load in particular, the corresponding values are above 0.90. As values further back in time are considered, the correlation coefficient decreases gradually. Considering the model complexity and accuracy comprehensively, the historical heating loads with correlation coefficients above 0.80 were selected as input parameters. That is, the 7-h historical heating loads were selected as model input parameters. Figure 7. Correlation analysis between current heating load and historical heating loads.

Model Evaluation
Four evaluation indicators, namely the coefficient of determination (R 2 ), the root mean square error (RMSE), the coefficient of RMSE variation (CV-RMSE), and the mean bias error (MBE), were adopted in this study. The respective calculation equations are as follows [49]: where Yi,m is the measured value, Yi,c is the predicted value, and n is the number of the observations. R 2 represents the ratio of the sum of regression squares to the total sum of squares, and values closer to 1 correspond to better performance. The RMSE describes the short-term performance of the model by comparing the deviation between the predicted and true values. CV-RMSE characterizes the performance of the model depending on its

Model Evaluation
Four evaluation indicators, namely the coefficient of determination (R 2 ), the root mean square error (RMSE), the coefficient of RMSE variation (CV-RMSE), and the mean bias error (MBE), were adopted in this study. The respective calculation equations are as follows [49]: where Y i,m is the measured value, Y i,c is the predicted value, and n is the number of the observations. R 2 represents the ratio of the sum of regression squares to the total sum of squares, and values closer to 1 correspond to better performance. The RMSE describes the short-term performance of the model by comparing the deviation between the predicted and true values. CV-RMSE characterizes the performance of the model depending on its value as follows: when it is less than 10%, the accuracy of the model is considered to be excellent; between 10-30%, the model's accuracy is considered good; when it is greater than 30%, the accuracy of the model is considered poor [50]. MBE describes the long-term performance of the model.

Sample Entropy Analysis
Sample entropy is a dimensionless parameter that measures the complexity of a time series through the measurement of the probability of new patterns occurring in the signal. Greater pattern-generation probabilities correspond to more complex sequences [51]. In this study, sample entropy was adopted to measure the complexity of the components. Figure 8 shows the values of sample entropy of the components decomposed using CEEMDAN. value as follows: when it is less than 10%, the accuracy of the model is considered to be excellent; between 10−30%, the model's accuracy is considered good; when it is greater than 30%, the accuracy of the model is considered poor [50]. MBE describes the long-term performance of the model.

Sample Entropy Analysis
Sample entropy is a dimensionless parameter that measures the complexity of a time series through the measurement of the probability of new patterns occurring in the signal. Greater pattern-generation probabilities correspond to more complex sequences [51]. In this study, sample entropy was adopted to measure the complexity of the components. Figure 8 shows the values of sample entropy of the components decomposed using CEEMDAN. Comparing the change trends of the sample entropies of the components, if the differences between the sample entropies of the IMFs and that of RES are large, or the change range of the IMFs' sample entropy suddenly becomes larger, then the last sequence between the two IMFs is used as the first sequence predicted by the ARIMA model, and the RES is used as the last sequence predicted by ARIMA model [52]. As seen in Figure 7, IMF5-RES are suitable for analysis using the ARIMA model, and IMF1−IMF4 are suitable for analysis using the DL models.

Model Development and Parameter Setting
In this study, four DL models, namely the CNN, LSTM, bi-LSTM, and GRU models, were adopted to develop the combined models. The process comprises three steps: (1) Data preprocessing: The CEEMDAN algorithm is applied to decompose the heating load into six IMFs and one RES. Then, the sample entropies are calculated, and the corresponding prediction models of each component are determined. Comparing the change trends of the sample entropies of the components, if the differences between the sample entropies of the IMFs and that of RES are large, or the change range of the IMFs' sample entropy suddenly becomes larger, then the last sequence between the two IMFs is used as the first sequence predicted by the ARIMA model, and the RES is used as the last sequence predicted by ARIMA model [52]. As seen in Figure 7, IMF5-RES are suitable for analysis using the ARIMA model, and IMF1−IMF4 are suitable for analysis using the DL models.

Model Development and Parameter Setting
In this study, four DL models, namely the CNN, LSTM, bi-LSTM, and GRU models, were adopted to develop the combined models. The process comprises three steps: (1) Data preprocessing: The CEEMDAN algorithm is applied to decompose the heating load into six IMFs and one RES. Then, the sample entropies are calculated, and the corresponding prediction models of each component are determined. The adaptive moment estimation (Adam) method is adopted to optimize the weights of the CNN, LSTM, bi-LSTM, and GRU networks. The reader can refer to [45] for an introduction to Adam. The hyper-parameters optimized for the four DL models are shown in Table 1. In this study, the Tree of Parzen Estimators (TPE) [53] method was used to optimize the hyper parameters, where the number of maximum estimations was set to be 100.
The CNN, LSTM, GRU, and bi-LSTM are one-step prediction models and were adopted to predict the heating load without any data preprocessing. CEEMDAN-CNN, CEEMDAN-LSTM, CEEMDAN-GRU, and CEEMDAN-bi-LSTM are the two-step prediction models based on the CEEMDAN algorithm, where the components of the heating load, including the IMFs and RES were decomposed using CEEMDAN algorithm and then predicted by the DL models. CEEMDAN-CNN-ARIMA, CEEMDAN-LSTM-ARIMA, CEEMDAN-GRU-ARIMA, and CEEMDAN-bi-LSTM-ARIMA are the three-step prediction models, which combine the DL models, the CEEMDAN algorithm, and the ARIMA model, where the ARIMA model is adopted to predict the components of low frequencies instead of the DL models. Table 2 shows the evaluation indicators of the one-, two-, and three-step prediction models. Overall, all models achieved high accuracy.   Figure 9 shows the prediction results of the one-step prediction models. For the CNN model, it can be seen in the diagram of Figure 8 that before the 350th heating load sample, the predicted heating load showed good agreement with the measured values. However, after the 350th sample, the agreement between them gradually deteriorated, which was the cause of the lowest R 2 values. The main reason is that in the early stage of testing, the heating load change trend is similar to that in the training phases. Thus, the CNN model could fit the training set very well through its multi-layer structure. That is, when the change trend in the test phase is similar to that of the training phase, the CNN models will produce excellent predictions. However, as time progressed, the temperature gradually warmed up, the heating load gradually decreased, and the change trend of the heating load began to change. Because the CNN model does not have any memory, it is difficult to capture the change trend of long-lasting series. As a result, the coincidence of CNN prediction and measured values declined after that point.

Model Performance Comparison
Although the change trend of the LSTM model's heating load prediction was similar to that of the measured values at all stages, its deviation degree was high. Therefore, the R 2 value of LSTM was higher than that of CNN, while the RMSE and CV-RMSE were higher than those of CNN. Moreover, it can be seen from Figure 8 that the coincidence degree of the LSTM continued to improve as time progressed. This is because the LSTM has a memory function and continues to learn through the prediction, so its performance improves continuously. Similarly, the performance of bi-LSTM and GRU also improved continuously, and the agreement between heating load predictions and measured values was very high in all testing phases.
Comparing the GRU and the bi-LSTM, it can be seen that the GRU had higher R 2 value, while the bi-LSTM yielded lower RMSE, CV-RMSE, and MBE. Building heating load is related to climate conditions, historical heating loads, building characteristics, and other factors, and its randomness is strong. The bi-LSTM generates the current prediction value considering the past and future information of time series, so its prediction is relatively smooth, resulting in poor consistency of the change trend between the predicted and the measured values compared with the GRU. However, because the bi-LSTM integrates these additional aspects of the time series, its prediction deviation is smaller than that of the GRU, and the prediction error is smaller. Thus, the RMSE, CV-RMSE, and MBE of bi-LSTM were lower than the corresponding values of the GRU. Figure 10 shows a comparison of predicted and measured heating load of the two-step prediction models, in which it can be easily seen that the performance was greatly improved compared with one-step prediction models. The average R 2 , RMSE, CV-RMSE, and MBE were 0.980, 143.72 kWh, 3.02%, and 88.88 kWh, respectively, improved by 2.78%, 31.03%, 31.01%, and 5.08% compared to the corresponding one-step prediction models. Similarly, CEEMDAN-GRU produced the highest R 2 at a value of 0.988, while CEEMDAN-bi-LSTM yielded the lowest RMSE and CV-RMSE with values of 92.27 kWh and 1.94%, respectively. Figure 11 shows a comparison of the predicted and measured heating load of the three-step prediction models. In a similar fashion to the previous comparison, compared with two-step prediction models, the performance was greatly improved, with average R 2 = 0.981, RMSE = 103.10 kWh, CV-RMSE = 2.16%, and MBE = 55.58 kWh. These values were improved by 0.08%, 28.26%, 28.28%, and 39.71%, respectively. CEEMDAN-CNN-ARIMA had worst performance; the R 2 value of CEEMDAN-GRU-ARIMA was the highest one; CEEMDAN-bi-LSTM-ARIMA produced the best performance with the lowest RMSE, CV-RMSE, and MBE. Figures 12-14 show the performance improvement percentages of the models investigated in this paper. It is evident that with regard to prediction accuracy, increasing the number of steps generally improved the prediction results. Furthermore, one-step models with worse performance showed higher performance improvement when used in combined models. For example, compared with one-step models, CNN models yielded the lowest R 2 , while GRU models have highest R 2 . The performance improvement percentages of the two-and three-step models were 5.45% and 5.45%, respectively. In contrast, the corresponding values for GRU were 0.92% and 0.92%, respectively. value considering the past and future information of time series, so its prediction is relatively smooth, resulting in poor consistency of the change trend between the predicted and the measured values compared with the GRU. However, because the bi-LSTM integrates these additional aspects of the time series, its prediction deviation is smaller than that of the GRU, and the prediction error is smaller. Thus, the RMSE, CV-RMSE, and MBE of bi-LSTM were lower than the corresponding values of the GRU.  Figure 10 shows a comparison of predicted and measured heating load of the two-step prediction models, in which it can be easily seen that the performance was greatly improved compared with one-step prediction models. The average R 2 , RMSE, CV-RMSE, and MBE were 0.980, 143.72 kWh, 3.02%, and 88.88 kWh, respectively, improved by 2.78%, 31.03%, 31.01%, and 5.08% compared to the corresponding one-step prediction models. Similarly, CEEMDAN-GRU produced the highest R 2 at a value of 0.988, while CEEMDAN-bi-LSTM yielded the lowest RMSE and CV-RMSE with values of 92.27 kWh and 1.94%, respectively.  Figure 11 shows a comparison of the predicted and measured heating load of the th step prediction models. In a similar fashion to the previous comparison, compared with t step prediction models, the performance was greatly improved, with average R 2 = 0. RMSE = 103.10 kWh, CV-RMSE = 2.16%, and MBE = 55.58 kWh. These values were impro by 0.08%, 28.26%, 28.28%, and 39.71%, respectively. CEEMDAN-CNN-ARIMA had w performance; the R 2 value of CEEMDAN-GRU-ARIMA was the highest one; CEEMDAN LSTM-ARIMA produced the best performance with the lowest RMSE, CV-RMSE, and M  Figures 12-14 show the performance improvement percentages of the models in tigated in this paper. It is evident that with regard to prediction accuracy, increasing     Figure 12. Improvement percentage of R 2 between different models investigated in this study. Notes: Improvement of R 2 = (R 2 in columns − R 2 in rows)/R 2 in columns.

Improvements of Combined Models
Sustainability 2022, 14, x FOR PEER REVIEW 17 of 21 Figure 12. Improvement percentage of R 2 between different models investigated in this study. Notes: Improvement of R 2 = (R 2 in columns − R 2 in rows)/R 2 in columns.

Conclusions
Accurate prediction of building energy consumption is of great significance to building energy management systems. In this paper, a model combining DL models (including CNN, LSTM, bi-LSTM, GRU) and the ARIMA model is proposed based on CEEMDAN. The CEEMDAN model was used to decompose the original heating load data, while the DL models were used to predict complex high-frequency subsequences; the ARIMA model was used to predict relatively stable low-frequency subsequences. Through the validation of models using actual heating load data, the following conclusions are obtained: (1) At the initial stage of testing, the change trend of the heating load is similar to that observed during training phases, and the CNN model could follow the heating load due to its deep structure. Therefore, at the initial stage of testing, the CNN had the best performance. As time progressed, the heating load gradually decreased, the change trend became less familiar to the model, and the CNN prediction performance decreased. In comparison, the LSTM, bi-LSTM, and GRU have memory functions resulting in their prediction performance improving over time. The Bi-LSTM had the best comprehensive performance because it can integrate information from the previous and the following samples of the time series. (2) Combined models performed better than single models. Compared with the single model, the average performance improvement percentages of R 2 , RMSE, and CV-RMSE of the two-step models combining CEEMDAN and DL models were 2.83%, 30.22%, and 30.15%, respectively. The corresponding values of the three-step models, which combined CEEMDAN, DL, and ARIMA, were 2.91%, 47.93%, and 47.92%, respectively. (3) Among all models, the CEEMDAN-Bi-LSTM-ARIMA model had the best performance. ARIMA model can predict the low-frequency subsequences decomposed by CEEMDAN effectively and reduce the prediction error of the bi-LSTM on these lowfrequency subsequences. This resulted in an improvement in the overall prediction accuracy of the hybrid model.
In conclusion, building energy consumption has high-dimensional nonlinearity and randomness. The CEEEMD-DL-ARIMA model can not only solve the problem of BECP but also solve other time series prediction problems, such as the solar radiation prediction  Figure 14. Improvement percentage of RMSE between different models investigated in this study (%). Notes: Improvement of CV-RMSE = (CV-RMSE in columns − CV-RMSE in rows)/CV-RMSE in columns.
Compared to one-step models, the performance improvement percentages of R 2 , RMSE, and CV-RMSE of the two-step models were 2.83%, 30.22%, and 30.15%, respectively, and 2.91%, 47.93%, and 47.92% for the three-step models.

Conclusions
Accurate prediction of building energy consumption is of great significance to building energy management systems. In this paper, a model combining DL models (including CNN, LSTM, bi-LSTM, GRU) and the ARIMA model is proposed based on CEEMDAN. The CEEMDAN model was used to decompose the original heating load data, while the DL models were used to predict complex high-frequency subsequences; the ARIMA model was used to predict relatively stable low-frequency subsequences. Through the validation of models using actual heating load data, the following conclusions are obtained: (1) At the initial stage of testing, the change trend of the heating load is similar to that observed during training phases, and the CNN model could follow the heating load due to its deep structure. Therefore, at the initial stage of testing, the CNN had the best performance. As time progressed, the heating load gradually decreased, the change trend became less familiar to the model, and the CNN prediction performance decreased. In comparison, the LSTM, bi-LSTM, and GRU have memory functions resulting in their prediction performance improving over time. The Bi-LSTM had the best comprehensive performance because it can integrate information from the previous and the following samples of the time series. (2) Combined models performed better than single models. Compared with the single model, the average performance improvement percentages of R 2 , RMSE, and CV-RMSE of the two-step models combining CEEMDAN and DL models were 2.83%, 30.22%, and 30.15%, respectively. The corresponding values of the threestep models, which combined CEEMDAN, DL, and ARIMA, were 2.91%, 47.93%, and 47.92%, respectively. (3) Among all models, the CEEMDAN-Bi-LSTM-ARIMA model had the best performance. ARIMA model can predict the low-frequency subsequences decomposed by CEEMDAN effectively and reduce the prediction error of the bi-LSTM on these low-frequency subsequences. This resulted in an improvement in the overall prediction accuracy of the hybrid model.
In conclusion, building energy consumption has high-dimensional nonlinearity and randomness. The CEEEMD-DL-ARIMA model can not only solve the problem of BECP but also solve other time series prediction problems, such as the solar radiation prediction problem. However, the different components of the model are still summed using a simple linear weighting. As the time series changes dynamically, the weight of each component also changes dynamically as time progresses. Therefore, in future research, a variable weight combination method should be considered to improve the prediction accuracy of building energy consumption.

Data Availability Statement:
The data were obtained from the monitoring system of the heating pipe network in Xi'an University of Architecture and Technology, China. Data are available on request to the authors.

Conflicts of Interest:
The authors declare no conflict of interest.