Predicting the Energy Consumption of Commercial Buildings Based on Deep Forest Model and Its Interpretability

: Building energy assessment models are considered to be one of the most informative methods in building energy efficiency design, and most of the current building energy assessment models have been developed based on machine learning algorithms. Deep learning models have proved their effectiveness in fields such as image and fault detection. This paper proposes a deep learning energy assessment framework with interpretability to support building energy efficiency design. The proposed framework is validated using the Commercial Building Energy Consumption Survey dataset, and the results show that the wrapper feature selection method (Sequential Forward Generation) significantly improves the performance of deep learning and machine learning models compared with the filtered (Mutual Information) and embedded (Least Absolute Shrinkage and Selection Operator) feature selection algorithms. Moreover, the Deep Forest model has an R 2 of 0.90 and outperforms the Deep Multilayer Perceptron, the Convolutional Neural Network, the Back-propagation Neural Network, and the Radial Basis Function Network in terms of prediction performance. In addition, the model interpretability results reveal how the features affect the prediction results and the contribution of the features to the energy consumption in a single building sample. This study helps building energy designers assess the energy consumption of new buildings and develop improvement measures.


Introduction
According to the Global Alliance for Building and Construction (GABC) Global Status Report, buildings account for approximately 40% of global energy use and 38% of global greenhouse gas (GHG) emissions [1]. Commercial buildings play a crucial role as the primary contributors to energy consumption and greenhouse gas emissions in the sector. In line with this, the World Green Building Council (World GBC) has emphasized the urgency of adopting net-zero carbon buildings as a standard practice in the commercial sector, with a target starting in 2030 [2]. Accurately predicting building energy consumption and developing various building energy efficiency strategies (e.g., optimization of air conditioning systems, lighting systems, etc.) is the best way to reduce greenhouse gas emissions from buildings. It is of great practical importance to study the mechanisms and patterns of building energy consumption and to develop accurate and effective building energy prediction models to assess the average annual energy consumption at the beginning of building design. Current building energy consumption prediction methods are divided into three main categories: physical modeling (white-box model), data-driven modeling (black-box model), and hybrid modeling (gray-box model) [3,4]. Physical modeling methods use thermodynamic principles for energy consumption modeling and analysis, such as thermodynamic models [5], computational fluid dynamics models (CFD) [6], etc. Accurate simulations usually require the input of very detailed building information, such as individual spatial characteristics, the thermal properties of building materials, etc., which are often difficult to obtain [7]. In addition, physical modeling is often applied to individual buildings or small subsets of buildings and is not ideal for depicting the spatial variability of energy use in neighborhoods, blocks, or broader areas [8]. Compared with physical modeling approaches, data-driven approaches do not require extensive expertise in parametric mechanisms and internal building components and use historical data for energy consumption prediction with more accurate and faster predictions [9]. Hybrid approaches involve a combination of both physical and data-driven methods and use the outputs of physical models as inputs to data-driven models [10]. These models aim to offset some of the limitations involved in physical modeling through the flexibility of statistical methods [11]. However, this method requires more computational resources and has high runtimes and computational costs as it requires running two models simultaneously. In addition, since two different types of models are involved, the predicted results may be more difficult to interpret and understand.
Energy consumption in buildings is influenced by a variety of factors, such as building type, climatic conditions, building structure, heating system, human activities, etc. Therefore, predicting building energy consumption is also a complex and variable problem. Applying machine learning models to predict building energy consumption is more advantageous than traditional statistical models. For example, Arjunan [12] compared the performance of MLR, multiple linear regression with feature interactions (MLRi), and gradient boosted tree (GBT) and showed that GBT outperformed the other two linear regression models. Chen et al. [13] developed a neural network model specifically for high-rise buildings, showcasing its accuracy, computational speed, and flexibility compared with conventional building simulation methods. This model enables designers to predict building performance during the early stages of design, thereby improving energy efficiency and occupant comfort. Fu et al. [14] applied a clustering method to study the factors influencing energy consumption in a school building, and the results showed that occupancy rate was the main influencing factor.
In addition to the above machine learning models, a large number of existing scholars have begun to explore the effectiveness of deep learning on building energy consumption prediction and have achieved some results. For example, in Razak's study [15], the performances of eight machine learning models and one deep learning model in predicting the annual energy consumption of residential buildings were compared. The results showed that the deep neural network (DNN) outperformed the other machine learning models with an R 2 of 0.95 and an RMSE of 1.19, which motivates building designers to use it to make informed decisions and to manage and optimize their designs prior to construction. In Moisés' study [3], the hourly electricity consumption of a single-family home was predicted using a variety of machine learning and deep learning models, and the results showed that LSTM had the best prediction performance (nRMSE of 4.74%). In addition to predicting residential energy consumption, LSTM also shows excellent prediction results in predicting the energy consumption of other building types. For example, the LSTM model developed in the studies of Kim [16] and Li [17] achieved better performance in predicting hospital buildings. The study of Dinh [18] also demonstrated the effectiveness of LSTM in predicting the energy consumption of a commercial building.
Although a large number of studies [19][20][21] have shown that LSTM has promising performance in predicting energy consumption in buildings, it is limited to predicting short-term energy consumption (energy consumption with a time granularity of daily or hourly) [18]. From the existing literature, most of the previous studies have been conducted on a specific building or a small number of buildings in a region, which provides limited guidance on energy efficiency for other building types and poor scalability. There is a lack of more generalized predictive models, and with the increased availability of data on actual energy consumption, there is now ample opportunity to apply advanced techniques to large datasets in order to build more generalized predictive models [22].
The Commercial Building Energy Consumption Survey [23] is the most comprehensive publicly available data set on commercial building energy use in the United States, originally developed to make statistical inferences about the national commercial building population [24]. The CBECS is also attractive to building energy modelers due to the ample amount of data, the representativeness of the sample, and the broad geographic coverage [22]. Deng et al. [22] predicted the total EUI, heating EUI, cooling EUI, and plug load EUI of an office building using the CBECS dataset, and the results showed that the RF model had an RMSE of 28.3, which was superior to other machine learning models. Norouziasl et al. proposed a data-driven prediction framework based on the energy consumption of lighting in office buildings. The results showed that the Support Vector Machine (SVM) algorithm provided the best prediction performance with an R-squared value of 0.78. Robinson et al. [11] used the CBECS dataset to develop prediction models for 18 building types, including office buildings, hotels, and educational buildings, respectively, using machine learning models, such as linear regression and stochastic forest gradient boosting. The results showed that Extreme Gradient Boosting (XGBoost) has a higher prediction performance (R 2 of 0.82) than other machine learning models. However, this method is time-consuming and has low scalability as it has to model each building individually. Kumar et al. [25] developed a Random Forest prediction model for the fuel consumption of an entire commercial building, which has better prediction and scalability with training and validation delays of only 0.82 s and 1.14 s, respectively.
Despite the widespread use of various data-driven models in the field of building energy consumption, machine learning methods remain the dominant approach due to limitations in data volume and dimensionality. However, numerous studies have shown the great potential of deep learning methods in this field [26][27][28][29]. For instance, Goodfellow et al. [30] argued that having 5000 labeled complete data can lead to improved results for algorithms in deep learning. Nevertheless, deep learning methods also have their limitations and face significant challenges in interpreting predicted results. Furthermore, most prediction models are built for a single building type, lacking the ability to characterize other similar buildings and limiting their guidance for energy-saving and emission reduction. Therefore, this study aims to develop a model based on Deep Forest (DF) and the SHAP value theory for assessing energy consumption in multiple types of commercial buildings. In summary, the novelty of this study is as follows: • Compared with the current deep neural network algorithms in the field of building energy consumption prediction, the prediction effect is good but sacrifices the interpretability. The Deep Forest algorithm proposed in this paper fills the gap of poor interpretability of deep learning in the field of building energy consumption prediction; • Unlike the predictive models developed in the literature based on a single or a few building types, the building energy assessment model in this paper was developed based on 20 commercial building types, such as office buildings, warehouses, and schools. It has broader adaptability to provide energy consumption prediction and energy saving recommendations for 20 commercial building types.
The rest of this study is organized as follows. Section 2 describes the proposed modeling framework, dataset, and related theories for the energy consumption assessment of commercial buildings. Section 3 demonstrates the performance evaluation and DF model interpretation results of various deep learning models used to predict energy consumption. Finally, the conclusion of this study is given in Section 4.

Materials and Methods
The building energy consumption assessment model proposed in this paper is a deep learning-based prediction method constructed on DF. The framework of this study is shown in Figure 1. First, the CBECS dataset is pre-processed with missing value processing, outlier processing, and constant value processing. Subsequently, feature selection is performed after removing a large number of redundant features using Spearman feature filtering, including three different types of feature selection methods: mutual information feature selection (MI), forward feature selection (SFS), and Least Absolute Shrinkage and Selection Operator (LASSO). After extracting features using three different selection methods, these features are used as inputs for energy consumption prediction in three deep learning and two machine learning models. The prediction effects of combining various feature selection algorithms and prediction models are then compared to develop an effective energy consumption assessment model. Finally, by using SHAP values for the interpretability study of the established prediction model, the main influencing factors are analyzed, the causes of high energy consumption are identified, and the model prediction results and analysis conclusions are used as feedback on which to formulate energy-saving strategies that will inversely guide the energy-saving design of new buildings.  Figure 1. The framework of the energy consumption assessment model for commercial buildings.

Data Description
CBECS 2012 is the largest commercial building energy survey conducted by the U.S. Energy Information Administration (EIA) to date, providing data on the annual energy consumption of over 6700 commercial buildings [22]. This dataset represents approximately 5.6 million commercial buildings across the United States [31]. Figure 2a displays the distribution of different building types, including office buildings, school buildings, and shopping malls, among others. Office buildings are of particular interest to most researchers due to their higher representation compared with other building types. The dataset comprises more than 1181 features related to four types of energy consumption, including heating, cooling, ventilation, and others. Figure 2b illustrates the percentage of energy consumption in commercial buildings, with fuel and electricity accounting for over 80% of the total energy consumption and serving as the primary sources of energy consumption and emissions. The remaining two energy sources contribute to only 17.3% of the total energy consumption. In this study, due to a significant number of missing values for fuel and natural gas consumption and their relatively low contribution to the total energy consumption (less than 20%), this paper primarily focuses on combining fuel and electricity consumption to represent the total energy consumption.

Data Pre-Processing
While the CBECS 2012 dataset provides valuable insights for energy efficiency analysis and prediction research, it is not exempt from certain limitations. These include a considerable number of missing values, the presence of outliers, and inconsistencies within the data. These factors can potentially compromise the accuracy and reliability of the analysis. Therefore, it is essential to preprocess the raw data before conducting any data analysis. The preprocessing phase involves various operations, such as data cleaning, missing value imputation, data standardization, and data filtering. These measures are implemented to enhance the quality and reliability of the data, enabling improved data analysis and inference. In this study, the following preprocessing steps were executed on the CBECS 2012 dataset: • Empirical removal of variables not relevant to predicting energy consumption, e.g., 'Imputed roof replacement' used to describe whether the roof parameter input is estimated or measured data, or the removal of constant and quasi-constant parameters (<5% change), leaving 614 features; • To reduce the uncertainty of the data, 162 variables with more than 80% missing values were removed to ensure data quality, leaving 452 characteristics;Missing values were filled using 0 values because missing values are not applicable in the database and 0 values have no significant effect on the results in regression prediction; in addition, samples with outliers were directly removed to preserve the characteristics of the data.

•
In order to better satisfy the model's requirement that the data conform to the properties of a normal distribution, this paper used Z-Score normalization to process the data so that they are close to a normal distribution.

Feature Selection
During the development of an energy consumption assessment model, feature selection plays a crucial role as it directly impacts the model's performance [32]. Building energy consumption features often exhibit high dimensionality, and as the number of features increases, the density of training samples dramatically decreases, leading to an elevated risk of overfitting. Moreover, high-dimensional features require more computational resources and longer training times, which can increase the cost of the machine learning problem. Currently, feature selection methods can be classified into three types: filter, wrapper, and embedded methods [33]. In this study, we applied and compared these three typical feature selection methods to evaluate their performance.
MI (Mutual Information) is a commonly used filtered feature selection method to measure the correlation between two variables. Specifically, the MI method calculates the mutual information I(X, Y) between two variables X and Y. Mutual information measures the degree of interdependence between variables X and Y. In feature selection, we usually represent the original dataset as a matrix of np, where n denotes the number of samples and p denotes the number of features. For the relationship between each feature and the target variable, the MI method calculates the mutual information between it and the target variable, and selects the feature that has the greatest correlation with the target variable. The formula of the MI method is shown in Equation (1): where ( , ) p x y denotes the joint probability of X = x and Y = y, () px denotes the marginal probability of X = x, and () py denotes the marginal probability of Y = y. The MI method is suitable for dealing with both discrete and continuous variables, does not need to normalize or standardize the data beforehand, and is widely used in filtered feature selection.
SFS begins with an empty set S. It iteratively selects features from the original feature set based on evaluation criteria and adds them to the current feature subset S. This process continues until the desired number of features in the subset is achieved or no further optimal feature can be selected. SFS based on Random Forest (RF) is a popular wrapper feature selection method. It evaluates feature importance by constructing a Random Forest and gradually selects features to build the best model possible. SFS allows for stepwise selection of features, preventing overfitting, and can find the optimal feature subset within a predefined range of feature numbers. SFS has demonstrated excellent performance in fields such as gene expression data analysis, image processing, and text classification.
LASSO is a widely employed feature selection method known for its interpretability and stability compared with traditional approaches [33][34][35]. The core concept behind LASSO is to select features based on linear regression while imposing a constraint that the sum of the absolute values of the independent variable coefficients is less than a threshold. By minimizing the sum of squared residuals, the LASSO method forces the coefficients of weakly correlated independent variables with the target variable to become zero. There are several techniques available to solve the LASSO model and, in this study, we employ the widely recognized Least Angle Regression (LAR) method [36,37], known for its computational efficiency. The LAR algorithm identifies the next most influential feature along the direction defined by the first two selected features, requiring only a few iterations of least squares fitting to obtain the feature variable coefficients β, thereby achieving a rapid solution.

Deep Forest Model
The Deep Forest Model (DF) consists of a Multi-Grained Scanning (MGS) structure and a cascading forest structure. The process of the multi-grain size scanning structure is illustrated in Figure 3. Each sample in the input dataset is divided into multiple subsamples by MGS, and each subsample contains a continuous segment of feature vectors that are fed into a Random Forest (R-Forest) and a completely random tree forest (E-Forests) for prediction. Each Random Forest outputs a real value, so a Deep Forest model will obtain multiple real numbers as prediction outputs and stitch these results into a vector to input into the cascade forest structure for further prediction. Figure 3. Schematic diagram of the multigrain size scanning structure.

R-Forest
One of the sources of the superior performance of deep neural networks is their multilayer connection structure [38]; similarly, the cascade forest structure in the Deep Forest model also adopts a multilayer cascade structure. The cascade forest consists of multiple cascade layers, each consisting of two R-Forests and two E-Forests. As shown in Figure 4, each Random Forest is trained after receiving the multi-grain scan data features and using the out-of-bag estimation as the prediction of the corresponding samples. The four feature vectors will be combined into a new vector and spliced with the features input from the previous layer into new features. After the second cascade layer receives the features passed down from the first layer, in addition to outputting the new features, it will also compare the prediction performance of this layer with the previous layer using an evaluation function. If the performance improvement exceeds a set threshold, the new cascade layer is retained and the category vector of the previous layer is updated with the new category vector and then passed down. When the prediction performance of the new cascade layer does not meet expectations, the cascade forest stops training. It then combines all the previous cascade layers to form the final cascade forest model. The cascade forest's prediction value is obtained by averaging the output values of the last cascade layer.

Deep Multilayer Perceptron
Deep Multilayer Perceptron (Deep MLP) is a deep neural network used to efficiently predict nonlinearly varying data, typically with more than three hidden layers. Figure 5 illustrates a multilayer perceptron network structure with a 3-component input layer and a 1-component output layer with 4 hidden layers, with each layer containing 4 neurons.
The () fx representative activation function can be one of the Sigmoid, Tanh, or ReLU functions. The role of the activation function is to introduce nonlinear operations into the learning network so that the network can approximate any nonlinear function and substantially improve the model generalization ability. The most widely used activation function is the ReLU activation function, which avoids the "gradient disappearance" defect, i.e., y is no longer sensitive to the increase in x after a large value of x, compared with the Sigmoid and Tanh functions.
where i d represents the true value and i y is the predicted value. The equation is solved using the error Back Propagation strategy. Deep MLP has significant improvement in model generalization ability and prediction accuracy compared with the classical neural network model due to the modules of multiple hidden layers, multiple neural nodes, and activation functions.

Convolutional Neural Network
Convolutional neural network (CNN) belongs to one of the deep learning methods, and its network architecture is shown in Figure 6. CNN is composed of three fundamental components: the convolutional layer, the pooling layer, and the activation layer [39]. The role of the convolutional layer is to first convolve the input one-dimensional feature data with a one-dimensional convolutional kernel, and then use the activation function to nonlinearize the data after the convolutional operation [40].

Input Layer Convolutional Layer
Pooling Layer Fully Connected Layer Output Layer Figure 6. Convolutional neural network structure.
The main function of the pooling layer is to remove some redundant information and extract the important features while maintaining feature invariance, i.e., feature extraction and data dimensionality reduction [41]. The common pooling methods are divided into mean pooling, maximum pooling, etc., and mean pooling is less used because its performance is inferior to that of maximum pooling [42,43].
The fully connected network layer encodes the local features into the global features of the convolutional neural network input data. It then adjusts and updates the model parameters by calculating the error between the model prediction and the true label [39], using a backward propagation algorithm. After several iterations of training, the loss bias is converged and the model parameters are no longer updated. At this point, the optimal values of the algorithm parameters are obtained, and the optimal model parameters are saved for subsequent testing of the algorithm on the validation set for testing and evaluation [44].

Model Evaluation
Based on the prediction results of the test set, the goodness of fit (R 2 ) and root mean square error (RMSE) are used as the evaluation indexes of the regression model performance in this paper. The evaluation indicators R 2 and RMSE are defined as shown in Equations (3) where ˆi y is the i-th sample predicted value; i y is the i-th sample true value; y is the sample mean; and N is the number of samples. R 2 is used to measure the goodness of fit of the regression model to the data; a higher value of R 2 indicates a better fit of the algorithm and better interpretability [18]. RMSE is used to compare the predictive performance of the model. It measures the average difference between the model's predicted values and the actual observed values, i.e., the standard deviation of the prediction error. The smaller the value of RMSE, the better the predictive performance of the mode. Therefore, it is widely used to evaluate the model performance in the field of building energy consumption prediction [45][46][47][48]. By using both R 2 and RMSE, we aim to gain a comprehensive understanding of how well our model captures potential patterns in the data and how accurately it predicts future outcomes.

Model Interpretation
An effective energy consumption assessment model should not only be accurate but should also provide an appropriate interpretation of the model using the features. Although tree-based machine learning algorithms such as XGBoost and RF can only provide results on feature importance, they do not support providing the effect of each feature on the predicted composite energy consumption of each sample. Furthermore, they do not indicate whether these features are positively or negatively correlated with composite energy consumption.
This study uses the SHAP method to interpret the proposed energy assessment model. SHAP is an algorithm that utilizes a game theoretic approach to interpret the output of any machine learning model. Traditionally, simple models such as decision trees and linear regression have been well visualized, but the visualization of other models, especially those that integrate learning, is not feasible. To address this problem, Lundberg et al. proposed the use of local interpretation to understand advanced decision tree-based models based on game theory [49]. SHAP values were developed to analyze the results of integrated decision tree learning models, especially the importance of output feature parameters. The specific process of interpreting the trained RF model using SHAP values is as follows: firstly, calculate the contribution of each feature vector to the integrated energy consumption, and then count the mean of the absolute SHAP values of all the features of the interpreted samples to obtain the contribution value of each feature to the integrated energy consumption, where the specific formula for the ith feature SHAP value is shown in Equation (5): where M is the number of feature parameters, F is the set of all feature parameters, f is the interpreted model, i x is an instance of the interpreted feature vector, and '

Methodology
The flow of the Deep Forest-based prediction algorithm proposed in this paper is shown in Figure 7. First, the CBECS data set is preprocessed, including outliers, missing values, and other processing methods. Then, the training set, validation set, and test set are divided according to the ratio of 8:1:1. In order to avoid information leakage, we process the training set with feature selection: first, we use PCC for feature filtering, and then we use the forward feature selection method to select key features. Then, the prediction model is built: first, the hyperparameters of the DF model are initialized, and then the selected features are entered into the prediction model for training. After training, the model performance is evaluated using a validation set, and the model parameters are output if the model performance reaches stability, otherwise the model parameters are further adjusted. Model performance is evaluated using R 2 and RMSE for the trained data. In addition, the model is interpreted using SHAP values, and energy saving strategies are developed based on the interpretation results to provide a reference for the energy management department.  Figure 7. Algorithmic framework for Deep Forest-based energy consumption prediction in commercial buildings and its interpretability study.

Pre-Processing
Skewness, kurtosis, and standard deviation were used to characterize the distribution of the data after removing the 165 data in the CBECS 2012 dataset that had missing values. In this case, skewness measures the asymmetry of the data distribution; kurtosis describes the sharpness or the height of the peaks of the data distribution; and the standard deviation is used to measure the degree of dispersion of the data. Table 1 shows the data normal distribution characteristics of the energy consumption data and energy intensity of 6555 buildings in the CBECS 2012 dataset before and after the removal of outliers. From the raw data, it can be seen that the total energy consumption and total energy intensity data obey skewness and do not follow the assumption of normal distribution. The data contain a high standard deviation, and there are a large number of outliers. After removing the outliers, the right skewness and kurtosis of the data distribution are weakened, bringing the data closer to a normal distribution. The processed dataset consists of 6442 buildings, which are then divided into three parts: 80% for the training set, 10% for the validation set, and 10% for the test set. Here, the validation set is mainly used for tuning the parameters of the network model, and the model parameters are optimized using grid search, which is the most common technique for optimizing multiple parameters at the same time.

Feature Selection Results
To address the issue of significant redundant data in CBECS 2012, it was crucial to eliminate redundant features before feature selection. Features that exhibited a correlation coefficient greater than 0.9, determined using the Pearson correlation coefficient method, were filtered out. Subsequently, the MI, SFS, and LASSO algorithms were applied to the training dataset to identify the most appropriate features for the EUI prediction model. The results of the MI coefficients between the candidate features, ranked from highest to lowest based on their MI coefficient values, and the EUI data are presented in Figure 8. A higher MI coefficient indicates a better feature for developing the EUI model. Based on the recommendations of Peng et al. [50], features with MI coefficients greater than 0.2 are considered significant (red dotted line). Consequently, the top 28 features are selected to construct the prediction model in this study. The second feature selection algorithm is SFS, in which the optimal feature subset is selected from all the candidate features using model reconstruction and the recursive pruning of the features with the least ranking importance in the current set. Since the SFS process depends on a specific regression algorithm, each machine learning model extracts different features from the candidate features. Figure 9 shows the feature selection results of the SFS method used for the RF model. The green dots indicate the average MSE values corresponding to the five-fold cross-validation of the selected feature subset in the training dataset. The optimal number of selected features is determined based on the points with the minimum MSE, at x = 30 suitable for having the best MSE (red dotted line), so the top 30 features are selected as the model input features. The third feature selection method is LASSO regression, which selects the best input features by developing a regularized regression model that removes irrelevant and redundant features by adjusting the penalty parameter. The optimal penalty parameter λ is determined by five-fold cross-validation. The variation of MSE with the penalty parameter λ is shown in Figure 10. The minimum MSE is reached when λ is 0.0077 ( Figure 10 red dotted line), but at this time, there are 130 features with coefficients that are not 0, as shown in Figure 11 for the important results of the first 40 features. In this paper, the first 29 features with absolute values of the eigencoefficients more than 0.05 were selected as model inputs (Figure 11 Table 2 presents the results of the three feature selections (see Appendix A for feature descriptions). As shown in the table, the selected features include four main categories: physical features, occupancy features, geographic features, and equipment features. Floor area (SQFT) is a very important feature in any feature selection algorithm, indicating its crucial role in predicting energy consumption. Table 2. Results of selecting energy consumption characteristics of commercial buildings.

Model Performance Evaluation
Before performing the model evaluation, we considered the uncertainties in each model. Since Deep MLP, CNN, BP, and RBF are all network models, they are particularly sensitive to hyperparameters. Therefore, we selected the best hyperparameters for the network models using a lattice optimization algorithm, which is the conventional method for finding the optimum for multiple hyperparameters simultaneously [51][52][53]. The DF model was not sensitive to hyperparameters, and the hyperparameters of the DF model were set to a reasonable empirical value and then fine-tuned to fit the data set with reference to Zhou et al. [54][55][56]. The hyperparameter settings in the above model are shown in Table 3. Different feature selection results were developed for three typical deep learning models (DF, Deep MLP, CNN) and two typical machine learning model (BP, RBF) algorithms. Meanwhile, in order to eliminate the randomness of the experiments, we ran the models at each of the 10 different initial points. And the mean values of R 2 and RMSE, based on the test dataset, were counted, and the statistical results are shown in Table 4. The distribution of the original data was skewed, which led to poor or invalid model predictions (R 2 of the CNN model is −0.191). To facilitate the comparison between models, this paper mainly predicts the log-transformed Total Energy Consumption (Total EC) data and Total Energy Intensity (Total EUI). According to the effect of predicting Total EC, compared with traditional machine learning models, the deep model had significant advantages in larger data sets, and the overall R 2 was higher than 0.80. Among them, the Deep Forest model had the best effect, with the highest average R 2 of 0.89 and the lowest average RMSE of 1.04. It was followed by the Deep MLP model with an average R 2 of 0.88. The CNN prediction was the worst among the deep models with an average R 2 of 0.81, and its performance was inferior to the BP neural network in machine learning with an average R 2 of 0.82. In addition, SFS achieved the best model performance compared with both MI and LASSO feature choices. The model performance for predicting total EUI using the same approach is shown in Table 5. The results show that compared with the prediction results of Total EC, the prediction results of Total EUI were less satisfactory, and the overall R 2 was below 0.65, mainly because the correlation between Total EUI and features was weakened by the floor area. The model with the best prediction, Deep Forest, had the highest average R 2 of 0.58 and the lowest average RMSE (1.03), and this advantage is more significant in predicting Total EUI compared with that of Total EC (average R 2 of Deep MLP is 0.55, average R 2 of CNN is 0.38, and average R 2 of BP is 0.45). In addition, the R 2 of the RBF model was negative in predicting Total EUI, mainly because RBF assumes that the data are uniformly distributed in the feature space, while this CBECS dataset mostly obeys a skewed distribution, so RBF is not suitable for the prediction task of this dataset. In summary, the DF model is suitable for developing evaluation models for building energy consumption and energy intensity and has the best results when combined with SFS feature selection. In this study, three feature selection algorithms were combined with five prediction models, and the results showed that the DF model (SFS + DF) using the SFS algorithm performed the best in predicting the total EC. It had the highest average R 2 value of 0.90 and the lowest average RMSE value of 1.00. Although SFS + Deep MLP also showed better predictive performance with the highest average R 2 and the lowest RMSE (1.09), it is not the best choice for developing an energy rating model for the following reasons: • The former requires the identification of a large number of suitable hyperparameters, resulting in a complex and time-consuming model development process. In addition, the stability of the model is greatly reduced due to the initial values. In contrast, the SFS + DF model requires fewer hyperparameters and can adaptively select the most suitable parameters, which are less affected by the initial values.

•
The total EUI prediction results show that SFS + DF has the best performance, further validating the stability of the model. Therefore, this model is the preferred choice for developing models for the overall EC assessment of buildings.

•
Compared with other network models, Deep Forest is an integrated learning method based on tree modeling, and the prediction process is relatively transparent with good interpretability.
In summary, the SFS + DF model is the best choice for developing energy assessment models. In addition, the high stability and better interpretability exhibited by the model make it more popular in certain application scenarios, especially in situations where model decisions or prediction results need to be interpreted. For example, areas such as control, optimization, and fault diagnosis strategies.

Model Interpretation
Although the feature selection algorithms can all quantify the impact of input features on energy consumption (output) to some extent, their analysis does not depend on the machine learning model developed and can only provide results on feature importance. As a result, they do not support providing the impact of each feature on the predicted combined energy consumption for each sample and whether these features are positively or negatively correlated with the combined energy consumption. Therefore, the above feature selection algorithm is not the best choice for interpreting the prediction results, while the SHAP values not only provide the interpretation of individual samples, but also maintain consistency in the interpretation of different models.
For the SFS + DF model, the contribution of each feature is calculated and evaluated using SHAP values, as shown in the feature global SHAP interpretation plot in Figure 12a. The x-axis represents the SHAP value of each feature, where positive and negative values represent positive and negative correlations with the input feature and output energy consumption, respectively. The y-axis represents the feature names ranked by feature importance, and each point corresponds to the feature's SHAP value interpretation. The results show that in the SFS + DF model, the most significant features that positively correlated with Total EC (Total Energy Consumption) are SQFT (floor area) and NWKER (number of employees). A higher floor area and number of employees lead to higher energy consumption. On the other hand, features such as EMCS (Building automation system), RFTILT (Roof tilt), and HWRDCL (How to reduce cooling) are negatively correlated with Total EC, but their effects are relatively small. It is worth noting that certain features exhibit a long tail distribution of SHAP values, such as NELVTR (Number of elevators) and RFGWIN (Number of walk-in units). Although their importance is relatively low, certain extreme values within these features may have a stronger impact on Total EC compared with certain NWKER features. The development of global SHAP interpretation maps for features facilitates the understanding of overall impact trends in existing buildings, aiding energy management efforts in establishing a comprehensive direction for enhancing energy efficiency in the building industry. For important features, such as SQFT and NWKER, it is important to understand how they affect the predicted energy consumption values for energy efficient building design. Figure 13a,b show the effect of these two features on the total energy consumption, where the horizontal axis shows the values taken by the features and the vertical axis shows the SHAP values of the features, which represent the contribution of the features to the model outputs, with each blue dot representing y. The results are shown for a sample of buildings. The results in Figure 13a show that SQFT = 5 × 10 3 is a critical value for the SHAP value of the feature, above which a positive impact is produced and vice versa, and the impact stabilizes for values greater than 6 × 10 5 . The further results in Figure 13b show that the number of employees always has a positive effect, but when the number of employees is less than 100, the corresponding SHAP values are widely distributed, indicating that there is a strong feature interaction in the model. Although the SHAP value dependence plot of features contains rich information, some of the features interact too strongly, which makes the analysis too complicated. The SHAP value force diagram is used to address this limitation, which refers to the illustration of how these different features affect the output in a certain sample, and the SHAP value force diagram is more easily understood by architects than the feature SHAP value dependence plot. As shown in Figure 14a, b, ( ) E f x   represents the baseline value of the SHAP value, which is the mean value of the model prediction, where blue represents a negative influence and red represents a positive influence. From the bottom red bar in Figure 14a, it can be seen that 21 insignificant features produce a positive influence of 0.15. In addition, SQFT = 18,000 produces a negative influence of 0.85, while WKHRS = 168 produces a positive influence of 0.61, and so on, to obtain the final predicted value of the sample, which is 20.249. In this sample SQFT, PBAPLUS, etc., are the main factors leading to the decrease in energy consumption, while WKHRS is the main factor leading to the increase in energy consumption. In Figure 14b, a notable difference can be observed compared with the previous figure. In this specific instance, the value of SQFT increases from 18,000 in Figure 14a to 30,000 in Figure 14b. Despite the increase in SQFT, the impact on energy consumption reduction decreases to 0.51. This decrease can be attributed to the concurrent increase in NWKER, which weakens the influence of SQFT on energy consumption. Furthermore, as NWKER rises from 7 in Figure 14a to 60 in Figure 14b, its effect on energy consumption reduction becomes significantly stronger. The development of this figure can provide a reference for building designers to design key features of new buildings to design low-energy consumption buildings.

Conclusions
This study proposes a framework for a data-driven commercial building energy assessment model, including five parts: data pre-processing, feature selection, predictive model development, model evaluation, and model interpretation. Taking the CBECS 2012 dataset as an example, the performance of three feature selection methods (filtering method (MI), wrapper method (SFS), and embedding method (LASSO)) was used to select the most favorable features for energy consumption prediction; the features were also used as input to build three deep learning models and two machine learning models and compare their performance. Finally, SHAP theory was used on multiple perspectives to explain the best-performing model and analyze the explanation results.
Among the results of the proposed framework, the feature selection algorithm SFS causes the most obvious improvement in the prediction effect, especially when predicting Total EUI; the algorithm incurs a 2~9.6% improvement in all five prediction algorithms. In addition, the combination of SFS with the DF model exhibits the best performance, achieving an R 2 value of 0.90. Although the combination of SFS with the Deep MLP model achieves a similar level of accuracy, the Deep MLP model has certain drawbacks compared with the former, such as the complexity of hyperparameter tuning and poorer model stability. So, the SFS + DF model is the best choice for developing building energy assessment models. The model interpretability was analyzed at three levels, namely, the impact of 20 features on the output, the impact of individual features on the output, and the impact of features in a single sample on the output, respectively. The results of the feature global SHAP interpretation plots show that the features with the most significant impact on energy consumption are SQFT and NWKER, which are positively correlated, followed by NELVTR, WKHRS, etc. In addition, certain features (NWKER) have stronger interactions, and it is difficult to analyze the cause of energy consumption from their feature dependence plots, while the single-sample SHAP value force plots are easier to see the roles of features in a certain sample and are not affected by feature interactions. The results show that although SQFT is the most influential factor among them in the global explanation, the influence of certain secondary factors on energy consumption in the SHAP explanation of the sample may be greater than SQFT, such as NWKER, NELVTR, WKHRS, etc. Therefore, architects should take into account the extreme values of nonsignificant features in addition to the major factors when designing buildings for energy efficiency.
This paper demonstrates the effectiveness of deep learning algorithms in the field of building energy consumption prediction, and the proposed model can provide suggestions and references for building designers, building energy managers, and other related staff.
The limitation of this study is that solving the hyperparameters of the DF model using the grid optimization algorithm does not significantly improve the model performance and is time-consuming. In the future, the predictive performance of the model will be further improved by pruning and optimizing the structure of the DF model for higher dimensional feature sets.

Conflicts of Interest:
The authors declare no conflicts of interest.
Appendix A