Machine Learning Approach to Predict Building Thermal Load Considering Feature Variable Dimensions: An Ofﬁce Building Case Study

: AbstractAn accurate and fast building load prediction model is critically important for guiding building energy system design, optimizing operational parameters, and balancing a power grid between energy supply and demand. A physics-based simulation tool is traditionally used to provide the building load demand; however, it is constrained by its complex model development process and requirement for engineering judgments. Machine learning algorithms (i.e


Literature Study
Building thermal load prediction plays an important role in energy management and efficiency. It has wide applications, such as in determining the capacity of heating, ventilation, and air conditioning (HVAC) systems in the phase of building design [1]; providing operational optimization control in building energy systems for existing or retrofitted buildings [2]; and determining the demand response baseline for grid-integrated buildings [3][4][5]. In the field of building load prediction, the thermal loads for heating and cooling are the most difficult parts to accurately predict, owing to the complex, nonlinear relationships among the influencing factors (input feature variables), such as the weather conditions, building physics, and different operational behaviors. Therefore, previous studies have focused on the thermal load of buildings, especially on the loads of HVAC systems. Prediction approaches are widely categorized into three types: physics-based XGBoost was first released in 2014 and has become a powerful algorithm; most Kaggle competitions have reported it as the final winner [27,28]. XGBoost is based on a gradient boosting algorithm for assembling weak learners into a strong learner. XGBoost can be easily implemented using the Python, R, Julia, and Scala platforms [29]. Wang et al. [13] predicted long-term building thermal loads using XGBoost. Five input feature variables were considered: the day of the week, hour of the day, holiday status, temperature, and relative humidity. They found that, in shallow machine learning, XGBoost is the best algorithm. Yan et al. [23] obtained similar results. In the cooling season, they found that 11 input features could represent the main factors influencing cooling energy consumption. The prediction performances of XGBoost (CVRMSE: 62%), RF (CVRMSE: 64%), SVR (CVRMSE: 64%), and ANNs (CVRMSE: 73%) were not good relative to the results in [13]. Wang et al. [24] investigated models including XGBoost, RF, ANN, and SVR for predicting consumption from the thermal load of a residential building in Tianjin, China. The CVRMSE values of the prediction results from these models were as follows: RF (5.0%), XGBoost (5.8%), SVR (6.2%), and ANN (7.0%). Although the prediction accuracy was good, the computation cost is huge when the input feature dimensions and data size are large. In recent years, Light Gradient Boosting Machine (LightGBM) has been proposed as a novel and promising gradient boosting framework; it is similar to XGBoost. XGBoost and LightGBM are various tree-boosting methods. Shi et al. [22] concluded that LightGBM obtained a higher prediction accuracy compared to SVR in electric load forecasting. Zhang et al. [21] proposed a model of LightGBM integrated with the Shapley Additive exPlanation algorithm to predict energy usage and greenhouse gas emissions. The results show that the proposed LightGBM can achieve a higher prediction accuracy compared to XGBoost, RF, and SVR. Through the literature review, we can find that the tree-based algorithm including LightGBM and XGBoost is a promising method to obtain a better prediction result [30].
RF is a supervised learning algorithm based on decision trees. Compared with other algorithms, fewer parameters need to be tuned when using the RF model [31]. Ahmad et al. [11] compared three different algorithms for energy predictions. They selected the ambient temperature and relative humidity ratio as the input feature variable, and they found that the RF model (MAPE: 2.64%) outperforms LMSR (MAPE: 3.10%) and NARM (MAPE: 4.21%). Except the advantages of less overfitting and higher accuracy, RF presents the importance of the input features, which can be used in the model training and testing processes. A feature importance analysis chooses the main features and skips the weak features; this is critical to accelerating the computational process and ensuring the prediction accuracy. LSTM is a type of recurrent neural network (RNN) algorithm. It was first introduced by Hochreiter and Schnidhuber in 1997 [32]. Differing from a traditional neural network, LSTM passes the last step's information to the next time step (i.e., backpropagation). With these merits, LSTM comprises an inborn network for processing sequential data. It has advantages in solving complex and long time lag tasks, whereas traditional RNN algorithms are not good at this. LSTM has performed better in short-term load predictions than the linear regression, SVM, RF, and XGBoost algorithms [13,33]. The detailed theory of LightGBM, RF, and LSTM can be seen in Section 2.2.
Different algorithms have their merits for different building energy datasets, according to the previous studies, and the generalization performance of a prediction model is mainly based on the quality of data. For a specific building case, the prediction accuracy of a data-driven model is good enough now [30], while the generalization performance for exogenous buildings is still poor in the field of building energy predictions. More papers have investigated data-driven models using a specific building case, and these models are usually biased and can lead to a poor generalization [34,35]. For this purpose, the data source should cover all the possible ranges of each main variable and represent the overall energy consumption patterns of buildings, because a data-driven model cannot deduce the result for unseen data. To do that, previous studies had some attempts to acquire big data from numerous buildings. For instance, 5000 residential buildings from the Ministry of Housing Communities and Local Government (MHCLG) repository have been used to develop a data-driven model [36]. Meteorological variables with specific ranges and a 5-min interval energy dataset from EnerNOC were used to develop a data-driven interval forecasting model for building energy predictions [37]. Despite more dimensions of information having been considered, the diversity and depth of each main variable are still worth investigating for building a well-generalized model.
Although many studies have already investigated the prediction performances of different data-driven algorithms, there are still two research gaps. The first gap is whether the data-driven model is good enough to represent the physics-based tools. It is urgent to investigate the feasibility of using a data-driven model in the field of building load forecasting. To the best knowledge of the authors, the previous studies mostly focused on the existing building with acquired historical data, which means that the model is well developed on this specific building while it usually has a poor generalization for other buildings, especially for design phase buildings. Therefore, using physics-based tools to generate a massive dataset that covers all the common energy use scenarios could be a promising way to develop a good generalization data-driven model. Another research gap concerns providing a sufficient number of key input features to determine the building energy use in data-driven models. For building owners, the difficulty in obtaining the different dimensions of input features is not equal, and some main information might be unavailable. Thus, presenting the prediction performance based on the number of main feature variables is quite useful for building owners to estimate energy consumption with a certain accuracy. Researchers have made different conclusions in the past. For instance, Yan et al. [23] concluded that 11 input features were sufficient, whereas Wang et al. [13] stated that 5 features were adequate. How many input features should be used in the model to obtain a satisfying prediction result? This is still a challenge. To address these research gaps, first, we acquired data by running massive EnergyPlus building models that represented different buildings. Note that the data can also be acquired from any physicsbased tools or on-site data. Second, different dimensions of the key feature variables were selected to develop three widely used models: LightGBM, RF, and LSTM in the context of forecasting the HVAC electrical load. Notably, the HVAC electrical load was predicted as an equivalent of a thermal load, as the electricity load is more commonly used in modern grid-integrated buildings.

Motivations and Contributions
In practice, it is extremely difficult to measure all of the required inputs to a physicsbased building energy model. These models require thousands of input parameters derived from prior knowledge, and their simulations are typically computationally intensive. Thus, a data-driven model could be a promising approach to solve this problem. However, developing a data-driven model to represent a physics-based model well is still a big challenge. First, the value range of each input variable should fully cover the practical situations; second, the simulation scenarios and dataset sizes should be big enough; finally, the computational costs should be acceptable for the practical engineering applications. According to previous studies, weather information is widely used in data-driven models; furthermore, building physical and thermophysical variables, such as the window-to-wall ratio and total heat transfer coefficient of the envelope, have been increasingly considered. In data-driven models, the number of input features can influence the prediction accuracy and computation speed. There is a tradeoff in constructing different sizes of input features under practical conditions. Therefore, it is intriguing to conduct an overall input feature dimension investigation when building managers can obtain different dimension building information. In this context, developing a data-driven model by using the massive energy data from simulation tools to represent a physics-based model is an appealing approach to building load predictions. The main contribution of this work is developing a data-driven model by using the massive energy data from simulation tools to represent a physicsbased model to building load predictions. These developed models are of high practical value despite that building energy managers have enough or limited building information, Buildings 2023, 13, 312 5 of 24 especially for design phase buildings where the available information is lacking. The goal of this paper is to develop a suitable data-driven model to estimate the thermal load demand of buildings promptly and accurately, especially in the phase of building design. For existing building, we can calibrate and evaluate the proposed model using real data.
The remainder of this paper is organized as follows. Section 2 introduces the methodology, the theories of the selected algorithms and evaluation indexes. Section 3 presents the overall comparison results and discussion. Finally, the main conclusions and future applications are presented in Section 4.

Methodology
Data-driven models do not require establishing thermal equilibrium equations; usually, fewer inputs are required compared to physics-based simulation tools. A data-driven model uses data to deduce the hidden relationships between output (e.g., cooling load) and input feature variables (e.g., weather and building physics information) using a statistical approach. This approach is well adapted to buildings in the design phase, where detailed input parameters may be lacking. The research outline is shown in Figure 1. The methodology was designed with four main steps. The first step was EnergyPlus model development, which was the main step when using the co-simulation method of Python and EnergyPlus to obtain the required dataset. Note that the development of building energy models can be replaced by any other physics-based simulation tools. The second step processed the data and analyzed the key input features. The widely used input variables included these three types: (1) time-related information, such as the day type, occupancy, and equipment schedules; (2) weather conditions, such as the temperature, humidity, and solar radiation; and (3) building physical parameters, such as the window-to-wall ratio and R-value of the wall. The output targets are generally thermal loads or electricity consumption [38,39]. The third step selected the data-driven algorithm by reviewing previous studies and tuned the selected models. Data-driven models have gained great interest in the buildings field because of their simplicity and flexibility [40]. In this section, three promising data-driven methods are presented, including LightGBM, RF, and LSTM. Additionally, some important hyperparameters are tuned, and they can be seen in Tables A1-A3. The last step evaluated the developed models and listed the future applications. It is worth noting that simulation data was used, not real data, because we needed to change each input feature variable within a wider range to obtain enough data points (thousands of building types and millions of data points) to train a well-generalization model, which is nearly impossible to attain from real buildings.

Seed Model Description
This section describes the office building models used in EnergyPlus (Version 9.0.1) and the ranges of the input features used in these cases to obtain energy data. Three categories of building energy models were developed: (1) small office building, (2) medium office building, and (3) large office building. DOE Commercial prototype building models were used as the starting point [41]. The geometry of these three building types is shown in Table 2; all of the models have a rectangular footprint. Table 2 provides key information on these three models. The small office building has 1 floor, the medium office building has 3 floors, and the large office building has 12 above-grade floors and 1 basement. Furthermore, the envelope types and HVAC system types are different. Table 2 also shows the types of exterior walls, roofs, heating and cooling systems, and HVAC system operation schedules. models. Data-driven models have gained great interest in buildings because of their simplicity and flexibility [35]. In this section, three promising data-driven methods are presented including LightGBM, RF, and LSTM. Besides, some important hyper-parameters were tuned and they can be seen in Appendix. The last step evaluated the developed models and listed the future applications. It is worth noting that simulation data was using while it is not using real data because we need to change each input feature variables within a wider range to obtain enough data points (thousands of building types and millions of data points) to train a well generalization model, which is nearly impossible to get it form real buildings.

Seed model description
This section describes the office building models used in EnergyPlus (Version 9.0.1), and the ranges of the input features used in these cases to obtain energy data. Three categories of building energy models were Step 1: Energy models of buildings development Seed models of buildings:  Prototype office building models from the U.S. Department of Energy (EnergyPlus)  Three types of office buildings including large, medium and small size, see table 1 Changeable input variable ranges:  Seventeen variables, see table 2 Step 2: Dataset acquisition and feature selection Dataset:  768,000 data samples, hourly HVAC electricity load as predicted output Feature selection:  Six input features (weather condition)  Nine input features (weather condition and operational information)  Fifteen input features (weather condition, operational, and physical information) Step 3: Data-driven algorithm selection and data-driven model development Algorithm selection:  LightGBM  Random forest (RF)  Long short-term memory (LSTM) Model development:  Model training, testing, evaluating  Hyper-parameter tuning Step 4: Data-driven model evaluation and applications Evaluation indexes:  CVRMSE, RMSE and R 2  Computational time

Applications:
 Energy predictor for building energy management system  Baseline calculation for demand response  Grid-integrated efficient buildings Python using eppy package to modify idf file with EnergyPlus Literature review to select the best datadriven algorithms Consideration both on prediction accuracy and time    Table 3 are EnergyPlus standard TMY3 weather data) are time series data that change for each simulation time step, whereas the other variables are constant in each step once determined. Based on the default values provided by the office prototype building models, we determined the ranges of these input variables. We used ±20% of the default values as   Table 3 are EnergyPlus standard TMY3 weather data) are time series data that change for each simulation time step, whereas the other variables are constant in each step once determined. Based on the default values provided by the office prototype building models, we determined the ranges of these input variables. We used ±20% of the default values as  in Table 3 are EnergyPlus standard TMY3 weather data) are time series data that change for each simulation time step, whereas the other variables are constant in each step once determined. Based on the default values provided by the office prototype building models, we determined the ranges of these input variables. We used ±20% of the default values as the lower and upper limits for most of the input variables [42], except for the cooling and heating temperature set points. For these two input variables, we used ±1.11 • C (±2 • F) of the default values as the lower and upper limits, so as to ensure that the indoor temperature was within a preferable range. Table 3 lists the input variable ranges for these three types of office buildings in detail. In order to simplify the data-driven model, we selected the most important input feature variables in Table 3, and some not important variables such as airtightness were neglected. A total of 768,000 valid data samples were generated from the above-mentioned three EnergyPlus seed models, and the dataset and full code to develop the data-driven model can be downloaded freely in [43]. When we generated the dataset, the Python package named eppy was used to co-simulate with EnergyPlus. All the variables listed in Table 3 were changed respectively in different steps, and the total simulation time to obtain all the data was about 700 h (running on a Dell Precision 7920 Tower, 20 kernel CPU).

Input Feature Selection
Feature selection is critical for data-driven models. External weather conditions, physical parameters, and operational schedules for equipment and occupant behavior are three common input feature types [44][45][46]. The feature selection of each building could be different from each other. We selected the input feature variables due to three reasons: (1) the importance of input variables in physics-based thermal equilibrium equations, (2) prior knowledge in building energy consumption estimation (including the literature study and engineering experience), and (3) the difficulty to obtain in practice. Through a literature study, we found that the outdoor dry bulb temperature, outdoor relative humidity, solar radiation, day of the week, and hour of the day were the five most frequently used features in data-driven models. Except for the external climate data, the physical information of the buildings (such as the number of floors, wall area, glazing area, and window-to-wall ratio) was used to improve the prediction accuracy. Therefore, we selected 17 widely used variables that are easier to obtain and have a great impact on building energy consumption. The complexity of the input feature size influences the computational cost and prediction accuracy. To study this impact, three input feature scenarios were investigated, aiming to meet different demands for practical applications. Table 4 shows the details of these scenarios. The key input variables represent the main weights in the machine learning models. The use of limited input information to achieve satisfactory prediction results and an acceptable time cost is a concern for building energy managers. LightGBM is a gradient-boosting framework comprising a tree-based learning algorithm, i.e., a gradient-boosting decision tree (GBDT). The GBDT is a widely used algorithm in machine learning, owing to its efficiency and accuracy; XGBoost is a typical framework employing this algorithm. However, when the input feature dimensions and/or data size are large, as in modern buildings, the predictable data scale and computation speed remain unsatisfactory. To tackle these deficiencies, LightGBM was proposed based on two novel techniques: gradient-based one-sided sampling (GOSS) and exclusive feature bunding (EFB). These two techniques speed up the training process by up to 20 times, with almost the same accuracy as the traditional GBDT algorithm [47]. LightGBM was first released on 17 October 2016 as a part of Microsoft Corporation's "Distributed Machine Learning Toolkit" project [48]. It was designed to be distributed and efficient, with the advantages of a faster training speed, higher efficiency, lower memory usage, parallel support, and the ability to handle large-scale data. LightGBM is a promising algorithm for big data [47,49]. GBDT is a mature algorithm, and the detailed theory thereof is discussed in other references [13,28]; thus, in this study, we only introduce the theories for GOSS and EFB in detail.
GOSS is a technique for balancing data information reduction and prediction performance. GOSS reduces the computation costs by distinguishing between different gradients of instances, retaining larger gradient instances while randomly sampling smaller gradients and thereby reducing the computation memory costs. The gradient magnitude of the instance represents the training error; thus, an instance with a small gradient can be eliminated, as it is already well trained. To avoid large changes in the training data distribution from the elimination of some instances, GOSS also randomly samples small gradient instances to secure the integrity of the original data. This way, although GOSS reduces the number of instances, the generalization error is close to that calculated using the full data instances. To prove that, the variance gain V j (d) of feature j at splitting point d is defined as shown in Equation (1).
In the above, x is the training set with i instances, g i is the negative gradient of the loss function, and O is the training dataset on a fixed node of the decision tree.
The training instances are initially ranked by their absolute gradients in descending order; next, the top a% of the larger gradients are selected as subset A; then, b% of the remaining gradients are randomly selected as subset B. Thus, the estimated variance gain V j (d) over the subset A ∪ B can be defined as shown in Equation (2). Here, There is a theory for proving that GOSS would not lose much training accuracy as compared with the full dataset [47], i.e., ε GOSS . EFB is another technique for reducing feature dimensions to improve the computational efficiency and is based on feature bundling. Usually, the bundled features are mutually exclusive, e.g., one feature is zero and the other is non-zero; therefore, these two features can be bundled together without losing information. In the case where two features are not mutually exclusive, a "conflict ratio" can be used to measure the degree of non-exclusion. When this ratio is small, the two features can be bound without excessively affecting the final accuracy. There are three steps in the EFB method. In Step 1, the features are sorted according to the total number of non-zero values; in Step 2, the conflict ratio between different features is calculated; and in Step 3, the conflict ratio is minimized by iterating through each feature and then binding the features. In this way, the time complexity is reduced from O N data * N f eature to O(N data * N bundle ), where N bundle N f eature .

Random Forest (RF)
The diagrammatic prediction process of RF is shown in Figure 2. Each decision tree is randomly formed with different features and training samples, and the trees can be trained in parallel. Thus, the prediction accuracy is higher than that of a single decision tree. In the RF model, the number of trees and depth of a tree are the two key parameters that must be tuned; therefore, fewer parameters must be set than in other algorithms [12,31,50]. The RF algorithm includes four main processes: bootstrap resampling, bagging and out-of-bag error (OOBE) estimation, random feature selection, and full-depth decision tree growth [51], as shown in Figure 2. First, N samples from the training dataset S n are randomly selected as bootstrap samples chosen with replacements, i.e., the same sample (X i , Y i ) may appear repeatedly. Second, the bagging technique selects samples from the bootstrap samples N; the remaining samples comprise the out-of-bag dataset. Third, there is a random selection of a predefined number p of total features k, and RF attempts to search for the best cutting among these p features. Finally, the best cutting is set by minimizing the cost function until the full-depth decision tree grows. The OOBE technique, or generalization error, is highly effective for estimating the generalization ability of the constructed model. In view of these technologies, the main advantage of the RF algorithm is its immunity to noise [51,52].

Long Short-Term Memory (LSTM)
LSTM is a type of RNN algorithm. Figure 3 shows the principle of LSTM. It is an inborn network capable of accurately modeling complex multivariate sequences (such as building energy demands), although this increases its computation costs [53]. It has advantages in solving complex and long time lag tasks that traditional RNN does not. In one study [13], LSTM performed better for the load prediction than the SVM and XGBoost algorithms.

Long Short-Term Memory (LSTM)
LSTM is a type of RNN algorithm. Figure 3 shows the principle of LSTM. It is an inborn network capable of accurately modeling complex multivariate sequences (such as building energy demands), although this increases its computation costs [53]. It has advantages in solving complex and long time lag tasks that traditional RNN does not. In one study [13], LSTM performed better for the load prediction than the SVM and XGBoost algorithms. Equations (3) and (4) define the architecture of the basic RNN algorithm. In this algorithm, only the memory of the last time step t-1 can be passed to time step t. However, the longer memory of the past time steps t−n can be passed by introducing three special gates and two memory cells: input gate , forget gate , and output gate are defined as shown in Equations (5)-(7), respectively; the candidate memory cell ̃ and memory cell are defined in Equations (8) and (9), respectively. Equations (3) and (4) define the architecture of the basic RNN algorithm. In this algorithm, only the memory of the last time step t−1 can be passed to time step t. However, the longer memory of the past time steps t−n can be passed by introducing three special gates and two memory cells: input gate i t , forget gate f t , and output gate o t are defined as shown in Equations (5)-(7), respectively; the candidate memory cell c t and memory cell c t are defined in Equations (8) and (9), respectively.
In the above, g(x) and g (x) are two activation functions; a t is the activation function at time step t; x t and y t are the input and output at time step t, respectively; b is the bias; W is the weight factor; σ(x) is the sigmoid function, which is defined in Equation (13); and tanh(x) is the hyperbolic tangent function, as defined in Equation (12).

Data-Driven Model Development Process
The development process (i.e., Step 3 in Figure 1) of our proposed data is shown in Figure 4. First, the massive dataset is divided into two sets, including training and test datasets after data pre-processing. Second, feature engineering is implemented for the model inputs, which includes the weather conditions (e.g., temperature, humidity, and solar radiation); building physical parameters (e.g., R-value of the wall, floor height, internal mass, and Shape coefficient); and operational information (e.g., temperature setting and fresh air volume). The model output is the HVAC electrical load. Then, different models can be trained, and the hyperparameters are required to tune for better results. Last is using the test dataset to test and evaluate the developed model.

Prediction Performance Indices
To evaluate the prediction performances of different algorithms, three indices are generally used: the CVRMSE, root mean squared error (RMSE), and squared correlation coefficient (R 2 ). The CVRMSE is a scale-independent indicator that is normalized by averaging the RMSE. The CVRMSE has been used in studies [13,54] and is recommended by the American Society of Heating, Refrigerating and Air-conditioning Engineers' (ASHRAE) Guidelines 14 [55]; the RMSE is a scale-dependent indicator and thus maintains the same scale as the original data; and R 2 is the coefficient of determination, which ranges from 0 to 1 and can reflect the goodness of fit. These three indices are defined in Equations (14)-(16), respectively.
wherey i , y i , and y represent the predicted value of sample i, the actual value of sample i, and the mean value of all sample datasets, respectively; n denotes the number of samples. The development process (i.e., Step 3 in Figure 1) of our proposed data is shown in Figure 4. First, the massive dataset is divided into two sets, including training and test datasets after data pre-processing. Second, feature engineering is implemented for the model inputs, which includes the weather conditions (e.g., temperature, humidity, and solar radiation); building physical parameters (e.g., R-value of the wall, floor height, internal mass, and Shape coefficient); and operational information (e.g., temperature setting and fresh air volume). The model output is the HVAC electrical load. Then, different models can be trained, and the hyperparameters are required to tune for better results. Last is using the test dataset to test and evaluate the developed model.

Prediction Performance Indices
To evaluate the prediction performances of different algorithms, three indices are generally used: the CVRMSE, root mean squared error (RMSE), and squared correlation coefficient (R 2 ). The CVRMSE is a scale-independent indicator that is normalized by averaging the RMSE. The CVRMSE has been used in studies [13,54] and is recommended by the American Society of Heating, Refrigerating and Air-conditioning Engineers' (ASHRAE) Guidelines 14 [55]; the RMSE is a scale-dependent indicator and thus maintains the same scale as the original data; and R 2 is the coefficient of determination, which

Results and Discussion
The prediction performances were calculated and compared under the above-mentioned scenarios, and the prediction granularity is 1 h. For each scenario, it is worth noting that we first split the 768,000 data samples into training and test sets at a ratio of 99:1 because of the big size of the dataset [56], although the commonly used ratio is 80:20 or 70:30 for normal-sized datasets [36] Second, different feature scenarios were selected as the inputs, and the hourly predicted HVAC electrical load was the output. Finally, we applied the five-fold cross-validation approach implemented in the scikit-learn Python package for the hyperparameter tuning in our models. The hyperparameter searching range and optimum results are shown in Tables A1-A3. All of the models were trained and tested on identical datasets. We compared three different machine learning algorithms (LightGBM, RF, and LSTM) under these three scenarios. To address cases with (such as existing buildings) and with no (such as buildings in the design phase) historical energy load data available, the input features were applied with and without the historical HVAC electricity load. In this paper, we obtained all the required HVAC electricity loads at the stage where we simulated all the seed models mentioned in Section 2.1. It is worth noting that we categorized the cases in the next three scenarios by distinguishing the use of historical load data or not. In all scenarios, three prediction performance indices were used to evaluate the prediction performance on the testing dataset: RMSE, CVRMSE, and R 2 .

Scenario 1: Six Input Features
The simplified scenario used only six key input features to develop the model. As shown in Table 5, LightGBM has the highest prediction accuracy (CVRMSE: 7.14%) and fastest computational speed (5.4 s); the results from LSTM are the worst, not only in regard to accuracy (CVRMSE: 26.15%) but also to computational cost (716.5 s). If the CVRMSE is below 30% when the prediction step is hourly, the model is considerably acceptable and sufficiently close to physical reality for engineering purposes [57]. By this criterion, the LSTM result without the historical load is unacceptable. Additionally, the computation cost of the LSTM is more than a hundred times larger than that of LightGBM; this is consistent with our theoretical analysis in Section 2 and corresponds to the conclusions from the literature, i.e., that LSTM is recommended for small datasets and short-term predictions. Figure 5 presents the hourly predicted and actual HVAC electrical load profiles on the testing dataset; ten days are randomly selected from the testing data for visualization purposes. It is evident that the prediction performance of the LSTM algorithm is quite poor. All the models provide better prediction results when the historical load data is considered. The LSTM algorithm that does not use historical load data performs the worst, and the prediction deviation is large during the peak and valley load times. (b) Without historical HVAC electricity load data

Scenario 2: Nine Input Features
As mentioned in Section 1, HVAC energy consumption is influenced by multiple factors, such as weather conditions, building physics, and operational parameters. Only six input features were used in Scenario 1, and the prediction results might not be convincing. Therefore, we added operational parameters (room temperature setting and fresh air volume) to boost the knowledge learning level in the training models. From comparing Tables 5 and 6, it can be seen that the prediction accuracy is slightly improved in all three models, although the computation cost also increases. By adding these operational parameters, the mean improvement percentages in the CVRMSE are approximately 10.9%, 17.6%, and 0.5% for LSTM, LightGBM, and RF, respectively. This improvement is evident in the LightGBM and LSTM models. As shown in Figure 6, the prediction deviation remains large during the peak and valley load times for the LSTM.

Scenario 2: Nine Input Features
As mentioned in Section 1, HVAC energy consumption is influenced by multiple factors, such as weather conditions, building physics, and operational parameters. Only six input features were used in Scenario 1, and the prediction results might not be convincing. Therefore, we added operational parameters (room temperature setting and fresh air volume) to boost the knowledge learning level in the training models. From comparing Tables 5 and 6, it can be seen that the prediction accuracy is slightly improved in all three models, although the computation cost also increases. By adding these operational parameters, the mean improvement percentages in the CVRMSE are approximately 10.9%, 17.6%, and 0.5% for LSTM, LightGBM, and RF, respectively. This improvement is evident in the LightGBM and LSTM models. As shown in Figure 6, the prediction deviation remains large during the peak and valley load times for the LSTM. (b) Without historical HVAC electricity load data Figure 6. Hourly prediction performances using nine input features.

Scenario 3: Fifteen Input Features
In addition to the features in Scenario 2, the physical information of the building can be used to improve the prediction accuracy. We used 15 key input features to train the models in Scenario 3; three physical aspects (building shape, R-values of walls, and building thermal mass) were considered. Table 7 shows the hourly results from the different models with and without the historical load data. The best result for the CVRMSE was

Scenario 3: Fifteen Input Features
In addition to the features in Scenario 2, the physical information of the building can be used to improve the prediction accuracy. We used 15 key input features to train the models in Scenario 3; three physical aspects (building shape, R-values of walls, and building thermal mass) were considered. Table 7 shows the hourly results from the different models with and without the historical load data. The best result for the CVRMSE was 5.25% in LightGBM, a promising result for the field of thermal load prediction. Furthermore, the computation time was only 7 s. The best results from the LSTM and RF approaches were 22.06% and 18.54%, respectively, i.e., close to the results from the previous study discussed in the Introduction. By adding six more building physical parameters, the mean improvement percentage of the CVRMSE was approximately 12.2%, 30.3%, and 1.6% for the LSTM, LightGBM, and RF approaches, respectively. The CVRMSE values of all three models were lower than 30%, indicating that they were all acceptable and sufficiently close to physical reality for engineering purposes. Figures 5-7 show that the prediction is gradually improved by employing additional input features. approaches were 22.06% and 18.54%, respectively, i.e., close to the results from the previous study discussed in the Introduction. By adding six more building physical parameters, the mean improvement percentage of the CVRMSE was approximately 12.2%, 30.3%, and 1.6% for the LSTM, LightGBM, and RF approaches, respectively. The CVRMSE values of all three models were lower than 30%, indicating that they were all acceptable and sufficiently close to physical reality for engineering purposes. Figures 5-7 show that the prediction is gradually improved by employing additional input features.

Discussions
Generally, the prediction accuracy can be improved when more information of the building is used in the data-driven models. As shown in Figure 8, the CVRMSE of scenario 3 with the historical HVAC load data is the best in these three models, although the improvement is higher in LightGBM and LSTM. In the model of RF, different scenarios have a close prediction accuracy that means additional building information is not necessary to improve its accuracy. We can also find that the historical HVAC load data is important to improve the accuracy of the models. All in all, a CVRMSE of 7.1% can be achieved when only the weather information is used, and the highest accuracy of 5.3% reached when the weather and operational and physical information of the building structure are considered. Except for these fifteen input features, adding more information such as the occupant's behavior is worthwhile as a further study in the future.

Discussion
Generally, the prediction accuracy can be improved when more information of the building is used in the data-driven models. As shown in Figure 8, the CVRMSE of scenario 3 with the historical HVAC load data is the best in these three models, although the improvement is higher in LightGBM and LSTM. In the model of RF, different scenarios have a close prediction accuracy that means additional building information is not necessary to improve its accuracy. We can also find that the historical HVAC load data is important to improve the accuracy of the models. All in all, a CVRMSE of 7.1% can be achieved when only the weather information is used, and the highest accuracy of 5.3% reached when the weather and operational and physical information of the building structure are considered. Except for these fifteen input features, adding more information such as the occupant's behavior is worthwhile as a further study in the future.
As shown in Figure 8, LightGBM is the best algorithm in building thermal energy prediction. Generally, the more data samples trained in the model training process, the better the prediction accuracy, as more hidden knowledge between the inputs and outputs can be learned. We investigated this effect by increasing the size of the training dataset sample in LightGBM, and Figure 9 shows the results. When we used 76,800 data samples, the prediction was the worst; the prediction accuracy generally improved as the sample size increased from 76,800 to 768,000. Additional data samples improved the prediction accuracy but also increased the computation cost.
A quantified investigation of the feature importance is also interesting for researchers. Therefore, we investigated the feature importance in the LightGBM model. Figure 10 ranks the importance of these fifteen input features. In our building's case, the results showed that the day of hours, outdoor dry bulb temperature, historical load data, global horizontal radiation, and relative humidity are the five most important features for building thermal load prediction. However, other features that have lower importance values can still be used to further improve the prediction accuracy. with historical data without historical data Algorithms Figure 8. Prediction performances in different scenarios.
As shown in Figure 8, LightGBM is the best algorithm in building thermal energy prediction. Generally, the more data samples trained in the model training process, the better the prediction accuracy, as more hidden knowledge between the inputs and outputs can be learned. We investigated this effect by increasing the size of the training dataset sample in LightGBM, and Figure 9 shows the results. When we used 76,800 data samples, the prediction was the worst; the prediction accuracy generally improved as the sample size increased from 76,800 to 768,000. Additional data samples improved the prediction accuracy but also increased the computation cost.   As shown in Figure 8, LightGBM is the best algorithm in building thermal energy prediction. Generally, the more data samples trained in the model training process, the better the prediction accuracy, as more hidden knowledge between the inputs and outputs can be learned. We investigated this effect by increasing the size of the training dataset sample in LightGBM, and Figure 9 shows the results. When we used 76,800 data samples, the prediction was the worst; the prediction accuracy generally improved as the sample size increased from 76,800 to 768,000. Additional data samples improved the prediction accuracy but also increased the computation cost.  ranks the importance of these fifteen input features. In our building's case, the re showed that the day of hours, outdoor dry bulb temperature, historical load data, g horizontal radiation, and relative humidity are the five most important features for b ing thermal load prediction. However, other features that have lower importance va can still be used to further improve the prediction accuracy. As using more features can improve the predictive accuracy, we investigated th fect of the number of features on the accuracy and computation speed in these three a rithms. The computational expense comparison of this study was performed on Wind 10 with a 2.6 GHz processor (Intel Corporation Core i7-10700) and 16 GB RAM mem The Spyder scientific Python-integrated development environment was used to im ment the prediction tasks. As shown in Figure 11, LightGBM requires the least comp tion time for model training and testing. In contrast, LSTM spends much more time. erally, the computation expense increases with the input feature size. The model deve ment time reaches the maximum when all 15 input features are considered. The L model takes much longer time than the other two models, which is not suitable for a time energy management control system. The computational time of LightGBM is short (several seconds), which makes LightGBM a suitable model for a real-time co system. Note that the acquisition of the original massive dataset is time-consuming, a spent about 700 h to obtain the dataset using a power machine (Dell Precision 7920 To 20 kernel CPU). The acquisition of the dataset process is time-consuming; however developed model achieves higher prediction accuracy, and it can be easily generalize As using more features can improve the predictive accuracy, we investigated the effect of the number of features on the accuracy and computation speed in these three algorithms. The computational expense comparison of this study was performed on Windows 10 with a 2.6 GHz processor (Intel Corporation Core i7-10700) and 16 GB RAM memory. The Spyder scientific Python-integrated development environment was used to implement the prediction tasks. As shown in Figure 11, LightGBM requires the least computation time for model training and testing. In contrast, LSTM spends much more time. Generally, the computation expense increases with the input feature size. The model development time reaches the maximum when all 15 input features are considered. The LSTM model takes much longer time than the other two models, which is not suitable for a real-time energy management control system. The computational time of LightGBM is very short (several seconds), which makes LightGBM a suitable model for a real-time control system. Note that the acquisition of the original massive dataset is time-consuming, as we spent about 700 h to obtain the dataset using a power machine (Dell Precision 7920 Tower, 20 kernel CPU). The acquisition of the dataset process is time-consuming; however, this developed model achieves higher prediction accuracy, and it can be easily generalized for various energy use scenarios and building types once it is well trained. In this way, the model developers do not need to develop a specific model for different buildings and energy use scenarios.  Figure 11. Computation time of the models in different scenarios.

Conclusions
A well-developed data-driven model to represent the physics-based tools is a challenge in both academic and practical fields. Traditionally, well-designed physics-based models, i.e., white-box models, have been widely applied. However, a white-box model requires a massive amount of detailed input parameters, which can be troublesome and difficult for engineers, especially for a building in the design phase. A fast and accurate building thermal load prediction method is critically important for optimal HVAC control, energy demand-side management, smart building management, and other tasks. In this study, therefore, we ran a big amount of EnergyPlus simulations to obtain massive energy data that covers the common energy use scenarios to develop a good generalization data-driven model. Using this data source, three machine learning models were developed and compared in three different input feature scenarios. Upon completion of the investigation, the following conclusions were reached.
(1) LightGBM is the most accurate and fastest prediction model. In the best scenario, the CVRMSE and R 2 of LightGBM are 5.25% and 0.99, respectively. Compared with the results of the other two algorithms and those in the existing literature, LightGBM is the most promising and best algorithm for building thermal load prediction. (2) By training with the large amount of energy data generated by physics-based tools or on-site data, a data-driven model is able to represent a physics-based tool with comparable accuracy. (3) The dimensions of the input features influence the prediction performance. Compared with a scenario using only weather information, the CVRMSE can be further improved when physical and operational information are considered. Although better accuracy is achieved with bigger dimensions of input features, it impacts the computational speed. Therefore, there will always be a tradeoff between the prediction accuracy demand and prediction speed tolerance.
The findings and the proposed models in this study are useful for real applications, such as smart building energy management, baseline calculation of demand response programs, and grid-integrated efficient building improvements. LightGBM is strongly recommended when dealing with large amounts of data, as it is faster and more robust.

Conclusions
A well-developed data-driven model to represent the physics-based tools is a challenge in both academic and practical fields. Traditionally, well-designed physics-based models, i.e., white-box models, have been widely applied. However, a white-box model requires a massive amount of detailed input parameters, which can be troublesome and difficult for engineers, especially for a building in the design phase. A fast and accurate building thermal load prediction method is critically important for optimal HVAC control, energy demand-side management, smart building management, and other tasks. In this study, therefore, we ran a big amount of EnergyPlus simulations to obtain massive energy data that covers the common energy use scenarios to develop a good generalization datadriven model. Using this data source, three machine learning models were developed and compared in three different input feature scenarios. Upon completion of the investigation, the following conclusions were reached.
(1) LightGBM is the most accurate and fastest prediction model. In the best scenario, the CVRMSE and R 2 of LightGBM are 5.25% and 0.99, respectively. Compared with the results of the other two algorithms and those in the existing literature, LightGBM is the most promising and best algorithm for building thermal load prediction. (2) By training with the large amount of energy data generated by physics-based tools or on-site data, a data-driven model is able to represent a physics-based tool with comparable accuracy. (3) The dimensions of the input features influence the prediction performance. Compared with a scenario using only weather information, the CVRMSE can be further improved when physical and operational information are considered. Although better accuracy is achieved with bigger dimensions of input features, it impacts the computational speed. Therefore, there will always be a tradeoff between the prediction accuracy demand and prediction speed tolerance.
The findings and the proposed models in this study are useful for real applications, such as smart building energy management, baseline calculation of demand response programs, and grid-integrated efficient building improvements. LightGBM is strongly