Development of Building Thermal Load and Discomfort Degree Hour Prediction Models Using Data Mining Approaches

Thermal load and indoor comfort level are two important building performance indicators, rapid predictions of which can help significantly reduce the computation time during design optimization. In this paper, a three-step approach is used to develop and evaluate prediction models. Firstly, the Latin Hypercube Sampling Method (LHSM) is used to generate a representative 19-dimensional design database and DesignBuilder is then used to obtain the thermal load and discomfort degree hours through simulation. Secondly, samples from the database are used to develop and validate seven prediction models, using data mining approaches including multilinear regression (MLR), chi-square automatic interaction detector (CHAID), exhaustive CHAID (ECHAID), back-propagation neural network (BPNN), radial basis function network (RBFN), classification and regression trees (CART), and support vector machines (SVM). It is found that the MLR and BPNN models outperform the others in the prediction of thermal load with average absolute error of less than 1.19%, and the BPNN model is the best at predicting discomfort degree hour with 0.62% average absolute error. Finally, two hybrid models—MLR (MLR + BPNN) and MLR-BPNN—are developed. The MLR-BPNN models are found to be the best prediction models, with average absolute error of 0.82% in thermal load and 0.59% in discomfort degree hour.


Introduction
Building design optimization involves the integration of an optimization algorithm with building performance calculation.Oftentimes the building performance calculation conducted by simulation software is time-consuming; therefore, the development of performance prediction models is a good alternative to significantly reduce the computation time.
Annual thermal load and indoor comfort level are two important factors in evaluating the performance of buildings and they are often the objectives of building design optimization [1][2][3][4][5][6][7].For example, Gong et al. [2] applied the orthogonal method and the listing method to optimize passive building design to minimize the annual thermal load.Yu et al. [3] applied a multiobjective genetic algorithm to optimize building energy efficiency and thermal comfort.
Insulation thickness, concrete slab thickness, window-to-wall ratio (WWR), and optical properties of the envelope (absorption/reflection of solar) are critical factors that affect the building performance and have attracted the interest of many researchers [6,8,9].For example, Yuan et al. [6] presented a proposal to find an optimal combination of reflectivity and insulation thickness of building exterior walls to minimize the annual thermal load and cost of the building envelope.Wang et al. [8] investigated the optimal slab thickness of the building envelope to maintain the indoor air temperature within a prescribed temperature range without turning on the heating, ventilation and air-conditioning (HVAC) system.A concrete slab thickness of 25 cm was recommended for the ceiling and floor and 10 cm for the envelope wall.The maximum WWR was then given as a function of diurnal temperature amplitude.Olivieri et al. [9] performed an experimental study to find the optimal insulation thickness of a vertical green wall under the continental Mediterranean climate and found an insulation thickness of 9 cm to be sufficient.
Building simulation software, such as TRNSYS [1], THERB [2], EnergyPlus [4], and New HASP/ACLD-β [6] have been used to obtain the thermal load and/or indoor thermal comfort condition.Such programs require dynamic computing to calculate the hourly/subhourly thermal load and indoor comfort condition.It becomes time-consuming when providing annual results, especially when coupled with an optimization algorithm and many iterations are inevitable in order to find the optimum building design solutions.
Data mining techniques can be used to develop prediction models based on experimental or simulation datasets to replace extensive simulation efforts, so as to reduce the computation time to evaluate the building performance indices.For instance, artificial neural network (ANN) models have been developed to predict the annual building energy consumption/thermal comfort condition to reduce the computation time during the optimization process [1,3,5].
Prediction models based on data mining techniques have been verified to have good performance in the prediction of heating and cooling load [10], building energy demand [11], electricity demand [12,13], and energy consumption [14][15][16].For example, Tsanas and Xifara [10] used statistical machine learning tools to predict the building heating load and cooling load with low mean absolute error deviations of 0.51 and 1.42 using a random forest (RF) approach, compared with the results from Ecotech.Yu et al. [11] developed a decision tree method to predict building energy demand with 93% accuracy for training data and 92% accuracy for test data.Wang et al. [12] developed an 'Ensemble Bagging Trees' (EBT) technique using data obtained from meteorological systems and building-level occupancy and meter to predict the hourly electricity demand of a test building with Mean Absolute Prediction Error ranging from 2.97 to 4.63%.
Some researchers have employed different approaches and compared the outcomes of prediction from various models [17][18][19][20][21].Those models are developed for predictions of hourly energy usages [17], steam load [18], energy consumption [19,20], cooling load, and heating load [21].For instance, Chou and Bui [21] utilized support vector regression (SVR), ANN, classification and regression tree (CART), chi-squared automatic interaction detector, general linear regression, and ensemble inference models to predict the energy performance of buildings and found that the ensemble approach (SVR + ANN) and SVR were the best models for predicting cooling load and heating load, with mean absolute percentage errors of 3.46% and 1.13%, respectively.Ahmad et al. [19] compared the performance of RF and ANN in the prediction of building energy consumption and found that ANN performed marginally better than RF.
It can be foreseen that a data mining approach can also be applied to predict annual thermal load and indoor thermal comfort conditions with satisfactory performance.Therefore, in this paper, seven data mining techniques, including multilinear regression (MLR), Chi-square Automatic Interaction Detector (CHAID), Exhaustive CHAID (ECHAID), back-propagation neural network (BPNN), radial basis function network (RBFN), CART, and support vector machines (SVM), are used to develop prediction models for annual building thermal load and discomfort degree hours and their performances are evaluated.Finally, two hybrid models, called MLR (MLR + BPNN) and MLR-BPNN models, are developed to improve the prediction accuracy.

Base Building Model
A three-story residential building (see Figure 1) with floor area of 146.43 m 2 , total construction area of 303.9 m 2 , and height of 11.77 m was selected for study.It is located in Wuhan city, which is a representative city that belongs to the hot summer and cold winter region in China.Most of the cities in this region are in the middle and lower reaches of the Yangtze River, and are all located in the north of the Tropic of Cancer.The buildings in this region are mainly oriented towards the south in order to obtain more solar radiation in winter.According to the residential building energy efficiency design standard for the hot summer/cold winter region JGJ134-2010 [22], the optimal building orientation in Wuhan city is 15 • South to West, which is applied in this study.Natural ventilation is adopted to use free cooling to reduce the thermal load.The infiltration rate is 0.5 air change rate per hour (ACH) according to the building energy efficiency standard [22].There is an overhang at the entrance of the building to provide shading.Low-E glazing is selected to ensure enough daylighting while effectively reducing the unwanted solar radiation in the daytime, and the roof overhangs act as shading devices for the windows.Internal shading devices can be used when needed.
The occupancy level is 50 m 2 /person, and the infiltration rate is 0.5 ACH, which is also the minimum fresh air rate required by GB 50736-2012 [23].The metabolic factor is 0.87 (Men = 1.0, women = 0.85, children = 0.75), representing two adult men, two adult women, and two children.The clothing level is 1.0 clo.In winter and 0.5 clo.In summer [24].The heating temperature setpoint is 18 • C with a setback temperature of 16 • C and the cooling temperature setpoint is 26 • C with a setback temperature of 28 • C, according to JGJ134-2010 [22].Natural ventilation is ON with a maximum ventilation rate of 3 ACH by zone control to reduce the building thermal load.A heat pump is selected to provide cooling in summer and heating in winter.The HVAC system is ON when occupied.

Independent Variables
Double-layer Low-E windows are installed on each side of the building.The layer-to-layer information for the roof is as follows (from exterior to interior): asphalt waterproof layer, extruded polystyrene board (XPS) insulation layer, concrete layer, and lime-and-cement mortar layer.No skylight is assumed.The structures of the exterior walls are as follows: face brick layer, XPS insulation layer, concrete layer, and lime-and-cement mortar layer.WWR, absorptance of solar radiation at the outer layer surface, insulation thickness, and concrete thickness are identified as the four groups of parameters that have an important impact on the building thermal performance due to the following reasons [24]: (1) Thermal mass can affect the fluctuation of the daily temperature inside the house.(2) Insulation can affect the conduction heat gain/loss through the opaque envelope.
(3) The absorptance of solar radiation of the opaque envelope and the location and size of the windows can affect the solar heat gain.Both concrete and brick are thermal masses, so the choice of concrete over brick is that concrete can be prefabricated and the size of it can be unlimited [25].Although there are different brick sizes, they are confined to a small range and the type of bricks is limited [25].In addition, the conductivity of concrete can be much lower than brick (0.24 W/m-k vs. 0.84 W/m-K), meaning the building will be better insulated when their thicknesses are the same.
To fully discover the impact of these four factors, different values are assigned for each facade.In addition, different value ranges are given (see Table 1) to cover the possible variation of each factor.A total of 19 design parameters are determined to be the independent variables.

Dependent Variables
The annual building thermal load and discomfort degree hour are the dependent variables.The annual thermal load is the sum of the cooling load and heating load: The discomfort degree hour, proposed by Zhang et al. [26], is composed of the summer discomfort degree hours and winter discomfort degree hours: ( The summer discomfort degree hour can be calculated as where t i (x) is the indoor air temperature at time i; and t H is the higher limit temperature in the thermal comfort range, taken as 26 • C according to the energy efficient building design standard JGJ134-2010 [22].The indoor air temperature was calculated with time steps of 0.5 h by DesignBuilder [27].
The winter discomfort degree hour can be calculated as where t L is the lower limit temperature in the thermal comfort range, taken as 18 • C according to JGJ134-2010 [22].

Data Sampling Method
The accuracy and reliability of data mining depend to a great extent on the quality of the data.Data preparation and preprocessing are two key steps in using data mining techniques to discover the corresponding relationships between the dependent and independent variables.It has been proved that data preparation accounts for 80% of the workload of the entire data mining process [28].In order to develop prediction models for the annual thermal load and discomfort degree hour, a database containing the building design parameters as inputs and building load and discomfort hour as outputs is to be created.There are a total of 19 inputs in this study, as shown in Table 1.To effectively reduce the number of samples, the Latin Hypercube Sampling Method (LHSM) (proposed by McKay [29]) is adopted.The LHSM is a multidimensional stratified sampling method that works according to the following principles: (1) Determine the number of samples needed as N; (2) The inputs are divided into N columns with equal probability according to Equation (5): (3) Only one sample is drawn from each column, and the locations of the sample in each column are randomly determined.
Studies have shown that this method can help reduce the sample size and ensure representativeness of the samples.In this study, the number of samples was finally determined to be 450, which is slightly higher than 22.5× the number of independent variables as determined by Conraud [30] and Magnier Haghighat [1].The building thermal load and number of discomfort degree hours are obtained through the simulation software DesignBuilder [27] and subsequent calculations, and then used for data analysis and model prediction.

Single-Algorithm Models
Seven data mining algorithms are selected to study the relationship between input variables and output variables.The seven algorithms are MLR, chi-square autointeraction detection (CHAID), ECHAID, BPNN, RBFN, CART, and SVM.

Multilinear Regression (MLR)
A regression modeling approach is frequently used in data analysis, e.g., applied by Capozzoli et al. [20] to estimate the heating energy consumption and Wang et al. [17] to predict hourly energy usages.Regression analysis not only quantitatively estimates the relationship among variables, but also the "strength" of the relation.The multiple regression analysis and forecasting method refers to the correlation analysis of two or more independent variables and one dependent variable.In this study, the MLR model is adopted, which can be presented as follows: where β 0 is the regression constant and β 1 , β 2 , • • • , β n are the regression coefficients.

Chi-Square Automatic Interaction Detector (CHAID)
CHAID (proposed by Kass et al. [31]) is an efficient taxonomic tree generator algorithm.As a decision tree algorithm, CHAID determines the current best grouping of variables and segmentation points based on the p-values of each variable as a predictor from statistical significance testing (F-test).CHAID has also been widely used, e.g., for steam load prediction [18].The process of CHAID is as follows: Firstly, the variables that are judged to be statistically similar to the target variable based on the F-test are merged; then, the p-values of the remaining variables are calculated those with the best predictors (lowest p-values) are selected to be the first branch in the decision tree.The process is recursively carried out until the decision tree is fully grown.

Exhaustive CHAID (ECHAID)
In the CHAID algorithm, the grouping selection is based on p-values.However, the number of variables in each group might not be the same, which means that the degree of freedom for the F-test for each group might not be the same, and might directly affect the calculation of p-values.CHAID stops merging when it detects that all remaining categories are statistically different.
ECHAID is an improved algorithm based on CHAID (proposed by Biggs et al. [32]), and mainly focuses on how to void the impact of the degree of freedom on p-values.CHAID continuously carries out the grouping process until only two super categories are left, so as to ensure that all input variables have the same degree of freedom in the statistical test.ECHAID is therefore more suitable for finding the best grouping of variables, but with lower efficiency than CHAID.Application of ECHAID can be found for steam load prediction [18] and prediction of the coefficient of performance (COP) of heat pumps [33].

Back-Propagation Neural Network (BPNN)
BPNN is a widely used ANN, and is composed of an input layer, hidden layer, and output layer.The learning process of BPNN consists of forward propagation of signals and reverse propagation of errors.In BPNN, different layers are connected by neurons.In the forward propagation of signals, the data obtained from the output layer are compared with the targeted values.If the error precision is not met, BPNN enters the process of inverse error propagation and continuously revises the weighting factors associated with the neurons to improve the accuracy of the BPNN prediction model.BPNN has been proved to be capable of predicting the thermal performance of a ground source heat pump system [33,34].

Radial Basis Function Network (RBFN)
RBFN is a special feedforward neural network which possesses high learning speed and good nonlinear conversion ability [35].Compared with BPNN, RBFN has one and only one hidden layer, and its structure is simpler.Meanwhile, the classification and prediction mechanisms of the two are not exactly the same.A radial basis function is used for the hidden layer nodes in RBFN, and for the output nodes, a linear adder and sigmoid excitation function are used.In BPNN, the weighting factors between the upper layer and the next layer need to be constantly revised, while in the RBFN, weighting factors between the input layer and the hidden layer are fixed to be 1, and only the weighting factors between the hidden layer and the output layer are adjusted.Therefore, the learning process in RBFN is more efficient than in BPNN.RBFN has been applied to predict the performance of direct evaporative cooling systems [36] and critical water parameters in desalination plants [37] with high accuracies.

Classification and Regression Trees (CART)
The CART was proposed by Breiman et al. [38].Similar to CHAID, CART includes the two processes of tree growing and tree pruning.In the tree growing process, the input data are split into two subsets to reduce the differences among the values of variables.This process continues to produce a subset of groups until the output variables are of the same category or until certain stop criteria are met.Tree pruning is mainly used to prevent the decision tree growing process from being "too precise" and the sample data from being unrepresentative and unable to be used for data prediction.CART has been proposed to predict the heating energy consumption [20] and COP of refrigeration equipment [39].

Support Vector Machines (SVM)
The SVM was proposed by Boser et al. [40].SVM uses the training samples as the data object; by analyzing the relationship between the input and output variables, a corresponding prediction model is developed, and the output values of the new samples with the same distribution are predicted.In SVM, the regression analysis of multiple input variables often maps the sample data set to a higher dimensional space indirectly through a kernel function and nonlinear transformation to find the hyperplane satisfying the condition.SVM has been applied to predict the district heating load [41].

Evaluation Method
In order to comparatively analyze and evaluate the prediction accuracy of each algorithm, five evaluation indices are selected, including average absolute error (MAE), absolute error standard deviation (Std_AE), mean absolute percentage error (MAPE), standard deviation of the absolute percentage error (Std_APE), and the correlation coefficient (R), which are calculated as follows: where ŷi is the prediction value, y i is the targeted value, and n is the number of samples used for training and validation-equal to 450 in this study.

Results and Discussion of Single-Algorithm Models
Tables 2 and 3 present the comparisons of the prediction results of the thermal load and discomfort degree hours for different algorithms.Based on the results from Table 2, it can be found that SVM has the worst performance in thermal load prediction with MAPEs close to 10% in the training process and higher than 10% in the validation process.The MAPEs for CHAID, ECHAID, and CART are much less than those of SVM, being 3.5~4.0%in the training process and 3.55~4.07%in the validation process.The MAPEs for RBFN in the training and validation processes are both less than 2.5%.MLR and BPNN are the two best algorithms with MAPEs less than 1.2%.
It can be found from Table 3 that the performances of the various algorithms on the prediction of discomfort degree hours are similar to the prediction of thermal load.The SVM has the worst performance with MAPE of 5.8% during the training process and 6.45% in the validation process.The MAPEs for CHAID, ECHAID, and RBFN are close to each other, ranging 2.0~2.5% during the training process and less than 2.8% in the validation process.RBFN performs better than the other two, with MAPEs of less than 2.0%.MAPEs of the MLR models are less than 1.0%.BPNN has the best performance with MAPEs close to 0.50%.Excepting BPNN, the MAPEs for other algorithms in the validation process are all higher than those in the training process.Therefore, the BPNN model for discomfort degree hour is not only with the highest accuracy, but also more stable than other models.
The standard deviation of the absolute percentage error (Std_APE) measures the degree of dispersion of the errors.Even if the MAPEs are the same, their Std_APE might be different.As discussed above, the MAPEs for SVM models are large, which indicates that it is not an ideal method for prediction of the thermal load or the discomfort degree hours.Although the MAPEs of CHAID, ECHAID, and RBFN are smaller, they are not the best algorithms.Therefore, the focus will be on MLR and BPNN models.The average percentage errors of MLR and BPNN are very close to each other.The Std_APEs of thermal load and discomfort degree hours for MLR models are 0.98% and 0.83%, respectively, and for BPNN models are 1.08% and 0.57%, respectively.For MLR and BPNN models, the maximum absolute error values for building thermal load prediction are 1906.27kW (6.93%) and 2335.46 kW (10.16%), respectively.Thus, the MLR algorithm for building thermal load forecasting is more stable with less relative error.However, the BPNN algorithm outperforms MLR algorithm in predicting discomfort degree hours.Tables 4 and 5 present the percentage of cases when the error falls into certain ranges for the thermal load model and discomfort degree hour model, respectively.It is found that the relative errors for both thermal load and discomfort degree hour models using MLR algorithms are less than 10% with average errors of 1.2% and 0.9%, respectively.The maximum relative error for the thermal load model using BPNN algorithms is higher than 10%; however, the average error is only 1.1%.The maximum relative error for the discomfort degree hour model using BPNN algorithms is less than 5% with average error of 0.6%.The ones with lowest/second lowest errors are highlighted in bold in Tables 4 and 5.

Hybrid Model
As can be observed from the above section, the MLR algorithm performs better in predicting the annual building thermal load while the BPNN algorithm outperforms the MLR algorithm in predicting the discomfort degree hour.In this section, two hybrid models, called the MLR (MLR + BPNN) model and the MLR-BPNN model, which take advantage of the MLR model and BPNN, are developed.

MLR (MLR + BPNN) Method
In this method, a MLR model is developed based on the outcomes of the MLR model and BPNN model, which can be presented as follows: y = α 0 + α 1 y 1 + α 2 y 2 (11) where α 0 is the regression constant and α 1 , α 2 are the regression coefficients; y 1 is the output of the prediction model using the MLR algorithm and y 2 is the output of the prediction model using the BPNN algorithm.

MLR-BPNN Model
In this method, the outputs from the MLR model are to be used as input variables to the BPNN model, which can be presented as follows: y = f(x 1 , x 2 , . . ., x n , y 1 ).
3.4.3.Results and Discussion of the Hybrid Model Table 6 presents the evaluation of the performance of the MLR-BPNN method, which shows improvement compared with the MLR method and BPNN method.The Std_APE of the annual thermal load is less than 0.74% and the correlation coefficients for both training and validation are as high as 0.996.The Std_APE of the discomfort degree hour is less than 0.54% and correlation coefficients for both training and validation are higher than 0.995.Table 7 shows the percentages of cases where the error falls into the specified ranges for both methods.Significant improvements are found as compared to the MLR method and BPNN method.The percentages of cases when the error falls into the range of <2.5% for the MLR (MLR + BPNN) method and MLR-BPNN method for the annual thermal load increased to 93.3% and 97.8%, respectively, and for the discomfort degree hour, as high as 98.7% and 99.6%.The performance of MLR-BPNN in this range is thus highlighted in bold.The average errors for the annual thermal load and discomfort degree hour are less than 0.82%.

Figure 1 .
Figure 1.Overview of the base building.

Figures 2 and 3
Figures2 and 3present the regressions between predicted and simulated thermal load and discomfort degree hour.It is found that the MLR-BPNN model outperforms all other models, with R-square of 0.9929 for annual thermal load and 0.9912 for the discomfort degree hour, respectively.

Table 1 .
Groups and ranges of the independent variables.

Table 2 .
Comparisons of different thermal load models.

Table 3 .
Comparisons of different discomfort degree hour models.

Table 4 .
Percentage of cases when error falls into the given range for thermal load model.

Table 5 .
Percentage of cases when error falls into the given range for discomfort degree hour model.

Table 6 .
Performance evaluation for the MLR-BPNN method.

Table 7 .
Percentage of cases where the error falls into the given range.