Predicting Heating Load in Energy-Efficient Buildings Through Machine Learning Techniques

The heating load calculation is the first step of the iterative heating, ventilation, and air conditioning (HVAC) design procedure. In this study, we employed six machine learning techniques, namely multi-layer perceptron regressor (MLPr), lazy locally weighted learning (LLWL), alternating model tree (AMT), random forest (RF), ElasticNet (ENet), and radial basis function regression (RBFr) for the problem of designing energy-efficient buildings. After that, these approaches were used to specify a relationship among the parameters of input and output in terms of the energy performance of buildings. The calculated outcomes for datasets from each of the above-mentioned models were analyzed based on various known statistical indexes like root relative squared error (RRSE), root mean squared error (RMSE), mean absolute error (MAE), correlation coefficient (R2), and relative absolute error (RAE). It was found that between the discussed machine learning-based solutions of MLPr, LLWL, AMT, RF, ENet, and RBFr, the RF was nominated as the most appropriate predictive network. The RF network outcomes determined the R2, MAE, RMSE, RAE, and RRSE for the training dataset to be 0.9997, 0.19, 0.2399, 2.078, and 2.3795, respectively. The RF network outcomes determined the R2, MAE, RMSE, RAE, and RRSE for the testing dataset to be 0.9989, 0.3385, 0.4649, 3.6813, and 4.5995, respectively. These results show the superiority of the presented RF model in estimation of early heating load in energy-efficient buildings.


Introduction
In recent decades, artificial intelligence-based methods have been dramatically applied by scientists in different fields of study, particularly in energy systems engineering (such as in Nguyen et al. [1] and Najafi et al. [2]). In this regard, scientific applications of machine learning-based techniques were considered to be a proper alternative in order to forecast the quantity of energy in constructions. Consequently, an appropriate inspection of the particular energy performance for buildings and optimal buildings by various new machine learning-based approaches. In the following, several machine learning techniques such as multi-layer perceptron regressor (MLPr), lazy locally weighted learning (LLWL), alternating model tree (AMT), random forest (RF), ElasticNet (ENet), and radial basis function regression (RBFr) are employed to estimate the amount of heating load (HL) in energy-efficient buildings.

Database Collection
The required initial dataset was obtained from Tsanas and Xifara [43]. The obtained records include eight inputs (i.e., conditional factors) and a separate output of heating load (i.e., response factors or dependent outputs). Based on a residential building main conditional design factors, the inputs were X 1 (Relative Compactness), X 2 (Surface Area), X 3 (Wall Area), X 4 (Roof Area), X 5 (Overall Height), X 6 (Orientation), X 7 (Glazing Area), and finally, X 8 (Glazing Area Distribution). Likewise, parameters of the heating load of the suggested building were presented to be forecasted by the inputs. In addition, in this study the heating loads, as the main outputs, were simplified as heating load. The characteristics of the analyzed building and fundamental assumptions are properly detailed in the [43]. A total of 768 buildings were modelled considering twelve distinct buildings, five distribution scenarios, four orientations, and four glazing areas. The obtained data is analyzed through Ecotect computer software. A graphical view of this process is illustrated in Figure 1.

Statistical Details of the Dataset
As stated earlier, the amount of the heating load was applied as the main target of the energy-efficient buildings, while the main influential parameters were roof area, wall area, relative compactness, surface area, overall height, glazing area, glazing area distribution, and orientation. The statistical explanation of energy-efficient residential buildings including conditional variables is tabulated in Table 1. In addition, Figure 2 shows the variables of relative compactness, wall area, surface area, overall height, roof area, glazing area, orientation (i.e., north, northeast, east, southeast, south, southwest, west, northwest), heating load, and glazing area distribution on the x-axis, against a heating load (Figure 3) on the y-axis.

Model Development
An acceptable predict approach that is utilized with different artificial intelligence-based systems like MLPr, LLWL, AMT, RF, ENet, and RBFr models to predict heating load in energy-efficient buildings requires several steps. After that, the best fit model is then selected. Firstly, the initial database should be separated to the datasets of training (80% of the whole dataset) and testing (20% of the whole dataset). In the current study and because of the size of the testing dataset, the predictability of generated networks is considered to be as a proof of their validations. Therefore, a greater percentage of the dataset is considered for the testing dataset to be reliable for testing the trained network. Secondly, in order to obtain the best predictive network, appropriate machine learning-based solutions have to be introduced. Lastly, the outcome of the trained network should be validated and verified for selected testing datasets, randomly. The dataset utilized in this work is generated by some of the most influential input layers, such as surface area, roof area, relative compactness, wall area, glazing area, glazing area distribution, overall height, and orientation, which are the effective parameters influencing the heating load value in energy-efficient buildings. Note that the employed dataset was obtained from a recent study conducted by Tsanas and Xifara [43].
All six machine learning analyses provided in the current study were performed using Waikato Environment for Knowledge Analysis (WEKA). WEKA is a java-based open-source machine learning analyzer software that was developed in University of Waikato, New Zealand. Each of the proposed techniques were performed in optimized settings as explained in this section.

Multi-Layer Perceptron Regressor (MLPr)
The MLP is a widely-used and well-known predictive network. Accordingly, MLPr aims to coordinate the best potential of regression between a set of data samples (shown here in terms of S). The MLPr divides the S into both of the set training and testing databases. An MLP involves several layers of computational nodes. Similar to many previous MLPr-based studies, a single hidden layer was used. This is because even with a single hidden layer and increasing the number of nodes in the hidden layer an excellent rate of prediction can be achieved. Figure 4 shows a common MLP structure. The optimum number of neurons in each of the hidden layer are obtained after a series of trial and error processes (i.e., sensitivity analysis) as shown in Figure 5. Noteworthily, only one hidden layer was selected since the accuracy of a single hidden layer was found to be high enough to not make the MLP structure more complicated.

Model Development
An acceptable predict approach that is utilized with different artificial intelligence-based systems like MLPr, LLWL, AMT, RF, ENet, and RBFr models to predict heating load in energy-efficient buildings requires several steps. After that, the best fit model is then selected. Firstly, the initial database should be separated to the datasets of training (80% of the whole dataset) and testing (20% of the whole dataset). In the current study and because of the size of the testing dataset, the predictability of generated networks is considered to be as a proof of their validations. Therefore, a greater percentage of the dataset is considered for the testing dataset to be reliable for testing the trained network. Secondly, in order to obtain the best predictive network, appropriate machine learning-based solutions have to be introduced. Lastly, the outcome of the trained network should be validated and verified for selected testing datasets, randomly. The dataset utilized in this work is generated by some of the most influential input layers, such as surface area, roof area, relative compactness, wall area, glazing area, glazing area distribution, overall height, and orientation, which are the effective parameters influencing the heating load value in energy-efficient buildings. Note that the employed dataset was obtained from a recent study conducted by Tsanas and Xifara [43].
All six machine learning analyses provided in the current study were performed using Waikato Environment for Knowledge Analysis (WEKA). WEKA is a java-based open-source machine learning analyzer software that was developed in University of Waikato, New Zealand. Each of the proposed techniques were performed in optimized settings as explained in this section.

Multi-Layer Perceptron Regressor (MLPr)
The MLP is a widely-used and well-known predictive network. Accordingly, MLPr aims to coordinate the best potential of regression between a set of data samples (shown here in terms of S). The MLPr divides the S into both of the set training and testing databases. An MLP involves several layers of computational nodes. Similar to many previous MLPr-based studies, a single hidden layer was used. This is because even with a single hidden layer and increasing the number of nodes in the hidden layer an excellent rate of prediction can be achieved. Figure 4 shows a common MLP structure. The optimum number of neurons in each of the hidden layer are obtained after a series of trial and error processes (i.e., sensitivity analysis) as shown in Figure 5. Noteworthily, only one hidden layer was selected since the accuracy of a single hidden layer was found to be high enough to not make the MLP structure more complicated.  Each node generates a local output. In addition, it sets the local output to the subsequent layer (the next nodes in a further hidden layer) until reaching the nodes of output, i.e., the nodes placed in the layer of output. Equation (1) shows the normal operation carried out considering a dataset of N groups of records by the j th neuron to compute the predicted output: where I symbolizes the input, b denotes the bias of the node, W is the weighting factor, and F signifies the activation function. Tansig (i.e., the tangent sigmoid activation function) is employed (Equation (2)). Note that we can have several types of activation functions (e.g., (i) sigmoid or logistic; (ii) Tanh-Hyperbolic tangent; (iii) Relu-rectified linear units) and that their performances are best suitable for different purposes. In the specific case of the sigmoid, this function (i) is real-valued and differentiable (i.e., to find gradients); (ii) has analytic tractability for the differentiation operation; and (iii) is an acceptable mathematical representation biological neuronal behavior.   Each node generates a local output. In addition, it sets the local output to the subsequent layer (the next nodes in a further hidden layer) until reaching the nodes of output, i.e., the nodes placed in the layer of output. Equation (1) shows the normal operation carried out considering a dataset of N groups of records by the j th neuron to compute the predicted output: where I symbolizes the input, b denotes the bias of the node, W is the weighting factor, and F signifies the activation function. Tansig (i.e., the tangent sigmoid activation function) is employed (Equation (2)). Note that we can have several types of activation functions (e.g., (i) sigmoid or logistic; (ii) Tanh-Hyperbolic tangent; (iii) Relu-rectified linear units) and that their performances are best suitable for different purposes. In the specific case of the sigmoid, this function (i) is real-valued and differentiable (i.e., to find gradients); (ii) has analytic tractability for the differentiation operation; and (iii) is an acceptable mathematical representation biological neuronal behavior. Each node generates a local output. In addition, it sets the local output to the subsequent layer (the next nodes in a further hidden layer) until reaching the nodes of output, i.e., the nodes placed in the layer of output. Equation (1) shows the normal operation carried out considering a dataset of N groups of records by the jth neuron to compute the predicted output: where I symbolizes the input, b denotes the bias of the node, W is the weighting factor, and F signifies the activation function. Tansig (i.e., the tangent sigmoid activation function) is employed (Equation (2)). Note that we can have several types of activation functions (e.g., (i) sigmoid or logistic; (ii) Tanh-Hyperbolic tangent; (iii) Relu-rectified linear units) and that their performances are best suitable for different purposes. In the specific case of the sigmoid, this function (i) is real-valued and differentiable (i.e., to find gradients); (ii) has analytic tractability for the differentiation operation; and (iii) is an acceptable mathematical representation biological neuronal behavior.

Lazy Locally Weighted Learning (LLWL)
Similar to the K-star technique (i.e., an instance-based classifier), locally-weighted learning (LWL) [44] is one of the common types of lazy learning-based solutions. Lazy learning approaches provide valuable training algorithms and representations for learning about complex phenomena during autonomous adaptive control of complex systems. Commonly, there are disadvantages in employing such methods. Lazy learners create a considerable delay during the network simulation. More explanations about this model are provided by Atkeso, et al. [44].
The key options we have in LLWL include number of decimal places (numDecimalPlaces), batch size (batchSize), KNN (following the k-nearest neighbors algorithm), nearest neighbor search algorithm (nearestNeighborSearchAlgorithm), and weighting Kernel (weightingKernel). More explanations are provided below for each of the above influential parameters.

Lazy Locally Weighted Learning (LLWL)
Similar to the K-star technique (i.e., an instance-based classifier), locally-weighted learning (LWL) [44] is one of the common types of lazy learning-based solutions. Lazy learning approaches provide valuable training algorithms and representations for learning about complex phenomena during autonomous adaptive control of complex systems. Commonly, there are disadvantages in employing such methods. Lazy learners create a considerable delay during the network simulation. More explanations about this model are provided by Atkeso, et al. [44].
The key options we have in LLWL include number of decimal places (numDecimalPlaces), batch size (batchSize), KNN (following the k-nearest neighbors algorithm), nearest neighbor search algorithm (nearestNeighborSearchAlgorithm), and weighting Kernel (weightingKernel). More explanations are provided below for each of the above influential parameters.
 numDecimalPlaces-The number of decimal places. This number will be implemented for the output of numbers in the model.  batchSize-The chosen number of cases to process if batch estimation is being completed. A normal value of the batch size is 100. In this example we also consider it to be constant as it did not have significant impact on the outputs.  KNN-The number of neighbors that are employed to set the width of the weighting function (noting that KNN <= 0 means all neighbors are considered).  nearestNeighborSearchAlgorithm-The potential nearest neighbor search algorithm to be applied (the default algorithm that was also selected in our study was LinearNN).  weightingKernel-The number that determines the weighting function. (0 = Linear; 1 = Epnechnikov; 2 = Tricube; 3 = Inverse; 4 = Gaussian; and 5 = Constant. (default 0 = Linear)). A good example of k-nearest neighbors algorithm is shown in Figure 6. The test sample (red dot) should be classified either as blue squares or as green triangles. If k = 3 (i.e., depicted in solid line circle) it is depicted by the green triangles as there are two triangles (reversed in shape) and only one rectangle through the inner (i.e., continuous line) circle. If k = 5 (dashed line circle) it is assigned to the blue rectangles (three blue rectangles vs. two green triangles inside the outer circle). Variation of the correlation coefficient (R²) versus number of used KNN neighbors is shown in Figure 7. It can be seen that changing the KNN could significantly enhance the correlation coefficient.

Lazy Locally Weighted Learning (LLWL)
Similar to the K-star technique (i.e., an instance-based classifier), locally-weighted learning (LWL) [44] is one of the common types of lazy learning-based solutions. Lazy learning approaches provide valuable training algorithms and representations for learning about complex phenomena during autonomous adaptive control of complex systems. Commonly, there are disadvantages in employing such methods. Lazy learners create a considerable delay during the network simulation. More explanations about this model are provided by Atkeso, et al. [44].
The key options we have in LLWL include number of decimal places (numDecimalPlaces), batch size (batchSize), KNN (following the k-nearest neighbors algorithm), nearest neighbor search algorithm (nearestNeighborSearchAlgorithm), and weighting Kernel (weightingKernel). More explanations are provided below for each of the above influential parameters.
 numDecimalPlaces-The number of decimal places. This number will be implemented for the output of numbers in the model.  batchSize-The chosen number of cases to process if batch estimation is being completed. A normal value of the batch size is 100. In this example we also consider it to be constant as it did not have significant impact on the outputs.  KNN-The number of neighbors that are employed to set the width of the weighting function (noting that KNN <= 0 means all neighbors are considered).  nearestNeighborSearchAlgorithm-The potential nearest neighbor search algorithm to be applied (the default algorithm that was also selected in our study was LinearNN).  weightingKernel-The number that determines the weighting function. (0 = Linear; 1 = Epnechnikov; 2 = Tricube; 3 = Inverse; 4 = Gaussian; and 5 = Constant. (default 0 = Linear)). A good example of k-nearest neighbors algorithm is shown in Figure 6. The test sample (red dot) should be classified either as blue squares or as green triangles. If k = 3 (i.e., depicted in solid line circle) it is depicted by the green triangles as there are two triangles (reversed in shape) and only one rectangle through the inner (i.e., continuous line) circle. If k = 5 (dashed line circle) it is assigned to the blue rectangles (three blue rectangles vs. two green triangles inside the outer circle). Variation of the correlation coefficient (R²) versus number of used KNN neighbors is shown in Figure 7. It can be seen that changing the KNN could significantly enhance the correlation coefficient.

batchSize-The chosen number of cases to process if batch estimation is being completed.
A normal value of the batch size is 100. In this example we also consider it to be constant as it did not have significant impact on the outputs.

Lazy Locally Weighted Learning (LLWL)
Similar to the K-star technique (i.e., an instance-based classifier), locally-weighted learning (LWL) [44] is one of the common types of lazy learning-based solutions. Lazy learning approaches provide valuable training algorithms and representations for learning about complex phenomena during autonomous adaptive control of complex systems. Commonly, there are disadvantages in employing such methods. Lazy learners create a considerable delay during the network simulation. More explanations about this model are provided by Atkeso, et al. [44].
The key options we have in LLWL include number of decimal places (numDecimalPlaces), batch size (batchSize), KNN (following the k-nearest neighbors algorithm), nearest neighbor search algorithm (nearestNeighborSearchAlgorithm), and weighting Kernel (weightingKernel). More explanations are provided below for each of the above influential parameters.
 numDecimalPlaces-The number of decimal places. This number will be implemented for the output of numbers in the model.  batchSize-The chosen number of cases to process if batch estimation is being completed. A normal value of the batch size is 100. In this example we also consider it to be constant as it did not have significant impact on the outputs.  KNN-The number of neighbors that are employed to set the width of the weighting function (noting that KNN <= 0 means all neighbors are considered).  nearestNeighborSearchAlgorithm-The potential nearest neighbor search algorithm to be applied (the default algorithm that was also selected in our study was LinearNN).  weightingKernel-The number that determines the weighting function. (0 = Linear; 1 = Epnechnikov; 2 = Tricube; 3 = Inverse; 4 = Gaussian; and 5 = Constant. (default 0 = Linear)). A good example of k-nearest neighbors algorithm is shown in Figure 6. The test sample (red dot) should be classified either as blue squares or as green triangles. If k = 3 (i.e., depicted in solid line circle) it is depicted by the green triangles as there are two triangles (reversed in shape) and only one rectangle through the inner (i.e., continuous line) circle. If k = 5 (dashed line circle) it is assigned to the blue rectangles (three blue rectangles vs. two green triangles inside the outer circle). Variation of the correlation coefficient (R²) versus number of used KNN neighbors is shown in Figure 7. It can be seen that changing the KNN could significantly enhance the correlation coefficient.

Lazy Locally Weighted Learning (LLWL)
Similar to the K-star technique (i.e., an instance-based classifier), locally-weighted learning (LWL) [44] is one of the common types of lazy learning-based solutions. Lazy learning approaches provide valuable training algorithms and representations for learning about complex phenomena during autonomous adaptive control of complex systems. Commonly, there are disadvantages in employing such methods. Lazy learners create a considerable delay during the network simulation. More explanations about this model are provided by Atkeso, et al. [44].
The key options we have in LLWL include number of decimal places (numDecimalPlaces), batch size (batchSize), KNN (following the k-nearest neighbors algorithm), nearest neighbor search algorithm (nearestNeighborSearchAlgorithm), and weighting Kernel (weightingKernel). More explanations are provided below for each of the above influential parameters.
 numDecimalPlaces-The number of decimal places. This number will be implemented for the output of numbers in the model.  batchSize-The chosen number of cases to process if batch estimation is being completed. A normal value of the batch size is 100. In this example we also consider it to be constant as it did not have significant impact on the outputs.  KNN-The number of neighbors that are employed to set the width of the weighting function (noting that KNN <= 0 means all neighbors are considered).  nearestNeighborSearchAlgorithm-The potential nearest neighbor search algorithm to be applied (the default algorithm that was also selected in our study was LinearNN).  weightingKernel-The number that determines the weighting function. (0 = Linear; 1 = Epnechnikov; 2 = Tricube; 3 = Inverse; 4 = Gaussian; and 5 = Constant. (default 0 = Linear)). A good example of k-nearest neighbors algorithm is shown in Figure 6. The test sample (red dot) should be classified either as blue squares or as green triangles. If k = 3 (i.e., depicted in solid line circle) it is depicted by the green triangles as there are two triangles (reversed in shape) and only one rectangle through the inner (i.e., continuous line) circle. If k = 5 (dashed line circle) it is assigned to the blue rectangles (three blue rectangles vs. two green triangles inside the outer circle). Variation of the correlation coefficient (R²) versus number of used KNN neighbors is shown in Figure 7. It can be seen that changing the KNN could significantly enhance the correlation coefficient. nearestNeighborSearchAlgorithm-The potential nearest neighbor search algorithm to be applied (the default algorithm that was also selected in our study was LinearNN).

Lazy Locally Weighted Learning (LLWL)
Similar to the K-star technique (i.e., an instance-based classifier), locally-weighted learning (LWL) [44] is one of the common types of lazy learning-based solutions. Lazy learning approaches provide valuable training algorithms and representations for learning about complex phenomena during autonomous adaptive control of complex systems. Commonly, there are disadvantages in employing such methods. Lazy learners create a considerable delay during the network simulation. More explanations about this model are provided by Atkeso, et al. [44].
The key options we have in LLWL include number of decimal places (numDecimalPlaces), batch size (batchSize), KNN (following the k-nearest neighbors algorithm), nearest neighbor search algorithm (nearestNeighborSearchAlgorithm), and weighting Kernel (weightingKernel). More explanations are provided below for each of the above influential parameters.
 numDecimalPlaces-The number of decimal places. This number will be implemented for the output of numbers in the model.  batchSize-The chosen number of cases to process if batch estimation is being completed. A normal value of the batch size is 100. In this example we also consider it to be constant as it did not have significant impact on the outputs.  KNN-The number of neighbors that are employed to set the width of the weighting function (noting that KNN <= 0 means all neighbors are considered).  nearestNeighborSearchAlgorithm-The potential nearest neighbor search algorithm to be applied (the default algorithm that was also selected in our study was LinearNN).  weightingKernel-The number that determines the weighting function. (0 = Linear; 1 = Epnechnikov; 2 = Tricube; 3 = Inverse; 4 = Gaussian; and 5 = Constant. (default 0 = Linear)). A good example of k-nearest neighbors algorithm is shown in Figure 6. The test sample (red dot) should be classified either as blue squares or as green triangles. If k = 3 (i.e., depicted in solid line circle) it is depicted by the green triangles as there are two triangles (reversed in shape) and only one rectangle through the inner (i.e., continuous line) circle. If k = 5 (dashed line circle) it is assigned to the blue rectangles (three blue rectangles vs. two green triangles inside the outer circle). Variation of the correlation coefficient (R²) versus number of used KNN neighbors is shown in Figure 7. It can be seen that changing the KNN could significantly enhance the correlation coefficient. A good example of k-nearest neighbors algorithm is shown in Figure 6. The test sample (red dot) should be classified either as blue squares or as green triangles. If k = 3 (i.e., depicted in solid line circle) it is depicted by the green triangles as there are two triangles (reversed in shape) and only one rectangle through the inner (i.e., continuous line) circle. If k = 5 (dashed line circle) it is assigned to the blue rectangles (three blue rectangles vs. two green triangles inside the outer circle). Variation of the correlation coefficient (R 2 ) versus number of used KNN neighbors is shown in Figure 7. It can be seen that changing the KNN could significantly enhance the correlation coefficient.

Lazy Locally Weighted Learning (LLWL)
Similar to the K-star technique (i.e., an instance-based classifier), locally-weighted learning (LWL) [44] is one of the common types of lazy learning-based solutions. Lazy learning approaches provide valuable training algorithms and representations for learning about complex phenomena during autonomous adaptive control of complex systems. Commonly, there are disadvantages in employing such methods. Lazy learners create a considerable delay during the network simulation. More explanations about this model are provided by Atkeso, et al. [44].
The key options we have in LLWL include number of decimal places (numDecimalPlaces), batch size (batchSize), KNN (following the k-nearest neighbors algorithm), nearest neighbor search algorithm (nearestNeighborSearchAlgorithm), and weighting Kernel (weightingKernel). More explanations are provided below for each of the above influential parameters.
 numDecimalPlaces-The number of decimal places. This number will be implemented for the output of numbers in the model.  batchSize-The chosen number of cases to process if batch estimation is being completed. A normal value of the batch size is 100. In this example we also consider it to be constant as it did not have significant impact on the outputs.  KNN-The number of neighbors that are employed to set the width of the weighting function (noting that KNN <= 0 means all neighbors are considered).  nearestNeighborSearchAlgorithm-The potential nearest neighbor search algorithm to be applied (the default algorithm that was also selected in our study was LinearNN).  weightingKernel-The number that determines the weighting function. (0 = Linear; 1 = Epnechnikov; 2 = Tricube; 3 = Inverse; 4 = Gaussian; and 5 = Constant. (default 0 = Linear)). A good example of k-nearest neighbors algorithm is shown in Figure 6. The test sample (red dot) should be classified either as blue squares or as green triangles. If k = 3 (i.e., depicted in solid line circle) it is depicted by the green triangles as there are two triangles (reversed in shape) and only one rectangle through the inner (i.e., continuous line) circle. If k = 5 (dashed line circle) it is assigned to the blue rectangles (three blue rectangles vs. two green triangles inside the outer circle). Variation of the correlation coefficient (R²) versus number of used KNN neighbors is shown in Figure 7. It can be seen that changing the KNN could significantly enhance the correlation coefficient.

Alternating Model Tree (AMT)
Alternating model tree (AMT) [45] is supported by ensemble learning. In this technique, a single tree will form the structure of AMT. Therefore, it can be compared with the M5P tree algorithm (i.e., a reconstruction of Quinlan's M5 algorithm for developing trees of regression models). It is well known that the M5P combines a conventional decision tree with the possibility of linear regression functions at the nodes. This model has been successfully employed in different subjects [46,47]. As the most similar technique with the AMT, alternating decision trees (ADT) provide the predictive power of decision tree ensembles in a single tree structure. Existing approaches for growing alternating decision trees focus on classification problems. In this paper, to find a relationship between the inputs and output layer we have proposed the AMT for regression, inspired by work on model trees for regression. As in most machine learning-based solutions, there are different parameters that can directly influence the accuracy of the prediction; we have run sensitivity analysis for different influential parameters. Since the highest variations in the results obtained stemmed from the term'number of iterations' we ran the analysis with different iteration numbers. To have a different data validation system, a new system of 10 k-fold selection was used here. It can be seen that the R 2 reduces when the number of iterations increases. Therefore, the number of iterations equal to 10 was used as the default in the Weka software.
Some of the influential terms that can influence the accuracy of the regression are number of iterations (numberOfIterations), batch size (batchSize), and number of decimal places (numDecimalPlaces).
NumberOfIterations-Sets the number of iterations to perform. A sensitivity analysis is provided to select a proper number of iterations for the proposed AMT structure (as shown in Table 2 and Figure 8).

Random Forest (RF),
The random forest (RF) technique [48] is well known as an ensemble-learning solution that can be applied to regression as well as classification trees [49]. To improve the performance of classification trees, RF randomly alters the relations dealing with predictions. For providing the forest, some parameters (for example, the number of variables that split the nodes (g) and the number of trees (t)) need to be determined by the user. In this regard, the settings that were chosen for performing the RF techniques were as follows: seed = 1; number of execution slots = 1; number of decimal places = 2; the batch size = 100; the number of iterations = 100; the maximum depth = 0; should RF compute attribute importance = False; the number of features = 0. This technique has been employed and recommended as a good solution in numerous studies (Ho [50], Svetnik et al. [51], Diaz-Uriarte and de Andres [52], and Cutler et al. [53]).

ElasticNet (ENet)
To understand how ENet finds a solution, we need to make some assumptions. Consider a set of samples {(xi, yi), i = 1, 2, … N}, where each xi ∈ R p and yi ∈ R. Also, consider y = (y1, y2, …, yn) T and X ∈ R n×p as denoting the vector that is called "response vector" and the set design matrix, respectively. During the model analyzing, ENet (as described in Zou and Hastie [54]) establishes a linear program of two parameters (K1 and K2) to estimate the target. To do this, ENet should minimize the squared loss with K2-regularization and K1-norm constraint, where β = [β1, β2, …, βZ] T ∈ R p denotes the weight vector, μ2 ≥ 0 is the P2-regularization factor, and g > 0 represents by P1-norm budget. The K1 constraint encourages the method to be sparse. The presence of the K2 regularization factor causes the acquisition of a unique solution by making the problem severely convex, and if ≫ the optimization continues as stable for noticeable values of g. Furthermore, it helps the solution to be more stable when there is a high correlation between the features. For the number of the models (i.e., the set length of the lambda sequence to be generated), the value of 100 was used. For the number of decimal places (i.e., the number of decimal places to be used for the output of numbers in the model), as usual, the value of 2 was NumDecimalPlaces-Is as described in LLWL. Based on the required accuracy, up to two decimals are considered for final outputs.
BatchSize-The preferred number of instances to process. In the current study the default batch size of 100 is considered.

Random Forest (RF)
The random forest (RF) technique [48] is well known as an ensemble-learning solution that can be applied to regression as well as classification trees [49]. To improve the performance of classification trees, RF randomly alters the relations dealing with predictions. For providing the forest, some parameters (for example, the number of variables that split the nodes (g) and the number of trees (t)) need to be determined by the user. In this regard, the settings that were chosen for performing the RF techniques were as follows: seed = 1; number of execution slots = 1; number of decimal places = 2; the batch size = 100; the number of iterations = 100; the maximum depth = 0; should RF compute attribute importance = False; the number of features = 0. This technique has been employed and recommended as a good solution in numerous studies (Ho [50], Svetnik et al. [51], Diaz-Uriarte and de Andres [52], and Cutler et al. [53]).

ElasticNet (ENet)
To understand how ENet finds a solution, we need to make some assumptions. Consider a set of samples {(x i , y i ), i = 1, 2, . . . N}, where each x i ∈ R p and y i ∈ R. Also, consider y = (y 1 , y 2 , . . . , y n ) T and X ∈ R n×p as denoting the vector that is called "response vector" and the set design matrix, respectively. During the model analyzing, ENet (as described in Zou and Hastie [54]) establishes a linear program of two parameters (K 1 and K 2 ) to estimate the target. To do this, ENet should minimize the squared loss with K 2 -regularization and K 1 -norm constraint, where β = [β 1 , β 2 , . . . , β Z ] T ∈ R p denotes the weight vector, µ 2 ≥ 0 is the P 2 -regularization factor, and g > 0 represents by P 1 -norm budget. The K 1 constraint encourages the method to be sparse. The presence of the K 2 regularization factor causes the acquisition of a unique solution by making the problem severely convex, and if P N the optimization continues as stable for noticeable values of g. Furthermore, it helps the solution to be more stable when there is a high correlation between the features. For the number of the models (i.e., the set length of the lambda sequence to be generated), the value of 100 was used. For the number of decimal places (i.e., the number of decimal places to be used for the output of numbers in the model), as usual, the value of 2 was selected. The batch size was considered to be 100. The values of alpha and epsilon were set to be 0.001 and 0.0001, respectively. Along with the above-mentioned structure, a unique linear regression equation can also be found from the ENet as shown in Equation (4):

Radial Basis Function Regression (RBFr)
Radial basis function network (RBFr) has a unique structure, as explained in Figure 9. Equation (5) illustrates the basis function of this network [55]. For solving the issue, radial basis function regression can be used by fitting a collection of kernels for the dataset. In addition, this method attends the position of noisy samples.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 12 of 19 selected. The batch size was considered to be 100. The values of alpha and epsilon were set to be 0.001 and 0.0001, respectively. Along with the above-mentioned structure, a unique linear regression equation can also be found from the ENet as shown in Equation (4)

Radial Basis Function Regression (RBFr)
Radial basis function network (RBFr) has a unique structure, as explained in Figure 9. Equation (5) illustrates the basis function of this network [55]. For solving the issue, radial basis function regression can be used by fitting a collection of kernels for the dataset. In addition, this method attends the position of noisy samples.
Oi stands for the output of the neuron, and xi shows the center of kernel K. in addition, the term stands for the width of the i th RBF unit. The model of RBFr utilizes a bath algorithm for predicting the number of developed kernels. This prediction is performed by bath algorithms. The specific function of expectation that is utilized in RBFR model is as below: ‖ ‖ stands to symbolize the Euclidean norm on . (‖ − ‖)| = 1,2, . . . , stands as a group of non-linear along with constant RBFr. In addition, the term shows the coefficient of regression.

Model Assessment Approaches
To evaluate the reliability of early estimated heating load in energy-efficient residential buildings, five well-known (as used mostly in academic studies) statistical indices including root mean square error (RMSE, relative absolute error (RAE in %, mean absolute error (MAE), root relative squared error (RRSE in %) and O i stands for the output of the neuron, and x i shows the center of kernel K. in addition, the term τ i stands for the width of the i th RBF unit.
The model of RBFr utilizes a bath algorithm for predicting the number of developed kernels. This prediction is performed by bath algorithms. The specific function of expectation that is utilized in RBFR model is as below: x stands to symbolize the Euclidean norm on x. k( x − x i )| i = 1, 2, . . . , z stands as a group of z non-linear along with constant RBFr. In addition, the term ϕ i shows the coefficient of regression.

Model Assessment Approaches
To evaluate the reliability of early estimated heating load in energy-efficient residential buildings, five well-known (as used mostly in academic studies) statistical indices including root mean square error (RMSE, relative absolute error (RAE in %, mean absolute error (MAE), root relative squared error (RRSE in %) and coefficient of determination (R 2 ) are used to help to rank the network performances. The outputs of these statistical indexes are also used for color intensity ranking. Equations (7)- (11) designate the equations of R 2 , MAE, RMSE, RAE, and RRSE, respectively.
where, Y i observed and Y i predicted , represented in Equations (6) to (10), are the actual and estimated values of heating load in energy-efficient buildings, respectively. The term S in the above equations stands for the number of instances and Y observed denotes the mean of the real amounts of the heating load. Weka software environment was employed to perform the machine learning models.

Results and Discussion
The present research aimed to provide a reliable early estimation of the heating load in energy-efficient building systems through several machine learning solutions, namely MLPr, LLWL, AMT, RF, ENet, and RBFr models. These approaches are well known. After running all these machine learning techniques, the best outputs can be selected as the most trustworthy solutions in the early estimation of heating load in energy-efficient residential buildings. Therefore, to find the most appropriate predictive networks, the proposed AI models (e.g., MLPr, LLWL, AMT, RF, ENet, and RBFr models) are evaluated and compared. The results of employed machine learning-based solutions proposed here, and their performances are evaluated through Tables 3 and 4. The overall scoring for the performances of the proposed technique is provided further in Table 5.
As is illustrated Figures 10 and 11, AMT, RF, and MLPr models provided significant accuracy in predicting heating load in energy-efficient buildings, however, the RF-based model can be nominated as more reliable than other developed estimations of machine learning-based techniques. The values 10,25,30,5,20, and 15 were calculated as the total scores for the LLWL, ATM, RF, ENet, MLPr, and RBFr techniques, respectively. These scores prove the superiority of the RF when compared with other nominated models.     3  3  3  3  3  3  3  3  3  3  15 RMSE: root mean squared error; R 2 : correlation coefficient; RRSE: root relative squared error; MAE: mean absolute error; RAE: relative absolute error.
The results of network reliability based on the R 2 performance of all proposed models, for both training and testing, are provided in Figures 10 and 11. As stated earlier, both of the RF models could provide a more reliable predictive network with higher accuracy when compared to other proposed techniques. The results of network output for the proposed RF are illustrated in Figures 10d and 11d. Having the provided information, the predictive network of RF proved to be slightly better than other proposed techniques and was superior in making a better regression relationship among the estimated and actual values.
Appl. Sci. 2019, 9, x; doi: FOR PEER REVIEW www.mdpi.com/journal/applsci training and testing, are provided in Figures 10 and 11. As stated earlier, both of the RF models could provide a more reliable predictive network with higher accuracy when compared to other proposed techniques. The results of network output for the proposed RF are illustrated in Figures 10 (d) and 11 (d). Having the provided information, the predictive network of RF proved to be slightly better than other proposed techniques and was superior in making a better regression relationship among the estimated and actual values.

Conclusions
In the current study, several predictive networks were introduced and evaluated. The study aimed to assess and compare several of the most well-known machine learning-based techniques in order to introduce the most reliable predictive method in early estimation of heating load in energyefficient residential building systems. Machine learning-based solutions, namely, MLPr, LLWL,

Conclusions
In the current study, several predictive networks were introduced and evaluated. The study aimed to assess and compare several of the most well-known machine learning-based techniques in order to introduce the most reliable predictive method in early estimation of heating load in energy-efficient residential building systems. Machine learning-based solutions, namely, MLPr, LLWL, AMT, RF, ENet, and RBFr models were employed to estimate the heating load. The results of the best model from the proposed techniques were presented. Based on the presented outcomes, it may be said that, except for the ENet model, almost all models (i.e., MLPr, LLWL, AMT, RF, and RBFr) have good prediction output results in estimating heating load in energy-efficient building systems. In this regard, the RF machine learning technique could be suggested as the most reliable and accurate among other predictive techniques provided in the present work. The learning approach is good in RF predictive models when compared to other models concerning both the training and validation models. The values of R 2 , MAE, RMSE, RAE (%), and RRSE (%) in the RF model training dataset, were 0.9997, 0.19, 0.2399, 2.078, and 2.3795, respectively. The values of R 2 , MAE, RMSE, RAE (%), and RRSE (%) in the AMT model training dataset were 0.9985, 0.4096, 0.5449, 4.4788, and 5.4036, respectively. Validated testing datasets from the selected techniques also showed appropriate accuracy as R 2 , MAE, RMSE, RAE (%), and RRSE (%) in the testing output of the RF model were found to be 0.9989, 0.3385, 0.4649, 3.6813, and 4.5995, respectively; R 2 , MAE, RMSE, RAE (%), and RRSE (%) in the testing output of the AMT model were found to be 0.9981, 0.4869, 0.6236, 5.2956, and 6.1693, respectively. The worst validation was found for the ENet technique with R 2 , MAE, RMSE, RAE (%), and RRSE (%) equal to 0.896, 3.2585, 4.4683, 35.4392, and 44.2052, respectively.