Machine Learning Modeling for Building Energy Performance Prediction Based on Simulation Data: A Systematic Review of the Processes, Performances, and Correlation of Process-Related Variables

Kömürcü, Damla; Edis, Ecem

doi:10.3390/buildings15081301

Open AccessReview

Machine Learning Modeling for Building Energy Performance Prediction Based on Simulation Data: A Systematic Review of the Processes, Performances, and Correlation of Process-Related Variables

by

Damla Kömürcü

^1,*

and

Ecem Edis

²

¹

Construction Sciences Ph.D. Program, Graduate School, Istanbul Technical University, 34469 Istanbul, Türkiye

²

Department of Architecture, Faculty of Architecture, Istanbul Technical University, 34367 Istanbul, Türkiye

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(8), 1301; https://doi.org/10.3390/buildings15081301

Submission received: 9 March 2025 / Revised: 7 April 2025 / Accepted: 9 April 2025 / Published: 15 April 2025

(This article belongs to the Section Building Energy, Physics, Environment, and Systems)

Download

Browse Figures

Versions Notes

Abstract

Machine learning models have become a potential alternative for building energy performance studies since they provide fast and reliable prediction results. However, decisions in the modeling process are sometimes made without knowing their possible impact on the results, which may lead to unstable process management. Therefore, this study aims to obtain a machine learning modeling process framework focusing on critical-decision subjects through a systematic review of the recent literature. Preferences of the current supervised modeling practices on process-related variables to have prediction models with high accuracy were analyzed in the studies using simulation data. In this paper, a general framework of the processes is presented through their steps and decision subjects. Regarding these steps, the frequency of the methods used, strategies followed against the limitations, common sources of concerns, and intertwined workflows are analyzed with their effects on prediction performance in terms of accuracy. In addition, correlations between process-related variables, i.e., decision subjects and model performance, are investigated to quantify the impacts. As a result, the most effective decision subjects on accuracy were observed as the machine learning algorithm to be used, input variables to be included, and the range of the sample size, respectively.

Keywords:

supervised machine learning; building energy performance; early design stage; energy prediction modeling process; input–output relations

1. Introduction

Designing buildings by evaluating the impact of design variables on the energy performance results is critical for producing energy-efficient buildings. In studies with this end in mind, both computer simulation and machine learning are used as methods to predict energy performance [1,2].

The use of computer simulation in building energy performance calculations, on which much work is present in the literature, can offer highly calibrated energy performance results with real data [3,4,5,6]. The developments in the tools used and the integration of different tools to form multi-domains have increased the level of taking dynamic effects into account [7]. As a drawback, this use of white-box modeling, in which the results (e.g., heating and cooling loads) are obtained through a hierarchically bottom-up structure, results in the need for expert knowledge to manage the large and complex data and also increases the modeling labor and calculation time [2,6].

The machine learning prediction models, on the other hand, use a top-down approach by utilizing the regression modeling method, which is handled by defining the relationships between outputs and influencing factors [5,8]. The use of machine learning models for predicting energy performance has become widespread because it is faster than a computer simulation, has a simpler modeling approach for the model user with fewer features to specify, and provides reliable results [2,5,9]. These are also the opportunities that are taken advantage of to be used at the early design stage, to contribute more effectively to energy efficiency in buildings. However, this is a black-box modeling approach, and thus, the factors affecting and boosting the prediction performance cannot be known clearly during the generation of these models. Additionally, there are some limitations in transferring the monolithic model produced on a case-by-case basis to other contexts, and therefore solutions are sought for these issues [2,5,8].

For the generation of machine learning prediction models, the two alternative dataset sources that can be used are simulation results and real consumption data, and the modeling processes followed using these two data types differ at some points. With the use of real data, more intensive work has been conducted to collecting appropriate/valid data from various sources and cleaning those data (e.g., by managing ambiguous values, missing values, outliers, meaningless data, etc.) [10,11]. For the use of simulation results, on the other hand, in conjunction with the bottom-up-structured white-box modeling nature of the simulation, proper steps can be taken for well-represented data. This can also supply a traceback point for the machine learning modeler to be able to build a machine learning modeling process where there are many alternative decisions. In addition, a simulation is a more suitable method for obtaining the data to construct models for analyzing design alternatives in the early design period. Obtaining a prediction of an energy consumption result based on the relationship between the input and output of the simulation data requires a study using supervised learning in the modeling, where the model is trained and makes the predictions depending on a labeled dataset, which are the input and output variables [5,9,12].

Previous reviews that examined building energy prediction modeling studies, including modeling studies using simulation data within their scope and, therefore, supervised machine learning, were observed to analyze the subject, usually in a broad context, to understand the general trends and approaches. For instance, in [13], where studies on building energy performance forecasting both at individual and urban scales were reviewed, the commonly used machine learning models, commonly investigated building functions and energy consumption types, overall input types commonly used in the models, etc., were put forth and discussed. Likewise, in [14], where studies on forecasting building energy demand were classified into groups and sub-groups considering the main modeling approach and reviewed, the target of the studies, the commonly used prediction models and software, the limits and advantages, etc., were investigated for each sub-group. Similarly, in [15], where the studies using data-driven approaches that include machine learning for building energy consumption prediction were investigated, the models employed, building functions considered, energy consumption-types predicted, input-types utilized, etc., were analyzed. However, an in-depth analysis of the details and dynamics of the modeling process, together with the interconnected steps and decision points were not included within the scope of these reviews. In this current review, which was limited to supervised machine learning modeling studies using energy performance simulation data to produce the dataset, but not to certain inputs, outputs, or algorithms, it has been focused on investigating and answering questions on how all the decisions in the modeling process affected the accuracy performance of the prediction models. Therefore, with a narrowed scope, a wider variety of studies regarding these decision points and others was included in this review. This also allowed questioning the factors affecting the decision-making process within the scope of this review.

Accordingly, in this study, one of the primary aims was to obtain a general modeling framework by reviewing these research studies. As aforementioned, it was also aimed to discuss the possible impacts of the decisions made and the steps followed on the prediction accuracy of the models. In this respect, decisions made at different steps were examined individually or cumulatively for the holistic nature of the prediction model, together with the analyses for determining the common limitations and concerns handled at different steps. In this paper, following the ‘2. Methodology’ section that explains the systematic review and analysis processes, the results obtained through these processes are presented and discussed under the headings in line with the modeling framework structure obtained via the review.

2. Methodology

The Web of Science database was used to search for the recent research studies on prediction modeling for building energy performance results published between 2013 and 2023 (Figure 1; Appendix A as a Prisma flow diagram). The keywords string of ‘prediction or estimation or forecasting’, ‘building’, ‘energy’, and ‘machine learning’ were searched in the topic field of the database initially, and a total of 989 publication records were found in English. In the succeeding abstract screening and elimination phase, modeling studies targeting the building energy performance output variables grouped under the categories of operational energy (either as load or consumption), comfort, environment, and economy were searched and selected for in-depth analysis. Studies considering building energy performance, for instance, through the prediction of occupancy and associated activity [16] or by predicting building interior temperature [17], were excluded from the systematic review because of the lack of targeted output variables. Likewise, studies such as those for predicting the photovoltaic power, wind power, and power needs of electrical vehicles were eliminated for the same reason. Additionally, one of the goals of this review was to examine prediction models that may assist the architects during the design process regarding the energy efficiency of buildings, and therefore modeling studies that did not consider the physical variables of building in the input set were excluded as well. This also provided the opportunities to compare the models targeting the early design phase and those without that limitation. For similar reasons, the studies that used a real dataset only and that did not perform a prediction but performed a multi-objective optimization only were eliminated too through title and abstract screening, in addition to those that could not be accessed. A total of 55 studies remained that used a dataset obtained by the simulation method for machine learning prediction modeling, and these were analyzed in detail. Since a set of labeled inputs associated with the outputs was used for training and testing the models in all of the studies reviewed, the analysis was ultimately limited to the regression modeling of supervised learning [18] (the list of these studies and detailed information on modeling-related issues are presented in Table A1).

In the in-depth examination of these 55 studies, initially, the prediction modeling process framework was structured with its main stages and the steps constituting them, by systematically reviewing the steps followed in each study, taking into account its main aim as well. Additionally, the decision subjects in these steps that were found critical to prediction performance were gathered and grouped to discuss common and differing points in the studies. The common concerns that cause the limitations and approaches used to overcome these concerns that were related to the area of machine learning modeling (i.e., approaches and methods) and building energy performance were reviewed and listed too.

In the second step of the in-depth examination, initially the distribution of process variables, i.e., input and output types, and machine learning algorithms were investigated and discussed. The variations in these respects in the studies targeting the early design stage were examined and discussed as well. Then, for the impacts of process variables, firstly, the relationships between the accuracy performance and individual decision subjects were examined. Afterward, as the last step, correlation analyses were conducted between process-related variables to observe the effects on each other, taking into consideration each model as a whole with its decided variables and accuracy performance. The correlation analysis methods that were used in this step varied depending on the data type. Theil’s U statics [19] was used for the analyses among categorical (discontinuous) variables. The correlation ratio (eta) [20] was used for analyses between continuous and categorical variables, and both the Pearson correlation [21] and Mutual information [22] were used for the continuous variables. While the Pearson correlation [21] analyzes the linear relationship strength between two variables, the Mutual information method may analyze non-linear relationships [22] and therefore was implemented to investigate the presence of any non-linear conditions. The result ranges of the correlation methods used and the evaluation of their results differ from each other. The results of both Theil’s U statics and the correlation ratio range between 0 and 1, where 0 means no information is contained about the relevant variable and indicates no correlation, and 1 means full information is contained about the relevant variable and indicates a full correlation. Considering the scale provided in [20], correlation ratio values less than 0.25 were accepted as weak correlations, while those between 0.25 and 0.4 were moderate, and greater than 0.4 were strong. The results of the Pearson correlation method range between −1 and 1, where 0 means no correlation, while values up to 0.4 and −0.4 are evaluated as a weak correlation, those between 0.4 and 0.7 and −0.4 and −0.7 as moderate, and those more than 0.7 and lower than −0.7 as a high correlation [21]. Dython and Scikit-learn, which are a set of data analysis tools in Python, were utilized for conducting these analyses [23,24]. As a part of the correlation analyses, statistical significance tests were also performed to understand the validity of the relations, if there were any. A chi-squared test [25,26] between categorical and numerical variables, and ANOVA [26] between categorical and numerical variables and between numerical variables were implemented for this.

In addition to these machine learning-related analyses, the distribution of building functions and climate types considered in the studies reviewed was investigated, as shown in Figure 2, for a brief analysis in the field of building energy performance. The Köppen–Geiger climate classification based on main climate and subsequent precipitation and temperature conditions was used for this purpose [27]. Among the 40 studies where both the building function and location were specified, the most studied climate was found to be the Cfa type, i.e., warm temperate climate with humid and hot summers, which is also known to be a humid subtropical climate. Regarding the building functions, residential (20) and office buildings (18) were the most-studied types.

3. Machine Learning Modeling Process in Building Energy Performance Prediction

Building energy performance prediction by machine learning modeling aims to obtain the results fast, and with high accuracy. The review of the selected publications showed that the process to generate a machine learning model for this purpose consists of three main stages, which are (i) dataset preparation, (ii) machine learning model preparation, and (iii) objective function(s) optimization. The first two stages contain sub-stages, in other words, steps, some of which do not necessarily take place in every study, as shown in Figure 3. The last stage aims to optimize the building performance and was excluded from the detailed discussions because it is not directly related to prediction modeling.

In this model generation process, some steps existed in all the studies reviewed. These were the case description step of the dataset preparation stage, and the data preprocessing, machine learning algorithm selection, training and testing, and performance evaluation steps of the machine learning model preparation stage. Therefore, these steps can be considered as the necessary minimum to build a prediction model. Regarding the remaining steps of the first two stages, except for the initial dataset production step present in studies that do not use data taken from another source and the variable importance determination step, the determinants of their presence were observed to be the aim of the studies to a great extent. These aims were observed to be collected under three groups as follows (Figure 3):

(i): Improving the prediction performance by focusing on the dataset preparation stage;
(ii): Improving the prediction performance by focusing on issues related to the machine learning model preparation stage;
(iii): Developing a particular model, approach, or framework, for e.g., the early design stage [2,28], occupant-behavior consideration [29,30], and optimization of objective functions [31,32].

During the machine learning model generation process, there are some decision subjects in each step that may raise concerns. The sources of these concerns were observed to be gathered in three groups as follows:

(i): The dimensionality between the representation of data sufficiently and handling complexity;
(ii): The uncertainty regarding the accuracy of predictions compared to the real building energy consumption/impact results that are affected by factors such as the use of appropriate tools together with sufficient knowledge of simulation modeling, the wide range of probable building design options with a lack of detailed information about them in the design stages considered, the performance of the machine learning algorithm used, etc.;
(iii): The appropriate ensemble of approaches or methods.

It was also observed in the studies that strategic approaches to these concerns and others have been developed over and over again throughout the process, such as in the case of dataset size determined in the case description, feature selection, and sampling steps, as shown in Figure 4, along with the need to solve the dilemma of constraints and extension for aims. This review also revealed that there were interrelated workflows mainly related to the dataset preparation stage. Associated variables of the dataset, which were the dataset size, features to be included, feature value, feature value distribution, and abstraction relevant to the subject at hand, could either be decided all at once directly under the relevant step(s) of dataset generation, or the dataset could be created with a logic that was gradually intertwined at different steps in a way shown in Figure 4. These interrelated workflows between the steps and the common concerns necessitate a holistic approach to the process.

In the following subsections, which are structured considering the steps of the first two stages, different studies regarding each step and the contribution of that step to the whole model are explained. Additionally, factors affecting modeling performance are evaluated in Section 3.2.4. Performance Evaluation in detail by comparing all the associated studies and particular decisions in the steps. Steps for the data preprocessing and variable importance determination are not discussed in separate sections, since comparative studies on their impact on accuracy performance did not exist in the articles reviewed. Yet, in brief, data preprocessing is mainly for collecting or preparing the dataset used for machine learning modeling through eliminating the missing and repeating values and/or the normalization and randomization of the datum to improve its generalizability. A variable importance step is applied to obtain the contribution of each feature to the trained models.

3.1. Dataset Preparation

In the preparation of the dataset, the commonly followed steps were (i) description of the case; (ii) production of the data by simulation, unless the decision was made to use existing data; (iii) augmentation of the data that were produced in the previous step—if preferred or found necessary; and (iv) selection of the features, in other words, the building energy performance variables that will allow a faster prediction when preferred or found necessary. Regarding the last step, feature engineering was sometimes performed either instead or additionally. Working on a dataset with relevant features (i.e., building energy load variables) in line with the aim and output of the study was observed to be one of the important determinants of the model’s prediction performance.

3.1.1. Case Description

The case description step covers data collection processes that allow the construction of the simulation models in the next step, such as the weather data of the selected location, information on the common characteristics of the buildings at the selected location, etc. These were determined considering the abstraction related to the case description and the design stage targeted. During data collection, the studies reviewed showed that different methods were used to acquire building features, such as the form, and envelop-related features. In addition to the field research method commonly applied fully or partially in the studies where real data were used regarding energy use (e.g., in [32,33,34]), the literature research (e.g., in [35,36]) or hybrid methods (e.g., in [32,33,37]) could be used.

Obtaining data that are as close to the real situation as possible and expanding the case scope to consider larger-scale and more diverse situations were among the common goals for the input dataset preparation for the computer simulations. However, this effort might have also caused an increase in the uncertainty level related to the validity of the prediction model produced with that dataset for the real world. Because, as an uncertainty source for the prediction model, the difficulty in the validation of data produced by the simulation increased by the increase in dimensionality with the effort to model dynamic variations close to the real world and the difficulties in accessing the real data of buildings [38,39]. Another uncertainty source was the design stage targeted to be used for a prediction model because, especially in the early design stage, the uncertainty was high since the key decisions were not made yet (e.g., [2,8]). These uncertainties usually resulted in the simplification/abstraction of the model, and most studies have been conducted considering a typical built environment and/or building model that would be representative of the existing buildings in a particular area. Regarding this, abstraction approaches, such as creating building models with the common features of the existing buildings or building groups (e.g., in [29,40,41]), working in limited locations (e.g., in [42,43,44]), and considering monthly or annual weather conditions rather than hourly or daily data (e.g., in [39,45]), have been used. Standards were also observed to be used as data sources in this respect. ASHRAE Standards, for instance, were utilized for determining the occupant-behavior-related variables and their values related to the building’s operational characteristics, such as the lighting power, occupancy, and electric equipment power densities [29,33]. In this manner, there were also some studies that proposed frameworks to account for, e.g., the occupancy [29,30] and weather variations [32]. Despite the efforts to obtain larger data for all these more comprehensive cases, tested models obtained with small datasets and that had sufficient accuracy also appeared in the literature [9].

In order to benefit from the potential of the machine learning model to work faster and with fewer parameters than the simulation to obtain a prediction result, studies targeting the early design stage where effective decisions are made for the final building have also been carried out. For this purpose, some studies developed tools and framework suggestions enabling the designer to explore and compare numerous potential designs’ energy results without creating a simulation model. The limitation of being able to make decisions on a few parameters at the early design stage led to too many possible building design alternatives shaped by the decisions regarding the parameters usually considered at the stage where the design was detailed. For example, the use of a simplified model in the simulations considering the early design stage presented more prediction gaps compared to that of using a detailed model [6]. Therefore, the use of a probabilistic approach was observed to be preferred in some studies instead of a deterministic approach in order to deal with those kinds of uncertainties. For this purpose, the predicted results in response to the parameters decided at the early design stage are presented as a range, together with the results of all probable designs generated considering the corresponding values of the parameters not yet decided. This was applied, e.g., for buildings’ environmental impacts [28], or energy use intensity [46].

In addition, the studies aiming to evaluate the generalization flexibility of the machine learning modeling were observed to use both complex datasets and their simplified versions or different featured input datasets comparatively. In this context, static and dynamic models [2], building stocks with and without various building types [40], cases with building shapes with different complexities [6,32], and cases considering different environmental contexts [47] were worked together comparatively. The results of these studies show that, although the possibility of a lower prediction accuracy resulting from increasing the complexity is high, some models in the studies, even if it was limited, showed better results with increasing the complexity (e.g., [32,40]).

In relation to the aim of the studies, it was observed that those focusing on the improvement of the machine learning models might prefer using either limited input categories [43] or location-independent generic models [35]. Also, the use of the dataset obtained from Tsanas’s and Xifara’s study [42] was used in more than one-fourth of the studies, i.e., in 16 together with the original study out of 55 studies [1,5,48,49,50,51,52,53,54,55,56,57,58,59,60]. Therefore, especially regarding these studies using datasets obtained from another source, the scope of the case description step could be relatively small compared to those focusing on other aims. It was also seen that using the same dataset paved the way for comparisons between studies as a solution to the difficulty of making evaluations regarding the impact of various decisions on the prediction performance of the machine learning models.

3.1.2. Data Production by Simulation

In the data production step, the main decision subjects for starting the simulations were observed to be the output types to be calculated numerically, the computer simulation software to be used, and the number of samples to be included in the dataset for machine learning modeling, i.e., the dataset size, the building and environment-related features to be included in the simulation, their values and distribution, and the abstraction level, in other words, the level of detail.

The simulation software that was commonly found to be reliable was preferred to be used in the studies to ensure that the gap between the real data on energy consumption and the data produced for the prediction model was as small as possible. EnergyPlus and DesignBuilder were the most-used simulation software in the studies reviewed for the outputs in the energy and comfort category. It was also seen that working on larger scales with the integration of tools or software was possible, such as in the case where Modelica was used in combination with Energyplus for integrated district modeling [39]. A seamless simulation process was also seen to be implemented to make data production faster and more automated by using tools/plugins to link the physical modeling tool with energy performance simulation software. For instance, Rhino was linked with Daysim through Grasshopper for the daylight simulation of a parametrically changing building form and fenestration design solution [61], or Revit was linked with the ICE database and EnergyPlus through Dynamo for an environmental performance assessment of different design scenarios generated [28]. Also, Tally, which is a Revit plugin that uses the GaBi database as a life cycle assessment data source, was used directly without the need for any additional tool for the environmental impact assessment [34]. Likewise, regarding EnergyPlus, the JEPlus tool was used to support the operation of parametrically processed simulations [62], or a model producer script was written to create the EnergyPlus input files representing different occupant behaviors and building sizes to automate the process that contained repetition in modeling for a relatively large number of cases [29]. In the studies focusing on framework development as the main aim, Building Information Modeling (BIM) software was also seen to be used as a main part of the studies since it reduces the remodeling effort associated with the physical modeling of the building(s) and has alternative ways to transfer that information to the energy simulation software [34,63].

The review and categorization of the inputs generated in the studies for machine learning modeling showed that there were six different categorical types of input variables as shown in Figure 5 collectively, detailed in Table A2 for each study, and presented fully in Table A3 for all input levels. These were (i) time (e.g., hour of the day [62], month of the year [35,62], and time period [32]); (ii) location definition as determined in the case description step; (iii) climatic data, again as determined in the case description step; (iv) data on surrounding built environment, in other words, the urban context; (v) data on building features; and (vi) data on room features. Input variables at different complexity levels were seen to be preferred. In this context, the data could be used in:

Basic form, individually (e.g., as building width and building length individually [8,28]);
Processed/complex form, i.e., as a combination of basic data (e.g., window-to-wall ratio instead of taking both the window area and wall area (e.g., [44,64]), or compactness ratio instead of considering the building width and length separately [40]);
Interconnected form by using a feature combined with other variables (e.g., window-to-wall ratios [8,31,40], and solar radiation absorption coefficients [65] taken as input variables multiple times for each orientation condition separately).

From another perspective, in addition to the numerical/quantitative data type of features, such as floor area (e.g., [29,46]) belonging to the form/geometry category or wind speed (e.g., [2,29]) belonging to the wind category, the categorical/qualitative data types could be used in the datasets to represent different circumstances. Examples of such categorical/qualitative data used in the studies were the orientations, insulation materials belonging to the construction/material category [45], and window operation types [29] belonging to various categories for the building.

Among the input variable categories listed in Figure 5, building features, in other words, variables at the building scale, such as those associated with envelope, orientation, form/geometry, HVAC system, and internal gain parameters, were used more frequently. Variables related to cloud, illuminance, and precipitation conditions under the climate category and those related to lighting system in the buildings were included as a parameter on a limited basis. In the case of studies targeting the early design stages, in addition to the strategies explained in the previous case description step, the use of the parameters decided in these stages only was observed to be used as well (e.g., focusing on building parameters decided in the early design stages for cooling load and environmental impacts [66], for operational energy demand [67]). As another example, in a study focusing on the building form to account for the shadowing and reflection effects of the surroundings, an approach was developed by decomposing the complex mass and collecting the results obtained with these simplified forms, and this approach was then incorporated into the proposed metamodel-based prediction model [47].

The analysis of the input–output relation in the studies and their distributions showed that (Figure 5), in the majority of studies (i.e., 35 out of 41), outputs within the operational energy category were chosen to be accounted for, which was expected in relation to the search and review focus. This was followed by comfort, environment, and economy. The input–output distribution in the studies conducted for the early design stage was also seen to follow the same order. In the environment and economy categories, where the final output was determined considering other associated outputs, mostly operational energy, and within those the heating and cooling loads/consumptions were taken into account. Studies on embodied energy/impact caused by the material used, ventilation load, and equipment load remained quite limited. Regarding the embodied energy, in addition to the study presented in the associated column in Figure 5 as this output was separately accounted for (i.e., [34]), in three other studies, it was considered as one of the values summed for calculating the total environmental impact (i.e., in [28,44,66]). Similarly, those on ventilation load in [28,65], and on equipment load in [28] were used as a value summed up. No significant difference was observed between the number of studies that use the outputs separately (e.g., as heating and cooling loads/consumption) and that of those using them as a total value (e.g., as total energy consumption and life cycle cost). Additionally, in the studies where multiple outputs were predicted, due to the ability of algorithms to weight features in the modeling process separately for different outputs, 28 studies used a common dataset for predicting multiple outputs, while only in 4 of them, different input sets were used for different outputs. Relatedly, as the most frequently targeted output category was the operational energy, it could also be evaluated as the determinant of the input set in the studies, where the HVAC [31] and internal gain [68] variables included in the dataset for the prediction of visual comfort could be provided as examples.

3.1.3. Data Augmentation

The data augmentation step aims to obtain large and properly distributed data to improve the generalization performance of the prediction model, and sampling methods are implemented for this purpose. Although parametric modeling, which can be used in the previous data production step, serves the same purpose, sampling methods enable a large sample size to be obtained while preserving the distribution feature of the original data.

Among the sampling methods, Latin hypercube sampling (LHS) was the most-used one (i.e., in 13 out of 24 studies that stated sampling method use). It is a space-filling method, in which each sample is located in a hyperplane aligned with an axis, and the sampled data with this method, even if small in size, reflect the diversity of the base data [32,61,69]. Yet, a study comparatively evaluating five different sampling methods (i.e., classic Fourier amplitude sensitivity test, LHS, quasi-random sampling (QRS), random sampling (RS), and Sobol) in various settings found that Sobol had the best performance in all circumstances investigated [64]. RS, LHS, and QRS are among the Monte Carlo methods that are based on repeated random sampling to generate numerical parameters in different scenarios [64]. The Sobol method, which had better accuracy and required less time to apply in the case study in [64] compared to these Monte Carlo methods, is a quasi-random low-discrepancy sequence method [19], and it is said to be able to provided uniformity properties better than LHS or RS methods [70]. As an alternative approach to deciding on a sampling method directly or to compare and select the best-performing one, using various sampling methods in combination was also seen, such as in [44], where five commonly used sampling methods in building performance analyses were combined to fully represent uncertainty characteristics and accelerate convergence.

In association with the sample size as a decision subject, some studies were seen to prefer a performance comparison of various sample sizes to find the one that provided the optimum accuracy and computational time for the particular energy prediction case at hand [5,29,33,47]. These studies showed that increasing the sample size often provides better accuracy with the generalization capability in prediction modeling, but it causes an increase in the training time. Additionally, other decisions in the whole process, such as the algorithm to be used, might be a determinant of a sufficient sample size. In a study on predicting heating and cooling loads, for instance, a sample size of 25,000 was found to be optimum for Gradient Boosted Regression Trees (GBRT) in terms of the accuracy and time required for training while the use of techniques such as Gaussian Process (GP) was suggested in the case of a small dataset [5]. Similarly, a study on the occupant-behavior-sensitive cooling energy consumption prediction showed that Artificial Neural Networks (ANN) had a better performance in most of the evaluations made with different sample sizes compared to Ensemble Bagging Trees (EBT) and needed a smaller sample size for an effective model, but also more time was required for training these models [29].

The fact that decisions can be made on limited parameters in the early design stage leads to many possible building alternatives in association with those to be taken in the detailed design process, and sampling methods can be integrated to augment these alternatives in the studies. For instance, in [28,46], based on the design decisions made in the early design process, the data were sampled over and over in the background, and the range of probable energy performance results was obtained accordingly with the integrated prediction model generated with machine learning. Also, choosing the values of design variables that have not yet been decided in the design process in a uniform distribution can be deemed appropriate to obtain probable results against uncertainty [46,67].

3.1.4. Feature Selection and Feature Engineering

Simpler models obtained by eliminating features that have a low impact on the output could reduce the overfitting of the model because this elimination provides a decrease in weights and/or parameters in the model and results in the improved readability of the algorithms [29]. This elimination can also reduce the computational cost as the model adapts to the data faster [5,9,63]. Therefore, feature selection is applied to eliminate these kinds of features, and sensitivity analysis is generally employed for this purpose.

In the studies reviewed, two different sensitivity analysis approaches were observed to be applied: global and local. In the global sensitivity analysis, the model is accepted as a whole and the effect of the inputs on each other is considered since the selected features may be affected by another one, and some features may carry common information because of the complex/multilinear relationships. In the local sensitivity analysis, on the other hand, the inputs are considered independent from each other [9,64]. The high-impact features determined by these two approaches may differ from each other. For instance, in a study comparing the use of ‘sensitivity analysis about the mean’ and ‘state-based sensitivity analysis’ as local and global techniques, respectively, it was observed that the impact level of some features varied according to the technique used, and it was concluded that calculating the variability of a single input by keeping the others constant (i.e., local analysis) might not be a correct approach due to the possibility of a high correlation between inputs, and the use of the global sensitivity analysis was suggested when certain input variables were in correlation with certain other inputs to yield more realistic results [1]. Additionally, the sensitivity analysis results were also observed to vary depending on the case description, (e.g., building type [32], location [69]), and the level of detail of the output considered [33].

Most studies performed feature selection by considering the order of parameters according to their effect on the output and eliminating those in the lower ranks, i.e., with the least impact. On the other hand, some studies that used subsets created in a mixed order with features having varying levels of impact according to the feature analysis showed that better results could be obtained with that strategy [69,71]. Concerning how the results of the sensitivity analysis were used, in addition to the direct use of the results of a certain selected method, a concept called the ‘cumulative contribution rate’ was seen to be used as well, which takes the total effect of different analyses and/or different outputs to take into account the results of different methods [44,64]. Additionally, since some algorithms may extract more meaning from ineffective features (e.g., GBRT [5]), the results of the feature selection process also needed to be tested together with the selected machine learning algorithm for better prediction accuracy.

In the case of energy prediction during the early design stage, some studies proposed tools that integrate sensitivity analysis in order to provide information on the effects of the parameters for the designer to take into account [28,46]. For instance, in a tool of this kind, the designer was expected to make decisions by reducing the value ranges of the variables and was informed about the effective parameters with the instant sensitivity analysis depending on the decisions made [46].

As an alternative to feature selection considering the sensitivity analysis results, in feature engineering, raw data are extracted and transformed by taking into account complex relationships, and this can be applied in various ways. For instance, in [2], a component-based system that parameterized the whole building and its components was used for this purpose to allow for the reusability of the model. In the same respect, in [72], physical equations of heat flow were used to transform the data from raw into a form that captures the interaction between the building and its environment. In the same context, in [61], the spatial properties of the sensor points were obtained by reorienting and normalizing their positions relative to the local coordinate system of the room.

3.2. Machine Learning Prediction Model Preparation

At the main stage of machine learning prediction model preparation, the data preprocessing, machine learning algorithm selection, training and testing, algorithm optimization, performance evaluation, and variable importance steps are usually carried out. In the following subsections, the main issues in relation to these steps, except for the first and last, are presented.

3.2.1. Machine Learning Algorithm Selection

The algorithm selection process of the studies varied, mainly depending on the aim of the studies and the dataset features that could be explored during data preprocessing with the analysis of, e.g., data distribution, complexity, or uncertainty level. The algorithms that were used in the studies reviewed are presented in Table 1, together with their commonly used abbreviated forms, which will be used hereafter.

Overall, a total of 56 different algorithms that can be grouped under 19 different algorithm classes were used in the reviewed studies. It was also observed that the use of single-type models was usually preferred (49 out of 56 studies), while the use of relatively new ensemble-type models, where two or more algorithms come together to take advantage of the strengths of each in the modeling, was relatively lower (23 out of 56 studies). Yet, they were explored increasingly because of their efficiency according to the test results in different cases. It was also seen in the studies that, in addition to the use of already developed algorithms commonly preferred in the field of building energy performance (e.g., Neural Network algorithms, such as DNN, BPNN, and CNN), new algorithms created by combining different algorithms were also used (e.g., Geometric Semantic Genetic Programming with Local Search [52], Grasshopper Optimization Algorithm with Artificial Neural Networks [55], Stochastic Fractal Search with Artificial Neural Networks [56]).

The most-used algorithms were the ANN, RF, and SVR, respectively, and in 34 of 55 studies, at least one of these algorithms was used. For the studies focusing on the early design stage, the ANN’s and RF’s dominance continued with their use in 8 studies out of a total of 13 studies. Since there is no single algorithm with a superior performance suitable for every problem, comparing the algorithms to determine the appropriate one was a frequently used approach in the studies reviewed (i.e., in 34 of 55 studies). It was observed that there were two trends in the choice of algorithms for that comparison. The first of these was the evaluation of only the frequently used algorithms for building energy studies. The second one, on the other hand, was implemented by comparing new or limitedly used algorithms with the frequently used ones. Regarding the latter approach, performance evaluations were performed for the DNN [54], ELM [43], EMARS [60], and LSSVM [73], as algorithms with limited use in the building energy field, by comparing them with the frequently used ones, such as the ANN, or SVM, with a proven good performance for complex problems of this field. These comparisons showed that the best model fitted to the data could also be built with an algorithm that was not tested enough in the associated field [43,73].

Algorithms, in connection with their own learning strategies, may perform certain operations internally, and this may be an element to be considered while selecting a modeling algorithm. For instance, the ANN, RF, and SVR are among those where the weighting of inputs in correspondence to the outputs is performed integrally during modeling [40,62]. Likewise, as inbuilt processes during modeling, GBRT can extract additional information from non-statistically correlated features [51], RF performs sampling [34], and Fuzzy Logic-based algorithms take into account the uncertainty effect [53]. In addition, it is possible to find algorithm options that can show sufficient prediction performances for some limitations, such as a small dataset [9], or a dataset with very few features [53].

3.2.2. Training and Testing

During the generation of a prediction model, the selected algorithm is first trained with a certain part of the dataset to obtain an appropriate model, and then the model is tested with an unseen dataset to evaluate its generalization performance. The evaluation of the prediction model is performed by considering its performance for the data in the testing set. The fact that the results’ accuracy obtained with the testing set is lower than that of those obtained with the training set is an indication of overfitting, which shows that the model cannot derive appropriate meaning from the data [37,70].

Sets for training and testing are created by randomly dividing the entire dataset into two or three sets when an additional validation set is preferred. The most frequently used proportional divisions in the studies without a validation set were 80/20% and 75/25%, i.e., in 10 and 8 studies, respectively. In the case of preferring a validation set, 10% or 15% of the dataset was generally used for it, by a reduction in the portions of training and/or testing sets (e.g., 10% for the validation set and 65/25% for training/testing sets [29]; 15% for the validation set and 70/15% training/testing sets [1,32,45,66,69]). It was also seen that changing the proportions of the datasets in different phases of modeling was possible. For instance, in [49], an 85% split for the training set was preferred for the overall preliminary design of prediction models, and then the training set was reduced to 70% of the dataset while working on individual parts (i.e., networks). The dataset size was also observed to be a determinant for deciding these proportions, where, for example, due to the large dataset with 100,000 samples, the test data could be determined as small as 5% of the total data in [68]. In addition, cross-validation is applied during the training process to repeatedly tune the algorithms according to the errors in the predictions, and the subsets generated by dividing the training and testing sets into smaller pieces are used for it [2]. A 10-fold cross-validation was the most used in terms of the fold number and was observed to be used in 10 studies (e.g., [30,40,50]).

One of the general goals in the field of machine learning prediction modeling is to have a prediction model suitable for use in a wider range of applications than the limited problem on which it was trained. Therefore, from time to time, instead of the common practice of generating the training and testing sets by taking them from a common dataset, a performance evaluation can be performed by using differentiating training and testing sets, where unseen design cases or feature values are used in the latter. For instance, in the studies reviewed that used this strategy, the training and testing sets were differentiated by considering buildings having different numbers of stories [72], or different shapes [32].

3.2.3. Algorithm Optimization

Hyperparameters are used to control the learning process and flexibility of the model in order to obtain an adequate generalization capability. Model overfitting or underfitting to the data is prevented by configuring them. For instance, in the case of using a complex model for simpler data causes the overfitting of the model because of overlearning and it predicts poorly with unseen data [50]. In such cases and others, algorithm optimization through the tuning of hyperparameters is therefore an important step and improves the prediction performance of the model.

Hyperparameters are specific to machine learning algorithms, and they are predefined outside the training procedure [50,63,70]. For example, the most commonly used neural networks are developed by imitating the neuron and synapse structure of the human brain. The modeling is optimized by the number of neurons and layers on which the weighting and activation information is based and accordingly the number of iterations and learning hyperparameters and their values (e.g., [43,64]). Interconnected neurons that are stimulated with different weights create flexibility to work with non-linear data compared to linear regression. Generally, one hidden neuron layer processes the input layer signal to the output layer in the ANN, but the DNN has more than one layer and takes into account more complex relationships [29]. Tree-based ensemble-type algorithms of the RF, ET, GBRTs, etc., have hyperparameters to control the depth of the model via components of algorithms, such as the number of trees and leaves (e.g., [51,65]). The final estimate is obtained by combining the estimates of different Decision Trees (DTs) [62]. Gradient Boosting involves adding weak learners sequentially to the model, so that each learner is able to fit into the residuals of the previous one. GP algorithm, which is based on Darwinian theories of natural selection and survival, has hyperparameters such as chromosomes, number of genes, mutation rate, and the population of potential solutions goes through repeated evolutions over successive generations with mutation, crossover process [43]. Apart from frequently used algorithms, limitedly used algorithms were also tested in the studies to observe their potential and they showed superior performance results in some cases compared to that commonly used algorithm. For example, MPMR, which showed better performance in comparison to DNN, was framed in a way that maximizes the probability of making any prediction [54]. Another relatively new type of algorithm referred to as hybrid models, combines an algorithm with another algorithm or strategy. For example, Geometric semantic genetic programming with linear scaling and local search as a hybrid model showed better performance than Geometric semantic genetic programming [52], as linear scaling and local search strategies supplied convergence in a small number of generations and produced quantitatively better results.

Tuning models for various outputs separately in the case of using common inputs for all of them was usually preferred in studies to manage different weighted connections specific to the dataset, e.g., heating and cooling load especially modeled separately in 10 studies, apart from taking them as a total value as seen in 8 studies. In the modeling for cooling and heating load, the cooling load was optimized with a higher number of estimators, indicating the need for modeling with a more complex model structure [51]. Also, modeling for lighting energy demand and cooling energy demand resulted in higher accuracy performance compared to modeling for the output obtained by summing these outputs in [74].

The search for a suitable configuration of different hyperparameters and their values could be done either manually or by automatic methods. Since doing it manually requires too much time and effort, automatic approaches were used more often in the studies reviewed, i.e., in 26 of the 55 studies. Among the automatic optimization methods, mostly, the grid search method has been used, i.e., in 11 of 26 studies. Yet, in some of the studies, an optimization method selection phase was seen to be integrated into the step to compare the performances of different methods and select accordingly [44,50,58,59,69]. For instance, in [50], ten different methods including some base methods as well were compared, and a modified method called modified Jaya was found efficient for the case. Likewise, in [44], randomized search showed the same performance in accuracy with grid search, but superior performance in terms of computational cost. Additionally, the use of hybrid methods was also seen as an alternative to the use of a single method, e.g., in [53], where a hybrid of two methods called backpropagation and least-squares estimation was used and found to be efficient.

3.2.4. Performance Evaluation

Prediction accuracy and computational cost stand as the main evaluation criteria for the performance of the model built. Studies often evaluate performance based on the results of more than one evaluation metric of these criteria. Regarding accuracy, to measure the degree to which the predicted values in the model fit the real values/results, calculations were observed to be made on two main dimensions; trend fit and location fit. The trend fit evaluates the degree of compliance of the predicted data with the trend of the studied dataset, and the most commonly used metric to evaluate this is the coefficient of determination (R²). The location fit evaluates the accuracy based on the pointwise difference of predicted values relative to the actual data, where Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are the most used metrics [1].

Regarding computational cost, the computational time required for various modeling processes is especially critical in models developed for datasets with large sizes to decide the optimum model, considering the accuracy as well. In this respect, the training time (i.e., the time required for fitting or processing a model) was considered in studies as a metric to compare algorithms. For instance, differences in the training times of algorithms were more than 7 times in [70], a deeper algorithm structure required more training time in [8,29,47], and training time differences of optimized model for different algorithms were 12 times or more in [63], all of which presenting that computational costs might need to be considered in addition to accuracy while deciding on process-related variables. Similarly, the training time was used to find the optimum sample size in [5,29], and optimum feature set in [33] (The training times of the models in the studies where comparative assessments were made for a limited number of variables were given in Appendix D, along with their variable and common features). Testing time was also seen to be evaluated as another metric for comparing algorithms [5]. The prediction time of the trained machine learning model, on the other hand, was observed to be used usually to compare it with that of numerical simulation, and the results for prediction models were less than that of simulation [9,63].

To examine comprehensively the impacts of various decisions on model performance, comparative analyses were made on the studies reviewed. These comparisons were made through the results of the R² metric since it was the most commonly used one in the studies, and data from a total of 35 studies sharing the R² results were collected. Those data, in addition to the R² results, were the inputs and outputs used as categorized considering the scheme given in Figure 5, the algorithms used as grouped under algorithms classes given in Table 1, the aims of the studies defined according to the scheme given in Figure 3, and dataset sample sizes grouped afterward under frequently seen intervals. The database formed by the combination of the data collected from these 35 studies was also used for the analysis given in the Section 4. Regarding the accuracy of the models, for an acceptable-level performance, R² results higher than 0.75 is recommended, while a result greater than 0.9 is an indicator of satisfying performance [76]. Therefore, while evaluating the impacts of decisions through the R² value, these scores were used as thresholds for determining R² ranges. Accordingly, no sub-groups were determined for values less than 0.75, and values between 0.75 and 0.9. Regarding those above 0.9, four subgroups were defined with 0.025 intervals. In the evaluations, in addition to the analyses performed on all studies providing an R² result, separate analyses on the studies using the same dataset, and working on the early design stage were made as well.

The analysis of the relation between the class of the algorithms used and the aim of the study through the R² results indicated that (Figure 6), regardless of the purposes of the studies, with all classes of algorithms, the energy results could be predicted accurately, with some exceptions regarding the commonly used ones. Neural networks, decision tree, support vector machine, as commonly used algorithm classes, might fall below the acceptable performance limit sometimes (i.e., R² < 0.75). These decreases seen in the performances were observed to be caused by the dataset feature, and algorithm optimization process. It was also observed that some less-used algorithm classes such as the Light Gradient Boosting Machine model, multivariate adaptive regression splines, and Minimax probability machine regression showed superior performances with an R² above 0.975. No significant differentiation was observed in the distribution of accuracy performance depending on the aims of the studies.

The analysis of the relation between the classes of algorithms used and the type of outputs revealed that (Figure 7) the algorithms used for models targeting the early design stage resulted in a poor performance (i.e., R² < 0.75) and an inadequate model more often than that of the remaining ones (Figure 7a vs. Figure 7b,c). Relatedly, the decision tree class of algorithms often yielded results below 0.9 and 0.75 in studies for the early design stage, while their results were above 0.9 in all studies using the same dataset. Additionally, they also performed better in those outside the scope of these two groups more often than in the studies for the early design stage. Regarding the linear regression class, there is an acceptance that it is inappropriate for working with complex data [9,32], and linear regression algorithms were considered in the studies entirely within the scope of testing multiple algorithms. However, they showed a satisfying performance with R² above 0.9 in the studies using the same dataset. Yet, regarding the studies for the early design stage, no study working with these algorithms achieved a sufficient performance.

As mentioned before regarding Figure 5, the analysis of inputs presented that, in the case of studies focusing on the early design stage, inputs were less detailed compared to those in the remaining studies. Regarding the outputs, they were almost the same, except for the embodied energy that was not present in the studies for the early design stage. The analysis of the relation between inputs and outputs through R² results showed that (Figure 8) there was no guarantee that modeling would result in a good performance, even when an input variable group at a particular detail level was used. Both in the studies focusing on the early design stage where a smaller input variable group limited to the frequently used building parameters was preferred, and in the remaining studies where a larger input variable group, including detailed parameters regarding the building and its surroundings, was used, there were occasions with an unacceptable level of accuracy with R² values below 0.75 (Figure 8b,c). Although there were few studies in the economy and environment output categories, a satisfying performance with an R² above 0.9 was obtained for all of them. When the distribution of the results regarding the early design stage and those excluding that was examined comparatively (Figure 8b,c), it could be seen that especially the studies conducted for the early design stage had insufficient R² results for the heating load. Conversely, for thermal comfort, studies excluding the early design stage showed an insufficient performance more often.

When the analysis of the sample size and output relationship in Figure 9 and the sample size and algorithm relationship in Figure 10 were examined, it could not be said that a correlation was confirmed or that the increase in sample size indicated an increase in the accuracy, neither in terms of the algorithm used nor in terms of the predicted outputs. The reasons for this instability could be seen in the ineffective decisions made on data quality representing the case with proper variation, algorithm to be used, optimization, etc. However, it was worth noting that, in the studies not focusing on the early design stage (Figure 9c and Figure 10c), sample sizes over 11,000 always showed high performance with R² above 0.9.

4. Discussion

To understand the effects of process-related variables (i.e., the study aims’, decision subjects, and prediction model accuracy), correlational analyses were conducted, considering the R² results. The same 35 studies analyzed in Section 3.2.4. Performance Evaluation were evaluated in this respect (Figure 11), with some differences in the way of taking into account the inputs and study aims. Instead of searching for the individual impact of each input, their combined effect on each model’s accuracy was investigated through correlation analysis. The same strategy was also used for the study aims. To discuss this in detail, the data were analyzed further with the individual R² distribution maps presented in Figure A1. Theil’s U statics was preferred for the correlations of categorical variables considering that it is a method convenient for the data with asymmetrical features, such as the algorithm class and algorithm present in the studied case (i.e., an algorithm class corresponds to multiple algorithms, but an algorithm corresponds a unique algorithm class only). The correlation ratio (eta) [20] was used for the analysis between continuous and categorical variables, and to obtain a result for every variable under consideration, sample size data were organized in ranges, in other words, converted to categorical data, as used in Section 3.2.4. Pearson correlation [21] and the Mutual information [22] methods were preferred for the correlation of continuous variables (i.e., R² and actual sample size). In the following subsection, the correlation of the process-related variables, i.e., impacts of decision subjects on the models’ prediction performance and interactions within these decision subjects, are discussed based on these analyses. In addition, the aims and concerns of the modeling process, and future directions for the research are presented and discussed in the subsequent subsections.

4.1. Correlation of Process-Related Variables

The correlation ratio results presented that the machine learning algorithm used have a strong correlation with the models’ performance (i.e., R²), with the highest result of 0.59 (Figure 11). This was followed respectively by the second-level input variable configuration used, sample size organized in intervals, and first-level input configuration, all of which had strong correlations with R² as well. The study aim was found to be in moderate correlation, while the design stage targeted was weak. The Pearson correlation and Mutual information results showed, on the other hand, the actual sample size has a negligible impact, while the sample size in intervals has a strong one.

Regarding the correlation of machine learning algorithms, the performance distribution graph of the machine learning algorithms showed that an acceptable level performance (i.e., R² > 0.75) was always achieved with some of the algorithms (e.g., ANN, DNN, XGBoost, GBRT, and SVM), while it was not guaranteed with the others (e.g., RF and SVR), as another indication that the algorithm selection was the decision that affected the prediction accuracy greatly (Figure A1a). the ANN, RF, and SVR were the three most-commonly studied algorithms in the models, and in these, the ANN proved its flexibility to be constructed appropriately for the data with different complexities by the R² results always being above the acceptability threshold (e.g., with large data [70], with an expanded variable set [62], etc.).

Concerning the correlation seen for the input variables, the performance distribution map of first- and second-level input variable configurations showed similarly that different input variable configurations had R² ranges varying considerably from each other, some of which were always in an acceptable range (Figure A1b,c). For instance, in the case of datasets containing first-level inputs in the building and climate categories, the R² range was narrow, with a satisfying performance always. However, in the case of datasets containing inputs in the building, time, and climate categories, the R² range was wider, and there was a higher insufficient accuracy performance compared with the others. Additionally, working with datasets containing inputs in the building category only was observed to not always ensure an acceptable performance.

In connection with the correlation of the sample size with the model accuracy, sample size organized in intervals presented a similar pattern with varying R² ranges depending on the intervals, where the range was wider, for instance, in sample sizes in the ranges of 2000–3000, 3000–4000, and 5000–6000 (Figure A1d). The review of circumstances regarding the unacceptable results in the models with the 2000–6000 sample size showed that they were either in the studies working for the early design period (e.g., [63]), those working with certain inputs (i.e., surrounding effects [74], time [32], or time and climate [62]), or those working with linear regression (e.g., [9]). As a solution to decreased accuracy, modeling with different algorithms with the same conditions was preferred in those studies, which resulted in sufficient prediction accuracy results. Moreover, regarding the sample sizes under 2000, which were always within the completely acceptable range, most of the models were observed to be constructed with input variables only in the building category, as another indication of the combined effect of decision variables. Regarding those with a sample size more than 11,000, the models with insufficient accuracy were observed to be built with linear regression.

The results of the correlation analysis made in this respect among the decision subjects show that a strong correlation between the second-level input configuration and the study’s aim and design stage targeted is present with values of 0.74 and 0.60, respectively (Figure 11). This can be interpreted as the second-level input configuration being largely shaped according to the aims of the study and the design stage targeted. The preference for using particular machine learning algorithms in association with the effect of the decision on studying for the early design period was also found in the matrix with a correlation value of 0.61.

4.2. Strategies Implemented to Overcome Concerns

In the machine learning modeling process for prediction, as aforementioned, the common sources of concern for the steps in the dataset and machine learning model preparation stages were determined to be (i) the dimensionality of and (ii) the uncertainty in the case considered, and (iii) finding appropriate approaches/methods ensemble during the step and/or across the steps. The processes followed for obtaining a model that provides a sufficient level of accuracy under the effect of concerns can lead to various limitations in its preparation process, and certain approaches to overcome such limitations were observed to support obtaining an efficient model in the studies.

Regarding the dimensionality between representing data sufficiently and handling the complexity, an increase in the dataset sample size and variations led to better results in terms of expanding the scope of the case and the accuracy of the prediction model. To this end, automation in workflow and the seamless integration of simulation tools [39,61], and working with BIM to reduce the remodeling effort associated with the physical modeling of the building [34,63] were implemented in the studies to obtain extended and larger samples efficiently. Furthermore, the sampling step was conducted in some studies to increase the sample size, which improves the generalization performance. However, there was also a tendency to limit the sample size because of increased computation costs for training, testing, and prediction. Working on simplified versions and different featured input datasets comparatively, e.g., static and dynamic models [2], building stocks with and without various building types [40], allowed in this context for a controlled process when dealing with dimensionality increase. For dataset production, collecting samples for the dataset from distinct samples compared to similar samples proved to be more effective as a strategy to collect data [67]. Modeling for the early design stage results in the necessity to deal with larger data because studies targeting the early design stage are expected to model probable decisions in relation to the detailed design stage. In the studies, the use of a probabilistic approach with sampling method integration was implemented for this purpose, e.g., in [28,46]. As another factor to consider, the complexity of the machine learning model needs to match the dataset’s complexity since using a complex model for simpler data causes the overfitting of the model because of overlearning and it predicts poorly with unseen data. On the other hand, a simple machine learning model for complex data may cause underfitting that also creates an inefficient prediction model. For instance, in [38], Lasso regression and feed forward neural networks were therefore constructed with increased complexity to be able to work with the large dataset produced, and in [8], the identification process of the best neural network complexity that fits the data complexity was automated.

Regarding the uncertainty about whether the data obtained were representative of real conditions and contained sufficient information to ensure effective predictions similar to the real consumptions, simplification/abstraction of the model by using a typical built environment and/or building model, working for limited location etc., were preferred to the reduce data-related uncertainty level by decreased variation in the case. However, valid, larger, and expanded data were usually needed in a manner opposite to corresponding real data dynamism. For this purpose, validated simulation tools and appropriately expanded feature sets with a sufficient sample size were preferred. For being able to use prediction models during the early design phase, proposals were made for the high uncertainty level that occurs due to the fact that detailed design decisions, which are the determinants of building energy performance, have not yet been made. Therefore, in the studies aimed at supporting the designer in the early design stage in particular, proposing a framework that has steps for requesting data from the designer regarding those types of decisions is one of the strategies to overcome this problem. In [46], for instance, an instant sensitivity analysis was integrated into the proposed tool for early design stage energy prediction regarding the decisions made about the building and its parts. The designer, as the tool user, narrows down the value range of effective design variables iteratively with the help of instant sensitivity analysis allowing them to decrease the uncertainty and achieve more precise prediction results. As an alternative to this kind of application developed and integrated the designer into the process, a prediction interval covering possible values of undecided parameters together with the results of the feature analysis and uncertainty levels can be provided, as done in [28]. Furthermore, for overall support during building design, the integration of multi-objective optimization steps into the prediction models and frameworks was also used as an approach to obtain an optimum building design scheme for multiple objectives regarding energy efficiency, as done in [31,44]. In addition, limiting the scope of some variables that has an effect on energy use was observed to be implemented as a strategy. In [74], for instance, two different prediction models combined with the optimization stage were developed in this context concerning ventilation, i.e., for single-sided and cross ventilation.

Finding the appropriate approaches/methods ensemble during and across the steps was an important part of the prediction modeling, and the performance analyses presented in Section 3.2.4 showed that alternative processes shaped by different decisions during algorithm selection, sample size determination, and input feature selection corresponding to output variables can create an efficient and/or high-performing model. Therefore, the whole process of modeling was generally iterative and cyclical, where different configurations were experimented with in the studies, and the models’ performances related to the effect of different decisions were tested. For this purpose, some studies have been conducted to automate some cyclic processes. For example, the process of identifying the best neural network complexity that fits the data was automated in [8], or a methodology was proposed, which was called Octahedric regression, to simplify the entire process starting from raw data acquisition to model generation [57]. As an alternative to the comparative analysis of different methods or configurations, combinations of several feature analysis and sampling methods were used in the studies to make use of the differences between them (e.g., [44]). Likewise, in terms of the algorithms, the use of ensemble-type algorithms made it possible to benefit from the strengths of different algorithms (e.g., [50]).

Various methods/approaches developing over time can also provide solutions to different research needs waiting to be fulfilled partially or fully in the context of building energy performance prediction. For example, the Multivariate modeling approach carried out with algorithms provides considering multiple input variables and their interdependencies simultaneously in the prediction modeling for time-series forecasting, such as is performed in the case of CO₂ emission forecasting for different regions in the future in [77]. Regarding the increasing computation time due to the complexity of the machine learning algorithm type, an optimized-based stacked ensemble machine learning model was proposed for instance, in [78], to optimize the model accuracy with the computational time. As a broader perspective to mitigate the bias, which is another issue that needs to be solved to solve in the model, fairness-aware approaches were developed as well [79,80]. Mechanisms for improving algorithmic fairness developed in this respect were grouped under the categories of preprocess (i.e., by changing the training data), in-process (i.e., by adding fairness-related constraints or penalties to the objective function during the training phase), and post-process (i.e., by modifying the prediction results of a classifier). Likewise, for dataset preparation, to be able to manage the selection of appropriate features, automatic feature selection methods are being developed. For instance, the recursive feature elimination method performs feature selection automatically by iteratively removing the features and building a model using the remaining features [81], or deep learning-based feature extraction methods extract features by flattening and concatenating through applying various hyperparameters [82].

4.3. Future Directions

In the machine learning modeling process for energy performance prediction, studies focusing on almost all decision subjects have been carried out with the effect of the flexibility in modeling processes allowing for alternative paths. With this flexibility in the modeling process, the studies carried out for the development of the machine learning modeling phase were slightly more than the studies carried out for developing frameworks and improving dataset quality (i.e., 42 studies in the first and 36 studies in both of the remaining ones). In another study published in 2020, where the reviews and technical papers on building climatization load prediction were examined [30], it was revealed that the studies related to the machine learning side were more than twice the studies focusing on the dataset, and this variation in the findings might indicate that dataset production has started to receive more attention. In this respect, considering the facts that using typical building models and working for limited locations is a strategy to deal with uncertainty, and most of the studies were found to focus on residential and office buildings in humid subtropical climates according to the Köppen–Geiger classification, working on other building functions and climate types may enrich the field. Additionally, climate change affecting the whole world is expected to alter buildings’ energy use, and only in one of the studies, building energy performance prediction models for short (2021–2040), medium (2041–2060), and long-term (2061–2100) future climate scenarios were developed [32]. Increasing the number of such studies would also enrich the field and prepare the building construction sector for the indirect effects of climate change. Likewise, in the case of district heating system usage, models suitable for buildings with different functions come up as a need to reflect the usage-induced changes in energy demands. In [39], correction functions were employed to predict for multiple building functions. In [40], the prediction modeling established with the dataset obtained by taking into account the operational characteristics of each of the different building functions provided sufficient results with the use of various algorithms. Increasing the number of studies searching for other strategies to be employed in prediction modeling to be fit for various functions will also enrich the field.

One of the most important limitations in the field of machine learning modeling for building energy performance is the lack of high-quality, real-world-collected datasets [30]. Studies using data obtained by numerical simulation with validated tools are therefore conducted to overcome this limitation, and the models generated by those data are able to provide reliable results to some extent by taking into account the dynamic changes in real conditions as much as possible. In order to increase the generalization level for real data, strategies were implemented against the drawbacks of simulations that may arise, especially when high abstraction levels are preferred in the simulation models. The calibration of the simulation results partially with the real data collected was a strategy used in, e.g., ref. [39] against this limitation. For calibration with complex real data, selecting appropriate features and dealing with uncertainties related to data that are difficult to capture, such as occupancy and unexpected dynamics or losses in mechanical systems, are important concerns. In order to reduce the gap between the actual and prediction results further, the subjects that were included limitedly in the reviewed studies, such as the occupancy and urban-related factors, can be investigated further in the studies. Regarding occupancy, only occupant density [29] and preferences on heating and cooling setpoint [30] for operational energy were considered as variables in the reviewed studies. Accordingly, variations in thermal comfort needs, preferences, and behavior at an hourly level could be considered together. Studies carried out together with the smart grid context at the urban level may provide the opportunity of and support working with realistic information through accumulating occupant-related information through collecting and monitoring. In addition, future studies on the inclusion of the components surrounding the building (e.g., other buildings and trees) that may have a great effect on energy needs and related variable climatic factors are important for developing more realistic models, and in turn more accurate results. For instance, one of the reviewed studies performed a comprehensive study on the variability of the effect of the surrounding environment [41] by considering the sky view factor, height-to-width ratio of the street for the heating load, and also the variations in climatic factors, i.e., temperature, humidity, wind, and solar radiation, and showed that applications with machine learning were suitable to study with this city-scale energy prediction. Additionally, transferring the knowledge gained in the studies using simulation data to the models prepared for real data comes up as a feasible approach that can be worked on more as well, especially in the case of a lack of data. The study on the capability of transfer learning techniques [32], although only simulation-based data were used, was an example of that approach aiming to avoid separate training processes of different algorithms for different datasets. With the increase in such studies, information that supports the solutions to the deficiencies in prediction modeling, along with the lack of studies focusing on real-life big data [83], larger scales, etc., can be obtained at the same time within the models generated by simulation data.

Despite the good performance of a wide range of machine learning algorithms in the energy performance prediction models, the ANN and SVR were selected the most. These algorithms have also been widely used and tested in several other areas. In a building management review study, SVM-based methods and ANN algorithms were found to be located in the good- and medium-robustness range, respectively, while the hybrid and ensemble methods were in the high-robustness range [84]. In another review study on renewable energy and electricity needs, the ANN was found to be usable for real data that were not too complex [85]. Although the compliance of the ANN and SVR with building energy performance data has been repeatedly proven, the current review presented that low accuracy in the prediction performance is also possible. One of the reasons for this was the presence of numerous different configurations that could be used while building these algorithms. For example, the studies that used the same data, which were considered in this review, revealed that the same algorithm provided different results in terms of the prediction accuracy by the changes, e.g., in the algorithm optimization for the ANN [5,54]. A solution to this, as in the case of these studies using the same publicly shared data, could be to increase the number of studies with open access datasets together with their codes in order to make them reusable by others.

5. Conclusions

In this review study, one of the aims was to obtain a process framework of machine learning modeling for building energy performance prediction with the data obtained from numerical simulations. Therefore, the modeling process considered was based on supervised learning where a labeled dataset was used, and 55 studies determined by a systematic search from the Web of Science database were analyzed to determine the different processes followed according to the different decisions made for different subjects. In addition to the general framework showing the main stages and their steps with respect to the specific aims of studies, the associated decision subjects on which the performance of a machine learning prediction model depended were gathered from these studies and presented together with intertwined workflows and common concerns in these steps. During these reviews, the use frequency of algorithms, and the accuracy distribution of the prediction models with respect to the dataset size, inputs and outputs used in the dataset, and algorithms implemented were analyzed as well. Moreover, studies targeting the early design stage and that used the same datasets were examined separately and comparatively with all studies to put forth and discuss the variations presented. Additionally, a correlation analysis was performed to identify effective decision points at the modeling scale.

The frequency analyses presented that, among the 56 different machine learning algorithms used in the studies belonging to 19 different classes, the ANN, RF, and SVR were the most-used algorithms in the presented order. As limitedly used algorithms, the Light Gradient Boosting Machine model, Multivariate adaptive regression splines, and Minimax probability machine regression also presented a high-accuracy performance in the studies.

Regarding sampling and algorithm optimization processes, the Latin hypercube sampling method and the grid search as an automatic algorithm optimization method came forth among the others. Yet, various other methods were observed to result in efficient accuracy, if the whole model was built properly. According to the correlation analysis conducted for the effects of process-related variables, the decisions regarding the machine learning algorithm, input set, and sample size (in intervals) were observed to have more of an effect on the performance results compared to the design stage targeted and the aim of the modeling study, while the relation of output type and algorithm class was insignificant.

From the machine learning side, accuracy distribution analyses showed that many alternative algorithms can be used to build effective prediction models. However, on the other hand, efforts related to high input dimensionality and early design period were limited, and these subjects needed to be explored further in studies considering the performance instability observed in the current studies.

Author Contributions

Conceptualization, D.K.; methodology, D.K. and E.E.; formal analysis and investigation, D.K.; writing—original draft preparation, D.K.; writing—review and editing, D.K. and E.E.; visualization, D.K.; supervision, E.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Prisma Flow Diagram

Appendix B. Machine Learning Modeling-Related Information of the Studies Analyzed

Table A1. Building functions, locations, and climate types considered in the studies, and sample sizes, methods, and evaluation metrics used for prediction modeling.

Ref. No	Building Function(s)	City/Country (Climate Type)	Sampling Method(s)	Sample Size: All *	Feature Analysis Method(s)	ML Algorithm(s)	ML Algorithm Optimization Method(s)	Evaluation Metric(s)
[1]	Residential **	Athens/Greece (Cfa) **	No sampling	768	Local sensitivity analysis, Global sensitivity analysis	DNN, ANN	N/S	R², RMSE, MAE, prediction interval (PI)
[2]	Office	N/S	Latin hypercube	(800), (200),(200)	Pearson correlation	ANN	N/S	R², Maximum deviation
[5]	Residential **	Athens/Greece (Cfa) **		768	N/S	ANN, SVM, GPR, RF, GBRT, XGBoost	Grid search	RMSE, MAE, R², Fit time, Test time, the average fitting time of all tested models
[5]	Various types (health facility, residential, hotel, office, restaurant, retail, educational, warehouse)	Various cities/United States (N/S)	N/S	1000, 10,000, 25,000 100,000	Correlation estimation, Principal Component Analysis, Sobol Method	ANN, SVM, GPR, RF, GBRT, XGBoost	Grid search	RMSE, MAE, R², Fit time, Test time, the average fitting time of all tested models
[6]	Office	Munich/Germany (Cfb)	N/S	4000		CBML	N/S	R², RMSE
[8]	Residential	Po Valley area/Italy (Cfa)	No sampling	600,000		ANN	Manual	Distribution of the relative errors
[9]	Agro-industrial building (a winery building)	Toscanella di Dozza, in the countryside close to Bologna/Italy (Cfa)	No sampling	5150	Spearman correlation	SVR, RF, LR, XGBoost	Grid search	Computational time, MAE, MSE, R²
[12]	Office spaces	Champaign/United States (Dfa)	No sampling	300	Principal component analysis	LR, SLR, GAM	N/S	RMSE
[28]	Office	Harbin/China (Dwa), Chengdu/China (Cwa)	No sampling	1152		ELM	Particle swarm optimization	R²
[29]	Office	Phoenix/United States(BWh), Houston/United States (Cfa), San Jose/United States (Csb), New York/United States (Cfa), Chicago/United States (Dfa)	Latin hypercube	1000, 2000, 5000, 10,000, 20,000, 50,000, 100,000, 200,000, 500,000, 1,000,000	Neighborhood component analysis	CART, ANN, EBT, DNN	Grid search	CV, RMSE, R², training time
[30]	Residential	Chongqing/China (Cfa)	N/S	50, 100, 150, 200, 250, 300, 400, 500	Spearman correlation	LR, ANN, SVR	Argument tunelength for the train function in the caret package	MAE, RMSE, NMAE, NRMSE
[31]	Residential	Shanghai/China (Cfa)	Latin hypercube	800		SLR, BPNN, SVM, RF	N/S	R², relative error ranges
[32]	Residential	Singapore (Af)	Latin hypercube	2500	Pearson correlation	LR, LSTM, MLP, XGBoost	N/S	R², MSE, MAE
[33]	Residential (high-rise buildings)	Lusail city/Qatar (BWh)	Latin hypercube	585, 1755, 2925, 5850, 8775	Standardized regression coefficient, Random forest variable importance, T-value, sensitivity value index	MLR, SVR, ANN, XGBoost	Manual	R², NMBE, CV(RMSE) %, Process time (s)
[34]	Residential	Seville/Spain (Csa)	No sampling	240	Correlation coefficient	RF	N/S	R², MAE, RMSE, OOB (Out-of-bag)
[35]	N/S	N/S	No sampling	421,245	The deep learning methods described by Garson	LSTM, GLM, DNN, RF, GBRT	Adaptive moment estimation	RMSE, MAE, R²
[36]	Office	Jaipur/India (BSh), Hyderabad/India (BWh)	Domain knowledge based sampling, Sampling by clustering	(2400), (2304)		LR, RR, LASSO, RANSAC, Theil Sen, KNN, SVR, DT	N/S	r (Spearman’s rank correlation coefficient), MAPE, error analysis, residual analysis
[37]	Residential	Taipei/Taiwan (Cfa), Taichung/Taiwan (Cwa), Kaohsiung/Taiwan (Aw)	No sampling	8184	Spearman correlation index	GBRT	N/S	RMSE, percentage error, R²
[39]	Various types (residential, kindergarten, school)	Belgrade/Serbia (Cfa)	No sampling	not clear		SVR, ANN	Grid search	R²
[40]	Various types (residential and non-residential buildings [office, hotel, mall, hospital, educational])	Chongqing/China (Cfa)	No sampling	N/S		SVR, RF, XGBoost, OLS, RR, LASSO, EN, ANN	N/S	NMAE, NRMSE, relative error
[41]	Various types (mainly residential)	Fribourg/Switzerland (Dfb)	Morris method	N/S	Morris method	LGBM	Root mean square propagation	MAE, MAPE
[42]	Residential	Athens/Greece (Cfa)	No sampling	768		IRLS, RF	N/S	MAE, MSE, MRE
[43]	Residential	Istanbul/Turkey (Cfa)	No sampling	180		ELM, GP, ANN	manual	RMSE, R², Pearson coefficient
[44]	Office	Shandong/China (Dwa)	Fourier amplitude sensitivity test, Sobol sampling, Random sampling, Orthogonal experiment, Latin hypercube	16,320	FAST extend, Sobol method, standardized correlation coefficients, Standardized rank correlation coefficients, partial correlation coefficient, partial rank correlation coefficient, Spearman correlation coefficient, Pearson correlation coefficient, Kolmogorov–Smirnov	RF, GBRT, ANN	Randomized search, Grid search	MAE, RMSE, R²
[45]	Residential	Nottingham/United Kingdom (Cfb)	Latin hypercube	40,000	A backwards feature selection proces	ANN	Grid search	MAE, residual analysis
[46]	Office	N/S	Sobol sampling	26,000	A regression-based method	CNN	Manual	MAPE, RMSE
[47]	Office	Beijing/China (Dwa)	Latin hypercube	4740, 4990, 5490, 5990, 7490, 8990		ANN, RF, GPR, SVM, DACE, MARS	N/S	R²
[48]	Residential **	Athens/Greece (Cfa) **	No sampling	768		MLP, SVR, Ibk, LWL, M5P, REPTree	N/S	R², MAE, RMSE
[49]	Residential **	Athens/Greece (Cfa) **	No sampling	768		SVR, MLP	N/S	Correlation coefficient (r), MSE, RMSE, MAE
[50]	Residential **	Athens/Greece (Cfa) **	No sampling	768		XGBoost	Ant Lion Optimizer, Black Hole Optimizer, Cuckoo Search, Dragonfly Algorithm, Differential Evolution, Genetic Algorithm, Gray Wolf Optimizer, Jaya, Particle Swarm Optimization, Modified Jaya	RMSE, R², MAE
[51]	Residential **	Athens/Greece (Cfa) **	No sampling	768	Pearson correlation	RF, ERT, GBRT	Grid search	MSE, MAE, MAPE
[52]	Residential **	Athens/Greece (Cfa) **	No sampling	768		GP	Local search method, Linear scaling	MAE, MRE, MSE
[53]	Residential **	Athens/Greece (Cfa) **	No sampling	768	Fuzzy Inductive Reasoning mask	FIR, ANFIS	A combination of backpropagation and least-squares estimation	RMSE, MAE, SI (a synthesis index = combination of RMSE and MAE measures)
[54]	Residential **	Athens/Greece (Cfa) **	No sampling	768		DNN, GBRT, GPR, MPMR, GPR, LR, ANN, RBFNN, SVM	N/S	VAF, RAAE, RMAE, R², RSR, NMBE, MAPE, NS, RMSE, WMAPE.
[55]	Residential **	Athens/Greece (Cfa) **	No sampling	768		MLP	The grasshopper optimization algorithm, wind-driven optimization, biogeography-based optimization	R², RMSE
[56]	Residential **	Athens/Greece (Cfa) **	No sampling	768		MLP, ANN	N/S	RMSE, MAE, R²
[57]	Residential **	Athens/Greece (Cfa) **	No sampling	768		Octahedric Regression Model	N/S	MSE, MAE, MAPE, REC (regression error characteristic)
[58]	Residential **	Athens/Greece (Cfa) **	No sampling	768		MLP	Shuffled complex evolution, moth–flame optimization, optics-inspired optimization	R², MAE, RMSE
[59]	Residential **	Athens/Greece (Cfa) **	No sampling	768		MLP	Firefly algorithm, optics-inspired optimization, shuffled complex evolution, teaching–learning-based optimization	RMSE, MAE, R²
[60]	Residential **	Athens/Greece (Cfa) **	No sampling	768	ANOVA	EMARS, MARS, BPNN, RBFNN, CART, SVM	Artificial bee colony	RMSE, MAPE, MAE, R²
[61]	Office space	Harbin/China (Dwa)	Latin hypercube	1610	Sobol method	ANN	Bayesian hyperparameter optimization	MSE, MAE, MAPE, R²
[62]	Residential	Aswan/Egypt (BWh)	No sampling	5208		ANN, RF, KNN, SVR, ELVoting	Bayesian hyperparameter optimization	R², RMSE
[63]	Office	Tehran/Iran (BSh)	No sampling	3384	Mutual information method	ANN, KNN, DT, RF, BR, AB, ERT	Grid search	R², MAE, MSE, the training time of the model
[64]	Office	Qingdao International Academician Park/Qingdao, China (Cfa)	Fourier amplitude sensitivity test, Latin hypercube, Quasi-random sampling, Random sampling, Sobol sampling	1680	FASTC, FASTE (first order and total order), Partial correlation coefficient, Partial rank correlation coefficient, Spearman correlation coefficient, Pearson correlation coefficient, standardized regression coefficient, Standardized rank regression coefficient, Sobol method, Morris method, Kolmogorov–Smirnov	LR, SVR, KNN, DT, RF, ET, BA, AB, GBRT, ANN	Grid search	R², MAE, RMSE
[65]	Educational	Wuhan/China (Cfa)	Orthogonal experiment	1,001,000		RF	With ’auto’ setting offered in the tool	RMSE, R²
[66]	Residential	Riyadh city/Saudi Arabia (BWh)	No sampling	201		ANN	N/S	MSE, R², error percentages calculated at 0.26%, 0.25%, 0.03% and 0.27%
[67]	Office	Munich/Germany (Cfb)	Latin hypercube sampling method, Sobol sampling	(400), (600), (800)		CBML	Manual	RMSE, R², MAPE, scatter plots and histograms of errors
[68]	Office space	Fribourg/Switzerland (Dfb)	No sampling	1,000,000		GBRT	N/S	R²
[69]	Office	Ifrane/Morroco (Cfa), Meknes/Morroco (Csa), Marrakesh/Morroco (BSh)	Latin hypercube	2400	Sobol method	ANN, SVM	N/S	R², RMSE, standard deviation (STD), Performance score = rank (R²) + rank (RMSE) + rank (STD)
[70]	Residential	N/S	Sobol sampling	46,696		ANN, GBRT	Grid search	R², RMSE, MAE, AE95 (the 95th percentile of the absolute error)
[71]	Residential	Paris, Lille, Lyon, Clermont-Ferrand/France (Cfb)	No sampling	N/S	Pearson correlation index, Kendall correlation index, Spearman correlation index	SVM	N/S	R², RMSE
[72]	Office	Brussels/Belgium (Cfb)	Sobol sampling	6000		ANN, CNN	Adaptive moment estimation	R², MAPE
[73]	Educational	Wuhan/China (Cfa)	Orthogonal experiment	54 sets from the orthogonal experiment	single-factor sensitivity analysis	LSSVM, BPNN, WNN, SVM	Grid search	RMSE, R²
[74]	Residential	Hong Kong/China (Cwa), Los Angeles/United States (Csa)	Latin hypercube	5610		MLR, MARS, SVM	N/S	R², RMSE, residual plots
[75]	Educational (multipurpose university building)	Montreal/Canada (Dfb)	No sampling	4720		ANN	N/S	MSE

Abbreviations: ML: Machine Learning; N/S: Not specified; Notes: *: In studies that work on more than one case, the sample sizes of different cases are given in separate parentheses. **: The data produced in [42] had been used.

Table A2. Inputs and outputs used for prediction modeling.

Reference no	Time	Location	Climate									Surrounding Effect/Urban Context Morphology Effect	Building										Output Variables
			Cloud	Degree Days	Illuminance	Precipitation	Pressure and Moisture	Solar Position-Related Parameters	Solar Radiation and Irradiation	Temperature	Wind		Orientation	Form/Geometry	Envelope			HVAC System	Lighting System	Internal Gain	Type	Other	Operational Energy				Comf.		Env.		Economy
			Cloud	Degree Days	Illuminance	Precipitation	Pressure and Moisture	Solar Position-Related Parameters	Solar Radiation and Irradiation	Temperature	Wind		Orientation	Form/Geometry	Construction/Material	Geometrical Feature	Shading	HVAC System	Lighting System	Internal Gain	Type	Other	Total Energy Consumption	Heating Load	Cooling Load	Lighting Load	Visual Comfort	Thermal Comfort	Environmental Impact	Embodied Energy/Material Stage	Economy
[2]						+	+	+	+	+	+		+	+	+	+		+						+	+
[5]				+			+		+	+				+	+	+	+			+				+	+
[6]													+	+	+	+		+		+	+		+
[8]													+	+	+	+				+				+
[9]													+		+								+
[12]													+	+	+	+										+	+
[28]		+											+	+							+								+
[29]						+	+		+	+	+			+		+		+		+					+
[30]															+			+		+				+	+
[31]														+	+	+	+	+				+	+				+
[32]	+												+	+	+	+	+								+		+	+
[33]							+			+				+	+		+	+		+					+
[34]															+															+
[35]	+		+		+		+		+						+	+											+
[36]													+	+	+	+	+						+
[37]										+			+		+								+						+
[39]							+		+	+	+		+	+	+	+	+							+
[40]														+	+	+		+		+				+	+
[41]							+		+	+	+	+		+	+	+								+
[42]													+	+		+								+	+
[43]															+								+
[44]													+		+	+	+	+		+								+	+		+
[45]														+	+	+		+		+	+			+
[46]													+	+	+	+								+	+
[47]													+	+	+	+					+		+	+	+
[61]													+	+	+	+											+
[62]	+		+				+	+	+	+	+		+	+		+					+				+			+
[63]													+	+	+	+	+							+	+			+
[64]													+		+	+	+	+		+								+	+
[65]															+	+		+					+					+	+		+
[66]													+		+	+	+												+		+
[67]													+	+	+	+		+		+			+
[68]								+	+							+	+										+
[69]															+	+		+						+	+
[70]										+			+	+	+	+							+
[71]	+						+		+	+	+				+					+				+
[72]													+	+	+	+	+	+		+				+	+
[73]															+	+							+
[74]												+	+	+	+		+						+		+	+
[75]															+	+		+	+				+						+		+

Abbreviations—Comf.: Comfort; Env.: Environment.

Table A3. Input variables used in studies analyzed.

Level 1	Level 2	Level 3	Level 4	Level 5
Time				Day of the Month, Day of the Year, Hour of the Day, Month of the Year, The hourly indicator of daily data, Time period
Location
Climate	Cloud			Sky cover
	Degree days			Degree days
	Illuminance			Illuminance
	Precipitation			Precipitable water, Rain status
	Pressure and moisture			Air density, Atmospheric pressure, Humidity
	Solar position-related parameters			Solar altitude angle, Solar azimuth angle, Solar declination, Solar hour angle
	Solar radiation and irradiation			Solar irradiation, Solar radiation
	Temperature			Air temperature, Dew point temperature, Dry bulb temperature, Ground temperature, Sky temperature, Surface temperature, Water mains temperature
	Wind			Wind direction, Wind speed
Surrounding effect/urban context morphology effect				External obstruction angle, Height-to-distance ratio, Sky view factor
Building	Type			Boundary conditions, Building archetype, Floor plan, Roof topology, Roof type
	Form/geometry		Dimensional features/ratio of dimensions	Aspect ratio, Building height, Building length, Building width, Ceiling height, Facade length (each separately), Floor height, Floor to floor height, Length to width ratio, Number of room, Number of stories, Overall height, Ratio between axes, Roof height, Roof length, Roof width, Room depth, Room width
			Surface area/ratio of surface areas	Building footprint, Floor area, Roof area, Roof-to-wall ratio, Window-to-floor ratio, Window-to-ground ratio
			Volume/Volume-proportional features	Compactness ratio, Volume
	Orientation
	Envelope	Geometrical feature	Surface area/ratio of surface areas	Envelope area, Facade area, Glazing area, Glazing area distribution, Heat loss surface, Roof area, Surface area, Wall area, Window area, Window-to-wall ratio
			Window-related layout	Window operation
		Construction/ material	Type	Component, Material
			Optical properties	Visible absorptance, Visible transmittance
			Physical properties	Density, Thickness
			Thermal properties	Air infiltration, Air permeability, Air tightness, Attenuation, g value, Heat capacity, Heat gain, Heat loss, Heat transfer coefficient, Solar absorptance, Solar heat gain coefficient, Solar radiation absorption coefficient, Solar radiation rate, Solar transmittance, Specific heat, Superficial mass, Thermal absorptance, Thermal conductivity, Thermal lag, Thermal mass, Thermal resistance, U value
		Shading	Shading device-related properties	Shading device dimensional properties, Shading device optical properties, Shading device type
			Shading factor	Shading
	HVAC system		Other	Air change rate, Air temperature, District heating outlet temperature, Fresh air supply, HVAC available proportion for heating and cooling, Outdoor air flow rate, Supply air temperature, Ventilation rate
			Efficiency	Coefficient of performance, Efficiency
			Schedule and setpoint	Schedule, Setpoint
			Type	Boiler pump type, Chiller type, Heating method, HVAC system
	Lighting system
	Internal gain			Equipment power density, Hot water density, Internal heat gain, Lighting power density, Occupant density, Operating hours, Ventilation flow rate density, Window operation, Zone total internal total heat gain
	Other			Green roof configurations
Room	Form/geometry		Dimensional features/ratio of dimensions	Aspect ratio, Room depth, Room height, Room length, Room width
			Surface area/ratio of surface areas	Floor area, Window-to-floor ratio
			Volume/volume-proportional features	Volume, Window/volume ratio
	Orientation
	Envelope	Geometrical feature	Surface area/ratio of surface areas	Window area, Window-to-wall ratio
		Geometrical feature	Window-related layout	Distance from the window, Number of windows, Window end position/room width, Window operation, Window sill height/room height, Window start position/room width, Window top height/room height
		Construction/ material	Optical properties	Surface reflectance, Visible transmittance
			Physical properties	Thickness
			Thermal properties	U value
		Shading	Shading device-related properties	Shading device dimensional properties, Shading device presence

Appendix C. R2 Result Distributions Within Decision Subjects

Table A4. Items considered under variable headings for the correlation analyses, and values assigned for the R² distribution maps (when necessary).

Decision Variable Heading		Items Under Headings	Value Assigned *
Study aim		Studies with the aim of ‘improving the prediction performance by focusing on the dataset preparation stage’	0
		Studies with the aims of ‘improving the prediction performance by focusing on the dataset preparation stage’ and ‘improving the prediction performance by focusing on issues related to the machine learning model preparation stage’	1
		Studies with the all aims determined in this study	2
		Studies with the aims of ‘improving the prediction performance by focusing on the dataset preparation stage’ and ‘developing a particular model, approach, or framework’	3
		Studies with the aim of ‘improving the prediction performance by focusing on issues related to the machine learning model preparation stage’	4
		Studies with the aims of ‘improving the prediction performance by focusing on issues related to the machine learning model preparation stage’ and ‘developing a particular model, approach, or framework’	5
		Studies with the aim of ‘improving the prediction performance by focusing on issues related to the machine learning model preparation stage’	6
Design stage		Early Design stage, Not Specified
Input variable configuration	1st level **	Building Climate +Building Surrounding effects+ Building Time + Building Time + Climate + Building	0 1 2 3 4
	2nd level **	Building/Form/geometry+ Building/HVAC system+ Building/Internal gain+ Building/Orientation+ Building/Envelope	0
		Building/Form/geometry+Building/HVAC system+ Climate/Precipitation+ Climate/Pressure and moisture+ Climate/Temperature+ Climate/Wind+ Building/Internal gain+ Climate/Solar radiation and irradiation	1
		Building/Form/geometry+ Building/Internal gain+ Building/Orientation+ Building/Envelope/+ Climate/Temperature	2
		Building/Form/geometry+ Building/Orientation+ Building/Type+ Building/Envelope/+ Climate/Cloud+ Climate/Pressure and moisture+ Climate/Solar position related parameters+ Climate/Solar radiation and irradiation+ Climate/Temperature+ Climate/Wind+ Time	3
		Building/Form/geometry+ Building/Orientation+ Building/Envelope/	4
		Building/Form/geometry+ Building/Orientation+ BuildingEnvelope/+ Surrounding effect/urban context morphology effect	5
		Building/Form/geometry+ Building/Orientation+ Building/Envelope+ Time	6
		Building/HVAC system+ Building/Envelope	7
		Building/Internal gain+ Building/Envelope+ Climate/Solar position related parameters+ Climate/Solar radiation and irradiation	8
		Building/Orientation+ Building/Envelope	9
		Building/Envelope	10
		Building/Envelope+ Building/Form/geometry+ Building/HVAC system	11
		Building/Envelope+ Building/Form/geometry+ Building/HVAC system+ Building/Orientation+ Building/Internal gain+ Building/Type	12
		Building/Envelope+ Building/Form/geometry+ Building/HVAC system+ Building/Orientation+ Climate/Precipitation+ Climate/Pressure and moisture+ Climate/Solar position related parameters+ Climate/Temperature+ Climate/Wind	13
		Building/Envelope/+ Building/Form/geometry+ Building/HVAC system+ Climate/Pressure and moisture+ Climate/Temperature+ Building/Internal gain	14
		Building/Envelope+ Building/Form/geometry+ Building/Orientation	15
		Building/Envelope+ Building/Form/geometry+ Climate/Pressure and moisture+ Climate/Temperature+ Building/Internal gain+ Climate/Degree Days+ Climate/Solar radiation and irradiation	16
		Building/Envelope+ Building/Orientation	17
		Building/Envelope+ Building/Orientation+ Climate/Temperature Climate/Wind	18
		Climate/Pressure and moisture+ Climate/Solar radiation and irradiation + Climate/Cloud+Climate/illuminance+ Room/Envelope+ Time	19
Sample size	Actual size	54, 180, 201, 240, 768, 800, 1000, 1610, 2500, 3384, 4000, 5000, 5150, 5208, 5610, 6000, 10,000, 11,700, 25,000, 46,696, 421,245, 1,000,000, 1,001,000
	As intervals	0–1000 1000–2000 2000–3000 3000–4000 5000–6000 10,000 11,000 higher	0 1 2 3 4 5 6
Output		Total energy consumption, Heating load, Cooling load, Visual comfort, Thermal comfort, Embodied energy/material stage impact, Environmental impact, Economy
Machine Learning Algorithm	Algorithm class	Component-based machine learning model, Decision Tree models, Gaussian Process regression, Genetic programming model, Lazy learning algorithm, Linear Regression, Minimax probability machine regression, Multivariate adaptive regression splines, Nearest Neighbors, Neural Networks, Support Vector Machine, Voting Heterogeneous Ensemble Learning
	Algorithm itself	Adaboost Artificial Neural Networks Back-propagation neural networks Bagging regressor Bayesian Regression Technique Classification and regression tree Component-based machine learning model Convolutional neural networks Decision Tree Deep neural networks Ensemble bagging trees Evolutionary multivariate adaptive regression splines Extreme Gradient Boosting Extreme Learning Machine Extremely randomized trees Gaussian process regression Generalized Linear Model Genetic programming model Gradient Boosting Decision Tree algorithm IBk Linear NN Search K-Nearest Neighbors Regressor Least square support vector machine Linear regression Locally Weighted Learning Long short-term memory model Minimax probability machine regression Model Trees Regression Multi-layer perceptron neural network Multiple linear regression Multivariate adaptive regression splines Radial basis function neural network Random forest algorithm Reduced Error Pruning Tree Stepwise linear regression Support vector machine Support vector regression Voting Heterogeneous Ensemble Learning Wavelet neural network	0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

Notes: *: Number representations were used in Figure A1. **: The level of the input variable is based on Figure 5.

Figure A1. Distribution of R² results according to (a) machine learning algorithm used; (b) 1st-level input configuration used; (c) 2nd-level input configuration used; (d) sample size used (as intervals). The variable values corresponding to the numbers on the Y axis are presented in Table A4.

Table A5. p-values for the relations between process-related variables.

	Variables ¹		p-Value	Statistical Significance ²
Between	Output	Machine learning algorithm	8.76 × 10⁻¹	Insignificant
categorical	Output	Machine learning algorithm class	7.48 × 10⁻¹	Insignificant
variables ³	Sample size interval	Machine learning algorithm class	1.67 × 10⁻²	Significant
	Design stage targeted	Machine learning algorithm class	3.37 × 10⁻³	Significant
	Design stage targeted	1st level input configuration	1.08 × 10⁻³	Significant
	1st-level input configuration	Machine learning algorithm class	1.93 × 10⁻⁴	Significant
	2nd-level input configuration	Machine learning algorithm class	1.95 × 10⁻⁷	Significant
	Study aim	Design stage targeted	8.12 × 10⁻⁸	Significant
	Sample size interval	Machine learning algorithm	4.77 × 10⁻⁸	Significant
	Design stage targeted	Machine learning algorithm	4.94 × 10⁻⁹	Significant
	Design stage targeted	Output	1.57 × 10⁻⁹	Significant
	1st-level input configuration	Machine learning algorithm	1.32 × 10⁻¹⁰	Significant
	Study aim	Machine learning algorithm	4.36 × 10⁻¹¹	Significant
	Sample size interval	Output	2.13 × 10⁻¹¹	Significant
	Design stage targeted	2nd level input configuration,	1.31 × 10⁻¹³	Significant
	Design stage targeted	Sample size-interval	6.83 × 10⁻¹⁴	Significant
	2nd-level input configuration	Machine learning algorithm	5.57 × 10⁻¹⁴	Significant
	Study aim	Machine learning algorithm class	5.92 × 10⁻¹⁶	Significant
	1st-level input configuration	Output	4.92 × 10⁻¹⁷	Significant
	Study aim	Output	1.78 × 10⁻²⁹	Significant
	Study aim	Sample size-interval	5.78 × 10⁻³⁵	Significant
	2nd-level input configuration	Output	6.57 × 10⁻⁵⁶	Significant
	1st-level input configuration	Sample size-interval	9.7 × 10⁻⁶⁰	Significant
	Study aim	1st level input configuration	2.01 × 10⁻⁶⁰	Significant
	2nd-level input configuration,	Sample size-interval	3.42 × 10⁻⁷³	Significant
	Study aim	2nd level input configuration	4.18 × 10⁻⁹⁶	Significant
	1st-level input configuration	2nd level input configuration	1.40 × 10⁻¹⁰⁵	Significant
	Machine learning algorithm class	Machine learning algorithm	1.27 × 10⁻²⁰⁴	Significant
Between	2nd-level input configuration	Sample size	0	Insignificant
categorical	Machine learning algorithm class	Sample size	8.80 × 10⁻¹	Insignificant
and	Output	R² results	2.94 × 10⁻¹	Insignificant
numerical	Machine learning algorithm class	R² results	1.84 × 10⁻¹	Insignificant
variables ⁴	Machine learning algorithm	Sample size	1.22 × 10⁻²	Significant
	Design stage targeted	Sample size	1.21 × 10⁻²	Significant
	Design stage targeted	R² results	1.13 × 10⁻²	Significant
	2nd-level input configuration	R² results	3.28 × 10⁻³	Significant
	Machine learning algorithm	R² results	1.85 × 10⁻³	Significant
	Study aim	R² results	6.94 × 10⁻⁴	Significant
	Sample size interval	R² results	1.45 × 10⁻⁵	Significant
	1st-level input configuration	R² results	2.96 × 10⁻⁶	Significant
	Study aim	Sample size	4.08 × 10⁻¹⁰	Significant
	1st-level input configuration	Sample size	4.08 × 10⁻¹⁰	Significant
	Sample size interval	Sample size	1.46 × 10⁻¹⁶	Significant
	Output	Sample size	6.41 × 10⁻²¹	Significant
Between numerical variables ⁴	Sample size	R² results	8.2 × 10⁻⁵	Significant

¹: The variables are sorted in descending order of p-value. ²: The relation is statistically significant when the p-value is smaller than 0.05. ³: For the relations between categorical variables, the chi-squared test [22,23] was used to obtain the p-value. ⁴: For the relations between categorical and numerical variables, ANOVA was used to obtain the p-value [26].

Appendix D. Training Times in the Studies That Comparatively Assessed Process-Related Variable Alternatives [8,29,35,51,67,71]

References

Sadeghi, A.; Sinaki, R.Y.; Weckman, G.R.; Young, W.A. An intelligent model to predict energy performances of residential buildings based on deep neural networks. Energies 2020, 13, 571. [Google Scholar] [CrossRef]
Geyer, P.; Singaravel, S. Component-based machine learning for performance prediction in building design. Appl. Energy 2018, 228, 1439–1453. [Google Scholar] [CrossRef]
Clarke, J. Energy Simulation in Building Design, 2nd ed.; Routledge: London, UK, 2001. [Google Scholar] [CrossRef]
Hensen, J.L.M.; Lamberts, R. (Eds.) Building Performance Simulation for Design and Operation, 2nd ed.; Routledge: London, UK, 2019. [Google Scholar] [CrossRef]
Seyedzadeh, S.; Pour Rahimian, F.; Rastogi, P.; Glesk, I. Tuning machine learning models for prediction of building energy loads. Sustain. Cities Soc. 2019, 47, 101484. [Google Scholar] [CrossRef]
Singh, M.M.; Singaravel, S.; Klein, R.; Geyer, P. Quick energy prediction and comparison of options at the early design stage. Adv. Eng. Inform. 2020, 46, 101185. [Google Scholar] [CrossRef]
Flager, F.; Welle, B.; Bansal, P.; Soremekun, G.; Haymaker, J. Multidisciplinary process integration and design optimization of a classroom building. ITcon 2009, 14, 595–612. [Google Scholar]
Pittarello, M.; Scarpa, M.; Ruggeri, A.G.; Gabrielli, L.; Schibuola, L. Artificial neural networks to optimize zero energy building (Zeb) projects from the early design stages. Appl. Sci. 2021, 11, 5377. [Google Scholar] [CrossRef]
Barbaresi, A.; Ceccarelli, M.; Torreggiani, D.; Tassinari, P.; Bovo, M.; Menichetti, G. Application of Machine Learning Models for Fast and Accurate Predictions of Building Energy Need. Energies 2022, 15, 1266. [Google Scholar] [CrossRef]
Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef]
Ahmad, T.; Chen, H.; Huang, R.; Yabin, G.; Wang, J.; Shair, J.; Azeem Akram, H.M.; Hassnain Mohsan, S.A.; Kazim, M. Supervised based machine learning models for short, medium and long-term energy prediction in distinct building environment. Energy 2018, 158, 17–32. [Google Scholar] [CrossRef]
Lee, J.; Boubekri, M.; Liang, F. Impact of building design parameters on daylighting metrics using an analysis, prediction, and optimization approach based on statistical learning technique. Sustainability 2019, 11, 1474. [Google Scholar] [CrossRef]
Fathi, S.; Srinivasan, R.; Fenner, A.; Fathi, S. Machine learning applications in urban building energy performance forecasting: A systematic review. Renew. Sustain. Energy Rev. 2020, 133, 110287. [Google Scholar] [CrossRef]
Ahmad, T.; Chen, H.; Guo, Y.; Wang, J. A comprehensive overview on the data driven and large scale based approaches for forecasting of building energy demand: A review. Energy Build. 2018, 165, 301–320. [Google Scholar] [CrossRef]
Amasyali, K.; El-Gohary, N.M. A review of data-driven building energy consumption prediction studies. Renew. Sustain. Energy Rev. 2018, 81, 1192–1205. [Google Scholar] [CrossRef]
Dridi, J.; Bouguila, N.; Amayri, M. Transfer learning for estimating occupancy and recognizing activities in smart buildings. Build. Environ. 2022, 217, 109057. [Google Scholar] [CrossRef]
Villa, S.; Sassanelli, C. The data-driven multi-step approach for dynamic estimation of buildings’ interior temperature. Energies 2020, 13, 6654. [Google Scholar] [CrossRef]
de Wilde, P.; Martinez-Ortiz, C.; Pearson, D.; Beynon, I.; Beck, M.; Barlow, N. Building simulation approaches for the training of automated data analysis tools in building energy management. Adv. Eng. Inform. 2013, 27, 457–465. [Google Scholar] [CrossRef]
Theil, H. Applied Economic Forecasting; by Henri Theil assisted by Beerens, G.A.C., De Leeuw, C.G., Tilanus, C.B.; Rand McNally: Chicago, IL, USA, 1966. [Google Scholar]
Wang, C.; Yang, Y.; Causone, F.; Ferrando, M.; Ye, Y.; Gao, N.; Li, P.; Shi, X. Dynamic predictions for the composition and efficiency of heating, ventilation and air conditioning systems in urban building energy modeling. J. Build. Eng. 2024, 96, 110562. [Google Scholar] [CrossRef]
Fu, T.; Tang, X.; Cai, Z.; Zuo, Y.; Tang, Y.; Zhao, X. Correlation research of phase angle variation and coating performance by means of Pearson’s correlation coefficient. Prog. Org. Coat. 2020, 139, 105459. [Google Scholar] [CrossRef]
Laarne, P.; Zaidan, M.A.; Nieminen, T. ennemi: Non-linear correlation detection with mutual information. SoftwareX 2021, 14, 100686. [Google Scholar] [CrossRef]
Zychlinski, S. Dython. 2024. Available online: http://shakedzy.xyz/dython/ (accessed on 17 January 2025).
Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Mueller, A.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; Grobler, J.; et al. Scikit-Learn. 2013. Available online: https://scikit-learn.org/stable/index.html# (accessed on 17 January 2025).
Risan, H.; Al-Azzawi, A.; Al-Zwainy, F. 4-T-, Chi-square, and F-distributions and hypothesis testing. In Statistical Analysis for Civil Engineers; Woodhead Publishing: Sawston, UK, 2025; pp. 155–214. ISBN 9780443273629. [Google Scholar] [CrossRef]
Pinder, J. Chapter 9—Simulation fit significance: Chi-Square, A.N.O.V.A. Introduction to Business Analytics Using Simulation, 2nd ed.; Academic Press: Cambridge, MA, USA, 2022; pp. 273–325. ISBN 9780323917179. [Google Scholar] [CrossRef]
Kottek, M.; Grieser, J.; Beck, C.; Rudolf, B.; Rubel, F. World Map of the Köppen-Geiger climate classification updated. Meteorol. Z. 2006, 15, 259–263. [Google Scholar] [CrossRef]
Feng, K.; Lu, W.; Wang, Y. Assessing environmental performance in early building design stage: An integrated parametric design and machine learning method. Sustain. Cities Soc. 2019, 50, 101596. [Google Scholar] [CrossRef]
Amasyali, K.; El-Gohary, N. Machine learning for occupant-behavior-sensitive cooling energy consumption prediction in office buildings. Renew. Sustain. Energy Rev. 2021, 142, 110714. [Google Scholar] [CrossRef]
Li, X.; Yao, R. A machine-learning-based approach to predict residential annual space heating and cooling loads considering occupant behaviour. Energy 2020, 212, 118676. [Google Scholar] [CrossRef]
Lin, Y.; Zhao, L.; Liu, X.; Yang, W.; Hao, X.; Tian, L.; Nord, N. Design optimization of a passive building with green roof through machine learning and group intelligent algorithm. Buildings 2021, 11, 192. [Google Scholar] [CrossRef]
Yan, H.; Ji, G.; Yan, K. Data-driven prediction and optimization of residential building performance in Singapore considering the impact of climate change. Build. Environ. 2022, 226, 109735. [Google Scholar] [CrossRef]
Jia, B.; Hou, D.; Wang, L.; Kamal, A.; Hassan, I.G. Developing machine-learning meta-models for high-rise residential district cooling in hot and humid climate. J. Build. Perform. Simul. 2022, 15, 553–573. [Google Scholar] [CrossRef]
Martínez-Rocamora, A.; Rivera-Gómez, C.; Galán-Marín, C.; Marrero, M. Environmental benchmarking of building typologies through BIM-based combinatorial case studies. Autom. Constr. 2021, 132, 103980. [Google Scholar] [CrossRef]
Ngarambe, J.; Yun, G.Y.; Kim, G.; Irakoze, A. Comparative performance of machine learning algorithms in the prediction of indoor daylight illuminances. Sustainability 2020, 12, 4471. [Google Scholar] [CrossRef]
Sangireddy, S.A.R.; Bhatia, A.; Garg, V. Development of a surrogate model by extracting top characteristic feature vectors for building energy prediction. J. Build. Eng. 2019, 23, 38–52. [Google Scholar] [CrossRef]
Tsay, Y.-S.; Yeh, C.-Y.; Chen, Y.-H.; Lu, M.-C.; Lin, Y.-C. A machine learning-based prediction model of lCCO2 for building envelope renovation in Taiwan. Sustainability 2021, 13, 8209. [Google Scholar] [CrossRef]
Edwards, R.E.; New, J.; Parker, L.E.; Cui, B.; Dong, J. Constructing large scale surrogate models from big data and artificial intelligence. Appl. Energy 2017, 202, 685–699. [Google Scholar] [CrossRef]
Sánchez, V.F.; Garrido Marijuan, A. Integrated model concept for district energy management optimisation platforms. Appl. Therm. Eng. 2021, 196, 117233. [Google Scholar] [CrossRef]
Li, X.; Yao, R. Modelling heating and cooling energy demand for building stock using a hybrid approach. Energy Build. 2021, 235, 110740. [Google Scholar] [CrossRef]
Todeschi, V.; Boghetti, R.; Kämpf, J.H.; Mutani, G. Evaluation of urban-scale building energy-use models and tools—Application for the city of Fribourg, Switzerland. Sustainability 2021, 13, 1595. [Google Scholar] [CrossRef]
Tsanas, A.; Xifara, A. Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy Build. 2012, 49, 560–567. [Google Scholar] [CrossRef]
Naji, S.; Keivani, A.; Shamshirband, S.; Alengaram, U.J.; Jumaat, M.Z.; Mansor, Z.; Lee, M. Estimating building energy consumption using extreme learning machine method. Energy 2016, 97, 506–516. [Google Scholar] [CrossRef]
Chen, R.; Tsay, Y.-S.; Zhang, T. A multi-objective optimization strategy for building carbon emission from the whole life cycle perspective. Energy 2023, 262, 125373. [Google Scholar] [CrossRef]
Hey, J.; Siebers, P.-O.; Nathanail, P.; Ozcan, E.; Robinson, D. Surrogate optimization of energy retrofits in domestic building stocks using household carbon valuations. J. Build. Perform. Simul. 2023, 16, 16–37. [Google Scholar] [CrossRef]
Singh, M.M.; Deb, C.; Geyer, P. Early-stage design support combining machine learning and building information modelling. Autom. Constr. 2022, 136, 104147. [Google Scholar] [CrossRef]
Zhu, S.; Ma, C.; Zhang, Y.; Xiang, K. A hybrid metamodel-based method for quick energy prediction in the early design stage. J. Clean. Prod. 2021, 320, 128825. [Google Scholar] [CrossRef]
Namlı, E.; Erdal, H.; Erdal, H.I. Artificial Intelligence-Based Prediction Models for Energy Performance of Residential Buildings; Springer International Publishing: Berlin/Heidelberg, Germany, 2019. [Google Scholar] [CrossRef]
Moradzadeh, A.; Mansour-Saatloo, A.; Mohammadi-ivatloo, B.; Anvari-Moghaddam, A. Performance Evaluation of Two Machine Learning Techniques in Heating and Cooling Loads Forecasting of Residential Buildings. Appl. Sci. 2020, 10, 3829. [Google Scholar] [CrossRef]
Sauer, J.; Mariani, V.C.; dos Santos Coelho, L.; Ribeiro MH, D.M.; Rampazzo, M. Extreme gradient boosting model based on improved Jaya optimizer applied to forecasting energy consumption in residential buildings. Evol. Syst. Interdiscip. J. Adv. Sci. Technol. 2022, 13, 577–588. [Google Scholar] [CrossRef]
Papadopoulos, S.; Azar, E.; Woon, W.-L.; Kontokosta, C.E. Evaluation of tree-based ensemble learning algorithms for building energy performance estimation. J. Build. Perform. Simul. 2018, 11, 322–332. [Google Scholar] [CrossRef]
Castelli, M.; Trujillo, L.; Vanneschi, L.; Popovič, A. Prediction of energy performance of residential buildings: A genetic programming approach. Energy Build. 2015, 102, 67–74. [Google Scholar] [CrossRef]
Nebot, A.; Mugica, F. Energy performance forecasting of residential buildings using fuzzy approaches. Appl. Sci. 2020, 10, 720. [Google Scholar] [CrossRef]
Roy, S.S.; Samui, P.; Nagtode, I.; Jain, H.; Shivaramakrishnan, V.; Mohammadi-ivatloo, B. Forecasting heating and cooling loads of buildings: A comparative performance analysis. J. Ambient Intell. Humaniz. Comput. 2020, 11, 1253–1264. [Google Scholar] [CrossRef]
Moayedi, H.; Mosavi, A. Double-target based neural networks in predicting energy consumption in residential buildings. Energies 2021, 14, 1331. [Google Scholar] [CrossRef]
Moayedi, H.; Mosavi, A. Suggesting a stochastic fractal search paradigm in combination with artificial neural network for early prediction of cooling load in residential buildings. Energies 2021, 14, 1649. [Google Scholar] [CrossRef]
Navarro-Gonzalez, F.J.; Villacampa, Y. An octahedric regression model of energy efficiency on residential buildings. Appl. Sci. 2019, 9, 4978. [Google Scholar] [CrossRef]
Zheng, S.; Lyu, Z.; Foong, L.K. Early prediction of cooling load in energy-efficient buildings through novel optimizer of shuffled complex evolution. Eng. Comput. 2020, 38, 105–119. [Google Scholar] [CrossRef]
Almutairi, K.; Algarni, S.; Alqahtani, T.; Moayedi, H.; Mosavi, A. A TLBO-tuned neural processor for predicting heating load in residential buildings. Sustainability 2022, 14, 5924. [Google Scholar] [CrossRef]
Cheng, M.-Y.; Cao, M.-T. Accurately predicting building energy performance using evolutionary multivariate adaptive regression splines. Appl. Soft Comput. 2014, 22, 178–188. [Google Scholar] [CrossRef]
Han, Y.; Shen, L.; Sun, C. Developing a parametric morphable annual daylight prediction model with improved generalization capability for the early stages of office building design. Build. Environ. 2021, 200, 107932. [Google Scholar] [CrossRef]
Ayoub, M. Contrasting accuracies of single and ensemble models for predicting solar and thermal performances of traditional vaulted roofs. Sol. Energy 2022, 236, 335–355. [Google Scholar] [CrossRef]
Forouzandeh, N.; Zomorodian, Z.S.; Tahsildoost, M.; Shaghaghian, Z. Room energy demand and thermal comfort predictions in early stages of design based on the machine Learning methods. Intell. Build. Int. 2023, 15, 3–20. [Google Scholar] [CrossRef]
Chen, R.; Tsay, Y.-S. Carbon emission and thermal comfort prediction model for an office building considering the contribution rate of design parameters. Energy Rep. 2022, 8, 8093–8107. [Google Scholar] [CrossRef]
Wu, X.; Feng, Z.; Chen, H.; Qin, Y.; Zheng, S.; Wang, L.; Liu, Y.; Skibniewski, M.J. Intelligent optimization framework of near zero energy consumption building performance based on a hybrid machine learning algorithm. Renew. Sustain. Energy Rev. 2022, 167, 112703. [Google Scholar] [CrossRef]
Hamida, A.; Alsudairi, A.; Alshaibani, K.; Alshamrani, O. Environmental impacts cost assessment model of residential building using an artificial neural network. Eng. Constr. Archit. Manag. 2021, 28, 3190–3215. [Google Scholar] [CrossRef]
Singh, M.M.; Singaravel, S.; Geyer, P. Machine learning for early stage building energy prediction: Increment and enrichment. Appl. Energy 2021, 304, 117787. [Google Scholar] [CrossRef]
Papinutto, M.; Boghetti, R.; Colombo, M.; Basurto, C.; Reutter, K.; Lalanne, D.; Kämpf, J.H.; Nembrini, J. Saving energy by maximising daylight and minimising the impact on occupants: An automatic lighting system approach. Energy Build. 2022, 268, 112176. [Google Scholar] [CrossRef]
Abdou, N.; El Mghouchi, Y.; Jraida, K.; Hamdaoui, S.; Hajou, A.; Mouqallid, M. Prediction and optimization of heating and cooling loads for low energy buildings in Morocco: An application of hybrid machine learning methods. J. Build. Eng. 2022, 61, 105332. [Google Scholar] [CrossRef]
Olinger, M.S.; de Araújo, G.M.; Dutra, M.L.; Silva, H.A.D.; Júnior, L.P.; de Macedo, D.D. Metamodel Development to Predict Thermal Loads for Single-family Residential Buildings. Mob. Netw. Appl. 2022, 27, 1977–1986. [Google Scholar] [CrossRef]
Paudel, S.; Elmitri, M.; Couturier, S.; Nguyen, P.H.; Kamphuis, R.; Lacarrière, B.; Le Corre, O. A relevant data selection method for energy consumption prediction of low energy building based on support vector machine. Energy Build. 2017, 138, 240–256. [Google Scholar] [CrossRef]
Singaravel, S.; Suykens, J.; Geyer, P. Deep convolutional learning for general early design stage prediction models. Adv. Eng. Inform. 2019, 42, 100982. [Google Scholar] [CrossRef]
Chen, B.; Liu, Q.; Wang, L.; Deng, T.; Wu, X.; Chen, H.; Zhang, L. Multiobjective optimization of building energy consumption based on BIM-DB and LSSVM-NSGA-II. J. Clean. Prod. 2021, 294, 126153. [Google Scholar] [CrossRef]
Chen, X.; Yang, H. A multi-stage optimization of passively designed high-rise residential buildings in multiple building operation scenarios. Appl. Energy 2017, 206, 541–557. [Google Scholar] [CrossRef]
Sharif, S.A.; Hammad, A. Developing surrogate ANN for selecting near-optimal building energy renovation methods considering energy consumption, LCC and LCA. J. Build. Eng. 2019, 25, 100790. [Google Scholar] [CrossRef]
ASHRAE. ASHRAE Guideline 14-2014 Measurement of Energy, Demand, and Water Savings. Atlanta: The American Society of Heating, Refrigerating and Air-Conditioning Engineers; ASHRAE: Peachtree Corners, GA, USA, 2014. [Google Scholar]
Giannelos, S.; Bellizio, F.; Strbac, G.; Zhang, T. Machine learning approaches for predictions of CO₂ emissions in the building sector. Electr. Power Syst. Res. 2024, 235, 110735. [Google Scholar] [CrossRef]
Kazemi, F.; Asgarkhani, N.; Jankowski, R. Optimization-based stacked machine-learning method for seismic probability and risk assessment of reinforced concrete shear walls. Expert Syst. Appl. 2024, 255, 124897. [Google Scholar] [CrossRef]
Sun, Y.; Haghighat, F.; Fung, B.C.M. The generalizability of pre-processing techniques on the accuracy and fairness of data-driven building models: A case study. Energy Build. 2022, 268, 112204. [Google Scholar] [CrossRef]
Pessach, D.; Shmueli, E. Improving fairness of artificial intelligence algorithms in Privileged-Group Selection Bias data settings. Expert Syst. Appl. 2021, 185, 115667. [Google Scholar] [CrossRef]
Ahmed, R.; Fahad, N.; Miah, M.S.U.; Hossen, M.J.; Mahmud, M.; Mostafizur Rahman, M.; Morol, M.K. A novel integrated logistic regression model enhanced with recursive feature elimination and explainable artificial intelligence for dementia prediction. Healthc. Anal. 2024, 6, 100362. [Google Scholar] [CrossRef]
Kaushik, B.; Chadha, A.; Mahajan, A.; Ashok, M. A three layer stacked multimodel transfer learning approach for deep feature extraction from Chest Radiographic images for the classification of COVID-19. Eng. Appl. Artif. Intell. 2025, 147, 110241. [Google Scholar] [CrossRef]
Sun, Y.; Haghighat, F.; Fung, B.C.M. A review of the -state-of-the-art in data -driven approaches for building energy prediction. Energy Build. 2020, 221, 110022. [Google Scholar] [CrossRef]
Sina, A.; Abdolalizadeh, L.; Mako, C.; Torok, B.; Amir, M. Systematic review of deep learning and machine learning for building energy. Front. Energy Res. 2022, 10, 786027. [Google Scholar] [CrossRef]
Ahmad, T.; Zhang, H.; Yan, B. A review on renewable energy and electricity requirement forecasting models for smart grid and buildings. Sustain. Cities Soc. 2020, 55, 102052. [Google Scholar] [CrossRef]

Figure 1. Search and review methodology.

Figure 2. The number of building types by function and climate types modeled in studies.

Figure 3. Stages and steps in energy performance prediction by machine learning; the studies followed those in their modeling process and their aims [1,2,5,6,8,9,12,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75].

Figure 4. Decision subjects, interrelated workflows, and concerns in machine learning model generation.

Figure 5. The number of different input variables/features per different output variables in all studies and in the studies targeting the early design stage [38,42].

Figure 6. ‘Algorithm class—aim of study’ interrelation for all studies (note: algorithm classes are ordered by the frequency of use, which can be seen in Table 1).

Figure 7. ‘Algorithm class—output’ interrelation for (a) studies focusing on the early design stage, (b) studies using the same dataset, and (c) all studies, except those in (a,b). (Note: algorithm classes are ordered by the frequency of use, which can be seen in Table 1, and the frequencies of output variables can be seen in Figure 5).

Figure 8. ‘Input—output’ interrelation for (a) all studies, (b) for those focusing on the early design stage, and (c) all studies, except those in (b) (the frequencies of input and output variables can be seen in Figure 5, and the detail input list is presented in Table A3).

Figure 9. ‘Dataset sample size—output’ interrelation for (a) all studies, (b) studies focusing on the early design stage, and (c) all studies, except those in (b) (the frequencies of output variable can be seen in Figure 5).

Figure 10. ‘Dataset sample size—algorithm class’ interrelation for (a) all studies, (b) studies focusing on the early design stage, and (c) all studies, except those in (b) (the frequencies of algorithm classes can be seen in Table 1).

Figure 11. Results of correlation analyses process-related variables, i.e., between study aims and decision subjects, and model performance (i.e., R²).

Table 1. Distribution of algorithms used in all studies and in the studies focusing on the early design stage.

				Number of Studies *
Algorithm Type	Algorithm Class	Algorithms	Abbreviation	All Studies	Studies for Early Design Stage
Ensemble	Bagging Algorithm	Bootstrap Aggregating–Bagging Algorithm	BA	1
	Decision Tree	Adaboost	AB	2	1
		Bagging Regressor	BR	1	1
		Ensemble Bagging Trees	EBT	1
		Extra Tree Algorithm	ET	1
		Extreme Gradient Boosting	XGBoost	6
		Extremely Randomized Trees	ERT	2	1
		Gradient Boosting Decision Tree Algorithm	GBRT	9	1
		Random Forest	RF	14	3
	Light Gradient Boosting Machine Model	Light Gradient Boosting Machine Model	LGBM	1
	Voting Heterogeneous Ensemble Learning	Voting Heterogeneous Ensemble Learning	ELVoting	1
Single	Component-based Machine Learning	Component-based Machine Learning	CBML	2	2
	Decision Tree	Classification and Regression Tree	CART	2
		Decision Tree	DT	3	2
		Model Trees Regression	M5P	1
		Reduced Error Pruning Tree	REPTree	1
	Design and Analysis of Computer Experiments	Design and Analysis of Computer Experiments	DACE	1	1
	Fuzzy Logic	Adaptive Neuro Fuzzy Inference System	ANFIS	1
		Fuzzy Inductive Reasoning	FIR	1
	Gaussian Process Regression	Gaussian Process Regression	GPR	3	1
	Generalized Additive Models	Generalized Additive Models	GAM	1
	Genetic Programming	Genetic Programming	GP	2
	Lazy Learning Algorithm	IBk Linear Nearest Neighbour Search	Ibk	1
		Locally Weighted Learning	LWL	1
	Linear Regression	Bayesian Regression Technique		1
		Bayesian Ridge Regression		1	1
		Elastic Net	EN	1
		Generalized Linear Model	GLM	1	1
		Huber		1	1
		Iteratively Reweighted Least Squares	IRLS	1
		Least Absolute Shrinkage and Selection Operator	LASSO	3	1
		Linear Regression	LR	7	1
		Multiple Linear Regression	MLR	2
		Multivariate Adaptive Regression Splines	MARS	3	1
		Ordinary Least-Squares Linear Regression	OLS	1
		Random Sample Consensus	RANSAC	1	1
		Ridge Regression	RR	2	1
		Stepwise Linear Regression	SLR	2
		Theil Sen	Theil Sen	1	1
	Minimax Probability Machine Regression	Minimax Probability Machine Regression	MPMR	1
	Multivariate Adaptive Regression Splines	Evolutionary Multivariate Adaptive Regression Splines	EMARS	1
	Nearest Neighbors	K-Nearest Neighbors Regressor	KNN	4	2
	Neural Networks	Artificial Neural Networks	ANN	24	7
		Back-Propagation Neural Networks	BPNN	3
		Convolutional Neural Networks	CNN	2	2
		Deep Neural Networks	DNN	4	1
		Extreme Learning Machine	ELM	2	1
		Feed Forward Neural Networks		1
		Long Short-term Memory Model	LSTM	2	1
		Multi-layer Perceptron Neural Network	MLP	7
		Radial Basis Function Neural Network	RBFNN	2
		Wavelet Neural Networks	WNN	1
	Octahedric Regression Model	Octahedric Regression Model		1
	Support Vector Machine	LeastSquare Support Vector Machine	LSSVM	1
		Support Vector Machine	SVM	9	1
		Support Vector Regression	SVR	10	1

*: In the heatmap, darker-green cells indicate more frequent use. Additionally, heatmaps for all studies and for the studies considering the early design stage are colored separately.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kömürcü, D.; Edis, E. Machine Learning Modeling for Building Energy Performance Prediction Based on Simulation Data: A Systematic Review of the Processes, Performances, and Correlation of Process-Related Variables. Buildings 2025, 15, 1301. https://doi.org/10.3390/buildings15081301

AMA Style

Kömürcü D, Edis E. Machine Learning Modeling for Building Energy Performance Prediction Based on Simulation Data: A Systematic Review of the Processes, Performances, and Correlation of Process-Related Variables. Buildings. 2025; 15(8):1301. https://doi.org/10.3390/buildings15081301

Chicago/Turabian Style

Kömürcü, Damla, and Ecem Edis. 2025. "Machine Learning Modeling for Building Energy Performance Prediction Based on Simulation Data: A Systematic Review of the Processes, Performances, and Correlation of Process-Related Variables" Buildings 15, no. 8: 1301. https://doi.org/10.3390/buildings15081301

APA Style

Kömürcü, D., & Edis, E. (2025). Machine Learning Modeling for Building Energy Performance Prediction Based on Simulation Data: A Systematic Review of the Processes, Performances, and Correlation of Process-Related Variables. Buildings, 15(8), 1301. https://doi.org/10.3390/buildings15081301

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning Modeling for Building Energy Performance Prediction Based on Simulation Data: A Systematic Review of the Processes, Performances, and Correlation of Process-Related Variables

Abstract

1. Introduction

2. Methodology

3. Machine Learning Modeling Process in Building Energy Performance Prediction

3.1. Dataset Preparation

3.1.1. Case Description

3.1.2. Data Production by Simulation

3.1.3. Data Augmentation

3.1.4. Feature Selection and Feature Engineering

3.2. Machine Learning Prediction Model Preparation

3.2.1. Machine Learning Algorithm Selection

3.2.2. Training and Testing

3.2.3. Algorithm Optimization

3.2.4. Performance Evaluation

4. Discussion

4.1. Correlation of Process-Related Variables

4.2. Strategies Implemented to Overcome Concerns

4.3. Future Directions

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. Prisma Flow Diagram

Appendix B. Machine Learning Modeling-Related Information of the Studies Analyzed

Appendix C. R2 Result Distributions Within Decision Subjects

Appendix D. Training Times in the Studies That Comparatively Assessed Process-Related Variable Alternatives [8,29,35,51,67,71]

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI