Next Article in Journal
Explaining and Reducing Urban Heat Islands Through Machine Learning: Evidence from New York City
Previous Article in Journal
Performance of Cementitious Materials Subjected to Low CO2 Concentration Accelerated Carbonation Curing and Further Hydration
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Benchmarking Automated Machine Learning for Building Energy Performance Prediction: A Comparative Study with SHAP-Based Interpretability

by
Zuyi Tang
1,
Jinyu Chen
2 and
Jiayu Cheng
3,*
1
Guangdong Architectural Design and Research Institute Group Co., Ltd., Guangzhou 510010, China
2
School of Civil Engineering and Transportation, Guangzhou University, Guangzhou 510006, China
3
Department of Building and Real Estate, The Hong Kong Polytechnic University, Hong Kong, China
*
Author to whom correspondence should be addressed.
Buildings 2026, 16(1), 185; https://doi.org/10.3390/buildings16010185 (registering DOI)
Submission received: 11 November 2025 / Revised: 25 December 2025 / Accepted: 30 December 2025 / Published: 1 January 2026
(This article belongs to the Special Issue Sustainable Energy in Built Environment and Building)

Abstract

The growing demand for energy-efficient buildings necessitates innovative approaches to reduce energy consumption during early design stages. While traditional physics-based simulations remain time- and expertise-intensive, automated machine learning (AutoML) offers a promising alternative by enabling data-driven building performance prediction with minimal human intervention. This study conducts a benchmark evaluation of AutoML’s potential in building energy applications through three objectives: (1) a literature review revealing AutoML’s nascent adoption (10 identified studies) and primary use cases (heating/cooling prediction, energy demand forecasting); (2) a benchmark comparing three AutoML frameworks (AutoGluon, H2O, Auto-sklearn) against baseline and ensemble ML models using R2, RMSE, MSE, and MAE metrics; and (3) SHAP (SHapley Additive exPlanations)-based interpretability analysis. Results demonstrate AutoGluon’s superior accuracy (R2 = 0.993, RMSE = 2.280 kWh/m2) in predicting energy performance, outperforming traditional methods. Key influential features, including solar heat gain coefficient (SHGC) and U-values, were identified through SHAP, offering actionable design insights. The primary novelty of this work lies in its two-step methodology: a focused review to identify pertinent AutoML frameworks, followed by a comparative benchmarking of these frameworks against traditional ML for early-stage prediction. This process substantiates AutoML’s potential to democratize energy modeling and deliver practical, interpretable workflows for architectural design.

1. Introduction

1.1. The Demands for Building Energy Management

The building sector is a major contributor to global energy consumption and carbon dioxide emissions. Research indicates that buildings account for approximately 40% of worldwide energy use [1]. In the United States, residential buildings alone consume 21% of the nation’s total energy, a figure that rises to 75% in some major cities [2]. Similarly, in China, buildings are responsible for 25% of national and 6% of global energy consumption [3]. Furthermore, due to population growth, urbanization, and rapid industrialization, building energy demand is projected to increase steadily, with an estimated annual growth rate of 1.3% between 2018 and 2050 [4]. Excessive energy consumption in buildings has significant environmental consequences, including greenhouse gas emissions, urban heat-island effects, and air pollution, all of which pose risks to public health and socioeconomic development [5].
Consequently, there is growing emphasis on reducing energy use and emissions in the building sector. Urban planners and policymakers are actively exploring innovative strategies to enhance the sustainability of existing buildings, such as developing comprehensive energy efficiency plans and long-term renovation strategies [6]. These measures aim to minimize overall energy consumption and carbon emissions by leveraging large-scale building energy performance data. Researchers, driven by global climate initiatives, are focusing on two primary approaches to reduce building energy consumption, namely, renewable energy integration and energy management [3]. Renewable energy generation seeks to replace conventional energy sources, such as coal and oil, with sustainable alternatives like solar and hydropower.
Meanwhile, energy management strategies aim to improve energy efficiency, reduce consumption, and lower carbon emissions [7]. Among those strategies, enhancing building energy efficiency has become a hot topic due to its proactive nature in eliminating energy waste at the early design stages. Specifically, it requires multifaceted interventions and decisions, including modern design principles, proper insulation, efficient HVAC systems, and energy-saving appliances [8]. Therefore, building energy prediction plays a critical role in building energy conservation and optimization, as well as in promoting the penetration of renewable energy in buildings. In addition, accurate energy performance modeling can help evaluate new building design alternatives and optimize early-stage designs.

1.2. Physical vs. Data-Driven Building Energy Performance Modeling

According to Pan and Zhang [3], building energy evaluation can generally be accomplished through either physical modeling or data-driven approaches, as shown in Table 1. Physics-driven building energy modeling uses fundamental thermodynamic principles and detailed building parameters to simulate energy consumption accurately. This “white box” approach uses heat and mass balance equations to represent a building’s thermal behavior through tools such as EnergyPlus and TRNSYS [9]. The method requires comprehensive inputs, including building envelope properties, temperature profiles, system operations, and occupancy patterns. While offering high accuracy, these models demand significant computational resources and specialized expertise in building physics. In addition, another primary limitation of physical-based modeling is its substantial dependence on domain expertise, physical principles, and comprehensive building characteristics, encompassing both geometric and non-geometric parameters [10].
In contrast, data-driven modeling applies machine learning to historical energy data, identifying patterns between consumption and influencing factors like weather, occupancy, and building characteristics [9]. This “black box” approach is particularly useful when physical parameters are unknown, as in early design phases. The method has been reported to achieve over 10% energy savings through optimized performance predictions, with the added advantage of improving accuracy as more operational data becomes available [9]. Therefore, this method has gained considerable traction in the building energy sector due to its ability to predict and estimate energy consumption with relatively limited building information [11]. Additionally, it can reveal complex relationships among independent variables that may be difficult to identify using conventional methods.

1.3. Automated Machine Learning and Its Potential for Building Energy Performance Estimation

1.3.1. Traditional Machine Learning vs. Automated Machine Learning

While traditional machine learning approaches have been validated as effective for accurately estimating building energy performance, they present several inherent constraints, including substantial demands for domain expertise, labor-intensive hyperparameter tuning, and computationally expensive trial-and-error model development [12,13,14,15]. These methodologies often require specialized knowledge to manually select algorithms, engineer features, and optimize parameters, which consumes significant time and resources while introducing human bias. To solve this dilemma, Automated machine learning (AutoML) emerges as a systematic solution to these challenges by automating the entire model development pipeline [16].
As defined by Bahri et al. [17], AutoML aims to develop computational systems capable of autonomously determining optimal model configurations without human intervention, while maintaining robust performance and computational efficiency. The term “model configuration” encompasses all predefined procedures that significantly influence final model performance, including data pre-processing, feature engineering (comprising feature selection and dimensionality reduction), hyperparameter optimization, and model architecture selection [16]. Therefore, AutoML encapsulates data pre-processing, feature selection, algorithm optimization, and model evaluation into a unified automated framework [18].
Figure 1 illustrates the differences in workflow between traditional machine learning and AutoML methodologies in data analysis pipelines. The conventional machine learning pipeline comprises multiple sequential stages: data preprocessing, feature engineering, model selection, hyperparameter optimization, and model evaluation [17]. In traditional implementations, each of these stages requires manual execution, demanding substantial expertise in computer science, statistical modeling, and iterative experimental design. Practitioners must manually perform data cleansing operations, engineer relevant features through domain knowledge, and conduct numerous trial-and-error iterations for hyperparameter tuning to identify optimal model configurations [19]. AutoML fundamentally transforms this workflow by automating these complex processes. This automation significantly lowers the technical barrier to implementation, enabling users without specialized machine learning expertise to effectively conduct advanced data analysis [17,20]. The critical distinction between these approaches consequently lies in the degree of human intervention. While traditional methods rely heavily on manual execution by skilled practitioners, AutoML systems implement intelligent automation to streamline the entire modeling lifecycle.
When compared with conventional methods, AutoML demonstrates three principal advantages: (1) enhanced operational efficiency through automation of repetitive tasks, allowing practitioners to focus on strategic problem-solving rather than technical implementation; (2) reduced human error in model configuration through systematic optimization techniques; and (3) democratized access to advanced analytics by lowering the expertise threshold for effective model deployment. Representative frameworks, such as Auto-sklearn, H2O AutoML, and TPOT, operationalize these benefits through features like parallel algorithm evaluation, Bayesian hyperparameter optimization, and dynamic pipeline composition.

1.3.2. AutoML for Building Energy Research

As an emerging branch of artificial intelligence, AutoML facilitates data-driven building energy performance modeling by autonomously selecting optimal model architectures and hyperparameters to maximize prediction accuracy. This capability proves particularly valuable for early-stage design decisions where conventional approaches face significant limitations [12]. The implementation of AutoML addresses two critical challenges in building energy modeling. Firstly, architectural designers typically lack specialized expertise in computer science and machine learning algorithms, making traditional ML implementation impractical due to its numerous technical decisions regarding feature selection, model choice, and optimization strategies [21]. AutoML overcomes this barrier by enabling automatic energy performance evaluation directly from design parameters, allowing designers to compare alternatives and optimize building energy efficiency efficiently [21,22].
Secondly, while data-driven modeling already offers advantages over physics-based approaches by eliminating time-consuming simulation processes, AutoML further enhances this methodology. It not only maintains the simulation-free advantage but also improves decision-making quality through automated optimization. Empirical studies demonstrate these benefits: Zhang, Tian, Zhao and Lu [12] evaluated six AutoML frameworks (AutoWeka, H2O, TPOT, AutoGluon, FlAML, and AutoKeras) for predicting real-world building heating, cooling, and electrical loads, revealing distinct performance advantages among different frameworks. Similarly, Lu, Li, Reddy Penaka and Olofsson [13] developed automated ML pipelines for heating/cooling load prediction during early design stages, with results demonstrating superior performance compared to traditional machine learning approaches. These advancements position AutoML as a transformative tool for building energy modeling, combining the accuracy of data-driven methods with the accessibility needed for practical design applications.

1.4. Research Gaps and Objectives

Given the rapid adoption of ML in the field alongside the emerging potential of AutoML to alleviate its expert-dependent bottlenecks, a dedicated synthesis of existing AutoML applications is timely. A preliminary assessment indicates that while the volume of traditional ML studies is substantial [23,24], focused AutoML applications remain fragmented and have not been systematically reviewed. This highlights a pressing need to identify effective methodologies, establish reliable performance benchmarks, and frame key research challenges. Recognizing AutoML as a powerful yet underutilized tool for energy performance estimation, this study investigates its potential to advance predictive modeling in architectural design contexts. Two critical research gaps motivate this work: (1) the lack of comparisons between AutoML frameworks and conventional machine learning approaches for building energy prediction, and (2) the need for improved model interpretability to facilitate practical implementation. Unlike standard AutoML systems focused solely on automation, interpretable AutoML better aligns with the stringent reliability requirements of building performance analysis. These considerations lead to three key research objectives:
1.
To conduct a review of AutoML applications in building energy research;
2.
To benchmark AutoML performance for early-stage building energy prediction;
3.
To enhance model transparency through Explainable Artificial Intelligence (XAI) techniques.
While comprehensive reviews exist for broader innovations in building performance [25], a distinct gap remains in the systematic, empirical benchmarking of AutoML for early-stage performance prediction. This study makes three main contributions to address this gap. First, it provides a focused review of AutoML applications in building energy research, synthesizing the current state of this emerging field. Second, and primarily, it addresses the gap by providing the first comparative benchmark of leading AutoML frameworks (AutoGluon, H2O, Auto-sklearn) against established traditional and ensemble ML methods for this specific task. Third, it introduces a novel integration of SHAP-based explainable AI (XAI) with the top-performing AutoML model, translating automated predictions into quantitatively actionable insights for building designers. Therefore, this work advances the field not only by summarizing current knowledge but, more importantly, by delivering new empirical evidence and a practical, interpretable workflow. Therefore, the novelty of this work is anchored in a two-step methodological contribution. First, we synthesize the emerging field through a literature review to identify pertinent AutoML frameworks. Second, and primarily, we conduct the comparative benchmark that empirically evaluates these identified AutoML frameworks against strong traditional ML baselines for early-design-stage energy prediction. This process delivers a substantiated evaluation of AutoML’s practical value in architectural design.

2. Materials and Methods

This study employed a tripartite methodological approach to achieve its research objectives, as illustrated in Figure 2. First, we conducted a literature review of AutoML applications in building energy research using the Scopus database, fulfilling Objective 1 by establishing the current state of knowledge in this emerging field.
For Objective 2, we performed benchmark validation of AutoML frameworks for building energy performance prediction. This comparative analysis evaluated multiple AutoML implementations against traditional machine learning approaches, assessing their relative effectiveness in improving both predictive accuracy and decision-making quality. The validation utilized a representative public building energy dataset as developed by Elbeltagi et al. [26], to ensure robust evaluation of AutoML’s capabilities across different building typologies and climatic conditions.
Objective 3 was addressed by developing an interpretability module integrated with AutoML outputs. This innovation enhances model transparency and supports trustworthy decision-making in early-stage design assistance [27]. The module employs advanced explainable AI techniques to elucidate the relationships between input parameters and energy performance predictions.
Through this three-phase investigation, we demonstrate AutoML’s significant potential to transform building energy performance estimation, particularly when combined with interpretability features for design-stage applications. The integrated approach not only advances predictive accuracy but also bridges the gap between data science and architectural practice.
The following sub-section will discuss the three works separately.

2.1. Review of AutoML for Building Energy Research

This study conducted a literature review to examine current applications of AutoML in building energy research. The review followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) framework [28,29], as shown in Figure 3, which consisted of three key phases: identification, screening, and inclusion.
During the identification phase, the research team developed a search strategy using two keyword sets. The first set captured AutoML-related terms: (“automated machine learning” OR “AutoML” OR “automated deep learning” OR “AutoDL”). The second set focused on the building energy domain: (“building” AND “energy”). These keyword combinations were searched in the Scopus database, selected for its extensive coverage of peer-reviewed literature across multiple disciplines. The initial search returned 46 potentially relevant articles.
The screening phase applied three inclusion criteria to evaluate these articles: (1) publication in the English language, (2) publication in peer-reviewed academic journals, and (3) direct relevance to AutoML applications in building energy research. Each article underwent evaluation of its title, abstract, and full text when necessary. This screening process resulted in 9 articles that met all inclusion criteria. As illustrated in Figure 3, the screening process resulted in the exclusion of 4 articles due to non-English language publication and 27 articles following a preliminary title review, as these did not meet criteria (2) or (3). A subsequent in-depth review of the abstracts led to the removal of an additional seven articles, resulting in a final corpus of 8 articles for detailed analysis.
To ensure comprehensive coverage of relevant literature, the study implemented a supplementary snowball sampling technique during the inclusion phase. This involved examining the reference lists of all eight selected articles to identify additional relevant publications that might have been missed in the initial database search. Through this method, one more study was included. Finally, a total of 9 articles were included. While the present review consists of a relatively modest number of articles (n = 9), this reflects both the emerging nature of the field and our methodological emphasis on depth over breadth. Similar article counts are not uncommon in systematic reviews of specialized domains [30,31]. To understand this emerging field, our analysis combines quantitative and qualitative approaches. The quantitative analysis characterizes the field’s growth trajectory and tool adoption by examining the distribution of publication years and the frequency of AutoML frameworks used (see Section 3.1.1). The qualitative analysis provides an in-depth examination of each study, focusing on its specific application area, input/output variables, main findings, and the AutoML framework adopted. The results of this structured qualitative synthesis are detailed in Section 3.1.2. Together, these analyses map the current research landscape, identify prevailing trends and gaps, and establish the foundation for our subsequent empirical benchmarking study.
It is acknowledged that the final corpus of studies (N = 9) is of a scale more typical of a focused scoping review than of a conventional systematic review [32,33]. Nonetheless, the PRISMA protocol was rigorously followed throughout the identification, screening, and inclusion phases. This decision was made to ensure methodological transparency, reproducibility, and scientific rigor in our inquiry. The structured approach minimizes selection bias and provides a clear audit trail, establishing a reliable foundation for mapping the current research landscape and identifying gaps.
The selected articles underwent thematic analysis to examine their research characteristics systematically [29]. Each publication was carefully reviewed to extract and categorize the following key elements: (1) specific building energy application area, (2) AutoML framework employed, (3) independent variables considered, (4) dependent variables measured, and (5) main findings. This structured extraction process facilitated the mapping of current AutoML applications in building energy research.

2.2. Benchmark Validation of AutoML for Building Energy Performance Prediction

The study implemented a structured workflow comprising five interconnected modules to systematically evaluate AutoML effectiveness for building energy performance estimation, as shown in Figure 4, which consists of five modules: pre-training module, traditional ML module, AutoML module, model evaluation module, and results presentation module.

2.2.1. Pre-Training Module

The pre-training module established robust data preparation protocols, beginning with data preprocessing that addressed missing values through median imputation, normalized numerical features using min-max scaling, and detected outliers via the interquartile range method. Feature engineering employed Spearman correlation analysis (threshold: ρ > 0.7) or Principal Component Analysis (retaining components explaining 95% of the variance) to identify optimal predictor combinations. The data partitioning strategy utilized stratified 9:1 train-test splits with five-fold cross-validation, ensuring robust model generalization and mitigating overfitting risks.

2.2.2. Traditional ML Module

The traditional machine learning module incorporated carefully selected baseline algorithms representing distinct methodological approaches: linear regression (LR) provided a parametric benchmark, random forest (RF) demonstrated ensemble tree performance, and naïve Bayes (NB) offered probabilistic modeling capabilities. These algorithms were selected as they are among the most commonly applied methods in building energy simulation, optimization, and management, as documented by Villano, Mauro and Pedace [15]. To enable a more advanced comparative analysis, three state-of-the-art ensemble methods were also implemented: XGBoost, LightGBM, and AdaBoost. Their inclusion is justified by their prevalent use and demonstrated efficacy in building energy research [34,35].
For a fair and reproducible comparison under a low-expertise scenario, simulating a practitioner applying these models without dedicated tuning, the three ensemble methods were implemented using standard library default hyperparameters [36]. This approach provides a realistic baseline of performance achievable with minimal configuration effort, which is the primary benchmark for evaluating the added value of AutoML’s automated tuning. The specific settings were: XGBoost (n_estimators = 100; max_depth = 6; learning_rate = 0.1), LightGBM (num_leaves = 31), and AdaBoost (n_estimators = 50; base estimator: DecisionTreeRegressor(max_depth = 1)) [36]. These values align with the common defaults or recommended starting points in the official documentation (XGBoost, LightGBM, scikit-learn) for regression tasks.

2.2.3. AutoML Module

The study evaluates three prominent AutoML frameworks as candidate solutions: Auto-sklearn, AutoGluon, and H2O. Auto-sklearn enhances the scikit-learn ecosystem by incorporating automated preprocessing, model selection, and Bayesian hyperparameter optimization, utilizing meta-learning to efficiently identify high-performing traditional machine learning models (excluding deep learning architectures) [37]. AutoGluon, developed by Amazon, emphasizes user accessibility and scalability while supporting automated deep learning, tabular data modeling, and ensemble techniques through multi-model stacking [38]. H2O AutoML offers an enterprise-grade solution featuring automated training and optimization of generalized linear models, gradient boosting machines, random forests, and deep neural networks, with particular emphasis on computational scalability and model interpretability [39]. These three frameworks were selected for the benchmark based on the review results and prior studies. AutoGluon and H2O AutoML were included as they are among the most frequently adopted and well-supported frameworks in the emerging building energy AutoML literature. Auto-sklearn was chosen as a robust and widely recognized benchmark in the broader AutoML field, representing a mature approach based on the scikit-learn ecosystem [40]. Collectively, this selection provides a representative and diverse comparison across different AutoML architectures, enabling the evaluation of their performance for building energy estimation.
The inclusion of diverse AutoML frameworks serves two primary research objectives: first, to assess their predictive capability in building energy performance modeling; second, to conduct comparative analysis between automated and traditional machine learning approaches, thereby evaluating the relative performance advantages of AutoML methodologies. Four AutoML frameworks were evaluated with their recommended default or prescriptive time budgets. Auto-sklearn ran with 72 h time limits, incorporating meta-learning from 140 OpenML datasets. AutoGluon utilized default 4 h presets with multi-layer stacking enabled. H2O AutoML completed training within 24 h windows using all available CPU cores, generating leaderboards of 20 model variants. Each framework processed identical feature sets and training data partitions to ensure comparability. These time limits are integral to their design, representing a realistic computational constraint under which their automated search for optimal models and hyperparameters operates. The use of different durations reflects the frameworks’ own recommendations for achieving robust performance, allowing each to showcase its optimization efficiency under a practical constraint.

2.2.4. Model Evaluation Module

The predictive performance of AutoML in building energy performance estimation was evaluated using four standard regression metrics: mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and coefficient of determination (R2). This set of metrics aligns with common practice in building energy performance prediction and the AutoML studies [13,41]. R2 quantifies the proportion of variance explained and serves as the primary goodness-of-fit indicator. RMSE and MAE provide complementary, interpretable measures of absolute error in the units of the target variable (kWh/m2), which is crucial for assessing the practical significance of prediction errors in an energy context. While standards like ASHRAE Guideline 14 define calibration thresholds for simulation models [42], the relative performance of data-driven models in early design is typically assessed through the comparative magnitude of these metrics. The mathematical formulations for these metrics are expressed as
M S E = 1 n i = 1 n ( y i y ^ i ) 2
R M S E = 1 n i = 1 n ( y i y ^ i ) 2
M A E = 1 n i = 1 n | y i y ^ i |
R 2 = 1 i = 1 n ( y i y ^ i ) 2 i = 1 n ( y i y ¯ i ) 2
where y i represents observed values, y ^ i denotes predicted values, y ¯ i is the mean of observations, and n is the number of data points.
The model evaluation employs a strict hold-out validation strategy. The dataset was initially split into a stratified 90% training set and a 10% held-out test set, with the test set remaining completely untouched during all training, hyperparameter optimization, and feature selection processes. Five-fold cross-validation was performed exclusively on the 90% training set for model selection and hyperparameter tuning within the AutoML frameworks. The final performance metrics (R2, RMSE, MAE) reported in Section 3.2 and used for comparison are calculated solely on the unseen 10% test set, providing an unbiased estimate of model generalization performance.

2.2.5. Results Presentation Module

The final module presents the performance evaluation of AutoML frameworks alongside a comparative analysis with traditional machine learning approaches. The findings compare AutoML frameworks against traditional ML methods using RMSE, MAE, and R2 metrics. AutoML performance is ranked and visualized through bar plots, while side-by-side comparisons with conventional approaches highlight accuracy improvements. Statistical tests assess significance, and feature importance analysis identifies key predictors. Runtime-accuracy plots demonstrate computational trade-offs, collectively highlighting AutoML’s advantages for building energy prediction models.

2.3. AutoML with Interpretation to Support Transparent ML Decision-Making

To interpret the complex ensembles generated by AutoML, we employ post hoc explainable AI (XAI). XAI methods are broadly categorized as either intrinsic (interpretability built into the model structure) or post hoc (applied after model training) [43]. Post hoc methods are necessary here, as they can analyze pre-trained, complex models such as AutoML ensembles without requiring architectural changes. Specifically, we use SHAP, a leading post hoc method chosen for its theoretical robustness and computational efficiency with tree-based models, which dominate high-performing AutoML ensembles for tabular data. The SHAP framework quantifies feature importance through two mechanisms. Individual SHAP values represent the marginal contribution of each feature to specific predictions, while global importance is determined by ranking features according to their mean absolute SHAP values across the dataset [44]. This approach satisfies all Shapley value axioms and maintains robustness to feature correlations while providing interpretability at both instance and population levels.
We specify the use of the TreeExplainer from the SHAP library, which is computationally efficient and provides exact Shapley values for tree-based models, making it suitable for the tree-dominated ensemble from AutoGluon. To balance computational efficiency with representativeness, we calculated SHAP values using a background sample of 100 data points selected via k-means clustering from the training set. To ensure robust interpretations given feature correlations, we assessed stability by calculating SHAP values across five cross-validation folds. The mean absolute SHAP values reported represent the average across these folds, mitigating variance from correlations in any single model fit.

2.4. Building Energy Consumption Dataset

This study utilizes the dataset developed by Elbeltagi, Wefki, Abdrabou, Dawood and Ramzy [26], which incorporates a set of design variables to simulate building energy performance in a single, fixed climate zone. This design choice allows for a controlled, fundamental comparison of model performance in learning relationships between building design parameters and energy outcomes, isolated from climatic variance. Specifically, the Cairo International Airport weather file supported by the EnergyPlus Weather (EPW) file was adopted. The parameters used for dataset generation encompass building dimensions, orientation, construction type, window-to-wall ratio, glazing properties, lighting and appliance loads, and heating/cooling temperature setpoints. While the geometric parameters, including building dimensions, orientation, and window-to-wall ratio, were modeled using Grasshopper, the remaining parameters were configured in EnergyPlus. The simulation outputs are represented by predicted Energy Use Intensity (pEUI), which quantifies site energy consumption for each design scenario based on modeled energy performance. The detailed variables within the dataset are shown in Table 2. Since the dataset contains categorical values, they were encoded using one-hot encoding to create binary features for each category, preventing the model from incorrectly interpreting ordinal relationships.

3. Results

3.1. Results of Review for AutoML in Building Energy Research

3.1.1. Overview

Figure 5 presents a comparative analysis of AutoML adoption trends in building energy research versus other industries, based on annual publication counts. The results reveal that AutoML gained significant traction across general industries beginning in 2015, coinciding with the rise of AI and advancements in neural networks. Since then, AutoML has emerged as a prominent research focus, as evidenced by the sustained growth in publications. In contrast, building energy research exhibited delayed adoption, with no notable AutoML-related studies appearing until 2023. Notably, only nine publications to date have explored AutoML applications for building energy performance estimation, underscoring both the untapped potential and the critical need for further development in this domain.
Figure 6 summarizes the distribution of AutoML frameworks adopted across the ten identified studies. The analysis reveals that H2O, AutoGluon, and TPOT dominate current applications, primarily due to their mature hyperparameter optimization and model selection capabilities. However, an emerging trend shows growing adoption of innovative frameworks such as FLAML, Auto-Weka, AutoKeras, and Auto-sklearn. This diverse landscape of AutoML utilization demonstrates that building energy researchers are actively exploring various automated approaches to enhance energy performance prediction.

3.1.2. Thematic Analysis

This review employed a structured analytical approach to synthesize findings from the ten identified studies. First, the core content of each study was extracted and summarized across five dimensions: (1) the specific building energy application area, (2) the input parameters used, (3) the output variables predicted, (4) the AutoML framework(s) adopted, and (5) the key performance results. Table 3 summarizes AutoML applications in building energy research across five key domains: application area, input parameters, output variables, AutoML frameworks, and performance results.
Firstly, the most prevalent application of AutoML involves heating and cooling load estimation. For example, Xiao et al. [45] developed a dual-prediction-horizon model using the AutoGluon framework to support control optimization in large central cooling systems, demonstrating a shift from pure prediction towards operational decision-support. Their study incorporated operational features such as chilled water flow rate and cooling water flow to predict cooling demand. Similarly, Liu et al. [46] employed AutoGluon to estimate office building loads using weather-related inputs, including outdoor air temperature, humidity, rainfall, and UV index. This focus on climatic drivers highlights a common data paradigm in load forecasting. Recognizing the diversity of available AutoML tools, Zhang et al. [47] conducted a comparative analysis of four frameworks (AutoGluon, FLAML, TPOT, and H2O) for cooling load prediction using historical time-series data. Their work introduced a novel interpretable AutoML approach, achieving accuracy improvements of 4.24% to 8.79% over baseline methods. This extended their prior research comparing AutoML frameworks with an expanded feature set (e.g., cooling load, outdoor temperature, relative humidity). This represents an important step beyond single-framework validation. Finally, Lu, Li, Reddy Penaka and Olofsson [13] similarly investigated heating and cooling load prediction was investigated using both H2O and TPOT frameworks, incorporating a comprehensive set of input features. Their results demonstrated exceptional predictive performance, achieving R2 values of 0.9965 for heating load and 0.9899 for cooling load estimation.
Secondly, AutoML has been effectively applied to energy consumption and demand estimation, often integrated with broader optimization goals. For example, Sheng, Arbabi, Ward, Álvarez, and Mayfield [41] employed the Auto-sklearn framework to predict building energy consumption using architectural and geometric features, including total floor area and the number of habitable rooms. Similarly, Shi and Chen [48] leveraged the H2O framework with an integrated optimization algorithm to solve a multi-criteria optimization problem, simultaneously minimizing energy consumption and life-cycle cost while maximizing thermal comfort hours. Their model incorporated key building parameters such as window type, external wall insulation material, and insulation thickness, achieving prediction accuracies of 97.43% for energy consumption and 96.44% for thermal comfort time. This illustrates AutoML’s role not just as a predictor, but as an engine for sustainable design exploration. Finally, Biessmann, Kamble, and Streblow [21] developed an energy demand prediction model using AutoGluon, integrating historical yearly energy consumption data, building characteristics, and local weather data.
Thirdly, AutoML applications have extended to address additional building performance metrics, including end-use intensity and thermal comfort hours. Cui et al. [49] Incorporated geographical location data and phase-change material (PCM) properties as model inputs, utilizing the H2O AutoML framework to predict end-use energy intensity and thermal comfort levels simultaneously. Their validation results demonstrated that the H2O framework, when combined with optimization algorithms, achieved optimal thermal comfort conditions by selecting an appropriate PCM. Figure 7 summarizes the findings from the thematic analysis.
A detailed analysis of the included studies indicates that AutoML in building energy research is indeed an emerging field. Notably, three studies published between 2023 and 2024 employed multiple AutoML frameworks to validate the effectiveness of the AutoML approach. Our work distinguishes itself in two key aspects. (1) We first conduct a review to map and confirm the nascent state of this domain systematically. (2) More importantly, our primary objective is not to validate AutoML internally, but to establish and demonstrate a generic benchmark workflow (as illustrated in Figure 4) that clearly quantifies the performance lift of AutoML relative to traditional machine learning methods. To serve this demonstrative purpose effectively within the benchmark, we have selected two of the most popular frameworks in the existing literature (AutoGluon and H2O) and one foundational framework (Auto-sklearn).
Table 3. Summary of AutoML research for building energy research.
Table 3. Summary of AutoML research for building energy research.
RefYearApplication AreaInputOutputAutoML FrameworkResults
[41]2025Energy consumption estimationTotal floor area, perimeter, Relh2, NPI, Vxcount, Builtrate, number habitable rooms, number heated rooms, lighting descriptionEnergy consumptionAuto-sklearnR2 score of 0.828
[45]2025Cooling load estimationCooling water flow rate, chilled water flow rate, cooling water supply temperature, cooling water return temperature, chilled water return temperature, chilled water supply temperatureCooling powerAuto-GluonMAP is 16.2% when compared to the predicted and the real.
[48]2024Energy consumption estimation, thermal comfort estimationWindow type, type of external wall insulation material, thickness of external wall insulation material, type of roof insulation material, thickness of roof insulation material, type of ground insulation material, thickness of ground insulation material.Energy consumption, life cycle cost, thermal comfort hoursH2OAccuracy reaches 97.43% and 96.44% for energy consumption prediction and thermal comfort time, respectively.
[46]2024Cooling loadOutdoor air temperature, humidity, rainfall, UV indexCooling loadAuto-GluonThe proposed method can achieve a coefficient of variation of less than 10% for long-term predictions on the test dataset.
[49]2024End-use intensity, thermal comfort hours ratio optimizationLocation (External wall, lounge, office, consultation room & treatment room), phase change material (PCM) thickness.End-use intensity, thermal comfort hours ratioH2OOptimized thermal comfort when using PCM.
[47]2024Cooling load estimationTime variables, historical cooling loadCooling loadAuto-Gluon, FLAML, TPOT, H2OThe proposed approach can improve the existing AutoML accuracy between 4.24% and 8.79%.
[21]2023Energy demand estimationYearly energy consumption data, building data (size, category), and weather dataEnergy demand predictionAuto-GluonAutoML-based methods achieve lower errors when predicting energy demand after the implementation of energy efficiency measures.
[12]2023Heating, cooling, and electrical load estimationCooling load, outdoor temperature, and outdoor air relative humidity.Heating, cooling, and electrical loadsAutoWeka, H2O, TPOT, AutoGluon, FLAML, and AutoKerasAutoML improved overall accuracy by 1.10% to 18.66%.
[13]2023Heating and cooling loadRelative compactness, surface area, wall area, roof area, overall height, orientation, glazing area percentage, glazing area distributionheating load, cooling loadH2O, TPOTR2 of 0.9965 and RMSE of 0.602 kWh/m2 for heating load prediction, and R2 of 0.9899 and RMSE of 0.973 kWh/m2 for cooling load prediction.
An analysis of the publication years of the included studies (see Table 1) reveals that the application of AutoML in building energy research is a very recent phenomenon, with all identified studies published in 2023 or later. While the annual publication count remains small and stable within this short timeframe, a qualitative analysis of the studies’ content reveals a steady evolution in research focus. The earliest studies (e.g., [12,13]) primarily employed frameworks such as AutoGluon and H2O to tackle fundamental prediction tasks, such as heating and cooling loads. In contrast, the more recent studies (e.g., [41,45,46,47,48,49]) demonstrate diversification in both methodology (e.g., increased use of Auto-sklearn and TPOT) and application scope, expanding into complex areas such as multi-objective optimization and occupant comfort prediction. This progression from foundational proofs of concept to more sophisticated, application-driven studies marks the initial development trajectory of this emerging subfield.

3.2. Benchmark the Feasibility of AutoML to Estimate Building Energy Performance: A Case Study

3.2.1. Data Exploration

Before formal analysis, exploratory data examination was conducted. Table 4 presents the statistical characteristics of variables included in the dataset. The features exhibit varying dimensions, with distinct mean and standard deviation values. This diversity in feature distributions may improve model generalization capability, as expanded feature ranges enable more robust prediction of building energy performance across different design configurations.
Data pre-processing involved cleaning and preparation steps: outlier detection and removal using the I-quartile Range (IQR) method (falls below Q1−1.5×IQR or above Q3 + 1.5xIQR) [50], and one-hot encoding of categorical variables. Given the relatively low dimensionality of the feature set and the benchmark objective of assessing AutoML with minimal manual intervention, explicit feature selection was not enforced as a filtering step in the final modeling pipeline. The reported Pearson correlation analysis thus provides explanatory insight into feature relationships rather than a directive for feature removal.
Pearson correlation analysis reveals generally weak linear correlations (|r| < 0.3) among most building parameter variables in the dataset. These results indicate that traditional simple linear modeling approaches would demonstrate limited predictive performance in forecasting building energy consumption. To effectively capture complex, non-linear relationships between building envelope material properties (such as U-value and SHGC) and geometric features (such as WWR and orientation), non-linear machine learning models are typically employed. This modeling approach enables a more robust characterization of higher-order interactions among design parameters, thereby enhancing the capability to predict building energy consumption.

3.2.2. AutoML Performance

H2O AutoML results are presented in Table 5. The model labeled “StackedEnsemble_AllModels_1_AutoML_1_20250716_112613” achieved the highest R2 score among all models evaluated in the H2O framework. The model’s low RMSE and MAE values further demonstrate its strong predictive capability for building energy performance. This optimal model represents a stacked ensemble that strategically combines predictions from all base machine learning models to enhance overall performance. Notably, the alternative model “StackedEnsemble_BestOfFamily_1_AutoML_1_20250716_112613” also exhibited competitive performance, attaining an R2 score of 0.981. This secondary ensemble model employs a selective approach by incorporating only the top-performing model from each algorithm family, thereby optimizing accuracy through focused model combination.
As illustrated in Table 6, the model ExtraTreesMSE_BAG_L2 achieved the highest R2 score on the test dataset while also demonstrating strong performance on the validation dataset. Notably, ten models within the AutoGluon framework attained R2 values exceeding 0.990, further underscoring its capability to identify high-performing models. These results suggest that AutoGluon delivers superior model identification and performance metrics compared to other AutoML frameworks.
As demonstrated in Table 7, the gradient boosting model implemented within the Auto-sklearn framework achieved superior predictive accuracy, attaining an R2 score of 0.9315. This performance highlights the framework’s effectiveness in automatically selecting and optimizing machine learning models for the given prediction task.

3.2.3. Comparison of Model Predictions

Table 8 presents the predictive performance metrics of the AutoML model compared to other baseline ML models. The results indicate that AutoML achieves superior accuracy relative to the baseline models. Specifically, AutoGluon attained the highest R2 value (0.993), followed by the H2O framework (0.981). Among traditional ML models, XGBoost and LightGBM also demonstrated high accuracy. This outcome is expected, given the similarities between AutoML and ensemble learning methods, and the strong performance of AutoML underscores the effectiveness of automated hyperparameter tuning and model architecture search in enhancing predictive accuracy.
Furthermore, AutoML achieves lower RMSE and MAE than the baseline models, indicating lower prediction errors. These findings suggest that the ensemble-based AutoML approach effectively mitigates bias and variance in the base models, thereby improving both accuracy and reliability in compressive strength prediction. Consequently, AutoML emerges as a highly promising tool for predictive modeling applications.
Figure 8 presents scatter plots comparing predicted versus actual compressive strength values generated by both AutoML and traditional machine learning models. Visual assessment reveals superior predictive performance of the AutoML model, evidenced by (1) closer alignment of data points with the y = x ideal prediction line and (2) fewer outliers exceeding the 20% error threshold compared to conventional ML approaches. Quantitative analysis confirms the AutoML model maintains this enhanced accuracy across the full spectrum of compressive strength values while simultaneously demonstrating robust performance in building energy performance prediction.

3.3. Explanation of the Prediction Based on AutoML-SHAP

3.3.1. Comparative Analysis of AutoML-SHAP and Feature Importance Analysis

To contextualize the SHAP interpretation from the best-performing AutoML model (AutoGluon) and address whether automated modeling alters the perceived influence of input variables, we compared it with a model-agnostic feature importance metric derived directly from the dataset. We employed a simple random forest on the raw dataset to establish a baseline understanding of variable influence grounded in the data’s intrinsic structure, referring to [51].
The comparative results, visualized in Figure 9, reveal both consistency and instructive divergence. Key parameters such as SHGC and SOG were consistently ranked as highly influential by both methods, confirming their fundamental role in determining energy performance. However, notable shifts in ranking were observed. For instance, while the feature importance plot ranked roof type (categorical value) and north (WWR direction) higher, the SHAP analysis from AutoGluon assigned significantly greater importance to U-value and Wall construction type.
We posit two plausible explanations for this divergence. Firstly, AutoML’s automated hyperparameter optimization may have configured the ensemble model to assign differential weights to features based on complex, non-linear interactions that simpler importance metrics fail to capture. Secondly, and more importantly, the superior performance of AutoGluon suggests it may have learned higher-order interaction effects between envelope properties (e.g., the combined impact of Wall type and U-value), which SHAP inherently accounts for in its marginal contribution calculations. In contrast, the baseline method evaluates features in greater isolation. This comparison underscores that while raw data correlations provide an initial guide, advanced ML models can uncover and leverage more sophisticated, interaction-driven relationships to improve predictions, thereby refining our understanding of influential design parameters.

3.3.2. Dependency Plot for the Best-Performed Automl Model

In addition to the global SHAP plot, Figure 10 illustrates the dependency plot for the top important features, namely, WWR (Window-to-Wall Ratio) by façade, U-value, and SHGC values. Observing the general trends across the plots provides insight into how various physical properties affect energy use. For instance, in the “South” plot, as the South WWR increases, the SHAP values generally become more negative, suggesting that a greater window area on the south façade (likely leading to more passive solar heating) tends to reduce overall energy consumption. Conversely, the “UValue” plot shows a clear positive correlation; higher U-values, which indicate poor insulation, are associated with higher positive SHAP values and thus greater energy use. The color coding often reinforces these observations, showing that variations in the SHGC modify the strength of these primary relationships.
The SHAP analysis of building energy features provides clear direction for energy-efficient architectural design, primarily emphasizing passive strategies and envelope performance. Maximizing south-facing WWR and optimizing the distribution of window areas by façade can significantly lower energy demand by leveraging passive solar gains. Furthermore, the strong link between U-value and consumption underscores the need for superior insulation and airtight construction to minimize heat transfer. Careful solar heat gain management is also crucial, requiring the selection of appropriate glazing or the integration of shading systems to balance daylighting and heat control effectively across all facades, especially those with high east- and west-facing WWRs.

4. Discussion

4.1. Current Findings

The pressing need for energy-efficient buildings has increased demand for accessible, data-driven performance prediction tools in early design stages [52]. While AI has enabled alternatives to time-consuming physics-based simulations, conventional machine learning (ML) still requires substantial computational expertise for feature engineering and model tuning [9,53]. AutoML offers a solution by automating the ML pipeline [20], yet its adoption and benchmarking in building energy research remain unclear. To address these gaps, this study pursued three objectives: reviewing existing applications, benchmarking frameworks, and interpreting influential features.
To achieve the first research objective, a literature review was conducted in Scopus using keywords related to AutoML and building energy. Analysis of the nine identified articles, compared with general AutoML publications, revealed that AutoML applications in building energy remain at a preliminary stage, and fall into the following three areas: (1) heating and cooling load prediction, (2) energy consumption and demand forecasting, and (3) other performance metrics. Figure 7 presents the statistical distribution of AutoML frameworks used across these studies.
The second objective was solved by benchmarking three baseline ML models (linear regression, random forest, naïve Bayes), three ensemble learning methods (XGBoost, LightGBM, AdaBoost), and three AutoML frameworks (Auto-sklearn, AutoGluon, H2O) using the building energy dataset developed by Elbeltagi, Wefki, Abdrabou, Dawood and Ramzy [26]. Comparative analysis demonstrated AutoGluon’s superior predictive accuracy (R2 = 0.993, RMSE = 2.280 kWh/m2, MAE = 1.116 kWh/m2), followed by other ensemble methods and traditional ML models. Scatter plots visually confirmed the performance advantages of AutoML.
Finally, the third research objective was answered by SHAP analysis. Specifically, SHAP analysis of AutoGluon’s best model (Figure 9) identified Solar Heat Gain Coefficient (SHGC), floor area (SOG), U-values, and roof characteristics as the most influential features affecting building energy performance predictions.

4.2. Research Implications

Our study advances the theoretical integration of automated machine learning within building performance simulation. First, we demonstrate that AutoML frameworks, particularly AutoGluon, can autonomously discover and optimize model architectures that rival or exceed the performance of expertly tuned traditional ML models. This validates the theoretical proposition that automated pipeline construction can effectively navigate the vast hyperparameter and model selection space inherent to building energy prediction, a domain characterized by high-dimensional, non-linear relationships.
For practitioners, this research provides actionable guidance for implementing data-driven energy prediction. The benchmark results offer a clear, evidence-based recommendation; AutoGluon is the preferred out-of-the-box solution for achieving high prediction accuracy with minimal configuration effort, making advanced ML accessible to architects and designers without data science expertise. The SHAP analysis translates model outputs into direct design intelligence; for instance, the quantified relationship showing that reducing SHGC from 0.8 to 0.4 decreases predicted EUI by 15–20 kWh/m2 provides a concrete performance target for envelope specification. This moves beyond generic “feature importance” to deliver specific, magnitude-based insights that can directly inform material selection and façade design decisions during early-stage workshops.
The primary end users envisioned for this AutoML-based approach are building designers, architects, and sustainability consultants involved in the early design stages. These professionals possess deep domain knowledge in building physics and design parameters (e.g., U-values, SHGC, window-to-wall ratios). Still, they may lack specialized expertise in machine learning or computational optimization. Therefore, our explainability methodology has been customized for this audience in two key ways. (1) Instead of presenting raw SHAP values, we translate them into quantitative, design-relevant impacts. For instance, we report that ‘reducing SHGC from 0.8 to 0.4 is associated with a pEUI decrease of approximately 15–20 kWh/m2’ (Section 3.3). This directly speaks the language of performance benchmarking familiar to designers. (2) The SHAP analysis prioritizes and interprets the influence of actionable early-design variables (e.g., envelope properties, orientation, glazing ratios) over abstract or latent features. The dependence plots (e.g., SHGC vs. U-Value interaction) are presented to reveal trade-offs and synergies between these familiar parameters, providing intuitive guidance for design trade-off decisions.

4.3. Limitations and Future Directions

The predictive models in this study are trained on data generated from a simplified, early-design-stage energy simulation model. While this is appropriate for the conceptual phase, the underlying dataset assumes idealized HVAC system behavior and is based on a single climatic context. Consequently, the predicted pEUI values and the identified parameter importance (e.g., SHGC) reflect performance under these simplified, climate-specific conditions. This abstraction is a necessary trade-off for early-stage analysis. Still, it implies that the models should be validated or retrained with operational data or more detailed simulation outputs across diverse climates before being applied to detailed system sizing or operational optimization.
In addition, while this study established a performance hierarchy using standard point-estimate metrics (R2, RMSE, MAE), a more comprehensive diagnostic analysis of model errors was beyond its comparative scope. Future applied studies focusing on the deployment of a specific model, notably the best-performing AutoGluon ensemble, would benefit from the in-depth analyses. This includes formally examining residual distributions for normality, conducting heteroscedasticity tests to validate error consistency, and performing a detailed breakdown of ‘difficult’ prediction cases to identify systematic failure modes or regions in the design space where model confidence should be calibrated.
Moreover, this study’s primary benchmark focused on whether AutoML can match or surpass the predictive accuracy of carefully configured traditional ML models for the target task, a question our case study results affirmatively answered. As noted in the result, several models achieved high accuracy (R2 ≥ 0.95). However, in practical deployment, selecting one high-accuracy model over another involves trade-offs beyond marginal gains in R2 or RMSE. Factors such as the computational resources (CPU/GPU) consumed, the total wall-clock time required to produce a deployable model (including training and tuning), and the degree of human expert intervention throughout the process are critical for adoption in real-world settings [16,54]. Our current benchmarking did not perform a systematic comparative evaluation of these practical operational costs. Future work should therefore extend the benchmark to include a multi-criteria assessment framework that evaluates leading models across this broader spectrum of accuracy, efficiency, and usability metrics, providing practitioners with a more holistic guide for tool selection.
Finally, while SHAP provides powerful, consistent explanations for model predictions, several inherent limitations should be acknowledged. First, SHAP explanations are model-specific; the feature importance (e.g., SHGC) identified is contingent on the particular AutoGluon ensemble we trained. A different model architecture might yield a slightly different ranking, although the dominant physical drivers are likely to remain the same. Second, SHAP explains the correlations learned by the model from the data, rather than fundamental physical causality. While the strong influence of parameters like U-value aligns with building physics, the explanations do not guarantee a causal relationship in the absence of a controlled experiment. Third, our analysis utilized a global background sample for computational efficiency, which can sometimes obscure local prediction nuances. Finally, future work could extend the explainability analysis by performing a comparative SHAP study across all major model types identified in the benchmark (e.g., tree-based ensembles, neural networks, and linear models). This would provide a comprehensive understanding of how different algorithms interpret the feature space and could reveal algorithm-specific biases or insights.

5. Conclusions

This study established a performance benchmark for AutoML in early-stage building energy prediction. Among the evaluated frameworks, AutoGluon demonstrated superior accuracy, achieving an R2 of 0.993 and an RMSE of 2.280 kWh/m2. H2O AutoML also delivered strong performance (R2 = 0.982), while LightGBM (R2 = 0.973) represented the best-performing traditional ensemble method. This hierarchy highlights a practical trade-off; AutoML frameworks provide high accuracy with minimal manual tuning, thereby democratizing access to advanced prediction, whereas manually optimized models, such as LightGBM, offer high performance with greater transparency for experts willing to invest time in tuning.
The integrated SHAP analysis translated predictions into actionable design intelligence, quantitatively confirming the dominant influence of Solar Heat Gain Coefficient (SHGC) and U-values on energy performance. This coupling of automated model development with automated interpretation provides a practical workflow, making sophisticated analytics both accessible and transparent for design professionals.
The primary contribution of this work is an evidence-based foundation for adopting AutoML in architectural practice. Future work should focus on validating this pipeline across diverse climates and integrating it into parametric design environments to enable real-time, insight-driven design iteration.

Author Contributions

Conceptualization, Z.T. and J.C. (Jiayu Cheng); methodology, Z.T.; software, J.C. (Jinyu Chen); validation, Z.T., J.C. (Jiuyu Chen) and J.C. (Jiayu Cheng); formal analysis, Z.T.; investigation, Z.T. and J.C. (Jiayu Cheng); resources, J.C. (Jinyu Chen); data curation, Z.T.; writing—original draft preparation, Z.T.; writing—review and editing, Z.T., J.C. (Jiuyu Chen) and J.C. (Jiayu Cheng); visualization, J.C. (Jiuyu Chen); supervision, J.C. (Jiayu Cheng); project administration, J.C. (Jiayu Cheng). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

Author Zuyi Tang was employed by the company Guangdong Architectural Design and Research Institute Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Pérez-Lombard, L.; Ortiz, J.; Pout, C. A review on buildings energy consumption information. Energy Build. 2008, 40, 394–398. [Google Scholar] [CrossRef]
  2. Santamouris, M.; Vasilakopoulou, K. Present and future energy consumption of buildings: Challenges and opportunities towards decarbonisation. e-Prime-Adv. Electr. Eng. Electron. Energy 2021, 1, 100002. [Google Scholar] [CrossRef]
  3. Pan, Y.; Zhang, L. Data-driven estimation of building energy consumption with multi-source heterogeneous data. Appl. Energy 2020, 268, 114965. [Google Scholar] [CrossRef]
  4. EIA. Global Energy Consumption Driven by More Electricity in Residential, Commercial Buildings; Energy Information Administration: Washington, DC, USA, 2019. [Google Scholar]
  5. Piracha, A.; Chaudhary, M.T. Urban Air Pollution, Urban Heat Island and Human Health: A Review of the Literature. Sustainability 2022, 14, 9234. [Google Scholar] [CrossRef]
  6. Dalla Mora, T.; Peron, F.; Romagnoni, P.; Almeida, M.; Ferreira, M. Tools and procedures to support decision making for cost-effective energy and carbon emissions optimization in building renovation. Energy Build. 2018, 167, 200–215. [Google Scholar] [CrossRef]
  7. Dubin, F.S. Energy-efficient building design: Innovative HVAC, lighting, energy-management control, and fenestration. Appl. Energy 1990, 36, 11–20. [Google Scholar] [CrossRef]
  8. Simpeh, E.K.; Pillay, J.-P.G.; Ndihokubwayo, R.; Nalumu, D.J. Improving energy efficiency of HVAC systems in buildings: A review of best practices. Int. J. Build. Pathol. Adapt. 2022, 40, 165–182. [Google Scholar] [CrossRef]
  9. Chen, Y.; Guo, M.; Chen, Z.; Chen, Z.; Ji, Y. Physical energy and data-driven models in building energy prediction: A review. Energy Rep. 2022, 8, 2656–2671. [Google Scholar] [CrossRef]
  10. Oh, K.; Kim, E.-J.; Park, C.-Y. A Physical Model-Based Data-Driven Approach to Overcome Data Scarcity and Predict Building Energy Consumption. Sustainability 2022, 14, 9464. [Google Scholar] [CrossRef]
  11. Mugnini, A.; Coccia, G.; Polonara, F.; Arteconi, A. Performance Assessment of Data-Driven and Physical-Based Models to Predict Building Energy Demand in Model Predictive Controls. Energies 2020, 13, 3125. [Google Scholar] [CrossRef]
  12. Zhang, C.; Tian, X.; Zhao, Y.; Lu, J. Automated machine learning-based building energy load prediction method. J. Build. Eng. 2023, 80, 108071. [Google Scholar] [CrossRef]
  13. Lu, C.; Li, S.; Reddy Penaka, S.; Olofsson, T. Automated machine learning-based framework of heating and cooling load prediction for quick residential building design. Energy 2023, 274, 127334. [Google Scholar] [CrossRef]
  14. Quan, S.J. Comparing hyperparameter tuning methods in machine learning based urban building energy modeling: A study in Chicago. Energy Build. 2024, 317, 114353. [Google Scholar] [CrossRef]
  15. Villano, F.; Mauro, G.M.; Pedace, A. A Review on Machine/Deep Learning Techniques Applied to Building Energy Simulation, Optimization and Management. Thermo 2024, 4, 100–139. [Google Scholar] [CrossRef]
  16. Karmaker, S.K.; Hassan, M.M.; Smith, M.J.; Xu, L.; Zhai, C.; Veeramachaneni, K. AutoML to Date and Beyond: Challenges and Opportunities. ACM Comput. Surv. 2021, 54, 175. [Google Scholar] [CrossRef]
  17. Bahri, M.; Salutari, F.; Putina, A.; Sozio, M. AutoML: State of the art with a focus on anomaly detection, challenges, and research directions. Int. J. Data Sci. Anal. 2022, 14, 113–126. [Google Scholar] [CrossRef]
  18. Yuan, H.; Yu, K.; Xie, F.; Liu, M.; Sun, S. Automated machine learning with interpretation: A systematic review of methodologies and applications in healthcare. Med. Adv. 2024, 2, 205–237. [Google Scholar] [CrossRef]
  19. Truong, A.; Walters, A.; Goodsitt, J.; Hines, K.; Bruss, C.B.; Farivar, R. Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools. In Proceedings of the 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA, 4–6 November 2019; pp. 1471–1479. [Google Scholar]
  20. He, X.; Zhao, K.; Chu, X. AutoML: A survey of the state-of-the-art. Knowl.-Based Syst. 2021, 212, 106622. [Google Scholar] [CrossRef]
  21. Biessmann, F.; Kamble, B.; Streblow, R. An Automated Machine Learning Approach towards Energy Saving Estimates in Public Buildings. Energies 2023, 16, 6799. [Google Scholar] [CrossRef]
  22. Alkhulaifi, N.; Bowler, A.L.; Pekaslan, D.; Triguero, I.; Watson, N.J. Exploring Automated Feature Engineering for Energy Consumption Forecasting with AutoML. In Proceedings of the 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Kuching, Malaysia, 6–10 October 2024; pp. 2993–2998. [Google Scholar]
  23. Li, D.; Qi, Z.; Zhou, Y.; Elchalakani, M. Machine Learning Applications in Building Energy Systems: Review and Prospects. Buildings 2025, 15, 648. [Google Scholar] [CrossRef]
  24. Khalil, M.; McGough, A.S.; Pourmirza, Z.; Pazhoohesh, M.; Walker, S. Machine Learning, Deep Learning and Statistical Analysis for forecasting building energy consumption—A systematic review. Eng. Appl. Artif. Intell. 2022, 115, 105287. [Google Scholar] [CrossRef]
  25. Michailidis, P.; Michailidis, I.; Minelli, F.; Coban, H.H.; Kosmatopoulos, E. Model Predictive Control for Smart Buildings: Applications and Innovations in Energy Management. Buildings 2025, 15, 3298. [Google Scholar] [CrossRef]
  26. Elbeltagi, E.; Wefki, H.; Abdrabou, S.; Dawood, M.; Ramzy, A. Visualized strategy for predicting buildings energy consumption during early design stage using parametric analysis. J. Build. Eng. 2017, 13, 127–136. [Google Scholar] [CrossRef]
  27. Chen, H.; Chan, I.Y.; Dong, Z.; Samuel, T.A. Unraveling Trust in Collaborative Human-Machine Intelligence from Neurophysiological Perspective: A Review of EEG and fNIRS Features. Adv. Eng. Inf. 2025, 67, 103555. [Google Scholar] [CrossRef]
  28. Lim, Z.Q.; Shah, K.W.; Gupta, M. Autonomous Mobile Robots Inclusive Building Design for Facilities Management: Comprehensive PRISMA Review. Buildings 2024, 14, 3615. [Google Scholar] [CrossRef]
  29. Chen, H.; Dong, Z.; Chan Isabelle, Y.S. Biometric Evaluation and Immersive Construction Environments: A Research Overview of the Current Landscape, Challenges, and Future Prospects. J. Constr. Eng. Manag. 2025, 151, 03125005. [Google Scholar] [CrossRef]
  30. Popat, S.; Starkey, L. Learning to code or coding to learn? A systematic review. Comput. Educ. 2019, 128, 365–376. [Google Scholar] [CrossRef]
  31. Bower, I.; Tucker, R.; Enticott, P.G. Impact of built environment design on emotion measured via neurophysiological correlates and subjective indicators: A systematic review. J. Environ. Psychol. 2019, 66, 101344. [Google Scholar] [CrossRef]
  32. Gray, R. Why do all systematic reviews have fifteen studies? Nurse Author Ed. 2020, 30, 27–29. [Google Scholar] [CrossRef]
  33. Wang, X.; Edison, H.; Khanna, D.; Rafiq, U. How Many Papers Should You Review? A Research Synthesis of Systematic Literature Reviews in Software Engineering. In Proceedings of the 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), New Orleans, LA, USA, 26–27 October 2023; pp. 1–6. [Google Scholar]
  34. Bassi, A.; Shenoy, A.; Sharma, A.; Sigurdson, H.; Glossop, C.; Chan, J.H. Building Energy Consumption Forecasting: A Comparison of Gradient Boosting Models. In Proceedings of the 12th International Conference on Advances in Information Technology, Bangkok, Thailand, 29 June–1 July 2021; p. 27. [Google Scholar]
  35. Dai, Z.; Huang, W. Improving energy management practices through accurate building energy consumption prediction: Analyzing the performance of LightGBM, RF, and XGBoost models with advanced optimization strategies. Electr. Eng. 2025, 107, 12583–12605. [Google Scholar] [CrossRef]
  36. Yoon, H.I.; Lee, H.; Yang, J.-S.; Choi, J.-H.; Jung, D.-H.; Park, Y.J.; Park, J.-E.; Kim, S.M.; Park, S.H. Predicting Models for Plant Metabolites Based on PLSR, AdaBoost, XGBoost, and LightGBM Algorithms Using Hyperspectral Imaging of Brassica juncea. Agriculture 2023, 13, 1477. [Google Scholar] [CrossRef]
  37. Feurer, M.; Eggensperger, K.; Falkner, S.; Lindauer, M.; Hutter, F. Auto-sklearn 2.0: The next generation. arXiv 2020, arXiv:2007.04074. [Google Scholar]
  38. Erickson, N.; Mueller, J.; Shirkov, A.; Zhang, H.; Larroy, P.; Li, M.; Smola, A. Autogluon-tabular: Robust and accurate automl for structured data. arXiv 2020, arXiv:2003.06505. [Google Scholar]
  39. LeDell, E.; Poirier, S. H2O automl: Scalable automatic machine learning. In Proceedings of the AutoML Workshop at ICML, Virtual, 18 July 2020; p. 24. [Google Scholar]
  40. Hutter, F.; Kotthoff, L.; Vanschoren, J. Automated Machine Learning: Methods, Systems, Challenges; Springer Nature: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
  41. Sheng, Y.; Arbabi, H.; Ward, W.O.C.; Álvarez, M.A.; Mayfield, M. City-scale residential energy consumption prediction with a multimodal approach. Sci. Rep. 2025, 15, 5313. [Google Scholar] [CrossRef]
  42. Garrett, A.; New, J.R. Suitability of ASHRAE Guideline 14 Metrics for Calibration; Oak Ridge National Laboratory (ORNL): Oak Ridge, TN, USA, 2015. [Google Scholar]
  43. Malinverno, L.; Barros, V.; Ghisoni, F.; Visonà, G.; Kern, R.; Nickel, P.J.; Ventura, B.E.; Šimić, I.; Stryeck, S.; Manni, F. A historical perspective of biomedical explainable AI research. Patterns 2023, 4, 100830. [Google Scholar] [CrossRef]
  44. Wang, H.; Liang, Q.; Hancock, J.T.; Khoshgoftaar, T.M. Feature selection strategies: A comparative analysis of SHAP-value and importance-based methods. J. Big Data 2024, 11, 44. [Google Scholar] [CrossRef]
  45. Xiao, Z.; Zhang, J.; Xiao, F.; Chen, Z.; Xu, K.; So, P.M.; Lau, K.T. An AI-enabled optimal control strategy utilizing dual-horizon load predictions for large building cooling systems and its cloud-based implementation. Energy Build. 2025, 330, 115352. [Google Scholar] [CrossRef]
  46. Liu, Y.; Zhao, X.; Qin, S.J. Dynamically engineered multi-modal feature learning for predictions of office building cooling loads. Appl. Energy 2024, 355, 122183. [Google Scholar] [CrossRef]
  47. Zhang, C.; Lu, J.; Huang, J.; Zhao, Y. End-to-end data-driven modeling framework for automated and trustworthy short-term building energy load forecasting. Build. Simul. 2024, 17, 1419–1437. [Google Scholar] [CrossRef]
  48. Shi, Y.; Chen, P. Energy retrofitting of hospital buildings considering climate change: An approach integrating automated machine learning with NSGA-III for multi-objective optimization. Energy Build. 2024, 319, 114571. [Google Scholar] [CrossRef]
  49. Cui, H.; Zhang, L.; Yang, H.; Shi, Y. Optimizing thermal comfort and energy efficiency in hospitals with PCM-Enhanced wall systems. Energy Build. 2024, 323, 114740. [Google Scholar] [CrossRef]
  50. Alshammari, T.; Ramadan, R.A.; Ahmad, A. Temporal Variations Dataset for Indoor Environmental Parameters in Northern Saudi Arabia. Appl. Sci. 2023, 13, 7326. [Google Scholar] [CrossRef]
  51. Zucchini, N.; Capozzella, E.; Giuffrè, M.; Mastronardi, M.; Casagranda, B.; Crocè, S.L.; de Manzini, N.; Palmisano, S. Advanced Non-linear Modeling and Explainable Artificial Intelligence Techniques for Predicting 30-Day Complications in Bariatric Surgery: A Single-Center Study. Obes. Surg. 2024, 34, 3627–3638. [Google Scholar] [CrossRef]
  52. Mörth, M.; Heinz, A.; Heimrath, R.; Edtmayer, H.; Mach, T.; Kaisermayer, V.; Gölles, M.; Hochenauer, C. Grey-Box Model for Efficient Building Simulations: A Case Study of an Integrated Water-Based Heating and Cooling System. Buildings 2025, 15, 1959. [Google Scholar] [CrossRef]
  53. Shirzadi, N.; Lau, D.; Stylianou, M. Surrogate Modeling for Building Design: Energy and Cost Prediction Compared to Simulation-Based Methods. Buildings 2025, 15, 2361. [Google Scholar] [CrossRef]
  54. Cruz, G.G.; Kor, A.-L.; Jawad, N.; Georges, J.-P. Exploring the Costs of Automation: A Comparative Study on the Energy Consumption and Performance of Open-Source Automated Machine and Deep Learning Libraries. In Proceedings of the International Sustainable Ecological Engineering Design for Society (SEEDS) Conference 2025, Loughborough, UK, 3–5 September 2025. [Google Scholar]
Figure 1. Comparison between traditional ML and AutoML.
Figure 1. Comparison between traditional ML and AutoML.
Buildings 16 00185 g001
Figure 2. Overall workflow of the current study.
Figure 2. Overall workflow of the current study.
Buildings 16 00185 g002
Figure 3. Overall steps following PRISMA principles.
Figure 3. Overall steps following PRISMA principles.
Buildings 16 00185 g003
Figure 4. Overall workflow for benchmark validation of AutoML for building energy performance prediction.
Figure 4. Overall workflow for benchmark validation of AutoML for building energy performance prediction.
Buildings 16 00185 g004
Figure 5. Comparison between AutoML in general and AutoML in building energy.
Figure 5. Comparison between AutoML in general and AutoML in building energy.
Buildings 16 00185 g005
Figure 6. Distribution of the AutoML framework used in building energy research.
Figure 6. Distribution of the AutoML framework used in building energy research.
Buildings 16 00185 g006
Figure 7. Summary of AutoML-driven workflow for building energy performance estimation.
Figure 7. Summary of AutoML-driven workflow for building energy performance estimation.
Buildings 16 00185 g007
Figure 8. Comparative analysis of different ML models for building energy performance estimation.
Figure 8. Comparative analysis of different ML models for building energy performance estimation.
Buildings 16 00185 g008
Figure 9. SHAP vs. feature important plot.
Figure 9. SHAP vs. feature important plot.
Buildings 16 00185 g009
Figure 10. SHAP dependence plot.
Figure 10. SHAP dependence plot.
Buildings 16 00185 g010
Table 1. Comparison between physical and data-driven modeling.
Table 1. Comparison between physical and data-driven modeling.
CriteriaPhysical-Driven ModelingData-Driven Modeling
Nature of analysis“white-box”“black-box”
KnowledgeIntensiveLess intensive
Setting timeHighLow
Analysis approachUse of simulation software, such as EnergyPlus and DesignBuilderMachine learning, deep learning.
Table 2. List of variables within the dataset.
Table 2. List of variables within the dataset.
IDVariablesTypesExplanationsValue Range
1WallsCategoricalRepresents the type of wall construction. Four categories: Brick, aluminum plate, glass curtain, and steel curtain.[‘Single wall’, ’Double wall’, ‘Double red brick wall with air gap’]
2RoofCategoricalRepresents the type of roof construction. Two categories: type 1 and type 2.[‘Floor slab 15 cm’, ‘Floor slab 20 cm’]
3SOGCategoricalRepresents the type of slab-on-grade construction. Two categories: type 1 and type 2.[‘Ground Floor 15 cm’, ‘Ground Floor 20 cm’]
4LengthNumerical (number)Represents the building shape dimension.Min. 10.0 m Max. 30.0 m
5DepthNumerical (number)Represents the building shape dimension.Min. 10.0 m Max. 30.0 m
6HeightNumerical (number)Represents the building shape dimension.Min. 3.0 m Max. 15.0 m
7OrientationNumerical (degree)Represents the building orientation angle.Min. 0° Max. 360ᵒ
8SouthNumerical (percentage)Represents the windows-to-wall ratio in the south direction.Min. 0% Max. 80%
9EastNumerical (percentage)Represents the windows-to-wall ratio in the east direction.Min. 0% Max. 80%
10NorthNumerical (percentage)Represents the windows-to-wall ratio in the north direction.Min. 0% Max. 80%
11WestNumerical (percentage)Represents the windows-to-wall ratio in the west direction.Min. 0% Max. 80%
12U-ValueNumerical (number)Glass U values.Min. 0 Max. 1.2 (Default)
13SHGCNumerical (number)Glass Solar Heat Gain Coefficient.Min. 0 Max. 1.0
14VTNumerical (number)Glass visual transmittance.Min. 0 Max. 1.0
15Heating_SPNumerical (number)Heating set points.Min. 18° Max. 28ᵒ
16Cooling_SPNumerical (number)Cooling set points.Min. 18° Max. 26ᵒ
17pEUINumerical (number)Energy use for the project is based on molded site energy, which represents the summation of heating, cooling, lighting, and equipment energy consumed in one year, measured in KWh.
Note: The minimum and maximum values for variables such as U-value, SHGC, and VT are as defined in the source dataset for simulation purposes. Some extreme values (e.g., 0 or 1) may represent conceptual placeholders or boundary conditions used in the dataset generation and do not reflect physically realistic glazing properties.
Table 4. Detailed distribution of variables in the dataset.
Table 4. Detailed distribution of variables in the dataset.
FeatureMeanSt.D.Min.Med.Max.
Length20.0125.73510.00020.02030.000
Depth20.0785.77610.00020.20530.000
Height9.4663.1714.0009.46015.000
Orientation179.935104.7490.000180.000360.000
South0.3980.2320.0000.4000.800
East0.3980.2310.0000.4000.800
North0.3980.2310.0000.4000.800
West0.3990.2310.0000.4000.800
UValue0.6190.3410.0100.6301.200
SHGC0.4990.2850.0100.5000.990
VT0.5020.2870.0100.5000.990
Heating_SP9.9881.4198.00010.00012.000
Cooling_SP22.9663.16318.00023.00028.000
pEUI51,612.50027,344.3666903.00045,336.500214,820.000
Table 5. Model ranks in the H2O framework.
Table 5. Model ranks in the H2O framework.
ModelsR2RMSEMAE
StackedEnsemble_AllModels_1_AutoML_1_20250716_1126130.9894.7542.296
StackedEnsemble_BestOfFamily_1_AutoML_1_20250716_1126130.9814.9172.448
GBM_1_AutoML_1_20250716_1126130.9795.2972.832
GBM_5_AutoML_1_20250716_1126130.9755.3582.993
GBM_grid_1_AutoML_1_20250716_112613_model_10.9735.4403.033
GBM_2_AutoML_1_20250716_1126130.9545.5083.170
GBM_3_AutoML_1_20250716_1126130.9485.6593.282
GBM_grid_1_AutoML_1_20250716_112613_model_20.9375.6603.233
XGBoost_3_AutoML_1_20250716_1126130.9225.7573.646
DeepLearning_grid_1_AutoML_1_20250716_112613_model_10.9165.8882.918
Note: Same unit applied for RMSE and MAE: kWh/m2 (same as target variable).
Table 6. Model ranks in AutoGluon framework.
Table 6. Model ranks in AutoGluon framework.
ModelR2 for Test SetR2 for Validation Test
ExtraTreesMSE_BAG_L20.9940.982
WeightedEnsemble_L30.9930.983
CatBoost_BAG_L20.9930.979
LightGBM_BAG_L20.9930.981
RandomForestMSE_BAG_L20.9920.981
XGBoost_BAG_L20.9920.980
NeuralNetFastAI_BAG_L20.9920.982
LightGBMXT_BAG_L20.9920.978
CatBoost_BAG_L10.9920.982
WeightedEnsemble_L20.9920.982
LightGBMXT_BAG_L10.9860.970
LightGBM_BAG_L10.9830.970
NeuralNetFastAI_BAG_L10.9750.957
ExtraTreesMSE_BAG_L10.9410.924
RandomForestMSE_BAG_L10.9350.919
KNeighborsDist_BAG_L10.7610.743
KNeighborsUnif_BAG_L10.7590.740
Table 7. Model ranks in the Auto-sklearn framework.
Table 7. Model ranks in the Auto-sklearn framework.
ModelR2 Score
standard_scaler_random_forest0.9138
standard_scaler_gradient_boosting0.9315
standard_scaler_linear_regression0.8645
standard_scaler_ridge0.8645
standard_scaler_lasso0.8645
standard_scaler_elastic_net0.7671
standard_scaler_svr0.655
Table 8. Prediction performance of traditional ML and AutoML methods.
Table 8. Prediction performance of traditional ML and AutoML methods.
ModelR2RMSEMAE
Linear Regression0.8809.3646.671
Random Forest0.9346.9644.545
Naive Bayes0.7117.4679.896
XGBoost0.9665.0133.454
LightGBM0.9734.4472.745
AdaBoost0.47819.52017.354
H2O0.9823.6532.057
AutoGluon0.9932.2801.116
AutoSklearn0.9476.2353.855
Note: Same unit applied for RMSE and MAE: kWh/m2 (same as the target variable).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tang, Z.; Chen, J.; Cheng, J. Benchmarking Automated Machine Learning for Building Energy Performance Prediction: A Comparative Study with SHAP-Based Interpretability. Buildings 2026, 16, 185. https://doi.org/10.3390/buildings16010185

AMA Style

Tang Z, Chen J, Cheng J. Benchmarking Automated Machine Learning for Building Energy Performance Prediction: A Comparative Study with SHAP-Based Interpretability. Buildings. 2026; 16(1):185. https://doi.org/10.3390/buildings16010185

Chicago/Turabian Style

Tang, Zuyi, Jinyu Chen, and Jiayu Cheng. 2026. "Benchmarking Automated Machine Learning for Building Energy Performance Prediction: A Comparative Study with SHAP-Based Interpretability" Buildings 16, no. 1: 185. https://doi.org/10.3390/buildings16010185

APA Style

Tang, Z., Chen, J., & Cheng, J. (2026). Benchmarking Automated Machine Learning for Building Energy Performance Prediction: A Comparative Study with SHAP-Based Interpretability. Buildings, 16(1), 185. https://doi.org/10.3390/buildings16010185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop