Machine Learning-Based Carbon Compliance Forecasting and Energy Performance Assessment in Commercial Buildings

Ramnarayan, Aditya; de Castro, Felipe; Sarmiento, Andres; Ohadi, Michael

doi:10.3390/en18153906

Open AccessArticle

Machine Learning-Based Carbon Compliance Forecasting and Energy Performance Assessment in Commercial Buildings

Department of Mechanical Engineering, University of Maryland, College Park, MD 20742, USA

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(15), 3906; https://doi.org/10.3390/en18153906

Submission received: 20 June 2025 / Revised: 15 July 2025 / Accepted: 20 July 2025 / Published: 22 July 2025

(This article belongs to the Special Issue The Role of Energy Systems and AI in Energy Transition, Economic Resilience, and Sustainable Development)

Download

Browse Figures

Versions Notes

Abstract

Owing to the need for continuous improvement in building energy performance standards (BEPSs), facilities must adhere to benchmark performances in their quest to achieve net-zero performance. This research explores machine learning models that leverage historical energy data from a cluster of buildings, along with relevant ambient weather data and building characteristics, with the objective of predicting the buildings’ energy performance through the year 2040. Using the forecasted emission results, the portfolio of buildings is analyzed for the incurred carbon non-compliance fees based on their on-site fossil fuel CO_2e emissions to assess and pinpoint facilities with poor energy performance that need to be prioritized for decarbonization. The forecasts from the machine learning algorithms predicted that the portfolio of buildings would incur an annual average penalty of $31.7 million ($1.09/sq. ft.) and ~$348.7 million ($12.03/sq. ft.) over 11 years. To comply with these regulations, the building portfolio would need to reduce on-site fossil fuel CO_2e emissions by an average of 58,246 metric tons (22.10 kg/sq. ft.) annually, totaling 640,708 metric tons (22.10 kg/sq. ft.) over a period of 11 years. This study demonstrates the potential for robust machine learning models to generate accurate forecasts to evaluate carbon compliance and guide prompt action in decarbonizing the built environment.

Keywords:

machine learning; decarbonization; carbon compliance; energy efficiency; building energy forecasting; data-driven methods

1. Introduction

To prevent the global temperature from rising over 2 °C above pre-industrial levels, bold measures must be taken to drastically reduce greenhouse gas (GHG) emissions by 2050 [1]. Over 145 countries, contributing to 90% of global GHG emissions, have established climate targets as part of the Paris Agreement, aiming for net-zero emissions by mid-century [2,3]. The agreement mandates a 45% reduction in GHG emissions by 2030 compared to 2010 levels [3]. Despite a predicted decrease in CO₂ emissions from industrial processes, if current emissions trends continue the world could face a 2.7 °C rise by 2100, with a 99.5% chance of exceeding 1.5 °C, which poses severe risks to societies and economies [4,5]. The warming climate increases the frequency and severity of extreme weather events, such as heat waves, droughts, flooding, and stronger tropical cyclones. Climate models suggest that further global warming will worsen these impacts, threatening infrastructure, agriculture, health, and ecosystems [6]. Without significant reductions in emissions across all sectors, many regions may struggle to adapt, leading to irreversible damage and economic losses.

Among major sources of emissions, the building sector accounts for 30–40% of global energy consumption and 26% of energy-related emissions [7,8], highlighting the need for policymakers to understand building energy use for future energy and climate policies [9]. In 2023, the U.S. residential and commercial sectors consumed approximately 28% of total end-use energy [10], resulting in 561 million metric tons of CO_2e emissions, a 6.5% decrease from 2022 [11]. However, aging buildings often decline in energy performance [12], with the EU and World Economic Forum projecting that at least 70–80% of current buildings will still be in use by 2050 [13,14]. As buildings age, their energy efficiency deteriorates [15], making it necessary to optimize their energy performance and improve their efficiency. To this end, major economies are tightening energy performance requirements for buildings. In China, the Ministry of Housing implemented a code in April 2022 mandating energy efficiency for all new and renovated structures [8]. Japan revised its regulations in 2022 to achieve zero-energy performance for all new buildings by 2030 and for existing buildings by 2050 [8]. The European Union’s 2023 revision of the Energy Performance of Buildings Directive aims for climate neutrality in the sector by 2050, requiring zero emissions for new public buildings by 2026 and all new buildings by 2028, with stricter standards for existing buildings over time [8]. In the U.S., several states have introduced building performance standards to enhance energy efficiency in the existing stock, as shown in Table 1.

Building energy performance must be accurately predicted by building performance standards mandating benchmark performances and stringent GHG emission goals by 2030. This helps facilities managers evaluate design alternatives and operation strategies to enhance energy efficiency and improve demand and supply management [22]. Two main approaches for predicting energy consumption are physical modeling (white-box models) and data-driven (black-box) methods. Physical models employ thermodynamic principles to analyze energy needs, incorporating variables like building materials and metadata, HVAC design, and weather data to calculate heating and cooling loads [22]. The white-box approach is a physics-based method for energy modeling that applies thermodynamic concepts for comprehensive energy modeling and analysis. Software packages like EnergyPlus [23], TRNSYS [24], eQuest [25], ESP-r [26], DOE-2 [27], Ecotect [28], and Trace 3D Plus [29] are examples of building energy simulation tools that use physical models. They assess energy use by considering the building’s design, environmental conditions, HVAC systems, operational schedules, and climate data. However, urban areas with clusters of buildings may lack these kinds of detailed information, which can significantly affect prediction accuracy [30]. On the other hand, data-driven approaches for assessing building energy performance do not require these kinds of detailed information. Using historical data, data-driven (black-box) approaches can identify patterns and correlations that enable accurate predictions. Over the past decade, data-driven methods have garnered considerable attention as a credible methodology for assessing energy performance and facilitating accurate energy and GHG forecasting [31]. Numerous prior studies have been conducted that utilize the analysis of black-box methodologies. Notable examples of such black-box methods encompass artificial neural networks (ANNs), support vector machines (SVMs), random forests (RF), gradient boosting (GB), and multiple linear regression (MLR) [32].

This research presents a machine learning (ML) tool for predicting energy efficiency and carbon footprint reduction compliance in commercial buildings. This study focuses on commercial buildings for facilities that are 3251 m² (35,000 square feet) or larger as regulated by the Maryland Building Energy Performance Standards (BEPSs), which mandates them to achieve a 20% reduction in net direct GHG emissions by 2030 compared with 2025 levels and subsequently net-zero direct GHG emissions by 2040 [33]. This study forecasts energy consumption and GHG emissions until 2040 for a portfolio of over 100 buildings totaling 2.7 million square meters (~29 million square feet). Two machine learning models are developed: one for energy consumption forecasting, which feeds into the second model for GHG emissions forecasting until 2040. The models are trained on a dataset of 310 buildings comprising 6.6 million square meters (~71 million square feet). Using the GHG emissions forecasts, the “cost of inaction” is then calculated for the entire dataset based on the carbon non-compliance fees for excess on-site fossil fuel CO_2e emissions as dictated by the Maryland Climate Solutions Now Act (CSNA) [19].

The rest of the paper is organized as follows. Section 2 presents a literature review of physics-based and machine learning models and the problem statement. Section 3 discusses the methodology used to curate the data and develop and evaluate machine learning models for forecasting. Section 3 also introduces the methodology involved in calculating carbon non-compliance fees associated with on-site fossil fuel emissions for the portfolio of buildings. Section 4 presents the study’s results by comparing the forecasting accuracies of the different machine learning models and discusses the implications of the forecasts for the next two decades for the portfolio of buildings. Finally, Section 5 concludes the study with takeaways, lessons learned from the research, and the study’s limitations.

2. Literature Review and Problem Statement

2.1. Literature Review

Accurate building energy consumption predictions are vital for effective electricity market management, which helps ensure grid safety and mitigate financial risks. Over the past few decades, there has been remarkable progress in developing time series load forecasts that cater to various domains and objectives [34]. Research conducted by Ma et al. [35] forecasts that by 2070, the total floor area worldwide will surpass 540 billion square meters. The rapid growth will increase demand for building materials like cement, steel, and aluminum, leading to high embodied carbon emissions. Additionally, rising energy consumption in residential and commercial buildings is predicted to result in operational carbon emissions, particularly from activities like space cooling [36]. For these reasons, this section introduces existing research on the three main forecasting techniques, i.e., physics-based, data-driven, and hybrid approaches, to facilitate the development of energy prediction models for the building sector.

Higgins et al. [37] explored energy efficiency opportunities for a public transportation maintenance facility, demonstrating that various energy efficiency measures could reduce annual emissions by 584 metric tons of CO_2e and save $162,402. However, the lack of access to complete building drawings and HVAC designs often limits the effectiveness of physical modeling. Neto et al. [38] compared the prediction accuracy of EnergyPlus and an artificial neural network (ANN) for building energy consumption. They found that EnergyPlus had a prediction error of ±13% while the ANN achieved ±10%. They identified lighting, equipment, and occupancy schedules as key sources of uncertainty. The literature indicates significant variation in prediction errors for different cases, and historical operational data can enhance stability. Kim et al. [39] created a physical building information model (BIM) using the Dymola 2012 platform, employing the object-based Modelica language to simulate energy consumption. Their study incorporated the Modelica building library [40]. Similarly, Chen et al. [41] developed a model of an office building to assess the HVAC load and overall energy consumption of the building.

Using a multi-criteria ranking tool they developed, Ramnarayan et al. [42,43] identified a cluster of university buildings for energy efficiency improvements and reductions in GHG emissions. Leveraging historical data from 2015 to 2023, Zargarzadeh et al. [44] investigated Solar PV electricity production on a university campus to reduce CO₂ emissions over nine years. A novel tree-based ensemble learning model was implemented to predict monthly photovoltaic-generated electricity, thereby enhancing model explainability through the application of SHAP (SHapley Additive exPlanations) techniques. Their findings indicated a potential reduction of approximately 18% in the CO_2e emissions of the campus. In a study by Olu-Ajayi et al. [45], a myriad of machine and deep learning models were tested to determine the best-performing model for energy forecasting, revealing that the decision tree model had the fastest computation time (1.2 s), while the deep neural network achieved the highest accuracy. Another study by Zhang et al. [46] developed a light gradient boost model with explainable artificial intelligence (xAI) to predict the energy consumption and corresponding GHG emissions of Seattle’s stock of residential buildings. This model identified square footage and natural gas consumption as the key factors influencing energy use and GHG emissions in Seattle’s residential buildings. Chen et al. [47] introduced a support vector regression (SVR) model that uses the ambient temperature two hours ahead for short-term electrical load prediction. This approach improves accuracy by reducing the lagging effects of weather-sensitive loads from a building’s thermal inertia. The model uses ambient temperature and time series data to focus solely on weather information. Wang et al. [48] analyzed XGBoost for building thermal load prediction using five variables: day of week, hour of day, holiday, temperature, and relative humidity. They found that XGBoost outperformed other algorithms, achieving a CV-RMSE of 21.1%, compared to SVM (25.0%), RF (23.7%), and LSTM (31.9%). In a study by Yousaf et al. [49], compact empirical models for air conditioners and heat pumps that work with standard building simulation engines were created, achieving a prediction error of under 5% on key outputs. In a similar vein, Yousaf et al. [50] present a steady-state gray-box model, which can be trained with as few as five data points, utilizing symbolic regression for component UA correlations. Their study effectively predicted important performance metrics, including cooling capacity, coefficient of performance (COP), and sensible heat ratio (SHR), achieving a mean absolute percentage error (MAPE) of under 3.4%.

The modeling and calibration of “white box” software for building energy stakeholders is challenging due to the extensive input parameters, making development time-consuming and costly. “Black box” models simplify the relationship between inputs and outputs but require large historical datasets and significant training time for accurate predictions. To blend the strengths of both white- and black-boxes, “grey box” models have been introduced, leveraging a simplified physical model and accessible data to simulate building energy demand effectively [51]. Dong et al. [52] created a hybrid model that combines data-driven and physics-based methods to estimate residential energy consumption. This model outperformed five other methods—ANN, SVR, LS-SVM, GPR, and GMM—in a 24 h prediction accuracy. Xu et al. [53] developed a model that couples EnergyPlus with an artificial neural network (ANN) to predict the energy consumption of a cluster of ten single-family residential houses. The model considered external empirical sources, including experimental data and demographic information.

Extensive research on data-driven forecasting techniques has also been conducted to predict greenhouse gas (GHG) emissions in the industrial sector, which accounted for about 35% of energy consumption in the U.S. in 2023 and was responsible for 20% of energy-related emissions, totaling 963 million metric tons of CO₂ [10]. India and China, with their large populations and significant manufacturing industries, contribute substantially to GHG emissions. In their research, Bakır et al. [54] forecasted that India’s GHG emissions will increase by 2.1 to 2.4 times by 2050. This study employed the marine predators’ algorithm (MPA) for optimization. Another study, by Sen et al. [55], focused on an Indian pig iron manufacturer that used an autoregressive integrated moving average (ARIMA) algorithm, achieving high accuracy in emissions prediction. Javed et al. [56] developed a gray forecasting technique for GHG emissions across various industrial sectors in India and China, building, transportation, manufacturing/construction, and waste, predicting declines in emissions from China’s manufacturing/construction sector while all sectors in India are expected to see increases. The findings highlight the crucial necessity for accurately forecasting greenhouse gas (GHG) emissions to inform effective policy formulation in both nations. Many studies predict that energy consumption and greenhouse gas (GHG) emissions will increase across residential, commercial, and industrial sectors. This rise is driven by factors such as population growth, economic affordability, GDP expansion, and the adoption of luxury appliances. Therefore, there is an urgent need to develop and implement policies aimed at reducing GHG emissions and improving energy efficiency [57].

GHG emission forecasting is a valuable tool for policymakers to address climate change. With strict targets to combat global warming and climate change [3,16,17,18,19,20,21], effective policies must focus on reducing emissions from key sectors: residential, commercial, industrial, and transportation. A study by Chang et al. [58] on policy-oriented CO₂ emissions forecasting in China identified rising electricity consumption as a significant factor influencing CO₂ emission predictions, which could lead to increased emissions in the absence of new restrictions or innovations. This study emphasized the importance of understanding the factors driving emissions for developing effective energy policies, emphasizing the importance of interpretable machine learning algorithms. Similarly, a study by Nam et al. [59] in South Korea explored how Jeju Island could achieve 100% sustainable electricity using a deep learning approach. Four renewable energy scenarios were assessed using a gated recurrent unit (GRU) to forecast Jeju Island’s renewable energy goals. The most likely scenario includes 100 MW from onshore wind, 400 MW from offshore wind, 60 MW from solar, 210 MW from fuel cells, and 520 MW from an integrated gasification combined cycle (IGCC) with carbon capture and sequestration (CCS). This scenario would reduce economic-environmental costs by 36.16% compared to the current energy policy. Studies [58,59] demonstrate how machine and deep learning models can aid in crafting adaptive, evidence-based policies for emission reduction.

2.2. Machine Learning Models

Seven different machine learning models—support vector machines (SVMs), artificial neural networks (ANNs), Gaussian process regression (GPR), decision trees (DTs), random forest (RF), and ensemble (bagging and boosting) models—were trained and tested on the dataset. An 80-20 split was used for the training and testing data, based on several studies [60,61,62], to partition the dataset for high accuracy and reduce bias. This section reviews the literature for the three best-performing models for forecasting this study’s energy consumption and GHG emissions, as described in Section 4.2.

2.2.1. Support Vector Machines (SVMs)

SVMs are robust models widely used in research and industry for non-linear regression and classification tasks, including building energy forecasting. In 2005, Dong et al. [63] developed an SVM model to predict monthly electricity consumption in Singapore’s commercial buildings using temperature, humidity, and solar radiation data, yielding an impressively low error rate of 4% with a radial basis function (RBF) kernel. Another study, by Massana et al. [64], compared SVMs with other machine learning models like ANN and multiple linear regression for short-term electricity demand predictions, finding that SVMs offered higher accuracy at a lower computational cost. Jain et al. [65] used sensor-based data from a multi-family building in New York City to investigate the effect of time intervals and locations on energy consumption forecasting accuracy, concluding that hourly data from the building’s floor level produced the best results.

Suppose every input feature comprises a vector X_i (i denotes the ith input component sample) and a corresponding output vector Y_i, in this case, building energy consumption. SVM relates inputs to output parameters using the following equations [66]:

Y = W \times ϕ (X) + b

(1)

The ϕ(X) function maps X non-linearly to a high-dimensional feature space. ‘b’ is the bias dictated by the selected kernel function. W is the weight vector given by Equation (2) and approximated by the empirical risk function as shown:

M i n i m i z e : {\frac{1}{2} ||W||}^{2} + C \sum_{i = 1}^{N} L_{ε} (Y_{i}, f (X_{i}))

(2)

L_{ε}

is referred to as the intensity loss function given as shown in Equation (3).

L_{ε} (Y_{i}, f (X_{i})) = \{|f (x) - Y_{i}| - ε, |f (x) - Y_{i}| \geq 0; 0 o t h e r w i s e\}

(3)

Equations (1)–(3) succinctly describe the mathematics behind support vector machines (SVM) [66]. SVMs can be trained with a small number of data samples, making them valuable in situations with limited historical data. A key advantage is their unique and globally optimal solutions, avoiding the pitfalls of non-linear optimization that can lead to local minima [67]. However, they have a significant drawback of high computation times, approximately on the order of (N³), where N represents the number of samples.

2.2.2. Gaussian Process Regression (GPR)

Gaussian process regression (GPR) is a Bayesian approach that models distributions over functions without relying on predetermined parameters, making it suitable for complex, non-linear data patterns [68]. Unlike traditional regression, GPR does not assume a specific functional form for input-output relationships. A study by Rastogi et al. [69] found GPR to be four times more accurate than linear regression in emulating building performance simulations with EnergyPlus. Another study by Zhang et al. [70] compared GPR, Gaussian mixture models (GMM), change-point models, and feed-forward artificial neural networks (ANN) for predicting an office building’s HVAC energy usage based on ambient temperature. It revealed that GMM performed best, but it was also noted that the ANN was the least effective due to insufficient data. Despite GMM and GPR showing slightly better accuracy than change-point regression, the latter was preferred for its simplicity. Lastly, GMM was used to forecast daily and hourly energy usage in commercial buildings, providing adaptable local uncertainty quantification for the data [71].

When dealing with a set of ‘n’ independent input vectors X_j (j = 1, …, n), the associated observations of Y_i (i = 1, …, n) exhibit correlation through the covariance function K, which follows a normal distribution as presented by [72] and is shown below in Equation (4).

P (y; m; k) = \frac{1}{{(2 π)}^{n / 2} {|K (X, X)|}^{1 / 2}} \times e x p (\frac{- 1}{2} {(y - m)}^{T} K {(X, X)}^{- 1} (y - m)

(4)

where K(X, X) is the covariance matrix that consists of an N × N matrix, where each entry of the matrix is k(x_i, x_j): k being the kernel function, which can be any polynomial function, such as a radial basis function or a sigmoid function. Furthermore, to account for any uncertainty in the prediction of y_i, a white noise ‘σ’ is assumed; thus, the covariance of y becomes as shown in Equation (5).

c o v (y) = K (X, X) + σ^{2}

(5)

Once the covariance matrix for y is available, it is possible to estimate y* using Equations (6) and (7).

y^{*} = \sum_{i = 1}^{N} α_{i} k (x_{i}, x^{*})

(6)

α_{i} = {(K (X, X) + σ^{2} I)}^{- 1} y_{i}

(7)

In building energy efficiency prediction and decarbonization, Gaussian process regression aids in estimating energy usage [73,74,75]. Estimating and forecasting a building’s performance can introduce uncertainty into these predictions, but GPR effectively accounts for this uncertainty. However, a downside of GPR is its computational expense, particularly as data volumes grow, due to the complexity of the covariance matrix and the N³[O(N³)] time required for matrix inversion during predictions [76].

2.2.3. Ensemble Models (LSBoost)

Least squares boosting (LSBoost) is an ensemble learning method that combines predictions from weak learners to improve regression accuracy, making it particularly effective for tasks like forecasting building energy consumption and GHG emissions [77,78]. A study by Ahmad et al. [77] introduced three machine learning models: a non-linear autoregressive model (NARM), a linear model using stepwise regression (LMSR), and least square boosting (LSBoost) to improve the accuracy of energy usage predictions across various sectors. The LSBoost model showed a coefficient of variation (CV) of 5.019%, 3.159%, 3.292%, and 3.184% in summer, autumn, winter, and spring, respectively. The findings indicate that while LSBoost performs moderately, both NARM and LMSR provide better forecasting accuracy than the existing GPR method. The study also suggests that future research should consider additional factors, like economic variables, to improve model performance in energy demand forecasting. In a similar study by Ke et al. [78], researchers combined data-driven predictive control (DPC) with a cloud-based system for building energy management, developing regression tree (RT) and LSBoost algorithms. LSBoost outperformed other strategies, reducing peak power consumption by 30.4% relative to TDNN while maintaining occupant comfort. The DPC-LSBoost approach decreases energy costs and prediction time, proving viable for large-scale energy systems.

The LSBoost model leverages various tree regression weak-learners (C) to minimize mean squared error (MSE) in the validation parameter (Z) and ineffective forecasting learners (Z_F). It starts with an initial assessment of combined validation parameters (

\underline{Z}

) and predictor median function parameters (Y), then employs a weighted approach to integrate regression algorithms C₁, C₂, C₃, …, C_n, enhancing forecasting performance and accuracy [77].

Z_{F} (Y) = \underline{Z} (Y) + w \sum_{n = 1}^{N} ρ_{n} C_{n} (Y)

(8)

LSBoost is effective for forecasting building energy consumption and GHG emissions as it models complex, non-linear relationships between inputs (like weather data, building age, occupancy, etc.) and outputs [79]. It refines predictions, minimizes errors, and ranks feature importance, identifying key factors impacting consumption and emissions [80]. This methodology is valuable for optimizing building performance and reducing environmental impact.

2.3. Gaps in Existing Literature

Significant strides have been made in physical and data-driven modeling approaches for building energy performance prediction [22,34,38,54,55,56,58,59]. Over the past decades, models have become more robust, generalizable, and scalable with improvements in computing power and technology, data availability, and software capable of performing complex simulations [22]. However, to the best of the authors’ knowledge, no study has investigated the economic penalties affiliated with excess on-site CO_2e emissions. This study investigates the performance of a cluster of buildings with pre-defined emissions standards [34] to calculate the carbon non-compliance penalties incurred by a cluster of buildings. By developing machine learning models capable of mimicking the “cost of inaction”, i.e., assuming no energy efficiency upgrades are implemented in these facilities, this study demonstrates the importance of taking swift action to improve energy efficiency and decarbonize the built environment and to avoid hefty non-compliance costs.

3. Methodology

The research methodology is broken into two phases, Phase 1 and 2, focusing on developing machine learning models to forecast energy consumption and GHG emissions until 2040, as shown in Figure 1. The following sections discuss the methodology in detail. Using a training dataset of 310 buildings, making up ~6.6 million m² (~71 million ft²), machine learning models are trained, validated, and tested for a portfolio of 100 buildings. Model development was carried out on a 12th Gen Intel Core i7-12700K processor (Intel: Santa Clara, CA, USA) @3.60 GHz, with 16 Gigabytes (GB) of RAM and MATLAB 2024a. The following sections discuss the methodology in detail.

3.1. Data Collection and Preprocessing

A central database [81] was used to gather monthly utility bills for all facilities from all utilities as part of the training dataset for six years, from 2018 to 2023. The utility bills comprised seven main utilities, i.e., electricity, natural gas, oil #2, propane, steam, chilled water, and hot water. The distribution of building types for the training dataset is shown in Figure 2, with a total of 310 buildings and sixteen unique building types.

Missing and/or incomplete data can result in inaccurate forecasts and unreliable machine learning models [82]. The dataset encompasses monthly utility data across seven utilities for 310 buildings from 2018 to 2023. This results in approximately 22,320 building-month entries (310 buildings × 6 years × 12 months), which then further translates to around 156,240 total utility data points. During the six-year analysis, some facilities had such data gaps (142 entries, <1%), requiring careful data cleaning and preprocessing. Missing values were averaged based on existing values for the same month from different years, as noted in [83]. Missing data cells were then imputed with averages, and additional features were integrated to enhance machine learning predictions [84]. These included building age, type, square footage, and weather data (heating degree days (HDD) and cooling degree days (CDD)) to assess the influence of weather on energy consumption and GHG emissions. Due to the absence of occupancy sensors, maximum occupancy levels for various building types were estimated based on ASHRAE Standard 62.1 occupant density values [85]. Finally, a breakdown of utility percentages for each year was added as a feature to improve machine learning forecasts for energy and GHG emission forecasts [86]. Once the data was prepared for initial training and testing, it was normalized to enhance the accuracy of the machine learning algorithms and forecasts [87]. Normalization helps address data skewness from outliers or non-normal distributions, leading to faster training and improved model performance [87]. The data and variables in this study were scaled between 0 and 1 for better accuracy and resolution using Equation (9).

X_{n o r m a l i z e d} = \frac{X - X_{m i n}}{X_{m a x} - X_{m i n}}

(9)

This study presents two machine learning models: one for energy forecasting and another for GHG emissions forecasting, with the output of the first model used as input for the second. Table 2 below lists the input features for both models.

3.2. Feature Selection and Principal Component Analysis

After cleaning and normalizing the data, the next step was to select features impacting model performance. Feature selection facilitates simplified model development and ensures data clarity [88]. The features for predicting energy consumption and GHG emissions are listed in Columns 2 and 3 of Table 2. Using annual utility data from 2018 to 2022 (electricity, natural gas, oil #2, propane, chilled water, steam, and hot water), along with weather data (CDD and HDD) and building characteristics (age, type, size, and occupancy), 58 features were used for energy consumption predictions. For GHG emissions, 63 features were used, including historical emissions data and energy consumption for 2022. Two feature selection algorithms, minimum redundancy maximum relevance (MRMR) and F Test, were used to evaluate the contribution of each variable to energy consumption and GHG emission predictions. MRMR is popular for selecting the most relevant features while minimizing redundancy, thus improving model performance and interpretability [89,90]. The F Test assesses the significance of differences between group means by comparing within-group variance to between-group variance [91,92]. Both methods identified identical features with higher correlations for the 2023 energy consumption prediction. Moreover, the variables selected from the MRMR and F Test are highly correlated, resulting in multicollinearity in the predictive model, as seen in Figure 3.

However, for the GHG emissions predictions for 2023, the five highest features varied for both feature selection methods, as seen in Figure 4. Such differences are particularly significant in building energy datasets where features often exhibit multicollinearity, such as between building geometry characteristics or occupancy patterns [22]. While F Test evaluates features based solely on their individual correlation with the target variable [92], MRMR employs a more comprehensive approach by simultaneously maximizing relevance to the target while minimizing redundancy among selected features [93]. Using the MRMR and F-test scores for the two target variables, energy consumption and GHG emissions, it was observed from Figure 3 and Figure 4 that the sixth feature’s score was lower (by and large) and thus excluded to simplify the model (increase generalizability and robustness) and eliminate redundancy. For the same reason, we retained the five highest-scoring features for 2023 (determined by MRMR/F-test scores).

Due to a high number of input features, principal component analysis (PCA) was implemented to reduce dimensionality and minimize multicollinearity by producing independent components [94]. PCA helps mitigate data overfitting from an abundance of variables, enhancing generalization [95]. A commonly employed criterion for determining the optimal number of principal components to retain in PCA is the 95% variance threshold [96]. This threshold is strategically selected to ensure that the analysis captures the most significant sources of variability present in the dataset, thereby facilitating a reduction in dimensionality while preserving the essential structure of the data [97]. For this study’s dataset, PCA was applied at a 95% explained variance, similar to [98], to choose the optimal number of principal components. Figure 4 shows a scree plot using the principal components for the energy consumption and GHG emission prediction features. The scree plot helps determine how many principal components (PCs) are retained in a Principal Component Analysis (PCA) by plotting the explained variance (or eigenvalues) of each PC in descending order [99]. For the two datasets on which PCA was performed, the number of principal components before the elbow points is 11, indicating that 11 principal components are needed for 95% variance, as shown in Figure 5. Based on this analysis, the eleven principal components were then selected for predictive modeling for the dataset. The dataset was divided into 80% for training and 20% for testing. Various machine learning models were then tested to identify the one that performed best.

4. Results and Discussion

4.1. Historical Weather Data

Table 2 outlines the input variables needed for both machine learning models, with historical data being crucial. This study utilizes historical ambient weather data in the form of HDD and CDD from thirteen weather stations. To forecast energy consumption and GHG emissions until 2040, HDD and CDD values must be projected into the future [100]. One such example of predictions for CDD and HDD is shown for the Baltimore weather station in Figure 6. These predictions use historical data from 1951 to 2023, fitting six probability distributions: normal, Nakagami, Birnbaum-Saunders, Gamma, Lognormal, and Burr. Evaluating six probability distribution fits, it was seen that the Birnbaum-Saunders model best represented the data for the Baltimore weather station. This model was then used to generate forecasts for CDD and HDD at the Baltimore weather station, as seen in Figure 7. This approach was then applied to the remaining twelve weather stations to determine their HDD and CDD forecasts.

4.2. Evaluating Machine Learning Models

After completing PCA and normalizing the HDD and CDD forecasts for thirteen weather stations, the next step comprised developing machine learning models for accurately forecasting energy consumption and GHG emissions for the cluster of buildings. The three best-performing models with the highest accuracy were the Gaussian process regression, the least square boost ensemble, and the support vector machine models detailed in Section 2.1. Table 3 presents the three best-performing models’ normalized accuracy metrics and training times. Based on the accuracy metrics, the GPR model was chosen for forecasting energy consumption (Model 1). The effectiveness of the models was assessed using the coefficient of determination (R²), normalized mean absolute error (NMAE), and normalized root mean square error (NRMSE). R² indicates the variability explained by independent variables, while mean absolute error (MAE) measures the average prediction error, and root mean square error (RMSE) provides more weight to larger errors. The models were validated using data from the calendar year 2023.

Figure 8 illustrates the predicted versus actual energy consumption responses for 2023, having two dashed lines with ±25% deviation lines indicating outliers. This comparison highlights how well each model can mimic and generalize the performance of building clusters. Outliers may arise from unmodeled factors affecting energy use, such as unique occupancy patterns, specialized equipment, or specific architectural features. Additionally, the absence of variables like HVAC system types and building characteristics can increase forecast variance. Notably, GPR shows the fewest outliers, making it the preferred model for forecasting energy consumption. A similar method applies to GHG forecasting, with added uncertainty related to utility consumption proportions from 2024 to 2040, as the model assumes average consumption from prior years to mimic the “cost of inaction”.

The accuracy metrics for the second machine learning model are presented in Table 4. Similarly to the energy consumption forecasting, models were validated for 2023. The LSBoost ensemble model was preferred due to its higher R² and lower NRMSE and NMAE values. Figure 9 illustrates the actual versus predicted GHG emissions for 2023, showing that the LSBoost model has the lowest deviation from the recorded values.

4.3. Hyperparameter Tuning

Hyperparameter tuning is critical for enhancing the performance and generalization of machine learning models employed in energy forecasting and GHG emissions prediction [101]. Unlike model parameters, hyperparameters dictate the model’s structure and learning process, requiring careful adjustment to attain optimal performance. In this study, we optimized the hyperparameters of the machine learning models leveraging Bayesian optimization, grid search, and random search techniques to enhance predictive accuracy and generalizability across diverse building types. The three best-performing models, Gaussian process regression, least square boost ensemble, and support vector machine, were subjected to a systematic hyperparameter tuning process to investigate the hyperparameter space, striking a balance between accuracy and computational efficiency, further discussed in this section.

4.3.1. Gaussian Process Regression Kernel Selection

Gaussian process regression was chosen for this study because of its capacity to model uncertainty and effectively capture complex, non-linear relationships between input features and energy consumption, as well as greenhouse gas emissions. A crucial aspect of GPR is the selection of the kernel function, which dictates the covariance structure of the predictions. The kernel influences the model’s smoothness and generalization capabilities, making its selection a vital factor in determining performance [102,103]. Multiple kernel functions were evaluated to determine the most suitable configuration for GPR. The radial basis function (RBF) kernel was tested due to its smoothness and ability to model complex relationships. Still, alternative kernels such as the Matern (ν = 3/2, ν = 5/2), rational quadratic, and exponential kernels were also considered to assess their suitability for capturing non-stationary behaviors in energy and emissions data [104,105]. Kernel selection in (GPR) significantly affects model performance. Due to GPR’s high computational cost (O(N³)), efficient hyperparameter optimization is crucial [106,107]. We used three strategies, namely grid search, random search, and Bayesian optimization, to explore different kernel configurations and select the best-performing model.

Grid search is an exhaustive search method for evaluating all combinations of hyperparameters from a set of values. Although comprehensive, it is computationally expensive, especially for GPR, which requires tuning multiple parameters like length scale, variance, and noise [108]. We used grid search as a baseline for comparing kernels but found it impractical for high-dimensional hyperparameter spaces. Random search samples hyperparameters randomly from a defined range, reducing computational costs while often matching the performance of grid search [109]. Our study effectively identified promising areas in the hyperparameter space, especially for complex kernels like Matern 5/2 and rational quadratic. Bayesian optimization is a probabilistic model-based method that iteratively improves hyperparameter selection by balancing exploration and exploitation. It is vastly effective due to the high computational complexity of Gaussian process regression (GPR). We utilized a Gaussian process-based acquisition function, expected improvement per second plus, to direct the search for optimal kernel configurations [110]. This approach consistently outperformed grid and random searches in computational efficiency and final model accuracy. Table 5 showcases the performance of the different kernel functions for the GPR model. The Matern (ν = 5/2) kernel provided the lowest NRMSE, suggesting that its flexibility in handling real-world variations makes it more suitable than the smoother exponential kernel. While the rational quadratic kernel also performed well, its additional complexity did not significantly improve performance. The RBF kernel performed the worst, likely due to its tendency to overfit abrupt changes in data patterns.

One of the main challenges associated with using Gaussian Process Regression (GPR) for large-scale energy forecasting is its computational complexity, which scales as O(N³) in relation to the number of training samples. To address this issue, Bayesian optimization was employed to effectively explore the hyperparameter space while minimizing the number of costly full-model evaluations, similar to [111]. This strategy notably reduced training time when compared to traditional exhaustive grid search methods.

4.3.2. LSBoost Hyperparameter Selection

The LSBoost model, a gradient-boosting ensemble based on least-squares loss, was extensively tuned to optimize predictive performance for energy consumption and GHG emissions. The model’s core strength lies in its ability to sequentially minimize residual errors via weak learners (decision trees), thereby achieving robust non-linear mappings of high-dimensional features [112]. Three key hyperparameters were tuned:

Maximum tree depth: governing the complexity of each decision tree. Shallow trees may result in underfitting, whereas deeper trees have the potential to overfit the data.
Number of learners (estimators): dictates the total number of boosting iterations. Utilizing too few learners can lead to underfitting, while an excessive number can prolong training time and heighten the risk of overfitting.
Learning rate: scales the impact of each learner’s contribution. A smaller learning rate can enhance generalization but may necessitate a greater number of learners.

An extensive hyperparameter tuning strategy was implemented utilizing Bayesian optimization, grid search, and random search. Bayesian optimization facilitated efficient convergence by modeling the objective function (e.g., RMSE) as a probabilistic function and sequentially sampling hyperparameters to maximize performance. Grid search enabled thorough evaluation across discrete parameter grids, whereas random search provided a stochastic exploration of a more expansive search space, occasionally uncovering non-obvious optimal combinations [107]. The LSBoost model was the best-performing model for predicting GHG emissions for the cluster of buildings. Table 6 identifies the optimal hyperparameters for forecasting GHG emissions.

4.3.3. SVM Hyperparameter Selection

In this study, support vector machine was chosen as one of the top-performing models due to its effective balance between prediction accuracy and computational efficiency in building-level forecasting. The primary hyperparameters tuned within the SVM models included the kernel function, which governs the transformation of the input space into a higher-dimensional space. Several kernels were assessed, such as the linear kernel, which is effective for linearly separable features; the polynomial kernel, which captures moderate feature interactions through controlled degrees; and the radial basis function (RBF), celebrated for its flexibility and non-linearity, making it especially prevalent in energy-related regression tasks. In addition, the box constraint (C) was analyzed, as it regulates the trade-off between bias and variance: a smaller value encourages generalization, while a larger one aims to reduce training error. Lastly, epsilon (ε) was defined as the margin of tolerance within which no penalty is applied in the loss function, thus influencing the model’s sensitivity to noise and its potential for overfitting [113,114,115]. Hyperparameter tuning was performed using a variety of methods. Using a surrogate probabilistic model, Bayesian optimization was employed to explore high-performing regions within the parameter space efficiently. In contrast, grid search facilitated a comprehensive evaluation of structured combinations of parameters such as C, epsilon, and kernel types. Additionally, random search was leveraged to probe diverse areas of the hyperparameter space without leading to a combinatorial explosion. The evaluation process utilized a 5-fold cross-validation (CV) framework, and various performance metrics, including root mean squared error (RMSE) and coefficient of determination (R²), were recorded for each combination of kernel function and tuning strategy [116]. Since the SVM model is not the best-performing model for either energy consumption or GHG emissions forecasting, Table 7 indicates how different kernel functions perform for energy forecasting only to give a glimpse of how the model performs well but not well enough to justify using it for either of the forecasting models.

Figure 10 shows the box plot method adopted to represent the performance of the seven developed models. The GPR model shows the best performance for Model 1 (energy consumption prediction), whereas the LSBoost ensemble model shows the best performance for Model 2 (GHG emissions forecasting).

After determining the two best models for the forecast, the next step was to use them for predictions up to 2040. For Model 1, we employed GPR to forecast energy consumption until 2040. These forecasts served as inputs for the LSBoost ensemble model, enabling us to predict GHG emissions for the same time period. The results of the energy consumption forecasts and GHG emissions predictions are presented in Section 4.4.

4.4. The Cost of Inaction

After validating the machine learning models using accuracy metrics from Table 3 and Table 4, unsupervised time-series forecasting was conducted to predict energy consumption and GHG emissions through 2040. For this forecasting, the utility percentage breakdown for each year from 2024 to 2040 was assumed as the average from 2018 to 2023, based on the premise that without energy efficiency upgrades/retrofits, consumption patterns would remain unchanged, reflecting a “cost of inaction”. Figure 11 illustrates the cumulative energy consumption predictions for the building portfolio (~100 buildings), with the green line indicating a linear decrease through 2030 to align with the Maryland Executive Order 2023 [117].

The outputs of Model 1 serve as inputs for Model 2 to forecast on-site fossil fuel CO_2e emissions through 2040 to evaluate the ramifications of carbon non-compliance fees, as shown in Figure 12. This model is expected to show a flat trend, indicating consistent emissions without energy efficiency upgrades. Unlike Figure 10, this forecast extends to 2040 per the guidelines in [19]. The dashed line represents a linear decline from 2018 to 2040, which the portfolio of buildings would aim to follow to meet net-zero targets.

In 2020 and 2021, energy consumption and greenhouse gas emissions dropped significantly due to the COVID-19 pandemic, which led to reduced operations in many facilities [118,119,120]. By 2022, operations resumed, but occupancy patterns shifted to hybrid work models, resulting in lower on-site presence. These changes indicate that state-owned buildings need major renovations, including energy efficiency upgrades and renewable energy integration, to meet ambitious state goals [19,117].

4.5. Carbon Non-Compliance Fees

The Maryland Climate Solutions Now Act (CSNA) mandates that facilities over 3252 m² (35,000 sq. ft.) will pay an alternative compliance fee for every metric ton of net direct emissions exceeding the annual standard [19]. This focuses on Scope 1 emissions from on-site fossil fuel use, such as natural gas, oil, propane, steam, and hot water. Non-compliance fees will start in 2030, with a penalty of $230 per metric ton of excess CO_2e, adjusted for inflation, based on 2020 dollars. The non-compliance fees until 2040 can be calculated using Equation (10), shown below.

N o n - C o m p l i a n c e F e e ($) = $ 230 \times {(1 + i)}^{(Y e a r - 2020)}

(10)

where ‘i’ is the rate of inflation.

Using a conservative inflation rate of 5% for Maryland, as provided by the Environmental Protection Agency (EPA) [121], we calculate the non-compliance fees for the building portfolio from 2030 to 2040. Figure 13 shows the rates for excess on-site CO_2e emissions. These fees are determined using the specified rates alongside machine learning forecasts for the portfolio of buildings. Additionally, the forecasts for this portfolio illustrate the “cost of inaction,” representing a scenario in which no energy efficiency upgrades or improvements are implemented at the facility. Leveraging the machine learning forecasts and the non-compliance fees, it was found that the cluster of buildings would incur an annual average penalty of $31.7 million ($1.09/sq. ft.) and ~$348.7 million ($12.03/sq. ft.) over 11 years in carbon non-compliance fees. Figure 14 shows the non-compliance fees ranging from 2030 to 2040 for the building portfolio. The figure shows a significant increase in penalties incurred for the portfolio in the years 2035 and 2040, respectively. This is owing to the nature of the mandates set by [19], increasing the stringency of the benchmark performances in blocks of 5 years, i.e., 2030–2034, 2035–2039, and 2040 and beyond. Moreover, the secondary y-axis illustrates the quantity of on-site fossil fuel CO_2e emissions the cluster of buildings must reduce to prevent incurring carbon non-compliance fees. Similarly to the rise in non-compliance fees, the stringent mandates dictate the reduction in on-site fossil fuel CO_2e emissions required by the building cluster [19]. This analysis empowers facility managers, building owners, and policymakers to evaluate their facilities’ performance, paving the way for transformative energy efficiency measures that enhance energy performance and drive the decarbonization of their operations. We utilize a conservative 5% inflation assumption (based on the EPA guidelines) for 2030–2040. We also evaluated non-compliance rates of 3% and 7% to account for uncertainty: the total non-compliance fee is approximately $233.4 million at 3% (compared to $348.7 million at 5%) and around $427.8 million at 7%. In all scenarios, the penalties for non-compliance remain significant, underscoring the validity of our findings and highlighting the urgent need for facility owners and building managers to prioritize prompt decarbonization measures.

5. Conclusions and Limitations

5.1. Conclusions

This study presented a novel forecasting technique that utilizes two machine learning models, with the output of one model serving as the input for the other. The models were developed using data from more than 300 commercial buildings across sixteen different building types, covering a total of 6.6 million square meters (approximately 71 million square feet), and aimed at forecasting energy consumption and fossil fuel emission trends. By employing a Gaussian process regression model alongside a least squares boost ensemble model, the study generated long-term forecasts for energy consumption and on-site fossil fuel CO_2e emissions for the building portfolio at hand. These forecasts were designed to reflect the scenario of the “cost of inaction,” assuming that these facilities do not undertake any energy efficiency improvements or retrofits to enhance their energy performance. By adopting this approach, estimating the carbon non-compliance penalties each facility would face if it exceeded permissible on-site fossil fuel CO_2e emissions became possible. For the building portfolio under investigation, encompassing over 100 buildings, it was found that the cluster of buildings would incur an annual average penalty of $31.7 million ($1.09/sq. ft.) and ~$348.7 million ($12.03/sq. ft.) over 11 years in carbon non-compliance fees. To comply with the regulation, the selected building portfolio would need to reduce on-site fossil fuel CO_2e emissions by 58,246 metric tons on average annually and 640,708 metric tons over 11 years. The proposed machine learning framework is scalable and adaptable for larger portfolios, including public sector facilities like schools and hospitals. The models used, especially LSBoost and SVM, were chosen for their computational efficiency, allowing application to datasets with hundreds to thousands of buildings. While Gaussian process regression has computational limits due to its O(N³) complexity, alternatives like sparse GPR and ensemble methods can be integrated. This study highlights the urgent need for action on energy efficiency and decarbonization, contributing a valuable tool for carbon compliance and energy-saving opportunities in the global building sector.

5.2. Limitations

Energy and GHG emission forecasting offer insights into a facility’s performance and assist in planning energy efficiency and decarbonization. However, the assumptions in each model may not precisely reflect actual performance due to the black-box nature of machine learning. The “cost of inaction” scenario presumes no energy efficiency improvements in the portfolio over 18 years. To enhance the accuracy of machine learning models, a larger dataset of facilities, along with historical data beyond 2018 and additional features, is needed to prevent overfitting and improve predictions with larger scalability and generalization. Another limitation of the study is the limited availability of HVAC equipment data for the buildings, with information available for only about 10% of the portfolio. This lack of data restricts the use of features necessary for assessing energy performance and developing complex models. Additionally, insufficient data on lighting power densities and real-time occupancy levels hamper feature engineering and accurate forecasts. This study underscores that there is no universal solution for decarbonization and energy efficiency; each facility should evaluate its equipment to target Scope-1 emissions and work towards net-zero emissions in line with the Paris Agreement goals.

Author Contributions

Conceptualization, A.R., F.d.C. and A.S.; Methodology, A.R. and A.S.; Validation, A.R.; Formal analysis, A.R., F.d.C. and A.S.; Investigation, A.R., F.d.C. and A.S.; Resources, M.O.; Data curation, A.R.; Writing—original draft, A.R. and F.d.C.; Writing—review & editing, A.R., F.d.C., A.S. and M.O.; Visualization, A.R.; Supervision, F.d.C., A.S. and M.O.; Project administration, M.O.; Funding acquisition, M.O. All authors have read and agreed to the published version of the manuscript.

Funding

The authors declare no competing financial interests. This research received no external funding.

Data Availability Statement

The datasets presented in this article are not readily available because of data privacy reasons based on requests from the sponsors. Requests to access the datasets should be directed to the corresponding author.

Acknowledgments

The authors thank the Office of Energy Sustainability within the State of Maryland’s Department of General Services (DGS) for supporting this project.

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

ANN	Artificial Neural Network
ASHRAE	American Society of Heating, Refrigeration, and Air Conditioning Engineers
BEPSs	Building Energy Performance Standards
BIM	Building Information Model
BPSs	Building Performance Standards
Btu	British Thermal Units
CBECS	Commercial Buildings Energy Consumption Survey
CDD	Cooling Degree Day
COP	Coefficient of Performance
CSNA	Climate Solutions Now Act
DOE	Department of Energy
EEM	Energy Efficiency Measure
EO	Executive Order
EUI	Energy Use Intensity
GB	Gradient Boosting
GHG	Greenhouse Gas
GJ	Gigajoules
GMM	Gaussian Mixture Model
GPR	Gaussian Process Regression
GT	Gigatons
HDD	Heating Degree Day
HVAC	Heating, Ventilation, and Air Conditioning
LED	Light Emitting Diode
LPD	Lighting Power Density
MAE	Mean Absolute Error
MLR	Multiple Linear Regression
MMmt	Million Metric Tons
MRMR	Minimum Redundancy Maximum Relevance
PCA	Principal Component Analysis
RMSE	Root Mean Square Error
SVM	Support Vector Machine
TRNSYS	TRaNsient SYstems Simulation
xAI	Explainable Artificial Intelligence
ϕ(X)	maps X non-linearly to a high-dimensional feature space
Lε	Loss function
σ	White noise
Z	Validation parameter
C	Tree regression weak learners

References

IPCC. AR6 (2021), Summary for Policymakers. In Climate Change 2021: The Physical Science Basis; Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change; IPCC: Geneva, Switzerland, 2021. [Google Scholar]
NPUC. Race to Net Zero: Carbon Neutral Goals by Country; NPUC: Ridgefield, WA, USA, 2021. [Google Scholar]
T.I.A.S. Paris Agreement to the United Nations Framework Convention on Climate Change; No. 16–1104; T.I.A.S: Tilburg, The Netherlands, 2015. [Google Scholar]
Tracker, C.A. Warming Projections Global Update; Climate Analytics and New Climate Institute: Berlin, Germany, 2021. [Google Scholar]
Hoegh-Guldberg, O.; Jacob, D.; Taylor, M.; Bolaños, T.G.; Bindi, M.; Brown, S.; Camilloni, I.A.; Diedhiou, A.; Djalante, R.; Ebi, K.; et al. The human imperative of stabilizing global climate change at 1.5 °C. Science 2019, 365, eaaw6974. [Google Scholar] [CrossRef]
Betts, R.A.; Alfieri, L.; Bradshaw, C.; Caesar, J.; Feyen, L.; Friedlingstein, P.; Gohar, L.; Koutroulis, A.; Lewis, K.; Morfopoulos, C.; et al. Changes in climate extremes, fresh water availability and vulnerability to food insecurity projected at 1.5 °C and 2 °C global warming with a higher-resolution global climate model. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2018, 376, 20160452. [Google Scholar] [CrossRef] [PubMed]
Huovila, P. Buildings and Climate Change: Status, Challenges, and Opportunities; UNEP SBCI: Nairobi, Kenya, 2007. [Google Scholar]
IEA. Tracking Buildings 2023; IEA: Paris, France, 2023. [Google Scholar]
Allouhi, A.; El Fouih, Y.; Kousksou, T.; Jamil, A.; Zeraouli, Y.; Mourad, Y. Energy consumption and efficiency in buildings: Current status and future trends. J. Clean. Prod. 2015, 109, 118–130. [Google Scholar] [CrossRef]
Frequently Asked Questions (faqs)—U.S. Energy Information Administration (EIA). Available online: https://www.eia.gov/tools/faqs/faq.php?id=86&t=1 (accessed on 24 October 2024).
EIA. U.S. Energy Information Administration—EIA—Independent Statistics and Analysis. Available online: https://www.eia.gov/environment/emissions/carbon/ (accessed on 24 October 2024).
Bell, M. Energy efficiency in existing buildings: The role of building regulations. In Proceedings of the Construction and Building Research (COBRA) Conference, Leeds, UK, 7–8 September 2004; pp. 7–8. [Google Scholar]
European Commission. Communication from the Commission to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions: Healthier, Safer, More Confident Citizens: A Health and Consumer Protection Strategy: Proposal for a Decision of the European Parliament and of the Council Establishing a Programme of Community Action in the Field of Health and Consumer Protection 2007–2013 (Text with EEA Relevance); Office for Official Publications of the European Communities: Luxembourg, 2005. [Google Scholar]
World Economic Forum. For Net-Zero Cities, We Need to Retrofit Our Older Buildings. Here’s What’s Needed. 2022. Available online: https://www.weforum.org/stories/2022/11/net-zero-cities-retrofit-older-buildings-cop27/#:~:text=Around%2080%25%20of%20the%20buildings,retrofit%20them%20for%20energy%20efficiency (accessed on 24 October 2024).
Aksoezen, M.; Daniel, M.; Hassler, U.; Kohler, N. Building age as an indicator for energy consumption. Energy Build. 2015, 87, 74–86. [Google Scholar] [CrossRef]
California, S. Supporting California’s Goal to Achieve Carbon Neutrality by 2045. Carbon Neutrality by 2045—Office of Planning and Research. Available online: https://opr.ca.gov/climate/carbon-neutrality.html (accessed on 24 October 2024).
GHG Pollution Reduction Roadmap 2.0. GHG Pollution Reduction Roadmap 2.0 | Colorado Energy Office. Available online: https://climate.colorado.gov/greenhouse-gas-roadmap (accessed on 24 October 2024).
Climate Change. DOEE. Available online: https://doee.dc.gov/service/climate-change (accessed on 24 October 2024).
Maryland Senate Bill 528. Climate Solutions Now Act of 2022. Available online: https://mgaleg.maryland.gov/2022RS/bills/sb/sb0528E.pdf (accessed on 24 October 2024).
Greenhouse Gas Emissions Reduction. NYSERDA. Available online: https://www.nyserda.ny.gov/Impact-Greenhouse-Gas-Emissions-Reduction (accessed on 24 October 2024).
Reducing Greenhouse Gas Emissions—Washington State Department of Ecology. Available online: https://ecology.wa.gov/air-climate/reducing-greenhouse-gas-emissions (accessed on 24 October 2024).
Amasyali, K.; El-Gohary, N.M. A review of data-driven building energy consumption prediction studies. Renew. Sustain. Energy Rev. 2018, 81, 1192–1205. [Google Scholar] [CrossRef]
EnergyPlus, National Renewable Energy Laboratory, Building Technologies Office, the US Department of Energy (DOE). Available online: https://energyplus.net/downloads (accessed on 24 October 2024).
TRNSYS, Transient System Simulation Tool, Solar Energy Laboratory, University of Wisconsin-Madison. Available online: https://sel.me.wisc.edu/trnsys/user18-resources/index.html (accessed on 24 October 2024).
Equest, The Quick Energy Simulation Tool, the US Department of Energy (DOE). Available online: http://doe2.com/equest/index.html (accessed on 24 October 2024).
ESP-r, University of Strathciyde, Department of Mechanicl Engineering, Scotland, UK. Available online: https://www.strath.ac.uk/research/energysystemsresearchunit/applications/esp-r/ (accessed on 24 October 2024).
DOE-2, DOE-2 Building Energy Use Modeling, the US Department of Energy (DOE). Available online: http://doe2.com/DOE2/index.html (accessed on 24 October 2024).
Ecotect, Autodesck Ecotect Analysis. Available online: https://www.asidek.es/en/architecture-and-engineering/autodesk-ecotect-analysis/ (accessed on 24 October 2024).
Trane Trace® 3D Plus. Trane Commercial HVAC. Available online: https://www.trane.com/commercial/north-america/us/en/products-systems/design-and-analysis-tools/trane-design-tools/trace-3d-plus.html (accessed on 24 October 2024).
Reinhart, C.F.; Davila, C.C. Urban building energy modeling—A review of a nascent field. Build. Environ. 2016, 97, 196–202. [Google Scholar] [CrossRef]
Esen, H.; Esen, M.; Ozsolak, O. Modelling and experimental performance analysis of solar-assisted ground source heat pump system. J. Exp. Theor. Artif. Intell. 2015, 29, 1–17. [Google Scholar] [CrossRef]
Wei, Y.; Zhang, X.; Shi, Y.; Xia, L.; Pan, S.; Wu, J.; Han, M.; Zhao, X. A review of data-driven approaches for prediction and classification of building energy consumption. Renew. Sustain. Energy Rev. 2018, 82, 1027–1047. [Google Scholar] [CrossRef]
Maryland Building Energy Performance Standards. Maryland Department of the Environment. Available online: https://mde.maryland.gov/programs/air/ClimateChange/Pages/BEPS.aspx (accessed on 24 October 2024).
Chen, Y.; Guo, M.; Chen, Z.; Chen, Z.; Ji, Y. Physical energy and data-driven models in building energy prediction: A review. Energy Rep. 2022, 8, 2656–2671. [Google Scholar] [CrossRef]
Ma, M.; Zhou, N.; Feng, W.; Yan, J. Challenges and opportunities in the global net-zero building sector. Cell Rep. Sustain. 2024, 1, 100154. [Google Scholar] [CrossRef]
Yan, R.; Ma, M.; Zhou, N.; Feng, W.; Xiang, X.; Mao, C. Towards COP27: Decarbonization patterns of residential building in China and India. Appl. Energy 2023, 352, 122003. [Google Scholar] [CrossRef]
Higgins, J.; Ramnarayan, A.; Family, R.; Ohadi, M. Analysis of Energy Efficiency Opportunities for a Public Transportation Maintenance Facility—A Case Study. Energies 2024, 17, 1907. [Google Scholar] [CrossRef]
Neto, A.H.; Fiorelli, F.A.S. Comparison between detailed model simulation and artificial neural network for forecasting building energy consumption. Energy Build. 2008, 40, 2169–2176. [Google Scholar] [CrossRef]
Kim, J.B.; Jeong, W.; Clayton, M.J.; Haberl, J.S.; Yan, W. Developing a physical BIM library for building thermal energy simulation. Autom. Constr. 2015, 50, 16–28. [Google Scholar] [CrossRef]
Modelica, Modelica Language, Modelica Association, Simulation in Europe Basic Research Working Group (SiE-WG). Available online: https://simulationresearch.lbl.gov/modelica/ (accessed on 24 October 2024).
Chen, Y.; Chen, Z.; Xu, P.; Li, W.; Sha, H.; Yang, Z.; Li, G.; Hu, C. Quantification of electricity flexibility in demand response: Office building case study. Energy 2019, 188, 116054. [Google Scholar] [CrossRef]
Ramnarayan, A.; Sarmiento, A.; Gerami, A.; Ohadi, M. An energy consumption intensity ranking system for rapid energy efficiency evaluation of a cluster of commercial buildings. ASHRAE Trans. 2023, 129, 602–611. [Google Scholar]
Ramnarayan, A.; Sarmiento, A.; Ohadi, M. A Multi-Criterion Ranking Approach for Energy Efficiency Ranking of a Cluster of Commercial Buildings. ASHRAE Trans. 2025, 131. [Google Scholar]
Zargarzadeh, S.; Ramnarayan, A.; de Castro, F.; Ohadi, M. ML-Enabled Solar PV Electricity Generation Projection for a Large Academic Campus to Reduce Onsite CO2 Emissions. Energies 2024, 17, 6188. [Google Scholar] [CrossRef]
Olu-Ajayi, R.; Alaka, H.; Sulaimon, I.; Sunmola, F.; Ajayi, S. Building energy consumption prediction for residential buildings using deep learning and other machine learning techniques. J. Build. Eng. 2022, 45, 103406. [Google Scholar] [CrossRef]
Zhang, Y.; Teoh, B.K.; Wu, M.; Chen, J.; Zhang, L. Data-driven estimation of building energy consumption and GHG emissions using explainable artificial intelligence. Energy 2022, 262, 125468. [Google Scholar] [CrossRef]
Chen, Y.; Xu, P.; Chu, Y.; Li, W.; Wu, Y.; Ni, L.; Bao, Y.; Wang, K. Short-term electrical load forecasting using the Support Vector Regression (SVR) model to calculate the demand response baseline for office buildings. Appl. Energy 2017, 195, 659–670. [Google Scholar] [CrossRef]
Wang, Z.; Hong, T.; Piette, M.A. Building thermal load prediction through shallow machine learning and deep learning. Appl. Energy 2020, 263, 114683. [Google Scholar] [CrossRef]
Yousaf, S.; Bradshaw, C.R.; Kamalapurkar, R.; San, O. Development and performance evaluation of empirical models compatible with building energy modeling engines for unitary equipment. Sci. Technol. Built Environ. 2025, 31, 533–550. [Google Scholar] [CrossRef]
Yousaf, S.; Bradshaw, C.R.; Kamalapurkar, R.; San, O. A gray-box model for unitary air conditioners developed with symbolic regression. Int. J. Refrig. 2024, 168, 696–707. [Google Scholar] [CrossRef]
Gassar, A.A.A.; Cha, S.H. Energy prediction techniques for large-scale buildings towards a sustainable built environment: A review. Energy Build. 2020, 224, 110238. [Google Scholar] [CrossRef]
Dong, B.; Li, Z.; Rahman, S.M.; Vega, R. A hybrid model approach for forecasting future residential electricity consumption. Energy Build. 2016, 117, 341–351. [Google Scholar] [CrossRef]
Xu, X.; Taylor, J.E.; Pisello, A.L.; Culligan, P.J. The impact of place-based affiliation networks on energy conservation: An holistic model that integrates the influence of buildings, residents and the neighborhood context. Energy Build. 2012, 55, 637–646. [Google Scholar] [CrossRef]
Bakır, H.; Ağbulut, Ü.; Gürel, A.E.; Yıldız, G.; Güvenç, U.; Soudagar, M.E.M.; Hoang, A.T.; Deepanraj, B.; Saini, G.; Afzal, A. Forecasting of future greenhouse gas emission trajectory for India using energy and economic indexes with various metaheuristic algorithms. J. Clean. Prod. 2022, 360, 131946. [Google Scholar] [CrossRef]
Sen, P.; Roy, M.; Pal, P. Application of ARIMA for forecasting energy consumption and GHG emission: A case study of an Indian pig iron manufacturing organization. Energy 2016, 116, 1031–1038. [Google Scholar] [CrossRef]
Javed, S.A.; Cudjoe, D. A novel grey forecasting of greenhouse gas emissions from four industries of China and India. Sustain. Prod. Consum. 2022, 29, 777–790. [Google Scholar] [CrossRef]
Tanaka, K. Review of policies and measures for energy efficiency in industry sector. Energy Policy 2011, 39, 6532–6550. [Google Scholar] [CrossRef]
Chang, L.; Mohsin, M.; Hasnaoui, A.; Taghizadeh-Hesary, F. Exploring carbon dioxide emissions forecasting in China: A policy-oriented perspective using projection pursuit regression and machine learning models. Technol. Forecast. Soc. Change 2023, 197, 122872. [Google Scholar] [CrossRef]
Nam, K.; Hwangbo, S.; Yoo, C. A deep learning-based forecasting model for renewable energy scenarios to guide sustainable energy policy: A case study of Korea. Renew. Sustain. Energy Rev. 2020, 122, 109725. [Google Scholar] [CrossRef]
Anand, P.; Deb, C.; Yan, K.; Yang, J.; Cheong, D.; Sekhar, C. Occupancy-based energy consumption modelling using machine learning algorithms for institutional buildings. Energy Build. 2021, 252, 111478. [Google Scholar] [CrossRef]
Veiga, R.; Veloso, A.; Melo, A.; Lamberts, R. Application of machine learning to estimate building energy use intensities. Energy Build. 2021, 249, 111219. [Google Scholar] [CrossRef]
Egwim, C.N.; Alaka, H.; Egunjobi, O.O.; Gomes, A.; Mporas, I. Comparison of machine learning algorithms for evaluating building energy efficiency using big data analytics. J. Eng. Des. Technol. 2022, 22, 1325–1350. [Google Scholar] [CrossRef]
Dong, B.; Cao, C.; Lee, S.E. Applying support vector machines to predict building energy consumption in tropical region. Energy Build. 2005, 37, 545–553. [Google Scholar] [CrossRef]
Massana, J.; Pous, C.; Burgas, L.; Melendez, J.; Colomer, J. Short-term load forecasting in a non-residential building contrasting models and attributes. Energy Build. 2015, 92, 322–330. [Google Scholar] [CrossRef]
Jain, R.K.; Smith, K.M.; Culligan, P.J.; Taylor, J.E. Forecasting energy consumption of multi-family residential buildings using support vector regression: Investigating the impact of temporal and spatial monitoring granularity on performance accuracy. Appl. Energy 2014, 123, 168–178. [Google Scholar] [CrossRef]
Seyedzadeh, S.; Rahimian, F.P.; Glesk, I.; Roper, M. Machine learning for estimation of building energy consumption and performance: A review. Vis. Eng. 2018, 6, 5. [Google Scholar] [CrossRef]
Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
Zeng, A.; Ho, H.; Yu, Y. Prediction of building electricity usage using Gaussian Process Regression. J. Build. Eng. 2020, 28, 101054. [Google Scholar] [CrossRef]
Rastogi, P.; Khan, M.E.; Andersen, M. Gaussian-process-based emulators for building performance simulation. In Proceedings of the BS 2017, San Francisco, CA, USA, 7–9 August 2017. [Google Scholar]
Zhang, Y.; O’NEill, Z.; Dong, B.; Augenbroe, G. Comparisons of inverse modeling approaches for predicting building energy performance. Build. Environ. 2015, 86, 177–190. [Google Scholar] [CrossRef]
Srivastav, A.; Tewari, A.; Dong, B. Baseline building energy modeling and localized uncertainty quantification using Gaussian mixture models. Energy Build. 2013, 65, 438–447. [Google Scholar] [CrossRef]
Li, Z.; Han, Y.; Xu, P. Methods for benchmarking building energy consumption against its past or intended performance: An overview. Appl. Energy 2014, 124, 325–334. [Google Scholar] [CrossRef]
Heo, Y.; Choudhary, R.; Augenbroe, G. Calibration of building energy models for retrofit analysis under uncertainty. Energy Build. 2012, 47, 550–560. [Google Scholar] [CrossRef]
Heo, Y.; Zavala, V.M. Gaussian process modeling for measurement and verification of building energy savings. Energy Build. 2012, 53, 7–18. [Google Scholar] [CrossRef]
Noh, H.Y.; Rajagopal, R. Data-driven forecasting algorithms for building energy consumption. In Sensors and Smart Structures Technologies for Civil, Mechanical, and Aerospace Systems 2013; SPIE: Bellingham, WA, USA, 2013; Volume 8692, pp. 221–228. [Google Scholar]
Shi, J.Q.; Choi, T. Gaussian Process Regression Analysis for Functional Data; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
Ahmad, T.; Chen, H. Nonlinear autoregressive and random forest approaches to forecasting electricity load for utility energy management systems. Sustain. Cities Soc. 2019, 45, 460–473. [Google Scholar] [CrossRef]
Ke, J.; Qin, Y.; Wang, B.; Yang, S.; Wu, H.; Yang, H.; Zhao, X.; Guo, H. Data-Driven Predictive Control of Building Energy Consumption under the IoT Architecture. Wirel. Commun. Mob. Comput. 2020, 2020, 8849541. [Google Scholar] [CrossRef]
Silva, J.; Praça, I.; Pinto, T.; Vale, Z. Energy consumption forecasting using ensemble learning algorithms. In Proceedings of the Distributed Computing and Artificial Intelligence, 16th International Conference, Special Sessions, Avila, Spain, 26–28 June 2019; Springer International Publishing: Cham, Switzerland, 2019; pp. 5–13. [Google Scholar]
Wang, Z.; Srinivasan, R.; Wang, Y. Homogeneous ensemble model for building energy prediction: A case study using ensemble regression tree. In Proceedings of the 2016 ACEEE Summer Study on Energy Efficiency in Buildings, Pacific Grove, CA, USA, 21–26 August 2016; pp. 21–26. [Google Scholar]
EnergyCAP. Energy Management & Monitoring Solutions. Available online: https://www.energycap.com/ (accessed on 24 October 2024).
Zhao, Y.; Li, T.; Zhang, X.; Zhang, C. Artificial intelligence-based fault detection and diagnosis methods for building energy systems: Advantages, challenges and the future. Renew. Sustain. Energy Rev. 2019, 109, 85–101. [Google Scholar] [CrossRef]
Little, R.J.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: Hoboken, NJ, USA, 2019; Volume 793. [Google Scholar]
Zheng, A.; Casari, A. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2018. [Google Scholar]
ANSI/ASHRAE Standard 62.12019; Ventilation for Acceptable Indoor Air Quality. American Society of Heating, Refrigerating, and Air-Conditioning Engineers, Inc.: Atlanta, GA, USA, 2019.
Giannelos, S.; Bellizio, F.; Strbac, G.; Zhang, T. Machine learning approaches for predictions of CO2 emissions in the building sector. Electr. Power Syst. Res. 2024, 235, 110735. [Google Scholar] [CrossRef]
Cabello-Solorzano, K.; de Araujo, I.O.; Peña, M.; Correia, L.; Tallón-Ballesteros, A.J. The impact of data normalization on the accuracy of machine learning algorithms: A comparative analysis. In Proceedings of the International Conference on Soft Computing Models in Industrial and Environmental Applications, Salamanca, Spain, 5–7 August 2023; Springer: Cham, Switzerland, 2023; pp. 344–353. [Google Scholar]
Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature selection: A data perspective. ACM Comput. Surv. (CSUR) 2017, 50, 1–45. [Google Scholar] [CrossRef]
Wang, G.; Lauri, F.; El Hassani, A.H. Feature selection by mRMR method for heart disease diagnosis. IEEE Access 2022, 10, 100786–100796. [Google Scholar] [CrossRef]
Li, B.Q.; Hu, L.L.; Chen, L.; Feng, K.Y.; Cai, Y.D.; Chou, K.C. Prediction of protein domain with mRMR feature selection and analysis. PLoS ONE 2012, 7, e39308. [Google Scholar] [CrossRef]
Elssied, N.O.F.; Ibrahim, O.; Osman, A.H. A novel feature selection based on one-way anova f-test for e-mail spam classification. Res. J. Appl. Sci. Eng. Technol. 2014, 7, 625–638. [Google Scholar] [CrossRef]
Dhanya, R.; Paul, I.R.; Akula, S.S.; Sivakumar, M.; Nair, J.J. F-test feature selection in Stacking ensemble model for breast cancer prediction. Procedia Comput. Sci. 2020, 171, 1561–1570. [Google Scholar] [CrossRef]
Wang, X.; Tao, Y.; Zheng, K. Feature selection methods in the framework of mRMR. In Proceedings of the 2018 Eighth International Conference on Instrumentation & Measurement, Computer, Communication and Control (IMCCC), Harbin, China, 19–21 July 2018; IEEE: New York, NY, USA, 2018; pp. 1490–1495. [Google Scholar]
Howley, T.; Madden, M.G.; O’Connell, M.L.; Ryder, A.G. The effect of principal component analysis on machine learning accuracy with high dimensional spectral data. In Proceedings of the International Conference on Innovative Techniques and Applications of Artificial Intelligence, Cambridge, UK, 25 December 2005; Springer: London, UK, 2005; pp. 209–222. [Google Scholar]
Ozdemir, S.; Susarla, D. Feature Engineering Made Easy: Identify Unique Features from Your Dataset in Order to Build Powerful Machine Learning Systems; Packt Publishing Ltd.: Birmingham, UK, 2018. [Google Scholar]
Alkaya, A.; Eker, İ. Variance sensitive adaptive threshold-based PCA method for fault detection with experimental application. ISA Trans. 2011, 50, 287–302. [Google Scholar] [CrossRef]
Jolliffe, I.T. Choosing a subset of principal components or variables. In Principal Component Analysis; Springer: Berlin/Heidelberg, Germany, 2002; pp. 111–149. [Google Scholar]
de Melo Battisti, F.; de Carvalho, T.B.A. Threshold feature selection PCA. In Proceedings of the Symposium on Knowledge Discovery, Mining and Learning (KDMiLe), Campinas, Brasil, 28 November–1 December 2022; SBC: Hong Kong, China, 2022; pp. 50–57. [Google Scholar]
Holland, S.M. Principal Components Analysis (PCA); GA 30602-2501; Department of Geology, University of Georgia: Athens, Greece, 2008. [Google Scholar]
Murphy, A.H. Probabilistic weather forecasting. In Probability, Statistics, and Decision Making in the Atmospheric Sciences; CRC Press: Boca Raton, FL, USA, 2019; pp. 337–377. [Google Scholar]
Wu, J.; Chen, X.Y.; Zhang, H.; Xiong, L.D.; Lei, H.; Deng, S.H. Hyperparameter optimization for machine learning models based on Bayesian optimization. J. Electron. Sci. Technol. 2019, 17, 26–40. [Google Scholar]
Rasmussen, C.E. Gaussian processes in machine learning. In Summer School on Machine Learning; Springer: Berlin/Heidelberg, Germany, 2003; pp. 63–71. [Google Scholar]
Williams, C.K.; Rasmussen, C.E. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006; Volume 2, p. 4. [Google Scholar]
Gorbach, N.S.; Bian, A.A.; Fischer, B.; Bauer, S.; Buhmann, J.M. Model selection for gaussian process regression. In Proceedings of the Pattern Recognition: 39th German Conference, GCPR 2017, Basel, Switzerland, 12–15 September 2017; Springer International Publishing: Cham, Switzerland, 2017; pp. 306–318. [Google Scholar]
Abdessalem, A.B.; Dervilis, N.; Wagg, D.J.; Worden, K. Automatic kernel selection for gaussian processes regression with approximate bayesian computation and sequential monte carlo. Front. Built Environ. 2017, 3, 52. [Google Scholar] [CrossRef]
Stuke, A.; Rinke, P.; Todorović, M. Efficient hyperparameter tuning for kernel ridge regression with Bayesian optimization. Mach. Learn. Sci. Technol. 2021, 2, 035022. [Google Scholar] [CrossRef]
Malakouti, S.M.; Menhaj, M.B.; Suratgar, A.A. Applying Grid Search, Random Search, Bayesian Optimization, Genetic Algorithm, and Particle Swarm Optimization to fine-tune the hyperparameters of the ensemble of ML models enhances its predictive accuracy for mud loss. arXiv 2024. [Google Scholar] [CrossRef]
Alibrahim, H.; Ludwig, S.A. Hyperparameter optimization: Comparing genetic algorithm against grid search and bayesian optimization. In Proceedings of the 2021 IEEE Congress on Evolutionary Computation (CEC), Krakow, Poland, 28 June–1 July 2021; IEEE: New York, NY, USA, 2021; pp. 1551–1559. [Google Scholar]
Wang, X.; Hong, L.J.; Jiang, Z.; Shen, H. Gaussian process-based random search for continuous optimization via simulation. Oper. Res. 2025, 73, 385–407. [Google Scholar] [CrossRef]
Palm, N.; Landerer, M.; Palm, H. Gaussian process regression based multi-objective bayesian optimization for power system design. Sustainability 2022, 14, 12777. [Google Scholar] [CrossRef]
Heaton, M.J.; Datta, A.; Finley, A.O.; Furrer, R.; Guinness, J.; Guhaniyogi, R.; Gerber, F.; Gramacy, R.B.; Hammerling, D.; Katzfuss, M.; et al. A case study competition among methods for analyzing large spatial data. J. Agric. Biol. Environ. Stat. 2018, 24, 398–425. [Google Scholar] [CrossRef]
Sharma, N.; Juneja, A. Combining of random forest estimates using LSboost for stock market index prediction. In Proceedings of the 2017 2nd International Conference for Convergence in Technology (I2CT), Mumbai, India, 7–9 April 2017; IEEE: New York, NY, USA, 2017; pp. 1199–1202. [Google Scholar]
Osman, H.; Ghafari, M.; Nierstrasz, O. Hyperparameter optimization to improve bug prediction accuracy. In Proceedings of the 2017 IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), Klagenfurt, Austria, 21 February 2017; IEEE: New York, NY, USA, 2017; pp. 33–38. [Google Scholar]
Sunkad, Z.A. Feature selection and hyperparameter optimization of SVM for human activity recognition. In Proceedings of the 2016 3rd International Conference on Soft Computing & Machine Intelligence (ISCMI), Dubai, United Arab Emirates, 23–25 November 2016; IEEE: New York, NY, USA, 2016; pp. 104–109. [Google Scholar]
Mantovani, R.G.; Rossi, A.L.; Vanschoren, J.; Bischl, B.; De Carvalho, A.C. Effectiveness of random search in SVM hyper-parameter tuning. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; IEEE: New York, NY, USA, 2015; pp. 1–8. [Google Scholar]
Kalita, D.J.; Singh, V.P.; Kumar, V. A survey on SVM hyper-parameters optimization techniques. In Proceedings of the Social Networking and Computational Intelligence: Proceedings of SCI-2018, Bhopal, India, 5–6 October 2018; Springer: Singapore, 2018; pp. 243–256. [Google Scholar]
Governor Moore Signs Executive Order Doubling Maryland’s Energy Conservation Goal Through State Government Greenhouse Gas Reductions; Press Releases—News; Office of Governor Wes Moore: Annapolis, MD, USA, 2023.
Kang, H.; An, J.; Kim, H.; Ji, C.; Hong, T.; Lee, S. Changes in energy consumption according to building use type under COVID-19 pandemic in South Korea. Renew. Sustain. Energy Rev. 2021, 148, 111294. [Google Scholar] [CrossRef]
Mokhtari, R.; Jahangir, M.H. The effect of occupant distribution on energy consumption and COVID-19 infection in buildings: A case study of university building. Build. Environ. 2021, 190, 107561. [Google Scholar] [CrossRef]
Cai, S.; Gou, Z. Impact of COVID-19 on the energy consumption of commercial buildings: A case study in Singapore. Energy Built Environ. 2024, 5, 364–373. [Google Scholar] [CrossRef]
Sheet, E.F. Social Cost of Carbon; United States Environmental Protection Agency (EPA): Washington, DC, USA, 2013.

Figure 1. Research methodology and workflow.

Figure 2. Building types and their number as part of the training dataset.

Figure 3. Feature importance scores for energy consumption prediction using MRMR and F test algorithms.

Figure 4. Feature importance scores for GHG emission prediction using MRMR and F test algorithms.

Figure 5. Scree plot showing how many principal components (PCs) to retain in a principal component analysis (PCA).

Figure 6. Probability distribution fits for cooling degree days for the Baltimore weather station.

Figure 7. HDD and CDD forecasts for the Baltimore weather station until 2040.

Figure 8. Actual vs. predicted response for energy consumption for 2023.

Figure 9. Actual vs. predicted response for GHG emissions for 2022.

Figure 10. R-squared comparison for the developed models.

Figure 11. Energy consumption prediction for the building portfolio.

Figure 12. Fossil fuel CO_2e emissions prediction for the building portfolio.

Figure 13. Annual carbon non-compliance fees for the building portfolio (2030–2040).

Figure 14. Non-compliance fees incurred by the building portfolio (2030–2040).

Table 1. Building performance standards across different states.

State	Emissions Reduction Goals
California [16]	57% by 2030; net-zero by 2045
Colorado [17]	26% by 2025; 50% by 2030; 90% by 2050
District of Columbia [18]	45% by 2030; 95% by 2050
Maryland [19]	60% by 2031; net-zero by 2045
New York [20]	40% by 2030; 85% by 2050
Washington [21]	58% by 2030; net-zero by 2050

Table 2. Input features used for developing machine learning Models 1 and 2.

Input Features	Energy Consumption (Model 1)	GHG Emission (Model 2)
Building Type	✅	✅
Square Footage	✅	✅
Age	✅	✅
Maximum Occupancy based on ASHRAE Standard 62.1 Levels [85]	✅	✅
Cooling Degree Days (2018–2040)	✅	✅
Heating Degree Days (2018–2040)	✅	✅
Utility Use% (2018–2040)	✅	✅
Energy Consumption (2018–2040)	N/A	✅

Table 3. Accuracy Metrics for Machine Learning Model 1 for Energy Consumption Forecasts.

Model	Training Time (s)	R²	NMAE (kWh) [kBtu]	NRMSE (kWh) [kBtu]
SVM	121	0.82	0.227 [0.774]	0.326 [1.112]
GPR	608	0.91	0.057 [0.194]	0.108 [0.368]
Ensemble (LSBoost)	606	0.87	0.121 [0.413]	0.201 [0.686]

Table 4. Accuracy Metrics for Machine Learning Model 2 for GHG Emissions Forecasts.

Model	Training Time (s)	R²	NMAE (Metric Tons)	NRMSE (Metric Tons)
SVM	113	0.93	0.092	0.124
GPR	609	0.90	0.139	0.248
Ensemble (LSBoost)	601	0.96	0.067	0.112

Table 5. Hyperparameter tuning to identify best models and kernel functions (GPR).

Kernel	Tuning Method	R²	NRMSE (kWh) [kBtu]	Best Hyperparameters
RBF	Grid search	0.82	0.257 [0.877]	Sigma = 0.0017
Matern 5/2	Bayesian optimization	0.91	0.108 [0.368]	Sigma = 0.0012
Rational quadratic	Random search	0.89	0.117 [0.399]	Sigma = 0.0047
Exponential	Grid search	0.86	0.201 [0.686]	Sigma = 0.0025

Table 6. Hyperparameter tuning to identify best models (LSBoost).

Number of Learners	Tuning Method	R²	NRMSE (Metric Tons of CO_2e)	Learning Rate
15	Grid Search	0.93	0.119	0.4642
18	Bayesian Optimization	0.96	0.112	0.3758
63	Random Search	0.89	0.291	0.0813

Table 7. Hyperparameter tuning to identify best models and kernel functions (SVM).

Kernel	Tuning Method	R²	NRMSE (kWh) [kBtu]	Best Hyperparameters
RBF	Grid search	0.75	0.412 [1.406]	C = 9, ε = 0.13
Linear	Random search	0.82	0.326 [1.112]	C = 2, ε = 0.01
Polynomial	Bayesian optimization	0.79	0.381 [1.299]	C = 4, ε = 0.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ramnarayan, A.; de Castro, F.; Sarmiento, A.; Ohadi, M. Machine Learning-Based Carbon Compliance Forecasting and Energy Performance Assessment in Commercial Buildings. Energies 2025, 18, 3906. https://doi.org/10.3390/en18153906

AMA Style

Ramnarayan A, de Castro F, Sarmiento A, Ohadi M. Machine Learning-Based Carbon Compliance Forecasting and Energy Performance Assessment in Commercial Buildings. Energies. 2025; 18(15):3906. https://doi.org/10.3390/en18153906

Chicago/Turabian Style

Ramnarayan, Aditya, Felipe de Castro, Andres Sarmiento, and Michael Ohadi. 2025. "Machine Learning-Based Carbon Compliance Forecasting and Energy Performance Assessment in Commercial Buildings" Energies 18, no. 15: 3906. https://doi.org/10.3390/en18153906

APA Style

Ramnarayan, A., de Castro, F., Sarmiento, A., & Ohadi, M. (2025). Machine Learning-Based Carbon Compliance Forecasting and Energy Performance Assessment in Commercial Buildings. Energies, 18(15), 3906. https://doi.org/10.3390/en18153906

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning-Based Carbon Compliance Forecasting and Energy Performance Assessment in Commercial Buildings

Abstract

1. Introduction

2. Literature Review and Problem Statement

2.1. Literature Review

2.2. Machine Learning Models

2.2.1. Support Vector Machines (SVMs)

2.2.2. Gaussian Process Regression (GPR)

2.2.3. Ensemble Models (LSBoost)

2.3. Gaps in Existing Literature

3. Methodology

3.1. Data Collection and Preprocessing

3.2. Feature Selection and Principal Component Analysis

4. Results and Discussion

4.1. Historical Weather Data

4.2. Evaluating Machine Learning Models

4.3. Hyperparameter Tuning

4.3.1. Gaussian Process Regression Kernel Selection

4.3.2. LSBoost Hyperparameter Selection

4.3.3. SVM Hyperparameter Selection

4.4. The Cost of Inaction

4.5. Carbon Non-Compliance Fees

5. Conclusions and Limitations

5.1. Conclusions

5.2. Limitations

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI