Predicting the Energy Consumption in Chillers: A Comparative Study of Supervised Machine Learning Regression Models

Benkhalfallah, Mohamed Salah; Kouah, Sofia; Harous, Saad

doi:10.3390/en18143672

Open AccessArticle

Predicting the Energy Consumption in Chillers: A Comparative Study of Supervised Machine Learning Regression Models

by

Mohamed Salah Benkhalfallah

^1,2,*

,

Sofia Kouah

^1,2,*

and

Saad Harous

^3,*

¹

Department of Mathematics and Computer Science, University of Oum El Bouaghi, Oum El Bouaghi 04000, Algeria

²

Artificial Intelligence and Autonomous Things Laboratory, University of Oum El Bouaghi, Oum El Bouaghi 04000, Algeria

³

College of Computing and Informatics, University of Sharjah, Sharjah 27272, United Arab Emirates

^*

Authors to whom correspondence should be addressed.

Energies 2025, 18(14), 3672; https://doi.org/10.3390/en18143672

Submission received: 27 April 2025 / Revised: 19 June 2025 / Accepted: 27 June 2025 / Published: 11 July 2025

Download

Browse Figures

Versions Notes

Abstract

Optimization of energy consumption in urban infrastructures is essential to achieve sustainability and reduce environmental impacts. In particular, accurate regression-based energy forecasting of the energy consumption in various sectors plays a key role in informed decision-making, efficiency improvements, and resource allocation. This paper examines the application of artificial intelligence and supervised machine learning techniques to modeling and predicting the energy consumption patterns in the smart grid sector of a commercial building located in Singapore. By evaluating performance of several regression algorithms using various metrics, this study identifies the most effective method for analyzing sectoral energy consumption. The results show that the Regression Tree Ensemble algorithm outperforms other techniques, achieving an accuracy of 97.00%, followed by Random Forest Regression (96.20%) and Gradient Boosted Regression Trees (95.50%). These results underline the potential of machine learning models to foster intelligent energy management and promote sustainable energy practices in smart cities.

Keywords:

energy consumption forecasting; regression analysis; artificial intelligence; machine learning; supervised learning; predictive modeling; regression tree ensemble; data mining techniques

1. Introduction

Energy has transformed human life due to its essential role and its necessity in all aspects of our lives. It encompasses a quintessential essence within the realms of investment, innovations, and mobility; evolving trends; and across various sectors, contributing to the pursuit of growth, advancements, and the overarching quest for global intelligent prosperity [1]. Information technology has undergone rapid and remarkable evolution in recent decades [2], particularly driven by the emergence and maturation of innovative paradigms such as digital infrastructures, interconnected networks, the Internet of Things (IoT), electronic governance systems, and large-scale data analytics platforms [3]. Among the most transformative advancements are artificial intelligence (AI) and machine learning (ML), which have become instrumental in enhancing the efficiency, adaptability, and productivity of contemporary energy systems. These technologies empower energy experts to process and interpret vast volumes of heterogeneous energy data, providing comprehensive insights into consumption behaviors, generation capabilities, and storage dynamics [4]. Such analytical capabilities enable the design of more resilient and optimal energy systems while also facilitating accurate forecasting of future energy demands. A body of research [5,6,7,8,9,10,11,12] has successfully leveraged various AI methodologies to forecast energy consumption, thereby supporting strategic energy planning, optimizing the operational performance, and ensuring the stability of systems operating under diverse environmental and infrastructural conditions. The present study aims to advance the performance, relevance, and predictive capability of chiller energy consumption models by employing an enhanced suite of assessment metrics and a diverse array of supervised ML algorithms. These include Linear Regression (LR), Gradient Boosted Regression Trees (GBRTs), Random Forest Regression (RFR), Simple Regression Tree (SRT), Polynomial Regression (PR), and Regression Tree Ensemble (RTE), each selected for their demonstrated proficiency in extracting patterns from labeled datasets and producing highly accurate predictions in regression contexts. These models offer significant advantages in energy forecasting applications: interpretability, which supports transparent and explainable decision-making; scalability, which enables integration across various building sizes and infrastructures; and robustness, which ensures model resilience under varying operational conditions. Such characteristics are vital for promoting regulatory compliance and operational stability within intelligent energy management frameworks. Moreover, forecasting the energy consumption in commercial buildings offers numerous benefits: it reduces the monitoring times and operational costs for both utilities and consumers; accelerates anomaly detection and its resolution; simplifies the reporting process; and ultimately contributes to the realization of intelligent, responsive, and sustainable energy management systems [13]. This paper is organized as follows: Section 2 presents an examination of the relevant scientific literature, Section 3 delineates the methodology applied, Section 4 designates the specific tools and platforms employed, Section 5 discusses the results, and finally, this paper closes with a conclusion and future work directions.

2. The Literature Review

A profusion of scholarly research endeavors have been dedicated to investigating the domain of cooling energy, addressing a wide spectrum of challenges, resolutions, and prospects within the field of intelligent energy management. This section presents a comprehensive overview of the existing literature on forecasting the energy consumption for intelligent and optimized energy management. To identify the most impactful and relevant contributions, a systematic bibliographic search was conducted, covering peer-reviewed publications from July 2018 to February 2025. The search encompassed leading scientific databases, including IEEE Xplore, ScienceDirect, SpringerLink, and Google Scholar, and employed targeted keywords such as energy consumption, energy management, ML, supervised learning, regression, predictive models, and data mining. Studies were selected based on strict inclusion criteria: the application of ML algorithms to real-world energy datasets, the implementation of rigorous validation protocols, and the comparison of proposed approaches with conventional methods. The exclusion criteria eliminated purely theoretical studies without empirical validation, those lacking the proper documentation, and those relying exclusively on synthetic data. This rigorous selection process ensured the compilation of a representative and high-quality corpus of literature that captured the recent advancements in data-driven energy forecasting, with direct implications for the design and implementation of intelligent energy management systems.

The research [14] investigated the intricate evolution of the energy consumption trends in the United States, propelled by multifaceted drivers such as demographic shifts, technological advancements, and evolving consumer behaviors. This study advances the proposition that ML algorithms substantially enhance the forecasting precision across residential, commercial, and industrial energy sectors. Drawing on comprehensive datasets, this study evaluates diverse ML models, demonstrating the superior predictive reliability of logistic regression within the given context. This research further critiques the limitations of the conventional statistical methods in modeling nonlinearities and behavioral complexities, advocating for a transition toward data-driven, adaptive forecasting frameworks. By positioning accurate energy demand predictions as a foundational element for sustainable development and climate resilience, this study underscores the strategic value of ML in optimizing energy resource management and informing policy interventions. The study [15] investigates the application of advanced ML algorithms to optimizing the energy consumption forecasting for office building HVAC and lighting systems, components that collectively represent a significant share of total energy use, with HVAC alone accounting for up to 40%. Four regression-based models were evaluated: Extra Trees Regressor (ETR), Voting Hybrid Regression (VHR), Multi-Layer Perceptron Regression (MLPR), and K-Nearest Neighbors Regression (K-NN). Leveraging two years of real-world operational data, the ETR model achieved the highest predictive performance, with an R² of 0.9943 and an RMSE of 0.4352, indicating exceptional accuracy. These findings underscore the effectiveness of ensemble-based methods, particularly boosted regression trees, in modeling complex, nonlinear patterns and managing data variability, thereby offering robust, interpretable, and scalable solutions for sustainable energy management in commercial office environments. The authors of [16] proposed a hybrid forecasting framework that synergistically integrated advanced ML algorithms with metaheuristic optimization techniques to enhance the predictive accuracy of electricity consumption models. Specifically, the framework combines gradient boosting methods with three optimization algorithms. The proposed models were empirically validated using real-world electricity consumption datasets from Turkey. Among the various configurations, the XGBoost-SSA hybrid demonstrates a superior performance, attaining the highest coefficient of determination (R²) and the lowest forecasting error metrics. This study distinguishes itself through its methodological robustness, high predictive fidelity, and tangible applicability to real-world energy management scenarios. Nonetheless, limitations persist regarding the model’s cross-regional generalizability and the absence of benchmarking against deep-learning-based time series models. The study [17] explores the application of ML algorithms to predicting the energy consumption in U.S. residential buildings. It uses the Residential Energy Consumption Survey (RECS) dataset to develop predictive models for the Energy Use Intensity (EUI) in apartments and single-family houses, employing tree-based algorithms such as LightGBM, CatBoost, and XGBoost. This study also incorporates SHAP (SHapley Additive exPlanations) to analyze the importance and interactions of household features in determining energy consumption, providing valuable insights for energy-efficient building design and retrofitting strategies. The results highlight key factors, including building size, heating methods, climate conditions, and building age, as significant contributors to energy use. By offering personalized energy saving strategies for different building types, this study contributes to the growing body of knowledge on energy efficiency and the integration of ML into building energy management. U. Ali et al. [18] proposed a scalable and robust framework for forecasting the energy consumption across urban residential building stocks, leveraging ensemble-based ML techniques. This approach integrates data acquisition, archetype development, parametric simulation, and predictive modeling to address the key limitations of conventional urban energy modeling, most notably the limited availability and heterogeneity of large-scale building data. Applied to Ireland’s residential building stock, this framework synthesizes a dataset representing one million dwellings characterized by 19 critical parameters. By disaggregating the end-use energy demands, such as heating, lighting, domestic hot water, photovoltaic generation, and appliance loads, this model enhances the resolution and interpretability. Ensemble learning methods achieve a predictive accuracy of 91%, markedly surpassing the traditional modeling techniques (76%). This study offers a data-driven, policy-relevant tool to support urban planners and decision-makers in evaluating retrofitting strategies and advancing sustainable energy transitions at scale. Sijun Xu et al. [19] studied the effects of individual factors on power savings and thermal management. They summarized the main factors in various cooling systems for reducing the power consumption, realized data management, and described the corresponding research, as well as the optimization methods. They also investigated data center cooling systems and described their principles, which take three main forms: air cooling systems, liquid cooling systems, and free cooling systems. Moreover, the power usage effectiveness (PUE) values and simultaneous cooling loads for different systems are provided. The study [20] presents a robust and interpretable machine learning framework for accurately forecasting the energy consumption for residential heating. The proposed model employs a stacking ensemble architecture comprising LightGBM, Random Forest, and XGBoost, with the hyperparameters optimized via Particle Swarm Optimization (PSO) and preceded by dimensionality reduction through Self-Organizing Maps (SOMs). Achieving a predictive accuracy of 95.4%, the model demonstrates strong generalizability and performance. Beyond prediction, the integration of SHapley Additive exPlanations (SHAP) and causal inference techniques enables both interpretability and the identification of underlying cause–effect relationships. Notably, variations in air and water pipe temperatures were found to significantly impact the energy usage. This methodological framework not only enhances the precision and transparency of energy demand modeling but also offers actionable insights for building managers to implement efficient, cost-effective heating strategies, particularly during high-demand winter periods. S. Kapp [21] proposes a hybrid modeling framework for forecasting the energy consumption in industrial buildings by integrating domain-specific physical system knowledge with ML techniques. Utilizing data from 45 manufacturing facilities, the model incorporates a comprehensive set of features, including environmental variables (e.g., air enthalpy, solar radiation), support system metrics (e.g., motors, steam usage, compressed air), and operational parameters (e.g., production throughput, workforce levels, and facility size). A linear regression model applied to a transformed feature space outperformed a conventional support vector machine, achieving a superior predictive accuracy while maintaining model interpretability. This physically informed, data-driven approach offers a scalable and practical solution for uncovering energy saving opportunities within complex industrial systems. Mohd Herwan Sulaiman and Zuriani Mustaffa [22] proposed the Barnacles Mating Optimizer (BMO) to solve the optimal chiller loading (OCL) problem by reducing the total energy consumption while considering certain limitations in the multi-cooling system. To show the effectiveness of the BMO, it was tested on three different cooling systems (6-unit, 4-unit, and 3-unit chiller systems) and its results compared with those of other modern optimization algorithms, where it was recognized that it could provide competitive results and was effective in achieving the lowest energy consumption to solve the OCL problem. The authors of [23] demonstrated that existing Thermal Energy Storage cooling facilities could be one of the most cost-effective resources for achieving state and government carbon neutrality goals, applied model predictive control (MPC), and evaluated the site performance of a campus-wide TES cooling facility. It aims to self-consume electricity generated on site, reduce carbon emissions into the grid, and reduce utility bills. The performance of MPC was evaluated against a carefully selected baseline period. The results of the MPC showed a reduction in the excess PV capacity by approximately 25%, greenhouse gas emissions by 10%, and the peak electricity demand by 10%. In [24], the authors proposed an analysis of the statistical relationship between the energy performance and life cycle costs (LCCs) of a cooling plant operating in medium- and large-scale application scenarios, with an evaluation of its impact under the same heat demand conditions. A case study of a Cuban hotel with 138 sets of differently arranged cooling stations was selected. The results indicated that the design of the overall chiller and the distribution of the cooling capacity between chillers have a significant impact on the energy consumption of the cooling plant, with Spearman’s Rho and Kendall’s Tau correlation indices of 0.625 and 0.559. Considering the LCCs, only the distribution of the cooling capacity between the chillers had an effect, with a Kendall Tau correlation index of 0.289. For the total cooling capacity studied, the statistical test applied indicated that this design variable did not affect the performance of the cooling plant. The study [25] assessed the efficacy of data-driven methodologies for predicting and forecasting the chiller power consumption within HVAC systems using real-time operational data from an academic building in Taiwan. A comparative analysis was conducted between a conventional thermodynamic linear regression model and a Multi-Layer Perceptron (MLP) neural network for consumption predictions, as well as among three deep learning architectures, MLP, a one-dimensional Convolutional Neural Network (1D-CNN), and Long Short-Term Memory (LSTM), for minute-ahead forecasting. The MLP model demonstrated a superior performance over that of the traditional thermodynamic approaches, yielding an R² of 0.971. For short-term forecasting, the LSTM model outperformed its counterparts with an R² of 0.994, underscoring its capability to capture the temporal dependencies in high-resolution energy data. Beyond the predictive accuracy, this study highlights real-world applications of these models, including proactive maintenance scheduling and intelligent switching of the energy sources based on the anticipated load, thereby advancing the energy efficiency and cost optimization in smart building operations. Jee-Heon Kim et al. [26] conducted a study on developing an energy consumption model for the refrigerant in an HVAC system using an ML algorithm based on artificial neural networks to find the optimal conditions. The developed model was also evaluated for its accuracy. It was improved in terms of various input parameters, as the model was able to predict the power consumption with 99.07% accuracy based on eight input variables. In addition, a standard reference building was designed to generate operating data for the refrigeration system during extended cooling periods (warm-weather months). Table 1 provides a synthesized comparison of the reviewed studies, emphasizing their key findings and contributions.

The previous literature has focused mainly on ensuring efficient energy use and managing its data; improving its performance, savings, and production; reducing the total power consumption; and lowering utility bills. Various energy systems were scrutinized, elucidating their underlying principles and operational mechanisms. Furthermore, it included the analysis, comparison, and evaluation of results that indicated the ability to achieve the optimal accuracy using a set of developed and multiple models, different ML algorithms, and various methodological techniques. However, it is worth noting that the aforementioned works did not encompass the application of supervised machine learning regression techniques within the same study, as Pedro C. Albuquerque et al. [29] demonstrated the superior predictive capabilities of supervised regression models in energy forecasting, attributing their efficacy to their ability to exploit labeled datasets to generate highly accurate and operationally actionable predictions. In light of this observation, our research seeks to use several supervised ML regression algorithms within a unified experimental framework to analyze, compare, and evaluate their predictive results while achieving a greater accuracy and performance.

3. Methodology

This section describes the conventional design methodology applied in this study, which is divided into three sequential and interdependent phases. The first phase deals with the details of the data selection process, ensuring relevance and representativeness in the context of chiller energy modeling. Then, the second phase explains the pre-processing of the data to enhance their analytical quality and compatibility with ML algorithms. Finally, the third phase comprehensively details different machine learning approaches used to model chiller energy and facilitates the prediction of potential outcomes. This phase encompasses model training, evaluation, and a comparative performance assessment. Figure 1 presents the adopted methodology.

3.1. Data Selection

This paper works with a dataset for a commercial building located in Singapore [30] from 18 August 2019 at 00:00 to 01 June 2020 at 13:00. The dataset is stored in a CSV file with 9 feature sets and 1 target feature. The feature sets include Timestamp, Chilled Water Rate (L/s), Cooling Water Temperature (C), Building Load (RT), Total Energy (kWh), Temperature (F), Dew Point (F), Humidity (%), Wind Speed (mph), Pressure (in), Hour of Day (h), and Day of Week. Table 2 provides a brief description of the dataset’s features.

Figure 2 and Figure 3 illustrate the variable energy consumption of the chillers, with a marked concentration within the 76–150 kWh range. The [101–125] kWh interval alone accounts for 47.35% of the records, indicating a standard operating range for the system under typical load conditions. Extreme values below 75 kWh or above 225 kWh are rare, reflecting atypical or exceptional load conditions. These findings suggest a relatively stable consumption profile, with potential for optimization beyond 150 kWh.

3.2. Data Visualization

The information from the dataset [30] was translated into a visual context and represented in a scatter plot (Figure 4) to improve the interpretability of complex data on the chiller’s energy consumption. This visualization facilitates the identification of underlying trends, anomalies, and correlations that may be obscured in raw numerical datasets. By transforming intricate patterns into an intuitive graphical representation, the scatter plot bridges technical analysis with actionable insights. As a result, it empowers decision-makers to make more informed, effective, and timely decisions, such as detecting abnormal energy usage, optimizing the operational strategies, and preemptively addressing inefficiencies with greater speed and precision.

3.3. Pre-Processing Data

It is recommended that inconsistent, incoherent, missing, noisy, and contradictory data to be analyzed and processed before applying various supervised ML regression techniques to the dataset. This data pre-processing guarantees resilient, robust, and precise outcomes. However, this dataset was carefully processed by the authors before publication. To provide a glimpse into the dataset’s structure, Table 3 displays some of the rows. The complete dataset comprises 13,615 rows.

3.3.1. Data Splitting

This step can increase scalability, minimize potential conflicts, and enhance the performance by splitting the dataset into two non-overlapping sets: 80% for the training set and 20% for the test set [31]. This split is instrumental in expediting computational procedures and harnessing the power of parallel processing [2]. It allows for more efficient memory usage, aids in model evaluation without compromising unseen data, and supports enhanced experimentation, enabling various model iterations and hyperparameter tuning.

3.3.2. The Correlation Matrix

Within the realm of data science, one of the most common statistical measures is the correlation coefficient since it can be used to dependencies in data and identify potential causal relationships. These correlations can be used to determine the importance of the features in ML tasks [32] because the relationship between the feature set and the target feature may serve as a reference for the significance of the feature [33]. To achieve high precision in ML task, Karl Pearson’s linear correlation coefficient was employed in this work, as indicated in Equation (1) [34], because it is efficient, accurate, and significant. It is based on the covariance method and is used globally.

r = \frac{\sum (X - \bar{X}) (Y - \bar{Y})}{\sqrt{\sum {(X - \bar{X})}^{2} \sqrt{{(Y - \bar{Y})}^{2}}}}

(1)

where

$\bar{X}$ : The mean of the X variable;
$\bar{Y}$ : The mean of the Y variable.

We conduct a comprehensive examination of the data to discern potential correlations among the features. As presented in Figure 5 and Table 4, it appears that there is no significant correlation that can be taken into account.

3.4. An Overview of the Applied Models

The pursuit of intelligent and cost-effective energy management in the future requires the development of prediction models that encapsulate chillers’ energy consumption patterns, drawing upon available data. For this endeavor, an array of advanced regression techniques has been employed to forge robust and forward-looking models, including LR, GBRT, RFR, SRT, PR, and RTE, for modeling the chiller energy and forecast potential outcomes.

In the realm of supervised ML, one method employed to rigorously assess the performance of ML algorithms is discrete training tests. This method involves the segregation of the data into distinct groups: the first group undertakes the responsibility of training the ML model, rigorously exposing it to diverse datasets and patterns, while the second group focuses on scrutinizing and evaluating the model’s accuracy and predictive capabilities. This bifurcation facilitates a robust evaluation framework, enabling comprehensive assessments of the model’s efficacy, its generalizability, and its adeptness in discerning patterns and making accurate predictions. Next, we briefly describe the different regression techniques considered in this work.

3.4.1. Linear Regression (LR)

Linear regression is one of the simplest and most popular supervised ML algorithms. It operates on the principle of establishing a linear relationship between the input features and the corresponding target variable. It endeavors to identify the best-fitting line that minimizes the discrepancy between the predicted values and the actual values. This algorithm leverages mathematical computations to estimate the coefficients that govern the linear equation, allowing for precise predictions of the target variable based on the provided input features. Linear regression finds extensive utility in diverse domains, including but not limited to predictive modeling, trend analyses, and correlation assessments, making it a foundational and widely employed tool in data analysis and ML tasks [35].