1. Introduction
Growing interest in urban sustainability and energy efficiency in the built environment has led to increasingly sophisticated predictive tools being developed to estimate building energy consumption. Accurate local-level energy demand forecasting is essential in optimizing resource management, reducing climate-changing emissions, promoting renewable energy integration and supporting informed decision making in the context of the energy transition [
1,
2]. Contemporary cities are becoming increasingly interconnected and data-driven environments in which reliable and timely information enables proactive and resilient energy system management strategies [
3]. In this scenario, artificial intelligence (AI) techniques, particularly machine learning (ML) algorithms, are playing a strategic role in analyzing and predicting energy consumption. Supervised ML models, such as tree regressors, artificial neural networks and sequential models, can capture complex, nonlinear relationships between climatic, behavioral and structural variables. This overcomes the limitations of conventional statistical methods in terms of flexibility, adaptability and accuracy [
4,
5]. Increasing access to granular data collected via IoT sensors and advanced monitoring systems has further encouraged the adoption of these techniques, which are now used in various applications, from electricity load forecasting to seasonal building consumption modeling [
6,
7].
However, a critical component that is often overlooked in the literature is data availability. While many studies are based on large, comprehensive, historical datasets, these requirements are often not met in operational contexts. At the urban level, available data are often limited in terms of time, subject to interruptions or affected by measurement errors. Such data limitations can arise from various causes, including the recent implementation of monitoring systems, sensor network malfunctions, difficulties accessing distributed sources and database fragmentation [
8]. This results in a situation wherein the applicability of AI models in practice depends substantially on their ability to perform reliably even in the presence of partial or degraded information. The discrepancy between the models’ theoretical assumptions and their actual operating conditions raises important methodological questions. In urban contexts with low information density, the robustness of ML models—their ability to generalize outside the training set—even in less than ideal conditions, is crucial. Comparative performance analysis between models that differ in complexity, architecture and learning mechanisms can provide useful guidance for selecting the most suitable method based on the quantity and quality of available data.
A further element of complexity arises from the need to consider strategies that mitigate the effects of data scarcity or imperfection. Various fields have proposed advanced techniques such as synthetic data augmentation, parameter regularization, ensemble modeling and engineered feature design to improve models’ ability to learn from limited samples [
9]. However, the effectiveness of such approaches is not guaranteed, particularly in domains with seasonality or strong heterogeneity between observed units, such as buildings, or low data frequency (e.g., monthly rather than hourly or daily).
In real urban contexts, the availability of energy consumption data at the building and neighborhood scale is often far more limited than what is typically assumed in the development of AI-based forecasting models. In many European cities, monitoring infrastructures have been deployed only in recent years, resulting in short historical time series, frequent data gaps and heterogeneous measurement quality. Despite these constraints, local administrations, energy managers and urban planners are increasingly required to rely on data-driven tools to support operational decisions, energy efficiency strategies and sustainability-oriented policies. This creates a structural mismatch between the large, clean and temporally rich datasets commonly used in the literature and the small, incomplete and low-frequency data that characterize real-world urban energy systems. Under such conditions, conventional AI models may suffer from unstable generalization, overfitting and reduced robustness, particularly when seasonal patterns are only partially observed or when data quality is degraded. Addressing these challenges is therefore not only a methodological issue but a practical necessity to enable the reliable application of AI techniques in urban energy management contexts where data scarcity is the norm rather than the exception.
1.1. Literature Review
Despite the widespread use of AI techniques in the energy sector, particularly in the field of consumption forecasting, there are still significant gaps in the scientific literature regarding the systematic analysis of model reliability under realistic information constraints. Most studies focus on the development or optimization of high-performance predictive models, evaluating them in scenarios characterized by abundant, complete and high-quality data. However, these conditions rarely reflect the operational situations encountered in real urban contexts, especially at the local level or in areas where energy monitoring systems are still being implemented. The behavior of ML models in the presence of short, interrupted or noisy time series datasets, which are the norm rather than the exception in many urban applications, remains largely unexplored. Notwithstanding the growing interest in intelligent models to support the energy management of buildings and neighborhoods, there is still an insufficient understanding of their ability to adapt to degraded or informationally unstable conditions. The robustness of models, understood as the stability and reliability of predictions in non-ideal scenarios, is rarely addressed explicitly and comparatively, leaving a significant methodological gap regarding the practical application of these technologies. Various studies propose innovative architectures, including deep neural networks, hybrid models and ensemble techniques, but the focus is almost exclusively on maximizing the predictive accuracy in controlled experimental environments. In contrast, efforts to investigate the minimum conditions for the applicability of AI models, i.e., the information thresholds below which performance deteriorates critically, are still marginal. This aspect is crucial in translating research results into reliable, scalable and replicable tools in the context of urban energy planning and management, especially where structural constraints in data collection are a permanent condition.
In light of these considerations, it is essential to critically examine the existing literature to identify not only methodological advances but also areas where current solutions show fragility with respect to concrete information constraints. The following literature analysis therefore aims to summarize the main directions of research in the field of energy consumption forecasting using AI techniques, highlighting the most relevant contributions and the main limitations in terms of robustness, generalization and model adaptability in limited information contexts.
Lee et al. [
10] develop a method for forecasting daily electricity loads in residential buildings using small datasets. By combining a self-organizing map with a stacking ensemble learning approach, the study addresses the problem of overfitting that is common in small datasets. Applied to real consumption and weather data, the model outperforms other methods, offering an effective solution for forecasting in scenarios with limited data [
10]. Choi et al. [
11] present a comparative study on the performance of six deep learning architectures for predicting the thermal loads and indoor temperatures of buildings, with a focus on scenarios characterized by limited data and strong seasonality. Using simulated data and 24 h forecasting horizons, the analysis shows that the Transformer model performs best with small training sets, followed by the GRU and RNN. The results highlight that increasing the amount of data does not guarantee more accurate forecasts unless the training data are seasonally consistent with the test data. The study therefore provides a useful reference for selecting predictive architectures in the energy sector when historical data availability is limited [
11]. Izonin et al. [
12] propose an ML approach for predicting the energy efficiency of buildings, focusing on scenarios where data availability is limited or partial. The study combines supervised learning techniques with regressor-based models, such as random forest (RF), to accurately estimate the energy behavior of buildings based on structural and climatic characteristics. The article highlights that the use of intelligent models allows for high levels of predictive accuracy, even in the absence of extensive datasets, making the approach particularly useful in supporting energy retrofitting, preliminary diagnoses or decision-making processes in contexts where access to complete data is difficult or costly [
12]. Mathumitha et al. [
13] present a comprehensive review of deep learning techniques for forecasting energy consumption in smart buildings, analyzing both residential and non-residential contexts. The study synthesizes existing research on a wide range of forecasting methodologies, including recurrent neural networks, convolutional neural networks and hybrid and ensemble models, and evaluates them based on the dataset characteristics, prediction accuracy and suitability for different load types and building categories. Identified challenges include handling load fluctuations, variability due to weather and occupant behavior and the need for predictive methods appropriate to grid planning. The authors highlight key research gaps and propose methodological guidelines for future forecasting models with improved reliability in real-world smart building environments [
13]. Cordeiro Costas et al. [
14] present a hybrid approach for short-term energy management in buildings, combining neural network models (LSTM-MLP) optimized through the multi-objective evolutionary algorithm NSGA-II. The proposed energy system uses GFS weather forecast data and electricity price forecasts to manage production from renewable sources and energy storage, balancing consumption while minimizing costs and waste. Applied to a building at the University of Vigo (Spain), the model demonstrates its ability to optimize operational decisions in a sustainable and calibrated manner, highlighting the benefits of automatic tuning via NSGA-II in terms of energy balancing and predictive results [
14].
Fan et al. [
15] propose an advanced hybrid model for short-term electricity load forecasting, called EWT-CNN-S-RNN + LSTM, which integrates empirical wavelet transformation, convolutional neural networks, simple recurrent networks and LSTM networks, all optimized with a Bayesian algorithm (BOA). The method addresses the non-linear and non-stationary nature of load data through adaptive signal decomposition (EWT) and then assigns different modal components to specific predictive sub-models. The data used come from real Australian and Swiss datasets, and the effectiveness of the model is evaluated on three distinct case studies. The results demonstrate that the proposed model significantly improves the accuracy compared to single models and other standard combinations (CNN-RNN, PSO-SVR, GRNN), reducing errors and increasing the degree of adaptation to real data [
15].
Gorzałczany et al. [
16] present an interpretable and accurate approach for predicting the energy consumption of residential buildings, combining fuzzy systems with multi-objective evolutionary optimization. The proposed method consists of designing a fuzzy rule-based model (FRBPS) optimized by an evolutionary variant of the SPEA2 algorithm, aimed at balancing predictive accuracy and model transparency. Starting from a publicly available dataset (e.g., Kaggle), the procedure is compared with over 20 alternative methods and shows comparable or slightly higher performance in terms of accuracy, as well as offering greater interpretability than black box models [
16].
Most research tends to favour ideal case studies or well-structured international benchmarks, neglecting the conditions most frequently found in medium-sized European cities or local administrative contexts, where the available information may be partial, inconsistent or subject to delays in collection. The models proposed in the literature are often based on large, well-balanced datasets with extensive temporal coverage and no significant anomalies—conditions that are rarely found in real urban operating contexts. In many cases, analyses focus solely on predictive accuracy, neglecting aspects related to the robustness of models in the presence of noise, missing data or short time series. In particular, there is often no systematic evaluation of performance as the available historical depth varies or under scenarios of controlled data quality degradation. Mitigation strategies are often limited to hyperparametric optimization or the introduction of complex hybrid models, without a comparative analysis between simple, interpretable approaches and more complex architectures.
A further limitation lies in the lack of attention to the problem of generalization in scenarios with low information density. Several studies, while proposing advanced solutions based on deep neural networks or combinations of models, do not explicitly consider the transferability of results to contexts with information conditions that are different from those seen during training. Despite the effectiveness of some modeling approaches, the use of interpretation or sensitivity assessment techniques often remains secondary or absent, reducing the transparency of the results obtained. Some contributions focus on the development of high-performance models for energy consumption forecasting but adopt synthetic or simulated datasets or use methodologies that assume the extensive availability of weather, behavioral or technical data, which are difficult to access in small-scale urban settings [
17]. Furthermore, comparative assessments are often limited to one or two test scenarios, without exploring the variability in operating conditions or the behavior of models over multiple time horizons. Consequently, it is necessary to propose a modeling evaluation framework that takes into account the variability of available historical data, the presence of noise and missing values and the need to ensure stable performance even in highly constrained scenarios. An approach focused on robustness, adaptability and operational replicability can bridge the gap between academic solutions and the real needs of local administrations, energy technicians and urban managers.
Against this background, the novelty of this study lies not in the proposal of a new forecasting model but in the development of a structured and multidimensional evaluation framework for AI-based urban energy forecasting under realistic small data conditions. In contrast to existing works, which typically assess model performance under fixed and idealized assumptions, this study systematically stress-tests multiple machine learning families across progressively reduced historical horizons, controlled data quality degradation and targeted mitigation strategies within a single, coherent experimental design. In addition, a deliberately ultra-conservative data augmentation strategy is introduced to enhance model resilience in extremely short time series, while explicitly preserving temporal consistency and physical plausibility. This positioning allows the present work to contribute methodological guidance that is directly transferable to real-world urban energy management contexts characterized by structural data limitations.
Based on these considerations, the main contributions of this study can be summarized as follows:
- −
The proposal of a structured and multidimensional stress-testing framework to evaluate AI-based urban energy forecasting models under realistic small data conditions;
- −
A systematic analysis of model performance across progressively reduced historical horizons (6, 12, 18 and 24 months) combined with controlled data quality degradation;
- −
A comparative evaluation of ensemble-based and regularized linear models to identify robustness thresholds and minimum data requirements;
- −
The introduction and validation of an ultra-conservative data augmentation strategy specifically designed for extremely short, low-frequency urban time series.
1.2. Research Objective
This research contributes to the methodological and empirical debate surrounding the application of AI to urban energy demand modeling, particularly in scenarios where data availability and quality are limited. The main objective is to systematically evaluate the ability of different ML models to reliably predict energy consumption in urban areas under less than ideal information conditions. The aim is to identify the most suitable model configurations to ensure predictive robustness in the presence of short, incomplete or noisy time series datasets, among regularized linear approaches, ensemble algorithms and neural architectures. This analysis uses a real monthly energy consumption dataset collected over 24 months from a set of residential buildings in an urban neighborhood in Rome. This case study is representative of many European contexts where building-level energy monitoring systems are still in their infancy and the available historical data cover short time intervals. The dataset exhibits characteristics typical of real-world contexts, including seasonal periodicity, heterogeneity between buildings and potential measurement anomalies.
The experimental design involves dividing the dataset into time windows of 6, 12, 18 and 24 months to simulate scenarios with increasing historical availability. Two types of information degradation are introduced in a controlled manner within each window: missing data and Gaussian noise on the target to represent data incompleteness and measurement errors typical of real systems, respectively. The impact of these perturbations is assessed using standardized accuracy metrics such as the mean absolute error (MAE) and the normalized coefficient of determination (R2). Mitigation strategies based on regularization (Ridge, Lasso), ensemble techniques (voting regressors between heterogeneous models) and synthetic data augmentation are tested to counteract the loss of performance linked to data scarcity or low quality. These strategies involve generating new observations derived from statistically controlled perturbations.
The novelty lies in constructing a modular, replicable experimental framework that assesses the resilience of AI models in the presence of multiple information constraints—temporal, structural and qualitative—on real data from urban buildings. Unlike most existing studies, which focus on large or simulated datasets, this work proposes a detailed comparative analysis across different time horizons and degradation levels. It integrates performance metrics and compensation strategies into a single methodological platform. This approach enables us to identify the most robust models in realistic scenarios and define minimum thresholds for data quality and quantity for the effective implementation of forecasting systems in urban environments.
Combining these elements enables us to define the minimum operating conditions for the reliable application of AI models in realistic urban contexts, thereby contributing to the development of resilient, adaptable, implementation-ready forecasting tools for local energy management decision support systems.
2. Materials and Methods
The main contribution of this study is the evaluation of the stability and robustness of AI models for urban energy forecasting when trained under varying levels of temporal data availability. Unlike conventional ML applications, which typically assume the presence of large-scale, clean and temporally rich datasets, this research addresses more realistic urban scenarios in which data may be partial, fragmented or recently acquired. Specifically, the study investigates how different predictive models respond to varying historical data, namely 6, 12, 18 and 24 months, and which algorithmic strategies can be employed to mitigate the accuracy degradation associated with limited temporal coverage.
To address these challenges, we developed a four-stage AI stress-testing framework designed to simulate real-world data limitations and assess model reliability under such conditions. In the first stage, we created time-constrained training scenarios by truncating the full dataset into subsets containing only the most recent 6, 12, 18 or 24 months of data. Each subset was then split into training and test segments using a time-aware approach, whereby the most recent 20% of data are reserved exclusively for testing, as shown in
Figure 1. This setup enables a controlled evaluation of how the predictive performance, measured via the coefficient of determination (R
2) and mean absolute error (MAE), varies as the length of historical data decreases. In this way, we simulate the operational challenges faced by urban energy analysts who are required to build predictive systems in settings with limited historical records.
The second contribution is the examination of the sensitivity of models to data quality degradation. In practice, urban energy datasets often suffer not only from limited lengths but also from imperfections such as missing values and measurement errors. To simulate these conditions, we injected missing values randomly into 5%, 10%, 20% and 30% of the dataset and assessed the impact on model performance. Missing values were imputed using simple column-wise mean substitution. While more sophisticated imputation techniques exist, they were intentionally not adopted in this study. Advanced methods such as k-nearest neighbors, multivariate imputation or model-based approaches typically require larger datasets to avoid overfitting and may implicitly reconstruct temporal patterns that are not supported by extremely short time series. Given the objective of this work—to evaluate model robustness under conservative and information limited conditions—mean imputation was selected as a transparent and deliberately minimal intervention, ensuring that performance variations reflected genuine model resilience rather than artifacts introduced by complex data reconstruction procedures. Simultaneously, we introduced Gaussian noise into the target energy consumption variable to reflect the effects of sensor inaccuracies or external data acquisition errors, as shown in
Figure 2. This dual variation allowed for an assessment of each model’s robustness to noise and sparsity, offering insights into how well they can generalize under degraded data quality.
In the third stage, we evaluated model robustness across different AI architectures. The selected models included gradient boosting regressors (GB), an ensemble-based method known for its ability to model complex nonlinear interactions in tabular datasets while maintaining robustness to overfitting. In addition, we implemented a multi-layer perceptron (MLP) neural network, which represents a fundamental class of feedforward neural architectures within the broader AI domain, as shown in
Figure 3. Although the MLP lacks explicit temporal memory, it can capture nonlinear dependencies among input features, particularly when enhanced with time-aware feature engineering. Its inclusion enabled the assessment of deep learning performance in structured energy datasets without sequence modeling. To explicitly capture temporal dependencies, we also evaluated long short-term memory (LSTM) networks, a recurrent neural network architecture designed to learn from sequential data. Including both MLP and LSTM allows for a comparative analysis between non-sequential and sequential deep learning approaches, highlighting the potential benefits of temporal modeling in constrained urban forecasting contexts. Although LSTM networks were included in the methodological design, they are excluded from the final comparative analysis. The dataset consisted of monthly observations over a short time span (24 records), which provided an insufficient temporal resolution for recurrent architectures. Preliminary tests confirmed that LSTM models were unstable and strongly affected by overfitting. For completeness,
Table 1 reports the quantitative performance of LSTM models across all tested temporal horizons and data degradation scenarios.
For this reason, and to maintain methodological coherence with small data constraints, only models suited to low-frequency short time series were retained in the final evaluation.
The final stage explored several mitigation strategies designed to restore predictive performance in highly constrained scenarios, particularly those involving shorter time windows and noise-injected data. Three categories of mitigation were tested. First, ensemble learning was employed by combining predictions from multiple models, specifically GB and MLP using a voting regressor, thereby leveraging their complementary strengths. Second, we applied regularization techniques through Ridge and Lasso regression to reduce overfitting in low-data regimes. Third, we implemented synthetic data augmentation by generating new training samples through Gaussian perturbations of existing feature vectors. This method aimed to increase the effective diversity of the training data without requiring new external data sources. Each strategy is evaluated for its effectiveness in mitigating performance loss under constrained conditions.
Model training is based on engineered features derived from the original time series. These features include cyclical representations of the calendar month using sine and cosine transformations, lagged energy consumption values (e.g., at one- and three-month intervals), rolling means over three- and six-month windows and statistical summaries such as the mean and standard deviation of energy use per building. Imputation parameters and feature standardization statistics are computed exclusively on the training data and subsequently applied to the test set to prevent information leakage. Missing values are imputed by using simple statistical heuristics to preserve the dataset structure.
The full experimental pipeline is implemented in Python, 3.14.2 using widely adopted libraries such as pandas, scikit-learn, matplotlib and tensorflow.keras.
The dataset used in this study consists of monthly energy consumption measurements from four massive social housing buildings in Rome for the years 2022 and 2023. These data were originally collected, structured and published as part of the Energy4Rome initiative. The dataset is openly accessible at
https://github.com/teeclab/Urban-Energy-Forecasting-AI-Stress-Test- (accessed on 8 January 2026). Its modular codebase enables seamless adaptation to additional buildings, districts or alternative energy-related datasets with minimal modification.
Figure 4 shows an overview of the buildings from which energy data were collected to create the dataset used in this study.
By simulating real-world data constraints and rigorously testing predictive models under such conditions, this methodology provides a comprehensive framework for evaluating the reliability of AI-driven urban energy forecasting. It also offers practical insights into the limits of model generalization, as well as the effectiveness of different strategies in overcoming data scarcity and quality issues. The framework is intended to serve as a replicable foundation for future research and practice in urban energy analytics, particularly in contexts where full historical datasets are structurally unavailable.
The dataset used in this study is intentionally characterized by a limited temporal depth, consisting of 24 monthly observations for each of the four analyzed buildings. While this may appear small from a purely time series perspective, it is important to clarify the nature and scale of the monitored entities. Each building corresponds to a large social housing complex comprising multiple residential units, and each monthly observation represents the aggregated energy consumption of a substantial number of households rather than an individual dwelling.
As a consequence, the resulting time series exhibits a relatively stable and smooth consumption profile, in which short-term individual behavioral fluctuations are averaged out at the building level. This aggregation effect reduces noise and enhances the structural regularity of the signal, particularly with respect to seasonal patterns and long-term consumption trends. At the same time, the limited temporal horizon reflects a realistic operational scenario commonly encountered in urban contexts, where building-level monitoring systems are often recently deployed and historical records are inherently short.
This combination of spatial aggregation and temporal scarcity makes the dataset particularly suitable for investigating the reliability and robustness of AI-based forecasting models under realistic small data conditions. Rather than aiming to maximize the predictive performance through large historical archives, the study deliberately focuses on identifying the minimum data requirements and operating thresholds under which different model families can still provide reliable forecasts in practice.
Methodology
In this study, we propose and rigorously assess a comprehensive methodological framework aimed at predicting short-term energy consumption at the building level, specifically under various constraints on data availability and quality. This framework is implemented entirely in Python, leveraging widely adopted open-source ML libraries such as scikit-learn, pandas and tensorflow.keras. The overarching goal is to align the modeling approach with the operational and infrastructural realities observed in urban energy monitoring systems, where data streams are often incomplete, noisy or subject to temporal gaps. Our methodology addresses these practical barriers by providing a workflow that remains robust across diverse data regimes, allowing for flexible adaptation to different levels of data richness.
Figure 5 provides a schematic overview of the entire workflow, which spans from raw data ingestion and preprocessing to feature engineering, model training, validation under degraded conditions and final evaluation across multiple predictive algorithms. This pipeline is not only reproducible but also designed for modularity, enabling future researchers and practitioners to incorporate additional models or data sources with minimal effort.
To allow ML models to distinguish between buildings, categorical identifiers are converted into numerical format using one-hot encoding. This ensures that building-specific consumption patterns are preserved without introducing ordinal bias. The result is a dataset that is both statistically rich and operationally aligned with the challenges faced in real-world energy monitoring systems. To comprehensively evaluate the adaptability and robustness of predictive models under variable data conditions, we selected four widely recognized regression algorithms: GB regressor [
16], gradient boosting regressor [
17], Ridge regression and Lasso regression. This ensemble of models was chosen with careful attention to the diversity in learning paradigms that they represent. Each model exhibits a unique position on the bias–variance spectrum and responds differently to issues like overfitting, noise sensitivity and data sparsity. RF and GB are both tree-based ensemble methods, yet they diverge in how they optimize the prediction accuracy. RF leverages bagging (bootstrap aggregation) to reduce variance, making it particularly effective in noisy environments. In contrast, GB sequentially corrects the errors of prior models, offering high performance at the cost of increased susceptibility to overfitting in small datasets. Meanwhile, Ridge and Lasso are regularized linear models. Ridge applies the L2 penalty to suppress large coefficients, effectively reducing variance, while Lasso imposes L1 regularization to enforce sparsity, making it well suited for high-dimensional or overparameterized feature spaces. Together, these models provide a broad spectrum of algorithmic complexity and generalization capacity. Their comparative evaluation under identical experimental conditions offers valuable insights into how the model structure and regularization strategies interact with constrained and potentially noisy urban energy data.
RF is a nonparametric, ensemble-based ML algorithm that constructs multiple decision trees during training and outputs the mean prediction of the individual trees. It employs bootstrap aggregation (bagging), where each decision tree is trained on a different bootstrapped sample of the original dataset, thus reducing variance and enhancing the generalization capabilities. This method is highly advantageous in urban energy forecasting tasks, where patterns in consumption can vary nonlinearly across time and between buildings. A key benefit of RF is that it does not assume any specific data distribution, making it adaptable to a wide range of statistical behaviors inherent in real-world energy consumption datasets. It excels in modeling complex feature interactions and can naturally incorporate categorical and continuous features without extensive preprocessing. Additionally, RF is inherently robust to outliers and noisy observations, a valuable trait when dealing with sensor-generated energy data that may include spikes or anomalies. Importantly, RF includes built-in mechanisms for dealing with missing data, such as surrogate splits, which allows it to make decisions even when some features are unavailable. This characteristic significantly enhances its reliability in settings with partial records. The model also produces variable importance rankings, which offer interpretable insights into which features most strongly influence energy predictions, facilitating transparency in operational and policy-driven applications.
GB is a powerful and flexible ML technique that constructs predictive models in a sequential manner. Unlike RF, which aggregates independent trees, GB builds trees in succession, with each new tree attempting to correct the residual errors of the ensemble thus far. This iterative approach allows the model to incrementally improve its predictive capacity by learning from mistakes, effectively reducing biases over time.
In the domain of energy forecasting, particularly when working with structured tabular datasets such as monthly energy consumption records, GB has emerged as a benchmark method. It is particularly advantageous in detecting subtle, nonlinear patterns that might be overlooked by linear or even bagged ensemble methods. This sensitivity is especially valuable in small datasets, where capturing seasonality or abrupt changes in usage due to behavioral or environmental factors is essential.
However, GB’s strengths are balanced by certain weaknesses. Its iterative nature makes it more prone to overfitting if not carefully regularized, and it tends to be more sensitive to noise than RF. Hyperparameters such as the learning rate, number of estimators and tree depth must be finely tuned to achieve optimal performance. Despite this, when properly calibrated, GB often delivers superior accuracy, especially in settings with limited data but rich feature interactions.
Ridge regression is a regularized extension of classical linear regression that incorporates an L2 penalty on the magnitude of the coefficients. This penalty discourages the model from assigning excessively large weights to any single feature, thereby reducing the model’s variance and improving its generalization, especially in the presence of multicollinearity—an issue that is common in time series datasets with correlated lag features. Despite its simplicity, Ridge regression has proven to be remarkably effective in structured datasets with strong seasonality, such as those found in urban energy consumption.
One of Ridge’s key advantages is computational efficiency. It allows for rapid training and evaluation, which is highly desirable in real-time applications or large-scale grid simulations. Furthermore, Ridge offers interpretability: each coefficient directly reflects the influence of a particular feature on the predicted energy use. This transparency is critical in operational or policy-related contexts where models must be explainable to stakeholders, auditors or regulatory bodies.
Lasso regression operates similarly to Ridge but uses an L1 penalty instead. This form of regularization has the additional effect of driving some coefficients to zero, effectively performing feature selection. This sparsity-inducing behavior makes Lasso particularly useful when working with high-dimensional inputs, such as heavily engineered feature sets or cases with building-specific indicators. By eliminating irrelevant variables, Lasso reduces the model complexity and enhances interpretability while still maintaining robust generalization. Its deterministic output and low computational demands make it ideal for embedded systems and energy dashboards where simplicity and clarity are paramount.
The practical merit of these linear models is further reinforced by prior studies, including that in [
18,
19], which showed that, even under small data regimes, autoregressive models based on transparent statistical formulations could rival more complex ML approaches. Thus, Ridge and Lasso provide a viable and sometimes preferable alternative in scenarios where data are limited, explainability is critical or deployment constraints exist. Hyperparameter tuning plays a critical role in determining a model’s capacity to generalize to unseen data, particularly in small data scenarios where overfitting is a substantial risk. In this study, we adopted a conservative tuning strategy designed to balance predictive performance with stability. The selection of hyperparameters is guided by a combination of the prior literature, domain-specific heuristics and preliminary cross-validation experiments conducted on the Energy4Rome dataset; see
https://github.com/teeclab/Urban-Energy-Forecasting-AI-Stress-Test- (accessed on 8 January 2026).
For RF, we utilized between 50 and 100 estimators. This range is sufficient to achieve robust ensemble averaging without introducing an excessive computational cost or overfitting to noise. The maximum tree depth is restricted to between 5 and 8 levels to limit the complexity of each decision tree, ensuring that predictions remain generalizable rather than overfitted to idiosyncratic training data patterns.
For GB, we selected 50 to 80 estimators, a maximum depth of 3 to 4 and a moderate learning rate of 0.1. This configuration allows the algorithm to capture nuanced patterns in the data while avoiding the common pitfall of overly aggressive learning, which can amplify noise. For Ridge and Lasso regression, we varied the regularization strength (alpha) between 0.5 and 10, striking a balance between under- and over-penalization. This range allows the models to maintain predictive flexibility while constraining coefficient magnitudes to prevent unstable behavior in low-data regimes.
By maintaining these hyperparameter settings consistently across all temporal scenarios (6, 12, 18 and 24 months), we ensure that the performance differences observed in the experiments are attributable to variations in data availability and quality, rather than inconsistencies in model tuning. This methodological rigor enhances the credibility and replicability of our findings.
Real-world energy monitoring systems rarely provide pristine datasets. Instead, data may be incomplete, corrupted or influenced by extraneous noise due to sensor malfunctions, network outages or manual data entry errors. To account for these challenges, our methodology incorporates a data degradation simulation step to assess the resilience of forecasting models in suboptimal conditions. Two primary forms of degradation are introduced: missing data and measurement noise.
- −
Missing data simulation: We randomly remove values from input features at missingness levels of 5%, 10%, 20% and 30%. This allows us to model scenarios ranging from minor sensor glitches to prolonged data outages. After removal, the gaps are imputed using simple column-wise means. While more sophisticated imputation methods exist, mean imputation is chosen here to avoid introducing artificial patterns that could bias the models, thereby keeping the evaluation conservative.
- −
Measurement noise simulation: We introduce zero-mean Gaussian noise to the target variable (energy consumption), with standard deviations representing 5%, 10%, 15% and 20% of the original variability. This reflects errors caused by inaccurate sensor calibration, external interference or environmental effects not captured by the monitoring system.
By combining these two degradation modes, it is possible to evaluate how model performance deteriorates under progressively harsher data quality conditions. To provide a complete overview of model performance across all experimental conditions,
Table 2 presents the normalized R
2 scores for each model horizon combination, along with summary statistics that reveal performance stability and scalability patterns.
The adoption of a normalized R
2 metric is motivated by the need to ensure comparability across experimental scenarios characterized by different historical horizons and data degradation levels. The coefficient of determination
is normalized and bounded and is explicitly defined as follows (1):
where h denotes the forecasting horizon, R
2(h) is the observed coefficient of determination, and R
min(h) and R
max(h) are the reference bounds for the corresponding horizon.
This horizon-dependent normalization ensures comparability while preventing out-of-range effects. In such conditions, the variance of the target variable may change substantially, making the standard R2 sensitive to scale effects rather than reflecting true predictive performance. Normalization mitigates this issue by anchoring performance to a common reference, allowing meaningful comparisons between models trained under heterogeneous data availability and quality conditions. The normalized R2 is used in this study specifically for relative performance assessment across scenarios, while the absolute accuracy is additionally evaluated using the MAE.
The summary statistics reveal important scaling characteristics. Random forest exhibits the largest performance range (0.555), indicating high sensitivity to data availability but also the greatest improvement potential. Gradient boosting maintains the most consistent performance across horizons (σ = 0.139), suggesting robust adaptation to varying data constraints. Linear models show moderate stability but require extended temporal contexts to achieve competitive accuracy. These baseline results establish the foundation for subsequent stress-testing and augmentation experiments. The stress-testing approach, illustrated in Algorithm 1, allows us to measure the trade-off between model robustness and predictive accuracy, providing critical insights for the selection of forecasting methods suitable for real-world deployment.
| Algorithm 1. Baseline and data quality stress test |
Input: Full dataset D (monthly records 2022–2023) Output: Performance matrix P of models × scenarios 01: for months ∈ {24, 18, 12, 6} do ▶ historical windows 02: D_m = truncate (D, last months) ▶ keep latest months 03: (D_tr, D_te) ← chronological_split(D_m, 0.8) ▶ no leakage 04: X_tr, y_tr ← feature_engineering(D_tr) ▶ lags, seasonality, etc. 05: X_te, y_te = feature_engineering(D_te) 06: for model ∈ {Random_Forest, Gradient_Boosting, Ridge, Lasso} do 07: θ ← fit(model, X_tr, y_tr) ▶ hyper-parameters 08: ŷ ← predict(θ, X_te) 09: r2 ← normalised_R2(y_te, ŷ, months) ▶ capped per horizon 10: mae = MAE(y_te, ŷ) 11: store (P, months, model, “baseline”, r2, mae) 12: for mrate ∈ {0.05, 0.10, 0.20, 0.30} do ▶ missing-value scenarios 13: X_tr_mv ← inject_missing(X_tr, mrate) 14: θ ← fit(model, X_tr_mv, y_tr) 15: ŷ ← predict(θ, X_te) 16: store (P, months, model, “missing”, r2, mae) 17: for σ ∈ {0.05, 0.10, 0.15, 0.20} do ▶ noise scenarios 18: y_tr_ns ← add_noise(y_tr, σ) 19: θ ← fit(model, X_tr, y_tr_ns) 20: ŷ = predict(θ, X_te) 21: store (P, months, model, “noise”, r2, mae) return gof, seed, []) } |
One of the most innovative elements of our methodology is the introduction of a deterministic, ultra-conservative data augmentation protocol tailored to extremely limited historical datasets (6- and 12-month scenarios). The primary goal of this protocol is to synthetically expand the training set while preserving the statistical properties of the original data to avoid the artificial inflation of the predictive accuracy. The augmentation process begins by identifying the most temporally consistent records, which are those whose consumption values align closely with lagged features and rolling statistical windows. This ensures that augmented samples do not introduce implausible or contradictory patterns into the dataset. Once selected, these high-consistency samples undergo controlled perturbations:
- −
Non-critical features (e.g., statistical indicators derived) are adjusted to within 0.5% of their original values;
- −
Temporal and categorical building identifiers are perturbed within ±0.1% to maintain identity integrity while introducing minimal variability.
This process is repeated until the augmented dataset reaches a size of 1.2× the original for 6-month data and 1.3× the original for 12-month data. Before integration into the training pipeline, the augmented set is validated to ensure that changes in the mean and standard deviation of all features remain within 2% and 5% thresholds, respectively. This guarantees statistical fidelity and physical plausibility. By design, this augmentation method is ultra-conservative and was empirically observed not to cause systematic performance degradation across repeated trials. This approach, summarized in Algorithm 2, provides a safe yet effective means to enhance model resilience when historical data availability is severely constrained.
| Algorithm 2. Data augmentation (used when months ≤ 12) |
Input: Training set (X_tr, y_tr), window length months, factor α (1.2 or 1.3) Output: Augmented training set (X_aug, y_aug) 01: target_size ← ⌈α · |X_tr|⌉ ▶ lag-consistent records 02: R ← select_reliable_samples(X_tr, y_tr) ▶ start with originals 03: X_aug ← X_tr; y_aug ← y_tr 04: while |X_aug| < target_size do 05: t ← random_choice(R) 06: x*, y* ← duplicate(X_tr[t], y_tr[t]) ▶ template copy 07: for feature f in non_critical(x*) do ▶ exclude IDs & 08: seasonality x*_f ← x*_f · U(if months ≤ 6 then 0.995, 1.005) if months > 6 else U(0.999, 1.001) 09: for lag or rolling feature g do 10: x*_g ← x*_g · U(0.9995, 1.0005) 11: y* ← clip(y* + ∈, µ ± 0.02σ) where ∈~U(−0.01σ, 0.01σ) 12: X_aug ← X_aug U {x*}; y_aug ← y_aug U {y*} 13: shuffle_minor(X_aug, y_aug, preserve_temporal_order = True) 14: assert |mean(y_aug) − mean(y_tr)| < 0.02·|mean(y_tr)| 15: assert |std(y_aug) − std(y_tr)| < 0.05·std(y_tr) |
The performance of forecasting models is evaluated through a dual-metric framework designed to balance interpretability, robustness and cross-scenario comparability.
First, we computed the coefficient of determination (R2), which quantifies the proportion of variance in the observed data that is captured by the model. To ensure fair comparisons across forecasting horizons of different lengths, R2 values are normalized using a capped transformation. This transformation accounts for the inherent difficulty of the forecasting task at each time horizon, preventing artificially inflated performance scores in easier scenarios and discouraging misleadingly small gains from being overemphasized. Second, we determined the mean absolute error (MAE) in standardized physical units (kWh/m2). This metric provides a tangible measure of the prediction error that is directly interpretable for building energy managers and policymakers. By reporting the MAE in normalized energy units, it is ensured that the performance evaluations remain relevant to both technical and operational decision-making contexts.
All evaluations are performed on temporally held out test sets that have never been used during model training or hyperparameter tuning. This prevents information leakage and ensures that the evaluation results reflect genuine predictive capabilities rather than being artifacts of overfitting. This evaluation strategy provides a rigorous and transparent basis for comparing models across different data availability scenarios and degradation levels.
3. Results and Discussion
The results presented in
Figure 6 offer a clear and compelling narrative: the predictive accuracy of all evaluated models is strongly dependent on the amount of historical data available at training time. This dependency is intuitive but critical for practitioners and researchers to quantify precisely, because it directly influences the feasibility of deploying AI-based forecasting systems in data-scarce urban environments.
In the 24-month scenario, which represents the most favorable setting in this study, RF and GB both achieved normalized R2 scores of 0.870, indicating high relative predictive performance under the normalized evaluation framework. This level of performance places these ensemble methods squarely in the category of operationally deployable forecasting models for urban energy management tasks. The nearly identical scores for RF and GB here indicate that, under abundant historical data conditions, the choice between bagging-based and boosting-based ensemble approaches may be less consequential from a pure accuracy standpoint.
Interestingly, even the simpler Ridge regression and Lasso regression methods performed competitively in this scenario, with scores of 0.851 and 0.842, respectively. These results corroborate earlier findings in energy forecasting research that, in datasets with strong seasonal periodicity, relatively low measurement noise and carefully engineered temporal features, as is the case in the Energy4Rome dataset, regularized linear models can rival more complex nonlinear approaches [
6]. This outcome emphasizes the value of robust feature engineering in bridging the gap between algorithmic simplicity and forecasting accuracy.
As the available historical data are reduced to 18 months, a more nuanced picture emerges. RF retains a relatively high R2 = 0.798, demonstrating notable resilience to moderate reductions in temporal depth. This robustness likely stems from RF’s ability to leverage decision tree ensembles to capture stable seasonal consumption rules, even when fewer examples are available. GB, however, experiences a sharper decline to R2 = 0.697. This suggests that, in moderately data-constrained environments, boosting’s sequential learning mechanism may amplify noise and variance when fewer examples are available to guide corrective adjustments.
Surprisingly, Ridge and Lasso remain competitive in the 18-month case, scoring 0.791 and 0.779, respectively. This stability may be due to their strong bias towards simpler, more stable relationships, which can still be adequately estimated with ~1.5 years of data. In these intermediate scenarios, linear models appear to be more robust than GB, particularly when the dataset remains structured and seasonality is still observable. The figure changes dramatically in the 12-month scenario. Here, only one full seasonal cycle is available for model training, meaning that any predictive relationship between seasonal indicators and energy consumption must be inferred from a single set of annual observations. Under these constraints, GB emerges as the clear leader, with R2 = 0.711, while RF drops sharply to 0.631. The poor performance of RF in this context may be due to its reliance on repeated seasonal patterns to stabilize decision rules. Without multiple seasonal cycles, its trees may overfit to idiosyncrasies of the training year. Ridge and Lasso, lacking nonlinear interaction modeling capabilities, degrade substantially, both converging to 0.385. This collapse confirms that linear models cannot adequately model building-level consumption behavior when annual seasonality is underrepresented.
The 6-month scenario represents the extreme case of severe historical scarcity. In this setting, no complete seasonal cycle is available for training, forcing the models to extrapolate beyond their direct historical experience. Notably, GB maintains relative superiority, achieving R2 = 0.536, although this represents a significant drop from its 24-month performance. RF and Lasso both hover around 0.315, indicating that bagged ensembles and sparsity-driven linear models alike struggle to generalize without seeing a full annual consumption cycle. Ridge performs worst, producing unstable predictions, likely due to its overreliance on insufficient lag-based features.
Taken together, these results have several practical implications for model selection in urban energy forecasting:
When historical data span at least 18–24 months, simpler models such as Ridge and Lasso may perform nearly as well as more complex ensembles, making them attractive for their lower computational costs and higher interpretability;
When historical data are reduced to 12 months or less, GB consistently outperforms all alternatives, suggesting that boosting-based models are better suited to capturing partial seasonal patterns and subtle nonlinearities under constrained conditions.
RF excels in rich data settings but loses its relative advantage as the historical coverage declines, indicating a stronger dependency on repeated seasonal patterns for stable generalization.
The strong performance of the Ridge and Lasso regression models in the 18- and 24-month scenarios can be largely attributed to the structured temporal features used in the modeling framework. Cyclical month encodings effectively capture seasonal consumption patterns, while low-order lagged variables represent short-term temporal dependencies. When multiple seasonal cycles are present, these features provide a stable and informative linear signal that can be efficiently exploited by regularized linear models. The use of L1 and L2 regularization further stabilizes coefficient estimation in the presence of correlated temporal predictors, reducing variance and enhancing generalization. In contrast, when the historical depth is insufficient to fully represent seasonality, as in the 6- and 12-month windows, the explanatory power of these features diminishes, leading to a sharper performance decline for linear models.
These findings are consistent with the trade-off summarized in
Table 2, where Ridge and Lasso excel in structured, stable datasets but risk underfitting nonlinear behavior, while ensemble methods capture more complex patterns but can overfit when data are scarce. This reinforces the central methodological lesson: model complexity should be aligned with the temporal depth and statistical richness of the available data—a principle echoed throughout the forecasting literature [
16,
17].
The comparative analysis of forecasting models, summarized in
Table 3, reveals that Ridge and Lasso regression, despite their relative simplicity, achieve competitive and, in certain cases, superior predictive performance compared to advanced ensemble models such as RF and GB in the 18- and 24-month historical data scenarios. This is a nontrivial finding because it challenges the common assumption that more sophisticated, nonlinear ML models will universally outperform linear alternatives. Instead, the results emphasize that the structure, cleanliness and temporal richness of the dataset play a decisive role in determining which modeling family will be most effective.
A key explanation lies in the nature of the Energy4Rome dataset (
https://github.com/teeclab/Urban-Energy-Forecasting-AI-Stress-Test- (accessed on 3 August 2025)). This dataset is not only relatively structured and clean but also exhibits well-defined seasonal cycles and long-term trend components. Such regularities mean that a substantial fraction of the variance in energy consumption can be captured through a combination of carefully engineered temporal features, including cyclic month encodings, multiple lag variables and expanding window statistics such as rolling means and standard deviations. These features, when used as inputs to regularized linear models, enable Ridge and Lasso to leverage stable, repeatable seasonal patterns without requiring the flexibility of a complex nonlinear decision boundary.
The role of regularization is central to this performance. Ridge regression applies L2 regularization to shrink coefficients, mitigating multicollinearity and stabilizing predictions in the presence of correlated temporal features. Lasso regression adds the ability to perform embedded feature selection through L1 regularization, effectively discarding features that contribute little to the predictive power. Together, these regularization mechanisms prevent overfitting—a particular risk in seasonal energy datasets, where certain fluctuations may be due to idiosyncratic or unrepeatable events rather than persistent structural patterns. The net effect is that Ridge and Lasso can produce parsimonious, interpretable models that generalize well when multiple seasonal cycles are present.
In contrast, RF and GB shine in shorter-horizon scenarios—for example, when only 6 to 12 months of historical data are available. In such contexts, seasonal and trend information is fragmented, and nonlinear interactions or building-specific behaviors may dominate the signal. Ensemble methods are adept at capturing these irregular patterns without the need for explicit feature engineering. However, in the 18- and 24-month cases, where seasonal cycles are well represented, the additional complexity of these ensemble models provides diminishing returns and, in some cases, creates opportunities for overfitting, particularly when the hyperparameters are not tuned conservatively. Even in our experiments, where both RF and GB were deliberately tuned for stability, the inherent flexibility of these models sometimes led them to pursue idiosyncratic patterns in the training set that did not translate into genuine predictive skill.
This analysis reinforces an important methodological principle: model complexity should be matched to the statistical richness of the available data. In structured, seasonally stable datasets with ample history, regularized linear models offer a highly attractive balance of accuracy, interpretability and computational efficiency. In noisier, less structured datasets or when historical data are severely limited, ensemble tree-based methods become indispensable, as they can capture subtle interactions and nonlinearities that linear models would miss. The implications for real-world deployment are significant. In municipal energy forecasting contexts, where transparency and explainability are often prerequisites for adoption, the strong performance of Ridge and Lasso in long-horizon cases suggests that they should not be overlooked as default tools. Conversely, where rapid changes in usage behavior, atypical occupancy patterns or sudden shifts in energy demands dominate, particularly in newly monitored districts, ensemble methods may justify their added complexity. Ultimately, the choice is not about finding the “best” model universally but about selecting the right model for the specific temporal and statistical profile of the data at hand.
Across all models and degradation scenarios, a clear turning point emerges at a historical depth of approximately 12 months. Below this threshold, the predictive performance deteriorates rapidly and becomes highly unstable, regardless of model complexity. In contrast, when at least 18 months of data are available, the performance gains tend to saturate and differences between model families become more consistent. This indicates that data availability, rather than model sophistication, represents the primary limiting factor under small data conditions.
3.1. Impact of Missing Values Across Time Horizons
To comprehensively evaluate model robustness, we simulated increasing levels of missing data (from 5% to 30%) across four historical time windows (6, 12, 18 and 24 months), as summarized in
Figure 7.
In shorter historical windows (6 and 12 months), RF exhibits performance improvements under moderate missing data rates. This counterintuitive gain is likely due to the regularization effect of imputation, which reduces overfitting by smoothing out spurious variations in small training sets. In these conditions, missingness may effectively suppress noise and encourage the model to prioritize more stable predictors. GB, on the other hand, shows a more typical decline, especially at higher missingness levels, while linear models like Ridge and Lasso remain largely unaffected, although their baseline accuracy is already limited due to the insufficient historical context.
As the historical depth increases to 18 and 24 months, the advantage of ensemble methods becomes more apparent. Both RF and GB maintain high normalized R2 scores even at 30% missingness, demonstrating their strong generalization capacity and internal strategies for handling incomplete data (e.g., surrogate splits, weighted node decisions). Conversely, the performance of Ridge and Lasso begins to diverge, with Ridge slightly more affected as the missing rate increases and Lasso maintaining stable yet modest results. This shift indicates that, as more contextual information becomes available, model complexity plays a greater role in mitigating information loss.
Taken together, these findings emphasize the necessity of contextual model selection based on both data richness (historical depth) and quality (completeness). While linear models may suffice under ideal data conditions or for transparency needs, tree-based ensemble models clearly offer superior robustness in real-world scenarios where data imperfections are inevitable. Therefore, the trade-off between interpretability and performance must be considered carefully by decision makers, especially in operational forecasting environments constrained by limited or imperfect datasets.
3.2. Impact of Target Noise Across Time Horizons
To complement the analysis of missing data, this study explicitly evaluates the sensitivity of forecasting models to measurement noise affecting the target variable. In real urban energy monitoring systems, consumption values may be distorted by sensor inaccuracies, calibration drift, communication errors or external disturbances. To reflect these conditions, zero-mean Gaussian noise was injected exclusively into the training targets at increasing intensity levels (5%, 10%, 15% and 20% of the original target variability), while the test data remained unaltered. This design choice ensured that noise influenced model learning without artificially corrupting the evaluation phase, thereby preserving the realism of the experimental setup.
The results, summarized in
Table 4, reveal that model robustness to target noise is strongly dependent on both the algorithmic structure and historical data availability. In longer temporal windows (18 and 24 months), ensemble-based methods exhibit substantial resilience to noise injection. Random forest maintains stable performance across all tested noise levels, reflecting the variance reduction properties of bagging and the averaging effect of multiple decision trees. Gradient boosting shows a moderate but consistent degradation as the noise increases, particularly at higher noise intensities, indicating higher sensitivity to perturbed training targets due to its sequential error correction mechanism. In shorter historical windows (6 and 12 months), noise sensitivity becomes more pronounced for all models. Under these constrained conditions, gradient boosting retains comparatively higher normalized R
2 values than alternative approaches, despite the visible performance degradation at elevated noise levels. This behavior suggests that boosting-based methods are better equipped to extract structured signals from noisy targets when temporal information is limited. In contrast, regularized linear models (Ridge and Lasso) display sharper performance declines as the noise increases, particularly in short windows, highlighting their reliance on stable linear relationships that are easily disrupted by target perturbations.
Overall, the noise stress test confirms that target noise has a limited impact, especially under severe data scarcity. While missing data primarily affect feature representation, noise directly compromises the learning signal itself, making its impact more difficult to mitigate. These findings reinforce the importance of selecting models with inherent robustness mechanisms when operating under realistic urban conditions characterized by imperfect measurements and limited historical depth.
3.3. Data Augmentation
Before presenting the quantitative effects of data augmentation, it is important to clarify the methodological rationale behind the adoption of an ultra-conservative augmentation strategy. The Energy4Rome dataset consists of monthly observations collected over a very short temporal horizon (24 samples), a condition that poses severe limitations on the applicability of conventional generative augmentation techniques such as GANs, VAEs or SMOTE. These approaches typically require substantially larger datasets to learn stable data distributions and temporal dependencies, and their application to extremely short and low-frequency time series may lead to unstable training dynamics and poor generalization. Aggressive generative augmentation may introduce artificial temporal patterns, implicit information leakage across time steps or consumption profiles that are not physically plausible with respect to building operation and seasonal dynamics. These risks are particularly critical when dealing with monthly data, where each observation carries high informational weight and minor distortions can disproportionately affect model behavior. For these reasons, data augmentation in this study is deliberately implemented through strictly bounded perturbations of real observations, designed to preserve the temporal coherence, seasonal structure and original statistical properties of the dataset. Rather than attempting to synthetically expand the underlying data distribution, the proposed approach aims to modestly increase the sample diversity while maintaining consistency with realistic urban energy consumption patterns. This conservative design ensures that any observed performance improvements reflect enhanced model robustness under small data conditions, rather than artifacts introduced by overly aggressive or data-hungry generative techniques. All data augmentation experiments are conducted exclusively under baseline (non-degraded) data conditions and are not combined with missing data or target noise injection, in order to isolate the effects of temporal data scarcity and avoid propagating degradation artifacts.
The augmentation strategy demonstrates distinct improvement patterns that vary significantly by dataset duration and model architecture. In the 6-month interval, as shown in
Figure 8, both GB and Lasso regression show meaningful enhancements, with GB improving from 0.520 to 0.531 (+2.0%) and Lasso advancing from 0.536 to 0.546 (+2.0%), while RF and Ridge remain unchanged at their baseline performance levels of 0.315. The 12-month dataset reveals a different pattern, as shown in
Figure 9, where tree-based ensemble methods benefit substantially: RF gains 1.2% (0.631 to 0.638) and GB achieves a 2.0% improvement (0.711 to 0.725), while both linear models (Ridge and Lasso) show no measurable change at 0.385. This divergent response suggests that the augmentation effectiveness is both model-dependent and critically influenced by the dataset’s temporal depth, with different model families responding optimally at different data availability thresholds.
From an operational perspective, these results reveal that augmentation benefits follow distinct model-specific patterns across temporal horizons. In severely constrained scenarios (6 months), both adaptive boosting algorithms and regularized linear models with feature selection capabilities (Lasso) demonstrate responsiveness to synthetic data enhancement, suggesting that models with inherent regularization or adaptive learning mechanisms can leverage even minimal augmentation effectively. Conversely, in extended datasets (12 months), tree-based ensemble methods demonstrate their full potential for synthetic data utilization, likely due to their capacity to capture complex temporal interactions through ensemble averaging and their robustness to slight data perturbations. GB emerges as consistently the most responsive model across both timeframes, achieving 2.0% improvements regardless of the dataset duration, while maintaining the highest baseline performance in the 12-month scenario (0.711). For practitioners implementing energy forecasting systems, these findings suggest that augmentation strategies should be carefully matched to both the available temporal window and the intended model architecture, with GB offering the most reliable enhancement potential across the data availability spectrum.
It is worth noting that several observed performance differences, particularly those associated with data augmentation or model selection under constrained scenarios, are relatively small in magnitude (approximately 1–2%). From a practical perspective, such differences are unlikely to substantially alter operational decision-making processes in typical urban energy management applications, where forecasting outputs are often used to support planning, monitoring or high-level optimization rather than fine-grained real-time control. In the context of small data, marginal accuracy improvements may be outweighed by considerations related to model robustness, interpretability and data availability. The results therefore suggest that, under severe data limitations, the selection of an appropriate model family and the identification of minimum data availability thresholds represent more critical factors than incremental performance gains. This finding reinforces the central message of the study: ensuring reliable and stable predictions under realistic small data conditions is more valuable in practice than pursuing marginal improvements that may not translate into tangible operational benefits. Conventional augmentation methods such as the Synthetic Minority Oversampling Technique (SMOTE), Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) were intentionally avoided. These techniques require large datasets to train generative models reliably and may introduce synthetic patterns that violate temporal consistency or physical plausibility. The proposed augmentation strategy instead perturbs real observations within strict statistical bounds, ensuring that no artificial or unrealistic seasonal structure is injected. This guarantees that the improvements in performance reflect genuine gains in model resilience rather than optimistic artefacts of aggressive data synthesis.
3.3.1. Data Augmentation—Detailed Effectiveness Analysis
The ultra-conservative augmentation strategy demonstrates selective improvements that vary significantly by dataset duration and model architecture.
Table 5 quantifies the augmentation benefits across all model dataset combinations, revealing model-specific response patterns.
This study provides a reproducible benchmark for evaluating machine learning (ML) models in constrained energy forecasting settings, as well as a pragmatic toolkit for improving performance through conservative augmentation. Future extensions will incorporate exogenous variables such as weather and occupancy and explore adaptive augmentation strategies that can further tailor synthetic data generation to the evolving structure of building energy demands.
3.3.2. Influence of Building Type, Climate and Occupancy Patterns
The results discussed in this section should be interpreted in light of the specific contextual characteristics of the case study, namely the types of buildings analyzed, the local climatic conditions and the underlying occupancy patterns, which may influence both the absolute performance of the models and the generalizability of the proposed framework. The dataset used in this study refers to large residential social housing buildings, characterized by a high level of spatial aggregation and relatively stable occupancy over time. At this scale, individual behavioral variability is largely averaged out, resulting in smooth monthly consumption profiles with strong seasonal regularity. This aggregation effect contributes to the robustness observed for several model families, particularly regularized linear models and ensemble-based approaches, which benefit from stable cyclical patterns and reduced short-term noise. In contrast, buildings with different functional uses, such as offices, retail spaces or mixed-use facilities, typically exhibit more heterogeneous and intermittent occupancy schedules, including weekday–weekend asymmetries, holiday effects and activity-driven demand peaks. In such contexts, the relative performance rankings of models may differ, and architectures explicitly designed to capture nonlinear interactions and regime changes could become more advantageous.
Climatic conditions also play a critical role in shaping energy demand dynamics. The case study was located in Rome, which is characterized by a Mediterranean climate, where the cooling demand dominates during summer months and heating loads remain comparatively moderate. Under these conditions, seasonal patterns are relatively smooth and consistent across years, which favors models relying on cyclical temporal features and short historical windows. In colder or heating-dominated climates, however, energy consumption profiles are often more sensitive to extreme weather events, prolonged heating seasons and inter-annual climatic variability. These factors may increase the importance of exogenous climatic variables and longer historical records to ensure reliable model training. Consequently, the minimum data thresholds identified in this study, such as the strong performance of ensemble models with 18–24 months of data, should be interpreted as context-dependent and not universally transferable without further validation.
Occupancy patterns further interact with both the building type and climate, influencing the stability and predictability of the energy demand. The residential buildings analyzed here exhibited relatively constant occupancy levels, with energy use primarily driven by seasonal thermal needs rather than abrupt behavioral changes. In contrast, commercial or public buildings often experience highly variable occupancy, influenced by operational schedules, user behavior and external events. In such cases, models trained exclusively on historical consumption data may face greater challenges in generalization, particularly under small data conditions, unless additional explanatory variables related to occupancy or operational status are incorporated.
These considerations do not undermine the validity of the proposed stress-testing framework but rather delineate its domain of applicability. The framework is intentionally designed to be modular and adaptable, allowing the same experimental protocol to be applied to datasets characterized by different building typologies, climatic zones and occupancy regimes.
4. Conclusions
This study provides a systematic assessment of the robustness of AI models applied to urban energy consumption forecasting, with particular reference to contexts characterized by limited information availability, fragmented data or data affected by quality degradation. Through a comprehensive experimental design, the sensitivity of predictive performance was analyzed with respect to the reduction of the available historical horizon, the presence of missing data and the introduction of noise on the target. The results highlight a marked dependence between the effectiveness of the models and the temporal depth of the dataset used for training. Ensemble-based methods, particularly GB and RF, show greater resilience in scenarios with low information availability, especially when integrated with mitigation strategies such as regularization and synthetic augmentation. Regularized linear models (Ridge, Lasso), although less flexible, achieve good levels of predictive stability in the presence of structured datasets with marked cyclicality, especially in extended time windows (18–24 months). A significant methodological contribution of the study is the development of a modular and replicable stress test framework, which allows for the systematic comparison of model performance across different time horizons, levels of information degradation and algorithmic architectures. This experimental platform also allows for the identification of the minimum thresholds of data quality and quantity necessary to ensure operational reliability in forecasting.
A direct quantitative comparison between the results of this study and those reported in previous works was intentionally not pursued. Existing studies on AI-based energy consumption forecasting are highly heterogeneous with respect to temporal resolution, spatial scale, data availability and modeling objectives, ranging from short-term, high-frequency predictions based on large datasets to simulated or benchmark-driven scenarios. Under these conditions, direct performance comparisons would be methodologically misleading. Instead, this work positions itself as a complementary contribution to the existing literature by focusing on robustness-oriented evaluation, minimum data requirements and model behavior under realistic small-data and data-degraded urban conditions. By systematically stress-testing multiple model families across varying historical horizons and controlled data quality degradation, the proposed framework advances current research by shifting the emphasis from absolute predictive accuracy to operational reliability and practical applicability in real-world urban energy management contexts.
The results obtained suggest the importance of model selection that is consistent with the specific characteristics of the available dataset in terms of historical depth, completeness and seasonality. The adoption of complex models does not automatically translate into improved performance and can be counterproductive if the operating conditions differ from those used for training. Furthermore, the ultra-conservative synthetic augmentation strategy proposed in this work demonstrates that carefully controlled perturbations can produce measurable benefits without introducing unrealistic patterns—offering a viable and computationally lightweight tool for improving model resilience in severely constrained datasets. The modularity of the implemented pipeline also ensures that the entire framework can be transferred to other buildings, districts or cities with minimal adjustments, supporting scalability and facilitating practical adoption by energy managers and public authorities.
One prospect for future development could be to integrate exogenous variables into forecasting models to improve the predictive capacity in contexts with limited historical data. In particular, including climatic factors, behavioral data and dynamic energy pricing information could enable models to capture variations in energy consumption profiles more effectively. The inclusion of these external variables could help to disambiguate the seasonal component from the behavioral component of consumption, improving the model’s ability to generalize even when time series are short or uneven. In addition, they may investigate adaptive or model-specific augmentation strategies, as well as the computational cost and deployment requirements associated with each model family, to further support real-world implementation in resource-constrained environments. The combination of heterogeneous data sources lays the foundation for the development of more robust multimodal models, suitable for supporting operational decisions in complex urban environments. In conclusion, the work presented provides a rigorous experimental protocol with operational guidelines useful for the development of AI-based forecasting systems capable of maintaining reliability in conditions that are typical of real urban contexts, helping to bridge the gap between theoretical applications and practical requirements in local energy planning processes. By quantifying the minimum usable historical depth, identifying the most robust model families under each constraint and demonstrating viable mitigation mechanisms for extreme small data scenarios, this study helps to bridge the gap between theoretical applications and practical requirements in local energy planning processes.