Short-Term Energy Forecasting to Improve the Estimation of Demand Response Baselines in Residential Neighborhoods: Deep Learning vs. Machine Learning

: Promoting flexible energy demand through response programs in residential neighborhoods would play a vital role in addressing the issues associated with increasing the share of distributed solar systems and balancing supply and demand in energy networks. However, accurately identifying baseline-related energy measurements when activating energy demand response events remains challenging. In response, this study presents a deep learning-based, data-driven framework to improve short-term estimates of demand response baselines during the activation of response events. This framework includes bidirectional long-term memory (BiLSTM), long-term memory (LSTM), gated recurrent unit (GRU), convolutional neural networks (CNN), deep neural networks (DNN), and recurrent neural networks (RNN). Their performance is evaluated by considering different aggregation levels of the demand response baseline profile for 337 dwellings in the city of La Rochelle, France, over different time horizons, not exceeding 24 h. It is also compared with fifteen traditional statistical and machine learning methods in terms of forecasting accuracy. The results demonstrated that deep learning-based models, compared to others, significantly succeeded in minimizing the gap between the actual and forecasted values of demand response baselines at all different aggregation levels of dwelling units over the considered time-horizons. BiLSTM models, followed by GRU and LSTM, consistently demonstrated the lowest mean absolute percentage error (MAPE) in most comparison experiments, with values up to 9.08%, 8.71%, and 9.42%, respectively. Compared to traditional statistical and machine learning models, extreme gradient boosting (XGBoost) was among the best, with a value up to 11.56% of MAPE, but could not achieve the same level of forecasting accuracy in all comparison experiments. Such high performance reveals the potential of the proposed deep learning approach and highlights its importance for improving short-term estimates of future baselines when implementing demand response programs in residential neighborhood contexts.


Introduction
The European residential sector accounts for approximately 75% of European buildings and is solely responsible for over 25% of the final energy demand in the European Union (EU), making it the second-largest consumer after transport [1].With this in mind, increasing the energy efficiency of residential and non-residential buildings is one of the main objectives of EU strategies to achieve the ambitious endeavor of decarbonizing European buildings [2].Given that residential buildings are significant contributors to global carbon emissions, there is great interest in creating a low-carbon residential sector in Europe [3].The widespread deployment of advanced demand-side management strategies through demand response programs, enabling flexible energy demand in European residential buildings, is seen as a promising direction to maximize energy efficiency while meeting comfort requirements.Demand response programs focus significantly on the increased integration of low-carbon energy generation systems, such as distributed solar photovoltaics systems [4], and the modification of natural energy usages for residential buildings in response to fluctuations in supply and demand when the reliability and security of energy network systems are compromised [5].At the same time, end-use customers become participators in demand response programs by modulating their natural consumption patterns according to electricity prices or through corresponding payment incentives in response to control signals issued by energy network operators or aggregators [6,7].
Thus, various types of demand response scenarios/programs, including load-shifting, valley-filling, peak-clipping, and the shaping of flexible loads, have been introduced [6,8] to accurately operate and manage the available local energy resources considering outdoor weather conditions, consumption patterns, and network security.In parallel, demand response strategies that reduce a building's energy demands during stressful times of the energy network are seen as feasible approaches to harness flexible energy demand without the need for substantial investments [9].Despite the transition towards greater energy efficiency in residential buildings, driven by the widespread adoption of demand response programs, incomplete and significant improvements are still ongoing in this direction.However, there is a strong need to understand the baseline demand trends (i.e., so-called demand response baselines) for energy in residential buildings on the side of end-use customers/occupants and the interaction with the energy network.This requires providing an accurate estimation of demand response baselines, which would be consumed by end-use customers in the absence of demand response programs [10].Demand response baselines serve as a fundamental reference point for measuring, optimizing, and assessing the potential reduction in energy demand during response events [11].In contrast, demand response baselines in buildings are distinguished by highly fluctuating with non-linear characteristics due to the nature of consumption, which depends on occupancy, the culture of a particular building, the working schedule of each building, and outdoor weather conditions [12].Thus, developing a data-driven learning framework to characterize baseline demand patterns and provide accurate estimates of demand response baselines, enabling the calculation of energy reductions in the context of residential buildings, is crucial [13,14].
In response, energy demand forecasts are a pivotal component of demand response programs to investigate the effectiveness of demand response scenarios and maximize their benefits.Specifically, accurate forecasts of energy demand in the short term and very short term can be employed to address different types of challenges at both the building level and the energy network level [15].Common building-level challenges that can be addressed include tracking progress in energy efficiency improvements and defining abnormal behaviors and deviations in expected energy usage patterns, which enable the detection of potential energy losses, breakdowns, and inefficiency within the building's systems [16,17].Energy network-level challenges are short-term optimal scheduling and identification of the optimal energy flow to meet expected demand [18], facilitating increased integration of low-carbon energy sources into the energy network [19], and demand response flexibility optimization by tracking the improvement progress in energy reductions [20].In demand response contexts, accurate short-term forecasts of demand response baselines would be utilized by aggregators (intermediaries between end-use consumers and energy utility suppliers) to support the fair compensation of participating households in demand response programs [6].Accurate demand response baselines can also serve as essential information for resource planners and energy system operators interested in implementing demand response programs with high effectiveness [14].However, how to develop an accurate and reliable data-driven framework that can fulfill the above-gap remains a difficult task [21].Therefore, this study aims to address this research need by developing a short-term forecasting framework for demand response baselines based on time-series data.This is essential for any optimal demand response strategy, particularly in large-scale residential neighborhoods that would interact with intermittent renewables (i.e., low-carbon energy sources, such as distributed solar photovoltaic systems).

Literature Review
Over the past few years, several research studies have been devoted to accurately estimating demand response baselines at both the individual customer-level and the aggregatelevel, as summarized in Table 1.Various methods have been employed, which can be categorized into statistical and traditional machine learning methods.Concerning statistical methods, Ghasemi et al. [22] and Wijaya et al. [23] introduced the averaging-based method (XofY method), which is based on historical datasets to investigate their effectiveness in providing accurate estimates of the demand response baselines for 32 industrial and 782 residential customers in Iran and Switzerland, respectively.Despite the importance of this work, the main limitation of these methods is that they tend to provide a simplified and less accurate representation of historical energy demand data and may not accurately capture the nuances and variations in demand patterns, which can lead to sub-optimal estimates for demand response baselines.Similarly, Zhang et al. [24] and Wang et al. [25] proposed using residential consumption of non-participants in demand response programs to estimate baselines of demand response participants.The problem with such a practical approach is that there must be reference buildings with similar characteristics that do not participate in demand response actions.Furthermore, this approach becomes problematic under frequent demand response actions.In the same context, the authors in [10,[26][27][28][29][30] presented statistical regression with external inputs, such as weather variables as predictive factors, to perform predictions of demand response baselines.The results revealed that statistical regression models have significant potential to provide accurate estimates of demand response baselines.However, a drawback of statistical regression-based models is inadequately quantified uncertainty in predicting energy demand baselines due to their inability to measure non-linear relationships between energy demand and relevant influencing factors such as consumer behaviors and ambient weather conditions [31].As a potential procedure to overcome the problems associated with statistical methods, several researchers have proposed a diverse combination of traditional machine learning methods to construct accurate demand response baselines.In this context, Chen et al. [14] proposed a support vector regression (SVR) method to estimate demand response baselines for office buildings, using factors such as weather and building working schedules as inputs to the SVR models.Similarly, Li et al. [32] proposed an SVR method to estimate customer demand response baselines in the presence of integrated distributed photovoltaic systems.Srivastav et al. [33] and Zhang et al. [34] proposed the development of predictive models based on the Gaussian Mixture Regression (GMR) method to characterize demand response baselines of building clusters.However, the GMR method has difficulties in processing time series data and requires the use of complex algorithmic models to calculate demand response baselines.Bampoulas et al. [35] compared the performance of RF (random forests), MNN (multilayer neural networks), SVR, and XGBoost (extreme gradient boosting) methods in providing accurate estimates of residential energy demand response baselines.Similarly, Sha et al. [36] developed six types of predictive models based on multiple linear regression (MLR), SVR, RF, CatBoost, LightGBM (light gradient boosting machine), and ANN (artificial neural network) to improve the calculation of demand response baselines for commercial buildings over the next 24 h.Tao et al. [37] proposed a graph convolutional network (GCN) method to improve the estimation of aggregated demand response baselines, as its performance was compared with that of their counterparts from SVR, MLR, and averaging methods.This is in addition to the other methods proposed in [38][39][40] to improve the estimation of demand response baselines in buildings.
Notwithstanding the effectiveness of some data-driven machine learning methods in estimating demand response baselines, as mentioned above, these methods require substantial improvement by considering more external factors, such as occupancy and indoor environmental conditions, which can be difficult to acquire in the context of large-scale neighborhoods and district buildings.In addition, implementation strategies for demand response in buildings necessitate high-resolution forecasts (from hourly to daily), leading to the need for developing accurate forecasting models [12,41].As demand patterns fluctuate randomly, inaccurate estimates of demand response baselines can lead to significant errors when aggregated to determine the total energy reductions caused by the activation of response events [37].In the face of such challenges, deep learning methods have brought the issue of reliable and accurate estimates in short-term energy forecasting studies in the building sector back into the spotlight and have received considerable attention in recent years.Researchers have pointed out the great potential of these methods in providing accurate results for building energy demand forecasting [42,43].Accordingly, this study introduces the deep learning approach as a potential candidate to improve the accuracy of residential demand response baseline estimates over a short-term forecast horizon.The aim is to develop a robust and reliable deep learning-based data-driven framework and evaluate its performance, considering different residential energy demand profiles in neighborhood buildings, in order to provide accurate estimates for aggregated demand response baselines over multiple time forecast horizons, not exceeding 24 h.To the author's knowledge, no previous studies have applied bidirectional long-short-term memory (BiLSTM) and gated recurrent unit neural network (GRU) to estimate aggregated demand response baselines in a neighborhood context.

Contribution of the Study
In light of the study's objective and considering the strengths and weaknesses identified in the literature, the main contribution of this work is as follows: • A data-driven framework is proposed to identify the most effective deep learning methods in providing accurate estimates of residential demand response baselines over various time-horizons, not exceeding 24 h.This provides a novel insight for a deeper understanding of the forecasting characteristics exhibited by different datadriven models.• Verify the change in model performance during the evaluation phase by considering the demand response baseline profile according to different aggregation levels of residential units and other input features.This investigation is essential for understanding the different behaviors of forecasting models and the importance level of input features.• The performance of the deep learning models is compared with that of the traditional statistical and machine learning models developed in this work, considering both the type of forecasting model and the expected margin of error.This comparison helps to identify the strengths and limitations of each model and method in the context of short-term demand forecasts for residential demand response baselines.
The rest of this paper is organized into four main sections as follows: Section 3 presents the deep learning methods proposed in this work.Section 4 describes the methodology of this work developed from the previously mentioned methods.Next, the findings obtained from this work are presented and discussed in Section 5, and finally, the main conclusions and potential future developments of this work are drawn in Section 6.

Proposed Forecasting Methods
Basic concepts of the deep learning methods proposed in this work are outlined as follows.

Deep Neural Networks
Deep neural networks (DNNs) are one of the most used and popular neural network architectures in energy demand forecasting, commonly known as multilayer perceptron neural network models, due to the inclusion of more hidden layers.They are widely regarded as a robust and effective tool for solving complex problems, including clas-sifications and forecasting tasks, due to their capacity to learn and represent intricate non-linear relationships between input and output data [44].To achieve this, the basic structure of DNNs consists of three types of successive layers: input layers, hidden layers, and output layers, as shown in Figure 1a.The back-propagation algorithm is used to train DNNs [45].As shown in Figure 1a, the input layers receive input signals µ(t − τ 1 ), µ(t − τ 2 ), µ(t − τ 3 ), . . . ,µ(t − τ n ), where τ 1 , τ 2 , τ 3 , . . . ,τ n are constants.The summation of control signals and the system's outputs at time t are represented by u(t).The weights that connect the input layers to the hidden layers are represented by w 1  11 , w 1 12 , . . ., w 1 1n for the first neuron, w 1  21 , w 1 22 , . . ., w 1 2n for the second neuron and w 1 31 , w 1 32 , . . ., w 1 3n for the third neuron.The weights associated with hidden layer neurons with neuron q are denoted by w 1 h1 , w 1 h2 , . . ., w 1 qn , where q denotes the number of neurons.The weights that connect hidden layers to an output layer are represented by w 21 , w 22 , . . ., w 2q [46,47].These weights and connections between different layers enable the DNN to learn and make reliable and accurate predictions based on the input data.Thus, DNNs have gained significant attention in time-series forecasting of building energy demand, with residential buildings receiving a substantial part of this attention [48,49].
Deep neural networks (DNNs) are one of the most used and popular neural network architectures in energy demand forecasting, commonly known as multilayer perceptron neural network models, due to the inclusion of more hidden layers.They are widely regarded as a robust and effective tool for solving complex problems, including classifications and forecasting tasks, due to their capacity to learn and represent intricate non-linear relationships between input and output data [44].To achieve this, the basic structure of DNNs consists of three types of successive layers: input layers, hidden layers, and output layers, as shown in Figure 1a.The back-propagation algorithm is used to train DNNs [45].As shown in Figure 1a, the input layers receive input signals ( −  ), ( −  ) , ( −  ) , …., ( −  ) , where  ,  ,  , … ,  are constants.The summation of control signals and the system's outputs at time  are represented by ().The weights that connect the input layers to the hidden layers are represented by  ,  , … ,  for the first neuron,  ,  , … ,  for the second neuron and  ,  , … ,  for the third neuron.The weights associated with hidden layer neurons with neuron  are denoted by  ,  , … ,  , where  denotes the number of neurons.The weights that connect hidden layers to an output layer are represented by  ,  , … ,  [46,47].These weights and connections between different layers enable the DNN to learn and make reliable and accurate predictions based on the input data.Thus, DNNs have gained significant attention in time-series forecasting of building energy demand, with residential buildings receiving a substantial part of this attention [48,49].

Convolutional Neural Networks
Convolutional neural networks (CNNs) are a unique class of advanced neural network methods that hierarchically perform convolutional operations on input time-series data.In recent years, CNNs have received increasing attention in building energy demand forecasting [48,50] due to their ability to capture time-series dependencies.The typical structure of CNNs consists of five layers, namely input, convolutional, pooling, fully connected, and output layers [51].They are characterized by their capability to process and transform timeseries datasets utilizing three building blocks.These include (1) convolutional layer, which implements two types of operations.(a) The first of these operations are convolutional operations, which require two components, called the kernel and the time-series data.The kernel implements convolution on the time series by moving from the beginning to the end of a series (i.e., in one direction), and the dot product between the kernel and corresponding parts of the series is computed.(b) The other operations involve the non-linear activation being applied to the final output of convolutional operations.The other building blocks are (2) a pooling layer to maintain the stability and prevent overfitting of the model and (3) a fully connected layer to perform the same duties in the conventional neural networks [52].
In this work, a 1-D convolutional neural network (ConvID) was utilized to extract the features from the time series data for demand response baselines in residential buildings.This network applies sliding convolutional operations along the sequence of onedimensional time series data [53].The proposed ConvID consists of five foundational layers: the convolutional layer, the pooling layer, the fully connected layer, the dropout layer, and the ReLU (relu) correction layer.The CNN network performance depends on the parameters of these layers.They consist of several components, including the number of filters in the layers, filter size, padding, stride, and batch size.Figure 1b displays the CNN architecture used in this work.

Recurrent Neural Networks
Recurrent neural networks (RNNs) are an advanced class of neural networks designed to overcome the disadvantages of a traditional neural network in accounting for temporal correlations and dependencies in data sequences [54].They are also distinguished from other deep learning neural network architectures by their recurrent connections, which enable them to memorize the information from previous outputs and incorporate it into the computation of the current result [55].Thus, the impact of recurrent neural network models has been remarkable in many disciplines, including energy demand forecasting, where the temporal order in the dataset is a fundamental feature in model design.RNNs are typically networks composed of standard recurrent connection cells, hidden states, and input and output layers.The input nodes in RNN models have no incoming connections, and the output nodes have no outgoing connections, while the hidden state nodes have both incoming and outgoing connections [52].Each time, the RNN model updates the information in its memory according to the following Equation (1).
where h t is the current hidden state at time t; f c is the activation function, typically the hyperbolic tangent (tanh) or the rectified linear unit (relu); W h is the weight matrix for the recurrent connections; h t−1 is the previous hidden state at time t − 1; and x t is the input at time t.The architecture of the RNN is exhibited in Figure 2a.Each node represents a neuron for a single timestep.W1 represents the connection weight for inputs, W2 signifies the self-connection weight for each neuron, and W3 denotes the connection weight for outputs.The input data sequence is processed sequentially within the network based on time steps, and the weight coefficients are reused in a recycling fashion.The training process of an RNN model includes a forward pass and a backward pass.The forward pass of an RNN model mirrors that of a single-hidden-layer multilayer perceptron, except that the hidden layer in an RNN model receives activations from both the current external input and the hidden layer activations from the previous timestep [56].The process of computing weight derivatives for an RNN during the backward pass is referred to as "backpropagation" through time.The advantage of RNN models in time-series forecasting is that they can predict not only the next time step but also multiple future time steps, making them versatile for different forecasting horizons.
outputs.The input data sequence is processed sequentially within the network based o time steps, and the weight coefficients are reused in a recycling fashion.The trainin process of an RNN model includes a forward pass and a backward pass.The forward pas of an RNN model mirrors that of a single-hidden-layer multilayer perceptron, except tha the hidden layer in an RNN model receives activations from both the current externa input and the hidden layer activations from the previous timestep [56].The process o computing weight derivatives for an RNN during the backward pass is referred to a "backpropagation" through time.The advantage of RNN models in time-serie forecasting is that they can predict not only the next time step but also multiple futur time steps, making them versatile for different forecasting horizons.

Long Short-Term Memory
Long short-term memory (LSTM) is an upgraded variant for RNN architecture designed to overcome the vanishing gradient and gradient explosion problems in th long-sequence training process.For this reason, LSTMs can be trained utilizing the time series dataset to make predictions for the future energy demand of buildings, each tim utilizing the historical dataset processed by the LSTM cells.Typically, the architecture o LSTM is composed of frequently interconnected subnetworks called memory blocks, a shown in Figure 2b.Each block consists of a forget gate, an input gate, an output gate, an

Long Short-Term Memory
Long short-term memory (LSTM) is an upgraded variant for RNN architectures designed to overcome the vanishing gradient and gradient explosion problems in the longsequence training process.For this reason, LSTMs can be trained utilizing the time-series dataset to make predictions for the future energy demand of buildings, each time utilizing the historical dataset processed by the LSTM cells.Typically, the architecture of LSTM is composed of frequently interconnected subnetworks called memory blocks, as shown in Figure 2b.Each block consists of a forget gate, an input gate, an output gate, and one or more self-connected memory cells [57].During the training of LSTM models, the LSTM gates facilitate the long-term storage and retrieval of information within the memory cells, effectively addressing the problem of vanishing gradients [58].For instance, if the input gate remains closed (i.e., with an activation close to 0), the cell's activation persists and is not overwritten by new inputs entering the network.Consequently, this information can be retained and made accessible to the network at a later point in the sequence by simply opening the output gate.
Significantly, the values of the forget gate and input gate are influenced by both the previous hidden state and the current input.The operation of these units in LSTM is parameterized as follows.
where I t , f t , G t , and O t are the input, forget, update, and output gate activations at time t.The notations σ and tanh represent non-linear activation functions that take values in the range the [0, 1] and [−1, 1], respectively.W * and b * are weight matrices and bias vectors specific to each gate, while C t is the cell state at time t.

Bidirectional Long Short-Term Memory
Bidirectional long short-term memory (BiLSTM) is an extension of the LSTM network to capture dependencies in both the past and future contexts of a given data point.With respect to prediction based on time series data, the multiple sequences of energy loads are highly time-dependent, and loads at any point in time are significantly correlated with loads at its previous and subsequent points in time, requiring a deeper temporal-featureextractor [59].Compared to the unidirectional-state-transmission in LSTMs, BiLSTM consists of two LSTM layers, namely the forward LSTM and the backward LSTM, as shown in Figure 3a, and the output is jointly identified by the states of these two LSTMs [12,60].This configuration achieves bidirectional time series feature extraction, which can fully exploit the temporal correlations of energy load sequences.By fusing two bidirectional LSTM layers, the output outcome is computed as follows.
Here, LSTM represents the LSTM function, while

Gated Recurrent Units
Gated recurrent units (GRUs) are a simplified alternative to LSTMs, designed to alleviate the computational burden associated with the large number of parameters in the LSTM network [61].Compared with LSTMs, the GRU architecture has only two gates, namely the update gate and the reset gate (see Figure 3b) [62].Typically, the update gate dictates what to discard from the memory of the previous unit, and the reset gate guides the fusion of the new input with the last memory [63].Based on both the previous output ℎ and the current input  , the functioning principle of the GRU cell is listed as follows.

Gated Recurrent Units
Gated recurrent units (GRUs) are a simplified alternative to LSTMs, designed to alleviate the computational burden associated with the large number of parameters in the LSTM network [61].Compared with LSTMs, the GRU architecture has only two gates, namely the update gate and the reset gate (see Figure 3b) [62].Typically, the update gate dictates what to discard from the memory of the previous unit, and the reset gate guides the fusion of the new input with the last memory [63].Based on both the previous output h t−1 and the current input x t , the functioning principle of the GRU cell is listed as follows.
As exhibited in Equations ( 11)-( 14), Z t , r t , h t , and ∼ h t are the update gate, reset gate, output candidate vector, and the shared memory at time t, respectively, while ⊙ is an element-wise product.As demonstrated in Equations ( 11)-( 14), the shared memory ∼ h t passes through various time steps to encode new information while discarding memories that are no longer relevant in terms of timing.Hence, the shared memory stores significant information over an extended period.r t defines how much information h t−1 should be retained.A smaller value for r t (close to 1) denotes that more information h t−1 is retained in h t .Then, the update gate Z t defines how much information h t−1 should be discarded.A bigger Z t indicates that more information h t−1 was not discarded, whereas a smaller Z t means that a significant amount of information h t−1 was ignored.

Methodology
As mentioned above, the primary objective of this study is to investigate the potential of deep learning methods in providing an accurate estimation of residential demand response baselines.This includes data preprocessing, input feature selection, hyperparameters, development of baseline models, forecasting, and evaluations.Therefore, the methodology of this study includes several steps consisting of the following:

•
Obtaining a representative profile of the baseline residential energy demand (i.e., demand response baselines) in building clusters, which vary in terms of household space size and occupant behavior in the absence of response events.

•
Finally, the trained models and the specified input features are used to predict the future demand response baselines of residential units over various time horizons, not exceeding 24 h, and the results are utilized to assess the performance of each forecasting model.
The methodology can also be divided into the steps described as data acquisition (i.e., residential building and energy demand data inventory) and data preprocessing and input feature selection, as shown in Figure 4.The subsequent sections provide further details on the model training process, which involves the utilization of a rolling window, the hyperparameter selection, and performance evaluations.

Residential Building and Energy Demand Data Inventory
To investigate the performance of the proposed forecasting methods, 337 dwellings in the Atlantech of La Rochelle City, France, were selected as a typical case study.In order to secure training data for developing forecasting models, the baseline energy demand profile of 337 dwelling units was generated by simulating the real dwellings in DIMOSIM

Residential Building and Energy Demand Data Inventory
To investigate the performance of the proposed forecasting methods, 337 dwellings in the Atlantech of La Rochelle City, France, were selected as a typical case study.In order to secure training data for developing forecasting models, the baseline energy demand profile of 337 dwelling units was generated by simulating the real dwellings in DIMOSIM (District MOdeller and SIMulator), a Python-based urban building energy simulation platform (more details in [66,67]), the numeric datasets were saved in comma-separated values (CSV) files.The reason for this is that the measured data for the residential energy demand of all dwelling units, which must represent real demand response baselines in the absence of demand response events, are not currently available.Therefore, the DIMOSIM models were used to generate meticulously prepared datasets that accurately represent the behavior of existing dwellings exposed to the typical climatic conditions of Atlantech in the city of La Rochelle.
These datasets cover all aspects of dwelling units, with a particular focus on space heating, lighting, and electric appliances to represent the smart-meter readings of dwellings.They consist of 674 columns representing the most important information on the thermal behaviors of the dwelling and the different energy demand patterns during the heating season.In terms of size, the dataset contains a considerable amount of information due to both the five-month simulation duration (from November to the end of March) and the 10 min sampling interval.In this study, energy demand simulations for dwelling units during the heating season were carried out because of the significance of demand response programs in reducing non-essential electricity consumption during peak heating hours while at the same time providing economic benefits and promoting energy efficiency.The results of the DIMOSIM models are also compared with external references (for more details, see Ref. [68]), with DIMOSIM showing good agreement with the other tools in terms of energy production.As depicted in Figure 5, the black solid lines represent the average demand response baseline of all dwelling units, while both the red and blue solid ones are the metered heating loads and lighting and appliances during the heating months.In terms of dwelling characteristics, the floor area of the dwelling units ranges from 38 m 2 up to 225 m 2 , with an average size of 71 m 2 .The annual energy demand of each dwelling unit is approximately 105.85 kWh/m 2 .For space heating, the heating system consists of air-to-water heat pumps installed in each dwelling unit.In addition, each heat pump is equipped with its dedicated thermostat, enabling control over individual zones.The heat pump equipment is variable speed, and its coefficient of performance (COP) depends on the polynomial regression of the nominal performance coefficient to estimate the thermal power output as a function of both the radiator temperature (i.e., the sink) and the ambient temperature (i.e., the source).For the population and occupancy characteristics, each dwelling unit is occupied by either a couple with or without children or a single adult living alone.The dwelling occupancy characteristics include people who are employed, unemployed, students, or retired, as shown in Table 2.
Table 2. Occupancy characteristics of the targeted dwelling units in Atlantech in the city of La Rochelle.In terms of dwelling characteristics, the floor area of the dwelling units ranges from 38 m 2 up to 225 m 2 , with an average size of 71 m 2 .The annual energy demand of each dwelling unit is approximately 105.85 kWh/m 2 .For space heating, the heating system consists of air-to-water heat pumps installed in each dwelling unit.In addition, each heat pump is equipped with its dedicated thermostat, enabling control over individual zones.The heat pump equipment is variable speed, and its coefficient of performance (COP) depends on the polynomial regression of the nominal performance coefficient to estimate the thermal power output as a function of both the radiator temperature (i.e., the sink) and the ambient temperature (i.e., the source).For the population and occupancy characteristics, each dwelling unit is occupied by either a couple with or without children or a single adult living alone.The dwelling occupancy characteristics include people who are employed, unemployed, students, or retired, as shown in Table 2.

Data Preprocessing
Since the baseline energy demand of the dwelling units involves separate profiles for space heating, lighting, and electric appliances in each dwelling, the initial step of data preprocessing was to appropriately integrate the dataset of demand response baselines for all dwelling units to create a unified dataset for analysis and model training.In the first stage, the data for space heating, lighting, and electric appliances are aggregated together for each dwelling.The next step is to determine the total and hourly energy demand for all dwelling units, as shown in Figure 6.On the other hand, the measured outdoor weather data collected at 1 h intervals in this study is not aligned with the generated dataset of demand response baselines, which had a 10 min sampling interval.Therefore, the data are reorganized using the technique of resampling and indexing time series to form unified readings.The reorganized dataset includes the data of demand response baselines, a total of 52,555 samples.This step is of great importance in order to facilitate the estimation of the correlation between demand response baselines for dwelling units and other related input features, as explained in Section 4.3.In this context, other new features (e.g., hour of the day or day of the week) are derived using a time-based feature engineering/temporal feature engineering technique.Feature engineering is used in this work due to its importance in the development of forecasting models based on time series data, as it directly affects the performance and accuracy of these models.

Data Preprocessing
Since the baseline energy demand of the dwelling units involves separate profiles for space heating, lighting, and electric appliances in each dwelling, the initial step of data preprocessing was to appropriately integrate the dataset of demand response baselines for all dwelling units to create a unified dataset for analysis and model training.In the first stage, the data for space heating, lighting, and electric appliances are aggregated together for each dwelling.The next step is to determine the total and hourly energy demand for all dwelling units, as shown in Figure 6.On the other hand, the measured outdoor weather data collected at 1 h intervals in this study is not aligned with the generated dataset of demand response baselines, which had a 10 min sampling interval.Therefore, the data are reorganized using the technique of resampling and indexing time series to form unified readings.The reorganized dataset includes the data of demand response baselines, a total of 52,555 samples.This step is of great importance in order to facilitate the estimation of the correlation between demand response baselines for dwelling units and other related input features, as explained in Section 4.3.In this context, other new features (e.g., hour of the day or day of the week) are derived using a timebased feature engineering/temporal feature engineering technique.Feature engineering is used in this work due to its importance in the development of forecasting models based on time series data, as it directly affects the performance and accuracy of these models.The final datasets, which included the hourly dataset for demand response baselines, weather factors, and other input features, were obtained.These datasets have not been subjected to the normalization process [0, 1] to ensure that all models work with the original data.They go through the input selection process to select the final inputs to the The final datasets, which included the hourly dataset for demand response baselines, weather factors, and other input features, were obtained.These datasets have not been subjected to the normalization process [0, 1] to ensure that all models work with the original data.They go through the input selection process to select the final inputs to the forecasting models and to understand how the future baseline energy demand of residential buildings correlates with the possible input features, as explained in Section 4.3.

Selection of the Input Features
Using the feature engineering technique results in datasets that have multiple attributes associated with the considered demand response baseline profile.Since there are multiple associated characteristics, it is essential to use the appropriate method to determine the importance of each of them to the demand response baselines of the dwelling units.Along with the measured outdoor weather factors (e.g., the outdoor temperature and direct solar radiation), the impact of sixteen ( 16) independent input features is also considered.Nine input features (9) capture previous energy demand patterns, encompassing the past 1 h, the past 2 hours, the past 3 hours, the past 24 h, the past 48 h, the past 72 h, and so on.The other seven input features (7) relate to the working schedule factors of the dwelling, such as the hour of the day, the day of the week, the day of the month, the day of the year, the week of the month, and the month of the year.In this context, both the Pearson Correlation Coefficient (PCC) and the Shapley Additive Explanation (SHAP) techniques are used to minimize redundant or useless input features and to identify the most important variables for the forecasting models.
PCC was used to estimate the correlation coefficients (R) between the average energy demand of all dwelling units in the district and each input feature, as described in Equation (15).
where x i and y i represent the actual values, x and y donate the average values, and n is the number of observations.The correlation coefficient of PCC takes a value between (+1) and (−1), as the highest absolute value of PCC translates to a parameter closely related to the future energy demand of dwelling units.Figure 7a displays the PCC values between the baseline residential energy demand (i.e., the demand response baselines) and the most meaningful input features in descending order when considering the whole evaluation.As shown in Figure 7a, the factors associated with the previous energy demand patterns show strong positive correlations with the demand response baselines of the dwelling units.
Weather-related outdoor temperatures also show strong negative correlations, while the other factors have relatively different correlation values.However, these factors with low PCC values were selected as input features for the forecasting models because deep learningbased data-driven models have a high ability to detect non-linear and linear relationships between the energy demand baselines of the dwelling units and other relevant influencing factors.Based on this, the use of small correlation values of the PCC ensures that the low-degree correlation is not neglected in model training.
As mentioned above, SHAP is also used to identify the potential contribution of each input feature to the forecasting model.Figure 7b shows the contributions of each input feature in descending order.It is important to note the importance of factors related to past energy demand patterns, outdoor temperature, and the hour of the day in forecasting baseline values of residential demand response over very short-term periods, not exceeding 24 h.Together with PCCs, these findings demonstrate the importance of including previous input features, including those related to previous energy demand pattern-related ones, for accurate forecasting of demand response baselines.The selection of the three (3) historical energy demand-related input features is based on a comprehensive analysis of factors that significantly affect energy demand forecasting.These features have been chosen to capture the diverse aspects of historical energy consumption patterns, ensuring a robust and accurate forecasting model.Accordingly, these nine factors out of all input features were selected as the final input features for all forecasting models, as described in Table 3.
As shown in Figure 7a, the factors associated with the previous energy demand patterns show strong positive correlations with the demand response baselines of the dwelling units.Weather-related outdoor temperatures also show strong negative correlations, while the other factors have relatively different correlation values.However, these factors with low PCC values were selected as input features for the forecasting models because deep learning-based data-driven models have a high ability to detect non-linear and linear relationships between the energy demand baselines of the dwelling units and other relevant influencing factors.Based on this, the use of small correlation values of the PCC ensures that the low-degree correlation is not neglected in model training.As mentioned above, SHAP is also used to identify the potential contribution of each input feature to the forecasting model.Figure 7b shows the contributions of each input feature in descending order.It is important to note the importance of factors related to past energy demand patterns, outdoor temperature, and the hour of the day in forecasting baseline values of residential demand response over very short-term periods, not exceeding 24 h.Together with PCCs, these findings demonstrate the importance of including previous input features, including those related to previous energy demand pattern-related ones, for accurate forecasting of demand response baselines.The selection of the three (3) historical energy demand-related input features is based on a comprehensive analysis of factors that significantly affect energy demand forecasting.These features have been chosen to capture the diverse aspects of historical energy consumption patterns, ensuring a robust and accurate forecasting model.Accordingly, these nine factors out of all input features were selected as the final input features for all forecasting models, as described in Table 3.

Forecast Model Development
The development of deep learning-based forecasting models explained in Section 3, involves two fundamental steps: hyperparameter tuning and post-training.It is important to note that the performance of these models is compared with that of classic and ensemble models based on traditional statistical and machine learning methods used in previous literature [69][70][71][72].The classic models included support vector regression (SVR), Autoregressive integrated moving average (ARIMA), multiple linear regression (MLR), Lasso regression (Lasso), ridge regression (Ridge), polynomial regression (PolyR), Bayesian regression (Bayesian), kernel ridge regression (KernelR), and stochastic gradient descent regression (SGDReg) algorithms.The tree-based ensemble machine learning models included extreme gradient boosting (XGBoost), light gradient-boosting machine (LightGBM), gradient boosting (GB), random forest (RF), Adaptive Boosting (AdaBoost), Bagging (Bagging), and categorical gradient boosting (CatBoost) algorithms.Both classic and ensemble algorithms are among the most popular traditional statistical and machine learning methods in building energy analysis applications (see Refs. [69][70][71]), which have become specifically data-driven methods for predicting, benchmarking, and mapping baseline energy demand in buildings [72].

Hyperparameter Tuning
Tuning the appropriate hyperparameters is a critical step in the development of accurate data-driven forecasting models and can have a considerable impact on convergence speeds and generalizability.In this work, the controlled-variable method, relying on empirical expertise, is utilized in the experiments to optimize the choice of hyperparameters.To determine the optimal architecture of the proposed models, this work referred to references [73][74][75] and identified the practicable scopes and the optimization scope for hyperparameters, as shown in Table 4.In terms of deep learning models, forecasting based on BiLSTM and CNN is presented here as a typical example.First, with default BiLSTM parameters (activation function = relu, optimizer = Adam, loss function = RMSE, batch size = 32, verbose = 2, learning rate = 0.0001, epochs = 30) and the default parameters of CNN (activation function = relu, optimizer = Adam, loss function = RMSE, filters = 64, kernel size = 2, pool size = 2, learning rate = 0.0001, verbose = 2, epochs = 30), the number of hidden units in each hidden layer was fixed.Second, the learning rate, epochs, batch size, filters, and kernel size were adjusted.For example, the RMSE value of BiLSTM models decreases when the learning rate is set to 0.01, and the number of epochs is set to 200 or 300.For CNN models, the RMSE value decreases when the learning rate is set to 0.001, and the number of epochs is set to 300 or 500.Third, in cases where there was a tendency for overfitting to occur, dropout was implemented based on empirical expertise.Dropout is a useful regularization technique to control the proportion of randomly selected neurons during each training iteration.This helps to prevent co-adaptation of neurons, making the network more robust and reducing overfitting.Finally, to minimize redundancy in the process of defining hyperparameters for optimization, parameters of the same model type in the same case were kept consistent, and adjustments were only made to those parameters that prevented overfitting.For both classic and ensemble models, additional hyperparameters were employed to optimize performance and prevent overfitting, as shown in Table 4.One of these hyperparameters is max_depth, which regulates the maximum depth of each tree in the ensemble, thus limiting the complexity of individual trees.The second is n_estimators, which allows controlling the size of the ensemble in terms of the number of base learners while training the model.The third parameter is subsample, which determines the proportion of randomly sampled training data utilized to grow each tree within the ensemble.Another technique is the kernel coefficient of SVR, which specifically controls the influence of individual training samples on the decision boundary.All models developed (as described in Section 4.4.2) in this work were implemented using the Python programming language and executed within the Scikit-learn [76] and TensorFlow [77] frameworks.The experimental hardware configuration included an Intel(R) Xeon(R) Bronze 3106 CPU, a 64-bit operating system, and 64 GB of RAM on an Intel processor.

Training and Validation
Once the final input features have been specified, as explained in Section 4.3, it is crucial to develop robust and accurate models for predicting the future energy demand (demand response baselines) of dwellings over multiple time horizons, not exceeding 24 h.To achieve this goal, the datasets were divided into training and validation sets, comprising 80% and 20%, respectively.Considering the training dataset, each model is trained to learn patterns and trends of energy demand, as well as relationships between data series for all dwelling units and different dwelling unit aggregation levels-not a specific pattern of a particular dwelling unit.The goal is to know the performance of each model in the context of all dwelling units and different aggregation levels of dwelling units when traditional machine learning and deep learning models are considered.Each model is trained to predict the demand response baselines of dwelling units, starting at 0:00 am and ending at 11:00 pm (23:00 pm) of the day, as this lasts for the entire period under consideration (five months).Secondly, the trained model, together with its learned information, will later be used to make predictions about the future energy demand of these dwelling units.
With optimal hyperparameters, each model of traditional classic, ensemble, and deep learning models is trained, and its performance is evaluated using three different performance indicators described in Section 4.5.More specifically, using the training dataset, each model is trained with all input features (nine factors) previously defined in Section 3.3.Then, each model is saved at the optimal time for each trained dataset, and its performance is evaluated and compared with its other counterparts.This performance evaluation step is always based on the optimal hyperparameters for each trained model.The purpose of this step is to obtain accurate forecasting models.Subsequently, these forecasting models (i.e., pre-trained models) are used to periodically predict the future demand response baseline of dwelling units over various time horizons: 6 h (00:00 am to 06:00 am), 12 h (00:00 am to 12:00 pm), 18 h (00:00 am to 06:00 pm), and 24 h (00:00 am to 11:00 pm), as shown in Figure 8.In the forecasting step, each model only includes the input features (9) defined previously in Section 4.3., while the demand response baseline profile of the dwelling units was excluded.The aim is to determine the ability of each model to predict the future energy demand (demand response basis) of dwellings by considering only those factors related to previous energy demand patterns, dwelling working schedules, and outdoor weather conditions.Ultimately, the performance of those forecasting models is evaluated and validated using the validation dataset.In addition, the discrepancy between the values of the actual and forecasted demand response baselines was used to express the overall performance of the deep learning and traditional machine learning-based forecasting models by considering performance indicators, as described in Section 4.5.
Buildings 2024, 14, x FOR PEER REVIEW 20 of 43 11:00 pm), as shown in Figure 8.In the forecasting step, each model only includes the input features (9) defined previously in Section 4.3., while the demand response baseline profile of the dwelling units was excluded.The aim is to determine the ability of each model to predict the future energy demand (demand response basis) of dwellings by considering only those factors related to previous energy demand patterns, dwelling working schedules, and outdoor weather conditions.Ultimately, the performance of those forecasting models is evaluated and validated using the validation dataset.In addition, the discrepancy between the values of the actual and forecasted demand response baselines was used to express the overall performance of the deep learning and traditional machine learning-based forecasting models by considering performance indicators, as described in Section 4.5.To further investigate the forecasting behavior of the proposed deep learning and machine learning models, this work is extended to incorporate various aggregation levels of the demand response baseline profile (i.e., 200, 150, 100, 50, 30, 20, and 10 dwelling units).Each time, these models are trained and validated and then evaluated and compared in their performance at each level of aggregation for the demand response baseline profile of dwelling units.The motivation behind this is to better understand the forecasting behaviors of these models in the case of fluctuations or changes occurring in the profile of the demand response baselines of dwelling units.This can be performed by forecasting the energy demand profile of dwelling units [48,78] over a short-term horizon.All forecasting models, whether based on deep learning or classic and ensemble machine learning methods, had the same input features as defined in Section 4.3 and were compared in terms of forecasting accuracy, as discussed in Section 5.

Performance Assessment
The primary goal of adopting deep learning models, as well as classic and ensemble, is to minimize the gap between the actual aggregated demand response baselines and their forecasted counterparts for the next day.Therefore, the forecasting accuracy of aggregated demand response baselines over various time horizons (6 h, 12 h, 18 h, and 24 h ahead) was measured utilizing three performance metrics.The most commonly used statistical metric to determine forecast accuracy in the literature is the mean absolute percentage error (MAPE) [78,79].However, MAPE has a limitation concerning the actual values of demand response baselines being zero, which can occur in forecasting demand response baselines.MAPE might take extreme values when the actual values are close to zero in some cases, making it less reliable.To avoid this limitation, this study also uses To further investigate the forecasting behavior of the proposed deep learning and machine learning models, this work is extended to incorporate various aggregation levels of the demand response baseline profile (i.e., 200, 150, 100, 50, 30, 20, and 10 dwelling units).Each time, these models are trained and validated and then evaluated and compared in their performance at each level of aggregation for the demand response baseline profile of dwelling units.The motivation behind this is to better understand the forecasting behaviors of these models in the case of fluctuations or changes occurring in the profile of the demand response baselines of dwelling units.This can be performed by forecasting the energy demand profile of dwelling units [48,78] over a short-term horizon.All forecasting models, whether based on deep learning or classic and ensemble machine learning methods, had the same input features as defined in Section 4.3 and were compared in terms of forecasting accuracy, as discussed in Section 5.

Performance Assessment
The primary goal of adopting deep learning models, as well as classic and ensemble, is to minimize the gap between the actual aggregated demand response baselines and their forecasted counterparts for the next day.Therefore, the forecasting accuracy of aggregated demand response baselines over various time horizons (6 h, 12 h, 18 h, and 24 h ahead) was measured utilizing three performance metrics.The most commonly used statistical metric to determine forecast accuracy in the literature is the mean absolute percentage error (MAPE) [78,79].However, MAPE has a limitation concerning the actual values of demand response baselines being zero, which can occur in forecasting demand response baselines.MAPE might take extreme values when the actual values are close to zero in some cases, making it less reliable.To avoid this limitation, this study also uses mean absolute error (MAE) and root mean squared error (RMSE), which do not have the above-mentioned limitations.The three metrics are formulated as follows.
y actual, i − y f orecast,i y actual, i × 100 ( 16) where y actual,i is the actual aggregated demand response baselines per hour i, y f orecast,i is the forecasted aggregated demand response baseline per hour i, and N is the number of hours in the datasets.The MAE reflects the magnitude of deviation between forecasted and actual values by utilizing the absolute error, while the RMSE refers to the standard deviation of the residuals between forecasted and actual values of demand response baselines for dwelling units.In contrast, the MAPE measures the forecasting accuracy between forecasted and actual values of demand response baselines for dwelling units, expressed as a relative percentage of forecasting errors.Both RMSE and MAE are scale-dependent metrics and describe the forecasting errors at their original scale.MAPE is a scale-independent metric because the denominator of its equation includes actual values, making it suitable for comparing performance with other studies.The lower values of these metrics mean that the dispersion is more similar between the actual and the forecast demand response baselines.In this study, the MAPE was used as the primary performance measure, with the MAE and RMSE used only as linkage breakers when the MAPE did not show a significant difference between the forecasting models.

Results and Discussion
In this section, the performance of the forecasting models developed based on deep learning algorithms is analyzed (Section 5.1.)and compared with those based on classic and ensemble machine learning algorithms (Sections 5.2 and 5.3) over various time horizons, as explained in the previous sections.Section 5.4 shows the forecasting behaviors of all models developed over different aggregation levels of demand response baseline profile for dwelling units.Note that the performance results represent multiple testing outcomes for the developed predictive model, reflecting different test periods.Section 5.5 provides an example to assess the ability of deep learning and classic and ensemble machine learning methods in forecasting energy reductions resulting from the activation of demand response events for 3 h on peak days of the heating season.

Performance of Deep Learning Models
Table 5 lists the MAPE, RMSE, and MAE values for the forecasting performance of the proposed deep learning methods over various time horizons, compared to their classic and ensemble counterparts, as a function of the input features and each forecasting method.Given the performance results, the authors of this study evaluated the effectiveness of each model over multiple test periods, surpassing five consecutive periods.As shown in Table 5, BiLSTM-based forecasting models could considerably reduce the gap between the measured demand response baselines of dwelling units and their forecasted counterparties, considering different forecast horizons, leading to better performance.As a result, BiL-STM models outperformed their deep learning counterparts, including traditional neural networks (ANNs), with MAPE values of 9.08% (6 h ahead), 11.14% (12 h ahead), 11.11% (18 h ahead), and 11.59% (24 h ahead), respectively.The error represented by RMSE and MAE was also lower than ANN, DNN, CNN, and RNN in all cases of forecasting demand response baselines of dwelling units.The best conditions yielded an RMSE of 7.07 kW and an MAE of 5.41 kW when forecasting 6 h demand response baselines.This performance is attributed to the bidirectional processing (in both forward and backward directions) of BiLSTM, which allows the neural network to efficiently learn and capture information from both past and future states.Since the primary objective of BiLSTM is to acquire further knowledge concerning a given context by capturing it from more than one perspective and then concatenating the two outputs into a single contextual representation.Compared to other deep learning models, GRU showed considerable improvements in forecast accuracy over all time horizons considered, better than LSTM, RNN, DNN, and CNN.The performance values of the GRU forecasts were better than those methods by 8.92% and 11.22% in MAPE for the 6 h and 18 h ahead forecasts.The LSTM models performed reasonably well in terms of forecasting accuracy, with MAPE values of up to 9.23% (6 h) and 12.03% (24 h).However, these improvements in the forecasting accuracy were slightly lower than those achieved by BiLSTM.For all four time horizons, BiLSTM performed better than GRU and LSTM, indicating that BiLSTM has a higher learning potential over the next 24 h due to its bidirectional learning capability and produces fewer errors than GRU and LSTM.On the contrary, this potential was found to be lowest with a traditional ANN-based forecasting model, which, due to its shallow structure, was not able to effectively learn time series data and produce accurate predictions for demand response baselines of dwelling units over the four time horizons.
In Figures 9 and 10, four time horizons have been illustrated to provide an intuitive representation of the forecast, each encompassing a single test period.Figure 9 illustrates a comparison of the efficiency of the deep learning methods in representing the closeness between the curves of forecasted and actual values for the demand response baselines of dwelling units over the four time horizons.As shown in Figure 9, these methods were able to produce a similar profile of demand response baselines for dwelling units, with observable differences in the efficiency of each.Figure 10 shows a comparison of the proposed deep learning methods in terms of the magnitude of the error at each hour, in kW, and at different time horizons.As depicted in Figure 10, it illustrates how the best model reduces the magnitude of an error when producing estimates of the demand response baselines over 6 h, 12 h, 18 h, and 24 h ahead forecasts.The magnitude of the forecast error decreased to some extent as the length of the input interval decreased.For each model, the 6 h input produced the lowest forecast error.However, in terms of conciseness, the BiLSTM, followed by the GRU and LSTM models, outperformed the others in all cases.As accuracy is very important in demand response program applications, BiLSTM, GRU, and LSTM models would be preferred over ANN and DNN models for short-term baseline energy forecasting in residential buildings.Therefore, BiLSTM, GRU, and LSTM, along with the RNN and CNN as alternative methods, should be one of the deep learning methods to consider when developing baseline demand response forecasting models for dwelling units.
a comparison of the efficiency of the deep learning methods in representing the clos between the curves of forecasted and actual values for the demand response baseli dwelling units over the four time horizons.As shown in Figure 9, these methods wer to produce a similar profile of demand response baselines for dwelling units, observable differences in the efficiency of each.Figure 10 shows a comparison proposed deep learning methods in terms of the magnitude of the error at each ho kW, and at different time horizons.As depicted in Figure 10, it illustrates how th model reduces the magnitude of an error when producing estimates of the de response baselines over 6 h, 12 h, 18 h, and 24 h ahead forecasts.The magnitude forecast error decreased to some extent as the length of the input interval decrease each model, the 6 h input produced the lowest forecast error.However, in ter conciseness, the BiLSTM, followed by the GRU and LSTM models, outperforme others in all cases.As accuracy is very important in demand response pro applications, BiLSTM, GRU, and LSTM models would be preferred over ANN and models for short-term baseline energy forecasting in residential buildings.Ther BiLSTM, GRU, and LSTM, along with the RNN and CNN as alternative methods, s be one of the deep learning methods to consider when developing baseline de response forecasting models for dwelling units.

Comparison of Deep Learning and Ensemble Models
As shown in Table 5, the performance of seven ensemble methods, based on Ba and boosting techniques, was assessed over various time horizons.Among these methods, XGBoost showed the best performance, with the lowest values of MAPE, R and MAE, when forecasting the demand response baselines of dwelling units ov four time horizons.As a result, the magnitude of error was lower than its counter with values (8.14 kW, 13.95 kW, 14.39 kW, and 15.78 kW) and ( 10 20.84 kW, 24.62 kW, 24.79 kW, and 25.73 kW) of MAE ove 12 h, 18 h and 24 h ahead forecasts, respectively.Otherwise, other ensemble methods as GB, LightGBM, and RF, showed reasonable performance with slightly lower variability.This is attributed to two main reasons: (1) the strong non-linear ma generalization and parallelization potential of XGBoost, which is derived from its bo decision tree-based architecture [80], and (2) the limited ability of other mod effectively learn from a given time series dataset and generalize outcomes has nega affected the forecast accuracy, despite the implementation of hyperpara adjustments.
Compared to deep learning methods, all ensemble models could not achiev same performance observed in deep learning-based forecasting models.This differe primarily due to the essential structural characteristics of these approaches.As ens algorithms without inherent memory, such ensemble methods are unable to captur preserve past information.As a result, they exhibit suboptimal performance in sce where the input time-series information is intricate and where a shorter output inte required.As shown in Figure 11, ensemble-based forecasting models can produ general pattern of demand response baselines for dwelling units.However, the magn of forecast errors remains considerable, as depicted in Figure 12.For example, the magnitude error of CatBoost models was 29.88 kW, 30.29 kW, and 31.61kW of R

Comparison of Deep Learning and Ensemble Models
As shown in Table 5, the performance of seven ensemble methods, based on Bagging and boosting techniques, was assessed over various time horizons.Among these seven methods, XGBoost showed the best performance, with the lowest values of MAPE, RMSE, and MAE, when forecasting the demand response baselines of dwelling units over the four time horizons.As a result, the magnitude of error was lower than its counterparts, with values (8.14 kW, 13.95 kW, 14.39 kW, and 15.78 kW) and ( 10 20.84 kW, 24.62 kW, 24.79 kW, and 25.73 kW) of MAE over 6 h, 12 h, 18 h and 24 h ahead forecasts, respectively.Otherwise, other ensemble methods, such as GB, LightGBM, and RF, showed reasonable performance with slightly lower error variability.This is attributed to two main reasons: (1) the strong non-linear mapping generalization and parallelization potential of XGBoost, which is derived from its boosted decision tree-based architecture [80], and (2) the limited ability of other models to effectively learn from a given time series dataset and generalize outcomes has negatively affected the forecast accuracy, despite the implementation of hyperparameter adjustments.
Compared to deep learning methods, all ensemble models could not achieve the same performance observed in deep learning-based forecasting models.This difference is primarily due to the essential structural characteristics of these approaches.As ensemble algorithms without inherent memory, such ensemble methods are unable to capture and preserve past information.As a result, they exhibit suboptimal performance in scenarios where the input time-series information is intricate and where a shorter output interval is required.As shown in Figure 11, ensemble-based forecasting models can produce the general pattern of demand response baselines for dwelling units.However, the magnitude of forecast errors remains considerable, as depicted in Figure 12.For example, the total magnitude error of CatBoost models was 29.88 kW, 30.29 kW, and 31.61kW of RMSE, whereas that was 15.12 kW, 17.79 kW, and 18.27 kW of RMSE for traditional artificial neural network (ANN) of deep learning models for 12 h, 18 h, and 24 h ahead forecasts, respectively.
To be more concise, in terms of forecasting accuracy, deep learning-based baseline models have exhibited superiority in reducing the magnitude error between forecasted and measured values of demand response baselines, thus demonstrating their advantage in short-term forecasts of demand response baselines for dwelling units.However, the XG-Boost method should be one of the alternative forecasting methods to be considered, along with deep learning methods, when estimating demand response baselines in residential neighborhood contexts over various time horizons, not exceeding 24 h.models have exhibited superiority in reducing the magnitude error between forec and measured values of demand response baselines, thus demonstrating their adva in short-term forecasts of demand response baselines for dwelling units.Howeve XGBoost method should be one of the alternative forecasting methods to be consid along with deep learning methods, when estimating demand response baselin residential neighborhood contexts over various time horizons, not exceeding 24 h.models have exhibited superiority in reducing the magnitude error between forec and measured values of demand response baselines, thus demonstrating their adva in short-term forecasts of demand response baselines for dwelling units.Howeve XGBoost method should be one of the alternative forecasting methods to be consid along with deep learning methods, when estimating demand response baselin residential neighborhood contexts over various time horizons, not exceeding 24 h.

Comparison of Deep Learning and Classic Models
Compared to deep learning methods, the nine ( 9) classic methods proposed i study showed no observable improvements in forecast accuracy over the four horizons considered.Although methods such as SVR exhibited reasonable perform they were far from achieving the performance obtained by deep learning-b forecasting models.As shown in Table 4, the best performance conditions for the resulted in an RMSE of 16.16 kW and an MAE of 14.19 kW for the 6 h ahead forecas addition, the RMSE and MAE values for ARIMA and Lasso were (26.97 kW and 24.85 and (21.32 kW and 21.02 kW) at 12 h and 24 h ahead forecasts, respectively.Amon classic models, KernelR and SGDReg performed the worst compared with the other c models, including SVR, ARIMA, and Lasso, with errors of magnitude of up to 33.2

Comparison of Deep Learning and Classic Models
Compared to deep learning methods, the nine (9) classic methods proposed in this study showed no observable improvements in forecast accuracy over the four time horizons considered.Although methods such as SVR exhibited reasonable performance, they were far from achieving the performance obtained by deep learning-based forecasting models.As shown in Table 4, the best performance conditions for the SVR resulted in an RMSE of 16.16 kW and an MAE of 14.19 kW for the 6 h ahead forecasts.In addition, the RMSE and MAE values for ARIMA and Lasso were (26.97 kW and 24.85 kW) and (21.32 kW and 21.02 kW) at 12 h and 24 h ahead forecasts, respectively.Among the classic models, KernelR and SGDReg performed the worst compared with the other classic models, including SVR, ARIMA, and Lasso, with errors of magnitude of up to 33.21 kW and 28.27 kW (RMSE and MSE) and 28.35 kW and 24.51 kW (RMSE and MSE), respectively.The KernelR and SGDReg models often suffered from instability when using relatively diverse and sparse data samples (because they have different very short-term and short-term forecast horizons), resulting in considerable differences in forecast accuracy.
Figure 13 shows the demand response baseline profiles produced by classical approachbased forecasting models.It can be seen that using the classic models would produce the general pattern of demand response baselines for dwelling units.However, when the classic models are used, more deviations from the measured values of the demand response baselines are likely to be seen.This is due to the presence of more non-linear values in the input features, leading to error magnitudes of up to 36.88% (KernelR) and 41.29% (SGDReg) of MAPE.One of the drawbacks of the classic methods' linear nature is that they prevent high-quality forecasts from being obtained with the original input features and, therefore, fail to achieve performance levels similar to those of deep learning models at the same time horizons.This is due to their inability to capture complex non-linear relationships in the demand response baselines' datasets.In situations where the underlying models are highly non-linear or involve interactions between input features, linear models may not perform as well as more sophisticated non-linear models.
Compared to deep learning methods, the nine ( 9) classic methods proposed i study showed no observable improvements in forecast accuracy over the four horizons considered.Although methods such as SVR exhibited reasonable perform they were far from achieving the performance obtained by deep learningforecasting models.As shown in Table 4, the best performance conditions for the resulted in an RMSE of 16.16 kW and an MAE of 14.19 kW for the 6 h ahead foreca addition, the RMSE and MAE values for ARIMA and Lasso were (26.97 kW and 24.8 and (21.32 kW and 21.02 kW) at 12 h and 24 h ahead forecasts, respectively.Amon classic models, KernelR and SGDReg performed the worst compared with the other c models, including SVR, ARIMA, and Lasso, with errors of magnitude of up to 33.2 and 28.27 kW (RMSE and MSE) and 28.35 kW and 24.51 kW (RMSE and respectively.The KernelR and SGDReg models often suffered from instability when relatively diverse and sparse data samples (because they have different very short and short-term forecast horizons), resulting in considerable differences in fo accuracy.
Figure 13 shows the demand response baseline profiles produced by cla approach-based forecasting models.It can be seen that using the classic models w produce the general pattern of demand response baselines for dwelling units.How when the classic models are used, more deviations from the measured values o demand response baselines are likely to be seen.This is due to the presence of more linear values in the input features, leading to error magnitudes of up to 36.88% (Ker and 41.29% (SGDReg) of MAPE.One of the drawbacks of the classic methods' nature is that they prevent high-quality forecasts from being obtained with the or input features and, therefore, fail to achieve performance levels similar to those of learning models at the same time horizons.This is due to their inability to capture com non-linear relationships in the demand response baselines' datasets.In situations w the underlying models are highly non-linear or involve interactions between features, linear models may not perform as well as more sophisticated non-linear m Buildings 2024, 14, x FOR PEER REVIEW 27

Performance at Different Aggregation Levels
To better understand the forecasting behavior of each method, different demand response baseline profiles were used at different levels of dwelling unit aggregation, namely, 200, 150, 100, 50, 20, and 10, as mentioned above.Each method has the same time horizons and input features, allowing analysis of the difference in forecast performance between different models with different levels of forecastability.Figures 15-17 show the overall performance in terms of MAPE values for each of the deep learning, classic, and ensemble models over different time horizons.As depicted in Figure 15, the variations in the demand response baseline profiles for dwelling units indicate that deep learning models typically avoid overestimating future demand response baselines for different dwelling units.It was observed that deep learning-based forecasting models could achieve higher accuracy compared to other models, with reduced errors of up to 8.71%, 10.59%, 11.53%, and 13.05% of MAPE for 6 h, 12 h, 18 h, and 24 h ahead forecasts, respectively.In this respect, BiLSTM-based forecasting models demonstrated superior performance, followed by GRU, LSTM, and RNN, over different time horizons (see Appendices A and B).This behavior in forecasting performance is due to the robust learning capabilities of these methods in mapping energy data of dwelling units, resulting in lower errors than their counterparts from other deep learning methods.
Intuitively, the forecasting behavior of the XGBoost, GB, LightGBM, and SVR models was better in terms of forecasting accuracy compared to their counterparts of other classic and ensemble models, as these models enabled better learning from different datasets of demand response baseline profiles (see Tables A1-A6 in Appendix B).However, these models could not outperform the deep learning models in terms of improving forecasting accuracy.This result was expected due to (1) the stability of deep learning models when dealing with sparse datasets and (2) the ability of deep learning models to better learn and efficiently deal with the complexity of sequential datasets (time series data) based on training data.Concerning the values of the accuracy metrics, using the demand response baseline profile for different aggregated dwelling units, for example, 200 dwelling units, results in (10.10%, 10.59%, 11.53%, 13.05%) of MAPE with BiLSTM, in contrast to (19.73%, 18.53%, 20.03%, 19.91%) and (11.05%, 14.12%, 15.45%, 16.02%) with the SVR and XGBoost models over 6 h, 12 h, 18 h, and 24 h ahead forecasts, respectively.
77.48%, 80.41%, 71.25%) when using demand response baseline profiles for 200 a dwelling units over 6 h, 12 h, 18 h, and 24 h ahead forecasts, respectively.considerable variations in error for each method are due to changes in the data demand response baseline profiles.Consequently, the margin of forecasting be between these models and deep learning models is more apparent in such sparse d of demand response baselines because, for example, several classic and ensemble m including Bagging and RF models, suffer more from instability in sparse datasets [ Intuitively, the forecasting behavior of the XGBoost, GB, LightGBM, and SVR m was better in terms of forecasting accuracy compared to their counterparts of other and ensemble models, as these models enabled better learning from different data demand response baseline profiles (see Tables A1-A6 in Appendix B).However models could not outperform the deep learning models in terms of improving forec accuracy.This result was expected due to (1) the stability of deep learning models dealing with sparse datasets and (2) the ability of deep learning models to better lea efficiently deal with the complexity of sequential datasets (time series data) bas training data.Concerning the values of the accuracy metrics, using the demand res baseline profile for different aggregated dwelling units, for example, 200 dwelling results in (10.10%, 10.59%, 11.53%, 13.05%) of MAPE with BiLSTM, in contrast to (1 18.53%, 20.03%, 19.91%) and (11.05%, 14.12%, 15.45%, 16.02%) with the SVR and XG models over 6 h, 12 h, 18 h, and 24 h ahead forecasts, respectively.
Furthermore, the RMSE and MAE for the deep learning models were compared to the classic and ensemble models, demonstrating better forec performance than others, as shown in Figures A1-A4 from Appendix A and Tabl A6 from Appendix B. Therefore, it can be said that the proposed deep learning m consistently outperform the comparative classic and ensemble methods for forec Furthermore, the RMSE and MAE for the deep learning models were lower compared to the classic and ensemble models, demonstrating better forecasting performance than others, as shown in Figures A1-A4 from Appendix A and Tables A1-A6 from Appendix B. Therefore, it can be said that the proposed deep learning methods consistently outperform the comparative classic and ensemble methods for forecasting demand response baselines at different aggregation levels of dwelling units, indicating accurate forecasting behaviors.The significant improvement in MAE and RMSE values primarily demonstrates the ability of the deep learning methods to effectively learn the complexity of time series datasets, which enables the correct capture of the data points of demand response baselines for different aggregations of dwelling units.It can be seen that the proposed deep learning models outperform the other models, as shown in Tables A1-A6.The deep learning models can adequately capture the dynamic stochastic nature of the aggregated demand response baselines, caused by the outdoor weather conditions and the energy demand behavior of the occupants, represented by the working schedules (calendar factor) of the dwelling.As a result, the gap between measured and forecasted aggregated demand response baselines was minimized to some extent, and good forecasting accuracy levels were achieved.
A6.The deep learning models can adequately capture the dynamic stochastic na the aggregated demand response baselines, caused by the outdoor weather cond and the energy demand behavior of the occupants, represented by the working sch (calendar factor) of the dwelling.As a result, the gap between measured and fore aggregated demand response baselines was minimized to some extent, and forecasting accuracy levels were achieved.

Example for Demand Response Forecasts
In order to integrate the forecasting models developed so far for demand re baselines into the residential energy management systems, it is necessary to evaluat performance not only for the baseline demand patterns but also during the activa demand response events (i.e., assess their ability in capturing energy reductions response events are triggered).Energy reductions are calculated based on the diff in energy values between the baselines and the energy profile while the response are triggered.To do so, datasets of the energy demand profile of all dwelling units demand response events are activated for 3 h per day (i.e., from 6:00 pm to 9:00 pm) the coldest days are utilized (see Figure A5 in Appendix A).The proposed deep le models are also used to provide an accurate estimate of the energy reductions res from demand response events during these three hours, and their performance evaluated against the classic and ensemble models by considering performance m Table 6 presents the performance results of the deep learning, ensemble, and

Example for Demand Response Forecasts
In order to integrate the forecasting models developed so far for demand response baselines into the residential energy management systems, it is necessary to evaluate their performance not only for the baseline demand patterns but also during the activation of demand response events (i.e., assess their ability in capturing energy reductions when response events are triggered).Energy reductions are calculated based on the difference in energy values between the baselines and the energy profile while the response events are triggered.To do so, datasets of the energy demand profile of all dwelling units when demand response events are activated for 3 h per day (i.e., from 6:00 pm to 9:00 pm) during the coldest days are utilized (see Figure A5 in Appendix A).The proposed deep learning models are also used to provide an accurate estimate of the energy reductions resulting from demand response events during these three hours, and their performance is then evaluated against the classic and ensemble models by considering performance metrics.Table 6 presents the performance results of the deep learning, ensemble, and classic models as a function of the given time horizon, input features, and for each proposed forecasting method.As expected, the deep learning-based forecasting models demonstrated better performance in capturing the energy reductions of the dwelling units due to demand response events over a 3 h time horizon.The LSTM models, followed by ANN, GRU, and BiLSTM, showed a significant ability to improve the forecasting accuracies by minimizing the gap between forecasted and measured values, up to 8.02%, 9.56%, 9.59, and 9.89% of MAPE, respectively.Table 6 also shows that the accuracy of other forecasting models based on deep learning is acceptable.The highest error magnitude of CNN is 7.55 kW of RMSE and 7.15 kW of MAE, making it the least accurate of the deep learning models.However, the forecast outcomes are all close to the actual reductions in energy demand.In contrast, all classic and ensemble models, including SGDReg, KernelR, ARIMA, and LightGBM models, failed to improve forecasting accuracy, with errors up to 46.32%, 34.24%, 36.94%, and 23.15% of MAPE, respectively.Both RMSE and MAE values were also significant, around (29.33 kW, 25.96 kW, 26.12 kW, 16.35 kW) and (26.68 kW, 23.72 kW, 22.18 kW, 15.65 kW), respectively.Otherwise, XGBoost performed better than its other classic and ensemble counterparts, with an error of 13.12% of MAPE, 9.58 kW of RMSE, and 8.01 kW of MAE, which is the best.However, it was not able to achieve the same level of accuracy when forecasting the energy reduction of dwelling units during the next three hours (3 h ahead forecast).For conciseness in this section, a notable observation was the satisfactory performance of the deep learning models compared to the classic and ensemble models in providing accurate estimates of energy reductions over a 3 h ahead.At the same time, the comparison of the actual profile of energy demand reductions with the forecasted profiles of each forecasting method showed that deep learning-based forecasting models can provide more accurate profiles of energy demand reductions resulting from demand response events over short-term and very short-term horizons.This result is due to the intrinsic characteristics of deep learning structures.XGBoost can also be used as an alternative method, along with deep learning methods to support demand response programs in providing accurate estimates of the energy reductions of buildings during the activation of residential neighborhood-level response events.

Conclusions
The paper presents the development of a deep learning-based, data-driven learning framework to provide accurate and reliable estimates of demand response baselines in a residential neighborhood context over short-term and very short-term time horizons.Several predictive models based on a deep learning approach, including ANN, DNN, CNN, RNN, LSTM, GRU, and BiLSTM, were developed to predict the future demand response baselines of 337 dwelling units and explore the influence of using different levels of aggregation (200, 150, 100, 50, 20, 10 dwelling units) for the demand response baseline profiles on the forecast accuracy.At the same time, all these methods are compared with fifteen different classic and ensemble methods to verify their potential to provide accurate and reliable estimates of demand response baselines over a time horizon not exceeding 24 h.The classic methods included MLR, Lasso, Ridge, PolyR, Bayesian, KernelR, SGDReg, and ARIMA, while the ensemble methods included XGBoost, LightGBM, GR, RF, Bagging, CatBoost, and AdaBoost.
In all these methods, firstly, the PCC technique is used to select the most significant variables (input features) that influence the energy demand baselines of dwelling units.Secondly, SHAP is used to identify the potential contribution of each input feature to the predictive model.This not only effectively reduces the dimensionality of the input parameters but also enhances the model's running speed while ensuring the incorporation of scientifically and rationally chosen input features.Finally, the controlled-variable method, relying on empirical expertise, is utilized in the experiments to determine the best combination of hyperparameters for building a robust demand response baseline model.Several demand response baseline models were developed and then their performance was analyzed based on MAE, RMSE, and MAPE measured over energy demand baseline datasets from different dwelling aggregation levels to identify the most accurate models over multiple forecast horizons.
The results showed that deep learning-based forecasting models, in comparison with others, could significantly minimize the gap between the actual and forecasted values of demand response baselines at all the different dwelling unit aggregation levels over the time horizons considered.The ANN, DNN, CNN, RNN, LSTM, GRU, and BiLSTM models consistently showed the smallest MAE, RMSE, and MAPE in all comparison experiments, with values up to (6.49 kW, 5.89 kW, 6.92 kW, 7.85 kW, 5.47 kW, 5.81 kW, 5.41 kW), (7.30 kW, 7.17 kW, 7.35 kW, 8.03 kW, 8.93 kW, 7.78 kW, 7.07 kW), and (10.18%, 8.86%, 9.15%, 9.92%, 9.23%, 8.92%, 9.08%), respectively.The BiLSTM models, followed by the GRU and LSTM, had the highest forecasting accuracies, as demonstrated by their superiority in most demand response baseline forecasting experiments.Compared to the performance of classic and ensemble models, XGBoost-based models were among the best for demand response baseline forecasts at different dwelling aggregation levels over the four time horizons considered.Meanwhile, KernelR, SGDReg, ARIMA, CatBoost, and AdaBoost were among the worst models when forecasting demand response baselines of dwellings.The classic and ensemble models could not achieve the same level of forecast accuracy in all comparative experiments over the time horizons considered.The optimal combination of hyperparameters, such as hidden layers and hidden units, was sufficient to characterize the different underlying patterns of demand response baselines for dwelling units in datasets.In some cases, the ANN, DNN, and CNN models suffered from instability but were able to self-regulate and achieve high performance in reliably and accurately forecasting residential demand response baselines.This is due to the training techniques associated with the deep learning approach, including the use of ReLU as the activation function and the dropout method for model regularization, which help to improve the forecasting performance of the neural network.
This work, in general, contributes to the body of knowledge on two levels.First, this research not only presents a performance comparison of each proposed method but also highlights the importance of employing advanced neural network models to improve the short-term and very short-term estimation of demand response baselines, as well as the energy reductions resulting from the implementation of demand response programs in residential neighborhood contexts.The MAE, RMSE, and MAPE values clearly show that method structure and other features can significantly influence the production of accurate and reliable demand response baselines.With this comparison, the predictions of residential demand response baselines can be expanded and associated with other issues to promote the efficiency of implementing demand response programs.Second, this work provides important insights into the domain of advanced deep learning-based energy reduction estimation.The results demonstrated that the neural network model with optimal hyperparameters can serve as a useful tool for enhancing demand response programs by providing accurate and reliable estimates of baseline values in residential neighborhoods.In future work, the author will focus on advanced hybrid neural network techniques and address the associated limitations of real datasets representing the realworld complexity of occupant behaviors.In addition, other features related to energy prices, building typology, such as thermal insulations of dwellings, and occupant behavior, can be incorporated along with concurrent prediction intervals to investigate the effect of uncertainties in the forecasting processes and improve forecast accuracy in the context of residential neighborhoods.

W
are the weight matrices of the forward and backward LSTMs for the computed output outcome.
• Pre-processing to understand how the demand response baselines of residential buildings correlate with potential input features, obtaining an input feature selection based on the Pearson Correlation Coefficient (PCC) technique[64,65].•Training the models on a dataset composed of all the input features processed during the feature engineering stage and those determined as significant inputs by PCC in the input feature selection stage, with the error measured by performance indicators in the validation and evaluation stages.

Figure 4 .
Figure 4. Research methodology used in this work, including simulations and model development.

Figure 4 .
Figure 4. Research methodology used in this work, including simulations and model development.

Figure 5 .
Figure 5. Hourly baseline energy demand behaviors for dwelling units over the days of the five heating months, as the energy demand baselines are the average for space heating, appliances, and lighting.

Figure 5 .
Figure 5. Hourly baseline energy demand behaviors for dwelling units over the days of the five heating months, as the energy demand baselines are the average for space heating, appliances, and lighting.

Figure 6 .
Figure 6.Energy demand distributions (demand response baseline profile) of dwelling units during the five heating months: (a) hourly energy demand and (b) total daily energy demand.

Figure 6 .
Figure 6.Energy demand distributions (demand response baseline profile) of dwelling units during the five heating months: (a) hourly energy demand and (b) total daily energy demand.

Figure 7 .
Figure 7. Example of the input feature importance estimated by the PCC technique and determined by SHAP tests using values of demand response baselines for all dwelling units.

Figure 7 .
Figure 7. Example of the input feature importance estimated by the PCC technique and determined by SHAP tests using values of demand response baselines for all dwelling units.

Figure 8 .
Figure 8. Process of training and testing deep learning and traditional machine learning models using the demand response baseline values of dwelling units.

Figure 8 .
Figure 8. Process of training and testing deep learning and traditional machine learning models using the demand response baseline values of dwelling units.

Figure 9 .
Figure 9.Comparison of forecasted and actual demand response baselines for dwelling units on the performance of deep learning models over various time horizons, considering a sing period.

Figure 9 .Figure 10 .
Figure 9.Comparison of forecasted and actual demand response baselines for dwelling units based on the performance of deep learning models over various time horizons, considering a single test period.Buildings 2024, 14, x FOR PEER REVIEW 2

Figure 10 .
Figure 10.Comparison of forecast errors for deep learning models at each hour and over v time horizons, considering a single test period.

Figure 10 .
Figure 10.Comparison of forecast errors for deep learning models at each hour and over various time horizons, considering a single test period.

Figure 11 .
Figure 11.Comparison of forecasted and actual demand response baselines for dwelling units on the performance of tree-based ensemble models over various time horizons, considering a test period.

Figure 11 .
Figure 11.Comparison of forecasted and actual demand response baselines for dwelling units based on the performance of tree-based ensemble models over various time horizons, considering a single test period.

Figure 11 .Figure 12 .
Figure 11.Comparison of forecasted and actual demand response baselines for dwelling units on the performance of tree-based ensemble models over various time horizons, considering a test period.

Figure 12 .
Figure 12.Comparison of forecast errors for ensemble models at each hour and over various time horizons, considering a single test period.

Figure 13 .
Figure 13.Comparison of forecasted and actual demand response baselines for dwelling units on the performance of classic models over various time horizons, considering a single test pe Figure 14 shows a comparison of the forecasted and measured value curves fo test period, where the maximum error amplitudes of the classic models were bet (−40.52 and 12.73 kW), (−40.52 and 16.22 kW), (−40.52 and 36.77kW), and (−40.52 and kW) for forecasts at 6 h, 12 h, 18 h, and 24 h, respectively.At the same time, the maxi error magnitude of the deep learning models was between (−9.07 and 3.16 kW), (− and 9.31 kW), (−11.73 and 10.88 kW), and (−9.59 and 11.76 kW) at 6 h, 12 h, 18 h, and ahead forecasts, respectively.Two observations in this context can be made, as follow Deep learning methods demonstrated the best performance, highlighting advantages in short-term residential energy forecasting.(2) Classic methods regularization terms exhibited larger error fluctuations when forecasting dem response baseline values compared to their deep learning counterparts.Therefore, be concluded that the classic method failed to significantly reduce the forecasting when producing accurate estimates of demand response baselines for dwelling unit

Figure 13 .
Figure 13.Comparison of forecasted and actual demand response baselines for dwelling units based on the performance of classic models over various time horizons, considering a single test period.

Figure 14 Figure 14 .
Figure 14 shows a comparison of the forecasted and measured value curves for one test period, where the maximum error amplitudes of the classic models were between (−40.52 and 12.73 kW), (−40.52 and 16.22 kW), (−40.52 and 36.77kW), and (−40.52 and 34.26 kW) for forecasts at 6 h, 12 h, 18 h, and 24 h, respectively.At the same time, the maximum error magnitude of the deep learning models was between (−9.07 and 3.16 kW), (−12.91 and 9.31 kW), (−11.73 and 10.88 kW), and (−9.59 and 11.76 kW) at 6 h, 12 h, 18 h, and 24 h ahead forecasts, respectively.Two observations in this context can be made, as follows: (1) Deep learning methods demonstrated the best performance, highlighting their advantages in

Figure 14 .
Figure 14.Comparison of forecast errors for classic models at each hour and over various time horizons, considering a single test period.

Figure 15 .
Figure 15.Comparison of MAPE values for deep learning models considering different aggre levels of the demand response baseline profile of the dwelling units over four time horizons: (b) 12 h, (c) 18 h, and (d) 24 h ahead forecasts.

Figure 15 .
Figure 15.Comparison of MAPE values for deep learning models considering different aggregation levels of the demand response baseline profile of the dwelling units over four time horizons: (a) 6 h, (b) 12 h, (c) 18 h, and (d) 24 h ahead forecasts.

Figure 16 .
Figure 16.Comparison of MAPE values for ensemble models considering different aggr levels of the demand response baseline profile of the dwelling units over four time horizons: (b) 12 h, (c) 18 h, and (d) 24 h ahead forecasts.

Figure 16 .
Figure 16.Comparison of MAPE values for ensemble models considering different aggregation levels of the demand response baseline profile of the dwelling units over four time horizons: (a) 6 h, (b) 12 h, (c) 18 h, and (d) 24 h ahead forecasts.

Figure 17 .
Figure 17.Comparison of MAPE values for classic models considering different aggregatio of the demand response baseline profile of the dwelling units over four time horizons: (a) 6 h h, (c) 18 h, and (d) 24 h ahead forecasts.

Figure 17 .
Figure 17.Comparison of MAPE values for classic models considering different aggregation levels of the demand response baseline profile of the dwelling units over four time horizons: (a) 6 h, (b) 12 h, (c) 18 h, and (d) 24 h ahead forecasts.

Funding:
This research no received external funding.

Figure A1 .
Figure A1.RMSE values for forecasting models at the aggregation level of dwelling units for the demand response baseline profile over four time horizons.

Figure A2 .
Figure A2.RMSE values for forecasting models at the aggregation level of 10 dwelling units for the demand response baseline profile over four time horizons.

Figure A2 .
Figure A2.RMSE values for forecasting models at the aggregation level of dwelling units for the demand response baseline profile over four time horizons.

Figure A1 .
Figure A1.RMSE values for forecasting models at the aggregation level of dwelling units for the demand response baseline profile over four time horizons.

Figure A2 . 43 Figure A3 .
Figure A2.RMSE values for forecasting models at the aggregation level of dwelling units for the demand response baseline profile over four time horizons.

Figure A3 .
Figure A3.MAE values for forecasting models at the aggregation level of dwelling units for the demand response baseline profile over four time horizons.

Figure A3 .
Figure A3.MAE values for forecasting models at the aggregation level of 200 dwelling units for the demand response baseline profile over four time horizons.

Figure A4 .
Figure A4.MAE values for forecasting models at the aggregation level of 10 dwelling units for the demand response baseline profile over four time horizons.Figure A4.MAE values for forecasting models at the aggregation level of 10 dwelling units for the demand response baseline profile over four time horizons.

Figure A4 .
Figure A4.MAE values for forecasting models at the aggregation level of 10 dwelling units for the demand response baseline profile over four time horizons.Figure A4.MAE values for forecasting models at the aggregation level of 10 dwelling units for the demand response baseline profile over four time horizons.

Figure A3 .
Figure A3.MAE values for forecasting models at the aggregation level of 200 dwelling units for the demand response baseline profile over four time horizons.

Figure A4 . 43 Figure A5 .
Figure A4.MAE values for forecasting models at the aggregation level of 10 dwelling units for the demand response baseline profile over four time horizons.

Table 1 .
Distribution of the reviewed studies according to method, study scale, data type, and time horizon for estimation.

Table 2 .
Occupancy characteristics of the targeted dwelling units in Atlantech in the city of La Rochelle.

Table 3 .
Input features selected as predictors for forecasting models.

Table 4 .
Cont.Notes: Case I and Case II refer to forecasting models with demand response baseline profiles with all dwelling units and different aggregation levels of dwelling units.Adam and Relu are also optimization techniques and activation functions for deep learning models.

Table 5 .
Average performance of deep learning and classic and ensemble models in forecasting aggregated demand response baselines for dwelling units over various time horizons.

Table 6 .
Performance of different forecasting models to evaluate energy reductions due to demand response events for all dwelling units during peak heating days over 3 h ahead.

Table A4 .
Average performance of forecasting models at the aggregation level of 50 dwelling units for the profile of demand response baselines over four time horizons.

Table A5 .
Average performance of forecasting models at the aggregation level of 20 dwelling units for the profile of demand response baselines over four time horizons.

Table A6 .
Average performance of forecasting models at the aggregation level of 10 dwelling units for the profile of demand response baselines over four time horizons.