4.1. Datasets’ Description
Two publicly available datasets, Smart House and Mexican House, were employed to analyze household energy consumption patterns. These datasets contained high-resolution time series data capturing diverse variations in energy demand across different household conditions.
4.1.1. Smart House Dataset
The Smart House dataset consists of power consumption measurements collected from a single-family home equipped with multiple IoT sensors. Data were recorded at a resolution of one minute, covering both individual appliance usage and overall household energy demand. The dataset includes the following features:
Timestamp: time of measurement (in minute intervals).
Total power consumption: aggregated household power usage in kilowatts (KW).
Appliance-specific consumption: power consumption per device (e.g., refrigerator, HVAC system, washing machine).
Environmental factors: ambient temperature and humidity, which influence energy demand.
4.1.2. Mexican House Dataset
The Mexican House dataset provides detailed energy consumption records from a residential household located in northeastern Mexico [
24]. Data were collected every minute over a period of 14 months, from 5 November 2022, to 5 January 2024. In total, the dataset comprises 605,260 samples, each containing 19 variables related to energy consumption and environmental conditions.
This dataset was specifically designed for domestic energy consumption forecasting and behavior analysis, addressing a gap in the existing literature where such datasets for Mexico remained scarce.
This temporal consistency is important for time series forecasting, as it maintains the natural daily and weekly cycles in energy use without artificial interruptions. In datasets affected by daylight saving time, the one-hour shift can introduce abrupt changes in consumption patterns that do not reflect actual user behavior. Such discontinuities can interfere with the model’s ability to learn true seasonal or temporal patterns. By avoiding these shifts, the Mexican Household dataset provides a continuous and stable time series, helping models more accurately capture regular usage trends and behavioral dynamics.
The dataset is stored in CSV format, with each row representing a timestamped observation. The primary attributes include the following:
Timestamp: time of measurement (recorded every minute).
Total energy usage: measured in watts (W), representing the household’s power consumption.
Temperature and humidity: recorded indoor and outdoor environmental conditions.
Solar power generation: the amount of energy produced by rooftop solar panels.
Additionally, the dataset contains other electrical and meteorological variables that contribute to understanding energy consumption patterns.
This dataset serves as a valuable resource for training and evaluating predictive models aimed at improving household energy management and efficiency.
To ensure consistency in the analysis, both datasets were resampled to match common time intervals, allowing direct comparison across different forecasting models.
The two household datasets include not only energy consumption data but also weather data for the corresponding regions. The specific variables (factors) are outlined in
Table 2 and
Table 3. In addition, comprehensive statistical information—such as count, minimum, maximum, mean, and standard deviation—is provided to enhance the understanding of the datasets. These statistics, presented in
Table 4 and
Table 5, offer a more detailed characterization of the data.
An outlier analysis was conducted for both the Mexican and Smart House datasets. The results did not indicate a significant presence of outliers, suggesting that the data reflected a consistent and realistic energy consumption behavior over time.
4.1.3. Feature Importance Analysis
In order to ensure optimal feature selection, we conducted a Gini-based feature importance analysis using tree-based models. This method evaluates the contribution of each feature by measuring the reduction in Gini impurity when the feature is used for splitting across the ensemble of trees.
The Gini impurity for a node
t is defined as:
where
represents the proportion of samples belonging to class
i at node
t, and
C is the total number of classes.
A feature is considered important if its usage in the tree structure significantly reduces the overall Gini impurity across nodes, indicating better separation between classes or improved predictive power.
For the Smart House dataset, the resulting feature importance scores are presented in
Figure 7. It was observed that
use [kW] was by far the most influential feature, with an importance score of approximately 0.25. It was followed by
dewPoint (around 0.09),
pressure (around 0.07), and
windBearing (around 0.07). In contrast, features such as
Solar [kW] and
House overall [kW] contributed very little to the model. Based on these results, it appeared sufficient to retain only the power consumption feature, i.e.,
use [kW], for further analysis, as it dominated the predictive capability.
Similarly, for the second dataset, corresponding to the Mexican Household dataset and illustrated in
Figure 8, a comparable pattern was observed. The feature
active_power emerged as the dominant variable with an importance score of approximately 0.42, which was equivalent to the
use [kW] feature from the first dataset. Other important features included
temp (approximately 0.12),
current (around 0.09), and
power_factor (around 0.08).
Given these observations across both datasets, only the most dominant feature—use [kW] for the Smart House and active_power for the Mexican Household—was retained for the subsequent feasibility study. This decision was justified by their overwhelming importance relative to the other features.
4.5. Influence of Data Split and Window Size on Prediction Accuracy
To ensure the effectiveness of the proposed solution across different data splits for the Mexican dataset, three splits were considered: 80–20%, 70–30%, and 60–40% for training and testing. For the three methods, a score of was obtained for these splits. This means that the influence of seasonality was limited using the proposed double LSTM approach.
The effect of varying window sizes on prediction accuracy is a critical aspect when modeling power consumption, particularly in time series with fine temporal granularity.
Figure 9 and
Figure 10 illustrate the predicted versus actual power consumption for a smart household dataset, using binomial smoothing filters applied with multiple window sizes at 1 min and 15 min granularities, respectively.
In
Figure 9, where the data were sampled every minute, smaller window sizes preserved short-term fluctuations and transient appliance events but could retain high-frequency noise, resulting in slight overfitting and increased prediction variance. Conversely, larger windows led to smoother predictions by averaging over more data points, which effectively reduced noise and highlighted broader consumption trends. However, excessive smoothing could also dampen the sharp transitions in load demand, such as sudden spikes caused by high-power appliances, leading to underestimation during peak periods.
Figure 10, based on the same dataset but aggregated at a 15 min resolution, demonstrates a markedly different behavior. At this coarser granularity, the signal was inherently smoother, and the benefit of additional smoothing via large window sizes became less pronounced. The predictions across different window sizes were generally closer to each other, and larger windows did not excessively distort the temporal dynamics. This suggests that for coarser-grained data, larger windows can be used without significant loss of important consumption features, improving model generalization.
Comparatively, the 1 min dataset required careful window size selection to balance signal fidelity and noise reduction, as inappropriate choices could obscure important load characteristics. On the other hand, in the 15 min setting, the model was more robust to changes in window size, as the primary variability had already been attenuated through temporal aggregation.
These observations support the notion that the optimal window size for filtering depends strongly on the data resolution: high-resolution datasets benefit from moderate smoothing to suppress volatility, whereas lower-resolution datasets can tolerate and even benefit from larger smoothing windows without significant degradation in predictive performance.
4.7. Discussion
The experimental results confirmed the effectiveness of frequency decomposition in improving the accuracy of short-term load forecasting. Across both the Smart House and Mexican Household datasets, models trained on decomposed signals consistently outperformed those reported in prior studies. Notably, the use of windowed uniform and binomial filters significantly enhanced model performance, especially when smaller window sizes were applied.
To further highlight the importance of signal decomposition, we conducted additional experiments by running the models without applying any decomposition.
Table 14 presents the performance comparison obtained when forecasting power consumption using a CNN-LSTM model, a GRU model, a single LSTM model, the DeepAR model, the BiLSTM model, the DATE-TM approach [
17], and our proposed method. The results, shown for the Mexican Household dataset, clearly demonstrate that our method significantly outperformed the baseline models, including DATE-TM. This highlights the crucial role of frequency decomposition in enhancing model performance.
Furthermore, despite their simplicity, the proposed convolution-based methods yielded highly competitive results, outperforming baseline models in both accuracy and efficiency.
To assess the statistical significance of our frequency decomposition approach, we employed the Diebold–Mariano test to compare forecasting performance between our method and a standard single LSTM model applied to the non-decomposed signal. The Diebold–Mariano test evaluates the null hypothesis of equal predictive accuracy between two forecast methods by examining their corresponding loss differentials. The DM test statistic is calculated as follows:
where
is the sample mean of the loss differential series;
represents the difference between the loss functions at time t;
is the loss function (typically the squared or absolute error);
and are the forecast errors from the two models at time t;
is a consistent estimate of the variance of .
Our analysis yielded a DM test statistic of 43.57 with a p-value of 10−7, which is substantially below the conventional significance threshold of 0.05. This result provided strong evidence to reject the null hypothesis that both models had equal predictive accuracy. The magnitude of the DM statistic (43.57) indicated a large effect size, demonstrating that the forecast errors from the single LSTM model were consistently and significantly larger than those from our frequency decomposition method. This statistical evidence confirmed that the separate modeling of frequency components provided a substantial and measurable improvement in forecasting accuracy. The extremely low p-value suggested that the probability of observing such performance differences by random chance was negligible, thus validating the fundamental efficacy of our decomposition-based approach in comparison to traditional single-model techniques.
Compared to the standard single LSTM, each LSTM network in the proposed double LSTM approach had the same complexity, but the training time was longer because both models needed to be trained sequentially. Therefore, while the single model required 127.8 MB of memory for 5135.17 s of training time across 20 epochs, the dual model required 128.41 MB of memory for 9879.61 s of training time, approximately double the time with the same memory requirement. During prediction, the standard LSTM’s inference time was 13 s, half of the 27 s required by the two models to process the 3783 time samples in the test dataset. This equated to 3.4 and 7.1 ms per sample, respectively.
To evaluate the effectiveness of the proposed dual-frequency LSTM model against established methodologies, we conducted a comparative analysis with a decomposition-based approach: STL (Seasonal-Trend decomposition using Loess) followed by a standard LSTM implementation.
Table 15 summarizes the performance metrics of both methods on the same test dataset.
The quantitative results demonstrate that our dual-frequency approach substantially outperformed the STL + LSTM baseline across all evaluation metrics. Specifically, our method reduced the Mean Absolute Error by 58.2% and the RMSE by 65.3%. These substantial performance differentials provided compelling evidence for both the predictive accuracy and methodological robustness of our proposed architecture.
For the Smart House dataset at 1 min granularity, the combination of decomposition and LSTM led to an R
2 score of 0.997 with the uniform filter and 0.995 with the binomial filter for a window size of three. These values reflected a near-perfect fit and represented a substantial improvement compared to the R
2 score of 0.863 found in the existing literature [
17]. Similar improvements were observed in terms of MSE and MAE, demonstrating the model’s capacity to track rapid consumption dynamics with minimal error. These results are depicted in
Table 16.
The Mexican Household dataset exhibited a similar trend, particularly at higher resolutions. A window size of three with uniform filtering yielded an R
2 of 0.994 and a low RMSE of 13.278, markedly outperforming previously reported RMSE values of 82.488 reported in the literature [
17]. Even at the 15 min interval, which typically smooths out temporal fluctuations, the proposed approach maintained high predictive accuracy, with R
2 values exceeding 0.99 for both filters.
To rigorously validate our approach and address concerns of potential overfitting, we implemented a time series cross-validation framework. Unlike standard k-fold cross-validation, which can lead to data leakage in time series problems, we employed a temporal cross-validation strategy using TimeSeriesSplit with
folds. This approach ensures that all training data strictly precede test data in each fold, maintaining the temporal integrity essential for forecasting tasks.
where:
represents the training set for fold i
represents the test set for fold i
For each fold, we independently trained both the low-frequency and high-frequency models on their respective decomposed signals using the designated training set. Performance metrics were then computed on the held-out test data, with final predictions formed by recombining outputs from both frequency components.
Table 17 presents the mean and standard deviation of performance metrics across all five folds.
The temporal cross-validation results demonstrated remarkable consistency across all folds, with the standard deviation of scores being merely 0.003. This low variance in performance metrics across different temporal splits strongly countered concerns of overfitting or data leakage. Furthermore, even the lower bound of the 95% confidence interval for (0.989) substantially exceeded the performance reported for state-of-the-art methods in the literature.
The proposed approach showed significant improvements over the state of the art. For the Smart House dataset, the MAE decreased by approximately 87.1%, and the RMSE dropped by 89.4%. The R2 score increased from 0.863 to 0.997, reflecting a 15.3% improvement in variance explanation. These high values are explained by the fact that the decomposition and filtering steps simplify the input signal by removing noise and isolating meaningful patterns. This helps the model learn more effectively and make more accurate predictions. Moreover, the results were validated through careful parameter tuning and multiple data splits to ensure that the gains were reliable and not tied to a specific subset of the data.
Similarly, for the Mexican Household dataset, the MAE decreased by 66.5%, and the RMSE dropped drastically by 83.9%. The R2 score improved from 0.878 to 0.994, representing a 13.2% relative enhancement. These improvements clearly demonstrated the robustness and superior accuracy of the proposed decomposition and filtering strategy when paired with the LSTM model, especially for high-resolution forecasting scenarios. To further validate this, we compared the proposed method with a standard LSTM model trained on the raw, non-decomposed signal. The baseline model yielded an MAE of 37.624, RMSE of 120.070, and R2 of 0.526. In contrast, our decomposition-based approach achieved an MAE of approximately 9.626, RMSE of 13.278, and R2 of 0.994. These results highlight the importance of frequency decomposition: by isolating different frequency components and modeling them separately, the model captures more fine-grained temporal structures, leading to significantly higher predictive performance. Additionally, a Gated Recurrent Unit (GRU) model was evaluated and produced results comparable to our LSTM-based approach. However, the LSTM model consistently outperformed GRU with slightly lower prediction errors and a higher coefficient of determination. Furthermore, the GRU model required approximately 5% more memory, making it less suitable for deployment in resource-constrained environments. The choice of a simple LSTM model thus offers a favorable balance between predictive accuracy and computational efficiency, aligning with the practical constraints of real-world smart grid applications.
At the 15 min resolution, the proposed method showed significant improvements over existing results. For the Smart House dataset, the coefficient of determination increased from 0.758 to 0.992, indicating a much better fit. For the Mexican Household dataset, the Mean Absolute Error and Root-Mean-Square error dropped by 99.1% and 99.2%, with the R
2 value rising from 0.771 to 0.991, confirming the robustness of the proposed approach across different data types. These results are depicted in
Table 18.
A critical finding from our investigation was the inverse relationship between filter window size and prediction accuracy. Smaller window sizes (particularly ) consistently yielded superior results across all experimental configurations. This suggests that preserving fine-grained signal characteristics through minimal smoothing is crucial for accurate energy forecasting. Excessive smoothing with larger windows appears to eliminate valuable predictive information about consumption dynamics, even when applied within a frequency decomposition framework.
When comparing filtering approaches, the binomial filter demonstrated slight advantages over the uniform filter in several configurations. As shown in
Table 6,
Table 8,
Table 10, and
Table 12, particularly when comparing the RMSE values, the RMSE values clearly decreased for the binomial filter compared to the uniform filter. This aligns with our theoretical understanding of the binomial filter’s properties—its graduated weighting scheme appears better suited to preserving significant signal transitions while still providing effective noise reduction. However, the performance difference between filtering methods was less pronounced than the impact of window size, indicating that window size selection is the more critical parameter in decomposition-based forecasting.
These findings have significant implications for energy forecasting applications. The optimal approach for high-accuracy household consumption prediction appears to involve a small decomposition window (), preferably using a binomial filter, with separate LSTM models trained on the resulting frequency components. This configuration preserves essential signal characteristics while enabling specialized prediction of different temporal patterns, resulting in remarkably accurate forecasts across diverse household environments and temporal resolutions.
The consistency of these results across two independent datasets with different characteristics—the Smart House dataset with numerous appliance-specific measurements and the Mexican Household dataset with different climate conditions and usage patterns—suggests that the benefits of frequency decomposition are generalizable rather than dataset-specific. Further analysis revealed that the decomposition approach performed well regardless of the underlying variations in usage patterns or environmental conditions. For both datasets, we observed similar improvements in key performance metrics such as Mean Absolute Error (MAE) and Root-Mean-Square Error (RMSE), indicating the robustness of the model. This suggests that the model’s performance is not overly reliant on specific features or data distributions but rather on the general ability of frequency decomposition to isolate and model relevant temporal patterns. These findings underscore the model’s adaptability across different real-world scenarios, providing confidence in its potential for broader applications beyond the datasets considered here.