Next Article in Journal
BiteAI: Attention-Guided Distillation and Weight-Only Quantization for Compact Insect-Bite Classification
Previous Article in Journal
DTBAffinity: A Multi-Modal Feature Engineering and Gradient-Boosting Framework for Drug–Target Binding Affinity on Davis and KIBA Benchmarks
Previous Article in Special Issue
Energy-Efficient Container Scheduling Based on Deep Reinforcement Learning in Data Centers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hybrid Spatio-Temporal Deep Learning Models for Multi-Task Forecasting in Renewable Energy Systems

by
Gulnaz Tolegenova
1,2,*,
Alma Zakirova
3,
Maksat Kalimoldayev
2,4 and
Zhanar Akhayeva
5,*
1
Department of Computer and Software Engineering, L.N. Gumilyov Eurasian National University, Astana 010000, Kazakhstan
2
Higher School of Information Technology and Engineering, Astana International University, Astana 010000, Kazakhstan
3
Department of Computer Science, L.N. Gumilyov Eurasian National University, Astana 010000, Kazakhstan
4
Institute of Information and Computational Technologies, Almaty 050040, Kazakhstan
5
Department of Information Systems, L.N. Gumilyov Eurasian National University, Astana 010000, Kazakhstan
*
Authors to whom correspondence should be addressed.
Computers 2026, 15(3), 183; https://doi.org/10.3390/computers15030183
Submission received: 20 January 2026 / Revised: 1 March 2026 / Accepted: 6 March 2026 / Published: 11 March 2026

Abstract

Short-term forecasting of solar and wind power generation is critical for smart grid management but challenging due to non-stationarity and extreme generation events. This study addresses a multi-task learning problem: regression-based forecasting of power output and binary detection of extreme events defined by a quantile-based threshold (q = 0.90). A hybrid spatio-temporal model, DP-STH++, is proposed, implementing parallel causal fusion of LSTM, GRU, a causal Conv1D stack, and a lightweight causal transformer. The architecture employs regression and classification heads, while an uncertainty-weighted mechanism stabilizes multitask optimization in the regression tasks; extreme event detection performance is evaluated using AUC. Training and evaluation follow a leakage-safe protocol with chronological data processing, calendar feature integration, time-aware splitting, and training-only estimation of scaling parameters and extreme thresholds. Experimental results obtained with a one-hour forecasting horizon and a 24 h context window demonstrate that DP-STH++ achieves the best regression performance on the hold-out set (RMSE = 257.18, MAE = 174.86–287.90, MASE = 0.2438, R2 = 0.9440) and the highest extreme event detection accuracy (AUC = 0.9896), ranking 1st among all compared architectures. In time-series cross-validation, the model retains the leading position with a mean MASE = 0.3883 and AUC = 0.9709. The advantages are particularly pronounced for wind power forecasting, where DP-STH++ simultaneously minimizes regression errors and maximizes AUC = 0.9880–0.9908.

Graphical Abstract

1. Introduction

The integration of renewable energy sources into urban smart infrastructure is increasing. Although solar and wind generation are technologically and environmentally advantageous, they are highly susceptible to weather conditions, exhibiting pronounced non-stationarity and sharp intraday output fluctuations. These characteristics complicate power system operational management and increase the demand for the accuracy of short-term forecasts used for load balancing and power reserve allocation [1,2,3].
Recent time series studies show that strong linear baselines and transformer approaches can deliver competitive quality when the experiment is set up correctly [4,5,6,7,8,9,10,11]. For energy applications, particularly in wind and solar generation tasks, hybrid and multitask approaches are relevant because they allow the simultaneous consideration of non-stationarity, seasonal-daily patterns, and extreme modes [12,13,14,15,16,17]. Recent reviews emphasize fair comparison across architectural classes and reproducible evaluation protocols in RES forecasting tasks [6,17,18]. Based on this, the task of this work is formulated as the construction and verification of a unified short-term Solar/Wind forecasting loop with simultaneous regression estimation and detection of extreme events in a comparable experimental protocol.
This study frames short-term forecasting as a time-oriented inference task, predicting “from the past to the future” with a horizon of H = 1 and a context window of L = 24. Initial observations were first standardized to a uniform timescale by parsing, sorting, and removing duplicate timestamps, followed by the creation of calendar attributes: hour, day of week, month, and season. Subsequently, a unified feature set and target variables were constructed for both the solar and wind subsystems. Strict chronological splitting was applied using an 80/20 hold-out and walk-forward cross-validation. Finally, the target series is shifted by horizon H, retaining only valid indices [19,20].
A key practical consideration is the need to account for not only the average power levels but also rare, extreme operating modes. In this study, this was addressed by formalizing a separate binary classification task to identify these extremes. Extremes are defined using an EXT_Q quantile threshold, calculated solely on the training data (q = 0.90 in our experiments) and then applied without modification to the hold-out/folds. This results in a unified multi-task protocol encompassing two regressions (solar and wind) and two classifications of extremes [21].
Another critical methodological requirement is preventing data leakage. All transformation statistics, including the feature and target variable scaling parameters, were configured exclusively on the training data and then applied to the validation/test sets without recalculation. This ensures the reproducibility and temporal correctness of the evaluation [19].
The objective of this study was to develop and evaluate a hybrid causal DP-STH++ architecture for the joint forecasting of solar and wind generation and the detection of extreme events within a single model. The architecture employs the parallel fusion of four representation branches (LSTM, GRU, causal Conv1D, and a lightweight causal transformer). The aggregated representations were then combined and fed into multi-head outputs for regression and classification. An uncertainty-weighted balancing mechanism (LossScaleLayer) is used in the regression component to adaptively balance solar and wind regression losses [5,7,8,9,10], where prior studies have shown that uncertainty-based weighting stabilizes multi-task optimization in non-stationary time series. The experimental evaluation compares the proposed architecture against baseline multi-task learning (MTL) architectures (LSTM, GRU, their combinations, CNN, STT), as well as a seasonal naive benchmark (lag = 24), allowing us to assess the contributions of both the architectural design and the rigorous time-correct processing and validation protocol [20].
In contrast to several existing hybrid models that typically combine two architectural families (e.g., TCN + Transformer or graph-based models), DP-STH++ integrates four parallel causal branches with joint regression and extreme-event classification heads under a unified chronological evaluation protocol. The novelty lies in the parallel causal fusion mechanism and strict train-only threshold computation, rather than in the use of individual standard components.
The main scientific contributions of this study are as follows:
  • A leak-resistant multitask learning framework is proposed for the joint forecasting of solar and wind generation.
  • A hybrid spatiotemporal architecture was developed, integrating recurrent, convolutional, and transformer components within a causal framework.
  • The explicit modeling of extreme generation modes was implemented as a separate classification task.
  • A comprehensive evaluation is conducted to assess accuracy and stability against representative baselines under the adopted protocol.
The field of time-series forecasting has seen rapid development in recent years, with several studies questioning the necessity of complex transformer architectures when simpler linear models or properly configured baselines perform comparably or better under fair evaluation conditions [4,5,8,9,10]. At the same time, energy-specific applications benefit significantly from hybrid and multi-task frameworks that jointly model regression and classification tasks (e.g., extreme event detection) while handling pronounced non-stationarity and sharp fluctuations typical for solar and wind generation [6,7,10,11,12].
Recent reviews published in 2024–2025 provide updated overviews of deep learning approaches for solar and wind forecasting, confirm the relevance of hybrid architectures, and stress the methodological importance of leakage-free experimental protocols and reproducible comparisons [13,14,15].
In light of these developments, the current work builds upon the latest methodological insights. Table 1 summarizes a curated set of recent publications (2022–2025) that directly inform the design choices, baseline selection, evaluation protocol, and interpretation of results in this study.
Recent developments in Kolmogorov–Arnold Networks (KAN) and related basis-function-based architectures propose adaptive nonlinear modeling through spline-based functional representations. These models have demonstrated strong approximation capabilities in structured regression tasks and have recently attracted attention in time-series forecasting research.
In the revised experimental section, a KAN-inspired proxy baseline is included to ensure empirical positioning relative to nonlinear basis-function families under the same leakage-safe protocol and identical evaluation settings. While KAN-type approaches emphasize flexible functional approximation, the proposed DP-STH++ architecture focuses on complementary temporal representations within a unified multitask regression–classification framework.

2. Materials and Methods

2.1. Data Source and Study Design

This study addresses the challenge of short-term (hourly) forecasting of both solar and wind generation within a unified framework, focusing on two target series: kWh_solar_power_solar and kWh_wind_power_wind. The empirical database consists of a single chronologically ordered time series with 8752 observations and a consistent input dimension for both forecasting tasks. The input feature vector comprises meteorological indicators relevant to solar and wind power generation and calendar variables (hour, day of the week, month, and season).
The experimental design employed a time-series simulation approach, maintaining chronological order without data mixing. An 80/20 hold-out split was used, training models on the initial 80% of the data and testing on the final 20%. Where necessary, walk-forward cross-validation (TimeSeriesSplit) was implemented. A chronological 80/20 hold-out split without shuffling was applied. Walk-forward cross-validation (TimeSeriesSplit) was additionally implemented.

2.2. Leakage-Safe Data Preprocessing

The data preparation pipeline consisted of the following sequential operations (Figure 1).
The data preparation and processing pipeline consists of the following steps.
  • Temporal Axis Alignment: Timestamps were parsed, data were sorted chronologically, and duplicate entries were removed.
  • Calendar Feature Generation: Calendar-based features, including hour, day of the week, month, and categorical season (subsequently encoded), are generated.
  • Working Dataset Construction: A working dataset comprising the target series and a complete set of input features (targets + features_all) was assembled. Rows containing missing values in the selected columns were then removed.
  • Time-Based Splitting and Horizon Adjustment: A time-based split (without shuffling) was performed using an 80/20 hold-out or walk-forward cross-validation approach. Target variables are shifted to a specified horizon (H), and only valid indices (rows with available target values at the horizon) are retained in the dataset.
  • Training-Only Scaling: Scaling parameters for features and target variables are estimated exclusively on the training dataset (fit x_scaler, y_scaler on TRAIN). These parameters were then applied to the hold-out or cross-validation folds without re-evaluation.
  • Windowing: For sequential architectures, sliding windows of length L (L = 24 in the experiments) were created. The first L − 1 timestamps were discarded for the supervised samples. Optionally, a feature vector (X_last) representing the “flat” representation of the current state can be formed from the last window step.

2.3. Definition of Extreme Events

Extreme events were formalized using a quantile threshold calculated solely from the training sample. For each target series, a quantile level (q = EXT_Q) was set independently (q = 0.90 in the presented experiments). The resulting threshold (thr) is then fixed and applied to the holdout or cross-validation sets without recalculation. A binary extreme label was assigned by comparing the observation to the threshold on the corresponding scale of the target variable, forming output y_ext for subsequent model evaluation using the AUC metric.

2.4. Model Architecture

In this study, we implemented and compared a set of multitasking deep learning architectures, including MTL_LSTM, MTL_GRU, MTL_LSTM_GRU, STT, MTL_CNN, MTL_LSTM_GRU_STT, and the proposed hybrid model DP-STH++. To ensure a fair comparison, all models utilized a consistent input format in the form of a sequential window XseqR(L × D) (where L = 24 in the experiments) and a unified output format: two regression heads for predicting solar and wind generation and two classification heads for detecting extreme modes.
The proposed DP-STH++ architecture is constructed as a parallel causal fusion of four spatiotemporal branches, each generating its own representation of the window Xseq:
  • Branch 1 (RNN): A causal LSTM recurrent encoder extracts short- and medium-term dependencies while maintaining causality.
  • Branch 2 (RNN): A causal GRU serves as an alternative recurrent encoder, complementing the LSTM in terms of inductive properties.
  • Branch 3 (TCN): A causal Conv1D stack extracts local and multiscale temporal patterns.
  • Branch 4 (STT): A lightweight causal transformer with a causal attention mechanism that accounts for a longer-range context within the window.
The output of each branch is aggregated using a global pooling operator (resulting in a fixed-dimensional vector), followed by feature fusion:
z = Concat z 1 z 2 z 3 z 4 ,
where z b is the pooled representation of branch b { 1,2 , 3,4 } . Subsequently, the vector z is processed through a sequence of BatchNormalization, a Dense layer with 64 units and ReLU activation, and a dropout layer with a rate of 0.20 before being fed to the output heads. This architecture decouples the extraction of temporal features (within branches) from the unified multitasking component (shared blocks and heads), ensuring compatibility with fundamental MTL architectures.

Architectural Rationale and Complementarity of Branches

The DP-STH++ architecture is not intended as a fundamentally new neural network class. Instead, it represents a structured architectural decomposition designed for the joint solution of two related tasks: (i) short-term regression forecasting of solar and wind generation and (ii) detection of rare extreme events under a leakage-safe protocol.
The design follows a complementarity principle, where each parallel branch captures a distinct aspect of temporal dynamics (see Figure 2):
  • LSTM branch—models long-term temporal inertia, daily periodicity, and smoothing effects typical for renewable generation series. This branch emphasizes stable representation of medium- and long-range dependencies.
  • GRU branch—provides a more compact recurrent representation with fewer parameters, which improves learning stability under limited data regimes. In our configuration, GRU serves as a complementary recurrent encoder focusing on faster adaptation to local dynamic changes.
  • TCN (causal Conv1D) branch—extracts short local temporal motifs and abrupt transitions (spikes and drops). Convolutional filters highlight multiscale local structures that recurrent encoders may smooth out.
  • Lightweight causal Transformer branch—performs contextual aggregation across the observation window while preserving strict causality (no access to future information). It models heterogeneous feature interactions and non-uniform temporal dependencies within the window.
Thus, the parallel composition is not an arbitrary engineering combination but a decomposition of temporal structure into complementary representations: memory/inertia (RNNs), local motifs (TCN), and contextual interactions (Transformer).
The contribution of DP-STH++ is therefore architectural–algorithmic and experimentally validated, rather than based on the invention of new layers. The goal is to construct a reproducible multitask framework that balances regression accuracy and extreme-event classification under realistic data constraints.

2.5. Multi-Task Learning and Loss Functions

Training was conducted in a multitask setting that combined two regression and two classification tasks. For solar and wind power generation regression, an uncertainty-weighted quadratic error with trainable scale parameters (implemented via LossScaleLayer (UW-Reg)) was employed. This approach enables the automatic balancing of contributions from the two regression tasks, thereby eliminating the need for manual coefficient selection.
Let the true values on the horizon be, and the model predictions be—The regression loss is defined as:
L i reg = 1 N t = 1 N y i t y ^ i t 2 , i s , w
Then, the uncertainty-weighted part takes the form:
L UW reg = i { s , w } 1 2 σ i 2 L i reg + l o g   σ i ,
where σ i > 0 —trainable parameters (uncertainty scales) for task i = s , w .
Extreme modes are specified by binary labels y s ext , y w ext { 0,1 } , formed according to fixed thresholds t h r s , t h r w , calculated on the training part at level q = EXT _ Q (in the experiments q = 0.90 ). Probabilistic predictions p ^ s ext , p ^ w ext ( 0,1 ) , obtained using sigmoid heads, were used for classification. The classification losses are specified by the focal binary cross-entropy as follows:
L i cls = 1 N t = 1 N FocalBCE , y i ext ( t ) p ^ i ext ( t ) , i { s , w } .
The total loss function is expressed as follows:
L = L UW reg + i { s , w } L i cls .

2.6. Training Procedure and Evaluation Protocol

All the models were trained using a consistent time-based validation protocol. The primary scenario employed an 80/20 hold-out split without shuffling, where training was conducted on the initial segment of the time series and evaluation was performed on the final segment. Furthermore, walk-forward cross-validation (TimeSeriesSplit) was implemented to assess the stability of the estimates across consecutive time folds.
The target values were shifted to the specified horizon H (in the presented experiment H = 1 ), and only valid window indices were utilized. All preprocessing parameters (including x _ s c a l e r , y _ s c a l e r and thresholds) were computed exclusively on the training partition and subsequently applied to the hold-out/CV-folds without re-evaluation. All preprocessing parameters were computed exclusively on the training partition and applied to validation/test sets without recalculation.
The performance of the regression forecasts was evaluated using the following metrics: RMSE, MAE, MAPE, R 2 and EVS. The effectiveness of the extreme value detection was assessed using the AUC ROC. Regression performance was evaluated using RMSE, MAE, MAPE, R2, and EVS. Extreme event detection was evaluated using AUC-ROC. Component-wise regression and classification metrics for Solar and Wind targets are reported separately in Table 2.
To avoid ambiguity in metric interpretation, scale-dependent error measures (RMSE, MAE, MAPE) are computed and reported only after a consistent scale definition. When normalization is applied during training, final evaluation metrics are calculated after inverse transformation to the selected reporting scale.
Results expressed in different physical scales (“raw” and “kW”) are not directly comparable for scale-dependent metrics. Therefore, such results are presented in separate tables. Cross-model interpretation across different scales relies exclusively on scale-independent metrics (R2, EVS, AUC) under the same data-splitting protocol.

3. Results

3.1. Hold-Out Evaluation Results

Before presenting the results, it should be noted that MAPE may become numerically unstable in renewable energy forecasting due to zero or near-zero generation values (e.g., nighttime solar production or low wind regimes). Division by small denominators can lead to inflated percentage errors. Therefore, in this study, MAPE is not used as a primary ranking metric. Model comparison and conclusions are primarily based on MASE and consistent absolute metrics (RMSE, MAE), while MAPE is reported only for completeness.
Table 2 presents the quality metrics for solar generation (Solar) forecasting on an 80/20 hold-out set. The proposed DP-STH++ model exhibited superior regression performance compared to the other considered architectures. It achieved a minimum error of RMSE = 524.27 and MAE = 287.90, along with maximum values of R2 = 0.7556 and EVS = 0.7556, indicating a substantial proportion of explained variability and strong alignment of predictions with the observed dynamics. DP-STH++ also demonstrated the lowest MAPE value; however, this metric is reported for completeness and is not used for primary ranking due to its instability near zero values.
Concurrently, MTL_GRU demonstrated the best performance in solar extreme detection on the hold-out set (AUC = 0.9746), whereas DP-STH++ achieved an AUC of 0.9547. This suggests that the primary advantage of DP-STH++ for Solar energy lies in its accuracy in continuous prediction rather than in maximizing the AUC classification metric.
Table 3 presents the results for wind generation (Wind). Here, DP-STH++ confidently leads in terms of key regression metrics: RMSE = 241.77, MAE = 174.86, as well as in terms of consistency R 2 = 0.9527 and EVS = 0.9529. In the task of detecting wind extremes, DP-STH++ also demonstrates maximum accuracy (AUC = 0.9908), outperforming its closest competitor, MTL_LSTM_GRU_STT (0.9902).
It is important to note that the MAPE values for Wind are notably high (thousands and tens of thousands of percent) across all models. This is a common artifact of the MAPE when dealing with periods of near-zero power generation. Consequently, RMSE, MAE, R2, EVS, and MASE offer more reliable performance indicators for Wind, with MAPE serving only as a supplementary descriptive metric.
Component-wise regression and classification metrics for Solar and Wind are reported separately to avoid cross-target generalization. Although trained within a single unified protocol, the targets exhibit different statistical properties, resulting in different absolute metric values. This confirms the necessity of target-specific interpretation. The detailed component-wise results for both targets are summarized in Table 4.

3.2. Cross-Validation Results

Table 5 presents the final model stability and quality indicators obtained during the cross-validation for the solar channel. These indicators include absolute errors (RMSE and MAE), relative error (MAPE), consistency and explained variance ( R 2 , EVS), and the quality of extreme event detection (AUC). This table facilitates the evaluation of both the average accuracy level and the dispersion of results across folds, which is crucial for assessing reproducibility.
The data presented in the table allow for several key observations.
  • Model Leadership on a Single Scale (kW) Based on Regression Metrics: Among the MTL variants operating on the kW scale, MTL_GRU exhibited the most balanced performance profile. It achieved the lowest average RMSE and MAE within this group, coupled with the highest R2 and EVS values. This suggests that, in this specific setting, the GRU encoder offers superior generalization capabilities compared to other “pure” recurrent MTL baselines.
  • Error Stability and Characteristics: The standard deviations reflect the sensitivity of each model to the data partitioning. For instance, STT displayed relatively moderate variability; however, it did not surpass MTL_GRU in terms of absolute error reduction. Conversely, MTL_CNN exhibited the greatest instability and poorest average accuracy (with a confidence interval that included negative values), indicating a potential risk of diminished performance across different folds.
  • Extreme Event Prediction (AUC) as a Distinct Quality Metric: Transformer/hybrid solutions excelled in terms of AUC. MTL_LSTM_GRU_STT demonstrated the highest average AUC, and STT also performed strongly within the distribution. This aligns with the understanding that attention mechanisms are often more effective in extracting subtle patterns crucial for classifying rare events, even if regression errors are not minimized.
  • To avoid ambiguity caused by mixed reporting scales, cross-validation results are presented separately for kW and raw scales. Scale-dependent metrics (RMSE, MAE, MAPE) are compared only within identical physical units, while R2, EVS, and AUC remain cross-scale interpretable under the same splitting protocol.
Table 6 presents a comparative analysis of the models under walk-forward cross-validation conditions for the wind channel. This analysis evaluated the regression accuracy (RMSE, MAE), relative errors (MAPE), forecast consistency with observations (EVS), and accuracy of extreme mode recognition via a probability head (AUC). This presentation format is valuable because an algorithm may demonstrate acceptable average error but exhibit instability across different folds, a characteristic reflected in the width of the 90% confidence interval.
The key conclusions drawn from this table are as follows:
  • Regression Quality and Stability: DP-STH++ and MTL_GRU exhibited the strongest wind regression profiles. DP-STH++ demonstrated the lowest average RMSE and MAE among the solutions presented, coupled with high nRMSE and EVS values, indicating a strong reproduction of wind generation variability. MTL_GRU also presents robust regression metrics, accompanied by a relatively narrow confidence interval for nRMSE, suggesting a consistent performance across different folds in this experiment.
  • Extreme Value Detection Quality: DP-STH++ achieved the highest average AUC (0.9555), demonstrating its ability to effectively differentiate between peak and non-peak wind generation values. MTL_LSTM_GRU_STT also exhibited a high AUC (0.9249), confirming the utility of hybridization/attention mechanisms for the classification component, even with some variability in the regression errors.
  • Behavior of Relative Metrics on Wind (MAPE): The MAPE metric for the wind channel exhibited high values and wide intervals across many rows. This is a common characteristic of relative metrics when dealing with small or near-zero values, where the denominator can destabilize an estimate. Therefore, when interpreting wind results, emphasis should be placed on RMSE/MAE and nRMSE/EVS, with MAPE considered an auxiliary indicator sensitive to low power levels.
  • Weak and Inadequate Baseline Benchmarks for Wind: Seasonal Naive (lag = 24) demonstrated negative nRMSE and EVS, indicating that it reproduces wind dynamics worse than a trivial constant line at the average level in the corresponding divisions. Similarly, MTL_CNN and STT exhibited decreased consistency (wide intervals and low average nRMSE/EVS), suggesting insufficient stability of these configurations within the current training protocol and data volume.
Table 7 Comparison of MAPE and MASE values specifically in low-generation regimes (actual generation <5% of observed maximum). MAPE shows significant inflation and instability, particularly for wind power and nighttime solar periods, justifying the use of MASE as the primary robust metric for model ranking and conclusions.
In summary, the results indicate that a combination of stable regression quality and reliable extreme value detection is crucial for wind-generation modeling. This combination is most evident in DP-STH++, whereas other architectures either lack stability across folds or do not provide sufficient accuracy in terms of basic error metrics.

3.2.1. Extreme-Event Threshold Sensitivity Analysis

The definition of extreme events in the main experiments is based on a quantile threshold q = 0.90 computed on the training subset and applied to the test data. To address concerns regarding the arbitrariness of this choice, a sensitivity analysis was conducted for q ∈ {0.85, 0.90, 0.95}.
Under q = 0.90, the proportion of positive extreme events on the test set is approximately 3.7% for Solar and 6.6% for Wind. This corresponds to rare but sufficiently represented events, enabling statistically stable estimation of PR-based metrics. The quantitative results of the threshold sensitivity analysis are presented in Table 8.
The results indicate that:
  • For Solar, q = 0.95 leads to excessive sparsity, significantly degrading PR-AUC and F1.
  • For Wind, q = 0.90 provides the best balance between Precision and Recall.
  • Lower thresholds (q = 0.85) increase event frequency but reduce extremeness.
These findings demonstrate that q = 0.90 represents a balanced operating point rather than an arbitrary choice. At the same time, the optimal quantile may depend on the risk tolerance and application context.
Increasing the quantile threshold increases event sparsity. While AUC remains relatively stable, PR-AUC and F1 degrade for Solar at q = 0.95 due to extreme class imbalance. The threshold q = 0.90 provides a balanced operating point across targets.

3.2.2. Extreme-Event Detection Performance (Imbalance-Aware Evaluation)

Given the class imbalance inherent in extreme-event detection, ROC-AUC alone may be insufficient for comprehensive evaluation. For q = 0.90, the positive class proportion in the test set is approximately 3.7% for Solar and 6.6% for Wind. Under such an imbalance, PR-AUC, Precision, Recall, F1-score, and confusion matrices provide a more informative assessment. The sensitivity of extreme-event detection metrics to the quantile threshold is illustrated in Figure 3.
Table 9 reports extended classification metrics for DP-STH++ and representative baselines.
The results show that for Solar, DP-STH++ operates in a high-recall regime (Recall = 0.9375), minimizing missed extreme events at the cost of lower Precision. In contrast, DLinear_proxy achieves higher Precision but lower Recall.
For Wind, both models demonstrate balanced detection performance, with DLinear_proxy slightly outperforming in PR-AUC and F1, indicating fewer classification errors under the current threshold.
Figure 4 presents the confusion matrices for Solar and Wind at q = 0.90.

3.3. Distributional Analysis

Figure 5 illustrates the temporal dynamics of solar generation, with three distinct series displayed on the raw scale: actual values for the training interval (“Training (true)”), actual values for the holdout interval (“Holdout (true)”), and the model forecast (“Forecast”).
This visualization enables a qualitative assessment of the model’s ability to preserve the daily signal structure and transfer learned patterns from the training period to the subsequent period without temporal discontinuity.
The graph depicts a typical daily profile of solar power generation, characterized by near-zero values during the night and peaks during the day. The forecast curve for the holdout interval generally reproduces:
  • The phase and duration of the daytime generation window (the temporal position of the rise and fall relative to the actual curve).
  • The amplitude of the main peaks and the overall shape of the daytime “dome” are particularly important for evaluating performance using RMSE/MAE in tasks exhibiting pronounced daily cyclicality.
  • Behavior near zero levels, where relative distortions can occur due to small denominators (a common source of instability in MAPE and similar metrics during nighttime segments).
However, local discrepancies in the height and/or steepness of the fronts are noticeable in some peak areas, indicating the ongoing challenge of accurately reproducing extreme and rapidly changing modes, even with a correctly captured seasonal structure. Overall, visual inspection confirms that DP-STH++ retains both the shape of the daily pattern and the generation levels on the lagged segment, which aligns with the quantitative performance metrics presented previously.
Figure 6 illustrates the dynamics of wind generation on the raw scale, segmented into three trajectories: actual values on the training section (“Training (true)”), actual values on the holdout section (“Holdout (true)”), and the model forecast (“Forecast”)
In contrast to the solar series, the wind process exhibits distinct non-stationarity, characterized by sharp fronts, short pulses, and episodes of power reduction to low values. Consequently, this visualization is particularly useful for assessing forecast quality based on pattern fidelity rather than solely relying on integral metrics.
The forecast curve for the holdout interval demonstrates general consistency with the observed data in key aspects.
  • Accurate reproduction of the temporal position of major rises and falls.
  • The maintenance of large peak levels and their sequence is critical for practical reserve management scenarios.
  • The transitions to low values were adequately reflected, although local shifts and smoothing were possible in some areas.
The remaining discrepancies were concentrated in areas with the most abrupt changes. In some instances, the model smooths out steep fronts or slightly underestimates/overestimates the amplitude of short-term spikes, which is a common characteristic when forecasting wind series with a fixed context window. Overall, the graph confirms that DP-STH++ effectively transfers patterns from the training segment to the deferred segment, maintaining stable forecasts, even with high variability in wind generation.
In addition to scalar metrics, temporal alignment between predicted and observed profiles (Figure 7) confirms that the model preserves dynamic behavior across peak and low-generation intervals.

3.4. Summary of Quantitative Results

The quantitative results are summarized below.
  • DP-STH++ demonstrated the highest-quality regression prediction on the hold-out data (80/20 split) at H = 1 and L = 24. For solar power forecasting, it achieved minimum errors (RMSE = 524.27, MAE = 287.90) and maximum consistency (R2 = 0.7556, EVS = 0.7556). For wind power forecasting, it recorded the best values (RMSE = 241.77, MAE = 174.86, R2 = 0.9527, EVS = 0.9529).
  • The multitasking approach provides a practically significant detection of the extreme modes. DP-STH++ exhibited a maximum AUC of 0.9908 on the hold-out data for wind generation. For solar generation, the model’s AUC is 0.9547, confirming high detection quality, although MTL_GRU achieves the highest AUC for solar generation (0.9746).
  • Temporal cross-validation estimates confirmed the reproducibility of the quality. They also demonstrated that comparisons of absolute errors between models on different scales (kW vs. raw) require careful consideration. Scale-invariant indicators (R2, EVS, and AUC) retained interpretable comparability.
  • The advantage of DP-STH++ is most evident for wind generation, where it achieves both high regression accuracy and the best AUC. For solar generation, the model’s advantage is primarily manifested in a reduction in absolute errors (RMSE/MAE) at a competitive but not maximum AUC level.
Sensitivity to Forecast Horizon and Window Length.
To provide a detailed numerical comparison, model sensitivity was evaluated under varying forecast horizons H 1 , 3 , 6 (with L = 24 fixed) and varying window lengths L { 12 , 24 , 48 } (with H = 1 fixed), while keeping all other parameters unchanged.
The comparative results of the models are presented in Table 10.
Performance degrades monotonically as H increases: RMSE and MASE rise, while AUC decreases. The sensitivity of the model to different window lengths is summarized in Table 11.
Within the tested range, the best performance is observed at L = 12, while larger windows lead to gradual degradation under the current dataset size.
Figure 8 provides quantitative evidence of model sensitivity to forecast horizon and context length under a fixed leakage-safe protocol. The monotonic increase in RMSE_mean with growing H confirms expected degradation, while window variation reveals data-regime-dependent behavior.

3.5. Ablation Study of Architectural Branch Contributions

To quantify the contribution of each architectural branch, we conducted a systematic ablation study by removing one component at a time (LSTM, GRU, TCN, Transformer) while keeping all other elements and training settings unchanged.
Table 12 reports regression (RMSE, MASE) and classification (AUC, PR-AUC, F1) metrics averaged across targets.
The results demonstrate that removing the LSTM branch leads to the largest increase in RMSE (321.43), confirming the importance of long-term memory modeling for renewable energy forecasting.
Interestingly, some reduced configurations (e.g., without GRU or Transformer) achieve lower regression errors under the current dataset size. This suggests possible over-parameterization or gradient interaction effects in multitask training when model capacity increases.
These findings indicate that DP-STH++ should be interpreted as a configurable architectural family rather than a fixed, universally optimal configuration. The ablation results provide measurable evidence of branch contributions and clarify the practical trade-offs between model complexity, stability, and accuracy under limited data conditions.

3.6. Extended Benchmark Comparison

To address concerns regarding the completeness of baseline comparison, we extended the experimental section with additional benchmark families representing diverse architectural paradigms. These include linear decomposition models (DLinear family), patch-based transformer-style models (PatchTST proxy), nonlinear basis-function architectures (KAN proxy), and residual deep architectures (N-BEATS proxy).
All benchmark models were evaluated under an identical experimental protocol: leakage-safe preprocessing, identical chronological splits (80/20 hold-out), window length L = 24, forecast horizon H = 1, and identical evaluation metrics. This ensures that the comparison is methodologically consistent and reproducible.
All proxy benchmark models reported in Table 13 were trained under the same experimental protocol as the proposed architecture and the main baselines. Specifically, an identical chronological 80/20 hold-out split was used, with context window L = 24 and the same 12 input features. Input scaling was fitted exclusively on the training subset and applied to the test subset without leakage. For regression evaluation, predictions were produced in the raw scale, consistent with the primary protocol. Extreme-event thresholds (quantile-based) were computed on the training set and then applied to the hold-out data.
The purpose of these proxy models is not to redefine the overall ranking but to position the proposed architecture within diverse methodological families and to provide additional diagnostic insight under identical evaluation conditions.
The results indicate that strong linear baselines remain highly competitive in short-horizon forecasting, achieving low regression error in certain settings. Nonlinear basis-function models demonstrate strong classification performance in extreme-event detection.
The proposed DP-STH++ model does not dominate across all individual metrics; however, it provides a balanced trade-off between regression accuracy and extreme-event detection within a unified multitask framework. These findings suggest that increased architectural complexity does not automatically guarantee superiority, particularly under limited data regimes, and that model selection should consider both performance and task objectives.

3.7. Robustness and Generalization Analysis

The dataset used in this study consists of 8752 hourly observations (approximately one year), which represents a limited data regime for hybrid deep learning architectures. To avoid overstated claims and assess result stability, additional robustness analyses were conducted.
Rolling TimeSeries cross-validation (TS-CV) was applied to evaluate temporal transferability across sequential folds. In addition, a train-fraction sensitivity analysis was performed to examine performance behavior under reduced training data proportions. Finally, bootstrap-based statistical testing was conducted for key regression comparisons.
The aggregated robustness results are summarized in Table 14.
The robustness analysis confirms that conclusions are restricted to the evaluated dataset and protocol. Strong linear baselines remain competitive under limited data conditions, highlighting the importance of extended multi-dataset validation in future work.

3.7.1. Consistency Audit and Reproducibility Checks

Given the presence of multiple evaluation regimes (hold-out, rolling TimeSeries cross-validation, ablation averages, and sensitivity analyses), numerical differences between tables may arise due to distinct splits and aggregation procedures.
To ensure internal consistency, all summary tables were regenerated from a single aggregated evaluation source. No manual editing of final metric values was performed. Automated checks verified:
  • absence of duplicate model–target rows,
  • reproducibility of reported summary metrics from raw evaluation outputs,
  • alignment between protocol definitions and table captions.
Differences in RMSE or related metrics across tables, therefore, reflect differences in evaluation protocol (e.g., fixed hold-out vs. rolling cross-validation vs. averaged ablation results) rather than reporting inconsistencies.
These measures ensure reproducibility and internal coherence of the reported experimental results.

3.7.2. Statistical Significance Analysis

To avoid conclusions based solely on mean metrics, paired significance testing was conducted on identical test timestamps. We report bootstrap 95% confidence intervals for mean error differences and paired permutation test p-values. DP-STH++ improves over MTL_CNN significantly for both Solar and Wind (p < 0.001). In comparison with DLinear_proxy, differences are not significant for Solar (p > 0.05), while DLinear_proxy is significantly better for Wind (p < 0.001). The results of the paired statistical significance tests are summarized in Table 15.

3.7.3. Computational Cost Analysis

DP-STH++ has the largest parameter count and computational complexity among the evaluated deep models, resulting in higher training time and inference latency. This trade-off is explicitly acknowledged: DP-STH++ targets improved predictive quality and extreme-event detection rather than computational efficiency. The comparative computational cost profile of the evaluated models is presented in Figure 9.

4. Discussion

The results demonstrate the efficacy of the DP-STH++ hybrid spatiotemporal architecture as a solution for multitasking short-term forecasting of hourly solar and wind generation under conditions of significant non-stationarity and the presence of infrequent extreme events. By employing a unified, leakage-safe protocol and consistent experimental parameters (H = 1, L = 24), DP-STH++ exhibited superior regression performance for both the solar and wind channels on the hold-out sample. Specifically, for solar generation, the minimum RMSE and MAE values were observed concurrently with the maximum R2/EVS. For wind generation, optimal values were attained across all key regression metrics, coupled with maximum accuracy in extreme value detection, as measured by the AUC. This performance profile indicates that the proposed architecture effectively integrates the capacity to reproduce continuous power dynamics and reliably identifies peak generation modes.
The architectural advantage of DP-STH++ stems from the complementarity of parallel causal branches. Recurrent components (LSTM and GRU) facilitate the stable modeling of short- and medium-term dependencies, whereas the TCN branch enhances the extraction of local and multi-scale patterns in a causally consistent manner. The lightweight causal transformer increases the sensitivity to a more extended context. The aggregation of representations via pooling and subsequent concatenation in the fusion block allows for the summation of diverse inductive biases without compromising causality, which is a crucial feature when processing time-series data characterized by the simultaneous presence of regular cycles and short-term anomalous spikes.
A comparative analysis of the models revealed a distinction between the solar and wind channels regarding the nature of the achievable gains. For solar generation, DP-STH++ yielded the most significant improvement in regression accuracy on the holdout set, whereas the simpler MTL_GRU configuration excelled in AUC. This suggests a practical trade-off: maximizing the quality of continuous forecasting and maximizing purely classification metrics of extremes are not always equally achievable by a single architecture, particularly given the pronounced nighttime intervals with near-zero values. For wind generation, the advantage of DP-STH++ was most pronounced, simultaneously improving the regression errors and providing the maximum AUC, which is consistent with the higher stochasticity of the wind process and the reduced effectiveness of purely seasonal heuristics.
The influence of metric selection requires careful consideration. For wind generation, the MAPE values are often inflated and exhibit high variability owing to periods of low power output, leading to division by near-zero values. Consequently, the assessment of model performance for wind generation should prioritize the RMSE/MAE and consistency metrics (R2, EVS), as well as the AUC for extreme operating modes. MAPE should be considered an auxiliary metric because of its potential instability.
The cross-validation results indicate that the model stability is significantly influenced by the architectural class. Certain solutions exhibited a wide range of indicator values across folds, with R2 values falling into low or even negative ranges (e.g., for some STT/MTL_CNN configurations in the wind task). In contrast, DP-STH++ maintains competitive scale-invariant indicators (R2/EVS/AUC), which are crucial for ensuring the reproducibility of the findings across different time divisions. It is important to note that the tables include rows calculated on different scales (kW vs. raw values); therefore, direct comparisons of absolute RMSE/MAE are only valid within the same scale, whereas R2/EVS/AUC remain comparable regardless of the scale.
A key methodological aspect is the rigorous implementation of a leakage-safe experimental design. All preprocessing parameters and extreme thresholds were determined solely from the training dataset and subsequently fixed for the hold-out and cross-validation folds. This approach minimizes the risk of overestimating the model performance and allows us to attribute the advantages of DP-STH++ to its architecture and training procedure, rather than to artifacts resulting from information leakage.
The limitations of this study should be interpreted with caution. First, the results were obtained using a single combined dataset that reflected the characteristics of a specific set of observations. The transferability to other climatic zones and generation modes requires independent validation. Second, a forecast horizon of H = 1 with a window of L = 24 was considered, and alternative horizons may alter the balance between the local and distant dependencies. Third, the use of relative metrics (MAPE) for series with frequent small values limits the interpretability of these metrics, necessitating that the conclusions be supported by absolute error and consistency metrics.
Results are obtained on a single dataset with verified seasonal coverage (Figure 10), limiting direct generalization to other climates or regions. The fixed one-hour horizon and 24 h context may not capture all dynamics at longer scales. MAPE instability in low-generation periods motivates reliance on robust metrics (MASE, AUC). Future validation on diverse datasets is recommended.
In addition to predictive performance, computational efficiency is important essential for real-world deployment. Figure 11 compares training time, inferencelatency, parameter count, and FLOPs across the evaluated models. DP-STH++ has higher computational cost than simpler architectures due to its hybrid design, but remains lighter than full transformer models while achieving superior or competitive predictive performance and improved extreme-event detection. This reflects a practical trade-off between forecasting accuracy and computational cost.
Overall, the results suggest that, for the setup under consideration, practical advantages arise from a combination of (i) parallel hybridization of causal encoders, (ii) multitask optimization of regression and extrema prediction, and (iii) a strict leak exclusion protocol. Although certain simpler architectures (e.g., MTL_GRU) occasionally show marginally higher AUC for solar generation, DP-STH++ consistently ranks 1st across both hold-out and time-series cross-validation protocols when evaluated by robust aggregated metrics. On hold-out, it achieves RMSE = 257.18, MASE = 0.2438 and AUC = 0.9896; in CV, the model retains rank 1 with mean MASE = 0.3883. This confirms that the proposed hybrid causal fusion of LSTM, GRU, Conv1D and lightweight transformer provides stable advantages, particularly valuable under high non-stationarity conditions typical for wind power series. Thus, within the rigorous leakage-safe protocol applied, the superiority of DP-STH++ over the compared baselines is consistently supported.

5. Conclusions

This study introduces and experimentally validates DP-STH++, a multi-task hybrid model designed for the short-term forecasting of hourly solar and wind generation at a forecast horizon of H = 1 and a context window of L = 24. The proposed approach integrates the causal branches of LSTM, GRU, TCN, and a lightweight causal transformer, employing a common representation fusion and four output heads: two for regression (solar/wind) and two for classification, aimed at detecting extreme modes defined by the 90th percentile of the training sample.
Experimental results on a hold-out set (80/20 split) demonstrate that DP-STH++ achieves superior regression performance for both target series. Specifically, for solar generation, the model attained minimum RMSE and MAE values, alongside maximum R2/EVS, signifying enhanced accuracy in capturing daily dynamics and a greater proportion of explained variability. For wind generation, DP-STH++ exhibited the best regression performance and maximum AUC for extreme value detection, underscoring the effectiveness of the architecture in a more non-stationary and volatile environment. Concurrently, MTL_GRU leads in AUC for Solar, highlighting the distinction between continuous forecast optimization and maximization of classification metrics at peak values.
This study introduces DP-STH++, a hybrid spatio-temporal multitask model that achieves leading performance among the evaluated baselines under the considered experimental protocol for both solar and wind power generation (H = 1, L = 24). On the hold-out set, DP-STH++ ranks 1st among the compared models with RMSE = 257.18, MASE = 0.2438, R2 = 0.9440 and AUC = 0.9896. The model maintains leadership in rigorous time-series cross-validation (mean MASE = 0.3883, rank 1), confirming the robustness of the obtained results. The most significant gains are observed for the wind channel (RMSE = 258.85, MASE = 0.1631, AUC = 0.9880–0.9908), where the hybrid causal architecture effectively handles high non-stationarity and sharp peaks. These results, obtained under a strict leakage-safe protocol with train-only parameter estimation, support the practical applicability of DP-STH++ as a unified solution for accurate baseline forecasting and reliable extreme event detection in renewable energy systems.
The results are obtained on a single dataset (8752 hourly observations), which limits direct generalization. Future validation on additional multi-year and multi-region datasets is required.
The temporal cross-validation results corroborated the reproducibility of the obtained estimates and revealed that stability was significantly influenced by the architecture class. Seasonally naive forecasting proves unsuitable for wind series (as indicated by negative consistency indicators), whereas certain STT/MTL_CNN configurations exhibit instability across folds. In contrast, DP-STH++ maintains competitive scale-invariant quality indicators and demonstrates a practically significant ability to identify extreme modes, particularly for wind generation.
The scientific and methodological significance of this study lies in the implementation of DP-STH++ within a strictly leak-resistant pipeline. All preprocessing parameters and extreme thresholds were evaluated exclusively on the training data and subsequently fixed for application to the hold-out set and cross-validation folds. This rigorous protocol enhances the accuracy of comparisons and the reproducibility of the results when evaluating multitask models for time-series forecasting.
From a practical standpoint, DP-STH++ can be used as a unified tool that simultaneously delivers accurate short-term power forecasts and risk-oriented signaling of extreme modes. This capability aligns with the operational management requirements of power systems with a high proportion of renewable energy sources, where both minimizing base forecast errors and timely detection of potentially critical generation peaks/troughs are essential.

Author Contributions

Conceptualization, M.K. and A.Z.; Methodology, G.T.; Software, G.T.; Validation, G.T. and M.K.; Formal analysis, G.T.; Investigation, G.T.; Resources, Z.A.; Data curation, G.T. and Z.A.; Writing—original draft preparation, G.T. and Z.A.; Writing—review and editing, G.T. and Z.A.; Visualization, G.T.; Supervision, A.Z. and M.K.; Funding acquisition, G.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in Mendeley Data at https://doi.org/10.17632/gxc6j5btrx.1 (accessed on 12 February 2026). The dataset, titled “Wind and Solar Power Generation Dataset” (v1), was published on 10 October 2024 by Yue Liu and is distributed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Acknowledgments

The authors used ChatGPT 5.2 (OpenAI) and Grammarly Premium to improve the clarity and readability of the manuscript. All AI-assisted content was carefully reviewed and edited by the authors, who take full responsibility for the final version of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MTLMulti-Task Learning
RNNRecurrent Neural Network
LSTMLong Short-Term Memory
GRUGated Recurrent Unit
CNNConvolutional Neural Network
TCNTemporal Convolutional Network
STTSpatio-Temporal Transformer
AUCArea Under the Curve
RMSERoot Mean Squared Error
MAEMean Absolute Error
EVSExplained Variance Score
EXT_QQuantile threshold for extreme event definition

References

  1. Weron, R. Electricity Price Forecasting: A Review of the State-of-the-Art with a Look into the Future. Int. J. Forecast. 2014, 30, 1030–1081. [Google Scholar] [CrossRef]
  2. Hong, T.; Fan, S. Probabilistic Electric Load Forecasting: A Tutorial Review. Int. J. Forecast. 2016, 32, 914–938. [Google Scholar] [CrossRef]
  3. Voyant, C.; Notton, G.; Kalogirou, S.; Nivet, M.-L.; Paoli, C.; Motte, F.; Fouilloy, A. Machine Learning Methods for Solar Radiation Forecasting: A Review. Renew. Energy 2017, 105, 569–582. [Google Scholar] [CrossRef]
  4. Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are Transformers Effective for Time Series Forecasting? arXiv 2022, arXiv:2205.13504. [Google Scholar] [CrossRef]
  5. Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-Term Series Forecasting. arXiv 2022, arXiv:2201.12740. [Google Scholar] [CrossRef]
  6. Su, L.; Zuo, X.; Li, R.; Wang, X.; Zhao, H.; Huang, B. A Systematic Review for Transformer-Based Long-Term Series Forecasting. Artif. Intell. Rev. 2025, 58, 80. [Google Scholar] [CrossRef]
  7. Woo, G.; Liu, C.; Sahoo, D.; Kumar, A.; Hoi, S. ETSformer: Exponential Smoothing Transformers for Time-Series Forecasting. arXiv 2022, arXiv:2202.01381. [Google Scholar] [CrossRef]
  8. Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series Is Worth 64 Words: Long-Term Forecasting with Transformers. arXiv 2022, arXiv:2211.14730. [Google Scholar]
  9. Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in Time Series: A Survey. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 19-25 August 2023; International Joint Conferences on Artificial Intelligence Organization: Macau, China, 2023; pp. 6778–6786. [Google Scholar]
  10. Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. iTransformer: Inverted Transformers are Effective for Time Series Forecasting. arXiv 2023, arXiv:2310.06625. [Google Scholar]
  11. Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar] [PubMed]
  12. Tang, Y.; Yang, K.; Zhang, S.; Zhang, Z. Wind Power Forecasting: A Hybrid Forecasting Model and Multi-Task Learning-Based Framework. Energy 2023, 278, 127864. [Google Scholar] [CrossRef]
  13. Wei, J.; Wu, X.; Yang, T.; Jiao, R. Ultra-Short-Term Forecasting of Wind Power Based on Multi-Task Learning and LSTM. Int. J. Electr. Power Energy Syst. 2023, 149, 109073. [Google Scholar] [CrossRef]
  14. Wang, S.; Sun, Y.; Zhang, W.; Chung, C.Y.; Srinivasan, D. Very Short-Term Wind Power Forecasting Considering Static Data: An Improved Transformer Model. Energy 2024, 312, 133577. [Google Scholar] [CrossRef]
  15. Salman, D.; Direkoglu, C.; Kusaf, M.; Fahrioglu, M. Hybrid Deep Learning Models for Time Series Forecasting of Solar Power. Neural Comput. Appl. 2024, 36, 9095–9112. [Google Scholar] [CrossRef]
  16. Huan, J.; Deng, L.; Zhu, Y.; Jiang, S.; Qi, F. Short-to-Medium-Term Wind Power Forecasting through Enhanced Transformer and Improved EMD Integration. Energies 2024, 17, 2395. [Google Scholar] [CrossRef]
  17. Shringi, S.; Saini, L.M.; Aggarwal, S.K. A Review of Data-Driven Deep Learning Models for Solar and Wind Energy Forecasting. Renew. Energy Focus 2025, 55, 100739. [Google Scholar] [CrossRef]
  18. Gupta, M.; Arya, A.; Varshney, U.; Mittal, J.; Tomar, A. A Review of PV Power Forecasting Using Machine Learning Techniques. Prog. Eng. Sci. 2025, 2, 100058. [Google Scholar] [CrossRef]
  19. Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. The M4 Competition: Results, Findings, Conclusion and Way Forward. Int. J. Forecast. 2018, 34, 802–808. [Google Scholar] [CrossRef]
  20. Lim, B.; Zohren, S. Time-Series Forecasting with Deep Learning: A Survey. Philos. Trans. R. Soc. Math. Phys. Eng. Sci. 2021, 379, 20200209. [Google Scholar] [CrossRef] [PubMed]
  21. Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding Deep Learning (Still) Requires Rethinking Generalization. Commun. ACM 2021, 64, 107–115. [Google Scholar] [CrossRef]
Figure 1. Data Preparation and Processing Pipeline. Source: Own calculations based on experimental results.
Figure 1. Data Preparation and Processing Pipeline. Source: Own calculations based on experimental results.
Computers 15 00183 g001
Figure 2. Hybrid multi-branch causal architecture DP-STH++ for the joint forecasting of solar and wind power generation and the detection of extreme events. Gray blocks represent the feature extraction branches (LSTM, GRU, causal Conv1D, and light causal Transformer). Blue blocks denote regression heads for solar and wind generation forecasting, while red blocks correspond to classification heads for extreme event detection. Source: Own calculations based on experimental results.
Figure 2. Hybrid multi-branch causal architecture DP-STH++ for the joint forecasting of solar and wind power generation and the detection of extreme events. Gray blocks represent the feature extraction branches (LSTM, GRU, causal Conv1D, and light causal Transformer). Blue blocks denote regression heads for solar and wind generation forecasting, while red blocks correspond to classification heads for extreme event detection. Source: Own calculations based on experimental results.
Computers 15 00183 g002
Figure 3. Sensitivity of extreme-event detection metrics (AUC, PR-AUC, and F1) to quantile threshold q for Solar and Wind targets. Source: Own calculations based on experimental results.
Figure 3. Sensitivity of extreme-event detection metrics (AUC, PR-AUC, and F1) to quantile threshold q for Solar and Wind targets. Source: Own calculations based on experimental results.
Computers 15 00183 g003
Figure 4. Confusion matrices for extreme-event detection at q = 0.90 for Solar and Wind (DP-STH++ and DLinear_proxy). Positive class corresponds to quantile-defined extreme events. Source: Own calculations based on experimental results.
Figure 4. Confusion matrices for extreme-event detection at q = 0.90 for Solar and Wind (DP-STH++ and DLinear_proxy). Positive class corresponds to quantile-defined extreme events. Source: Own calculations based on experimental results.
Computers 15 00183 g004
Figure 5. Comparison of actual and forecast solar generation for the DP-STH++ model on training and deferred intervals (H = 1, L = 24; time stamps of target windows). Source: Own calculations based on experimental results.
Figure 5. Comparison of actual and forecast solar generation for the DP-STH++ model on training and deferred intervals (H = 1, L = 24; time stamps of target windows). Source: Own calculations based on experimental results.
Computers 15 00183 g005
Figure 6. Comparison of actual and forecast wind generation for the DP-STH++ model on training and deferred intervals (H = 1, L = 24; time stamps of target windows). Source: Own calculations based on experimental results.
Figure 6. Comparison of actual and forecast wind generation for the DP-STH++ model on training and deferred intervals (H = 1, L = 24; time stamps of target windows). Source: Own calculations based on experimental results.
Computers 15 00183 g006
Figure 7. Actual and forecast series DP-STH++. Source: Own calculations based on experimental results.
Figure 7. Actual and forecast series DP-STH++. Source: Own calculations based on experimental results.
Computers 15 00183 g007
Figure 8. Parametric sensitivity of RMSE_mean to forecast horizon (H) and context window length (L). Source: Own calculations based on experimental results.
Figure 8. Parametric sensitivity of RMSE_mean to forecast horizon (H) and context window length (L). Source: Own calculations based on experimental results.
Computers 15 00183 g008
Figure 9. Comparative computational cost profile: number of parameters, estimated FLOPs per sample, training time per epoch, and inference latency for all evaluated models. Source: Own calculations based on experimental results.
Figure 9. Comparative computational cost profile: number of parameters, estimated FLOPs per sample, training time per epoch, and inference latency for all evaluated models. Source: Own calculations based on experimental results.
Computers 15 00183 g009
Figure 10. Seasonal structure of the sample and average generation. Source: Own calculations based on experimental results.
Figure 10. Seasonal structure of the sample and average generation. Source: Own calculations based on experimental results.
Computers 15 00183 g010
Figure 11. Diagnosis of computational cost. Source: Own calculations based on experimental results.
Figure 11. Diagnosis of computational cost. Source: Own calculations based on experimental results.
Computers 15 00183 g011
Table 1. Related Work Summary (2022–2025).
Table 1. Related Work Summary (2022–2025).
ReferenceArchitectural FocusApplication/TaskKey ContributionRelevance to This Study
Zeng et al. [4]Linear baseline (DLinear)Time series forecastingDemonstrates strong linear baselines under fair evaluationMotivates rigorous comparison with simple models
Zhou et al. [5]Frequency-enhanced Transformer (FEDformer)Long-term TSFFrequency decomposition for non-stationary seriesSupports decomposition ideas for energy data
Su et al. [6]Transformer reviewLong-term TSFEvaluation best practicesSupports leakage-free protocol
Woo et al. [7]ETS-TransformerTSFTrend/seasonality separation via exponential smoothingGuides interpretation of seasonal structure
Nie et al. [8]PatchTSTLong-term forecastingPatch-based Transformer processingUsed as modern transformer reference
Wen et al. [9]Survey (Transformers in TS)General TSFArchitecture taxonomy and evaluation criteriaDefines comparison principles
Liu et al. [10]iTransformerTSFVariate-centric transformer modelingSupports modern transformer lines
Liu et al. [11]KANNonlinear modelingSpline-based functional approximationMotivates nonlinear basis baseline
Tang et al. [12]Hybrid + MTLWind powerMultitask hybrid wind forecastingJustifies hybrid MTL for WPF
Wei et al. [13]MTL + LSTMUltra-short-term windJoint task learningSupports multitask short-term setup
Wang et al. [14]Improved TransformerVery short-term windStatic feature integrationSupports feature-rich WPF
Salman et al. [15]Hybrid DL modelsSolar forecastingCNN/LSTM/Transformer hybridizationConfirms hybrid benefit for PV
Huan et al. [16]Transformer + EMDWind forecastingDecomposition + TransformerSupports non-stationary modeling
Shringi et al. [17]Review (Solar/Wind DL)RES forecastingDeep learning overviewConfirms hybrid relevance
Gupta et al. [18]Review (PV ML)PV forecastingML/DL status updateContextualizes forecasting trends
Source: Own calculations based on experimental results.
Table 2. Holdout comparison of models on the Solar task H = 1 , L = 24
Table 2. Holdout comparison of models on the Solar task H = 1 , L = 24
ModelRMSEMAEMAPER2EVSAUC
MTL_LSTM683.381401.431747.1450.58470.64580.9692
MTL_GRU659.721417.619803.7210.6130.67020.9746
MTL_LSTM_GRU698.835437.48830.2680.56580.64740.969
STT605.6395.856472.1740.67390.68150.9654
MTL_LSTM_GRU_STT700.222430.233636.0750.5640.61380.9554
Seasonal Naive (lag = 24)837.41390.761454.2530.37650.37650.9139
MTL_CNN1227.349976.3581227.408−0.33940.24870.9562
DP-STH++524.266287.904213.810.75560.75560.9547
Note: (The best values are highlighted in bold: RMSE/MAE/MAPE—minimum; /EVS/AUC—maximum). Source: Own calculations based on experimental results.
Table 3. Holdout comparison of models on the task “Wind” (80/20), H = 1 ,   L = 24 .
Table 3. Holdout comparison of models on the task “Wind” (80/20), H = 1 ,   L = 24 .
ModelRMSEMAEMAPER2EVSAUC
MTL_LSTM326.093249.9512965.7860.9140.91460.9845
MTL_GRU308.415236.1143731.2470.9230.9240.9837
MTL_LSTM_GRU353.6283.3482944.0390.89880.89910.9801
STT494.004377.6788389.4980.80250.80690.9725
MTL_LSTM_GRU_STT281.384213.4453801.3510.93590.93850.9902
Seasonal Naive (lag = 24)1358.3791016.75830,869.579−0.493−0.49290.6396
MTL_CNN588.344456.0469826.7480.71990.72090.9701
DP-STH++241.769174.8633666.3790.95270.95290.9908
Note: (the best values are highlighted in bold: RMSE/MAE/MAPE—minimum; R 2 /EVS/AUC—maximum). Source: Own calculations based on experimental results.
Table 4. Component metrics for Solar and Wind targets.
Table 4. Component metrics for Solar and Wind targets.
ModelTargetRMSEMASEAUCR2
DP-STH++Solar255.51930.32440.99120.9421
MTL_CNNSolar326.21710.47450.99070.9056
STTSolar333.14340.50800.99560.9015
MTL_LSTM_GRU_STTSolar347.27210.53980.99570.8930
MTL_GRUSolar360.38110.61000.98580.8847
MTL_LSTM_GRUSolar372.90830.63760.99430.8766
MTL_LSTMSolar380.77840.60180.98810.8713
Seasonal_Naive_lag24Solar838.60860.84490.91360.3759
MTL_LSTM_GRUWind252.22270.16770.98830.9487
DP-STH++Wind258.84940.16310.98800.9459
MTL_GRUWind269.72370.19510.98970.9413
STTWind281.60510.18340.98770.9360
MTL_LSTM_GRU_STTWind290.64040.20980.98950.9318
MTL_CNNWind311.40530.21440.98840.9217
MTL_LSTMWind326.77460.24140.98660.9138
Seasonal_Naive_lag24Wind1358.3350.92900.6387−0.4890
Source: Own calculations based on experimental results.
Table 5. Cross-validation for the task “Solar”: mean ± std and 90% CI; AUC—mean.
Table 5. Cross-validation for the task “Solar”: mean ± std and 90% CI; AUC—mean.
ModelRMSE (Mean ± Std; CI90)MAE (Mean ± Std; CI90)MAPE (Mean ± Std; CI90)R2 (Mean ± Std; CI90)EVS (Mean ± Std; CI90)AUC (Mean)
MTL_LSTM901.12 ± 182.35; [766.97–1035.26]652.39 ± 200.82; [504.66–800.13]671.50 ± 425.46; [358.51–984.50]0.5519 ± 0.1566; [0.4367–0.6671]0.5676 ± 0.1635; [0.4474–0.6879]0.8827
MTL_GRU816.10 ± 184.38; [680.45–951.74]577.47 ± 173.38; [449.92–705.02]551.42 ± 410.48; [249.44–853.40]0.6178 ± 0.1928; [0.4759–0.7596]0.6321 ± 0.1974; [0.4869–0.7773]0.9153
MTL_LSTM_GRU866.42 ± 186.47; [729.24–1003.59]609.54 ± 175.08; [480.74–738.35]581.70 ± 394.03; [291.83–871.57]0.5737 ± 0.2040; [0.4236–0.7238]0.5940 ± 0.2026; [0.4450–0.7430]0.8739
STT896.06 ± 214.96; [737.92–1054.20]652.44 ± 166.03; [530.30–774.59]687.17 ± 386.99; [402.48–971.87]0.5696 ± 0.1207; [0.4808–0.6584]0.6486 ± 0.0944; [0.5792–0.7180]0.9234
MTL_CNN1030.43 ± 398.85; [737.01–1323.85]849.66 ± 396.97; [557.62–1141.70]1076.56 ± 793.76; [492.61–1660.50]0.3457 ± 0.5046; [−0.0256–0.7169]0.6465 ± 0.1361; [0.5464–0.7466]0.8743
Source: Own calculations based on experimental results.
Table 6. Cross-validation for the task “Wind”: mean ± std and 90% CI; AUC—mean.
Table 6. Cross-validation for the task “Wind”: mean ± std and 90% CI; AUC—mean.
ModelRMSE (Mean ± Std; CI90)MAE (Mean ± Std; CI90)MAPE (Mean ± Std; CI90)R2 (Mean ± Std; CI90)EVS (Mean ± Std; CI90)AUC (Mean)
MTL_LSTM_GRU_STT905.44 ± 340.54; [654.92–1155.97]688.68 ± 362.46; [422.03–955.33]410.67 ± 175.05; [281.89–539.45]0.5405 ± 0.2769; [0.3368–0.7442]0.5846 ± 0.2779; [0.3802–0.7891]0.9259
Seasonal Naive (lag = 24)879.39 ± 65.99; [830.85–927.94]478.44 ± 17.58; [465.51–491.37]165.24 ± 56.03; [124.02–206.46]0.5738 ± 0.0839; [0.5120–0.6355]0.5738 ± 0.0839; [0.5121–0.6355]0.8827
DP-STH++901.74 ± 209.44; [747.67–1055.82]545.76 ± 138.99; [443.51–648.02]227.63 ± 188.44; [89.00–366.27]0.5474 ± 0.1857; [0.4109–0.6840]0.6025 ± 0.1456; [0.4954–0.7097]0.8815
Source: Own calculations based on experimental results.
Table 7. Comparison of MAPE and MASE under low generation conditions.
Table 7. Comparison of MAPE and MASE under low generation conditions.
ModelMAPEMASE
DLinear_proxy2.780 × 1090.2402
KAN_proxy2.640 × 1090.2499
NBEATS_proxy2.152 × 1090.3689
PatchTST_proxy6.610 × 1090.5271
Seasonal_Naive_lag241.641 × 1080.8870
DP-STH++_proxy5.910 × 1091.0619
Source: Own calculations based on experimental results.
Table 8. Reports AUC, PR-AUC, Precision, Recall, and F1 for different quantile thresholds.
Table 8. Reports AUC, PR-AUC, Precision, Recall, and F1 for different quantile thresholds.
TargetqAUCPR-AUCPrecisionRecallF1
Solar0.850.9903770.9048440.8600000.7678570.811321
Solar0.900.9943430.8759960.6000000.9375000.731707
Solar0.950.9873390.3423480.1400001.0000000.245614
Wind0.850.9782850.9185250.9743590.6298340.765101
Wind0.900.9882550.9375050.8376070.8521740.844828
Wind0.950.9965830.9170660.4786320.9824560.643678
Source: Own calculations based on experimental results.
Table 9. Extended classification metrics for extreme-event detection at q = 0.90.
Table 9. Extended classification metrics for extreme-event detection at q = 0.90.
ModelTargetPositive_Rate_TestAUCPR-AUCPrecisionRecallF1
DP-STH++Solar0.0366550.9943430.8759960.6000000.9375000.731707
DLinear_proxySolar0.0366550.9959960.8976040.8653850.7031250.775862
DP-STH++Wind0.0658650.9882550.9375050.8376070.8521740.844828
DLinear_proxyWind0.0658650.9940450.9537540.8879310.8956520.891775
Source: Own calculations based on experimental results.
Table 10. Sensitivity to forecast horizon (L = 24 fixed).
Table 10. Sensitivity to forecast horizon (L = 24 fixed).
HRMSE_Mean ↓MASE_Mean ↓AUC_Mean ↑
11056.84921.06190.8707
31138.28601.14130.8183
61206.12491.21070.7638
Source: Own calculations based on experimental results.
Table 11. Sensitivity to window length (H = 1 fixed).
Table 11. Sensitivity to window length (H = 1 fixed).
LRMSE_Mean ↓MASE_Mean ↓AUC_Mean ↑
12995.14261.03740.9067
241056.84921.06190.8707
481141.55281.15650.8379
Source: Own calculations based on experimental results.
Table 12. Ablation study of DP-STH++ branch contributions.
Table 12. Ablation study of DP-STH++ branch contributions.
ModelBranchesRMSE_Mean ↓MASE_Mean ↓AUC_Mean ↑PR-AUC_Mean ↑F1_Mean ↑
DP_fullLSTM+GRU+TCN+Trf299.470.34190.99160.88650.7658
DP_no_LSTMGRU+TCN+Trf321.430.33770.99010.89400.8011
DP_no_GRULSTM+TCN+Trf260.730.25340.99230.89360.7833
DP_no_TCNLSTM+GRU+Trf274.170.27280.99240.92030.7996
DP_no_TransformerLSTM+GRU+TCN264.120.25550.99340.91150.8167
Source: Own calculations based on experimental results.
Table 13. Presents the combined hold-out benchmark summary.
Table 13. Presents the combined hold-out benchmark summary.
ModelRoleRMSE ↓MASE ↓AUC ↑PR-AUC ↑F1 ↑
PatchTST_proxyPatch-based family proxy461.130.5270.98480.85520.5981
DLinear_proxyStrong linear baseline228.490.2400.99500.92570.8338
KAN_proxyNonlinear basis baseline238.580.2500.99440.93150.8203
NBEATS_proxydeep baseline (MLP residual family)393.130.3830.99290.87370.8116
DP-STH++Proposed model260.680.2500.99130.90680.7883
Source: Own calculations based on experimental results.
Table 14. Robustness and generalization summary.
Table 14. Robustness and generalization summary.
Robustness CheckWhat Is EvaluatedKey Observation
TimeSeries CVTemporal transferability across foldsOn a rolling CV, strong, simple baselines may exhibit slightly higher stability; this is explicitly acknowledged as a limitation and subject for future multi-dataset validation.
Train-fraction sensitivityDependence on training data volumePerformance degrades predictably as the train fraction decreases; trends remain consistent across models.
Statistical significanceSignificance of performance differencesImprovements of DP-STH++ over several deep baselines are statistically supported; comparisons with strong linear baselines are interpreted cautiously without universal superiority claims.
Source: Own calculations based on experimental results.
Table 15. Paired statistical significance tests (bootstrap CI + permutation test).
Table 15. Paired statistical significance tests (bootstrap CI + permutation test).
ComparisonTargetΔMSE (comp − DP)95% CIp-Value
DP-STH++ vs. MTL_CNNSolar+63,060.75[46,901.17; 80,189.64]0.000000
DP-STH++ vs. MTL_CNNWind+26,866.24[19,473.15; 34,191.39]0.000000
DP-STH++ vs. DLinear_proxySolar−5479.19[−16,930.09; 4266.03]0.295333
DP-STH++ vs. DLinear_proxyWind−24,481.13[−31,775.70; −18,450.60]0.000000
Source: Own calculations based on experimental results.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tolegenova, G.; Zakirova, A.; Kalimoldayev, M.; Akhayeva, Z. Hybrid Spatio-Temporal Deep Learning Models for Multi-Task Forecasting in Renewable Energy Systems. Computers 2026, 15, 183. https://doi.org/10.3390/computers15030183

AMA Style

Tolegenova G, Zakirova A, Kalimoldayev M, Akhayeva Z. Hybrid Spatio-Temporal Deep Learning Models for Multi-Task Forecasting in Renewable Energy Systems. Computers. 2026; 15(3):183. https://doi.org/10.3390/computers15030183

Chicago/Turabian Style

Tolegenova, Gulnaz, Alma Zakirova, Maksat Kalimoldayev, and Zhanar Akhayeva. 2026. "Hybrid Spatio-Temporal Deep Learning Models for Multi-Task Forecasting in Renewable Energy Systems" Computers 15, no. 3: 183. https://doi.org/10.3390/computers15030183

APA Style

Tolegenova, G., Zakirova, A., Kalimoldayev, M., & Akhayeva, Z. (2026). Hybrid Spatio-Temporal Deep Learning Models for Multi-Task Forecasting in Renewable Energy Systems. Computers, 15(3), 183. https://doi.org/10.3390/computers15030183

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop