Hybrid Spatio-Temporal Deep Learning Models for Multi-Task Forecasting in Renewable Energy Systems

Tolegenova, Gulnaz; Zakirova, Alma; Kalimoldayev, Maksat; Akhayeva, Zhanar

doi:10.3390/computers15030183

Open AccessArticle

Hybrid Spatio-Temporal Deep Learning Models for Multi-Task Forecasting in Renewable Energy Systems

by

Gulnaz Tolegenova

^1,2,*

,

Alma Zakirova

³,

Maksat Kalimoldayev

^2,4 and

Zhanar Akhayeva

^5,*

¹

Department of Computer and Software Engineering, L.N. Gumilyov Eurasian National University, Astana 010000, Kazakhstan

²

Higher School of Information Technology and Engineering, Astana International University, Astana 010000, Kazakhstan

³

Department of Computer Science, L.N. Gumilyov Eurasian National University, Astana 010000, Kazakhstan

⁴

Institute of Information and Computational Technologies, Almaty 050040, Kazakhstan

⁵

Department of Information Systems, L.N. Gumilyov Eurasian National University, Astana 010000, Kazakhstan

^*

Authors to whom correspondence should be addressed.

Computers 2026, 15(3), 183; https://doi.org/10.3390/computers15030183

Submission received: 20 January 2026 / Revised: 1 March 2026 / Accepted: 6 March 2026 / Published: 11 March 2026

(This article belongs to the Special Issue AI Applications for Smart Grid Energy Management and Industrial Electrical Systems)

Download

Browse Figures

Versions Notes

Abstract

Short-term forecasting of solar and wind power generation is critical for smart grid management but challenging due to non-stationarity and extreme generation events. This study addresses a multi-task learning problem: regression-based forecasting of power output and binary detection of extreme events defined by a quantile-based threshold (q = 0.90). A hybrid spatio-temporal model, DP-STH++, is proposed, implementing parallel causal fusion of LSTM, GRU, a causal Conv1D stack, and a lightweight causal transformer. The architecture employs regression and classification heads, while an uncertainty-weighted mechanism stabilizes multitask optimization in the regression tasks; extreme event detection performance is evaluated using AUC. Training and evaluation follow a leakage-safe protocol with chronological data processing, calendar feature integration, time-aware splitting, and training-only estimation of scaling parameters and extreme thresholds. Experimental results obtained with a one-hour forecasting horizon and a 24 h context window demonstrate that DP-STH++ achieves the best regression performance on the hold-out set (RMSE = 257.18, MAE = 174.86–287.90, MASE = 0.2438, R² = 0.9440) and the highest extreme event detection accuracy (AUC = 0.9896), ranking 1st among all compared architectures. In time-series cross-validation, the model retains the leading position with a mean MASE = 0.3883 and AUC = 0.9709. The advantages are particularly pronounced for wind power forecasting, where DP-STH++ simultaneously minimizes regression errors and maximizes AUC = 0.9880–0.9908.

Keywords:

renewable energy; generation forecasting; multi-task learning; hybrid models; causal architectures; time series; LSTM; GRU; temporal convolutional networks; causal transformer; extreme events; AUC; data leakage; time-series cross-validation; smart grids

Graphical Abstract

1. Introduction

The integration of renewable energy sources into urban smart infrastructure is increasing. Although solar and wind generation are technologically and environmentally advantageous, they are highly susceptible to weather conditions, exhibiting pronounced non-stationarity and sharp intraday output fluctuations. These characteristics complicate power system operational management and increase the demand for the accuracy of short-term forecasts used for load balancing and power reserve allocation [1,2,3].

Recent time series studies show that strong linear baselines and transformer approaches can deliver competitive quality when the experiment is set up correctly [4,5,6,7,8,9,10,11]. For energy applications, particularly in wind and solar generation tasks, hybrid and multitask approaches are relevant because they allow the simultaneous consideration of non-stationarity, seasonal-daily patterns, and extreme modes [12,13,14,15,16,17]. Recent reviews emphasize fair comparison across architectural classes and reproducible evaluation protocols in RES forecasting tasks [6,17,18]. Based on this, the task of this work is formulated as the construction and verification of a unified short-term Solar/Wind forecasting loop with simultaneous regression estimation and detection of extreme events in a comparable experimental protocol.

This study frames short-term forecasting as a time-oriented inference task, predicting “from the past to the future” with a horizon of H = 1 and a context window of L = 24. Initial observations were first standardized to a uniform timescale by parsing, sorting, and removing duplicate timestamps, followed by the creation of calendar attributes: hour, day of week, month, and season. Subsequently, a unified feature set and target variables were constructed for both the solar and wind subsystems. Strict chronological splitting was applied using an 80/20 hold-out and walk-forward cross-validation. Finally, the target series is shifted by horizon H, retaining only valid indices [19,20].

A key practical consideration is the need to account for not only the average power levels but also rare, extreme operating modes. In this study, this was addressed by formalizing a separate binary classification task to identify these extremes. Extremes are defined using an EXT_Q quantile threshold, calculated solely on the training data (q = 0.90 in our experiments) and then applied without modification to the hold-out/folds. This results in a unified multi-task protocol encompassing two regressions (solar and wind) and two classifications of extremes [21].

Another critical methodological requirement is preventing data leakage. All transformation statistics, including the feature and target variable scaling parameters, were configured exclusively on the training data and then applied to the validation/test sets without recalculation. This ensures the reproducibility and temporal correctness of the evaluation [19].

The objective of this study was to develop and evaluate a hybrid causal DP-STH++ architecture for the joint forecasting of solar and wind generation and the detection of extreme events within a single model. The architecture employs the parallel fusion of four representation branches (LSTM, GRU, causal Conv1D, and a lightweight causal transformer). The aggregated representations were then combined and fed into multi-head outputs for regression and classification. An uncertainty-weighted balancing mechanism (LossScaleLayer) is used in the regression component to adaptively balance solar and wind regression losses [5,7,8,9,10], where prior studies have shown that uncertainty-based weighting stabilizes multi-task optimization in non-stationary time series. The experimental evaluation compares the proposed architecture against baseline multi-task learning (MTL) architectures (LSTM, GRU, their combinations, CNN, STT), as well as a seasonal naive benchmark (lag = 24), allowing us to assess the contributions of both the architectural design and the rigorous time-correct processing and validation protocol [20].

In contrast to several existing hybrid models that typically combine two architectural families (e.g., TCN + Transformer or graph-based models), DP-STH++ integrates four parallel causal branches with joint regression and extreme-event classification heads under a unified chronological evaluation protocol. The novelty lies in the parallel causal fusion mechanism and strict train-only threshold computation, rather than in the use of individual standard components.

The main scientific contributions of this study are as follows:

A leak-resistant multitask learning framework is proposed for the joint forecasting of solar and wind generation.
A hybrid spatiotemporal architecture was developed, integrating recurrent, convolutional, and transformer components within a causal framework.
The explicit modeling of extreme generation modes was implemented as a separate classification task.
A comprehensive evaluation is conducted to assess accuracy and stability against representative baselines under the adopted protocol.

The field of time-series forecasting has seen rapid development in recent years, with several studies questioning the necessity of complex transformer architectures when simpler linear models or properly configured baselines perform comparably or better under fair evaluation conditions [4,5,8,9,10]. At the same time, energy-specific applications benefit significantly from hybrid and multi-task frameworks that jointly model regression and classification tasks (e.g., extreme event detection) while handling pronounced non-stationarity and sharp fluctuations typical for solar and wind generation [6,7,10,11,12].

Recent reviews published in 2024–2025 provide updated overviews of deep learning approaches for solar and wind forecasting, confirm the relevance of hybrid architectures, and stress the methodological importance of leakage-free experimental protocols and reproducible comparisons [13,14,15].

In light of these developments, the current work builds upon the latest methodological insights. Table 1 summarizes a curated set of recent publications (2022–2025) that directly inform the design choices, baseline selection, evaluation protocol, and interpretation of results in this study.

Recent developments in Kolmogorov–Arnold Networks (KAN) and related basis-function-based architectures propose adaptive nonlinear modeling through spline-based functional representations. These models have demonstrated strong approximation capabilities in structured regression tasks and have recently attracted attention in time-series forecasting research.

In the revised experimental section, a KAN-inspired proxy baseline is included to ensure empirical positioning relative to nonlinear basis-function families under the same leakage-safe protocol and identical evaluation settings. While KAN-type approaches emphasize flexible functional approximation, the proposed DP-STH++ architecture focuses on complementary temporal representations within a unified multitask regression–classification framework.

2. Materials and Methods

2.1. Data Source and Study Design

This study addresses the challenge of short-term (hourly) forecasting of both solar and wind generation within a unified framework, focusing on two target series: kWh_solar_power_solar and kWh_wind_power_wind. The empirical database consists of a single chronologically ordered time series with 8752 observations and a consistent input dimension for both forecasting tasks. The input feature vector comprises meteorological indicators relevant to solar and wind power generation and calendar variables (hour, day of the week, month, and season).

The experimental design employed a time-series simulation approach, maintaining chronological order without data mixing. An 80/20 hold-out split was used, training models on the initial 80% of the data and testing on the final 20%. Where necessary, walk-forward cross-validation (TimeSeriesSplit) was implemented. A chronological 80/20 hold-out split without shuffling was applied. Walk-forward cross-validation (TimeSeriesSplit) was additionally implemented.

2.2. Leakage-Safe Data Preprocessing

The data preparation pipeline consisted of the following sequential operations (Figure 1).

The data preparation and processing pipeline consists of the following steps.

Temporal Axis Alignment: Timestamps were parsed, data were sorted chronologically, and duplicate entries were removed.
Calendar Feature Generation: Calendar-based features, including hour, day of the week, month, and categorical season (subsequently encoded), are generated.
Working Dataset Construction: A working dataset comprising the target series and a complete set of input features (targets + features_all) was assembled. Rows containing missing values in the selected columns were then removed.
Time-Based Splitting and Horizon Adjustment: A time-based split (without shuffling) was performed using an 80/20 hold-out or walk-forward cross-validation approach. Target variables are shifted to a specified horizon (H), and only valid indices (rows with available target values at the horizon) are retained in the dataset.
Training-Only Scaling: Scaling parameters for features and target variables are estimated exclusively on the training dataset (fit x_scaler, y_scaler on TRAIN). These parameters were then applied to the hold-out or cross-validation folds without re-evaluation.
Windowing: For sequential architectures, sliding windows of length L (L = 24 in the experiments) were created. The first L − 1 timestamps were discarded for the supervised samples. Optionally, a feature vector (X_last) representing the “flat” representation of the current state can be formed from the last window step.

2.3. Definition of Extreme Events

Extreme events were formalized using a quantile threshold calculated solely from the training sample. For each target series, a quantile level (q = EXT_Q) was set independently (q = 0.90 in the presented experiments). The resulting threshold (thr) is then fixed and applied to the holdout or cross-validation sets without recalculation. A binary extreme label was assigned by comparing the observation to the threshold on the corresponding scale of the target variable, forming output y_ext for subsequent model evaluation using the AUC metric.

2.4. Model Architecture

In this study, we implemented and compared a set of multitasking deep learning architectures, including MTL_LSTM, MTL_GRU, MTL_LSTM_GRU, STT, MTL_CNN, MTL_LSTM_GRU_STT, and the proposed hybrid model DP-STH++. To ensure a fair comparison, all models utilized a consistent input format in the form of a sequential window X_seq ∈ R^{(L × D)} (where L = 24 in the experiments) and a unified output format: two regression heads for predicting solar and wind generation and two classification heads for detecting extreme modes.

The proposed DP-STH++ architecture is constructed as a parallel causal fusion of four spatiotemporal branches, each generating its own representation of the window X_seq:

Branch 1 (RNN): A causal LSTM recurrent encoder extracts short- and medium-term dependencies while maintaining causality.
Branch 2 (RNN): A causal GRU serves as an alternative recurrent encoder, complementing the LSTM in terms of inductive properties.
Branch 3 (TCN): A causal Conv1D stack extracts local and multiscale temporal patterns.
Branch 4 (STT): A lightweight causal transformer with a causal attention mechanism that accounts for a longer-range context within the window.

The output of each branch is aggregated using a global pooling operator (resulting in a fixed-dimensional vector), followed by feature fusion:

z = Concat (z^{(1)}, z^{(2)}, z^{(3)}, z^{(4)}),

where

z^{(b)}

is the pooled representation of branch

b \in {1,2, 3,4}

. Subsequently, the vector z is processed through a sequence of BatchNormalization, a Dense layer with 64 units and ReLU activation, and a dropout layer with a rate of 0.20 before being fed to the output heads. This architecture decouples the extraction of temporal features (within branches) from the unified multitasking component (shared blocks and heads), ensuring compatibility with fundamental MTL architectures.

Architectural Rationale and Complementarity of Branches

The DP-STH++ architecture is not intended as a fundamentally new neural network class. Instead, it represents a structured architectural decomposition designed for the joint solution of two related tasks: (i) short-term regression forecasting of solar and wind generation and (ii) detection of rare extreme events under a leakage-safe protocol.

The design follows a complementarity principle, where each parallel branch captures a distinct aspect of temporal dynamics (see Figure 2):

LSTM branch—models long-term temporal inertia, daily periodicity, and smoothing effects typical for renewable generation series. This branch emphasizes stable representation of medium- and long-range dependencies.
GRU branch—provides a more compact recurrent representation with fewer parameters, which improves learning stability under limited data regimes. In our configuration, GRU serves as a complementary recurrent encoder focusing on faster adaptation to local dynamic changes.
TCN (causal Conv1D) branch—extracts short local temporal motifs and abrupt transitions (spikes and drops). Convolutional filters highlight multiscale local structures that recurrent encoders may smooth out.
Lightweight causal Transformer branch—performs contextual aggregation across the observation window while preserving strict causality (no access to future information). It models heterogeneous feature interactions and non-uniform temporal dependencies within the window.

Thus, the parallel composition is not an arbitrary engineering combination but a decomposition of temporal structure into complementary representations: memory/inertia (RNNs), local motifs (TCN), and contextual interactions (Transformer).

The contribution of DP-STH++ is therefore architectural–algorithmic and experimentally validated, rather than based on the invention of new layers. The goal is to construct a reproducible multitask framework that balances regression accuracy and extreme-event classification under realistic data constraints.

2.5. Multi-Task Learning and Loss Functions

Training was conducted in a multitask setting that combined two regression and two classification tasks. For solar and wind power generation regression, an uncertainty-weighted quadratic error with trainable scale parameters (implemented via LossScaleLayer (UW-Reg)) was employed. This approach enables the automatic balancing of contributions from the two regression tasks, thereby eliminating the need for manual coefficient selection.

Let the true values on the horizon be, and the model predictions be—The regression loss is defined as:

L_{i}^{reg} = \frac{1}{N} \sum_{t = 1}^{N} {(y_{i}^{(t)}− {\hat{y}}_{i}^{(t)})}^{2}, i \in \{s, w\}

Then, the uncertainty-weighted part takes the form:

L_{UW}^{reg} = \sum_{i \in {s, w}} (\frac{1}{2 σ_{i}^{2}} L_{i}^{reg} + l o g σ_{i}),

where

σ_{i} > 0

—trainable parameters (uncertainty scales) for task

i = s, w

.

Extreme modes are specified by binary labels

y_{s}^{ext}, y_{w}^{ext} \in {0,1}

, formed according to fixed thresholds

t h r_{s}, t h r_{w}

, calculated on the training part at level

q = EXT_Q

(in the experiments

q = 0.90

). Probabilistic predictions

{\hat{p}}_{s}^{ext}, {\hat{p}}_{w}^{ext} \in (0,1)

, obtained using sigmoid heads, were used for classification. The classification losses are specified by the focal binary cross-entropy as follows:

L_{i}^{cls} = \frac{1}{N} \sum_{t = 1}^{N} FocalBCE, (y_{i}^{ext (t)}, {\hat{p}}_{i}^{ext (t)}), i \in {s, w} .

The total loss function is expressed as follows:

L = L_{UW}^{reg} + \sum_{i \in {s, w}} L_{i}^{cls} .

2.6. Training Procedure and Evaluation Protocol

All the models were trained using a consistent time-based validation protocol. The primary scenario employed an 80/20 hold-out split without shuffling, where training was conducted on the initial segment of the time series and evaluation was performed on the final segment. Furthermore, walk-forward cross-validation (TimeSeriesSplit) was implemented to assess the stability of the estimates across consecutive time folds.

The target values were shifted to the specified horizon

H

(in the presented experiment

H = 1

), and only valid window indices were utilized. All preprocessing parameters (including

x_s c a l e r

,

y_s c a l e r

and thresholds) were computed exclusively on the training partition and subsequently applied to the hold-out/CV-folds without re-evaluation. All preprocessing parameters were computed exclusively on the training partition and applied to validation/test sets without recalculation.

The performance of the regression forecasts was evaluated using the following metrics: RMSE, MAE, MAPE,

R^{2}

and EVS. The effectiveness of the extreme value detection was assessed using the AUC ROC. Regression performance was evaluated using RMSE, MAE, MAPE, R², and EVS. Extreme event detection was evaluated using AUC-ROC. Component-wise regression and classification metrics for Solar and Wind targets are reported separately in Table 2.

To avoid ambiguity in metric interpretation, scale-dependent error measures (RMSE, MAE, MAPE) are computed and reported only after a consistent scale definition. When normalization is applied during training, final evaluation metrics are calculated after inverse transformation to the selected reporting scale.

Results expressed in different physical scales (“raw” and “kW”) are not directly comparable for scale-dependent metrics. Therefore, such results are presented in separate tables. Cross-model interpretation across different scales relies exclusively on scale-independent metrics (R², EVS, AUC) under the same data-splitting protocol.

3. Results

3.1. Hold-Out Evaluation Results

Before presenting the results, it should be noted that MAPE may become numerically unstable in renewable energy forecasting due to zero or near-zero generation values (e.g., nighttime solar production or low wind regimes). Division by small denominators can lead to inflated percentage errors. Therefore, in this study, MAPE is not used as a primary ranking metric. Model comparison and conclusions are primarily based on MASE and consistent absolute metrics (RMSE, MAE), while MAPE is reported only for completeness.

Table 2 presents the quality metrics for solar generation (Solar) forecasting on an 80/20 hold-out set. The proposed DP-STH++ model exhibited superior regression performance compared to the other considered architectures. It achieved a minimum error of RMSE = 524.27 and MAE = 287.90, along with maximum values of R² = 0.7556 and EVS = 0.7556, indicating a substantial proportion of explained variability and strong alignment of predictions with the observed dynamics. DP-STH++ also demonstrated the lowest MAPE value; however, this metric is reported for completeness and is not used for primary ranking due to its instability near zero values.

Concurrently, MTL_GRU demonstrated the best performance in solar extreme detection on the hold-out set (AUC = 0.9746), whereas DP-STH++ achieved an AUC of 0.9547. This suggests that the primary advantage of DP-STH++ for Solar energy lies in its accuracy in continuous prediction rather than in maximizing the AUC classification metric.

Table 3 presents the results for wind generation (Wind). Here, DP-STH++ confidently leads in terms of key regression metrics: RMSE = 241.77, MAE = 174.86, as well as in terms of consistency

R^{2} = 0.9527

and EVS = 0.9529. In the task of detecting wind extremes, DP-STH++ also demonstrates maximum accuracy (AUC = 0.9908), outperforming its closest competitor, MTL_LSTM_GRU_STT (0.9902).

It is important to note that the MAPE values for Wind are notably high (thousands and tens of thousands of percent) across all models. This is a common artifact of the MAPE when dealing with periods of near-zero power generation. Consequently, RMSE, MAE, R², EVS, and MASE offer more reliable performance indicators for Wind, with MAPE serving only as a supplementary descriptive metric.

Component-wise regression and classification metrics for Solar and Wind are reported separately to avoid cross-target generalization. Although trained within a single unified protocol, the targets exhibit different statistical properties, resulting in different absolute metric values. This confirms the necessity of target-specific interpretation. The detailed component-wise results for both targets are summarized in Table 4.

3.2. Cross-Validation Results

Table 5 presents the final model stability and quality indicators obtained during the cross-validation for the solar channel. These indicators include absolute errors (RMSE and MAE), relative error (MAPE), consistency and explained variance (

R^{2}

, EVS), and the quality of extreme event detection (AUC). This table facilitates the evaluation of both the average accuracy level and the dispersion of results across folds, which is crucial for assessing reproducibility.

The data presented in the table allow for several key observations.

Model Leadership on a Single Scale (kW) Based on Regression Metrics: Among the MTL variants operating on the kW scale, MTL_GRU exhibited the most balanced performance profile. It achieved the lowest average RMSE and MAE within this group, coupled with the highest R² and EVS values. This suggests that, in this specific setting, the GRU encoder offers superior generalization capabilities compared to other “pure” recurrent MTL baselines.
Error Stability and Characteristics: The standard deviations reflect the sensitivity of each model to the data partitioning. For instance, STT displayed relatively moderate variability; however, it did not surpass MTL_GRU in terms of absolute error reduction. Conversely, MTL_CNN exhibited the greatest instability and poorest average accuracy (with a confidence interval that included negative values), indicating a potential risk of diminished performance across different folds.
Extreme Event Prediction (AUC) as a Distinct Quality Metric: Transformer/hybrid solutions excelled in terms of AUC. MTL_LSTM_GRU_STT demonstrated the highest average AUC, and STT also performed strongly within the distribution. This aligns with the understanding that attention mechanisms are often more effective in extracting subtle patterns crucial for classifying rare events, even if regression errors are not minimized.
To avoid ambiguity caused by mixed reporting scales, cross-validation results are presented separately for kW and raw scales. Scale-dependent metrics (RMSE, MAE, MAPE) are compared only within identical physical units, while R², EVS, and AUC remain cross-scale interpretable under the same splitting protocol.

Table 6 presents a comparative analysis of the models under walk-forward cross-validation conditions for the wind channel. This analysis evaluated the regression accuracy (RMSE, MAE), relative errors (MAPE), forecast consistency with observations (EVS), and accuracy of extreme mode recognition via a probability head (AUC). This presentation format is valuable because an algorithm may demonstrate acceptable average error but exhibit instability across different folds, a characteristic reflected in the width of the 90% confidence interval.

The key conclusions drawn from this table are as follows:

Regression Quality and Stability: DP-STH++ and MTL_GRU exhibited the strongest wind regression profiles. DP-STH++ demonstrated the lowest average RMSE and MAE among the solutions presented, coupled with high nRMSE and EVS values, indicating a strong reproduction of wind generation variability. MTL_GRU also presents robust regression metrics, accompanied by a relatively narrow confidence interval for nRMSE, suggesting a consistent performance across different folds in this experiment.
Extreme Value Detection Quality: DP-STH++ achieved the highest average AUC (0.9555), demonstrating its ability to effectively differentiate between peak and non-peak wind generation values. MTL_LSTM_GRU_STT also exhibited a high AUC (0.9249), confirming the utility of hybridization/attention mechanisms for the classification component, even with some variability in the regression errors.
Behavior of Relative Metrics on Wind (MAPE): The MAPE metric for the wind channel exhibited high values and wide intervals across many rows. This is a common characteristic of relative metrics when dealing with small or near-zero values, where the denominator can destabilize an estimate. Therefore, when interpreting wind results, emphasis should be placed on RMSE/MAE and nRMSE/EVS, with MAPE considered an auxiliary indicator sensitive to low power levels.
Weak and Inadequate Baseline Benchmarks for Wind: Seasonal Naive (lag = 24) demonstrated negative nRMSE and EVS, indicating that it reproduces wind dynamics worse than a trivial constant line at the average level in the corresponding divisions. Similarly, MTL_CNN and STT exhibited decreased consistency (wide intervals and low average nRMSE/EVS), suggesting insufficient stability of these configurations within the current training protocol and data volume.

Table 7 Comparison of MAPE and MASE values specifically in low-generation regimes (actual generation <5% of observed maximum). MAPE shows significant inflation and instability, particularly for wind power and nighttime solar periods, justifying the use of MASE as the primary robust metric for model ranking and conclusions.

In summary, the results indicate that a combination of stable regression quality and reliable extreme value detection is crucial for wind-generation modeling. This combination is most evident in DP-STH++, whereas other architectures either lack stability across folds or do not provide sufficient accuracy in terms of basic error metrics.

3.2.1. Extreme-Event Threshold Sensitivity Analysis

The definition of extreme events in the main experiments is based on a quantile threshold q = 0.90 computed on the training subset and applied to the test data. To address concerns regarding the arbitrariness of this choice, a sensitivity analysis was conducted for q ∈ {0.85, 0.90, 0.95}.

Under q = 0.90, the proportion of positive extreme events on the test set is approximately 3.7% for Solar and 6.6% for Wind. This corresponds to rare but sufficiently represented events, enabling statistically stable estimation of PR-based metrics. The quantitative results of the threshold sensitivity analysis are presented in Table 8.

The results indicate that:

For Solar, q = 0.95 leads to excessive sparsity, significantly degrading PR-AUC and F1.
For Wind, q = 0.90 provides the best balance between Precision and Recall.
Lower thresholds (q = 0.85) increase event frequency but reduce extremeness.

These findings demonstrate that q = 0.90 represents a balanced operating point rather than an arbitrary choice. At the same time, the optimal quantile may depend on the risk tolerance and application context.

Increasing the quantile threshold increases event sparsity. While AUC remains relatively stable, PR-AUC and F1 degrade for Solar at q = 0.95 due to extreme class imbalance. The threshold q = 0.90 provides a balanced operating point across targets.

3.2.2. Extreme-Event Detection Performance (Imbalance-Aware Evaluation)

Given the class imbalance inherent in extreme-event detection, ROC-AUC alone may be insufficient for comprehensive evaluation. For q = 0.90, the positive class proportion in the test set is approximately 3.7% for Solar and 6.6% for Wind. Under such an imbalance, PR-AUC, Precision, Recall, F1-score, and confusion matrices provide a more informative assessment. The sensitivity of extreme-event detection metrics to the quantile threshold is illustrated in Figure 3.

Table 9 reports extended classification metrics for DP-STH++ and representative baselines.

The results show that for Solar, DP-STH++ operates in a high-recall regime (Recall = 0.9375), minimizing missed extreme events at the cost of lower Precision. In contrast, DLinear_proxy achieves higher Precision but lower Recall.

For Wind, both models demonstrate balanced detection performance, with DLinear_proxy slightly outperforming in PR-AUC and F1, indicating fewer classification errors under the current threshold.

Figure 4 presents the confusion matrices for Solar and Wind at q = 0.90.

3.3. Distributional Analysis

Figure 5 illustrates the temporal dynamics of solar generation, with three distinct series displayed on the raw scale: actual values for the training interval (“Training (true)”), actual values for the holdout interval (“Holdout (true)”), and the model forecast (“Forecast”).

This visualization enables a qualitative assessment of the model’s ability to preserve the daily signal structure and transfer learned patterns from the training period to the subsequent period without temporal discontinuity.

The graph depicts a typical daily profile of solar power generation, characterized by near-zero values during the night and peaks during the day. The forecast curve for the holdout interval generally reproduces:

The phase and duration of the daytime generation window (the temporal position of the rise and fall relative to the actual curve).
The amplitude of the main peaks and the overall shape of the daytime “dome” are particularly important for evaluating performance using RMSE/MAE in tasks exhibiting pronounced daily cyclicality.
Behavior near zero levels, where relative distortions can occur due to small denominators (a common source of instability in MAPE and similar metrics during nighttime segments).

However, local discrepancies in the height and/or steepness of the fronts are noticeable in some peak areas, indicating the ongoing challenge of accurately reproducing extreme and rapidly changing modes, even with a correctly captured seasonal structure. Overall, visual inspection confirms that DP-STH++ retains both the shape of the daily pattern and the generation levels on the lagged segment, which aligns with the quantitative performance metrics presented previously.

Figure 6 illustrates the dynamics of wind generation on the raw scale, segmented into three trajectories: actual values on the training section (“Training (true)”), actual values on the holdout section (“Holdout (true)”), and the model forecast (“Forecast”)

In contrast to the solar series, the wind process exhibits distinct non-stationarity, characterized by sharp fronts, short pulses, and episodes of power reduction to low values. Consequently, this visualization is particularly useful for assessing forecast quality based on pattern fidelity rather than solely relying on integral metrics.

The forecast curve for the holdout interval demonstrates general consistency with the observed data in key aspects.

Accurate reproduction of the temporal position of major rises and falls.
The maintenance of large peak levels and their sequence is critical for practical reserve management scenarios.
The transitions to low values were adequately reflected, although local shifts and smoothing were possible in some areas.

The remaining discrepancies were concentrated in areas with the most abrupt changes. In some instances, the model smooths out steep fronts or slightly underestimates/overestimates the amplitude of short-term spikes, which is a common characteristic when forecasting wind series with a fixed context window. Overall, the graph confirms that DP-STH++ effectively transfers patterns from the training segment to the deferred segment, maintaining stable forecasts, even with high variability in wind generation.

In addition to scalar metrics, temporal alignment between predicted and observed profiles (Figure 7) confirms that the model preserves dynamic behavior across peak and low-generation intervals.

3.4. Summary of Quantitative Results

The quantitative results are summarized below.

DP-STH++ demonstrated the highest-quality regression prediction on the hold-out data (80/20 split) at H = 1 and L = 24. For solar power forecasting, it achieved minimum errors (RMSE = 524.27, MAE = 287.90) and maximum consistency (R² = 0.7556, EVS = 0.7556). For wind power forecasting, it recorded the best values (RMSE = 241.77, MAE = 174.86, R² = 0.9527, EVS = 0.9529).
The multitasking approach provides a practically significant detection of the extreme modes. DP-STH++ exhibited a maximum AUC of 0.9908 on the hold-out data for wind generation. For solar generation, the model’s AUC is 0.9547, confirming high detection quality, although MTL_GRU achieves the highest AUC for solar generation (0.9746).
Temporal cross-validation estimates confirmed the reproducibility of the quality. They also demonstrated that comparisons of absolute errors between models on different scales (kW vs. raw) require careful consideration. Scale-invariant indicators (R², EVS, and AUC) retained interpretable comparability.
The advantage of DP-STH++ is most evident for wind generation, where it achieves both high regression accuracy and the best AUC. For solar generation, the model’s advantage is primarily manifested in a reduction in absolute errors (RMSE/MAE) at a competitive but not maximum AUC level.

Sensitivity to Forecast Horizon and Window Length.

To provide a detailed numerical comparison, model sensitivity was evaluated under varying forecast horizons

H \in \{1, 3, 6\}

(with

L = 24

fixed) and varying window lengths

L \in {12, 24, 48}

(with

H = 1

fixed), while keeping all other parameters unchanged.

The comparative results of the models are presented in Table 10.

Performance degrades monotonically as H increases: RMSE and MASE rise, while AUC decreases. The sensitivity of the model to different window lengths is summarized in Table 11.

Within the tested range, the best performance is observed at L = 12, while larger windows lead to gradual degradation under the current dataset size.

Figure 8 provides quantitative evidence of model sensitivity to forecast horizon and context length under a fixed leakage-safe protocol. The monotonic increase in RMSE_mean with growing H confirms expected degradation, while window variation reveals data-regime-dependent behavior.

3.5. Ablation Study of Architectural Branch Contributions

To quantify the contribution of each architectural branch, we conducted a systematic ablation study by removing one component at a time (LSTM, GRU, TCN, Transformer) while keeping all other elements and training settings unchanged.

Table 12 reports regression (RMSE, MASE) and classification (AUC, PR-AUC, F1) metrics averaged across targets.

The results demonstrate that removing the LSTM branch leads to the largest increase in RMSE (321.43), confirming the importance of long-term memory modeling for renewable energy forecasting.

Interestingly, some reduced configurations (e.g., without GRU or Transformer) achieve lower regression errors under the current dataset size. This suggests possible over-parameterization or gradient interaction effects in multitask training when model capacity increases.

These findings indicate that DP-STH++ should be interpreted as a configurable architectural family rather than a fixed, universally optimal configuration. The ablation results provide measurable evidence of branch contributions and clarify the practical trade-offs between model complexity, stability, and accuracy under limited data conditions.

3.6. Extended Benchmark Comparison

To address concerns regarding the completeness of baseline comparison, we extended the experimental section with additional benchmark families representing diverse architectural paradigms. These include linear decomposition models (DLinear family), patch-based transformer-style models (PatchTST proxy), nonlinear basis-function architectures (KAN proxy), and residual deep architectures (N-BEATS proxy).

All benchmark models were evaluated under an identical experimental protocol: leakage-safe preprocessing, identical chronological splits (80/20 hold-out), window length L = 24, forecast horizon H = 1, and identical evaluation metrics. This ensures that the comparison is methodologically consistent and reproducible.

All proxy benchmark models reported in Table 13 were trained under the same experimental protocol as the proposed architecture and the main baselines. Specifically, an identical chronological 80/20 hold-out split was used, with context window L = 24 and the same 12 input features. Input scaling was fitted exclusively on the training subset and applied to the test subset without leakage. For regression evaluation, predictions were produced in the raw scale, consistent with the primary protocol. Extreme-event thresholds (quantile-based) were computed on the training set and then applied to the hold-out data.

The purpose of these proxy models is not to redefine the overall ranking but to position the proposed architecture within diverse methodological families and to provide additional diagnostic insight under identical evaluation conditions.

The results indicate that strong linear baselines remain highly competitive in short-horizon forecasting, achieving low regression error in certain settings. Nonlinear basis-function models demonstrate strong classification performance in extreme-event detection.

The proposed DP-STH++ model does not dominate across all individual metrics; however, it provides a balanced trade-off between regression accuracy and extreme-event detection within a unified multitask framework. These findings suggest that increased architectural complexity does not automatically guarantee superiority, particularly under limited data regimes, and that model selection should consider both performance and task objectives.

3.7. Robustness and Generalization Analysis

The dataset used in this study consists of 8752 hourly observations (approximately one year), which represents a limited data regime for hybrid deep learning architectures. To avoid overstated claims and assess result stability, additional robustness analyses were conducted.

Rolling TimeSeries cross-validation (TS-CV) was applied to evaluate temporal transferability across sequential folds. In addition, a train-fraction sensitivity analysis was performed to examine performance behavior under reduced training data proportions. Finally, bootstrap-based statistical testing was conducted for key regression comparisons.

The aggregated robustness results are summarized in Table 14.

The robustness analysis confirms that conclusions are restricted to the evaluated dataset and protocol. Strong linear baselines remain competitive under limited data conditions, highlighting the importance of extended multi-dataset validation in future work.

3.7.1. Consistency Audit and Reproducibility Checks

Given the presence of multiple evaluation regimes (hold-out, rolling TimeSeries cross-validation, ablation averages, and sensitivity analyses), numerical differences between tables may arise due to distinct splits and aggregation procedures.

To ensure internal consistency, all summary tables were regenerated from a single aggregated evaluation source. No manual editing of final metric values was performed. Automated checks verified:

absence of duplicate model–target rows,
reproducibility of reported summary metrics from raw evaluation outputs,
alignment between protocol definitions and table captions.

Differences in RMSE or related metrics across tables, therefore, reflect differences in evaluation protocol (e.g., fixed hold-out vs. rolling cross-validation vs. averaged ablation results) rather than reporting inconsistencies.

These measures ensure reproducibility and internal coherence of the reported experimental results.

3.7.2. Statistical Significance Analysis

To avoid conclusions based solely on mean metrics, paired significance testing was conducted on identical test timestamps. We report bootstrap 95% confidence intervals for mean error differences and paired permutation test p-values. DP-STH++ improves over MTL_CNN significantly for both Solar and Wind (p < 0.001). In comparison with DLinear_proxy, differences are not significant for Solar (p > 0.05), while DLinear_proxy is significantly better for Wind (p < 0.001). The results of the paired statistical significance tests are summarized in Table 15.

3.7.3. Computational Cost Analysis

DP-STH++ has the largest parameter count and computational complexity among the evaluated deep models, resulting in higher training time and inference latency. This trade-off is explicitly acknowledged: DP-STH++ targets improved predictive quality and extreme-event detection rather than computational efficiency. The comparative computational cost profile of the evaluated models is presented in Figure 9.

4. Discussion

The results demonstrate the efficacy of the DP-STH++ hybrid spatiotemporal architecture as a solution for multitasking short-term forecasting of hourly solar and wind generation under conditions of significant non-stationarity and the presence of infrequent extreme events. By employing a unified, leakage-safe protocol and consistent experimental parameters (H = 1, L = 24), DP-STH++ exhibited superior regression performance for both the solar and wind channels on the hold-out sample. Specifically, for solar generation, the minimum RMSE and MAE values were observed concurrently with the maximum R²/EVS. For wind generation, optimal values were attained across all key regression metrics, coupled with maximum accuracy in extreme value detection, as measured by the AUC. This performance profile indicates that the proposed architecture effectively integrates the capacity to reproduce continuous power dynamics and reliably identifies peak generation modes.

The architectural advantage of DP-STH++ stems from the complementarity of parallel causal branches. Recurrent components (LSTM and GRU) facilitate the stable modeling of short- and medium-term dependencies, whereas the TCN branch enhances the extraction of local and multi-scale patterns in a causally consistent manner. The lightweight causal transformer increases the sensitivity to a more extended context. The aggregation of representations via pooling and subsequent concatenation in the fusion block allows for the summation of diverse inductive biases without compromising causality, which is a crucial feature when processing time-series data characterized by the simultaneous presence of regular cycles and short-term anomalous spikes.

A comparative analysis of the models revealed a distinction between the solar and wind channels regarding the nature of the achievable gains. For solar generation, DP-STH++ yielded the most significant improvement in regression accuracy on the holdout set, whereas the simpler MTL_GRU configuration excelled in AUC. This suggests a practical trade-off: maximizing the quality of continuous forecasting and maximizing purely classification metrics of extremes are not always equally achievable by a single architecture, particularly given the pronounced nighttime intervals with near-zero values. For wind generation, the advantage of DP-STH++ was most pronounced, simultaneously improving the regression errors and providing the maximum AUC, which is consistent with the higher stochasticity of the wind process and the reduced effectiveness of purely seasonal heuristics.

The influence of metric selection requires careful consideration. For wind generation, the MAPE values are often inflated and exhibit high variability owing to periods of low power output, leading to division by near-zero values. Consequently, the assessment of model performance for wind generation should prioritize the RMSE/MAE and consistency metrics (R², EVS), as well as the AUC for extreme operating modes. MAPE should be considered an auxiliary metric because of its potential instability.

The cross-validation results indicate that the model stability is significantly influenced by the architectural class. Certain solutions exhibited a wide range of indicator values across folds, with R² values falling into low or even negative ranges (e.g., for some STT/MTL_CNN configurations in the wind task). In contrast, DP-STH++ maintains competitive scale-invariant indicators (R²/EVS/AUC), which are crucial for ensuring the reproducibility of the findings across different time divisions. It is important to note that the tables include rows calculated on different scales (kW vs. raw values); therefore, direct comparisons of absolute RMSE/MAE are only valid within the same scale, whereas R²/EVS/AUC remain comparable regardless of the scale.

A key methodological aspect is the rigorous implementation of a leakage-safe experimental design. All preprocessing parameters and extreme thresholds were determined solely from the training dataset and subsequently fixed for the hold-out and cross-validation folds. This approach minimizes the risk of overestimating the model performance and allows us to attribute the advantages of DP-STH++ to its architecture and training procedure, rather than to artifacts resulting from information leakage.

The limitations of this study should be interpreted with caution. First, the results were obtained using a single combined dataset that reflected the characteristics of a specific set of observations. The transferability to other climatic zones and generation modes requires independent validation. Second, a forecast horizon of H = 1 with a window of L = 24 was considered, and alternative horizons may alter the balance between the local and distant dependencies. Third, the use of relative metrics (MAPE) for series with frequent small values limits the interpretability of these metrics, necessitating that the conclusions be supported by absolute error and consistency metrics.

Results are obtained on a single dataset with verified seasonal coverage (Figure 10), limiting direct generalization to other climates or regions. The fixed one-hour horizon and 24 h context may not capture all dynamics at longer scales. MAPE instability in low-generation periods motivates reliance on robust metrics (MASE, AUC). Future validation on diverse datasets is recommended.

In addition to predictive performance, computational efficiency is important essential for real-world deployment. Figure 11 compares training time, inferencelatency, parameter count, and FLOPs across the evaluated models. DP-STH++ has higher computational cost than simpler architectures due to its hybrid design, but remains lighter than full transformer models while achieving superior or competitive predictive performance and improved extreme-event detection. This reflects a practical trade-off between forecasting accuracy and computational cost.

Overall, the results suggest that, for the setup under consideration, practical advantages arise from a combination of (i) parallel hybridization of causal encoders, (ii) multitask optimization of regression and extrema prediction, and (iii) a strict leak exclusion protocol. Although certain simpler architectures (e.g., MTL_GRU) occasionally show marginally higher AUC for solar generation, DP-STH++ consistently ranks 1st across both hold-out and time-series cross-validation protocols when evaluated by robust aggregated metrics. On hold-out, it achieves RMSE = 257.18, MASE = 0.2438 and AUC = 0.9896; in CV, the model retains rank 1 with mean MASE = 0.3883. This confirms that the proposed hybrid causal fusion of LSTM, GRU, Conv1D and lightweight transformer provides stable advantages, particularly valuable under high non-stationarity conditions typical for wind power series. Thus, within the rigorous leakage-safe protocol applied, the superiority of DP-STH++ over the compared baselines is consistently supported.

5. Conclusions

This study introduces and experimentally validates DP-STH++, a multi-task hybrid model designed for the short-term forecasting of hourly solar and wind generation at a forecast horizon of H = 1 and a context window of L = 24. The proposed approach integrates the causal branches of LSTM, GRU, TCN, and a lightweight causal transformer, employing a common representation fusion and four output heads: two for regression (solar/wind) and two for classification, aimed at detecting extreme modes defined by the 90th percentile of the training sample.

Experimental results on a hold-out set (80/20 split) demonstrate that DP-STH++ achieves superior regression performance for both target series. Specifically, for solar generation, the model attained minimum RMSE and MAE values, alongside maximum R²/EVS, signifying enhanced accuracy in capturing daily dynamics and a greater proportion of explained variability. For wind generation, DP-STH++ exhibited the best regression performance and maximum AUC for extreme value detection, underscoring the effectiveness of the architecture in a more non-stationary and volatile environment. Concurrently, MTL_GRU leads in AUC for Solar, highlighting the distinction between continuous forecast optimization and maximization of classification metrics at peak values.

This study introduces DP-STH++, a hybrid spatio-temporal multitask model that achieves leading performance among the evaluated baselines under the considered experimental protocol for both solar and wind power generation (H = 1, L = 24). On the hold-out set, DP-STH++ ranks 1st among the compared models with RMSE = 257.18, MASE = 0.2438, R² = 0.9440 and AUC = 0.9896. The model maintains leadership in rigorous time-series cross-validation (mean MASE = 0.3883, rank 1), confirming the robustness of the obtained results. The most significant gains are observed for the wind channel (RMSE = 258.85, MASE = 0.1631, AUC = 0.9880–0.9908), where the hybrid causal architecture effectively handles high non-stationarity and sharp peaks. These results, obtained under a strict leakage-safe protocol with train-only parameter estimation, support the practical applicability of DP-STH++ as a unified solution for accurate baseline forecasting and reliable extreme event detection in renewable energy systems.

The results are obtained on a single dataset (8752 hourly observations), which limits direct generalization. Future validation on additional multi-year and multi-region datasets is required.

The temporal cross-validation results corroborated the reproducibility of the obtained estimates and revealed that stability was significantly influenced by the architecture class. Seasonally naive forecasting proves unsuitable for wind series (as indicated by negative consistency indicators), whereas certain STT/MTL_CNN configurations exhibit instability across folds. In contrast, DP-STH++ maintains competitive scale-invariant quality indicators and demonstrates a practically significant ability to identify extreme modes, particularly for wind generation.

The scientific and methodological significance of this study lies in the implementation of DP-STH++ within a strictly leak-resistant pipeline. All preprocessing parameters and extreme thresholds were evaluated exclusively on the training data and subsequently fixed for application to the hold-out set and cross-validation folds. This rigorous protocol enhances the accuracy of comparisons and the reproducibility of the results when evaluating multitask models for time-series forecasting.

From a practical standpoint, DP-STH++ can be used as a unified tool that simultaneously delivers accurate short-term power forecasts and risk-oriented signaling of extreme modes. This capability aligns with the operational management requirements of power systems with a high proportion of renewable energy sources, where both minimizing base forecast errors and timely detection of potentially critical generation peaks/troughs are essential.

Author Contributions

Conceptualization, M.K. and A.Z.; Methodology, G.T.; Software, G.T.; Validation, G.T. and M.K.; Formal analysis, G.T.; Investigation, G.T.; Resources, Z.A.; Data curation, G.T. and Z.A.; Writing—original draft preparation, G.T. and Z.A.; Writing—review and editing, G.T. and Z.A.; Visualization, G.T.; Supervision, A.Z. and M.K.; Funding acquisition, G.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in Mendeley Data at https://doi.org/10.17632/gxc6j5btrx.1 (accessed on 12 February 2026). The dataset, titled “Wind and Solar Power Generation Dataset” (v1), was published on 10 October 2024 by Yue Liu and is distributed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Acknowledgments

The authors used ChatGPT 5.2 (OpenAI) and Grammarly Premium to improve the clarity and readability of the manuscript. All AI-assisted content was carefully reviewed and edited by the authors, who take full responsibility for the final version of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MTL	Multi-Task Learning
RNN	Recurrent Neural Network
LSTM	Long Short-Term Memory
GRU	Gated Recurrent Unit
CNN	Convolutional Neural Network
TCN	Temporal Convolutional Network
STT	Spatio-Temporal Transformer
AUC	Area Under the Curve
RMSE	Root Mean Squared Error
MAE	Mean Absolute Error
EVS	Explained Variance Score
EXT_Q	Quantile threshold for extreme event definition

References

Weron, R. Electricity Price Forecasting: A Review of the State-of-the-Art with a Look into the Future. Int. J. Forecast. 2014, 30, 1030–1081. [Google Scholar] [CrossRef]
Hong, T.; Fan, S. Probabilistic Electric Load Forecasting: A Tutorial Review. Int. J. Forecast. 2016, 32, 914–938. [Google Scholar] [CrossRef]
Voyant, C.; Notton, G.; Kalogirou, S.; Nivet, M.-L.; Paoli, C.; Motte, F.; Fouilloy, A. Machine Learning Methods for Solar Radiation Forecasting: A Review. Renew. Energy 2017, 105, 569–582. [Google Scholar] [CrossRef]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are Transformers Effective for Time Series Forecasting? arXiv 2022, arXiv:2205.13504. [Google Scholar] [CrossRef]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-Term Series Forecasting. arXiv 2022, arXiv:2201.12740. [Google Scholar] [CrossRef]
Su, L.; Zuo, X.; Li, R.; Wang, X.; Zhao, H.; Huang, B. A Systematic Review for Transformer-Based Long-Term Series Forecasting. Artif. Intell. Rev. 2025, 58, 80. [Google Scholar] [CrossRef]
Woo, G.; Liu, C.; Sahoo, D.; Kumar, A.; Hoi, S. ETSformer: Exponential Smoothing Transformers for Time-Series Forecasting. arXiv 2022, arXiv:2202.01381. [Google Scholar] [CrossRef]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series Is Worth 64 Words: Long-Term Forecasting with Transformers. arXiv 2022, arXiv:2211.14730. [Google Scholar]
Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in Time Series: A Survey. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 19-25 August 2023; International Joint Conferences on Artificial Intelligence Organization: Macau, China, 2023; pp. 6778–6786. [Google Scholar]
Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. iTransformer: Inverted Transformers are Effective for Time Series Forecasting. arXiv 2023, arXiv:2310.06625. [Google Scholar]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar] [PubMed]
Tang, Y.; Yang, K.; Zhang, S.; Zhang, Z. Wind Power Forecasting: A Hybrid Forecasting Model and Multi-Task Learning-Based Framework. Energy 2023, 278, 127864. [Google Scholar] [CrossRef]
Wei, J.; Wu, X.; Yang, T.; Jiao, R. Ultra-Short-Term Forecasting of Wind Power Based on Multi-Task Learning and LSTM. Int. J. Electr. Power Energy Syst. 2023, 149, 109073. [Google Scholar] [CrossRef]
Wang, S.; Sun, Y.; Zhang, W.; Chung, C.Y.; Srinivasan, D. Very Short-Term Wind Power Forecasting Considering Static Data: An Improved Transformer Model. Energy 2024, 312, 133577. [Google Scholar] [CrossRef]
Salman, D.; Direkoglu, C.; Kusaf, M.; Fahrioglu, M. Hybrid Deep Learning Models for Time Series Forecasting of Solar Power. Neural Comput. Appl. 2024, 36, 9095–9112. [Google Scholar] [CrossRef]
Huan, J.; Deng, L.; Zhu, Y.; Jiang, S.; Qi, F. Short-to-Medium-Term Wind Power Forecasting through Enhanced Transformer and Improved EMD Integration. Energies 2024, 17, 2395. [Google Scholar] [CrossRef]
Shringi, S.; Saini, L.M.; Aggarwal, S.K. A Review of Data-Driven Deep Learning Models for Solar and Wind Energy Forecasting. Renew. Energy Focus 2025, 55, 100739. [Google Scholar] [CrossRef]
Gupta, M.; Arya, A.; Varshney, U.; Mittal, J.; Tomar, A. A Review of PV Power Forecasting Using Machine Learning Techniques. Prog. Eng. Sci. 2025, 2, 100058. [Google Scholar] [CrossRef]
Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. The M4 Competition: Results, Findings, Conclusion and Way Forward. Int. J. Forecast. 2018, 34, 802–808. [Google Scholar] [CrossRef]
Lim, B.; Zohren, S. Time-Series Forecasting with Deep Learning: A Survey. Philos. Trans. R. Soc. Math. Phys. Eng. Sci. 2021, 379, 20200209. [Google Scholar] [CrossRef] [PubMed]
Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding Deep Learning (Still) Requires Rethinking Generalization. Commun. ACM 2021, 64, 107–115. [Google Scholar] [CrossRef]

Figure 1. Data Preparation and Processing Pipeline. Source: Own calculations based on experimental results.

Figure 2. Hybrid multi-branch causal architecture DP-STH++ for the joint forecasting of solar and wind power generation and the detection of extreme events. Gray blocks represent the feature extraction branches (LSTM, GRU, causal Conv1D, and light causal Transformer). Blue blocks denote regression heads for solar and wind generation forecasting, while red blocks correspond to classification heads for extreme event detection. Source: Own calculations based on experimental results.

Figure 3. Sensitivity of extreme-event detection metrics (AUC, PR-AUC, and F1) to quantile threshold q for Solar and Wind targets. Source: Own calculations based on experimental results.

Figure 4. Confusion matrices for extreme-event detection at q = 0.90 for Solar and Wind (DP-STH++ and DLinear_proxy). Positive class corresponds to quantile-defined extreme events. Source: Own calculations based on experimental results.

Figure 5. Comparison of actual and forecast solar generation for the DP-STH++ model on training and deferred intervals (H = 1, L = 24; time stamps of target windows). Source: Own calculations based on experimental results.

Figure 6. Comparison of actual and forecast wind generation for the DP-STH++ model on training and deferred intervals (H = 1, L = 24; time stamps of target windows). Source: Own calculations based on experimental results.

Figure 7. Actual and forecast series DP-STH++. Source: Own calculations based on experimental results.

Figure 8. Parametric sensitivity of RMSE_mean to forecast horizon (H) and context window length (L). Source: Own calculations based on experimental results.

Figure 9. Comparative computational cost profile: number of parameters, estimated FLOPs per sample, training time per epoch, and inference latency for all evaluated models. Source: Own calculations based on experimental results.

Figure 10. Seasonal structure of the sample and average generation. Source: Own calculations based on experimental results.

Figure 11. Diagnosis of computational cost. Source: Own calculations based on experimental results.

Table 1. Related Work Summary (2022–2025).

Reference	Architectural Focus	Application/Task	Key Contribution	Relevance to This Study
Zeng et al. [4]	Linear baseline (DLinear)	Time series forecasting	Demonstrates strong linear baselines under fair evaluation	Motivates rigorous comparison with simple models
Zhou et al. [5]	Frequency-enhanced Transformer (FEDformer)	Long-term TSF	Frequency decomposition for non-stationary series	Supports decomposition ideas for energy data
Su et al. [6]	Transformer review	Long-term TSF	Evaluation best practices	Supports leakage-free protocol
Woo et al. [7]	ETS-Transformer	TSF	Trend/seasonality separation via exponential smoothing	Guides interpretation of seasonal structure
Nie et al. [8]	PatchTST	Long-term forecasting	Patch-based Transformer processing	Used as modern transformer reference
Wen et al. [9]	Survey (Transformers in TS)	General TSF	Architecture taxonomy and evaluation criteria	Defines comparison principles
Liu et al. [10]	iTransformer	TSF	Variate-centric transformer modeling	Supports modern transformer lines
Liu et al. [11]	KAN	Nonlinear modeling	Spline-based functional approximation	Motivates nonlinear basis baseline
Tang et al. [12]	Hybrid + MTL	Wind power	Multitask hybrid wind forecasting	Justifies hybrid MTL for WPF
Wei et al. [13]	MTL + LSTM	Ultra-short-term wind	Joint task learning	Supports multitask short-term setup
Wang et al. [14]	Improved Transformer	Very short-term wind	Static feature integration	Supports feature-rich WPF
Salman et al. [15]	Hybrid DL models	Solar forecasting	CNN/LSTM/Transformer hybridization	Confirms hybrid benefit for PV
Huan et al. [16]	Transformer + EMD	Wind forecasting	Decomposition + Transformer	Supports non-stationary modeling
Shringi et al. [17]	Review (Solar/Wind DL)	RES forecasting	Deep learning overview	Confirms hybrid relevance
Gupta et al. [18]	Review (PV ML)	PV forecasting	ML/DL status update	Contextualizes forecasting trends

Source: Own calculations based on experimental results.

Table 2. Holdout comparison of models on the Solar task

H = 1, L = 24

Table 2. Holdout comparison of models on the Solar task

H = 1, L = 24

Model	RMSE	MAE	MAPE	R²	EVS	AUC
MTL_LSTM	683.381	401.431	747.145	0.5847	0.6458	0.9692
MTL_GRU	659.721	417.619	803.721	0.613	0.6702	0.9746
MTL_LSTM_GRU	698.835	437.48	830.268	0.5658	0.6474	0.969
STT	605.6	395.856	472.174	0.6739	0.6815	0.9654
MTL_LSTM_GRU_STT	700.222	430.233	636.075	0.564	0.6138	0.9554
Seasonal Naive (lag = 24)	837.41	390.761	454.253	0.3765	0.3765	0.9139
MTL_CNN	1227.349	976.358	1227.408	−0.3394	0.2487	0.9562
DP-STH++	524.266	287.904	213.81	0.7556	0.7556	0.9547

Note: (The best values are highlighted in bold: RMSE/MAE/MAPE—minimum; /EVS/AUC—maximum). Source: Own calculations based on experimental results.

Table 3. Holdout comparison of models on the task “Wind” (80/20),

H = 1, L = 24

.

Table 3. Holdout comparison of models on the task “Wind” (80/20),

H = 1, L = 24

.

Model	RMSE	MAE	MAPE	R²	EVS	AUC
MTL_LSTM	326.093	249.951	2965.786	0.914	0.9146	0.9845
MTL_GRU	308.415	236.114	3731.247	0.923	0.924	0.9837
MTL_LSTM_GRU	353.6	283.348	2944.039	0.8988	0.8991	0.9801
STT	494.004	377.678	8389.498	0.8025	0.8069	0.9725
MTL_LSTM_GRU_STT	281.384	213.445	3801.351	0.9359	0.9385	0.9902
Seasonal Naive (lag = 24)	1358.379	1016.758	30,869.579	−0.493	−0.4929	0.6396
MTL_CNN	588.344	456.046	9826.748	0.7199	0.7209	0.9701
DP-STH++	241.769	174.863	3666.379	0.9527	0.9529	0.9908

Note: (the best values are highlighted in bold: RMSE/MAE/MAPE—minimum;

R^{2}

/EVS/AUC—maximum). Source: Own calculations based on experimental results.

Table 4. Component metrics for Solar and Wind targets.

Model	Target	RMSE	MASE	AUC	R²
DP-STH++	Solar	255.5193	0.3244	0.9912	0.9421
MTL_CNN	Solar	326.2171	0.4745	0.9907	0.9056
STT	Solar	333.1434	0.5080	0.9956	0.9015
MTL_LSTM_GRU_STT	Solar	347.2721	0.5398	0.9957	0.8930
MTL_GRU	Solar	360.3811	0.6100	0.9858	0.8847
MTL_LSTM_GRU	Solar	372.9083	0.6376	0.9943	0.8766
MTL_LSTM	Solar	380.7784	0.6018	0.9881	0.8713
Seasonal_Naive_lag24	Solar	838.6086	0.8449	0.9136	0.3759
MTL_LSTM_GRU	Wind	252.2227	0.1677	0.9883	0.9487
DP-STH++	Wind	258.8494	0.1631	0.9880	0.9459
MTL_GRU	Wind	269.7237	0.1951	0.9897	0.9413
STT	Wind	281.6051	0.1834	0.9877	0.9360
MTL_LSTM_GRU_STT	Wind	290.6404	0.2098	0.9895	0.9318
MTL_CNN	Wind	311.4053	0.2144	0.9884	0.9217
MTL_LSTM	Wind	326.7746	0.2414	0.9866	0.9138
Seasonal_Naive_lag24	Wind	1358.335	0.9290	0.6387	−0.4890

Source: Own calculations based on experimental results.

Table 5. Cross-validation for the task “Solar”: mean ± std and 90% CI; AUC—mean.

Model	RMSE (Mean ± Std; CI90)	MAE (Mean ± Std; CI90)	MAPE (Mean ± Std; CI90)	R² (Mean ± Std; CI90)	EVS (Mean ± Std; CI90)	AUC (Mean)
MTL_LSTM	901.12 ± 182.35; [766.97–1035.26]	652.39 ± 200.82; [504.66–800.13]	671.50 ± 425.46; [358.51–984.50]	0.5519 ± 0.1566; [0.4367–0.6671]	0.5676 ± 0.1635; [0.4474–0.6879]	0.8827
MTL_GRU	816.10 ± 184.38; [680.45–951.74]	577.47 ± 173.38; [449.92–705.02]	551.42 ± 410.48; [249.44–853.40]	0.6178 ± 0.1928; [0.4759–0.7596]	0.6321 ± 0.1974; [0.4869–0.7773]	0.9153
MTL_LSTM_GRU	866.42 ± 186.47; [729.24–1003.59]	609.54 ± 175.08; [480.74–738.35]	581.70 ± 394.03; [291.83–871.57]	0.5737 ± 0.2040; [0.4236–0.7238]	0.5940 ± 0.2026; [0.4450–0.7430]	0.8739
STT	896.06 ± 214.96; [737.92–1054.20]	652.44 ± 166.03; [530.30–774.59]	687.17 ± 386.99; [402.48–971.87]	0.5696 ± 0.1207; [0.4808–0.6584]	0.6486 ± 0.0944; [0.5792–0.7180]	0.9234
MTL_CNN	1030.43 ± 398.85; [737.01–1323.85]	849.66 ± 396.97; [557.62–1141.70]	1076.56 ± 793.76; [492.61–1660.50]	0.3457 ± 0.5046; [−0.0256–0.7169]	0.6465 ± 0.1361; [0.5464–0.7466]	0.8743

Source: Own calculations based on experimental results.

Table 6. Cross-validation for the task “Wind”: mean ± std and 90% CI; AUC—mean.

Model	RMSE (Mean ± Std; CI90)	MAE (Mean ± Std; CI90)	MAPE (Mean ± Std; CI90)	R² (Mean ± Std; CI90)	EVS (Mean ± Std; CI90)	AUC (Mean)
MTL_LSTM_GRU_STT	905.44 ± 340.54; [654.92–1155.97]	688.68 ± 362.46; [422.03–955.33]	410.67 ± 175.05; [281.89–539.45]	0.5405 ± 0.2769; [0.3368–0.7442]	0.5846 ± 0.2779; [0.3802–0.7891]	0.9259
Seasonal Naive (lag = 24)	879.39 ± 65.99; [830.85–927.94]	478.44 ± 17.58; [465.51–491.37]	165.24 ± 56.03; [124.02–206.46]	0.5738 ± 0.0839; [0.5120–0.6355]	0.5738 ± 0.0839; [0.5121–0.6355]	0.8827
DP-STH++	901.74 ± 209.44; [747.67–1055.82]	545.76 ± 138.99; [443.51–648.02]	227.63 ± 188.44; [89.00–366.27]	0.5474 ± 0.1857; [0.4109–0.6840]	0.6025 ± 0.1456; [0.4954–0.7097]	0.8815

Source: Own calculations based on experimental results.

Table 7. Comparison of MAPE and MASE under low generation conditions.

Model	MAPE	MASE
DLinear_proxy	2.780 × 10⁹	0.2402
KAN_proxy	2.640 × 10⁹	0.2499
NBEATS_proxy	2.152 × 10⁹	0.3689
PatchTST_proxy	6.610 × 10⁹	0.5271
Seasonal_Naive_lag24	1.641 × 10⁸	0.8870
DP-STH++_proxy	5.910 × 10⁹	1.0619

Source: Own calculations based on experimental results.

Table 8. Reports AUC, PR-AUC, Precision, Recall, and F1 for different quantile thresholds.

Target	q	AUC	PR-AUC	Precision	Recall	F1
Solar	0.85	0.990377	0.904844	0.860000	0.767857	0.811321
Solar	0.90	0.994343	0.875996	0.600000	0.937500	0.731707
Solar	0.95	0.987339	0.342348	0.140000	1.000000	0.245614
Wind	0.85	0.978285	0.918525	0.974359	0.629834	0.765101
Wind	0.90	0.988255	0.937505	0.837607	0.852174	0.844828
Wind	0.95	0.996583	0.917066	0.478632	0.982456	0.643678

Source: Own calculations based on experimental results.

Table 9. Extended classification metrics for extreme-event detection at q = 0.90.

Model	Target	Positive_Rate_Test	AUC	PR-AUC	Precision	Recall	F1
DP-STH++	Solar	0.036655	0.994343	0.875996	0.600000	0.937500	0.731707
DLinear_proxy	Solar	0.036655	0.995996	0.897604	0.865385	0.703125	0.775862
DP-STH++	Wind	0.065865	0.988255	0.937505	0.837607	0.852174	0.844828
DLinear_proxy	Wind	0.065865	0.994045	0.953754	0.887931	0.895652	0.891775

Source: Own calculations based on experimental results.

Table 10. Sensitivity to forecast horizon (L = 24 fixed).

H	RMSE_Mean ↓	MASE_Mean ↓	AUC_Mean ↑
1	1056.8492	1.0619	0.8707
3	1138.2860	1.1413	0.8183
6	1206.1249	1.2107	0.7638

Source: Own calculations based on experimental results.

Table 11. Sensitivity to window length (H = 1 fixed).

L	RMSE_Mean ↓	MASE_Mean ↓	AUC_Mean ↑
12	995.1426	1.0374	0.9067
24	1056.8492	1.0619	0.8707
48	1141.5528	1.1565	0.8379

Source: Own calculations based on experimental results.

Table 12. Ablation study of DP-STH++ branch contributions.

Model	Branches	RMSE_Mean ↓	MASE_Mean ↓	AUC_Mean ↑	PR-AUC_Mean ↑	F1_Mean ↑
DP_full	LSTM+GRU+TCN+Trf	299.47	0.3419	0.9916	0.8865	0.7658
DP_no_LSTM	GRU+TCN+Trf	321.43	0.3377	0.9901	0.8940	0.8011
DP_no_GRU	LSTM+TCN+Trf	260.73	0.2534	0.9923	0.8936	0.7833
DP_no_TCN	LSTM+GRU+Trf	274.17	0.2728	0.9924	0.9203	0.7996
DP_no_Transformer	LSTM+GRU+TCN	264.12	0.2555	0.9934	0.9115	0.8167

Source: Own calculations based on experimental results.

Table 13. Presents the combined hold-out benchmark summary.

Model	Role	RMSE ↓	MASE ↓	AUC ↑	PR-AUC ↑	F1 ↑
PatchTST_proxy	Patch-based family proxy	461.13	0.527	0.9848	0.8552	0.5981
DLinear_proxy	Strong linear baseline	228.49	0.240	0.9950	0.9257	0.8338
KAN_proxy	Nonlinear basis baseline	238.58	0.250	0.9944	0.9315	0.8203
NBEATS_proxy	deep baseline (MLP residual family)	393.13	0.383	0.9929	0.8737	0.8116
DP-STH++	Proposed model	260.68	0.250	0.9913	0.9068	0.7883

Source: Own calculations based on experimental results.

Table 14. Robustness and generalization summary.

Robustness Check	What Is Evaluated	Key Observation
TimeSeries CV	Temporal transferability across folds	On a rolling CV, strong, simple baselines may exhibit slightly higher stability; this is explicitly acknowledged as a limitation and subject for future multi-dataset validation.
Train-fraction sensitivity	Dependence on training data volume	Performance degrades predictably as the train fraction decreases; trends remain consistent across models.
Statistical significance	Significance of performance differences	Improvements of DP-STH++ over several deep baselines are statistically supported; comparisons with strong linear baselines are interpreted cautiously without universal superiority claims.

Source: Own calculations based on experimental results.

Table 15. Paired statistical significance tests (bootstrap CI + permutation test).

Comparison	Target	ΔMSE (comp − DP)	95% CI	p-Value
DP-STH++ vs. MTL_CNN	Solar	+63,060.75	[46,901.17; 80,189.64]	0.000000
DP-STH++ vs. MTL_CNN	Wind	+26,866.24	[19,473.15; 34,191.39]	0.000000
DP-STH++ vs. DLinear_proxy	Solar	−5479.19	[−16,930.09; 4266.03]	0.295333
DP-STH++ vs. DLinear_proxy	Wind	−24,481.13	[−31,775.70; −18,450.60]	0.000000

Source: Own calculations based on experimental results.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tolegenova, G.; Zakirova, A.; Kalimoldayev, M.; Akhayeva, Z. Hybrid Spatio-Temporal Deep Learning Models for Multi-Task Forecasting in Renewable Energy Systems. Computers 2026, 15, 183. https://doi.org/10.3390/computers15030183

AMA Style

Tolegenova G, Zakirova A, Kalimoldayev M, Akhayeva Z. Hybrid Spatio-Temporal Deep Learning Models for Multi-Task Forecasting in Renewable Energy Systems. Computers. 2026; 15(3):183. https://doi.org/10.3390/computers15030183

Chicago/Turabian Style

Tolegenova, Gulnaz, Alma Zakirova, Maksat Kalimoldayev, and Zhanar Akhayeva. 2026. "Hybrid Spatio-Temporal Deep Learning Models for Multi-Task Forecasting in Renewable Energy Systems" Computers 15, no. 3: 183. https://doi.org/10.3390/computers15030183

APA Style

Tolegenova, G., Zakirova, A., Kalimoldayev, M., & Akhayeva, Z. (2026). Hybrid Spatio-Temporal Deep Learning Models for Multi-Task Forecasting in Renewable Energy Systems. Computers, 15(3), 183. https://doi.org/10.3390/computers15030183

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Spatio-Temporal Deep Learning Models for Multi-Task Forecasting in Renewable Energy Systems

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Source and Study Design

2.2. Leakage-Safe Data Preprocessing

2.3. Definition of Extreme Events

2.4. Model Architecture

Architectural Rationale and Complementarity of Branches

2.5. Multi-Task Learning and Loss Functions

2.6. Training Procedure and Evaluation Protocol

3. Results

3.1. Hold-Out Evaluation Results

3.2. Cross-Validation Results

3.2.1. Extreme-Event Threshold Sensitivity Analysis

3.2.2. Extreme-Event Detection Performance (Imbalance-Aware Evaluation)

3.3. Distributional Analysis

3.4. Summary of Quantitative Results

3.5. Ablation Study of Architectural Branch Contributions

3.6. Extended Benchmark Comparison

3.7. Robustness and Generalization Analysis

3.7.1. Consistency Audit and Reproducibility Checks

3.7.2. Statistical Significance Analysis

3.7.3. Computational Cost Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI