1. Introduction
The integration of renewable energy sources into urban smart infrastructure is increasing. Although solar and wind generation are technologically and environmentally advantageous, they are highly susceptible to weather conditions, exhibiting pronounced non-stationarity and sharp intraday output fluctuations. These characteristics complicate power system operational management and increase the demand for the accuracy of short-term forecasts used for load balancing and power reserve allocation [
1,
2,
3].
Recent time series studies show that strong linear baselines and transformer approaches can deliver competitive quality when the experiment is set up correctly [
4,
5,
6,
7,
8,
9,
10,
11]. For energy applications, particularly in wind and solar generation tasks, hybrid and multitask approaches are relevant because they allow the simultaneous consideration of non-stationarity, seasonal-daily patterns, and extreme modes [
12,
13,
14,
15,
16,
17]. Recent reviews emphasize fair comparison across architectural classes and reproducible evaluation protocols in RES forecasting tasks [
6,
17,
18]. Based on this, the task of this work is formulated as the construction and verification of a unified short-term Solar/Wind forecasting loop with simultaneous regression estimation and detection of extreme events in a comparable experimental protocol.
This study frames short-term forecasting as a time-oriented inference task, predicting “from the past to the future” with a horizon of
H = 1 and a context window of
L = 24. Initial observations were first standardized to a uniform timescale by parsing, sorting, and removing duplicate timestamps, followed by the creation of calendar attributes: hour, day of week, month, and season. Subsequently, a unified feature set and target variables were constructed for both the solar and wind subsystems. Strict chronological splitting was applied using an 80/20 hold-out and walk-forward cross-validation. Finally, the target series is shifted by horizon
H, retaining only valid indices [
19,
20].
A key practical consideration is the need to account for not only the average power levels but also rare, extreme operating modes. In this study, this was addressed by formalizing a separate binary classification task to identify these extremes. Extremes are defined using an EXT_Q quantile threshold, calculated solely on the training data (
q = 0.90 in our experiments) and then applied without modification to the hold-out/folds. This results in a unified multi-task protocol encompassing two regressions (solar and wind) and two classifications of extremes [
21].
Another critical methodological requirement is preventing data leakage. All transformation statistics, including the feature and target variable scaling parameters, were configured exclusively on the training data and then applied to the validation/test sets without recalculation. This ensures the reproducibility and temporal correctness of the evaluation [
19].
The objective of this study was to develop and evaluate a hybrid causal DP-STH++ architecture for the joint forecasting of solar and wind generation and the detection of extreme events within a single model. The architecture employs the parallel fusion of four representation branches (LSTM, GRU, causal Conv1D, and a lightweight causal transformer). The aggregated representations were then combined and fed into multi-head outputs for regression and classification. An uncertainty-weighted balancing mechanism (LossScaleLayer) is used in the regression component to adaptively balance solar and wind regression losses [
5,
7,
8,
9,
10], where prior studies have shown that uncertainty-based weighting stabilizes multi-task optimization in non-stationary time series. The experimental evaluation compares the proposed architecture against baseline multi-task learning (MTL) architectures (LSTM, GRU, their combinations, CNN, STT), as well as a seasonal naive benchmark (
lag = 24), allowing us to assess the contributions of both the architectural design and the rigorous time-correct processing and validation protocol [
20].
In contrast to several existing hybrid models that typically combine two architectural families (e.g., TCN + Transformer or graph-based models), DP-STH++ integrates four parallel causal branches with joint regression and extreme-event classification heads under a unified chronological evaluation protocol. The novelty lies in the parallel causal fusion mechanism and strict train-only threshold computation, rather than in the use of individual standard components.
The main scientific contributions of this study are as follows:
A leak-resistant multitask learning framework is proposed for the joint forecasting of solar and wind generation.
A hybrid spatiotemporal architecture was developed, integrating recurrent, convolutional, and transformer components within a causal framework.
The explicit modeling of extreme generation modes was implemented as a separate classification task.
A comprehensive evaluation is conducted to assess accuracy and stability against representative baselines under the adopted protocol.
The field of time-series forecasting has seen rapid development in recent years, with several studies questioning the necessity of complex transformer architectures when simpler linear models or properly configured baselines perform comparably or better under fair evaluation conditions [
4,
5,
8,
9,
10]. At the same time, energy-specific applications benefit significantly from hybrid and multi-task frameworks that jointly model regression and classification tasks (e.g., extreme event detection) while handling pronounced non-stationarity and sharp fluctuations typical for solar and wind generation [
6,
7,
10,
11,
12].
Recent reviews published in 2024–2025 provide updated overviews of deep learning approaches for solar and wind forecasting, confirm the relevance of hybrid architectures, and stress the methodological importance of leakage-free experimental protocols and reproducible comparisons [
13,
14,
15].
In light of these developments, the current work builds upon the latest methodological insights.
Table 1 summarizes a curated set of recent publications (2022–2025) that directly inform the design choices, baseline selection, evaluation protocol, and interpretation of results in this study.
Recent developments in Kolmogorov–Arnold Networks (KAN) and related basis-function-based architectures propose adaptive nonlinear modeling through spline-based functional representations. These models have demonstrated strong approximation capabilities in structured regression tasks and have recently attracted attention in time-series forecasting research.
In the revised experimental section, a KAN-inspired proxy baseline is included to ensure empirical positioning relative to nonlinear basis-function families under the same leakage-safe protocol and identical evaluation settings. While KAN-type approaches emphasize flexible functional approximation, the proposed DP-STH++ architecture focuses on complementary temporal representations within a unified multitask regression–classification framework.
2. Materials and Methods
2.1. Data Source and Study Design
This study addresses the challenge of short-term (hourly) forecasting of both solar and wind generation within a unified framework, focusing on two target series: kWh_solar_power_solar and kWh_wind_power_wind. The empirical database consists of a single chronologically ordered time series with 8752 observations and a consistent input dimension for both forecasting tasks. The input feature vector comprises meteorological indicators relevant to solar and wind power generation and calendar variables (hour, day of the week, month, and season).
The experimental design employed a time-series simulation approach, maintaining chronological order without data mixing. An 80/20 hold-out split was used, training models on the initial 80% of the data and testing on the final 20%. Where necessary, walk-forward cross-validation (TimeSeriesSplit) was implemented. A chronological 80/20 hold-out split without shuffling was applied. Walk-forward cross-validation (TimeSeriesSplit) was additionally implemented.
2.2. Leakage-Safe Data Preprocessing
The data preparation pipeline consisted of the following sequential operations (
Figure 1).
The data preparation and processing pipeline consists of the following steps.
Temporal Axis Alignment: Timestamps were parsed, data were sorted chronologically, and duplicate entries were removed.
Calendar Feature Generation: Calendar-based features, including hour, day of the week, month, and categorical season (subsequently encoded), are generated.
Working Dataset Construction: A working dataset comprising the target series and a complete set of input features (targets + features_all) was assembled. Rows containing missing values in the selected columns were then removed.
Time-Based Splitting and Horizon Adjustment: A time-based split (without shuffling) was performed using an 80/20 hold-out or walk-forward cross-validation approach. Target variables are shifted to a specified horizon (H), and only valid indices (rows with available target values at the horizon) are retained in the dataset.
Training-Only Scaling: Scaling parameters for features and target variables are estimated exclusively on the training dataset (fit x_scaler, y_scaler on TRAIN). These parameters were then applied to the hold-out or cross-validation folds without re-evaluation.
Windowing: For sequential architectures, sliding windows of length L (L = 24 in the experiments) were created. The first L − 1 timestamps were discarded for the supervised samples. Optionally, a feature vector (X_last) representing the “flat” representation of the current state can be formed from the last window step.
2.3. Definition of Extreme Events
Extreme events were formalized using a quantile threshold calculated solely from the training sample. For each target series, a quantile level (q = EXT_Q) was set independently (q = 0.90 in the presented experiments). The resulting threshold (thr) is then fixed and applied to the holdout or cross-validation sets without recalculation. A binary extreme label was assigned by comparing the observation to the threshold on the corresponding scale of the target variable, forming output y_ext for subsequent model evaluation using the AUC metric.
2.4. Model Architecture
In this study, we implemented and compared a set of multitasking deep learning architectures, including MTL_LSTM, MTL_GRU, MTL_LSTM_GRU, STT, MTL_CNN, MTL_LSTM_GRU_STT, and the proposed hybrid model DP-STH++. To ensure a fair comparison, all models utilized a consistent input format in the form of a sequential window Xseq ∈ R(L × D) (where L = 24 in the experiments) and a unified output format: two regression heads for predicting solar and wind generation and two classification heads for detecting extreme modes.
The proposed DP-STH++ architecture is constructed as a parallel causal fusion of four spatiotemporal branches, each generating its own representation of the window Xseq:
Branch 1 (RNN): A causal LSTM recurrent encoder extracts short- and medium-term dependencies while maintaining causality.
Branch 2 (RNN): A causal GRU serves as an alternative recurrent encoder, complementing the LSTM in terms of inductive properties.
Branch 3 (TCN): A causal Conv1D stack extracts local and multiscale temporal patterns.
Branch 4 (STT): A lightweight causal transformer with a causal attention mechanism that accounts for a longer-range context within the window.
The output of each branch is aggregated using a global pooling operator (resulting in a fixed-dimensional vector), followed by feature fusion:
where
is the pooled representation of branch
. Subsequently, the vector
z is processed through a sequence of BatchNormalization, a Dense layer with 64 units and ReLU activation, and a dropout layer with a rate of 0.20 before being fed to the output heads. This architecture decouples the extraction of temporal features (within branches) from the unified multitasking component (shared blocks and heads), ensuring compatibility with fundamental MTL architectures.
Architectural Rationale and Complementarity of Branches
The DP-STH++ architecture is not intended as a fundamentally new neural network class. Instead, it represents a structured architectural decomposition designed for the joint solution of two related tasks: (i) short-term regression forecasting of solar and wind generation and (ii) detection of rare extreme events under a leakage-safe protocol.
The design follows a complementarity principle, where each parallel branch captures a distinct aspect of temporal dynamics (see
Figure 2):
LSTM branch—models long-term temporal inertia, daily periodicity, and smoothing effects typical for renewable generation series. This branch emphasizes stable representation of medium- and long-range dependencies.
GRU branch—provides a more compact recurrent representation with fewer parameters, which improves learning stability under limited data regimes. In our configuration, GRU serves as a complementary recurrent encoder focusing on faster adaptation to local dynamic changes.
TCN (causal Conv1D) branch—extracts short local temporal motifs and abrupt transitions (spikes and drops). Convolutional filters highlight multiscale local structures that recurrent encoders may smooth out.
Lightweight causal Transformer branch—performs contextual aggregation across the observation window while preserving strict causality (no access to future information). It models heterogeneous feature interactions and non-uniform temporal dependencies within the window.
Thus, the parallel composition is not an arbitrary engineering combination but a decomposition of temporal structure into complementary representations: memory/inertia (RNNs), local motifs (TCN), and contextual interactions (Transformer).
The contribution of DP-STH++ is therefore architectural–algorithmic and experimentally validated, rather than based on the invention of new layers. The goal is to construct a reproducible multitask framework that balances regression accuracy and extreme-event classification under realistic data constraints.
2.5. Multi-Task Learning and Loss Functions
Training was conducted in a multitask setting that combined two regression and two classification tasks. For solar and wind power generation regression, an uncertainty-weighted quadratic error with trainable scale parameters (implemented via LossScaleLayer (UW-Reg)) was employed. This approach enables the automatic balancing of contributions from the two regression tasks, thereby eliminating the need for manual coefficient selection.
Let the true values on the horizon be, and the model predictions be—The regression loss is defined as:
Then, the uncertainty-weighted part takes the form:
where
—trainable parameters (uncertainty scales) for task
.
Extreme modes are specified by binary labels
, formed according to fixed thresholds
, calculated on the training part at level
(in the experiments
). Probabilistic predictions
, obtained using sigmoid heads, were used for classification. The classification losses are specified by the focal binary cross-entropy as follows:
The total loss function is expressed as follows:
2.6. Training Procedure and Evaluation Protocol
All the models were trained using a consistent time-based validation protocol. The primary scenario employed an 80/20 hold-out split without shuffling, where training was conducted on the initial segment of the time series and evaluation was performed on the final segment. Furthermore, walk-forward cross-validation (TimeSeriesSplit) was implemented to assess the stability of the estimates across consecutive time folds.
The target values were shifted to the specified horizon (in the presented experiment ), and only valid window indices were utilized. All preprocessing parameters (including , and thresholds) were computed exclusively on the training partition and subsequently applied to the hold-out/CV-folds without re-evaluation. All preprocessing parameters were computed exclusively on the training partition and applied to validation/test sets without recalculation.
The performance of the regression forecasts was evaluated using the following metrics: RMSE, MAE, MAPE,
and EVS. The effectiveness of the extreme value detection was assessed using the AUC ROC. Regression performance was evaluated using RMSE, MAE, MAPE, R
2, and EVS. Extreme event detection was evaluated using AUC-ROC. Component-wise regression and classification metrics for Solar and Wind targets are reported separately in
Table 2.
To avoid ambiguity in metric interpretation, scale-dependent error measures (RMSE, MAE, MAPE) are computed and reported only after a consistent scale definition. When normalization is applied during training, final evaluation metrics are calculated after inverse transformation to the selected reporting scale.
Results expressed in different physical scales (“raw” and “kW”) are not directly comparable for scale-dependent metrics. Therefore, such results are presented in separate tables. Cross-model interpretation across different scales relies exclusively on scale-independent metrics (R2, EVS, AUC) under the same data-splitting protocol.
3. Results
3.1. Hold-Out Evaluation Results
Before presenting the results, it should be noted that MAPE may become numerically unstable in renewable energy forecasting due to zero or near-zero generation values (e.g., nighttime solar production or low wind regimes). Division by small denominators can lead to inflated percentage errors. Therefore, in this study, MAPE is not used as a primary ranking metric. Model comparison and conclusions are primarily based on MASE and consistent absolute metrics (RMSE, MAE), while MAPE is reported only for completeness.
Table 2 presents the quality metrics for solar generation (Solar) forecasting on an 80/20 hold-out set. The proposed DP-STH++ model exhibited superior regression performance compared to the other considered architectures. It achieved a minimum error of RMSE = 524.27 and MAE = 287.90, along with maximum values of R
2 = 0.7556 and EVS = 0.7556, indicating a substantial proportion of explained variability and strong alignment of predictions with the observed dynamics. DP-STH++ also demonstrated the lowest MAPE value; however, this metric is reported for completeness and is not used for primary ranking due to its instability near zero values.
Concurrently, MTL_GRU demonstrated the best performance in solar extreme detection on the hold-out set (AUC = 0.9746), whereas DP-STH++ achieved an AUC of 0.9547. This suggests that the primary advantage of DP-STH++ for Solar energy lies in its accuracy in continuous prediction rather than in maximizing the AUC classification metric.
Table 3 presents the results for wind generation (Wind). Here, DP-STH++ confidently leads in terms of key regression metrics: RMSE = 241.77, MAE = 174.86, as well as in terms of consistency
and EVS = 0.9529. In the task of detecting wind extremes, DP-STH++ also demonstrates maximum accuracy (AUC = 0.9908), outperforming its closest competitor, MTL_LSTM_GRU_STT (0.9902).
It is important to note that the MAPE values for Wind are notably high (thousands and tens of thousands of percent) across all models. This is a common artifact of the MAPE when dealing with periods of near-zero power generation. Consequently, RMSE, MAE, R2, EVS, and MASE offer more reliable performance indicators for Wind, with MAPE serving only as a supplementary descriptive metric.
Component-wise regression and classification metrics for Solar and Wind are reported separately to avoid cross-target generalization. Although trained within a single unified protocol, the targets exhibit different statistical properties, resulting in different absolute metric values. This confirms the necessity of target-specific interpretation. The detailed component-wise results for both targets are summarized in
Table 4.
3.2. Cross-Validation Results
Table 5 presents the final model stability and quality indicators obtained during the cross-validation for the solar channel. These indicators include absolute errors (RMSE and MAE), relative error (MAPE), consistency and explained variance (
, EVS), and the quality of extreme event detection (AUC). This table facilitates the evaluation of both the average accuracy level and the dispersion of results across folds, which is crucial for assessing reproducibility.
The data presented in the table allow for several key observations.
Model Leadership on a Single Scale (kW) Based on Regression Metrics: Among the MTL variants operating on the kW scale, MTL_GRU exhibited the most balanced performance profile. It achieved the lowest average RMSE and MAE within this group, coupled with the highest R2 and EVS values. This suggests that, in this specific setting, the GRU encoder offers superior generalization capabilities compared to other “pure” recurrent MTL baselines.
Error Stability and Characteristics: The standard deviations reflect the sensitivity of each model to the data partitioning. For instance, STT displayed relatively moderate variability; however, it did not surpass MTL_GRU in terms of absolute error reduction. Conversely, MTL_CNN exhibited the greatest instability and poorest average accuracy (with a confidence interval that included negative values), indicating a potential risk of diminished performance across different folds.
Extreme Event Prediction (AUC) as a Distinct Quality Metric: Transformer/hybrid solutions excelled in terms of AUC. MTL_LSTM_GRU_STT demonstrated the highest average AUC, and STT also performed strongly within the distribution. This aligns with the understanding that attention mechanisms are often more effective in extracting subtle patterns crucial for classifying rare events, even if regression errors are not minimized.
To avoid ambiguity caused by mixed reporting scales, cross-validation results are presented separately for kW and raw scales. Scale-dependent metrics (RMSE, MAE, MAPE) are compared only within identical physical units, while R2, EVS, and AUC remain cross-scale interpretable under the same splitting protocol.
Table 6 presents a comparative analysis of the models under walk-forward cross-validation conditions for the wind channel. This analysis evaluated the regression accuracy (RMSE, MAE), relative errors (MAPE), forecast consistency with observations (EVS), and accuracy of extreme mode recognition via a probability head (AUC). This presentation format is valuable because an algorithm may demonstrate acceptable average error but exhibit instability across different folds, a characteristic reflected in the width of the 90% confidence interval.
The key conclusions drawn from this table are as follows:
Regression Quality and Stability: DP-STH++ and MTL_GRU exhibited the strongest wind regression profiles. DP-STH++ demonstrated the lowest average RMSE and MAE among the solutions presented, coupled with high nRMSE and EVS values, indicating a strong reproduction of wind generation variability. MTL_GRU also presents robust regression metrics, accompanied by a relatively narrow confidence interval for nRMSE, suggesting a consistent performance across different folds in this experiment.
Extreme Value Detection Quality: DP-STH++ achieved the highest average AUC (0.9555), demonstrating its ability to effectively differentiate between peak and non-peak wind generation values. MTL_LSTM_GRU_STT also exhibited a high AUC (0.9249), confirming the utility of hybridization/attention mechanisms for the classification component, even with some variability in the regression errors.
Behavior of Relative Metrics on Wind (MAPE): The MAPE metric for the wind channel exhibited high values and wide intervals across many rows. This is a common characteristic of relative metrics when dealing with small or near-zero values, where the denominator can destabilize an estimate. Therefore, when interpreting wind results, emphasis should be placed on RMSE/MAE and nRMSE/EVS, with MAPE considered an auxiliary indicator sensitive to low power levels.
Weak and Inadequate Baseline Benchmarks for Wind: Seasonal Naive (lag = 24) demonstrated negative nRMSE and EVS, indicating that it reproduces wind dynamics worse than a trivial constant line at the average level in the corresponding divisions. Similarly, MTL_CNN and STT exhibited decreased consistency (wide intervals and low average nRMSE/EVS), suggesting insufficient stability of these configurations within the current training protocol and data volume.
Table 7 Comparison of MAPE and MASE values specifically in low-generation regimes (actual generation <5% of observed maximum). MAPE shows significant inflation and instability, particularly for wind power and nighttime solar periods, justifying the use of MASE as the primary robust metric for model ranking and conclusions.
In summary, the results indicate that a combination of stable regression quality and reliable extreme value detection is crucial for wind-generation modeling. This combination is most evident in DP-STH++, whereas other architectures either lack stability across folds or do not provide sufficient accuracy in terms of basic error metrics.
3.2.1. Extreme-Event Threshold Sensitivity Analysis
The definition of extreme events in the main experiments is based on a quantile threshold q = 0.90 computed on the training subset and applied to the test data. To address concerns regarding the arbitrariness of this choice, a sensitivity analysis was conducted for q ∈ {0.85, 0.90, 0.95}.
Under
q = 0.90, the proportion of positive extreme events on the test set is approximately 3.7% for Solar and 6.6% for Wind. This corresponds to rare but sufficiently represented events, enabling statistically stable estimation of PR-based metrics. The quantitative results of the threshold sensitivity analysis are presented in
Table 8.
The results indicate that:
For Solar, q = 0.95 leads to excessive sparsity, significantly degrading PR-AUC and F1.
For Wind, q = 0.90 provides the best balance between Precision and Recall.
Lower thresholds (q = 0.85) increase event frequency but reduce extremeness.
These findings demonstrate that q = 0.90 represents a balanced operating point rather than an arbitrary choice. At the same time, the optimal quantile may depend on the risk tolerance and application context.
Increasing the quantile threshold increases event sparsity. While AUC remains relatively stable, PR-AUC and F1 degrade for Solar at q = 0.95 due to extreme class imbalance. The threshold q = 0.90 provides a balanced operating point across targets.
3.2.2. Extreme-Event Detection Performance (Imbalance-Aware Evaluation)
Given the class imbalance inherent in extreme-event detection, ROC-AUC alone may be insufficient for comprehensive evaluation. For
q = 0.90, the positive class proportion in the test set is approximately 3.7% for Solar and 6.6% for Wind. Under such an imbalance, PR-AUC, Precision, Recall, F1-score, and confusion matrices provide a more informative assessment. The sensitivity of extreme-event detection metrics to the quantile threshold is illustrated in
Figure 3.
Table 9 reports extended classification metrics for DP-STH++ and representative baselines.
The results show that for Solar, DP-STH++ operates in a high-recall regime (Recall = 0.9375), minimizing missed extreme events at the cost of lower Precision. In contrast, DLinear_proxy achieves higher Precision but lower Recall.
For Wind, both models demonstrate balanced detection performance, with DLinear_proxy slightly outperforming in PR-AUC and F1, indicating fewer classification errors under the current threshold.
Figure 4 presents the confusion matrices for Solar and Wind at
q = 0.90.
3.3. Distributional Analysis
Figure 5 illustrates the temporal dynamics of solar generation, with three distinct series displayed on the raw scale: actual values for the training interval (“Training (true)”), actual values for the holdout interval (“Holdout (true)”), and the model forecast (“Forecast”).
This visualization enables a qualitative assessment of the model’s ability to preserve the daily signal structure and transfer learned patterns from the training period to the subsequent period without temporal discontinuity.
The graph depicts a typical daily profile of solar power generation, characterized by near-zero values during the night and peaks during the day. The forecast curve for the holdout interval generally reproduces:
The phase and duration of the daytime generation window (the temporal position of the rise and fall relative to the actual curve).
The amplitude of the main peaks and the overall shape of the daytime “dome” are particularly important for evaluating performance using RMSE/MAE in tasks exhibiting pronounced daily cyclicality.
Behavior near zero levels, where relative distortions can occur due to small denominators (a common source of instability in MAPE and similar metrics during nighttime segments).
However, local discrepancies in the height and/or steepness of the fronts are noticeable in some peak areas, indicating the ongoing challenge of accurately reproducing extreme and rapidly changing modes, even with a correctly captured seasonal structure. Overall, visual inspection confirms that DP-STH++ retains both the shape of the daily pattern and the generation levels on the lagged segment, which aligns with the quantitative performance metrics presented previously.
Figure 6 illustrates the dynamics of wind generation on the raw scale, segmented into three trajectories: actual values on the training section (“Training (true)”), actual values on the holdout section (“Holdout (true)”), and the model forecast (“Forecast”)
In contrast to the solar series, the wind process exhibits distinct non-stationarity, characterized by sharp fronts, short pulses, and episodes of power reduction to low values. Consequently, this visualization is particularly useful for assessing forecast quality based on pattern fidelity rather than solely relying on integral metrics.
The forecast curve for the holdout interval demonstrates general consistency with the observed data in key aspects.
Accurate reproduction of the temporal position of major rises and falls.
The maintenance of large peak levels and their sequence is critical for practical reserve management scenarios.
The transitions to low values were adequately reflected, although local shifts and smoothing were possible in some areas.
The remaining discrepancies were concentrated in areas with the most abrupt changes. In some instances, the model smooths out steep fronts or slightly underestimates/overestimates the amplitude of short-term spikes, which is a common characteristic when forecasting wind series with a fixed context window. Overall, the graph confirms that DP-STH++ effectively transfers patterns from the training segment to the deferred segment, maintaining stable forecasts, even with high variability in wind generation.
In addition to scalar metrics, temporal alignment between predicted and observed profiles (
Figure 7) confirms that the model preserves dynamic behavior across peak and low-generation intervals.
3.4. Summary of Quantitative Results
The quantitative results are summarized below.
DP-STH++ demonstrated the highest-quality regression prediction on the hold-out data (80/20 split) at H = 1 and L = 24. For solar power forecasting, it achieved minimum errors (RMSE = 524.27, MAE = 287.90) and maximum consistency (R2 = 0.7556, EVS = 0.7556). For wind power forecasting, it recorded the best values (RMSE = 241.77, MAE = 174.86, R2 = 0.9527, EVS = 0.9529).
The multitasking approach provides a practically significant detection of the extreme modes. DP-STH++ exhibited a maximum AUC of 0.9908 on the hold-out data for wind generation. For solar generation, the model’s AUC is 0.9547, confirming high detection quality, although MTL_GRU achieves the highest AUC for solar generation (0.9746).
Temporal cross-validation estimates confirmed the reproducibility of the quality. They also demonstrated that comparisons of absolute errors between models on different scales (kW vs. raw) require careful consideration. Scale-invariant indicators (R2, EVS, and AUC) retained interpretable comparability.
The advantage of DP-STH++ is most evident for wind generation, where it achieves both high regression accuracy and the best AUC. For solar generation, the model’s advantage is primarily manifested in a reduction in absolute errors (RMSE/MAE) at a competitive but not maximum AUC level.
Sensitivity to Forecast Horizon and Window Length.
To provide a detailed numerical comparison, model sensitivity was evaluated under varying forecast horizons (with fixed) and varying window lengths (with fixed), while keeping all other parameters unchanged.
The comparative results of the models are presented in
Table 10.
Performance degrades monotonically as
H increases: RMSE and MASE rise, while AUC decreases. The sensitivity of the model to different window lengths is summarized in
Table 11.
Within the tested range, the best performance is observed at L = 12, while larger windows lead to gradual degradation under the current dataset size.
Figure 8 provides quantitative evidence of model sensitivity to forecast horizon and context length under a fixed leakage-safe protocol. The monotonic increase in RMSE_mean with growing
H confirms expected degradation, while window variation reveals data-regime-dependent behavior.
3.5. Ablation Study of Architectural Branch Contributions
To quantify the contribution of each architectural branch, we conducted a systematic ablation study by removing one component at a time (LSTM, GRU, TCN, Transformer) while keeping all other elements and training settings unchanged.
Table 12 reports regression (RMSE, MASE) and classification (AUC, PR-AUC, F1) metrics averaged across targets.
The results demonstrate that removing the LSTM branch leads to the largest increase in RMSE (321.43), confirming the importance of long-term memory modeling for renewable energy forecasting.
Interestingly, some reduced configurations (e.g., without GRU or Transformer) achieve lower regression errors under the current dataset size. This suggests possible over-parameterization or gradient interaction effects in multitask training when model capacity increases.
These findings indicate that DP-STH++ should be interpreted as a configurable architectural family rather than a fixed, universally optimal configuration. The ablation results provide measurable evidence of branch contributions and clarify the practical trade-offs between model complexity, stability, and accuracy under limited data conditions.
3.6. Extended Benchmark Comparison
To address concerns regarding the completeness of baseline comparison, we extended the experimental section with additional benchmark families representing diverse architectural paradigms. These include linear decomposition models (DLinear family), patch-based transformer-style models (PatchTST proxy), nonlinear basis-function architectures (KAN proxy), and residual deep architectures (N-BEATS proxy).
All benchmark models were evaluated under an identical experimental protocol: leakage-safe preprocessing, identical chronological splits (80/20 hold-out), window length L = 24, forecast horizon H = 1, and identical evaluation metrics. This ensures that the comparison is methodologically consistent and reproducible.
All proxy benchmark models reported in
Table 13 were trained under the same experimental protocol as the proposed architecture and the main baselines. Specifically, an identical chronological 80/20 hold-out split was used, with context window
L = 24 and the same 12 input features. Input scaling was fitted exclusively on the training subset and applied to the test subset without leakage. For regression evaluation, predictions were produced in the raw scale, consistent with the primary protocol. Extreme-event thresholds (quantile-based) were computed on the training set and then applied to the hold-out data.
The purpose of these proxy models is not to redefine the overall ranking but to position the proposed architecture within diverse methodological families and to provide additional diagnostic insight under identical evaluation conditions.
The results indicate that strong linear baselines remain highly competitive in short-horizon forecasting, achieving low regression error in certain settings. Nonlinear basis-function models demonstrate strong classification performance in extreme-event detection.
The proposed DP-STH++ model does not dominate across all individual metrics; however, it provides a balanced trade-off between regression accuracy and extreme-event detection within a unified multitask framework. These findings suggest that increased architectural complexity does not automatically guarantee superiority, particularly under limited data regimes, and that model selection should consider both performance and task objectives.
3.7. Robustness and Generalization Analysis
The dataset used in this study consists of 8752 hourly observations (approximately one year), which represents a limited data regime for hybrid deep learning architectures. To avoid overstated claims and assess result stability, additional robustness analyses were conducted.
Rolling TimeSeries cross-validation (TS-CV) was applied to evaluate temporal transferability across sequential folds. In addition, a train-fraction sensitivity analysis was performed to examine performance behavior under reduced training data proportions. Finally, bootstrap-based statistical testing was conducted for key regression comparisons.
The aggregated robustness results are summarized in
Table 14.
The robustness analysis confirms that conclusions are restricted to the evaluated dataset and protocol. Strong linear baselines remain competitive under limited data conditions, highlighting the importance of extended multi-dataset validation in future work.
3.7.1. Consistency Audit and Reproducibility Checks
Given the presence of multiple evaluation regimes (hold-out, rolling TimeSeries cross-validation, ablation averages, and sensitivity analyses), numerical differences between tables may arise due to distinct splits and aggregation procedures.
To ensure internal consistency, all summary tables were regenerated from a single aggregated evaluation source. No manual editing of final metric values was performed. Automated checks verified:
absence of duplicate model–target rows,
reproducibility of reported summary metrics from raw evaluation outputs,
alignment between protocol definitions and table captions.
Differences in RMSE or related metrics across tables, therefore, reflect differences in evaluation protocol (e.g., fixed hold-out vs. rolling cross-validation vs. averaged ablation results) rather than reporting inconsistencies.
These measures ensure reproducibility and internal coherence of the reported experimental results.
3.7.2. Statistical Significance Analysis
To avoid conclusions based solely on mean metrics, paired significance testing was conducted on identical test timestamps. We report bootstrap 95% confidence intervals for mean error differences and paired permutation test
p-values. DP-STH++ improves over MTL_CNN significantly for both Solar and Wind (
p < 0.001). In comparison with DLinear_proxy, differences are not significant for Solar (
p > 0.05), while DLinear_proxy is significantly better for Wind (
p < 0.001). The results of the paired statistical significance tests are summarized in
Table 15.
3.7.3. Computational Cost Analysis
DP-STH++ has the largest parameter count and computational complexity among the evaluated deep models, resulting in higher training time and inference latency. This trade-off is explicitly acknowledged: DP-STH++ targets improved predictive quality and extreme-event detection rather than computational efficiency. The comparative computational cost profile of the evaluated models is presented in
Figure 9.
4. Discussion
The results demonstrate the efficacy of the DP-STH++ hybrid spatiotemporal architecture as a solution for multitasking short-term forecasting of hourly solar and wind generation under conditions of significant non-stationarity and the presence of infrequent extreme events. By employing a unified, leakage-safe protocol and consistent experimental parameters (H = 1, L = 24), DP-STH++ exhibited superior regression performance for both the solar and wind channels on the hold-out sample. Specifically, for solar generation, the minimum RMSE and MAE values were observed concurrently with the maximum R2/EVS. For wind generation, optimal values were attained across all key regression metrics, coupled with maximum accuracy in extreme value detection, as measured by the AUC. This performance profile indicates that the proposed architecture effectively integrates the capacity to reproduce continuous power dynamics and reliably identifies peak generation modes.
The architectural advantage of DP-STH++ stems from the complementarity of parallel causal branches. Recurrent components (LSTM and GRU) facilitate the stable modeling of short- and medium-term dependencies, whereas the TCN branch enhances the extraction of local and multi-scale patterns in a causally consistent manner. The lightweight causal transformer increases the sensitivity to a more extended context. The aggregation of representations via pooling and subsequent concatenation in the fusion block allows for the summation of diverse inductive biases without compromising causality, which is a crucial feature when processing time-series data characterized by the simultaneous presence of regular cycles and short-term anomalous spikes.
A comparative analysis of the models revealed a distinction between the solar and wind channels regarding the nature of the achievable gains. For solar generation, DP-STH++ yielded the most significant improvement in regression accuracy on the holdout set, whereas the simpler MTL_GRU configuration excelled in AUC. This suggests a practical trade-off: maximizing the quality of continuous forecasting and maximizing purely classification metrics of extremes are not always equally achievable by a single architecture, particularly given the pronounced nighttime intervals with near-zero values. For wind generation, the advantage of DP-STH++ was most pronounced, simultaneously improving the regression errors and providing the maximum AUC, which is consistent with the higher stochasticity of the wind process and the reduced effectiveness of purely seasonal heuristics.
The influence of metric selection requires careful consideration. For wind generation, the MAPE values are often inflated and exhibit high variability owing to periods of low power output, leading to division by near-zero values. Consequently, the assessment of model performance for wind generation should prioritize the RMSE/MAE and consistency metrics (R2, EVS), as well as the AUC for extreme operating modes. MAPE should be considered an auxiliary metric because of its potential instability.
The cross-validation results indicate that the model stability is significantly influenced by the architectural class. Certain solutions exhibited a wide range of indicator values across folds, with R2 values falling into low or even negative ranges (e.g., for some STT/MTL_CNN configurations in the wind task). In contrast, DP-STH++ maintains competitive scale-invariant indicators (R2/EVS/AUC), which are crucial for ensuring the reproducibility of the findings across different time divisions. It is important to note that the tables include rows calculated on different scales (kW vs. raw values); therefore, direct comparisons of absolute RMSE/MAE are only valid within the same scale, whereas R2/EVS/AUC remain comparable regardless of the scale.
A key methodological aspect is the rigorous implementation of a leakage-safe experimental design. All preprocessing parameters and extreme thresholds were determined solely from the training dataset and subsequently fixed for the hold-out and cross-validation folds. This approach minimizes the risk of overestimating the model performance and allows us to attribute the advantages of DP-STH++ to its architecture and training procedure, rather than to artifacts resulting from information leakage.
The limitations of this study should be interpreted with caution. First, the results were obtained using a single combined dataset that reflected the characteristics of a specific set of observations. The transferability to other climatic zones and generation modes requires independent validation. Second, a forecast horizon of H = 1 with a window of L = 24 was considered, and alternative horizons may alter the balance between the local and distant dependencies. Third, the use of relative metrics (MAPE) for series with frequent small values limits the interpretability of these metrics, necessitating that the conclusions be supported by absolute error and consistency metrics.
Results are obtained on a single dataset with verified seasonal coverage (
Figure 10), limiting direct generalization to other climates or regions. The fixed one-hour horizon and 24 h context may not capture all dynamics at longer scales. MAPE instability in low-generation periods motivates reliance on robust metrics (MASE, AUC). Future validation on diverse datasets is recommended.
In addition to predictive performance, computational efficiency is important essential for real-world deployment.
Figure 11 compares training time, inferencelatency, parameter count, and FLOPs across the evaluated models. DP-STH++ has higher computational cost than simpler architectures due to its hybrid design, but remains lighter than full transformer models while achieving superior or competitive predictive performance and improved extreme-event detection. This reflects a practical trade-off between forecasting accuracy and computational cost.
Overall, the results suggest that, for the setup under consideration, practical advantages arise from a combination of (i) parallel hybridization of causal encoders, (ii) multitask optimization of regression and extrema prediction, and (iii) a strict leak exclusion protocol. Although certain simpler architectures (e.g., MTL_GRU) occasionally show marginally higher AUC for solar generation, DP-STH++ consistently ranks 1st across both hold-out and time-series cross-validation protocols when evaluated by robust aggregated metrics. On hold-out, it achieves RMSE = 257.18, MASE = 0.2438 and AUC = 0.9896; in CV, the model retains rank 1 with mean MASE = 0.3883. This confirms that the proposed hybrid causal fusion of LSTM, GRU, Conv1D and lightweight transformer provides stable advantages, particularly valuable under high non-stationarity conditions typical for wind power series. Thus, within the rigorous leakage-safe protocol applied, the superiority of DP-STH++ over the compared baselines is consistently supported.
5. Conclusions
This study introduces and experimentally validates DP-STH++, a multi-task hybrid model designed for the short-term forecasting of hourly solar and wind generation at a forecast horizon of H = 1 and a context window of L = 24. The proposed approach integrates the causal branches of LSTM, GRU, TCN, and a lightweight causal transformer, employing a common representation fusion and four output heads: two for regression (solar/wind) and two for classification, aimed at detecting extreme modes defined by the 90th percentile of the training sample.
Experimental results on a hold-out set (80/20 split) demonstrate that DP-STH++ achieves superior regression performance for both target series. Specifically, for solar generation, the model attained minimum RMSE and MAE values, alongside maximum R2/EVS, signifying enhanced accuracy in capturing daily dynamics and a greater proportion of explained variability. For wind generation, DP-STH++ exhibited the best regression performance and maximum AUC for extreme value detection, underscoring the effectiveness of the architecture in a more non-stationary and volatile environment. Concurrently, MTL_GRU leads in AUC for Solar, highlighting the distinction between continuous forecast optimization and maximization of classification metrics at peak values.
This study introduces DP-STH++, a hybrid spatio-temporal multitask model that achieves leading performance among the evaluated baselines under the considered experimental protocol for both solar and wind power generation (H = 1, L = 24). On the hold-out set, DP-STH++ ranks 1st among the compared models with RMSE = 257.18, MASE = 0.2438, R2 = 0.9440 and AUC = 0.9896. The model maintains leadership in rigorous time-series cross-validation (mean MASE = 0.3883, rank 1), confirming the robustness of the obtained results. The most significant gains are observed for the wind channel (RMSE = 258.85, MASE = 0.1631, AUC = 0.9880–0.9908), where the hybrid causal architecture effectively handles high non-stationarity and sharp peaks. These results, obtained under a strict leakage-safe protocol with train-only parameter estimation, support the practical applicability of DP-STH++ as a unified solution for accurate baseline forecasting and reliable extreme event detection in renewable energy systems.
The results are obtained on a single dataset (8752 hourly observations), which limits direct generalization. Future validation on additional multi-year and multi-region datasets is required.
The temporal cross-validation results corroborated the reproducibility of the obtained estimates and revealed that stability was significantly influenced by the architecture class. Seasonally naive forecasting proves unsuitable for wind series (as indicated by negative consistency indicators), whereas certain STT/MTL_CNN configurations exhibit instability across folds. In contrast, DP-STH++ maintains competitive scale-invariant quality indicators and demonstrates a practically significant ability to identify extreme modes, particularly for wind generation.
The scientific and methodological significance of this study lies in the implementation of DP-STH++ within a strictly leak-resistant pipeline. All preprocessing parameters and extreme thresholds were evaluated exclusively on the training data and subsequently fixed for application to the hold-out set and cross-validation folds. This rigorous protocol enhances the accuracy of comparisons and the reproducibility of the results when evaluating multitask models for time-series forecasting.
From a practical standpoint, DP-STH++ can be used as a unified tool that simultaneously delivers accurate short-term power forecasts and risk-oriented signaling of extreme modes. This capability aligns with the operational management requirements of power systems with a high proportion of renewable energy sources, where both minimizing base forecast errors and timely detection of potentially critical generation peaks/troughs are essential.