4.1. Experimental Setups
4.1.1. Datasets
To validate the predictive performance of the GLSTM model, this study utilized the NREL WTDS dataset (5 min resolution, including data from 100 wind farms between 2018 and 2022). See
Table 1 for details.
We used the China National Photovoltaic Monitoring Center dataset (15 min resolution, including data from 50 power stations between 2019 and 2023) [
19,
20]. The raw data were preprocessed by filling missing values using interpolation methods, and Min-Max normalization was applied to scale the data to the [0, 1] range [
21]. Subsequently, the dataset was partitioned into training, validation, and test sets based on a time-sequenced division with an 80–10–10% ratio, ensuring that the dataset included various weather patterns and seasonal variations [
22]. The dataset partitioning is shown in
Table 2 below.
4.1.2. Experimental Environment and Parameter Settings
Table 3 presents the experimental parameter settings required for this study, providing a detailed overview of the key parameters and their configurations used during the experiments to ensure the accuracy and reproducibility of the results.
4.1.3. Baseline Methods
To demonstrate the effectiveness of the proposed method, it was compared with popular baselines, including MLP (a classic feedforward neural network model with multiple hidden layers, which captures the relationships between input features through layer-wise non-linear mappings and is commonly used for tabular data or simple time series problems), LSTM (a special type of recurrent neural network designed to address the vanishing gradient problem in long sequence data, excelling at capturing long-term dependencies in sequential data and widely used for time series forecasting), GCN (a neural network model for graph-structured data that effectively captures spatial dependencies between nodes through graph convolution operations, widely applied to graph data learning and prediction tasks), ST-GCN (which combines spatial graph convolution and temporal convolution, aiming to model both spatial and temporal dependencies in spatiotemporal data, and is commonly used in spatiotemporal sequence forecasting tasks such as traffic flow prediction and video analysis), and Transformer (a model based on the self-attention mechanism which captures global dependencies in input sequences, is especially adept at handling long sequences, and is widely used in natural language processing and time series forecasting tasks).
Using the control variable method, suitable hyperparameters were selected through multiple experiments, as detailed below:
Hidden Layers: [256, 128, 64]
Activation Function: GELU
Dropout Rate: 0.5
Number of Layers: 3 stacked layers
Hidden Units: 256
Sequence Processing: Sliding window length T = 24 h
Graph Structure: Fixed adjacency matrix based on station geographic distances (threshold = 50 km)
Number of Layers: 3
Aggregation Method: Mean pooling
Spatiotemporal Graph Convolution Kernel: 3 × 3 (Time × Space)
Temporal Convolution: Standard 1D convolution
Skip Connections: Cross-layer concatenation
Attention Heads: 8
Positional Encoding: Learnable
Decoder Layers: 4 layers
4.2. Evaluation Metrics
In research on short-term renewable energy power forecasting, RMSE and MAE are commonly used evaluation metrics to measure the accuracy and performance of forecasting models [
23].
The RMSE is a commonly used metric to measure the difference between a model’s predicted values and the actual observed values. It takes into account both the magnitude of the error and the fluctuation of the error, providing a balanced evaluation result. The smaller the RMSE, the smaller the model’s prediction error, indicating higher accuracy of the model.
where
: The actual power value of the i power station at the h hour;
: The corresponding predicted power value;
: Total number of power stations;
: Forecast horizon (hours).
The RMSE is more sensitive to larger errors, effectively reflecting extreme deviations in the prediction results. It has the same dimension as the target variable (power), making it easier to interpret intuitively. Minimizing the RMSE is equivalent to maximizing the global accuracy of the prediction results.
The MAE is the average of the absolute differences between the predicted values and the actual values. Unlike the RMSE, the MAE does not excessively penalize large errors.
The MAE is not sensitive to outliers and provides a more stable reflection of the overall performance of a forecasting model. It directly represents the average absolute prediction error, making it easier for non-technical personnel to understand. Minimizing the MAE tends to yield the optimal solution at the median, making it suitable for scenarios with non-Gaussian error distributions.
4.3. Experimental Results and Analysis
4.3.1. Comparison of Benchmark Model Performance
To comprehensively evaluate the effectiveness of the hybrid model proposed in this paper, it was first compared with existing mainstream benchmark models. The experimental results are shown in
Table 4.
Regarding local analysis, for the under-one-hour prediction analysis of the RMSE and MAE metrics, the proposed model performed best in 1 h prediction, with an RMSE of 3.8% and an MAE of 2.3%, significantly outperforming other models. Following closely were LSTM (RMSE of 4.8%, MAE of 3.0%) and ST-GCN (RMSE of 4.5%, MAE of 2.5%). In comparison, the MLP model performed the worst, with an RMSE of 5.2% and an MAE of 3.2%. For the 6-h prediction analysis, the proposed model continued to maintain superior performance, with an RMSE of 3.8%. LSTM (7.5%) and ST-GCN (7.2%) followed closely, while Transformer had a larger error, with an RMSE of 6.8%, which was 2.6 percentage points higher. In terms of the MAE, the proposed model again performed best (3.5%), with LSTM (4.3%) and ST-GCN (4.1%) being relatively close. For the 24-h prediction analysis, as the prediction horizon increased, the errors of all models increased. However, the proposed model still maintained the smallest RMSE (7.0%) and MAE (3.9%). LSTM (RMSE of 8.2%, MAE of 4.8%) and ST-GCN (RMSE of 8.0%, MAE of 4.5%) exhibited higher errors, especially in the MAE, where the gap was more noticeable. Transformer and GCN showed larger errors, with GCN’s RMSE at 10.5%, significantly higher than the other models.
Regarding the overall analysis, the proposed model performed excellently across all time spans, particularly in the 1 h prediction, where both the RMSE and MAE were the lowest. Despite the training time being 50 min, which was significantly shorter than the Transformer model (90 min), the accuracy was not notably impacted. Although LSTM performed well in 1 h predictions, its prediction errors gradually increased as the prediction time increased, especially for 24 h predictions, where the RMSE reached 8.2%. ST-GCN and Transformer performed better over longer time spans, but their longer training times and higher prediction errors may reduce efficiency in practical applications. In contrast, the MLP model performed the worst, especially in longer-term predictions, with larger errors in the RMSE and MAE than the other models, indicating its weaker ability to model temporal data.
4.3.2. Ablation Study on Component Contributions
To clearly understand the contribution of each module to the overall performance, particularly in short-term power forecasting tasks, this study conducted an ablation analysis [
24,
25], as detailed in
Table 5.
Local Analysis: Experiment 1 showed that LSTM performed best in 1 h prediction (RMSE 5.0%), but as the prediction horizon increased, the error gradually increased, indicating that, while LSTM could capture temporal dependencies, it failed to effectively model spatial information, leading to a decline in its long-term forecasting accuracy. The training time was 20 min, which was relatively fast. In Experiment 2, the RMSE for GAT was higher, especially in the 1 h prediction (7.5%), indicating that neglecting temporal sequence modeling led to poor prediction performance. The training time was 40 min, which was longer than LSTM due to the complexity of processing graph structures. In Experiment 3, the RMSE was poorer, particularly in the 6 h and 12 h predictions, with slightly higher values compared to the baseline model (5.4% vs. 4.2%), showing that a fixed adjacency matrix could not flexibly adapt to data changes. The training time was 30 min, which was shorter than the baseline model. The baseline model (LSTM + GAT with dynamic adjacency matrix) exhibited a lower RMSE in the 1 h, 6 h, and 12 h predictions, especially performing best in short-term forecasting (4.2%), demonstrating that the LSTM + GAT dynamic adjacency matrix effectively captured both spatial and temporal dependencies. The training time was 60 min, and, although longer, the high prediction accuracy justified the additional time investment.
Overall Analysis: Experiments 1 and 2 showed that when LSTM or GAT was used individually, the lack of spatiotemporal dependency modeling reduced the prediction accuracy. The baseline model (LSTM + GAT with dynamic adjacency matrix) captured spatiotemporal dependencies through flexible adjustment of the adjacency matrix and demonstrated the best prediction accuracy. In Experiment 3, the fixed adjacency matrix reduced the computation time (shorter training time), but the prediction accuracy dropped, validating the importance of the dynamic adjacency matrix. Although the baseline model had the longest training time, the high accuracy justified the additional overhead.
4.3.3. Model Robustness Testing
To test the robustness of the proposed model, this study designed experiments under extreme weather and dynamic environmental conditions [
26]. The details are shown in
Table 6 below.
Local Analysis: In Experiment 1, under extreme weather conditions, the model’s prediction error (RMSE of 7.5%) was relatively high. This was due to the abnormal changes in the data caused by the extreme weather, increasing the uncertainty of the model’s predictions. Despite the shorter training time (20 min), extreme weather conditions still had a certain impact on prediction accuracy. In Experiment 2, under different regional climate conditions, the model was able to adapt to various climatic environments. Although the training time was slightly longer (30 min), the prediction error (RMSE of 6.8%) was relatively large due to differences in data features and climate patterns between regions. The challenge of cross-regional adaptability lie mainly in the differences in climate patterns and uneven data distribution across regions. In Experiment 3, when the data were affected by different noise intensities, the model’s prediction error increased, with an RMSE of 8.2%. The training time was 40 min, reflecting the interference effect of noise on the model. The intensity of the noise determined the robustness of the model, and stronger noise may have led to overfitting or reduced prediction accuracy. In Experiment 4, when data were missing, particularly due to random missing, periodic missing, and data interval issues, the model’s RMSE was 8.0%. Missing data affected the model’s training effectiveness, leading to inaccurate prediction results. Although the training time was 35 min, the presence of missing data still reduced the model’s performance. Experiment 5 combined extreme weather, noise, and missing data in a comprehensive test. With a training time of 50 min, the model’s RMSE increased to 9.5%. This experiment showed that, when multiple challenging factors were at play simultaneously, the model’s robustness was significantly tested, and the prediction error increased significantly. In Experiment 6, combining regional differences and noise, the training time was 45 min, and the prediction error was 8.3%, which was relatively high. This indicated that when both cross-regional adaptability and noise were present, the model’s accuracy was still affected, especially in regions with changing data and the interference of noise, leading to less stable prediction results. In Experiment 7, by introducing noise, missing data, and combining model optimization strategies, the model’s RMSE reduced to 7.0%, showing a significant improvement compared to other experiments. The training time was 60 min, indicating that model optimization strategies could effectively enhance the model’s robustness and prediction accuracy when dealing with incomplete or noisy data.
Overall Analysis: As the complexity of the test conditions increased (such as extreme weather, cross-regional adaptability, data noise, and missing data), the model’s prediction error generally increased. In particular, when combining extreme weather with data noise and missing values, the model performed the worst, with the highest RMSE of 9.5%. This indicated that the combination of multiple challenging factors significantly reduced the model’s prediction accuracy. Nevertheless, Experiment 7, with the inclusion of model optimization strategies, demonstrated a noticeable improvement, with the RMSE dropping to 7.0%, showing that model optimization could effectively enhance the robustness of the model and improve its prediction accuracy. In comparison, the combination of cross-regional adaptability and data noise, while performing relatively well, still did not reach the ideal prediction accuracy. Overall, our model demonstrated strong robustness when faced with multiple complex conditions, such as extreme weather, data noise, and missing values, and the model optimization strategy significantly improved the prediction accuracy. These experimental results provide strong support for further optimizing the model, especially when dealing with data quality issues which may arise in practical applications.
4.3.4. Interpretability Study
To explore the decision-making logic of the model during the forecasting process and enhance the interpretability of the proposed hybrid model, the outputs of GAT and LSTM were combined to analyze how the model integrated both temporal and spatial features to determine the final prediction results. The details are shown in
Table 7 below.
Local Analysis:
In this experiment, the weight of temporal features was analyzed to determine the model’s contribution when handling temporal dependencies. The training time for this experiment was 60 min, and the model achieved an RMSE of 6.5%. By analyzing temporal dependencies in the time series, the model effectively captured the importance of temporal features in making predictions. However, the performance improvement was marginal compared to the baseline model, where feature weight analysis was not performed. This suggests that, while temporal features play a significant role, the model may already account for them efficiently without explicit weight analysis, and further tuning could enhance its performance in more dynamic temporal environments.
In Experiment 2, the spatial dependencies of the model were analyzed by utilizing SHAP (Shapley additive explanations) and the GAT (graph attention networks) mechanisms. The model achieved an RMSE of 7.2%, which was slightly higher than the temporal feature analysis. This result indicates that, while the model can successfully capture spatial features, the complexity of processing spatial dependencies is greater than that of temporal features. The increased RMSE could be attributed to challenges such as data distribution, regional differences, and spatial correlations that are harder to model. The SHAP and GAT methods revealed the key spatial features that influenced the predictions, but the inherent complexity of spatial relationships necessitates further refinement in feature extraction and spatial representation.
Experiment 3 investigated the interaction effects between temporal and spatial features using a combination of SHAP, LIME (local interpretable model-agnostic explanation), and spatiotemporal interaction analysis. The training time was 70 min, with an RMSE of 7.0%. The results demonstrated that the interaction effects between spatiotemporal features significantly impacted the prediction accuracy. This experiment highlighted that, although the model could handle spatial and temporal features individually, combining them in a meaningful way enhanced its ability to capture complex relationships in the data. While this experiment was more complex than analyzing temporal or spatial features separately, it provided deeper insight into how the model balanced and processed both types of features simultaneously. This experiment paved the way for better handling of multivariate and multimodal data, though the RMSE showed that further optimization is needed to reduce the prediction error in such integrated approaches.
Experiment 4 focused on model transparency through the use of SHAP and LIME. The goal was to explain the model’s decision-making process and understand the contribution of each feature to the prediction. With a training time of 75 min and an RMSE of 7.4%, this experiment provided clearer insights into the inner workings of the model. While the transparency analysis helped clarify the role of individual features, the complexity of the model itself introduced additional challenges. The increased RMSE suggested that the extra computational load from these transparency techniques may have slightly diminished the model’s ability to make accurate predictions. However, this experiment significantly contributed to the interpretability of the model, which was crucial for understanding and trusting the model’s predictions in real-world applications.
In Experiment 5, the focus was on identifying the sources of prediction errors using SHAP, LIME, and error tracking methods. The model had a training time of 80 min and an RMSE of 8.1%. This experiment underlined the importance of error analysis in improving the model. By examining the sources of prediction errors, it became evident that the errors were influenced by various factors, including the complexity of the data and interfering variables not fully captured by the model. The RMSE increase highlighted the challenges faced when tracking errors, as additional steps in error analysis can introduce noise and increase uncertainty in the model’s predictions. Nonetheless, this experiment demonstrated the potential to identify and mitigate error sources, which could lead to more accurate predictions in future iterations.
The final experiment aimed to improve the model’s interpretability through SHAP, LIME, and optimization strategies. This experiment had the longest training time of 90 min and achieved an RMSE of 6.8%. The results showed a notable improvement in prediction accuracy compared to previous experiments. By optimizing the model’s interpretability, the model became more adept at handling the complexity of the data and interactions between features. The RMSE reduction indicated that optimization strategies positively impacted both the model’s interpretability and its overall performance. The experiment suggested that enhancing interpretability through optimization could lead to more efficient feature processing and better prediction outcomes, even in the face of complex and noisy data.
Overall Analysis: Upon reviewing the experimental results, it became clear that, as the complexity of the interpretability analysis methods increased, both the training time and the model’s RMSE tended to rise. This trend was particularly evident in experiments involving spatiotemporal interaction analysis, model transparency analysis, and error source identification. The additional computation required for these complex methods introduces some overhead, resulting in slightly higher prediction errors. Nevertheless, the experiment focused on interpretability optimization demonstrated that, by strategically enhancing model transparency and optimizing its interpretability, the RMSE was significantly reduced. This indicated that optimization strategies not only improved interpretability but also helped improve the model’s predictive accuracy.
While the experiments aimed at enhancing interpretability introduced additional complexity, they provided valuable insights into the inner workings of the model and how it processed different feature types. Specifically, in the experiments analyzing spatiotemporal feature interaction and feature weight, these experiments greatly improved the model’s transparency and reliability when dealing with complex tasks. The ability to interpret the model’s decision-making process helped in understanding the key factors affecting the predictions, thus boosting confidence in the model’s output, especially in real-world applications where interpretability is critical for trust and deployment.