Long-Term and Short-Term Photovoltaic Power Generation Forecasting Using a Multi-Scale Fusion MHA-BiLSTM Model

Li, Mengkun; Sun, Letian; Sun, Yitian

doi:10.3390/en19020363

Open AccessArticle

Long-Term and Short-Term Photovoltaic Power Generation Forecasting Using a Multi-Scale Fusion MHA-BiLSTM Model

by

Mengkun Li

^1,2

,

Letian Sun

¹ and

Yitian Sun

^1,*

¹

Beijing Centre for a Holistic Approach to National Security Studies, Beijing 100089, China

²

School of Management, Capital Normal University, Beijing 100089, China

^*

Author to whom correspondence should be addressed.

Energies 2026, 19(2), 363; https://doi.org/10.3390/en19020363

Submission received: 27 November 2025 / Revised: 28 December 2025 / Accepted: 9 January 2026 / Published: 12 January 2026

(This article belongs to the Section A2: Solar Energy and Photovoltaic Systems)

Download

Browse Figures

Versions Notes

Abstract

As the proportion of photovoltaic (PV) power generation continues to increase in power systems, high-precision PV power forecasting has become a critical challenge for smart grid scheduling. Traditional forecasting methods often struggle with accuracy and error propagation, particularly when handling short-term fluctuations and long-term trends. To address these issues, this paper proposes a multi-time scale forecasting model, MHA-BiLSTM, based on Bidirectional Long Short-Term Memory (BiLSTM) and Multi-Head Attention (MHA). The model combines the short-term dependency modeling ability of BiLSTM with the long-term trend capturing ability of the multi-head attention mechanism, effectively addressing both short-term (within 6 h) and long-term (up to 72 h) dependencies in PV power data. The experimental results on a simulated PV dataset demonstrate that the MHA-BiLSTM model outperforms traditional models such as LSTM, BiLSTM, and Transformer in multiple evaluation metrics (e.g., MSE, RMSE, R²), particularly showing stronger robustness and generalization ability in long-term forecasting tasks. The results prove that MHA-BiLSTM effectively improves the accuracy of both short-term and long-term PV power predictions, providing valuable support for future microgrid scheduling, energy storage optimization, and the development of smart energy systems.

Keywords:

photovoltaic power generation forecasting; time series prediction; long short-term memory networks; hybrid deep learning models

1. Introduction

With the large-scale penetration of renewable energy, photovoltaic (PV) power generation, due to its clean and sustainable advantages, has become a crucial force in the global energy transition [1,2]. However, PV power output is influenced by various factors, exhibiting both short-term fluctuations and long-term trends in time scales [3,4,5]. Additionally, noise and intermittency in the data make accurate forecasting a core challenge in power system scheduling and energy management [6]. Despite extensive research in recent years aimed at improving forecasting performance, significant challenges remain in addressing the complexity and variability in PV data in practical applications [7].

Existing PV power forecasting methods, such as KNN-LSTM, CNN-LSTM, and other hybrid models, have improved the performance of traditional statistical methods to some extent, but they still have inherent limitations in extracting multi-time-scale information. On one hand, these models often focus on training for a single time scale, making it difficult to simultaneously capture short-term fluctuations and long-term trends [8]. Specifically, KNN-LSTM compensates for the short-term fluctuation prediction by incorporating the K-nearest neighbors algorithm, enabling it to respond to sudden short-term changes effectively [9]. However, when modeling long-term trends, the model tends to accumulate errors. CNN-LSTM combines the advantages of convolutional neural networks (CNN) and LSTM, where CNN excels in extracting local features, particularly suited for capturing spatial patterns in PV data, and LSTM handles long-term dependencies in time series [10,11]. However, this hybrid structure still faces limitations in capturing the dynamic relationships across different time scales in PV power data [12,13,14].

To overcome these challenges, several fusion models have been proposed in recent years, combining different network architectures or mechanisms to enhance the ability to model the complex dependencies in PV power generation data. For example, the CNN-BiLSTM model integrates CNN and bidirectional long short-term memory networks (BiLSTM) [15,16,17]. CNN efficiently extracts local features, particularly in processing complex spatial patterns in PV data, while BiLSTM captures both forward and backward temporal dependencies through its bidirectional structure, complementing each other in short-term fluctuations and long-term trend modeling [18]. This model achieves a good balance between short-term and long-term forecasting, effectively extracting both local features and temporal dependencies of PV power.

Another advanced fusion model is the Wavelet-CNN-LSTM model, which combines wavelet transform, CNN, and LSTM. Wavelet transform decomposes the time series into different frequency bands, enabling effective extraction of multi-level fluctuation features [19]. CNN further extracts local features, and the LSTM layer is responsible for modeling long-term dependencies. Through this multi-level fusion, the Wavelet-CNN-LSTM model provides more accurate predictions under the multi-time-scale variations in PV data, especially when handling fluctuations at different frequencies.

Although these deep learning fusion models have made significant progress in short-term forecasting, few studies have proposed a unified framework that can simultaneously handle short-term (e.g., 6 h) and long-term (e.g., 72 h) forecasting tasks while effectively integrating information from different time scales. This research gap limits the applicability of existing models in practical power system scheduling, energy storage optimization, and demand response.

To fill this gap, this paper proposes a unified PV power forecasting framework that performs both short-term and long-term predictions within a single model architecture. The core innovations of this framework include:

(1).: Development of a unified forecasting framework that eliminates the need for model switching or phase-by-phase training, enabling simultaneous consideration of short-term fluctuations and long-term trends, thereby improving forecasting efficiency and practical application value.
(2).: Proposing a multi-scale fusion multi-head attention mechanism (MHA) structure that effectively captures time dependencies across different time scales, enhancing the model’s ability to represent multi-frequency signals and improve prediction accuracy.

With this framework, this paper employs the multi-scale attention mechanism to adaptively weight key temporal features of PV power, effectively addressing issues such as sensitivity to noisy data and performance degradation in long-term forecasting. The experimental results demonstrate that, compared to traditional models such as KNN-LSTM, CNN-LSTM, and other baseline models, the proposed method outperforms them in several key forecasting metrics, validating its practical applicability and reliability when relying solely on PV power data.

2. Model Construction

2.1. Design of Fusion Model Based on BiLSTM + Multi-Time Scale Multi-Head Attention

In traditional PV power forecasting, existing methods often focus on single-time scale forecasting, ignoring the multi-time scale fluctuations and trends existing in PV power generation data. To overcome this problem, this paper proposes an innovative multi-time scale PV power forecasting model—MHA-BiLSTM model. Combining the advantages of Bidirectional Long Short-Term Memory (BiLSTM) and Multi-Head Attention (MHA) mechanism, the model aims to simultaneously handle short-term and long-term dependencies and significantly improve prediction accuracy. Specifically, BiLSTM serves as the feature extraction layer, which can simultaneously utilize forward and backward contextual information within a fixed historical window to accurately capture the local morphology and short-term fluctuations in PV power curves. The multi-head attention mechanism adaptively adjusts the attention weight to each part of the information at different time scales, and processes short-term disturbances and long-term trends through multiple parallel attention heads, enabling the model to flexibly respond to complex patterns of different time scales. Through the deep combination of BiLSTM and multi-head attention mechanism, the MHA-BiLSTM model can effectively improve the short-term fluctuation capture ability and long-term trend prediction accuracy of PV power, overcoming the problems of error accumulation, local information loss, and long-term dependency modeling in traditional methods. As shown in Figure 1:

Figure 1 shows the overall framework of the MHA-BiLSTM model. The input historical PV power generation data is encoded by the BiLSTM layer to extract short-term dependency information. Subsequently, the output of BiLSTM is assigned to two attention modules: the short-term multi-head attention module focuses on short-term disturbances, while the long-term multi-head attention module pays attention to long-term trends. Finally, the outputs of the two modules are fused and fed into the dual-output prediction head to generate short-term (6-h) and long-term (72-h) prediction results, respectively. The BiLSTM module fully utilizes the context of historical information through bidirectional encoding, which is particularly suitable for capturing short-term fluctuations in PV power curves (such as rapid changes in cloud cover, temperature fluctuations, etc.). Compared with traditional LSTM, BiLSTM improves the accuracy of these local morphologies through the integration of bidirectional information flow. The multi-head attention mechanism assigns different attention weights to different features at different time scales through multiple parallel attention heads, helping the model effectively capture long-term trend changes (such as sunshine changes, weather periodicity, etc.). Each attention head is responsible for focusing on different types of temporal features, enabling the model to understand and predict changes in PV power from multiple perspectives. Combining BiLSTM and multi-head attention mechanism not only retains the advantages of BiLSTM in short-term dependency modeling but also fully utilizes the ability of multi-head attention mechanism in long-term trend capture, thereby constructing a powerful multi-time scale fusion forecasting model.

2.1.1. Bidirectional Long Short-Term Memory Neural Network (BiLSTM)

In the proposed MHA-BiLSTM framework, BiLSTM serves as the core temporal encoder for horizontal feature extraction, consisting of two LSTM chains that process the historical sequence in the forward and backward directions. This bidirectional encoding enables the model to capture local morphological patterns and bidirectional temporal dependencies within a fixed historical observation window, which is beneficial for representing short-term fluctuations and contextual correlations in PV generation data. The input at time step

t

is denoted as

x_{t} \in R^{d_{x}}

, formed by the multivariate PV features (e.g., irradiance, temperature, DC current, DC voltage, and power) in the historical window. Let the historical window length be

T

, where

t = 1,2, \dots, T

. The forward and backward hidden states are computed as follows:

{\vec{h}}_{t} = {LSTM}_{f} (x_{t}, {\vec{h}}_{t - 1})

(1)

{\overset{\leftarrow}{h}}_{t} = {LSTM}_{b} (x_{t}, {\overset{\leftarrow}{h}}_{t + 1})

(2)

where

{\vec{h}}_{t}

and

{\overset{\leftarrow}{h}}_{t}

denote the forward and backward hidden representations at time step

t

, respectively, and

{LSTM}_{f} (\cdot)

and

{LSTM}_{b} (\cdot)

represent the forward and backward LSTM transition functions. The BiLSTM output at time step

t

is obtained by concatenating the two directional states:

h_{t} = [{\vec{h}}_{t}; {\overset{\leftarrow}{h}}_{t}] \in R^{d_{h}}

(3)

where

[\cdot; \cdot]

denotes vector concatenation and

d_{h}

is the resulting hidden dimension. For sequence representation and subsequent attention-based fusion, the BiLSTM outputs

\{h_{t}}_{t = 1}^{T}

are forwarded to the short-term and long-term attention modules described in Section 2.1.2. In this study, the BiLSTM encoder adopts a stacked two-layer structure with hidden size 64 to balance modeling capacity and computational efficiency.

As shown in Figure 2:

As shown in Figure 3, the BiLSTM model processes input sequences in both forward and backward directions during the training phase, allowing it to capture both past and future context. However, during the prediction phase, only historical data is used for forecasting, ensuring that the model does not violate temporal causality. This means that while the BiLSTM utilizes bidirectional information flow in training, it strictly adheres to causal relationships in the prediction phase by relying solely on past data. This design ensures that the model does not use any future information during prediction, thus preserving the integrity of temporal causality.

2.1.2. Multi-Time Scale Multi-Head Attention

An additional multi-head attention layer is added on top of the BiLSTM output to weight the features using the attention mechanism, enabling the model to more finely adjust the attention to different parts of the time series, In the multi-head attention mechanism, four attention heads were chosen. Based on preliminary experiments, it was found that this configuration strikes a balance between capturing dependencies across different time scales and maintaining computational efficiency. Using more heads (e.g., 8 heads) brought marginal improvements in model performance but significantly increased computational costs. This choice is also supported by previous studies, such as the work by Vaswani et al., where a similar configuration was used to capture multi-scale temporal dependencies in sequence modeling tasks. as shown in Figure 4:

For both the short-term and long-term attention modules shown in Figure 4, a multi-head attention mechanism with H heads is adopted. In this study, the number of attention heads is set to H = 4, which provides a balance between representation capacity and computational efficiency. The hidden state output from the preceding recurrent layer has a dimension of dmodel = 64, which is linearly projected into query (Q), key (K), and value (V) spaces. Specifically, the 64-dimensional hidden representation is evenly divided across the attention heads, resulting in a per-head dimension of dk = dv = 64/H = 16. Each attention head independently performs scaled dot-product attention, and the outputs of all heads are concatenated and linearly transformed to form the final attention output.

① Linear transformation:

Q = h_{t} W^{Q}, K = h_{t} W^{K}, V = h_{t} W^{V}

(4)

The output sequence of the LSTM layer generates query (Q), key (K), and value (V) matrices through three different linear transformations. These transformations are completed through weight matrices W^Q, W^K, and W^V, which are model parameters learned through training.

② Calculation of attention scores:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(5)

The attention scores are calculated by performing a dot product between the query (Q) and key (K) to obtain a similarity matrix, which is then scaled by

{\sqrt{d}}_{k}

. Next, the softmax function is applied to normalize the results, obtaining the attention weights. Finally, the attention weights are multiplied by the value (V) to produce the weighted output.

③ Multi-head parallel computing:

MultHead (Q, K, V) = Concat (h e a d_{1}, h e a d_{2}, \dots, h e a d_{h}) W^{O}

(6)

In multi-head attention, the above process is repeated multiple times (once per head). Each head can learn different features from the sequence. Finally, the outputs of all heads are concatenated and a linear transformation WO is applied to generate the final output. This step helps fuse the information captured by different heads and enhances the model’s representation ability. Each head is calculated as follows:

{head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(7)

x_{FFN} = \max (0, x_{MHA} W_{1} + b_{1}) W_{2} + b_{2}

(8)

Each multi-head attention layer is followed by a feed-forward network, which consists of two linear transformations and a nonlinear activation function. Specifically, xMHA represents the output of the multi-head attention layer, W1 and W2 are weight matrices, b1 and b2 are bias terms, and ReLU (max(0, x)) is the activation function. This network layer enhances the model’s ability to handle nonlinearity, enabling it to capture more complex patterns.

Output = Linear (x_{FFN})

(9)

In the MHA-BiLSTM model, after a series of LSTM and multi-head attention layers, the data is finally passed through a linear layer to generate the final prediction output. This layer converts high-dimensional features into target prediction values.

Each attention head in the short-term multi-head attention module focuses on different local features, helping the model identify and capture short-term changes in PV power. In this module, the model automatically identifies short-term fluctuations and assigns higher weights to these features, thereby optimizing short-term predictions.

Each attention head in the long-term multi-head attention module focuses on long-term trends, such as changes in sunshine and weather. By weighting different time scales, the model can automatically distinguish and optimize the prediction of long-term trends, making the final long-term prediction more accurate.

For both the short-term and long-term attention modules, a multi-head attention mechanism with H heads is adopted. In this study, the number of attention heads is set to H = 4, which provides a balance between representation capacity and computational efficiency. The hidden state output from the preceding recurrent layer has a dimension of dmodel = 64, which is linearly projected into query (Q), key (K), and value (V) spaces. Specifically, the 64-dimensional hidden representation is evenly divided across the attention heads, resulting in a per-head dimension of dk = dv = 64/H = 16. Each attention head independently performs scaled dot-product attention, and the outputs of all heads are concatenated and linearly transformed to form the final attention output.

3. Dataset and Model Validation

3.1. Simulation System Architecture Design

As shown in Figure 5, a Simulink-based simulation platform was developed to generate synthetic operational data, with both structural alignment and physical interpretability to mirror real-world photovoltaic (PV) systems. The simulation system was constructed using actual engineering data from a distribution network in Shandong Province, ensuring that the simulated system reflects realistic operational dynamics. The system consists of a PV array, driven by irradiance and temperature inputs, followed by a Boost converter and a grid-connected inverter. The architecture and configuration of the system closely resemble those of operational PV systems, enabling the generation of synthetic PV power output data that is representative of typical real-world conditions. Specifically, the simulation adhered to the IEC 61724-1 standard for photovoltaic system performance monitoring, which provides guidelines for the collection, analysis, and presentation of photovoltaic performance data [20]. This ensures that the generated output data accurately reflects the performance and behavior of PV systems in operational settings, making the data suitable for use in PV power forecasting applications, with a strong correspondence to real-world scenarios.

3.2. Dataset and Preprocessing

The dataset used in the experiment consists of 8 consecutive months of simulated PV power generation data, generated from the Simulink-based simulation model mentioned earlier. The data, recorded at 15 min intervals, includes key features such as irradiance, temperature, DC current, DC voltage, and actual power generation, as shown in Table 1. The data frequency was set to match typical power system scheduling cycles, ensuring a balance between prediction accuracy and computational efficiency.

To ensure the representativeness and validity of the generated data, a comprehensive statistical analysis was conducted. This analysis included key statistical properties such as the mean, variance, skewness, and kurtosis, which were compared with typical real-world photovoltaic power generation characteristics. The results confirmed that the simulated data adhered to the expected daily and seasonal variability patterns. The mean power generation during peak hours was consistent with real-world systems, peaking at approximately 85–100 kW during midday. The variance was relatively high during these periods, with a standard deviation of 10–15 kW, reflecting typical fluctuations due to cloud cover and sunlight variation. The data showed a positive skew, which is typical for PV power generation, where peaks occur more frequently than troughs due to the availability of sunlight. The kurtosis of the data was also high, reflecting frequent sharp increases in power due to rapid changes in irradiance, as observed in the ramping events in the power generation curves.

The simulation was modeled with realistic boundary conditions to ensure that the synthetic data accurately reflects real-world scenarios. The maximum output power was capped at 100 kW, typical for a large-scale PV system, and the system efficiency was assumed to be 85%, based on the typical performance of PV inverters and system components. Environmental conditions were modeled using historical climate data from Shandong Province, with irradiance values following typical daily cycles, peaking in the afternoon and decreasing at night, while temperature fluctuations were also simulated based on historical trends. Grid constraints, such as voltage and frequency limits imposed by the power grid, were applied to ensure the simulated data adhered to real-world operational limitations.

The raw data underwent several preprocessing steps to ensure its quality and suitability for forecasting. First, the timestamps were converted to the datetime type to maintain correct chronological order. Missing values were handled through linear interpolation to ensure the continuity of the time series. Outliers in the power data, such as spikes or drops beyond a 10% threshold, were identified and corrected using statistical methods like Z-score analysis. All features were normalized using the MinMaxScaler, transforming the data into a [0, 1] range to ensure that each feature contributed equally to the model, stabilizing the learning process.

The dataset was split into three subsets: training, validation, and testing. 80% of the data was used for training, 10% for validation, and 10% for testing. The training set was used to train the model, while the validation set was used to tune hyperparameters and evaluate the model during training. The test set was kept aside and only used to evaluate the model’s final performance after training was completed.

3.3. Model Hyperparameter Setting and Calibration ProcessExperimental Setup, Evaluation Metrics, and Implementation Details

This section introduces the indicator system used to evaluate model performance, the hyperparameter settings and calibration process of the proposed model and comparative models, as well as the experimental environment and basic configuration instructions.

3.3.1. Experimental Setup and Evaluation Metrics

The following three indicators were uniformly used to evaluate model performance in the experiment:

① MSE (Mean Squared Error):

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(10)

where n represents the number of samples, yi is the i-th true value, and

{\hat{y}}_{i}

is the i-th predicted value. MSE measures the average of the squares of the prediction errors of all data points, reflecting the overall accuracy of the model’s prediction. A smaller value indicates that the model’s prediction is more accurate with smaller errors; a larger value indicates that the model’s prediction error is larger.

② RMSE (Root Mean Squared Error):

RMSE = \sqrt{MSE}

(11)

where MSE is the mean squared error. RMSE is the square root of the mean squared error (MSE), providing an error measure with the same unit as the original data, making it easier to compare with actual observations. A smaller RMSE indicates a more accurate prediction; on the contrary, a larger RMSE indicates a larger prediction error. By taking the square root of MSE, RMSE not only retains the meaning of the model error but also makes its unit consistent with the original data, more intuitively reflecting the deviation of the model’s prediction.

③ R² (Coefficient of Determination):

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(12)

where n is the number of samples, i.e., the total number of data points, yi is the i-th true value,

{\hat{y}}_{i}

is the i-th predicted value, and

\bar{y}

is the mean of the true values. R² reflects the model’s ability to explain the variability in the data. A value closer to 1 indicates that the model can better fit the data; a value closer to 0 indicates that the model’s prediction effect is poor and can hardly explain the variability in the data.

3.3.2. Model Hyperparameter Setting and Calibration Process

To ensure the fairness and interpretability of the comparison results, hyperparameter configuration and calibration were performed for the BiLSTM + multi-time scale multi-head attention fusion model, as well as for various comparative models, including TCN, unidirectional LSTM, Transformer, BiLSTM, CNN-LSTM, and KNN-LSTM. All models were trained and validated using the same dataset, input features, historical time window length, and short-term and long-term prediction step settings. This uniform configuration ensured that the model comparison was unbiased and meaningful.

For the recurrent models, the hidden layer dimension was set to 64, and a two-layer BiLSTM structure was selected after testing configurations with one, two, and three stacked layers. A single-layer BiLSTM was found to be insufficient for capturing complex temporal dependencies, especially in long-term forecasting tasks. On the other hand, deeper models, such as those with three layers, provided marginal improvements but led to higher computational costs and an increased risk of overfitting. Therefore, the two-layer BiLSTM model was chosen, offering an optimal balance between performance and computational efficiency. This decision was made in conjunction with the overall model’s training stability and predictive accuracy, which were further supported by the loss function (MSE) and optimizer (Adam) settings, with a learning rate of 0.001, training for 200 epochs, and a batch size of 64.

Following the initial configuration, a sensitivity analysis was conducted to evaluate the impact of varying key hyperparameters, specifically the number of attention heads and the hidden layer dimensions. When the number of attention heads increased from 4 to 8, a slight improvement in RMSE was observed; however, the computational cost also increased, suggesting a trade-off between performance and efficiency. In contrast, reducing the hidden layer size from 64 to 32 resulted in a noticeable increase in error, particularly in long-term predictions, highlighting the model’s sensitivity to the depth of its hidden layers. These findings underline the importance of carefully selecting hyperparameters that optimize the model’s ability to handle both short-term fluctuations and long-term trends, without compromising on efficiency.

In addition to these configurations, a weight analysis was performed to understand the contribution of different features to the model’s predictions. The analysis showed that irradiance and temperature had the largest weights in predicting short-term fluctuations, while DC voltage and current played a more significant role in modeling long-term trends. This supports the intuitive understanding that irradiance directly influences short-term power generation, while DC voltage and current are better suited for capturing long-term operational patterns in PV systems.

Furthermore, the proposed model underwent ablation experiments to assess the contributions of each component, particularly the BiLSTM and multi-time scale attention structures. Variants of the model were tested with different configurations, including removing the attention mechanism and merging short- and long-term attention heads into a single time-scale attention module. These experiments demonstrated the effectiveness of BiLSTM’s bidirectional encoding and the multi-head attention mechanism in capturing both local morphological features and long-term trends, which are crucial for accurate forecasting. The results from these ablation experiments provide strong evidence for the necessity of the BiLSTM encoding and multi-time scale attention design, which significantly enhances the model’s robustness and predictive accuracy.

3.3.3. Experimental Environment and Preparation

To ensure the fairness of the comparative experiments, baseline models such as LSTM, BiLSTM, CNN-LSTM, TCN, and Transformer were consistent with the proposed model in terms of input feature set, historical window length, short-term and long-term prediction step sizes, optimizer type, initial learning rate, batch size, and number of training epochs, with differences only in the network structure form; the hidden layer dimensions of each model were controlled at the same level to avoid result bias caused by differences in parameter scale. In the hyperparameter calibration process, this paper did not simply give parameters based on experience but adopted a strategy of “combining coarse search and fine adjustment”: first, grid search was performed on several candidate values of key hyperparameters such as the number of hidden units, number of attention heads, and learning rate, with the minimum MSE of the validation set as the main reference; then, a small number of key parameters were fine-tuned within the initially determined reasonable range, comprehensively considering the short-term and long-term prediction effects and model complexity, and finally the parameter configuration adopted in this paper was determined. At the same time, for the dual-task output of short-term and long-term, a weighted joint loss function was constructed during training. By trying different loss weight combinations on the validation set, a weight scheme that could balance the short-term disturbance response ability and long-term trend fitting level was selected, thereby realizing the joint optimization of multi-time scale prediction tasks.

4. Experimental Results and Analysis

In the model training process, This experiment used Mean Squared Error (MSE) as the loss function and trained the model for 200 epochs. During this process, the value of the loss function gradually decreased, indicating that the training loss gradually decreased and tended to be stable, indicating that the model converged. Figure 6 shows the change trend of the loss function during the training process. It can be seen that in the early training epochs, the loss function decreased significantly, indicating that the model gradually learned effective feature representations. In the later training period, the decrease range of the loss function gradually stabilized, indicating that the model was close to the optimal state, and further training brought limited improvement, verifying the effectiveness of the early stopping strategy. The model is trained for a maximum of 200 epochs, with an early stopping strategy applied to prevent overfitting. Specifically, the validation loss is monitored during training, and training is terminated if no improvement is observed for 20 consecutive epochs. The maximum number of epochs is set to 200 to ensure sufficient training iterations for model convergence, while the early stopping mechanism guarantees that training stops automatically once the model reaches a stable and optimal state. The choice of 20 consecutive epochs without improvement was determined through preliminary experiments, where various early stopping criteria were tested. A smaller number of epochs resulted in premature stopping, leading to underfitting, while a larger number of epochs caused overfitting. As shown in Figure 6, the validation loss converges well before reaching the maximum number of epochs, indicating that the chosen training configuration is reasonable and effective for achieving optimal performance without overfitting.

Figure 6 presents the training loss curve during model training. Initially, the loss decreases steadily, indicating that the model is effectively learning. However, two anomalies are observed during the training process. The first anomaly occurs around the 10th epoch, where the loss increases sharply from approximately 0.02 to over 0.14. This sudden rise is likely due to temporary instability in the optimization process, which can occur when the learning rate is too high or when the model explores new regions of the parameter space. This fluctuation was short-lived, and after a few epochs, the loss resumed its steady decrease, reflecting the model’s ability to recover and continue learning. A second anomaly appears around the 120th epoch, where the loss experiences a brief increase. This could suggest the onset of overfitting, but as with the earlier fluctuation, the loss quickly returned to its decreasing trend. Despite these short-term fluctuations, the overall trend of the training loss curve demonstrates that the model is converging steadily, with the loss approaching its optimal value as training progresses. This indicates that the model is learning effectively and reaching a stable state.

4.1. Short-Term Performance Prediction Analysis and Prediction Visualization

In the short-term PV power forecasting task, this paper evaluated the performance of the proposed BiLSTM + multi-time scale multi-head attention (MHA-BiLSTM) model in terms of prediction accuracy and stability by comparing it with traditional models such as LSTM, BiLSTM, CNN-LSTM, TCN, and Transformer. To ensure the fairness of the comparative experiments, all models were trained and evaluated using the same input features, historical window length, short-term prediction step size, and hyperparameter configuration.

From the short-term prediction results, the proposed MHA-BiLSTM model performed optimally among all comparative models. In the short-term prediction task, the model achieved the best performance in terms of Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Coefficient of Determination (R²) on the test set, as shown in Table 2. Compared with the traditional LSTM model, the MSE of MHA-BiLSTM was reduced by approximately 18%, the RMSE was reduced by 11%, and the R² was improved by 6%. Especially when facing short-term fluctuations in PV power, the MHA-BiLSTM model could more accurately capture local fluctuations, especially having a strong response ability to rapid changes.

In addition to point-forecast accuracy metrics, the uncertainty of the model’s predictions was evaluated by calculating the prediction interval at a 95% confidence level. For the MHA-BiLSTM model, the 95% prediction interval was found to be [

\hat{y} - 0.098

,

\hat{y} + 0.098

]. This interval represents the range within which 95% of the model’s predictions are expected to fall, providing a measure of uncertainty. It reflects the variability in the predictions, showing how much the model’s output can vary due to factors such as noise and data fluctuations. The width of this interval is derived from the spread of the model’s errors, indicating the level of confidence in the prediction’s accuracy.

It can be seen from the results that the short-term prediction accuracy of the MHA-BiLSTM model is significantly better than all baseline models. Especially in terms of the RMSE index, the MHA-BiLSTM reduced the error by approximately 11%, further proving that the combination of bidirectional LSTM and multi-scale attention mechanism has significant advantages in capturing short-term fluctuations.

As shown in Figure 7, it shows the prediction accuracy of each model in the short term (06:00–12:00). It can be seen from the figure that the prediction curves of all models are close to the true values, but the MHA-BiLSTM performs the best, which can accurately capture the fluctuations in power changes. Especially in the rising and falling stages from morning to noon, the prediction results almost coincide with the true values. In contrast, although LSTM and BiLSTM can also track the general trend, there is a slight deviation in the detailed changes around noon. Transformer and CNN-LSTM have a slight lag in the prediction during the rapid power rise stage in the morning, and fail to fully accurately follow the changes in the true values in the falling stage.

4.2. Long-Term Performance Prediction Analysis and Prediction Visualization

For the long-term PV power forecasting task, the proposed MHA-BiLSTM model also showed excellent performance. In the long-term prediction scenario, the prediction duration reaches 72 h, and the periodic and trend characteristics of PV output become particularly important. Traditional LSTM models are easily disturbed by noise in such tasks, leading to the gradual increase in long-term prediction errors. However, the MHA-BiLSTM model effectively alleviates the problem of error propagation in long-term prediction through the multi-time scale attention mechanism.

The experimental results show that the MHAABiLSTM performs excellently in long-term prediction tasks, with MSE and RMSE significantly lower than those of comparative models, as shown in Table 3. Compared with baseline models such as Transformer and BiLSTM, the R² of MHA-BiLSTM in long-term prediction is improved by about 8%, the MSE is reduced by 18%, and the RMSE is reduced by 15%. Especially in capturing long-term trends and periodic changes, the MHA-BiLSTM model has stronger robustness and accuracy, and can effectively predict the changes in PV power in the next few days.

Similarly, the 95% prediction interval of long-term prediction errors was found to be ±0.1371, which is higher than the short-term prediction interval but still indicates a reasonably low level of uncertainty. This suggests that while long-term predictions are more prone to fluctuations due to external factors such as weather patterns, the model still maintains a good level of accuracy.

As shown in Figure 8, it shows the comparison between the prediction results of each model and the true values within a 48 h period (from May 31 to June 2). In long-term prediction, all models can better follow the periodic changes in PV power. Especially during the periods of sharp rise and fall of power, the performance of MHA-BiLSTM is still better than other models. Its prediction curve closely follows the changes in the true values with the smallest error. In contrast, other models (such as LSTM and BiLSTM) show relatively larger errors during the prediction process, especially a slight lag in the process of power falling from the peak to the trough. Transformer and CNN-LSTM follow the overall trend well, but their response to mutation points is slightly slower than that of MHA-BiLSTM.

4.3. Ablation Experiment Results’ Analysis

To systematically verify the contribution of each component in the proposed MHA-BiLSTM model to the prediction performance, this paper designed comprehensive ablation experiments. By gradually removing or replacing the core modules of the model, three comparative variants were constructed:

(1).: MHA-BiLSTM (complete model);
(2).: BiLSTM (without attention);
(3).: MHA-LSTM. (without bidirectional mechanisms)

4.3.1. Short-Term Prediction Ablation Results

Table 4 shows the performance of each ablation variant in the short-term prediction task. The complete model MHA-BiLSTM achieved the optimal results in all indicators (MSE = 0.0026, RMSE = 0.0509, R² = 0.9627), verifying the effectiveness of its structural design. The key observations include three aspects: first, the attention mechanism shows a significant gain effect. Compared with the complete model, the BiLSTM variant without the attention mechanism has a significant increase of 46.2% in MSE (from 0.0026 to 0.0038) and a decrease of 2.43 percentage points in R², indicating that the multi-head attention mechanism plays an important role in capturing key moments of short-term fluctuations. Second, bidirectional encoding shows obvious necessity. The performance of the MHA-LSTM variant (unidirectional LSTM + attention) further deteriorates, with MSE increasing by 57.7% compared with the complete model, proving that the bidirectional contextual encoding ability of BiLSTM is indispensable for short-term fluctuation modeling.

4.3.2. Long-Term Prediction Ablation Results

The ablation results in the long-term prediction task (Table 5) further confirm the superiority of the complete model. MHA-BiLSTM maintains the lowest error (MSE = 0.0049, RMSE = 0.0700) and the highest goodness of fit (R² = 0.9365) in long-term prediction. The key findings are mainly reflected in three levels: in terms of error propagation control, the MSE of the complete model is reduced by 19.7% compared with the variant without attention, indicating that the multi-time scale attention mechanism can effectively suppress error accumulation in long-term prediction; in terms of the long-term benefits of bidirectional encoding, the MHA-LSTM variant performs poorly in long-term prediction (MSE = 0.0067), further proving the positive effect of BiLSTM on long-term trend modeling; in terms of module synergy effect, the variant with only attention performs the worst in long-term tasks (MSE = 0.0073), highlighting the synergy necessity of BiLSTM and attention mechanism in multi-time scale prediction.

4.3.3. Summary of Ablation Experiment Results

The ablation experiments systematically verify the design rationality of the MHA-BiLSTM model from both structural and functional dimensions: first, in terms of structural integrity, the complete model performs optimally in all prediction tasks, proving that the combination of BiLSTM and multi-time scale attention produces a significant synergy effect; second, in terms of functional complementarity, BiLSTM is good at capturing local morphology and bidirectional temporal dependencies, while multi-time scale attention focuses on adaptive weighted focusing of key time points, forming a good functional complementarity between the two; in addition, in terms of multi-scale necessity, the experimental results show that a single module is difficult to simultaneously cope with the complex challenges of short-term fluctuations and long-term trends, thus verifying the necessity of the multi-scale fusion design concept. In summary, the ablation experiments fully prove the effectiveness and structural rationality of the MHA-BiLSTM model in multi-time scale PV power prediction from multiple angles.

5. Conclusions

This study proposes an MHA-BiLSTM model that integrates the Bidirectional Long Short-Term Memory Network and the multi-time scale multi-head attention mechanism to address the multi-time scale prediction problem of PV power generation. The model demonstrates superior performance in both short-term and long-term prediction tasks, significantly improving forecasting accuracy compared to mainstream baseline models. The improvements in short-term predictions are expected to benefit real-time operational decision-making, such as energy dispatch and grid management, where accurate, timely forecasts are crucial for optimizing operational efficiency. The enhanced long-term forecasting capabilities can contribute to strategic planning and resource allocation, enabling better decision-making in the context of grid expansion and energy storage optimization.

Although the model excels in terms of prediction accuracy, recent studies have emphasized that more accurate predictions do not always lead to better downstream operational performance. Specifically, one study on value-oriented renewable energy prediction suggests that predictions should not only focus on accuracy but also on the practical value they provide to downstream tasks, such as minimizing operational costs in energy dispatch [21]. Another study on closed-loop prediction optimization introduces the idea that integrating predictive models with downstream optimization processes can lead to better economic performance, as it accounts for the feedback from operational decisions and adjusts predictions accordingly [22]. These perspectives suggest that future research could further improve the proposed model by incorporating value-oriented and closed-loop optimization frameworks, allowing predictions to be optimized not just for accuracy, but for their practical value in real-world applications.

Looking ahead, incorporating these concepts could enhance the model’s applicability in practical engineering fields such as smart grid scheduling and energy storage optimization, where both short-term operational accuracy and long-term planning precision are critical. Future studies could explore combining value-oriented prediction with the proposed model to ensure that the predictions not only improve accuracy but also bring tangible benefits in terms of reduced operational costs and enhanced decision-making efficiency.

Author Contributions

M.L.: conceptualization, writing—original draft, funding acquisition, software and resources; L.S.: investigation, formal analysis, project administration, supervision, data curation, writing—review and editing; Y.S.: methodology, visualization and validation. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Central Government Guidance on Local Science and Technology Development Initiatives, China, grant number ZY23CG29.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to the confidentiality restrictions associated with the electrical data, which are subject to privacy regulations and proprietary protection.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Masson, G.; de l’Epine, M.; Kaizuka, I. Trends in Photovoltaic Applications 2024; International Energy Agency Photovoltaic Power Systems Programme (IEA-PVPS): Paris, France, 2024. [Google Scholar]
Ali, A.O.; Elgohr, A.T.; El-Mahdy, M.H.; Zohir, H.M.; Emam, A.Z.; Mostafa, M.G.; Al-Razgan, M.; Kasem, H.M.; Elhadidy, M.S. Advancements in Photovoltaic Technology: A Comprehensive Review of Recent Advances and Future Prospects. Energy Convers. Manag. X 2025, 26, 100952. [Google Scholar] [CrossRef]
Yuan, L.; Wang, X.; Sun, Y.; Liu, X.; Dong, Z.Y. Multistep Photovoltaic Power Forecasting Based on Multi-Timescale Fluctuation Aggregation Attention Mechanism and Contrastive Learning. Int. J. Electr. Power Energy Syst. 2025, 164, 110389. [Google Scholar] [CrossRef]
Sardarabadi, A.; Heydarian Ardakani, A.; Matrone, S.; Ogliari, E.; Shirazi, E. Multi-Temporal PV Power Prediction Using Long Short-Term Memory and Wavelet Packet Decomposition. Energy AI 2025, 21, 100540. [Google Scholar] [CrossRef]
Wang, X.; Li, Z.; Fu, C.; Liu, X.; Yang, W.; Huang, X.; Yang, L.; Wu, J.; Zhao, Z. Short-Term Photovoltaic Power Probabilistic Forecasting Based on Temporal Decomposition and Vine Copula. Sustainability 2024, 16, 8542. [Google Scholar] [CrossRef]
Sharma, M.; Kothari, D.; Mohan, S. Enhancing the Performance of Power System Scheduling through Robust Solar Power Forecasting in Presence of Noise and Intermittency. Energy Rep. 2025, 11, 215–227. [Google Scholar]
Zhao, H.; Li, Y.; Zhang, H.; Yang, J. The Impact of Data Complexity and Variability on PV Power Forecasting Performance: A Critical Review and Future Directions. Energy Rep. 2024, 10, 12–25. [Google Scholar]
Zhang, X.; Li, J.; Zhao, Y.; Zhang, Y.; Liu, Z.; Liu, X. Multi-Time-Scale Hybrid Models for Photovoltaic Power Forecasting: Challenges and Future Directions. Energy Rep. 2024, 10, 450–462. [Google Scholar]
Ali, A.; El-Mahdy, M.; Zohir, H.; Karam, M.; Mostafa, M.; Mosaad, H.; Elhadidy, M. A Hybrid KNN-LSTM Model for Solar Power Forecasting: Capturing Short-Term Fluctuations. Energy Rep. 2024, 10, 250–260. [Google Scholar]
Zhang, L.; Zhou, X.; Li, Y.; Wang, Z.; Shi, Y.; Guo, X. A Hybrid CNN-LSTM Model for Long-Term Solar Power Prediction: Extracting Spatial-Temporal Features. Energy Rep. 2024, 10, 560–572. [Google Scholar]
Abumohsen, M.; Owda, A.Y.; Owda, M.; Abumihsan, A. Hybrid machine learning model combining CNN-LSTM-RF for time series forecasting of solar power generation. J. Oper. Res. Soc. 2024, 9, 100636. [Google Scholar] [CrossRef]
Zhang, H.; Tang, Z.; Deng, Y.; Wang, K. Photovoltaic Power Output Forecasting Based on CNN-LSTM Multi-Scale Method. J. Phys. Conf. Ser. 2024, 2836, 012013. [Google Scholar] [CrossRef]
Zhou, N.; Shang, B.; Xu, M.; Peng, L.; Feng, G. Enhancing photovoltaic power prediction using a CNN-LSTM-attention hybrid model with Bayesian hyperparameter optimization. Glob. Energy Interconnect. 2024, 7, 667–681. [Google Scholar] [CrossRef]
Zhang, L.; Ma, X.; Liu, Y.; Chen, Q. A multiscale CNN-BiLSTM model for PV forecasting improves multi-time-scale accuracy versus traditional CNN-LSTM. J. Mod. Power Syst. Clean Energy 2024. [Google Scholar]
Shi, S.; Wang, Z.; Li, J.; Zhang, Y.; Liu, W.; Zhang, L. Multi-Scale Fusion CNN-BiLSTM Model for Solar Power Forecasting. Energy Rep. 2025, 11, 2064–2077. [Google Scholar]
Dai, Y.; Wang, Y.; Chen, Y.; Wu, J.; Chao, J. Combining meteorological and power information of station-measurement and model-prediction with the hybrid CNN-Transformer and CNN-BiLSTM for ultra-short-term photovoltaic power forecasting. Int. J. Electr. Power Energy Syst. 2025, 171, 111009. [Google Scholar]
Zhang, Y.; Ren, X.; Zhang, F.; Liu, Y.; Li, J. A Deep Learning-Based Dual-Scale Hybrid Model for Ultra-Short-Term Photovoltaic Power Forecasting (HS_CNN-A_BiLSTM-A). Sustainability 2024, 16, 7340. [Google Scholar]
Wu, Z.; Yang, W.; Zhang, H.; Liu, Y.; Huang, Y. Hybrid Deep Learning Models for Solar Power Forecasting: Challenges in Multi-Scale Temporal Dependencies. Energy Rep. 2025, 11, 2064–2077. [Google Scholar]
Zhang, Z.; Li, Z.; Huang, X.; Zhang, Q.; Liu, F. Wavelet-CNN-LSTM Hybrid Model for Photovoltaic Power Forecasting: Capturing Multi-Scale Temporal Features. Energy Rep. 2025, 11, 1129–1139. [Google Scholar]
IEC 61724-1:2021; Photovoltaic System Performance—Part 1: Monitoring. International Electrotechnical Commission (IEC): Geneva, Switzerland, 2021.
Zhang, Y.; Jia, M.; Wen, H.; Bian, Y.; Shi, Y. Toward Value-Oriented Renewable Energy Forecasting: An Iterative Learning Approach. IEEE Trans. Smart Grid 2025, 16, 1962–1974. [Google Scholar] [CrossRef]
Chen, X.; Yang, Y.; Liu, Y.; Wu, L. Feature-Driven Economic Improvement for Network-Constrained Unit Commitment: A Closed-Loop Predict-and-Optimize Framework. IEEE Trans. Power Syst. 2021, 37, 3104–3118. [Google Scholar] [CrossRef]

Figure 1. MHA-BiLSTM Model Prediction Flow Chart.

Figure 2. BiLSTM Structure Diagram.

Figure 3. Temporal Causality in BiLSTM Model.

Figure 4. Multi-Time Scale Attention Module Structure Diagram.

Figure 5. Partial Structural Diagram of Photovoltaic Power Generation Simulation System.

Figure 6. Change in Loss Function During Model Training.

Figure 7. Short-Term Photovoltaic Power Prediction Visualization Comparison.

Figure 8. Photovoltaic Power Prediction Comparison over 48 h.

Table 1. Simulation Dataset Field Description Table.

Field Name	Meaning	Unit	Data Type
timestamp	Time stamp	-	datetime
I_DC1	DC current	A (ampere)	float
V_DC	DC voltage	V (volt)	float
Pv_guangzhao	Light intensity	W/m² (watt per square meter)	float
power	Actual power generation	kW (kilowatt)	float

Table 2. Short-Term Different Models Experimental Results.

Model	Time Scale	MSE	RMSE	R²	Prediction Error Dispersion
LSTM	Short-term	0.0035	0.0591	0.9412	±0.1158
Transformer	Short-term	0.0031	0.0557	0.9483	±0.1092
TCN	Short-term	0.0033	0.0574	0.9461	±0.1125
Bi-LSTM	Short-term	0.0032	0.0563	0.9470	±0.1103
CNN-LSTM	Short-term	0.0030	0.0550	0.9500	±0.1078
KNN-LSTM	Short-term	0.0028	0.0511	0.9624	±0.1002
MHA-BiLSTM	Short-term	0.0026	0.0509	0.9627	±0.098

Table 3. Long-Term Different Models Experimental Results.

Model	Time Scale	MSE	RMSE	R²	Prediction Error Dispersion
LSTM	Long-term	0.0079	0.0889	0.9026	±0.1741
Transformer	Long-term	0.0068	0.0825	0.9144	±0.1610
TCN	Long-term	0.0063	0.0794	0.9192	±0.1555
Bi-LSTM	Long-term	0.0070	0.0835	0.9110	±0.1631
CNN-LSTM	Long-term	0.0065	0.0805	0.9150	±0.1589
KNN-LSTM	Long-term	0.0058	0.0762	0.9225	±0.1494
MHA-BiLSTM	Long-term	0.0049	0.0700	0.9365	±0.1371

Table 4. Short-term Prediction Ablation Experiment Results.

Model	BiLSTM	Multi-Head Attention	Short-Term MSE	Short-Term RMSE	Short-Term R²
MHA-BiLSTM	√	√	0.0026	0.0509	0.9627
BiLSTM (No Attention)	√	-	0.0038	0.0617	0.9384
MHA-LSTM (without bidirectional mechanisms)	-	√	0.0041	0.064	0.9302

Note: √ indicates that the corresponding module is included; - indicates that the module is not included.

Table 5. Long-term Prediction Ablation Experiment Results.

Model	BiLSTM	Multi-Head Attention	Long-Term MSE	Long-Term RMSE	Long-Term R²
MHA-BiLSTM	√	√	0.0049	0.07	0.9365
BiLSTM (No Attention)	√	-	0.0061	0.0781	0.9235
MHA-LSTM	-	√	0.0067	0.082	0.9157

Note: √ indicates that the corresponding module is included; - indicates that the module is not included.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, M.; Sun, L.; Sun, Y. Long-Term and Short-Term Photovoltaic Power Generation Forecasting Using a Multi-Scale Fusion MHA-BiLSTM Model. Energies 2026, 19, 363. https://doi.org/10.3390/en19020363

AMA Style

Li M, Sun L, Sun Y. Long-Term and Short-Term Photovoltaic Power Generation Forecasting Using a Multi-Scale Fusion MHA-BiLSTM Model. Energies. 2026; 19(2):363. https://doi.org/10.3390/en19020363

Chicago/Turabian Style

Li, Mengkun, Letian Sun, and Yitian Sun. 2026. "Long-Term and Short-Term Photovoltaic Power Generation Forecasting Using a Multi-Scale Fusion MHA-BiLSTM Model" Energies 19, no. 2: 363. https://doi.org/10.3390/en19020363

APA Style

Li, M., Sun, L., & Sun, Y. (2026). Long-Term and Short-Term Photovoltaic Power Generation Forecasting Using a Multi-Scale Fusion MHA-BiLSTM Model. Energies, 19(2), 363. https://doi.org/10.3390/en19020363

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Long-Term and Short-Term Photovoltaic Power Generation Forecasting Using a Multi-Scale Fusion MHA-BiLSTM Model

Abstract

1. Introduction

2. Model Construction

2.1. Design of Fusion Model Based on BiLSTM + Multi-Time Scale Multi-Head Attention

2.1.1. Bidirectional Long Short-Term Memory Neural Network (BiLSTM)

2.1.2. Multi-Time Scale Multi-Head Attention

3. Dataset and Model Validation

3.1. Simulation System Architecture Design

3.2. Dataset and Preprocessing

3.3. Model Hyperparameter Setting and Calibration ProcessExperimental Setup, Evaluation Metrics, and Implementation Details

3.3.1. Experimental Setup and Evaluation Metrics

3.3.2. Model Hyperparameter Setting and Calibration Process

3.3.3. Experimental Environment and Preparation

4. Experimental Results and Analysis

4.1. Short-Term Performance Prediction Analysis and Prediction Visualization

4.2. Long-Term Performance Prediction Analysis and Prediction Visualization

4.3. Ablation Experiment Results’ Analysis

4.3.1. Short-Term Prediction Ablation Results

4.3.2. Long-Term Prediction Ablation Results

4.3.3. Summary of Ablation Experiment Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI