1. Introduction
With the accelerating transition of the global energy structure toward low-carbon and clean sources, the development and utilization of renewable energy has become a core strategy to address the energy crisis and environmental challenges [
1,
2]. PV power generation, featuring wide resource distribution, short construction period and low operation and maintenance costs, has emerged as one of the most important forms of clean energy supply in modern power systems. Amid the continuous growth of PV installed capacity and expanding grid integration, the inherent intermittency, randomness and volatility of PV power generation impose significant impacts on power balance, dispatching operation, security and stability of power grids, considerably increasing the difficulties of grid regulation and accommodation. High-precision PV power prediction technology serves as a critical support to smooth PV power fluctuations, enhance grid operational controllability and new energy accommodation capability, and bears important engineering value for ensuring the stable and efficient operation of power systems with high-penetration PV integration. Current PV power prediction methods are mainly categorized into physical methods and statistical methods [
3]. Physical methods rely on the photoelectric conversion mechanism of PV modules and atmospheric physical processes, and establish mathematical models to simulate the mapping relationship between illumination, temperature and other factors and output power for power calculation. However, such methods impose high requirements on model parameters and meteorological data accuracy, involve complicated modeling procedures, and are prone to systematic errors caused by environmental uncertainties. Statistical methods, which do not depend on explicit physical mechanisms, mine the nonlinear correlation between historical meteorological data and power generation data, and implement learning and prediction via data-driven models. They are characterized by simple modeling, strong adaptability and low implementation costs. Mainstream statistical prediction methods mainly include regression analysis, time series analysis, support vector machines and neural networks. With excellent data fitting and generalization performance, these methods have become the mainstream research direction in the field of PV power prediction.
In recent years, the rapid advancement of artificial intelligence has empowered deep learning to surpass the constraints of conventional shallow machine learning models. Deep learning architectures, typified by Convolutional Neural Networks (CNN) [
4] and Recurrent Neural Networks (RNN) [
5], have gained widespread adoption in short-term photovoltaic (PV) power forecasting. RNNs are inherently suited for time-series processing, they are plagued by long-term dependency problems. To mitigate this issue, Long Short-Term Memory (LSTM) networks incorporate gating mechanisms on top of the RNN structure, enabling selective retention of historical information and effectively alleviating long-term dependency [
6]. In Ref. [
7], a CNN-LSTM hybrid framework was presented, in which CNN is employed to extract deep nonlinear features and invariant patterns from input data, followed by LSTM for sequential prediction. In Ref. [
8], it introduced a combined model using Temporal Convolutional Network (TCN) and LSTM; benefiting from parallel computing capability, TCN avoids typical drawbacks of recurrent structures, resolves gradient-related issues in LSTM, and reduces prolonged training time arising from sequential computation. As a lightweight gated recurrent alternative to LSTM, the Gated Recurrent Unit (GRU) [
9] has been extensively utilized in PV power forecasting owing to its fewer trainable parameters and higher computational efficiency. A CNN-GRU-based short-term PV power forecasting model was proposed in [
10], which successfully overcomes the limited prediction accuracy of standalone GRU models in PV-related applications. Most existing CNN-GRU forecasting models still have limitations in adaptive feature weighting and critical information extraction. Standard CNN cannot automatically distinguish and enhance important feature channels, leading to insufficient utilization of effective information from PV input data.
To boost the prediction accuracy of hybrid forecasting models, researchers have increasingly integrated modal decomposition techniques and metaheuristic optimization algorithms into model pipelines. In Ref. [
11], a unified PV power prediction framework was developed by combining LSTM with Empirical Mode Decomposition (EMD), Kernel Principal Component Analysis (KPCA), and the Sparrow Search Algorithm (SSA). EMD is adopted to decompose environmental parameter time series into feature components with diverse time scales. In Ref. [
12], a short-term PV power prediction approach was proposed by fusing optimized Variational Mode Decomposition (VMD) with LSTM, where the optimized VMD is capable of decomposing complex fluctuating components of PV power into relatively independent subseries. A WOA-GRNN-based prediction model was presented in [
13], in which the Whale Optimization Algorithm (WOA) is utilized to optimize the key parameters of the Generalized Regression Neural Network (GRNN), effectively improving prediction accuracy and stability.
The inherent intermittency and volatility of PV power generation pose severe challenges to achieving high-precision prediction. Decomposition-based techniques have been widely employed for the preprocessing of PV power time series; conventional decomposition methods suffer from distinct limitations. EMD is plagued by severe mode mixing and significant end effects. While VMD offers a more solid theoretical basis and mitigates the deficiencies of EMD, it still requires manual tuning of the mode number K and penalty factor α, which frequently leads to over-decomposition or under-decomposition. These methods rely on data extrema and fixed parameters, resulting in passive decomposition that fails to adapt to the complex fluctuation characteristics of PV power. Accordingly, an AFD approach is adopted in this work to dynamically separate low-frequency and high-frequency components, which effectively alleviates mode mixing and reduces parameter dependence, making it more suitable for PV power series with diverse fluctuation patterns [
14].
Existing PV power forecasting studies mostly focus on single-station prediction, which fails to fully exploit the spatial correlation between adjacent PV stations in the distribution station area. The output of a single PV station is easily affected by local shading, equipment faults and other factors, resulting in strong randomness and volatility that restricts to the complex fluctuation characteristics of PV power. A spatial data aggregation strategy based on geographical distance is introduced in this paper to utilize spatial correlations and stabilize the input data characteristics.
Compared with the hybrid forecasting methods, the individual deep learning models show limited capacity in capturing spatial–temporal features. In this paper, a CNN-GRU-SE hybrid structure is constructed to extract deep spatial features, model long-term temporal dependencies, and emphasize critical features through the SE attention mechanism. However, the hyperparameters of hybrid learning models are usually determined manually, which is inefficient and difficult to obtain the optimal values. Thus, the MIBWO [
15] algorithm is adopted to optimize the hyperparameters of the hybrid method, reducing human interference, improving model adaptability, and further enhancing the prediction accuracy.
The main contributions are as follows:
- (1)
To better select meteorological features, the Pearson correlation coefficient [
16,
17] and the entropy weight method [
18,
19] are employed.
- (2)
Considering the aggregation of power curves from geographically distributed stations, the Haversine formula [
20,
21] is employed to calculate the geographical distances between different stations. It mitigates the volatility differences among PV power stations, and indirectly reduces prediction errors.
- (3)
Compared with EMD [
11] and VMD [
12], the AFD method suppresses the mode mixing effect and reduces parameter dependency, improving the prediction accuracy of PV power.
- (4)
To further improve the prediction accuracy, the SE attention is used to adaptively enhance the weight of important features and suppress useless information. The MIBWO algorithm is employed to optimize the cutoff frequency of the AFD method and the hyperparameters of the hybrid CNN-GRU-SE method.
The remainder of this paper is organized as follows:
Section 2 describes the process of meteorological feature selection using the Pearson correlation coefficient and entropy weight method, as well as data aggregation. The improved whale optimization algorithm is introduced in
Section 3. The adaptive frequency decomposition method, the construction of the overall prediction model, and the SHAP analysis are presented in
Section 4. Simulation results are provided in
Section 5. Conclusions are drawn in
Section 6.
2. Data Correlation Analysis and Aggregation
2.1. Meteorological Feature Selection
The magnitude of PV power output is closely related to meteorological factors. This paper adopts a combined linear and nonlinear approach to comprehensively analyze the impact of meteorological features on PV output.
The Pearson correlation coefficient is employed to calculate the linear correlation between each meteorological factor and PV power output. The Pearson correlation coefficient measures the linear correlation between two variables,
X and
Y, and is defined as the ratio of their covariance to the product of their standard deviations [
16,
17]. This method is applicable to continuous variables, assuming both variables follow a normal distribution. The value of the Pearson correlation coefficient ranges from [−1, 1]. Its calculation formula is presented in (1).
where
Xi and
Yi—the
i-th variables of
X and
Y;
and
—the means of
X and
Y;
n—sample size;
r—Pearson correlation coefficient.
To further objectively assess the importance of the selected meteorological features for PV power prediction, this paper adopts the Entropy weight method for feature weighting. Based on the degree of variation in each feature value, this method calculates the objective weights of the features using information entropy [
18]. Assuming there are
m samples and
n evaluations, the original data matrix is
X = (
xij)
m×n [
19]. The formulas of the Entropy weight method are as follows:
where (2) and (3) represent two forms of standardization: (2) corresponds to positive indicators, and (3) corresponds to negative indicators. The standardized data
yij ∈ [0, 1];
pij denotes the proportion of the
i-th sample under the
j-th indicator;
Sj represents the information entropy of the
j-th indicator;
Tj is the weight of the
j-th indicator, where 1 −
Sj is referred to as the information redundancy or divergence coefficient.
This paper selects actual data from a cluster of PV power stations in a certain region for correlation analysis. The influencing factors include global irradiance, diffuse irradiance, air temperature, air pressure, wind direction and speed, as well as humidity.
The correlation coefficients between each meteorological parameter and PV power are shown in
Figure 1.
From
Figure 1, a global irradiance exhibits the closest relationship with PV power generation, with a correlation coefficient reaching 0.96. Diffuse irradiance is also highly correlated with and has a significant influence on power generation. Air temperature, humidity, and wind speed show a certain degree of correlation with power generation, while atmospheric pressure and wind direction demonstrate a weaker correlation.
From the left panel of
Figure 2, it can be observed that humidity exhibits the highest proportion of mutual information. This is attributed to the inclusion of nighttime periods in the dataset, during which both irradiance and PV power output are zero, diluting the relationship between irradiance and PV power. Overall, the weights assigned by the Entropy weight method to features such as temperature, wind direction, and atmospheric pressure, in relation to PV power, are low, indicating that their nonlinear relationships are weak. Although the Entropy weight of temperature is not high, it is retained considering its certain linear correlation.
Therefore, this paper selects global irradiance, diffuse irradiance, temperature, humidity, and wind speed as the meteorological input features for the prediction model.
2.2. Transformer Area-Level Data Aggregation and Intra-Transformer Area Data Aggregation
Since PV power generation is influenced by factors such as terrain, component installation tilt angle, and cloud movement, the output of neighboring power stations exhibits spatially similar variation trends. It is considered feasible to aggregate and analyze data from nearby PV power stations within a certain range.
Data aggregation involves calculating the weighted average of meteorological features and PV power generation from different power stations within a specified range, followed by summation to produce a new dataset. This process is fundamentally a form of dataset preprocessing.
The specific method for data aggregation employs the Haversine formula to calculate the distances between power stations within the region [
20]. A distance decay function is then applied to convert geographical distances into correlation coefficients [
21]. Finally, a weighted average summation is performed based on the correlation coefficients between the power stations. The detailed calculation formulas are as follows:
where
and
represent the latitude and longitude (in radians) of power station
i, while
and
represent the latitude and longitude (in radians) of power station
j;
R is the average radius of the Earth (typically taken as 6371 km);
d is the geographical distance between the two power stations (in km); and
is the distance-based correlation coefficient between power station
i and power station
j.
After calculating the distance-based correlation coefficients between the power stations, a representative power station is selected that is located near the geographical center of the region. The self-correlation coefficient of the representative power station is
= 1, while the correlation coefficients
between other power stations and the representative power station are calculated using (12). The formula for calculating the aggregated feature weights of power stations within the region is as follows:
Due to the limited geographical scope of a transformer area, it can be assumed that the same meteorological characteristics apply throughout the area. Unlike transformer area-level data aggregation, intra-transformer area data aggregation involves averaging the instantaneous active power and positive active power of multiple loads within the same transformer area at the same moment. This average value is then used to represent the instantaneous and positive active power of the entire transformer area.
3. Multi-Objective Improved Beluga Whale Optimization Algorithm
The BWO algorithm is inspired by the collective behavior of beluga whales and their mechanisms of information sharing among individuals. It is characterized by its simple structure, ease of implementation, and high stability. There remains room for improvement in both its convergence speed and solution accuracy.
3.1. Population Initialization Based on Chaotic Mapping with Opposite Solutions
A certain search capability is exhibited by the BWO algorithm during its initial phase. The initial individuals, being randomly generated, tend to aggregate together. The algorithm is rendered susceptible to becoming trapped in local optima, from which it cannot escape. To seek the global optimum and expand the search space, population diversity can be enhanced by introducing number sequences that possess irregular properties at the stage of population initialization. Further randomness of individuals in the early iterations is required. The PWLCM chaotic mapping is employed for the improvement of both the randomness and ergodicity of the algorithm’s individuals. The capacity to escape local optima is promoted. The ergodicity of individuals is enhanced during the algorithm’s initial phase. The PWLCM chaotic mapping is presented as
where
m(
t) represents the state value of the chaotic sequence at the
t-th iteration, and
m(
t + 1) denotes the updated state value at the (
t + 1)-th iteration. The parameter
n is a critical segmentation coefficient that controls the piecewise structure of the map. In this paper,
n is set to 0.4, which satisfies the constraint 0 <
n < 0.5 and ensures the complete chaos of the mapping. The initial state
m(0) is randomly generated within the interval (0, 1), excluding fixed points such as
n, 0.5, and 1 −
n to guarantee the ergodicity of the chaotic sequence. By utilizing PWLCM for population initialization, the search space of the algorithm is expanded, and the randomness and ergodicity of the initial beluga individuals are significantly improved.
Quasi-oppositional learning enables probabilistic updates of individual positions during the algorithm’s iterative process. By leveraging the rich information provided by opposite individuals, it not only further enhances population randomness but also effectively improves the algorithm’s convergence performance. To maintain global search capability while enhancing algorithmic performance, quasi-oppositional learning is integrated to optimize the strategy. The update of beluga individual positions using the quasi-oppositional learning strategy is shown as
where
is the opposite solution of
,
is the center value of the upper and lower bounds, and
is the new opposite solution generated by the quasi-oppositional learning strategy.
3.2. Dynamic Constrained Local Perturbation Search Mechanism
The transition of the BWO algorithm from exploration to exploitation is determined by a balance factor. Multiple local optima often exist in complex optimization problems. The algorithm’s ability to escape local optima may be hindered by a linearly decreasing strategy. An enhanced global search capability is required during the early stages of the algorithm. A nonlinear convergence factor is introduced for the improvement of the balance factor. This nonlinear convergence factor is shown as
where
k represents a positive constant, which is adopted to regulate the pace of growth or reduction in the nonlinear convergence factor. During the initial iteration stages of the algorithm, the algorithm is endowed with robust global exploration capability by a greater convergence factor, with a wide search range guaranteed. Iteration counts climb, and the solution draws near to the optimal value. The balance factor is reduced rapidly, with the local optimization capability enhanced.
Considering the complexity of the photovoltaic power prediction optimization problem, multiple comparative experiments verify that a nonlinear convergence factor with k = 0.6 achieves an ideal dynamic balance between broad global search in the early stage and rapid local optimization in the later stage. This setting enables the algorithm to sufficiently explore the solution space at the early iterations and prevent premature convergence. It also allows fast convergence to the optimal solution in the late iterations, striking a balance between search efficiency and optimization accuracy. For the univariate optimization task of the cut-off frequency fc in AFD, and the multivariate joint optimization task of hyperparameters (learning rate, GRUs, number of convolution kernels, etc.) of the CNN-GRU-SE method, the iteration number Tmax is 50.
3.3. Differentiated Population Optimization Strategy
In the subsequent iterations of the BWO algorithm, beluga individuals deliver the optimal fitness value and move forward to the subsequent iteration cycle during the whale fall stage. Population diversity diminishes as all individuals converge towards the optimal solution. Uneven distribution of the beluga population is observed in the post-initialization phase following each iteration, which elevates the likelihood of the algorithm being trapped in local optima and weakens the effectiveness of convergence precision.
In the search and predation phase of the whale optimization algorithm, the current position update is based on changes in coefficient
A. If coefficient
A exceeds the specified range, the current position of the whale individual is randomly updated via distance
D. This expression is shown as
Inspired by the whale optimization algorithm, and to prevent the BWO algorithm from falling into premature convergence while enhancing its convergence accuracy on multimodal functions, a differentiated population evolution strategy is proposed. It is assumed that weaker individuals in the beluga population perish during activities such as swimming and foraging. At this point, the fittest individual in the population inspects the location of the deceased individual. Since the death position of the beluga is random, the inspection step size of the optimal individual is set as
C3, expressed as
The predation strategy of the whale optimization algorithm can enhance both the optimization capability and convergence speed of the algorithm, the position update of belugas in the whale fall phase of the BWO algorithm is introduced as
where
r8 is a random number between (0, 1). The probability of whale fall
Wf is determined based on the balance factor
Bf. If a whale fall occurs, the current optimal individual’s position is updated through the differentiated population evolution strategy. This essentially introduces perturbations to the current optimal value, preventing the population from becoming trapped in local optima.
3.4. Optimization Problem Definition
In this paper, the MIBWO algorithm undertakes two core optimization tasks: the optimization of the AFD cutoff frequency and the hyperparameter optimization of the CNN-GRU-SE model. The specific definitions are as follows:
- (1)
Optimization of AFD Cutoff Frequency
The AFD realizes the adaptive decomposition of photovoltaic power sequences via the cutoff frequency
fc, and the selection of the cutoff frequency directly affects the decomposition performance and subsequent prediction accuracy. The cutoff frequency
fc is taken as the decision variable, and the objective function is constructed to minimize the prediction error of the decomposed sequences. The optimization problem can be formulated as
where
ypred,t fc denotes the predicted value corresponding to cutoff frequency
fc,
ytrue,t denotes the true value,
N is the number of samples, and [
fmin,
fmax] represents the search range of the cutoff frequency.
- (2)
Hyperparameter Optimization of the CNN-GRU-SE Model
Model hyperparameters directly affect feature extraction capability and prediction accuracy. The number of CNN kernels
ncnn, the number of hidden layer neurons in GRU
ngru, learning rate
η, and batch size
b are taken as decision variables
s = [
ncnn,
ngru,
η,
b]. With the minimization of model prediction error as the objective function, the optimization problem can be formulated as
where
Θ denotes the search space of hyperparameters. In this paper, the MIBWO algorithm is adopted to collaboratively optimize the above two optimization tasks, so as to achieve the global optimization of the cutoff frequency and model hyperparameters, and further improve the prediction accuracy.
4. Model Construction and Related Principles
4.1. Adaptive Frequency Decomposition
To tackle the intrinsic complexity of temporal signal datasets, decomposition techniques are widely employed to partition raw data into more structured sub-components, with predictive accuracy elevated. A self-adaptive frequency decomposition approach, referred to as AFD, is utilized for the analysis of temporal signal sequences, which are transformed into the frequency domain through FFT. A dynamic spectral filter is constructed to automatically partition low-frequency trend constituents and high-frequency seasonal constituents based on the spectral properties of the dataset, with inverse fast Fourier transform (IFFT) implemented on the partitioned constituents to reconstruct the trend sequence and seasonal sequence within the temporal domain. Frequency intervals are distinguished in a self-adaptive fashion by the introduced approach based on the distinct spectral signatures of the dataset. The entanglement of high and low frequencies is effectively prevented, with predictive precision and generalization performance enhanced across diverse data collections. This frequency-domain transform is extensively applied in temporal signal processing workflows.
The FFT is implemented on the input signal sequence, with the sequence transformed from the temporal domain into the frequency domain. Key fluctuations in frequency constituents associated with trend and seasonal characteristics are captured by the model, with a robust foundation established for follow-up decomposition and modeling procedures. The Discrete Fourier Transform (
DFT) is mathematically formulated as
where
L is defined as an integer power of 2 that specifies the sequence length,
x(
t) is taken as the sampled value at time step t within the input sequence
X, and
X(
k) is designated as the Fourier transform outcome.
Dynamic spectral filtering techniques are employed to differentiate high-frequency spectral constituents from low-frequency spectral constituents. Fixed frequency intervals are typically utilized by conventional decomposition techniques for partitioning, a practice that restricts their adaptability to the unique properties of diverse temporal sequences. An adaptive frequency-domain filtering approach is introduced, which dynamically partitions low-frequency and high-frequency constituents based on the spectral properties of the input signal sequence. The decomposition procedure is optimized through alignment with the measured spectral properties, with accurate differentiation achieved between trend (low-frequency) constituents and seasonal (high-frequency) constituents.
For input temporal sequences, FFT is implemented to transform the signal into the frequency domain, with the corresponding frequency-domain representation acquired. The squared amplitude of the complex spectral signal (i.e., the power spectrum) is computed, with the mean value derived across all samples and channels to yield the global frequency power profile, which is formulated as
where
Xf(b,d)(
f) is designated as the complex spectral component associated with the
d-th feature of the
b-th sample at frequency
f.
The cumulative distribution function (CDF) of the power spectrum is utilized to derive a cutoff frequency
fc, with spectral constituents below this threshold primarily associated with the trend component of the input signal and spectral constituents above this threshold primarily linked to seasonal or short-term variations. The power spectrum
P(
f) of the input dataset is precomputed to characterize the energy associated with each frequency constituent, with frequency values ordered sequentially from
f = 0 (DC component) to the maximum frequency
fmax. The cumulative energy
E(
fi) is computed and normalized to a percentage metric
R(
fi), shown as
In (26), the share of the overall signal energy held within the frequency constituents spanning from the minimum frequency to
fi is represented by
R(
fi). An energy ratio threshold
θ is established, which specifies the share of the overall signal energy to be covered by the low-frequency constituents, with the cutoff frequency adaptively derived for distinct data collections. The power spectrum is calculated for each data collection, with the frequency at which the cumulative energy share initially satisfies or surpasses
θ identified.
Since selecting the energy ratio threshold empirically is rather complex, this paper adopts the MIBWO algorithm to determine the energy ratio threshold for optimizing the cutoff frequency fc. After extensive experimental optimization, it is observed that a θ value of approximately 0.9 ensures effective separation of high-frequency and low-frequency components in the AFD.
Then, a Gaussian high-pass filter is constructed, with its frequency weighting function formulated as
where
σ is defined as an adjustable smoothing parameter governing the sharpness of the filtering operation. The low-frequency constituent and high-frequency constituent are obtained following the filtering process. These constituents are transformed back to the temporal domain through IFFT, with the trend component and seasonal component yielded. In this paper, σ is set to 0.05, which is determined through extensive comparative experiments to balance the noise suppression and detail retention of the PV power series.
4.2. Convolutional Neural Network
CNN is a deep learning model designed to process data with grid-like structures. A typical CNN consists of convolutional layers, pooling layers, and fully connected layers, each with distinct functions. Convolutional layers use multiple kernels that slide over the input sequence, performing convolution operations on local regions to extract features; the parameters of the same kernel are shared across the entire sequence, thereby effectively capturing repetitive patterns. Pooling layers are usually placed after convolutional layers, compressing feature lengths through downsampling (such as max pooling or average pooling) to reduce computational complexity and extract features with scale invariance. In this paper, CNN is employed to perform local feature extraction and deep information mining on the fused multi-dimensional input features. The formulas of the convolutional layer and fully connected layer are presented as
where
yt,j denotes the
j-th feature value of the output feature map at time step
t;
xt+i−1 represents the element in the input data corresponding to the region covered by the convolution kernel;
k is the convolution kernel size;
wi,j is the convolution kernel weight;
bj is the bias term;
W is the weight matrix;
x is the input vector;
b is the bias term, and
y is the output.
4.3. Gated Recurrent Unit
The GRU is classified as a specialized variant of the RNN, which is constructed with gated control units. A streamlined structural configuration is adopted by this network, which incorporates solely two gating mechanisms—the reset gate and the update gate. The cell state and hidden state inherent to the LSTM architecture are consolidated into a unified hidden state within the GRU framework. A reduction in computational complexity is realized through this structural design, with the capability to capture temporal dependencies preserved. The corresponding mathematical formulations are presented as follows, with the network architecture depicted in
Figure 3.
where
xt is identified as the input dataset;
Wz,
Wr,
Wh,
Uz,
Ur,
Uh are categorized as weight matrices;
ht and
ht−1 are designated as the state variables corresponding to the hidden layer at time steps
t and
t − 1, respectively; the hyperbolic tangent function is represented by tanh;
is defined as the intermediate memory state; the Sigmoid activation function is denoted by
σ; the element-wise multiplication operation is symbolized by ⨂;
zt and
rt are assigned as the state outputs of the update gate and reset gate at time
t, respectively.
4.4. Squeeze-And-Excitation
The Squeeze-and-Excitation (SE) module is an attention mechanism-based module designed to enhance the ability of CNN to capture data features [
22,
23]. In this paper, the SE module is adopted to weight and strengthen the deep features extracted by the convolutional layers. The workflow of the SE module is as follows:
The squeeze operation begins with global average pooling, which aggregates each channel’s 2D feature map into a compact, channel-wise descriptive vector.
where
zc is the global descriptor scalar for the
c-th channel;
i and
j index the spatial row and column dimensions, respectively;
H and
W denote the feature map size; and
C is the total number of channels. The excitation stage is subsequently performed, in which channel weights are learned using two fully connected layers and non-linear activation functions.
where
z is the channel description vector derived from (34),
s denotes the corresponding channel weight vector,
r refers to the channel compression ratio,
δ(·) represents the ReLU activation function, and
σ(·) stands for the Sigmoid activation function.
The recalibration operation is implemented to remap the learned channel weights onto the original feature maps.
where
denotes the recalibrated feature of the
c-th channel,
Xc represents the original feature map, and
sc is the corresponding channel weight.
4.5. MIBWO-AFD-CNN-GRU-SE Hybrid Prediction Model
The prediction flowchart of the proposed method is shown in
Figure 4, and steps are as follows:
Step 1: The Pearson correlation coefficient and the Entropy weight method are used to analyze meteorological factors and PV power, selecting those meteorological factors that exhibit high correlation with PV power as features. The distance correlation coefficient between power stations is calculated to perform data aggregation on each feature sequence and PV power sequence. After the dataset preprocessing is completed, the data is fed into the model.
Step 2: Adaptive frequency decomposition is applied to the PV power sequence, and the MIBWO algorithm is used to optimize the cut-off frequencies. The resulting high-frequency and low-frequency components are combined with the original feature set to form a new feature set.
Step 3: The new feature set serves as the input to the combined forecasting model. To simplify power-related features, CNN is employed to perform effective feature extraction on the power-related feature set for dimensionality reduction, thereby obtaining the corresponding key relevant characteristics. The SE attention mechanism is then used to recalibrate the feature values of each channel, enhancing important features while suppressing less significant ones, thus improving the model’s expressive capacity and generalization ability. The MIBWO algorithm is adopted to optimize the model’s hyperparameters, including the learning rate, number of GRU neurons, number of convolution kernels, and the number of neurons in the SE fully connected layer.
Step 4: After the model training is completed, predictions are made on the test set, and the predicted results are subsequently output.
4.6. SHAP Feature Analysis
The SHAP (Shapley Additive Explanations) method is one of the most advanced and widely used interpretability tools for machine learning models [
24]. Based on the Shapley value principle from cooperative game theory, SHAP explains individual prediction outcomes by allocating contribution values to each feature, while also enabling quantitative analysis of global feature importance, feature marginal effects, and interaction mechanisms [
25].
The core idea of SHAP is to treat each feature as a “player” in a cooperative game, where the model’s prediction result is the total payout to be fairly distributed among all features.
Without loss of generality, let the feature set of the sample be F = {x1, x2, …, xn}, the prediction function of the model be f, then f(X) is the prediction result of the full feature set X, and fS(X) is the prediction result when only the feature subset S ⊆ F is retained.
The contribution of feature subset
S relative to the empty set ∅ is defined as
where
f∅(
X) is the baseline prediction (usually the average prediction of the model on the training set).
For any feature
xi F, its SHAP value
(
X) (i.e., the marginal contribution to the prediction result) is calculated by the Shapley value formula, which averages the marginal contribution of
xi across all possible feature coalitions.
4.7. Model Evaluation Metrics
Quantitative analysis of the discrepancy between the forecasted power profile and the measured continuous power profile is conducted. A suite of error assessment metrics is utilized to quantify the extent of discrepancy between the two profiles. Mean Absolute Error (
MAE), Mean Squared Error (
MSE), and Root Mean Squared Error (
RMSE) are incorporated as multi-dimensional indicators for the quantitative evaluation of model performance. The corresponding mathematical formulations are as follows:
where
Xi is the predicted value at the
i-th continuous point,
Yi is the actual value at the
i-th continuous point, and
is the total duration of PV output in the scenario, i.e., the total number of output data points.