ForecastNet Wind Power Prediction Based on Spatio-Temporal Distribution

: The integration of large-scale wind power into the power grid threatens the stable operation of the power system. Traditional wind power prediction is based on time series without considering the variability between wind turbines in different locations. This paper proposes a wind power probability density prediction method based on a time-variant deep feed-forward neural network (ForecastNet) considering a spatio-temporal distribution. First, the outliers in the wind turbine data are detected based on the isolated forest algorithm and repaired through Lagrange interpolation. Then, based on the graph attention mechanism, the features of the proximity node information of the individual wind turbines in the wind farm are extracted and the input feature matrix is constructed. Finally, the wind power probability density prediction results are obtained using the ForecastNet model based on three different hidden layer variants. The experimental results show that the ForecastNet model with a hidden layer as a dense network based on the attention mechanism (ADFN) predicts better. The average width of the prediction intervals at achieved confidence levels for all interval coverage is reduced by 34.19%, 35.41%, and 35.17%, respectively, when compared to the model with the hidden layer as a multilayer perceptron. For different categories of wind turbines, ADFN also achieves relatively narrow interval average widths of 368.37 kW, 315.87 kW, and 299.13 kW, respectively.


Introduction
With energy and environmental issues becoming more and more prominent [1], global energy is moving towards low-carbon, clean, safe, and efficient development [2].In recent years, the wind energy industry has been developing rapidly around the world.According to statistics, the global newly installed capacity of wind power reached 93.6 GW in 2021.However, wind power generation is random and fluctuating due to climate conditions.Therefore, when large-scale wind power is integrated into the grid, it can jeopardize the safe and stable operation of the grid [3,4].Wind power prediction helps to optimize the grid integration of wind power, reduce operating costs, and promote the development of energy systems.
Existing wind power prediction methods are categorized into ultra-short-term prediction [5,6], short-term prediction [7,8] and mid-to-long-term prediction [9,10], with multiple prediction time dimensions.Commonly used wind power prediction models include physical [11][12][13][14] and statistical [15][16][17][18] methods.Physical methods utilize weather forecast data and related geographic information, and the models are generally more complex.Moreover, the weather forecast data will affect the prediction accuracy to a certain extent.They are often used for the mid-to-long-term prediction of new wind farms or wind farms with incomplete data.Statistical methods use learning algorithms to analyze historical data and establish the intrinsic connections between historical data.They are commonly used in short-term and ultra-short-term prediction.
Of the many forecasting methods, deep learning [19] algorithms are the most widely used, mainly including time series methods [20] and artificial intelligence algorithms [21].Common artificial intelligence algorithms include random forests [22], support vector machines [23], extreme learning machines [24], and artificial neural networks [25].Longand short-term memory neural networks [26,27] are widely used in wind power prediction due to their excellent properties in analyzing time-series data.Convolutional neural networks (CNN) [28], which have achieved good performance in image processing, have also been applied to wind power prediction.Zhou et al. [29] proposed a combination of RANSAC (random sample consensus) noise screening and the Seq2Seq-Attention-BiGRU model to enhance prediction accuracy.Zhang et al. [30] proposed a novel hybrid prediction method involving individual prediction model training, model ensemble, and error correction considering temporal correlation.Chen et al. [31] proposed a variational mode decomposition-gate recurrent unit network prediction mode to enhance the accuracy of ultra-short wind power forecasting.Although these studies have achieved more accurate wind power prediction results, they lacked research on the spatial characteristics of the wind power data.
Most wind power prediction studies predict the sum of the power of all turbines in the region, taking into account the temporal characteristics.However, the location and contextual information of wind turbines are neglected, and consideration of the variability of turbine power output at different spatial locations is lacking.The studies show that considering both temporal and spatial characteristics of wind turbine data can help improve the accuracy of wind power prediction.Yu et al. [32] proposed a spatiotemporal wind power forecasting model, which utilized a GAT (graph attention network), GRU (gated recursion unit), and GAT-TCN (temporal convolutional network) as the main prediction methods.Zhang et al. [33] proposed a wind power prediction method based on spatiotemporal correlations considering the influences of wind speed, wind direction, and temperature.Wang et al. [34] proposed a short-term wind power forecasting method based on feature clustering and correlation analysis, improving forecasting accuracy through data feature clustering, variable correlation analysis, and building forecasting models.
Comprehensively analyzing existing studies, this paper analyzes the spatial distribution and dynamic contextual information of wind turbines.In order to facilitate the study of the spatial characteristics of wind power data and better analyze the location information, this paper predicts wind power from the perspective of wind turbines rather than wind farms.The dataset used in this paper includes data such as the angle of the wind received by each turbine, the environmental temperatures around the different wind turbines, the internal temperatures of the turbine nacelles, and the orientation of each turbine nacelle.These data are used to model the spatial correlation between the wind turbines.Predictions for each wind turbine are obtained considering the spatial and temporal distribution of the turbines, and then the predictions for the different turbines are summed to obtain the final prediction.Specifically, this paper makes the following contributions: • We detect outliers based on the isolated forest algorithm and repair them using the Lagrange interpolation method.• We extract the information about the neighboring nodes of individual turbines in a wind farm based on the graph attention mechanism and construct the input feature matrix.• We use a time-variant deep feed-forward neural network (ForecastNet) model to obtain the wind power probability density prediction results based on three different hidden layer variants.
The rest of the paper is structured as follows: Section 2 presents the theoretical knowledge of the method proposed in this paper.Section 3 presents the experiments using the method proposed in this paper and analyzes the experimental results.Section 4 provides the conclusions of this paper.

Wind Power Anomaly Data Processing
Wind farm data mainly includes the spatial distribution of wind turbines, as well as dynamic background factors such as the temperature, weather, and internal state of turbines.Wind power data will inevitably have anomalies and missing information in the process of collection, transmission, processing, and storage.The existence of these anomalous data can reduce the efficiency of the model training and the accuracy of the prediction [35].The anomalous data need to be handled without interfering too much with the original data.
The isolated forest algorithm quickly separates the outliers and vacancies in the data from the normal data.Based on the length of the data path to score the data to determine the degree of abnormality, the higher the score, the higher the degree of abnormality of the data.By setting a certain proportion of abnormal data, the algorithm can quickly distinguish between abnormal and normal data, and the abnormal score of the isolated forest algorithm is calculated as shown in Equation (1).
where S(x) is the outlier score of samples x, with a value range of [0,1].The larger the value, the more likely the sample is to be labeled as an outlier.h(x) is the path length of the sample in the tree, h(x) = ln(x) + ζ, and ζ is Euler's constant.E(h(x)) is the mean value of the path length of the sample x in the tree.c(x) is the average search path length of the binary tree constituted by a dataset containing x samples.The Lagrange interpolation method is used to repair the abnormal data detected by the isolated forest, as shown in Equation (2).
where x j corresponds to wind speed and y j corresponds to wind power.Figure 1 shows the wind speed-power plot for the detection of anomalous data.Figure 1a shows the results before detection and repair, and Figure 1b shows the results after detection and repair.The orange dots are the anomalous and repaired data, respectively, and the blue dots are the normal data.
The rest of the paper is structured as follows: Section 2 presents the theoretical knowledge of the method proposed in this paper.Section 3 presents the experiments using the method proposed in this paper and analyzes the experimental results.Section 4 provides the conclusions of this paper.

Wind Power Anomaly Data Processing
Wind farm data mainly includes the spatial distribution of wind turbines, as well as dynamic background factors such as the temperature, weather, and internal state of turbines.Wind power data will inevitably have anomalies and missing information in the process of collection, transmission, processing, and storage.The existence of these anomalous data can reduce the efficiency of the model training and the accuracy of the prediction [35].The anomalous data need to be handled without interfering too much with the original data.
The isolated forest algorithm quickly separates the outliers and vacancies in the data from the normal data.Based on the length of the data path to score the data to determine the degree of abnormality, the higher the score, the higher the degree of abnormality of the data.By setting a certain proportion of abnormal data, the algorithm can quickly distinguish between abnormal and normal data, and the abnormal score of the isolated forest algorithm is calculated as shown in Equation (1).
where ( ) S x is the outlier score of samples x , with a value range of [0,1].The larger the value, the more likely the sample is to be labeled as an outlier.( ) h x is the path length of the sample in the tree, ( ) ln( ) where j x corresponds to wind speed and j y corresponds to wind power.Figure 1 shows the wind speed-power plot for the detection of anomalous data.Figure 1a shows the results before detection and repair, and Figure 1b shows the results after detection and repair.The orange dots are the anomalous and repaired data, respectively, and the blue dots are the normal data.

Feature Clustering of Wind Power Data
Wind power output is closely related to factors in the weather environment.Generally, the feature with the strongest correlation with wind power is wind speed.

Feature Clustering of Wind Power Data
Wind power output is closely related to factors in the weather environment.Generally, the feature with the strongest correlation with wind power is wind speed.Figure 2 shows the Pearson correlation heat map of wind power data features.From Figure 2, it can be seen that the features that have the greatest degree of influence on wind turbine power output are wind speed (Wspd), the angle between the wind direction and the position of the turbine nacelle (Wdir), and the pitch angle of blades 1-3 (Pab1-Pab3).
Appl.Sci.2024, 14, x FOR PEER REVIEW 4 of 1 Figure 2 shows the Pearson correlation heat map of wind power data features.From Figure 2, it can be seen that the features that have the greatest degree of influence on wind turbine power output are wind speed (Wspd), the angle between the wind direction and the position of the turbine nacelle (Wdir), and the pitch angle of blades 1-3 (Pab1-Pab3).In order to improve the intrinsic connection between wind power data and bette explore the spatial and temporal distribution state of wind power, the K-sums algorithm is used to segmentally cluster wind turbines at different locations [36].Based on the features related to wind speed, wind turbines distributed in different spatial locations are differentiated.In this paper, the wind direction factor is added on the basis of wind speed The wind speed and direction of the natural wind in the duration of time is constantly changing, and the atmosphere has a certain inertia and incompressible mobility in a small time scale range, which leads to periodic changes in wind speed and direction within a certain range.This is manifested by the direction of the wind force and the angle between the wind turbine oscillating back and forth within a certain range [37].
The wind turbines are clustered into two classes based on wind speed characteristics defined as the class that receives higher wind speeds and the class that receives lowe wind speeds.In fact, due to geographic location and environmental factors, low outpu units still exist in the high wind speed range.For this reason, the high wind speed category will be clustered for the second time using wind direction characteristics such as Wdir and Pab as input variables, and finally, three types of wind turbines will be obtained.Type 0 is for high wind speeds and high output units, type 1 is for high wind speeds and low output units, and type 2 is for low wind speeds and low output units.The results of the K-sums algorithm clustering for visualizing the location of the wind farm are shown in Figure 3.In order to improve the intrinsic connection between wind power data and better explore the spatial and temporal distribution state of wind power, the K-sums algorithm is used to segmentally cluster wind turbines at different locations [36].Based on the features related to wind speed, wind turbines distributed in different spatial locations are differentiated.In this paper, the wind direction factor is added on the basis of wind speed.The wind speed and direction of the natural wind in the duration of time is constantly changing, and the atmosphere has a certain inertia and incompressible mobility in a small-time scale range, which leads to periodic changes in wind speed and direction within a certain range.This is manifested by the direction of the wind force and the angle between the wind turbine oscillating back and forth within a certain range [37].
The wind turbines are clustered into two classes based on wind speed characteristics, defined as the class that receives higher wind speeds and the class that receives lower wind speeds.In fact, due to geographic location and environmental factors, low output units still exist in the high wind speed range.For this reason, the high wind speed category will be clustered for the second time using wind direction characteristics such as Wdir and Pab as input variables, and finally, three types of wind turbines will be obtained.Type 0 is for high wind speeds and high output units, type 1 is for high wind speeds and low output units, and type 2 is for low wind speeds and low output units.The results of the K-sums algorithm clustering for visualizing the location of the wind farm are shown in Figure 3.
The spatial categorization of the three types of turbines is more concentrated, and the existence of some outliers is due to the different topography and ground roughness at different turbine locations.The gap between the turbines in the upper right corner of Figure 3 is exactly a railroad.The spatial categorization of the three types of turbines is more concentrated, and the existence of some outliers is due to the different topography and ground roughness at different turbine locations.The gap between the turbines in the upper right corner of Figure 3 is exactly a railroad.

Euclidean Distance and Differential Distance
In order to obtain the spatio-temporal distribution characteristics of wind turbines, the spatial distance perception function d G is constructed based on the Euclidean distance, and the differential distance perception function s G is constructed based on the differential similarity.d G can reflect the explicit neighboring relationship between wind turbines, and s G can reflect the invisible neighboring relationship.We calculate the Euclidean distance between two nodes, take the K nodes with the closest distance as neighboring nodes, and form the set ( ) d N i to obtain the spatial matrix ( ) , A i j .The expression of ( ) , A i j is shown in Equation (3).We calculate the differential similarity ( , ) Sim i j between two nodes, which is calculated as in Equation ( 4), and the closest K nodes are taken as the differential proximity nodes to form the set ( ) )) where denotes the wind speed sequence of the i-th wind turbine.For wind turbines, both ( ) s N i are combined to aggregate wind speed information from two neighboring turbines and merged as input features to improve the performance of the wind power prediction.

Graph Attention
In order to establish spatial and temporal correlations between wind turbines in different geographic locations, an attention-based spatio-temporal graph network [38] is introduced into the time series prediction model.In order to improve the training

Wind Power Prediction Model 2.2.1. Euclidean Distance and Differential Distance
In order to obtain the spatio-temporal distribution characteristics of wind turbines, the spatial distance perception function G d is constructed based on the Euclidean distance, and the differential distance perception function G s is constructed based on the differential similarity.G d can reflect the explicit neighboring relationship between wind turbines, and G s can reflect the invisible neighboring relationship.We calculate the Euclidean distance between two nodes, take the K nodes with the closest distance as neighboring nodes, and form the set N d (i) to obtain the spatial matrix A(i, j).The expression of A(i, j) is shown in Equation (3).We calculate the differential similarity Sim(i, j) between two nodes, which is calculated as in Equation ( 4), and the closest K nodes are taken as the differential proximity nodes to form the set N s (i).
where x i,w ∈ R T×1 denotes the wind speed sequence of the i-th wind turbine.For wind turbines, both N d (i) and N s (i) are combined to aggregate wind speed information from two neighboring turbines and merged as input features to improve the performance of the wind power prediction.

Graph Attention
In order to establish spatial and temporal correlations between wind turbines in different geographic locations, an attention-based spatio-temporal graph network [38] is introduced into the time series prediction model.In order to improve the training efficiency and prediction accuracy of the model, the feature information needs to be filtered, and the information with low importance is ignored [39].
Graph attention networks process information by calculating the weights of the information and weighting the information according to certain weights to aggregate the information.Specifically, the attention score of the information is calculated using query Appl.Sci.2024, 14, 937 6 of 19 vector (Query), key-value vector (Key), and value vector (Value) [40].The formula is shown in Equation ( 5): where Q is the feature vector of the current node, K is the feature vector of the neighboring node, and V is the feature vector of the neighboring node after mapping the weight value W.An attention score is obtained by performing an inner product calculation on the feature vector of a node itself and the feature vectors of the neighboring nodes, and the score is weighted with the node's feature vectors after a normalization operation.Specifically: 1.
The central target node and neighbor node attention scores are calculated.
The formula for calculating the value of attention on node i based on the features of node j is shown in Equation (6), where e ij is the attention score between node i and node j, a represents the correlation calculation function between nodes, h j is the output vector of node i, and W is the weight.
e ij = a(Wh i , Wh j ).
The weight scores are activated using the activation function.
The calculation formula is shown in Equation ( 7), where || is the vector vertical splicing operation, which splices mapped column vectors; LeakyReLU is the activation function; a is the vector to be learnt; and a T is the transpose of vector a.

Weight normalization
The sum of all the weights should be 1; therefore, the attention score is normalized.The weights are normalized for all neighboring node attention values of node i using the softmax function.The calculation formula is shown in Equation (8). 4.

Information aggregation
The graph attention layer adds the node's feature information and neighbor node's feature information according to a certain weight coefficient, performs feature extraction to form a new node to represent the feature, and outputs the new node feature as a result.The calculation formula is shown in Equation ( 9): where h i ′ is the new node features, σ(•) is the activation function, and a ij is the effect of node i features on the node j attention score.
The features of the new node obtained according to the parameter W will be different in dimension from the features of the original node, and the parameter W will map the information of the original node to the new space.

Multi-head attention mechanisms
The multi-head attention mechanism introduces multiple attention mechanisms to aggregate information, and each attention mechanism is able to focus on different features to enhance the expression ability of the attention layer [41].The multi-head attention mechanism splices the outputs of multiple nodes into column vectors to obtain the final new node features, and the new node features are calculated using the Equation (10): where || denotes the vector splicing operation, a k ij represents the normalized value of the attention score calculated by the k-th attention mechanism, and W k is the weight matrix of the linear transformation.
Using the graph attention mechanism to deal with the pair of spatial feature information, combined with the multi-head attention mechanism, different attention score weights are calculated for the target node and its neighboring nodes [42].It is beneficial to improve the model's representation of spatial dimensions and reduce the risk of overfitting.

ForecastNet Model
ForecastNet [43] is a multi-layer feed-forward neural network model that is commonly used to perform multi-step time series forecasting.Its network structure is shown in Figure 4, where each neuron can receive signals from the previous layer of neurons, resulting in a multilayer structure.This neural network structure is commonly used for processing time series data because it captures trends and periodicities in time series, as well as other complex dynamic features.
The multi-head attention mechanism introduces multiple attention mechanism aggregate information, and each attention mechanism is able to focus on different featu to enhance the expression ability of the attention layer [41].The multi-head atten mechanism splices the outputs of multiple nodes into column vectors to obtain the fi new node features, and the new node features are calculated using the Equation ( 10): where || denotes the vector splicing operation, k ij a represents the normalized valu the attention score calculated by the k-th attention mechanism, and k W is the we matrix of the linear transformation.
Using the graph attention mechanism to deal with the pair of spatial fea information, combined with the multi-head attention mechanism, different attention sc weights are calculated for the target node and its neighboring nodes [42].It is benefi to improve the model's representation of spatial dimensions and reduce the risk overfitting.

ForecastNet Model
ForecastNet [43] is a multi-layer feed-forward neural network model tha commonly used to perform multi-step time series forecasting.Its network structur shown in Figure 4, where each neuron can receive signals from the previous laye neurons, resulting in a multilayer structure.This neural network structure is commo used for processing time series data because it captures trends and periodicities in t series, as well as other complex dynamic features.is the parameter matrix.It breaks down the chain multiplication chain in the chain rule into sum of multiple terms.The factor accumulation process is more stable than multiplication process, which can significantly reduce the depth of the network w alleviating the gradient explosion and gradient vanishing problems.ForecastNet has time-varying and interleaved output properties among model neurons.The former improves the problem of gradient vanishing during the training process of recurrent neural network (RNN) and CNN models.The latter improves the problem of gradient explosion and gradient vanishing that occurs during neural network training.The root cause of gradient explosion and gradient vanishing is the reuse of chaining laws in neural network gradient computation.The interleaved output properties of ForecastNet are shown in Figure 5, where the upper row represents the hidden layer network, the lower row represents the output layer, a [l] is the output vector of the l-th layer, W [l] is the weight matrix of the l-th layer, and b [l] is the bias parameter matrix.It breaks down the chain multiplication chain in the chain rule into the sum of multiple terms.The factor accumulation process is more stable than the multiplication process, which can significantly reduce the depth of the network while alleviating the gradient explosion and gradient vanishing problems.The input layer of ForecastNet is univariate or a set of multivariate inputs; the hidden layers are different forms of feed-forward neural networks, such as common back propagation (BP) networks, radial basis function (RBF) networks, etc.The architecture of each hidden layer can be heterogeneous or identical, and the parameters of each hidden layer are independent of each other, which are used to simulate the dynamic The input layer of ForecastNet is univariate or a set of multivariate inputs; the hidden layers are different forms of feed-forward neural networks, such as common back propagation (BP) networks, radial basis function (RBF) networks, etc.The architecture of each hidden layer can be heterogeneous or identical, and the parameters of each hidden layer are independent of each other, which are used to simulate the dynamic characteristics of the time series.Different variants of the ForecastNet model can be obtained by using different feed-forward networks in the hidden layer: 6.
ForecastNet model with a multilayer perceptron (MLP) as the hidden layer (MLPFN) A multilayer perceptron is a special form of a fully connected neural network.Its main difference from a fully connected network is its hidden layer.The hidden layer can improve the MLP's expressive ability for the network, thus improving its ability to solve complex prediction or classification problems.Figure 6 shows a schematic diagram of the ForecastNet hidden layer using an MLP structure, where dense is the fully connected layer and h represents the number of neuron nodes in each hidden layer.The hidden layer and the output layer of the MLP network are fully connected layers.Each layer has 24 ReLU neuron units, where the neuron nodes are fully connected.The input layer of ForecastNet is univariate or a set of multivariate inputs; the layers are different forms of feed-forward neural networks, such as commo propagation (BP) networks, radial basis function (RBF) networks, etc.The architec each hidden layer can be heterogeneous or identical, and the parameters of each layer are independent of each other, which are used to simulate the d characteristics of the time series.Different variants of the ForecastNet model obtained by using different feed-forward networks in the hidden layer: 6. ForecastNet model with a multilayer perceptron (MLP) as the hidden layer (M A multilayer perceptron is a special form of a fully connected neural netw main difference from a fully connected network is its hidden layer.The hidden la improve the MLP's expressive ability for the network, thus improving its ability t complex prediction or classification problems.Figure 6 shows a schematic diagram ForecastNet hidden layer using an MLP structure, where dense is the fully connecte and h represents the number of neuron nodes in each hidden layer.The hidden la the output layer of the MLP network are fully connected layers.Each layer has 2 neuron units, where the neuron nodes are fully connected.

7.
ForecastNet model with CNN as the hidden layer (CNNFN) CNN is a kind of artificial neural network.Its structure is mainly composed of three parts: the convolution layer, pooling layer, and dense layer, as shown in Figure 7.The main role of the convolution layer is to extract the features, where f is the number of convolution kernels and k is the size of the convolution kernel.The pooling layer is used for down sampling, where s is the filling of the pooling layer and p is the step size of the pooling layer.The dense layer is mainly used for feature classification, where h is the number of hidden layers of the dense layer, which is composed of 24 ReLU neurons.
The input layer of ForecastNet is univariate or a set of multivariate inputs; the layers are different forms of feed-forward neural networks, such as commo propagation (BP) networks, radial basis function (RBF) networks, etc.The archite each hidden layer can be heterogeneous or identical, and the parameters of each layer are independent of each other, which are used to simulate the d characteristics of the time series.Different variants of the ForecastNet model obtained by using different feed-forward networks in the hidden layer: 6. ForecastNet model with a multilayer perceptron (MLP) as the hidden layer (M A multilayer perceptron is a special form of a fully connected neural netw main difference from a fully connected network is its hidden layer.The hidden la improve the MLP's expressive ability for the network, thus improving its ability t complex prediction or classification problems.Figure 6 shows a schematic diagram ForecastNet hidden layer using an MLP structure, where dense is the fully connecte and h represents the number of neuron nodes in each hidden layer.The hidden la the output layer of the MLP network are fully connected layers.Each layer has 2 neuron units, where the neuron nodes are fully connected.

ForecastNet model with CNN as the hidden layer (CNNFN)
CNN is a kind of artificial neural network.Its structure is mainly composed parts: the convolution layer, pooling layer, and dense layer, as shown in Figure main role of the convolution layer is to extract the features, where f is the num convolution kernels and k is the size of the convolution kernel.The pooling layer for down sampling, where s is the filling of the pooling layer and p is the step siz pooling layer.The dense layer is mainly used for feature classification, where number of hidden layers of the dense layer, which is composed of 24 ReLU neuro

8.
ForecastNet model with the hidden layer as a dense network based on the attention mechanism (ADFN) The model is shown in Figure 8.
8. ForecastNet model with the hidden layer as a dense network based on the att mechanism (ADFN) The model is shown in Figure 8.In this paper, the ForecastNet model is used as the basis for wind power temporal distribution prediction research by using a MLP, CNN, and the dense n based on the attention mechanism in its hidden layer.Each output in the output la prediction result, and a linear output can be used to obtain a deterministic pre result, or a Gaussian mixture density can be used to establish a probability density

ForecastNet Methodology
There is a certain difference in the output power of each wind turbine, and only a large amount of work to build a model for each wind turbine, but also a and complicated model training process.Therefore, the prediction is based on the a wind power.Specifically, the higher the active power (Patv) data of the wind turbi closer the output of the turbines to the theoretical output, which means that it appropriate to select the turbines with high wind speeds in the k-sums clusteri data from the N wind turbines with the highest output and high spatio-te correlation are selected as the historical data for training the model.Prediction made using these data as the output of other turbines, and the average of the pred was used as the prediction for the other turbines in the wind plant.
The new node data obtained from the data processed by the graph attention n mentioned in Section 2.2.2 contains not only the wind turbine feature data and hi wind power data of its own node but also the feature data of the neighboring nodes new node data are used as inputs to the ForecastNet model to build a prediction and perform multi-step predictions.Each step predicts the data for the next 12 time and the prediction results obtained from the previous step are merged with the hi features before the next moment prediction.Together, they are used as inputs to t model.This allows for more accurate learning of wind power trends over time wh fully considering the correlation of wind power data between inputs and outp between outputs at different moments.Following the multi-step prediction steps in Figure 9, a multi-step prediction of wind power is performed to obtain the pre results at multiple future points in time.Adding a linear output model to the Fore output layer allows for obtaining deterministic prediction results, and adding a G mixture density module [44] allows for obtaining probability density prediction The specific process framework diagram is shown in Figure 10.In this paper, the ForecastNet model is used as the basis for wind power spatiotemporal distribution prediction research by using a MLP, CNN, and the dense network based on the attention mechanism in its hidden layer.Each output in the output layer is a prediction result, and a linear output can be used to obtain a deterministic prediction result, or a Gaussian mixture density can be used to establish a probability density output.

ForecastNet Methodology
There is a certain difference in the output power of each wind turbine, and it is not only a large amount of work to build a model for each wind turbine, but also a tedious and complicated model training process.Therefore, the prediction is based on the average wind power.Specifically, the higher the active power (Patv) data of the wind turbines, the closer the output of the turbines to the theoretical output, which means that it is more appropriate to select the turbines with high wind speeds in the k-sums clustering.The data from the N wind turbines with the highest output and high spatio-temporal correlation are selected as the historical data for training the model.Predictions were made using these data as the output of other turbines, and the average of the predictions was used as the prediction for the other turbines in the wind plant.
The new node data obtained from the data processed by the graph attention network mentioned in Section 2.2.2 contains not only the wind turbine feature data and historical wind power data of its own node but also the feature data of the neighboring nodes.These new node data are used as inputs to the ForecastNet model to build a prediction model and perform multi-step predictions.Each step predicts the data for the next 12 time points, and the prediction results obtained from the previous step are merged with the historical features before the next moment prediction.Together, they are used as inputs to the next model.This allows for more accurate learning of wind power trends over time while also fully considering the correlation of wind power data between inputs and outputs and between outputs at different moments.Following the multi-step prediction steps shown in Figure 9, a multi-step prediction of wind power is performed to obtain the prediction results at multiple future points in time.Adding a linear output model to the ForecastNet output layer allows for obtaining deterministic prediction results, and adding a Gaussian mixture density module [44] allows for obtaining probability density prediction results.The specific process framework diagram is shown in Figure 10.

Evaluation Metrics
Common error evaluation metrics such as root mean squared error (RMSE), mean absolute error (MAE), etc., are only applicable to deterministic prediction results and cannot be applied to probabilistic prediction results.In interval probabilistic prediction, the prediction interval coverage is a major metric for evaluating uncertain predictions.The formula is shown in Equation ( 11): where R cover is the interval coverage, c is the number of true values in the test set that fall within the prediction interval, and v is the total number of true values in the test set.

Evaluation Metrics
Common error evaluation metrics such as root mean squared error (RMSE), mean absolute error (MAE), etc., are only applicable to deterministic prediction results and cannot be applied to probabilistic prediction results.In interval probabilistic prediction, the prediction interval coverage is a major metric for evaluating uncertain predictions.The formula is shown in Equation ( 11): where cover R is the interval coverage, c is the number of true values in the test set that fall within the prediction interval, and v is the total number of true values in the test set.
Interval coverage indicates the proportion of real values included in the prediction interval.A higher coverage indicates a more accurate prediction.Under a given confidence level, the coverage of the prediction interval cannot be lower than the requirement of the confidence level; otherwise, the prediction result is unreliable.In wind power prediction, interval coverage is one of the important metrics used to assess the prediction accuracy.However, relying solely on the interval coverage is insufficient to

Evaluation Metrics
Common error evaluation metrics such as root mean squared error (RMSE), mean absolute error (MAE), etc., are only applicable to deterministic prediction results and cannot be applied to probabilistic prediction results.In interval probabilistic prediction, the prediction interval coverage is a major metric for evaluating uncertain predictions.The formula is shown in Equation ( 11): where cover R is the interval coverage, c is the number of true values in the test set that fall within the prediction interval, and v is the total number of true values in the test set.
Interval coverage indicates the proportion of real values included in the prediction interval.A higher coverage indicates a more accurate prediction.Under a given confidence level, the coverage of the prediction interval cannot be lower than the requirement of the confidence level; otherwise, the prediction result is unreliable.In wind power prediction, interval coverage is one of the important metrics used to assess the prediction accuracy.However, relying solely on the interval coverage is insufficient to Interval coverage indicates the proportion of real values included in the prediction interval.A higher coverage indicates a more accurate prediction.Under a given confidence level, the coverage of the prediction interval cannot be lower than the requirement of the confidence level; otherwise, the prediction result is unreliable.In wind power prediction, interval coverage is one of the important metrics used to assess the prediction accuracy.However, relying solely on the interval coverage is insufficient to evaluate quality of the interval predictions; the interval width metric is also essential.The interval average width metric describes the average width of the prediction interval, which is usually used to assess the accuracy and effectiveness of interval prediction.When the interval coverage is higher, the average width of the interval should be smaller, indicating that the model can provide a more accurate range of intervals in the prediction process.Relatively speaking, the model prediction results of the interval coverage are high-quality, but the average width of the interval is too large may have certain defects.The calculation of the average width of the interval metric is shown in Equation ( 12): where Y max is the maximum value of wind power in the test set at the prediction time t, Y min is the minimum value of wind power in the test set at the prediction time t, U t is the upper boundary of the prediction interval, and L t is the lower boundary of the prediction interval.The larger the average interval width indicator, the larger the power interval obtained from the prediction, and the more valid the information it can provide.Combining the two metrics, interval average width and interval coverage, to jointly judge the uncertain prediction results is of practical significance.

Results and Discussion
In this paper, we utilize the data from the open-source dataset 2022 Long yuan Power Group Co., Beijing, China wind farm to verify the effectiveness of the proposed method.
The proximity nodes are identified based on the spatial perception function and differential distance perception function.The data of the neighboring turbine nodes are aggregated based on the graph attention model to improve the model effect.Predictions are made using the ForecastNet model with different implicit layers, predicting data at 12 time points in the future at a time.Rolling predictions are used to determine the power generation from wind turbines in the next two days.The deterministic prediction results are obtained using linear output at the output layer, and the interval prediction results of wind power are obtained using a Gaussian mixture density network for the probability density output.
Wind energy fluctuates over time, and there are differences in the variability at different locations in the same wind farm.A spatio-temporal graph is constructed based on the physical spatial location of individual wind turbines, and the data of the model are reconstructed based on the established relationships between nodes and edges.The strong temporal correlation of the time series model is taken into account while also fully considering the characteristics of the wind turbines in terms of their physical location in space.

Analysis of Deterministic Prediction Results
Deterministic prediction values can be obtained by using linear output in the output layer of ForecastNet.The RMSE and MAE metric values are calculated separately for the ForecastNet model and the other three prediction models: support vector regression (SVR), K-nearest neighbor (KNN), and light gradient-boosting machine (LightGBM).Tables 1  and 2 show the RMSE and MAE metrics for selecting the top 1-5 proximate wind turbine prediction results with the highest power outputs.Figure 11 demonstrates the trend in RMSE evaluation metrics for selecting different numbers of wind turbines with the highest power output (THPO) in the multi-step prediction process.
turbines, obtaining similar error outputs.The RMSE within the short-term predi controlled to be less than 60, and its prediction trend is relatively good and However, after 200 predictions, a large error fluctuation occurs, and then th fluctuation tends to be more stable.The reason for this may be that the wind spe other characteristics have fluctuated more during this prediction.Overall, mos prediction steps have good error performance, and from the data in the Figure Tables 1 and 2, it can be seen that better prediction results can be achieved by selec four wind turbines with the highest power outputs for prediction.

Analysis of Probability Density Prediction Results
In order to verify the prediction effect of ForecastNet on the uncertainty of the temporal distribution of wind power, the hidden layer of ForecastNet is im Different hidden layer connection methods such as MLP, CNN, and attention mec fully connected network are chosen to analyze their effects on the prediction eff rolling forecast is used to predict the wind power, and each time the data of the time points are predicted, the multi-step prediction is performed to pred distribution of wind power in the next 2 days.The prediction results of d prediction steps in the multi-step prediction process were randomly selected, prediction intervals at different confidence levels (80%, 85%, and 90%) are shown in 12. From Figure 11, it can be concluded that ForecastNet obtains similar error distributions when predicting proximity wind turbines.It indicates that the ForecastNet model achieves relative feature aggregation in dealing with the proximity of wind turbines, obtaining similar error outputs.The RMSE within the short-term prediction is controlled to be less than 60, and its prediction trend is relatively good and stable.However, after 200 predictions, a large error fluctuation occurs, and then the error fluctuation tends to be more stable.The reason for this may be that the wind speed and other characteristics have fluctuated more during this prediction.Overall, most of the prediction steps have good error performance, and from the data in the Figure 11 and Tables 1 and 2, it can be seen that better prediction results can be achieved by selecting the four wind turbines with the highest power outputs for prediction.

Analysis of Probability Density Prediction Results
In order to verify the prediction effect of ForecastNet on the uncertainty of the spatiotemporal distribution of wind power, the hidden layer of ForecastNet is improved.Different hidden layer connection methods such as MLP, CNN, and attention mechanism fully connected network are chosen to analyze their effects on the prediction effect.The rolling forecast is used to predict the wind power, and each time the data of the next 12 time points are predicted, the multi-step prediction is performed to predict the distribution of wind power in the next 2 days.The prediction results of different prediction steps in the multi-step prediction process were randomly selected, and the prediction intervals at different confidence levels (80%, 85%, and 90%) are shown in Figure 12.
As can be seen from the Figure 12, the prediction intervals of each probabilistic prediction model cover most of the real values more accurately, and the trends in the upper and lower limits of the prediction intervals are also consistent with the trend in the wind power.The wind power prediction intervals obtained at different confidence levels are different, and the larger the confidence level, the wider the given prediction interval.The specific prediction evaluation indexes are shown in Table 3.
Different variants can be obtained by changing the hidden layer structure of Forecast-Net.As can be seen from Table 3, the highest coverage of prediction intervals at different confidence levels is achieved by the CNNFN model, with interval coverage of reliability metrics at 90%, 85%, and 80%, and confidence levels of 100%, 98.95%, and 95.32%, respectively.However, interval coverage that is too high may cause the average width of the intervals to rise.The average width of the intervals of the CNNFN model is also the highest among the three models, which is 72.11 kW, 67.75 kW, and 63.31 kW at 90%, 85%, and 80% confidence levels, respectively.The wider the average width of the intervals in the case of certainty of the interval coverage, the less practical significance of the interval prediction.
Both the ADFN model and the MLPFN model obtained relatively good prediction results, and the predicted interval coverages obtained are up to the pre-specified confidence level.At confidence levels of 90%, 85%, and 80%, the interval coverage of the MLPFN model is higher than that of the ADFN model by 4.11%, 4.28%, and 3.59%, respectively.However, the average width of the intervals of the ADFN is narrower, which is reduced by 34.19%, 35.41%, and 35.17% compared with that of the MLPFN model, respectively, achieving better prediction results.Combining the above analyses, it can be concluded that the prediction effects of the ADFN and MLPFN models are better than that of the CNNFN model.The ADFN model achieves a narrower interval prediction width based on the attention mechanism at the confidence level of interval coverage, and it is able to better fit the trend in wind power.The CNNFN model, on the other hand, obtained 100% interval coverage at a higher confidence level, making the prediction intervals less meaningful in practice.As can be seen from the Figure 12, the prediction intervals of each probabilistic prediction model cover most of the real values more accurately, and the trends in the upper and lower limits of the prediction intervals are also consistent with the trend in the wind power.The wind power prediction intervals obtained at different confidence levels are different, and the larger the confidence level, the wider the given prediction interval.The specific prediction evaluation indexes are shown in Table 3.In addition, predictions are made for the three different classes of wind turbines obtained from the previous clustering.Figures [13][14][15] demonstrate the probability density predictions of the ADFN model over the next two days.The more common probabilistic prediction models selected for comparison at the 90% confidence level are LSTM (combination of long short-term memory network and Gaussian mixture density model) and the QRGBM model (gradient booster using quantile output).Table 4 shows the corresponding prediction results.
As can be seen from Table 4, the turbines in the high wind speed and high output units and the turbines in the high wind speed and low output units achieve the given confidence level when uncertainty prediction is made for turbines in different clusters at the 90% confidence level.The highest interval coverage is obtained by QRGBM in the prediction results, with 99.50%, 88.13%, and 97.01%, respectively.However, the interval width of QRGBM is wider than the other models, reaching 452.58 kW, 358.46 kW, and 329.57kW, respectively.The overly wide interval width will lead to less valid information available in the prediction results.The LSTM and ADFN models have similar interval coverages.For type 0 (high wind speed and high output units), the ADFN model obtains a slightly narrower average width of intervals, but it is wider for other types, and its average width of intervals increases by 5.23% and 4.34% compared to LSTM model.ADFN, based on the attention mechanism, with a small increase in interval coverage, also increases the average width of intervals by a certain amount, with better prediction results.For type 2 (low wind speed and low output units), the interval coverages of all three models do not reach the confidence level, and the prediction effects are not good.
other hand, obtained 100% interval coverage at a higher confidence level, mak prediction intervals less meaningful in practice.
In addition, predictions are made for the three different classes of wind t obtained from the previous clustering.Figures [13][14][15] demonstrate the probability predictions of the ADFN model over the next two days.The more common prob prediction models selected for comparison at the 90% confidence level are (combination of long short-term memory network and Gaussian mixture density and the QRGBM model (gradient booster using quantile output).Table 4 sho corresponding prediction results.
As can be seen from Table 4, the turbines in the high wind speed and high units and the turbines in the high wind speed and low output units achieve th confidence level when uncertainty prediction is made for turbines in different clu the 90% confidence level.The highest interval coverage is obtained by QRGBM prediction results, with 99.50%, 88.13%, and 97.01%, respectively.However, the width of QRGBM is wider than the other models, reaching 452.58 kW, 358.46 k 329.57kW, respectively.The overly wide interval width will lead to less valid info available in the prediction results.The LSTM and ADFN models have similar coverages.For type 0 (high wind speed and high output units), the ADFN model a slightly narrower average width of intervals, but it is wider for other types, average width of intervals increases by 5.23% and 4.34% compared to LSTM ADFN, based on the attention mechanism, with a small increase in interval covera increases the average width of intervals by a certain amount, with better prediction For type 2 (low wind speed and low output units), the interval coverages of a models do not reach the confidence level, and the prediction effects are not good.Furthermore, this paper randomly selected the probability density function ADFN prediction model at different wind turbines and different time prediction from the multi-step prediction, as shown in Figure 16.Furthermore, this paper randomly selected the probability density function of the ADFN prediction model at different wind turbines and different time prediction points from the multi-step prediction, as shown in Figure 16.
The straight line perpendicular to the x-coordinate axis in Figure 16 is the true value at that prediction moment, and the curve is the probability density distribution of the corresponding model.From Figure 16, it can be seen that the probability density curve is complete and smooth, and there is no missing, very high, or very low values.The curve is also not too broad or too narrow, indicating that the prediction effect of the algorithm is appropriate.Most of the real values in the prediction results of the ADFN model fall near the highest probability points of the probability density curve, such as when t = 167, t = 110, t = 100, and t = 0.This indicates that the algorithm predicts with a high accuracy.When the actual wind power value is in the vicinity of the crest of the probability density curve, it indicates that the real value of wind power falls in the high probability interval given by the interval prediction, and the prediction error of these moments is small.When t = 20, the real value of wind power deviates from the center of the probability curve, and when t = 30, the deviation is even further away.This indicates that the prediction error at these moments is large, and the prediction interval may not even cover the real value of wind power.In the probability density curve results obtained from the test set, the prediction results are reliable if most of the actual values are close to the center range of the curve and only a small number of actual values deviate farther.If most of the actual values are deviated or even far away from the position of the peak of the probability density curve, the obtained probability density prediction results are unreliable.In summary, the ADFN prediction model is reliable in predicting the probability density of different wind turbines.The straight line perpendicular to the x-coordinate axis in Figure 16 is the true value at that prediction moment, and the curve is the probability density distribution of the corresponding model.From Figure 16, it can be seen that the probability density curve is complete and smooth, and there is no missing, very high, or very low values.The curve is also not too broad or too narrow, indicating that the prediction effect of the algorithm is appropriate.Most of the real values in the prediction results of the ADFN model fall near the highest probability points of the probability density curve, such as when t = 167, t = 110, t = 100, and t = 0.This indicates that the algorithm predicts with a high accuracy.When the actual wind power value is in the vicinity of the crest of the probability density curve, it indicates that the real value of wind power falls in the high probability interval given by the interval prediction, and the prediction error of these moments is small.When t = 20, the real value of wind power deviates from the center of the probability curve, and when t = 30, the deviation is even further away.This indicates that the prediction error at these moments is large, and the prediction interval may not even cover the real value of wind power.In the probability density curve results obtained from the test set, the prediction results are reliable if most of the actual values are close to the center range of the curve and only a small number of actual values deviate farther.If most of the actual values are deviated or even far away from the position of the peak of the probability density curve, the obtained probability density prediction results are unreliable.In

Conclusions
This paper focuses on predicting the spatio-temporal distribution of wind power from different turbines in wind farms and proposes a ForecastNet prediction model based on the spatio-temporal distribution.In order to simplify the prediction, the data of N wind turbines with the highest output power in the wind farm are selected to train the model, where the determination of the N value is based on the deterministic prediction results.Then, the uncertainty prediction is performed on the wind power data.The prediction results of ForecastNet variant models based on different hidden layers are compared with those of common prediction models.The results show that the ADFN model has better prediction results when considering the reliability index and acuity index.In addition, the prediction model in this paper has the following features: • The model uses Euclidean distance and difference distance combined with a graph attention network to aggregate spatio-temporal information on input features.

•
In order to avoid the gradient problem when the neural network model predicts wind power in the short term, the time-varying characteristics and interleaved output characteristics of ForecastNet are used to improve the prediction effect.• The model can predict the probability density curve of wind power at future moments, which can provide more effective information for grid decision makers.
and ζ is Euler's constant. ( ( ))E h x is the mean value of the path length of the sample x in the tree.( ) c x is the average search path length of the binary tree constituted by a dataset containing x samples.The Lagrange interpolation method is used to repair the abnormal data detected by the isolated forest, as shown in Equation (2).

Figure 1 .
Figure 1.Detection and repair of wind power outliers.

Figure 1 .
Figure 1.Detection and repair of wind power outliers.

Figure 10 .
Figure 10.A wind power prediction framework based on spatio-temporally distributed ForecastNet.

Figure 10 .
Figure 10.A wind power prediction framework based on spatio-temporally distributed ForecastNet.

Figure 10 .
Figure 10.A wind power prediction framework based on spatio-temporally distributed ForecastNet.

Figure 11 .
Figure 11.Multi-step prediction error curves for different numbers of THPO.

Figure 11 .
Figure 11.Multi-step prediction error curves for different numbers of THPO.

19 Figure 12 .
Figure 12.Comparison of wind power prediction intervals for different ForecastNet models.

Figure 12 .
Figure 12.Comparison of wind power prediction intervals for different ForecastNet models.

Figure 13 .
Figure 13.Comparison of interval prediction results for different models for turbine #1.

Figure 13 .Figure 14 .
Figure 13.Comparison of interval prediction results for different models for turbine #1.

Figure 14 .
Figure 14.Comparison of interval prediction results for different models for turbine #96.

Figure 14 .
Figure 14.Comparison of interval prediction results for different models for turbine #96.

Figure 15 .
Figure 15.Comparison of interval prediction results for different models for turbine #128.

Figure 15 .
Figure 15.Comparison of interval prediction results for different models for turbine #128.

Figure 16 .
Figure 16.Probability density function of wind power at different moments.

Figure 16 .
Figure 16.Probability density function of wind power at different moments.

Table 1 .
RMSE metrics for wind turbine prediction results.

Table 2 .
MAE metrics for wind turbine prediction results.

Table 3 .
Comparison of evaluation indexes for the different models.

Table 3 .
Comparison of evaluation indexes for the different models.

Table 4 .
Comparison of forecast indicators for different wind turbines.

Table 4 .
Comparison of forecast indicators for different wind turbines.