M2GSNet: Multi-Modal Multi-Task Graph Spatiotemporal Network for Ultra-Short-Term Wind Farm Cluster Power Prediction

: Ultra-short-term wind power prediction is of great importance for the integration of renewable energy. It is the foundation of probabilistic prediction and even a slight increase in the prediction accuracy can exert signiﬁcant improvement for the safe and economic operation of power systems. However, due to the complex spatiotemporal relationship and the intrinsic characteristic of nonlinear, randomness and intermittence, the prediction of regional wind farm clusters and each wind farm’s power is still a challenge. In this paper, a framework based on graph neural network and numerical weather prediction (NWP) is proposed for the ultra-short-term wind power prediction. First, the adjacent matrix of wind farms, which are regarded as the vertexes of a graph, is deﬁned based on geographical distance. Second, two graph neural networks are designed to extract the spatiotemporal feature of historical wind power and NWP information separately. Then, these features are fused based on multi-modal learning. Third, to enhance the e ﬃ ciency of prediction method, a multi-task learning method is adopted to extract the common feature of the regional wind farm cluster and it can output the prediction of each wind farm at the same time. The cases of a wind farm cluster located in Northeast China veriﬁed that the accuracy of a regional wind farm cluster power prediction is improved, and the time consumption increases slowly when the number of wind farms grows. The results indicate that this method has great potential to be used in large-scale wind farm clusters.


Introduction
Renewable energy, especially wind energy, has become the key to alleviating the energy problem. The installed capacity of wind power is also increasing year by year and most wind farms are integrated into grids in the form of large-scale clusters. Due to the fluctuation and intermittence of wind, wind power not only provides clean energy, but also brings severe challenges to the safe and stable operation of power systems. Accurate prediction of wind speed and wind power is a fundamental requirement and basic task to ensure the grid connection of wind power [1].
There is a tremendous amount of research about the ultra-short-term wind power prediction, which can be divided into two types of methods: Physical method and Statistic Learning method. The physical method models the wind behavior according to the equation of atmosphere movement and can simulate the nonlinear characteristic of wind process. However, the parameter of the physical power and wind speed [20][21][22]. The downscale method is about getting higher resolution wind speed or wind power from lower resolution NWP or prediction results. The downscaled NWP windspeed can provide more precise information for wind power prediction [23].
In fact, there are difficulties on two levels to build a comprehensive wind farm prediction model. The first one is how to use the complex spatial-temporal relationship effectively among the historical wind power data and NWP data of different wind farms, in order to increase the accuracy of prediction. The second is how to get the output of every single wind farm and the whole region efficiently, especially when the number of wind farms is big.
Addressing these two goals, we proposed a hybrid prediction framework based on deep learning for wind power prediction in a region, calling it the Multi-modal Multi-task Graph Spatiotemporal NETwork (M2GSNet). The main contribution is as follows: (1) We designed a spatiotemporal graph convolutional network which can extract the spatiotemporal feature of historical wind power and NWP data of wind farms in the given region. To the best of our knowledge, we are the first to employ a spectral graph neural network for the ultra-short-term wind farm cluster power prediction. Compared to the previous wind power prediction method, it can take consideration of the global geographical location and make better use of the historical wind power and NWP information of wind farms in a region. It can reduce normalized root mean square error (RMSE) in the fourth hour by 1.75%.
(2) We also designed multi-task learning for the wind power prediction of all the wind farms. This can enhance the learning efficiency by combining similar learning tasks and sharing weights of some neural network layers. The power of every single wind farm and the whole region efficiently can be predicted in one model. The time consumption of 20 wind farm forecasts is only 4.1 times the time used for one wind farm. There is also great potential to expand the method to a region which contain hundreds of wind farms.
The rest of this paper is organized as follows. Section 2 analyzes the availability of NWP and formulates the problem of wind power prediction on the graph. Section 3 provides temporal and spatial dependency modeling based on graph convolution. Here, the feature of historical wind power data and NWP data of different wind farms can be extracted. Based on these features, Section 4 proposes a multi-modal multi-task graph convolutional network for wind power prediction. The experimental results are reported in Section 5. The conclusion is made in Section 6.

Availability Analysis of NWP
The wind farm can convert the wind energy into electrical power and the energy conversion process can be depicted as follows [24]: where C p represents the wind power conversion parameter and ρ is the air density. A is the area through which the wind is flowing and v is the wind speed. Equation (1) shows that the wind power is proportional to the cubic of wind speed. Therefore, when comparing the effectiveness of numerical weather prediction, the cubic of wind speed is also chosen. Generally, the height of the wind turbines is 50 m to 150 m. The height of the NWP wind speed from the meteorology department is 10 m, 30 m, 100 m and 170 m. Therefore, the wind speed at the height of 100 m is selected for comparison and analysis since this altitude is more effective for the wind power conversion. speed and the cubic of wind speed are compared in those scenarios. As the wind power and wind speed are in different ranges, both of them are normalized to make it more illustrative in Figure 1.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 18 of the NWP, two scenarios in 2019 are selected from a wind farm (no. 1 wind farm) in our testing wind farm cluster. The time intervals of the scenarios are from 2019-03-26 08:15:00 AM to 2019-03-31 08:15:00 AM and from 2019-04-05 08:15:00 AM to 2019-04-10 08:15:00 AM, separately. Both the wind speed and the cubic of wind speed are compared in those scenarios. As the wind power and wind speed are in different ranges, both of them are normalized to make it more illustrative in Figure 1. To illustrate the similarity of NWP and wind power, we use the dynamic time wrapping (DTW) distance to assess the discrepancy of them. The classical sequence similarity computation is the Euclidean distance. However, when there is a time shift in the sequence, there will be large errors between two sequences even when they are similar. DTW is a kind of commonly used method to assess the similarity of two time series and is especially suitable when the length of two time series is different or there is a time shift between them [25]. The main idea of DTW is to convert the time series similarity computation into the shortest path problem and use the dynamic programming method for the computation. The computation process of DTW distance between 1 and 2 is listed as Equations (2) where ∈ × is the distance of two sequences, ( , ) is the distance of the -th element in the 1 and -th element in 2 and the distance can be measured by the absolute value. and are the length of 1 and 2 , respectively, and ( , ) is the DTW distance between 1 and 2 . To illustrate the similarity of NWP and wind power, we use the dynamic time wrapping (DTW) distance to assess the discrepancy of them. The classical sequence similarity computation is the Euclidean distance. However, when there is a time shift in the sequence, there will be large errors between two sequences even when they are similar. DTW is a kind of commonly used method to assess the similarity of two time series and is especially suitable when the length of two time series is different or there is a time shift between them [25]. The main idea of DTW is to convert the time series similarity computation into the shortest path problem and use the dynamic programming method for the computation. The computation process of DTW distance between L 1 and L 2 is listed as Equations (2) and (3).
where dtw ∈ R m×n is the distance of two sequences, d(i, j) is the distance of the i-th element in the L 1 and j-th element in L 2 and the distance can be measured by the absolute value. m and n are the length of L 1 and L 2 , respectively, and dtw(m, n) is the DTW distance between L 1 and L 2 .
According to the definition of DTW distance, in the first scenario, the distance between wind power and 100 m wind speed in the NWP is 140.8926 and the distance between wind power and 100 m wind speed cubic in the NWP is 80.2233. In the second scenario, the distance between wind power and 100 m wind speed in the NWP is 50.8905 and the distance between wind power and 100 m wind speed cubic in the NWP is 26.6638. From the computation results, we can see that due to the NWP amplitude and shape error, there are also some times when the discrepancy between the NWP windspeed and wind power is relatively large, such as the time interval between 2019-03-28 and 2019-03-29. The DTW distance cannot be reduced into 0. However, the cubic of wind speed can represent the tendency of the wind power more precisely than the wind speed itself. Therefore, if we use the cubic NWP windspeed as the input of the model, rather than the NWP windspeed, it is beneficial for the prediction modeling.

Power Prediction of Wind Farms Cluster
Wind power forecast is a classic time-series prediction problem. It can give out the most likely output in the next H time steps given the previous M wind power observations and NWP data of length N. It can be depicted as: where P t ∈ R n×1 is the vector of the wind power of the wind farms in a region and V t ∈ R n×k is the matrix of NWP. n is the number of wind farms and k is the number of NWP variables. The prediction is to find the most appropriate f by using the machine learning or deep learning method. Since the ultra-short-term wind power prediction is supposed to output the wind power in the next 4 h at the interval of 15 min, the time step H is 16 generally. However, there exists complex spatial-temporal correlation among the wind farms which is shown in Figure 2 and it is necessary to consider the coupling relationship when designing the function. However, even though there are some methods that can take into consideration the correlation [2,17], they neglect the geographical information of different wind farms.
windspeed and wind power is relatively large, such as the time interval between 2019-03-28 and 2019-03-29. The DTW distance cannot be reduced into 0. However, the cubic of wind speed can represent the tendency of the wind power more precisely than the wind speed itself. Therefore, if we use the cubic NWP windspeed as the input of the model, rather than the NWP windspeed, it is beneficial for the prediction modeling.

Power Prediction of Wind Farms Cluster
Wind power forecast is a classic time-series prediction problem. It can give out the most likely output in the next H time steps given the previous M wind power observations and NWP data of length N. It can be depicted as: where ∈ ×1 is the vector of the wind power of the wind farms in a region and ∈ × is the matrix of NWP.
is the number of wind farms and is the number of NWP variables. The prediction is to find the most appropriate by using the machine learning or deep learning method. Since the ultra-short-term wind power prediction is supposed to output the wind power in the next 4 h at the interval of 15 min, the time step is 16 generally. However, there exists complex spatial-temporal correlation among the wind farms which is shown in Figure 2 and it is necessary to consider the coupling relationship when designing the function. However, even though there are some methods that can take into consideration the correlation [2,17], they neglect the geographical information of different wind farms. In Figure 2, we also notice that the historical wind power and NWP data of the wind farms constitute a kind of typical spatiotemporal graph data structure. The prediction function can be described as: Where is the graph constructed by the wind farms in the cluster and we design a multi-modal multi-task spatiotemporal graph convolutional network for the approximation of the prediction function. In Figure 2, we also notice that the historical wind power and NWP data of the wind farms constitute a kind of typical spatiotemporal graph data structure. The prediction function can be described as:

The Adjacent Matrix
where G is the graph constructed by the wind farms in the cluster and we design a multi-modal multi-task spatiotemporal graph convolutional network for the approximation of the prediction function.

The Adjacent Matrix
The adjacent matrix, which reflects the geographical dispersion of the wind farms, is very important for the spatial-temporal dependency modeling and there are many methods to construct the adjacent matrix. As the wind farms are located in a region and the spatial dispersion is mostly based on distance, we define the adjacent matrix based on the Gaussian kernel threshold distance function of the wind farms.
where dist(i, j) is the geographical distance between wind farm i and wind farm j. std is the standard deviation of the distance between n wind farms and ε is the threshold. Here, we choose the half of the mean distance as the threshold. If the distance is smaller than the threshold, we assess that there is no connection between the two wind farms to guarantee the sparsity of the adjacent matrix.

The Graph Convolutional Neural Network
The convolution network witnessed a great success in the image recognition for its ability to extract the spatial feature and many research studies use it for the spatial dependency modeling of the wind farms. However, this kind of method needs to arrange the wind farms in a specific way and ignore the geographic location relationship among the wind farms. To deal with this case, it is necessary to expand the spatial dependency modeling method and develop the convolutional neural network for the graph data. There has been some research [26] about using the graph neural network for the short-term windspeed prediction and its effectiveness has been verified. We can also use the graph convolutional neural network for the spatial feature extraction in the ultra-short-term wind farm cluster power prediction.
There are two methods to develop the graph neural network, namely, the spatial domain method and spectral domain method. However, the spectral method is based on graph Fourier transformation and has a relatively solid theoretical foundation that is more suitable for the wind power prediction.
We use the first order approximation of Chebyshev spectral filter brought out by Kipf in the graph convolution layer [27,28] and the propagation mode between layers are as follows: A= A + I n (8) In this equation, A ∈ R n×n is the adjacency matrix formed from the location of wind farms and n is the number of wind farms.D ∈ R n×n is the degree matrix ofÂ. X (l) ∈ R n×d is the feature of layer l and X (l+1) ∈ R n×h is the updated feature of layer l + 1. d and h constitute the feature dimension of each node, which are the time series data of certain wind farms in this case. W ∈ R d×h is the learnable convolutional kernel parameter. σ is the activation function. Through the matrix product, the feature of each wind farm correlated with each other.
For the input layer, we use the historical wind power of length M or the NWP series of length N as the feature of each wind farm. According to Equation (7), the first layer of the graph convolutional network outputs a matrix with the following elements: Appl. Sci. 2020, 10, 7915 In the equation above, X (l) nwp (i, j) are the spatiotemporal feature of the historical wind power and NWP, where i = 1, 2, . . . , n and i = 1, 2, . . . , d. It can be noticed that the spatial correlation is adjusted by theÂ and the temporal correlation is mapped by the W.
In this way, the temporal feature of wind farms is utilized in the network. Through this design, we can get the graph neural network that is suitable for our wind power prediction and we call it Graph Convolution Module. The structure of the graph convolutional network can be represented as   Graph convolution can extend the convolution from the traditional Euclidean distance space to the general graph data by carrying out the convolution in the spectral domain. In practice, we should decide the depth of the GCN and the adjacent matrix. In our wind power prediction problem, we use two layers of GCN for the spatial dependency modeling considering too many layers will lead to the feature embedding indistinguishably, although it can increase the size of receptive filed [27].

Multi-Modal Learning
Historical wind power and NWP contain different types of information about the wind power to be predicted and we need to fuse the spatiotemporal feature of them in the network. Multi-modal learning is a kind of technique which can process and combine information from different sources [29].
Feature fusion aims to integrate information of different types and sources to get a consistent and common model output, which is a basic problem in the multi-modal learning field. There are three commonly used methods for the feature fusion.

Bilinear Fusion Method
The calculation formula is where X p ∈ R n×d p and X nwp ∈ R n×d nwp are the spatiotemporal feature of historical wind power and NWP. Tensor W m ∈ R d p ×d nwp ×d out is the parameter of bilinear transformation and b is the bias of bilinear fusion. Suppose X p is a vector with dimension (128, 20) and X nwp is a vector with dimension (128, 30). When the dimension of W m is (20,30,40), the dimension of the fused feature is (128, 40).

Nonlinear Weighted Fusion
The nonlinear weighted fusion is as follows: The W P ∈ R d p ×d out and U nwp ∈ R d nwp ×d out are the weighted parameter of the spatiotemporal feature in the fusion.

Concatenate
When the correlation of features is weak, the method of direct feature concatenate can be used for feature fusion as follows.
In the wind power prediction problem, the fused feature Y ∈ R n×d out is the input for the task-specific layer of each wind farm which consists of the multi-task learning.

Multi-Task Learning
Multi-task learning (MTL) has led to great success in many areas of deep learning, from natural language processing to speech recognition [30]. Traditionally, the wind power prediction model is designed for every single wind farm separately or predicts the wind power of a region directly. We then fine-tune and tweak these models to improve the performance to an acceptable level. However, when optimizing more than one loss function and the tasks are similar to each other, there is a chance that the auxiliary task will help improve the accuracy of the main task. The multi-task learning can improve the generalization ability by leveraging the domain-specific information in the training signals of related tasks. It can make better use of the entangled features when using the multi-task learning [31]. There is also research about using the multi-modal and multi-task learning for short-term wind power prediction [32]. However, multi-task learning is used for different steps rather than different wind farms. We use one of the most commonly used methods for the multi-task learning called hard parameter sharing. It shares some hidden layers and has several task-specific output layers, as can be seen in Figure 4. signals of related tasks. It can make better use of the entangled features when using the multi-task learning [31]. There is also research about using the multi-modal and multi-task learning for shortterm wind power prediction [32]. However, multi-task learning is used for different steps rather than different wind farms. We use one of the most commonly used methods for the multi-task learning called hard parameter sharing. It shares some hidden layers and has several task-specific output layers, as can be seen in Figure 4. In our network, we can share the parameter of the graph neural network and multi-modal learning in the previous part and use the multi-task learning for the wind power prediction of each wind farm. The task-specific layer is designed for each wind farm and the loss function of the neural network is the sum of loss in each wind farm.

̂=
(14) where ̂∈ 1× , = 1,2, … , is the predicted wind power of the -th wind farm. ∈ 1× is the -th row of multi-modal feature fusion matrix and is the parameter of the fully connected layers for the power prediction. The number of task-specific layers is equal to the number of wind farms.
is the loss function of the M2GSNet and it is the sum of loss function for each wind farm. We can use the absolute value of prediction power and true power to calculate. The parameters of the fully connected layers are learned in the training process. It actually plays a role in weighting the In our network, we can share the parameter of the graph neural network and multi-modal learning in the previous part and use the multi-task learning for the wind power prediction of each wind farm. The task-specific layer is designed for each wind farm and the loss function of the neural network is the sum of loss in each wind farm.p wherep i ∈ R 1×H , i = 1, 2, . . . , n is the predicted wind power of the i-th wind farm. Y i ∈ R 1×d out is the i-th row of multi-modal feature fusion matrix and W oi is the parameter of the fully connected layers for the power prediction. The number of task-specific layers is equal to the number of wind farms. L is the loss function of the M2GSNet and it is the sum of loss function for each wind farm. We can use the absolute value of prediction power and true power to calculate. The parameters of the fully connected layers are learned in the training process. It actually plays a role in weighting the fused features. Therefore, the spatial-temporal feature is taken into consideration in the procedure above.

GCN Model for Wind Power Prediction
According to the analysis before, the NWP contains the future meteorological information and it can be found that there is a good correlation with the power. Therefore, the multi-modal learning is used to combine the temporal and spatial characteristics of the historical wind power and NWP. The multi-task learning is used for the prediction of each wind farm. The proposed multi-modal multi-task graph spatiotemporal network (M2GSNet) model for wind power prediction is as shown in Figure 5: According to the analysis before, the NWP contains the future meteorological information and it can be found that there is a good correlation with the power. Therefore, the multi-modal learning is used to combine the temporal and spatial characteristics of the historical wind power and NWP. The multi-task learning is used for the prediction of each wind farm. The proposed multi-modal multi-task graph spatiotemporal network (M2GSNet) model for wind power prediction is as shown in Figure 5: From Figure 5, we can see that the whole structure is an encoder-decoder framework. The historical wind power and NWP data are encoded into a latent spatial and temporal feature space by using graph convolutional network and multi-modal learning. The features are decoded into the wind power of different wind farms by using multi-task learning and fully connected layers. The model consists of the following modules.
(1) GCN Module: GCN part includes two GCN modules for the historical wind power and NWP data, respectively. Each GCN module includes two layers of standard GCN and the adjacent matrix is based on distance. It is used to extract the spatial feature of the wind farm.
(2) Multi-Modal Learning Module: It is used to concatenate the spatial and temporal feature of NWP and historical wind power. It makes M2GSNet an effective hybrid prediction method by combining the advantage of model-based prediction and data-driven prediction.
(3) Multi-Task Learning Module: The fully connected layer is used to map the fused spatialtemporal feature of each wind farm into the wind power. Each wind farm has a specific layer.
The historical wind power and NWP data are the input of the model.The output is the ultrashort-term power sequence of each wind farm. The regional wind power can be calculated by adding together the power of each wind farm. From Figure 5, we can see that the whole structure is an encoder-decoder framework. The historical wind power and NWP data are encoded into a latent spatial and temporal feature space by using graph convolutional network and multi-modal learning. The features are decoded into the wind power of different wind farms by using multi-task learning and fully connected layers. The model consists of the following modules.

Case Study
(1) GCN Module: GCN part includes two GCN modules for the historical wind power and NWP data, respectively. Each GCN module includes two layers of standard GCN and the adjacent matrix is based on distance. It is used to extract the spatial feature of the wind farm.
(2) Multi-Modal Learning Module: It is used to concatenate the spatial and temporal feature of NWP and historical wind power. It makes M2GSNet an effective hybrid prediction method by combining the advantage of model-based prediction and data-driven prediction.
(3) Multi-Task Learning Module: The fully connected layer is used to map the fused spatial-temporal feature of each wind farm into the wind power. Each wind farm has a specific layer.
The historical wind power and NWP data are the input of the model.The output is the ultra-short-term power sequence of each wind farm. The regional wind power can be calculated by adding together the power of each wind farm.

Case Study
The proposed method is also tested on the measurement data of a wind farm cluster in Northeast China. The proposed model is tested on Linux server Cluster (CPU: Intel Xeon (R) CPU E5-2650 v4 @ 2.10 GHz, GPU: NVIDIA Tesla P100) and deep learning framework Pytorch (1.4.0) with GPU acceleration to speed up the training process.

Data Set and Test Description
For the test system, only the historical wind power data and the NWP data are provided. The wind speed data are not included. The historical wind power is from the field measurement and the NWP is from the meteorology station. The NWP wind speed at the height of 170, 100, 30 and 10 m are used for the analysis. According to the analysis in 2.1, the cubic NWP windspeed can reflect the tendency of wind power more effectively. So, we use the cubic NWP windspeed rather than the NWP windspeed as the input of the model. The location of those wind farms in the cluster is as Figure The root mean square error (RMSE) and mean absolute error (MAE) are selected as the evaluation metric to assess the performance of the model on the testing set: where x ti andx ti are the normalized true value and normalized predicted value in prediction scenario i at prediction time step t. K is the number in the test set. To represent the prediction error of scenarios, we design another index: In each scenario, there are 16 time steps and the maximum value of the time step is 16. The index is used to assess the similarity of a given scenario and it reflects the average deviation of true value and predicted value in each time step.
The sensitivity of some hyper-parameters is taken into account, such as learning rate and the hidden state number, which are very important for the training process [33]. However, it is impossible to do the grid search on the whole parameter space. So, the hyper-parameter is determined according to the grid search combined by human experience. The learning rate is chosen from the set (0.01,  0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4), The hidden layer numbers of the graph convolutional part for historical wind power and wind speed are both chosen from the set (10,20,30,40,50,60,70,80,90,100). The input lengths of the historical wind power and cubic NWP windspeed are chosen from set (20,30,40,50,60). We decided the optimal value according to the prediction error in the fourth hour. The parameter combination with the lowest prediction error is the optimal value. After adjusting the structure and parameters of the model, the parameters of the final model are as follows. In the M2GSNet model, input characteristic matrix P t−M+1:t is the power measurement information of each node on the graph. It nwp is a (20 * 80) matrix. The hidden state for the wind power GCN module is 60 and for the NWP GCN module is 40. The adjacent matrix in the graph convolution network is calculated by the distance between wind farms. The dimension of variable is labeled in Figure 5. The prediction error is calculated according to the RMSE after normalization and the specific calculation method can refer to the previous description.
The iteration epoch of model training is 200 and the training batch size is 256. The optimizer is Adadelta [34] and the learning rate is 0.1. The five-fold cross-validation is used for verification.

Baseline Model
M2GSNet is our proposed model and it has three features. First, it utilizes the feature of cubic NWP wind speed by using the multi-modal learning. Second, it adopts the spatiotemporal model for the geographical information extraction. Third, it uses the multi-task learning to predict the power of each wind farm. To illustrate the accuracy improvement by each feature, we design the baseline model and other GCN model for ablation study.
(1) MLP [6]: This is the multilayer perception model for regression and the hidden state number is 800. The historical wind power of each wind farm, including 40 time steps as the input and output, is the wind power of the same wind farm including 16 time steps. The sum of the wind power for each wind farm is the regional wind power.
(2) LSTM [9]: This includes two LSTM layers and the dropout rate is 0.25. The historical wind power of each wind farm, including 40 time steps as the input and output, is the wind power of the same wind farm including 16 time steps.
(3) ELM [35]: This uses the classic ELM model parameter. The historical wind power of each wind farm, including 40 time steps as the input and output, is the wind power of the same wind farm including 16 time steps. The sum of the wind power for each wind farm is the regional wind power.
(4) LSTNet [10]: This uses the standard LSTNet structure and parameter and it can take consideration of the spatiotemporal relationship of wind farms. However, it cannot make use of the geographical information of the wind farms when extracting the spatial feature of the wind farms. The input only includes the historical wind power of each wind farm. The output is the wind power of each wind farm including 16 time steps.
(5) LSTNet_NWP: This uses the same structure and parameter with LSTNet. However, it also uses the cubic NWP windspeed as input and it concatenates the spatial-temporal feature of historical wind power and NWP. The output is the wind power of each wind farm including 16 time steps.
We also compare different M2GSNet models for the ablation study and the characteristic of each model is as shown in Table 1. Where M2GSNet means the model that uses the information of the cubic NWP wind speed, the w/o CW means the GCN model that only uses the raw data of NWP but without using the cubic NWP wind speed. The w/o AD means the GCN model that uses the information of cubic NWP wind speed but uses the wind speed time series correlation to define the graph. The w/o W means the GCN model that does not use the NWP. The w/o MT1 means the GCN model that does not use the multi-task learning and predicts the wind power of the region directly. The w/o MT2 means the GCN model that does not use the multi-task learning and predict the wind power of each wind farm separately. We use same model structure but train the model individually. The training hyper-parameter is the same as the description above and only the model structure is different.

The Prediction Results for Regional Wind Power
The prediction results of several structures of the M2GSNet are listed in Table 2. From Table 2, it is obvious that M2GSNet is the model that performs best. Besides, methods which take consideration of the NWP are better than those that do not include NWP data. From the results, the prediction error of the M2GSNet method is smaller than LSTM by over 2 percent in the fourth hour. This means it can reduce more than 50 MW prediction error for the whole cluster, which is vital progress for the operation center of the power grid. LSTNet is a kind of deep learning method that takes consideration of the spatiotemporal relationship of wind farms in the cluster. It is an improved version of spatiotemporal prediction model [17]. From Table 2, we can see that LSTNet is indeed better than the MLP, LSTM and ELM which do not consider the spatiotemporal relationship. However, the M2GSNet method is better than the LSTNet due to its ability to extract the geographical location information feature.
Even for the M2GSNet, when using multi-task learning to predict the wind power of each wind farm, the results are better than predicting the regional wind power directly (w/o MT1) or predicting the wind power of each and summing them together (w/o MT2) which proves the effectiveness of multi-task learning. Besides, the result of using the cubic NWP windspeed in the multi-modal learning is better than the result of using NWP windspeed directly.

The Prediction Results of Each Wind Farm
The M2GSNet is not only convenient for predicting the regional wind power, but it also can output the detail power  From Figure 8, it is obvious that the feature fusion method of bilinear and concatenate is better than the nonlinear Tanh method. The prediction error of the concatenate method is slightly lower than the bilinear method, especially in the interval of 0.5 h-3.5 h. Considering that the bilinear method From Figure 7, we can notice that the prediction errors of single wind farms are much higher than some other wind farms. However, due to the "smooth effect" of the wind farm cluster, the prediction error of the cluster is much smaller than the individual wind farms. This result is very meaningful for the power grid dispatching center. In addition, according to the mean value, max value and minimum value, the performance of the M2GSNet is much better than the other methods, especially in the 3rd hour and 4th hour in the statistical sense. However, due to the NWP feature fusion, the prediction error of M2GSNet in the 1st hour is a little higher than the other methods that do not consider the NWP. It also enlightens us to design a mechanism to dynamically select the models. For example, for the ultra-short-term prediction within 1 h, we can choose a model with lower RMSE.

Ablation Study (1) The Comparison of Different Concatenate Method
The feature fusion of historical wind power and NWP is very important for the wind power prediction and there are three commonly used feature fusion methods. The prediction results of the three methods are listed in Figure 8.
From Figure 8, it is obvious that the feature fusion method of bilinear and concatenate is better than the nonlinear Tanh method. The prediction error of the concatenate method is slightly lower than the bilinear method, especially in the interval of 0.5 h-3.5 h. Considering that the bilinear method is more complex and has lower training efficiency, the concatenate method is chosen as the feature fusion method in our network.
(1) The Comparison of Different Concatenate Method The feature fusion of historical wind power and NWP is very important for the wind power prediction and there are three commonly used feature fusion methods. The prediction results of the three methods are listed in Figure 8. From Figure 8, it is obvious that the feature fusion method of bilinear and concatenate is better than the nonlinear Tanh method. The prediction error of the concatenate method is slightly lower than the bilinear method, especially in the interval of 0.5 h-3.5 h. Considering that the bilinear method is more complex and has lower training efficiency, the concatenate method is chosen as the feature fusion method in our network. (

2) The Comparison of Training Time Consumption under Different Wind Farm Numbers
The training time of the M2GSNet is crucial because it determines whether it can be utilized in the large-scale renewable energy cluster which includes hundreds, even thousands, of small wind farms. Therefore, we compare the training time of M2GSNet under different wind farm numbers. The results are in Figure 9.  According to the results, the training time for one wind farm is 66 min. So, if each wind farm uses one specific model, the training time is more than 1200 min in this case. However, when multitask learning is used, the training time reduces to 271 min. Thus, it can be seen that by using multitask learning, it saves a lot of training time and resources. Notably, when the wind farm number increases, the increase rate of training time is actually decreased. Therefore, when more wind farms are considered, the advantages of multi-task learning will be more remarkable.

The Remarkable Error Analysis in Test Set
The prediction error of the M2GSNet is analyzed as Figure 10. In this figure, the M2GSNet is According to the results, the training time for one wind farm is 66 min. So, if each wind farm uses one specific model, the training time is more than 1200 min in this case. However, when multi-task learning is used, the training time reduces to 271 min. Thus, it can be seen that by using multi-task learning, it saves a lot of training time and resources. Notably, when the wind farm number increases, the increase rate of training time is actually decreased. Therefore, when more wind farms are considered, the advantages of multi-task learning will be more remarkable.

The Remarkable Error Analysis in Test Set
The prediction error of the M2GSNet is analyzed as Figure 10. In this figure, the M2GSNet is compared with the MLP and LSTM since they are the most commonly used machine learning and deep learning methods. The prediction results are visualized. task learning is used, the training time reduces to 271 min. Thus, it can be seen that by using multitask learning, it saves a lot of training time and resources. Notably, when the wind farm number increases, the increase rate of training time is actually decreased. Therefore, when more wind farms are considered, the advantages of multi-task learning will be more remarkable.

The Remarkable Error Analysis in Test Set
The prediction error of the M2GSNet is analyzed as Figure 10. In this figure, the M2GSNet is compared with the MLP and LSTM since they are the most commonly used machine learning and deep learning methods. The prediction results are visualized. In the left array of the figure, it is the 4th-hour prediction results in the test set by three methods. The predicted values are compared with the true value. In the right array, the location of scenarios with larger prediction errors are visualized. The prediction error analysis is also very important because it can tell us which kind of scenario is difficult to be predicted. Then we can design methods to deal with it in the future. For each time step, we used different colors and different sizes to In the left array of the figure, it is the 4th-hour prediction results in the test set by three methods. The predicted values are compared with the true value. In the right array, the location of scenarios with larger prediction errors are visualized. The prediction error analysis is also very important because it can tell us which kind of scenario is difficult to be predicted. Then we can design methods to deal with it in the future. For each time step, we used different colors and different sizes to represent the prediction errors. The darker the color and the smaller the size, the smaller the prediction error. Since the Ind i of most of scenarios are smaller than 0.15, we classify the prediction errors into four categories. If the prediction error according to Ind i in Equation (19) is smaller than 0.05 p.u, it is the first category. This type includes the scenarios that are predicted rather accurately. If the the prediction error is between 0.05 and 0.1, it is the second category, and the color of this category in the figure is at 5. If the prediction error is between 0.1 and 0.15, it is the third category, and the color of those scenarios is at 10. If the prediction error is higher than 0.15, it is the fourth category and the color of them is at 20. So, different colors and sizes can reflect the prediction results of the same scenario. We counted the ratio of the different categories in Table 3. From Figure 10 and Table 3, it is obvious that the M2GSNet method has better performance since it has less points belonging to the high prediction error category. However, it also can be found from Figure 10 that most light color and large size points are located in the turning point of the wind fluctuation process which means it is hard to predict and often leads to higher prediction errors.

Conclusions
In this paper, we bring out a spatiotemporal deep learning network for the ultra-short-term wind power prediction. Through the case study, we can draw the following conclusions: (1) Adding a numerical weather forecast by virtue of multi-modal learning, especially the third power of wind speed as auxiliary information, can improve the accuracy of forecasts.
(2) The spatiotemporal graph neural network can extract the spatial-temporal feature of the wind farms effectively and is helpful in improving the accuracy of predictions compared to the other methods.
(3) By using the multi-task learning method, prediction accuracy can be improved, and the training time can also be reduced compared to additive methods.
In the follow-up study, we can consider designing a comprehensive method which can classify the wind process in advance and define the dynamic graph according to the spatial-temporal relationship among wind farms to further increase the accuracy.