SFINet: Shufﬂe–and–Fusion Interaction Networks for Wind Power Forecasting

: Wind energy is one of the most important renewable energy sources in the world. Accurate wind power prediction is of great signiﬁcance for achieving reliable and economical power system operation and control. For this purpose, this paper is focused on wind power prediction based on a newly proposed shufﬂe–and–fusion interaction network (SFINet). First, a channel shufﬂe is employed to promote the interaction between timing features. Second, an attention block is proposed to fuse the original features and shufﬂed features to further increase the model’s sequential modeling capability. Finally, the developed shufﬂe–and–fusion interaction network model is tested using real-world wind power production data. Based on the results veriﬁed, it was proven that the proposed SFINet model can achieve better performance than other baseline methods, and it can be easily implemented in the ﬁeld without requiring additional hardware and software.


Introduction
With the increasing global warming threats, the United Nations has called for the reduction of carbon dioxide emissions and hence set out the goals of reducing greenhouse gas emissions by 45 percent by 2030 and to net zero emissions by 2050 [1].In line with the United Nations' goals, the developed and most developing countries have started to take actions to develop realistic plans toward the reduction of carbon dioxide emissions.For instance, the Chinese government announced their goals to reach the peak of carbon dioxide emissions before 2030 and strive to achieve carbon neutrality before 2060 at the 75th session of the United Nations General Assembly (UNGA 75) in September 2020 [2].In 2019, the total carbon dioxide emissions in China were estimated at 10.5 billion tons, of which the carbon emissions from energy consumption were about 9.8 billion tons, accounting for around 87% of the total emissions [3].With rapid economic and social development, the transition to a green and low-carbon society is accelerated, and the transition of the country's energy structure brooks no delay.Cleanliness is an important direction for carbon emission reduction in energy production.The way to accomplish the "dual carbon" task is to develop green and low-carbon renewable energy.During the 14th Five-Year Plan period, coal consumption in China will continue to decline.In the plan, it is forecasted that the installed capacity of renewable energy and nuclear power will reach 1200 GW by 2030, of which wind power will reach 500 GW in China [4].
As the installed capacity of wind power continues to grow and wind power is connected to the grid on a large scale, the overall grid performance is affected by the output power from wind farms due to the intermittency of wind [5].In order to ensure the safety and stability of the operation of the power system, the power grid needs to prepare a sufficient spinning reserve capacity.However, the increase in the reserve capacity will increase the operating cost of wind power.Therefore, accurate wind power forecasting (WPF) is required for providing a basis to develop a grid dispatch schedule, and it also helps to greatly reduce the operating costs of wind farms and improve the competitiveness of wind energy in the overall energy market [6][7][8].
Historically, there have been different wind power forecasting (WPF) methods, which can be divided into four categories: physical, statistical, hybrid, and deep-learning methods.A summary report of these four categories of methods in terms of their features and limitations in application is given in [9].The physical method is based on a mesoscale weather model or a numerical weather prediction system (NWP).NWP represents a variety of mathematical expression models of geographic and meteorological information [10,11].Although this method has a good effect on short or medium-term forecasts of more than 3 h, it is difficult for it to collect all relevant geographic or meteorological data [12][13][14], so it has limitations in application.The physical forecasting method is generally used to select/determine new wind farms, but not for wind turbine power production prediction.
The statistical prediction method is based on the historical data collected by the SCADA system to establish a linear/non-linear relationship between relevant index data and power to predict the output power of a wind farm.Statistical prediction methods can be categorized as conventional statistical methods and those based on artificial neural networks (ANN).The conventional statistical methods have limitations in forecasting due to the demand for non-linear expression in wind power forecasting (WPF), while the methods based on artificial neural networks can effectively represent a large number of non-linear relationships and complex characteristics among wind speed, temperature, and other parameters in power generation.Therefore, statistical prediction methods based on ANN have become widely applied.Ref. [15] proposed a shallow model for wind speed prediction (WSF) based on artificial neural networks, which is more accurate than physical or traditional statistical methods.
The hybrid method integrates the physical and statistical models to improve forecast performance by preserving the advantages of each approach [16][17][18], but the hybrid models may not have the capability to achieve stable prediction, as their complex learning architecture may cause low efficiency in training and even underfitting [9].
With the development of deep learning techniques in recent years, and because wind power prediction possesses a natural time-series feature, some deep neural network (DNN)based time series forecast methods have been developed and used for wind power estimation, such as the methods based on recurrent neural networks (RNNs) [19], long short-term memory (LSTM) [20], Transformer [21], temporal convolutional networks (TCNs) [22], sample convolution and interaction networks (SCINets) [23], etc.These form the deep learning-based method for wind power forecast.It is promising in terms of new model development for time series forecasting; however, none of the existing methods can claim to be perfect in time series prediction, which depends on the available data and data quality.
SCINet is a novel framework proposed by Liu et al. [23] very recently that has been applied to time series forecasting problems.It performs sample convolution and interaction at multiple resolutions for time-series modeling.Although good prediction results can be achieved by SCINet models, the SCINet framework has some shortcomings that affect the prediction performance, i.e., the prediction accuracy.One of the shortcomings is the strict binary tree structure taken in SCINet causing information blockage as the number of network levels increases.To address this issue, this present paper proposes a novel framework with a shuffle-and-fusion interaction network, named SFINet, to avoid the information blocking of SCINet sequence channels and develops an improved algorithm for wind power forecasting.The developed models based on SFINet have been proven effective to achieve the economic dispatch of energy production and reliable operation of the power system, providing an opportunity for reducing the operation costs of wind farms.The main contributions are as follows: (1) considering the sequence interaction modeling of time series tasks, we introduced the shuffle operation to increase the dependence between sequences; (2) in order to further promote interactive learning, we proposed feature fusion based on the time series attention mechanism; and (3) the developed models are applied to wind power forecasting using real wind power production data collected from a wind farm in China, verifying the outperformance of our models by comparing them with other baseline approaches.

Channel Interleave Operation
To the best of our knowledge, the first real use of channel alternate operation was in IGCNets [24], and channel interleave in the form of shuffle was proposed in shufflenet [25], which aimed to break the information blockage between group convolutions.Subsequently, channel shufflenet has been widely utilized as a basic backbone network [26,27], with applications in semantic segmentation, Multi-Person Pose Estimation, Image Processing, and other tasks [28][29][30].However, channel shufflenet is mostly used on the basis of grouped convolution and lightweight models.In this paper, we apply it to the construction of the sequence channels for time series forecasting to improve the interaction capabilities of different time series features.

Attention Mechanisms
Attention is essentially a tool to filter and focus important information from a large number of available processing resources, while ignoring non-important information [31,32].It is usually combined with threshold functions, such as softmax and sigmoid, or sequential techniques [33,34].In both computer vision and sequence tasks, it has shown superior performance [35,36].In these applications, the attention mechanism usually acts on one or more top layers to further reshape the characteristics of the higher level.Attention mechanisms have provided a lot of benefits in many applications, e.g., image classification [37], object detection [38], multi-modal task [39], few-show learning [40], and machine translation [21].
The more common attentions are channel attention [37,41], spatial attention [37,42], temporal attention [43], and branch attention [44].Channel attention adaptively recalibrates the weight of each channel and can be viewed as an object selection process, thus determining what to pay attention to.Hu et al. [41] introduced a lightweight attention operation with a Squeeze-and-Excitation block to model channel-wise relationships.Spatial attention can be seen as an adaptive spatial region selection mechanism for determining where to pay attention to.Dai et al. [42] proposed deformable convolutional networks (deformable ConvNets) to be invariant to geometric transformations, but they paid attention to the important regions in a different manner.Self attention [21] is also used as a spatial attention mechanism to capture global information.Temporal attention is a dynamic time-selection mechanism.Li et al. [43] proposed a global-local temporal representation (GLTR) to exploit multi-scale temporal cues in a video sequence.In a multi-branch structure, branch attention is used for branch selection.Reference [44] proposed an automatic selection operation called selective kernel (SK) convolution implemented using three operations: split, fuse, and select.
The above-mentioned attention methods are often combined in application.Chen et al. [45] dynamically modulated the sentence generation context in multi-layer feature maps using encoding channel attention and spatial attention.Reference [46] identified spatial saliency associated with image pixels and executed temporal intensity filtering and predictive coding to filter spatiotemporal redundancies from images.
On the basis of the above overview, we propose a feature fusion method based on time series channel attention, aiming to enhance the model's long-term forecasting ability in wind power forecasting.

Deep Learning-Based Wind Power Forecasting
The wind speed and power indicators collected through the wind turbine SCADA system are all time series data.Time series forecasting can estimate their future development based on indicators or events.At the same time, there are complex nonlinear relationships among other indicator data related to power.From the previously published research works, it has been realized that deep learning-based time series forecasting has higher forecasting accuracy than the traditional methods [47], so the deep learning-based time series forecasting (TSF) method has been widely utilized.
The recurrent neural network (RNN)-based TSF method given in [48,49] compactly summarizes the past information in the internal memory used for prediction, where the memory state is recursively updated with new inputs at each time step, as shown in Figure 1a below.Ref. [50] proposed a long and short-term memory-based recurrent neural network (LSTM-RNN) to predict the wind power from 1 to 24 h.Transformer relies on the attention mechanism to model the global dependency of input and output, and breaks the non-parallelization problem of RNN-based methods, so it is gradually replacing the RNN model in almost all sequence modeling tasks.Therefore, various Transformer-based TSF methods were presented in [51], as shown in Figure 1b.The multi-head self-attention mechanism is used to extract the spatial correlation between wind farms [52].Models based on convolutional neural networks (CNNs), such as temporal convolutional networks (TCNs), are also used in time series forecasting (TSF) [53,54].The TCN uses a series of causal convolutional layers stacked to make full use of convolution.Parallel operation with efficient modeling of the dependency relationship between multiple sequence features is shown in Figure 1c.Long-term prediction of wind power with a mean absolute percentage error of 10% was carried out in [9] by using the temporal convolutional network (TCN).
On the basis of the above overview, we propose a feature fusion method based on time series channel attention, aiming to enhance the model's long-term forecasting ability in wind power forecasting.

Deep Learning-Based Wind Power Forecasting
The wind speed and power indicators collected through the wind turbine SCADA system are all time series data.Time series forecasting can estimate their future development based on indicators or events.At the same time, there are complex nonlinear relationships among other indicator data related to power.From the previously published research works, it has been realized that deep learning-based time series forecasting has higher forecasting accuracy than the traditional methods [47], so the deep learning-based time series forecasting (TSF) method has been widely utilized.
The recurrent neural network (RNN)-based TSF method given in [48,49] compactly summarizes the past information in the internal memory used for prediction, where the memory state is recursively updated with new inputs at each time step, as shown in Figure 1a below.Ref. [50] proposed a long and short-term memory-based recurrent neural network (LSTM-RNN) to predict the wind power from 1 to 24 h.Transformer relies on the attention mechanism to model the global dependency of input and output, and breaks the non-parallelization problem of RNN-based methods, so it is gradually replacing the RNN model in almost all sequence modeling tasks.Therefore, various Transformer-based TSF methods were presented in [51], as shown in Figure 1b.The multi-head self-attention mechanism is used to extract the spatial correlation between wind farms [52].Models based on convolutional neural networks (CNNs), such as temporal convolutional networks (TCNs), are also used in time series forecasting (TSF) [53,54].The TCN uses a series of causal convolutional layers stacked to make full use of convolution.Parallel operation with efficient modeling of the dependency relationship between multiple sequence features is shown in Figure 1c.Long-term prediction of wind power with a mean absolute percentage error of 10% was carried out in [9] by using the temporal convolutional network (TCN).Ref. [23] proposed a new neural network structure named SCINet, as shown in Figure 1d, specifically designed for time series forecasting, which lifts causal convolutional layers and the forced numbers of network input and output to be the same constraints of TCNs, and achieved very good performance in TSF tasks.To the best of our knowledge, however, SCINet has not been applied to the field of wind power forecasting (WPF).At the same time, the binary tree structure of SCINet causes information blockage as the Ref. [23] proposed a new neural network structure named SCINet, as shown in Figure 1d, specifically designed for time series forecasting, which lifts causal convolutional layers and the forced numbers of network input and output to be the same constraints of TCNs, and achieved very good performance in TSF tasks.To the best of our knowledge, however, SCINet has not been applied to the field of wind power forecasting (WPF).At the same time, the binary tree structure of SCINet causes information blockage as the number of network levels increases.For this reason, we propose shuffle-and-fusion interaction networks (SFINet) to overcome this issue.

SFINet: Shuffle-and-Fusion Interaction in Convolution Networks
As mentioned above, SCINet follows the strictly binary tree structure, and the timeseries feature information will no longer have the opportunity for information interaction after passing through the parent node of the binary tree.Although there is an interactive learning operation in SCI-Block, it can fuse information between time-series, but this interactive process only exists at the node of the parent tree, which means that the subsequent layers of different depths can only come from the first interactive learning of the parent node for the most primitive timing input.As the number of tree layers deepens, this information will be transmitted more insignificantly.We think that this feature is very unfavorable for capturing the dependencies between long sequences.
We use Figure 2 to illustrate this feature of the original SCINet structure, where the most basic unit is SCI-Block, as shown in Figure 2a.SCI-Block contains interactive learning modules responsible for the interaction between two different timing features.The SCINet is composed of basic SCI-Blocks according to the strictly binary tree structure, as shown in Figure 2b, and the SCI-Block always averages the split input features in the timing dimension; finally, SCINet is stacked to form Stacked SCINet.We named the input feature of an SCI-Block as SSF (split-sequence features), then the input to the k-th SCI-Block of the l-th layer can be defined as SSF (l, k) , where l = 1, 2, . . ., L, and k = 2 0 , 2 1 , . . ., 2 L−1 .Due to the average split characteristic in the timing dimension, for a timing sequence whose input length is known to be S, then L ∈ [1, n], and 2 n ≤ S.

SFINet: Shuffle-and-fusion Interaction in Convolution Networks
As mentioned above, SCINet follows the strictly binary tree structure, and the timeseries feature information will no longer have the opportunity for information interaction after passing through the parent node of the binary tree.Although there is an interactive learning operation in SCI-Block, it can fuse information between time-series, but this interactive process only exists at the node of the parent tree, which means that the subsequent layers of different depths can only come from the first interactive learning of the parent node for the most primitive timing input.As the number of tree layers deepens, this information will be transmitted more insignificantly.We think that this feature is very unfavorable for capturing the dependencies between long sequences.
We use Figure 2 to illustrate this feature of the original SCINet structure, where the most basic unit is SCI-Block, as shown in Figure 2a.SCI-Block contains interactive learning modules responsible for the interaction between two different timing features.The SCINet is composed of basic SCI-Blocks according to the strictly binary tree structure, as shown in Figure 2b, and the SCI-Block always averages the split input features in the timing dimension; finally, SCINet is stacked to form Stacked SCINet.We named the input feature of an SCI-Block as SSF (split-sequence features), then the input to the k-th SCI-Block of the l-th layer can be defined as SSF (l, k) , where l = 1, 2, . . ., L, and k = 2 0 , 2 1 , . . ., 2 L−1 .Due to the average split characteristic in the timing dimension, for a timing sequence whose input length is known to be S, then  ∈ [1, n], and 2 n  S.
Obviously, in the SCINet structure, SSF (l, k) is with l > 1, and k ∈ (1, 2 l−1 ).Only part of the output from the upper layer of the module will interact with other outputs in this layer.As shown in Figure 2b, each of SSF (3, 1) , SSF (3, 2) , SSF (3, 3) , SSF (3, 4) is input into the corresponding SCI-Block for further reasoning, and they are mutually isolated.Their interactive learning only exists in the first layer of SCI-Block.Hence, this property blocks information flow between sequential channels and weakens representation.Obviously, in the SCINet structure, SSF (l, k) is with l > 1, and k ∈ 1, 2 l−1 .Only part of the output from the upper layer of the module will interact with other outputs in this layer.As shown in Figure 2b, each of SSF (3, 1) , SSF (3, 2) , SSF (3, 3) , SSF (3, 4) is input into the corresponding SCI-Block for further reasoning, and they are mutually isolated.Their interactive learning only exists in the first layer of SCI-Block.Hence, this property blocks information flow between sequential channels and weakens representation.

Shuffle Split-Sequence Features
If SCI-Block is allowed to obtain an input from different split-sequence features, the split-sequence features that characterize different time dimensions will obtain better interaction.Therefore, the shuffle operation was introduced on the basis of SCINet.Specifically, for the structural characteristics of SCINet, for each layer of input split-sequence features, the inputs are naturally presented in groups at each level.First, the channels of each group can be divided into sub-groups, and then different sub-groups can be evenly allocated to each group as the input to the next layer.This operation is a sequential operation, as shown in Figure 3a.
cally, for the structural characteristics of SCINet, for each layer of input split-sequence features, the inputs are naturally presented in groups at each level.First, the channels of each group can be divided into sub-groups, and then different sub-groups can be evenly allocated to each group as the input to the next layer.This operation is a sequential operation, as shown in Figure 3a.
The above operations can be efficiently implemented through a channel shuffle operation, as shown in Figure 3b.Use 2 l groups to form new sequence features, and its output channel has 2 2l sub-groups.First, reshape the output channel size to (2 l , 2 l ), transposing and then flatten it back as the input of the next layer.Channel shuffling is also differentiable, which means that it can be embedded into network structures for end-toend training.The shuffle operation makes it possible to build more powerful structures with sequential interactive learning.
On this basis, the shuffle operation was embedded in the SCINet structure, and therefore, the SFINet structure was designed as shown in Figure 3c.The shuffle operation acted on the output of all leaf nodes of all SCI-Blocks in each layer.There was only one shuffle operation in each layer, which had nothing to do with the number of SCI-Blocks in that layer.

Fusion with Channel Attention
Taking into account the natural law of features in timing, this paper does not directly use the feature SSF after shuffle operation to replace the original SSF, but merges the two.Here, a simple method of adding corresponding positions was used to realize the fusion of the two parts; that is, to ensure the sequence relationship of the original SSF in time series and to ensure the communication of features in different slice groups.
Before feature fusion, the attention operation was further performed on SSF, which was completed by the attention block shown in Figure 4a.The above operations can be efficiently implemented through a channel shuffle operation, as shown in Figure 3b.Use 2 l groups to form new sequence features, and its output channel has 2 2l sub-groups.First, reshape the output channel size to 2 l , 2 l , transposing and then flatten it back as the input of the next layer.Channel shuffling is also differentiable, which means that it can be embedded into network structures for end-to-end training.The shuffle operation makes it possible to build more powerful structures with sequential interactive learning.
On this basis, the shuffle operation was embedded in the SCINet structure, and therefore, the SFINet structure was designed as shown in Figure 3c.The shuffle operation acted on the output of all leaf nodes of all SCI-Blocks in each layer.There was only one shuffle operation in each layer, which had nothing to do with the number of SCI-Blocks in that layer.

Fusion with Channel Attention
Taking into account the natural law of features in timing, this paper does not directly use the feature SSF after shuffle operation to replace the original SSF, but merges the two.Here, a simple method of adding corresponding positions was used to realize the fusion of the two parts; that is, to ensure the sequence relationship of the original SSF in time series and to ensure the communication of features in different slice groups.
Before feature fusion, the attention operation was further performed on SSF, which was completed by the attention block shown in Figure 4a.First of all, in order to ensure that the features of each dimension on the timing channel are fully utilized and can participate in subsequent predictions, we propose squeezing global spatial information into a channel descriptor.Exploiting such information is prevalent in feature engineering work.We opted for the simplest.This was achieved by using global average pooling to generate channel-wise statistics.Formally, a statistic z ∈ ℝ c was generated by shrinking SSF through spatial dimensions W, where the c-th element of z is calculated by: To make use of the information aggregated in the squeeze operation, we followed it with a second operation that aims to fully capture channel-wise dependencies.We performed one-dimensional convolution again after Z, and the output was E. At the same time, it was to ensure that the characteristic channel output was the original length, and we finally used Sigmoid to activate E. Decoupling realizes the release of the dependency between different channels, and outputs of the final attention coefficient S = [s 1 , s 2 , . . ., s c ].The final output X = [x 1 , x 2 , . . ., x c ] was obtained by rescaling the transformation output SSF with the activation, such that the c-th element of X was calculated by where F scale (S c , SSF c ) refers to the channel-wise multiplication between the feature map SSF c ∈ ℝ w and the scalar S c .Finally, we embedded the Attention Block into SCINet, as shown in Figure 4b; the output after the Attention Block will be added and fused with the original SSF and sent to the next layer of inference.First of all, in order to ensure that the features of each dimension on the timing channel are fully utilized and can participate in subsequent predictions, we propose squeezing global spatial information into a channel descriptor.Exploiting such information is prevalent in feature engineering work.We opted for the simplest.This was achieved by using global average pooling to generate channel-wise statistics.Formally, a statistic z ∈ R c was generated by shrinking SSF through spatial dimensions W, where the c-th element of z is calculated by: To make use of the information aggregated in the squeeze operation, we followed it with a second operation that aims to fully capture channel-wise dependencies.We performed one-dimensional convolution again after Z, and the output was E. At the same time, it was to ensure that the characteristic channel output was the original length, and we finally used Sigmoid to activate E. Decoupling realizes the release of the dependency between different channels, and outputs of the final attention coefficient S = [s 1 , s 2 , . . . ,s c ].The final output X = [x 1 , x 2 , . . . ,x c ] was obtained by rescaling the transformation output SSF with the activation, such that the c-th element of X was calculated by where F scale S c , SSF c refers to the channel-wise multiplication between the feature map SSF c ∈ R w and the scalar S c .Finally, we embedded the Attention Block into SCINet, as shown in Figure 4b; the output after the Attention Block will be added and fused with the original SSF and sent to the next layer of inference.

SFINet Architecture
After each layer of SCINet is the output, it is connected to the shuffle operation and attention block.Hence, a new network structure SFINet-the shuffle-and-fusion interaction network is proposed.This structure represents different timing information features.After passing through each layer, they will be shuffled and further processed at the attention block.After the above operations are completed, they will be fused with the previous features, and then enter the next layer of reasoning, thus breaking the information blockage in the original structure.After the above operations, not only the capture of short-term dependencies in timing is guaranteed, but also the ability to build long-term dependencies in timing is further improved.
It should be noted that, when level = 1, SFINet has the same structure as the original SCINet.Because the original input is directly connected to the output after a layer of reasoning, it does not go through the shuffle operation and attention block modules; however, in actual tasks, most tasks require a structure with larger than 2 levels.

Wind Power Forecast
This section provides a test of the proposed SFINet for wind power forecasting.Section 4.1 introduces the data sets included in this study.There were five data sets utilized: two of them were the collected real wind power data and three were the published data sets.Section 4.2 introduces detailed power prediction and other experimental settings.Section 4.3 shows the performance and usability of the proposed method, as well as the testing results.The effectiveness of SFINet was verified by comparing it with various other methods including SCINet.

Data Set Selection
We empirically perform the test of the established models using five data sets: two of them were collected from a wind farm the other three were selected from the published benchmark data sets.
The data collected from a wind farm represent the operation data of two independent wind turbines, each with a rated power of 1.5 MW, for one year with a sampling frequency of once per 10 min, i.e., 10 min data.The two data sets were marked as WPm1 and WPm2, respectively.The data for each sampling point include 12 variables-see Table 1 below.These variables were most relevant to wind power generation, including the wind speed, generator output power, pitch angle, nacelle position (or yaw angle), wind direction (vane direction), etc.These variables were selected by referring to the previously published research articles for wind power forecast and through discussions with the wind farm operation manager and engineers.There were two wind turbine operation modes: Mode 1 was marked as 20 when there was power output to the grid and Mode 2 was marked as 6 when there was no power output or the wind turbine stopped running.Other values represent the wind turbine running in a transition status between the two operation modes.The value was calculated based on the time length when it was running in Mode 1 and the time length when it was running in Mode 2 in a 10 min time step.Similarly, there were two wind turbine braking modes: Under braking and no braking.When it was under braking, the variable Turbine Brake Level was assigned a value of 51, whereas it was given 0 if there was no braking.Other values assigned to this variable represent the braking is in a transition status between the two braking modes.The value was calculated based on the time length when it was in braking mode and the time length when there was no braking.Each value was calculated in average on a 10 min time step.
The ratio of the training data set to the validation set and the test set was 5:2:3.See Table 2 for detailed information.These two data sets were used to verify the effectiveness of the proposed method in wind power forecasting.
Electricity Transformer Temperature (ETT) data were collected and used in [55].The ETT data cover 2 years' data collected from two separate counties in China.They were split into two data sets marked as ETTh1 and ETTh2, respectively, with a sampling frequency of once per hour.The ETTm1 data set was 15 min data, i.e., the sampling frequency was once per 15 min.Each data point consisted of the target value of "oil temperature" ( • C) and six power load features-see Table 3 below.The ratio of the training data set to the validation set and the test set was 3:1:1.See Table 2 for detailed information.The ETT data sets were used to demonstrate the general validity of the proposed method.

Experiment Implementation
In order to evaluate the performance of the proposed method in different aspects, a variety of tasks were defined based on the wind power data set, including prediction tasks of different horizons and univariate or multivariate predictions.
In terms of horizon, similar to ETT data, in the case of a fixed sampling time, different output sequence lengths characterized different prediction times and also showed the difficulty of the task.The prediction lengths of WPh1 and WPh2 were divided into 6, 12, 24, and 48, and the corresponding prediction times were 1 h, 2 h, 4 h, and 8 h.In terms of variables, two forms of multivariate and univariate were used for evaluation.Univariate prediction takes the value of the generated wind power per second (A1gr Gen Power for Process_1sec), and multivariate prediction takes the values from all variables.
For ETT data, the prediction lengths of ETTh1 and ETTh2 were 24, 48, 96, 288, and 720, respectively, and the corresponding prediction time lengths were 24 h, 48 h, 96 h, 288 h, and 720 h.The predicted lengths of ETTm1 were 24, 48, 96, 288, and 672, corresponding to the predicted times of 6 h, 12 h, 24 h, 72 h, and 168 h, respectively.Univariate prediction takes the value of oil temperature, and multivariate prediction takes the values from all variables.
(1) Evaluation index The Mean Absolute Error (MAE) and Weighted Mean Absolute Percentage Error (WMAPE) were used as evaluation criteria.Because some variables in the task may have had negative values (the data sets in the format as Tables 1 and 2), the optimized WMAPE was calculated as follows, where τ is the total number of data points and x i is the mean of the data in the sample.
The relative improvement of performance (RIP) with MAE and absolute improvement of performance (AIP) with WMAPE were used for comparison.They were calculated as follows: where and WMAPE o are obtained using other competitive methods for prediction, and MAE and WMAPE are obtained using the newly proposed method based on the SFINet model. ( In order to evaluate the performance of our proposed algorithm in wind turbine power prediction, we conducted experiments based on the proposed WP data sets and compared the prediction results with those given by the SCINet models.At the same time, in order to further verify the general applicability of the algorithm, we also verified the performance of the algorithm using the ETT data sets and compared it with other methods, including SCINet.
Our task does not specify the look-back windows corresponding to a certain prediction sequence horizon.In the wind power forecasting task, the original wind power data have singular values.In order to eliminate the singular values shown in Table 1, we chose to skip singular values when constructing the training, validation, and testing for different tasks.If there was a singular value in a pair, this piece of data was discarded.Therefore, in this way, for different tasks, the number of samples for training, validation, and testing were be produced, as shown in Table 4. Table 4 also shows the input data time length for a certain prediction length; for example, if the prediction length is for 6 h, the input data covers a time length of 128 h.The singular value issue did not appear in the ETT data set, and these data could be used normally.In the ETT task, the settings are shown in Table 5.All data were normalized.In terms of the loss function and optimizer, we followed the same settings for the SCINet model given in [23].

Prediction Experiment and Result Analysis
(1) Wind power forecasting Applying SFINet to the wind power data sets, the forecasting performance obtained is shown in Table 6.It can be seen from Table 6 that the prediction results of the SFINet model proposed in this paper were generally better than those of the SCINet model.The evaluation criteria of MAE using the multivariate and univariate prediction results with the WPm1 data set were reduced by up to 10.07% and 6.90%, respectively; and up to 9.20 and 6.87%, respectively, for the WPm2 data set.The forecast results evaluated using WMAPE are shown in Table 7. From the results given in Table 7, the superiority of the SFINet model algorithm was verified in univariate and multivariate wind turbine power prediction.The evaluation indices of WMAPE with the WPm1 data set were reduced by up to 5.79% using multivariate evaluation and 4.93% using univariate evaluation, and the WMAPE with the WPm2 data set were reduced by up to 4.86 and 4.35%, respectively.The performance improvement trend of the SFINet model under different horizon tasks was similar to the results using MAE metric.
We then performed qualitative analysis of the prediction results using the wind power data set by selecting a piece of wind turbine power data for a sampling length of 200 in the WPh1 data set and WPh2 data set, as shown in Figure 5a,b.It can be seen that the forecasted wind power had the characteristics of large variation, violent fluctuations and no obvious laws to follow.The prediction result of the SFINet model could better fit the actual power curve.At the same time, at the peak and valley points of the power change, the prediction result of the SFINet model was generally better than that of the SCINet model (see also Figure 5c) by using the published data set.We then performed qualitative analysis of the prediction results using the wind power data set by selecting a piece of wind turbine power data for a sampling length of 200 in the WPh1 data set and WPh2 data set, as shown in Figure 5a,b.It can be seen that the forecasted wind power had the characteristics of large variation, violent fluctuations and no obvious laws to follow.The prediction result of the SFINet model could better fit the actual power curve.At the same time, at the peak and valley points of the power change, the prediction result of the SFINet model was generally better than that of the SCINet model (see also Figure 5c) by using the published data set.(2) Generalization study The ETT data set given in [55] were used to evaluate the performance of a time series forecasting task e.g., [23,55].In this paper, we used the same dataset to evaluate the performance of time series forecasting by different approaches, and the results of multivariate and univariate prediction are shown in Tables 8 and 9.
As shown in Table 8, in multivariate prediction, the prediction effects of Transformerbased methods other than Reformer [56], such as LogTrans [52] and Informer [55], outperformed the RNN-based methods, such as LSTMa [5]; the performance of TCN [22] further outperformed Transformer-based methods; compared with these methods, SCINet model achieved better performance, because the downsample-convolve-interact architecture enabled multi-resolution analysis, which facilitated extracting temporal relation (2) Generalization study The ETT data set given in [55] were used to evaluate the performance of a time series forecasting task e.g., [23,55].In this paper, we used the same dataset to evaluate the performance of time series forecasting by different approaches, and the results of multivariate and univariate prediction are shown in Tables 8 and 9.
As shown in Table 8, in multivariate prediction, the prediction effects of Transformerbased methods other than Reformer [56], such as LogTrans [52] and Informer [55], outperformed the RNN-based methods, such as LSTMa [5]; the performance of TCN [22] further outperformed Transformer-based methods; compared with these methods, SCINet model achieved better performance, because the downsample-convolve-interact architecture enabled multi-resolution analysis, which facilitated extracting temporal relation features with enhanced predictability.Overall, in this paper, as shown in all subtasks with ETT data, the prediction performance using SFINet was all better than that using SCINet-see the relative performance improvement given by RIP as shown in green color.
By comparison with the multivariate prediction, the performance of these methods in discussion for univariate prediction was gradually improved.N-Beats [57] outperforms the above methods, and it is observed that SCINet is superior to other baseline methods.However, the performance of SFINet in time series forecasting is even better than that of SCINet.
Specifically, for the two different tasks using the ETTh1 data set, the evaluation criterion of MAE was improved by 6.33% and 0.76% or more, respectively, while using the ETTh2 and ETTm1 data sets, the prediction performance was improved by 6.94% and 1.76%, and 3.06% and 1.14% or more, respectively.When increasing the horizon, the improvement in MAE showed an increasing trend.The results further confirmed the effectiveness and universality of the algorithm proposed in this paper.In order to make further comparison between SFINet and SCINet in time series forecasting, we performed the prediction using the ETT data set and the prediction performance was evaluated using the metric of WMAPE-see Table 9. SFINet also achieved better performance than SCINet.More specifically, in the multivariate task and univariate task on the ETTh1 data set, the WMAPE was improved by at least 2.88% and 1.04%, respectively, while, on the ETTh2 and ETTm1 data sets, the above-mentioned performance was improved by 1.11% and 1.47%, and 1.13% and 0.43% or more, respectively.
Finally, the prediction results using ETTh1 data set for a horizon of 168 are shown in Figure 5c as an illustration example.The predicted results could very well fit the actual values, and the degree of fitting was better than that of the SCINet model.

Conclusions
In this paper, we propose a new framework named SFINet: shuffle-and-fusion interaction network.The SFINet model included a shuffle operation and a feature fusion function based on the attention mechanism.The shuffle operation was succinctly embedded in between the adjacent layers of SFINet, which increased the interaction of the different time series features of the model.At the same time, in order to more effectively integrate the features of different parts, a feature fusion function based on the attention mechanism was proposed to enhance the feature interaction capabilities of different parts of the model.The developed SFINet models were tested using real data of wind power generation.It was verified that the SFINet models provided better performance than the network algorithm based on SCINet.At the same time, in order to verify the universality of the proposed framework, we evaluated the performance of SFINet models using the ETT datasets, which are open and published datasets.The model we proposed presented wind power forecast performance which was better than that of seven other types of algorithms in comparison.In future work, we will focus on further improvement of the SFINet structure with more applications to show the powerfulness of SFINet models in time series forecasting.

Figure 2 .
Figure 2. The overall architecture of the Sample Convolution and Interaction Network (SCINet) [23].Figure 2. The overall architecture of the Sample Convolution and Interaction Network (SCINet) [23].

Figure 2 .
Figure 2. The overall architecture of the Sample Convolution and Interaction Network (SCINet) [23].Figure 2. The overall architecture of the Sample Convolution and Interaction Network (SCINet) [23].

Figure 5 .
Figure 5.The prediction comparison between SFINet and SCINet for different datasets.

Figure 5 .
Figure 5.The prediction comparison between SFINet and SCINet for different datasets.

Table 1 .
Illustration of the wind power data set including 12 variables.

Table 2 .
The overall information of the 5 datasets.

Table 3 .
Illustration of the ETT data set including 7 variables.

Table 4 .
Wind power forecasting task setting and number of samples.

Table 5 .
ETT task setting and number of samples.

Table 6 .
Forecasting results evaluated in MAE on wind power data sets.

Table 7 .
Forecasting results evaluated in WMAPE on wind power data sets.

Table 8 .
Forecasting results evaluated by MAE on ETT datasets.The best results are in bold and the second-best results are underlined.RIP denotes the relative improvement of performance of the proposed method over the second-best results.

Table 9 .
Forecasting results evaluated by WMAPE on ETT datasets.