Wind-Speed Multi-Step Forecasting Based on Variational Mode Decomposition, Temporal Convolutional Network, and Transformer Model

: Reliable and accurate wind-speed forecasts signiﬁcantly impact the eﬃciency of wind power utilization and the safety of power systems. In addressing the performance enhancement of transformer models in sh ort-term wind-speed forecasting, a multi-step prediction model based on variational mode decomposition (VMD), temporal convolutional network (TCN), and a transformer is proposed. Initially, the Dung Beetle Optimizer (DBO) is utilized to optimize VMD for decomposing non-stationary wind-speed series data. Subsequently, the TCN is used to extract features from the input sequences. Finally, the processed data are fed into the transformer model for prediction. The eﬀectiveness of this model is validated by comparison with six other prediction models across three datasets, demonstrating its superior accuracy in short-term wind-speed forecasting. Experimental ﬁndings from three distinct datasets reveal that the developed model achieves an average improvement of 52.1% for R 2 . To the best of our knowledge, this places our model at the leading edge of wind-speed prediction for 8 h and 12 h forecasts, demonstrating MSEs of 1.003 and 0.895, MAEs of 0.754 and 0.665, and RMSEs of 1.001 and 0.946, respectively. Therefore, this research oﬀer s signiﬁcant contributions through a new framework and demonstrates the utility of the transformer in eﬀectively predicting short-term wind speed.


Introduction
Wind power is a crucial component of renewable energy sources, representing one of the most viable alternatives to traditional fossil fuels thanks to its eco-friendly properties.This can contribute to decreasing reliance on fossil fuels and mitigating environmental pollution [1].The Global Wind Energy Council has documented a significant rise in worldwide wind energy capacity, reaching 906 Gigawatt (GW), which represents an annual increase of 9%.The year 2023 was expected to be a milestone, with projections indicating that it would be the inaugural year to witness the addition of more than 100 GW of new capacity across the globe.Their estimates also predict a remarkable expansion of 1221 GW in new capacity from 2023 to 2030 [2].Accurate predictions of wind speed are essential for the effective management of wind energy generation [3].Generally, precise forecasts of wind speed can enhance the efficiency of wind resource utilization and reduce the effects of wind energy variability on the stability of the electrical grid, facilitating costeffective and efficient wind farm operations [4].Therefore, the importance of accurate wind-speed forecasting is growing in terms of reducing the costs and risks linked to power supply systems [5].Numerous scholars have endeavored to craft models that yield precise deterministic forecasts of wind speeds.These endeavors have categorized models into four distinct groups: physical, statistical, artificial intelligence (AI)-based, and hybrid models [6,7].Among these, numerical weather prediction models, such as the weather research and forecasting model [8], are recognized as the most prominent physical models.They predict wind speeds using intricate mathematical equations that factor in meteorological variables like humidity and temperature [9], proving particularly effective for medium-tolong-range forecasts of wind speed [10].On the other hand, statistical models, such as auto-regressive moving average [11], auto-regressive integrated moving average [12], and vector auto-regression [13], differ from physical models by relying solely on historical data of wind speeds for predictions.These models are adept at capturing the linear variability of wind speeds and excel in forecasting over short-term periods [14].AI-based models primarily tackle the nonlinear dynamics of wind speed, incorporating simple neural networks (for instance, the back-propagation neural network [15], Elman neural network [16], and multilayer perceptron [17]), along with support vector machines [18] and extreme learning machines [19].Studies indicate that while deep learning offers suboptimal interpretability, it yields commendable predictive outcomes [20].Presently, a plethora of deep learning methods have been employed for wind-speed forecasting, such as deep belief networks [21], convolutional neural networks (CNNs) [22], long short-term memory networks (LSTM) [23], gated recurrent units (GRUs) [24], and temporal convolutional networks (TCNs) [25].TCN-based approaches [26] utilize convolutional kernels to detect temporal changes by moving across the time dimension.Zhang et al. [27] proposed a novel integrated model, blending VMD, the Sparrow Search Algorithm, and bidirectional GRU, that leverages TCNs.It has been observed in various studies that deep learning models often outshine both classical machine learning and statistical models in terms of nonlinear predictive capabilities and feature extraction prowess [28].The consensus among many scholars is that no single model can fully encapsulate the intricate variations in wind speed, leading to the creation of diverse hybrid models [8].Zhang et al. [29] developed a hybrid model that merges noise-reduction techniques, optimization strategies, statistical approaches, and deep learning.Neshat et al. [30] introduced a novel hybrid model with a deep learning-based evolutionary approach, featuring a bidirectional LSTM, an efficient hierarchical evolutionary decomposition technique, and an enhanced generalized normal distribution optimization method.
The transformer model has achieved remarkable success in fields such as computer vision and natural language processing, and it is pivotal in bridging the gaps between diverse research domains.In the realm of time series forecasting, transformer-based models have gained prominence due to their multi-head self-attention (MHSA) mechanism.Both the transformer and its adaptations have been proposed for time sequence forecasting tasks [31].The transformer model, renowned for its effectiveness in the realm of windspeed prediction, has become a prominent tool in this area.For instance, Wu et al. [32] introduced a novel EEMD-Transformer-based hybrid model for predicting wind speeds.Zhou et al. [33] presented the informer, a model designed for long sequence time forecasts, characterized by a ProbSparse self-attention mechanism for optimal time complexity and memory efficiency.Yang et al. [34] developed a causal inference-enhanced informer methodology employing an advanced variant of the informer model, specifically adapted for long-term time series analysis.Bommidi et al. [35] developed a composite approach that harnesses the predictive strength of the transformer model alongside the analytical prowess of ICEEMDAN to improve wind-speed prediction accuracy.Huang et al. [36] present a new hybrid forecasting model for short-term power load that effectively decomposes power load data into subsequences of varying complexities; employs BPNN for less complex subsequences and transformers for more intricate ones; and amalgamates the forecasts to form a unified prediction.Wang et al. [37] utilized the transformer as a core component to devise an innovative convolutional transformer-based truncated Gaussian density framework, offering both precise wind-speed predictions and reliable probabilistic forecasts.Zeng et al. [38] introduced the DLinear model, which explores the impacts of various design elements of long-sequence time forecast models on their capability to extract temporal relationships.Nie et al. [39] present a novel transformer-based framework for multivariate time series forecasts and self-supervised representation learning.This framework, termed the channel-independent Patch Time Series Transformer (PatchTST), markedly improves long-term forecasting precision.
Within the hybrid modeling framework, original wind-speed data are segmented into subseries with distinct frequencies and analyzed individually using specialized models, and their forecasts are amalgamated to produce the final prediction outcome [40].For instance, Li et al. [41] employed the VMD technique to segregate wind-speed data into intrinsic mode functions (IMFs) of varying frequencies, with each IMF being analyzed through a bidirectional LSTM model.Similarly, Wu et al. [42] utilized VMD to segment wind speed and integrated these segments with multiple meteorological variables to construct a deep-learning model with interpretability.Geng et al. [43] propose a novel prediction framework to enhance short-term power load forecasting accuracy, utilizing a particle swarm optimization (PSO)-enhanced VMD in conjunction with a TCN incorporating an attention mechanism.Zhang et al. [44] proposed a hybrid deep learning model for wind-speed forecasting that combines CNN, bidirectional LSTM, an enhanced sine cosine algorithm, and EDM based on time-varying filtering to improve prediction accuracy.Moreover, Altan et al. [45] presented a predictive model that combines ICEEMDAN decomposition and LSTM, employing grey wolf optimization to fine-tune the weighted coefficients of each IMF for enhanced forecasting precision.
The literature review highlights several existing gaps in the field of wind-speed prediction.Wind-speed prediction studies based on transformers are relatively scarce compared to those based on other deep learning models.This highlights the necessity for a further in-depth exploration of the potential of transformer-based models within the wind-speed prediction domain.In the realm of wind-speed prediction models based on transformers, the majority are designed for long-term forecasting.There is a notable scarcity of models for medium-term, short-term, and ultra-short-term predictions.This indicates a pressing need for the development of transformer-based models that can effectively address medium-term, short-term, and ultra-short-term wind-speed forecasting.Additionally, there is a scarcity of transformer-based wind-speed prediction models that integrate data decomposition algorithms and other models, indicating a need for further exploration of the potential of hybrid forecasting models based on transformers.In response to the aforementioned challenges and needs, this paper introduces a hybrid windspeed prediction model named DBO-VMD-TCN-Transformer, which integrates Dung Beetle Optimizer (DBO) algorithm-enhanced VMD, TCN, and transformer technologies.The contributions of the study are as follows:

•
The model utilizes the DBO algorithm to autonomously determine the most effective decomposition parameters for VMD.This approach significantly reduces signal loss during the decomposition phase and enhances the overall performance of VMD.

•
A hybrid forecasting model that combines TCN with transformers is introduced.TCN is employed to extract original wind-speed features, which are then fed into the transformer for multi-step short-term wind-speed prediction.

Flow Chart of the Proposed Model
A novel composite forecasting approach is presented, illustrated in Figure 1, which integrates the advantages of DBO-enhanced VMD, TCN, and transformer technologies, concisely referred to as the DBO-VMD-TCN-Transformer.The approach is delineated across three phases: The initial phase involves partitioning the gathered wind-speed data into training, validation, and test groups.Utilizing the DBO algorithm, the optimal parameters for VMD are determined automatically, leading to the segmentation of windspeed data into various IMFs.In the second phase, the decomposed data are fed into the TCN model to extract features from the high-resolution wind-speed data.These features are subsequently used for multi-step, short-term prediction through a transformer model.The TCN-Transformer architecture is devised to elucidate the complex relationships between historical inputs and forecasted outcomes.The final phase is dedicated to the exposition and analysis of empirical results obtained from three distinct datasets, assessing the framework's effectiveness and stability via four principal performance metrics (MSE, MAE, RMSE, and R 2 ) in conjunction with the Diebold Mariano (DM) test.

Variational Mode Decomposition
VMD is a contemporary technique in signal processing that has been increasingly adopted for its effectiveness.It excels in pinpointing the optimal central frequencies and minimizing bandwidth for each mode during analysis, thereby effectively isolating intrinsic mode functions and segmenting the frequency domain [46].Unlike empirical mode decomposition and wavelet analysis, VMD offers enhanced signal reconstruction capabilities and superior noise immunity.The algorithm decomposes a signal into K distinct frequency bands and stable sub-signals, each characterized by unique oscillatory components with varying frequencies and amplitudes.This approach, optimized through a variational method, seeks to balance the total estimated bandwidths against the minimization of bandwidth sums for each mode, thus achieving an optimal decomposition.The formal definition of VMD in signal decomposing is given by Equation (1).
In the formulation, the component of mode kth is indicated as   , and the central frequency for this component is denoted by {  }.The representation for the Dirac distribution is given as ().
To tackle the original constrained variational formula, the approach integrates a penalty coefficient α along with a Lagrange multiplier λ.This integration effectively shifts the problem from a constrained framework to an unconstrained setting.As a result of this process, a revised Lagrange formula, referred to as expression (2), is derived.
For attaining the ideal outcome, the initial values for the parameters  ̂,  2 ,  1 , and  are set, with n being initially fixed at 0. Following this setup, a repetitive process begins in which n is progressively increased with each pass.Throughout every step of this process, the parameters  ̂ ,  2 and  1 undergo adjustments based on the latest computations.

Dung Beetle Optimization
The algorithm was introduced by Xue and Shen in 2023 [47].The foundational Dung Beetle algorithm updates the positions of the population by mimicking four natural behaviors observed in dung beetles: rolling, spawning, foraging, and stealing.
During the rolling process, dung beetles engage in the behavior of shaping dung into spherical forms and propelling them forward swiftly to minimize competition from fellow beetles.The beetles determine their movement direction by using environmental light, aiming to propel the dung ball in the straightest line achievable.Equation ( 6) delineates the method for recalibrating the position of the dung beetle engaged in rolling: where t symbolizes the iteration count currently in progress, and   () represents the dung beetle's location after t iterations.The text initially sets  to indicate the beetle's adherence to or deviation from its set path, where a value of  is randomly assigned as 1 for no change in direction and −1 for a shift in direction. ∈ (0,0.2] is defined as the imperfection factor with a value of 0.1, and b is a constant within [0,1], with a value of 0.3 specified in the implementation.  is identified as the least favorable global value.Δ mimics the effect of sunlight, where a higher Δ suggests a greater distance from the light source. Naturally, in the absence of light or on uneven terrain, dung beetles lack the ability to determine their movement direction.Under such conditions, they ascend the dung ball and perform a dance-a behavior that aids in deciding the direction for subsequent movement.The mathematical expression for updating the dung beetle's position based on this dance is outlined in Equation (7).
In the spawning process, dung beetles choose secure locations for egg-laying.Mirroring this behavior, a strategy for selecting boundaries to represent these areas was introduced, as outlined below: where  * and  * signify the lower and upper limits, respectively, of the area designated for spawning. * is recognized as the current local optimal site,  = 1 − /max, and max symbolizes the maximum iteration count.When a spawning dung beetle identifies the most favorable area for spawning, it proceeds to spawn within that zone.The spawning area is subject to continuous variation, ensuring the ongoing search for the region containing the current optimal solution while avoiding entrapment in local optima.The modification in the position of a spawning dung beetle is formalized in Equation ( 9): Here,  1 and  2 are random values with a magnitude of 1 × Dim and Dim, which refers to the dimensionality of the optimization challenge, represents the problem's dimension.
Within the foraging process, dung beetles engaging in foraging behavior similarly prioritize the selection of a secure location, akin to their approach in egg-laying.The precise definition of this area is provided through Equation (10).
In this context,   signifies the globally optimal position, whereas   and   are indicative of the lower and upper thresholds of the prime foraging zone. and , on the other hand, delineate the lower and upper limits relevant to problem resolution.Each act of foraging by a dung beetle translates into a revision of its position, with the update process for a foraging dung beetle's location detailed in Equation (11): Here,  1 represents a normally distributed random numeral, and  2 is a vector within [0,1] of size 1 × Dim.
During the stealing process, certain dung beetles are known to pilfer dung balls from their counterparts.The globally optimal position   is designated as the site of these competed-for dung balls.The process of theft is characterized by the positional update of the steal dung beetle, with the specific update mechanism detailed in Equation ( 12): Here,  is a fixed value set at 0.5 in the study,  quantifies the randomness factor, and Dim elucidates the dimensionality of the problem at hand.

Temporal Convolutional Network
Derived from the foundational architecture of CNN, TCNs represent an evolutionary development that incorporates one-dimensional convolutional layers structured causally with extended lengths for both inputs and outputs.This design allows for the simultaneous processing of historical and spatial information.Moreover, the inherent capability of CNNs to execute parallel operations contributes to a significant reduction in processing time.When juxtaposed with long short-term memory networks (LSTM), TCNs display a more straightforward and coherent structure, enhanced training and convergence efficiency, and the capacity to learn historical data akin to recurrent neural networks (RNNs) without inadvertently revealing future information.Additionally, TCNs offer superior stability in overcoming challenges associated with gradients exploding or vanishing and demand lower memory usage, positioning them as a more practical option for specific analytical tasks.
The architecture of the network is elaborately depicted in Figure 2, which illustrates that the TCN [48] primarily consists of three key components: causal convolution, dilated convolution, and residual connections.The design principle behind causal convolution is to ensure that the model's predictions are based solely on past and present inputs, rather than future inputs, aligning with the temporal sequence's natural causality.As demonstrated in the left portion of Figure 2, causal convolutions are structured such that the information for a given time point t incorporates data from preceding time points, thereby embedding a temporal hierarchy within the model layers.The effectiveness of causal convolution in feature extraction is constrained by the dimensions of its kernel, leading to the need for multiple linearly stacked layers to apprehend extensive dependencies.To address this limitation, TCNs employ an expanded convolution strategy, known as dilated convolution.Dilated convolutions, by design, require padding on either side of the input layer (left or right, depending on the convolution direction) commonly achieved through zero-padding.This approach allows for a broader receptive field without increasing the number of layers, thereby efficiently capturing wider temporal relationships without raising computational complexity or the number of parameters.The formal definition of dilated convolution is given by Equation (13): where * denotes the convolution operation,  represents the convolution kernel,  represents the dilation factor,  signifies the filter size, and  indicates the sequence element for the dilated convolution.Typically, the dilation factor  experiences an exponential increase in correlation with the network's increasing depth.Augmenting both the dilation factor  and the convolution kernel's dimension  results in an expanded receptive field for the TCN.Unlike standard convolutions, dilated convolutions sample the input at intervals, effectively expanding the receptive field with a controlled sampling rate determined by the dilation factor .As the number of layers in the network increases, it becomes essential to tackle challenges such as the vanishing gradient issue, necessitating the adoption of residual connections.Residual connections, particularly those utilizing 1 × 1 convolution blocks, facilitate the cross-layer transmission of information, ensuring consistency between the inputs and outputs.The mathematical representation of these connections is presented below: In this equation,  denotes the input, () is the convolutional layer's output, and ( ) signifies the ReLU activation function.
Displayed in the right section of Figure 2, the residual module encompasses a sequence starting with dilation causal convolution followed by weight normalization, application of ReLU for activation, and incorporation of a Dropout layer to prevent overfitting.This configuration is iterated across four stages, resulting in an eight-layer structure.Throughout this process, residual connections utilizing 1 × 1 convolution blocks are employed to maintain consistent output dimensions.

Transformer
Transformers have achieved remarkable success in realms such as Natural Language Processing and image recognition, overcoming the limitations inherent in RNN and CNNbased forecasting models.CNNs often require many layers to achieve a significant receptive field, while RNNs rely on long time sequences for predictions.The self-attention mechanism of transformers addresses these issues by enabling direct access to sequence elements, thus facilitating a deeper exploration of the complex correlations within individual feature data.Moreover, their capacity for parallel processing significantly reduces training durations, allowing models to be trained on larger datasets compared to LSTM networks, enhancing their efficiency and applicability.
Figure 3 illustrates the intricate structure of the transformer network.The transformer architecture comprises two key elements: an encoder and a decoder [49].The encoder is tasked with transforming the input into a rich, high-dimensional representation that encapsulates contextual nuances, whereas the decoder is dedicated to feature reconstruction [50].Figure 3 delineates the comprehensive blueprint of the transformer model.Initial steps involve input embedding and position encoding before the data proceed to the encoder and decoder layers.Input embedding amalgamates various features into a unified representation, and position encoding ensures the retention of temporal attributes associated with each data point.The relevant mathematical formulation is provided as follows: where   = 1 10000 2 ′  ;  denotes the position index.The MHSA mechanism permits the model to concurrently compute linear transformations through various attention mechanisms, subsequently amalgamating diverse attentions to acquire a relatively more comprehensive feature information, thereby enhancing the efficacy of the self-attention layer.The MHSA mechanism emerges as a pivotal feature of the transformer, facilitating parallel processing of input data, a capability that sets it apart from sequential time sequence models like LSTM and TCN. Figure 3 provides a visual representation of the transformer's architecture.Within the MHSA framework, the input vector  is converted into ℎ distinct sets of query, key, and value matrices.The three distinct matrices known as Q (Query), K (Key), and V (Value) can be generated.The corresponding equations are depicted as follows: (17) where  ℎ denotes the query matrix,  ℎ symbolizes the key matrix, and  ℎ represents the value matrix, with  ℎ  ,  ℎ  and  ℎ  being the adjustable parameters for the linear transformations.The MHSA divides the input into several independent feature spaces, facilitating the model's ability to learn a broader spectrum of feature information [51].The process continues with the application of scaled dot-product attention to generate a series of output vectors: Here,  ℎ is the result of the scaled dot-product attention mechanism, with �  acting as the scaling factor for the attention weights.The outputs,  ℎ , are subsequently concatenated and subjected to a linear projection to yield the final output.
where   represents the learnable parameter of the MHSA mechanism, which is critical for encoding and aggregation info at each point for sequence.

Evaluation Indicators and Experimental Environment
Four frequently utilized indicators are employed to assess the efficacy of the experimental models: mean square error (MSE), mean absolute error (MAE), root-mean-square error (RMSE), and the R-squared (R 2 ) score.The equations for these metrics are detailed as follows, where m represents the aggregate count of samples,  �  denotes the forecasted values,   corresponds to the observed values, and ̄ is the average of   .
The study's experiments were performed using a system running on Windows 10 OS, utilizing the PyTorch framework alongside Python 3.9.The evaluations were conducted on hardware featuring an Intel Core CPU T7700, equipped with 32 GB of RAM and an NVIDIA Tesla M10 GPU.To ensure the fairness of the experiments, efforts were made to keep the parameters consistent across all models.The forecast periods considered are 12, 24, and 48, and the look-back periods are set at 24, 48, and 96.Batch size and epochs were standardized at 128 and 100, respectively, with Early Stop (patience = 3) and Dropout (dropout rate = 0.15) mechanisms implemented to counteract overfitting.

Datasets Description
To assess the performance of the proposed model, three unique datasets of wind speed, each from different geographical locations and with varied resolutions, were selected.The initial dataset originates from the National Renewable Energy Laboratory Wind Technology Center and is available to the public.The tower is located at 39°54′38.4 displays the fluctuation curves for data, while Table 1 shows a comprehensive overview of the datasets' statistical characteristics, showcasing the unique statistical features of each dataset.Outliers were removed using the commonly employed quartile method, and missing values were addressed through cubic spline interpolation.

Experiment I
For the assessment of the developed model's performance in terms of both accuracy and stability, we evaluated the suggested model against several benchmarks, including DBO-VMD-TCN, DBO-VMD-SVR, DBO-VMD-DLinear, DBO-VMD-PatchTST, DBO-VMD-Informer, and DBO-VMD-Transformer.The comparative analysis of errors across these models in three different datasets is detailed in Tables 2-4, where the optimal outcomes are emphatically denoted in bold.Furthermore, Figures 5-7 display forecast curves and columnar stacked charts, respectively, highlighting the forecasting capabilities of the six models over the three datasets.The findings from Tables 2-4 reveal that the DBO-VMD-TCN-Transformer model outperforms other forecasting models in terms of prediction accuracy.The new hybrid model introduced in this study, which builds upon the TCN and transformer framework, demonstrates improved performance across various error metrics.Notably, this model demonstrates significant improvements in MAE, MSE, RMSE, and R 2 , especially in the context of multi-step forecasting, when compared to the other model.For instance, in the 48-step prediction using the basic model, the TCN-Transformer exhibited the highest R 2 , at 0.938, 0.907, and 0.922 for Datasets A, B, and C, respectively.In contrast, the R 2 values for the TCN networks in Datasets A, B, and C were 0.429, 0.558, and 0.729, respectively.In the 48-step forecasting using the SVR model, the TCN-Transformer exhibited the lowest MAE values for Datasets A, B, and C, recording 0.523, 0.754, and 0.665, respectively.In contrast, the MAE values for the SVR model were 0.973, 1.941, and 2.078 for Datasets A, B, and C, respectively.The multi-step forecast curves, illustrated in Figures 5-7, demonstrate that the model developed in this research outperforms alternative models in forecasting efficacy.By integrating the transformer model with the TCN, the approach achieves a superior fit to the predictive curve.The columnar stacked graphs in Figure 8 display the marked advantage of the combined forecasting model over other models in terms of overall performance.In Figure 8, the legends omit the common prefix part 'DBO-VMD-' of the models.This advantage is evident across four key metrics: MSE, MAE, RMSE, and R 2 , each showing a trend of notable improvement.For example, during a 24-step prediction for Dataset B, the DBO-VMD-TCN model recorded MSE, MAE, and RMSE values of 4.815, 1.637, and 2.194, respectively.By contrast, the DBO-VMD-TCN-Transformer model dramatically improved upon these figures, posting values of 0.723, 0.640, and 0.850, respectively.This corresponds to performance enhancements of 85.0%, 60.9%, and 61.3% for these metrics, respectively.In the case of a 48-step forecast for Dataset C, the DBO-VMD-TCN model's figures were 3.129, 1.280, and 1.769, while our model displayed superior figures of 0.895, 0.665, and 0.946, representing improvements of 71.4%, 48.0%, and 46.5%, respectively.Consequently, the hybrid approach introduced in this study achieves the most effective outcomes, leveraging the transformer's robust forecasting capabilities alongside TCN's enhanced feature extraction prowess.This combination effectively uncovers underlying correlations within extensive time series data, markedly elevating the hybrid model's forecasting precision.

Experiment II
The predictive capabilities of the transformer, PatchTST, and informer were compared and analyzed under the same data decomposition method combined with the TCN.Tables 5-7 provide a detailed comparative analysis of the error metrics for these models across three distinct datasets, with the optimal values highlighted in bold.Furthermore, Figure 9 illustrates the 3D histograms, emphasizing the prediction abilities of the three models on the three datasets.
As shown in Tables 5-7, the transformer model outperforms the informer and PatchTST models in multi-step prediction across distinct datasets.Regarding the MSE, MAE, and RMSE indicators, where lower values are preferable, the transformer model shows a marked decrease in these values when compared to the PatchTST and informer models.This is corroborated by the outcomes of multi-step forecasting shown in Tables 5-7.For instance, during a 48-step prediction in Dataset B, the MAE values recorded were 2.091, 0.861, and 0.754, respectively.These findings highlight a notably better performance in the transformer model while indicating a somewhat inferior result in the informer model.Therefore, the transformer model reveals a considerable capacity for enhancement.This not only boosts the overall accuracy of the model but also guarantees a more precise reflection of the actual figures.From the 3D histograms in Figure 9, it is evident that the transformer model yields superior outcomes compared to the PatchTST and informer models across Datasets A, B, and C. In the figure, the legends omit the common prefix part 'DBO-VMD-TCN' of the models.For instance, in a 24-step prediction for three datasets, the transformer model shows a notable enhancement over the PatchTST model, with average increases of 36.8% in R 2 metrics.The transformer model demonstrated significant improvements over the PatchTST model, with similar trends observed in Datasets A, B, and C.Moreover, during the 48-step prediction phase for three datasets, the transformer model registers an average enhancement of 15.1% across three metrics over informer, with average increases of 22.3%, 12.3%, and 10.6% in MSE, MAE, RMSE, respectively.

Experiment III
To assess the effectiveness of the DBO-VMD method in decomposing wind-speed series data, comparisons were made with scenarios without VMD, with VMD, and with VMD optimized by PSO.For the optimized VMD, the penalty factor was chosen within the range of [500, 3000], and the value of K was set between 3 and 10, inclusive of integers only.For the non-optimized VMD, the K value was empirically set to 7. Both DBO and PSO optimization algorithms were configured with two variables, ten individuals, and a maximum of thirty iterations.Subsequently, DBO was utilized to optimize the VMD parameters.Figure 10 illustrates that the optimal number of IMFs was determined to be 8.The time domain of the modal components obtained through DBO-VMD decomposition is shown in the left portion of Figure 10.It is evident from the right portion of Figure 10 that each mode is distinct in the frequency distribution of the modal components, effectively preventing the issue of mode mixing.Table 8's analysis shows that employing the VMD method results in substantial performance improvements across four metrics compared to without VMD.For instance, in a 24-step forecast, the MSE, MAE, RMSE, and R 2 values for the VMD method were 0.789, 0.637, 0.888, and 0.932, respectively.In contrast, the values for these metrics without using the VMD method were significantly less favorable.These findings demonstrate that VMD is an effective data decomposition model for enhancing predictive accuracy.A comprehensive analysis of Table 8 reveals that the DBO-VMD-TCN-Transformer model achieved outstanding evaluation metrics in three types of multi-step forecasts.Through comparative assessments, the forecasting outcomes based on the DBO-VMD and PSO-VMD hybrid models demonstrated an improvement of 34.8%, 21.2%, 19.3%, and 4.7% across the metrics of MSE, MAE, RMSE, and R 2 , respectively.These critical indicators highlight the superior performance of the combined model employing DBO for optimizing VMD parameters over the model using PSO optimization for VMD.

Diebold Mariano Test
The DM test is comparable to conducting a t-test, focusing on comparing the average losses produced by two distinct predictive models to determine if they are statistically identical.When dealing with time series data that show autocorrelation, the DM test adeptly adjusts its estimation of the standard deviation for the difference in losses, taking autocorrelation into account.This capability renders the DM test especially effective for evaluating forecasting models tailored to time series data.The foundational premise, or the null hypothesis (H0), posited by the test is the absence of any significant discrepancy in the predictive accuracies of the two models being compared.Nonetheless, as detailed in Table 9, differences in the predictive accuracies of the models are observed at the 5% significance level.Table 9 demonstrates that H0 is rejected for every comparison model, suggesting a distinct difference in the predictive capabilities of the DBO-VMD-TCN-Transformer model compared to its counterparts.The data in Table 9, which shows all pvalues below 0.05 and all DM values as negative, support the conclusion that the combined prediction model introduced in this research significantly outperforms the benchmark models in terms of forecasting accuracy.

Discussion
Previous research results indicate that, compared to other models, the proposed model exhibits significant advantages across three datasets and various step lengths.This is attributed to its reliance on a hybrid model capable of handling high-resolution windspeed fluctuation information.The superiority of the DBO-VMD-TCN-Transformer model can be summarized as follows: VMD Preprocessing: As demonstrated by the experiments in Section 3.5, there is a noticeable difference in the forecasting results with and without VMD preprocessing.Optimization through DBO further enhances the effectiveness of VMD preprocessing.VMD preprocessing improves the non-stationarity of the original wind speeds.therefore, all experimental comparisons in this paper are based on VMD-preprocessed data.
TCN Module: As observed in Section 3.3, standalone TCN predictions perform the worst.However, hybrid models that combine TCNs with transformer-like structures outperform those without TCN integration.The TCN module excels in extracting temporal features from high-resolution wind speeds, thereby enhancing the performance of the hybrid forecasting models.
Transformer Module: As demonstrated in Section 3.4, hybrid models equipped with transformer modules yield better forecasting results than non-transformer hybrid models.Furthermore, transformer-based hybrid models surpass those integrating informer and PatchTST models.The transformer module effectively captures complex dependencies between input data and forecast outputs, achieving optimal predictive performance.

Conclusions
Addressing the need for enhanced accuracy in wind-speed forecasting and the scarcity of research on wind-speed short-term prediction utilizing the transformer architecture, this study introduces a hybrid wind-speed prediction model that integrates the transformer model, VMD, and TCNs.This innovative model aims to leverage the strengths of each component to enhance accuracy and efficiency in predicting wind speeds across various time horizons.By integrating the transformer's ability to handle complex dependencies with the accuracy of VMD for wind-speed decomposing and the efficiency of TCNs for temporal analysis, this proposed model seeks to fill the gaps in current short-term wind-speed forecasting methodologies and extend the application of transformer-based models to a wider range of forecasting scenarios.The efficacy of the introduced model was validated and assessed using three real-world datasets.Experiments conducted with these datasets revealed that (1) compared to six benchmark models, the proposed model exhibits superior performance, showing an average improvement of 54.2% in MSE, MAE, and RMSE performance, and a 52.1% increase in R 2 performance.(2) The transformer model demonstrates enhanced capabilities in short-term forecasting compared to the PatchTST and informer models.On average, its performance in the MSE, MAE, and RMSE metrics improved by 40.2%, while the R 2 score increased by 20.8%.(3)

Figure 1 .
Figure 1.Flowchart of the developed model.
34″ N and 105°14′5.28″W, with its base at an elevation of 1855 m above mean sea level.The data measurement height is 80 m, with a value resolution of 1 min.In this paper, 44,640 records from December 2020 are utilized, denoted as Dataset A. The second dataset originates from a wind farm in Wuwei City, Gansu Province, featuring a data measurement height of 70 m and a resolution of 10 min.This study utilizes 26,214 records from April to September 2019, referred to as Dataset B. The third dataset is sourced from a wind farm in Jiuquan City, Gansu Province, maintaining the same measurement height of 70 m but with a resolution of 15 min.It includes 58,368 records from January to December 2018, denoted as Dataset C. The forecast intervals are defined as 12, 24, and 48 steps, corresponding to actual prediction durations of 12, 24, and 48 minutes for Dataset A; 2, 4, and 8 hours for Dataset B; and 3, 6, and 12 hours for Dataset C, respectively.Each dataset was divided into three segments: 70% allocated for training, 10% for validation, and the remaining 20% for testing purposes.

Figure 5 .
Figure 5.The forecasting results of Dataset A.

Figure 6 .
Figure 6.The forecasting results of Dataset B.

Figure 7 .
Figure 7.The forecasting results of Dataset C.

Figure 8 .
Figure 8. Columnar stacked chart of various models.

Figure 11
Figure 11  presents radar charts that compare the performance indicators for VMD both with and without the decomposition methods, alongside the application of different optimization algorithms.In the legend, 'Transformer' is abbreviated as 'Tr'.The Key indicators of MSE, MAE, and RMSE are recorded with preferable outcomes indicated by lower scores.It is evident from the chart that the DBO-VMD-TCN-Transformer model secures minimal values across these performance measures.Regarding the 1-R 2 metric, which when nearer to 0 denotes greater precision, the WSO-VMD-TCN-Transformer model is shown to be the closest to this optimal benchmark.This finding highlights the efficacy of the DBO-VMD approach in enhancing the accuracy and fit of wind-speed predictions.

Table 1 .
The statistical information.

Table 2 .
The performance for Dataset A.
Note: Values in bold indicate the best value.

Table 3 .
The performance for Dataset B.
Note: Values in bold indicate the best value.

Table 4 .
The performance for Dataset C.

Table 5 .
The performance of three models for Dataset A.

Table 6 .
The performance of three models for Dataset B.
Note: Values in bold indicate the best value.

Table 7 .
The performance of three models for Dataset C.

Table 8 .
The performance of different optimization algorithms.
Note: Values in bold indicate the best value.

Table 9 .
The DM test results.