A Comparative Study of Day-Ahead Wind Power Forecasting Models for a Single Wind Farm Under Strict Chronological Splitting and Unified Hyperparameter Tuning Conditions

Liu, Jiacheng; Lu, Yihang; Zou, Guoping

doi:10.3390/en19122784

Open AccessArticle

A Comparative Study of Day-Ahead Wind Power Forecasting Models for a Single Wind Farm Under Strict Chronological Splitting and Unified Hyperparameter Tuning Conditions

by

Jiacheng Liu

¹,

Yihang Lu

^2,* and

Guoping Zou

¹

School of Mechanical and Electrical Engineering, China Jiliang University, Hangzhou 310018, China

²

The College of Electrical Engineering, Zhejiang University of Water Resources and Electric Power, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Energies 2026, 19(12), 2784; https://doi.org/10.3390/en19122784 (registering DOI)

Submission received: 29 April 2026 / Revised: 25 May 2026 / Accepted: 8 June 2026 / Published: 10 June 2026

(This article belongs to the Section F5: Artificial Intelligence and Smart Energy)

Download

Browse Figures

Versions Notes

Abstract

Short-term wind power forecasting is a key enabling technology for wind farm operation optimization, power grid dispatch, and electricity market decision-making. However, existing studies often lack unified standards in data partitioning, input feature construction, and hyperparameter tuning, making fair and reproducible comparisons across models difficult to achieve. To address this issue, this study focuses on day-ahead wind power forecasting for a single wind farm and establishes a benchmarking framework with strict chronological splitting, a shared feature information set, and a consistent hyperparameter tuning budget. Within this framework, six representative models, namely Ridge, XGBoost, LightGBM, DLinear, Transformer, and PatchTST, are systematically evaluated. A two-level evaluation protocol combining a fixed hold-out split and expanding-window rolling validation is adopted to compare model performance from multiple perspectives, including overall accuracy, sensitivity to hyperparameter tuning, robustness across rolling windows, and performance under typical operating scenarios. The results show that model rankings are not fully consistent between the hold-out evaluation and the rolling-validation setting. Under the fixed hold-out split, LightGBM achieved the lowest NRMSE of 10.2326%, while Transformer obtained the lowest NMAE of 6.9944%. In contrast, under the 8-fold expanding-window rolling validation, Transformer achieved the lowest average NRMSE of 8.1684%, followed by LightGBM with 8.7344%. These results indicate that the best performance on a single test split does not necessarily imply the strongest robustness across multiple time windows. In addition, strong tree-based models remain highly competitive in this single-wind-farm forecasting task, whereas more complex deep temporal models do not always deliver stable advantages. Meanwhile, the performance gains brought by hyperparameter optimization vary substantially across models, suggesting that conclusions drawn from default-parameter comparisons are of limited reliability. These findings demonstrate that systematic benchmarking under strict temporal constraints and fair tuning conditions is essential for clarifying the comparative performance, robustness, and engineering applicability of different model families. The study can therefore provide practical guidance for model selection and deployment in short-term wind power forecasting for single wind farms.

Keywords:

day-ahead wind power forecasting; benchmarking; fair model comparison; rolling validation; hyperparameter optimization; single wind farm

1. Introduction

Against the backdrop of intensifying global climate change, the ongoing transition toward low-carbon energy systems, and the rapid growth of renewable energy installations, wind power has become one of the most representative forms of renewable energy in modern power systems. Compared with traditional fossil fuels, wind energy has the advantages of abundant resources, clean and low-carbon characteristics, and increasingly mature development technologies. Therefore, it has been widely developed and utilized worldwide, and its penetration in power systems has continued to increase. Guo et al. [1] reviewed short-term wind power forecasting methods based on machine learning and highlighted the growing role of data-driven forecasting in wind power applications. Liu et al. [2] summarized wind power prediction models, applications, and challenges from a machine learning perspective. Tawn and Browell [3] reviewed very short-term wind and solar power forecasting, emphasizing the importance of forecasting for variable renewable generation. As grid-connected wind power capacity continues to expand, wind power is no longer merely a supplementary power source in conventional grids, but has gradually evolved into an important variable affecting system dispatch, security assessment, and electricity market transactions. Mei et al. [4] provided an overview of short-term wind power forecasting methods based on multi-scale decomposition and multi-model deep learning fusion, indicating that accurate short-term forecasting is an important technical support for managing the operational impacts of high wind power penetration. Therefore, improving the security, economic efficiency, and operational flexibility of power systems with high wind power penetration has become a key research issue in renewable energy accommodation and power system optimization.

However, wind power is inherently driven by complex meteorological processes and is characterized by significant randomness, volatility, and time-varying behavior, which leads to considerable uncertainty in practical operation. Forecasting errors not only affect the operational optimization of wind farms themselves but also further propagate to multiple aspects, including reserve allocation, market bidding, system balancing costs, and dispatch security margins. Tawn et al. [3] further pointed out that, in practical operating scenarios, model usability, temporal stability, and the ability to characterize uncertainty are also important dimensions for evaluating forecasting methods. In their review of deep learning-based forecasting methods, Wang et al. [5] summarized the key challenges of wind power forecasting under complex operating conditions into three categories: data uncertainty, incomplete feature representation, and complex nonlinear relationships. Yang et al. [6] further emphasized that current wind power forecasting research should not only focus on predictive accuracy, but also pay greater attention to data quality, reproducibility, interpretability, and the standardization of the evaluation process itself. These observations indicate that developing short-term wind power forecasting models that are accurate, robust, and practically applicable remains a key enabling technology for wind farm operation optimization, power grid dispatch, and market decision-making.

To address this challenge, short-term wind power forecasting methods have broadly evolved from traditional statistical models to machine learning models, and further to deep learning and hybrid enhanced frameworks. From the perspective of methodological development, relevant review studies generally indicate that early wind power forecasting research mainly relied on historical wind speed and power series, together with limited meteorological variables, and carried out forecasting through time-series analysis, regression modeling, and empirical statistical methods. These methods feature clear structures and are relatively easy to implement, and they remain applicable to scenarios with limited sample sizes or relatively stable operating conditions. Tuncar et al. [7] reviewed short-term wind power generation forecasting methods in recent technological trends and summarized the role of conventional forecasting methods in this field. Guo et al. [1] reviewed machine learning-based short-term wind power forecasting methods and discussed the transition from conventional forecasting approaches to data-driven models. These reviews indicate that traditional statistical methods played an important role in early wind power forecasting research, whereas their ability to represent strong nonlinearity, multi-scale fluctuations, and complex meteorological coupling effects is often limited. Dou et al. [8] also noted that statistical methods and machine learning methods each have their own applicability boundaries: the former have advantages in model transparency and low complexity, whereas their ability to capture complex patterns is usually limited. Overall, traditional statistical methods played a foundational role in early wind power forecasting research, but their methodological boundaries are also relatively clear, particularly in representing complex nonlinear relationships and exploiting multi-source information. This, in turn, provided a practical motivation for the subsequent development of machine learning and deep learning methods.

With the development of data-driven methods, machine learning has gradually become an important technical route for short-term wind power forecasting. Compared with traditional statistical models, machine learning methods no longer rely on explicit linear assumptions, but instead place greater emphasis on learning the nonlinear mapping between inputs and outputs from historical power data, meteorological variables, and temporal features. In this process, tree-based models have shown particularly persistent and stable competitiveness. Alkesaiberi et al. [9] demonstrated in a comparative study that appropriate data-driven models can significantly improve wind power forecasting accuracy. Carrera and Kim [10] conducted a comparative analysis of machine learning techniques for wind power generation prediction using data from Guatemala, showing that model performance should be interpreted in relation to the specific dataset and forecasting task. Abdelsattar et al. [11] evaluated machine learning and deep learning models for predicting wind turbine power output from environmental factors, further indicating that the relative performance of different model families depends on the input variables and practical forecasting conditions. These studies suggest that model superiority in wind power forecasting is task-dependent rather than universal. Xiong et al. [12] showed that hyperparameter optimization can substantially unlock the predictive potential of XGBoost. Liao et al. [13] reported that, after incorporating reanalysis meteorological data, LightGBM (LGBM) achieves both high accuracy and low computational cost in short-term wind power forecasting. Zheng and Wu [14] further improved the performance of XGBoost in short-term forecasting by introducing weather similarity analysis and feature engineering. From the perspective of different forecasting horizons, Ekinci and Ozturk [15] systematically compared the performance of multiple machine learning algorithms for wind farm generation forecasting, and pointed out that model performance depends not only on the algorithm itself, but also closely on the forecasting horizon and data scaling strategy. Kopyt et al. [16] compared multiple GBDT models under the same wind farm and unified temporal resolution, showing that comparing tree-based models under a consistent evaluation protocol is itself a research issue that deserves careful attention. Overall, under tabular-input settings such as SCADA and NWP data, methods such as XGBoost and LGBM remain strong baselines that cannot be overlooked in wind power forecasting tasks.

With the continued development of machine learning, deep learning methods have further propelled wind power forecasting research toward more complex temporal modeling. Wang et al. [5] pointed out that the core advantage of deep learning in wind power forecasting lies in its ability to automatically learn multivariate coupling relationships, temporal dependency structures, and high-dimensional feature representations. Existing studies have developed a diverse technical spectrum ranging from recurrent networks and convolutional networks to attention mechanisms and Transformer-based architectures. For example, Zhen et al. [17] developed and compared a hybrid deep learning model for wind power forecasting by considering temporal-spatial feature extraction. Wu et al. [18] investigated short-term wind power prediction by fusing multiple spatial and temporal correlation features. These studies indicate that deep learning models have been increasingly used to capture complex temporal and spatial dependencies in wind power forecasting. Bispo Junior et al. [19] further compared the performance of Transformer, Informer, Flowformer, and Flashformer in short-term wind power forecasting, and pointed out that temporal encoding schemes and the choice of attention mechanisms can significantly affect a model’s ability to characterize wind power fluctuations and temporal dependencies. The Informer model proposed by Zhou et al. [20] reduced the computational complexity of long-sequence modeling. Xu et al. [21] proposed a heterogeneous seq2seq PatchTST-GRU framework that further combines NWP refinement to mitigate error accumulation and improve the utilization of future meteorological information in multi-step wind power forecasting. Meanwhile, reflections on the effectiveness of deep architectures have also gradually increased. Takara et al. [22] comprehensively compared several advanced deep learning models in multi-step wind power forecasting and pointed out that, even within the deep learning framework itself, the relative merits of different architectures still vary significantly with the forecasting target, training strategy, and uncertainty modeling approach. Manzano et al. [23] similarly noted in a systematic review that deep neural networks have evolved into diverse methodological routes, yet the relative strengths of different models remain highly dependent on task settings, data scale, and experimental procedures. From the perspective of interpretability, Niu et al. [24] further argued that deep models should not merely pursue deeper or more complex architectures, but should also pay attention to how key temporal and variable information is extracted. These studies indicate that, although deep learning has substantially expanded the modeling capacity of wind power forecasting, its advantages still need to be validated under unified conditions rather than being concluded solely from results obtained in single experimental settings.

Beyond model architecture itself, input-side enhancement, ensemble optimization, and physics-guided modeling have gradually become important directions for improving wind power forecasting performance. Hu et al. [25] pointed out that errors in numerical weather prediction can directly affect subsequent wind power forecasting performance, and therefore correcting NWP bias at the input stage is of substantial value. Liu et al. [26] further proposed that forecasting accuracy depends not only on the predictor itself, but also closely on the reliability of meteorological inputs and the ability to capture fluctuation characteristics. Wang et al. [27] also showed that correcting NWP wind speed before constructing a hybrid forecasting model can further improve short-term forecasting performance. At the same time, studies on ensemble learning and multi-source data fusion have continued to increase. By comparing machine learning, deep learning, and ensemble strategies, Rajaperumal et al. [28] showed that appropriate model fusion can further reduce forecasting errors in some scenarios. Dmitrijevs et al. [29] demonstrated from the perspective of multiple data sources that different input sources can have a significant influence on short-term forecasting accuracy. Wang et al. [30] and Tian et al. [31] further introduced ideas such as mixed-frequency modeling, interpretable base-model selection, and transfer learning into the design of wind power forecasting systems, showing that current research is moving from single predictors toward more complex system-level frameworks. On the other hand, the integration of physical information has also become an important pathway for improving purely data-driven models. Karniadakis et al. [32] pointed out that physics-informed machine learning can organically combine data with mathematical models, thereby alleviating the excessive dependence of purely data-driven methods on large samples. Daw et al. [33] further emphasized that prior physical knowledge can help improve model generalization. In the wind power domain, Duan et al. [34] explored physics-guided forecasting by incorporating physical priors into data-driven prediction. Gao et al. [35] investigated probabilistic distribution constraints for wind power forecasting, providing a distribution-level perspective for model regularization. Liu et al. [36] studied forecasting under extreme operating conditions with reinforcement-learning-related strategies. Zheng et al. [37] developed a physics-integrated prediction framework, further demonstrating the value of combining mechanism information with data-driven modeling. These studies show that physics-guided forecasting has become an important route for improving the plausibility and robustness of purely data-driven models.

A comprehensive review of the existing literature shows that the field of short-term wind power forecasting has developed a relatively rich methodological landscape, covering multiple technical routes such as traditional statistical approaches, tree-based models, deep temporal models, hybrid ensemble frameworks, and physics-guided methods. However, with regard to the question of the true performance boundaries of different models under a unified experimental protocol, existing studies still lack sufficiently systematic answers. Xu et al. [38] explicitly pointed out in their cross-dataset benchmarking study that a large proportion of neural-network-based wind power forecasting research is still conducted independently on limited datasets, and that a unified comparison framework capable of reflecting model robustness and generalization ability is urgently needed. Yang et al. [6] also emphasized that evaluation practices, reproducibility, and experimental standardization in wind power forecasting should themselves be treated as important concerns, rather than focusing solely on the numerical performance achieved on a single test setting. This issue is particularly prominent in day-ahead forecasting tasks for a single wind farm. On the one hand, there are substantial differences in data partitioning strategies across studies, with random splitting, fixed chronological splitting, and rolling validation being used in parallel, which can easily cause model rankings to be influenced by the evaluation protocol itself. On the other hand, input feature construction, preprocessing strategies, and the extent of hyperparameter tuning are often not standardized, making it difficult to compare so-called optimal models fairly under the same criteria. Therefore, rather than continuing to propose increasingly complex new models, conducting systematic benchmarking under strict temporal constraints and a unified hyperparameter tuning budget also carries clear research value and practical engineering significance.

Based on the above discussion, this study focuses on the comparative evaluation of forecasting models under strict chronological splitting and fair hyperparameter tuning conditions. To this end, a unified benchmarking framework is established to systematically evaluate six representative models, namely Ridge, XGBoost, LGBM, DLinear, Transformer, and PatchTST. On the basis of a shared information set, unified evaluation metrics, and a unified hyperparameter tuning budget, this study combines two evaluation protocols, namely a fixed hold-out split and expanding-window rolling validation, and conducts the analysis from four perspectives: overall accuracy, sensitivity to hyperparameter tuning, robustness across rolling windows, and performance under typical operating scenarios. Specifically, this study aims to answer the following questions: under the conditions of a single wind farm, limited sample size, and a shared information set, which model family achieves the best overall performance? Is the best result on a single fixed test set necessarily aligned with the strongest robustness across multiple time windows? Do more complex temporal architectures necessarily outperform lighter tree-based models or linear baselines in their respective modeling settings? Through this investigation, the present work seeks to provide more comparable and reproducible empirical evidence for model selection, result interpretation, and engineering deployment in short-term wind power forecasting for single wind farms.

Based on the above research background and practical motivation, the main contributions of this study can be summarized as follows:

(1): A unified comparison framework for day-ahead wind power forecasting in a single wind farm is established. Under strict chronological splitting, a shared feature information set, and a unified hyperparameter tuning budget, linear models, tree-based models, and deep temporal models are systematically evaluated under consistent evaluation criteria;
(2): A two-level evaluation protocol combining a fixed hold-out split and 8-fold expanding-window rolling validation is adopted for standard result reporting and robustness assessment, respectively;
(3): By comparing default and tuned settings, the sensitivity of different models to hyperparameter optimization is quantitatively analyzed, thereby improving the reliability of the model comparison conclusions;
(4): Through scenario-based analysis involving power intervals, ramping conditions, and intraday periods, the applicability boundaries of different models and their engineering implications are further revealed.

2. Data and Methods

2.1. Research Framework and Experimental Protocol

To systematically compare the performance of different model families in day-ahead wind power forecasting for a single wind farm under strict chronological constraints and a unified experimental protocol, this study establishes a unified benchmarking framework covering data preparation, model training, hyperparameter optimization, and result evaluation. Before model training, the raw SCADA and NWP data were preprocessed through time alignment, data cleaning, and sample construction. The overall research framework and technical route are illustrated in Figure 1.

First, the raw wind power data and meteorological data were subjected to time alignment and data cleaning to construct a shared feature set, and day-ahead forecasting samples were generated on a daily basis. Subsequently, six representative models were trained and evaluated, including one conventional linear baseline, two strong tree-based models, and three deep temporal models. To ensure a fair comparison, all models were trained under a unified experimental environment and evaluated using a consistent metric system. Although all models share the same feature source and information set, their input organization strategies are not completely identical: the tabular models perform point-wise regression, whereas the deep temporal models take sequence inputs and produce sequence outputs.

Before introducing the experimental protocol, the main data-splitting concepts used in this study are defined as follows. Chronological splitting refers to partitioning the dataset strictly according to temporal order, where earlier samples are used for model training and later samples are used for validation and testing. This strategy differs from random splitting and is adopted to avoid future information leakage in time-series forecasting. A fixed hold-out split denotes a single static chronological division of the dataset into training, validation, and test subsets, which is used for hyperparameter optimization and final test-set evaluation. Expanding-window rolling validation refers to a multi-fold temporal validation strategy in which the training window is progressively expanded, while the validation and test windows move forward along the time axis. This protocol is used to evaluate whether the model can maintain stable performance across different temporal windows.

Based on these definitions, this study adopts an evaluation scheme that combines a fixed hold-out split with expanding-window rolling validation. First, in the fixed hold-out setting, the dataset was divided chronologically into training, validation, and test sets at a ratio of 70%/10%/20%, which were used for model training, hyperparameter selection, and standard test result reporting, respectively. To ensure a fair comparison, all models were optimized using Optuna under a unified tuning budget, with the validation metric serving as the criterion for hyperparameter selection. Therefore, the results under the hold-out setting reflect the predictive performance of different model families under a shared information set and consistent tuning conditions, rather than a strict like-for-like comparison under exactly the same input formulation.

Furthermore, to reduce the potential contingency associated with a single test split, this study employs an 8-fold expanding-window rolling validation to further assess model performance. Specifically, the rolling validation starts with an initial training window of 120 days, which is expanded progressively with each fold. Each fold consists of a 14-day validation window and a 14-day test window, with the gap set to 0. The validation and test windows were kept fixed across all folds to ensure that the evaluation results were calculated from comparable sample sizes and identical forecasting horizons. If the validation or test window length were also increased, the fold-level error statistics would be affected by different numbers of evaluation samples, making the robustness comparison less consistent. In contrast, only the training window was progressively expanded to simulate the practical forecasting situation in which more historical data become available over time. Therefore, this expanding-window design evaluates whether each model can maintain stable performance as the available training history increases and the test period moves forward chronologically. The optimal hyperparameter configuration obtained from the fixed-split experiment was directly used as a frozen setting in the rolling validation, in order to evaluate the generalization ability and stability of each model across different time windows. The time spans of all folds are illustrated in Figure 2.

On this basis, additional analyses were conducted, including comparisons between default and tuned settings, as well as scenario-based error analysis, so as to more comprehensively identify the applicability boundaries of different models in day-ahead wind power forecasting for a single wind farm. To avoid data leakage, in both the fixed-split experiment and each rolling-validation fold, the scaler was fitted using only the training set and then applied to the corresponding validation and test sets. All experiments were implemented in Python 3.11. The main computational libraries included NumPy, pandas, scikit-learn, XGBoost, LightGBM, PyTorch, and Optuna. Model training and hyperparameter optimization were conducted using the above Python-based environment, and the experimental figures were generated from the corresponding prediction results.

It should be emphasized that the focus of this study is to conduct a systematic comparison of different model families under a shared information set, a unified hyperparameter tuning budget, and strict chronological constraints, so as to obtain benchmarking conclusions that are more comparable and reproducible. Accordingly, the results should be interpreted as a comparison of different model families under their respective native modeling paradigms, rather than as a pure architecture-level comparison under a completely identical input organization scheme.

2.2. Data Sources and Task Definition

The data used in this study were obtained from the Sotavento wind farm in Galicia, Spain (43.354377° N, 7.881213° W). The wind farm consists of 24 wind turbines, with an installed capacity of 17,560 kW. Historical operational data of the wind farm are available from its official website, including the aggregated wind speed, wind direction, and total power output of the 24 turbines in 2016 at a 10 min resolution. Therefore, each variable contains 144 sampling points per day. This wind farm provides the real-world study object and data source for the day-ahead wind power forecasting task considered in this study.

Meanwhile, NWP data provided by the Meteogalicia numerical weather prediction system were introduced as exogenous meteorological inputs in this study. The original variables include wind speed, wind direction, temperature, humidity, and pressure, with a temporal resolution of 1 h. Based on the above multi-source data, a day-ahead wind power forecasting task for a single wind farm was constructed. Specifically, with each day treated as the basic sample unit, and after unified time alignment and feature construction, the available exogenous meteorological features and temporal encodings within the 24 h of the target forecasting day were used to predict the aggregated wind power sequence of the wind farm at 144 time steps of 10 min each. It should be noted that the unified benchmarking framework in this study does not incorporate historical observed power or lagged SCADA variables as autoregressive inputs. Therefore, the historical observation window length was set to

H = 0

. Under this setting, all models share the same information set. However, the comparative differences may arise not only from model architecture itself, but also from the way in which this shared information is organized for different model families, rather than from differences in the amount of available information.

At the model-input level, the original datasets have different sampling frequencies, with SCADA data recorded at 10 min resolution and NWP data available at 1 h resolution. To ensure temporal consistency, the NWP meteorological variables were resampled to the 10 min SCADA time grid before feature construction. Specifically, for each pair of adjacent hourly NWP records, the corresponding hourly interval was divided into six 10 min timestamps. For example, the interval between 10:00 and 11:00 was represented by 10:00, 10:10, 10:20, 10:30, 10:40, and 10:50. For continuous meteorological variables, such as wind speed, temperature, pressure, and air density, linear interpolation was applied between the two adjacent hourly NWP values to obtain the intermediate 10 min values:

x (t_{h} + k Δ t) = (1 - \frac{k}{6}) x (t_{h}) + \frac{k}{6} x (t_{h + 1}), k = 0, 1, \dots, 5, Δ t = 10 \min .

In this way, each hourly NWP interval was converted into six 10 min meteorological feature records. For missing anchor points at the beginning of the sequence caused by timestamp misalignment, backward filling was applied to ensure temporal completeness. For wind direction, direct linear interpolation was not used because wind direction is a circular variable and may have discontinuity around 0° and 360°. Therefore, wind direction was first transformed into sine and cosine components, and these two components were then interpolated separately on the 10 min time grid. The interpolated sine and cosine components were used as direction-related features in the model input. This resampling process only changes the temporal resolution of the available NWP forecast series and does not introduce future observed SCADA information.

The final input features include wind speed magnitude, sine and cosine encodings of wind direction, air density, temperature, mean sea level pressure, an intraday position feature, and annual cyclical encodings. Among these features, air density was calculated from pressure and temperature when unavailable, and wind direction was represented using sine-cosine encoding. The actual power output was used as the ground-truth prediction target. During sample construction, only daily samples containing complete 144 ten-min observations and no missing values in either the target variable or the input features were retained.

To more clearly characterize the data properties of the study object and the challenges of the forecasting task, a visual analysis of the wind speed–power relationship was further conducted, as shown in Figure 3. The results reveal the significant nonlinearity and heteroscedasticity inherent in this task, and also provide a data-level basis for the subsequent model comparison and result interpretation.

As shown in Figure 3, the scatter relationship between wind speed and actual power exhibits a typical nonlinear power-curve pattern. In the low-wind-speed region, power output remains generally low while the samples are highly dispersed, indicating that the system is more susceptible to disturbances under low-output conditions. As wind speed increases, power enters a rapid ramping region, where the output becomes highly sensitive to changes in wind speed. In the high-wind-speed region, power gradually approaches saturation, although a certain degree of dispersion can still be observed. These observations indicate that the wind speed–power relationship is characterized not only by clear piecewise nonlinearity, but also by pronounced heteroscedasticity. For model comparison, this result suggests that models relying solely on simple linear assumptions are unlikely to maintain stable accuracy across the full range of operating conditions, whereas models with stronger nonlinear fitting capability are more likely to exhibit better adaptability under complex scenarios.

2.3. Comparative Models

To ensure that the comparison framework is both compact and representative, the comparative models were selected according to three principles: methodological coverage, practical relevance, and computational diversity. First, the selected models should cover the main forecasting paradigms commonly used in wind power forecasting, including linear statistical learning, tree-based nonlinear machine learning, and deep temporal modeling. Second, the models should have clear practical relevance and should be widely used or frequently adopted as strong baselines in energy time-series forecasting tasks. Third, the selected models should reflect different levels of structural and computational complexity, so that the benchmark can provide not only an accuracy comparison but also useful evidence for engineering model selection.

Following these principles, six representative models were selected. Ridge was used as a simple and interpretable linear baseline to indicate the performance level achievable by regularized linear mapping. XGBoost and LGBM were selected to represent gradient-boosting decision-tree models, which are widely used for structured tabular forecasting tasks involving SCADA, NWP, and temporal features. DLinear was included as a lightweight temporal deep-learning baseline based on sequence decomposition. Transformer was selected as a standard attention-based sequence model for evaluating the applicability of global temporal dependency modeling. PatchTST was included as a more recent patch-based long-sequence forecasting architecture to examine whether a more complex temporal structure can bring additional gains under the single-wind-farm setting. Therefore, the purpose of this model set is not to exhaustively evaluate all possible forecasting algorithms, but to provide a compact and representative benchmark across major model families under a shared information set and a unified experimental protocol.

For notational consistency, the input sample of a single day is denoted as

X = [x_{1}, x_{2}, \dots, x_{T}]^{T} \in R^{T \times d}

where

T

denotes the input sequence length and

d

denotes the feature dimension at each time step. The prediction target is the wind power sequence over the future

T

time steps, denoted as

\hat{Y} = [{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{T}]^{T} \in R^{T}

. It should be noted that, although all models are compared under the common setting of a single-day forecasting task with 144 time steps, their input organization strategies are not identical. For tabular models such as Ridge, XGBoost, and LGBM, the input at time step

t

is denoted as

z_{t} = x_{t} \in R^{d}

, and the corresponding power value

y_{t}

is predicted through point-wise regression. The final daily prediction sequence of length

T

is then reconstructed from these point-wise outputs. In contrast, DLinear, Transformer, and PatchTST directly take the entire multivariate sequence

X

as input and output the future T-step prediction sequence. This treatment is consistent with the native modeling biases of different model families and also matches the implementation adopted in this study. Therefore, the comparison in this work should be understood as a comparison across model families under a shared information set but different native input organizations, rather than as a strict pure-capacity comparison under exactly the same input formulation.

2.3.1. Ridge Regression

Ridge regression is a linear model that extends ordinary linear regression by introducing an

L_{2}

regularization term. In this study, Ridge is trained on tabular samples expanded in a point-wise manner. Therefore, its single-step prediction can be written as

{\hat{y}}_{t} = w^{T} z_{t} + b

(1)

where

z_{t}

denotes the input feature vector at time step

t

,

w

is the regression coefficient vector, and

b

is the bias term. After reorganizing the predictions for all time steps within a day, the daily prediction sequence can be expressed as

\hat{Y} = [{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{T}]^{T}

(2)

Ridge suppresses overfitting by jointly constraining the fitting error and the parameter magnitude. Its optimization objective is given by

\begin{matrix} L_{R i d g e} = \sum_{i = 1}^{N} (y_{i} - {\hat{y}}_{i})^{2} + λ ‖ w ‖_{2}^{2} \end{matrix}

(3)

where

N

denotes the number of training samples, and

λ

is the regularization coefficient. Compared with ordinary least squares regression, Ridge generally exhibits better numerical stability when correlations exist among input features. Therefore, it is adopted in this study as a traditional and interpretable linear baseline. Its primary role is not to pursue the best predictive accuracy, but to provide a reference for the performance attainable by linear mapping under unified input features. Owing to its simple structure, strong interpretability, and low training cost, Ridge helps determine whether stronger nonlinear models or more complex temporal architectures are truly necessary for the current day-ahead wind power forecasting task in a single wind farm.

2.3.2. XGBoost

XGBoost, short for eXtreme Gradient Boosting, is an ensemble learning method based on gradient-boosted decision trees. Its core idea is to construct a strong predictor by sequentially adding regression trees, where each newly added tree attempts to correct the residual errors of the previous ensemble. In addition, XGBoost introduces regularization terms into the objective function to control tree complexity and reduce overfitting. Therefore, it is particularly suitable for nonlinear regression problems with structured tabular inputs. In this study, XGBoost is adopted as a representative tree-based machine learning model. The input features for XGBoost are organized in a point-wise tabular form, including NWP-derived meteorological variables and temporal encodings at each 10 min time step. Under this setting, XGBoost is used to learn the nonlinear mapping from the structured SCADA–NWP feature set to wind power output. Its role in the benchmark is to represent the boosting-tree route and to provide a strong nonlinear machine learning baseline against both linear regression and deep temporal models. The general modeling structure of XGBoost used in this study is illustrated in Figure 4.

For time step

t

, the boosting process can be expressed as

{\hat{y}}_{t}^{(0)} = 0

(4)

{\hat{y}}_{t}^{(m)} = {\hat{y}}_{t}^{(m - 1)} + f_{m} (z_{t}), f_{m} \in F

(5)

where

m = 1,2, \dots, M

denotes the boosting iteration,

f_{m} (\cdot)

denotes the m-th regression tree, and

F

represents the function space of tree models.

Therefore, the final prediction can be written as

{\hat{y}}_{t} = \sum_{m = 1}^{M} f_{m} (z_{t})

(6)

Its optimization objective is given by

L = \sum_{i = 1}^{N} l (y_{i}, {\hat{y}}_{i}) + \sum_{m = 1}^{M} Ω (f_{m})

(7)

where

l (\cdot)

denotes the loss function, and

Ω (\cdot)

denotes the regularization term used to control tree complexity.

A commonly used complexity penalty in XGBoost is formulated as

Ω (f_{m}) = γ T_{m} + \frac{1}{2} λ \sum_{j = 1}^{T_{m}} w_{m j}^{2}

(8)

where

T_{m}

is the number of leaf nodes in the m-th tree,

w_{m j}

is the weight of the j-th leaf node in the m-th tree, and

γ

and

λ

control the penalty strength on the tree structure and leaf weights, respectively.

Overall, the advantage of XGBoost lies in its ability to maintain strong nonlinear fitting capability while balancing generalization performance through regularization and the boosting mechanism. Therefore, it is adopted in this study as one of the representative strong tree-based models to examine the performance of the boosting-tree route in day-ahead wind power forecasting for a single wind farm.

2.3.3. LGBM

LGBM, which also belongs to the gradient boosting tree framework, retains the boosting principle of XGBoost while further improving training efficiency through histogram-based splitting and a leaf-wise growth strategy. For the single-wind-farm forecasting task considered in this study, which is characterized by a moderate data scale and clearly structured features, LGBM not only maintains strong fitting capability but also exhibits clear advantages in computational efficiency and complexity control. To facilitate understanding of its implementation logic, Figure 5 illustrates the overall structure of LGBM. The model first bins continuous features, then searches for the optimal split according to split gain, and gradually expands the tree structure in a leaf-wise manner, finally producing the boosting ensemble output.

For time step t, the prediction process of LGBM can be expressed as

{\hat{y}}_{t}^{(0)} = 0

(9)

{\hat{y}}_{t}^{(m)} = {\hat{y}}_{t}^{(m - 1)} + g_{m} (z_{t}), g_{m} \in G

(10)

where

g_{m} (\cdot)

denotes the m-th decision tree, and

G

represents the function space of tree models.

Accordingly, the final prediction can be written as

{\hat{y}}_{t} = \sum_{m = 1}^{M} g_{m} (z_{t})

(11)

Its optimization objective remains consistent with the boosting framework and can be written as

L = \sum_{i = 1}^{N} l (y_{i}, {\hat{y}}_{i}) + \sum_{m = 1}^{M} Ω (g_{m})

(12)

To evaluate the quality of a candidate split, LGBM commonly adopts the split-gain criterion

G a i n = \frac{1}{2} (\frac{G_{L}^{2}}{H_{L} + λ} + \frac{G_{R}^{2}}{H_{R} + λ} - \frac{(G_{L} + G_{R})^{2}}{H_{L} + H_{R} + λ}) - γ

(13)

where

G_{L}

and

G_{R}

denote the sums of first-order gradients for the left and right child nodes, respectively,

H_{L}

and

H_{R}

denote the sums of second-order gradients for the left and right child nodes, respectively, and

γ

and

λ

are regularization parameters.

From the perspective of model characteristics, the key advantage of LGBM lies not merely in being faster than XGBoost, but in its ability to balance predictive accuracy and computational efficiency relatively stably in tabular-feature tasks. For this reason, it is adopted in this study as another representative strong tree-based model to examine the performance of a more efficient boosting route in day-ahead wind power forecasting for a single wind farm.

2.3.4. DLinear

DLinear, short for Decomposition-Linear, is a lightweight deep time-series forecasting model based on series decomposition. Its main idea is to decompose the input sequence into trend and seasonal components, and then apply separate linear mappings to these components for forecasting. Different from attention-based models such as Transformer and PatchTST, DLinear does not rely on complex self-attention structures. Instead, it uses a simple decomposition-and-linear-mapping strategy to capture the dominant temporal patterns in the sequence. In this study, DLinear is included as a representative lightweight temporal deep-learning baseline. Unlike XGBoost and LGBM, which use point-wise tabular inputs, DLinear directly takes the complete multivariate daily sequence as input and outputs the corresponding wind power sequence. Its role in the benchmark is to examine whether a simple decomposition-based temporal model can provide competitive performance under the single-wind-farm, limited-sample, and exogenous-feature-based forecasting setting. To more clearly illustrate this process, Figure 6 presents the DLinear framework used in this study: the input sequence is first decomposed by a moving-average operation, after which linear transformations are applied to the trend and seasonal components, respectively, and the target power prediction is finally obtained through feature projection.

Given the input sequence

X \in R^{T \times d}

, the decomposition process can be expressed as

X_{t r e n d} = M A (X)

(14)

X_{s e a s o n} = X - X_{t r e n d}

(15)

where

M A (\cdot)

denotes the moving-average operation used to extract the low-frequency trend component. On this basis, linear mappings along the temporal dimension are separately applied to the seasonal and trend components:

H_{s e a s o n} = W_{s} X_{s e a s o n} + b_{s}

(16)

H_{t r e n d} = W_{t} X_{t r e n d} + b_{t}

(17)

where

W_{s}

and

W_{t}

denote the linear mapping parameters corresponding to the seasonal and trend components, respectively, and

b_{s}

and

b_{t}

are the bias terms. The intermediate representation is then obtained by summing the two components:

H = H_{s e a s o n} + H_{t r e n d}

(18)

Since the multivariate-input version of DLinear is adopted in this study, a feature projection layer is further used to obtain the target power output:

{\hat{y}}_{t} = w_{p}^{T} h_{t} + b_{p}, t = 1,2, \dots, T

(19)

where

h_{t} \in R^{d}

denotes the feature vector at time step

t

in the intermediate representation

H

, and

w_{p}

and

b_{p}

denote the weight and bias of the output projection layer, respectively. Accordingly, the prediction sequence over the whole horizon can be written as

\hat{Y} = [{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{T}]^{T}

(20)

DLinear is included in the comparison not because of its structural complexity, but precisely because of its simplicity. It is used to answer a key question: under the conditions of a single wind farm, limited sample size, and unified inputs, is lightweight temporal modeling already sufficient to achieve competitive forecasting performance? In other words, DLinear serves in this study as a lightweight baseline among deep temporal models.

2.3.5. Transformer

Transformer is a sequence modeling method based on the self-attention mechanism. It can capture dependencies among different temporal positions in the input sequence through global dependency modeling, while avoiding the efficiency limitations and gradient-related issues commonly encountered by traditional recurrent structures in long-sequence modeling. In this study, the Transformer adopts an encoder-only architecture: the input sequence is first projected through a linear embedding layer and combined with positional encoding, then fed into multiple encoder layers, and finally mapped by a linear output layer to obtain point-wise prediction results. Figure 7 illustrates the overall structure of this process. As can be seen, the model places greater emphasis on exploiting the global attention mechanism to capture long-range temporal dependencies.

Given the input sequence

X \in R^{T \times d}

, its initial embedding representation can be written as

H^{(0)} = X W_{e} + P

(21)

where

W_{e}

is the input projection matrix, and

P

is the positional encoding matrix.

In the l-th encoder layer, the query, key, and value matrices are defined as

Q^{(l)} = H^{(l - 1)} W_{Q}^{(l)}

(22)

K^{(l)} = H^{(l - 1)} W_{K}^{(l)}

(23)

V^{(l)} = H^{(l - 1)} W_{V}^{(l)}

(24)

The self-attention operation is then given by

A t t n (Q^{(l)}, K^{(l)}, V^{(l)}) = S o f t m a x (\frac{Q^{(l)} (K^{(l)})^{T}}{\sqrt{d_{k}}}) V^{(l)}

(25)

where

d_{k}

denotes the dimension of the key vectors. Furthermore, the layer outputs after multi-head attention and the feed-forward network can be written as

{\tilde{H}}^{(l)} = M H A (H^{(l - 1)})

(26)

H^{(l)} = F F N ({\tilde{H}}^{(l)}), l = 1,2, \dots, L

(27)

where

M H A (\cdot)

denotes the multi-head attention module and

F F N (\cdot)

denotes the feed-forward neural network. Finally, the model obtains the future T-step prediction through the output mapping:

{\hat{y}}_{t} = w_{o}^{T} h_{t}^{(L)} + b_{o}, t = 1,2, \dots, T

(28)

In wind power forecasting scenarios, Transformer has been widely used in complex time-series prediction tasks because it can explicitly model long-range dependencies. In this study, it is adopted as a standard attention-based model to evaluate the applicability of mainstream deep sequence-modeling approaches in day-ahead forecasting for a single wind farm.

2.3.6. PatchTST

PatchTST is a time-series forecasting model developed on the basis of the Transformer. Its core improvement lies in dividing the continuous time series into a series of local patches and treating each patch as a token input to the model, thereby enhancing local pattern modeling and reducing the computational burden of long-sequence processing. Compared with the standard Transformer, which performs modeling directly at the point-wise time-step level, PatchTST places greater emphasis on extracting local segment-level patterns. Figure 8 illustrates its overall structure: the model first partitions the continuous sequence into patches, then performs patch embedding and positional encoding, after which the embedded patches are fed into a Transformer encoder to extract high-level features, and finally mapped to future power prediction results.

Let the input sequence be

X \in R^{T \times d}

, with patch length

p

and stride

s

. Then, the total number of patches is given by

N_{p} = ⌊ \frac{T - p}{s} ⌋ + 1

(29)

For the c-th feature channel, the j-th patch can be written as

P_{c, j} = X_{(j - 1) s + 1 : (j - 1) s + p, c} \in R^{p}

(30)

After flattening each patch and applying a linear projection, its embedding representation is obtained as

E_{c, j} = W_{p} P_{c, j} + p_{j}

(31)

where

w_{p}

denotes the patch embedding matrix and

p_{j}

denotes the positional encoding. Subsequently, the patch tokens are fed into the Transformer encoder for feature extraction:

H_{c} = T r a n s f o r m e r E n c o d e r ([E_{c, 1}, E_{c, 2}, \dots, E_{c, N_{p}}])

(32)

After encoding, the patch representations are mapped back to a time-series representation of length

T

through the prediction head:

R_{c} = H e a d (H_{c}) \in R^{T}

(33)

By concatenating the outputs from all channels, we obtain

R = [R_{1}, R_{2}, \dots, R_{d}] \in R^{T \times d}

(34)

Finally, the target power prediction is obtained through a feature projection layer:

{\hat{y}}_{t} = w_{f}^{T} r_{t} + b_{f}, t = 1,2, \dots, T

(35)

where

r_{t} \in R^{d}

denotes the multi-channel representation corresponding to time step t, and

w_{f}

and

b_{f}

are the parameters of the output projection layer. Accordingly, the final prediction sequence can be written as

\hat{Y} = [{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{T}]^{T}

(36)

Compared with the standard Transformer, PatchTST represents a more modern and structurally more sophisticated long-sequence modeling architecture. It is included in this study primarily to examine whether a more complex patch-based long-sequence structure can bring additional gains under the conditions of a single wind farm, limited sample size, and multivariate meteorological inputs, and whether such gains are sufficient to offset the increased structural complexity.

In summary, the comparative models selected in this study cover the major methodological spectrum from regularized linear mapping and nonlinear learning with tree-based models to modern deep temporal modeling. Ridge provides the traditional linear baseline, XGBoost and LGBM represent the strong tree-model route, while DLinear, Transformer, and PatchTST correspond to lightweight temporal modeling, standard attention-based modeling, and more complex patch-based temporal modeling architectures, respectively. Comparing these models under a shared feature information set, a unified experimental protocol, and a unified hyperparameter tuning budget helps identify the applicability boundaries of different inductive biases in day-ahead wind power forecasting for a single wind farm, and provides a consistent basis for the subsequent result analysis. It should again be noted that the comparison is conducted across different model families in their respective native modeling forms, rather than under a completely identical input organization scheme.

2.4. Optuna-Based Hyperparameter Optimization Strategy

To ensure a fair comparison of different models under a unified computational budget, this study employs Optuna to perform automated hyperparameter optimization for all comparative models. Optuna is a hyperparameter optimization framework based on sequential trials and adaptive sampling, which can efficiently explore complex parameter spaces under a limited number of trials. Considering the substantial differences among models in structural complexity, parameter types, and search-space scale, comparisons based solely on default settings tend to underestimate the potential of some models and may even lead to biased model rankings. Therefore, unified hyperparameter tuning is regarded in this study as an essential component of the benchmarking design rather than an auxiliary step.

Let

θ_{m}

denote the hyperparameter vector of model m, and let

Θ_{m}

denote its search space. The optimization objective on the validation set is denoted by

J_{m} (θ_{m})

. Then, the hyperparameter optimization problem in this study can be formally expressed as

θ_{m}^{*} = \arg \min_{θ_{m} \in Θ_{m}} J_{m} (θ_{m})

(37)

where

θ_{m}^{*}

denotes the optimal hyperparameter combination obtained for model m within the given search space. Considering that this study mainly focuses on comparing the overall predictive accuracy of different models under a unified experimental protocol, the normalized root mean square error on the validation set is adopted as the primary optimization objective, that is,

J_{m} (θ_{m}) = {N R M S E}_{v a l} (f_{m} (\cdot; θ_{m}))

(38)

where

f_{m} (\cdot; θ_{m})

denotes model m parameterized by

θ_{m}

, and

N R M S E_{v a l}

denotes the normalized root mean square error on the validation set. If the number of validation samples is

N_{v a l}

, the true power is

y_{i}

, the predicted power is

{\hat{y}}_{i}

, and the rated capacity of the wind farm is

P_{r a t e d}

, then

{N R M S E}_{v a l} = \frac{\sqrt{\frac{1}{N_{v a l}} \sum_{i = 1}^{N_{v a l}} (y_{i} - {\hat{y}}_{i})^{2}}}{P_{r a t e d}} \times 100 %

(39)

During the search process, let

θ_{m}^{(t)}

, denote the hyperparameter combination corresponding to the t-th trial, and let T denote the total number of trials. Then, the optimal hyperparameter configuration can be further written as

θ_{m}^{*} = \arg \min_{t \in {1,2, \dots, T}} J_{m} (θ_{m}^{(t)})

(40)

This indicates that Optuna performs multiple rounds of trials and evaluation over the hyperparameter space, and ultimately selects the parameter combination that minimizes the validation objective as the final model configuration.

For different models, the search spaces were defined according to their structural characteristics. For Ridge, the main search variables include the regularization strength α and whether to include the intercept term. For XGBoost and LGBM, the search mainly covers the learning rate, number of trees, tree depth, or leaf-complexity parameters, subsampling ratio, column sampling ratio, and regularization parameters. For DLinear, the main search variables include the decomposition window size, number of training epochs, batch size, learning rate, weight decay, and early-stopping patience. For Transformer, the search is further extended to hidden dimension, number of attention heads, number of encoder layers, feed-forward layer width, dropout, and training-related parameters. For PatchTST, structural parameters such as patch length and stride are additionally optimized on this basis. It should be noted that different models retain their native objective-function settings. For example, XGBoost uses the squared-error objective, whereas LGBM adopts the L₁ regression objective while using L₂-based metrics for validation. Although the parameter types and dimensionalities vary across models, this study consistently follows a unified principle, namely that hyperparameter search is conducted under the same validation set, the same optimization objective, and comparable budget constraints, so as to minimize the influence of differences in tuning effort on the model comparison conclusions.

In the hold-out split experiment, candidate models were first trained on the training set, and Optuna was then used to search for the optimal hyperparameter configuration on the validation set. Subsequently, the final model was retrained using the optimal hyperparameters, and standard results were reported on the test set. Furthermore, in the expanding-window rolling validation, large-scale hyperparameter search was not repeated within each fold. Instead, the optimal hyperparameters obtained from the hold-out split experiment were adopted as frozen configurations to examine the generalization stability of different models across multiple time windows. Through this design, the study not only ensures the consistency of the tuning process but also avoids additional computational overhead and comparison bias caused by repeated search.

It should be emphasized that the purpose of retaining both default and tuned results is not to increase the number of experiments, but to quantitatively analyze the differences in model sensitivity to hyperparameter optimization. If a model performs only moderately under default settings but improves substantially after unified-budget optimization, this indicates that its performance conclusions depend strongly on parameter configuration. In contrast, if a model already performs robustly under default settings, this suggests that its structure itself has relatively strong adaptability. Therefore, Optuna-based tuning is used not only to improve the performance of individual models but also as an important component of the fair benchmarking design in this study.

2.5. Evaluation Metrics

To comprehensively evaluate the performance of different models in the day-ahead wind power forecasting task for a single wind farm, this study adopts absolute-error-based metrics and their normalized forms, including the mean absolute error (MAE), root mean square error (RMSE), and the normalized metrics NMAE and NRMSE with respect to the rated capacity of the wind farm.

Let

y_{i}

denote the true power,

{\hat{y}}_{i}

the predicted power,

N

the number of samples, and

P_{r a t e d}

the rated capacity of the wind farm. The evaluation metrics are defined as follows:

M A E = \frac{1}{N} \sum_{i = 1}^{N} | y_{i} - {\hat{y}}_{i} |

(41)

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} (y_{i} - {\hat{y}}_{i})^{2}}

(42)

N M A E = \frac{M A E}{P_{r a t e d}} \times 100 %

(43)

N R M S E = \frac{R M S E}{P_{r a t e d}} \times 100 %

(44)

In this study, the rated capacity of the wind farm is set to 17,560 kW. Therefore, NMAE and NRMSE normalize the errors of different models to the same scale, which facilitates horizontal comparison.

In practical use, NMAE and NRMSE are taken as the primary accuracy metrics because they can reflect the overall prediction error while eliminating the influence of installed-capacity scale on the metric magnitude. MAE and RMSE are used as supplementary metrics to provide a more intuitive description of the absolute error level.

In the hold-out split experiment, the above metrics are used to report the standard test-set results. In the expanding-window rolling validation, the mean and standard deviation across folds are further calculated, and the temporal robustness of each model is evaluated together with the average ranking. In the scenario-based analysis, NMAE and NRMSE are used as the core metrics, while MAE and RMSE are employed as supplementary indicators to further analyze the error characteristics of different models.

3. Results and Discussion

Under the unified research framework and experimental protocol described above, the predictive performance of the six representative models is systematically examined from four perspectives: the test results under the hold-out split, the comparison between default and tuned results, the performance under expanding-window rolling validation, and the differences across typical operating scenarios.

3.1. Hold-Out Split Results

Table 1 presents the main evaluation results of the models on the test set under the hold-out split. It can be observed that the models exhibit a relatively clear performance stratification under this evaluation protocol. LGBM achieves the best overall performance, with the lowest NRMSE of 10.2326%. Transformer attains the lowest NMAE of 6.9944%, but its NRMSE is 10.6688%, which is slightly higher than that of LGBM. DLinear and XGBoost are at an intermediate level, with NRMSE values of 11.1095% and 11.3381%, respectively. Ridge, as the traditional linear baseline, reaches an NRMSE of 12.8054%, lagging significantly behind the nonlinear models. Although PatchTST outperforms Ridge after hyperparameter tuning, its overall performance still does not reach the level of the leading models.

These results suggest that, within the present single-wind-farm benchmarking setting, the tree-based models were not outperformed by the more complex deep architectures and showed lower overall errors under the hold-out evaluation. This suggests that, under the hold-out split, the best-performing model does not necessarily correspond to the highest structural complexity, but depends more on the degree of fit between the model family, its native input organization, and the task characteristics. One possible explanation is that, under the present limited-data setting and shared input design, the task may behave more like a structured nonlinear mapping problem than a strongly autoregressive sequence-learning problem. Under this setting, LGBM may be relatively well matched to the structured nonlinear relationships in the input features. By contrast, the potential advantages of more complex deep temporal architectures may be less fully expressed when the data scale is limited and historical power lags are excluded. This may partly account for the relatively limited gains of PatchTST in the present benchmark.

3.2. Comparison Between Default and Tuned Results

To further examine the effect of hyperparameter optimization on the performance of different models, Figure 9 presents the comparison of NMAE and NRMSE for each model under default and tuned conditions. It can be seen that different models exhibit markedly different sensitivity to hyperparameter optimization, and the performance gains brought by tuning are unevenly distributed across models.

Overall, LGBM achieves only a modest but stable improvement after tuning, with its NRMSE decreasing from 10.2922% to 10.2326%. This indicates that the default configuration of this model is already highly competitive, leaving relatively limited room for further optimization. By contrast, XGBoost and DLinear exhibit more pronounced improvements after tuning. Specifically, the NRMSE of XGBoost decreases from 12.0151% to 11.3381%, while that of DLinear decreases from 11.5784% to 11.1095%, suggesting that these two models are more sensitive to hyperparameter settings and that appropriate tuning can further unlock their predictive potential. PatchTST shows the largest improvement, with its NRMSE dropping from 15.0175% to 11.9398%, suggesting that its performance may be noticeably underestimated when assessed only under the default configuration. The behavior of Transformer is more representative: although its NMAE improves, its NRMSE does not improve accordingly. This suggests that the influence of hyperparameter optimization on different evaluation metrics is not always consistent, and that changes in model performance cannot simply be interpreted as an across-the-board improvement.

These results indicate that, if model comparison is conducted only under default settings, the actual capability of different models may not be fully reflected, and model rankings may even become biased. Therefore, fair hyperparameter tuning under a unified budget and consistent validation criteria should not be regarded as an auxiliary step, but rather as an indispensable component of benchmarking comparison. This difference in tuning sensitivity is also consistent with the characteristics of the compared models. LGBM already performs strongly under its default configuration, indicating that its built-in parameterization is relatively well aligned with this type of structured forecasting task. By contrast, models such as PatchTST and Transformer involve more structural and training-related hyperparameters, and their final performance is therefore more sensitive to hyperparameter selection.

3.3. Performance Under Expanding-Window Rolling Validation

To further compare the error distributions and stability of different models under expanding-window rolling validation, Figure 10 and Figure 11 present the boxplots of NMAE and NRMSE, respectively, across different folds for all models. As can be seen, the error distributions differ substantially across models under the rolling setting, and their performance rankings are not fully consistent with those obtained under the hold-out split. Overall, Transformer, LGBM, and XGBoost exhibit relatively lower error distributions, among which Transformer performs the best. By contrast, PatchTST and DLinear show generally higher error distributions and wider variation ranges, reflecting weaker stability across multiple time windows. Although Ridge remains limited in predictive accuracy, it still retains its reference value as a traditional baseline model.

Combined with the average NRMSE statistics, Transformer achieves the lowest mean NRMSE under the rolling setting, at 8.1684%. The corresponding values for LGBM and XGBoost are 8.7344% and 8.9159%, respectively, while Ridge records 10.4702%. PatchTST and DLinear show relatively poorer performance, with mean NRMSE values of 12.2008% and 12.5682%, respectively. These results indicate that the best performance on a single fixed test set does not necessarily correspond to the strongest robustness across multiple time windows. Specifically, LGBM performs better under the hold-out split, whereas Transformer shows lower average errors and better stability across rolling windows in the present validation setting. This difference indicates that the two evaluation protocols emphasize different aspects of model quality. The hold-out setting mainly reflects how well a model fits one specific temporal segment, whereas rolling validation provides a stronger test of whether the model can maintain stable performance when the evaluation window moves over time. In this sense, the present results may reflect that LGBM is relatively well aligned with structured relationships within a fixed segment, whereas Transformer may be better aligned with patterns that remain more stable across varying windows. Given that real-world wind farm operation is subject to changing meteorological conditions and potential distribution shifts, model superiority should not be assessed solely on a single test set, but also on robustness across multiple time windows.

3.4. Analysis of Results Under Typical Operating Scenarios

Building on the overall evaluation under the hold-out split, this study further performs scenario-based error analysis on the test set to reveal the performance differences of different models across typical operating conditions. Using the optimal hyperparameter configurations obtained by Optuna on the validation set, the analysis is conducted from three perspectives: power intervals, ramp events, and intraday periods. Specifically, with the rated capacity of the wind farm,

P_{r a t e d}

as the reference, the samples are divided into three intervals: low-power (

P

< 0.2

P_{r a t e d}

), medium-power (0.2

P_{r a t e d}

≤

P

≤ 0.8

P_{r a t e d}

), and high-power (

P

≥ 0.8

P_{r a t e d}

). Ramp events are identified according to the power variation between two adjacent 10 min time steps, defined as

Δ P_{t} = P_{t} - P_{t - 1}

: when

Δ P_{t}

> 0.05

P_{r a t e d}

, the sample is classified as Ramp Up; when

Δ P_{t}

< −0.05

P_{r a t e d}

, it is classified as Ramp Down; otherwise, it is classified as Non-Ramp. In addition, one day is divided into four periods: Morning (06:00–11:59), Afternoon (12:00–17:59), Evening (18:00–23:59), and Night (00:00–05:59). The comparative error results under these scenarios are presented in Figure 12, Figure 13 and Figure 14.

As shown in Figure 12, the models exhibit clear performance differences across power intervals. In the low-power scenario, Transformer yields lower NMAE and NRMSE than LGBM, suggesting a relative advantage under low-output conditions in the present test set. In contrast, LGBM shows lower errors in the medium- and high-power intervals, indicating that it performs more favorably in those regimes within the current scenario-based analysis. These results indicate that model superiority is not consistent across power levels, and that a single overall metric cannot fully characterize scenario-specific applicability. A possible explanation is that the data characteristics differ substantially across power intervals. In the low-power region, the output is more dispersed, and the variations are relatively subtle, which may make Transformer’s sequence-context modeling more beneficial. By contrast, the medium- and high-power regions are more strongly associated with pronounced nonlinear response patterns in the wind speed–power relationship, under which LGBM may be better suited to approximating piecewise nonlinear mappings.

As shown in Figure 13, model performance also differs substantially across ramp event types. Overall, LGBM tends to yield lower errors than Transformer in both Ramp Up and Ramp Down scenarios, suggesting a relative advantage during rapid power transitions in the present test set. In contrast, Transformer attains lower average errors in the Non-Ramp scenario, indicating relatively better performance under smoother operating conditions. These results suggest that model performance varies across scenarios rather than being uniformly dominated by a single model. This scenario-dependent behavior is also consistent with the intrinsic difficulty of ramp forecasting. Ramp events involve sharper local transitions and stronger short-term nonstationarity, which place greater demands on a model’s ability to respond to abrupt nonlinear changes. In such cases, LGBM may benefit from its flexible partitioning of the feature space, whereas Transformer is more likely to perform well when the power trajectory is smoother and short-term variations are less abrupt.

As shown in Figure 14, the error levels of different models also vary across intraday periods. Overall, LGBM and Transformer remain in the leading group, although their relative advantages are not entirely consistent across different periods of the day. This suggests that model performance is influenced not only by power levels and ramp behavior, but also by the inflow conditions and operating characteristics associated with different time periods. The result further indicates that the performance of wind power forecasting models is highly context-dependent and that model superiority should not be judged in a generalized manner without considering specific scenarios.

Overall, the scenario-based analysis further shows that model selection should be aligned with specific application requirements. For applications that place greater emphasis on error control in medium- and high-power intervals as well as ramping processes, LGBM may be a preferable option under the present benchmark. In contrast, for applications that focus more on average error performance during low-power and relatively stable operating stages, Transformer may merit consideration. These findings further suggest that model comparison in day-ahead wind power forecasting for a single wind farm should not rely solely on overall average metrics, but should instead incorporate the error characteristics under different typical operating conditions.

3.5. Computational Requirement Analysis

In addition to forecasting accuracy and temporal robustness, computational requirement is also an important factor for practical model deployment. Therefore, the computational characteristics of the compared models were analyzed in terms of final training time, test-set inference time, relative computational requirement, and deployment complexity. The detailed numerical results are provided in Appendix B, Table A7.

Overall, Ridge has the lowest computational requirement because of its simple linear structure and small number of tunable parameters. Among the two tree-based models, LGBM requires less final training time than XGBoost and also achieves faster test-set inference in the present experimental setting. Combined with its best hold-out NRMSE, this indicates that LGBM provides a favorable balance between forecasting accuracy, computational efficiency, and deployment simplicity. Therefore, it can be regarded as a strong engineering baseline for single-wind-farm day-ahead forecasting.

By contrast, DLinear, Transformer, and PatchTST require iterative neural-network training and are more sensitive to training-related hyperparameters, such as learning rate, batch size, number of epochs, dropout, and early stopping. Transformer and PatchTST show higher final training costs than the other models because of their attention-based temporal modeling structures. Although Transformer achieves better robustness under expanding-window rolling validation, its higher computational requirement should be considered in practical deployment. PatchTST has a more complex patch-based architecture, but its accuracy advantage is not clearly demonstrated under the current single-wind-farm setting. These results further indicate that model selection should jointly consider forecasting accuracy, temporal robustness, computational requirement, and deployment complexity.

4. Conclusions

This study demonstrates that, for day-ahead wind power forecasting at a single wind farm, model superiority cannot be determined only by a single test-set result or by model structural complexity. The results are consistent with the initial hypothesis of this work: under strict chronological splitting and unified hyperparameter tuning conditions, a fair benchmarking framework can reveal differences in model applicability that may be hidden by default-parameter comparisons or by a single fixed test split. Therefore, the developed benchmarking framework meets the original expectation of providing a more reliable basis for model comparison and engineering model selection.

The main findings can be summarized as follows. First, under the fixed hold-out split, LGBM achieved the lowest NRMSE of 10.2326%, while Transformer obtained the lowest NMAE of 6.9944%, indicating that strong tree-based models remain highly competitive when the input information mainly consists of structured NWP-derived meteorological and temporal features. Second, under the expanding-window rolling validation, Transformer achieved the lowest average NRMSE of 8.1684%, showing that the best model on a single fixed test interval is not necessarily the most robust model across multiple temporal windows. Third, the scenario-based analysis indicates that no single model is uniformly optimal across all operating states, and model selection should therefore consider the target application scenario rather than relying only on overall average accuracy.

These findings have practical implications for wind power forecasting applications. For the current single-wind-farm day-ahead forecasting practice, LGBM can be regarded as a strong engineering baseline when overall accuracy, computational efficiency, and deployment simplicity are prioritized. Transformer may be more suitable when temporal robustness is the main concern and sufficient computational resources are available. More broadly, the proposed benchmarking framework can support future industrial model screening by jointly considering forecasting accuracy, temporal robustness, scenario-dependent behavior, tuning sensitivity, and computational feasibility.

This study also has several limitations. First, the experiments were conducted on a single wind farm, and the generalizability of the conclusions across different wind farms, terrains, and climate regimes still requires further verification. Second, historical observed power and lagged SCADA variables were not used as autoregressive inputs, so the conclusions mainly apply to the adopted exogenous-feature-based forecasting setting. Third, this work focuses on deterministic point forecasting, while probabilistic forecasting, uncertainty quantification, and statistical significance testing were not fully incorporated. Fourth, although basic computational profiling was added, it was conducted under a single hardware environment and mainly focused on final training time and test-set inference time. Future work should extend the benchmark to multi-wind-farm and cross-regional scenarios, incorporate autoregressive and uncertainty-aware forecasting settings, and further record complete hyperparameter-optimization wall-clock time, memory consumption, and energy cost under different deployment environments.

Author Contributions

Conceptualization, J.L.; methodology, J.L.; software, J.L. and G.Z.; validation, J.L.; formal analysis, J.L.; investigation, J.L.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, J.L. and Y.L. supervision, Y.L. and G.Z.; project administration, Y.L. and G.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data Availability Statement: The raw wind farm operational data used in this study were obtained from the publicly available Sotavento wind farm website, and the NWP data were obtained from the Meteogalicia numerical weather prediction system. The processed datasets and code generated during this study are available from the corresponding author upon reasonable request. The processed datasets and code have not been deposited in a public repository because they contain intermediate preprocessing outputs and experiment-specific file structures that require additional documentation for reuse.

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

Abbreviations
Term	Description
SCADA	Supervisory Control and Data Acquisition
NWP	Numerical Weather Prediction
MAE	Mean Absolute Error
RMSE	Root Mean Square Error
NMAE	Normalized Mean Absolute Error
NRMSE	Normalized Root Mean Square Error
HPO	Hyperparameter Optimization
GBDT	Gradient Boosting Decision Tree
Optuna	Automated hyperparameter optimization framework
Model names
Term	Description
Ridge	Ridge regression
XGBoost	Extreme Gradient Boosting
LGBM	Light Gradient Boosting Machine
DLinear	Decomposition-Linear model
Transformer	Attention-based sequence forecasting model
PatchTST	Patch-based Time Series Transformer
Symbols
Symbol	Description
P_r	Rated capacity of the wind farm
y_t	Actual wind power at time step t
${\hat{y}}_{t}$	Predicted wind power at time step t
N	Number of samples
T	Number of forecasting time steps
x_t	Input feature vector at time step t
X	Input sequence
Y	Target wind power sequence
θ	Model parameter or hyperparameter vector
α	Regularization coefficient
Ω(⋅)	Regularization term or penalty function
$F$	Function space of tree models

Appendix A

Appendix A reports the final hyperparameter settings of the compared models. The detailed configurations of Ridge, XGBoost, LGBM, DLinear, Transformer, and PatchTST are listed in Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6, respectively. These settings correspond to the final hyperparameter configurations determined in the model development process and used in the experiments presented in the main text.

Table A1. Hyperparameter Settings of the Ridge Model.

Parameter	Value
alpha	949.1476728951529
fit_intercept	True
random_state	42

Table A2. Hyperparameter Settings of the XGBoost Model.

Parameter	Value
objective	reg:squarederror
n_estimators	2000
learning_rate	0.02142387495644906
max_depth	3
min_child_weight	3.79884089544096
subsample	0.7300733288106989
colsample_bytree	0.8918424713352255
reg_alpha	0.05522729957780637
reg_lambda	1.475659183168782
gamma	0.9444298503238986
random_state	42
n_jobs	−1

Table A3. Hyperparameter Settings of the LGBM Model.

Parameter	Value
objective	regression_l1
n_estimators	2700
learning_rate	0.029693988282159984
num_leaves	47
max_depth	6
min_child_samples	45
subsample	0.6526817485886734
subsample_freq	1
colsample_bytree	0.9351448536945536
reg_alpha	0.821113302724886
reg_lambda	0.10935892559063892
random_state	42
verbose	−1
n_jobs	−1

Table A4. Hyperparameter Settings of the DLinear Model.

Parameter	Value
kernel_size	37
epochs	100
batch_size	16
lr	0.0032691242922590204
weight_decay	0.000411759959226288
patience	15

Table A5. Hyperparameter Settings of the Transformer Model.

Parameter	Value
d_model	64
nhead	4
n_layers	2
dim_ff	256
dropout	0.1294521467686548
epochs	90
batch_size	16
lr	0.0029940214978875306
weight_decay	0.00018817995066160116
patience	15

Table A6. Hyperparameter Settings of the PatchTST Model.

Parameter	Value
patch_len	8
stride	4
d_model	128
n_heads	2
n_layers	3
dim_ff	128
dropout	0.16135258446029016
epochs	120
batch_size	32
lr	0.001973200925804034
weight_decay	0.0012092881991241204
patience	10

Appendix B

Table A7. Computational requirements of the compared models.

Model	Final Training Time	Test Inference Time	Relative Requirement	Deployment Complexity
Ridge	0.0047 s	0.0004 ± 0.0001 s	Very low	Very low
XGBoost	2.060 s	0.0273 ± 0.0005 s	Medium	Medium
LGBM	0.563 s	0.0101 ± 0.0009 s	Low–medium	Low
DLinear	5.629 s	0.0079 ± 0.0008 s	Medium	Medium
Transformer	8.894 s	0.0161 ± 0.0011 s	High	High
PatchTST	8.945 s	0.0218 ± 0.0004 s	High	High

References

Guo, X.; Zeng, P.; Xiong, X.; Wang, G.; Cui, Y. Short-term wind power forecasting methods based on machine learning: A review and case study. Energy Rep. 2025, 14, 3753–3782. [Google Scholar] [CrossRef]
Liu, Z.; Guo, H.; Zhang, Y.; Zuo, Z. A comprehensive review of wind power prediction based on machine learning: Models, applications, and challenges. Energies 2025, 18, 350. [Google Scholar] [CrossRef]
Tawn, R.; Browell, J. A review of very short-term wind and solar power forecasting. Renew. Sustain. Energy Rev. 2022, 153, 111758. [Google Scholar] [CrossRef]
Mei, Y.; Che, J.; Sun, Q.; Dong, W. An overview of short-term wind power forecasting: Multi-scale decomposition and multi-model deep learning fusion. Energy Strategy Rev. 2026, 63, 102021. [Google Scholar] [CrossRef]
Wang, Y.; Zou, R.; Liu, F.; Zhang, L.; Liu, Q. A review of wind speed and wind power forecasting with deep neural networks. Appl. Energy 2021, 304, 117766. [Google Scholar] [CrossRef]
Yang, M.; Huang, Y.; Xu, C.; Liu, C.; Dai, B. Review of several key processes in wind power forecasting: Mathematical formulations, scientific problems, and logical relations. Appl. Energy 2025, 377, 124631. [Google Scholar] [CrossRef]
Tuncar, E.A.; Sağlam, Ş.; Oral, B. A review of short-term wind power generation forecasting methods in recent technological trends. Energy Rep. 2024, 12, 197–209. [Google Scholar] [CrossRef]
Dou, Y.; Tan, S.; Xie, D. Comparison of machine learning and statistical methods in the field of renewable energy power generation forecasting: A mini review. Front. Energy Res. 2023, 11, 1218603. [Google Scholar] [CrossRef]
Alkesaiberi, A.; Harrou, F.; Sun, Y. Efficient wind power prediction using machine learning methods: A comparative study. Energies 2022, 15, 2327. [Google Scholar] [CrossRef]
Carrera, B.; Kim, K. Comparative analysis of machine learning techniques in predicting wind power generation: A case study of 2018–2021 data from Guatemala. Energies 2024, 17, 3158. [Google Scholar] [CrossRef]
Abdelsattar, M.; Ismeil, M.A.; Menoufi, K.; AbdelMoety, A.; Emad-Eldeen, A. Evaluating machine learning and deep learning models for predicting wind turbine power output from environmental factors. PLoS ONE 2025, 20, e0317619. [Google Scholar] [CrossRef] [PubMed]
Xiong, X.; Guo, X.; Zeng, P.; Zou, R.; Wang, X. A short-term wind power forecast method via XGBoost hyper-parameters optimization. Front. Energy Res. 2022, 10, 905155. [Google Scholar] [CrossRef]
Liao, S.; Tian, X.; Liu, B.; Liu, T.; Su, H.; Zhou, B. Short-term wind power prediction based on LightGBM and meteorological reanalysis. Energies 2022, 15, 6287. [Google Scholar] [CrossRef]
Zheng, H.; Wu, Y. A xgboost model with weather similarity analysis and feature engineering for short-term wind power forecasting. Appl. Sci. 2019, 9, 3019. [Google Scholar] [CrossRef]
Ekinci, G.; Ozturk, H.K. Forecasting wind farm production in the short, medium, and long terms using various machine learning algorithms. Energies 2025, 18, 1125. [Google Scholar] [CrossRef]
Kopyt, M.; Piotrowski, P.; Baczyński, D. Short-Term Energy Generation Forecasts at a Wind Farm—A Multi-Variant Comparison of the Effectiveness and Performance of Various Gradient-Boosted Decision Tree Models. Energies 2024, 17, 6194. [Google Scholar] [CrossRef]
Zhen, H.; Niu, D.; Yu, M.; Wang, K.; Liang, Y.; Xu, X. A hybrid deep learning model and comparison for wind power forecasting considering temporal-spatial feature extraction. Sustainability 2020, 12, 9490. [Google Scholar] [CrossRef]
Wu, F.; Yang, M.; Shi, C. Short-term prediction of wind power considering the fusion of multiple spatial and temporal correlation features. Front. Energy Res. 2022, 10, 878160. [Google Scholar] [CrossRef]
Bispo Junior, D.A.; Leite, G.d.N.P.; Droguett, E.L.; de Souza, O.V.C.; Lisboa, L.A.; Cavalcanti, G.D.d.C.; Ochoa, A.A.V.; Costa, A.C.A.d.; Vilela, O.d.C.; Brennand, L.J.d.P.; et al. Short-Term Wind Power Forecasting with Transformer-Based Models Enhanced by Time2Vec and Efficient Attention. Energies 2025, 18, 6162. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
Xu, S.; Wang, Y.; Xu, X.; Shi, G.; Huang, H.; Zheng, Y. A PatchTST-GRU based heterogeneous seq2seq model with numerical weather prediction refinement for multi-step wind power forecasting. Sci. Rep. 2025, 15, 16547. [Google Scholar] [CrossRef]
de Azevedo Takara, L.; Teixeira, A.C.; Yazdanpanah, H.; Mariani, V.C.; dos Santos Coelho, L. Optimizing multi-step wind power forecasting: Integrating advanced deep neural networks with stacking-based probabilistic learning. Appl. Energy 2024, 369, 123487. [Google Scholar] [CrossRef]
Manzano, E.A.; Nogales, R.E.; Rios, A. A Systematic Review of Wind Energy Forecasting Models Based on Deep Neural Networks. Wind 2025, 5, 29. [Google Scholar] [CrossRef]
Niu, Z.; Han, X.; Zhang, D.; Wu, Y.; Lan, S. Interpretable wind power forecasting combining seasonal-trend representations learning with temporal fusion transformers architecture. Energy 2024, 306, 132482. [Google Scholar] [CrossRef]
Hu, S.; Xiang, Y.; Zhang, H.; Xie, S.; Li, J.; Gu, C.; Sun, W.; Liu, J. Hybrid forecasting method for wind power integrating spatial correlation and corrected numerical weather prediction. Appl. Energy 2021, 293, 116951. [Google Scholar] [CrossRef]
Liu, C.; Zhang, X.; Mei, S.; Zhen, Z.; Jia, M.; Li, Z.; Tang, H. Numerical weather prediction enhanced wind power forecasting: Rank ensemble and probabilistic fluctuation awareness. Appl. Energy 2022, 313, 118769. [Google Scholar] [CrossRef]
Wang, S.; Liu, H.; Yu, G. Short-term wind power combination forecasting method based on wind speed correction of numerical weather prediction. Front. Energy Res. 2024, 12, 1391692. [Google Scholar] [CrossRef]
Rajaperumal, T.; Christopher Columbus, C. Enhanced wind power forecasting using machine learning, deep learning models and ensemble integration. Sci. Rep. 2025, 15, 20572. [Google Scholar] [CrossRef]
Dmitrijevs, N.; Komasilovs, V.; Orlova, S.; Kamolins, E. Short-term wind energy yield forecasting: A comparative analysis using multiple data sources. Energies 2025, 18, 4393. [Google Scholar] [CrossRef]
Wang, X.; Hao, Y.; Yang, W. Novel wind power ensemble forecasting system based on mixed-frequency modeling and interpretable base model selection strategy. Energy 2024, 297, 131142. [Google Scholar] [CrossRef]
Tian, C.; Niu, T.; Li, T. Developing an interpretable wind power forecasting system using a transformer network and transfer learning. Energy Convers. Manag. 2025, 323, 119155. [Google Scholar] [CrossRef]
Karniadakis, G.E.; Kevrekidis, I.G.; Lu, L.; Perdikaris, P.; Wang, S.; Yang, L. Physics-informed machine learning. Nat. Rev. Phys. 2021, 3, 422–440. [Google Scholar] [CrossRef]
Daw, A.; Karpatne, A.; Watkins, W.D.; Read, J.S.; Kumar, V. Physics-guided neural networks (pgnn): An application in lake temperature modeling. In Knowledge Guided Machine Learning; Chapman and Hall/CRC, Taylor & Francis: Abingdon, UK, 2022; pp. 353–372. [Google Scholar]
Duan, Z.; Zou, X.; Zheng, X.; Zhang, J.; Shi, Y.; Liu, Y. Physics informed data-driven approach for ultra-short term wind power forecasting based on temporal and graph convolutional networks. In Proceedings of IET Conference Proceedings CP880; IEEE: New York, NY, USA, 2024; pp. 585–590. [Google Scholar]
Gao, J.; Cheng, Y.; Zhang, D.; Chen, Y. Physics-constrained wind power forecasting aligned with probability distributions for noise-resilient deep learning. Appl. Energy 2025, 383, 125295. [Google Scholar] [CrossRef]
Liu, Y.; Wang, J.; Liu, L. Physics-informed reinforcement learning for probabilistic wind power forecasting under extreme events. Appl. Energy 2024, 376, 124068. [Google Scholar] [CrossRef]
Zheng, W.; Yin, J.; Wang, Z.; Sun, H.; Bai, L. Wind Power Prediction Method Based on Physics-Guided Fusion and Distribution Constraints. Energies 2025, 18, 6479. [Google Scholar] [CrossRef]
Xu, X.; Cao, Q.; Deng, R.; Guo, Z.; Chen, Y.; Yan, J. A cross-dataset benchmark for neural network-based wind power forecasting. Renew. Energy 2025, 254, 123463. [Google Scholar] [CrossRef]

Figure 1. Overall research framework and technical route.

Figure 2. Time spans of the folds in the 8-fold expanding-window rolling validation with fixed validation and test windows.

Figure 3. Scatter plot of wind speed versus actual power.

Figure 4. Schematic diagram of the XGBoost model.

Figure 5. Schematic diagram of the LGBM model.

Figure 6. Schematic diagram of the DLinear model. The blue curves denote the input and forecast time series, the orange curves and blocks denote the trend component and its corresponding linear layer, and the green curves and blocks denote the seasonal component and its corresponding linear layer.

Figure 7. Schematic diagram of the Transformer model.

Figure 8. Schematic diagram of the PatchTST model. The colored blocks indicate the main processing stages, including input series segmentation, patch embedding and positional encoding, Transformer encoder-based feature extraction, prediction head mapping, and final power output generation.

Figure 9. Performance comparison of different models under default and tuned conditions.

Figure 10. NMAE error distribution of different models under rolling validation. The yellow line inside each box denotes the median value.

Figure 11. NRMSE error distribution of different models under rolling validation. The yellow line inside each box denotes the median value.

Figure 12. Prediction error comparison of different models under different power intervals.

Figure 13. Prediction error comparison of different models under different ramp event types.

Figure 14. Prediction error comparison of different models under different intraday periods.

Table 1. Performance comparison of different models on the test set under the hold-out split.

Statistical Model		Machine Model		Deep Learning Model
	Ridge	LGBM	XGBoost	Transformer	DLinear	PatchTST
MAE	1781.2952	1230.3816	1464.7739	1228.2209	1330.8196	1523.1902
NMAE	10.1441	7.0067	8.3415	6.9944	7.5787	8.6742
RMSE	2248.6301	1796.8383	1990.9658	1873.0940	1950.8299	2096.6273
NRMSE	12.8054	10.2326	11.3381	10.6688	11.1095	11.9398

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Lu, Y.; Zou, G. A Comparative Study of Day-Ahead Wind Power Forecasting Models for a Single Wind Farm Under Strict Chronological Splitting and Unified Hyperparameter Tuning Conditions. Energies 2026, 19, 2784. https://doi.org/10.3390/en19122784

AMA Style

Liu J, Lu Y, Zou G. A Comparative Study of Day-Ahead Wind Power Forecasting Models for a Single Wind Farm Under Strict Chronological Splitting and Unified Hyperparameter Tuning Conditions. Energies. 2026; 19(12):2784. https://doi.org/10.3390/en19122784

Chicago/Turabian Style

Liu, Jiacheng, Yihang Lu, and Guoping Zou. 2026. "A Comparative Study of Day-Ahead Wind Power Forecasting Models for a Single Wind Farm Under Strict Chronological Splitting and Unified Hyperparameter Tuning Conditions" Energies 19, no. 12: 2784. https://doi.org/10.3390/en19122784

APA Style

Liu, J., Lu, Y., & Zou, G. (2026). A Comparative Study of Day-Ahead Wind Power Forecasting Models for a Single Wind Farm Under Strict Chronological Splitting and Unified Hyperparameter Tuning Conditions. Energies, 19(12), 2784. https://doi.org/10.3390/en19122784

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Comparative Study of Day-Ahead Wind Power Forecasting Models for a Single Wind Farm Under Strict Chronological Splitting and Unified Hyperparameter Tuning Conditions

Abstract

1. Introduction

2. Data and Methods

2.1. Research Framework and Experimental Protocol

2.2. Data Sources and Task Definition

2.3. Comparative Models

2.3.1. Ridge Regression

2.3.2. XGBoost

2.3.3. LGBM

2.3.4. DLinear

2.3.5. Transformer

2.3.6. PatchTST

2.4. Optuna-Based Hyperparameter Optimization Strategy

2.5. Evaluation Metrics

3. Results and Discussion

3.1. Hold-Out Split Results

3.2. Comparison Between Default and Tuned Results

3.3. Performance Under Expanding-Window Rolling Validation

3.4. Analysis of Results Under Typical Operating Scenarios

3.5. Computational Requirement Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Nomenclature

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI