Learning Residual Distributions with Diffusion Models for Probabilistic Wind Power Forecasting

Chen, Fuhao; Gao, Linyue

doi:10.3390/en18164226

Open AccessArticle

Learning Residual Distributions with Diffusion Models for Probabilistic Wind Power Forecasting

by

Fuhao Chen

and

Linyue Gao

^*

Department of Mechanical Engineering, University of Colorado Denver, Denver, CO 80204, USA

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(16), 4226; https://doi.org/10.3390/en18164226

Submission received: 30 June 2025 / Revised: 31 July 2025 / Accepted: 7 August 2025 / Published: 8 August 2025

(This article belongs to the Section A3: Wind, Wave and Tidal Energy)

Download

Browse Figures

Versions Notes

Abstract

Accurate and uncertainty-aware wind power forecasting is essential for reliable and cost-effective power system operations. This paper presents a novel probabilistic forecasting framework based on diffusion probabilistic models. We adopted a two-stage modeling strategy—a deterministic predictor first generates baseline forecasts, and a conditional diffusion model then learns the distribution of residual errors. Such a two-stage decoupling strategy improves learning efficiency and sharpens uncertainty estimation. We employed the elucidated diffusion model (EDM) to enable flexible noise control and enhance calibration, stability, and expressiveness. For the generative backbone, we introduced a time-series-specific diffusion Transformer (TimeDiT) that incorporates modular conditioning to separately fuse numerical weather prediction (NWP) inputs, noise, and temporal features. The proposed method was evaluated using the public database from ten wind farms in the Global Energy Forecasting Competition 2014 (GEFCom2014). We further compared our approach with two popular baseline models, i.e., a distribution parameter regression model and a generative adversarial network (GAN)-based model. Results showed that our method consistently achieves superior performance in both deterministic metrics and probabilistic accuracy, offering better forecast calibration and sharper distributions.

Keywords:

wind power forecasting; probabilistic forecasting; uncertainty-aware forecasting; diffusion model; residual modeling

1. Introduction

Wind power forecasting is essential for maintaining the reliability and economic efficiency of power systems with high shares of renewable energy [1]. It supports key operations such as grid stability, market participation, and reserve scheduling [2]. While deterministic forecasting methods provide single-point estimates of future wind power output, they often fail to account for the inherent uncertainty in wind behavior and limitations of numerical weather prediction (NWP) models [3]. This lack of uncertainty representation restricts their value in risk-sensitive applications like real-time dispatch or energy trading [4].

Probabilistic forecasting addresses this limitation by offering a full distribution of potential outcomes rather than a single estimate [5]. This enables power system operators and market participants to assess forecast confidence, quantify operational risks, and make more robust decisions under uncertainty [6]. In practice, probabilistic forecasts are particularly valuable in applications such as reserve allocation and energy bidding, where understanding both expected outcomes and associated uncertainties is crucial [7]. As the share of wind energy increases, the need for such uncertainty-aware forecasting becomes even more critical to ensure secure and cost-effective system operation [8].

Probabilistic wind power forecasting (PWPF) can be categorized by prediction horizons, with short-term forecasting—typically covering the next 24 h—being the most directly applicable to real-time operations [9]. Short-term PWPF often relies on NWP data as major inputs to model the relationship between weather conditions and future power output [10]. Due to the dynamic variability in atmospheric conditions and the evolving state of the power system, modeling this relationship remains a complex yet impactful research problem. Moreover, short-term PWPF holds distinct engineering value by balancing forecast accuracy and uncertainty quantification, aligning well with operational timescales for power system planning and decision-making [11]. Given its high practical relevance and technical challenges, short-term PWPF forms the core focus of this study.

In response to the limitations of traditional statistical methods (i.e., quantile regression [5], kernel density estimation [12], and Gaussian process regression [13]), machine learning (ML)-based approaches have become the dominant paradigm in PWPF. These data-driven methods are capable of modeling nonlinear relationships, capturing high-dimensional spatiotemporal features from NWP inputs, and scaling to large datasets in operational environments. Within this modern framework, three major model categories have emerged: (1) distribution-based approaches that predict parameters of assumed probability distributions [14], (2) generative adversarial network (GAN)-based methods that approximate the conditional distribution via adversarial training [15], and (3) diffusion probabilistic models that generate forecast samples through a learned reverse stochastic process [16]. These categories reflect different trade-offs in modeling assumptions, generative flexibility, and computational behavior, as compared in Table 1.

The first category (i.e., distribution-based methods) assumes that future wind power outputs follow a predefined probability distribution—commonly Gaussian, Beta, Gamma, Weibull, or a mixture of Gaussians. A neural network is trained to predict the parameters (e.g., mean and variance) of the chosen distribution based on historical and NWP-derived features. Sampling from the resulting distribution yields the probabilistic forecast. While this approach is conceptually simple and computationally efficient, its performance is inherently limited by the correctness of the distributional assumption. Moreover, such models often struggle to represent complex, multimodal, or skewed uncertainty patterns that frequently arise in real-world wind power scenarios.

The second category (GAN-based methods [17]) avoids explicit distributional assumptions by learning to model the entire conditional data distribution. A generator takes both known inputs and injected noise to produce plausible future power sequences, while a discriminator distinguishes real from generated sequences. Through adversarial training, the generator learns to produce outputs that closely resemble the true data distribution. Once trained, the generator enables probabilistic forecasting by sampling multiple trajectories from different noise inputs. Despite their theoretical flexibility, GAN-based models suffer from practical challenges, including mode collapse, unstable optimization, and difficulties in capturing temporal consistency and calibrated uncertainty, limiting their robustness in high-stakes forecasting tasks.

The third and increasingly prominent category involves diffusion probabilistic models, which offer a new framework for generative modeling by learning to reverse a multi-step stochastic corruption process [18]. In this setup, a forward process gradually adds noise to training data, while a neural network is trained to learn the reverse denoising process. For wind power forecasting, diffusion models generate future trajectories conditioned on past observations and NWP features by progressively transforming noise into realistic outputs. Unlike previous approaches, diffusion models do not rely on restrictive distributional assumptions and naturally support flexible, multimodal generation. They also exhibit stable training dynamics and produce well-calibrated outputs across forecast horizons [19]. These advantages make diffusion models especially suited for representing the diverse and uncertain nature of short-term wind power evolution.

Diffusion-based generative modeling has progressed rapidly in recent years. The denoising diffusion probabilistic model (DDPM) [19] brought diffusion models into the spotlight by framing data generation as a progressive denoising process. Although effective, the DDPM requires hundreds or even thousands of sampling steps due to its discrete-time formulation, resulting in significant computational overhead. Concurrently, score-based models [20] adopted Langevin dynamics to guide the denoising process using learned gradients of the data distribution. While these models improved sample quality, they remained bound to inefficient discrete-time sampling. To alleviate this issue, subsequent work formulated diffusion in the continuous-time domain using stochastic differential equations (SDEs) [18], enabling the integration of high-order ordinary differential equation (ODE)/SDE solvers to accelerate inference without compromising generation quality. Building on this, the flow-matching [21] framework proposed to model the transformation from prior to data distribution as a continuous velocity field, bypassing score estimation altogether and offering a new perspective rooted in probability transport.

A major milestone was the introduction of the elucidated diffusion model (EDM) [22], which systematically unified the above approaches under a general theoretical framework. The EDM demonstrated that prior methods could be interpreted through a common lens based on noise scaling and training objectives, thereby clarifying their implicit connections. On top of this unified view, the EDM proposed a simplified and highly effective parameterization of the diffusion trajectory, yielding improved stability, faster convergence, and better sample fidelity. It has since become a leading approach in diffusion-based modeling, delivering state-of-the-art results across multiple domains while significantly reducing sampling steps [23]. This makes the EDM not only theoretically elegant but also practically advantageous for real-world applications that demand both accuracy and efficiency—such as probabilistic forecasting in energy systems.

This study introduces ResD-PWPF, a novel residual diffusion-based probabilistic wind power forecasting framework. Built upon EDM, ResD-PWPF brings key improvements over existing approaches in terms of modeling formulation, diffusion strategy, and architectural design. Specifically, our approach differs from prior work in three main aspects:

(1) Modeling strategy: Rather than directly predicting the probability distribution of future wind power values, we adopt a two-stage approach that decouples the forecasting task into deterministic prediction and probabilistic error modeling. A deterministic model first produces the baseline forecast, and a conditional diffusion model is then used to estimate the distribution of the residual error between the prediction and the ground truth. This strategy allows the diffusion model to focus solely on modeling uncertainty, leading to more efficient learning, sharper uncertainty characterization, and improved interpretability [24].

(2) Diffusion framework: We adopt the EDM [22] in place of the more commonly used DDPM [19]. The EDM offers continuous noise scale control, improved conditioning via

σ

-parameterization, and a more stable training process. These features enhance sampling quality, enable better calibration of forecast distributions, and allow for more expressive modeling of complex uncertainty structures—attributes that are particularly beneficial for short-term wind power forecasting.

(3) Architecture design: For the conditional generation network, we adopt the time diffusion Transformer (TimeDiT) [25,26], a recent architecture designed for time-series generative modeling. Unlike conventional methods that fuse known features, noise, and timestep embeddings via simple concatenation, TimeDiT introduces a modular conditioning mechanism. Specifically, known inputs, noise vectors, and temporal encodings are separately processed through modulation networks, which inject scale and shift parameters into different Transformer layers. This structured integration improves the network’s ability to hierarchically fuse diverse information sources, leading to more accurate and calibrated probabilistic forecasts.

To validate the effectiveness of the proposed ResD-PWPF, we perform extensive experiments on open-source data from ten wind farms provided by the Global Energy Forecasting Competition 2014 [27] (GEFCom2014). The proposed ResD-PWPF is compared against two baseline approaches: a parameter-prediction model based on predefined distributional assumptions, and a GAN-based generative model. For the deterministic component of the forecast, we evaluate mean value prediction accuracy using root mean square error (RMSE) and mean absolute error (MAE). For probabilistic forecasts, we use the continuous ranked probability score (CRPS), which measures the quality of the predicted distribution by jointly assessing calibration (how well the predicted probabilities reflect observed frequencies) and sharpness (the concentration of the distribution). To further examine whether the observed performance differences between models are statistically meaningful, we apply the Wilcoxon signed-rank test (WSRT), a non-parametric test that evaluates paired differences without assuming a specific distribution. Experimental results show that the proposed ResD-PWPF consistently outperforms both baselines across all evaluation metrics, confirming its superior ability to capture both the central tendency and the uncertainty of future wind power outputs.

The remainder of this paper is structured as follows. Section 2 outlines the overall workflow of the proposed ResD-PWPF framework. Section 3 elaborates on the EDM-based diffusion formulation tailored for time-series applications. Section 4 describes the architecture of the TimeDiT model and its design principles. Section 5 presents the experimental setup and case study results based on the GEFCom2014 dataset. Section 6 concludes the paper and discuss the future perspectives for advancing the field.

2. Workflow of the ResD-PWPF

The overall workflow of the proposed ResD-PWPF framework is illustrated in Figure 1, which consists of three steps to generate the final PWPF results.

Step 1: A deterministic power forecasting model is first applied to generate point forecasts based on NWP inputs. This process is formulated as Equation (1), where $F_{D P}$ denotes the deterministic predictor, $n w p \in R^{S \times C_{n w p}}$ represents the NWP input sequence ( $S$ represents the overall length of the sequence, while $C_{n w p}$ indicates the dimensionality of the features), and ${\tilde{y}}_{D P} \in R^{S \times 1}$ denotes the resulting deterministic wind power forecasting results.

{\tilde{y}}_{D P} = F_{D P} (n w p)

(1)

Step 2: This step models the error uncertainty in the deterministic predictions. Specifically, given both the NWP inputs and the deterministic forecasts, a residual modeling module learns the distribution of forecasting errors at each time step. As shown in Equation (2), $F_{P P}$ refers to the probabilistic residual prediction model, while ${\tilde{μ}}_{P P} \in R^{S \times 1}$ and ${\tilde{σ}}_{P P} \in R^{S \times 1}$ represent the mean and standard deviation of the residual distribution, respectively.

[{\tilde{μ}}_{P P}, {\tilde{σ}}_{P P}] = F_{P P} (n w p, {\tilde{y}}_{D P})

(2)

Step 3: The deterministic forecasts are combined with the estimated residual distributions to obtain the full probabilistic forecasts. The average value of the resulting predictive distribution ${\tilde{y}}_{P P} \in R^{S \times 1}$ is given by Equation (3), and its standard deviation directly corresponds to the residual uncertainty estimated in Step 2.

{\tilde{y}}_{P P} = {\tilde{y}}_{D P} + {\tilde{μ}}_{P P}

(3)

As shown in the workflow above, Step 3 simply combines the deterministic forecasts and the estimated residual distributions to produce the final probabilistic prediction. Therefore, the core modeling efforts of this study are concentrated in the first two stages, corresponding to deterministic modeling and uncertainty (residual) modeling, respectively. In this study, in Step 1—deterministic modeling—we adopt one of the most widely used and effective strategies for short-term wind power forecasting: using a Transformer [28] encoder to map NWP sequences to wind power series. The model is configured with a hidden dimension of 128 and consists of two encoder layers. Each layer uses the GELU activation function and employs 8 self-attention heads. During training, the Adam optimizer is used with mean squared error (MSE) as the loss function. Step 2 is centered on two key aspects: the training and inference strategy of the diffusion model (i.e., EDM), and the architectural design of the model itself (i.e., TimeDiT), which are discussed in detail in Section 3 and Section 4.

The proposed two-stage decomposition strategy—comprising an initial point prediction model followed by residual modeling—is theoretically justified even when the residuals are not purely stochastic. In real-world forecasting tasks, especially in complex systems such as wind power, the initial model may fail to capture all temporal dependencies, nonlinearities, or regime-specific behaviors, leaving behind structured and learnable residual patterns. In such cases, the residual component is not merely noise but also contains additional predictive signals.

This decomposition aligns with the principle of functional approximation, where a complex mapping is incrementally approximated by simpler components. The first-stage model captures the dominant signal, while the second-stage model acts as a corrective mechanism to refine prediction accuracy. Furthermore, from a statistical perspective, the two-stage approach can be interpreted through bias-variance decomposition [24]: the first stage aims to reduce bias, and the residual stage reduces variance by modeling finer deviations. This layered refinement strategy enhances both expressiveness and generalization, especially when modeling uncertainty or adapting to distributional shifts.

3. Training and Inference Formulation Using EDM

Figure 2 illustrates the basic principle of applying diffusion models to time-series forecasting. The framework consists of two main stages: the forward diffusion process and the reverse denoising process. In the forward process, the original time series

x_{0}

is gradually perturbed by adding Gaussian noise, eventually becoming a noisy version

x_{T}

that approximately follows an isotropic Gaussian distribution. In the reverse process, we start by sampling an initial noisy sequence

{\tilde{x}}_{T}

from the Gaussian distribution. This sequence is then progressively denoised using a trained diffusion model to generate a predicted sequence

{\tilde{x}}_{0}

. In this study,

x_{0}

specifically represents the residual between the deterministic forecast and the actual wind power output.

Since the initial noise

{\tilde{x}}_{T}

is sampled stochastically, each denoising trajectory can lead to a different outcome

{\tilde{x}}_{0}

. By generating multiple predictions with different noise samples, we can estimate the distribution of future sequences, thereby achieving probabilistic time-series forecasting.

A central focus in the design of diffusion models lies in formulating an effective forward noise schedule and enhancing the model’s ability to learn the reverse denoising process. In this work, we adopt the EDM [22] as the core framework. Compared to conventional diffusion models, the EDM offers a more flexible noise scaling mechanism, enabling stable training across a wide range of noise levels. It also improves the precision and expressiveness of the output generated. Furthermore, the EDM demonstrates superior sampling efficiency and numerical robustness, making it particularly well-suited for PWPF.

3.1. Modeling Training in EDM

During training, the EDM abandons the discrete time-step-based noise scheduling in the traditional DDPM and instead samples a continuous noise level, enabling a more flexible and generalized forward diffusion strategy. Given a clean data sample

x_{0}

, the noisy input

x_{σ}

is generated as in Equation (4) [22]. Here,

σ

represents the noise scale, sampled from a log-uniform distribution that spans a broad range from low to high noise.

ϵ

is standard Gaussian noise with the same shape as

x_{0}

.

x_{σ} = x_{0} + σ \cdot ϵ

(4)

The EDM aims to train a denoising model

F_{θ}

, which is going to predict the original sequence

{\tilde{x}}_{0}

based on the current noisy sequence

x_{σ}

and the corresponding noise level

σ

, as illustrated in Equation (5) [22].

{\tilde{x}}_{0} = F_{θ} (x_{σ}, σ)

(5)

During model training, this study employs the MSE as a loss function to encourage the predicted original sequence

{\tilde{x}}_{0}

to be as close as possible to the ground truth

x_{0}

. The model parameters are optimized using the Adam optimizer to ensure efficient and stable convergence.

To enhance prediction stability across varying noise levels, the EDM does not directly use the raw output of the neural network as the final prediction. Instead, it introduces three scale coefficients during training and inference [22]: skip scale factor

c_{s k i p} (σ)

, input scale factor

c_{i n} (σ)

, and output scale factor

c_{o u t} (σ),

which are used to rescale the input and output before combining them. Assuming the original neural network is denoted as

f_{θ}

, the expression for

F_{θ}

is given in Equation (6). The three scaling coefficients are computed as shown in Equations (7)–(9), where

σ_{d a t a}

represents the standard deviation of the dataset. To further stabilize training across different noise levels, the EDM also incorporates a loss weighting coefficient

λ (σ)

, which adjusts the contribution of each training sample based on its noise scale. The weight coefficient is defined as in Equation (10).

F_{θ} (x_{σ}, σ) = c_{s k i p} (σ) \cdot x_{σ} + c_{o u t} (σ) \cdot f_{θ} [c_{i n} (σ) \cdot x_{σ}, σ]

(6)

c_{s k i p} (σ) = \frac{{σ_{d a t a}}^{2}}{σ^{2} + {σ_{d a t a}}^{2}}

(7)

c_{i n} (σ) = \frac{1}{\sqrt{σ^{2} + {σ_{d a t a}}^{2}}}

(8)

c_{o u t} (σ) = \frac{σ \cdot σ_{d a t a}}{\sqrt{σ^{2} + {σ_{d a t a}}^{2}}}

(9)

λ (σ) = \frac{σ^{2} + {σ_{d a t a}}^{2}}{{(σ \cdot σ_{d a t a})}^{2}}

(10)

3.2. Sample Inference in EDM

In the inference phase, the EDM supports two types of sampling strategies: ODE inference and SDE inference [22]. In the context of the current PWPF task, the performance difference between these two approaches is negligible. Therefore, this study adopts ODE-based inference as the prediction strategy, specifically implemented using Heun’s 2nd method, and the corresponding pseudocode is presented in Table 2 [22]. Here,

N

denotes the number of sampling steps, which is set to 10 in this case. The noise level

σ_{i}

used at each sampling step is determined according to Equation (11), where the parameters

σ_{m a x}

,

σ_{m i n}

, and

ρ

are set to 80, 0.002, and 7, respectively.

σ_{i} = {[{σ_{m a x}}^{\frac{1}{ρ}} + \frac{N - i}{N - 1} ({σ_{m i n}}^{\frac{1}{ρ}} - {σ_{m a x}}^{\frac{1}{ρ}})]}^{ρ}

(11)

To generate probabilistic forecasts, the EDM begins by sampling an initial random noise sequence from a standard Gaussian distribution. This noise is then transformed through the ODE-based denoising process, ultimately producing a single trajectory representing the forecast error curve for wind power at each time step. By repeating this process with multiple independently sampled noise inputs, the model generates an ensemble of error trajectories. Statistical aggregation of these trajectories—specifically, computing the mean and standard deviation at each time point—yields a time-dependent probabilistic distribution of wind power forecasting errors.

While Heun’s second-order method provides a good trade-off between stability and computational cost, it is not necessarily optimal across all settings. First-order solvers like Euler [18] are simpler but often require more steps to maintain fidelity, whereas higher-order methods such as fourth-order Runge–Kutta [18] or DPM-Solver [29] can achieve comparable quality with fewer evaluations, albeit at higher per-step cost and implementation complexity. The effectiveness of different solvers can also vary depending on the noise schedule, data distribution, and conditioning structure. A more comprehensive evaluation of solver selection, including adaptive step-size control or learned solver strategies, remains a promising direction for future work.

To better justify the adoption of the EDM in this study for PWPF, a structured comparative analysis between the DDPM and the EDM is presented, with a focus on four key aspects: time domain formulation, forward diffusion process, optimization objective, and reverse denoising dynamics (details see Appendix A).

4. Architecture of TimeDiT

The network architecture of TimeDiT [26] is illustrated in Figure 3. The overall design is based on the Transformer encoder framework, with several tailored modifications to support multi-source probabilistic modeling. The model takes as input three types of information: the current noise level

σ

, the noise-corrupted sequence

x_{σ}

, and the conditional context

C

at prediction time. These components are encoded separately using different feature encoders suited to their nature. Specifically, the noise level

σ

is encoded using the sinusoidal timestep encoding adopted from the DDPM [19], the sequence

x_{σ}

is processed through a one-dimensional convolutional neural network, and the conditional context

C

is encoded via a linear projection layer. In this study, the conditional input

C

is formed by concatenating the NWP sequence

n w p

with the deterministic forecasting result

{\tilde{y}}_{D P}

along the feature dimension.

After encoding, the features of

σ

and

x_{σ}

are elementwise added to positional encodings and then passed into the encoder. The encoder consists of two identical stacked blocks. A key architectural distinction between TimeDiT and the original Transformer encoder lies in its use of conditional modulation. Prior to each encoder block, the conditional embedding is passed through a multi-layer perceptron (MLP) to generate a pair of modulation coefficients: a scale-used one and a shift-used one.

The modulation procedure for a single encoder layer is detailed as follows. Let

h_{0}

denote the input tensor to an encoder layer and

C

denote the corresponding conditional vector. A lightweight MLP first transforms

C

into six parameter vectors:

c_{s c a l e}^{1}

,

c_{s h i f t}^{1}

,

c_{s c a l e}^{2}

,

c_{s c a l e}^{3}

,

c_{s h i f t}^{3}

,

c_{s c a l e}^{4}

, with each having the same dimensionality as

C

. These parameters are then used to inject conditional information through a series of scale and shift operations embedded within the encoder layer.

First,

h_{0}

undergoes layer normalization to produce

h_{1}

, which is then modulated using a scale-and-shift transformation, as given in Equation (12).

{\tilde{h}}_{1} = c_{s c a l e}^{1} ⊙ h_{1} + c_{s h i f t}^{1}

(12)

The modulated tensor

{\tilde{h}}_{1}

is passed through a multi-head self-attention block to obtain

h_{2}

, followed by a scaling operation, as shown in Equation (13).

{\tilde{h}}_{2} = c_{s c a l e}^{2} ⊙ h_{2}

(13)

After residual connection and layer normalization, the output becomes

h_{3}

, which is then subjected to a second scale-and-shift modulation, as shown in Equation (14).

{\tilde{h}}_{3} = c_{s c a l e}^{3} ⊙ h_{3} + c_{s h i f t}^{3}

(14)

Subsequently,

{\tilde{h}}_{3}

is passed through a feedforward network to produce

h_{4}

, followed by the final scaling step, as shown in Equation (15).

{\tilde{h}}_{4} = c_{s c a l e}^{4} ⊙ h_{4}

(15)

where

{\tilde{h}}_{4}

is either used as input to the next encoder layer or transformed via a linear projection to forecast the error output. In the above equations, the symbol

⊙

represents the Hadamard product, applied component-wise between tensors of identical dimensions. Overall, TimeDiT demonstrates a highly modular and effective design for multi-source information fusion. By decoupling the encoding of different input types and injecting conditional knowledge directly into the model’s core computations via modulation, it enables flexible and fine-grained control over the prediction process while preserving the architectural integrity of the Transformer backbone.

5. Case Study

5.1. Dataset Description

To validate the effectiveness of the proposed ResD-PWPF framework, we conduct extensive experiments on open-source data from ten wind farms provided by the GEFCom2014 [27]. For clarity and consistency throughout the study, these wind farms are sequentially labeled WF1 to WF10. The wind power data for each site is pre-normalized, with values scaled between 0 and 1. The NWP inputs include four features: the U and V wind components are 10 m and 100 m above ground level. The forecast horizon spans from 1:00 to 24:00 of the following day. For each NWP feature, min–max normalization is applied based on its historical range. The dataset has an hourly resolution and covers a two-year period from 1 January 2012 to 31 December 2013. We split the data chronologically along the time axis into training, validation, and test sets with a ratio of 7:1:2. During model training, early stopping is employed based on performance on the validation set to prevent overfitting and determine the optimal stopping point.

5.2. Evaluation Metrics for PWPF

To evaluate the quality of PWPF, this study adopts CRPS, a widely used metric for assessing the accuracy of predictive distributions, defined as in Equation (16).

C P R S (F, y) = \int_{- \infty}^{+ \infty} ({F (z) - l {z \geq y})}^{2} d z

(16)

where

F (z)

denotes the cumulative distribution function (CDF) of the forecast,

y

is the observed ground-truth value, and

l {z \geq y}

is an indicator function that equals 1 when

z \geq y

. Intuitively, CRPS measures the squared difference between the predicted CDF and the empirical step function at the true value, integrated over the entire range of outcomes. It can be viewed as a probabilistic extension of the mean absolute error (MAE), capturing both the sharpness and calibration of the predictive distribution.

When the forecast is assumed to follow a Gaussian distribution with mean

μ

and standard deviation

σ

, the CRPS has a closed-form expression, as in Equation (17).

Φ (\cdot)

and

\emptyset (\cdot)

represent the cumulative distribution function and probability density function of the standard normal distribution, respectively.

C P R S (μ, σ, y) = σ [\frac{y - μ}{σ} (2 Φ (\frac{y - μ}{σ}) - 1) + 2 \emptyset (\frac{y - μ}{σ}) - \frac{1}{\sqrt{π}}]

(17)

In addition to CRPS, we also evaluate the accuracy of the mean prediction using two deterministic metrics: MAE and RMSE. These metrics are widely used in forecasting literature and are not elaborated on here due to their well-established definitions. Together, CRPS, MAE, and RMSE provide a comprehensive assessment of both the distributional quality and the central tendency accuracy of the proposed probabilistic forecasting model.

In addition to the standard error-based metrics—MAE, RMSE, and CRPS—which evaluate the average prediction accuracy, we introduce WSRT to assess the statistical significance of performance differences between models. Unlike simple averaging, which can be sensitive to outliers, this non-parametric test evaluates whether one method consistently outperforms another across paired samples, without assuming normality. Specifically, we collect daily metric values for each wind farm in the test set and perform the WSRT test by comparing the paired errors between the proposed method and each baseline. Since lower metric values indicate better performance, the null hypothesis assumes that the baseline method is not worse than the proposed method. A small

p

-value implies strong evidence against this assumption (e.g., <0.05), indicating that the proposed method significantly outperforms the baseline across the test set.

5.3. Baseline Comparison Methods

To validate the effectiveness of the proposed method, we compare it against two representative probabilistic forecasting baselines: DeepAR and GAN-BERT.

DeepAR is an autoregressive forecasting model that estimates the parameters of a predefined probability distribution at each time step. It is built on a long short-term memory architecture and optimized using the negative log-likelihood loss. While widely used in time-series forecasting, it assumes a fixed distribution form and may struggle to model complex uncertainty patterns.
GAN-BERT adopts a generative adversarial training framework, where both the generator and discriminator are based on Transformer encoders. The generator takes NWP sequences and Gaussian noise as input and outputs wind power predictions. The discriminator receives NWP features along with either real or generated power sequences and learns to distinguish between them. This adversarial setup encourages the generator to produce realistic and diverse probabilistic forecasts.

5.4. Results and Discussion

Table 3 presents the forecasting performance of three probabilistic models—DeepAR, GAN-BERT, and the proposed ResD-PWPF—on ten wind farms (WF1~WF10) using three evaluation metrics: MAE, RMSE, and CRPS. Across all metrics, ResD-PWPF consistently outperforms both baselines.

In terms of MAE, ResD-PWPF achieves the lowest average error of 0.1204, compared to 0.1271 for GAN-BERT and 0.1297 for DeepAR. Similar trends are observed for RMSE, where ResD-PWPF reports the lowest average of 0.1529, while GAN-BERT and DeepAR reach 0.1599 and 0.1626, respectively. These results indicate that ResD-PWPF provides more accurate mean predictions across all sites. Most notably, ResD-PWPF achieves the best performance in CRPS, a metric that captures both the calibration and the sharpness of probabilistic forecasts. With an average CRPS of 0.0885, ResD-PWPF outperforms GAN-BERT (0.1203) and DeepAR (0.0933) by a significant margin, suggesting superior ability to represent forecast uncertainty. This advantage is especially evident at sites such as WF7 and WF9, where ResD-PWPF shows substantial reductions in CRPS.

Table 4 presents the results of the WSRT comparing the proposed ResD-PWPF method against two baselines—DeepAR and GAN-BERT—across 10 wind farms, using three evaluation metrics: MAE, RMSE, and CRPS. Each cell shows the p-value computed from daily prediction results over the test set. Underlined entries indicate

p

-values greater than 0.05, suggesting that the difference between ResD-PWPF and the baseline is not statistically significant at the 95% confidence level.

The results demonstrate that ResD-PWPF consistently outperforms both DeepAR and GAN-BERT, with high statistical significance. In particular, for the CRPS metric, all p-values are below 0.05, often reaching extremely small magnitudes (e.g., 1 × 10⁻²²), reflecting strong probabilistic calibration across all sites. For MAE and RMSE, statistical superiority is observed in most cases, with only a small number of exceptions (e.g., WF2), possibly due to site-specific variability or noise. These findings confirm that ResD-PWPF provides more accurate and better-calibrated probabilistic forecasts, with improvements that are statistically significant and robust under diverse wind farm conditions.

5.5. Scenario Generalizability Analysis

To evaluate the generalization and robustness of the proposed ResD-PWPF framework under varying operational conditions, four representative wind power output scenarios are selected: high output, low output, medium output, and high fluctuation. The probabilistic forecasting results under each scenario are visualized in Figure 4 using multiple confidence intervals (CIs) and the corresponding mean forecasts.

Across all scenarios, the proposed method demonstrates strong generalization ability, with predictive distributions that closely follow the ground truth trajectories. In the high-output and medium-output scenarios (Figure 4a,b), the forecast mean values remain consistently close to the observed curves, and the uncertainty bands tightly capture the variations. This suggests that the model effectively learns from historical NWP features and produces well-calibrated forecasts in steady operating conditions. In the low-output scenario in Figure 4c, the method maintains narrow prediction intervals around low values, avoiding overestimation while preserving calibration. The ability to predict near-zero output while keeping confidence intervals consistent reflects good robustness to sparse information and weak wind signals. In the high-fluctuation scenario in Figure 4d, the model still provides smooth, continuous probabilistic bands that encompass most ground truth values. However, it can be observed that the ground truth occasionally falls near or just outside the 70%~90% confidence intervals, indicating a slight underestimation of uncertainty during rapidly changing conditions. This highlights a potential limitation of the model in extreme weather or abrupt transitions, where uncertainty quantification may benefit from incorporating higher-order temporal features or more dynamic noise modeling. The proposed method exhibits strong generalization across different output levels and resilience to noise and fluctuations, while showing room for improvement in capturing highly volatile transitions.

5.6. Computational Configuration and Efficiency

The experiments were conducted on a MacBook Pro equipped with an Apple M3 Max chip. All models were implemented using the PyTorch 2.7.1 framework, and training and inference were accelerated using the on-device Graphics Processing Unit (GPU). In the deterministic modeling stage, early stopping was applied to dynamically determine whether to continue or terminate training for each wind farm. Across 10 wind farms, the average training time for the deterministic model was approximately 20 min. During inference, the deterministic model required less than 0.1 s to generate a 24 h advance power forecast.

In the probabilistic modeling stage, model training was conducted with a batch size of 64 for a total of 50,000 epochs. The TimeDiT model consists of two encoder layers with a hidden dimension of 128 and 8 attention heads, resulting in a total of 73,065 trainable parameters. For a single wind farm, the average training time was approximately 106 min. During probabilistic inference, the number of diffusion steps

N

was set to 10, and 100 samples of initial random noise were generated to form an ensemble forecast. The time required for a single 24 h advance probabilistic power forecast for one wind farm was approximately 5 s. Once the models are trained and deployed, the proposed method can produce a probabilistic power forecast for a single wind farm in under 6 s. This satisfies the time constraints of real-world engineering applications and demonstrates the practical feasibility of the approach.

6. Conclusions

This study presents ResD-PWPF, a novel residual diffusion-based PWPF framework that integrates deterministic prediction with conditional diffusion modeling. A baseline predictor first produces the mean wind power forecast, and a conditional diffusion model is then employed to learn the distribution of residual errors conditioned on both the forecast and the external features. To enable flexible and expressive modeling of uncertainty, we build the diffusion process upon the EDM, which provides continuous noise-level control, enhanced conditioning mechanisms, and improved training stability over conventional diffusion frameworks. For the denoising network, we introduce TimeDiT, a diffusion Transformer-based architecture tailored for time-series generation. TimeDiT incorporates modular feature encoders and conditional modulation through scale-and-shift transformations. Together, these design choices enable accurate, uncertainty-aware wind power forecasting under diverse operational scenarios.

Extensive experiments were conducted on ten wind farms from the GEFCom2014 dataset. The results demonstrate that ResD-PWPF consistently outperforms strong baselines such as DeepAR and GAN-BERT in terms of both point prediction metrics (MAE, RMSE) and probabilistic calibration (CRPS). Further case studies across four typical wind power output scenarios—high output, low output, medium output, and high fluctuation—validate the model’s strong generalization ability and robustness under diverse operating conditions. The method effectively captures uncertainty and remains well calibrated in most cases, although slight underestimation may occur in highly volatile scenarios. Such underestimation tends to arise when the wind power time series exhibits abrupt, large-magnitude transitions that fall outside the statistical patterns seen during training. In such cases, the baseline deterministic predictor may fail to provide sufficiently accurate forecasts, limiting the effectiveness of residual-based diffusion modeling. Moreover, the conditional diffusion model, while expressive, may struggle to represent multi-modal or long-tailed residual distributions under sparse data regimes or rare extreme weather events. We also caution readers regarding the use of a fixed conditioning structure: When external features such as NWP inputs are misaligned, delayed, or contain systematic biases, the model’s probabilistic outputs can become poorly calibrated. This issue is particularly pronounced under rapidly evolving meteorological conditions or sensor degradation and may be mitigated through strategies such as dynamic feature selection, hierarchical conditioning, or adaptive residual modeling.

For future work, several avenues can be explored to further improve performance and extend applicability. Incorporating spatial correlations among wind farms and leveraging multi-site joint modeling may enhance forecast consistency across different geographic regions. Additionally, the integration of online learning or real-time recalibration mechanisms could enable the model to better improve responsiveness to sudden environmental changes.

Author Contributions

L.G.—conceptualization, methodology, supervision, project administration, funding acquisition, writing—review and editing; F.C.—software, formal analysis, investigation, writing—original draft preparation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Gao’s ORAU Ralph E. Powe Junior Faculty Enhancement Award and the U.S. National Science Foundation (NSF CAREER #2443363).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

CDF	Cumulative distribution function
CI	Confidence interval
CRPS	Continuous ranked probability score
DDPM	Denoising diffusion probabilistic model
EDM	Elucidated diffusion model
GAN	Generative adversarial network
GEFCom2014	Global Energy Forecasting Competition 2014
GPU	Graphics Processing Unit
MAE	Mean absolute error
MLP	Multi-layer perceptron
MSE	Mean squared error
NWP	Numerical weather prediction
ODE	Ordinary differential equation
PWPF	Probabilistic wind power forecasting
RMSE	Root mean square error
SDE	Stochastic differential equation
TimeDiT	Time diffusion Transformer
WSRT	Wilcoxon signed-rank test

Appendix A. Mathematical Comparative Analysis of DDPM and EDM

The DDPM represents one of the most prominent and widely adopted diffusion frameworks in recent literature [19].

Appendix A.1. Time Domain Formulation

The DDPM operates in a discrete time domain, with a predefined number of diffusion steps

t \in \{1, 2, \dots, T\}

, often in the range of 1000 or more. Each step corresponds to a predefined noise level governed by a variance schedule

β_{t}

[19]. The EDM, in contrast, defines the diffusion process in continuous time, where the time variable

t

is in one-to-one correspondence with the noise scale

σ

. During training, the time variable

t

is sampled from a log-uniform distribution defined as in Equation (A1).

I n (t) ~ N (P_{m e a n}, {P_{s t d}}^{2} I)

(A1)

where

P_{m e a n} = - 1.2

and

P_{s t d} = 1.2

are hyperparameters controlling the center and spread of the log-scale noise distribution.

By adopting a continuous-time formulation, the EDM offers several principled advantages over discrete-time diffusion models such as the DDPM. First, continuous noise scheduling enables smooth interpolation across noise levels, which eliminates the need for manually designed stepwise schedules and supports learning- or data-adaptive scheduling. Second, the continuous formulation naturally aligns with probability flow ODE or reverse-time SDE, allowing the use of high-order numerical solvers (e.g., fourth-order Runge–Kutta and DPM-Solver) to integrate the reverse trajectory with improved accuracy and stability. This leads to a significant reduction in sampling steps without compromising generation quality. Moreover, continuous-time models provide a stronger mathematical foundation. In the limit, as the step size approaches zero, discrete-time diffusion processes converge to their continuous counterparts, which are governed by well-defined SDE. This connection allows theoretical analysis of the sampling dynamics and generalization behavior using tools from SDE theory, such as Fokker–Planck equations and score-based interpretations. A more systematic exposition of this convergence can be found in prior work [18], where the DDPM formulation is shown to approximate a continuous-time Langevin process. In short, continuous noise scheduling is not only practically beneficial in terms of sampling speed and quality, but also mathematically grounded, enabling flexible, stable, and theoretically analyzable generative processes.

Appendix A.2. Forward Diffusion Process

In DDPM, the forward process is defined as a fixed-length Markov chain of

T

discrete time steps. At each step, a small amount of Gaussian noise is added to the data with Equation (A2) [19].

q (x_{t}| x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I), t = 1, 2, \dots, T

(A2)

where

{\{β_{t}\}}_{t = 1}^{T}

is a predefined variance schedule, typically increasing over time. The marginal distribution

q (x_{t}| x_{0})

can be derived analytically using Equation (A3) [19].

q (x_{t}| x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I), t = 1, 2, \dots, T

(A3)

where

{\bar{α}}_{t} = \prod_{s = 1}^{t} (1 - β_{s})

denotes the accumulated noise decay [19]. This formulation enables direct sampling of noisy inputs from clean data and forms the basis of the denoising training objective. The EDM generalizes the DDPM by defining the forward process in continuous time using one SDE, as shown in Equation (A4) [22].

d x_{t} = \sqrt{2 t} d w_{t}

(A4)

where

w_{t}

is the standard Wiener process. The solution to this SDE can be expressed in closed form, as given in Equation (A5) [22].

q (x_{t}| x_{0}) = N (x_{t}; x_{0}, t^{2} I)

(A5)

This formulation represents the continuous analogue of the DDPM and can be interpreted as a time-warped Gaussian process that gradually perturbs the data towards isotropic Gaussian noise. Compared to the DDPM’s fixed-step noise schedule, the EDM adopts a continuous noise injection process, which enables smoother transitions, finer control over noise levels, and better alignment with advanced solvers. This not only improves training efficiency but also allows for higher sample quality with fewer generation steps.

Appendix A.3. Model Optimization Objective

In the DDPM, training is framed as minimizing the variational bound on the negative log-likelihood of the data distribution. However, Ref. [19] showed that this objective can be simplified to a denoising score-matching loss, where the model learns to predict the noise component

ϵ

added during the forward process, as in Equation (A6).

L_{D D P M} = E_{t, x_{0}, ϵ} {‖ϵ - ϵ_{θ} (x_{t}, t)‖}^{2}

(A6)

In contrast to the DDPM, which predicts the added noise, the EDM employs a training objective that directly estimates the clean data sample

x_{0}

from a noisy input

x_{t}

. The neural network

F_{θ} (x_{σ}, σ)

is optimized to reconstruct

x_{0}

using a weighted mean squared error loss, as in Equation (A7) [22].

L_{E D M} = E_{t, x_{0}, σ} \{{λ (t) \cdot ‖x_{0} - F_{θ} (x_{σ}, σ)‖}^{2}\}

(A7)

The weighting function

λ (t)

is chosen to compensate for the scale-dependent magnitude of the residual between the noisy and clean samples, the specific expression of which is shown in Equation (10). Compared to the DDPM, the EDM eliminates the need to predict the additive noise

ϵ

and is better aligned with the reconstruction goal of diffusion models. Moreover, the direct regression of

x_{0}

simplifies the reverse-time formulation and facilitates the use of advanced ODE/SDE solvers during sampling.

Appendix A.4. Reverse Denoising Process

In the DDPM, the reverse process is implemented via a parameterized Gaussian distribution, as shown in Equation (A8).

p_{θ} (x_{t - 1}| x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), β_{t} I)

(A8)

where the mean value is given by Equation (A9).

μ_{θ} (x_{t}, t) = \frac{1}{\sqrt{1 - β_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - {\bar{α}}_{t}}} \cdot ϵ_{θ} (x_{t}, t))

(A9)

Sampling proceeds sequentially from

x_{T} ~ N (0, I)

back to

x_{0}

, requiring hundreds to thousands of steps to achieve high-fidelity results. In contrast, the EDM defines the reverse dynamics using either a deterministic probability flow ODE or a stochastic reverse-time SDE. In the ODE formulation, the trajectory of

x_{t}

is governed by Equation (A10).

d x_{t} = \frac{1}{t} (x_{t} - F_{θ} (x_{t}, σ)) \cdot d t

(A10)

The ODE formulation in the EDM is solver-friendly and facilitates high-quality sample generation with as few as 10–20 function evaluations. In the reverse sampling process, the EDM demonstrates clear advantages over the DDPM by directly predicting the clean data

x_{0}

and modeling the reverse path as a continuous trajectory. This formulation enables the use of high-order ODE solvers, resulting in more efficient and stable generation. Consequently, the EDM achieves high-quality samples with significantly fewer steps than the DDPM, while also maintaining a simpler and more interpretable denoising process.

References

Shan, C.; Liu, S.; Peng, S.; Huang, Z.; Zuo, Y.; Zhang, W.; Xiao, J. A Wind Power Forecasting Method Based on Lightweight Representation Learning and Multivariate Feature Mixing. Energies 2025, 18, 2902. [Google Scholar] [CrossRef]
Li, F.; Wang, H.; Wang, D.; Liu, D.; Sun, K. A Review of Wind Power Prediction Methods Based on Multi-Time Scales. Energies 2025, 18, 1713. [Google Scholar] [CrossRef]
Liu, Z.; Guo, H.; Zhang, Y.; Zuo, Z. A Comprehensive Review of Wind Power Prediction Based on Machine Learning: Models, Applications, and Challenges. Energies 2025, 18, 350. [Google Scholar] [CrossRef]
Ekinci, G.; Ozturk, H.K. Forecasting Wind Farm Production in the Short, Medium, and Long Terms Using Various Machine Learning Algorithms. Energies 2025, 18, 1125. [Google Scholar] [CrossRef]
Zhang, L.; Xie, L.; Han, Q.; Wang, Z.; Huang, C. Probability Density Forecasting of Wind Speed Based on Quantile Regression and Kernel Density Estimation. Energies 2020, 13, 6125. [Google Scholar] [CrossRef]
Li, G.; Lin, C.; Li, Y. Probabilistic Forecasting of Provincial Regional Wind Power Considering Spatio-Temporal Features. Energies 2025, 18, 652. [Google Scholar] [CrossRef]
Jeon, Y.-N.; Ko, J. Forecast-Aided Converter-Based Control for Optimal Microgrid Operation in Industrial Energy Management System (EMS): A Case Study in Vietnam. Energies 2025, 18, 3202. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, R.; Zhao, Y.; Qiu, J.; Bu, S.; Zhu, Y.; Li, G. Deterministic and Probabilistic Prediction of Wind Power Based on a Hybrid Intelligent Model. Energies 2023, 16, 4237. [Google Scholar] [CrossRef]
Chen, F.; Yan, J.; Liu, Y.; Yan, Y.; Tjernberg, L.B. A Novel Meta-Learning Approach for Few-Shot Short-Term Wind Power Forecasting. Appl. Energy 2024, 362, 122838. [Google Scholar] [CrossRef]
Chen, F.; Yan, J.; Tjernberg, L.B.; Song, D.; Yan, Y.; Liu, Y. Medium-Term Wind Power Forecasting Based on Dynamic Self-Attention Mechanism. In Proceedings of the 2023 IEEE Belgrade PowerTech, Belgrade, Serbia, 25 June 2023; pp. 1–5. [Google Scholar]
Liu, Z.; Chen, J.; Dong, H.; Wang, Z. MSVMD-Informer: A Multi-Variate Multi-Scale Method to Wind Power Prediction. Energies 2025, 18, 1571. [Google Scholar] [CrossRef]
Dong, W.; Sun, H.; Tan, J.; Li, Z.; Zhang, J.; Yang, H. Regional Wind Power Probabilistic Forecasting Based on an Improved Kernel Density Estimation, Regular Vine Copulas, and Ensemble Learning. Energy 2022, 238, 122045. [Google Scholar] [CrossRef]
Cai, H.; Jia, X.; Feng, J.; Li, W.; Hsu, Y.-M.; Lee, J. Gaussian Process Regression for Numerical Wind Speed Prediction Enhancement. Renew. Energy 2020, 146, 2112–2123. [Google Scholar] [CrossRef]
Zhang, H.; Liu, Y.; Yan, J.; Han, S.; Li, L.; Long, Q. Improved Deep Mixture Density Network for Regional Wind Power Probabilistic Forecasting. IEEE Trans. Power Syst. 2020, 35, 2549–2560. [Google Scholar] [CrossRef]
Yuan, R.; Wang, B.; Mao, Z.; Watada, J. Multi-Objective Wind Power Scenario Forecasting Based on PG-GAN. Energy 2021, 226, 120379. [Google Scholar] [CrossRef]
Liu, J.; Zang, H.; Cheng, L.; Ding, T.; Wei, Z.; Sun, G. Generative Probabilistic Forecasting of Wind Power: A Denoising-Diffusion-Based Nonstationary Signal Modeling Approach. Energy 2025, 317, 134576. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Montreal, QC, Canada, 2014; pp. 2672–2680. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual, 6–12 December 2020; pp. 6840–6851. [Google Scholar]
Song, Y.; Ermon, S. Improved Techniques for Training Score-Based Generative Models. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual, 6–12 December 2020; Curran Associates, Inc.: Red Hook, NY, USA, 2020; p. 1043. [Google Scholar]
Lipman, Y.; Chen, R.T.Q.; Ben Hamu, H.; Nickel, M.; Le, M. Flow Matching for Generative Modeling. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Karras, T.; Aittala, M.; Aila, T.; Laine, S. Elucidating the Design Space of Diffusion-Based Generative Models. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 4–10 December 2022; pp. 15733–15745. [Google Scholar]
Song, Y.; Dhariwal, P.; Chen, M.; Sutskever, I. Consistency Models. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, USA, 18–24 July 2023. [Google Scholar]
Mardani, M.; Brenowitz, N.; Cohen, Y.; Pathak, J.; Chen, C.-Y.; Liu, C.-C.; Vahdat, A.; Nabian, M.A.; Ge, T.; Subramaniam, A.; et al. Residual Corrective Diffusion Modeling for Km-Scale Atmospheric Downscaling. Commun. Earth Environ. 2025, 6, 124. [Google Scholar] [CrossRef]
Peebles, W.; Xie, S. Scalable Diffusion Models with Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2023), Vancouver, BC, Canada, 2–6 October 2023; 2023; pp. 4172–4182. [Google Scholar]
Cao, D.; Ye, W.; Zhang, Y.; Liu, Y. TimeDiT: General-Purpose Diffusion Transformers for Time Series Foundation Model. arXiv 2024, arXiv:2409.02322. [Google Scholar]
Hong, T.; Pinson, P.; Fan, S.; Zareipour, H.; Troccoli, A.; Hyndman, R.J. Probabilistic Energy Forecasting: Global Energy Forecasting Competition 2014 and Beyond. Int. J. Forecast. 2016, 32, 896–913. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; Zhu, J. DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS 2022, Virtual, 28 November–9 December 2022; Volume 35. [Google Scholar]

Figure 1. Workflow of the proposed ResD-PWPF framework. The process consists of three main steps: Step 1 uses NWP information to generate a deterministic baseline forecast, Step 2 models the residual uncertainty using a conditional diffusion model, and Step 3 combines the deterministic forecast and sampled residuals to produce the final PWPF with calibrated uncertainty intervals.

Figure 2. Illustration of diffusion models for time-series forecasting.

Figure 3. Schematic diagram of the TimeDiT architecture.

Figure 4. Forecast performance of ResD-PWPF under 4 typical wind power output scenarios. (a) High-output scenario. (b) Medium-output scenario. (c) Low-output scenario. (d) High-fluctuation scenario.

Table 1. Comparison of three typical categories of PWPF methods.

Category	Distribution-Based	GAN-Based	Diffusion-Based
Distribution assumption	Explicit (e.g., Gaussian, Beta)	N/A	N/A
Expressiveness	Low (limited by fixed form)	Moderate (theoretically flexible)	High (naturally multimodal)
Training stability	High (easy to train)	Low (unstable training)	High (stable convergence)
Uncertainty modeling	Weak (fail with skewed or multimodal patterns)	Moderate (usually poorly calibrated)	High (well-calibrated outputs)
Sampling efficiency	High (closed form)	High (single pass)	Moderate (multi-step)
Limitations	Rigid assumptions, limited flexibility	Mode collapse, temporal inconsistency	Slower sampling, higher inference cost

Table 2. Pseudocode of ODE-based inference implemented with Heun’s 2nd method.

1.	Sample ${\tilde{x}}_{N} ~ N (0, {σ_{m a x}}^{2} I)$
2.	For $i \in \{N, N - 1, \dots, 1\}$ do
3.	$d_{i} \leftarrow \frac{1}{σ_{i}} [{\tilde{x}}_{i} - F_{θ} ({\tilde{x}}_{i}, σ_{i})]$
4.	${\tilde{x}}_{i - 1} \leftarrow {\tilde{x}}_{i} + (σ_{i - 1} - σ_{i}) \cdot d_{i}$
5.	If $i \neq 1$ then
6.	${\tilde{d}}_{i - 1} \leftarrow \frac{1}{σ_{i - 1}} [{\tilde{x}}_{i - 1} - F_{θ} ({\tilde{x}}_{i - 1}, σ_{i - 1})]$
7.	${\tilde{x}}_{i - 1} \leftarrow {\tilde{x}}_{i} + \frac{(σ_{i - 1} - σ_{i})}{2} \cdot (d_{i} + {\tilde{d}}_{i - 1})$
8.	Return ${\tilde{x}}_{0}$

Table 3. Performance comparison of three PWPF methods across 10 wind farms.

Sites	MAE			RMSE			CPRS
Sites	DeepAR	GAN-BERT	ResD-PWPF	DeepAR	GAN-BERT	ResD-PWPF	DeepAR	GAN-BERT	ResD-PWPF
WF1	0.1355	0.1255	0.1246	0.1672	0.1589	0.1584	0.0959	0.1253	0.0916
WF2	0.1238	0.1262	0.1221	0.1502	0.1533	0.1494	0.0919	0.1214	0.0937
WF3	0.1253	0.1278	0.1209	0.1547	0.1582	0.1522	0.0902	0.1165	0.0920
WF4	0.1256	0.1192	0.1121	0.1644	0.1599	0.1507	0.0904	0.1117	0.0816
WF5	0.1311	0.1327	0.1190	0.1668	0.1675	0.1546	0.0939	0.1230	0.0865
WF6	0.1355	0.1320	0.1242	0.1729	0.1671	0.1586	0.0990	0.1278	0.0914
WF7	0.1104	0.1016	0.0961	0.1380	0.1270	0.1210	0.0778	0.0941	0.0696
WF8	0.1339	0.1333	0.1296	0.1676	0.1645	0.1603	0.0965	0.1252	0.0938
WF9	0.1190	0.1157	0.1067	0.1525	0.1472	0.1385	0.0862	0.1108	0.0782
WF10	0.1570	0.1565	0.1482	0.1922	0.1951	0.1851	0.1113	0.1473	0.1064
AVERAGE	0.1297	0.1271	0.1204	0.1626	0.1599	0.1529	0.0933	0.1203	0.0885

Table 4. Statistical comparison of three PWPF methods across 10 wind farms.

Sites	MAE		RMSE		CPRS
Sites	ResD-PWPF > DeepAR	ResD-PWPF > GAN-BERT	ResD-PWPF > DeepAR	ResD-PWPF > GAN-BERT	ResD-PWPF > DeepAR	ResD-PWPF > GAN-BERT
WF1	$3 \times 10^{- 6}$	$4 \times 10^{- 2}$	$8 \times 10^{- 4}$	$1 \times 10^{- 1}$	$2 \times 10^{- 4}$	$1 \times 10^{- 17}$
WF2	$3 \times 10^{- 1}$	$9 \times 10^{- 3}$	$5 \times 10^{- 1}$	$3 \times 10^{- 2}$	$6 \times 10^{- 1}$	$1 \times 10^{- 22}$
WF3	$1 \times 10^{- 2}$	$3 \times 10^{- 3}$	$8 \times 10^{- 2}$	$3 \times 10^{- 3}$	$7 \times 10^{- 1}$	$4 \times 10^{- 17}$
WF4	$4 \times 10^{- 9}$	$7 \times 10^{- 5}$	$3 \times 10^{- 7}$	$6 \times 10^{- 5}$	$1 \times 10^{- 8}$	$1 \times 10^{- 22}$
WF5	$6 \times 10^{- 6}$	$4 \times 10^{- 9}$	$8 \times 10^{- 5}$	$7 \times 10^{- 7}$	$9 \times 10^{- 6}$	$2 \times 10^{- 23}$
WF6	$5 \times 10^{- 6}$	$5 \times 10^{- 5}$	$4 \times 10^{- 6}$	$2 \times 10^{- 4}$	$4 \times 10^{- 6}$	$2 \times 10^{- 23}$
WF7	$1 \times 10^{- 11}$	$1 \times 10^{- 3}$	$1 \times 10^{- 10}$	$3 \times 10^{- 3}$	$6 \times 10^{- 9}$	$1 \times 10^{- 20}$
WF8	$3 \times 10^{- 3}$	$1 \times 10^{- 3}$	$7 \times 10^{- 4}$	$5 \times 10^{- 3}$	$8 \times 10^{- 3}$	$6 \times 10^{- 23}$
WF9	$3 \times 10^{- 6}$	$3 \times 10^{- 5}$	$2 \times 10^{- 5}$	$7 \times 10^{- 4}$	$8 \times 10^{- 6}$	$2 \times 10^{- 23}$
WF10	$5 \times 10^{- 3}$	$3 \times 10^{- 4}$	$3 \times 10^{- 2}$	$2 \times 10^{- 4}$	$2 \times 10^{- 2}$	$9 \times 10^{- 24}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, F.; Gao, L. Learning Residual Distributions with Diffusion Models for Probabilistic Wind Power Forecasting. Energies 2025, 18, 4226. https://doi.org/10.3390/en18164226

AMA Style

Chen F, Gao L. Learning Residual Distributions with Diffusion Models for Probabilistic Wind Power Forecasting. Energies. 2025; 18(16):4226. https://doi.org/10.3390/en18164226

Chicago/Turabian Style

Chen, Fuhao, and Linyue Gao. 2025. "Learning Residual Distributions with Diffusion Models for Probabilistic Wind Power Forecasting" Energies 18, no. 16: 4226. https://doi.org/10.3390/en18164226

APA Style

Chen, F., & Gao, L. (2025). Learning Residual Distributions with Diffusion Models for Probabilistic Wind Power Forecasting. Energies, 18(16), 4226. https://doi.org/10.3390/en18164226

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Learning Residual Distributions with Diffusion Models for Probabilistic Wind Power Forecasting

Abstract

1. Introduction

2. Workflow of the ResD-PWPF

3. Training and Inference Formulation Using EDM

3.1. Modeling Training in EDM

3.2. Sample Inference in EDM

4. Architecture of TimeDiT

5. Case Study

5.1. Dataset Description

5.2. Evaluation Metrics for PWPF

5.3. Baseline Comparison Methods

5.4. Results and Discussion

5.5. Scenario Generalizability Analysis

5.6. Computational Configuration and Efficiency

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Mathematical Comparative Analysis of DDPM and EDM

Appendix A.1. Time Domain Formulation

Appendix A.2. Forward Diffusion Process

Appendix A.3. Model Optimization Objective

Appendix A.4. Reverse Denoising Process

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI