Skewed and Mixture of Gaussian Distributions for Ensemble Postprocessing

: The implementation of statistical postprocessing of ensemble forecasts is increasingly developed among national weather services. The so-called Ensemble Model Output Statistics (EMOS) method, which consists of generating a given distribution whose parameters depend on the raw ensemble, leads to signiﬁcant improvements in forecast performance for a low computational cost, and so is particularly appealing for reduced performance computing architectures. However, the choice of a parametric distribution has to be sufﬁciently consistent so as not to lose information on predictability such as multimodalities or asymmetries. Different distributions are applied to the postprocessing of the European Centre for Medium-range Weather Forecast (ECMWF) ensemble forecast of surface temperature. More precisely, a mixture of Gaussian and skewed normal distributions are tried from 3- up to 360-h lead time forecasts, with different estimation methods. For this work, analytical formulas of the continuous ranked probability score have been derived and appropriate link functions are used to prevent overﬁtting. The mixture models outperform single parametric distributions, especially for the longest lead times. This statement is valid judging both overall performance and tolerance to misspeciﬁcation.


Introduction
In numerical weather prediction (NWP), ensemble forecasts aim to capture the expected uncertainty associated with a weather forecast [1]. In practice, multiple forecasts are produced with different inputs such as several parametrizations, several model physics, or slightly perturbed initial conditions. Well-known examples of such forecasts are the densitybased or probability weather forecasts [2,3]. However, for surface variables, ensembles tend to be biased and misdispersed because some error sources such as insufficient resolution, discretized physical equations, and processes cannot be handled [4]. For instance, ensemble bias can be seen as the bias coming from the dynamical core of the deterministic forecast used to run ensembles, ensemble wrong dispersion can come from a sampling error unable to catch the underlying distribution of the atmospheric state. Ensemble forecasts thus need to be statistically postprocessed.
In recent years, numerous techniques of statistical ensemble postprocessing have been set up. Exhaustive reviews of these techniques are available in Vannitsem et al. [5,6]. The rise of artificial intelligence in weather and climate sciences [7] have permitted the development of promising data-driven techniques, see e.g., Taillardat et al. [8], Rasp and Lerch [9], Scheuerer et al. [10], Veldkamp et al. [11], and Grönquist et al. [12]. In practice, a compromise on operational methods has to be made between data storage issues, reduced computational resources with respect to NWP models, and tuning and implementation of the algorithms [13,14]. As a result, lighter and easier methods to implement than data-driven ones are still up to date.
Hence, this paper deals with the widely used technique called nonhomogeneous regression or ensemble model output statistics (EMOS) [15]. This type of distribution regression approach has been extended to numerous variables' responses and distributions such as Gaussian distributions [15,16], truncated normal [17], log-normal [18], gamma [19], or generalized extreme values [20,21]. Jordan et al. [22] review most of the distributions commonly used in ensemble model output statistics. Extensions and refinements of this technique have been proposed to improve the classical framework like the addition or selection of covariates [23], a more realistic selection of the training sample [24], or semilocal versions of the algorithm [25,26]. This study focuses on local implementation of EMOS using only the raw ensemble empirical statistics as potential covariates.
The seminal work of Gneiting et al. [15] considered Gaussian distributions for the EMOS postprocessing of ensemble forecasts of surface temperature. One can wonder whether this assumption remains valid among forecast horizons, location specificities, or unstable atmospheric states. For example, Gebetsberger et al. [27] used logistic and skewed logistic distributions for site-specific ensemble forecasts. Moreover, Baran and Lerch [28,29] investigated the mixture and the combination of distributions for statistical postprocessing in the spirit of the work of Möller and Groß [30], which is particularly appealing if some or all the members of the raw ensemble are not exchangeable, and offers a simpler framework than Bayesian model averaging [31]. An example of ensemble prediction system is the 51member European Centre for Medium-Range Weather Forecasts (ECMWF) ensemble [32]. This ensemble includes one control (unperturbed) member and 50 exchangeable ones. In Gneiting [33], a distinction is made between the control forecast and the mean in a single parametric density. It could be of interest to fit a mixture of distributions accounting for potential skewness and/or multimodalities of the physically based ensemble forecasts in order not to lose information on predictability, especially for the longest lead times.
This article proposes a new distribution for the EMOS, the skew normal distribution, generalizing the Gaussian distribution with a parameter accounting for asymmetries in the predictive distribution. Analytical formulas of popular probabilistic scoring rules are derived and this distribution is combined with the Gaussian distribution and provides flexibility and skillful predictive distributions. Mixture and single parametric Gaussian and skew normal distributions are compared on the ECMWF ensemble forecast of surface temperature, using different estimation methods. Section 2 contains a description of the data used for this study and exhibits the EMOS models used. Section 3 presents the results for temperature postprocessing and Section 4 discusses these results and concludes.

European Centre for Medium-Range Weather Forecasts Ensemble
The ECMWF ensemble consists of one control member and 50 exchangeable members of 2-meter temperature from 3-up to 360-h lead times twice a day. Only the initialization time 00:00 UTC is presented and used in this study. The grid of the ensemble is 0.25°regular and the members are bilinearly interpolated to 2056 locations in Western Europe where observations are available, see Figure 1. Data span 3 years from 1 August 2017 to 31 July 2020. The results presented in Section 3 are based on a leave-one-year-out cross-validation strategy.

Ensemble Model Output Statistics
A basic example of the EMOS framework for a Gaussian response is recalled in Section 2.2.1. The EMOS for the skew normal distribution is presented in Section 2.2.2. The mixture models are presented in Section 2.2.3 and the tuning, link functions, and estimation methods are summarized in Section 2.2.4. In the following, φ and Φ are the probability density function (PDF) and the cumulative distribution function (CDF) of the standard normal distribution, respectively.

Gaussian EMOS
The work of Gneiting et al. [15] introduced the EMOS for Gaussian distributions. Here, since only the control run is distinguishable from the other runs, the predictive distribution is where CTRL denotes the control member, ENS the empirical ensemble mean, and σ 2 ENS the empirical ensemble variance. The coefficients a CTRL , a ENS , b 0 , b ENS are constrained to be non-negative. The value of the coefficients is determined by M-estimation using the logarithmic scoring rule (LogS, equivalent to a maximum likelihood estimation) [34] or the continuous ranked probability score (CRPS) [35][36][37]. In the following, the EMOS algorithms using a single Gaussian density are called NORM-CRPS and NORM-LOGS with respect to the scoring rule employed for the estimation of the coefficients.

The Skew Normal Distribution for EMOS
The predictive density of the skew normal distribution is In the following, we use K(x; m, s, α) for the CDF of the skew normal distribution. Note that if α = 0, we have a Gaussian distribution of mean m and variance s 2 . For m = 0 and s = 1, the Figure 2 shows the role of α in the skew normal distribution. The skew normal distribution has a mean and variance, respectively, of where δ = α √ 1+α 2 is determined from the skewness γ: The extreme values for α = δ √ 1−δ 2 are reached when δ 2 is close to 1, which is equivalent to |γ| = 0.99527. Therefore, the calculation of α is only possible if |γ| is reasonably small. Pewsey [38] reports these issues for the skew normal distribution, where no closed-form expression of the maximum likelihood estimator of (m, s, α) is available (unless α = 0).
For this study, the representation of the single parametric skew normal distribution is as follows: where γ ENS is the empirical ensemble skewness. As for the Gaussian framework, the coefficients a CTRL , a ENS , b 0 , b ENS are constrained to be non-negative. A formula for the LogS is not hard to compute. Denoting (; α) = (; 0, 1, α) for k and K, a closed-form expression of the CRPS is available: and In the following, the EMOS algorithms using a single skew normal density are called SKN-LOGS and SKN-CRPS.

Mixture Distributions for EMOS
Following the work of Baran and Lerch [28], Möller and Groß [30], we can use a weighted mixture of distributions to exploit the differences between distinguishable components of the ensemble. As mentioned in Baran and Lerch [29], this is different to a combination of predictive distributions where single parametric distributions are determined and then pooled in a two-step procedure where the weights are estimated after. This topic is beyond the scope of the paper, but the interested reader in probabilistic aggregation can refer to Baran and Lerch [29], Ranjan and Gneiting [39], Gneiting et al. [40], Zamo et al. [41]. In this study, two mixtures are proposed. The first one is a mixture of two Gaussian distributions (called MIXNRM-LOGS and MIXNRM-CRPS in the following). The predictive density is where µ CTRL and µ ENS are affine functions of CTRL and ENS, respectively. σ CTRL is a non-negative real and σ ENS is an affine function (with non-negative coefficients) of the ensemble standard deviation σ ENS . The second mixture proposed relies on a skew normal fit for the distribution of the control run (where the skewness depends on the ensemble skewness) and a Gaussian distribution for the ensemble mean; in terms of predictive density, this yields where µ ENS is an affine function of ENS, s is a non-negative real, and σ ENS is an affine function (with non-negative coefficients) of the ensemble standard deviation σ ENS . α is computed using Equations (7)- (9). m CTRL is calculated with the Equation (5), where, in this case, a ENS = 0. Due to the central limit theorem, a Gaussian distribution for the ensemble mean is chosen, the skew normal distribution is fitted using CTRL and ensemble empirical skewness. Grimit et al. [42] provided the closed-form expression of the CRPS for a mixture of Gaussian distributions, and Baran and Lerch [29] provided a formula for a combination of two predictive distributions H(t) = wF(t) + (1 − w)G(t): The last integral term does not have a closed-form for the mixture proposed in (13). As a result, only the mixture MIXSKN-LOGS is considered here.

Training, Tuning, and Link Functions
Recently, Lang et al. [43] examined the training strategies for nonhomogeneous regression. A trade-off is made between time adaptation and the data over multiple years to obtain unbiased and stable regression estimates. Here, the choice is made to adjust the coefficients seasonally using all of the same seasons available in the data set. More precisely, each year is divided into four seasons of three months, and the coefficients estimated for a season with two given years are applied to the third year to compute the predictive distributions.
Both maximum likelihood (LogS) estimation and CRPS minimization are used when possible. Gebetsberger et al. [44], Yuen and Stoev [45] notice that these estimations yield similar results when the response distribution is well specified, which is natural since one can guess that the minimum expected values of the scores are reached; thus, the ideal forecast is made, according to the definition of proper scoring rules. Using various estimation methods permits detecting model misspecification implicitly. Practically, one must avoid overfitting. For example, in case the coefficients are estimated day by day, Scheuerer [46] proposes to stop the optimization early provided that starting values are known to be accurate. The question of numerical optimization is a crucial one when dealing with a huge number of models where a manual check of the coefficients is hard to do. Gebetsberger et al. [47] report some solutions to avoid numerical blow up, for example, the use of the quadratic link function for non-negative coefficients [15]. Moreover, the LogS and the CRPS are not sensitive to the same errors in numerical optimization and Gneiting et al. [15] states that a LogS strategy can lead to overdispersed forecasts and CRPS can lead to sharper but lesser calibrated forecasts [44].
Hence, a universal link function is proposed here, for a parameter θ, based on the logistic function: The function l(; A, B) is defined on R so no constrained estimation is imposed. In addition, the link function is bounded on (A, B) and the range of the coefficients can be controlled. Last but not least, the derivatives of l(; A, B) are mild and easy to compute. The spirit here is very similar to the use of sigmoid functions in neural networks, leading to faster convergence with backpropagation [48].
For instance, in the three mixture EMOSs exposed, all three use l(; 0.05, 0.95) to represent ω. A summary of the link functions employed according to the methods used is available in Table 1.

Results
All results presented in the following come from 3-fold cross-validation based on the 3 years of data at hand. For each station, lead time, and season, three regression models are estimated using two seasons while the remaining season is used for validation. In the following, the raw ensemble is denoted by RAW. For the postprocessed methods, quantiles of order 1 52 , 2 52 , · · · , 51 52 are computed and used in the verification process.

Probabilistic Calibration
Probabilistic calibration is commonly evaluated with probability integral transform (PIT) or rank histograms (RH) [49][50][51][52]. Subject to calibration, the PIT statistic follows a uniform distribution, its mean is 1 2 , and its normalized variance is 1 [3,53].  . PIT mean and PIT variance for the 2056 weather stations of the benchmark. Subject to calibration, the PIT statistic follows a uniform distribution, its mean is 1 2 , and its normalized variance is 1.
The PIT mean shows a clear improvement of forecast bias regardless of the method considered. The RAW ensemble is cold-biased, and the diurnal effect on forecast performance is reduced. The PIT variance of the RAW indicates an underdispersed forecast, with a reduction of the underdispersion across lead times. It may be imputable to the inflation of the spread of the ensemble. Most of the LogS-learned algorithms show a very slight overdispersion, especially for earlier lead times. NORM-CRPS and SKN-LOGS have an increase of the PIT variance across lead times, until the RAW PIT variance. Moreover, SKN-CRPS clearly shows underdispersed forecasts, with its best dispersion from Day 4 to Day 6. The stability of the behavior of MIXNRM indicates that a fit of Gaussian mixture is a good specification for ensemble forecasts of temperature. NORM-CRPS probably indicates that a normal fit, for longer lead times, is unable to catch enough variability from the RAW distribution. The bad dispersion of SKN-CRPS may come from a target distribution, which is not tailored for a single parametric EMOS for temperature or for an erratic management of forecast variance by the CRPS itself-the LogS being more sensitive to dispersion issues. At last, it is interesting to see that SKN and NORM have different predictive patterns even if the Gaussian distribution is a special case of the skew normal distribution. Figure 4 represents the average CRPS for each technique along lead times. The CRPS are computed using the 51-quantile forecasts issued from the predictive distributions and following Hersbach [36], Ferro [54], Zamo and Naveau [55]. Single parametric EMOS have a performance similar to the RAW ensemble at Day 10, and the CRPSS (skill score-not shown here) with respect to the raw ensemble are below 0 after Day 10. The EMOS method is not to blame; however, these results can be explained by the lack of information on the RAW ensemble (not enough covariates) and predictive distributions that are not flexible enough. The results until Day 10 are in compliance with Hemri et al. [56]. The three MIX models are very close together and have the best CRPS averages, especially from Day 6, showing the importance of a mixture representation of ensemble forecasts of temperature; in this case, their CRPSS are at 0.4 near Day 1, and fall down to 0.2 at Day 15. Notice that the SKN-CRPS seems to be the worst postprocessed forecast but one found that misdispersion (here, sharper but lesser calibrated forecasts) is hidden by the CRPS, which confirms the finding of Gebetsberger et al. [44]. The ECMWF use a new headline index based on the percentage of "large errors", i.e., forecasts, where CRPS > 5 • C at Day 5 [57]. This idea is taken here and exposed in Figure  5. Figure 5 is very similar to Figure 4. There is a clear improvement of the large errors, especially with the mixture models and after Day 10. However, it is noteworthy that for the single parametric EMOS, the LogS-learned methods commit less large errors than with a CRPS optimization. This behavior is quite logical since CRPS provides sharper forecasts; however, here, it is too confident with respect to the forecast horizon. In order to understand the differences between optimization strategies, an assessment of forecast quality using the LogS is presented in Figure 6. The choice is made to work with the ensemble members instead of the predictive distribution to be fair with the RAW ensemble. A discretization of the LogS is made in the spirit of Roulston and Smith [34] proposal. Let f 1 , · · · , f 51 be the ordered (and rounded to the closest tenth) values of the predictive distribution, the LogS( f , y) (in nat) proposed is (16) where the lower the better. Here, one must be careful regarding the sensitivity of the inference of the score above. Indeed, some choices can interfere with our results here. The interested reader on this topic can consult Weijs et al. [58], Tödter and Ahrens [59], Siegert et al. [60]. First, one can assume that the ranking of the forecasts is very similar to those in Figure 4; in particular, for the best method at each lead time. Using proper scoring rules in verification ensures that the best (but a priori unknown) forecasts are rewarded on average. Here, the fact that the same forecast (MIXNRM-LOGS) seems to be dominant both in CRPS and LogS advocates that this forecast is the best subject for the information provided by the RAW ensemble (i.e., the score has only an uncertainty component remaining). Yet, a difference between MIXNRM-LOGS and MIXNRM-CRPS is noticeable until Day 3, and the SKN-CRPS method is still the worst among postprocessed forecasts. In contrast to Figure 3, the LogS seems to penalize underdispersed forecasts, as in the optimization process. Last but not least, the RAW ensemble seems to gain predictive performance until Day 6. This is not the case in Figure 4. This can be due to the initial ensemble perturbations of the RAW ensemble [61], probably better tailored to medium-range or extended-range forecasts.
The differences between MIXNRM-LOGS, MIXNRM-CRPS, and MIXSKN-LOGS are not very significant in terms of scoring rules. Therefore, the final choice of a predictive distribution can be made using a simple tool inspired by the so-called weather roulette [62]. Figure 7 shows the frequency where the CRPS of the postprocessed forecast is better than the RAW forecast. This tool demonstrates that the MIXNRM-CRPS wins 1% more than the other methods against the RAW ensemble for the longest lead times. Figure 7. Proportion of the cases where the CRPS of the postprocessed forecasts are better than the RAW forecast. This tool demonstrates that the MIXNRM-CRPS wins 1% more than the other methods against the RAW ensemble for the longest lead times.

Discussion and Conclusions
Overall, all the postprocessed forecasts show an improvement compared to the RAW ensemble. The single parametric distributions are effective until Day 10. The mixture distributions are the most skillful forecasts, regardless of the forecast horizon. This statement is supported through two different proper scoring rules, emphasizing and penalizing different types of errors. We are unaware of the connections of the tool of Figure 7 with some tests of predictive performance, economic value of the forecasts, or scoring rules, and such a figure should be drawn for LogS. In the future, very simple ideas such as "Your forecast is better than the RAW one X times over Y" would be in favor of a broader communication of ensemble information for decision making [63].
The choice of a "perfect" parametric distribution is probably a never-ending story. The mixture distributions are appealing solutions to overcome this issue. Here, the skew normal distribution is probably not tailored for a single parametric use and/or for surface temperature ensemble forecasts. Skewed distributions propose new solutions in order to combine nonexchangeable forecasts with distributions centered on ensemble means. In this study, the skew normal distributions are applied on the control forecast but one may consider other ensemble forecasts to combine. Moreover, more and more meteorological variables are postprocessed with numerous distributions, see e.g., Schulz et al. [64], Baran et al. [65]; in this context, the more distributions one tries, the better the results will be. For future work, one may try to implement EMOS schemes where the choice of a parametric distribution is related on forecasted weather regimes.
In all cases, the choice of appropriate parameter estimation schemes and link functions appears to be as important as the choice of target distributions. It is not shown here, but one finds that the proportion of Normality tests rejected across lead times is clearly anticorrelated to the RAW LogS, and that tests relying on Kolmogorov-Smirnov statistic are less powerful than Jarque-Bera or Shapiro-Wilk ones. The Kolmogorov-Smirnov statistic is based on a CDF distance such as the CRPS. One can wonder whether this behavior is to associate with the better sensitivity of the CRPS than the LogS to tolerate misspecification but it upholds the work of Leutbecher and Haiden [66] on a Gaussian approximation of the CRPS.
This work is also an advocate for the use of several scoring rules in forecast verification. Some scores, even if they are proper, are sensitive to different types of forecast errors, and differences among them can pinpoint peculiarities in forecasts. Indeed, minimizing an expected score may be less important than reducing large (and costly) errors.