Random Noise vs State-of-the-Art Probabilistic Forecasting Methods : A Case Study on CRPS-Sum Discrimination Ability

The recent developments in the machine learning domain have enabled the development of complex multivariate probabilistic forecasting models. Therefore, it is pivotal to have a precise evaluation method to gauge the performance and predictability power of these complex methods. To do so, several evaluation metrics have been proposed in the past (such as Energy Score, Dawid-Sebastiani score, variogram score), however, they cannot reliably measure the performance of a probabilistic forecaster. Recently, CRPS-sum has gained a lot of prominence as a reliable metric for multivariate probabilistic forecasting. This paper presents a systematic evaluation of CRPS-sum to understand its discrimination ability. We show that the statistical properties of target data affect the discrimination ability of CRPS-Sum. Furthermore, we highlight that CRPS-Sum calculation overlooks the performance of the model on each dimension. These flaws can lead us to an incorrect assessment of model performance. Finally, with experiments on the real-world dataset, we demonstrate that the shortcomings of CRPS-Sum provide a misleading indication of the probabilistic forecasting performance method. We show that it is easily possible to have a better CRPS-Sum for a dummy model, which looks like random noise, in comparison to the state-of-the-art method.


Introduction
In the last decades, probabilistic forecasting has drawn a lot of attention in the scientific community which led to the fast-paced development of new methods as well as application in a wide variety of domains including renewable energies [1, 2, arXiv:2201.08671v1 [cs.LG] 21 Jan 2022 3], weather forecast [4,5,6], seismic hazard prediction [7,8], and health care [9]. Due to the recent upsurge in Deep Learning, several approaches have been proposed which are based or inspired by these powerful neural network. To name some of these techniques, there are DeepAR [10], Low-Rank Gaussian Copula Processes [11], Conditioned Normalizing Flows [12], Normalizing Kalman Filter [13], Denoising Diffusion Model [14], and Conditional Generative Adversarial Networks [15,16]. In many proposed solutions, we do not have direct access to the probability distribution of the model over future values. Hence, we acquire the forecast through Monte Carlo sampling. Once these models are available, it is important to have evaluation methods which can be used to gauge the performance. Assessment of probabilistic models poses a special challenge when we only have access to forecasts samples from the probabilistic model.
Garthwaite et al. [17] coined the concept of scoring rules for summarizing the quality of probabilistic forecaster with a numerical score [18]. A scoring rule is expected to make careful assessment and be honest [17]. Gneiting et al. [18] proposed Continuous Ranked Probability Score(CRPS) for univariate and Energy Score (ES) for multivariate time series as strictly proper scoring rules. While CRPS presents a robust and reliable assessment method for univariate time series forecasting, ES discrimination ability diminishes in higher dimensionalities [19]. Several other multivariate metrics [19,20] were proposed to address probabilistic forecaster assessment in higher dimensions however, each of them had a flaw that makes them unsuitable for the assessment task. For instance, variogram score [20] is a proper scoring rule which can reflect the misalignment in correlation very well, but it lacks strictness property. Dawid-Sebastiani score [21] employs only the first two moments of distribution for evaluation which is not sufficient for many applications. A thorough analysis of these metrics is provided in [22].
Recently, Salinas et al. [11] suggest CRPS-Sum as a new proper multivariate scoring rule. This scoring rule has been well-received in the scientific community [11,12,14,13]. The properties of CRPS-Sum are not studied so far.
In this paper, our goal is to discuss the discrimination ability of CRPS-Sum. We have conducted several experiments on artificial and real dataset to investigate quantification power of CRPS-sum for the performance of a probabilistic forecaster. Based on the experiments' results, we point out the loopholes in this metric and discuss how CRPS-Sum can mislead us in interpretation of a model's performance.

Problem Specification
The forecasting task deals with predicting the future given historical information of a time series. A time series can have multiple dimensions. The notation x i t indicates the value of a time series at the time-step t in the i − th dimension. If a time series has only one dimension, it is called a univariate time series, otherwise, it is a multivariate time series.
In the forecasting task, given x 0:T as historical information, we are interested in predicting values for x T +1:T +h where h stands for the horizon of forecast. In probabilistic forecasting, the target is to acquire the range of possible outcomes with their corresponding probabilities. In more concrete terms, we aim to model the following conditional probability distribution: For the assessment of a probabilistic forecasting model, the goal is to measure how well a model is aligned with the probability distribution of the data. In other words, we want to calculate the divergence between P model and P data .

Evaluation Metrics for Probabilistic Forecasting Models
Our first challenge for assessing a probabilistic model is that, in real-world scenarios, we do not have access to the true generative process distribution i.e. P data . We only have access to the observations from P data . The scoring rule provides a general framework for evaluating the alignment of P data with P model . A scoring rule is any real-valued function which provides a numerical score based on the predictive distribution (i.e. P model ) and a set of observations X.
The scoring can be defined positively or negatively orientated. In this paper, we consider the negative orientation, since it can be interpreted as the model error and as a result, it is more popular in the scientific community. Hence, the lower score indicates a better probabilistic model.
A scoring rule is proper if the following inequality holds: A scoring rule is strictly proper if the equality in equation 3 hold if and only if P model = P data [18]. Therefore, only the model which is perfectly aligned with the data generative process can acquire the lowest strictly proper score. Various realization of scoring rules have been proposed to evaluate the performance of a probabilistic forecaster. Below, we review three scoring rules which are commonly used for the assessment of a probabilistic foresting model.

Continuous Ranked Probability Score (CRPS)
CRPS is a univariate strictly proper scoring rule which measures the compatibility of a cumulative distribution function F with an observation x ∈ R as where 1{x ≤ y} is the indicator function, which is one if x ≤ y and zero otherwise.
The predictive distributions are often expressed in terms of samples, possibly through Monte Carlo sampling [18]. Fortunately, there are several methods to estimate CRPS given only samples from predictive distribution. The precision of these approximation methods depends on the number of samples we use for estimation. Below you can find a list of the most used techniques.
Empirical CDF: In this technique, we approximate the CDF of a predictive model using its samples.F Then, we can useF (y) in conjunction with Equation 4 to approximate CRPS.
Quantile-based: The pinball loss or quantile loss at a quantile level α ∈ [0, 1] and with a predicted α th quantile q is defined as The CRPS has an intuitive definition as the pinball loss integrates over all quantile levels α ∈ [0, 1], where F −1 represents the quantile function. In practice, we approximate quantiles based on the samples we have. Therefore, equation 7 can be approximated as a summation over N quantile. The precision of our approximation depends on the number of quantiles as well as the number of samples we have.
Sample Estimation: Using lemma 2.2 of [23] or identity 17 of [24], we can approximate CRPS by where X and X are independent copies of a random variable with distribution function F and finite first moment. [18] To investigate the significance of sample size on the accuracy of different approximation methods, we ran a simple experiment. In this experiment, we assumed that the probabilistic model follows a Gaussian distribution with µ = 0 and σ = 1. Then, we approximate CRP S(F, x) where x = 0 with various sample sizes in range [200,5000]. Since we know the probabilistic model distribution, we can calculate the value of CRPS analytically, i.e. CRP S(F, x) ≈ 0.2337.
From figures 1a and 1b, we can perceive that the empirical CDF method and sample estimation method can converge to the close vicinity of the true value efficiently. However, the empirical CDF method has less variance in comparison to sample estimation. The method based on pinball loss depends on sample size and the number of quantiles. Figure 1c portrays how these two factors affect the approximation. We can see that with the number of quantiles greater than 20, the pinball loss method can produce a very good approximation using only a few samples (circa 500 samples). The effect of sample size on precision of CRPS approximation using different methods. We can see that all approximation methods can provide us with close estimation, however sample estimation method have more variance in estimation.

Energy Score (ES)
Energy Score (ES) is a strictly proper scoring rule for multivariate time series. For an m-dimensional observation x in R m and a predictive cumulative distribution function F , the Energy Score (ES) [18] is defined as where . denotes Euclidean distance and β ∈ (0, 2). We can see here that CRPS is a special case of ES where β = 1 and m = 1. While ES is a strictly proper scoring rule for all choices of β, the standard choice in application is normally β = 1 [18].
ES provides a method for probabilistic forecast model assessment which works well on multi-variate time series. Unfortunately, ES suffers from the curse of dimensionality [19]and its discrimination power decreases with increasing number of data dimensions. Still, the performance of ES in lower dimensionalities complies with the expected behavior of an honest and careful assessor. Hence, we can use its behavior in lower dimensionalities as the reference for comparison with newly suggested assessment methods.

CRPS-Sum
To address the limitation of ES in multidimensional data, Salinas et al. [11] introduced CRPS-Sum for evaluating a multivariate probabilistic forecasting model. CRPS-Sum is a proper scoring rule, and it is not a strictly proper. CRPS-Sum extends CRPS to multivariate time series with a simple modification. It is defined as where F −1 sum is calculated by summing samples across dimensions and then sorting to get quantiles. Equation 10 calculates CRPS based on quantile-based method (equation 7). In a more general sense, one can calculate the CRPS-Sum by summing both samples and observation across the dimensions. This way, we would acquire a univariate vector of samples and observation. Then, we can apply any aforementioned approximating methods to calculate CRPS-Sum.

Investigating CRPS-Sum properties
The CRPS-Sum has been wide welcomed by the scientific community and many researches have used it to report the performance of their models [11,12,14,13]. However, the capabilities of CRPS-Sum have not been investigated thoroughly unlike the vast studied dedicated to the properties of ES and CRPS [18,19,22]. To evaluate discrimination ability of CRPS-Sum, we conducted several experiments on toy dataset and outline the results in this section.

CRPS-Sum Sensitivity Study
In this study, we inspect the sensitiveness of CRPS-Sum concerning the changes in the covariance matrix. This study extends the sensitivity study which were previously conducted by [19,22] for various scoring rules including CRPS and ES. For easier interpretation of the scoring rule response to the changes in the model or data, we define relative changes of the scoring rule ∆ rel .
We run our experiment N times, where CS i denotes the obtained CRPS-Sum from the i th experiment. We define as the mean value of CRPS-Sum for the N experiments. Furthermore, let CS * signify the CRPS-Sum for a model which is identical to the true data distribution. Now, the relative changes [19] in CRPS-Sum is defined as This metric frames the relative changes in CRPS-Sum of a forecasting modeling across our experiments as the differences between the predicted and actual density of the stochastic process. The main idea is to determine the sensitivity of the scores with respect to some biased non-optimal forecast in a relative manner.
Furthermore, we specify a forecasting model f which follows the same distribution, however, this time the off-diagonal element of the covariance matrix is ∈ [−1, −0.9, −0.8, ..., 0.8, 0.9, 1]. In our study, we sample n = 2 14 windows of size w = 2 9 as suggested in [22]. Figure 2 illustrates the relative change in CRPS-Sum and ES with respect to changes in correlation ρ of the data generating process as a function of the correlation coefficient of the family of models we study. We can observe that ES behavior is unbiased with regard to ρ and its figure is symmetric. This is the expected behavior from a scoring rule in this scenario. In contrast, CRPS-Sum response to change in ρ is not symmetric. It is more sensitive to the changes when the covariance ρ of the data is negative, and it is almost indifferent to the changes when the covariance ρ of the data is positive.
Hence, the sensitivity of CRPS-Sum to changes in covariance is dependent on the dependency structure of true data. In real-world scenarios, where we do not have access to the covariance matrix of the data generative process, we cannot reliably interpret CRPS-Sum and compare various models based on CRPS-Sum.

The Effect of Summation in CRPS-Sum
To calculate CRPS-Sum, first we sum the time-series over the dimensions [11]. Although this aggregation let us turn a multivariate time-series into a univariate one, we lose important information concerning the performance of the model on each dimension. Furthermore, the values of dimensions which are negatively correlated, negate each other and consequently those dimension will not be presented in aggregated time series.
For instance, assume we have a multivariate time series x with two dimensions.
Our data follow a bivariate Gaussian distribution with µ = 0 0 and Σ = 1 −1 −1 1 . Hence, the following relation holds between dimensions: By summing over dimensions, we have: Clearly, after summation we acquire a signal with constant zero and all the information regarding variability of original time series is lost.
To acquire information regarding the performance of the model on each dimension, we can calculate CRPS first. Once the CRPS is validated, we can calculate the CRPS-Sum to check how well the model has learned the relationship between the dimensions, and even at this point, we should not forget the flaws of CRPS-Sum which we witnessed e.g. sensitivity toward data covariance and loss of information during summation. Unfortunately, the importance of CRPS is ignored in most of the recent papers in the probabilistic forecast domain. In these papers, the CRPS is either not reported at all [13,14], or the argument about the performance of the model is made solely based on CRPS-Sum [12,11].
Considering the flaws of CRPS-Sum, this trend can put the assessment results of these recent models into jeopardy.

Closer look to CRPS-Sum in Practice
In previous section, we have discussed the properties of CRPS-Sum and indicate its in hypothetical scenarios using toy data settings. In this section, we aim to investigate CRPS-Sum capabilities with a real dataset. To do so, we conduct an experiment by constructing simple models which are based on random noise and investigate its performance using CRPS-Sum. In our experiment, we employed the exchange-rate dataset [25]. The exchange-rate dataset is a multivariate time series dataset which contains daily exchange-rate of eight countries namely Australia, British, Canada, Switzerland, China, Japan, New Zealand, and Singapore which is collected between 1990 and 2016. This dataset has few dimensions, which let us use ES alongside CRPS and CRPS-Sum. Additionally, it is easier to perform qualitative assessment on lower dimensionalities. We used the dataset in the same setting which is proposed in [11]. We also utilize the GP-Copula method from [11] as the baseline since they have reported results in CRPS-Sum.
Our first model is a dummy univariate model which follow a Gaussian distribution. The mean of the Gaussian distribution is µ = µ last where µ last is the mean of the last values in the input vector over the dimensions i.e.
We used σ = 10 −4 as the standard deviation of the dummy univariate model in our experiments, however the results are not dependent on σ value (More discussion on σ value can be found in Appendix B). We use this model to generate the forecast for every dimension.
For the second model, we employ a multivariate Gaussian distribution to build a dummy multivariate forecaster. The mean of ith dimension of the multivariate Gaussian distribution is the value of the last time step in the input window, i.e. µ i = x i T . The covariance matrix is zero everywhere except on its main diagonal, which is filled with 10 −4 . In other words, we extend the last observation of the input window as the prediction and apply small perturbation from a Gaussian distribution. Table 1 present CRPS-Sum, CRPS and ES of the two dummy models and the result of GP-Copula model from [11] on the exchange-rate dataset. While the CRPS-Sum suggests that dummy univariate model is much better than GP-Copula, the CRPS and ES indicate that the performance of dummy univariate GP-Copula [11] Dummy On the other hand, the quantitative results for dummy multivariate model are quite surprising. All three assessment methods denote that the dummy multivariate has a superior performance in comparison to GP-Copula. To provide further explanation for this unexpected result, we analyze the performance of these models qualitatively. Figure 3a depicts the forecasts from GP-Copula for the first dimension of exchangerate dataset (rest of the dimensions are visualized in Appendix A) and figure 3b presents samples from dummy multivariate model.
This experiment shows us that the border between a dummy model and a genuine model can get very blurry, if we rely solely on CRPS-Sum. Furthermore, we learn that CRPS and visualization can help us to acquire a better understanding of a model performance. Still, we do not have a reliable assessment method for investigating dependency structure of a model in higher dimensionalities.

Conclusion
In this paper, we reviewed various existing methods for the assessment of probabilistic forecast models and discussed their advantages and disadvantages. While CRPS is only applicable on univariate models and ES suffers from curse of dimensionality, CRPS-Sum was introduced to help us with assessing multivariate probabilistic forecast models. Unlike CRPS and ES, the properties of CRPS-Sum are not studied in the past. Our Sensitivity study illustrates that the CRPS-Sum behavior is not symmetric with regard to the covariance of data distribution. CRPS-Sum is more sensitive to changes in covariance of the model when the covariance of the data is negative. It is an undesirable behavior and make the result interpretation difficult.   [11] for exchange-rate dataset test set. The dataset has 8 dimensions and the test set consists of five batches with 30 time-steps (for more information refer to [11]) . Each sub-figure corresponds to one of the data dimensions, presented in original order from the top to bottom. We used 400 samples for visualization of each forecast batch.

Appendix B The Standard Deviation of Dummy Models
For our discussions on dummy models performance, we used σ = 10 −4 to define Gaussian distribution. Nevertheless, as shown in figure 5 and figure 6, we can acquire consistent result with σ ≤ 10 −3 . Furthermore, we can see that that values of our scoring rules converges when σ ≤ 10 −5 .