Probability Forecast Combination via Entropy Regularized Wasserstein Distance

Cumings-Menon, Ryan; Shin, Minchul

doi:10.3390/e22090929

Open AccessArticle

Probability Forecast Combination via Entropy Regularized Wasserstein Distance

by

Ryan Cumings-Menon

¹

and

Minchul Shin

^2,*

¹

The US Census Bureau, 4600 Silver Hill Rd, Suitland-Silver Hill, MD 20746, USA

²

Federal Reserve Bank of Philadelphia, Ten Independence Mall, Philadelphia, PA 19106, USA

^*

Author to whom correspondence should be addressed.

Entropy 2020, 22(9), 929; https://doi.org/10.3390/e22090929

Submission received: 9 July 2020 / Revised: 9 August 2020 / Accepted: 22 August 2020 / Published: 25 August 2020

(This article belongs to the Special Issue Information Theory, Forecasting, and Hypothesis Testing)

Download

Browse Figure

Versions Notes

Abstract

:

We propose probability and density forecast combination methods that are defined using the entropy regularized Wasserstein distance. First, we provide a theoretical characterization of the combined density forecast based on the regularized Wasserstein distance under the assumption. More specifically, we show that the regularized Wasserstein barycenter between multivariate Gaussian input densities is multivariate Gaussian, and provide a simple way to compute mean and its variance–covariance matrix. Second, we show how this type of regularization can improve the predictive power of the resulting combined density. Third, we provide a method for choosing the tuning parameter that governs the strength of regularization. Lastly, we apply our proposed method to the U.S. inflation rate density forecasting, and illustrate how the entropy regularization can improve the quality of predictive density relative to its unregularized counterpart.

Keywords:

entropy regularization; Wasserstein distance; optimal transport; density forecasting; forecast combination; model combination; quantile aggregation

1. Introduction

In this paper, we study a class of density forecast combination methods based on a Wasserstein metric. In the univariate case, an equally weighted centroid defined by a Wasserstein metric corresponds to a quantile averaging or vincentized center where quantiles of forecast densities are averaged. The resulting combined density tends to be narrower than the linear opinion rule [1,2,3], which may or not be desirable, depending on the context.

We propose to use the entropy regularized Wasserstein metric to construct a combined density forecast. Like its unregularized counterpart, this combined probability/density can be defined by an optimization problem, but the optimization problem in this case includes an additional regularization term that penalizes densities with low entropy, which ensures the combined density forecast is smooth. One advantage of this approach is that the entropy regularized Wasserstein barycenter can be found in a much more computationally efficient manner than its unregularized counterpart when the input densities are multi-dimensional [4].

While computational efficiency is the most commonly cited reason for using entropy regularization, this paper demonstrates that there is an additional advantage of regularization when it comes to the density combination problem. It provides a way to tune the degree of dispersion of the combined density forecast. To the best of our knowledge, this regularized metric has not been explored in the context of the density forecasting combination problem.

As a part of our discussion, we provide a theoretical characterization of the regularized Wasserstein distance under the Gaussian assumption. More specifically, we show that the regularized Wasserstein barycenter between two multivariate Gaussian inputs is multivariate Gaussian. Our proof complements Theorem 1 of [5], which characterizes the regularized Wasserstein barycenter among an arbitrary number of univariate normal densities. In addition, our result also provides a simple recursive equation that is guaranteed to converge to the variance–covariance matrix.

We proceed as follows. Section 2 formulates a density forecast combination problem with a general metric. Several existing aggregation methods in the literature can be formulated with the choice of a specific metric within this unified framework. After discussing these existing approaches, we introduce our proposal of using the entropy regularized Wasserstein barycenter. Section 3 provides theoretical results that describe the impact of entropy regularization on the combined density under a Gaussian assumption and discusses how this helps improve the quality of the combined density prediction. Section 4 discusses how to set the strength of the entropy regularization in practice and shows that our proposed selection rule achieves a certain notion of optimality. Section 5 provides an empirical exercise that illustrates how entropy regularization improves the quality of density prediction of the U.S. inflation rate relative to the unregularized combined density forecast. Section 6 concludes the article.

2. Regularized Wasserstein Barycenter for Density Forecast Combination

This section introduces the density combination problem; see, for example [6]. We assume that agent

i \in {1, \dots, N}

at time

t \in N_{+}

provides a forecast of the density function

p_{i t} : R^{d} \to R_{+},

with distribution function denoted by

P_{i t} : R^{d} \to R_{+},

of the random variable

y_{t + h}

with

h \in N_{+}

. We are interested in aggregating information contained in the N agents’ forecasts to generate a better predictive distribution for

y_{t + h}

.

Throughout the paper, we shall focus on density combinations that can be viewed as a type of average over probability densities. Specifically, those that can be defined as

{\bar{p}}_{t} = arg min_{p_{t} \in P} \sum_{i = 1}^{N} D (p_{i t}, p_{t}),

(1)

where

D (p_{i}, p_{j})

is a measure of the discrepancy between the densities

p_{i}

and

p_{j} .

When

D (\cdot)

satisfies the usual properties of a distance metric, which is the case when

D (\cdot)

is defined as Euclidean or an unregularized Wasserstein metric, then

{\bar{p}}_{t}

is known as a Fréchet mean, which is a generalization of the average for real numbers. We will refer to

{\bar{p}}_{t}

as a barycenter to also encompass the more general case in which

D (\cdot)

is not a metric. As described in Equation (1), we restrict our attention to the case in which

{\bar{p}}_{t}

is a density forecast with each input density having equal weight, which is known to perform quite well as a combination forecast [7].

A specific choice of metric,

D (p_{i}, p_{j})

, will lead to a different combined density,

{\bar{p}}_{t}

. Before introducing our proposed definition of

D (\cdot),

the entropy regularized Wasserstein metric, the next two sections introduce choices for

D (p_{i}, p_{j})

that lead to well-known density forecast combination methods.

2.1. Equal-Weighted Linear Opinion Rule

As a starting point let us consider

D (p_{i}, p_{j}) : = {‖ p_{i} - p_{j} ‖}_{2}^{2}

. Then, Equation (1) becomes

{\bar{p}}_{t} = arg min_{p_{t} \in P} \sum_{i = 1}^{N} \int {(p_{i t} - p_{t})}^{2},

(2)

which results in the following solution

{\bar{p}}_{t} = \frac{1}{N} \sum_{i = 1}^{N} p_{i t} .

(3)

This can be derived using the first-order condition with respect to

p_{t},

which is

\sum_{i = 1}^{N} (p_{i t} - {\bar{p}}_{t}) = 0

.

This solution is known as the linear opinion rule with equal-weighting. This is the prototypical aggregation method both in the forecasting literature and in practice; see, for example [1]. This is a particularly tractable density combination method, as it is equivalent to a mixture density, and it has the additional advantage of being computationally tractable to compute. However, one disadvantage is that it does not preserve the shape of the individual forecast densities. For example, when combining two uni-modal densities, the resulting solution is generally bi-modal.

2.2. Quantile Aggregation and the Wasserstein Barycenter

In this section we consider the case in which

D (\cdot)

is defined as the p-Wasserstein metric, which is defined as

W_{p} (p_{i t}, p_{j t}) = {(inf_{φ \in Ω (p_{i t}, p_{j t})} \int ‖ z_{i} - z_{j} ‖^{p} d φ (z_{i}, z_{j}))}^{1 / p},

(4)

where

Ω (p_{i t}, p_{j t})

is the set of all joint distributions

φ (z_{i}, z_{j})

that have marginal densities given by

p_{i t}

and

p_{j t}

, respectively. Formally, we write

Ω (p_{i t}, p_{j t}) = \{φ : R^{d} \times R^{d} \to R_{+}^{1} | \forall A \subset R^{d}, φ (A, R^{d}) = p_{i t} (A) and φ (R^{d}, A) = p_{j t} (A)\} .

(5)

In other words, each

φ \in Ω (p_{i t}, p_{j t})

is a coupling between the distributions

p_{i t}

and

p_{j t} .

In the optimal transport literature, the minimizer of (4) is also known as the optimal transport plan. This is because, for any

A, B \subset R^{d},

φ (A, B)

can be interpreted as the amount of mass that is moved from A to B in order to minimize

E ({‖ z_{i} - z_{j} ‖}_{p}^{p})

where

z_{i} \sim p_{i t}

and

z_{j} \sim p_{j t} .

For more detail on the field of optimal transport, see [8,9].

A special case of this Wasserstein barycenter has a close relation to a recently proposed probability/density forecast combination method in the forecasting literature. More specifically, suppose that input densities are univariate, and

{\bar{p}}_{t}

is defined as the squared Wasserstein metric, denoted by

D (\cdot) : = W_{2}^{2} (\cdot);

in this case, we have

{\bar{P}}_{t}^{- 1} (τ) = \frac{1}{N} \sum_{i = 1}^{N} P_{i t}^{- 1} (τ), for all τ \in (0, 1),

(6)

where

P_{i t}^{- 1} (\cdot)

and

{\bar{P}}_{t}^{- 1} (\cdot)

are the quantile function of agent i and of the combination method, respectively. This forecast aggregation rule is also known as “quantile aggregation” or “Vincentized distribution” [2,3,10]. We prefer the representation of Equation (1) because this definition can be easily extended to higher dimensional densities or mixed data types (e.g., when some inputs are continuous and others are discrete) unlike quantile aggregation.

The Wasserstein barycenter is known to preserve the shape of input densities, such as log-concavity [11]. For example [12] show that the Wasserstein barycenter of the inputs,

N (μ_{1}, S_{1})

and

N (μ_{2}, S_{2}),

is

N ((μ_{1} + μ_{2}) / 2, S),

where S is the solution of,

S = {(S^{1 / 2} S_{1} S^{1 / 2})}^{1 / 2} / 2 + {(S^{1 / 2} S_{2} S^{1 / 2})}^{1 / 2} / 2;

(7)

see also [13]. This is different than the linear opinion rule, which leads to a mixture of two normal densities with mean

(μ_{1} + μ_{2}) / 2

and variance

\frac{σ_{1}^{2} + σ_{2}^{2}}{2} + \frac{{(μ_{1} - μ_{2})}^{2}}{4}

, which, in contrast, can be expected to be bi-modal whenever

μ_{1} \neq μ_{2}

.

Another difference between these two aggregation methods is that the variance of the Wasserstein barycenter is smaller than that of the combined density resulting from a linear opinion rule. This holds for a more general class of input densities as shown in [2] in the univariate case. Of course, a narrow (i.e., sharp) predictive density can be good or bad depending on the underlying distribution of the target variable. It may be desirable to have an ability to flexibly adjust the dispersion of the combined density.

2.3. Regularized Wasserstein Barycenter

Now, we turn to our proposal. In this paper, we use a regularized Wasserstein distance [14,15] to combine individual probability forecasts. The regularization term used in this approximation of the Wasserstein metric is given by the negative differential entropy, which, when

φ

is an absolutely continuous measure, we will define as,

h (φ) = \int_{R^{d} \times R^{d}} log (\frac{d φ}{d λ}) d φ,

where

λ

is the Lebesgue measure, and infinity otherwise. We will use

h (φ)

to define the regularized Wasserstein metric as

W_{p, γ} (p_{i t}, p_{j t}) = {(inf_{φ \in Ω (p_{i t}, p_{j t})} \int ‖ z_{i} - z_{j} ‖^{p} d φ (z_{i}, z_{j}) + γ h (φ))}^{1 / p},

(8)

where

γ > 0

controls a strength of regularization. Note that

φ

is constrained by the same two marginal restrictions as its unregularized counterpart, as described in the definition of

Ω (p_{i t}, p_{j t})

. This form of regularization is originally introduced by [14] in order to estimate the Wasserstein metric in a computationally efficient manner using the iterative proportional fitting procedure (IPFP) provided by [16].

When

γ = 0

, there is no regularization, so we have

W_{p, 0} (p_{i t}, p_{j t}) = W_{p} (p_{i t}, p_{j t}) .

One can also show that the optimal coupling, say

φ_{γ}^{🟉},

satisfies

{lim}_{γ \to 0^{+}} φ_{γ}^{🟉} = φ_{0}^{🟉}

when

φ_{0}^{🟉}

is uniquely defined, and otherwise this limiting value is given by the element of the set of optimal unregularized couplings with maximum entropy [15]. Higher values of

γ

place more weight on the second term in the objective function, which results in optimal couplings that are smoother and more dispersed than their unregularized counterparts.

Defining

D (p_{i t}, p_{j t})

by

W_{2, γ}^{2} (p_{i t}, p_{j t})

results in the combined density

{\bar{p}}_{t} = arg min_{p_{t} \in P} \sum_{i = 1}^{N} W_{2, γ}^{2} (p_{i t}, p_{j t}),

(9)

which is known as the regularized Wasserstein barycenter. The authors of [4] provided a generalization of the IPFP procedure to find this barycenter that is more computationally efficient than the unregularized case. While computational efficiency is the commonly cited reason for using entropy regularization, as we will see in the later sections, our motivation for regularization is not entirely computational.

For the rest of the paper, we study this regularized Wasserstein barycenter, which is

{\bar{p}}_{t}

defined in Equation (1) using (8). First, we present analytical results under a parametric assumption that broadens our understanding about the role of the regularization in forecast density combination. Then, we discuss how one can empirically choose the strength of the regularization that would achieve a certain notion of optimality.

3. Analytical Results: The Impact of Entropy Regularization

In this section we provide analytical results that describe the impact of entropy regularization on the shape of the barycenter. To better compare this barycenter with its unregularized counterpart in the Gaussian case, as defined above, we will focus on the regularized barycenter when

p_{1}

and

p_{2}

are d-dimensional multivariate Gaussians (

d \geq 1

). The regularized Wasserstein barycenter in this case is defined as

\bar{p} \in arg min_{q} (W_{γ}^{2} (p_{1}, q) + W_{γ}^{2} (p_{2}, q)) .

(10)

The following theorem completely characterizes the resulting barycenter in this case. Like the unregularized case, the theorem shows that regularization does not impact the mean of the barycenter; however, it does have an impact on its variance–covariance matrix.

Theorem 1.

Let

p_{1}

and

p_{2}

be Gaussian density functions with means

μ_{1}, μ_{2} \in R^{d},

and variance matrices,

S_{1}, S_{2} \in R^{d \times d} .

The regularized Wasserstein barycenter between

p_{1}

and

p_{2}

is given by the density function of

N (μ_{B}, S_{B}),

where

μ_{B} \in R^{d}

and

S_{B} \in R^{d \times d}

are defined by,

\begin{matrix} μ_{B} : = & (μ_{1} + μ_{2}) / 2 \\ S_{B} : = & {(V / γ + I)}^{- 1} (V / 2 + I γ / 2 + S_{2}) {(V / γ + I)}^{- 1} \\ = & {(- V / γ + I)}^{- 1} (- V / 2 + I γ / 2 + S_{1}) {(- V / γ + I)}^{- 1}, \end{matrix}

where

V \in R^{d \times d}

is the unique symmetric matrix that satisfies these equalities and

- I γ < V < I γ .

Also, the iterates of the following series converge to V when

V^{(0)} : = 0_{d \times d},

V^{(k + 1)} = S_{2} - S_{1} + S_{1} {(S_{1} + I γ / 2 - V^{(k)} / 2)}^{- 1} S_{1} - S_{2} {(S_{2} + I γ / 2 + V^{(k)} / 2)}^{- 1} S_{2} .

The proof of this result is included in the Appendix A. We prove a slightly more general version of the theorem where the objective function in Equation (10) is a weighted average of

W_{γ}^{2} (p_{1}, q)

and

W_{γ}^{2} (p_{2}, q)

. The proof first derives a system of equations that characterizes the barycenter in the case in which the regularized barycenter is Gaussian. Afterward, a fixed point theorem provided by [17] for mappings on partially ordered sets is used to show that this system has a unique solution, and this, along with convexity of Equation (10), implies the regularized barycenter is Gaussian.

Now, we discuss our theoretical results and their implication to the density forecast combination problem.

Remark 1.

(on location). Regularization does not affect the mean of the resulting barycenter, which is a property that may not hold in the more general setting that does not include a normality assumption. For example, suppose the domain of p is

[0, 1],

and

E_{x \sim p} (x) \neq 1 / 2,

and consider the barycenter between p and itself. For any fixed density function

q,

the optimal coupling of the optimization problem that defines

W_{γ}^{2} (p, q)

converges to

d φ (z_{1}, z_{2}) / d λ = q (z_{1}) p (z_{2}),

as this is the coupling with maximum entropy that has marginals given by q and

p;

see for example, [15]. However, the negative entropy of

d φ (z_{1}, z_{2}) / d λ = p (z_{2}),

is less than or equal to that of

d φ (z_{1}, z_{2}) / d λ = q (z_{1}) p (z_{2}),

for any such fixed density

q .

We can also ensure these couplings are feasible by defining q to be a uniform density function, so we have

{lim}_{γ \to \infty} q = 1 .

This implies that

{lim}_{γ \to \infty} E_{x \sim q} (x) = 1 / 2,

regardless of the

E_{x \sim p} (x) .

Since the unregularized density is given by

q = p,

and

E_{x \sim p} (x) \neq 1 / 2,

the regularization parameter does impact the mean of the barycenter.

Remark 2.

(on dispersion) Regularization tends to smooth the resulting barycenter, leading to a more dispersed combined density. To understand this point, let us consider a simple example below.

Example 1.

Consider a case with univariate

p_{i t} = N (μ_{i t}, σ^{2})

and

N = 2

. Then, the original Wasserstein barycenter (quantile averaging) is

{\bar{p}}_{t} = N ((μ_{1 t} + μ_{2 t}) / 2, σ^{2})

. On the other hand the regularized Wasserstein barycenter is

{\bar{p}}_{t} (γ) = N ((μ_{1 t} + μ_{2 t}) / 2, σ^{2} + γ / 2) .

As this case exemplifies the strength of the regularization controls a dispersion of the combined density. The heavier the regularization the greater dispersed (or, the smoother) density we obtain. This result highlights that the entropy regularization offers an extra flexibility to control the dispersion of the combined density. In the next section, we propose a data-driven way to select the value of

γ

, the strength of the regularization.

Remark 3.

The normality assumption that we made to obtain the closed-form solution for the barycenter is not needed in practice. The regularized barycenter of probability/density forecasts is well-defined and computationally tractable for a broader context. One can have multiple inputs, non-Gaussian densities, discrete/continuous/mixed distribution. This includes many interesting and empirically relevant situations in economic forecasting such as macroeconomic and financial forecasting. The efficient computation of the regularized Wasserstein distance and barycenter with non-Gaussian input densities is still an active area of research. There is a large literature on computing the regularized barycenter in practice; see for example [4,18,19,20,21,22,23].

Remark 4.

During the review process for this paper, we became aware of a similar result that was proved independently of ours by [5]. There are two primary differences between these results. First, our result provides the regularized barycenter between two multivariate normal densities, while Theorem 1 in Janati et al. (2020a) provides the barycenter between an arbitrary number of univariate normal densities. Second, our result also provides a recursive formula to compute the variance–covariance matrix of the barycenter, which guarantees a convergence to a desired solution. We appreciate one of referees who pointed out relevant papers.

There have also been a number of recent results on a few related barycenters, including those that are modified to avoid the increase in the dispersion of the barycenter caused by regularization using one of the following two techniques. First, a Kullback–Leibler divergence penalty term can be used, with a reference measure given by the product of the input densities, rather than differential entropy. Second, a technique known as debiasing can also be used. For example, the remaining results in [5], as well as the results provided by [24,25], characterize these types of regularized Wasserstein barycenters between Gaussian densities. In contrast to the barycenter we consider, which can be viewed as the original discrete entropy regularized Wasserstein barycenter in the limit as the number of bins diverges, increasing the regularization parameter of these alternative barycenters either decreases or does not change the variance of the barycenter.

4. On Choosing the Strength of the Regularization

This section discusses how to choose the strength of the penalization. Our empirical strategy is to select

γ

by the value that most accurately fits the observed data. To economize our notation we restrict our discussion to the 1-step-ahead prediction (i.e.,

h = 1

). To do so, we regard the regularized barycenter computed at time t,

{\bar{p}}_{t}

, as a predictive likelihood for

y_{t + 1}

. This predictive likelihood interpretation of the barycenter can be formally justified by the principal-agent framework similar to the one developed by [26]. Suppose we have collected the regularized barycenters and the realized value of the target variable from the initial period

(1)

to present

(t)

. We write this collection as

I_{t}

. Then, we can define a maximum likelihood estimator for

γ

at t with

I_{t}

as

{\hat{γ}}_{1 : t}^{m l e} \in arg max_{γ \geq 0} \sum_{τ = 1}^{t - 1} log {\bar{p}}_{τ} (y_{τ + 1}; γ),

(11)

and the combined density prediction for

y_{t + 1}

at time t is

\hat{p} (y_{t + 1} | I_{t}) = {\bar{p}}_{t} (y_{t + 1}; {\hat{γ}}_{1 : t}^{m l e}) .

(12)

There is a notion in which this combined density with

\hat{γ}

is optimal. Suppose that

y_{t} \sim_{i . i . d .} p^{*} (y)

, and assume that forecasters report a sequence of predictive densities,

p_{i} (y)

for

y_{t}

,

t = 1, 2, \dots, T

and

i = 1, 2, \dots, N

. These forecasts are reported before the realization of

y_{t}

, and the barycenter

\bar{p} (y; γ)

is defined by

p_{i} (y)

’s and

γ > 0

. Then, the following can be shown under regularity conditions,

\frac{1}{T} \sum_{t = 1}^{T} log \bar{p} (y_{t}; γ) \to_{p} \int log \bar{p} (y; γ) p^{*} (y) d y as T \to \infty,

for

γ \in Γ \in R_{+}

. In turn, a maximizer of the left-hand-side term also converges to the maximizer of the right-hand-side term, which is a minimizer of

K L (\bar{p} (y; γ), p^{*} (y)) = - \int log \bar{p} (y) p^{*} (y) d y + \int log (p^{*} (y)) p^{*} (y) d y .

Therefore,

\hat{γ}

converges to the pseudo-true parameter that minimizes Kullback–Leibler (KL) divergence from the regularized barycenter to the true data generating process. In other words, we find

γ

that makes the resulting barycenter close to the true data generating process in the limit. This asymptotic thought experiment can be justifiable under quite general conditions, allowing for a range of serial dependence in

y_{t}

as well as a flexible form of the regularized Wasserstein barycenter implied by

p_{i, t - 1} (y_{t})

’s. We can operationalize this by recognizing that

{\bar{p}}_{t - 1} (y; γ)

can be viewed as a predictive likelihood for

y_{t}

formed at time

t - 1

. Then, quasi-MLE theory can be invoked, e.g., [27,28]. We provide a simple example in which the true data generating process follows the autoregressive (AR) process.

Example 2.

Suppose that forecaster 1 and 2 use mean-zero Gaussian AR(1) process to construct their density prediction. The two forecasts differ only by the mean reversion parameter. That is, the means of predictive distribution for forecaster 1 and 2 are

μ_{1 t} = ρ_{1} y_{t - 1}

and

μ_{2 t} = ρ_{2} y_{t - 1}

, respectively. Based on our theory in the previous section, the barycenter is

{\bar{p}}_{t - 1} (y; γ) = N ({\bar{μ}}_{t}, σ^{2} + γ / 2)

where

{\bar{μ}}_{t} = (μ_{1 t} + μ_{2 t}) / 2

, and the log density of the regularized barycenter at τ for

y_{τ + 1}

is

log ({\bar{p}}_{τ} (y_{τ + 1}; γ)) = - 1 / 2 log (2 π) - 1 / 2 log (σ^{2} + γ / 2) - 1 / 2 {(\frac{y_{τ + 1} - {\bar{μ}}_{τ + 1}}{\sqrt{σ^{2} + γ / 2}})}^{2},

(13)

and the ML estimator for γ at time t is

{\hat{γ}}_{1 : t}^{m l e} \in arg max_{γ \geq 0} \sum_{τ = 1}^{t - 1} (- 1 / 2 log (2 π) - 1 / 2 log (σ^{2} + γ / 2) - 1 / 2 {(\frac{y_{τ + 1} - {\bar{μ}}_{τ + 1}}{\sqrt{σ^{2} + γ / 2}})}^{2}),

(14)

which leads to

{\hat{γ}}_{1 : t}^{m l e} = 2 \times max (\frac{1}{(t - 1)} \sum_{τ = 1}^{t - 1} {(y_{τ + 1} - {\bar{μ}}_{τ + 1})}^{2} - σ^{2}, 0) .

(15)

Now, suppose that the actual data generating process is

y_{t} = ρ_{*} y_{t - 1} + v_{t}, v_{t} \sim_{i . i . d .} N (0, σ_{*}^{2}) .

(16)

When the simple average of both forecasters’ autoregressive parameter equals

ρ_{*}

, the ML estimate for γ depends on the true conditional variance,

σ_{*}^{2}

, and forecasters’ conditional variance. If the sample variance is larger than that of the forecasters, then γ is chosen so that the resulting regularized barycenter has the same variance as the sample variance. On the other hand, if the sample variance is smaller than that of the forecasters, then γ is set to 0. Note that there is an asymmetry in adjusting the variance of the barycenter. This is natural in that the regularization only makes the resulting density smoother. In practice, this may not be a problem if the practitioner’s concern is the combined density being too sharp (e.g., relative to the linear opinion rule).

Note that

{\hat{γ}}_{1 : t}^{m l e}

converges in probability to

γ_{\infty} = 2 max (σ_{*}^{2} - σ^{2}, 0)

. The KL divergence between

\bar{p} (y_{t + 1}; γ)

and the true conditional density of

y_{t + 1}

at t is minimized at

γ = γ_{\infty}

. This confirms that our selection rule for γ aims to fit the data well by shaping the regularized barycenter as close as possible to the data generating process.

5. Empirical Illustration

In this section, we illustrate our proposed method using macroeconomic data for the U.S. We consider 14 hypothetical forecasters who produce their own 1-step-ahead forecast about the U.S. inflation rate based on the following vector autoregression (VAR) with three variables,

Y_{t} = Φ_{0} + \sum_{i = 1}^{4} Φ_{i} Y_{t - i} + e_{t}, e_{t} \sim_{i . i . d} N (0, \sum),

(17)

where

Y_{t}

is a

3 \times 1

vector that consists three quarterly macroeconomic variables,

Φ_{0}

is a

3 \times 1

vector,

Φ_{1}, Φ_{2}, Φ_{3}, Φ_{4}, \sum

are

3 \times 3

matrices. The first two elements of

Y_{t}

are common to all 14 forecasters: the annualized quarter-over-quarter inflation rate and real GDP growth rate. They differ by the third element of

Y_{t}

. We assign each forecaster a different macroeconomic variable from the FRED-QD database by [29]. A detailed description of the variable used in this exercise is in Table 1.

We compute each forecasters’ 1-step-ahead predictive distribution for the inflation rate at time t as

π_{t + 1 | t} \sim N ({[μ_{t + 1 | t}]}_{(1, 1)}, {[\sum_{t + 1 | t}]}_{(1, 1)})

where

{[x]}_{(i, j)}

denotes

(i, j)

element of vector/matrix x. These forecasters assume that the 1-step-ahead predictive distribution of

Y_{t + 1}

at t is Gaussian, and they use their best guess about the predictive mean and variance to construct the predictive distribution. More specifically, they set these two moments as

μ_{t + 1 | t} = {\hat{Φ}}_{0, t} + \sum_{p = 1}^{4} Y_{t - p + 1}^{'} {\hat{Φ}}_{p, t}, and \sum_{t + 1 | t} = {\sum^{^}}_{t},

(18)

where

({\hat{Φ}}_{0, t}, {\hat{Φ}}_{1, t}, {\hat{Φ}}_{2, t}, {\hat{Φ}}_{3, t}, {\hat{Φ}}_{4, t}, {\sum^{^}}_{t})

is the posterior mean of

p (Φ_{0}, Φ_{1}, Φ_{2}, Φ_{3}, Φ_{4}, \sum | Y_{t : (t - R + 1)})

with a flat prior. We set

R = 80,

meaning that they also use the most recent 20 years of data to construct the predictive distribution.

We let the forecasters to generate their 1-step-ahead predictive distribution for the inflation rate from 2001Q1 to 2018Q4. This leaves us 72 quarters for a forecast evaluation sample. At each point in time, we also combine these 14 predictive densities based on the regularized Wasserstein barycenter with 20 different values of the regularization parameter

γ

on

[0.3, 10]

. As we explained in the previous section, a larger value of this parameter implies a stronger regularization, and the resulting combined predictive density becomes smoother with a larger variance. We also compute the combined density with

γ = 0

, which leads to “quantile aggregation” or “Vincentized distribution”. Our computation of the regularized barycenter is based on the algorithm developed and proposed by [19]. The MATLAB toolbox that implements this algorithm is available from https://github.com/gpeyre/2015-SIGGRAPH-convolutional-ot.

We evaluate each forecaster’s, and other forecast aggregation, methods by the sum of log predictive score, which is a logarithm of the predictive density evaluated at the actualized value, over the evaluation sample. These results are presented in Figure 1. The left panel presents the sum of the log score for individual forecasters sorted by their performance. There is a sizeable difference in their historical performance. The solid line represents the performance based on the quantile aggregation, which aggregates all forecasters in the pool. As found by other research papers, e.g., [2,3] the quantile aggregation method generates a decent predictive distribution, which performs slightly better than the ex-post top 4 forecaster.

The right panel in Figure 1 shows the historical performance of our proposed approach with various choices of regularization parameter,

γ

. For a wide range of values for

γ

the regularized barycenter performs better than the quantile aggregation. It does even better than the best individual. This is interesting because we cannot identify the best forecaster a priori.

The optimal value of

γ

defined in Equation (11) at the end of the evaluation sample would be the value of

γ

that corresponds to the peak of the curve, which is about

{\hat{γ}}_{2018 Q 4} \approx 1.3

. If we were to use this value at the beginning of the evaluation sample, then the mean difference in the log predictive score between the regularized Wasserstein barycenter and the quantile aggregation would have been 0.12 with the heteroscedasticity and autocorrelation consistent (HAC) standard error being 0.07. This implies that the difference in the peak of the curve and the solid line is statistically significant at 10% confidence level.

To make the

γ

selection fully adaptive, we also compute the optimal

γ

sequentially from the beginning to the end of the evaluation sample. That is, we set the predictive density for

y_{t + 1}

as the regularized barycenter with the value of

γ

that maximizes the objective function defined in Equation (11) only using the information available from the beginning of the sample up to t. In this way, we do not use any future information when choosing the value of

γ

. Even in this case the regularized Wasserstein barycenter performs better than the best individual forecaster and the quantile aggregation. The sum of the log predictive score is −93.09, and the mean difference in the log predictive score with the quantile aggregation is 0.11 with the HAC standard error being 0.06. This suggests that the regularized Wasserstein barycenter with the adaptively chosen (e.g., estimated online)

γ

performs statistically better than its unregularized counterpart, the quantile aggregation, at the 10% significance level. This superior predictive performance of the regularized Wasserstein barycenter relative to the quantile aggregation remains unchanged even when we split the evaluation sample into two. The mean difference in the log predictive score is 0.13 and 0.09 for the first half and the second half of the evaluation sample, respectively.

6. Concluding Remarks

This paper proposes to use the entropy regularized Wasserstein barycenter to combine several probability and density forecasts. The entropy regularization smooths the resulting combined forecast, and it offers a flexible way to adjust the dispersion of the predictive density when it is needed. We study the effect of the regularization on the combined density forecast and provide an exact relationship between the strength of the regularization and the variance–covariance matrix of the combined density when input densities are Gaussian. We then provide a way to select the strength of regularization by choosing the regularized barycenter that most closely matches the data. We apply our proposed methodology to the U.S. inflation density forecasting and show how the entropy regularization can improve the quality of the density forecast relative to its unregularized counterpart.

In this article, we restrict weights of each input densities on the final combined density to be pre-determined at some values (i.e., equal weighting). This choice was intentional to focus on studying the role of entropy regularization. In practice, however, it is possible that a subset of input densities might be superior to others, and one may wish to put different weights on each input density. Alternatively, it is desirable to include only a subset of input densities into the combined density and set other weights to zero, see, for example, [30]. For those cases, it is fruitful to develop a data-dependent method that chooses both the regularization strength and those weights simultaneously, which is a topic for future research.

Author Contributions

The authors contributed equally to this paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

We thank Frank Diebold, Roger Koenker, Frank Schorfheide, R. Miyauchi Lee, and two anonymous referees for their insightful comments.

Conflicts of Interest

The authors declare no conflict of interest.

Disclaimer

The views expressed in these papers are solely those of the authors and do not necessarily reflect the views of the Federal Reserve Bank of Philadelphia, the Federal Reserve System, or the Census Bureau. Any errors or omissions are the responsibility of the authors. There are no sensitive data in this paper.

Appendix A

The authors of [17] provide the following fixed point theorem, which we will use in the proof of Theorem 1.

Lemma A1 (Ran and Reurings, 2004).

Let T be a partially ordered set such that every pair

x, y \in T

has a lower bound and an upper bound. Furthermore, let d be a metric on T such that

(T, d)

is a complete metric space. If

F : T \to T

is a continuous, monotone (e.g., either order-preserving or order-reversing) map from T into T such that,

\exists c \in (0, 1) : d (F (x), F (y)) < c d (x, y), \forall x > y

and

\exists x_{0} \in T : F (x_{0}) > x_{0} o r F (x_{0}) > x_{0},

then

F

has a unique fixed point,

x^{🟉} \in T .

Also, for all

x \in T,

{lim}_{n \to \infty} F^{n} (x) = x^{🟉} .

The following result follows from Lemma 1.

Lemma A2.

Suppose

λ \in (0, 1),

T \subset R^{d \times d}

is the set of symmetric matrices with all eigenvalues in the range

(\frac{- γ}{2 λ}, \frac{γ}{2 (1 - λ)}),

and

S_{1}, S_{2} \in R^{d \times d}

are positive definite matrices. Then there is a unique

V^{🟉} \in T

such that

F (V^{🟉}) = V^{🟉},

where

F (V) : = S_{2} - S_{1} + S_{1} {(S_{1} + I γ / 2 - V (1 - λ))}^{- 1} S_{1} - S_{2} {(S_{2} + I γ / 2 + V λ)}^{- 1} S_{2} .

Also, for any

V \in T,

{lim}_{n \to \infty} F^{n} (V) = V^{🟉} .

Proof.

Suppose

A, B \in T

and

A > B .

First we will establish that

F (\cdot)

is order-preserving, which is equivalent to

F (A) > F (B) .

Note that,

\begin{matrix} S_{1} ({(S_{1} + I γ / 2 - A (1 - λ))}^{- 1} - {(S_{1} + I γ / 2 - B (1 - λ))}^{- 1}) S_{1} > 0 \Leftrightarrow \\ {(S_{1} + I γ / 2 - A (1 - λ))}^{- 1} > {(S_{1} + I γ / 2 - B (1 - λ))}^{- 1} \Leftrightarrow \\ - A < - B \Leftrightarrow A > B . \end{matrix}

Similar logic implies that for all such

A, B \in T,

S_{2} ({(S_{2} + I γ / 2 + B λ)}^{- 1} - {(S_{2} + I γ / 2 + A λ)}^{- 1}) S_{2} > 0 \Leftrightarrow A > B,

and since

F (A) - F (B)

is the sum of both of these order-preserving functions,

F (\cdot)

is also order-preserving.

Clearly our bounds on the eigenvalues imply that

F (V)

is continuous for all

V \in T .

To show that

F

is a mapping from T into

T,

note that matrix symmetry is preserved over addition and inversion, so

F (V)

is symmetric for all

V \in T .

Also, note that,

\begin{matrix} F (- I γ / (2 λ)) = - S_{1} + S_{1} {(S_{1} + I γ / (2 λ))}^{- 1} S_{1} > - I γ / (2 λ) \Leftrightarrow \\ - S_{1}^{1 / 2} (I - {(I + S_{1}^{- 1} γ / (2 λ))}^{- 1}) S_{1}^{1 / 2} > - I γ / (2 λ) \Leftrightarrow \\ S_{1}^{1 / 2} (I - {(I + S_{1}^{- 1} γ / (2 λ))}^{- 1}) S_{1}^{1 / 2} < I γ / (2 λ) \Leftrightarrow \\ {(I + S_{1} 2 λ / γ)}^{- 1} < S_{1}^{- 1} γ / (2 λ) \Leftrightarrow I > 0 . \end{matrix}

Similar logic can be used to show that

F (I γ / (2 (1 - λ)) < I γ / (2 (1 - λ) .

This also implies the final requirement of Lemma 1.

The only remaining requirement of Lemma 2 is the penultimate, which we will establish for

A, B \in T

such that

A > B,

and using the norm,

d (A, B) = Tr (A - B) .

Also, let

α : = {1, - 1},

β : = {λ - 1, λ},

and

∥C∥

denote the spectral norm of

C \in R^{d \times d} .

We will use the property

Tr (C D) \leq ∥C∥ Tr (D),

where

C, D \in R^{d \times d}

and

C, D > 0;

see for example, [17]. Note that,

\begin{matrix} Tr (F (A) - F (B)) = \\ \sum_{i} α_{i} Tr (S_{i} ({(S_{i} + I γ / 2 + A β_{i})}^{- 1} - {(S_{i} + I γ / 2 + B β_{i})}^{- 1}) S_{i}) = \\ \sum_{i} α_{i} β_{i} Tr (S_{i} {(S_{i} + I γ / 2 + A β_{i})}^{- 1} (B - A) {(S_{i} + I γ / 2 + B β_{i})}^{- 1} S_{i}) = \\ \sum_{i} α_{i} β_{i} Tr ({(S_{i} + I γ / 2 + B β_{i})}^{- 1} S_{i} S_{i} {(S_{i} + I γ / 2 + A β_{i})}^{- 1} (B - A)) \leq \\ \sum_{i} α_{i} β_{i} ∥{(S_{i} + I γ / 2 + B β_{i})}^{- 1} S_{i} S_{i} {(S_{i} + I γ / 2 + A β_{i})}^{- 1}∥ Tr (B - A) < \\ c Tr (B - A) \sum_{i} α_{i} β_{i} = c Tr (A - B), \end{matrix}

where

c \in (0, 1) .

The second inequality follows from the matrix

S_{i} {(S_{i} + I γ / 2 - A β_{i})}^{- 1}

(respectively,

S_{i} {(S_{i} + I γ / 2 - B β_{i})}^{- 1})

being similar to a symmetric matrix, and with eigenvalues contained in

(0, 1)

because

A \in T

(B \in T)

implies

I γ / 2 - A β_{i} > 0

(I γ / 2 - B β_{i} > 0) .

□

Next we will establish Theorem 1, which is restated below. This is a slightly more general version of the theorem in the main text where the objective function in Equation (10) is a weighted average of

W_{γ}^{2} (p_{1}, q)

and

W_{γ}^{2} (p_{2}, q)

.

Theorem A1.

Let

λ \in (0, 1)

and

p_{1}

and

p_{2}

be Gaussian density functions with means

μ_{1}, μ_{2} \in R^{d},

and variance matrices,

S_{1}, S_{2} \in R^{d \times d} .

The regularized Wasserstein barycenter between

p_{1}

and

p_{2}

is given by the density function of

N (μ_{B}, S_{B}),

where

μ_{B} \in R^{d}

and

S_{B} \in R^{d \times d}

are defined by,

\begin{matrix} μ_{B} : = & λ μ_{1} + (1 - λ) μ_{2} \\ S_{B} : = & {(V 2 λ / γ + I)}^{- 1} (V λ + I γ / 2 + S_{2}) {(V 2 λ / γ + I)}^{- 1} \\ = & {(V 2 (λ - 1) / γ + I)}^{- 1} (V (λ - 1) + I γ / 2 + S_{1}) {(V 2 (λ - 1) / γ + I)}^{- 1}, \end{matrix}

where

V \in R^{d \times d}

is the unique symmetric matrix that satisfies these equalities and

- I γ / (2 λ) < V < I γ / (2 (1 - λ)) .

Also, the iterates of the following series converge to V when

V^{(0)} : = 0_{d \times d},

V^{(k + 1)} = S_{2} - S_{1} + S_{1} {(S_{1} + I γ / 2 - V^{(k)} (1 - λ))}^{- 1} S_{1} - S_{2} {(S_{2} + I γ / 2 + V^{(k)} λ)}^{- 1} S_{2} .

Proof.

Let

ϕ : R^{d} \to R

be defined as,

ϕ (z) : = exp (- {∥z∥}_{2}^{2} / γ),

and, for a given function

f : R^{d} \to R,

we will denote the convolution of

f (z)

and

ϕ (z)

as,

f (z) ⊛ ϕ (z) : = \int_{R^{d}} f (t) ϕ (z - t) d t .

When there is little risk of confusion, we will omit the input

z \in R^{d}

of functions supported on

R^{d}

in the remainder of the proof.

We will characterize the barycenter using the fact that it is the minimizer of the following optimization problem.

\min_{q} λ W_{γ}^{2} (q, p_{1}) + (1 - λ) W_{γ}^{2} (q, p_{2}) .

(A1)

To do so, note that the optimal coupling corresponding to

W_{γ}^{2} (q, p_{i})

can be defined by instead solving the dual of (8), which is

\begin{matrix} w_{i}, u_{i} = arg max_{w_{i}, u_{i}} E_{p_{i}} (log (w_{i})) + E_{q} (log (u_{i})) - γ \int_{R^{d} \times R^{d}} w_{i} (z_{1}) u_{i} (z_{2}) exp (- {∥z_{1} - z_{2}∥}^{2} / γ) d z_{1} d z_{2}, \end{matrix}

(A2)

and the optimal coupling can be defined in terms of the dual variables as

d φ_{i} (z_{1}, z_{2}) / d λ = u_{i} (z_{1}) ϕ (z_{1}) ϕ (z_{2}) w_{i} (z_{2}) .

The first order conditions of (A2) are

\begin{matrix} p_{i} = & w_{i} (u_{i} ⊛ ϕ) \end{matrix}

(A3)

\begin{matrix} q = & u_{i} (w_{i} ⊛ ϕ) . \end{matrix}

(A4)

Also, since the objective function of (A2) is differentiable, an application of the envelope theorem implies

\frac{δ W_{γ}^{2} (q, p_{i})}{δ q} = log (u_{i}) .

Thus, the optimum of (A1) can be characterized by the following functional derivative being zero.

\begin{matrix} \frac{δ}{δ q} (λ W_{γ}^{2} (q, p_{1}) + (1 - λ) W_{γ}^{2} (q, p_{2})) = 0 \Rightarrow \\ λ log (u_{1}) + (1 - λ) log (u_{2}) = 0 \end{matrix}

After combining this equality with (A3) and (A4), we have that the barycenter can be characterized by the system

\begin{matrix} p_{1} = w_{1} (u_{1} ⊛ ϕ_{γ / 2}), p_{2} = w_{2} (u_{2} ⊛ ϕ_{γ / 2}) \\ q = u_{1} (w_{1} ⊛ ϕ_{γ / 2}) = u_{2} (w_{2} ⊛ ϕ_{γ / 2}), and 1 = u_{1}^{λ} u_{2}^{1 - λ} . \end{matrix}

This system can be reduced to two equalities after noting that,

p_{i} = w_{i} (u_{i} ⊛ ϕ_{γ / 2})

and

q = u_{i} (w_{i} ⊛ ϕ_{γ / 2})

implies

q = u_{i} (\frac{p_{i}}{u_{i} ⊛ ϕ_{γ / 2}} ⊛ ϕ_{γ / 2}) .

After combining both equalities, and noting

u_{1} = u_{2}^{(λ - 1) / λ},

we have

\begin{matrix} q = u_{2}^{(λ - 1) / λ} (\frac{p_{1}}{u_{2}^{(λ - 1) / λ} ⊛ ϕ_{γ / 2}} ⊛ ϕ_{γ / 2}) = u_{2} (\frac{p_{2}}{u_{2} ⊛ ϕ_{γ / 2}} ⊛ ϕ_{γ / 2}) \end{matrix}

(A5)

Let

G

be defined as the set of functions

g : R^{d} \to R_{+}^{1}

of the form

g (z) = a exp (- {(z - μ_{g})}^{⊤} V_{g}^{- 1} (z - μ_{g}) / 2),

where

μ_{g} \in R^{d},

V_{g} \in R^{d \times d}

is a symmetric and invertible matrix, and

a \in R_{+ +}^{1} .

It will also be convenient to let

C : G \to R^{d \times d}

be defined so that

C (g) = V_{g}

and

M : G \to R^{d}

be defined so that

M (g) = μ_{g} .

It is well known that if

g, h \in G

are Gaussian density functions, then

g^{b}, c g, g ⊛ h, g h \in G,

where

b, c \in R^{1}

and

b \neq 0,

and it is also straightforward to show

\begin{matrix} C (g^{b}) = V_{g} / b, C (c g) = V_{g}, \\ C (g h) = {(V_{g}^{- 1} + V_{h}^{- 1})}^{- 1}, and C (g ⊛ h) = V_{g} + V_{h} . \end{matrix}

Likewise, in the case of

M (\cdot),

we will also use the properties

M (g^{b}) = μ_{g}, M (c g) = μ_{g}, M (g h) = C (g h) (V_{g}^{- 1} μ_{g} + V_{h}^{- 1} μ_{h}), and M (g ⊛ h) = μ_{g} + μ_{h} .

Note that

V_{g}^{- 1} + V_{h}^{- 1} > 0

is the necessary and sufficient condition for

g ⊛ h

to be well defined, and it is straightforward to verify that the properties above also hold over all pairs of

g, h \in G

when this is the case; for the case of normal density functions, see for example [31].

Next, we will suppose that

u_{2}

is in

G,

which, due to (A5), also implies

q, u_{1}, w_{1}, w_{2} \in G,

and then show that there exists a unique

u_{2} \in G

that satisfies (A5). Since (A1) is a strictly convex optimization problem, when a solution to (A1) exists, it can be characterized uniquely by its first-order conditions. Note that, for any pair

u_{i}, w_{i}

that solves (A1), we have that

u_{i} a, w_{i} / a,

where

a \in R_{+ +}^{1},

are also solutions. We avoid complications from this issue by placing the additional restriction on these dual variables that

w_{i} (0) = 1,

as this ensures strict convexity over this set of dual functions. To see that this is also without loss of generality, note that rescaling the dual variables by

u_{i} a, w_{i} / a

would not impact the objective function in (A2) because

\int_{R^{d}} q (z) d z = \int_{R^{d}} p_{i} (z) d z = 1 .

Also, a would not impact the first order conditions (A3) and (A4), so it would also not have an impact on q. Thus, after providing

u_{2} \in G

that solves (A5), we will have also shown that this solution is unique even when not restricted to

G .

Since

ϕ,

p_{1},

and

p_{2}

are elements of

G,

and

G

is closed under multiplication, division, convolution, and exponentiation to the (non-zero) power of

(λ - 1) / λ,

if

u_{2} \in G

then the functions on both sides of the equality (A5) will also be elements of

G .

Let

U_{i} : = C (u_{i})

and

μ_{u} : = M (u_{2}) .

As noted above, the convolutions in (A5) are only well defined if the following matrix inequalities hold, so we will also require the solution to satisfy these inequalities.

I 2 / γ + U_{i}^{- 1} > 0 and I 2 / γ + U_{i}^{- 1} (λ - 1) / λ > 0,

which hold if and only if

\begin{matrix} - 2 / γ I < U_{2}^{- 1} < 2 λ / (γ (1 - λ)) I . \end{matrix}

(A6)

It is straightforward to verify that these inequalities are identical to the ones that ensure the optimal coupling is integrable, as this coupling is given by,

d φ_{i} (z_{1}, z_{2}) / d λ = u_{i} (z_{1}) ϕ (z_{1}) ϕ (z_{2}) w_{i} (z_{2}) .

Thus, Fubini’s theorem implies that they are also sufficient conditions for q to be integrable.

We can find

S_{B}

by applying

C (\cdot)

to (A5), which implies

\begin{matrix} S_{B}^{- 1} = & U_{2}^{- 1} + {({(S_{2}^{- 1} - {(U_{2} + I γ / 2)}^{- 1})}^{- 1} + I γ / 2)}^{- 1} \end{matrix}

(A7)

\begin{matrix} = & U_{2}^{- 1} (λ - 1) / λ + {({(S_{1}^{- 1} - {(U_{2} λ / (λ - 1) + I γ / 2)}^{- 1})}^{- 1} + I γ / 2)}^{- 1} . \end{matrix}

(A8)

Let

b_{i} \in {λ / (λ - 1), 1} .

After three applications of the matrix inversion lemma and simplifying we have that, for each

i \in {1, 2}

\begin{matrix} S_{B}^{- 1} - U_{2}^{- 1} / b_{i} = & {({(S_{i}^{- 1} - {(U_{2} b_{i} + I γ / 2)}^{- 1})}^{- 1} + I γ / 2)}^{- 1} \\ = & {({(S_{i}^{- 1} - I 2 / γ + 4 / γ^{2} {(U_{2}^{- 1} / b_{i} + I 2 / γ)}^{- 1})}^{- 1} + I γ / 2)}^{- 1} \\ = & I 2 / γ - 4 / γ^{2} {(S_{i}^{- 1} + 4 / γ^{2} {(U_{2}^{- 1} / b_{i} + I 2 / γ)}^{- 1})}^{- 1} \\ = & I 2 / γ - 4 / γ^{2} S_{i} + 4 / γ^{2} S_{i} {(γ^{2} / 4 U_{2}^{- 1} / b_{i} + I γ / 2 + S_{i})}^{- 1} S_{i} . \end{matrix}

(A9)

This, along with Equations (A7) and (A8), implies that

U_{2}

can be characterized by

\begin{matrix} γ^{2} / 4 U_{2}^{- 1} - S_{2} + S_{2} {(γ^{2} / 4 U_{2}^{- 1} + I γ / 2 + S_{2})}^{- 1} S_{2} = \\ γ^{2} / 4 U_{2}^{- 1} (λ - 1) / λ - S_{1} + S_{1} {(γ^{2} / 4 U_{2}^{- 1} (λ - 1) / λ + I γ / 2 + S_{1})}^{- 1} S_{1} . \end{matrix}

After defining V as

γ^{2} / (4 λ) U_{2}^{- 1},

this implies

V = S_{2} - S_{1} + S_{1} {(S_{1} + I γ / 2 - V (1 - λ))}^{- 1} S_{1} - S_{2} {(S_{2} + I γ / 2 + V λ)}^{- 1} S_{2} .

Note that our requirement that

U_{2}^{- 1}

satisfy (A6) can be written in terms of V as,

- γ / (2 λ) I < V < γ / (2 (1 - λ)) I,

and Lemma 2 implies that there is a unique solution that satisfies these conditions.

The functional form for

S_{B}

from the statement of this theorem follows from an alternative ordering of the matrix inversion theorem. Specifically, starting from (A9)

\begin{matrix} S_{B}^{- 1} - U_{2}^{- 1} / b_{i} & = I 2 / γ - 4 / γ^{2} {(S_{i}^{- 1} + 4 / γ^{2} {(U_{2}^{- 1} / b_{i} + I 2 / γ)}^{- 1})}^{- 1} \\ = - U_{2}^{- 1} / b_{i} + 4 / γ^{2} (γ^{2} / 4 U_{2}^{- 1} / b_{i} + I γ / 2) {(γ^{2} / 4 U_{2}^{- 1} / b_{i} + I γ / 2 + S_{i})}^{- 1} \\ \times (γ^{2} / 4 U_{2}^{- 1} / b_{i} + I γ / 2) \\ = - U_{2}^{- 1} / b_{i} + (2 λ / (γ b_{i}) V + I) {(λ / b_{i} V + γ / 2 I + S_{i})}^{- 1} (2 λ / (γ b_{i}) V + I) . \end{matrix}

Thus,

\begin{matrix} S_{B}^{- 1} = & {(V 2 λ / γ + I)}^{- 1} (S_{2} + V λ + I γ / 2) {(V 2 λ / γ + I)}^{- 1} \\ = & {(V 2 (λ - 1) / γ + I)}^{- 1} (S_{1} + V (λ - 1) + I γ / 2) {(V 2 (λ - 1) / γ + I)}^{- 1} \end{matrix}

After applying

M (\cdot)

to both sides of (A5), we have

M (u_{2}^{b_{i}} (\frac{p_{i}}{u_{2}^{b_{i}} ⊛ ϕ_{γ / 2}} ⊛ ϕ_{γ / 2})) =

S_{B} (U_{2}^{- 1} μ_{u} / b_{i} + (S_{B}^{- 1} - U_{2}^{- 1} / b_{i}) {(S_{i}^{- 1} - {(U_{2} b_{i} + I γ / 2)}^{- 1})}^{- 1} (S_{i}^{- 1} μ_{i} - {(U_{2} b_{i} + I γ / 2)}^{- 1} μ_{u})) .

(A10)

To simplify this expression, we will first establish three intermediate equalities. First, Equations (A7) and (A8) imply

\begin{matrix} S_{B}^{- 1} - U_{2}^{- 1} / b_{i} & = {({(S_{i}^{- 1} - {(U_{2} b_{i} + I γ / 2)}^{- 1})}^{- 1} + I γ / 2)}^{- 1} \Rightarrow \\ (S_{B}^{- 1} - U_{2}^{- 1} / b_{i}) {(S_{i}^{- 1} - {(U_{2} b_{i} + I γ / 2)}^{- 1})}^{- 1} {(U_{2} b_{i} + I γ / 2)}^{- 1} \\ = {(U_{2} b_{i} + γ / 2 (U_{2} b_{i} + I γ / 2) S_{i}^{- 1})}^{- 1} \\ = {(I + γ / 2 (I + γ / (2 b_{i}) U_{2}^{- 1}) S_{i}^{- 1})}^{- 1} U_{2}^{- 1} / b_{i} . \end{matrix}

(A11)

Second, (A11) in turn implies

\begin{matrix} (S_{B}^{- 1} - U_{2}^{- 1} / b_{i}) {(S_{i}^{- 1} - {(U_{2} b_{i} + I γ / 2)}^{- 1})}^{- 1} S_{i}^{- 1} \\ = {(I + γ / 2 (I + γ / (2 b_{i}) U_{2}^{- 1}) S_{i}^{- 1})}^{- 1} (I + γ / (2 b_{i}) U_{2}^{- 1}) S_{i}^{- 1} . \end{matrix}

(A12)

Third, after an application of the matrix inverse identity to (A7) and (A8)

\begin{matrix} S_{B}^{- 1} & = U_{2}^{- 1} / b_{i} + {({(S_{i}^{- 1} - {(U_{2} b_{i} + I γ / 2)}^{- 1})}^{- 1} + I γ / 2)}^{- 1} \end{matrix}

(A13)

\begin{matrix} = U_{2}^{- 1} / b_{i} + I 2 / γ - I 4 / γ^{2} {(I 2 / γ + S_{i}^{- 1} - {(U_{2} b_{i} + I γ / 2)}^{- 1})}^{- 1}, \end{matrix}

(A14)

which implies

\begin{matrix} S_{B} & = {(U_{2}^{- 1} / b_{i} + I 2 / γ - 4 / γ^{2} {((U_{2} b_{i} + I γ / 2) (I 2 / γ + S_{i}^{- 1}) - I)}^{- 1} (U_{2} b_{i} + I γ / 2))}^{- 1} \\ = {(U_{2}^{- 1} / b_{i} + I 2 / γ - 4 / γ^{2} {(I 2 / γ + (I + U_{2}^{- 1} γ / (2 b_{i})) S_{2}^{- 1})}^{- 1} U_{2}^{- 1} / b_{i} (U_{2} b_{i} + I γ / 2))}^{- 1} . \end{matrix}

Thus,

\begin{matrix} S_{B} = {(U_{2}^{- 1} / b_{i} + I 2 / γ)}^{- 1} {(I - {(I + γ / 2 (I + γ / (2 b_{i}) U_{2}^{- 1}) S_{i}^{- 1})}^{- 1})}^{- 1} . \end{matrix}

(A15)

We will start with the coefficient on

μ_{u}

in (A10). The equalities (A11) and (A15) imply that this term is equal to

\begin{matrix} S_{B} & (U_{2}^{- 1} / b_{i} - (S_{B}^{- 1} - U_{2}^{- 1} / b_{i}) {(S_{i}^{- 1} - {(U_{2} b_{i} + I γ / 2)}^{- 1})}^{- 1} {(U_{2} b_{i} + I γ / 2)}^{- 1}) μ_{u} \\ = & {(U_{2}^{- 1} / b_{i} + I 2 / γ)}^{- 1} {(I - {(I + γ / 2 (I + γ / (2 b_{i}) U_{2}^{- 1}) S_{i}^{- 1})}^{- 1})}^{- 1} \\ \times (I - {(I + γ / 2 (I + γ / (2 b_{i}) U_{2}^{- 1}) S_{i}^{- 1})}^{- 1}) U_{2}^{- 1} / b_{i} μ_{u} \\ = & {(U_{2}^{- 1} / b_{i} + I 2 / γ)}^{- 1} U_{2}^{- 1} / b_{i} μ_{u} \\ = & {(I + U_{2} 2 b_{i} / γ)}^{- 1} μ_{u} . \end{matrix}

The equalities (A12) and (A15) imply that the coefficient on

μ_{i}

in (A10) can be written as

\begin{matrix} S_{B} & (S_{B}^{- 1} - U_{2}^{- 1} / b_{i}) {(S_{i}^{- 1} - {(U_{2} b_{i} + I γ / 2)}^{- 1})}^{- 1} S_{i}^{- 1} μ_{i} \\ = & {(U_{2}^{- 1} / b_{i} + I 2 / γ)}^{- 1} {(I - {(I + S_{i}^{- 1} γ / 2 + U_{2}^{- 1} S_{i}^{- 1} / b_{i} γ^{2} / 4)}^{- 1})}^{- 1} \\ \times {(I + γ / 2 (I + γ / (2 b_{i}) U_{2}^{- 1}) S_{i}^{- 1})}^{- 1} (I + γ / (2 b_{i}) U_{2}^{- 1}) S_{i}^{- 1} μ_{i} \\ = & {(U_{2}^{- 1} / b_{i} + I 2 / γ)}^{- 1} {(γ / 2 (I + γ / (2 b_{i}) U_{2}^{- 1}) S_{i}^{- 1})}^{- 1} (I + γ / (2 b_{i}) U_{2}^{- 1}) S_{i}^{- 1} μ_{i} \\ = & {(U_{2}^{- 1} γ / (2 b_{i}) + I)}^{- 1} μ_{i} . \end{matrix}

After combining these terms, we can define (A10) as the solution to

\begin{matrix} μ_{q} = {(I + U_{2} 2 b_{i} / γ)}^{- 1} μ_{u} + {(U_{2}^{- 1} γ / (2 b_{i}) + I)}^{- 1} μ_{i} \Rightarrow \\ (I + U_{2} 2 b_{i} / γ) (μ_{q} - {(U_{2}^{- 1} γ / (2 b_{i}) + I)}^{- 1} μ_{i}) = μ_{u} \Rightarrow \\ (I + U_{2} 2 b_{1} / γ) (μ_{q} - {(U_{2}^{- 1} γ / (2 b_{1}) + I)}^{- 1} μ_{1}) = (I + U_{2} 2 / γ) (μ_{q} - {(U_{2}^{- 1} γ / 2 + I)}^{- 1} μ_{2}) . \end{matrix}

Since the matrix inverse identity also implies

\begin{matrix} {(U_{2}^{- 1} γ / (2 b_{i}) + I)}^{- 1} = I - {(U_{2} 2 b_{i} / γ + I)}^{- 1}, \end{matrix}

we have

\begin{matrix} (I + U_{2} 2 / γ) μ_{q} - U_{2} 2 / γ μ_{2} & = (I + U_{2} 2 b_{1} / γ) μ_{q} - U_{2} 2 b_{1} / γ μ_{1} \Rightarrow \\ (1 - b_{1}) μ_{q} & = μ_{2} - b_{1} μ_{1} \Rightarrow \\ (1 + λ / (1 - λ)) μ_{q} & = μ_{2} + λ / (1 - λ) μ_{1} \Rightarrow \\ μ_{q} & = μ_{2} (1 - λ) + λ μ_{1} . \end{matrix}

□

References

Geweke, J.; Amisano, G. Optimal prediction pools. J. Econom. 2011, 164, 130–141. [Google Scholar] [CrossRef] [Green Version]
Lichtendahl, K.C.; Grushka-Cockayne, Y.; Winkler, R. Is it better to average probabilities or quantiles. Manag. Sci. 2013, 59, 1594–1611. [Google Scholar] [CrossRef]
Busetti, F. Quantile aggregation of density forecasts. Oxf. Bullet. Econom. Stat. 2017, 79, 495–512. [Google Scholar] [CrossRef]
Benamou, J.; Carlier, G.; Cuturi, M.; Nenna, L.; Peyre, G. Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput. 2015, 37, 1111–1138. [Google Scholar] [CrossRef] [Green Version]
Janati, H.; Cuturi, M.; Gramfort, A. Debiased Sinkhorn barycenters. arxiv 2020, arXiv:2006.02575. [Google Scholar]
Timmermann, A. Forecast combinations. In Handbook of Economic Forecasting; Elsevier: Amsterdam, The Netherlands, 2006; Volume 1, pp. 135–196. [Google Scholar]
Clemen, R. Combining forecasts: A review and annotated bibliography. Int. J. Forecast. 1989, 5, 559–583. [Google Scholar] [CrossRef]
Villani, C. Topics in Optimal Transportation; American Mathematical Soc.: Providence, RI, USA, 2003; Volume 58. [Google Scholar]
Galichon, A. Optimal Transport Methods in Economics; Princeton University Press: Princeton, NJ, USA, 2018. [Google Scholar]
Ratcliff, R. Group reaction time distributions and an analysis of distribution statistics. Psychol. Bullet. 1979, 86, 446–461. [Google Scholar] [CrossRef]
Genest, C. Vincentization revisited. Ann. Stat. 1992, 20, 1137–1142. [Google Scholar] [CrossRef]
Agueh, M.; Carlier, G. Barycenters in the Wasserstein space. SIAM J. Math. Anal. 2011, 43, 904–924. [Google Scholar] [CrossRef] [Green Version]
Knott, M.; Smith, C.S. On a generalization of cyclic monotonicity and distances among random vectors. Linear Algebra Appl. 1994, 199, 363–371. [Google Scholar] [CrossRef] [Green Version]
Cuturi, M. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013. [Google Scholar]
Peyré, G.; Cuturi, M. Computational optimal transport: With applications to data science. Found. Trends Mach. Learn. 2019, 11, 355–607. [Google Scholar] [CrossRef]
Sinkhorn, R. Diagonal equivalence to matrices with prescribed row and column sums. Am. Math. Mon. 1967, 74, 402–405. [Google Scholar] [CrossRef]
Ran, A.C.; Reurings, M.C. A fixed point theorem in partially ordered sets and some applications to matrix equations. Proc. Am. Math. Soc. 2004, 132, 1435–1443. [Google Scholar] [CrossRef]
Cuturi, M.; Doucet, A. Fast computation of Wasserstein barycenters. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014. [Google Scholar]
Solomon, J.; De Goes, F.; Peyré, G.; Cuturi, M.; Butscher, A.; Nguyen, A.; Du, T.; Guibas, L. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Trans. Graph. 2015, 34, 66. [Google Scholar] [CrossRef]
Dvurechensky, P.; Dvinskikh, D.; Gasnikov, A.; Uribe, C.; Nedic, A. Decentralize and randomize: Faster algorithm for Wasserstein barycenters. In Proceedings of the Annual Conference on Neural Information Processing Systems 2018, Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
Kroshnin, A.; Tupitsa, N.; Dvinskikh, D.; Dvurechensky, P.; Gasnikov, A.; Uribe, C. On the complexity of approximating Wasserstein barycenters. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019. [Google Scholar]
Lin, T.; Ho, N.; Chen, X.; Cuturi, M.; Jordan, M.I. Fixed-support Wasserstein barycenters: Computational hardness and fast algorithm. arXiv 2020, arXiv:2002.04783v4. [Google Scholar]
Lin, T.; Ho, N.; Cuturi, M.; Jordan, M.I. On the complexity of approximating multimarginal optimal transport. arXiv 2020, arXiv:1910.00152v2. [Google Scholar]
Janati, H.; Muzellec, B.; Peyré, G.; Cuturi, M. Entropic optimal transport between (unbalanced) Gaussian measures has a closed form. arXiv 2020, arXiv:2006.02572. [Google Scholar]
Mallasto, A.; Gerolin, A.; Minh, H.Q. Entropy-regularized 2-Wasserstein distance between Gaussian measures. arXiv 2020, arXiv:2006.03416. [Google Scholar]
Del Negro, M.; Hasegawa, R.; Schorfheide, F. Dynamic prediction pools: An investigation of financial frictions and forecasting performance. J. Econom. 2016, 192, 391–405. [Google Scholar] [CrossRef] [Green Version]
White, H. Maximum likelihood estimation of misspecified models. Econometrica 1982, 50, 1–25. [Google Scholar] [CrossRef]
Bollerslev, T.; Wooldridge, J.M. Quasi-maximum likelihood estimation and inference in dynamic models with time-varying covariances. Econom. Rev. 1992, 11, 143–172. [Google Scholar] [CrossRef]
McCracken, M.; Ng, S. FRED-QD: A Quarterly Database for Macroeconomic Research; Working Paper No. 26872; National Bureau of Economic Research: Cambridge, MA, USA, 2020. [Google Scholar] [CrossRef] [Green Version]
Diebold, F.X.; Shin, M. Machine learning for regularized survey forecast combination: Partially-egalitarian LASSO and its derivatives. Int. J. Forecast. 2019, 35, 1679–1691. [Google Scholar] [CrossRef]
Bromiley, P. Products and convolutions of Gaussian probability density functions. Tina-Vision Memo 2003, 3, 1. [Google Scholar]

Figure 1. Sum of log predictive score for U.S. inflation rate (2000Q1–2018Q4).

Table 1. Variables used in empirical exercises.

$Y^{(i)} = [Y_{1}, Y_{2}, Y_{3}^{(i)}]$	Used by	Variable Description	FRED-QD Mnemonic
Variable 1 $(Y_{1})$	All	Inflation rate	GDPCTPI
Variable 2 $(Y_{2})$	All	Real GDP growth rate	GDPC1
Variable 3 $(Y_{3}^{(i)})$	Forecaster 1	Real Personal Consumption Expenditures	PCECC96
	Forecaster 2	Industrial Production Index	INDPRO
	Forecaster 3	All Employees: Total Nonfarm	PAYEMS
	Forecaster 4	Housing Starts: Total Privately Owned Housing Units Started	HOUST
	Forecaster 5	Real Manufacturing and Trade Industries Sales	CMRMTSPLx
	Forecaster 6	Real Crude Oil Prices: West Texas Intermediate (WTI)	OILPRICEx
	Forecaster 7	Real Average Hourly Earnings: Manufacturing	CES3000000008x
	Forecaster 8	10-Year Treasury Constant Maturity Minus 3-Month Treasury Bill	GS10TB3Mx
	Forecaster 9	Real Commercial and Industrial Loans	BUSLOANSx
	Forecaster 10	Real Total Assets of Households and Nonprofit Organizations	TABSHNOx
	Forecaster 11	U.S. / U.K. Foreign Exchange Rate	EXUSUKx
	Forecaster 12	Consumer Sentiment (University of Michigan)	UMCSENTx
	Forecaster 13	S&P’s Common Stock Price Index: Composite	S&P 500
	Forecaster 14	Real Disposable Business Income	CNCFx

Note: All variables are obtained from the FRED-QD database [29]. Inflation rate is computed as a log difference of the GDP deflator (GDPCTPI). Real GDP growth rate is computed as a log difference of the real GDP (GDPC1). All other variables are transformed following [29]. We use the 2019–11 vintage data. Each forecaster constructs a predictive distribution using their own vector autoregression with three variables

Y^{(i)} = [Y_{1}, Y_{2}, Y_{3}^{(i)}]

where

i = 1, 2, \dots, 14

.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cumings-Menon, R.; Shin, M. Probability Forecast Combination via Entropy Regularized Wasserstein Distance. Entropy 2020, 22, 929. https://doi.org/10.3390/e22090929

AMA Style

Cumings-Menon R, Shin M. Probability Forecast Combination via Entropy Regularized Wasserstein Distance. Entropy. 2020; 22(9):929. https://doi.org/10.3390/e22090929

Chicago/Turabian Style

Cumings-Menon, Ryan, and Minchul Shin. 2020. "Probability Forecast Combination via Entropy Regularized Wasserstein Distance" Entropy 22, no. 9: 929. https://doi.org/10.3390/e22090929

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Probability Forecast Combination via Entropy Regularized Wasserstein Distance

Abstract

1. Introduction

2. Regularized Wasserstein Barycenter for Density Forecast Combination

2.1. Equal-Weighted Linear Opinion Rule

2.2. Quantile Aggregation and the Wasserstein Barycenter

2.3. Regularized Wasserstein Barycenter

3. Analytical Results: The Impact of Entropy Regularization

4. On Choosing the Strength of the Regularization

5. Empirical Illustration

6. Concluding Remarks

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Disclaimer

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI