Entropy-Regularized Optimal Transport on Multivariate Normal and q-normal Distributions

The distance and divergence of the probability measures play a central role in statistics, machine learning, and many other related fields. The Wasserstein distance has received much attention in recent years because of its distinctions from other distances or divergences. Although computing the Wasserstein distance is costly, entropy-regularized optimal transport was proposed to computationally efficiently approximate the Wasserstein distance. The purpose of this study is to understand the theoretical aspect of entropy-regularized optimal transport. In this paper, we focus on entropy-regularized optimal transport on multivariate normal distributions and q-normal distributions. We obtain the explicit form of the entropy-regularized optimal transport cost on multivariate normal and q-normal distributions; this provides a perspective to understand the effect of entropy regularization, which was previously known only experimentally. Furthermore, we obtain the entropy-regularized Kantorovich estimator for the probability measure that satisfies certain conditions. We also demonstrate how the Wasserstein distance, optimal coupling, geometric structure, and statistical efficiency are affected by entropy regularization in some experiments. In particular, our results about the explicit form of the optimal coupling of the Tsallis entropy-regularized optimal transport on multivariate q-normal distributions and the entropy-regularized Kantorovich estimator are novel and will become the first step towards the understanding of a more general setting.


Introduction
Comparing probability measures is a fundamental problem in statistics and machine learning. A classical way to compare probability measures is the Kullback-Leibler divergence. Let M be a measurable space and µ, ν be the probability measure on M; then, the Kullback-Leibler divergence is defined as: The Wasserstein distance [1], also known as the earth mover distance [2], is another way of comparing probability measures. It is a metric on the space of probability measures derived by the mass transportation theory of two probability measures. Informally, optimal transport theory considers an optimal transport plan between two probability measures under a cost function, and the Wasserstein distance is defined by the minimum total transport cost. A significant difference between the Wasserstein distance and the Kullback-Leibler divergence is that the former can reflect the metric structure, whereas the latter cannot. The Wasserstein distance can be written as: where d(·, ·) is a distance function on a measurable metric space M and Π(µ, ν) denotes the set of probability measures on M × M, whose marginal measures correspond to µ and ν. In recent years, the application of optimal transport and the Wasserstein distance has been studied in many fields such as statistics, machine learning, and image processing. For example, Reference [3] generated the interpolation of various three-dimensional (3D) objects using the Wasserstein barycenter. In the field of word embedding in natural language processing, Reference [4] embedded each word as an elliptical distribution, and the Wasserstein distance was applied between the elliptical distributions. There are many studies on the applications of optimal transport to deep learning, including [5][6][7]. Moreover, Reference [8] analyzed the denoising autoencoder [9] with gradient flow in the Wasserstein space.
In the application of the Wasserstein distance, it is often considered in a discrete setting where µ and ν are discrete probability measures. Then, obtaining the Wasserstein distance between µ and ν can be formulated as a linear programming problem. In general, however, it is computationally intensive to solve such linear problems and obtain the optimal coupling of two probability measures. For such a situation, a novel numerical method, entropy regularization, was proposed by [10], C λ (µ, ν) := inf π∈Π(µ,ν) R n ×R n c(x, y)π(x, y)dxdy − λEnt(π). ( This is a relaxed formulation of the original optimal transport of a cost function c(·, ·), in which the negative Shannon entropy −Ent(·) is used as a regularizer. For a small λ, C λ (µ, ν) can approximate the p-th power of the Wasserstein distance between two discrete probability measures, and it can be computed efficiently by using Sinkhorn's algorithm [11].
More recently, many studies have been published on improving the computational efficiency. According to [12], the most computationally efficient algorithm at this moment to solve the linear problem for the Wasserstein distance is Lee-Sidford linear solver [13], which runs in O(n 2.5 ). Reference [14] proved that a complexity bound for the Sinkhorn algorithm isÕ(n 2 ε −2 ), where ε is the desired absolute performance guarantee. After [10] appeared, various algorithms have been proposed. For example, Reference [15] adopted stochastic optimization schemes for solving the optimal transport. The Greenkhorn algorithm [16] is the greedy variant of the Sinkhorn algorithm, and Reference [12] proposed its acceleration. Many other approaches such as adapting a variety of standard optimization algorithms to approximate the optimal transport problem can be found in [12,[17][18][19]. Several specialized Newton-type algorithms [20,21] achieve complexity boundÕ(n 2 ε −1 ) [22,23], which are the best ones in terms of computational complexity at the present moment.
Moreover, entropy-regularized optimal transport has another advantage. Because of the differentiability of the entropy-regularized optimal transport and the simple structure of Sinkhorn's algorithm, we can easily compute the gradient of the entropy-regularized optimal transport cost and optimize the parameter of a parametrized probability distribution by using numerical differentiation or automatic differentiation. Then, we can define a differentiable loss function that can be applied to various supervised learning methods [24]. Entropy-regularized optimal transport can be used to approximate not only the Wasserstein distance, but also its optimal coupling as a mapping function. Reference [25] adopted the optimal coupling of the entropy-regularized optimal transport as a mapping function from one domain to another.
Despite the empirical success of the entropy-regularized optimal transport, its theoretical aspect is less understood. Reference [26] studied the expected Wasserstein distance between a probability measure and its empirical version. Similarly, Reference [27] showed the consistency of the entropy-regularized optimal transport cost between two empirical distributions. Reference [28] showed that minimizing the entropy-regularized optimal transport cost between empirical distributions is equivalent to a type of maximum likelihood estimator. Reference [29] considered Wasserstein generative adversarial networks with an entropy regularization. Reference [30] constructed information geometry from the convexity of the entropy-regularized optimal transport cost.
Our intrinsic motivation of this study is to produce an analytical solution about the entropy-regularized optimal transport problem between continuous probability measures so that we can gain insight into the effects of entropy regularization in a theoretical, as well as an experimental way. In our study, we generalized the Wasserstein distance between two multivariate normal distributions by entropy regularization. We derived the explicit form of the entropy-regularized optimal transport cost and its optimal coupling, which can be used to analyze the effect of entropy regularization directly. In general, the nonregularized Wasserstein distance between two probability measures and its optimal coupling cannot be expressed in a closed form; however, Reference [31] proved the explicit formula for multivariate normal distributions. Theorem 1 is a generalized form of [31]. We obtain an explicit form of the entropy-regularized optimal transport between two multivariate normal distributions. Furthermore, by adopting the Tsallis entropy [32] as the entropy regularization instead of the Shannon entropy, our theorem can be generalized to multivariate q-normal distributions.
Some readers may find it strange to study the entropy-regularized optimal transport for multivariate normal distributions, where the exact (nonregularized) optimal transport has been obtained explicitly. However, we think it is worth studying from several perspectives: • Normal distributions are the simplest and best-studied probability distributions, and thus, it is useful to examine the regularization theoretically in order to infer results for other distributions. In particular, we will partly answer the questions "How much do entropy constraints affect the results?" and "What does it mean to constrain by the entropy?" for the simplest cases. Furthermore, as a first step in constructing a theory for more general probability distributions, in Section 4, we propose a generalization to multivariate q-normal distributions. • Because normal distributions are the limit distributions in asymptotic theories using the central limit theorem, studying normal distributions is necessary for the asymptotic theory of regularized Wasserstein distances and estimators computed by them. Moreover, it was proposed to use the entropy-regularized Wasserstein distance to compute a lower bound of the generalization error for a variational autoencoder [29]. The study of the asymptotic behavior of such bounds is one of the expected applications of our results. • Though this has not yet been proven theoretically, we suspect that entropy regularization is efficient not only for computational reasons, such as the use of the Sinkhorn algorithm, but also in the sense of efficiency in statistical inference. Such a phenomenon can be found in some existing studies, including [33]. Such statistical efficiency is confirmed by some experiments in Section 6.
The remainder of this paper is organized as follows. First, we review some definitions of optimal transport and entropy regularization in Section 2. Then, in Section 3, we provide an explicit form of the entropy-regularized optimal transport cost and its optimal coupling between two multivariate normal distributions. We also extend this result to q-normal distributions for Tsallis entropy regularization in Section 4. In Section 5, we obtain the entropy-regularized Kantorovich estimator of probability measures on R n with a finite second moment that are absolutely continuous with respect to the Lebesgue measure in Theorem 3. We emphasize that Theorem 3 is not limited to the case of multivariate normal distribution, but can handle a wider range of probability measures. We analyze how entropy regularization affects the optimal result experimentally in certain sections.
We note that after publishing the preprint version of the paper, we found closely related results [34,35] reported within half a year. In Janati et al. [34], they proved the same result as Theorem 1 based on solving the fixed-point equation behind Sinkhorn's algorithm. Their results include the unbalanced optimal transport between unbalanced multivariate normal distributions. They also studied the convexity and differentiability of the objective function of the entropy-regularized optimal transport. In [35], the same closed-form as Theorem 1 was proven by ingeniously using the Schrödinger system. Although there are some overlaps, our paper has significant novelty in the following respects. Our proof is more direct than theirs and can be extended directly to the proof for the Tsallis entropyregularized optimal transport between multivariate q-normal distributions provided in Section 4. Furthermore, Corollaries 1 and 2 are novel and important results to evaluate how much the entropy regularization affects the estimation results or not at all. We also obtain the entropy-regularized Kantorovich estimator in Theorem 3.

Preliminary
In this section, we review some definitions of optimal transport and entropy-regularized optimal transport. These definitions were referred to in [1,36]. In this section, we use a tuple (M, Σ) for a set M and σ-algebra on M and P (X) for the set of all probability measures on a measurable space X.
Definition 1 (Pushforward measure). Given measurable spaces (M 1 , Σ 1 ) and (M 2 , Σ 2 ), a measure µ : Σ 1 → [0, +∞], and a measurable mapping ϕ : M 1 → M 2 , the pushforward measure of µ by ϕ is defined by: Definition 2 (Optimal transport map). Consider a measurable space (M, Σ), and let c : M × M → R + denote a cost function. Given µ, ν ∈ P (M), we call ϕ : M → M the optimal transport map if ϕ realizes the infimum of: This problem was originally formalized by [37]. However, the optimal transport map does not always exist. Then, Kantorovich considered a relaxation of this problem in [38]. Definition 3 (Coupling). Given µ, ν ∈ P (M), the coupling of µ and ν is a probability measure on M × M that satisfies: Definition 4 (Kantorovich problem). The Kantorovich problem is defined as finding a coupling π of µ and ν that realizes the infimum of: Hereafter, let Π(µ, ν) be the set of all couplings of µ and ν. When we adopt a distance function as the cost function, we can define the Wasserstein distance.
Definition 5 (Wasserstein distance). Given p ≥ 1, a measurable metric space (M, Σ, d), and µ, ν ∈ P (M) with a finite p-th moment, the p-Wasserstein distance between µ and ν is defined as: Now, we review the definition of entropy-regularized optimal transport on R n . Definition 6 (Entropy-regularized optimal transport). Let µ, ν ∈ P (R n ), λ > 0, and let π(x, y) be the density function of the coupling of µ and ν, whose reference measure is the Lebesgue measure. We define the entropy-regularized optimal transport cost as: where Ent(·) denotes the Shannon entropy of a probability measure: There is another variation in entropy-regularized optimal transport defined by the relative entropy instead of the Shannon entropy: This is definable even when Π(µ, ν) includes a coupling that is not absolutely continuous with respect to the Lebesgue measure. We note that when both µ and ν are absolutely continuous, the infimum is attained by the same π for C λ andC λ , and it depends only on µ and ν. In the following part of the paper, we assume the absolute continuity of µ, ν, and π with respect to the Lebesgue measure for well-defined entropy regularization.

Entropy-Regularized Optimal Transport between Multivariate Normal Distributions
In this section, we provide a rigorous solution of entropy-regularized optimal transport between two multivariate normal distributions. Throughout this section, we adopt the squared Euclidean distance x − y 2 as the cost function. To prove our theorem, we start by expressing C λ using mean vectors and covariance matrices. The following lemma is a known result; for example, see [31]. Lemma 1. Let X ∼ P, Y ∼ Q be two random variables on R n with means µ 1 , µ 2 and covariance matrices Σ 1 , Σ 2 , respectively. If π(x, y) is a coupling of P and Q, we have: Proof. Without loss of generality, we can assume X and Y are centralized, because: Therefore, we have: By adding µ 1 − µ 2 2 , we obtain (12).
Lemma 1 shows that R n ×R n x − y 2 π(x, y)dxdy can be parameterized by the covariance matrices Σ 1 , Σ 2 , Cov(X, Y). Because Σ 1 and Σ 2 are fixed, the infinite-dimensional optimization of the coupling π is a finite-dimensional optimization of covariance matrix Cov(X, Y).
We prepare the following lemma to prove Theorem 1.

Lemma 2.
Under a fixed mean and covariance matrix, the probability measure that maximizes the entropy is a multivariate normal distribution.
Lemma 2 is a particular case of the principle of maximum entropy [39], and the proof can be found in [40] Theorem 3.1.
The optimal coupling π of P and Q of the entropy-regularized optimal transport: is expressed as: where: Furthermore, C λ (P, Q) can be written as: (18) and the relative entropy version can be written as: We note that we use the regularization parameter 4λ in ( * ) for the sake of simplicity.
Proof. Although the first half of the proof can be derived directly from Lemma 2, we provide a proof of this theorem by Lagrange calculus, which will be used later for the extension to q-normal distributions. Now, we define an optimization problem that is equivalent to the entropy-regularized optimal transport as follows: subject to π(x, y)dx = q(y) for ∀ y ∈ R n , Here, p(x) and q(y) are probability density functions of P and Q, respectively. Let α(x), β(y) be Lagrange multipliers that correspond to the above two constraints. The Lagrangian function of (21) is defined as: Taking the functional derivative of (22) with respect to π, we obtain: By the fundamental lemma of the calculus of variations, we have: Here, α(x), β(y) are determined from the constraints (21). We can assume that π is a 2n-variate normal distribution, because for a fixed covariance matrix Cov(X, Y), −Ent(π) takes the infimum when the coupling π is a multivariate normal distribution by Lemma 2. Therefore, we can express π by using z = (x T , y T ) T and a covariance matrix Σ := Cov(X, Y) as: Putting: we write: According to block matrix inversion formula [41], Then, comparing the term x T y between (24) and (28), we obtain Σ −1 1 ΣA −1 = 1 2λ I and: Here, Σ −1 1 Σ = Σ T Σ −1 1 holds, because A is a symmetric matrix, and thus, we obtain: Completing the square of the above equation, we obtain: Let Q be an orthogonal matrix; then, (31) can be solved as: We rearrange the above equation as follows: Because the left terms and (Σ 1/2 1 Σ 2 Σ 1/2 1 + λ 2 I) 1/2 are all symmetric positive definite, we can conclude that Q is the identity matrix by the uniqueness of the polar decomposition. Finally, we obtain: We obtain (18) by the direct calculation of C λ using Lemma 1 with this Σ λ .
The following corollary helps us to understand the properties of Σ λ .
which is a monotonically decreasing function of the regularization parameter λ.
We show how entropy regularization behaves in two simple experiments. We calculate the entropy-regularized optimal transport cost N 0 0 , 1 0 0 1 and N 0 0 , 2 −1 −1 2 in the original version and the relative entropy version in Figure 1. We separate the entropyregularized optimal transport cost into the transport cost term and regularization term and display both of them. It is reasonable that as λ ↓ 0, Σ λ converges to Σ 1/2 1 (Σ 1/2 1 Σ 2 Σ 1/2 1 ) 1/2 Σ −1/2 1 , which is equal to the original optimal coupling of nonregularized optimal transport and as λ → ∞, Σ λ converges to 0. This is a special case of Corollary 1.The larger λ becomes, the less correlated the optimal coupling is. We visualize this behavior by computing the optimal couplings of two one-dimensional normal distributions in Figure 2. The left panel shows the original version. The transport cost is always positive, and the entropy regularization term can take both signs in general; then, the sign and total cost depend on their balance. We note that the transport cost as a function of λ is bounded, whereas the entropy regularization is not. The boundedness of the optimal cost is deduced from (1) and Corollary 1, and the unboundedness of the entropy regularization is due to the regularization parameter λ multiplied by the entropy. The right panel shows the relative entropy version. It always takes a non-negative value. Furthermore, because the total cost is bounded by the value for the independent joint distribution (which is always a feasible coupling), both the transport cost and the relative entropy regularization regularization term are also bounded. Nevertheless, the larger the regularization parameter λ, the greater the influence of entropy regularization over the total cost.
It is known that a specific Riemannian metric can be defined in the space of multivariate normal distributions, which induces the Wasserstein distance [42]. To understand the effect of entropy regularization, we illustrate how entropy regularization deforms this geometric structure in Figure 3. Here, we generate 100 two-variate normal distributions {N (0, Σ r,k )} r,k∈{1,2,,10} , where {Σ r,k } is defined as: To visualize the geometric structure of these two-variate normal distributions, we compute the relative entropy-regularized optimal transport costC λ between each pairwise two-variate normal distributions. Then, we apply multidimensional scaling [43] to embed them into a plane (see Figure 3). We can see entropy regularization deforming the geometric structure of the space of multivariate normal distributions. The deformation for distributions close to the isotopic normal distribution is more sensitive to the change in λ. The following corollary states that if we allow orthogonal transformations of two multivariate normal distributions with fixed covariance matrices, then the minimum and maximum of C λ are attained when Σ 1 and Σ 2 are diagonalizable by the same orthogonal matrix or, equivalently, when the ellipsoidal contours of the two density functions are aligned with the same orthogonal axes.

Corollary 2.
With the same settings as in Theorem 1, fix µ 1 , µ 2 , Σ 1 , and all eigenvalues of Σ 2 . When Σ 1 is diagonalized as Σ 1 = Γ T Λ ↓ 1 Γ, where Λ ↓ 1 is the diagonal matrix of the eigenvalues of Σ 1 in descending order and Γ is an orthogonal matrix, and Λ ↑ 2 are the diagonal matrices of the eigenvalues of Σ 2 in descending and ascending order, respectively. Therefore, neither the minimizer, nor the maximizer depend on the choice of λ.
Note that a special case of Corollary 2 for the ordinary Wasserstein metric (λ = 0) has been studied in the context of fidelity and the Bures distance in quantum information theory. See Lemma 3 of [47]. Their proof is not directly applicable to our generalized result; thus, we used another approach to prove it.

Extension to Tsallis Entropy Regularization
In this section, we consider a generalization of entropy-regularized optimal transport. We now focus on the Tsallis entropy [32], which is a generalization of the Shannon entropy and appears in nonequilibrium statistical mechanics. We show that the optimal coupling of Tsallis entropy-regularized optimal transport between two q-normal distributions is also a q-normal distribution. We start by recalling the definition of the q-exponential function and q-logarithmic function based on [32].

Definition 7.
Let q be a real parameter, and let u > 0. The q-logarithmic function is defined as: and the q-exponential function is defined as: Definition 8. Let q < 1 or 1 < q < 1 + 2 n ; an n-variate q-normal distribution is defined by two parameters: µ ∈ R n and a positive definite matrix Σ, and its density function is: where C q (Σ) is a normalizing constant. µ and Σ are called the location vector and scale matrix, respectively.
In the following, we write the multivariate q-normal distribution N q (µ, Σ). We note that the property of the q-normal distribution changes in accordance with q. The q-normal distribution has an unbounded support for 1 < q < 2 n and a bounded support for q < 1. The second moment exists for q < 1 + 2 n+2 , and the covariance becomes 1 2+(n+2)(1−q) Σ. We remark that each n-variate 1 + 2 ν+n -normal distribution is equivalent to an n-variate t-distribution with ν degrees of freedom, for 1 < q < 1 + 2 n+2 and an n-variate normal distribution for q ↓ 1.

Definition 9.
Let p be a probability density function. The Tsallis entropy is defined as: Then, the Tsallis entropy-regularized optimal transport is defined as: subject to π(x, y)dx = q(y) for ∀ y ∈ R n , The following lemma is a generalization of the maximum entropy principle for the Shannon entropy shown in Section 2 of [48]. Lemma 3. Let P be a centered n-dimensional probability measure with a fixed covariance matrix Σ; the maximizer of the Renyi α-entropy: under the constraint is N 2−α (0, ((n + 2)α − n)Σ) for n n+2 < α < 1.
We note that the maximizers of the Renyi α-entropy and the Tsallis entropy with q = α coincide; thus, the above lemma also holds for the Tsallis entropy. This is mentioned, for example, in Section 9 of [49].
To prove Theorem 2, we use the following property of multivariate t-distributions, which is summarized in Chapter 1 of [50]. Lemma 4. Let X be a random vector following an n-variate t-distribution with degree of freedom ν. Considering a partition of the mean vector µ and scale matrix Σ, such as: X 1 follows a p-variate t-distribution with degree of freedom ν, mean vector µ 1 , and scale matrix Σ 11 , where p is the dimension of X 1 .
Recalling the correspondence of the parameter of the multivariate q-normal distribution and the degree of freedom of the multivariate t-distribution q = 1 + 2 ν+n , we can obtain the following corollary.
Corollary 3. Let X be a random vector following an n-variate q-normal distribution for 1 < q < 1 + 2 n+2 . Consider a partition of the mean vector µ and scale matrix Σ in the same way as in (54). Then, X 1 follows a p-variate 1 + 2(q−1) 2−(n−p)(q−1) -normal distribution with mean vector µ 1 and scale matrix Σ 11 , where p is the dimension of X 1 .

Entropy-Regularized Kantorovich Estimator
Many estimators are defined by minimizing the divergence or distance ρ between probability measures, that is arg min µ ρ(µ, ν) for a fixed ν. When ρ is the Kullback-Leibler divergence, the estimator corresponds to the maximum likelihood estimator. When ρ is the Wasserstein distance, the following estimator is called the minimum Kantorovich estimator, according to [36]. In this section, we consider a probability measure Q * that minimizes C λ (P, Q) for a fixed P over P 2 (R n ), the set of all probability measures on R n with finite second moment that are absolutely continuous with respect to the Lebesgue measure. In other words, we define the entropy-regularized Kantorovich estimator arg min Q∈P 2 (R n ) C λ (P, Q). The entropy-regularized Kantorovich estimator for discrete probability measures was studied in [33], Theorem 2. We obtain the entropy-regularized Kantorovich estimator for continuous probability measures in the following theorem: Theorem 3. For a fixed P ∈ P 2 (R n ), exists, and its density function can be written as: where φ λ (x) is a density function of N (0, λ 2 I), and denotes the convolution operator.
We use the dual problem of the entropy-regularized optimal transport to prove Theorem 3 (for details, see Proposition 2.1 of [15] or Section 3 of [51]).

Lemma 5.
The dual problem of entropy-regularized optimal transport can be written as: Moreover, A λ (P, Q) = C λ (P, Q) holds.

Now, we prove Theorem 3.
Proof. Let Q * be the minimizer of min Q C λ (P, Q). Applying Lemma 5, there exist α * ∈ L 1 (P) and β * ∈ L 1 (Q * ) such that: Now, A λ (P, Q * ) is the minimum value of A λ , such that the variation δA λ (P, Q * ) is always zero. Then, δA λ (P, Q * ) = β * (y)δq * (y)dy = 0 ⇒ β * ≡ 0 (71) holds, and the optimal coupling of P, Q can be written as: Moreover, we can obtain a closed-form of α * (x) as follows from the equation π(x, y)dy = p(x): Then, by calculating the marginal distribution of π(x, y) with respect to x, we can obtain: Therefore, we conclude that a probability measure Q that minimizes C λ (P, Q) is expressed as (75).
It should be noted that when P in Theorem 3 are multivariate normal distributions, Q * and P are simultaneously diagonalizable by a direct consequence of the theorem. This is consistent with the result of Corollary 2(1) for minimization when all eigenvalues are fixed.
We can determine that the entropy-regularized Kantorovich estimator is a measure convolved with an isotropic multivariate normal distribution scaled by the regularization parameter λ. This is similar to the idea of prior distributions in the context of Bayesian inference. Applying Theorem 3, the entropy-regularized Kantorovich estimator of the multivariate normal distribution N (µ, Σ) is N (µ, Σ + λ 2 I).

Numerical Experiments
In this section, we introduce experiments that show the statistical efficiency of entropy regularization in Gaussian settings. We consider two different setups, estimating covariance matrices (Section 6.1) and the entropy-regularized Wasserstein barycenter (Section 6.2). To obtain the entropy-regularized Wasserstein barycenter, we adopt the Newton-Schulz method and a manifold optimization method, which are explained in Sections 6.3 and 6.4, respectively.

Estimation of Covariance Matrices
We provide a covariance estimation method based on entropy-regularized optimal transport. Let P = N (µ, Σ) be an n-variate normal distribution. We define an entropyregularized Kantorovich estimatorP λ , that is, We generate some samples from N (µ, Σ) and estimate the mean and covariance matrix. We compare the maximum likelihood estimatorP MLE = N (μ MLE ,Σ MLE ) andP λ with respect to the prediction error: KL(P,P MLE ), KL(P,P λ ).
In our experiment, the dimension n is set to 5, 15, 30, and the sample size is set to 60, 120. The experiment proceeds as follows.

3.
Compute the prediction error between Σ and the entropy-regularized minimum Kantorovich estimator ofΣ 4.
Repeat the above steps 1000 times and obtain a confidence interval of the prediction error. Table 1 shows the average prediction error of the MLE and entropy-regularized Kantorovich estimator of covariance matrices from 60 samples from an n-variate normal distribution with the 95% confidential interval. We can see that the prediction error is smaller than the maximum likelihood estimator under adequately small λ for n = 15, 30, but not for n = 5. Moreover, the decrease in the prediction error is larger for n = 30 than for n = 15, which indicates that the entropy regularization is effective in a high dimension. On the other hand, Table 2 shows in all cases that the decreases in the prediction error are more moderate than Table 1. We can see that this is due to the increase in the sample size. Then, we can conclude that the entropy regularization is effective in a high-dimensional setting with a small sample size.

Estimation of the Wasserstein Barycenter
A barycenter with respect to the Wasserstein distance is definable [52] and is widely used for image interpolation and 3D object interpolation tasks with entropy regularization [3,33].
be a set of probability measures in P (R n ). The barycenter with respect to C λ (entropy-regularized Wasserstein barycenter) is defined as: Now, we restrict P and {Q i } m i=1 to be multivariate normal distributions and apply our theorem to illustrate the effect of entropy regularization.
The experiment proceeds as follows. The dimensionality and the sample size were set the same as in the experiments in Section 6.1.

3.
Obtain the barycenter of Compute the prediction error between Σ and the barycenter obtained in step 3.

5.
Repeat the above steps 100 times and obtain a confidence interval of the prediction error.
We show the results for several values of the regularization parameter λ in Tables 3 and 4. A decrease in the prediction error can be seen in Table 3 for n = 30, as well as Tables 1 and 2. However, because the computation of the entropy-regularized Wasserstein barycenter uses more data than that of the minimum Kantorovich estimator, the decrease in the prediction error is mild. The entropy-regularized Kantorovich estimator is a special case of the entropy-regularized Wasserstein barycenter (78) for m = 1. Our experiments show that the appropriate range of λ to decrease the prediction error depends on m and becomes narrow as m increases. In addition, we note that there is a small decrease in the prediction error in Table 4 for n = 30. We use a gradient descent method to compute the entropy-regularized barycenter. Applying the gradient descent method to the loss function defined by the Wasserstein distance was proposed in [4]. This idea is extendable to entropy-regularized optimal transport. The detailed algorithm is shown below. Because C λ (P, Q) is a function of a positive definite matrix, we used a manifold gradient descent algorithm on the manifold of positive definite matrices.
We review the manifold gradient descent algorithm used in our numerical experiment. Let Sym + (n) be the manifold of n-dimensional positive definite matrices. We require a formula for a gradient operator and the inner product of Sym + (n) in the gradient descent algorithm. In this paper, we use the following inner product from [44], Chapter 6. For a fixed X ∈ int(Sym + (n)), we define an inner product of Sym + (n) as: g X (Y, Z) = tr YX −1 ZX −1 , Y, Z ∈ Sym + (n), Equation (79) is the best choice in terms of the convergence speed according to [53]. Let f : Sym + (n) → R be a differential matrix function. Then, the induced gradient of f under (79) is: We consider the updating step after obtaining the gradient of f . grad f (X) is an element of the tangent space, and we have to project it to Sym + (n). This projection map is called a retraction. It is known that the Riemannian metric g X leads to the following retraction: exp X x = X Exp X −1 x , where Exp is the matrix exponential. Then, the corresponding gradient descent method becomes as shown in Algorithm 1.

Approximate the Matrix Square Root
To compute the gradient of the square root of a matrix in the objective function, we approximate it using the Newton-Schulz method [54], which can be implemented by matrix operations as shown in Algorithm 2. It is amenable to automatic differentiation, such that we can easily apply the gradient descent method to our algorithm. (1 + ) A Y

Conclusions and Future Work
In this paper, we studied entropy-regularized optimal transport and derived several result. We summarize these as follows and add notes on future work.

•
We obtain the explicit form of entropy-regularized optimal transport between two multivariate normal distributions and derived Corollaries 1 and 2, which clarified the properties of optimal coupling. Furthermore, we demonstrate experimentally how entropy regularization affects the Wasserstein distance, the optimal coupling, and the geometric structure of multivariate normal distributions. Overall, the properties of optimal coupling were revealed both theoretically and experimentally. We expect that the explicit formula can be a replacement for the existing methodology using the (nonregularized) Wasserstein distance between normal distributions (for example, [4,5]). • Theorem 2 derives the explicit form of the optimal coupling of the Tsallis entropyregularized optimal transport between multivariate q-normal distributions. The optimal coupling of the Tsallis entropy-regularized optimal transport between multivariate q-normal distributions is also a multivariate q-normal distribution, and the obtained result has an analogy to that of the normal distribution. We believe that this result can be extended to other elliptical distribution families. • The entropy-regularized Kantorovich estimator of a probability measure in P 2 (R) is the convolution of a multivariate normal distribution and its own density function. Our experiments show that both the entropy-regularized Kantorovich estimator and the Wasserstein barycenter of multivariate normal distributions outperform the maximum likelihood estimator in the prediction error for adequately selected λ in a high dimensionality and small sample setting. As future work, we want to show the efficiency of entropy regularization using real data. Data Availability Statement: All the data used are artificial and generated by pseudo-random numbers.

Conflicts of Interest:
The authors declare no conflict of interest.