Towards Analysis of Covariance Descriptors via Bures–Wasserstein Distance

Huang, Huajun; Li, Yuexin; Lin, Shu-Chin; Yi, Yuyan; Zheng, Jingyi

doi:10.3390/math13132157

Open AccessArticle

Towards Analysis of Covariance Descriptors via Bures–Wasserstein Distance

by

Huajun Huang

¹,

Yuexin Li

¹,

Shu-Chin Lin

²,

Yuyan Yi

¹

and

Jingyi Zheng

^1,*

¹

Department of Mathematics and Statistics, Auburn University, Auburn, AL 36849, USA

²

Institute of Statistics and Data Science, National Taiwan University, Taipei 106319, Taiwan

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(13), 2157; https://doi.org/10.3390/math13132157

Submission received: 6 June 2025 / Revised: 27 June 2025 / Accepted: 27 June 2025 / Published: 1 July 2025

Download

Browse Figures

Versions Notes

Abstract

A brain–computer interface (BCI) provides a direct communication pathway between the human brain and external devices, enabling users to control them through thought. It records brain signals and classifies them into specific commands for external devices. Among various classifiers used in BCI, those directly classifying covariance matrices using Riemannian geometry find broad applications not only in BCI, but also in diverse fields such as computer vision, natural language processing, domain adaption, and remote sensing. However, the existing Riemannian-based methods exhibit limitations, including time-intensive computations, susceptibility to disturbances, and convergence challenges in scenarios involving high-dimensional matrices. In this paper, we tackle these issues by introducing the Bures–Wasserstein (BW) distance for covariance matrices analysis and demonstrating its advantages in BCI applications. Both theoretical and computational aspects of BW distance are investigated, along with algorithms for Fréchet Mean (or barycenter) estimation using BW distance. Extensive simulations are conducted to evaluate the effectiveness, efficiency, and robustness of the BW distance and barycenter. Additionally, by integrating BW barycenter into the Minimum Distance to Riemannian Mean classifier, we showcase its superior classification performance through evaluations on five real datasets.

Keywords:

brain–computer interface (BCI); Riemannian manifold; affine-invariant distance; Bures–Wasserstein distance; Fréchet Mean; barycenter

MSC:

15A42; 15A45; 15B48

1. Introduction

A brain–computer interface (BCI) builds a bridge between the human brain and external devices by translating brain signals into instructions for external devices to execute the user’s imagined actions. The core of BCI system is the classifier that classify the brain signals into one of the commands for the external devices. These brain signals are typically captured through multi-channel electroencephalography (EEG) [1,2], functional magnetic resonance imaging (fMRI) [3,4], and other neuroimaging techniques, which leads to BCI data being mostly spatial and temporal. To capture the spatial and temporal pattern of BCI data, covariance matrices (also called covariance descriptors) are extensively employed, and BCI classifiers are trained to directly classify covariance matrices in the Riemannian geometry based framework [5,6,7]. In addition to BCI, covariance matrices find widespread application in various domains for representing complex data structures, such as computer vision [8,9,10], natural language processing [11,12], domain adaption [13,14], remote sensing [15,16], and many others [17,18,19].

Earlier research has frequently considered covariance matrices as positive definite (PD) matrices and analyzed them on the Riemannian manifold

P_{n}

of

n \times n

PD matrices using metrics such as Affine-Invariant (AI) Riemannian metric, Log-Euclidean metric, etc. Among various metrics, the AI metric is particularly popular due to its favorable mathematical properties. Therefore, we provide a brief introduce to the key concepts related to AI distance here. For two PD matrices A and B, the AI distance is

d_{A I} (A, B) = {∥ log (A^{- 1 / 2} B A^{- 1 / 2}) ∥}_{F} = (\sum_{i = 1}^{n} {log}^{2} λ_{i} (A^{- 1} B))^{1 / 2},

(1)

where

{∥ \cdot ∥}_{F}

denotes the Frobenius norm and

λ_{i} (A^{- 1} B), i = 1, \dots, n

, are the eigenvalues of

A^{- 1} B

. Any two points on

P_{n}

are joined by the following geodesic:

A ♯_{t} B = A^{1 / 2} {(A^{- 1 / 2} B A^{- 1 / 2})}^{t} A^{1 / 2}, t \in [0, 1] .

(2)

Each point

A ♯_{t} B

is also called the t-weighted geometric mean of A and B. In particular, the mid-point of the geodesic is denoted as

A ♯ B = A ♯_{\frac{1}{2}} B = A^{1 / 2} {(A^{- 1 / 2} B A^{- 1 / 2})}^{1 / 2} A^{1 / 2} .

(3)

An essential metric for analyzing a set of matrices is the Fréchet Mean, also known as the barycenter, which measures the central tendency of a set of matrices on the manifold. The barycenter not only allows the quantification of the spread-out pattern of matrices in terms of variance, but also facilitates the construction of classifiers for matrices. Various statistical and computational methodologies, utilizing the AI metric, have been developed [5,6,7,20,21,22,23,24] for the analysis of PD matrices (e.g., covariance matrices), including tasks such as barycenter estimation and matrix classification. However, computing AI distance involves the calculation of the inverse of the matrix and the eigenvalues, as shown in (1), which can be time-consuming and computationally unstable especially for large matrices. Moreover, rather than strictly adhering to positive definiteness, the covariance matrix is indeed positive semi-definite (PSD). As the dimension of the matrix increases, the likelihood of encountering very small eigenvalues increases significantly, rendering the use of AI distance impractical. To address this issue, researchers typically regularize the input covariance matrices using regularization techniques such as the one proposed by Ledoit and Wolf [25]. However, in the context of real-time BCI devices, computational efficiency is crucial. The computation of AI distance is time-consuming, and additional processing time is required for the matrix regularization.

To address the aforementioned limitations, this paper introduces the Bures–Wasserstein (BW) distance, an efficient and geometrically meaningful metric for the analysis of PSD matrices. Unlike the AI metric, the definition of BW distance, BW geodesic, and BW mean (4)–(5) can be extended to the closure of

P_{n}

, i.e.,

{\bar{P}}_{n}

, the set of

n \times n

PSD matrices, by viewing

{\bar{P}}_{n} = M_{n} / U_{n}

since

{\bar{GL}}_{n} = M_{n}

.

\begin{matrix} d_{B W} (A, B) = {[tr (A + B) - 2 tr {(A^{1 / 2} B A^{1 / 2})}^{1 / 2}]}^{1 / 2} = {[tr (A + B) - 2 tr {(A B)}^{1 / 2}]}^{1 / 2} . \end{matrix}

(4)

A ⋄_{t} B = {(1 - t)}^{2} A + t^{2} B + t (1 - t) [{(A B)}^{1 / 2} + {(B A)}^{1 / 2}], t \in [0, 1] .

(5)

The Wasserstein mean of A and B is denoted as

A ⋄ B = A ⋄_{\frac{1}{2}} B .

The BW metric was first introduced by Rao as a Riemannian metric on the space of probability measures [26]. In the case of zero-mean multivariate Gaussian distributions, the BW distance measures the cost of optimally transporting one distribution into another, where the cost is measured by how much effort it takes to shift and reshape their covariance matrices. Compared to AI distance and other Riemannian metrics, the BW distance has a clear probabilistic interpretation, computational efficiency, and yields interpretable interpolation paths and barycenters. These advantages make it widely applicable in various fields such as signal processing, brain connectivity analysis, optimal transport theory [27,28], optimizations, and quantum information theory [29,30,31].

In this paper, we advocate for the direct analysis of covariance matrices on the complete metric space

{\bar{P}}_{n}

of PSD matrices coupled with the BW distance. We first establish the mathematical foundation for the BW distance by studying its mathematical properties and deriving the retractions functions on

{\bar{P}}_{n}

. Subsequently, we propose three algorithms to estimate the BW barycenter of a set of matrices. Extensive simulations are performed to assess the efficacy and robustness of both the BW distance and the BW barycenter. By integrating BW distance into the Minimum Distance to Riemannian Mean classifier, we further validate the superiority of the BW distance over the AI distance using five real datasets.

The subsequent sections are structured as follows. In Section 2, we establish the mathematical properties of BW distance and the retraction maps, and propose three algorithms for estimating the barycenter of matrices. In Section 3, we present the simulation results, discussing the efficiency and robustness of BW distance and barycenter estimation, and providing a comprehensive comparison with the widely adopted AI distance. In Section 4, the classification comparison on five real datasets is discussed. In Section 5, we summarize our contributions and conclude this paper.

2. Methodology

In this section, we first study the mathematical properties of the BW metric, and then introduce three algorithms to estimate the barycenter of PSD matrices on

{\bar{P}}_{n}

under the BW metric. With the estimated barycenter, we further illustrate how to incorporate the barycenter in handling classification problems.

2.1. Mathematical Properties of the BW Metric

The mathematical properties of AI metrics on

P_{n}

and their applications to data science have drawn extensive investigation. In comparison, the mathematical properties of the BW metric and their applications receive fewer studies (e.g., [32,33,34,35,36,37,38,39,40,41]). Here, we discuss some properties of the BW metric on

{\bar{P}}_{n}

to enrich our understanding of the metric and the barycenter, and compare the properties of the AI metric to find the differences between these two metrics.

Let

M_{n}

(resp.

{GL}_{n}

,

U_{n}

) denote the set of

n \times n

real general (resp. nonsingular, orthogonal) matrices. Given

X \in M_{n}

, let

| X | = {(X^{*} X)}^{1 / 2}

be the PSD part in the polar decomposition of X. Two

n \times n

Hermitian matrices A and B satisfy the Löwner order:

A \leq B \Leftrightarrow B - A is PSD .

(6)

Denote the t-arithmetic mean:

A \nabla_{t} B = (1 - t) A + t B .

(7)

Let

{∥ X ∥}_{F} = \sqrt{tr (X^{*} X)}

be the Frobenius norm of

X \in M_{n}

. The following are basic properties of the AI mean and the AI geodesic of two PD matrices [32,42,43].

Theorem 1.

Let

A, B, C, D \in P_{n}

and let

s, u, t \in [0, 1]

. The following are satisfied.

1.: $A ♯_{t} B = B ♯_{1 - t} A$ .
2.: ${(A ♯_{t} B)}^{- 1} = A^{- 1} ♯_{t} B^{- 1}$ .
3.: $(a A) ♯_{t} (b B) = a^{1 - t} b^{t} (A ♯_{t} B)$ for any $a, b > 0$ .
4.: $det (A ♯_{t} B) = {(\det A)}^{1 - t} {(\det B)}^{t}$ .
5.: $M (A ♯_{t} B) M^{*} = (M A M^{*}) ♯_{t} (M B M^{*})$ for nonsingular M.
6.: $(A ♯_{s} B) ♯_{t} (A ♯_{u} B) = A ♯_{(1 - t) s + t u} B$ .
7.: $A^{- 1} ♯ (B A B) = B .$
8.: In the Löwner order, if $A \leq C$ and $B \leq D$ , then $A ♯_{t} B \leq C ♯_{t} D .$
9.: ${(A^{- 1} \nabla_{t} B^{- 1})}^{- 1} \leq A ♯_{t} B \leq A \nabla_{t} B$ .

Some of the above properties can be extended to the barycenter of multiple PD matrices using the AI metric. Moreover, the following result holds.

Theorem 2.

Let

C, A_{1}, \dots, A_{m} \in P_{n}

,

t, ω_{1}, \dots, ω_{m} \in [0, 1]

, and

\sum_{j = 1}^{m} w_{j} = 1

. The following Löwner order relations hold:

\begin{matrix} C ♯_{t} (\sum_{j = 1}^{m} ω_{j} A_{j}) & \geq & \sum_{j = 1}^{m} ω_{j} (C ♯_{t} A_{j}), \end{matrix}

(8)

Proof.

The Löwner order inequality (8) obviously holds for

t = 0, 1 .

Now assume that

t \in (0, 1) .

By (2), the relation (8) is equivalent to

{[C^{- 1 / 2} (\sum_{j = 1}^{m} ω_{j} A_{j}) C^{- 1 / 2}]}^{t} = {(\sum_{j = 1}^{m} ω_{j} C^{- 1 / 2} A_{j} C^{- 1 / 2})}^{t} \geq \sum_{j = 1}^{m} ω_{j} {(C^{- 1 / 2} A_{j} C^{- 1 / 2})}^{t},

(9)

which can be derived from the concavity of the function

A \mapsto A^{t}

for

t \in (0, 1)

(see ([32], Theorem 4.2.3)). □

Some basic properties of the BW metric are studied in [36,38,39], which can be stated in

{\bar{P}}_{n}

as follows.

Theorem 3.

Let

A, B \in {\bar{P}}_{n}

and let

s, t, u \in [0, 1]

. The following are satisfied.

1.: $A ⋄_{t} B = B ⋄_{1 - t} A$ .
2.: When $A, B \in P_{n}$ , ${(A ⋄_{t} B)}^{- 1} = A^{- 1} ⋄_{t} B^{- 1}$ if and only if $A = B$ .
3.: $(a A) ⋄_{t} (a B) = a (A ⋄_{t} B)$ for any $a > 0$ .
4.: $(A ⋄_{s} B) ⋄_{t} (A ⋄_{u} B) = A ⋄_{(1 - t) s + t u} B$ .
5.: $det (A ⋄_{t} B) \geq {(det A)}^{1 - t} {(det B)}^{t}$ .
6.: In the Löwner order, $A ⋄_{t} B \leq A \nabla_{t} B$ .

In the case

A, B \in P_{n}

, besides the expression (5), the BW geodesic can be expressed by the AI geodesic as follows (see [36], Lemma 2.4):

A ⋄_{t} B = [I \nabla_{t} (A^{- 1} ♯ B)] A [I \nabla_{t} (A^{- 1} ♯ B)] .

(10)

The AI geodesic and the BW geodesic have order relations. Given an

n \times n

Hermitian matrix A, let

λ (A) = (λ_{1} (A), \dots, λ_{n} (A))

(11)

denote the n-tuple of eigenvalues of A in descending order. The weakly log-majorization order between two matrices

A, B \in P_{n}

is as follows:

A ≺_{w log} B \Leftrightarrow \prod_{i = 1}^{k} λ_{i} (A) \leq \prod_{i = 1}^{k} λ_{i} (B), k = 1, 2, \dots, n .

(12)

It is known that for all

t \in [0, 1]

,

A ♯_{t} B

is weakly log-majorized by

A ⋄_{t} B

[44].

A fundamental difference between the AI metric, given by (1) and (2), and the BW metric, defined by (4) and (5), is as follows: for any scalar

c > 0

, matrices

A, B \in P_{n}

, and

t \in [0, 1]

:

\begin{matrix} d_{A I} (c A, c B) & = d_{A I} (A, B), & (c A) ♯_{t} (c B) & = c (A ♯_{t} B), \end{matrix}

(13)

\begin{matrix} d_{B W} (c A, c B) & = c^{1 / 2} d_{B W} (A, B), & (c A) ⋄_{t} (c B) & = c (A ⋄_{t} B) . \end{matrix}

(14)

The relations in (14) indicate that while the AI geodesic between matrices

c A

and

c B

is scaled by the factor c, the AI distance remains invariant under this scaling by c. Such properties, especially when

c ≫ 1

or

\frac{1}{c} ≫ 1

, may not align well with real-world scenarios involving covariance matrices. In contrast, the relationships in (14) suggest that the BW metric provides a more realistic representation when scaling is involved. Moreover, the factor

c^{1 / 2}

in the BW distance of

c A

and

c B

hints that

d_{B W} (A, B)

may be a normed space distance between

A^{1 / 2}

and

B^{1 / 2}

. Indeed, Bhatia, Jain, and Lim showed that ([33], Theorem 1) for

A, B \in P_{n}

:

d_{B W} (A, B) = inf_{U \in U_{n}} {∥ A^{1 / 2} - B^{1 / 2} U ∥}_{F} .

(15)

We extend the formula (5) of

A ⋄_{t} B

to all

A, B \in {\bar{P}}_{n}

and

t \in R

in the following discussion.

Theorem 4.

For

A, B \in {\bar{P}}_{n}

and

t \in R

,

1.: $A ⋄_{t} B = B ⋄_{1 - t} A = {| (1 - t) A^{1 / 2} + t U^{*} B^{1 / 2} |}^{2}$ , where U is a certain orthogonal matrix occurring in a polar decomposition of $B^{1 / 2} A^{1 / 2}$ , i.e., $B^{1 / 2} A^{1 / 2} = U | B^{1 / 2} A^{1 / 2} | = U {(A^{1 / 2} B A^{1 / 2})}^{1 / 2}$ .
2.: $| {(A ⋄_{t} B)}^{1 / 2} A^{1 / 2} | = | (1 - t) A + t | B^{1 / 2} A^{1 / 2} | |$ . In particular, if $(1 - t) A + t | B^{1 / 2} A^{1 / 2} |$ is PSD (such as when $t \in [0, 1]$ ), then $| {(A ⋄_{t} B)}^{1 / 2} A^{1 / 2} | = (1 - t) A + t | B^{1 / 2} A^{1 / 2} |$ .

Theorem 5.

Let

A, B \in {\bar{P}}_{n}

.

1.: If $r, t \in R$ and $(1 - t) A + t | B^{1 / 2} A^{1 / 2} | \in {\bar{P}}_{n}$ , then

$A ⋄_{r} (A ⋄_{t} B) = A ⋄_{r t} B .$
2.: If $r, s, t \in R$ satisfy that $(1 - x) A + x | B^{1 / 2} A^{1 / 2} | \in {\bar{P}}_{n}$ for $x \in {s, t}$ , then

$(A ⋄_{s} B) ⋄_{r} (A ⋄_{t} B) = A ⋄_{(1 - r) s + r t} B .$

Let U be an orthogonal matrix in the polar decomposition of $B^{1 / 2} A^{1 / 2}$ . When $A, B \in P_{n}$ , Bhatia, Jain, and Lim showed that (see [33], Theorem 1): $d_{B W} (A, B) = {∥ A^{1 / 2} - B^{1 / 2} U ∥}_{F}$ , which can be extended to the following property of BW distance.

Theorem 6.

For

A, B \in {\bar{P}}_{n}

and

t \in R

, if

(1 - t) A + t | B^{1 / 2} A^{1 / 2} |

is PSD (e.g., when

t \in [0, 1]

), then

d_{B W} (A, A ⋄_{t} B) = | t | d_{B W} (A, B) .

The tangent space at a PD matrix $A \in P_{n}$ can be identified with the space $H_{n}$ of $n \times n$ real symmetric matrices. The logarithm function projects a neighborhood of A to a neighborhood of 0 in $H_{n}$ such that for $B \in P_{n}$ : ${log}_{A} B = {\frac{d}{d t}|}_{t = 0} (A ⋄_{t} B) .$ The inverse of logarithm function is the exponential function, which maps a neighborhood of 0 in $H_{n}$ to a neighborhood of A in $P_{n}$ . Both functions, illustrated in Figure 1A, can be extended to the boundary of ${\bar{P}}_{n}$ under mild conditions (for a PSD matrix A, the domain of the exponential function at A only covers part of $H_{n}$ ; seeTheorem 8). With the logarithm and exponential functions, many methods that work in Euclidean tangent space are applicable to the manifold. Therefore, the following derivation of the log and exp functions is imperative for further analysis.

Lemma 1.

For

A, B \in {\bar{P}}_{n}

,

X \in H_{n}

such that

{exp}_{A} X

is well-defined,

U \in U_{n}

, and

t \in R

,

\begin{matrix} d_{B W} (U^{*} A U, U^{*} B U) & = d_{B W} (A, B), & (U^{*} A U) ⋄_{t} (U^{*} B U) & = U^{*} (A ⋄_{t} B) U, \\ {log}_{U^{*} A U} (U^{*} B U) & = U^{*} ({log}_{A} B) U, & {exp}_{U^{*} A U} (U^{*} X U) & = U^{*} ({exp}_{A} X) U . \end{matrix}

The spectral decomposition implies that every $A \in {\bar{P}}_{n}$ can be written as $A = U Λ U^{*}$ in which $U \in U_{n}$ is real orthogonal and $Λ$ is nonnegative diagonal with eigenvalues of A as its diagonal entries. Lemma A1 in Appendix A suggests that the BW metric in the vicinity of A can be effectively transformed to correspond with that around the diagonal matrix $Λ$ . This transformation plays a pivotal role in simplifying the computations that follow.

The log function on

P_{n}

under BW metric has been explored in existing literature [33,37,40]. Here, we add the singular case and an approximation of

{log}_{A} (A + t X)

when

t X

is nearby 0 as follows.

Theorem 7.

For any

A, B \in {\bar{P}}_{n}

,

X \in H_{n}

, and

t \in R

sufficiently close to 0, we have

\begin{matrix} {log}_{A} (B) & = & {(A B)}^{1 / 2} + {(B A)}^{1 / 2} - 2 A \end{matrix}

(16)

\begin{matrix} = & A (A^{- 1} ♯ B) + (A^{- 1} ♯ B) A - 2 A (if A \in P_{n}), \end{matrix}

(17)

\begin{matrix} {log}_{A} (A + t X) & = & t X + O (t^{2}) . \end{matrix}

(18)

The exponential map under the BW metric has been described in [35,41]. In this work, we present a succinct formulation of the exponential map, explicitly delineating its precise domain. We also provide an approximation of ${exp}_{A} (X)$ when X is a Hermitian matrix near 0. For matrices $A = [a_{i j}], B = [b_{i j}] \in M_{n}$ , let $A \circ B = [a_{i j} b_{i j}] \in M_{n}$ denote the Hadamard product of A and B.

Lemma 2.

Let

A = diag (λ_{1}, \dots, λ_{n}) \in P_{n}

be a positive diagonal matrix. Denote

W : = {(\frac{1}{λ_{i} + λ_{j}})}_{n \times n}

. Then, for every Hermitian matrix

X \in H_{n}

such that

I_{n} + W \circ X

is PSD,

{exp}_{A} (X) = A + X + (W \circ X) A (W \circ X) .

In particular, for

t \in R

sufficiently close to 0,

{exp}_{A} (t X) = A + t X + O (t^{2}) .

Similarly, the exponential function and its approximation derived in Lemma A2 for the case of A being a positive diagonal matrix can also be further defined for general cases.

Theorem 8.

Suppose

A \in P_{n}

has the spectral decomposition

A = U Λ U^{*}

, where U is an orthogonal matrix and

Λ = diag (λ_{1}, \dots, λ_{n})

. Denote

W = {(\frac{1}{λ_{i} + λ_{j}})}_{n \times n}

. Then, for every Hermitian matrix

X \in H_{n}

such that

I_{n} + W \circ X_{U}

is PSD, where

X_{U} : = U^{*} X U

, we have

{exp}_{A} (X) = A + X + U [(W \circ X_{U}) Λ (W \circ X_{U})] U^{*} .

(19)

In particular, when

t \in R

is sufficiently close to zero, we have

{exp}_{A} (t X) = A + t X + O (t^{2}) .

2.2. Barycenter Estimation with BW Distance

When characterizing the distribution of a set of PSD matrices, two crucial properties to consider are the central tendency and the variation. Given a set of matrices

A_{1}, \dots, A_{m} \in {\bar{P}}_{n}

, their centroid can be estimated using the Fréchet mean with BW distance:

\bar{A} (A_{1}, \dots, A_{m}) = \underset{X \in {\bar{P}}_{n}}{arg min} \sum_{i = 1}^{m} d_{B W}^{2} (A_{i}, X)

The Fréchet mean is also known as the Karcher mean and the barycenter in the literature [33,45]. With the barycenter, we can further quantify the dispersion of matrices around their center via the Fréchet variance:

σ^{2} = \frac{1}{m} \sum_{i = 1}^{m} d_{B W}^{2} (A_{i}, \bar{A})

To estimate the barycenter, we propose three methods: the Inductive Mean Algorithm, the Projection Mean Algorithm, and the Cheap Mean Algorithm. For the AI metric, we always assume that

A_{1}, \dots, A_{m} \in P_{n}

in the following algorithms. For the BW metric, we choose

A_{1}, \dots, A_{m} \in {\bar{P}}_{n}

in the Inductive Mean Algorithm and the Projection Mean Algorithm. The error tolerance in the following algorithms is denoted as

ε

.

Inductive Mean Algorithm (Algorithm 1) estimates the barycenter of $A_{1}, \dots, A_{m} \in {\bar{P}}_{n}$ along the geodesic that connects two points on ${\bar{P}}_{n}$ , as illustrated in Figure 1B with four data points as an example. The initial barycenter is set to be the first data point, $S^{(1)} : = A_{1}$ , and then the barycenter is updated as the middle point along the geodesic that connects $S^{(1)}$ and $A_{2}$ , i.e., $S^{(2)} : = S^{(1)} ⋄_{1 / 2} A_{2}$ . In the iteration process, given $S^{(k - 1)}$ , the updated barycenter $S^{(k)}$ is the ${\frac{1}{k}}^{t h}$ point along the geodesic connecting $S^{(k - 1)}$ and $A_{k}$ , with $A_{k} : = A_{k - m}$ for all $k > m$ .

Algorithm 1: Inductive Mean Algorithm

By design, points in

{S^{(k)}}_{k \in N}

are confined within the compact region bound by the geodesics connecting

A_{1}, \dots, A_{m}

, and the convergence point of

{S^{(k)}}_{k \in N}

is unique. Therefore, the Inductive Algorithm is valid for estimating the barycenter of PSD matrices. With the AI metric, the convergence of the algorithm to the corresponding Fréchet mean was proved in ([46] Theorem 5.2). With the BW metric, the computational process of the algorithm is straightforward and not sensitive to the quantity of matrices, as long as the maximal distance between two matrices in

A_{1}, \dots, A_{m}

is bounded. The convergence rate is uniform but relatively slow compared to the following two algorithms.

Projection Mean Algorithm (Algorithm 2) accounts for the fact that the arithmetic center of the projections of $A_{1}, \dots, A_{m}$ onto the tangent space at the barycenter C is exactly the projection of C (i.e., $0 \in H_{n}$ ). Inspired by this, the iteration starts with projecting the data points from ${\bar{P}}_{n}$ onto the tangent space at the arithmetic center of the original data. Then, the arithmetic mean of the tangent vectors is computed and projected back to ${\bar{P}}_{n}$ as the updated barycenter. The iteration stops when the distance between the estimated barycenter obtained in two consecutive iterations are less than the preset error tolerance $ϵ$ . The details of the algorithm are summarized below.

Algorithm 2: Projection Mean Algorithm

The convergence of the Projection Mean Algorithm has been proved in the Wasserstein metric between probabilities ([47] Theorem 4.2), using a fixed point theorem approach. Here, we derive near order relations on the sequence

{S^{(k)} ∣ k \in N}

and give a new proof for the convergence of the Projection Mean Algorithm for the BW metric.

It is well known that the Wasserstein barycenter

S : = \bar{A} (A_{1}, \dots, A_{m})

is the unique matrix in

P_{n}

that satisfies the equation:

I = \frac{1}{m} \sum_{j = 1}^{m} [S^{- 1} ♯ A_{j}] .

(20)

We have the following property of the Projection Mean Algorithm.

Theorem 9.

For

A_{1}, \dots, A_{m} \in {\bar{P}}_{n}

with at least one

A_{j} \in P_{n}

, the sequence

{S^{(k)} ∣ k \in N}

obtained from the Projection Mean Algorithm in the BW metric satisfies that

{(S^{(k)})}^{- 1} ♯ S^{(k + 1)} = \frac{1}{m} \sum_{j = 1}^{m} [{(S^{(k)})}^{- 1} ♯ A_{j}] .

(21)

Equivalently,

{(S^{(k)} S^{(k + 1)})}^{1 / 2} = \frac{1}{m} \sum_{j = 1}^{m} {(S^{(k)} A_{j})}^{1 / 2} .

(22)

Proof.

We have

S^{(k)} \in P_{n}

by induction. The Projection Mean Algorithm derives

S^{(k + 1)}

from

S^{(k)}

through the process:

{log}_{S^{(k)}} S^{(k + 1)} = \frac{1}{m} \sum_{j = 1}^{m} {log}_{S^{(k)}} A_{j} .

Using formula (17), we have

\begin{matrix} S^{(k)} [{(S^{(k)})}^{- 1} ♯ S^{(k + 1)}] + [{(S^{(k)})}^{- 1} ♯ S^{(k + 1)}] S^{(k)} & = & S^{(k)} [\frac{1}{m} \sum_{j = 1}^{m} {(S^{(k)})}^{- 1} ♯ A_{j}] + [\frac{1}{m} \sum_{j = 1}^{m} {(S^{(k)})}^{- 1} ♯ A_{j}] S^{(k)} . \end{matrix}

The matrix

C : = [{(S^{(k)})}^{- 1} ♯ S^{(k + 1)}] - \frac{1}{m} \sum_{j = 1}^{m} [{(S^{(k)})}^{- 1} ♯ A_{j}]

is Hermitian. The preceding equality shows that:

S^{(k)} C + C S^{(k)} = 0 .

So

S^{(k)} C

is skew-Hermitian. Moreover,

S^{(k)} C

is similar to the Hermitian matrix

{(S^{(k)})}^{1 / 2} C {(S^{(k)})}^{1 / 2}

, whose eigenvalues are all real. Hence,

S^{(k)} C = 0

and

C = 0

. We obtain (21).

By [32], Proposition 4.1.9,

A ♯ B = A {(A^{- 1} B)}^{1 / 2}

for

A, B \in P_{n}

. Apply it to (21) and we will get (22). □

Theorem 10.

The sequence

{S^{(k)} ∣ k \in N}

obtained from the Projection Mean Algorithm on

A_{1}, \dots, A_{m} \in {\bar{P}}_{n}

satisfies the Löwner order relations:

\frac{1}{m} \sum_{j = 1}^{m} A_{j} = S^{(0)} \geq S^{(k + 1)}, k \in N .

(23)

Proof.

For

n \times n

Hermitian matrices

B_{1}, \dots, B_{m}

we have the Löwner order

\sum_{1 \leq i < j \leq m} {(B_{i} - B_{j})}^{2} \geq 0

(24)

which is equivalent to

\frac{1}{m} \sum_{j = 1}^{m} B_{j}^{2} \geq {(\frac{1}{m} \sum_{j = 1}^{m} B_{j})}^{2} .

(25)

(21) can be expressed as

{[{(S^{(k)})}^{1 / 2} S^{(k + 1)} {(S^{(k)})}^{1 / 2}]}^{1 / 2} = \frac{1}{m} \sum_{j = 1}^{m} {[{(S^{(k)})}^{1 / 2} A_{j} {(S^{(k)})}^{1 / 2}]}^{1 / 2} .

(26)

So, for

k \in N

we have

\begin{matrix} {(S^{(k)})}^{1 / 2} (\frac{1}{m} \sum_{j = 1}^{m} A_{j}) {(S^{(k)})}^{1 / 2} & = & \frac{1}{m} \sum_{j = 1}^{m} {(S^{(k)})}^{1 / 2} A_{j} {(S^{(k)})}^{1 / 2} \\ \geq & {\{\frac{1}{m} \sum_{j = 1}^{m} {[{(S^{(k)})}^{1 / 2} A_{j} {(S^{(k)})}^{1 / 2}]}^{1 / 2}\}}^{2} \\ = & {(S^{(k)})}^{1 / 2} S^{(k + 1)} {(S^{(k)})}^{1 / 2} . \end{matrix}

(27)

Therefore,

\frac{1}{m} \sum_{j = 1}^{m} A_{j} \geq S^{(k + 1)}

for

k \in N

. □

Cheap Mean Algorithm [48] (Algorithm 3) also utilizes the Log and Exp functions to project matrices between ${\bar{P}}_{n}$ and the tangent space. Different from the Projection Algorithm that only updates the estimated barycenter, the Cheap Mean Algorithm updates the estimated barycenter as well as the original matrices. Moreover, unlike the Projection Mean Algorithm that only spans the tangent space at the estimated barycenter $S^{(k)}$ in the iterations, the Cheap Mean Algorithm spans the tangent space at each data matrix. Specifically, for each original matrix $A_{k}$ , which also serves as the initiation of $A_{k}^{(0)}$ , all matrices are projected onto the tangent space spanned at $A_{k}$ , then the arithmetic mean of the tangent vectors are projected back to ${\bar{P}}_{n}$ and serve as the update of $A_{k}$ , denoted as $A_{k}^{(1)}$ . Each data matrix is updated during the iteration process, and the iteration stops when the changes of the arithmetic mean of all updated matrices ( $A_{k}^{(ℓ)}$ ) is less than $ϵ$ . The iterative process is outlined as follows:

Algorithm 3: Cheap Mean Algorithm

When

m = 2

, the Cheap Mean Algorithm produces the true Fréchet Mean. However, when

m > 2

, the Cheap Mean is only an approximation of the Fréchet Mean [48], compensated by the low computational cost of this algorithm [48,49]. Nevertheless, as shown in the following experimental results, the efficiency prowess of the Cheap Mean approach is only displayed when the number of matrices is small.

2.3. Classification with BW Barycenter

The barycenter serves not only as a statistical metric for measuring the central tendency of data, but is also extensively applied in the development of classifiers in various domains, including BCI research. Therefore, we construct a simple classifier known as the Minimum Distance to Riemannian Mean (MDRM) [50] as follows.

Suppose there are

m_{k}

matrices belonging to the

k^{t h}

class, denoted as (

X_{1}^{(k)}, \dots, X_{m_{k}}^{(k)}

). Let

{\hat{μ}}_{k}

denote the estimated barycenter of the

k^{(t h)}

class. The MDRM method compares the distance between a new matrix

X_{n e w}

and the barycenter of each class, and assigns

X_{n e w}

to the class whose barycenter exhibits the minimum distance.

k^{*} = \underset{k}{arg min} d_{\circ} (X_{n e w}, {\hat{μ}}_{k})

where

d_{\circ} (\cdot, \cdot)

denote the chosen distance metric (e.g., AI and BW). It is crucial to note that the selection of metrics gives rise to distinct barycenters, and therefore,

{\hat{μ}}_{k}

is contingent on the selected metric. Consequently, the classifier, when paired with distinct metrics, would demonstrate varying classification capabilities. With five real datasets, we aim to explore the efficacy of BW metric with this simple classifier and compare it with the widely used AI metric.

3. Simulation Results

Different metrics give rise to distinct barycenters. Although the true BW barycenter remains elusive, as there is no probability distribution defined on the manifold coupled with the BW metric yet, we can still conduct extensive simulations to thoroughly investigate the efficiency and robustness of the estimation of BW barycenter. In this section, we investigate both efficiency and robustness of the BW metric and barycenter and compare them with the AI metric and barycenter (1) via extensive simulations. Given that the AI metric is applicable only to positive definite matrices, we augment the randomly generated PSD matrices with a diagonal matrix (

O (10^{- 3}) \times I

) in order to ensure the positive definiteness of PSD matrices in our simulations. Table 1 provides a summary of the parameter settings employed in the following simulations.

3.1. Robustness of BW Distance

When matrices are affected by slight perturbations, a robust distance measure is expected to remain unaffected, i.e., small changes in distance. In this simulation, we quantify the robustness of a distance measure as the alteration in distance values when matrices are affected by slight perturbations. Our goal is to assess the robustness of BW distance and investigate the factors that affect its robustness.

We begin by randomly generating 100 pairs of PSD matrices, denoted as

(A_{i}, B_{i})

,

i = 1, \dots, 100

, where dimension n is set to 10 and p is set to 0.2. For each pair

(A_{i}, B_{i})

, we randomly generate 100 pairs of Hermitian perturbation matrices, denoted as

(E_{i j}^{A}, E_{i j}^{B})

,

j = 1, \dots, 100

. These perturbation matrices have spectral norms scaled to match the smallest eigenvalues of

(A_{i}, B_{i})

, which is

O (10^{- 3})

. The contaminated matrices

({\tilde{A}}_{i j}, {\tilde{B}}_{i j})

are then obtained by adding the perturbation matrices to

(A_{i}, B_{i})

, resulting in

{\tilde{A}}_{i j} = A_{i} + E_{i j}^{A}

,

{\tilde{B}}_{i j} = B_{i} + E_{i j}^{B}

. The robustness of the distance measures is assessed by quantifying the difference in distances between the two matrices with and without perturbation. As AI and BW distance are different, we normalize the differences by the respective distance measure between the unperturbed matrices as follows:

Δ_{i j}^{A I} = \frac{d_{A I} (A_{i}, B_{i}) - d_{A I} ({\tilde{A}}_{i j}, {\tilde{B}}_{i j})}{d_{A I} (A_{i}, B_{i})}, Δ_{i j}^{B W} = \frac{d_{B W} (A_{i}, B_{i}) - d_{B W} ({\tilde{A}}_{i j}, {\tilde{B}}_{i j})}{d_{B W} (A_{i}, B_{i})} .

Figure 2 presents the empirical distributions of

Δ_{i j}^{A I}

and

Δ_{i j}^{B W}

with varying n in (A) and varying p in (B). Overall,

Δ^{B W}

consistently exhibits significantly smaller values compared to

Δ^{A I}

, indicating the superior robustness of BW distance in the presence of perturbations. Moreover, both BW and AI distances are influenced by the matrix dimension (all p-values

< 0.01

for Kruskal–Wallis and Dunn tests), with higher dimensions resulting in reduced robustness. However, p has less impact on the robustness of both distances (significant Kruskal–Wallis test, but not all p-values are significant for Dunn tests at

0.01

significance level). Additionally, BW distance demands considerably less computational time than AI distance, as evident from the comparison of Equations (1) and (4).

3.2. Robustness of BW Barycenter for Two Matrices

The true BW barycenter of two matrices is the midpoint along the geodesic. There is no necessity to employ an algorithm for barycenter estimation in this case. Hence, we evaluate the robustness of the BW barycenter for a pair of matrices independently. The robustness refers to whether the barycenter would be affected when the two matrices are contaminated by small perturbations. With this simulation, we aim to assess the robustness of the BW barycenter and investigate the factors that affect its robustness.

Following the same data simulation process outlined previously, we randomly generate 100 pairs of unperturbed matrices

(A_{i}, B_{i}), i = 1, \dots, 100

. For each pair

(A_{i}, B_{i})

, 100 perturbed matrices

({\tilde{A}}_{i j}, {\tilde{B}}_{i j}), j = 1, \dots, 100

are generated. Then, the BW and AI barycenters for unperturbed and perturbed matrices are computed:

M_{i}^{A I} = A_{i} ♯ B_{i}, M_{i}^{B W} = A_{i} ⋄ B_{i}

{\tilde{M}}_{i j}^{A I} = {\tilde{A}}_{i j} ♯ {\tilde{B}}_{i j}, {\tilde{M}}_{i j}^{B W} = {\tilde{A}}_{i j} ⋄ {\tilde{B}}_{i j}

The robustness of the barycenter is quantified as the Frobenius distance between the barycenters of matrices with and without perturbations, denoted as

d_{F} (M_{i}^{\circ}, {\tilde{M}}_{i j}^{\circ})

for AI and BW measures, respectively. A smaller distance

d_{F} (M_{i}^{\circ}, {\tilde{M}}_{i j}^{\circ})

indicates greater robustness of the barycenter. Figure 3 presents the distribution of

{d_{F} (M_{i}^{\circ}, {\tilde{M}}_{i j}^{\circ})}, j = 1, \dots, 100

for three representative pairs of matrices. Case 1 represents the most prevalent scenario (

73 %

) where

d_{F} (M_{i}^{B W}, {\tilde{M}}_{i j}^{B W})

is generally smaller than

d_{F} (M_{i}^{A I}, {\tilde{M}}_{i j}^{A I})

. Case 2 shows another common scenario (

25 %

) when

d_{F} (M_{i}^{B W}, {\tilde{M}}_{i j}^{B W})

is significantly smaller than

d_{F} (M_{i}^{A I}, {\tilde{M}}_{i j}^{A I})

. Both case 1 and 2 signify instances where the BW barycenter exhibits greater robustness than the AI barycenter. In contrast, Case 3, which is the least common scenario (

2 %

), showcases situations where the AI barycenter displays superior robustness to the BW barycenter, albeit with a noteworthy degree of overlap between the two distributions.

For each pair of matrices, the average of

{d_{F} (M_{i}^{\circ}, {\tilde{M}}_{i j}^{\circ})}

, denoted as

\bar{d_{F} (M_{i}^{\circ}, {\tilde{M}}_{i j}^{\circ})}

, are calculated for AI and BW, respectively, to represent the average deviation of the barycenter when matrices are affected by perturbations. To compare which barycenter is less affected by the perturbation, i.e., the one having smaller

\bar{d_{F} (M_{i}^{\circ}, {\tilde{M}}_{i j}^{\circ})}

, we record the relative difference

δ_{i}

as follows:

\bar{d_{F} (M_{i}^{\circ}, {\tilde{M}}_{i j}^{\circ})} = \frac{1}{100} \sum_{j = 1}^{100} d_{F} (M_{i}^{\circ}, {\tilde{M}}_{i j}^{\circ}), δ_{i} = \frac{\bar{d_{F} (M_{i}^{A I}, {\tilde{M}}_{i j}^{A I})} - \bar{d_{F} (M_{i}^{B W}, {\tilde{M}}_{i j}^{B W})}}{\bar{d_{F} (M_{i}^{B W}, {\tilde{M}}_{i j}^{B W})}} .

A positive

δ_{i}

signifies that the BW barycenter exhibits greater robustness compared to AI. Figure 4A,B presents the distribution of

{δ}

with varying n and p, respectively. The line plot in the top right corner describes the median of each

{δ}

distribution. Overall, a substantial portion of

δ

fall on the positive side, irrespective of n and p, while occurrences of negative

δ

are rare. This pattern indicates that the BW barycenter tends to be more robust than the AI barycenter.

With an increase in the matrix dimension n, the distribution of

δ_{i}

becomes more tightly clustered around 0, accompanied by a consistent decrease in the median. In simpler terms, the advantage in robustness of the BW barycenter over AI is more pronounced for lower-dimensional matrices (significant Kruskal–Wallis test and majority pairwise comparisons of Dunn test are significant at significant level

0.05

). Unlike n, the distribution of

δ

exhibits similar shapes and comparable medians across varying p (not significant Kruskal–Wallis test). Consequently, it appears that p exerts minimal influence on the discrepancies in robustness between BW and AI barycenters.

3.3. Properties of BW Barycenter for More than Two Matrices

Now we consider a collection of PSD matrices

A_{1}, \dots, A_{m}, m > 2

and estimate their barycenter using the proposed algorithms. Our goal is to evaluate the efficiency, accuracy, and robustness of the BW barycenter estimation.

3.3.1. Efficiency of Barycenter Estimation with BW Distance

The three algorithms are applied to multiple sets of randomly generated PSD matrices to estimate both the BW and AI barycenters. Figure 5 presents the average runtime (in seconds) on a logarithmic scale for various numbers and dimensions of matrices. For each combination of dimension of matrices n and number of matrices m, 100 sets of matrices are generated.

Both the Projection and Cheap Mean Algorithms require more time to execute as n or m increases, whereas the Inductive Algorithms is primarily only affected by n. The BW Projection Algorithm (abbreviation of Projection Algorithm with BW distance) proves to be the most efficient regardless of n or m, followed by the AI Projection Algorithm in second place. The gap between the two diminishes as n increases. It is also worth emphasizing that the AI Projection Algorithm could experience convergence difficulties, especially when n substantially surpasses m [49]. For evidence, see Section 4.

3.3.2. Accuracy of Barycenter Estimation with BW Distance

The barycenter of a set of matrices, although unknown, is defined as the point that minimizes the sum of squared BW distances from each matrix to itself. This makes the sum of squared distances (SSD) a suitable metric for measuring the accuracy of the barycenter estimation, where a smaller SSD indicates a more accurate estimation of the barycenter.

A set of PSD matrices are randomly generated, with p being 0.2 and varying m and n. Three algorithms are employed to estimate the barycenters using the same stopping criteria

ε = 10^{- 3}

. Denote the barycenter estimated via the Inductive, Projection, and Cheap Mean Algorithm as

M^{I}

,

M^{P}

, and

M^{C}

, respectively. For each barycenter

M^{\circ}

, the corresponding

S S D^{\circ}

is calculated as

S S D^{\circ} = \sum_{i = 1}^{m} d_{B W}^{2} (A_{i}, M^{\circ})

. Figure 6A shows the distribution of three

S S D^{\circ}

across 100 iterations with n being 10 and m being 100.

S S D^{P}

is significantly smaller than the other two. For easy comparison, we present the difference between

S S D^{I}

and

S S D^{P}

in Figure 6B and difference between

S S D^{C}

and

S S D^{P}

in Figure 6C.

Overall, both

S S D^{I} - S S D^{P}

and

S S D^{C} - S S D^{P}

are positive, indicating that the Projection Mean Algorithm consistently proves to be the most accurate, regardless of n and m.

S S D^{C} - S S D^{P}

generally exhibits a larger scale than

S S D^{I} - S S D^{P}

, implying that the Inductive Mean Algorithm tends to outperform the Cheap Mean Algorithm, except for low-dimensional matrices

n = 5

. Additionally, the Cheap Mean Algorithm exhibits reduced accuracy with increasing number and dimension of matrices.

3.3.3. Robustness of Barycenter Estimation with BW Distance

The robustness of barycenter estimation refers to how the estimated barycenter is affected by perturbations introduced to the matrices. To assess this, we use the Frobenius distance between the two barycenters estimated with and without perturbations as the evaluation metric, with the assumption that a smaller distance indicates a more robust barycenter. Given that the Projection Algorithm is the most precise and efficient, it is employed for estimating the barycenters in this simulation for both the AI and BW metric. In the cases where AI Projection encounters convergence issues, we resort to AI Inductive for barycenter estimation.

In each iteration i (

i = 1, \dots, 100

), a set of PSD matrices

{A_{i}^{1}, \dots, A_{i}^{m}}

is randomly generated, and the contaminated matrices

{{\tilde{A}}_{i j}^{1}, \dots, {\tilde{A}}_{i j}^{m}, j = 1, \dots, 100}

are derived by adding randomly generated Hermitian perturbation matrices

E_{i j}^{k}

to

A_{i}^{k}

. Let

M_{i}^{B W}

and

M_{i}^{A I}

denote the estimated barycenters of unperturbed matrices

{A_{i}^{1}, \dots, A_{i}^{m}}

, and

{\tilde{M}}_{i j}^{B W}

and

{\tilde{M}}_{i j}^{A I}

represent the barycenters of perturbed matrices

{{\tilde{A}}_{i j}^{1}, \dots, {\tilde{A}}_{i j}^{m}}

. The robustness of barycenter is then assessed via

d_{F} (M_{i}^{\circ}, {\tilde{M}}_{i j}^{\circ})

. Figure 7A presents the distribution of

{d_{F} (M_{i}^{B W}, {\tilde{M}}_{i j}^{B W})}

and Figure 7B presents the distribution of

{d_{F} (M_{i}^{A I}, {\tilde{M}}_{i j}^{A I}) - d_{F} (M_{i}^{B W}, {\tilde{M}}_{i j}^{B W})}

. It is observed that as n or m increases, BW barycenters exhibit significantly greater robustness than AI barycenters (confirmed by a significant Kruskal–Wallis test and pairwise Wilcoxon Rank Sum test).

4. Real Data Results

In addition to simulations, we also investigate the effectiveness of employing BW distance and BW barycenter in real-world applications. One scalp EEG data collected in a lab and four publicly accessible BCI datasets are utilized to evaluate the improvement in classification performance achieved through the implementation of BW distance.

Two datasets from BCI competition III [51,52,53] and two from BCI competition IV [54,55,56] are utilized in order to ensure a comprehensive representation of scenarios commonly encountered in EEG-based BCI applications. In BCI III, dataset IIIa comprises 60-channel EEG recordings from three subjects who performed motor imagery tasks involving four classes (left hand, right hand, foot, and tongue). Dataset IVa consists of 118-channel EEG recordings from five subjects engaged in two motor imagery tasks (right hand and foot). In BCI IV, dataset IIa includes 22-channel EEG recordings from nine subjects performing four motor imagery tasks (left hand, right hand, feet, and tongue), while dataset IIb contains 3-channel EEG recordings from nine subjects engaged in two tasks (left hand and right hand). For the sake of comparability across all datasets, the two datasets with four classes were narrowed down to two classes, i.e., right hand and left hand, so that the classification on all datasets are framed as a binary task. Besides BCI data, one scalp EEG data collected from human spatial cognition laboratory at the University of Arizona is also incorporated. In total, 16 participants were recruited to perform predesigned tasks. Written informed consent was obtained from participants. The study was approved by the Institutional Review Board at the University of Arizona and is conducted in accordance with relevant guidelines and regulations. The EEG data is a 64-channel scalp EEG with binary labels. Further information regarding the experimental design can be found in [2,57]. Details regarding the five datasets are summarized in Table 2. In general, the matrix dimensions across these five datasets range from 3 to 118, and the number of matrices varies from 10 to 200. This provides a broad and inclusive representation of the cases encountered in EEG-based BCI data analysis.

Let

S_{n \times T}

represent the BCI (or EEG) data, with n denoting the number of electrode channels and T indicating the duration of each trial. Following the preprocessing procedures outlined in the previous works [5,6,58,59,60], the raw BCI undergo bandpass filtration within the 8–30 Hz range utilizing a fifth-order Butterworth filter. Similarly, the lab EEG data undergoes a bandpass filter of 1–30 Hz and independent component analysis [57]. Subsequently, the covariance matrix of the filtered signals during each trial is computed, denoted as

X_{n \times n}

. Then, a MDRM classifier is trained to directly classify the matrix

X_{n x n}

into one of the classes. Using six-fold cross validation repeated 10 times, Table 2 presents the average classification accuracy for the five datasets. The barycenters in the MDRM classifiers are estimated using the Projection Algorithm with the AI and BW metrics, respectively. Since covariance regularization is widely used in EEG preprocessing, we apply the Lediot–Wolf shrinkage method [61] (LWF) to explore the effect of regularization coupled with the two metrics, respectively.

Overall, it is evident that classifiers with BW barycenter yields higher accuracy when LWF is not applied. For BCI III IIIa, BCI III Iva, and the lab data, applying LWF decreases the accuracy of BW MDRM (abbreviation of MDRM classifier with BW barycenter) and increases the accuracy of AI MDRM. However, the increased accuracy of AI MDRM with LWF still falls short compared to BW MDRM without LWF. This, coupled with the extra cost of introducing LWF as an additional preprocessing step, suggests that preserving and modeling the positive semidefiniteness of matrices could be a superior option to regularization. On the other hand, AI and BW MDRM both produce decreased accuracy for BCI IV IIb, while both producing increased accuracy for BCI IV IIa. Although these two cases experience completely opposite changes, the performance of AI MDRM did not exceed its BW counterpart.

Moreover, using the AI metric in MDRM encountered convergence difficulties that using the BW metric did not encounter in our real data experiment. To measure the convergence process of the Projection Algorithms with the AI and BW metric, we recorded the sum of squared distances (SSD) after each iteration. The ’foot’ class of three subjects (av, aw, and ay) from the BCI III IVa dataset that share the same dimension of matrix (

n = 118

) but have different number of matrices (

m = 84

for av, 56 for aw, and 28 for ay) were selected as representative cases for demonstration. The change in SSD is shown in Figure 8. In addition to the fact that the BW metric yielded significantly lower SSD than the AI metric, the numerically significant oscillatory behavior exhibited by using the AI metric suggests that the mean might be bouncing around in a neighborhood, while fast convergence is achieved by using the BW metric on the same data.

5. Conclusions

In this paper, we establish the mathematical framework for the BW distance by delving into its properties and the retraction maps of

{\bar{P}}_{n}

. Given a set of PSD matrices, we propose three algorithms to estimate their barycenter, which characterize the central tendency of a set of matrices and is widely used in the construction of classifiers. Extensive simulation experiments reveal that BW distance and barycenter exhibit superior efficiency and robustness compared to AI distance, especially when dealing with high-dimensional matrices which are frequently encountered in high-dimensional data analysis. Moreover, validation with five real datasets further confirms the superiority of BW distance and BW barycenter. BW distance demonstrates higher classification accuracy and substantially reduced computational cost, especially for high-dimensional matrices. Therefore, when analyzing high-dimensional data, we highly recommend employing BW distance along with the Projection Mean Algorithm for the barycenter estimation. This approach eliminates the need for matrix regularization, substantially reduces computational time, and can achieve comparable or even superior classification performance compared to the AI metric. While we validate the proposed method using BCI data, it is adaptable to various other fields, including computer vision, remote sensing, and biomedical imaging.

Author Contributions

Conceptualization, J.Z. and H.H.; methodology, J.Z. and H.H.; software, Y.L.; validation, Y.L. and Y.Y.; formal analysis, Y.L. and Y.Y.; investigation, J.Z.; resources, J.Z.; data curation, J.Z.; writing—original draft preparation, J.Z., H.H., Y.L., S.-C.L. and Y.Y.; writing—review and editing, J.Z., H.H., Y.L., S.-C.L. and Y.Y.; visualization, Y.L. and Y.Y.; supervision, J.Z.; project administration, J.Z.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation under Grant No. 2153492.

Data Availability Statement

The BCI III datasets are available at https://www.bbci.de/competition/iii/ (accessed on 26 Febraury 2023) and the BCI IV datasets are available at https://www.bbci.de/competition/iv/ (accessed on 26 Febraury 2023). The EEG lab data is available upon request. The codes for the BW barycenter estimation algorithms and MDRM have been published at https://github.com/CarlosyxLi/Towards-Analysis-of-Covariance-Matrices-through-Bures-Wasserstein-Distance (accessed on 26 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Theorem A1.

For

A, B \in {\bar{P}}_{n}

and

t \in R

,

A ⋄_{t} B = B ⋄_{1 - t} A = {| (1 - t) A^{1 / 2} + t U^{*} B^{1 / 2} |}^{2} .

(A1)

in which U is a certain orthogonal matrix occurring in a polar decomposition of

B^{1 / 2} A^{1 / 2}

:

B^{1 / 2} A^{1 / 2} = U | B^{1 / 2} A^{1 / 2} | = U {(A^{1 / 2} B A^{1 / 2})}^{1 / 2},

(A2)

or equivalently,

A^{1 / 2} B^{1 / 2} = U^{*} | A^{1 / 2} B^{1 / 2} |

. Moreover,

| {(A ⋄_{t} B)}^{1 / 2} A^{1 / 2} | = | (1 - t) A + t | B^{1 / 2} A^{1 / 2} | | .

(A3)

In particular, if

(1 - t) A + t | B^{1 / 2} A^{1 / 2} |

is PSD (such as when

t \in [0, 1]

), then

| {(A ⋄_{t} B)}^{1 / 2} A^{1 / 2} | = (1 - t) A + t | B^{1 / 2} A^{1 / 2} | .

(A4)

Proof.

If

A^{1 / 2}

is nonsingular, i.e.,

A \in P_{n}

, then for any orthogonal matrix U in the polar decomposition (A2), we have

\begin{matrix} A^{1 / 2} U^{*} B^{1 / 2} & = & A^{1 / 2} (U^{*} B^{1 / 2} A^{1 / 2}) A^{- 1 / 2} \\ = & A^{1 / 2} {(A^{1 / 2} B A^{1 / 2})}^{1 / 2} A^{- 1 / 2} \\ = & {(A^{1 / 2} A^{1 / 2} B A^{1 / 2} A^{- 1 / 2})}^{1 / 2} \\ = & {(A B)}^{1 / 2} . \end{matrix}

(A5a)

If

A^{1 / 2}

is singular, we may find a sequence of nonsingular positive definite matrices

{A_{i}}_{i = 1}^{\infty} \subseteq P_{n}

such that

{lim}_{i \to \infty} A_{i} = A

. Let

B^{1 / 2} A_{i}^{1 / 2} = U_{i} | B^{1 / 2} A_{i}^{1 / 2} |

be the polar decomposition for

i = 1, 2, 3, \dots

Since the orthogonal group of degree n is compact, the sequence

{U_{i}}_{i = 1}^{\infty}

has a subsequence

{U_{i_{t}}}_{t = 1}^{\infty}

convergent to an orthogonal matrix

U : = {lim}_{t \to \infty} U_{i_{t}}

. Then

\begin{matrix} A^{1 / 2} U^{*} B^{1 / 2} & = & lim_{t \to \infty} A_{i_{t}}^{1 / 2} U_{i_{t}}^{*} B^{1 / 2} \\ = & lim_{t \to \infty} {(A_{i_{t}} B)}^{1 / 2} = {(A B)}^{1 / 2} . \end{matrix}

(A5b)

In both cases,

B^{1 / 2} U A^{1 / 2} = {(A^{1 / 2} U^{*} B^{1 / 2})}^{*} = {(B A)}^{1 / 2} .

(A6)

So, the geodesic from A to B can be expressed as:

\begin{matrix} A ⋄_{t} B & = & B ⋄_{1 - t} A \\ = & {(1 - t)}^{2} A + t^{2} B + t (1 - t) [{(A B)}^{1 / 2} + {(B A)}^{1 / 2}] \\ = & | (1 - t) A^{1 / 2} + t U^{*} B^{1 / 2} |^{2} . \end{matrix}

We get (A1). (Note that when both A and B are singular, some orthogonal matrices U that satisfy (A2) may not satisfy (A1)). Moreover, since

| | X | Y | = (Y^{*} {| X |}^{2} {Y)}^{1 / 2} = | X Y |

for all

X, Y \in M_{n}

, we have

\begin{matrix} | {(A ⋄_{t} B)}^{1 / 2} A^{1 / 2} | & = & | | (1 - t) A^{1 / 2} + t U^{*} B^{1 / 2} | A^{1 / 2} | \\ = & | [(1 - t) A^{1 / 2} + t U^{*} B^{1 / 2}] A^{1 / 2} | \\ = & | (1 - t) A + t | B^{1 / 2} A^{1 / 2} | | . \end{matrix}

So, (A3) and (A4) hold. The proof is completed. □

Theorem A2.

Let

A, B \in {\bar{P}}_{n}

.

1.: If $r, t \in R$ and $(1 - t) A + t | B^{1 / 2} A^{1 / 2} | \in {\bar{P}}_{n}$ , then

$A ⋄_{r} (A ⋄_{t} B) = A ⋄_{r t} B .$

(A7)
2.: If $r, s, t \in R$ satisfy that $(1 - x) A + x | B^{1 / 2} A^{1 / 2} | \in {\bar{P}}_{n}$ for $x \in {s, t}$ , then

$(A ⋄_{s} B) ⋄_{r} (A ⋄_{t} B) = A ⋄_{(1 - r) s + r t} B .$

(A8)

Proof.

(A7) is a special case of (A8) by choosing

s = 0

. We will show that (A7) also implies (A8) after proving (A7).

Suppose

r, t \in R

such that

(1 - t) A + t | B^{1 / 2} A^{1 / 2} | \in {\bar{P}}_{n}

. Assume that

A \in P_{n}

first (the case of singular A can be performed by continuous extension). By (A3) and (A4),

\begin{matrix} | {[A ⋄_{r} (A ⋄_{t} B)]}^{1 / 2} A^{1 / 2} | \\ = & | (1 - r) A + r | {(A ⋄_{t} B)}^{1 / 2} A^{1 / 2} | | \\ = & | (1 - r) A + r [(1 - t) A + t | B^{1 / 2} A^{1 / 2} |] | \\ = & | (1 - r t) A + r t | B^{1 / 2} A^{1 / 2} | | \\ = & | {(A ⋄_{r t} B)}^{1 / 2} A^{1 / 2} | . \end{matrix}

(A9)

Hence, there is an orthogonal matrix V such that

{[A ⋄_{r} (A ⋄_{t} B)]}^{1 / 2} A^{1 / 2} = V {(A ⋄_{r t} B)}^{1 / 2} A^{1 / 2} .

By assumption, A is nonsingular, so that

{[A ⋄_{r} (A ⋄_{t} B)]}^{1 / 2} = V {(A ⋄_{r t} B)}^{1 / 2}

. We get (A7).

Now suppose

r, s, t \in R

satisfy that

(1 - x) A + x | B^{1 / 2} A^{1 / 2} | \in {\bar{P}}_{n}

for

x \in {s, t}

. Again, we assume that

A \in P_{n}

first and then extend the result continuously to singular A. Suppose

t \leq s

without loss of generality. Then, by (A7),

\begin{matrix} (A ⋄_{s} B) ⋄_{r} (A ⋄_{t} B) \\ = & (A ⋄_{s} B) ⋄_{r} [A ⋄_{\frac{t}{s}} (A ⋄_{s} B)] \\ = & (A ⋄_{s} B) ⋄_{r} [(A ⋄_{s} B) ⋄_{1 - \frac{t}{s}} A] \\ = & (A ⋄_{s} B) ⋄_{\frac{r s - r t}{s}} A \\ = & A ⋄_{1 - \frac{r s - r t}{s}} (A ⋄_{s} B) \\ = & A ⋄_{(1 - r) s + r t} B . \end{matrix}

We get (A8). □

Let U be the orthogonal matrix in the polar decomposition of

B^{1 / 2} A^{1 / 2}

as in (A2). Bhatia, Jain, and Lim showed that ([33], Theorem 1):

d_{B W} (A, B) = {∥ A^{1 / 2} - B^{1 / 2} U ∥}_{F}

(A10)

Which implies the following property about BW distance.

Theorem A3.

For

A, B \in {\bar{P}}_{n}

and

t \in R

, if

(1 - t) A + t | B^{1 / 2} A^{1 / 2} |

is PSD (e.g., when

t \in [0, 1]

), then

d_{B W} (A, A ⋄_{t} B) = | t | d_{B W} (A, B) .

(A11)

Proof.

We prove the nonsingular A case. The singular A case can be proved by continuity. Let U and V be the orthogonal matrix in the polar decomposition of

B^{1 / 2} A^{1 / 2}

and

{(A ⋄_{t} B)}^{1 / 2} A^{1 / 2}

, respectively:

\begin{matrix} B^{1 / 2} A^{1 / 2} & = & U | B^{1 / 2} A^{1 / 2} |, \\ {(A ⋄_{t} B)}^{1 / 2} A^{1 / 2} & = & V | {(A ⋄_{t} B)}^{1 / 2} A^{1 / 2} | \\ = & V [(1 - t) A + t | B^{1 / 2} A^{1 / 2} |] . \end{matrix}

Taking Hermitian transposes on the above equalities, we get

\begin{matrix} A^{1 / 2} B^{1 / 2} & = & | B^{1 / 2} A^{1 / 2} | U^{*}, \\ A^{1 / 2} {(A ⋄_{t} B)}^{1 / 2} & = & [(1 - t) A + t | B^{1 / 2} A^{1 / 2} |] V^{*} . \end{matrix}

Therefore,

\begin{matrix} {(A ⋄_{t} B)}^{1 / 2} V & = & A^{- 1 / 2} [(1 - t) A + t | B^{1 / 2} A^{1 / 2} |] \\ = & (1 - t) A^{1 / 2} + t B^{1 / 2} U . \end{matrix}

By ([33] Theorem 1),

\begin{matrix} d_{B W} (A, A ⋄_{t} B) & = & ∥ A^{1 / 2} - {(A ⋄_{t} B)}^{1 / 2} {V ∥}_{F} \\ = & ∥ t A^{1 / 2} - t B^{1 / 2} {U ∥}_{F} \\ = & | t | d_{B W} (A, B) . \end{matrix}

We get (A11). □

Lemma A1.

For

A, B \in {\bar{P}}_{n}

,

X \in H_{n}

such that

{exp}_{A} X

is well-defined,

U \in U_{n}

, and

t \in R

,

\begin{matrix} d_{B W} (U^{*} A U, U^{*} B U) & = & d_{B W} (A, B), \end{matrix}

(A12)

\begin{matrix} (U^{*} A U) ⋄_{t} (U^{*} B U) & = & U^{*} (A ⋄_{t} B) U, \end{matrix}

(A13)

\begin{matrix} {log}_{U^{*} A U} (U^{*} B U) & = & U^{*} ({log}_{A} B) U, \end{matrix}

(A14)

\begin{matrix} {exp}_{U^{*} A U} (U^{*} X U) & = & U^{*} ({exp}_{A} X) U . \end{matrix}

(A15)

Proof.

The proofs of (A12) and (A13) are straightforward by (4) and (5). Taking

{\frac{d}{d t}|}_{t = 0}

on both sides of (A13), we get (A14). Then, (A15) follows. □

According to spectral decomposition, every

A \in {\bar{P}}_{n}

can be written as

A = U Λ U^{*}

for a nonnegative diagonal matrix

Λ

and an orthogonal matrix

U \in U_{n}

. Lemma A1 implies that we can transform the BW metric around A to that around the diagonal matrix

Λ

and simplify the computation.

The log function on

P_{n}

under the BW metric has been described in [33,37,40]. We add an approximation of

{log}_{A} (A + t X)

when

t X

is near 0 as follows.

Theorem A4.

For any

A, B \in P_{n}

,

X \in H_{n}

, and

t \in R

sufficiently close to 0, we have

\begin{matrix} {log}_{A} (B) & = & {(A B)}^{1 / 2} + {(B A)}^{1 / 2} - 2 A, \end{matrix}

(A16)

\begin{matrix} {log}_{A} (A + t X) & = & t X + O (t^{2}) . \end{matrix}

(A17)

Proof.

By (5), the tangent vector of geodesic

A ⋄_{t} B

at A is

{log}_{A} (B) = {\frac{d}{d t}|}_{t = 0} (A ⋄_{t} B) = {(A B)}^{1 / 2} + {(B A)}^{1 / 2} - 2 A .

For

X \in H_{n}

and

t \in R

sufficiently close to 0, we have

A + t X \in P_{n}

, so that

\begin{matrix} {log}_{A} (A + t X) = {[A (A + t X)]}^{1 / 2} + {[(A + t X) A]}^{1 / 2} - 2 A . \end{matrix}

Moreover,

\begin{matrix} {[A (A + t X)]}^{1 / 2} & = & {[A^{1 / 2} (A^{2} + t A^{1 / 2} X A^{1 / 2}) A^{- 1 / 2}]}^{1 / 2} \\ = & A^{1 / 2} {(A^{2} + t A^{1 / 2} X A^{1 / 2})}^{1 / 2} A^{- 1 / 2}, \\ {[(A + t X) A]}^{1 / 2} & = & {[A^{- 1 / 2} (A^{2} + t A^{1 / 2} X A^{1 / 2}) A^{1 / 2}]}^{1 / 2} \\ = & A^{- 1 / 2} {(A^{2} + t A^{1 / 2} X A^{1 / 2})}^{1 / 2} A^{1 / 2} \end{matrix}

Let

C : = {(A^{2} + t A^{1 / 2} X A^{1 / 2})}^{1 / 2}

. Then

{log}_{A} (A + t X) = A^{1 / 2} C A^{- 1 / 2} + A^{- 1 / 2} C A^{1 / 2} - 2 A .

(A18)

When t is close to 0, C can be expressed as a power series of t:

\begin{matrix} C = {(A^{2} + t A^{1 / 2} X A^{1 / 2})}^{1 / 2} = A + t Z + O (t^{2}) . \end{matrix}

(A19)

Taking squares, we have

\begin{matrix} A^{2} + t A^{1 / 2} X A^{1 / 2} & = & {[A + t Z + O (t^{2})]}^{2} \\ = & A^{2} + t (A Z + Z A) + O (t^{2}) . \end{matrix}

(A20)

Therefore, we get the Lyapunov equation

A Z + Z A = A^{1 / 2} X A^{1 / 2}

. By (A18) and (A19),

\begin{matrix} {log}_{A} (A + t X) \\ = & A^{\frac{1}{2}} (A + t Z) A^{- \frac{1}{2}} + A^{- \frac{1}{2}} (A + t Z) A^{\frac{1}{2}} - 2 A + O (t^{2}) \\ = & t A^{- \frac{1}{2}} (A Z + Z A) A^{- \frac{1}{2}} + O (t^{2}) \\ = & t X + O (t^{2}) . □ \end{matrix}

The exponential map has been studied in [35,41]. Here, we give a concise form of the exponential map and provide its exact domain. We also provide an approximation of

{exp}_{A} (X)

when X is a Hermitian matrix nearby 0. Let

A \circ B

denote the Hadamard product of matrices A and B of the same size.

Lemma A2.

Let

A = diag (λ_{1}, \dots, λ_{n}) \in P_{n}

be a positive diagonal matrix. Denote

W : = {(\frac{1}{λ_{i} + λ_{j}})}_{n \times n}

. Then, for every Hermitian matrix

X \in H_{n}

such that

I_{n} + W \circ X

is PSD,

\begin{matrix} \begin{matrix} {exp}_{A} (X) & = A + X + (W \circ X) A (W \circ X) . \end{matrix} \end{matrix}

(A21)

In particular, for

t \in R

sufficiently close to 0,

\begin{matrix} {exp}_{A} (t X) = A + t X + O (t^{2}) . \end{matrix}

(A22)

Proof.

Firstly, let

B = {exp}_{A} (X)

and

C : = {(A^{1 / 2} B A^{1 / 2})}^{1 / 2} .

(A23)

Then, (A16) and the assumption

A = diag (λ_{1}, \dots, λ_{n})

imply that

\begin{matrix} 2 A + X & = & {(A B)}^{1 / 2} + {(B A)}^{1 / 2} \\ = & A^{1 / 2} {(A^{1 / 2} B A^{1 / 2})}^{1 / 2} A^{- 1 / 2} \\ + A^{- 1 / 2} {(A^{1 / 2} B A^{1 / 2})}^{1 / 2} A^{1 / 2} \\ = & A^{1 / 2} C A^{- 1 / 2} + A^{- 1 / 2} C A^{1 / 2} \\ = & {(\frac{λ_{i} + λ_{j}}{λ_{i}^{1 / 2} λ_{j}^{1 / 2}})}_{n \times n} \circ C . \\ C & = & {(\frac{λ_{i}^{1 / 2} λ_{j}^{1 / 2}}{λ_{i} + λ_{j}})}_{n \times n} \circ (2 A + X) \\ = & A + {(\frac{λ_{i}^{1 / 2} λ_{j}^{1 / 2}}{λ_{i} + λ_{j}})}_{n \times n} \circ X . \end{matrix}

(A24)

By (A23), C must be PSD, which is equivalent to that

A^{- 1 / 2} C A^{- 1 / 2} = I_{n} + W \circ X

is PSD. In such a case,

\begin{matrix} B & = & {exp}_{A} (X) = A^{- 1 / 2} C^{2} A^{- 1 / 2} \\ = & A^{- 1 / 2} {(A + {(\frac{λ_{i}^{1 / 2} λ_{j}^{1 / 2}}{λ_{i} + λ_{j}})}_{n \times n} \circ X)}^{2} A^{- 1 / 2} \\ = & A + X + (W \circ X) A (W \circ X) . \end{matrix}

The expression of (A21) is obtained. Replace X by

t X

for small t, then (A22) follows. □

Similarly, the exponential function and approximation that are derived in Lemma A2 for the case of A being a positive diagonal matrix can also be further defined for general cases.

Theorem A5.

Suppose

A \in P_{n}

has the spectral decomposition

A = U Λ U^{*}

, where U is an orthogonal matrix and

Λ = diag (λ_{1}, \dots, λ_{n})

. Denote

W = {(\frac{1}{λ_{i} + λ_{j}})}_{n \times n}

. Then, for every Hermitian matrix

X \in H_{n}

such that

I_{n} + W \circ X_{U}

is PSD where

X_{U} : = U^{*} X U

, we have

{exp}_{A} (X) = A + X + U [(W \circ X_{U}) Λ (W \circ X_{U})] U^{*} .

(A25)

In particular, when

t \in R

is sufficiently closed to zero, we have

\begin{matrix} {exp}_{A} (t X) = A + t X + O (t^{2}) . \end{matrix}

(A26)

Proof.

By Lemma A2, under the assumption of X,

\begin{matrix} {exp}_{A} X & = & {exp}_{U Λ U^{*}} X = U {exp}_{Λ} (U^{*} X U) U^{*} \\ = & U [Λ + X_{U} + (W \circ X_{U}) Λ (W \circ X_{U})] U^{*} \\ = & A + X + U [W \circ X_{U}) Λ (W \circ X_{U}] U^{*} . \end{matrix}

Replace X by

t X

, then we get (A26). □

References

Schirrmeister, R.T.; Springenberg, J.T.; Fiederer, L.D.J.; Glasstetter, M.; Eggensperger, K.; Tangermann, M.; Hutter, F.; Burgard, W.; Ball, T. Deep learning with convolutional neural networks for EEG decoding and visualization. Hum. Brain Mapp. 2017, 38, 5391–5420. [Google Scholar] [CrossRef]
Zheng, J.; Liang, M.; Sinha, S.; Ge, L.; Yu, W.; Ekstrom, A.; Hsieh, F. Time-frequency analysis of scalp EEG with Hilbert-Huang transform and deep learning. IEEE J. Biomed. Health Inform. 2021, 26, 1549–1559. [Google Scholar] [CrossRef] [PubMed]
Qiu, A.; Lee, A.; Tan, M.; Chung, M.K. Manifold learning on brain functional networks in aging. Med. Image Anal. 2015, 20, 52–60. [Google Scholar] [CrossRef] [PubMed]
Varoquaux, G.; Baronnet, F.; Kleinschmidt, A.; Fillard, P.; Thirion, B. Detection of brain functional-connectivity difference in post-stroke patients using group-level covariance modeling. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2010: 13th International Conference, Beijing, China, 20–24 September 2010; Proceedings, Part I 13. Springer: Berlin/Heidelberg, Germany, 2010; pp. 200–208. [Google Scholar]
Barachant, A.; Bonnet, S.; Congedo, M.; Jutten, C. Riemannian Geometry Applied to BCI Classification. In Proceedings of the Latent Variable Analysis and Signal Separation, St. Malo, France, 27–30 September 2010; Vigneron, V., Zarzoso, V., Moreau, E., Gribonval, R., Vincent, E., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 629–636. [Google Scholar]
Barachant, A.; Bonnet, S.; Congedo, M.; Jutten, C. Classification of Covariance Matrices Using a Riemannian-Based Kernel for BCI Applications. Neurocomput. 2013, 112, 172–178. [Google Scholar] [CrossRef]
Miah, A.S.M.; Islam, M.R.; Molla, M.K.I. EEG classification for MI-BCI using CSP with averaging covariance matrices: An experimental study. In Proceedings of the 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2), Rajshahi, Bangladesh, 1–12 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
Chen, K.X.; Ren, J.Y.; Wu, X.J.; Kittler, J. Covariance descriptors on a Gaussian manifold and their application to image set classification. Pattern Recognit. 2020, 107, 107463. [Google Scholar] [CrossRef]
Porikli, F.; Tuzel, O.; Meer, P. Covariance tracking using model update based on lie algebra. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 1, pp. 728–735. [Google Scholar]
Sivalingam, R.; Boley, D.; Morellas, V.; Papanikolopoulos, N. Tensor sparse coding for region covariances. In Proceedings of the Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Proceedings, Part IV 11. Springer: Berlin/Heidelberg, Germany, 2010; pp. 722–735. [Google Scholar]
Jagarlamudi, J.; Udupa, R.; Daumé III, H.; Bhole, A. Improving bilingual projections via sparse covariance matrices. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, 27–29 July 2011; pp. 930–940. [Google Scholar]
Zhang, W.; Fung, P. Discriminatively trained sparse inverse covariance matrices for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 873–882. [Google Scholar] [CrossRef]
Cui, Z.; Li, W.; Xu, D.; Shan, S.; Chen, X.; Li, X. Flowing on Riemannian Manifold: Domain Adaptation by Shifting Covariance. IEEE Trans. Cybern. 2014, 44, 2264–2273. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, M.; Huang, Y.; Nehorai, A. Aligning infinite-dimensional covariance matrices in reproducing kernel hilbert spaces for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3437–3445. [Google Scholar]
He, N.; Fang, L.; Li, S.; Plaza, A.; Plaza, J. Remote sensing scene classification using multilayer stacked covariance pooling. IEEE Trans. Geosci. Remote Sens. 2018, 56, 6899–6910. [Google Scholar] [CrossRef]
Eklundh, L.; Singh, A. A comparative analysis of standardised and unstandardised principal components analysis in remote sensing. Int. J. Remote Sens. 1993, 14, 1359–1370. [Google Scholar] [CrossRef]
Yang, D.; Gu, C.; Dong, Z.; Jirutitijaroen, P.; Chen, N.; Walsh, W.M. Solar irradiance forecasting using spatial-temporal covariance structures and time-forward kriging. Renew. Energy 2013, 60, 235–245. [Google Scholar] [CrossRef]
Meyer, K. Factor-analytic models for genotype× environment type problems and structured covariance matrices. Genet. Sel. Evol. 2009, 41, 21. [Google Scholar] [CrossRef]
Huang, Z.; Wang, R.; Shan, S.; Chen, X. Face recognition on large-scale video in the wild with hybrid Euclidean-and-Riemannian metric learning. Pattern Recognit. 2015, 48, 3113–3124. [Google Scholar] [CrossRef]
Arsigny, V.; Fillard, P.; Pennec, X.; Ayache, N. Geometric means in a novel vector space structure on symmetric positive-definite matrices. SIAM J. Matrix Anal. Appl. 2007, 29, 328–347. [Google Scholar] [CrossRef]
Jayasumana, S.; Hartley, R.; Salzmann, M.; Li, H.; Harandi, M. Kernel methods on the Riemannian manifold of symmetric positive definite matrices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 73–80. [Google Scholar]
Huang, Z.; Wang, R.; Shan, S.; Li, X.; Chen, X. Log-euclidean metric learning on symmetric positive definite manifold with application to image set classification. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 720–729. [Google Scholar]
Lin, Z. Riemannian geometry of symmetric positive definite matrices via Cholesky decomposition. SIAM J. Matrix Anal. Appl. 2019, 40, 1353–1370. [Google Scholar] [CrossRef]
Pennec, X. Manifold-valued image processing with SPD matrices. In Riemannian Geometric Statistics in Medical Image Analysis; Elsevier: Amsterdam, The Netherlands, 2020; pp. 75–134. [Google Scholar]
Ledoit, O.; Wolf, M. A well-conditioned estimator for large-dimensional covariance matrices. J. Multivar. Anal. 2004, 88, 365–411. [Google Scholar] [CrossRef]
Kotz, S.; Johnson, N.L. (Eds.) Breakthroughs in Statistics: Methodology and Distribution; Springer Series in Statistics; Perspectives in Statistics; Springer: New York, NY, USA, 1992; Volume II, pp. xxii+600. [Google Scholar]
Villani, C. Topics in Optimal Transportation; Graduate Studies in Mathematics; American Mathematical Society: Providence, RI, USA, 2003; Volume 58, pp. xvi+370. [Google Scholar] [CrossRef]
Villani, C. Optimal Transport: Old and New; Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]; Springer: Berlin/Heidelberg, Germany, 2009; Volume 338, pp. xxii+973. [Google Scholar] [CrossRef]
Hayashi, M. Quantum Information Theory: Mathematical Foundation, 2nd ed.; Graduate Texts in Physics; Springer: Berlin/Heidelberg, Germany, 2017; pp. xli+636. [Google Scholar] [CrossRef]
Bengtsson, I.; Życzkowski, K. Geometry of Quantum States: An Introduction to Quantum Entanglement, 2nd ed.; Cambridge University Press: Cambridge, UK, 2017; pp. xv+619. [Google Scholar] [CrossRef]
Oostrum, J.v. Bures-Wasserstein geometry for positive-definite Hermitian matrices and their trace-one subset. Inf. Geom. 2022, 5, 405–425. [Google Scholar] [CrossRef]
Bhatia, R. Positive Definite Matrices; Princeton Series in Applied Mathematics; Princeton University Press: Princeton, NJ, USA, 2007; pp. x+254. [Google Scholar]
Bhatia, R.; Jain, T.; Lim, Y. On the Bures—Wasserstein distance between positive definite matrices. Expo. Math. 2019, 37, 165–191. [Google Scholar] [CrossRef]
Bhatia, R.; Jain, T.; Lim, Y. Inequalities for the Wasserstein mean of positive definite matrices. Linear Algebra Its Appl. 2019, 576, 108–123. [Google Scholar] [CrossRef]
Thanwerdas, Y.; Pennec, X. O(n)-invariant Riemannian metrics on SPD matrices. Linear Algebra Its Appl. 2023, 661, 163–201. [Google Scholar] [CrossRef]
Hwang, J.; Kim, S. Two-variable Wasserstein means of positive definite operators. Mediterr. J. Math. 2022, 19, 110. [Google Scholar] [CrossRef]
Thanwerdas, Y. Riemannian and Stratified Geometries of Covariance and Correlation Matrices. Ph.D. Thesis, Université Côte d’Azur, Nice, France, 2022. [Google Scholar]
Kim, S.; Lee, H. Inequalities of the Wasserstein mean with other matrix means. Ann. Funct. Anal. 2020, 11, 194–207. [Google Scholar] [CrossRef]
Hwang, J.; Kim, S. Bounds for the Wasserstein mean with applications to the Lie-Trotter mean. J. Math. Anal. Appl. 2019, 475, 1744–1753. [Google Scholar] [CrossRef]
Massart, E.; Absil, P.A. Quotient geometry with simple geodesics for the manifold of fixed-rank positive-semidefinite matrices. SIAM J. Matrix Anal. Appl. 2020, 41, 171–198. [Google Scholar] [CrossRef]
Malagò, L.; Montrucchio, L.; Pistone, G. Wasserstein Riemannian geometry of Gaussian densities. Inf. Geom. 2018, 1, 137–179. [Google Scholar] [CrossRef]
Kubo, F.; Ando, T. Means of positive linear operators. Math. Ann. 1980, 246, 205–224. [Google Scholar] [CrossRef]
Lee, H.; Lim, Y. Metric and spectral geometric means on symmetric cones. Kyungpook Math. J. 2007, 47, 133–150. [Google Scholar]
Gan, L.; Huang, H. Order relations of the Wasserstein mean and the spectral geometric mean. Electron. J. Linear Algebra 2024, 40, 491–505. [Google Scholar] [CrossRef]
Yger, F.; Berar, M.; Lotte, F. Riemannian Approaches in Brain-Computer Interfaces: A Review. IEEE Trans. Neural Syst. Rehabil. Eng. 2017, 25, 1753–1762. [Google Scholar] [CrossRef]
Lim, Y.; Pálfia, M. Weighted inductive means. Linear Algebra Appl. 2014, 453, 59–83. [Google Scholar] [CrossRef]
Álvarez Esteban, P.C.; del Barrio, E.; Cuesta-Albertos, J.; Matrán, C. A fixed-point approach to barycenters in Wasserstein space. J. Math. Anal. Appl. 2016, 441, 744–762. [Google Scholar] [CrossRef]
Bini, D.A.; Iannazzo, B. A note on computing matrix geometric means. Adv. Comput. Math. 2011, 35, 175–192. [Google Scholar] [CrossRef]
Jeuris, B.; Vandebril, R.; Vandereycken, B. A survey and comparison of contemporary algorithms for computing the matrix geometric mean. Electron. Trans. Numer. Anal. 2012, 39, 379–402. [Google Scholar]
Barachant, A.; Bonnet, S.; Congedo, M.; Jutten, C. Multiclass Brain–Computer Interface Classification by Riemannian Geometry. IEEE Trans. Biomed. Eng. 2012, 59, 920–928. [Google Scholar] [CrossRef]
Blankertz, B.; Muller, K.R.; Krusienski, D.J.; Schalk, G.; Wolpaw, J.R.; Schlogl, A.; Pfurtscheller, G.; Millan, J.R.; Schroder, M.; Birbaumer, N. The BCI competition III: Validating alternative approaches to actual BCI problems. IEEE Trans. Neural Syst. Rehabil. Eng. 2006, 14, 153–159. [Google Scholar] [CrossRef]
Schlögl, A. GDF-A general dataformat for biosignals. arXiv 2006, arXiv:cs/0608052. [Google Scholar]
Dornhege, G.; Blankertz, B.; Curio, G.; Muller, K.R. Boosting bit rates in noninvasive EEG single-trial classifications by feature combination and multiclass paradigms. IEEE Trans. Biomed. Eng. 2004, 51, 993–1002. [Google Scholar] [CrossRef] [PubMed]
Tangermann, M.; Müller, K.R.; Aertsen, A.; Birbaumer, N.; Braun, C.; Brunner, C.; Leeb, R.; Mehring, C.; Miller, K.J.; Mueller-Putz, G.; et al. Review of the BCI competition IV. Front. Neurosci. 2012, 6, 55. [Google Scholar]
Brunner, C.; Leeb, R.; Müller-Putz, G.; Schlögl, A.; Pfurtscheller, G. BCI Competition 2008–Graz Data Set A; IEEE DataPort: Piscataway, NJ, USA, 2008; Volume 16, pp. 1–6. [Google Scholar] [CrossRef]
Leeb, R.; Brunner, C.; Müller-Putz, G.; Schlögl, A.; Pfurtscheller, G. BCI Competition 2008–Graz Data Set B; Graz University of Technology: Graz, Austria, 2008; Volume 16, pp. 1–6. [Google Scholar]
Liang, M.; Starrett, M.J.; Ekstrom, A.D. Dissociation of frontal-midline delta-theta and posterior alpha oscillations: A mobile EEG study. Psychophysiology 2018, 55, e13090. [Google Scholar] [CrossRef]
Lotte, F.; Guan, C. Regularizing common spatial patterns to improve BCI designs: Unified theory and new algorithms. IEEE Trans. Biomed. Eng. 2010, 58, 355–362. [Google Scholar] [CrossRef]
Barachant, A.; Bonnet, S.; Congedo, M.; Jutten, C. BCI Signal Classification using a Riemannian-based kernel. In Proceedings of the 20th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2012), Bruges, Belgium, 25–27 April 2012; pp. 97–102. [Google Scholar]
Congedo, M.; Barachant, A.; Bhatia, R. Riemannian geometry for EEG-based brain–computer interfaces; a primer and a review. Brain-Comput. Interfaces 2017, 4, 155–174. [Google Scholar] [CrossRef]
Ledoit, O.; Wolf, M. Honey, I shrunk the sample covariance matrix. J. Portf. Manag. 2004, 30, 110–119. [Google Scholar] [CrossRef]

Figure 1. (A) Manifold and its tangent space at point

A \in {\bar{P}}_{n}

. (B) Illustration of the Inductive Algorithm.

Figure 1. (A) Manifold and its tangent space at point

A \in {\bar{P}}_{n}

. (B) Illustration of the Inductive Algorithm.

Figure 2. Empirical distributions of

Δ_{i j}^{A I}

and

Δ_{i j}^{B W}

under different dimension of matrices (A) and proportion of zero eigenvalues (B).

Figure 2. Empirical distributions of

Δ_{i j}^{A I}

and

Δ_{i j}^{B W}

under different dimension of matrices (A) and proportion of zero eigenvalues (B).

Figure 3. Empirical distributions of

d_{F} (M_{i}^{B W}, {\tilde{M}}_{i j}^{B W})

and

d_{F} (M_{i}^{A I}, {\tilde{M}}_{i j}^{A I})

for three representative pairs of matrices.

Figure 3. Empirical distributions of

d_{F} (M_{i}^{B W}, {\tilde{M}}_{i j}^{B W})

and

d_{F} (M_{i}^{A I}, {\tilde{M}}_{i j}^{A I})

for three representative pairs of matrices.

Figure 4. Distributions of

δ_{i}

with (A) varying matrix dimension n (

p = 0.2

), (B) varying proportion of zero eigenvalues p (

n = 20

).

Figure 4. Distributions of

δ_{i}

with (A) varying matrix dimension n (

p = 0.2

), (B) varying proportion of zero eigenvalues p (

n = 20

).

Figure 5. Comparison of efficiency of Barycenter estimation algorithms coupled with BW and AI distance.

Figure 6. Accuracy of BW barycenter estimation. (A) Distribution of

{S S D^{\circ}}

, with n being 10 and m being 100. (B) Distribution of

{S S D^{I} - S S D^{P}}

with varying n and m. (C) Distribution of

{S S D^{C} - S S D^{P}}

with varying n and m.

Figure 6. Accuracy of BW barycenter estimation. (A) Distribution of

{S S D^{\circ}}

, with n being 10 and m being 100. (B) Distribution of

{S S D^{I} - S S D^{P}}

with varying n and m. (C) Distribution of

{S S D^{C} - S S D^{P}}

with varying n and m.

Figure 7. Robustness of BW barycenter estimation. (A) Distribution of

{d_{F} (M^{B W}, {\tilde{M}}^{B W})}

, with n being 20 and varying m. (B) Distribution of

{d_{F} (M^{A I}, {\tilde{M}}^{A I}) - d_{F} (M^{B W}, {\tilde{M}}^{B W})}

with varying n and m. All distributions are significantly different from each other (all p-values

< 0.05

).

Figure 7. Robustness of BW barycenter estimation. (A) Distribution of

{d_{F} (M^{B W}, {\tilde{M}}^{B W})}

, with n being 20 and varying m. (B) Distribution of

{d_{F} (M^{A I}, {\tilde{M}}^{A I}) - d_{F} (M^{B W}, {\tilde{M}}^{B W})}

with varying n and m. All distributions are significantly different from each other (all p-values

< 0.05

).

Figure 8. Change of sum of squared distances obtained by the Projection Mean Algorithm with the AI and BW metrics on the BCI III IVa dataset.

Table 1. Simulation Parameter Settings.

Simulation Parameter	Values
n: Dimension of Matrices	$[5, 10, 20, 30, 50, 100]$
m: Number of Matrices	$[5, 10, 20, 30, 50, 100]$
p: Proportion of zero eigenvalues	$[0.1, 0.2, 0.4, 0.6, 0.8]$

Table 2. The configurations and classifications of BCI datasets.

	BCI III		BCI IV		Lab Data
	Dataset IIIa	Dataset IVa	Dataset IIa	Dataset IIb	Lab Data
Number of subjects	3	5	9	9	16
Number of channels	60	118	22	3	64
Trials per class	30/45	28–224	72	60/70/80	9–17
Sampling rate	250 Hz	1000 Hz	250 Hz	250 Hz	500 Hz
Filter Band	8–30 Hz	8–30 Hz	8–30 Hz	8–30 Hz	1–30 Hz
Classification Performance
Accuracy (BW)	0.68	0.72	0.70	0.66	0.99
Accuracy (BW with LWF)	0.59	0.65	0.76	0.58	0.89
Accuracy (AI)	0.58	0.62	0.66	0.66	0.88
Accuracy (AI with LWF)	0.63	0.71	0.75	0.58	0.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, H.; Li, Y.; Lin, S.-C.; Yi, Y.; Zheng, J. Towards Analysis of Covariance Descriptors via Bures–Wasserstein Distance. Mathematics 2025, 13, 2157. https://doi.org/10.3390/math13132157

AMA Style

Huang H, Li Y, Lin S-C, Yi Y, Zheng J. Towards Analysis of Covariance Descriptors via Bures–Wasserstein Distance. Mathematics. 2025; 13(13):2157. https://doi.org/10.3390/math13132157

Chicago/Turabian Style

Huang, Huajun, Yuexin Li, Shu-Chin Lin, Yuyan Yi, and Jingyi Zheng. 2025. "Towards Analysis of Covariance Descriptors via Bures–Wasserstein Distance" Mathematics 13, no. 13: 2157. https://doi.org/10.3390/math13132157

APA Style

Huang, H., Li, Y., Lin, S.-C., Yi, Y., & Zheng, J. (2025). Towards Analysis of Covariance Descriptors via Bures–Wasserstein Distance. Mathematics, 13(13), 2157. https://doi.org/10.3390/math13132157

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards Analysis of Covariance Descriptors via Bures–Wasserstein Distance

Abstract

1. Introduction

2. Methodology

2.1. Mathematical Properties of the BW Metric

2.2. Barycenter Estimation with BW Distance

2.3. Classification with BW Barycenter

3. Simulation Results

3.1. Robustness of BW Distance

3.2. Robustness of BW Barycenter for Two Matrices

3.3. Properties of BW Barycenter for More than Two Matrices

3.3.1. Efficiency of Barycenter Estimation with BW Distance

3.3.2. Accuracy of Barycenter Estimation with BW Distance

3.3.3. Robustness of Barycenter Estimation with BW Distance

4. Real Data Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI