Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors

Evaluating the performance of Bayesian classification in a high-dimensional random tensor is a fundamental problem, usually difficult and under-studied. In this work, we consider two Signal to Noise Ratio (SNR)-based binary classification problems of interest. Under the alternative hypothesis, i.e., for a non-zero SNR, the observed signals are either a noisy rank-R tensor admitting a Q-order Canonical Polyadic Decomposition (CPD) with large factors of size Nq×R, i.e., for 1≤q≤Q, where R,Nq→∞ with R1/q/Nq converge towards a finite constant or a noisy tensor admitting TucKer Decomposition (TKD) of multilinear (M1,…,MQ)-rank with large factors of size Nq×Mq, i.e., for 1≤q≤Q, where Nq,Mq→∞ with Mq/Nq converge towards a finite constant. The classification of the random entries (coefficients) of the core tensor in the CPD/TKD is hard to study since the exact derivation of the minimal Bayes’ error probability is mathematically intractable. To circumvent this difficulty, the Chernoff Upper Bound (CUB) for larger SNR and the Fisher information at low SNR are derived and studied, based on information geometry theory. The tightest CUB is reached for the value minimizing the error exponent, denoted by s⋆. In general, due to the asymmetry of the s-divergence, the Bhattacharyya Upper Bound (BUB) (that is, the Chernoff Information calculated at s⋆=1/2) cannot solve this problem effectively. As a consequence, we rely on a costly numerical optimization strategy to find s⋆. However, thanks to powerful random matrix theory tools, a simple analytical expression of s⋆ is provided with respect to the Signal to Noise Ratio (SNR) in the two schemes considered. This work shows that the BUB is the tightest bound at low SNRs. However, for higher SNRs, the latest property is no longer true.


State-of-the-Art and Problem Statement
Evaluating the performance limit for the "Gaussian information plus noise" binary classification problem is a challenging research topic, see for instance [1][2][3][4][5][6][7]. Given a binary hypothesis problem, the Bayes' decision rule is based on the principle of the largest posterior probability. Specifically, with M q /N q converging towards a finite constant. A standard approach for zero-mean independent Gaussian core and noise tensors, is to define the Signal to Noise Ratio by SNR = σ 2 s /σ 2 where σ 2 s and σ 2 are the variances of the vectorized core and noise tensors, respectively. So, the binary classification can be described in the following way: Under the null hypothesis H 0 , SNR = 0, meaning that the observed tensor contains only noise. Conversely, the alternative hypothesis H 1 is based on SNR = 0, meaning that there exists a multilinear signal of interest. First note that there exists a lack of contribution dealing with classification performance for tensors. Since the exact derivation of the error probability is intractable, the performance of the classification of the core tensor random entries is hard to evaluate. To circumvent this audible difficulty, based on computational information geometry theory, we consider the Chernoff Upper Bound (CUB), and the Fisher information in the context of massive measurement vectors. The error exponent can be minimized at s , which corresponds to the reachable tightest CUB. In general, due to the asymmetry of the s-divergence, the Bhattacharyya Upper Bound (BUB)-Chernoff Information calculated at s = 1/2-cannot solve this problem effectively. As a consequence, we rely on a costly numerical optimization strategy to find s . However, with respect to different Signal to Noise Ratios (SNR), we provide simple analytical expressions of s , thanks to the so-called Random Matrix Theory (RMT). For low SNR, analytical expressions of the Fisher information are given. Note that the analysis of the Fisher information in the context of the RMT has been only studied in recent contributions [37][38][39] for parameter estimation. For larger SNR, analytic and simple expression of the CUB for the CPD and the TKD are provided.
We note that Random Matrix Theory (RMT) has attracted both mathematicians and physicists since they were first introduced in mathematical statistics by Wishart in 1928 [40]. When Wigner [41] introduced the concept of statistical distribution of nuclear energy levels, the subject has started to earn prominence. However, it took until 1955 before Wigner [42] introduced ensembles of random matrices. Since then, many important results in RMT were developed and analyzed, see for instance [43][44][45][46] and the references therein. In the last two decades, research on RMT has been constantly published.
Finally, let us underline that many arguments of this paper differ from the works presented in [47,48]. In [47], we tackled the problem of detection using Chernoff Upper Bound in data of type matrix in the double asymptotic regime. In [48], we established the detection problem in tensor data by analyzing the Chernoff Upper Bound. In [48], we assumed that the tensor follows the Canonical Polyadic Decomposition (CPD), we gave some analysis of Chernoff Upper Bound when the rank of the tensor is much smaller than the dimensions of the tensor. Since [47,48] are conference papers, some proofs have been omitted due to limited space. Therefore, this full paper may share the ideas in [47,48] on Information Geometry (s-divergence, Chernoff Upper Bound, Fisher Information, etc.), but completes [48] in a more general asymptotic regime. Moreover, in this work, we give new analysis in both scenarios (SNR small and large) whereas [48] did not, and the important and difficult new tensor scenario of the Tucker decomposition is considered. This is in our view the main difference because the CPD is a particular case of the more general decomposition of TucKer. Indeed, in the CPD, the core tensor is assumed to be diagonal.

Paper Organisation
The organization of the paper is as follows: In the second section, we introduce some definitions, tensor models, and the Marchenko-Pastur distribution from random matrix theory. The third section is devoted to present Chernoff Information for binary hypothesis test. The fourth section gives the main results on Fisher Information and the Chernoff bound. The numerical simulation results are given in the fifth section. We conclude our work by giving some perspectives in the Section 6. Finally, several proofs of the paper can be found in the appendix.

Algebra of Tensors and Random Matrix Theory (RMT)
In this section, we introduce some useful definitions from tensor algebra and from the spectral theory of large random matrices.

Preliminary Definitions
Definition 1. The Kronecker product of matrices X and Y of size I × J and K × N, respectively is given by We have rank{X ⊗ Y} = rank{X}×rank{Y}.
Definition 3. The q-mode product denoted by × q between a tensor X ∈ R M 1 ×...×M Q and a matrix U ∈ R K×M q is denoted by Definition 4. The q-mode unfolding matrix of size M q × ∏ Q k=1,k =q M k denoted by X (q) = unfold q (X ) of a tensor X ∈ R M 1 ×...×M q is defined according to

Canonical Polyadic Decomposition (CPD)
The rank-R CPD of order Q is defined according to where • is the outer product [25], φ (q) r ∈ R N q ×1 and s r is a real scalar. An equivalent formulation using the q-mode product defined in Definition 3 is where S is the R × · · · × R diagonal core tensor with [S] r,...,r = s r and Φ (q) = [φ (q) R ] is the q-th factor matrix of size N q × R.
The q-mode unfolding matrix for tensor X is given by where S = diag(s) with s = [s 1 , ..., s R ] T and stands for the Khatri-Rao product [25].

Tucker Decomposition (TKD)
The Tucker tensor model of order Q is defined according to .., Q and s m 1 m 2 ...m Q is a real scalar. The q-mode product of X is similar to CPD case, however the q-mode unfolding matrix for tensor X is slightly different M q ] ∈ R N q ×M q and ⊗ stands for Kronecker product. See Figure 1. Following the definitions, we note that the CPD and TKD scenarios imply that vector x in Equation (11) is related either to the structured linear system Φ = Φ (Q) ... Φ (q+1) Φ (q−1) ... Φ (1)

The Marchenko-Pastur Distribution
The Marchenko-Pastur distribution was introduced half a century ago [45] in 1967, and plays a key role in a number of high-dimensional signal processing problems. To help the reader, in this section, we introduce some fundamental results concerning large empirical covariance matrices. Let (v n ) n=1,...,N a sequence of i.i.d zero mean Gaussian random M-dimensional vectors for which E(v n v T n ) = σ 2 I M . We consider the empirical covariance matrix which can be also written as . W N is thus a Gaussian matrix with independent identically distributed N (0, σ 2 N ) entries. When N → +∞ while M remains fixed, matrix W N W T N converges towards σ 2 I M in the spectral norm sense. In the high dimensional asymptotic regime defined by it is well understood that W N W T N − σ 2 I M does not converge towards 0. In particular, the empirical of the eigenvaluesλ 1,N ≥ ... ≥λ M,N of W N W T N does not converge towards the Dirac measure at point λ = σ 2 . More precisely, we denote by ν c,σ 2 the Marchenko-Pastur distribution of parameters (c, σ 2 ) defined as the probability measure Then, the following result holds.

Theorem 1 ([45]
). The empirical eigenvalue value distributionν N converges weakly almost surely towards ν c,σ 2 when both M and N converge towards +∞ in such a way that c N = M N converges towards c > 0. Moreover, it holds thatλ We also observe that Theorem 1 remains valid if W N is not necessarily a Gaussian matrix whose i.i.d. elements have a finite fourth order moment (see e.g., [43]). Theorem 1 means that when ratio M N is not small enough, the eigenvalues of the empirical spatial covariance matrix of a temporally and spatially white noise tend to spread out around the variance of the noise, and that almost surely, for N large enough, all the eigenvalues are located in a neighbourhood of interval [λ − , λ + ]. See Figures 2 and 3.

Formulation Based on a SNR-Type Criterion
We denote by SNR = σ 2 s /σ 2 and p i (·) = p(·|H i ) with i ∈ {0, 1}. The binary classification of the random signal based on the equi-probable binary hypothesis test, s, is is the alternative hypothesis (H 1 ) data-space. Following the above expression, the log-likelihood ratio test Λ(y N ) and the binary classification threshold τ are given by where det(·) and log(·) are respectively the determinant and the natural logarithm.

The Expected Log-likelihood Ratio in Geometry Perspective
We note that the estimated hypothesisĤ is associated to p(y N Ĥ ) = N (0, Σ). Therefore, the expected log-likelihood ratio is defined by is the Kullback-Leibler Divergence (KLD) [10]. The expected log-likelihood ratio test admits to a simple geometric characterization based on the difference of two KLDs [8]. However, it is often difficult to evaluate the performance of the test via the minimal Bayes' error probability P (N) e , since its expression cannot be determined analytically in closed-form [3,8].
The minimal Bayes' error probability conditionally to vector y N is defined as

CUB
According to [24], the relation between the Chernoff Upper Bound and the (average) minimal Bayes' error probability P where the (Chernoff) s-divergence for s ∈ (0, 1) is given bỹ is the moment generating function (mgf) of variable X. The error exponent, denoted byμ(s), is given by the Chernoff information which is an asymptotic characterization on the exponentially decay of the minimal Bayes' error probability. The error exponent is derived thanks to the Stein's lemma according to [13] − lim =μ(s).
As parameter s ∈ (0, 1) is free, the CUB can be tightened by minimizing this parameter: Finally, using Equations (5) and (7), the Chernoff Upper Bound (CUB) is obtained. Instead of solving Equation (7), the Bhattacharyya Upper Bound (BUB) is calculated by Equation (5) and by fixing s = 1/2 . Therefore we have the following relation of order: Lemma 1. The log-moment generating function given by Equation (6) for test of Equation (4) is given bỹ From now on, to simplify the presentation and the numerical results later on, we denote by for all s ∈ [0, 1], the opposites of the log-moment generating function and its limit.

Fisher Information
In the small deviation regime, we assume that δSNR is a small deviation of the SNR. The new binary hypothesis test is where Σ(x) = x × ΦΦ T + I. The s-divergence in the small SNR deviation scenario is written as Lemma 2. The s-divergence in the small deviation regime can be approximated according to where the Fisher information [3] is given by According to Lemma 2, the optimal s-value at low SNR is s At contrary, the optimal s-value for larger SNR is given by the following lemma.
Proof. See Appendix C.

Formulation of the Observation Vector as a Structured Linear Model
The measurement tensor follows a noisy Q-order tensor of size N 1 × . . . × N Q can be expressed as where N is the noise tensor whose entries are assumed to be centered i.i.d. Gaussian, i.e., [N ] n 1 ,...,n Q ∼ N (0, σ 2 ) and the core tensor X follows either CPD or TKD given by Section 2.1.2 and Section 2.1.3, respectively. The vectorization of Equation (10) is given by where n = vec(N (1) ) and x = vec(X (1) ). Note that Y (1) , N (1) and X (1) are respectively the first unfolding matrices given by Definition 4 of tensors Y, N and X ,

1.
When tensor X follows a Q-order CPD with a canonical rank of M, we have When tensor X follows a Q-order TKD of multilinear rank of {M 1 , . . . , M Q }, we have

The CPD Case
We recall that in the CPD case, matrix Φ = Φ (Q) . . . Φ (1) and (Φ (q) ) q=1,...,Q are matrices of size N q × R. In the following, we assume that matrices Φ (q) q=1,...,Q are random matrices with Gaussian N (0, 1 N q ) variate entries. We evaluate the behavior of µ N (s) N when (N q ) q=1,...,Q converge towards +∞ at the same rate and that R N converges towards a non zero limit.

Result 1.
In the asymptotic regime where N 1 , . . . , N Q converge towards +∞ at the same rate and where R → +∞ in such a way that c R = R N converges towards a finite constant c > 0, it holds that with a.s standing for "almost sure convergence" and with u(x) = 1 Proof. See Appendix D.

Small SNR Deviation Scenario
In this section, we assume that SNR is small. Under this regime, we have the following result:

Result 2.
In the small SNR scenario, the Fisher information for CPD is given as Proof. Using Lemma 2, we can notice that converges a.s towards the second moment of the Marchenko-Pastur distribution which is 1 + c (see for instance [43]).
Note that µ 1 2 is the error exponent related to the Bhattacharyya divergence.

Large SNR Deviation Scenario
Result 3. In case of large SNR, the minimizer of Chernoff Information is given by Proof. It is straightforward to notice that The last equality can be obtained as in [50]. Using Lemma 3, we get immediately Equation (14).

Remark 3.
It is interesting to note that for c → 0 or 1, the optimal s-value follows the same approximated relation given by as long as SNR exp [1] or equivalently a SNR in dB much larger than 4 dB.
Proof. It is straightforward to note that Using Equation (14) and condition SNR exp [1], the desired result is proved.

Approximated Analytical Expressions for c 1 and Any SNR
In the case of low rank CPD where its rank R is supposed to be small compared to N, it is realistic to assume c 1 since R N.

Result 4.
Under this regime, the error exponent can be approximated as follows:

Proof. See Appendix E.
It is easy to notice that the second-order derivative of µ(s) is strictly positive. Therefore, µ(s) is a strictly convex function over interval (0, 1). As a consequence, µ(s) admits at most one global minimum. We denote by s , the global minimizer and obtained by zeroing the first-order derivative of the error exponent. This optimal value is expressed as The two following scenarios can be considered: • At low SNR, we denote by µ(s ), the error exponent associated with the tightest CUB, coincides with the error exponent associated with the BUB. To see this, when c 1, we derive the second-order approximation of the optimal value s in Equation (15) Result 1 and the above approximation allow us to get the best error exponent at low SNR and c 1, • Contrarily, when SNR → ∞, s → 1. As a consequence, the optimal error exponent in this regime is not the BUB anymore. Assuming that log SNR SNR → 0, Equation (15) in Result 4 provides the following approximation of the optimal error exponent for large SNR (1 − log SNR + log log(1 + SNR)) .

Result 5.
In the asymptotic regime where M q < N q , 1 ≤ q ≤ Q and M q , N q converge towards +∞ at the same rate such that where ν c q are Marchenko-Pastur distributions of parameters (c q , 1) defined as in Equation (1).

Remark 4.
We can notice that for Q = 1, the result 5 is similar to result 1. However, when Q ≥ 2, the integrals in Equation (16) are not tractable in a closed-form expression. For instance, let Q = 2, we consider the integral We can notice that this integral is characterized by elliptic integral (see e.g., [51]). As a consequence, it cannot be expressed in closed-form. However, numerical computations can be exploited to solve efficiently the minimization problem of Equation (7).

Large SNR Deviation Scenario
Result 6. In case of large SNR, the minimizer of Chernoff Information for TKD is given by Proof. We have that Using Lemma 3, we get immediately Equation (17).

Small SNR Deviation Scenario
Under this regime, we have the following results

Result 7.
For small SNR deviation, the Chernoff information for the TKD is given by Proof. Using Lemma 2, we can notice that Each term in the product converges a.s towards the second moment of Marchenko-Pastur distributions ν c q which are 1 + c q and M N converges to ∏ Q q=1 c q . This proves the desired result.

Remark 5.
Contrary to the Remark 3, it is interesting to note that for c 1 = c 2 = ... = c Q = c and c → 0 or 1, the optimal s-value follows different approximated relation given by which does not depend on Q, and In practice, when c is close to 1, we have to carefully check if Q is in the neighbourhood of log(SNR). As we can see that, when log SNR − Q < 0 or 0 < log SNR − Q < 1, following the above approximation, s ∈ [0, 1].

Numerical Illustrations
In this section, we consider cubic tensors of order Q = 3 with N 1 = 10, N 2 = 20, N 3 = 30, R = 3000 following a CPD and M 1 = 100, M 2 = 120, M 3 = 140, N 1 = N 2 = N 3 = 200 for the TKD, respectively.  Firstly, for the CPD model, in Figure 4, parameter s is drawn with respect to the SNR in dB. The parameter s is obtained thanks to three different methods. The first one is based on the brute force/exhaustive computation of the CUB by minimizing the expression in Equation (8) with Φ = Φ . This approach has a very high computational cost especially in our asymptotic regime (for a standard computer with Intel Xeon E5-2630 2.3 GHz and 32 GB RAM, it requires 183 h to establish 10,000 simulations). The second approach is based on the numerical optimization of the closed-form expression of µ(s) given in Result 4. In this scenario, the drawback in terms of the computational cost is largely mitigated since it consists of a minimization of a univariate regular function. Finally, under the hypothesis that SNR is large, typically >30 dB, the optimal s-value, s , is derived by an analytic expression given by Equation (15). We can check that the proposed semi-analytic and analytic expressions are in good agreement with the brute-force method for a lowest computational cost. It turns out that the mean square relative errors are in mean of order −40 dB. We can conclude that the estimator s is a consistent estimator of s .
In Figure 5, we draw various s-divergences: µ 1 2 , µ(s ), 1 N µ N 1 2 , 1 N µ N (ŝ). We can observe the good agreement with the proposed theoretical results. The s-divergence obtained by fixing s = 1 2 is accurate only at small SNR but degrades when SNR grows large.
For the TKD scenario, we follow the same methodology as above for CPD, Figures 7 and 8 all agree with the analysis provided in Section 4.3.   For TKD scenario, the mean square relative error is in mean of order −40 dB. So, we check numerically the consistency of the estimator of the optimal s-value.
We can also notice that the convergence of µ N (s) N towards its deterministic equivalent µ(s) in the case TKD is faster than in the case CPD, since the dimension of matrix Φ ⊗ is 200, 200, 200 × 100, 120, 140 (N = 200 3 ) which is much larger than the dimension 6000 × 3000 of Φ (N = 6000).

Conclusions
In this work, we derived and studied the limit performance in terms of minimal Bayes' error probability for the binary classification of high-dimensional random tensors using both the tools of Information Geometry (IG) and of Random Matrix Theory (RMT). The main results on Chernoff Bounds and Fisher Information are illustrated by Monte-Carlo simulations that corroborated our theoretical analysis.
For future work, we would like to study the rate of convergence and the fluctuation of the statistics µ N (s) N andŝ.
Author Contributions: Gia-Thuy Pham, Rémy Boyer and Frank Nielsen contributed to the research results presented in this paper. Gia-Thuy Pham and Rémy Boyer performed the numerical experiments. All authors have read and approved the final manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proof of Lemma 1
The s-divergence in Equation (6) for the following binary hypothesis test is given by [15]:μ Using the expressions of the covariance matrices Σ 0 and Σ 1 , the numerator in Equation (A1) is given by N log σ 2 + log det SNR × (1 − s)ΦΦ T + I and the two terms at its numerator are log[det Σ 0 ] s = sN log σ 2 and Using the above expressions,mu N (s) is given by Equation (8).
Using the above expression, the s-divergence is given by Now, using Equation (8), and the following approximation: where the Fisher information for y|δSNR ∼ N (0, Σ(δSNR)) is given by [3]:

Appendix C. Proof of Theorem 3
The first step of the proof is based on the derivation of an alternative expression of µ s (SNR) given by Equation (A1) involving the inverse of the covariance matrices Σ 0 and Σ 1 . Specifically, we have The second step is to derive a closed-form expression in the high SNR regime using the following the approximation (see [52] for instance): As sΦΦ † is a rank-K projector matrix scaled by factor s > 0, its eigen-spectrum is given by s, . . . , s K , 0, . . . , 0 N−K . In addition, as the rank-N identity matrix and the scaled projector sΦΦ † can be diagonalized in the same orthonormal basis matrix, the n-th eigenvalue of the inverse of matrix I N − sΦΦ † is given by with s ∈ (0, 1). Using the above property, we obtain In addition, we have Finally, thanks to Equation (A2), we have Finally, to obtain s in Equation (9), we solve ∂µ s (SNR) ∂s = 0.

Appendix D. Proof of Result 1
The asymptotic behavior of µ N (s) N when N q → +∞ for each q = 1, . . . , Q, R → +∞ in such a way that R 1/q N q converge towards a non zero constant for each q = 1, . . . , Q can be obtained thanks to large random matrix theory. We suppose that N 1 , . . . , N Q converge towards +∞ at the same rate (i.e., N q N p converge towards a non zero constant for each (p, q)), and c R = R N converges towards a constant c > 0. Under this regime, the empirical eigenvalue distribution of covariance matrix Φ (Φ ) T is known to converge towards the so-called Marcenko-Pastur distribution. By Section 2.2, we recall that the Marcenko-Pastur distribution ν c (dλ) is defined as λ−z the Stieltjes transform of ν c . We have that t c (z) satisfies the equation When z ∈ R − * , i.e., z = −ρ, with ρ > 0, it is well known that t c (ρ) is given by It was established for the first time in [45] that if X represents a K × P random matrix with zero mean and 1 K variance i.i.d. entries, and if (λ k ) k=1,...,K represent the eigenvalues of XX T arranged in decreasing order, then 1 K ∑ K k=1 δ(λ − λ k ), the empirical eigenvalue distribution of XX T converges weakly almost surely towards ν c , under the regime K → +∞, P → +∞, P K → c. In addition, we have the following property, for each continuous function f (λ) Practically, when K and P are large enough, the histogram of the eigenvalues of each realization of XX T accumulates around the graph of the probability density of ν c .
log SNR × λ  We obtain easily Result 5.