Computing the Partial Correlation of ICA Models for Non-Gaussian Graph Signal Processing

Conventional partial correlation coefficients (PCC) were extended to the non-Gaussian case, in particular to independent component analysis (ICA) models of the observed multivariate samples. Thus, the usual methods that define the pairwise connections of a graph from the precision matrix were correspondingly extended. The basic concept involved replacing the implicit linear estimation of conventional PCC with a nonlinear estimation (conditional mean) assuming ICA. Thus, it is better eliminated the correlation between a given pair of nodes induced by the rest of nodes, and hence the specific connectivity weights can be better estimated. Some synthetic and real data examples illustrate the approach in a graph signal processing context.


Background
The partial correlation coefficient (PCC) [1] is a classical concept that has relevance in a variety of statistical signal processing problems. Essentially, PCC measures the correlation between two random variables conditioned to other observed random variables. Interest in it has recently increased due to the emergence of graph signal processing (GSP) [2][3][4]. One key aspect of GSP is defining the graph connectivity. Although this can be done considering the natural interactions from the context where the graph signal is defined (e.g., time or space proximity between two nodes), it is desirable to develop formal statistical methods; that is, given a set of multivariate samples where every sample component is assigned to a node of the graph, the graph connectivity which best describes the implicit dependences between any two nodes can be learned. Thus, PCCs are appropriate candidates to define the connectivity as the effect of the rest of nodes is removed from the pairwise correlation. Actually, the PCC is formally interpreted as the correlation between the residuals obtained after optimal estimation of the values of the two involved nodes from the rest of nodes. Optimality is in the sense of minimum linear mean square error (LMSE). Fortunately, it is not necessary to make an explicit estimation, as the PCCs can be computed from the so-called precision matrix (inverse of the covariance matrix). Thus, many efforts have focused onto estimating the precision matrix both in GSP [5,6] and in statistics [7][8][9][10][11][12]. However, the minimum LMSE estimator is optimum only if Gaussianity can be assumed. Accordingly, we can say that methods based on the precision matrix are suboptimal in non-Gaussian scenarios. The concept of PCC is extended to a Gaussian mixture model (GMM) in [13]. Apart from this work, and to our knowledge, there have been no other attempts to consider non-Gaussian models in graph connectivity learning.

New Contributions and Paper Organization
In this work we consider the partial correlation computation under a non-Gaussian model, in particular, a model of independent component analysis (ICA). ICA [14][15][16][17] is a consolidated technique which has found a myriad of applications in statistical signal processing (e.g., blind source separation [18][19][20][21][22][23][24][25]) and pattern recognition (see [26][27][28][29][30][31] and references therein). From the perspective of this work, ICA is a model which incorporates non-Gaussianity through some independent variables (sources), which are linearly mixed to create the observed samples. This makes it highly versatile and allows for the modeling of non-Gaussian multivariate densities.
In the next Section we define a new partial correlation coefficient: ICA-PCC. The basic concept is to replace the implicit linear estimation of conventional PCC by a nonlinear estimation (conditional mean) assuming an underlying ICA model. Then, a general formula is presented to compute the residual covariance matrix from where the ICA-PCCs are to be computed. An essential part of this formula is a diagonal matrix having entries equal to the mean-square-errors of estimating the sources of the ICA model. Then, in Section 3, a practical method is presented to estimate such a matrix from the ICA model parameters. Finally, Section 4 includes some simulations to illustrate the improved estimation of the partial correlation by ICA-PCC in non-Gaussian scenarios. A real data example with EEG multichannel highly non-Gaussian signals is also included to quantify changes in brain connectivity between normal and abnormal states of a patient during sleep.

Statement of the Problem
. . x N ] T be the observation vector having covariance matrix E xx T = C xx . We assume that x obeys an ICA model, then where s = [s 1 . . . s N ] T is a vector of independent sources and U is a square and invertible mixing matrix (W = U −1 is the de-mixing matrix). The sources are considered standardized (zero mean and unit variance), otherwise they may have different non-Gaussian marginal densities, which factorize the joint probability density function (pdf) p(s) = p(s 1 ) · . . . · p(s N ). Notice that Every component of x is assigned to every node of a graph G{V, E, A}, where V is the set of N nodes, E is the set of edges connecting the nodes and A is the adjacency matrix. The generic element a nm is the weight (assumed real and nonnegative) corresponding to the edge connecting node m to node n. We will consider undirected graphs, so a nm = a mn . The problem is to learn A from an available set of observation vectors. PCCs are reasonable candidates as they can measure the correlation between two nodes removing the effect of the rest of nodes. Moreover, PCCs can be computed from the precision where q nm is the nm element of matrix Q and ρ PCC nm is the PCC of nodes n and m. Equation (3) could be used for any underlying joint probability density p(x), however it is optimal only for the Gaussian case. This is because the formal definition of ρ PCC nm is given by where the conditional mean E[x n /x −nm ] depends on the specific model assumed for p(x). In this paper we consider the ICA model (1). Then, the corresponding partial correlation coefficient is called ICA-PCC, and it is represented by ρ ICA−PCC nm to be specific with respect to the general definition (5).

A General Formula for the Residual Covariance
Let us define the vector x nm = [x n x m ] T . We can express x in the form where T −nm is a matrix of dimension (N × (N − 2)) obtained from an (N × N) identity matrix by dropping the n-th and m-th columns. Similarly, T nm is a matrix of dimension (N × 2) obtained from an (N × N) identity matrix dropping all but the n-th and m-th columns. Let us also define the residual vector e nm = [e n e m ] T e n = . Notice that the conditional mean is an unbiased estimator, hence the residuals are zero mean and the residual covariance matrix will be C e nm e nm = E e nm e T nm = We want to compute the residual covariance matrix so that (5) can be applied. We assume an ICA model. First notice that e nm = x nm − E[x nm /x −nm ], but considering (1) and (6), we may write Then, we can solve for where (·) + is the Moore-Penrose (left) pseudoinverse. Thus, in (9) we are expressing the conditional mean of x nm in terms of the conditional mean of the sources and the ICA model parameters. This allows us to derive the following general formula, which in spite of its simplicity requires a rather tedious derivation that can be found in Appendix A C e nm e nm = (WT nm ) + M nm (WT nm ) + T where M nm is an (N × N) diagonal matrix having in its main diagonal the MSEs of optimally estimating the sources from x −nm , that is, Moreover, from (A7) (see Appendix A) we know that mse nmi = 1 − var[E[s i /x −nm ]], hence considering that, by definition, mse nmi and var [ . ] are positive quantities, we conclude that 0 ≤ mse nmi ≤ 1.
Notice that (10) is a combination of the contributions of every source to the residual covariance matrix. This can be better seen by expressing (10) in the alternative form where u + nmi is the i-th column of (WT nm ) + . Notice that WT nm is a (N × 2) matrix formed by the n-th and m-th columns of W, i.e., by the coefficients that define the (demixing) contributions of x n and x m to s. Thus, (WT nm ) + is a (2 × N) matrix, u + nmi is a (2 × 1) vector and u + nmi u +T nmi is a (2 × 2) matrix that can be interpreted as the contribution of source s i to C e nm e nm . This contribution is weighted by mse nmi . Thus, mse nmi = 0 indicates that source s i is perfectly estimated by x −nm , hence s i does not contribute to the partial correlation between x n and x m . At the other extreme, mse nmi = 1 indicates that s i is independent of x −nm , so it has maximum contribution to the partial correlation between x n and x m .

Estimating the ICA Partial Correlation Coefficients
So, according to (7), we have to estimate C e nm e nm . Considering (10), we need estimates of W and M nm . Estimates of W, the ICA model parameters, can be obtained using a variety of algorithms [14][15][16][17][26][27][28][29][30][31] so, in the following, we concentrate on the estimation of M nm , i.e., on we will consider a particular form of the Wiener structure, which was proposed in [32], namely: whereŝ l i is the LMSE estimator of s i from x −nm (we dropped the dependence on nm to ease the notation) and the uni-dimensional conditional mean can be approximated by [32,33] is a standardized Gaussian random variable (this is justified in [32] by using the central limit theorem). Let us approximate (15) by the first two terms. Taking into account that where we consider that E s i ·ŝ l i = E ŝ l i ·ŝ l i (due to the orthogonality between the estimation error and the linear estimate). As any LMSE estimator is unbiased, we know that E ŝ l i = E[s i ] = 0. Then, we can express the conditional mean in (16) as the combination of a linear termŝ l i plus a nonlinear Let us now compactly express the estimation of s from x −nm in We can write where M l is a diagonal matrix whose elements are the MSEs corresponding to the linear estimation of s i from x −nm , that is, In (20), we have considered the orthogonality between the error vector and the estimate vector. However,ŝ l is the minimum LMSE estimate, so it can be obtained from the Wiener-Hopft equationŝ Hence, we can write: Taking into account we can finally express M l nm in terms of the ICA model parameters: Let us now consider the other two terms in (19). First, notice thatŝ nl i can be interpreted as a linear estimate of s i from ŝ l in 2 , because assuming thatŝ l in is a standardized Gaussian random variable, then ŝ l in 2 is χ 2 having a mean equal to 1 and variance equal to 2. Hence, we can apply orthogonality again: Consequently, we have The second term in (25) is zero, because Entropy 2019, 21, 22 = 0, because we assume thatŝ l in is Gaussian so its odd moments are zero. Regarding the first term in (25) and taking into account that var ŝ l and considering (21) and (23), (28) can be expressed in terms of the ICA model parameters So, in conclusion we can express the matrix M nm as where M l nm and diag E s ŝ nl T can be obtained from (23) and (29), respectively, using estimatesŴ of the model parameters and a sample mean to evaluate the expectation required in (29). Algorithm 1 below describes the estimation procedure.  (30) provides an interesting decomposition of mse nmi . Let mse l nmi be called the entries of the diagonal matrix M l nm , then mse nmi can be expressed as mse l nmi minus a nonnegative term (see Equation (29)), so that mse nmi ≤ mse l nmi . The condition mse nmi = mse l nmi ⇔ M nm = M l nm holds for the Gaussian case, because then E s i · ŝ l in 2 (see Equation (27)) becomes zero (it is an odd higher-order moment of a multivariate Gaussian variable). In such a case, E[s/x −nm ] =ŝ l becomes a linear function of x −nm and so the same happens with E[x/x −nm ] in (9). Hence, the second term in (30) is responsible for the improved reduction in the influence of x −nm in the estimation of the partial correlation between x n and x m , in the non-Gaussian case. Moreover, we should expect similar results for ICA-PCC and PCC for the Gaussian case.

Synthetic Data Experiments
In this experiment we evaluated the influence of the training set size in the estimation of ρ ICA−PCC nm as well as comparing the quality of the estimate with the one obtained from the precision matrix.
To this aim, we generated synthetic data corresponding to three different ICA models. In the first one, the sources s i were independent and identically distributed (i.i.d.) random variables having a unit-variance zero-mean uniform pdf. This correspond to an example of sub-Gaussian distribution, as the excess kurtosis is negative, κ − κ G = −1.2, where κ is the kurtosis, and κ G = 3 is the kurtosis of a Gaussian pdf. In the second model, the sources s i were i.i.d. random variables having a unit-variance zero-mean Laplacian pdf. This is an example of super-Gaussian distribution, as the excess kurtosis is positive κ − κ G = 3. In the third model, some sources are uniform and the rest are Laplacian. Finally, we also considered the Gaussian case by generating sources having a standard Gaussian pdf. Figure 1 shows the errors corresponding to the estimation of ρ ICA−PCC nm for the four models.  Every curve is an average of 10 curves corresponding to 10 different runs. In every run, an ICA matrix 1   U W was randomly selected; every entry was obtained by sampling a standard Gaussian pdf. Then, a varying number of training vectors x was generated from source vectors s having independent components sampled from the mentioned marginal pdfs: sub-Gaussian (Figure 1a), super-Gaussian (Figure 1b), mixed of sub/super-Gaussian ( Figure 1c) and Gaussian (Figure 1d). The error was computed as and averaged over the 10 runs for every training set size. Notice that 0 1 Every curve is an average of 10 curves corresponding to 10 different runs. In every run, an ICA matrix U = W −1 was randomly selected; every entry was obtained by sampling a standard Gaussian pdf. Then, a varying number of training vectors x was generated from source vectors s having independent components sampled from the mentioned marginal pdfs: sub-Gaussian (Figure 1a), super-Gaussian (Figure 1b), mixed of sub/super-Gaussian ( Figure 1c) and Gaussian (Figure 1d). The error was computed as was obtained from Algorithm 1 using the true matrix W and an extremely large number of instances for the sample mean required to compute the expectation in (29). On the other hand,ρ ICA−PCC mn was computed from Algorithm 1 using estimates of W obtained with the corresponding finite training set. We used the Extended Infomax algorithm described in [34] and the JADE algorithm [35]. Extended Infomax is an extension of the Infomax algorithm [36] used to deal with mixed sub/super-Gaussian sources. It is representative of algorithms that iteratively optimize some defined cost-function like Fast-ICA. JADE is based on matrix computation and diagonalization, so, it is not sensitive to initialization or optimization path problems. The same finite training set was also used to evaluate the expectation in (29). In all cases we considered N = 20. For comparison, we also computed the error which corresponds to the PCCs obtained from empirical estimatesQ of the precision matrix as indicated in (3):ρ PCC nm = −q nm / q nnqmm . Notice that it is also 0 ≤∈ PCC ρ ≤ 1. Several conclusions may be drawn from Figure 1. First, we can see that in the non-Gaussian cases (a) (b) and (c), PCC cannot decrease the error with increased training set size. This demonstrates the model mismatch due to the implicit Gaussianity of PCC. In these three cases, ICA-PCC methods improve on PCC after a sufficient number of training samples and maintain a decreased error for an increased training set size. The minimum training set size required to improve on PCC depends on the case and on the ICA-PCC method. Thus, this minimum number is smaller in Figure 1a (sub-Gaussians) than in Figure 1b (super-Gaussians), and has an intermediate value in Figure 1c (mixed sub/super Gaussians). However, notice that the non-decreasing error of PCC is much higher in Figure 1a,c, so this suggests that ICA-PCC improvement begins from a smaller training set size. On the other hand, this minimum value is smaller in JADE than in Extended-Infomax. This is due to the different nature of both algorithms. JADE requires a matrix diagonalization, while Extended-Infomax requires iterative learning. However the computational complexity of JADE is much larger, especially as N increases. Regarding convergence for large values of the training set size, we can see that the error level of the mixed case is clearly above the others. This is because, in general all ICA algorithms have more difficulties in estimating the model in the mixed case. Actually, Extended-Infomax was conceived in an effort to deal with the mixed case by incorporating a procedure to estimate the class (sub/super) of every source. This explains the smaller error of Extended-Infomax in Figure 1c with respect to JADE, after a given training set size. For the Gaussian case, PCC yields a very small error, which decreases with increasing training set size. In this case, ICA-PCC is worse than PCC, although the error is reasonably small. Remember that, in the Gaussian case, we expected similar results for both methods, however, the estimation path followed is different: in PCC the precision matrix is directly estimated, while in ICA-PCC the matrices W and M nm are estimated in Algorithm 1. This could explain the separation observed in the error curves of Figure 1d. Finally, most ICA algorithms decompose the estimation of W into two steps: first, estimate a decorrelation matrix, and then, a rotation matrix. When the independent components are Gaussian, any rotation matrix is valid, as all of them are compatible with Gaussianity. However, in the non-Gaussian cases the rotation matrix must be properly estimated for the corresponding non-Gaussian model. This can be interpreted as if a smaller number of model parameters (entries of the decorrelation matrix) should be actually estimated in the Gaussian case. This explains the faster convergence of the curves in Figure 1d.

A Real Data Application
We applied the proposed method to quantify the significance of changes in brain connectivity during sleep of patients having disorders like apnea or epilepsy [37]. These disorders are characterized by regular arousal, which are stages of abnormal degraded sleep. The frequency of arousals in a given period of time is related to the seriousness of the pathology. However, the intensity of the arousals may also be relevant for an appropriate diagnosis. Assuming that an arousal is associated to changes in brain connectivity [38], a measure related to the change magnitude may be useful to quantify the significance of the pathology. To this aim, the patient was monitored during sleep by 19 channels of EEG recordings. Every signal channel was segmented into intervals of 2 s and a given feature was computed in every interval and averaged in epochs of 26 s. Each epoch was manually or automatically [22] labelled in two possible states: normal sleep (state 0) or arousal (state 1). Then, associated to every epoch, an observation vector x was built with one feature extracted from all the channels (the same type of feature for all of them), thus N = 19. In this experiment, a total of 2000 epochs were available in every state. Given these data sets, an average measure related to brain connectivity was computed to quantify the importance of brain changes between the two states.
There are many possible definitions of overall connectivity, here, we considered the so called algebraic connectivity [39], which can be computed as the second smallest eigenvalue λ 2 of the graph Laplacian matrix [40] L = D − A, being D a diagonal matrix with entries d nn = ∑ m =n a nm and A the adjacency matrix with entries a nm ≥ 0. The Laplacian matrix is semidefinite positive with the smallest eigenvalue λ 1 equal to zero, then λ 2 ≥ 0. Moreover, it is demonstrated in [39] that λ 2 = N for a complete graph (a graph with a nm = 1 ∀n = m). It is also demonstrated in [39] that Hence, assuming that 0 ≤ a nm ≤ 1 (as it will be in our case), the greatest upper bound for λ 2 in (33) corresponds to the complete graph ( a nm = 1 ∀n = m ⇒ d nn = N − 1 ∀n ), therefore 0 ≤ λ 2 ≤ N. Consequently, we proposed a normalized version of λ 2 to measure the connectivity The lower bound ς = 0 corresponds to a disconnected graph, as it implies an order of multiplicity greater than 1 of the smallest eigenvalue. The upper bound corresponds to a complete graph, which is the one having maximum connectivity under the constraint 0 ≤ a nm ≤ 1. We obtained connectivity estimates for every state (0 or 1) and method (ICA-PCC or PCC): This was made from (34) with N = 19, after computing the second smallest eigenvalue of the Laplacian matrix, considering that the entries of the associated adjacency matrix are the respective magnitudes of the partial correlation estimates obtained from the training set 0 or 1: Two different features were considered separately. The first is "amplitude" (Amp), which is the maximum amplitude in the corresponding 2 s interval, the second is the "alfa-slow-index" (Asi), which is the ratio of power in the alpha band (8.0-11 Hz) to the combined power in the delta (0.5-3.5 Hz) and theta (3.5-8.0 Hz) bands. Tables 1 and 2 show the results corresponding to the Amp and the Asi features, respectively, for 6 different patients. Together with the normalized connectivity, we included the connectivity variation between states defined as ∆ ICA−PCC = ς ICA−PCC We also included a kurtosis estimate for every patient and state. This estimate was obtained as a mean of all the 19 empirical kurtosis separately calculated for every component of vector x, i.e., the empirical kurtosis of the marginal distributions of x. Notice that the estimated kurtosis is clearly above the Gaussian reference κ G = 3, so Gaussianity assumption does not hold in this case. We can see in Tables 1 and 2 that the PCC method yields very small values of connectivity for all subjects and states, therefore, it is not sensitive to possible changes between states. However, ICA-PCC provides larger values of connectivity and significant changes between states. Figures 2 and 3 show the estimated adjacency matrices corresponding to the different subjects, methods and states. We can see that PCC magnitudes are, in general, much lower than ICA-PCC magnitudes, therefore, PCC has more difficulty revealing the interrelations between the different EEG channels due to the brain activity. This may be explained in terms of the residuals e n and e m . Notice from (12) that where v + 1nmi and v + 2nmi are the elements of vectors v + 1nm and v + 2nm , respectively, and later, these are the first and second row of (WT nm ) + , respectively. We showed in Section 3 that PCC should be similar to ICA-PCC for mse nmi = mse l nmi , but for non-Gaussian observations mse nmi < mse l nmi , so it is deduced from (36) that where equality holds in the Gaussian case. So, PCC provides overestimated residuals where the actual partial correlation between x n and x m may be eventually hidden. This masking effect should increase with the non-Gaussianity character of the observations. In our experiment, the features are highly non-Gaussian as demonstrated by the kurtosis values of Tables 1 and 2. So, when using PCC, the "true" residuals seem to be overestimated by rather uncorrelated residuals that provide a too low estimation of the actual interrelation between the different EEG channels.

Conclusions and Extensions
Partial correlations may be used to define the weights of an undirected graph for subsequent graph signal processing. Conventionally, partial correlations are obtained from the precision matrix, but this is optimal only under the Gaussianity assumption. Hence, we have proposed a new method for computing the partial correlation, assuming a non-Gaussian model (ICA). The latter is a versatile model which suits a diversity of non-Gaussian pdfs.
The proposed method requires the computation of the ICA model parameters, which can be made by using any of the many existing algorithms. Two different ICA methods have been considered in the synthetic examples, which may be considered representative of two different kinds of approaches to estimating the ICA model parameters. Both yield similar performance. Computing the mean-square-errors corresponding to the optimal estimation of the sources is also required. Hence, we have proposed a second-order approximation of the conditional mean. Higher orders could be tried at the price of increased complexity.
We have verified, both by simulations and by real data experiments that the new method better captures the pairwise and overall connectivity of the graph compared to the precision matrix in non-Gaussian scenarios. The results could be extended to larger values of N but the training set sizes should be correspondingly increased to keep the quality of the model parameter estimates.
Future extensions of this work can be devised. Some kind of regularization is desirable to emphasize the relevant information provided by the graph connectivity and/or to establish more natural relations between the connected nodes. Thus, sparsity is a common requirement of graph learning (see [41] as a representative example). Considering Equation (10), sparsity could be imposed by selecting only those sources that significantly contribute to the partial correlation between n x and

Conclusions and Extensions
Partial correlations may be used to define the weights of an undirected graph for subsequent graph signal processing. Conventionally, partial correlations are obtained from the precision matrix, but this is optimal only under the Gaussianity assumption. Hence, we have proposed a new method for computing the partial correlation, assuming a non-Gaussian model (ICA). The latter is a versatile model which suits a diversity of non-Gaussian pdfs.
The proposed method requires the computation of the ICA model parameters, which can be made by using any of the many existing algorithms. Two different ICA methods have been considered in the synthetic examples, which may be considered representative of two different kinds of approaches to estimating the ICA model parameters. Both yield similar performance. Computing the mean-square-errors corresponding to the optimal estimation of the sources is also required. Hence, we have proposed a second-order approximation of the conditional mean. Higher orders could be tried at the price of increased complexity.
We have verified, both by simulations and by real data experiments that the new method better captures the pairwise and overall connectivity of the graph compared to the precision matrix in non-Gaussian scenarios. The results could be extended to larger values of N but the training set sizes should be correspondingly increased to keep the quality of the model parameter estimates.
Future extensions of this work can be devised. Some kind of regularization is desirable to emphasize the relevant information provided by the graph connectivity and/or to establish more natural relations between the connected nodes. Thus, sparsity is a common requirement of graph learning (see [41] as a representative example). Considering Equation (10), sparsity could be imposed by selecting only those sources that significantly contribute to the partial correlation between x n and x m , i.e., by soft or hard thresholding on mse nmi . On the other hand, smoothness regularization could be tried in a similar manner to the approach proposed in [42] for the Gaussian case. To this aim, it could be considered that the representation matrix U can be factorized in a correlation matrix multiplied by a rotation (unitary) matrix [43]. Understanding how this rotation relates to the graph connectivity may allow the definition of cost functions, which include some possible smoothness related terms. Other structural constraints [44] could also be compatible with the non-Gaussian model.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Derivation of the General Formula
Due to the orthogonality between the error and the conditional mean, we can write Let us define So, from (9) We have to compute the 4 terms of (A3). - where we have taken into account that the conditional mean is an unbiased estimator. Then, we know that E ff T (i,