Robust Bilinear Probabilistic Principal Component Analysis

: Principal component analysis (PCA) is one of the most popular tools in multivariate ex-ploratory data analysis. Its probabilistic version (PPCA) based on the maximum likelihood procedure provides a probabilistic manner to implement dimension reduction. Recently, the bilinear PPCA (BPPCA) model, which assumes that the noise terms follow matrix variate Gaussian distributions, has been introduced to directly deal with two-dimensional (2-D) data for preserving the matrix structure of 2-D data, such as images, and avoiding the curse of dimensionality. However, Gaussian distributions are not always available in real-life applications which may contain outliers within data sets. In order to make BPPCA robust for outliers, in this paper, we propose a robust BPPCA model under the assumption of matrix variate t distributions for the noise terms. The alternating expectation conditional maximization (AECM) algorithm is used to estimate the model parameters. Numerical examples on several synthetic and publicly available data sets are presented to demonstrate the superiority of our proposed model in feature extraction, classiﬁcation and outlier detection.


Introduction
High-dimensional data are increasingly collected for a variety of applications in the real world. However, high-dimensional data are not often distributed uniformly in their ambient space, instead of that the interesting structure inside the data often lies in a low-dimensional space [1]. One of the fundamental challenges is how to find the lowdimensional data representation for high-dimensional observed data in pattern recognition, machine learning and statistics [2,3]. Principal component analysis (PCA) [4] is arguably the most well-known dimension reduction method for high-dimensional data analysis, and it aims to find the first few principal eigenvectors corresponding to the first few largest eigenvalues of the covariance matrix, and then projects the high-dimensional data onto the low-dimensional subspace spanned by these principal eigenvectors to achieve the purpose of dimensionality reduction.
The traditional PCA is concerned with vectorial data, i.e., 1-D data. For 2-D image trained sample matrices, it is usual to first convert 2-D image matrices into 1-D image vectors. This transformation leads to higher dimensional image sample vectors and a larger covariance matrix, and thus suffers from the difficulty of accurately evaluating the principal eigenvectors of the large scale covariance matrix. Furthermore, such vectorizing of 2-D data destroys the natural matrix structure, and ignores potentially valuable information about the spatial relationships among 2-D data. Therefore, two-dimensional PCA (2DPCA) type algorithms [5][6][7] are proposed to compute principal component weight matrices directly based on 2-D image training with sample matrices instead of using vectorization.
These conventional PCA and 2DPCA algorithms are both derived and interpreted in the standard algebraic framework, thus they lack capability in handling issues of statistical inference or missing data. To remedy these drawbacks, a probabilistic PCA model (PPCA) has been proposed by Tipping and Bishop in [8], which is processed by assuming some Gaussian distributions on observations with introduced extra latent variables, and it has been successfully applied in many machine learning tasks [9]. Following PPCA, a probabilistic second-order PCA, called PSOPCA, is developed in [10] to directly model 2-D image matrices based on the so-called matrix variate Gaussian distributions.
Throughout this paper, R n×m is the set of all n × m real matrices, I n and 0 m×n are the n × n identity matrix and m × n zero matrix, respectively. The superscript "· T " means transpose only, · 2 and · F denote the 2 -norm and Frobenius norm of a matrix, respectively. Denoted by N d c ,d r (M, Ω c , Ω r ) is the matrix variate Gaussian distribution [11] with the mean matrix M ∈ R d c ×d r , column covariance Ω c ∈ R d c ×d c and row covariance Ω r ∈ R d r ×d r . A random matrix X ∈ R d c ×d r is said to follow the matrix variate Gaussian distribution N d c ,d r (M, Ω c , Ω r ), i.e., where vec(·) is the vectorization of a matrix obtained by stacking the columns of the matrix on top of one another. That means that the probability density function (pdf) of X is where "⊗" is the Kronecker product of two matrices, etr(·) = exp(tr(·)) and tr(·) denotes the trace of a matrix. The last equality of (2) holds because of |Ω r ⊗ Ω c | − 1 2 = |Ω r | − 1 2 d c |Ω c | − 1 2 d r and  [10] considers the following two-sided latent matrix variable model where C ∈ R d c ×q c and R ∈ R d r ×q r are the column and row factor loading matrices, respectively, W ∈ R d c ×d r and Υ ∈ R d c ×d r are the mean and error matrices, respectively, and Z ∈ R q c ×q r is the latent core variable of X. The PSOPCA model is further extended to the bilinear probabilistic principal component analysis (BPPCA) model in [12] for better establishing the relationship with the 2DPCA algorithm [6], which is defined as In contrast to the PSOPCA model (3), the column and row noise matrices E c ∈ R d c ×q r and E r ∈ R q c ×d r with different noise variances σ 2 c and σ 2 r , respectively, are included in the BPPCA model, and E ∈ R d c ×d r is represented as the common noise matrix. The model (4) improves the flexibility in capturing data uncertainty, and makes the marginal distribution p(X) to be the matrix variable Gaussian. In particular, we can see that if E r and E c are removed and σ c = σ r , then (4) reduces to the PSOPCA model.
All of the above mentioned probabilistic models assume that the noise terms follow Gaussian distributions. It is a well-known issue that Gaussian noises will lead to a serious drawback while dealing with anomalous observations. Thus, the probabilistic PCA models based on Gaussian distributions are not robust to outliers. To make probabilistic models which are insensitive to outliers, one prefers heavy-tailed distributions, such as the Student t distribution or centered Laplacian distribution with 1 -norm. Using the t distribution or centered Laplacian distribution instead of the Gaussian distribution in the PPCA model [8] results in tPPCA [13,14] and probabilistic L1-PPCA [15] algorithms, respectively. Similarly, a robust version of PSOPCA, called L1-2DPPCA, is introduced in [16] based on the Laplacian distribution combined with variational EM-type algorithms to learn parameters. However, it is difficult to generalize a robust version of the BPPCA algorithm based on the Laplacian distribution. The reason is that if the error term Υ in the PSOPCA model is a Laplacian distribution, then the condition distribution X|Z is also a Laplacian distribution, but it does not hold in the BPPCA model. Fortunately, the same goal can be achieved by using the t distribution. In fact, the Gaussian distribution is a special t distribution. Compared to the Gaussian distribution, the t distribution has significantly heavier tails and contains one more free parameter. Recently, some robust probabilistic models under the assumption of the t distribution have already been done successfully by a number of researchers in [17][18][19][20][21][22]. Motivated by these facts, we will continue the effort to develop a robust BPPCA model from matrix variate t distributions to handle 2-D data sets in the presence of outliers.
The remainder of the paper is organized as follows. Section 2 introduces some notations and a matrix variate t distribution which are essential to our later development. The robust BPPCA model and its associated parameters estimation based on the AECM algorithm are given and analyzed in detail in Section 3. Section 4 is dedicated to present some numerical examples for showing the behaviors of our proposed model and to support our analysis. Finally, conclusions are made in Section 5.

Preliminaries
Let X ∈ R d c ×d r , M ∈ R d c ×d r , Ω c ∈ R d c ×d c and Ω r ∈ R d r ×d r , and the probability density function of the random variable µ having a Gamma distribution with parameters α and β, i.e., µ ∼ Ga(α, β), be where Γ(·) is the Gamma function, i.e., Analogously to the process of tPPCA in [14], we derive the matrix variate t distribution in this paper by considering then the random matrix X is said to follow the matrix variate t distribution with degrees of freedom ν, and is denoted by In particular, if d c = 1 or d r = 1, then the matrix variate t distribution degenerates to the multivariate t distribution. As the classical multivariate t distribution, another favorite perspective on the matrix variate t distribution which is critical to our later developments, is to treat µ as a latent variable, then the conditional distribution of X|µ is a matrix variate Gaussian distribution by (8), i.e., where µ c µ r = µ −1 . Notice that, despite the non-uniqueness in the factorization µ −1 = µ c µ r , N d c ,d r (M, µ c Ω c , µ r Ω r ) in (10) always returns the same pdf of X.

The Model
In this section, we develop a robust model by replacing the matrix variate Gaussian distribution in BPPCA with the matrix variate t distribution with degrees of freedom ν defined in (9) to deal with 2-D data sets. Specifically, the proposed robust bilinear probabilistic principal analysis model (RBPPCA for short) is defined as As BPPCA [12], in the RBPPCA model (11), E c ∈ R d c ×q r , E r ∈ R q c ×d r and E ∈ R d c ×d r are the column, row and common noise matrices, respectively, Z ∈ R q c ×q r is the latent matrix, and these are assumed to be independent of each other, and the mean matrix, and the column and row factor loading matrices are W ∈ R d c ×d r , C ∈ R d c ×q c and R ∈ R d r ×q r , respectively. Similarly to BPPCA [12], the parameters C, R, σ c and σ r can not be uniquely identified, but the interested subspaces spanned by the columns of C and R are unique. The reader is referred to ([12], Appendix B) for details. The difference from BPPCA is that the noise matrices E c , E r and E and latent matrix variate Z in the RBPPCA model (11) are supposed matrix variate t distributions by (10), i.e., It follows by (11) that That means the random matrix X follows the matrix variate t distribution, i.e., In addition, as shown in [23], the conditional distribution µ|X which is also required in our later estimation of model parameters is a Gamma distribution, i.e., where By introducing two latent matrix variates Y r ∈ R q c ×d r and Y r ∈ R d c ×d r , the RBPPCA model (11) can be rewritten as where Y r and Y r are the row projected intermediate and residual matrices, respectively. By (11), we have the conditional distributions where Σ r is given by (12). In addition, by using (15c) and the Bayes' rule, the conditional distributions Z|Y r , µ and Y r |X, µ can be calculated as where In (14), the bilinear projection in the RBPPCA model is split into two stages by first projecting the latent matrix Z in the row direction to obtain Y r , then Y r being projected in the column direction to finally generate X. Similarly, we can also consider the decom-position of the bilinear projection by first projecting column and then row directions to rewrite (11) as where

Estimation of the Parameters
In the model (11), the parameter set to be estimated is θ = {ν, σ c , σ r , W, C, R}. We will introduce how to calculate the parameters by using the alternating expectation conditional maximization (AECM) algorithm in this subsection. The AECM algorithm [12,24,25] is a two stage iterative optimization technique for finding maximum likelihood solutions. To apply it to the AECM algorithm, we divide the parameter set θ into two subsets In the first stage, we consider the AECM algorithm for the model (14) to compute θ c . Let {X n } N n=1 with X n ∈ R d c ×d r be a set of 2-D sample observations. The latent variables' data {Y r n , µ n } N n=1 are treated as "missing data", and the "complete" data log-likelihood is It follows by (5), (15a) and (15d) that In E-step, given the parameter set } which is obtained from the i-th iteration, we compute the expectation of L com,c (θ c ) with respect to the condition distribution Y r n , µ n |X n , i.e., where the constant contains those terms without referring the parameters in the set θ c .
We denote It is noted by (13) and (16b) that given the parameter set Then, it is easy to obtain that In addition, based on the conditional distributions of µ n |X n and Y r n |X n , µ n , in (20), the condition expectations E(ln µ n |X n ) = ψ ν (i) +d c d r 2 − ln ν (i) +ρ (i) n 2 by [13] where ψ(·) is the digamma function, and which is detailed in Appendix A.
In the subsequent conditional maximization (CM) step of the first stage, given the condition Therefore, by successively solving the equations we can iteratively update the parameters W, C and σ c . Specifically, by Thus, an iterative updating of W can be obtained by Similarly, based on The last equality (24b) holds because of (22) and (24a). Finally, we update ν (i) by maximizing the scalar nonlinear function Q c (θ c ) defined in (20) on v, which can be solved numerically by most scientific computation software packages [26,27], to obtainν (i) .
In the second stage, the AECM algorithm is used for the model (18) to update θ r . In such a case, we consider the latent variables' data {Y c n , µ n } N n=1 as "missing data". Then, by (19b), the "complete" data log-likelihood is Similarly, in E-step of the second stage, given the updated parameter set c , W (i) and C (i+1) are calculated from the first stage, we compute the expectation of L com,r (θ r ) with respect to the condition distribution Y c n , µ n |X n , denoted by Q r (θ r ). Based on (19) and the current parameter setθ (i) , we define We have, up to a constant, See Appendix A for the derivation of (27). At last, in CM-step of the second stage, based onθ (i) (23) and (24), we maximize Q r (θ r ) with respect to θ r to update and then solve the scalar nonlinear maximization problem (26) on ν to get ν (i+1) . We summarize what we do in this subsection in Algorithm 1. A few remarks regarding Algorithm 1 are in order: (1) In Algorithm 1, it is not necessarily to explicitly compute Σ (i) c and Σ (i) r . The reason is that the calculation of (Σ (i) c ) −1 and (Σ (i) r ) −1 can be more efficiently performed by using In its per-iteration of Algorithm 1, the most expensive computational cost is O(Nd c d r (d c + d r )) appearing on the formation of ρ (i) n with 1 ≤ n ≤ N. Owing to introducing the new latent variable µ n , RBPPCA is a little more time complex than BP-PCA having a calculation cost of O(Nd c d r (q c + q r )). However, it will be shown in our numerical example that the RBPPCA algorithm presents less sensitivity to outliers.
(2) Compared with the AECM algorithm of BPPCA in [12] which uses the centered data and estimates {σ c , C} and {σ r , R} based on the model X = CY r + Y r and X = Y c R T + Y c , respectively, two more parameters ν and W are needed to be computed in the AECM iteration of RBPPCA. Notice that both the models (14) and (18) contain the parameters ν and W. Thus, we split the parameter set θ into θ c = {ν, σ c , W, C} and θ r = {ν, σ r , W, R} which naturally leads to the parameters ν and W being calculated twice in each loop of Algorithm 1. Though other partitions of the set θ, such as {ν, σ c , C} and {σ r , W, R}, are also available for the estimation of parameters, we prefer θ c and θ r , because updating ν and W one more time in each iteration can be obtained by adding a little more computational cost. (3) As stated in Section 3.3 of [24], any AECM sequence increases L(θ) = ∑ N n=1 ln(p(X n |θ)) at each iteration, and converges to a stationary point of L(θ). Notice that the convergence results of the AECM algorithm proved in Section 3.3 of [24] do not depend on the distributions of the data sets. Therefore, up to set a limit on the maximum number of steps iter max , we use the following relative change of log-likelihood as the stopping criterion, i.e., where ε is a specified tolerance used, which by default is set to 10 −5 in our numerical examples. (4) Based on the computed results of Algorithm 1, and similar to PPCA [8] and BP-PCA [12], it is known that can be considered as the compressed representation of X. Hence, we can reconstruct X as Compute W (i) , C (i+1) and σ (i+1) c by (23) and (24). Solve the scalar nonlinear maximization problem (20) to obtainν (i) . (25). Compute W (i+1) , R (i+1) and σ (i+1) r by (28). Solve the scalar nonlinear maximization problem (26) to obtain ν (i+1) . end for

Numerical Examples
In this section, we conduct several numerical examples based on synthetic problems and three real-world data sets to demonstrate the effectiveness of our proposed RBPPCA algorithm. All experiments were run by using MATLAB (2016a) with machine epsilon 2.22 × 10 −16 on a Windows 10 (64 bit) Laptop with an Intel Core i7-8750H CPU (2.20GHz) and 8GB memory. Each random experiment was repeated 20 times independently, then the average numerical results were reported. (Experiments on the synthetic data). In this example, we only compare ours with the BPPCA algorithm [12] to illustrate the significant improvement of the RBPPCA algorithm. We take N data matrices X n for n = 1, . . . , N with N = 200 and d c = d r = 64, N g of which are generated by X n = CZ n R T + W + CE r n + E c n R T + E n , n = 1, . . . , N g , where C, R and W are simply synthesized by MATLAB as

Example 1
Z n , E r n , E c n and E n are sampled from matrix variate normal distributions N q c ,q r (0 q c ×q r , I q c , I q r ), N q c ,d r (0 q c ×d r , I q c , σ 2 r I d r ), N d c ,q r (0 d c ×q r , σ 2 c I d c , I q r ), and N d c ,d r (0 d c ×d r , σ 2 c I d c , σ 2 r I d r ) with σ c = σ r = 1, respectively. The other N o data matrices, i.e., X n for n = N g + 1, . . . , N, are regarded as outliers of which each entry is sampled from the uniform distribution over the range of 0 to 10. In order to demonstrate the quality of computed approximations C (i) and R (i) , we calculate the arc length distance between the two subspaces span(R ⊗ C) and span(R (i) ⊗ C (i) ), which is used in [12] and defined as to monitor the numerical performance of the RBPPCA and BPPCA method, where span(R ⊗ C) and span(R (i) ⊗ C (i) ) are the column space of R ⊗ C and R (i) ⊗ C (i) , respectively, and Q and Q (i) are the orthogonal base matrices of span(R ⊗ C) and span(R (i) ⊗ C (i) ), respectively. In fact, by ( [28], Definition 4.2.1), the computational result of (31) is the largest canonical angle between the estimated subspace span(R (i) ⊗ C (i) ) and the true span(R ⊗ C).
In this test, we start with C (0) = rand(d c , q c ), R (0) = rand(d r , q r ), σ (0) c = 1, σ (0) r = 1 and ν (0) = 1, then consider the effect of the ratio of outliers, i.e., N o /N, which is varied from 0 to 30% with a stride length of 10% in this example. In these cases, the estimated values of ν are all ν = 1. If we use other initial values ν (0) here, the computed ν also converges to one as iterations increase. The corresponding numerical results of arc length distances are plotted in Figure 1. Figure 1 shows that the RBPPCA and BPPCA methods almost perform with the same convergence behavior when the data matrices are without outliers.
As the ratio of outliers goes to 30%, it is reasonable that more iterations are required for the RBPPCA method to achieve a satisfactory accuracy. Unlike the BPPCA method, the presented RBPPCA method is more robust to outliers because the arc length distances of the BPPCA method are always held to approximately 1.5 when the data includes outliers.
In this example, we also test the impact of the initializations on C (0) and R (0) , and the sample size N, respectively, based on the synthetic data having 10% outliers. Three different types of initializations of C (0) and R (0) are set as follows: (1) initialization 1: C (0) = rand(d c , q c ) and R (0) = rand(d r , q r ); (2) initialization 2: C (0) = randn(d c , q c ) and R (0) = randn(d r , q r ); (3) initialization 3: d c + 9 sin(9) cos(9) tan(9) cot(9) sec (9) and other parameters are fixed to be the same. The results associated with the different initializations are shown in Figure 2. Inspection of the plot illustrates that our RBPPCA method appears to be insensitive to different initializations, since the convergence behaviors of the RBPPCA method based on different initializations do not have a significant difference.   We run the BPPCA and RBBPCA algorithms for the original data set and the corrupted data set, respectively, with the order of Z in (11) being q c = q r = 8, and consider the reconstructed images X n defined in (30) based on the computed results. A comparison of the reconstructed images based on the original data set and corrupted data set with two corrupted images for each individual is shown in Figure 8. In Figure 8, the first, second and third columns are the original images, the reconstructed images of RBPPCA, and the reconstructed images of BPPCA for the Yale (left) and Yale B (right) databases, respectively, and the images of the first, second and third rows are shown based on the original, corrupted data sets with 30 × 30 rectangles of noise, and corrupted data sets with 50 × 50 rectangles of noise, respectively. The BPPCA and RBPPCA almost perform the same as the reconstructed images on the original data set. However, for corrupted data sets, the reconstructed performance of RBPPCA presents better images than BPPCA because it tries to explain noise information.
We also compare the average reconstruction errors η = ∑ N n=1 X n − X n F /N and the recognition accuracy rates for the RBPPCA and BPPCA algorithms in the cases where each person has two corrupted images and four corrupted images, respectively, where recognition accuracy rates are calculated based on the nearest neighbor classifier (1-NN) which is employed for the classification. The average reconstruction errors and recognition accuracy rates versus the order of Z are plotted in Figures 9 and 10, respectively. As expected, the average reconstruction errors decrease, while the recognition accuracy rates rise as the order of Z increases. In these cases, our proposed RBPPCA algorithm outperforms the BPPCA algorithm in reducing average reconstruction errors and enhancing the recognition accuracy. In addition, another advantage of robust probabilistic algorithms based on t distributions is outlier detection. By [14], we can compute ρ n = [vec(X n − M)] T (Σ r ⊗ Σ c ) −1 vec(X n − M) as the standard of outlier detection. Figure 11 is the scatter chart for ρ n of all the images in the training set of the Yale and Yale B databases with two corrupted images for one person, respectively. It is exhibited that the quantity of ρ n can be divided into three parts. Notice that these three parts correspond to the images with no noise, a 30 × 30 rectangle of noise, and a 50 × 50 rectangle of noise, respectively. Hence, the comparison of ρ n provides a method for judging the outliers.

Example 3 (Experiments on the MNIST dataset).
In Example 2, it is shown that RBPPCA is superior to BPPCA when the data sets contain outliers. In this example, we compare the RBPPCA algorithm to the tPPCA [14] and L1-PPCA [15] algorithms based on handwritten digit images from the MNIST (Available from http://yann.lecun.com/exdb/mnist) (accessed on 22 September 2021) database in which each image has 28 × 28 pixels. We choose 59 images of the digit 4 as the training data set, and randomly select nine of them to be corrupted as outliers. The way of corrupting the images is to add noise from a uniform distribution on the interval [0, 510], and then normalize all images to the range [0, 1]. The normalized corrupted images of the digit 4 are shown in Figure 12. The RBPPCA, tPPCA [14] and L1-PPCA [15] algorithms are implemented with 100 iterations for the original data set of the digit 4 and the corrupted data set, respectively. Figure 13 presents the average reconstruction errors of the RBBPCA, tPPCA and L1-PPCA algorithms where the feature number is the order of Z for the RBBPCA algorithm and is the dimension of the low-dimensional representation for the tPPCA and L1-PPCA algorithms. It is observed that the performance of our RBPPCA algorithm is superior to other algorithms. The reconstructed behaviors of different algorithms based on the original data set of the digit 4 and the corrupted data set with q c = q r = 1 are shown in Figure 14. In Figure 14, the first column is the original images, and the second, third and fourth columns are the reconstructed images by the RBBPCA, tPPCA and L1-PPCA algorithms, respectively. As shown in Figure 14, compared to the tPPCA and L1-PPCA algorithms, the RBPPCA algorithm performs better reconstruction outcomes in such a case.   The index of images The index of images

Conclusions
To remedy the problem that data are assumed to follow a matrix variate Gaussian distribution which is sensitive to outliers, in this paper, we proposed a robust BPPCA algo-rithm (RBPPCA), i.e., Algorithm 1, by replacing the matrix variate Gaussian distribution with the matrix variate t distribution for noise. Compared to BPPCA, owing to the matrix variate t distribution having a significantly heavy tail property, our proposed RBPPCA method combined with AECM for estimating parameters can deal with 2-D data sets in the presence of outliers. The numerical examples based on a synthetic and two publicly available real data sets, Yale and Yale B, are presented to state that Algorithm 1 is far superior to the BPPCA algorithm in computational accuracy, reconstruction performance, average reconstruction errors, recognition accuracy rates, and outlier detection. It is also shown by numerical examples based on the MNIST database that our RBPPCA method outperforms the tPPCA and L1-PPCA algorithms. where E (i) µ n and E (i) Y r n are given by (21). In addition, for symmetric matrices A ∈ R d r ×d r and B ∈ R q c ×q c , we have E tr(µ n A(Y r n ) T BY r n )|X n = Y r n µn tr µ n A(Y r n ) T BY r n p(Y r n |X n , µ n )p(µ n |X n )dµ n dY r n = µn µ n p(µ n |X n ) Y r n tr A(Y r n ) T BY r n p(Y r n |X n , µ n )dY r n dµ n = µn µ n p(µ n |X n ) where E (i) µ n and E (i) Y c n are defined in (25). The last equality (A7) holds due to the same reason as (A3).