Robust Bilinear Probabilistic Principal Component Analysis

Yaohang Lu; Zhongming Teng

doi:10.3390/a14110322

and

College of Computer and Information Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China

^*

Author to whom correspondence should be addressed.

Algorithms2021, 14(11), 322;https://doi.org/10.3390/a14110322

Version Notes

Order Reprints

Abstract

Principal component analysis (PCA) is one of the most popular tools in multivariate exploratory data analysis. Its probabilistic version (PPCA) based on the maximum likelihood procedure provides a probabilistic manner to implement dimension reduction. Recently, the bilinear PPCA (BPPCA) model, which assumes that the noise terms follow matrix variate Gaussian distributions, has been introduced to directly deal with two-dimensional (2-D) data for preserving the matrix structure of 2-D data, such as images, and avoiding the curse of dimensionality. However, Gaussian distributions are not always available in real-life applications which may contain outliers within data sets. In order to make BPPCA robust for outliers, in this paper, we propose a robust BPPCA model under the assumption of matrix variate t distributions for the noise terms. The alternating expectation conditional maximization (AECM) algorithm is used to estimate the model parameters. Numerical examples on several synthetic and publicly available data sets are presented to demonstrate the superiority of our proposed model in feature extraction, classification and outlier detection.

Keywords:

2-D data; probabilistic principal component analysis; AECM algorithm; matrix variate Gaussian distributions; matrix variate t distributions; outliers

1. Introduction

High-dimensional data are increasingly collected for a variety of applications in the real world. However, high-dimensional data are not often distributed uniformly in their ambient space, instead of that the interesting structure inside the data often lies in a low-dimensional space [1]. One of the fundamental challenges is how to find the low-dimensional data representation for high-dimensional observed data in pattern recognition, machine learning and statistics [2,3]. Principal component analysis (PCA) [4] is arguably the most well-known dimension reduction method for high-dimensional data analysis, and it aims to find the first few principal eigenvectors corresponding to the first few largest eigenvalues of the covariance matrix, and then projects the high-dimensional data onto the low-dimensional subspace spanned by these principal eigenvectors to achieve the purpose of dimensionality reduction.

The traditional PCA is concerned with vectorial data, i.e., 1-D data. For 2-D image trained sample matrices, it is usual to first convert 2-D image matrices into 1-D image vectors. This transformation leads to higher dimensional image sample vectors and a larger covariance matrix, and thus suffers from the difficulty of accurately evaluating the principal eigenvectors of the large scale covariance matrix. Furthermore, such vectorizing of 2-D data destroys the natural matrix structure, and ignores potentially valuable information about the spatial relationships among 2-D data. Therefore, two-dimensional PCA (2DPCA) type algorithms [5,6,7] are proposed to compute principal component weight matrices directly based on 2-D image training with sample matrices instead of using vectorization.

These conventional PCA and 2DPCA algorithms are both derived and interpreted in the standard algebraic framework, thus they lack capability in handling issues of statistical inference or missing data. To remedy these drawbacks, a probabilistic PCA model (PPCA) has been proposed by Tipping and Bishop in [8], which is processed by assuming some Gaussian distributions on observations with introduced extra latent variables, and it has been successfully applied in many machine learning tasks [9]. Following PPCA, a probabilistic second-order PCA, called PSOPCA, is developed in [10] to directly model 2-D image matrices based on the so-called matrix variate Gaussian distributions.

Throughout this paper,

R^{n \times m}

is the set of all

n \times m

real matrices,

I_{n}

and

0_{m \times n}

are the

n \times n

identity matrix and

m \times n

zero matrix, respectively. The superscript “

\cdot^{T}

” means transpose only,

{∥ \cdot ∥}_{2}

and

{∥ \cdot ∥}_{F}

denote the

ℓ_{2}

-norm and Frobenius norm of a matrix, respectively. Denoted by

N_{d_{c}, d_{r}} (M, Ω_{c}, Ω_{r})

is the matrix variate Gaussian distribution [11] with the mean matrix

M \in R^{d_{c} \times d_{r}}

, column covariance

Ω_{c} \in R^{d_{c} \times d_{c}}

and row covariance

Ω_{r} \in R^{d_{r} \times d_{r}}

. A random matrix

X \in R^{d_{c} \times d_{r}}

is said to follow the matrix variate Gaussian distribution

N_{d_{c}, d_{r}} (M, Ω_{c}, Ω_{r})

, i.e.,

X \sim N_{d_{c}, d_{r}} (M, Ω_{c}, Ω_{r}), i f vec (X) \sim N (vec (M), Ω_{r} \otimes Ω_{c}),

(1)

where

vec (\cdot)

is the vectorization of a matrix obtained by stacking the columns of the matrix on top of one another. That means that the probability density function (pdf) of X is

\begin{matrix} p (X) & = {(2 π)}^{- \frac{1}{2} d_{c} d_{r}} {|Ω_{r} \otimes Ω_{c}|}^{- \frac{1}{2}} exp (- \frac{1}{2} {[vec (X - M)]}^{T} (Ω_{r}^{- 1} \otimes Ω_{c}^{- 1}) vec (X - M)) \\ = {(2 π)}^{- \frac{1}{2} d_{c} d_{r}} {|Ω_{r}|}^{- \frac{1}{2} d_{c}} {|Ω_{c}|}^{- \frac{1}{2} d_{r}} etr (- \frac{1}{2} Ω_{c}^{- 1} (X - M) Ω_{r}^{- 1} {(X - M)}^{T}), \end{matrix}

(2)

where “⊗” is the Kronecker product of two matrices,

etr (\cdot) = exp (tr (\cdot))

and

tr (\cdot)

denotes the trace of a matrix. The last equality of (2) holds because of

{|Ω_{r} \otimes Ω_{c}|}^{- \frac{1}{2}} = {|Ω_{r}|}^{- \frac{1}{2} d_{c}} {|Ω_{c}|}^{- \frac{1}{2} d_{r}}

and

{[vec (X - M)]}^{T} (Ω_{r}^{- 1} \otimes Ω_{c}^{- 1}) vec (X - M) = tr (Ω_{c}^{- 1} (X - M) Ω_{r}^{- 1} {(X - M)}^{T}) .

See ([11], Theorem 1.2.21) and ([11], Theorem 1.2.22) for more details.

PSOPCA in [10] considers the following two-sided latent matrix variable model

\{\begin{matrix} X = C Z R^{T} + W + Υ, \\ Z \sim N_{q_{c}, q_{r}} (0_{q_{c} \times q_{r}}, I_{q_{c}}, I_{q_{r}}), Υ \sim N_{d_{c}, d_{r}} (0_{d_{c} \times d_{r}}, σ^{2} I_{d_{c}}, σ^{2} I_{d_{r}}), \end{matrix}

(3)

where

C \in R^{d_{c} \times q_{c}}

and

R \in R^{d_{r} \times q_{r}}

are the column and row factor loading matrices, respectively,

W \in R^{d_{c} \times d_{r}}

and

Υ \in R^{d_{c} \times d_{r}}

are the mean and error matrices, respectively, and

Z \in R^{q_{c} \times q_{r}}

is the latent core variable of X. The PSOPCA model is further extended to the bilinear probabilistic principal component analysis (BPPCA) model in [12] for better establishing the relationship with the 2DPCA algorithm [6], which is defined as

\{\begin{matrix} X = C Z R^{T} + W + C E_{r} + E_{c} R^{T} + E, \\ E_{c} \sim N_{d_{c}, q_{r}} (0_{d_{c} \times q_{r}}, σ_{c}^{2} I_{d_{c}}, I_{q_{r}}), E_{r} \sim N_{q_{c}, d_{r}} (0_{q_{c} \times d_{r}}, I_{q_{c}}, σ_{r}^{2} I_{d_{r}}), \\ E \sim N_{d_{c}, d_{r}} (0_{d_{c} \times d_{r}}, σ_{c}^{2} I_{d_{c}}, σ_{r}^{2} I_{d_{r}}), Z \sim N_{q_{c}, q_{r}} (0_{q_{c} \times q_{r}}, I_{q_{c}}, I_{q_{r}}) . \end{matrix}

(4)

In contrast to the PSOPCA model (3), the column and row noise matrices

E_{c} \in R^{d_{c} \times q_{r}}

and

E_{r} \in R^{q_{c} \times d_{r}}

with different noise variances

σ_{c}^{2}

and

σ_{r}^{2}

, respectively, are included in the BPPCA model, and

E \in R^{d_{c} \times d_{r}}

is represented as the common noise matrix. The model (4) improves the flexibility in capturing data uncertainty, and makes the marginal distribution

p (X)

to be the matrix variable Gaussian. In particular, we can see that if

E_{r}

and

E_{c}

are removed and

σ_{c} = σ_{r}

, then (4) reduces to the PSOPCA model.

All of the above mentioned probabilistic models assume that the noise terms follow Gaussian distributions. It is a well-known issue that Gaussian noises will lead to a serious drawback while dealing with anomalous observations. Thus, the probabilistic PCA models based on Gaussian distributions are not robust to outliers. To make probabilistic models which are insensitive to outliers, one prefers heavy-tailed distributions, such as the Student t distribution or centered Laplacian distribution with

ℓ_{1}

-norm. Using the t distribution or centered Laplacian distribution instead of the Gaussian distribution in the PPCA model [8] results in tPPCA [13,14] and probabilistic L1-PPCA [15] algorithms, respectively. Similarly, a robust version of PSOPCA, called L1-2DPPCA, is introduced in [16] based on the Laplacian distribution combined with variational EM-type algorithms to learn parameters. However, it is difficult to generalize a robust version of the BPPCA algorithm based on the Laplacian distribution. The reason is that if the error term

Υ

in the PSOPCA model is a Laplacian distribution, then the condition distribution

X | Z

is also a Laplacian distribution, but it does not hold in the BPPCA model. Fortunately, the same goal can be achieved by using the t distribution. In fact, the Gaussian distribution is a special t distribution. Compared to the Gaussian distribution, the t distribution has significantly heavier tails and contains one more free parameter. Recently, some robust probabilistic models under the assumption of the t distribution have already been done successfully by a number of researchers in [17,18,19,20,21,22]. Motivated by these facts, we will continue the effort to develop a robust BPPCA model from matrix variate t distributions to handle 2-D data sets in the presence of outliers.

The remainder of the paper is organized as follows. Section 2 introduces some notations and a matrix variate t distribution which are essential to our later development. The robust BPPCA model and its associated parameters estimation based on the AECM algorithm are given and analyzed in detail in Section 3. Section 4 is dedicated to present some numerical examples for showing the behaviors of our proposed model and to support our analysis. Finally, conclusions are made in Section 5.

2. Preliminaries

Let

X \in R^{d_{c} \times d_{r}}

,

M \in R^{d_{c} \times d_{r}}

,

Ω_{c} \in R^{d_{c} \times d_{c}}

and

Ω_{r} \in R^{d_{r} \times d_{r}}

, and the probability density function of the random variable

μ

having a Gamma distribution with parameters

α

and

β

, i.e.,

μ \sim G a (α, β)

, be

p (μ) = β^{α} μ^{α - 1} exp (- β μ) / Γ (α),

(5)

where

Γ (\cdot)

is the Gamma function, i.e.,

Γ (α) = \int_{0}^{+ \infty} τ^{α - 1} exp (- τ) d τ .

(6)

Analogously to the process of tPPCA in [14], we derive the matrix variate t distribution in this paper by considering

vec (X) \sim \int_{0}^{+ \infty} N (vec (M), \frac{Ω_{r} \otimes Ω_{c}}{μ}) p (μ) d μ,

where

μ \sim G a (ν / 2, ν / 2)

. Let

δ = tr (_{c}^{- 1} {(X - M)}_{r}^{- 1} {(X - M)}^{T})

. We have

\begin{matrix} \int_{0}^{+ \infty} N (vec (M), \frac{Ω_{r} \otimes Ω_{c}}{μ}) p (μ) d μ \\ = & \int_{0}^{+ \infty} {(2 π)}^{- \frac{d_{c} d_{r}}{2}} {|\frac{Ω_{r} \otimes Ω_{c}}{μ}|}^{- \frac{1}{2}} exp (- \frac{δ μ}{2}) {(\frac{ν}{2})}^{\frac{ν}{2}} μ^{\frac{ν}{2} - 1} exp (- \frac{ν}{2} μ) \frac{1}{Γ (\frac{ν}{2})} d μ \\ = & \frac{{(2 π)}^{- \frac{d_{c} d_{r}}{2}} | Ω_{c} |^{- \frac{d_{r}}{2}} {| Ω_{r} |}^{- \frac{d_{c}}{2}} {(\frac{ν}{2})}^{\frac{ν}{2}}}{Γ (\frac{ν}{2})} \int_{0}^{+ \infty} μ^{\frac{ν + d_{c} d_{r}}{2} - 1} exp (- \frac{(δ + ν) μ}{2}) d μ . \end{matrix}

(7)

Let

τ = \frac{(δ + ν) μ}{2}

. Then,

μ = \frac{2 τ}{ν + δ}

and

d μ = \frac{2}{ν + δ} d τ

. Therefore, (7) can be rewritten as

\begin{matrix} \int_{0}^{+ \infty} N (vec (M), \frac{Ω_{r} \otimes Ω_{c}}{μ}) p (μ) d μ \\ = & \frac{{(2 π)}^{- \frac{d_{c} d_{r}}{2}} | Ω_{c} |^{- \frac{d_{r}}{2}} {| Ω_{r} |}^{- \frac{d_{c}}{2}} {(\frac{ν}{2})}^{\frac{ν}{2}}}{Γ (\frac{ν}{2})} {(\frac{2}{ν + δ})}^{\frac{ν + d_{c} d_{r}}{2}} \int_{0}^{+ \infty} τ^{\frac{ν + d_{c} d_{r}}{2} - 1} exp (- τ) d τ \\ = & \frac{{(2 π)}^{- \frac{d_{c} d_{r}}{2}} | Ω_{c} |^{- \frac{d_{r}}{2}} {| Ω_{r} |}^{- \frac{d_{c}}{2}} {(\frac{ν}{2})}^{\frac{ν}{2}}}{Γ (\frac{ν}{2})} {(\frac{2}{ν + δ})}^{\frac{ν + d_{c} d_{r}}{2}} Γ (\frac{ν + d_{c} d_{r}}{2}) \\ = & \frac{π^{- \frac{d_{c} d_{r}}{2}} | Ω_{c} |^{- \frac{d_{r}}{2}} {| Ω_{r} |}^{- \frac{d_{c}}{2}} ν^{\frac{ν}{2}} Γ (\frac{ν + d_{c} d_{r}}{2})}{Γ (\frac{ν}{2})} ν^{- \frac{ν + d_{c} d_{r}}{2}} {(1 + \frac{δ}{ν})}^{- \frac{ν + d_{c} d_{r}}{2}} \\ = & \frac{| Ω_{c} |^{- \frac{d_{r}}{2}} {| Ω_{r} |}^{- \frac{d_{c}}{2}} Γ (\frac{ν + d_{c} d_{r}}{2})}{{(ν π)}^{\frac{d_{c} d_{r}}{2}} Γ (\frac{ν}{2})} {(1 + \frac{δ}{ν})}^{- \frac{ν + d_{c} d_{r}}{2}} . \end{matrix}

(8)

In this paper, if the pdf of the random matrix X is

p (X) = \frac{| Ω_{c} |^{- \frac{d_{r}}{2}} {| Ω_{r} |}^{- \frac{d_{c}}{2}} Γ (\frac{ν + d_{c} d_{r}}{2})}{{(ν π)}^{\frac{d_{c} d_{r}}{2}} Γ (\frac{ν}{2})} {(1 + \frac{δ}{ν})}^{- \frac{ν + d_{c} d_{r}}{2}},

(9)

then the random matrix X is said to follow the matrix variate t distribution with degrees of freedom

ν

, and is denoted by

X \sim T_{d_{c}, d_{r}} (ν, M, Ω_{c}, Ω_{r}) .

In particular, if

d_{c} = 1

or

d_{r} = 1

, then the matrix variate t distribution degenerates to the multivariate t distribution. As the classical multivariate t distribution, another favorite perspective on the matrix variate t distribution which is critical to our later developments, is to treat

μ

as a latent variable, then the conditional distribution of

X | μ

is a matrix variate Gaussian distribution by (8), i.e.,

X | μ \sim N_{d_{c}, d_{r}} (M, μ_{c} Ω_{c}, μ_{r} Ω_{r}),

(10)

where

μ_{c} μ_{r} = μ^{- 1}

. Notice that, despite the non-uniqueness in the factorization

μ^{- 1} = μ_{c} μ_{r}

,

N_{d_{c}, d_{r}} (M, μ_{c} Ω_{c}, μ_{r} Ω_{r})

in (10) always returns the same pdf of X.

3. Robust BPPCA

3.1. The Model

In this section, we develop a robust model by replacing the matrix variate Gaussian distribution in BPPCA with the matrix variate t distribution with degrees of freedom

ν

defined in (9) to deal with 2-D data sets. Specifically, the proposed robust bilinear probabilistic principal analysis model (RBPPCA for short) is defined as

\{\begin{matrix} X = C Z R^{T} + W + C E_{r} + E_{c} R^{T} + E, \\ μ \sim G a (ν / 2, ν / 2), μ_{c} μ_{r} = μ^{- 1}, \\ Z | μ \sim N_{q_{c}, q_{r}} (0_{q_{c} \times q_{r}}, μ_{c} I_{q_{c}}, μ_{r} I_{q_{r}}), E_{c} | μ \sim N_{d_{c}, q_{r}} (0_{d_{c} \times q_{r}}, μ_{c} σ_{c}^{2} I_{d_{c}}, μ_{r} I_{q_{r}}), \\ E_{r} | μ \sim N_{q_{c}, d_{r}} (0_{q_{c} \times d_{r}}, μ_{c} I_{q_{c}}, μ_{r} σ_{r}^{2} I_{d_{r}}), E | μ \sim N_{d_{c}, d_{r}} (0_{d_{c} \times d_{r}}, μ_{c} σ_{c}^{2} I_{d_{c}}, μ_{r} σ_{r}^{2} I_{d_{r}}) . \end{matrix}

(11)

As BPPCA [12], in the RBPPCA model (11),

E_{c} \in R^{d_{c} \times q_{r}}

,

E_{r} \in R^{q_{c} \times d_{r}}

and

E \in R^{d_{c} \times d_{r}}

are the column, row and common noise matrices, respectively,

Z \in R^{q_{c} \times q_{r}}

is the latent matrix, and these are assumed to be independent of each other, and the mean matrix, and the column and row factor loading matrices are

W \in R^{d_{c} \times d_{r}}

,

C \in R^{d_{c} \times q_{c}}

and

R \in R^{d_{r} \times q_{r}}

, respectively. Similarly to BPPCA [12], the parameters C, R,

σ_{c}

and

σ_{r}

can not be uniquely identified, but the interested subspaces spanned by the columns of C and R are unique. The reader is referred to ([12], Appendix B) for details. The difference from BPPCA is that the noise matrices

E_{c}

,

E_{r}

and

E

and latent matrix variate Z in the RBPPCA model (11) are supposed matrix variate t distributions by (10), i.e.,

\begin{matrix} E_{c} \sim T_{d_{c}, q_{r}} (ν, 0_{d_{c} \times q_{r}}, σ_{c}^{2} I_{d_{c}}, I_{q_{r}}), E_{r} \sim T_{q_{c}, d_{r}} (ν, 0_{q_{c} \times d_{r}}, I_{q_{c}}, σ_{r}^{2} I_{d_{r}}), \\ E \sim T_{d_{c}, d_{r}} (ν, 0_{d_{c} \times d_{r}}, σ_{c}^{2} I_{d_{c}}, σ_{r}^{2} I_{d_{r}}), Z \sim T_{q_{c}, q_{r}} (ν, 0_{q_{c} \times q_{r}}, I_{q_{c}}, I_{q_{r}}) . \end{matrix}

It follows by (11) that

\begin{matrix} C Z R^{T} | μ \sim N_{d_{c}, d_{r}} (0_{d_{c} \times d_{r}}, μ_{c} C C^{T}, μ_{r} R R^{T}), \\ C E_{r} | μ \sim N_{d_{c}, d_{r}} (0_{d_{c} \times d_{r}}, μ_{c} C C^{T}, μ_{r} σ_{r}^{2} I_{d_{r}}), \\ E_{c} R^{T} | μ \sim N_{d_{c}, d_{r}} (0_{d_{c} \times d_{r}}, μ_{c} σ_{c}^{2} I_{d_{c}}, μ_{r} R R^{T}) . \end{matrix}

Consequently,

X | μ \sim N_{d_{c}, d_{r}} (W, μ_{c} Σ_{c}, μ_{r} Σ_{r})

where

\begin{matrix} Σ_{c} = C C^{T} + σ_{c}^{2} I_{d_{c}} and Σ_{r} = R R^{T} + σ_{r}^{2} I_{d_{r}} . \end{matrix}

(12)

That means the random matrix X follows the matrix variate t distribution, i.e.,

X \sim T_{d_{c}, d_{r}} (ν, W, Σ_{c}, Σ_{r})

. In addition, as shown in [23], the conditional distribution

μ | X

which is also required in our later estimation of model parameters is a Gamma distribution, i.e.,

μ | X \sim G a (\frac{ν + d_{c} d_{r}}{2}, \frac{ν + ρ}{2}),

(13)

where

ρ = tr (Σ_{c}^{- 1} (X - W) Σ_{r}^{- 1} {(X - W)}^{T})

.

By introducing two latent matrix variates

Y^{r} \in R^{q_{c} \times d_{r}}

and

Y_{ϵ}^{r} \in R^{d_{c} \times d_{r}}

, the RBPPCA model (11) can be rewritten as

\{\begin{matrix} X = C Y^{r} + W + Y_{ϵ}^{r}, \\ Y^{r} = Z R^{T} + E_{r}, \\ Y_{ϵ}^{r} = E_{c} R^{T} + E, \end{matrix}

(14)

where

Y^{r}

and

Y_{ϵ}^{r}

are the row projected intermediate and residual matrices, respectively. By (11), we have the conditional distributions

\begin{matrix} Y^{r} | μ \sim N_{q_{c}, d_{r}} (0_{q_{c} \times d_{r}}, μ_{c} I_{q_{c}}, μ_{r} Σ_{r}), \end{matrix}

(15a)

\begin{matrix} Y_{ϵ}^{r} | μ \sim N_{d_{c}, d_{r}} (0_{d_{c} \times d_{r}}, μ_{c} σ_{c}^{2} I_{d_{c}}, μ_{r} Σ_{r}), \end{matrix}

(15b)

\begin{matrix} Y^{r} | Z, μ \sim N_{q_{c}, d_{r}} (Z R^{T}, μ_{c} I_{q_{c}}, μ_{r} σ_{r}^{2} I_{d_{r}}), \end{matrix}

(15c)

\begin{matrix} X | Y^{r}, μ \sim N_{d_{c}, d_{r}} (C Y^{r} + W, μ_{c} σ_{c}^{2} I_{d_{c}}, μ_{r} Σ_{r}), \end{matrix}

(15d)

where

Σ_{r}

is given by (12). In addition, by using (15c) and the Bayes’ rule, the conditional distributions

Z | Y^{r}, μ

and

Y^{r} | X, μ

can be calculated as

\begin{matrix} Z | Y^{r}, μ \sim N_{q_{c}, q_{r}} (Y^{r} R Φ_{r}^{- 1}, μ_{c} I_{q_{c}}, μ_{r} σ_{r}^{2} Φ_{r}^{- 1}), \end{matrix}

(16a)

\begin{matrix} Y^{r} | X, μ \sim N_{q_{c}, d_{r}} (Φ_{c}^{- 1} C^{T} (X - W), μ_{c} σ_{c}^{2} Φ_{c}^{- 1}, μ_{r} Σ_{r}), \end{matrix}

(16b)

where

Φ_{c} = C^{T} C + σ_{c}^{2} I_{q_{c}} a n d Φ_{r} = R^{T} R + σ_{r}^{2} I_{q_{r}} .

(17)

In (14), the bilinear projection in the RBPPCA model is split into two stages by first projecting the latent matrix Z in the row direction to obtain

Y^{r}

, then

Y^{r}

being projected in the column direction to finally generate X. Similarly, we can also consider the decomposition of the bilinear projection by first projecting column and then row directions to rewrite (11) as

\{\begin{matrix} X = Y^{c} R^{T} + W + Y_{ϵ}^{c}, \\ Y^{c} = C Z + E_{c}, \\ Y_{ϵ}^{c} = C E_{r} + E, \end{matrix}

(18)

where

Y^{c} \in R^{d_{c} \times q_{r}}

and

Y_{ϵ}^{c} \in R^{d_{c} \times d_{r}}

with

Y^{c} | μ \sim N_{d_{c}, q_{r}} (0_{d_{c} \times q_{r}}, μ_{c} Σ_{c}, μ_{r} I_{q_{r}})

and

Y_{ϵ}^{c} | μ \sim N_{d_{c}, d_{r}} (0_{d_{c} \times d_{r}}, μ_{c} Σ_{c}, μ_{r} σ_{r}^{2} I_{d_{r}})

. Furthermore,

\begin{matrix} Y^{c} | Z, μ \sim N_{d_{c}, q_{r}} (C Z, μ_{c} σ_{c}^{2} I_{d_{c}}, μ_{r} I_{q_{r}}), \end{matrix}

(19a)

\begin{matrix} X | Y^{c}, μ \sim N_{d_{c}, d_{r}} (Y^{c} R^{T} + W, μ_{c} Σ_{c}, μ_{r} σ_{r}^{2} I_{d_{r}}), \end{matrix}

(19b)

\begin{matrix} Z | Y^{c}, μ \sim N_{q_{c}, q_{r}} (Φ_{c}^{- 1} C^{T} Y^{c}, μ_{c} σ_{c}^{2} Φ_{c}^{- 1}, μ_{r} I_{q_{r}}), \end{matrix}

(19c)

\begin{matrix} Y^{c} | X, μ \sim N_{d_{c}, q_{r}} ((X - W) R Φ_{r}^{- 1}, μ_{c} Σ_{c}, μ_{r} σ_{r}^{2} Φ_{r}^{- 1}) . \end{matrix}

(19d)

3.2. Estimation of the Parameters

In the model (11), the parameter set to be estimated is

θ = {ν, σ_{c}, σ_{r}, W, C, R}

. We will introduce how to calculate the parameters by using the alternating expectation conditional maximization (AECM) algorithm in this subsection. The AECM algorithm [12,24,25] is a two stage iterative optimization technique for finding maximum likelihood solutions. To apply it to the AECM algorithm, we divide the parameter set

θ

into two subsets

θ_{c} = {ν, σ_{c}, W, C}

and

θ_{r} = {ν, σ_{r}, W, R}

.

In the first stage, we consider the AECM algorithm for the model (14) to compute

θ_{c}

. Let

{X_{n}}_{n = 1}^{N}

with

X_{n} \in R^{d_{c} \times d_{r}}

be a set of 2-D sample observations. The latent variables’ data

{Y_{n}^{r}, μ_{n}}_{n = 1}^{N}

are treated as “missing data”, and the “complete” data log-likelihood is

\begin{matrix} L_{c o m, c} (θ_{c}) = & \sum_{n = 1}^{N} ln (p (X_{n}, Y_{n}^{r}, μ_{n})) \\ = & \sum_{n = 1}^{N} ln (p (X_{n} | Y_{n}^{r}, μ_{n}) p (Y_{n}^{r} | μ_{n}) p (μ_{n})) . \end{matrix}

It follows by (5), (15a) and (15d) that

\begin{matrix} L_{c o m, c} (θ_{c}) = & \sum_{n = 1}^{N} ln ({(2 π)}^{- \frac{d_{c} d_{r}}{2}} | μ_{c, n} σ_{c}^{2} I_{d_{c}} |^{- \frac{d_{r}}{2}} | μ_{r, n} Σ_{r} |^{- \frac{d_{c}}{2}} etr (- \frac{1}{2} {(μ_{r, n} Σ_{r})}^{- 1} [X_{n} - (C Y_{n}^{r} \\ + {W)]}^{T} {(μ_{c, n} σ_{c}^{2} I)}^{- 1} [X_{n} - (C Y_{n}^{r} + W)]) {(2 π)}^{- \frac{q_{c} d_{r}}{2}} | μ_{c, n} I_{q_{c}} |^{- \frac{d_{r}}{2}} {| μ_{r, n} Σ_{r} |}^{- \frac{q_{c}}{2}} \\ etr (- \frac{1}{2} {(μ_{r, n} Σ_{r})}^{- 1} {(Y_{n}^{r})}^{T} μ_{c, n}^{- 1} Y_{n}^{r}) {(\frac{ν}{2})}^{\frac{ν}{2}} μ_{n}^{\frac{ν}{2} - 1} exp (- \frac{ν}{2} μ_{n}) \frac{1}{Γ (\frac{ν}{2})}) \\ = & - \sum_{n = 1}^{N} {\frac{(d_{c} + q_{c}) d_{r}}{2} ln (2 π) + ln Γ (\frac{ν}{2}) + \frac{d_{c} + q_{c}}{2} ln | Σ_{r} | - \frac{ν}{2} ln (\frac{ν}{2}) + \frac{ν μ_{n}}{2} \\ - \frac{(d_{c} + q_{c}) d_{r} + ν - 2}{2} ln μ_{n} + \frac{d_{c} d_{r}}{2} ln σ_{c}^{2} + tr (\frac{μ_{n}}{2 σ_{c}^{2}} Σ_{r}^{- 1} {[X_{n} - (C Y_{n}^{r} + W)]}^{T} \\ [X_{n} - (C Y_{n}^{r} + W)] + \frac{μ_{n}}{2} Σ_{r}^{- 1} {(Y_{n}^{r})}^{T} Y_{n}^{r})} . \end{matrix}

In E-step, given the parameter set

θ^{(i)} = {ν^{(i)}, σ_{c}^{(i)}, σ_{r}^{(i)}, W^{(i)}, C^{(i)}, R^{(i)}}

which is obtained from the i-th iteration, we compute the expectation of

L_{c o m, c} (θ_{c})

with respect to the condition distribution

Y_{n}^{r}, μ_{n} | X_{n}

, i.e.,

\begin{matrix} Q_{c} (θ_{c}) = & \int_{Y_{n}^{r}} \int_{μ_{n}} L_{c o m, c} (θ_{c}) p (Y_{n}^{r}, μ_{n} | X_{n}) d μ_{n} d Y_{n}^{r} \\ = & \int_{Y_{n}^{r}} \int_{μ_{n}} L_{c o m, c} (θ_{c}) p (Y_{n}^{r} | μ_{n}, X_{n}) p (μ_{n} | X_{n}) d μ_{n} d Y_{n}^{r} \\ = & - \sum_{n = 1}^{N} {ln Γ (\frac{ν}{2}) - \frac{ν}{2} ln (\frac{ν}{2}) + \frac{ν}{2} E (μ_{n} | X_{n}) - \frac{(d_{c} + q_{c}) d_{r} + ν - 2}{2} E (ln μ_{n} | X_{n}) \\ + \frac{d_{c} d_{r}}{2} ln σ_{c}^{2} + \frac{1}{2 σ_{c}^{2}} E (tr (μ_{n} Σ_{r}^{- 1} {[X_{n} - (C Y_{n}^{r} + W)]}^{T} [X_{n} - (C Y_{n}^{r} + W)]) | X_{n})} \\ + constant, \end{matrix}

(20)

where the constant contains those terms without referring the parameters in the set

θ_{c}

. We denote

E_{μ_{n}}^{(i)} = E (μ_{n} | X_{n})

and

E_{Y_{n}^{r}}^{(i)} = E (Y_{n}^{r} | X_{n}, μ_{n})

for convenience. It is noted by (13) and (16b) that given the parameter set

θ^{(i)} = {ν^{(i)}, σ_{c}^{(i)}, σ_{r}^{(i)}, W^{(i)}, C^{(i)}, R^{(i)}}

, the conditional distributions of

μ_{n} | X_{n}

and

Y_{n}^{r} | X_{n}, μ_{n}

are known. That is, for

1 \leq n \leq N

,

\begin{matrix} μ_{n} | X_{n} & \sim G a (\frac{ν^{(i)} + d_{c} d_{r}}{2}, \frac{ν^{(i)} + ρ_{n}^{(i)}}{2}), \\ Y_{n}^{r} | X_{n}, μ_{n} & \sim N_{q_{c}, d_{r}} ({(Φ_{c}^{(i)})}^{- 1} {(C^{(i)})}^{T} (X_{n} - W^{(i)}), μ_{c, n} {(σ_{c}^{(i)})}^{2} {(Φ_{c}^{(i)})}^{- 1}, μ_{r, n} Σ_{r}^{(i)}), \end{matrix}

where

μ_{r, n} μ_{c, n} = μ_{n}

,

Φ_{c}^{(i)} = {(C^{(i)})}^{T} C^{(i)} + {(σ_{c}^{(i)})}^{2} I_{q_{c}}

,

Σ_{r}^{(i)} = R^{(i)} {(R^{(i)})}^{T} + {(σ_{r}^{(i)})}^{2} I_{d_{r}}

, and

ρ_{n}^{(i)} = tr ({(Σ_{c}^{(i)})}^{- 1} (X_{n} - W^{(i)}) {(Σ_{r}^{(i)})}^{- 1} {(X_{n} - W^{(i)})}^{T})

with

Σ_{c}^{(i)} = C^{(i)} {(C^{(i)})}^{T} + {(σ_{c}^{(i)})}^{2} I_{d_{c}}

. Then, it is easy to obtain that

E_{μ_{n}}^{(i)} = \frac{ν^{(i)} + d_{c} d_{r}}{ν^{(i)} + ρ_{n}^{(i)}} and E_{Y_{n}^{r}}^{(i)} = {(Φ_{c}^{(i)})}^{- 1} {(C^{(i)})}^{T} (X_{n} - W^{(i)}) .

(21)

In addition, based on the conditional distributions of

μ_{n} | X_{n}

and

Y_{n}^{r} | X_{n}, μ_{n}

, in (20), the condition expectations

E (ln μ_{n} | X_{n}) = ψ (\frac{ν^{(i)} + d_{c} d_{r}}{2}) - ln (\frac{ν^{(i)} + ρ_{n}^{(i)}}{2})

by [13] where

ψ (\cdot)

is the digamma function, and

\begin{matrix} E (tr (μ_{n} Σ_{r}^{- 1} {[X_{n} - (C Y_{n}^{r} + W)]}^{T} [X_{n} - (C Y_{n}^{r} + W)]) | X_{n}) \\ = & E_{μ_{n}}^{(i)} tr (Σ_{r}^{- 1} {(X_{n} - W)}^{T} (X_{n} - W)) - 2 E_{μ_{n}}^{(i)} tr (Σ_{r}^{- 1} {(X_{n} - W)}^{T} C E_{Y_{n}^{r}}^{(i)}) \\ + tr (Σ_{r}^{- 1} Σ_{r}^{(i)}) tr (C^{T} C {(σ_{c}^{(i)})}^{2} {(Φ_{c}^{(i)})}^{- 1}) + E_{μ_{n}}^{(i)} tr (Σ_{r}^{- 1} {(E_{Y_{n}^{r}}^{(i)})}^{T} C^{T} C E_{Y_{n}^{r}}^{(i)}), \end{matrix}

(22)

which is detailed in Appendix A.

In the subsequent conditional maximization (CM) step of the first stage, given the condition

θ_{r}^{(i)} = {ν^{(i)}, σ_{r}^{(i)}, W^{(i)}, R^{(i)}}

, we maximize

Q_{c} (θ_{c})

with respect to

θ_{c} = {ν, σ_{c}, W, C}

. It follows by (20) that

\begin{matrix} \frac{\partial Q_{c} (θ_{c})}{\partial W} = & - \frac{1}{2 σ_{c}^{2}} \sum_{n = 1}^{N} E_{μ_{n}}^{(i)} ((- 2 X_{n} + 2 W) {(Σ_{r}^{(i)})}^{- 1} + 2 C E_{Y_{n}^{r}}^{(i)} {(Σ_{r}^{(i)})}^{- 1}), \\ \frac{\partial Q_{c} (θ_{c})}{\partial C} = & - \frac{1}{2 σ_{c}^{2}} \sum_{n = 1}^{N} {- 2 E_{μ_{n}}^{(i)} (X_{n} - W) {(Σ_{r}^{(i)})}^{- 1} {(E_{Y_{n}^{r}}^{(i)})}^{T} \\ + 2 d_{r} C {(σ_{c}^{(i)})}^{2} {(Φ_{c}^{(i)})}^{- 1} + 2 E_{μ_{n}}^{(i)} C E_{Y_{n}^{r}}^{(i)} {(Σ_{r}^{(i)})}^{- 1} {(E_{Y_{n}^{r}}^{(i)})}^{T}}, \\ \frac{\partial Q_{c} (θ_{c})}{\partial σ_{c}^{2}} = & - \frac{N d_{c} d_{r}}{2 σ_{c}^{2}} + \frac{1}{2 σ_{c}^{4}} \sum_{n = 1}^{N} E (tr (μ_{n} {(Σ_{r}^{(i)})}^{- 1} {[X_{n} - (C Y_{n}^{r} + W)]}^{T} [X_{n} - (C Y_{n}^{r} + W)]) | X_{n}) . \end{matrix}

Therefore, by successively solving the equations

\frac{\partial Q_{c} (θ_{c})}{\partial W} = 0

,

\frac{\partial Q_{c} (θ_{c})}{\partial C} = 0

and

\frac{\partial Q_{c} (θ_{c})}{\partial σ_{c}^{2}} = 0

, we can iteratively update the parameters W, C and

σ_{c}

. Specifically, by

\frac{\partial Q_{c} (θ_{c})}{\partial W} = 0

, we have

\sum_{n = 1}^{N} E_{μ_{n}}^{(i)} (- 2 X_{n} + 2 W + 2 C E_{Y_{n}^{r}}^{(i)}) {(Σ_{r}^{(i)})}^{- 1} = 0 .

That means

\sum_{n = 1}^{N} E_{μ_{n}}^{(i)} (- X_{n} + C E_{Y_{n}^{r}}^{(i)}) + \sum_{n = 1}^{N} E_{μ_{n}}^{(i)} W = 0 .

Thus, an iterative updating of W can be obtained by

{\tilde{W}}^{(i)} = \frac{\sum_{n = 1}^{N} E_{μ_{n}}^{(i)} (X_{n} - C^{(i)} E_{Y_{n}^{r}}^{(i)})}{\sum_{n = 1}^{N} E_{μ_{n}}^{(i)}} .

(23)

Similarly, based on

\frac{\partial Q_{c} (θ_{c})}{\partial C} = 0

and

\frac{\partial Q_{c} (θ_{c})}{\partial σ_{c}^{2}} = 0

, we have

\begin{matrix} C^{(i + 1)} = & \sum_{n = 1}^{N} E_{μ_{n}}^{(i)} (X_{n} - {\tilde{W}}^{(i)}) {(Σ_{r}^{(i)})}^{- 1} {(E_{Y_{n}^{r}}^{(i)})}^{T} \end{matrix}

\begin{matrix} \times {[\sum_{n = 1}^{N} (d_{r} {(σ_{c}^{(i)})}^{2} {(Φ_{c}^{(i)})}^{- 1} + E_{μ_{n}}^{(i)} E_{Y_{n}^{r}}^{(i)} {(Σ_{r}^{(i)})}^{- 1} {(E_{Y_{n}^{r}}^{(i)})}^{T})]}^{- 1}, \\ {(σ_{c}^{(i + 1)})}^{2} = & \frac{1}{N d_{c} d_{r}} \sum_{n = 1}^{N} E (tr (μ_{n} {(Σ_{r}^{(i)})}^{- 1} {(X_{n} - C^{(i + 1)} Y_{n}^{r} - {\tilde{W}}^{(i)})}^{T} (X_{n} - C^{(i + 1)} Y_{n}^{r} - {\tilde{W}}^{(i)})) | X_{n}) \end{matrix}

(24a)

\begin{matrix} = & \frac{1}{N d_{c} d_{r}} \sum_{n = 1}^{N} E_{μ_{n}}^{(i)} tr ({(Σ_{r}^{(i)})}^{- 1} {(X_{n} - {\tilde{W}}^{(i)})}^{T} (X_{n} - {\tilde{W}}^{(i)} - C^{(i + 1)} E_{Y_{n}^{r}}^{(i)})) . \end{matrix}

(24b)

The last equality (24b) holds because of (22) and (24a). Finally, we update

ν^{(i)}

by maximizing the scalar nonlinear function

Q_{c} (θ_{c})

defined in (20) on v, which can be solved numerically by most scientific computation software packages [26,27], to obtain

{\tilde{ν}}^{(i)}

.

In the second stage, the AECM algorithm is used for the model (18) to update

θ_{r}

. In such a case, we consider the latent variables’ data

{Y_{n}^{c}, μ_{n}}_{n = 1}^{N}

as “missing data”. Then, by (19b), the “complete” data log-likelihood is

\begin{matrix} L_{c o m, r} (θ_{r}) = & \sum_{n = 1}^{N} ln (p (X_{n}, Y_{n}^{c}, μ_{n})) = \sum_{n = 1}^{N} ln (p (X_{n} | Y_{n}^{c}, μ_{n}) p (Y_{n}^{c} | μ_{n}) p (μ_{n})) \\ = & \sum_{n = 1}^{N} ln ({(2 π)}^{- \frac{d_{c} d_{r}}{2}} | μ_{c, n} Σ_{c} |^{- \frac{d_{r}}{2}} | μ_{r, n} σ_{r}^{2} I_{d_{r}} |^{- \frac{d_{c}}{2}} etr (- \frac{1}{2} {(μ_{r, n} σ_{r}^{2} I)}^{- 1} [X_{n} - (Y_{n}^{c} R^{T} \\ + {W)]}^{T} {(μ_{c, n} Σ_{c})}^{- 1} [X_{n} - (Y_{n}^{c} R^{T} + W)]) {(2 π)}^{- \frac{d_{c} q_{r}}{2}} | μ_{c, n} Σ_{c} |^{- \frac{q_{r}}{2}} {| μ_{r, n} I_{q_{r}} |}^{- \frac{d_{c}}{2}} \\ \times etr (- \frac{1}{2} {(μ_{r, n} I_{q_{r}})}^{- 1} {(Y_{n}^{c})}^{T} {(μ_{c, n} Σ_{c})}^{- 1} Y_{n}^{c}) {(\frac{ν}{2})}^{\frac{ν}{2}} μ_{n}^{\frac{ν}{2} - 1} exp (- \frac{ν}{2} μ_{n}) \frac{1}{Γ (\frac{ν}{2})}) \\ = & - \sum_{n = 1}^{N} {\frac{(d_{r} + q_{r}) d_{c}}{2} ln (2 π) + ln Γ (\frac{ν}{2}) + \frac{d_{r} + q_{r}}{2} ln | Σ_{c} | - \frac{ν}{2} ln (\frac{ν}{2}) + \frac{ν μ_{n}}{2} \\ - \frac{(d_{r} + q_{r}) d_{c} + ν - 2}{2} ln μ_{n} + \frac{d_{c} d_{r}}{2} ln σ_{r}^{2} + tr (\frac{μ_{n}}{2 σ_{r}^{2}} {[X_{n} - (Y_{n}^{c} R^{T} + W)]}^{T} Σ_{c}^{- 1} \\ \times [X_{n} - (Y_{n}^{c} R^{T} + W)] + \frac{μ_{n}}{2} {(Y_{n}^{c})}^{T} Σ_{c}^{- 1} Y_{n}^{c})} . \end{matrix}

Similarly, in E-step of the second stage, given the updated parameter set

{\tilde{θ}}^{(i)} = {{\tilde{ν}}^{(i)}, σ_{c}^{(i + 1)}, σ_{r}^{(i)}, {\tilde{W}}^{(i)}, C^{(i + 1)}, R^{(i)}},

where

{\tilde{ν}}^{(i)}

,

σ_{c}^{(i + 1)}

,

{\tilde{W}}^{(i)}

and

C^{(i + 1)}

are calculated from the first stage, we compute the expectation of

L_{c o m, r} (θ_{r})

with respect to the condition distribution

Y_{n}^{c}, μ_{n} | X_{n}

, denoted by

Q_{r} (θ_{r})

. Based on (19) and the current parameter set

{\tilde{θ}}^{(i)}

, we define

\begin{matrix} Σ_{c}^{(i + 1)} & = C^{(i + 1)} {(C^{(i + 1)})}^{T} + {(σ_{c}^{(i + 1)})}^{2} I_{d_{c}}, \end{matrix}

(25a)

\begin{matrix} {\tilde{ρ}}_{n}^{(i)} & = tr ({(Σ_{c}^{(i + 1)})}^{- 1} (X_{n} - {\tilde{W}}^{(i)}) {(Σ_{r}^{(i)})}^{- 1} {(X_{n} - {\tilde{W}}^{(i)})}^{T}), \end{matrix}

(25b)

\begin{matrix} {\tilde{E}}_{μ_{n}}^{(i)} & = E (μ_{n} | X_{n}) = \frac{{\tilde{ν}}^{(i)} + d_{c} d_{r}}{{\tilde{ν}}^{(i)} + {\tilde{ρ}}_{n}^{(i)}}, \end{matrix}

(25c)

\begin{matrix} Φ_{r}^{(i)} & = {(R^{(i)})}^{T} R^{(i)} + {(σ_{r}^{(i)})}^{2} I_{q_{r}}, \end{matrix}

(25d)

\begin{matrix} E_{Y_{n}^{c}}^{(i)} & = E (Y_{n}^{c} | X_{n}, μ_{n}) = (X_{n} - {\tilde{W}}^{(i)}) R^{(i)} {(Φ_{r}^{(i)})}^{- 1} . \end{matrix}

(25e)

We have, up to a constant,

\begin{matrix} Q_{r} (θ_{r}) = & - \sum_{n = 1}^{N} {ln Γ (\frac{ν}{2}) - \frac{ν}{2} ln (\frac{ν}{2}) + \frac{ν}{2} {\tilde{E}}_{μ_{n}}^{(i)} - \frac{(d_{r} + q_{r}) d_{c} + ν - 2}{2} E (ln μ_{n} | X_{n}) + \frac{d_{c} d_{r}}{2} \\ ln σ_{r}^{2} + \frac{1}{2 σ_{r}^{2}} E (tr (μ_{n} {[X_{n} - (Y_{n}^{c} R^{T} + W)]}^{T} Σ_{c}^{- 1} [X_{n} - (Y_{n}^{c} R^{T} + W)]) | X_{n})}, \end{matrix}

(26)

where

E (ln μ_{n} | X_{n}) = ψ (\frac{{\tilde{ν}}^{(i)} + d_{c} d_{r}}{2}) - ln (\frac{{\tilde{ν}}^{(i)} + {\tilde{ρ}}_{n}^{(i)}}{2})

and

\begin{matrix} E (tr (μ_{n} {[X_{n} - (Y_{n}^{c} R^{T} + W)]}^{T} Σ_{c}^{- 1} [X_{n} - (Y_{n}^{c} R^{T} + W)]) | X_{n}) \\ = & {\tilde{E}}_{μ_{n}}^{(i)} tr ({(X_{n} - W)}^{T} Σ_{c}^{- 1} (X_{n} - W)) - 2 {\tilde{E}}_{μ_{n}}^{(i)} tr ({(X_{n} - W)}^{T} Σ_{c}^{- 1} E_{Y_{n}^{c}}^{(i)} R^{T}) \\ + tr (Σ_{c}^{- 1} Σ_{c}^{(i + 1)}) tr (R^{T} R {(σ_{r}^{(i)})}^{2} {(Φ_{r}^{(i)})}^{- 1}) + {\tilde{E}}_{μ_{n}}^{(i)} tr (R^{T} R {(E_{Y_{n}^{c}}^{(i)})}^{T} Σ_{c}^{- 1} E_{Y_{n}^{c}}^{(i)}) . \end{matrix}

(27)

See Appendix A for the derivation of (27). At last, in CM-step of the second stage, based on

{\tilde{θ}}_{c}^{(i)} = {{\tilde{ν}}^{(i)}, σ_{c}^{(i + 1)}, {\tilde{W}}^{(i)}, C^{(i + 1)}}

, similarly to (23) and (24), we maximize

Q_{r} (θ_{r})

with respect to

θ_{r}

to update

\begin{matrix} W^{(i + 1)} = & \frac{\sum_{n = 1}^{N} {\tilde{E}}_{μ_{n}}^{(i)} (X_{n} - E_{Y_{n}^{c}}^{(i)} {(R^{(i)})}^{T})}{\sum_{n = 1}^{N} {\tilde{E}}_{μ_{n}}^{(i)}}, \\ R^{(i + 1)} = & \sum_{n = 1}^{N} {\tilde{E}}_{μ_{n}}^{(i)} {(X_{n} - W^{(i + 1)})}^{T} {(Σ_{c}^{(i + 1)})}^{- 1} E_{Y_{n}^{c}}^{(i)} \end{matrix}

(28a)

\begin{matrix} \times {[\sum_{n = 1}^{N} (d_{c} {(σ_{r}^{(i)})}^{2} {(Φ_{r}^{(i)})}^{- 1} + {\tilde{E}}_{μ_{n}}^{(i)} {(E_{Y_{n}^{c}}^{(i)})}^{T} {(Σ_{c}^{(i + 1)})}^{- 1} E_{Y_{n}^{c}}^{(i)})]}^{- 1}, \end{matrix}

(28b)

\begin{matrix} {(σ_{r}^{(i + 1)})}^{2} = & \frac{1}{N d_{c} d_{r}} \sum_{n = 1}^{N} {\tilde{E}}_{u_{n}}^{(i)} tr ({(X_{n} - W^{(i + 1)})}^{T} {(Σ_{c}^{(i + 1)})}^{- 1} (X_{n} - W^{(i + 1)} - E_{Y_{n}^{c}}^{(i)} {(R^{(i + 1)})}^{T})), \end{matrix}

(28c)

and then solve the scalar nonlinear maximization problem (26) on

ν

to get

ν^{(i + 1)}

.

We summarize what we do in this subsection in Algorithm 1. A few remarks regarding Algorithm 1 are in order:

(1): In Algorithm 1, it is not necessarily to explicitly compute $Σ_{c}^{(i)}$ and $Σ_{r}^{(i)}$ . The reason is that the calculation of ${(Σ_{c}^{(i)})}^{- 1}$ and ${(Σ_{r}^{(i)})}^{- 1}$ can be more efficiently performed by using

${(Σ_{c}^{(i)})}^{- 1} = \frac{I_{d_{c}} - C^{(i)} {(Φ_{c}^{(i)})}^{- 1} {(C^{(i)})}^{T}}{{(σ_{c}^{(i)})}^{2}}, {(Σ_{r}^{(i)})}^{- 1} = \frac{I_{d_{r}} - R^{(i)} {(Φ_{r}^{(i)})}^{- 1} {(R^{(i)})}^{T}}{{(σ_{r}^{(i)})}^{2}} .$

In its per-iteration of Algorithm 1, the most expensive computational cost is $O (N d_{c} d_{r} (d_{c} + d_{r}))$ appearing on the formation of $ρ_{n}^{(i)}$ with $1 \leq n \leq N$ . Owing to introducing the new latent variable $μ_{n}$ , RBPPCA is a little more time complex than BPPCA having a calculation cost of $O (N d_{c} d_{r} (q_{c} + q_{r}))$ . However, it will be shown in our numerical example that the RBPPCA algorithm presents less sensitivity to outliers.
(2): Compared with the AECM algorithm of BPPCA in [12] which uses the centered data and estimates ${σ_{c}, C}$ and ${σ_{r}, R}$ based on the model $X = C Y^{r} + Y_{ϵ}^{r}$ and $X = Y^{c} R^{T} + Y_{ϵ}^{c}$ , respectively, two more parameters $ν$ and W are needed to be computed in the AECM iteration of RBPPCA. Notice that both the models (14) and (18) contain the parameters $ν$ and W. Thus, we split the parameter set $θ$ into $θ_{c} = {ν, σ_{c}, W, C}$ and $θ_{r} = {ν, σ_{r}, W, R}$ which naturally leads to the parameters $ν$ and W being calculated twice in each loop of Algorithm 1. Though other partitions of the set $θ$ , such as ${ν, σ_{c}, C}$ and ${σ_{r}, W, R}$ , are also available for the estimation of parameters, we prefer $θ_{c}$ and $θ_{r}$ , because updating $ν$ and W one more time in each iteration can be obtained by adding a little more computational cost.
(3): As stated in Section 3.3 of [24], any AECM sequence increases $L (θ) = \sum_{n = 1}^{N} ln (p (X_{n} | θ))$ at each iteration, and converges to a stationary point of $L (θ)$ . Notice that the convergence results of the AECM algorithm proved in Section 3.3 of [24] do not depend on the distributions of the data sets. Therefore, up to set a limit on the maximum number of steps $i t e r_{m a x}$ , we use the following relative change of log-likelihood as the stopping criterion, i.e.,

$ζ = |1 - \frac{L (θ^{(i + 1)})}{L (θ^{(i)})}| = |1 - \frac{\sum_{n = 1}^{N} ln (p (X_{n} | θ^{(i + 1)}))}{\sum_{n = 1}^{N} ln (p (X_{n} | θ^{(i)}))}| \leq ε$

(29)

where $ε$ is a specified tolerance used, which by default is set to $10^{- 5}$ in our numerical examples.
(4): Based on the computed results of Algorithm 1, and similar to PPCA [8] and BPPCA [12], it is known that

$E (Z | X) = E (E (Z | Y^{r}) | X) = Φ_{c}^{- 1} C^{T} (X - W) R Φ_{r}^{- 1}$

can be considered as the compressed representation of X. Hence, we can reconstruct X as

$\hat{X} = C E (Z | X) R^{T} + W = C Φ_{c}^{- 1} C^{T} (X - W) R Φ_{r}^{- 1} R^{T} + W .$

(30)

Algorithm 1 Robust bilinear probabilistic PCA algorithm (RBPPCA).

Input: Initialization

ν^{(0)}, σ_{c}^{(0)}, σ_{c}^{(0)}, W^{(0)}, C^{(0)}, R^{(0)}

, and sample matrices

{X_{n}}_{n = 1}^{N}

. Compute

Φ_{c}^{(0)}

.
Output: the converged {

ν, σ_{c}, σ_{r}, W, C, R

}

for $i = 0, 1, 2, \dots$ , until $ζ$ defined in (29) is smaller than a threshold do
% Stage 1
Compute $Φ_{r}^{(i)}$ , $ρ_{n}^{(i)}$ , $E_{μ_{n}}^{(i)}$ and $E_{Y_{n}^{r}}^{(i)}$ via (21).
Compute ${\tilde{W}}^{(i)}$ , $C^{(i + 1)}$ and $σ_{c}^{(i + 1)}$ by (23) and (24).
Solve the scalar nonlinear maximization problem (20) to obtain ${\tilde{ν}}^{(i)}$ .
% Stage 2
Compute $Φ_{c}^{(i + 1)}$ , ${\tilde{ρ}}_{n}^{(i)}$ , ${\tilde{E}}_{μ_{n}}^{(i)}$ and $E_{Y_{n}^{c}}^{(i)}$ via (25).
Compute $W^{(i + 1)}$ , $R^{(i + 1)}$ and $σ_{r}^{(i + 1)}$ by (28).
Solve the scalar nonlinear maximization problem (26) to obtain $ν^{(i + 1)}$ .
end for

4. Numerical Examples

In this section, we conduct several numerical examples based on synthetic problems and three real-world data sets to demonstrate the effectiveness of our proposed RBPPCA algorithm. All experiments were run by using MATLAB (2016a) with machine epsilon

2.22 \times 10^{- 16}

on a Windows 10 (64 bit) Laptop with an Intel Core i7-8750H CPU (2.20GHz) and 8GB memory. Each random experiment was repeated 20 times independently, then the average numerical results were reported.

Example 1

(Experiments on the synthetic data).In this example, we only compare ours with the BPPCA algorithm [12] to illustrate the significant improvement of the RBPPCA algorithm. We take N data matrices

X_{n}

for

n = 1, \dots, N

with

N = 200

and

d_{c} = d_{r} = 64

,

N_{g}

of which are generated by

X_{n} = C Z_{n} R^{T} + W + C E_{r_{n}} + E_{c_{n}} R^{T} + E_{n}, n = 1, \dots, N_{g},

where C, R and W are simply synthesized by MATLAB as

C = eye (d_{c}, q_{c}), R = eye (d_{r}, q_{r}), W = rand (d_{c}, d_{r}) w i t h q_{c} = q_{r} = 8,

Z_{n}

,

E_{r_{n}}

,

E_{c_{n}}

and

E_{n}

are sampled from matrix variate normal distributions

N_{q_{c}, q_{r}} (0_{q_{c} \times q_{r}}, I_{q_{c}}, I_{q_{r}})

,

N_{q_{c}, d_{r}} (0_{q_{c} \times d_{r}}, I_{q_{c}}, σ_{r}^{2} I_{d_{r}})

,

N_{d_{c}, q_{r}} (0_{d_{c} \times q_{r}}, σ_{c}^{2} I_{d_{c}}, I_{q_{r}})

, and

N_{d_{c}, d_{r}} (0_{d_{c} \times d_{r}}, σ_{c}^{2} I_{d_{c}}, σ_{r}^{2} I_{d_{r}})

with

σ_{c} = σ_{r} = 1

, respectively. The other

N_{o}

data matrices, i.e.,

X_{n}

for

n = N_{g} + 1, \dots, N

, are regarded as outliers of which each entry is sampled from the uniform distribution over the range of 0 to 10. In order to demonstrate the quality of computed approximations

C^{(i)}

and

R^{(i)}

, we calculate the arc length distance between the two subspaces

span (R \otimes C)

and

span (R^{(i)} \otimes C^{(i)})

, which is used in [12] and defined as

arcsin ({∥(I_{d_{c} d_{r}} - Q Q^{T}) Q^{(i)}∥}_{2}),

(31)

to monitor the numerical performance of the RBPPCA and BPPCA method, where

span (R \otimes C)

and

span (R^{(i)} \otimes C^{(i)})

are the column space of

R \otimes C

and

R^{(i)} \otimes C^{(i)}

, respectively, and Q and

Q^{(i)}

are the orthogonal base matrices of

span (R \otimes C)

and

span (R^{(i)} \otimes C^{(i)})

, respectively. In fact, by ([28], Definition 4.2.1), the computational result of (31) is the largest canonical angle between the estimated subspace

span (R^{(i)} \otimes C^{(i)})

and the true

span (R \otimes C)

.

In this test, we start with

C^{(0)} = rand (d_{c}, q_{c})

,

R^{(0)} = rand (d_{r}, q_{r})

,

σ_{c}^{(0)} = 1

,

σ_{r}^{(0)} = 1

and

ν^{(0)} = 1

, then consider the effect of the ratio of outliers, i.e.,

N_{o} / N

, which is varied from 0 to 30% with a stride length of 10% in this example. In these cases, the estimated values of

ν

are all

ν = 1

. If we use other initial values

ν^{(0)}

here, the computed

ν

also converges to one as iterations increase. The corresponding numerical results of arc length distances are plotted in Figure 1. Figure 1 shows that the RBPPCA and BPPCA methods almost perform with the same convergence behavior when the data matrices are without outliers.

Figure 1. Convergence behaviors of RBPPCA and BPPCA with the ratio of outliers being 0%, 10%, 20% and 30%, respectively.

As the ratio of outliers goes to 30%, it is reasonable that more iterations are required for the RBPPCA method to achieve a satisfactory accuracy. Unlike the BPPCA method, the presented RBPPCA method is more robust to outliers because the arc length distances of the BPPCA method are always held to approximately

1.5

when the data includes outliers.

In this example, we also test the impact of the initializations on

C^{(0)}

and

R^{(0)}

, and the sample size N, respectively, based on the synthetic data having 10% outliers. Three different types of initializations of

C^{(0)}

and

R^{(0)}

are set as follows:

(1): initialization 1: $C^{(0)} = rand (d_{c}, q_{c})$ and $R^{(0)} = rand (d_{r}, q_{r})$ ;
(2): initialization 2: $C^{(0)} = randn (d_{c}, q_{c})$ and $R^{(0)} = randn (d_{r}, q_{r})$ ;
(3): initialization 3:

$C^{(0)} = R^{(0)} = [\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ \frac{9}{d_{c}} & \frac{9}{d_{c}} + 9 & sin (9) & cos (9) & tan (9) & cot (9) & sec (9) & csc (9) \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ \frac{d_{c}}{d_{c}} & \frac{d_{c}}{d_{c}} + d_{c} & sin (d_{c}) & cos (d_{c}) & tan (d_{c}) & cot (d_{c}) & sec (d_{c}) & csc (d_{c}) \end{matrix}],$

and other parameters are fixed to be the same. The results associated with the different initializations are shown in Figure 2. Inspection of the plot illustrates that our RBPPCA method appears to be insensitive to different initializations, since the convergence behaviors of the RBPPCA method based on different initializations do not have a significant difference. Figure 3 presents the required CPU time in seconds and the quantities of arc length distances of the BPPCA and RBPPCA methods with the number of iterations being 25 with respect to the number of samples, where the sample size N varies from 200 to 5000 with a stride length of 200. Such graphs covey the fact that with the increase of the sample size N, the BPPCA and RBPPCA methods both required more CPU time for 25 iterations, and the BPPCA method needs less time complexity than RBPPCA as we stated in the remarks of Algorithm 1. However, the bigger N does not lead to the improved arc length distances of the BPPCA method.

Figure 2. Convergence behaviors of RBPPCA for three different types of initialization.

Figure 3. CPU time in seconds (left) and arc length distances (right) of BPPCA and RBPPCA with the number of iterations being 25 with the sample size N from 200 to 5000.

Example 2

(Experiments on face image databases).In this example, all of the experiments are based on two publicly available face image databases: the Yale face database ( Available from http://vision.ucsd.edu/content/yale-face-database) (accessed on 22 September 2021), and the Yale B face database (Available from http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html) (accessed on 22 September 2021).The Yale face database includes different face orientations, facial expression, and whether there exist glasses, and the Yale B face database is collected under different illumination conditions. We select 265 face images of 15 individuals for each database. Each person has 11 images, and these images are cropped to

64 \times 64

pixels. For each individual, eight images are randomly selected as the training set, and the rest are the test set. Then, we randomly select two and four images from the eight images in the training set to be corrupted as outliers, respectively. Half of the corrupted images are generated by replacing part of the original image with a

30 \times 30

rectangle of noise, and the other half of corrupted images use a

50 \times 50

rectangle of noise to replace it. Within each rectangle, the pixel value comes from a uniform distribution on the interval

[0, 255]

. Then, we rescale the images from the range of [0, 255] to the range [0, 1]. Some original and corrupted images of the Yale and Yale B databases are shown in Figure 4, Figure 5, Figure 6 and Figure 7, respectively. The iteration of BPPCA and RBPPCA is stopped when their corresponding relative changes of the log-likelihood defined in (29) are smaller than

10^{- 5}

.

Figure 4. One set of processed samples from the Yale face database.

Figure 5. Some generated outlier face images in the training set from the Yale face database.

Figure 6. Eleven images of an individual from the Yale B face database.

Figure 7. Some corrupted face images in the training set of the Yale B face database.

We run the BPPCA and RBBPCA algorithms for the original data set and the corrupted data set, respectively, with the order of Z in (11) being

q_{c} = q_{r} = 8

, and consider the reconstructed images

{\hat{X}}_{n}

defined in (30) based on the computed results. A comparison of the reconstructed images based on the original data set and corrupted data set with two corrupted images for each individual is shown in Figure 8. In Figure 8, the first, second and third columns are the original images, the reconstructed images of RBPPCA, and the reconstructed images of BPPCA for the Yale (left) and Yale B (right) databases, respectively, and the images of the first, second and third rows are shown based on the original, corrupted data sets with

30 \times 30

rectangles of noise, and corrupted data sets with

50 \times 50

rectangles of noise, respectively. The BPPCA and RBPPCA almost perform the same as the reconstructed images on the original data set. However, for corrupted data sets, the reconstructed performance of RBPPCA presents better images than BPPCA because it tries to explain noise information.

Figure 8. Original images (the first column), reconstructed images by RBPPCA (the second column), and reconstructed images by BPPCA (the third column) of the Yale (left) and Yale B (right) databases, respectively.

We also compare the average reconstruction errors

η = \sum_{n = 1}^{N} {∥ X_{n} - {\hat{X}}_{n} ∥}_{F} / N

and the recognition accuracy rates for the RBPPCA and BPPCA algorithms in the cases where each person has two corrupted images and four corrupted images, respectively, where recognition accuracy rates are calculated based on the nearest neighbor classifier (1-NN) which is employed for the classification. The average reconstruction errors and recognition accuracy rates versus the order of Z are plotted in Figure 9 and Figure 10, respectively. As expected, the average reconstruction errors decrease, while the recognition accuracy rates rise as the order of Z increases. In these cases, our proposed RBPPCA algorithm outperforms the BPPCA algorithm in reducing average reconstruction errors and enhancing the recognition accuracy. In addition, another advantage of robust probabilistic algorithms based on t distributions is outlier detection. By [14], we can compute

ρ_{n} = {[vec (X_{n} - M)]}^{T} {(Σ_{r} \otimes Σ_{c})}^{- 1} vec (X_{n} - M)

as the standard of outlier detection. Figure 11 is the scatter chart for

ρ_{n}

of all the images in the training set of the Yale and Yale B databases with two corrupted images for one person, respectively. It is exhibited that the quantity of

ρ_{n}

can be divided into three parts. Notice that these three parts correspond to the images with no noise, a

30 \times 30

rectangle of noise, and a

50 \times 50

rectangle of noise, respectively. Hence, the comparison of

ρ_{n}

provides a method for judging the outliers.

Figure 9. Average reconstruction errors of the BPPCA and RBPPCA algorithms vs. the order of Z for the Yale and Yale B databases with 2 and 4 corrupted images of each individual, respectively.

Figure 10. Recognition accuracy rates of the BPPCA and RBPPCA algorithms vs. the order of Z for the Yale and Yale B databases with 2 and 4 corrupted images of each individual, respectively.

Figure 11.

ρ_{n}

of each image in the Yale (left) and Yale B (right) databases, respectively, with 2 corrupted images of each individual.

Example 3

(Experiments on the MNIST dataset).In Example 2, it is shown that RBPPCA is superior to BPPCA when the data sets contain outliers. In this example, we compare the RBPPCA algorithm to the tPPCA [14] and L1-PPCA [15] algorithms based on handwritten digit images from the MNIST (Available from http://yann.lecun.com/exdb/mnist) (accessed on 22 September 2021) database in which each image has

28 \times 28

pixels. We choose 59 images of the digit 4 as the training data set, and randomly select nine of them to be corrupted as outliers. The way of corrupting the images is to add noise from a uniform distribution on the interval

[0, 510]

, and then normalize all images to the range

[0, 1]

. The normalized corrupted images of the digit 4 are shown in Figure 12. The RBPPCA, tPPCA [14] and L1-PPCA [15] algorithms are implemented with 100 iterations for the original data set of the digit 4 and the corrupted data set, respectively.

Figure 12. The normalized corrupted images of the digit 4.

Figure 13 presents the average reconstruction errors of the RBBPCA, tPPCA and L1-PPCA algorithms where the feature number is the order of Z for the RBBPCA algorithm and is the dimension of the low-dimensional representation for the tPPCA and L1-PPCA algorithms. It is observed that the performance of our RBPPCA algorithm is superior to other algorithms. The reconstructed behaviors of different algorithms based on the original data set of the digit 4 and the corrupted data set with

q_{c} = q_{r} = 1

are shown in Figure 14. In Figure 14, the first column is the original images, and the second, third and fourth columns are the reconstructed images by the RBBPCA, tPPCA and L1-PPCA algorithms, respectively. As shown in Figure 14, compared to the tPPCA and L1-PPCA algorithms, the RBPPCA algorithm performs better reconstruction outcomes in such a case.

Figure 13. Average reconstruction errors of the RBPPCA, tPPCA and L1-PPCA algorithms vs. feature number on the MINST database.

Figure 14. Original images of the digit 4 (the first column), images reconstructed by RBPPCA (the second column), images reconstructed by tPPCA (the third column), and images reconstructed by L1-PPCA (the fourth column). The images shown in the first and second rows are based on the original and corrupted image data sets, respectively.

5. Conclusions

To remedy the problem that data are assumed to follow a matrix variate Gaussian distribution which is sensitive to outliers, in this paper, we proposed a robust BPPCA algorithm (RBPPCA), i.e., Algorithm 1, by replacing the matrix variate Gaussian distribution with the matrix variate t distribution for noise. Compared to BPPCA, owing to the matrix variate t distribution having a significantly heavy tail property, our proposed RBPPCA method combined with AECM for estimating parameters can deal with 2-D data sets in the presence of outliers. The numerical examples based on a synthetic and two publicly available real data sets, Yale and Yale B, are presented to state that Algorithm 1 is far superior to the BPPCA algorithm in computational accuracy, reconstruction performance, average reconstruction errors, recognition accuracy rates, and outlier detection. It is also shown by numerical examples based on the MNIST database that our RBPPCA method outperforms the tPPCA and L1-PPCA algorithms.

Author Contributions

Writing—original draft, Y.L.; Writing—review and editing, Z.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the research fund for distinguished young scholars of Fujian Agriculture and Forestry University No. xjq201727, and the science and technology innovation special fund project of Fujian Agriculture and Forestry University No. CXZX2020105A.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Derivations for (22) and (27) in Section 3.

We first consider the conditional expectation of

μ_{n} Y_{n}^{r}

with respect to the condition distribution

Y_{n}^{r}, μ_{n} | X_{n}

, which will be applied in our derivations of (22). That is

\begin{matrix} E (μ_{n} Y_{n}^{r} | X_{n}) & = \int_{Y_{n}^{r}} \int_{μ_{n}} μ_{n} Y_{n}^{r} p (Y_{n}^{r}, μ_{n} | X_{n}) d μ_{n} d Y_{n}^{r} \\ = \int_{Y_{n}^{r}} \int_{μ_{n}} μ_{n} Y_{n}^{r} p (Y_{n}^{r} | X_{n}, μ_{n}) p (μ_{n} | X_{n}) d μ_{n} d Y_{n}^{r} \\ = \int_{μ_{n}} μ_{n} p (μ_{n} | X_{n}) \int_{Y_{n}^{r}} Y_{n}^{r} p (Y_{n}^{r} | X_{n}, μ_{n}) d Y_{n}^{r} d μ_{n} \\ = \int_{μ_{n}} μ_{n} p (μ_{n} | X_{n}) E (Y_{n}^{r} | X_{n}, μ_{n}) d μ_{n} \\ = E (μ_{n} | X_{n}) E (Y_{n}^{r} | X_{n}, μ_{n}) = E_{μ_{n}}^{(i)} E_{Y_{n}^{r}}^{(i)}, \end{matrix}

(A1)

where

E_{μ_{n}}^{(i)}

and

E_{Y_{n}^{r}}^{(i)}

are given by (21). In addition, for symmetric matrices

A \in R^{d_{r} \times d_{r}}

and

B \in R^{q_{c} \times q_{c}}

, we have

\begin{matrix} E (tr (μ_{n} A {(Y_{n}^{r})}^{T} B Y_{n}^{r}) | X_{n}) \\ = & \int_{Y_{n}^{r}} \int_{μ_{n}} tr (μ_{n} A {(Y_{n}^{r})}^{T} B Y_{n}^{r}) p (Y_{n}^{r} | X_{n}, μ_{n}) p (μ_{n} | X_{n}) d μ_{n} d Y_{n}^{r} \\ = & \int_{μ_{n}} μ_{n} p (μ_{n} | X_{n}) \int_{Y_{n}^{r}} tr (A {(Y_{n}^{r})}^{T} B Y_{n}^{r}) p (Y_{n}^{r} | X_{n}, μ_{n}) d Y_{n}^{r} d μ_{n} \\ = & \int_{μ_{n}} μ_{n} p (μ_{n} | X_{n}) \int_{Y_{n}^{r}} {[vec (Y_{n}^{r})]}^{T} (A \otimes B) vec (Y_{n}^{r}) p (Y_{n}^{r} | X_{n}, μ_{n}) d Y_{n}^{r} d μ_{n} \\ = & \int_{μ_{n}} μ_{n} p (μ_{n} | X_{n}) \int_{Y_{n}^{r}} tr ((A \otimes B) vec (Y_{n}^{r}) {[vec (Y_{n}^{r})]}^{T}) p (Y_{n}^{r} | X_{n}, μ_{n}) d Y_{n}^{r} d μ_{n} \\ = & \int_{μ_{n}} μ_{n} p (μ_{n} | X_{n}) tr ((A \otimes B) \int_{vec (Y_{n}^{r})} vec (Y_{n}^{r}) {[vec (Y_{n}^{r})]}^{T} p (vec (Y_{n}^{r}) | X_{n}, μ_{n}) d vec (Y_{n}^{r})) d μ_{n} \\ = & \int_{μ_{n}} μ_{n} p (μ_{n} | X_{n}) tr ((A \otimes B) \{[μ_{n}^{- 1} Σ_{r}^{(i)} \otimes {(σ_{c}^{(i)})}^{2} {(Φ_{c}^{(i)})}^{- 1}] + vec (E_{Y_{n}^{r}}^{(i)}) {[vec (E_{Y_{n}^{r}}^{(i)})]}^{T}\}) d μ_{n} \\ = & \int_{μ_{n}} μ_{n} p (μ_{n} | X_{n}) tr (μ_{n}^{- 1} A Σ_{r}^{(i)} \otimes B {(σ_{c}^{(i)})}^{2} {(Φ_{c}^{(i)})}^{- 1}) d μ_{n} + \int_{μ_{n}} μ_{n} p (μ_{n} | X_{n}) tr (A {(E_{Y_{n}^{r}}^{(i)})}^{T} B E_{Y_{n}^{r}}^{(i)}) d μ_{n} \\ = & tr (A Σ_{r}^{(i)}) tr (B {(σ_{c}^{(i)})}^{2} {(Φ_{c}^{(i)})}^{- 1}) + E_{μ_{n}}^{(i)} tr (A {(E_{Y_{n}^{r}}^{(i)})}^{T} B E_{Y_{n}^{r}}^{(i)}) . \end{matrix}

(A2)

Notice that (22) can be rewritten as

\begin{matrix} E (tr (μ_{n} Σ_{r}^{- 1} {[X_{n} - (C Y_{n}^{r} + W)]}^{T} [X_{n} - (C Y_{n}^{r} + W)]) | X_{n}) \\ = & E (μ_{n} tr (Σ_{r}^{- 1} {(X_{n} - W)}^{T} (X_{n} - W)) | X_{n}) - 2 E (μ_{n} tr (Σ_{r}^{- 1} {(X_{n} - W)}^{T} C Y_{n}^{r}) | X_{n}) \\ + E (tr (μ_{n} Σ_{r}^{- 1} {(Y_{n}^{r})}^{T} (C^{T} C) Y_{n}^{r}) | X_{n}) . \end{matrix}

(A3)

Then, we give the calculation results of the terms on the right hand side of (A3) from below. It is clear that

\begin{matrix} E (μ_{n} tr (Σ_{r}^{- 1} {(X_{n} - W)}^{T} (X_{n} - W)) | X_{n}) & = E (μ_{n} | X_{n}) tr (Σ_{r}^{- 1} {(X_{n} - W)}^{T} (X_{n} - W)) \\ = E_{μ_{n}}^{(i)} tr (Σ_{r}^{- 1} {(X_{n} - W)}^{T} (X_{n} - W)) . \end{matrix}

(A4)

By (A1) and taking

A = Σ_{r}^{- 1}

and

B = C^{T} C

into (A2), we have

\begin{matrix} E (μ_{n} tr (Σ_{r}^{- 1} {(X_{n} - W)}^{T} C Y_{n}^{r}) | X_{n}) = & E_{μ_{n}}^{(i)} tr (Σ_{r}^{- 1} {(X_{n} - W)}^{T} C E_{Y_{n}^{r}}^{(i)}), \end{matrix}

(A5)

\begin{matrix} E (tr (μ_{n} Σ_{r}^{- 1} {(Y_{n}^{r})}^{T} (C^{T} C) Y_{n}^{r}) | X_{n}) = & tr (Σ_{r}^{- 1} Σ_{r}^{(i)}) tr (C^{T} C {(σ_{c}^{(i)})}^{2} {(Φ_{c}^{(i)})}^{- 1}) \\ + E_{μ_{n}}^{(i)} tr (Σ_{r}^{- 1} {(E_{Y_{n}^{r}}^{(i)})}^{T} C^{T} C E_{Y_{n}^{r}}^{(i)}) . \end{matrix}

(A6)

Equality (22) is a consequence of (A3)–(A6).

Similarly, for (27), it follows that

\begin{matrix} E (tr (μ_{n} {[X_{n} - (Y_{n}^{c} R^{T} + W)]}^{T} Σ_{c}^{- 1} [X_{n} - (Y_{n}^{c} R^{T} + W)]) | X_{n}) \\ = & E (μ_{n} tr ({(X_{n} - W)}^{T} Σ_{c}^{- 1} (X_{n} - W)) | X_{n}) - 2 E (μ_{n} tr ({(X_{n} - W)}^{T} Σ_{c}^{- 1} Y_{n}^{c} R^{T}) | X_{n}) \\ + E (μ_{n} tr (R^{T} R {(Y_{n}^{c})}^{T} Σ_{c}^{- 1} Y_{n}^{c}) | X_{n}) \\ = & {\tilde{E}}_{μ_{n}}^{(i)} tr ({(X_{n} - W)}^{T} Σ_{c}^{- 1} (X_{n} - W)) - 2 {\tilde{E}}_{μ_{n}}^{(i)} tr ({(X_{n} - W)}^{T} Σ_{c}^{- 1} E_{Y_{n}^{c}}^{(i)} R^{T}) \\ + tr (Σ_{c}^{- 1} Σ_{c}^{(i + 1)}) tr (R^{T} R {(σ_{r}^{(i)})}^{2} {(Φ_{r}^{(i)})}^{- 1}) + {\tilde{E}}_{μ_{n}}^{(i)} tr (R^{T} R {(E_{Y_{n}^{c}}^{(i)})}^{T} Σ_{c}^{- 1} E_{Y_{n}^{c}}^{(i)}), \end{matrix}

(A7)

where

{\tilde{E}}_{μ_{n}}^{(i)}

and

E_{Y_{n}^{c}}^{(i)}

are defined in (25). The last equality (A7) holds due to the same reason as (A3).

References

Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Burges, C.J.C. Dimension Reduction: A Guided Tour; Now Publishers Inc.: Noord-Brabant, The Netherlands, 2010. [Google Scholar]
Ma, Y.; Zhu, L. A review on dimension reduction. Int. Stat. Rev. 2013, 81, 134–150. [Google Scholar]
Jolliffe, I.T. Principal Component Analysis; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Yang, J.; Zhang, D.; Frangi, A.F.; Yang, J. Two-dimensional PCA: A new approach to appearance-based face representation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 131–137. [Google Scholar]
Ye, J. Generalized low rank approximations of matrices. Mach. Learn. 2005, 61, 167–191. [Google Scholar]
Zhang, D.; Zhou, Z.H. (2D) 2PCA: Two-directional two-dimensional PCA for efficient face representation and recognition. Neurocomputing 2005, 69, 224–231. [Google Scholar]
Tipping, M.E.; Bishop, C.M. Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B Stat. Methodol. 1999, 61, 611–622. [Google Scholar]
Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
Yu, S.; Bi, J.; Ye, J. Matrix-variate and higher-order probabilistic projections. Data Min. Knowl. Discov. 2011, 22, 372–392. [Google Scholar]
Gupta, A.K.; Nagar, D.K. Matrix Variate Distributions; CRC Press: Boca Raton, FL, USA, 2018; Volume 104. [Google Scholar]
Zhao, J.; Philip, L.H.; Kwok, J.T. Bilinear probabilistic principal component analysis. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 492–503. [Google Scholar]
Zhao, J.; Jiang, Q. Probabilistic PCA for t distributions. Neurocomputing 2006, 69, 2217–2226. [Google Scholar]
Chen, T.; Martin, E.; Montague, G. Robust probabilistic PCA with missing data and contribution analysis for outlier detection. Comput. Stat. Data Anal. 2009, 53, 3706–3716. [Google Scholar]
Gao, J. Robust L1 principal component analysis and its Bayesian variational inference. Neural Comput. 2008, 20, 555–572. [Google Scholar] [CrossRef] [PubMed]
Ju, F.; Sun, Y.; Gao, J.; Hu, Y.; Yin, B. Image outlier detection and feature extraction via L1-norm-based 2D probabilistic PCA. IEEE Trans. Image Process. 2015, 24, 4834–4846. [Google Scholar] [CrossRef] [PubMed]
Galimberti, G.; Soffritti, G. A multivariate linear regression analysis using finite mixtures of t distributions. Comput. Stat. Data Anal. 2014, 71, 138–150. [Google Scholar] [CrossRef]
Morris, K.; McNicholas, P.D.; Scrucca, L. Dimension reduction for model-based clustering via mixtures of multivariate t-distributions. Adv. Data Anal. Classif. 2013, 7, 321–338. [Google Scholar] [CrossRef]
Pesevski, A.; Franczak, B.C.; McNicholas, P.D. Subspace clustering with the multivariate-t distribution. Pattern Recognit. Lett. 2018, 112, 297–302. [Google Scholar] [CrossRef] [Green Version]
Teklehaymanot, F.K.; Muma, M.; Zoubir, A.M. Robust Bayesian cluster enumeration based on the t distribution. Signal Process. 2021, 182, 107870. [Google Scholar] [CrossRef]
Wei, X.; Yang, Z. The infinite Student’s t-factor mixture analyzer for robust clustering and classification. Pattern Recognit. 2012, 45, 4346–4357. [Google Scholar] [CrossRef]
Zhou, X.; Tan, C. Maximum likelihood estimation of Tobit factor analysis for multivariate t-distribution. Commun. Stat. Simul. Comput. 2009, 39, 1–16. [Google Scholar] [CrossRef]
Lange, K.L.; Little, R.J.A.; Taylor, J.M.G. Robust statistical modeling using the t distribution. J. Am. Stat. Assoc. 1989, 84, 881–896. [Google Scholar] [CrossRef] [Green Version]
Meng, X.L.; Van Dyk, D. The EM algorithm—An old folk-song sung to a fast new tune. J. R. Stat. Soc. Ser. B Stat. Methodol. 1997, 59, 511–567. [Google Scholar] [CrossRef]
Zhou, Y.; Lu, H.; Cheung, Y. Bilinear probabilistic canonical correlation analysis via hybrid concatenations. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 5–9 October 2017; Volume 31. [Google Scholar]
Coleman, T.; Branch, M.A.; Grace, A. Optimization toolbox. In For Use with MATLAB. User’s Guide for MATLAB 5, Version 2, Relaese II; 1999; Available online: https://www.dpipe.tsukuba.ac.jp/~naito/optim_tb.pdf (accessed on 20 September 2021).
Schrage, L. LINGO User’s Guide; LINDO System Inc.: Chicago, IL, USA, 2006. [Google Scholar]
Stewart, G.W. Matrix Algorithms: Volume II: Eigensystems; SIAM: Philadelphia, PA, USA, 2001. [Google Scholar]

Figure 1. Convergence behaviors of RBPPCA and BPPCA with the ratio of outliers being 0%, 10%, 20% and 30%, respectively.

Figure 2. Convergence behaviors of RBPPCA for three different types of initialization.

Figure 3. CPU time in seconds (left) and arc length distances (right) of BPPCA and RBPPCA with the number of iterations being 25 with the sample size N from 200 to 5000.

Figure 4. One set of processed samples from the Yale face database.

Figure 5. Some generated outlier face images in the training set from the Yale face database.

Figure 6. Eleven images of an individual from the Yale B face database.

Figure 7. Some corrupted face images in the training set of the Yale B face database.

Figure 8. Original images (the first column), reconstructed images by RBPPCA (the second column), and reconstructed images by BPPCA (the third column) of the Yale (left) and Yale B (right) databases, respectively.

Figure 9. Average reconstruction errors of the BPPCA and RBPPCA algorithms vs. the order of Z for the Yale and Yale B databases with 2 and 4 corrupted images of each individual, respectively.

Figure 10. Recognition accuracy rates of the BPPCA and RBPPCA algorithms vs. the order of Z for the Yale and Yale B databases with 2 and 4 corrupted images of each individual, respectively.

Figure 11.

ρ_{n}

of each image in the Yale (left) and Yale B (right) databases, respectively, with 2 corrupted images of each individual.

Figure 12. The normalized corrupted images of the digit 4.

Figure 13. Average reconstruction errors of the RBPPCA, tPPCA and L1-PPCA algorithms vs. feature number on the MINST database.

Figure 14. Original images of the digit 4 (the first column), images reconstructed by RBPPCA (the second column), images reconstructed by tPPCA (the third column), and images reconstructed by L1-PPCA (the fourth column). The images shown in the first and second rows are based on the original and corrupted image data sets, respectively.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Robust Bilinear Probabilistic Principal Component Analysis

Abstract

1. Introduction

2. Preliminaries

3. Robust BPPCA

3.1. The Model

3.2. Estimation of the Parameters

4. Numerical Examples

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Article Metrics

Citations

Article Access Statistics