Survey on Probabilistic Models of Low-Rank Matrix Factorizations

: Low-rank matrix factorizations such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD) and Non-negative Matrix Factorization (NMF) are a large class of methods for pursuing the low-rank approximation of a given data matrix. The conventional factorization models are based on the assumption that the data matrices are contaminated stochastically by some type of noise. Thus the point estimations of low-rank components can be obtained by Maximum Likelihood (ML) estimation or Maximum a posteriori (MAP). In the past decade, a variety of probabilistic models of low-rank matrix factorizations have emerged. The most signiﬁcant difference between low-rank matrix factorizations and their corresponding probabilistic models is that the latter treat the low-rank components as random variables. This paper makes a survey of the probabilistic models of low-rank matrix factorizations. Firstly, we review some probability distributions commonly-used in probabilistic models of low-rank matrix factorizations and introduce the conjugate priors of some probability distributions to simplify the Bayesian inference. Then we provide two main inference methods for probabilistic low-rank matrix factorizations, i.e., Gibbs sampling and variational Bayesian inference. Next, we classify roughly the important probabilistic models of low-rank matrix factorizations into several categories and review them respectively. The categories are performed via different matrix factorizations formulations, which mainly include PCA, matrix factorizations, robust PCA, NMF and tensor factorizations. Finally, we discuss the research issues needed to be studied in the future.


Introduction
In many practical applications, a commonly-used assumption is that the dataset approximately lies in a low-dimensional linear subspace.Low-rank matrix factorizations (or decompositions, the two terms are used interchangeably) are just a type of method for recovering the low-rank structure, removing noise and completing missing values.Principal Component Analysis (PCA) [1], a traditional matrix factorization method, copes effectively with the situation that the dataset is contaminated by Gaussian noise.If the mean vector is set to be a zero vector, then PCA is transformed into Singular Value Decomposition (SVD) [2].Classical PCA approximates the low-rank representation according to eigen decomposition of the covariance matrix of a dataset to be investigated.When there are outliers or large sparse errors, the original version of PCA does not work well for obtaining the low-rank representations.In this case, many robust methods of PCA have been extensively studied such as L1-PCA [3], L1-norm maximization PCA [4], L21-norm maximization PCA [5] and so on.
In this paper, low-rank matrix factorizations refer to a general formulation of matrix factorizations and they mainly consist of PCA, matrix factorizations, robust PCA [6], Non-negative Matrix Factorization (NMF) [7] and tensor decompositions [8].As a matter of fact, matrix factorizations are usually a special case of PCA and they directly factorize a data matrix into the product of two low-rank matrices without consideration of the mean vector.Matrix completion aims to complete all missing values according to the approximate low-rank structure and it is mathematically described as a nuclear norm minimization problem [9].To this end, we regard matrix completion as a special case of matrix factorizations.In addition, robust PCA decomposes the data matrix into the superposition of a low-rank matrix and a sparse component, and recovers both the low-rank and the sparse matrices via principal component pursuit [6].
Low-rank matrix factorizations suppose that the low-rank components are corrupted by certain random noise and the low-rank matrices are deterministic unknown parameters.Maximum Likelihood (ML) estimation and Maximum a posteriori (MAP) are two most popular methodologies used in estimating the low-rank matrices.A prominent advantage of the aforementioned point estimate methods is that they are simple and easy to implement.However, we can not obtain the probability distributions of the low-rank matrices that are pre-requisite in exploring the generative models.Probabilistic models of low-rank matrix factorizations can approximate the true probability distributions of the low-rank matrices and they have attracted a great deal of attention in the past two decades.These models have been widely applied in the fields of signal and image processing, computer vision, machine learning and so on.
Tipping and Bishop [10] originally presented probabilistic PCA in which the latent variables are assumed to be a unit isotropic Gaussian distribution.Subsequently, several other probabilistic models of PCA were proposed successively, such as Bayesian PCA [11], Robust L1 PCA [12], Bayesian robust PCA [13] and so on.As a special case of probabilistic models of PCA, Bayesian matrix factorization [14] placed zero-mean spherical Gaussian priors on two low-rank matrices and it was applied to collaborative filtering.Probabilistic models of matrix factorizations also include probabilistic matrix factorization [15], Bayesian probabilistic matrix factorization [16], robust Bayesian matrix factorization [17], probabilistic robust matrix factorization [18], sparse Bayesian matrix completion [19], Bayesian robust matrix factorization [20] and Bayesian L1-norm low-rank matrix factorizations [21].
As to robust PCA, we take the small dense noise into account.In other words, the data matrix is decomposed into the sum of a low-rank matrix, a sparse noise matrix and a dense Gaussian noise matrix.Hence, the corresponding probabilistic models are robust to outliers and large sparse noise, and they are mainly composed of Bayesian robust PCA [22], variational Bayesian robust PCA [23] and sparse Bayesian robust PCA [19].As another type of low-rank matrix factorizations, NMF decomposes a non-negative data matrix into the product of two non-negative low-rank matrices.Compared with PCA, NMF is a technique which learns holistic, not parts-based representations [7].Different probabilistic models of NMF were proposed in [24][25][26][27][28]. Recently, probabilistic models of low-rank matrix factorizations are also extended to the case of tensor decompositions.Tucker decomposition and CANDECOMP/PARAFAC (CP) decomposition are two most important tensor decompositions.By generalizing the subspace approximation, some new low rank tensor decompositions have emerged, such as the hierarchical Tucker (HT) format [29,30], the tensor tree structure [31] and the tensor train (TT) format [32].Among them, the TT format is a special case of the HT and the tensor tree structure [33].The probabilistic models of the Tucker were presented in [34][35][36] and that of the CP were developed in [37][38][39][40][41][42][43][44][45][46][47][48].
For probabilistic models of low-rank matrix factorizations, there are three most frequently used statistical approaches for inferring the corresponding probability distributions or parameters, i.e., Expectation Maximization (EM) [49][50][51][52][53][54], Gibbs sampling(or a Gibbs sampler) [54][55][56][57][58][59][60][61][62] and variational Bayesian (VB) inference [54,[62][63][64][65][66][67][68][69][70][71].EM is an iterative algorithm with guaranteed the local convergence for ML estimation and does not require explicit evaluation of the likelihood function [70].Although it has many advantages over ML, EM tends to be limited in applications because of its serious requirements for the posterior of the hidden variables and can not be used to solve complex Bayesian models [70].However, VB inference relaxes some limitations of the EM algorithm and ameliorates its shortcomings.As a means of VB, Gibbs sampling is another method used to infer the probability distributions of all parameters and hidden variables.
This paper provides a survey on probabilistic models of low-rank matrix factorizations.Firstly, we review the significant probability distributions commonly-used in probabilistic low-rank matrix factorizations and introduce the conjugate priors that are essential to Bayesian statistics.Then we present two most important methods for inferring the probability distributions, that is, Gibbs sampling and VB inference.Next, we roughly divide the available low-rank matrix factorization models into five categories: PCA, matrix factorizations, robust PCA, NMF and tensor decompositions.For each category, we provide a detailed overview of the corresponding probabilistic models.
A central task for probabilistic low-rank matrix factorizations is to predict the missing or incomplete data.For the sake of concise descriptions, we do not consider the missing entries in all models except the sparse Bayesian model of matrix completion.The remainder of this paper is listed as below.Section 2 introduces the commonly-used probability distributions and the conjugate priors.Section 3 presents two frequently used inferring methods: Gibbs sampling and VB inference.Probabilistic models of PCA and matrix factorizations are reviewed respectively in Sections 4 and 5.Sections 6 and 7 survey probabilistic models of robust PCA and NMF, respectively.Section 8 provides other probabilistic models of low-rank matrix factorizations and probabilistic tensor factorizations.The conclusions and future research directions are drawn in Section 9.
Notation: Let R be the set of real numbers and R + the set of non-negative real numbers.We denote scalars with italic letters (e.g., x), vectors with bold letters (e.g., x), matrices with bold capital letters (e.g., X) and sets with italic capital letters (e.g., X).Given a matrix X, its ith row, jth column and (i, j)th element are expressed as x i• , x •j and X ij , respectively.If X is square, let Tr(X) and |X| be the trace and the determinant of X, respectively.

Probability Distributions and Conjugate Priors
This section will introduce some probability distributions commonly adopted in probabilistic models of low-rank matrix factorizations and discuss the conjugate priors for algebraic and intuitive convenience in Bayesian statistics.

Probability Distributions
We consider some significant probability distributions that will be needed in the following sections.Given a univariate random variable x or a random vector x, we denote its probability density/mass function by p(x) or p(x).Let E(x) be the expectation of x and Cov(x, x) the covariance matrix of x.
Several probability distributions such as Gamma distribution and Beta distribution deal with the gamma function defined by We summarize the probability distributions used frequently in probabilistic models of low-rank matrix factorizations, as shown in Table 1.In this table, we list the notation for each probability distribution, the probability density/mass function, the expectation and the variance/covariance respectively.For the Wishart distribution W (Λ|W, v), the term B(W, v) is given by where the positive integer v ≥ d − 1 is named the degrees of freedom.For the generalized inverse Gaussian distribution GIG(x|p, a, b), K p (c) is a modified Bessel function of the second kind.For brevity of notation, we stipulate a few simple representations of a random variable or vector.For instance, if x follows a multivariate Gaussian distribution with mean µ and covariance matrix Σ, we can write this distribution as x ∼ N (x|µ, Σ), or x ∼ N (µ, Σ), or p(x) = N (x|µ, Σ).In addition, some probability distributions can be extended to the multivariate case under identical conditions.The following cites an example: if random variables x i ∼ St(µ i , λ i , v i ) and x 1 , x 2 , . . ., x N are independent, then we have x = (x 1 , x 2 , . . ., x N ) T ∼ St(µ, λ, v) with the probability density function p There exist close relationships among probability distributions listed in Table 1, as shown partially in Figure 1.Refer to [54,62,66] for more probability distributions.Moreover, some continuous random variables can be reformulated as Gaussian scale mixtures with other distributions from a hierarchical viewpoint.Now, we give two most frequently used examples as below.
Example 1.For given parameter µ, if the conditional probability distribution of x is p(x|λ) = N (x|µ, λ −1 ) and the prior distribution of λ is p(λ) = IGam(λ|1, 1), then we have p The probability density function of x is derived as: where L{•} is the Laplace transform and L −1 {•} is the inverse Laplace transform.Hence, we get Example 2. For given parameters µ and τ, if the conditional probability distribution of x is p(x|u) = N x|µ, (τu) −1 and the prior distribution of u is p(u|v) = Gam(u|v/2, v/2), then it holds that p(x|v) = St(x|µ, τ, v), where v is a fixed parameter.
The derivation process for p(x|v) is described as follows: Relationships among several probability distributions.

Conjugate Priors
Let x be a random vector with the parameter vector z and , ,..., N X = x x x a collection of N observed samples.In the presence of latent variables, they are also absorbed into z .For given z , the conditional probability density/mass function of x is denoted by ( | ) p x z .Thus, we can construct the likelihood function: As for variational Bayesian methods, the parameter vector z is usually assumed to be stochastic.Here, the prior distribution of z is expressed as ( ) p z .
To simplify Bayesian analysis, we hope that the posterior distribution ( | ) p X z is in the same functional form as the prior ( ) p z .Under this circumstance, the prior and the posterior are called conjugate distributions and the prior is also called a conjugate prior for the likelihood function ( | ) L X z [54,66].In the following, we provide three most commonly-used examples of conjugate priors.
Example 3. Assume that random variable x obeys the Bernoulli distribution with parameter μ .We have the likelihood function for x : where the observations {0,1} i x ∈ .In consideration of the form of ( | ) L X μ , we stipulate the prior distribution of μ as the Beta distribution with parameters a and b :

Conjugate Priors
Let x be a random vector with the parameter vector z and X = {x 1 , x 2 , . . . ,x N } a collection of N observed samples.In the presence of latent variables, they are also absorbed into z.For given z, the conditional probability density/mass function of x is denoted by p(x|z).Thus, we can construct the likelihood function: As for variational Bayesian methods, the parameter vector z is usually assumed to be stochastic.Here, the prior distribution of z is expressed as p(z).
To simplify Bayesian analysis, we hope that the posterior distribution p(z|X) is in the same functional form as the prior p(z).Under this circumstance, the prior and the posterior are called conjugate distributions and the prior is also called a conjugate prior for the likelihood function L(z|X) [54,66].In the following, we provide three most commonly-used examples of conjugate priors.
Example 3. Assume that random variable x obeys the Bernoulli distribution with parameter µ.We have the likelihood function for x: where the observations x i ∈ {0, 1}.In consideration of the form of L(µ|X), we stipulate the prior distribution of µ as the Beta distribution with parameters a and b: At this moment, we get the posterior distribution of µ via the Bayes' rule: Because we have p(µ|X) ∼ Beta(a The conclusion means that the Beta distribution is the conjugate prior for the Bernoulli likelihood. Example 4. Assume that random variable x obeys a univariate Gaussian distribution N (µ, λ −1 ), where λ is also named the precision.The likelihood function of x for given µ is We further suppose that the prior distribution of λ is the Gamma distribution with parameters a and b.Let p(λ|µ, X) be the posterior distribution of λ.Then we have p(λ|µ, X) ∝ p(λ)p(X, µ|λ). ( Because p(λ)p(X, µ|λ) we get p(λ|µ, X) ∼ Gam(a ).Therefore, the conjugate prior of the precision for a Gaussian likelihood is a Gamma distribution.
Example 5. Assume that random vector x obeys a d-dimensional Gaussian distribution N (µ, Λ −1 ), where Λ, the inverse of covariance matrix, is called the precision matrix.We consider the case that both µ and Λ are unknown.Thus, the likelihood function for x is The prior distribution of (µ, Λ) is given by a Gaussian-Wishart distribution N W (µ, Λ|µ 0 , β, W, v) with the joint probability density function: where µ 0 , β, W and v are the fixed hyperparameters.Hence, it holds that Denote x = ∑ N i=1 x i /N and X = (x 1 , x 2 , . . ., x N ).To obtain the above probability density function, we first derive the following formula: Then we get Because p(µ, Λ|X) ∝ p(µ, Λ|µ 0 , β, W, v)p(X|µ, Λ), we have This example shows that the conjugate prior of (µ, Λ) for a multivariate Gaussian N (x|µ, Λ) is a Gaussian-Wishart distribution.
There are some other conclusions on the conjugate priors.For instance, given Gaussian likelihood, the conjugate prior for the mean is another Gaussian distribution, the conjugate prior for the precision matrix is a Wishart distribution.All probability distributions discussed in Section 2.1 belong to a broad class of distributions, that is, the exponential family.It is shown that the exponential family distributions have conjugate priors [54,66,72].

Gibbs Sampling and Variational Bayesian Inference
Due to the existence of the latent variables or unknown parameters, computing the posterior distribution is analytically intractable in general.In this section, we will provide two approximation inference methods: Gibbs sampling and variational Bayesian inference.
Gibbs sampling or Gibbs sampler, a simple MCMC algorithm, is especially applicable for approximating a sequence of observations by a specified probability distribution when direct sampling is intractable.In addition, the basic version of Gibbs sampling is also a special case of the Metropolis-Hastings algorithm [54,62].In detailed implementation, Gibbs sampling adopts the strategy of sampling from a conditional probability distribution instead of marginalizing the joint probability distribution by integrating over other variables.In other words, Gibbs sampling generates alternatively an instance from its corresponding conditional probability distribution by fixing other variables.
Let z 1 , z 2 , . . ., z M be M blocks of random variables and set z = (z 1 , z 2 , . . ., z M ).The joint probability distribution of z is written as p(z) = p(z 1 , z 2 , . . ., z M ).In this subsection, we only consider the case that it is difficult to directly sampling from the joint probability distribution p(z).To this end, we can sample each component z i from marginal distribution in order.The marginal distribution of z i can be theoretically obtained by the following formulation: Generally speaking, the integrating term in the above formula is not tractable.A wise sampling strategy is that we generate z i according to the conditional probability distribution p(z i |z\z i ), where z\z i = (z 1 , . . ., z i−1 , z i+1 , . . ., z M ).By using Bayes' rule, the relationship between the conditional probability distribution and the joint probability distribution is given as follows: The sampling procedure is repeated by cycling through all variables.We summarize the outline of Gibbs sampling as below.
In the above sampling procedure, T is the maximum number of iterations.Therefore, we have T To alleviate the influence of random initializations, we can ignore the samples within the burn-in period.Although Gibbs sampling is commonly efficient in obtaining marginal distributions from conditional distributions, it can be highly inefficient for some cases such as sampling from Mexican hat like distribution [53].
The Metropolis algorithm is an instance of MCMC algorithms and walks randomly with an acceptance/rejection rule until the convergence is achieved.As the generalization of Metropolis algorithm, the Metropolis-Hastings algorithm modifies the jumping rules and its convergence speed is improved [66].Broadly speaking, the advanced Gibbs sampling algorithm can be regarded as a special case of the Metropolis-Hastings algorithm.The Gibbs sampler is applicable for models that are conditionally conjugate, while the Metropolis algorithm can be used for not conditionally conjugate models.Hence, we can combine both the Gibbs sampler and the Metropolis algorithm to sample from complex distributions [66].

Variational Bayesian Inference
Another widely used class of approximating the marginal probability distribution p(z i ) is the variational Bayesian (VB) inference [54].We still use the notation X = (x 1 , x 2 , . . ., x N ) to represent a matrix composed of N observed datum and z = (z 1 , z 2 , . . ., z M ) to represent a vector constructed by M blocks of latent variables.In VB methods, both X and z are assumed to be stochastic.
Given the prior distribution p(z) and the data likelihood p(X|z), the probability distribution of data matrix X, also called the evidence, can be calculated as p(X) = p(z)p(X|z)dz.Then we derive the conditional probability distribution of z via Bayes' rule: However, the integrating term p(z)p(X|z)dz is analytically intractable under normal conditions.Now, we harness VB methods to approximate the posterior distribution p(z|X) as well as p(X).
Another perspective for Equation ( 23) is that, The above inequality is obtained by Jensen's inequality.The negative lower bound −L(q) is called the free energy [88,89].
The KL divergence KL(q||p) can be regarded as a metric for evaluating the approximation performance of the prior distribution q(z) over the posterior distribution p(z|X) [54].In the light of Equation ( 23), minimizing KL(q||p) is equivalent to maximizing L(q).What's more, the lower bound can be further derived as below: L(q) = q(z) ln p(X|z)dz + q(z) ln p(z)dz − q(z) ln q(z)dz = q(z) ln p(X|z)dz + q(z) ln p(z) q(z) dz = E q(z) ln p(X|z) − KL(q||p).

(25)
The above Equation means that ln p(X) can also be regarded as the expectation of the log likelihood function ln p(X|z) with respect to z.
To reduce the difficulty of approximating the trial distribution q(z), we partition all latent variables into M disjoint block variables {z 1 , z 2 , . . . ,z M } and assume that z 1 , z 2 , . . . ,z M are mutually independent.Based on the above assumption, we have Afterwards, we utilize the mean field theory to obtain the approximation of q(z).Concretely speaking, we update each q(z i ) in turn.Let q(z i ) be unknown and q(z j )(j = i) be fixed.Under this circumstance, we can get the optimal q(z i ) by solving the following variational maximization problem: Plugging Equation ( 26) into the lower bound of ln p(X), we have where "const" is a term independent of z i and p(X, z i ) is a new probability distribution whose log is defined by ln p(X, Therefore, p(X, z i ) is the optimal solution of problem (27) obtained by minimizing the KL divergence between q(z i ) and p(X, z i ).
An advantage of the VB methods is that they are immune to over fitting.But they also have some shortcomings.For example, the probability distributions derived via VB have always less probability mass in the wings than the true solution and this systematic bias may break applications.A VB method approximates a full posterior distribution by maximizing the corresponding lower bound on the marginal likelihood and it can only handle a smaller dataset.In contrast, the stochastic variational inference optimizes a subset at each iteration.This batch inference algorithm is scalable and outperforms traditional variational inference methods [90,91].

Comparisons between Gibbs Sampling and Variational Bayesian Inference
Evaluating the posterior distribution p(z|X) is a central task in probabilistic models of low-rank matrix factorizations.In many practical applications, it is often infeasible to compute the posterior distribution or the expectations with respect to this distribution.Gibbs sampling and VB inference are two dominant methods for approximating the posterior distribution.The former is a stochastic approximation and the latter is a deterministic technique.
The fatal shortcoming of Expectation Maximization (EM) algorithm is that the posterior distribution of the latent variables should be given in advance.Both Gibbs sampling and VB inference can ameliorate the shortcoming of EM algorithm.Gibbs sampling is easy to implement due to the fact it adopts a Monte Carlo procedure.Another advantage of Gibbs sampling is that it can generate exact results [54].But this method is often suitable for small-scale problems because it costs a large amount of computation.Compared with Gibbs sampling, VB inference does not generate exact results and has small computation complexity.Therefore, their strengths and weaknesses are complementary rather than competitive.

Probabilistic Models of Principal Component Analysis
Principal Component Analysis (PCA) is a special type of low-rank matrix factorizations.This section first introduces the classical PCA and then reviews its probabilistic models.

Principal Component Analysis
The classical PCA converts a set of samples with possibly correlated variables into another set of samples with linearly uncorrelated variables via an orthogonal transformation [1].Based on this, PCA is an effective technique widely used in performing dimensionality reduction and extracting features.
Let X = {x 1 , x 2 , . . . ,x N } be a collection of N samples with d dimensions and I r (r < d) an r-by-r identity matrix.Given a projection transformation matrix W ∈ R d×r , the PCA model can be expressed as where the mean vector x = ∑ N i=1 x i /N, W satisfies W T W = I r , z i is a representation coefficient vector with r dimensions and ε 1 , ε 2 , . . ., ε N ∼ N (0,σ 2 I d ) are independent and identically distributed noise vectors.Denote X = (x 1 , x 2 , . . . ,x N ), Z = (z 1 , z 2 , . . . ,z N ), X = (x, x, . . . ,x) ∈ R d×N and E = (ε 1 , ε 2 , . . . ,ε N ).Then PCA can be rewritten as the following matrix factorization formulation: According to the maximum likelihood estimation, the optimal W and Z can be obtained by solving the minimization problem: where • F is the Frobenious norm of a matrix.Although this optimization problem is not convex, we can obtain the optimal transform matrix W * by stacking r singular vectors corresponding to the first r largest singular values of the sample covariance matrix . Let z * i be the optimal low-dimensional representation of x i .Then it holds that z * i = W * T (x i − x).The PCA technique only supposes that the dataset is contaminated by isotropic Gaussian noise.The advantage of PCA is that it is very simple and effective in achieving the point estimations of W and Z.But we can not obtain their probability distributions.In fact, the probability distributions of parameters are more useful and valuable than the point estimations in exploiting the intrinsic essence and investigating the procedure of data generation.Probabilistic models of PCA are just a class of methods for inferring the probability distributions of parameters.

Probabilistic Principal Component Analysis
In Equation (30), the low-dimensional representation z i is an unknown and deterministic parameter vector.In contrast, the original probabilistic PCA [10] regards z i as a stochastic vector.This probability model provides a general form of decomposing a d-dimensional input sample x: where the latent random vector z ∼ N (0, I r ), the noise ε ∼ N (0, σ 2 I d ) and µ is the mean vector.
In the following, a Maximum Likelihood (ML) method is proposed to obtain the point estimations of W, µ and σ 2 .It is obvious that Given the observed sample set X = {x 1 , x 2 , . . . ,x N }, the log-likelihood function is The optimal W, µ, σ 2 can be obtained by maximizing L(W, µ, σ 2 |X).By letting the partial derivative of L(W, µ, σ 2 |X) with respect to µ be the zero vector, we can easily get the maximum likelihood estimation of µ: µ ML = x.Hence, the optimal W and σ 2 can be achieved by the stationary points of L(W, µ ML , σ 2 |X) with respect to W and σ 2 .The aforementioned method is slightly complex when computing W and σ 2 .For this purpose, an Expectation Maximization (EM) algorithm was also presented to solve probabilistic PCA.If W, µ and σ 2 are given, then the joint probability distribution of z and x can be derived as follows: The posterior distribution of z for given x is Because we have where is the mean of z.Based on the above analysis, we give the complete-data log-likelihood function: where "const" is a term independent of W, µ and σ 2 .
In the expectation step, we take expectation on L C (W, µ, σ 2 |X) with respect to z: According to Equation (39), we have In the maximization step, we first obtain the maximum likelihood estimation of µ: µ ML = x by setting the partial derivative of E z L C (W, µ, σ 2 |X) with respect to µ be a zero vector.Similarly, we can also obtain the optimal estimations of W and σ 2 by solving the equation set: Moreover, a mixture model for probabilistic PCA was proposed in [92].Khan et al. replaced the Gaussian noise with Gaussian process and incorporated the information of preference pairs into the collaborative filtering [93].

Bayesian Principal Component Analysis
In probabilistic PCA, both z and ε obey Gaussian distributions, and both W and µ are non-stochastic parameters.Now, we further treat w •1 , w •2 , . . ., w •r and µ as independent random variables, that is, At this moment, the corresponding probabilistic model of PCA is called Bayesian PCA [11].Set α = (α 1 , α 2 , . . ., α r ) T and τ = σ −2 .Then we give the prior distributions of α and τ as follows: where a 0 , b 0 , c 0 , d 0 are four given hyperparameters.
The true joint probability distribution of complete data is given by We suppose that the trial distribution of p(X, W, α, Z, µ, τ) has the following form: By making use of VB inference, we can obtain the trial probability distributions of W, α, Z, µ and τ respectively.In detailed implementation, the hyperparameters a 0 , b 0 , c 0 , d 0 , β can be set to be small positive numbers to obtain the broad priors.Unlike other approximation methods, the proposed method maximizes a rigorous lower bound.

Robust L1 Principal Component Analysis
The probabilistic model of robust L1 PCA [12] regards both µ and W as deterministic parameters and µ is set to a zero vector without loss of the generality.We still suppose that z obeys a spherical multivariate Gaussian distribution, that is, z ∼ N (0, I r ).To improve the robustness, the noise ε i is assumed to follow a multivariate Laplace distribution: where • 1 is the L1 norm of a vector.Due to the fact that the Laplace distribution has heavy tail, the proposed model in [12] is more robust against data outliers.
We use the hierarchical model to deal with the Laplace distribution.The probability density distribution of a univariate Laplace distribution Lap(0, σ) can be rewritten as Hence, we can set ε i ∼ N 0, (σ 2 /β i )I d and β i ∼ IGam(1, 1/2).Let ρ = 1/σ 2 and give its prior distribution: where a and b are two given hyperparameters.
We take Z, β and ρ as latent variables and W as the hyperparameters.For fixed W, the true joint probability distribution of all latent variables is p(Z, β, ρ, X|W) = p(ρ)p(β)p(Z)p(X|W, Z). (49) The trial joint distribution of Z, β and ρ is chosen as q(Z, β, ρ) = q(Z)q(β)q(ρ).By applying the VB inference, the probability distributions of Z, β and ρ are approximated respectively.What's more, W is updated by minimizing the robust reconstruction error of all samples.

Bayesian Robust Factor Analysis
In previous probabilistic models of PCA, the noise ε obeys the same Gaussian distributions.However, different features maybe have different noise levels in practical applications.Now, we set ε ∼ N (0, diag(τ)), where τ = (τ 1 , τ 2 , . . ., τ d ) T and diag(τ) is a diagonal matrix.Probabilistic Factor Analysis (or PCA) [13] further assumes that ).In other words, different W ij or µ i have different variances and the variances of W ij and µ i also have a coupling relationship.Given W, z, µ, τ, the conditional probability distribution of x is written as Meanwhile, the prior distributions for τ i , α j and β are given as follows: where {a 0 , b 0 , c 0 , d 0 , e 0 , f 0 } is the set of hyperparameters.The robust version of Bayesian factor analysis [13] considers the Student's t-distributions instead of Gaussian noise corruptions due to the fact that the heavy tail of Student's t-distributions makes it more robust to outliers or large sparse noise.Let ε ∼ St(0, τ. −1 , v), where τ. −1 = (1/τ 1 , 1/τ 2 , . . ., 1/τ N ) T and v = (v 1 , v 2 , . . . ,v N ) T .In this case, the probability distribution of x for given W, z, µ, τ, v is We can represent hierarchically the above Student's t-distributions.Firstly, the conditional probability distribution of x k can be expressed as: where u k is a d-dimensional column vector, ". * " is the Hadamard product (also known as entrywise product).Then we give the prior of u k for fixed d-dimensional hyper parameter vector v as below: Another Bayesian robust factor analysis is on the basis of the Laplace distribution.At this moment, we assume that ε ∼ Lap(0, τ. −1/2 ), where τ.
. In this case, we have The Laplace distribution generally leads to adverse effects on inferring the probability distributions of other random variables.Here, we still employ the hierarchical method, that is, Under this circumstance, we have p(x|W, z, µ, τ) = p(x, u|W, z, µ, τ)du = p(x|W, z, µ, τ, u)p(u)du where we have p(x|W, z, µ, τ) For the aforementioned two probabilistic models of robust factor analysis, the VB inference was proposed to approximate the posterior distributions [13].In practice, the probability distribution of the noise should be chosen based on the application.The probabilistic models of PCA are compared in Table 2.

Probabilistic Models of Matrix Factorizations
Matrix factorizations are a type of methods to approximating the data matrix by the product of two low-rank matrices.They can be regarded as a special case of PCA without considering the mean.This section will discuss the existing probabilistic models of matrix factorizations.

Matrix Factorizations
For given data matrix X, its low-rank factorization model is written as where W ∈ R d×r , Z ∈ R r×N , E is the noise matrix and r < d.We assume that E ij ∼ N (0, σ 2 ) are independent and identically distributed.We can get the optimal W and Z according to the maximum likelihood estimation.More specifically, we need to solve the following minimization problem: min The closed-form solution to problem ( 60) can be obtained by the Singular Value Decomposition (SVD).
To enhance the robustness to outliers or large sparse noise, we now assume that E ij ∼ Lap(0, σ).For the moment, we solve the following optimization problem: min where • 1 is the L1-norm of a matrix (i.e., the sum of the absolute value of all elements).This problem is also called L1-norm PCA and the corresponding optimization methods were proposed in [3,4].Srebro and Jaakkola considered the weighted low-rank approximations problems and provided an EM algorithm [94].

Probabilistic Matrix Factorization
We still consider Gaussian noise corruptions, that is, E ij ∼ N (0, σ 2 ).Furthermore, the zero-mean spherical Gaussian priors are respectively imposed on each row of W and each column of Z: Probabilistic matrix factorization (PMF) [15] regards both σ 2 w and σ 2 z as two deterministic parameters.The point estimations of W, Z can be obtained by maximizing the posterior distribution with the following form: If the priors are respectively placed on σ 2 w and σ 2 z , we can obtain the generalized model of probabilistic matrix factorization [15].In this case, the likelihood function is derived as By maximizing the above posterior distribution, we can obtain the point estimations of parameters {W, Z} and hyperparameters σ 2 w , σ 2 z .Furthermore, two derivatives of PMF are also presented, i.e., PMF with a adaptive prior and PMF with constraining user-specific feature vectors.
In [89], a fully-observed variational Bayesian matrix factorization, an extension of PMF, was discussed.Meanwhile, it is shown that this new probabilistic matrix factorization can weaken the decomposability assumption and obtain the exact global analytic solution for rectangular cases.

Variational Bayesian Approach to Probabilistic Matrix Factorization
In PMF, W ik are independent and identically distributed and so are Z kj .Variational Bayesian PMF [14] assumes the entries from different columns of W or Z T have different variances, that is, For given hyperparameters {σ 2 , σ 2 W , ρ 2 Z }, we get the joint probability distribution: where We assume that the trial joint distribution of W and Z is decomposable, that is, q(W, Z) = q(W)q(Z).Using VB method, we can update alternatively q(W) and q(Z).The variances σ 2 , σ 2 W , ρ 2 Z can be determined by maximizing the following lower bound: The experimental results in Netflix Prize competition show that the Variational Bayesian approach has superiority over MAP and greedy residual fitting.

Bayesian Probabilistic Matrix Factorizations Using Markov Chain Monte Carlo
Variational Bayesian approach to PMF only discusses the case that W ik (or Z kj ) are independent and identically distributed and their means are zeros.However, Bayesian PMF [16] assumes that w i• (or z •j ) are independent and identically distributed and their mean vectors are not zero vectors.Concretely speaking, we stipulate that Let Θ W = {µ W , Λ W }, Θ Z = {µ Z , Λ Z }.We further suppose the prior distributions of Θ W and Θ Z are Gaussian-Wishart priors: where Θ 0 = {µ 0 , v 0 , W 0 }, β 0 is a hyper parameter.We can initialize the parameters Θ 0 as follows: In theory, the predictive probability distribution of X * ij can be obtained by marginalizing over model parameters and hyperparameters: However, the above integral is analytically intractable due to the fact that it is very difficult to determine the posterior distribution.Based on this, Gibbs sampling, one of the simplest Markov chain Monte Carlo, was proposed to approximate the predictive distribution p(X * ij |X, Θ 0 ).It is noted that MCMC methods for large-scale problems require especial care for efficient proposals and may be very slow if the sample correlation is too long.

Sparse Bayesian Matrix Completion
We consider the case that some elements of data matrix X are missing and the observed index set is denoted by Ω. Matrix completion assumes that X is approximately low-rank and its goal is to recover all missing elements from observed elements.
For noise matrix E, we assume that E ij ∼ N (0, β −1 ).In sparse Bayesian matrix completion [19], the Gaussian distributions are imposed on two low-rank matrices, that is, Moreover, the priors of γ i are given by γ i ∼ Gam(a, b) and the prior of β is assigned the noninformative Jeffrey's prior: Then the joint probability distribution is Let q(W, Z, γ, β) be the approximated distribution of p(W, Z, γ, β).The approximated procedure can be achieved by VB inference.It is demonstrated that the proposed method achieves a better prediction error in recommendation systems.

Robust Bayesian Matrix Factorization
In previous probabilistic models of matrix factorizations, there is no relationship among the variances of W ik , Z kj , E ij .Now, we reconsider the probability distributions of W ik , Z kj , E ij .The noise E ij is chosen as the heteroscedastic Gaussian scale mixture distribution: and the probability distributions of w i• and z •j are given by: We also impose Gamma distribution priors on α i and β j : where {a 0 , b 0 , c 0 , d 0 } is a given set of hyper-parameters.To reduce this problem's complexity, we restrict Λ W and Λ Z to be diagonal matrices, that is, Let θ = {τ, Λ W , Λ Z , a 0 , b 0 , c 0 , d 0 }.For the fixed parameters θ, the joint probability distribution is We consider two types of approximated posteriors of p(W, Z, α, β|X, θ), one is and another has the form q(W, Z, α, β) = q(W, α)q(Z, β).
For the above two cases, we can obtain respectively the trial probability distribution q(W, Z, α, β) by VB method [17].
In addition, a structured variational approximation was also proposed in [17].We assume that the variational posteriors of W and Z obey Gaussian distributions: The free energy function is defined by the negative lower bound: By directly minimizing the free energy function with respect to w i• , S i , z •j , R j , we can obtain the optimal w i• , S i , z •j , R j .The variational posteriors of scale variables α and β can also be recognized as the generalized inverse Gaussians.
The parameters θ can be estimated by type II maximum likelihood or empirical Bayes.In other words, we update the parameters by minimizing directly the negative lower bound.Robust Bayesian matrix factorization shows that the heavy-tailed distributions are useful to incorporate robustness information to the probabilistic models.

Probabilistic Robust Matrix Factorization
The model of probabilistic robust matrix factorization [18] considers the sparse noise corruptions.In this model, the Gaussian distributions are also imposed on W ik and Z kj : and the Laplace noise is placed on E ij , that is, E ij ∼ Lap(0, λ).
From Bayes' rule, we have We construct the hierarchical model for the Laplace distribution: We regard T as a latent variable matrix and denote θ = {W, Z}, θ = Ŵ, Ẑ , where θ is the current estimation of θ.An EM algorithm was proposed to inferring W and Z [18].To this end, we construct the so-called Q-function: The posterior of complete-data is p(Z| Ŵ, X, T) = p(Z,X| Ŵ,T) p(X| Ŵ,T) ∝ p(Z, X| Ŵ, T) ∝ p(X|Z, Ŵ, T)p(Z) (86) Hence, its log is where "const" is a term independent of Z.
To compute the expectations of T, we derive the conditional probability distribution of T as follows: Hence, T −1 ij follows an inverse Gaussian distribution and its posterior expectation is given by Thus, we get Q(Z| θ).
To obtain the update of Z, we maximize the function Q(Z| θ) with respect to z •j .By setting we can get the closed-form solution of z •j .The update rule for W is similar to that of Z.The proposed probabilistic model is robust again outliers and missing data and equivalent to robust PCA under mild conditions [18].

Bayesian Robust Matrix Factorization
Another robust probabilistic model of matrix factorizations is Bayesian robust matrix factorization [20].Gaussian distributions are still imposed on W and Z, that is, We further assume both (µ W , Λ W ) and (µ Z , Λ Z ) follow Gaussian-Wishart distributions: where W 0 , v 0 , µ 0 , β 0 are the hyperparameters.
To enhance the robustness, we suppose the noise is the mixture of a Laplace distribution and a GIG distribution.Concretely speaking, the noise term E ij ∼ Lap(0, η ij ) and the prior of η ij is given by GIG(p, a, b), where p, a, b are three hyperparameters.Hence, E ij ∼ Lap(0, η ij ) can be represented by two steps: Gibbs sampling is proposed to infer the posterior distributions.For this purpose, we need to derive the posterior distributions of all random variables.Due to the fact that the derivation process of {µ Z , Λ Z , Z} is similar to that of {µ W , Λ W , W}, the following only considers approximating the posterior distributions of (µ W , Λ W , W) for brevity.Firstly, the posterior distribution of (µ W , Λ W ) is a Gaussian-Wishart distribution because that Then, we compute the conditional probability distribution of w i. : According to the above Equation, p(w i• |X, Z, µ W , Λ W , T) is also Gaussian.Next, for given X ij , w i• , z •j , η ij , the probability distribution of T ij is derived as below: Hence, it holds η ij |T ij , p, a, b ∼ GIG(p + 1, T ij + a, b).Bayesian robust matrix factorization incorporates spatial or temporal proximity in computer vision applications and batch algorithms are proposed to infer parameters.

Bayesian Model for L1-Norm Low-Rank Matrix Factorizations
For low-rank matrix factorizations, L1-norm minimization is more robust than L2-norm minimization in the presence of outliers or non-Gaussian noises.Based on this, we assume that the noise E ij follows the Laplace distribution: E ij ∼ Lap(0, √ λ/2).Since the Laplace noise is inconvenient for Bayesian inference, a hierarchical Bayesian model was formulated in [21].Concretely speaking, a two-level hierarchical prior is imposed on the Laplace prior: The generative models of W and Z are constructed as follows: In addition, Gamma priors are placed on the precision parameters of the above Gaussian distributions: The trial posterior distribution for {W, Z, τ W , τ Z , T} is specified as: And the joint probability distribution is expressed as VB inference was adopted to approximate the full posterior distribution [21].Furthermore, varying precision parameters are also considered for different rows of W or different columns of Z.All parameters are automatically tuned to adapt to the data, and the proposed method is applied in computer vision problems to validate its efficiency and robustness.
In Table 3, we list all probabilistic models of matrix factorizations discussed in this section, and compare their probability distributions, priors and solving strategy.

Probabilistic Models of Robust PCA
Compared with traditional PCA, robust PCA is more robust to outliers or large sparse noise.Stable robust PCA [95], a stable version of robust PCA, decomposes the data matrix into the sum of a low-rank matrix, a sparse noise matrix and a dense noise matrix.The low-rank matrix obtained by solving a relaxed principal component pursuit is simultaneously stable to small noise and robust to gross sparse errors.This section will review probabilistic models of robust PCA.

Bayesian Robust PCA
In [22], a stable robust PCA is modeled as: where D = diag(d  101) is equivalent to Equation (59).If all columns of B. * S are same, then the stable robust PCA becomes to be the PCA model (31).
Two methods were proposed in [22], that is, Gibbs sampling and VB inference.For the second method, the joint probability distribution is

Variational Bayes Approach to Robust PCA
In Equation ( 101), if both D and Λ are set to be identical matrices, then we have another form of stable robust PCA: Essentially, both Equations ( 101) and ( 103) are equivalent.Formally, Equation ( 103) can also be transformed into another formula: We assume that w Let the means of S ij and E ij be zeros and their precision be τ S and τ E respectively.The priors of τ S and τ E are given by A naïve VB approach was proposed in [23].The trial distribution is stipulated as q(W, Z, B, τ S , τ E ) = q(W)q(Z)q(B)q(τ S )q(τ E ).
The likelihood function is constructed as The prior distribution is represented by We first construct a function: G(q) = E q [log L] − D KL (q||π).To simplify this problem, we let To find the updates for W, Z, we can maximize the function G with respect to µ wi , Σ wi , µ zi , Σ zi respectively.The main advantage of the proposed approach is that it can incorporate additional prior information and cope with missing data.

Sparse Bayesian Robust PCA
In Equation ( 101), we replace B. * S by S for the sake of simplicity.Thus, we have a model of sparse Bayesian robust PCA [19].Assume that w , where k = 1, 2, . . ., r, i = 1, 2, . . ., d, j = 1, 2, . . ., N. The priors of γ k are given by γ k ∼ Gam(a, b).What's more, we assign Jeffrey's priors to α ij and β: The joint distribution is expressed as VB inference was used to approximate the posterior distributions of all variables matrices [19].
Experimental results in video background/foreground separation show that the proposed method is more effective than MAP and Gibbs sampling.

Probabilistic Models of Non-Negative Matrix Factorization
Non-negative matrix factorization (NMF) decomposes a non-negative data matrix into the product of two non-negative low-rank matrices.Mathematically, we can formulate NMF as follows: where X ij , W ik , Z kj are non-negative.Multiplicative algorithms [96] are often used to obtain the point estimations of both W and Z.This section will introduce probabilistic models of NMF.

Probabilistic Non-Negative Matrix Factorization
Equation ( 112) can be rewritten as θ 2 , . . . ,θ r }.Probabilistic non-negative matrix factorization [24] introduces a generative model: The probability distributions of C k,ij can be assumed to follow Gaussian or Poisson distributions because they are closed under summation.This assumption means that we can get easily the probability distributions of X ij .Four algorithms were proposed in [24], that is, multiplicative, EM, Gibbs sampling and VB algorithms.

Bayesian Inference for Nonnegative Matrix Factorization
For arbitrary k ∈ {1, 2, . . . ,r}, Bayesian non-negative matrix factorization [25] introduces variables S k = S ikj |i = 1, 2, . . ., d, j = 1, 2, . . ., N as latent sources.The hierarchical model of X ij is given by In view of the fact that a Gamma distribution is the conjugate prior to Poisson distribution, the hierarchical priors of W ik and Z kj are proposed as follows ik and S = {S 1 , S 2 , . . . ,S r }.For given Θ and X, the posterior distribution is expressed as p(W, Z, S|X, Θ).We assume the trial distribution of p(W, Z, S|X, Θ) is factorable: q(W, Z, S) = q(W)q(Z)q(S).Both VB inference and Gibbs sampling were proposed to infer the probability distributions of all variables [25].
The above Bayesian nonnegative matrix factorization is not a matrix factorization approach to latent Dirichlet allocation.To this end, another Bayesian extension of the nonnegative matrix factorization algorithm was proposed in [27].What's more, Paisley et al. also provided a correlated nonnegative matrix factorization based on the correlated topic model [27].The stochastic variational inference algorithms were presented to solve the proposed two models.

Bayesian Nonparametric Matrix Factorization
Gamma process nonnegative matrix factorization (GaP-NMF) was developed in [26].This Bayesian nonparametric matrix factorization considers the case that the number of sources r is unknown.Let non-negative hidden variable θ k be the overall gain of the k-th source and L a large number of sources.We assume that The posterior distribution is expressed as p(W, Z, θ|X, a, b, α, c) for given hyperparameters a, b, α, c.The trial distribution of θ, W, Z is assumed to be factorable, that is, q(W, Z, θ) = q(W)q(Z)q(θ).The flexible generalized inverse-Gaussian distributions are imposed on W ik , Z kj and θ k respectively, The lower bound of the marginal likelihood is computed as By maximizing the lower bound of log p(X, a, b, α, c), we can yield an approximation distribution of q(W, Z, θ).GaP-NMF is applied in recorded music and the number of latent sources is discovered automatically.

Beta-Gamma Non-Negative Matrix Factorization
For the moment, we do not factorize the data matrix X into the product of two low-rank matrices.Different from previous probabilistic models of NMF, we assume that X ij is generated from a Beta distribution: Beta(A ij , B ij ).For two matrices A and B, we jointly factorize them as where C ∈ R d×r + , D ∈ R d×r + , H ∈ R r×N + .In Beta-Gamma non-negative matrix factorization [28], the generative model is given by D ik H kj ), C ik ∼ Gam(µ 0 , α 0 ), D ik ∼ Gam(υ 0 , β 0 ), H kj ∼ Gam(ρ 0 , ς 0 ) (120) Variational inference framework was adopted and a new lower-bound was proposed to approximate the objective function to derive an analytically tractable approximate solution of the posterior distribution.Beta-Gamma non-negative matrix factorization is used in source separation, collaborative filtering and cancer epigenomics analysis.

Other Probabilistic Models of Low-Rank Matrix/Tensor Factorizations
Besides the probabilistic models discussed in foregoing sections, there are many other types of probabilistic low-rank matrix factorization models.A successful application of probabilistic low-rank matrix factorization is the collaborative filtering in recommendation systems.In collaborative filtering, there are several other Poisson models in which the observations are usually modeled with a Poisson distribution, and these models mainly include [97][98][99][100][101][102][103][104][105].As a matter of fact, the Poisson factorization roots in the nonnegative matrix factorization and takes advantage of the sparse essence of user behavior data and scales [103].For some probabilistic models with respect to collaborative filtering, the Poisson distribution is changed into other probability distributions and this change deals with logistic function [106][107][108], Heaviside step function [107,109], Gaussian cumulative density function [110] and so on.In addition, side information on the a low-dimensional latent presentations is integrated into probabilistic low-rank matrix factorization models [111][112][113], and the case that the data is missing not at random is taken into consideration [109,114,115].
It is worthy to pay attention to other applications of probabilistic low-rank matrix factorization models.For instance, [116] developed a probabilistic model for low-rank subspace clustering.In [88], a sparse additive matrix factorization was proposed by a Bayesian regularization effect and the corresponding model was applied into a foreground/background video separation problem.
Recently, probabilistic low-rank matrix factorizations have been extended into the case of tensor decompositions (factorizations).Tucker decomposition and CP decomposition are two popular tensor decomposition approaches.The probabilistic Tucker decomposition models mainly include probabilistic Tucker decomposition [34], exponential family tensor factorization [35] and InfTucker model [36].Probabilistic Tucker decomposition was closely related to probabilistic PCA.In [35], an integration method was proposed to model heterogeneously attributed data tensors.InfTucker, a tensor-variate latent nonparametric Bayesian model, conducted Tucker decomposition in an infinite feature space.
More probabilistic models of tensor factorizations focus on CP tensor decomposition model.For example, Ermis and Cemgil investigated variational inference for probabilistic latent tensor factorization [37].Based on hierarchical dirichlet process, a Bayesian probabilistic model for unsupervised tensor factorization was proposed [38].In [39], a novel probabilistic tensor factorization was proposed by extending probabilistic matrix factorization.A probabilistic latent tensor factorization was proposed in [40] to address the task of link pattern prediction.Based on the Polya-Gamma augmentation strategy and online expectation maximization algorithm, [41] proposed a scalable probabilistic tensor factorization framework.As the generalization of Poisson matrix factorization, Poisson tensor factorization was presented in [42].In [43], a Bayesian tensor factorization models was proposed to infer the latent group structures from dynamic pairwise interaction patterns.In [44], a Bayesian non-negative tensor factorization model was presented for count-valued tensor data and scalable inference algorithms were developed.A scalable Bayesian framework for low-rank CP decomposition was presented and it can analyses both continuous and binary datasets [45].A zero-truncated Poisson tensor factorization for binary tensors was proposed in [46].A Bayesian robust tensor factorization [47] was proposed and it is the extension of probabilistic stable robust PCA.And in [48], the CP factorization was formulated by a hierarchical probabilistic model.

Conclusions and Future Work
In this paper, we have made a survey on probabilistic models of low-rank matrix factorizations and the related works.To classify the main probabilistic models, we divide low-rank matrix factorizations into several groups such as PCA, matrix factorizations, robust PCA, non-negative matrix factorization and so on.For each category, we list representative probabilistic models, describe the probability distributions of all random matrices or latent variables, present the corresponding inference methods and compare their similarity and difference.Besides, we further provide an overview of probabilistic models of low-rank tensor factorizations and discuss other probabilistic matrix factorizations models.
Although probabilistic low-rank matrix/tensor factorizations have made some progresses, we still face some challenges in theories and applications.Future research may concern the following aspects:

•
Scalable algorithms to infer the probability distributions and parameters.Although both Gibbs sampling and variational Bayesian inference have their own advantages, they need large computation cost for real large-scale problems.A promising future direction is to design scalable algorithms.

•
Constructing new probabilistic models of low-rank matrix factorizations.It is necessary to develop other probabilistic models according to the actual situation.For example, we can consider different types of sparse noise and different probability distributions (including the prior distributions) of low-rank components or latent variables.

•
Probabilistic models of non-negative tensor factorizations.There is not much research on this type of probabilistic models.Compared with probabilistic models of tensor factorizations, the probabilistic non-negative tensor factorizations models are more complex and difficult in inferring the posterior distributions.

•
Probabilistic TT format.In contrast to both CP and Tucker decompositions, the TT format provides stable representations and is formally free from the curse of dimensionality.Hence, probabilistic model of the TT format would be an interesting research issue.
, Z, D, Λ, B, S, p, τ, π, v, γ|X).Bayesian robust PCA is robust to different noise levels without changing model hyperparameters and exploits additional structure of the matrix in video applications.

Table 3 .
Probabilistic models of matrix factorizations with X = WZ + E.