From Bilinear Regression to Inductive Matrix Completion: A Quasi-Bayesian Analysis

In this paper, we study the problem of bilinear regression, a type of statistical modeling that deals with multiple variables and multiple responses. One of the main difficulties that arise in this problem is the presence of missing data in the response matrix, a problem known as inductive matrix completion. To address these issues, we propose a novel approach that combines elements of Bayesian statistics with a quasi-likelihood method. Our proposed method starts by addressing the problem of bilinear regression using a quasi-Bayesian approach. The quasi-likelihood method that we employ in this step allows us to handle the complex relationships between the variables in a more robust way. Next, we adapt our approach to the context of inductive matrix completion. We make use of a low-rankness assumption and leverage the powerful PAC-Bayes bound technique to provide statistical properties for our proposed estimators and for the quasi-posteriors. To compute the estimators, we propose a Langevin Monte Carlo method to obtain approximate solutions to the problem of inductive matrix completion in a computationally efficient manner. To demonstrate the effectiveness of our proposed methods, we conduct a series of numerical studies. These studies allow us to evaluate the performance of our estimators under different conditions and provide a clear illustration of the strengths and limitations of our approach.


Introduction
In this paper, we investigate the bilinear regression model, a statistical method that assumes a linear relationship between a set of multiple response variables and two sets of covariates. This model, also known as the growth curve model or generalized multivariate analysis model, is commonly used for analyzing longitudinal data, as shown in previous studies such as [1][2][3][4][5][6]. However, these studies only cover the scenario in which the response matrix is fully observed.
Recently, the bilinear regression model with incomplete response has been introduced and studied as the so-called inductive matrix completion, which is a generalization of the matrix completion problem [7,8]. This problem has attracted significant attention in various fields, such as drug repositioning [9], collaborative filtering [10], and genomics [7]. Inductive matrix completion is a challenging problem that arises when some of the entries in the response matrix are missing, which makes it difficult to infer the underlying relationship between the variables.
In this work, we explore the problem of bilinear regression and inductive matrix completion under a low-rank constraint on the coefficient matrix. Most existing approaches for these problems are frequentist methods, such as maximum likelihood estimation [2] or penalized optimization [7]. These methods are effective in providing point estimates for the parameters of the model but lack the ability to provide a full probabilistic characterization of the uncertainty. Recently, Bayesian approaches have been considered for these problems. For example, the paper [11] proposed a Bayesian approach for bilinear regression, and a Bayesian method was proposed for inductive matrix completion in the work [9]. However, unlike frequentist approaches, the statistical properties of the Bayesian approach for these models have not been fully explored yet.
The aim of this paper is to address an existing gap in the understanding of the bilinear regression and inductive matrix completion problems. To achieve this goal, we propose a novel approach that combines elements of Bayesian statistics with a quasi-likelihood method. Specifically, we start by addressing the problem of bilinear regression using a quasi-Bayesian approach, where a quasi-likelihood is employed. We then generalize this approach to the problem of inductive matrix completion. To ensure that our method is adaptive to the rank of the coefficient matrix, we use a spectral scaled Student prior distribution, which allows us to prove that the posterior mean satisfies a tight oracle inequality. This result demonstrates that our method is able to accurately estimate the parameters of the model, even when the rank of the coefficient matrix is unknown. Additionally, we also prove the contraction properties of the posteriors, which further enhances the performance of our method.
The proposed method in this paper, the quasi-Bayesian approach, is an extension of the traditional Bayesian approach and is becoming increasingly popular in statistics and machine learning as a technique for generalized Bayesian inference, as noted in studies such as [12][13][14]. This approach allows for more flexibility in the modeling assumptions by replacing the likelihood function with a more general notion of risk or quasi-likelihood.
To provide theoretical guarantees for our proposed quasi-posteriors, we make use of the PAC-Bayesian technique [15][16][17]. This technique provides bounds on the generalization error of a learned estimator, and has been widely used in the literature as described in recent reviews and introductions, such as [18,19]. The PAC-Bayes bounds have been successfully applied in the context of matrix estimation problems as shown in studies such as [20][21][22][23]. Interestingly, by using the PAC-Bayesian technique for inductive matrix completion, we do not need to make any assumptions about the distribution of the missing entries in the response matrix. This is in contrast with previous works on matrix completion, such as [24][25][26][27][28][29], which typically require assumptions about the missing data. This makes our method more versatile and applicable to a wider range of problems.
The proposed method in this paper makes use of a spectral scaled Student prior, which is a specific choice of prior distribution. This choice is motivated by recent works in which it has been shown to lead to optimal rates in a variety of problems, including high-dimensional regression [30], image denoising [31] and reduced rank regression [23]. Although this prior is not conjugate to our problems, it allows for the convenient implementation of gradient-based sampling methods, which makes it computationally efficient.
To compute the proposed estimators and sample from the quasi-posterior, we employ a Langevin Monte Carlo (LMC) method. This method is a widely used algorithm for approximating complex distributions and allows for efficient computation of the proposed estimators. The LMC method allows us to obtain approximate solutions to the problem of bilinear regression and inductive matrix completion in a computationally efficient manner. Furthermore, we use numerical studies to demonstrate the effectiveness of our proposed methods. These studies enable us to evaluate the performance of our estimators in various scenarios and provide insight into the capabilities and limitations of our approach. By conducting a thorough evaluation, we can gain a better understanding of how our method performs in different settings and make any necessary adjustments to improve its performance. We also compare our method with the ordinary least squared method. This comparison allows us to demonstrate the superiority of our method over the traditional approach in certain scenarios, such as when the response matrix contains missing data. Overall, the numerical studies serve as a valuable tool for assessing the effectiveness of our proposed method and provide a clear illustration of its strengths and weaknesses, as well as its improvement over the traditional method.
The remainder of this paper is organized as follows: In Section 2, we present the problem of bilinear regression, and introduce the low-rank promoting prior distribution that we use to address this problem. In Section 3, we extend our approach to the problem of inductive matrix completion. In Section 4, we discuss the Langevin Monte Carlo method used for the computation of the estimators, and present numerical studies to demonstrate the effectiveness of our proposed methods. Finally, we conclude our work and provide a summary of our findings in Section 5. The technical proofs are provided in Appendix A for the interested readers. Notation 1. Let R n 1 ×n 2 denote the set of n 1 × n 2 matrices with real elements. Let A ∈ R n 2 ×n 1 denote the transpose of A. For any A ∈ R n 1 ×n 2 and I = (i, j) ∈ {1, . . . , n 1 } × {1, . . . , n 2 }, we denote by A I = A (i,j) = A i,j the i-th row and j-th column elements of A. The matrix in R n 1 ×n 2 with all entries equal to 0 is denoted by 0 n 1 ×n 2 . For a matrix B ∈ R n 1 ×n 1 , we let Tr(B) denote its trace. The identity matrix in R n 1 ×n 1 is denoted by I n 1 . For A ∈ R n 1 ×n 2 , we define its sup-norm i,j and rank(A) its rank.

Model
Let Y ∈ R n×q consist of n independent response vectors, X ∈ R n×p be a given betweenindividuals design matrix and Z ∈ R k×q be a known within-individuals design matrix.
Consider the bilinear regression model as follows: where M * ∈ R p×k is the unknown parameter matrix. The random noise matrix E is assumed to have zero mean, E(E) = 0. The main assumption here is the low-rank restriction on the model parameter that rank(M * ) < min(p, k). The model presented in Equation (1) is a bilinear regression model, which can be seen as a generalization of the reduced rank regression problem. In the case where k = q and Z = I q , the model simplifies to the traditional reduced rank regression problem, which has been well-studied in the literature, such as [32,33]. However, in this paper, we consider the more general case where the matrix Z contains additional explanatory variables.
The low-rank assumption in this model can be interpreted as indicating the presence of a latent process that affects the response data, not only through the "between-individuals" structure of the model but also through the "within-individuals" structure. This model is often referred to as the growth curve model or the generalized multivariate analysis model (GMANOVA) and has been studied in depth in the literature, such as [2].
From Assumption 1, it is not reliable to return predictions XMZ with entries that are outside of interval [−C, C]. However, for computational reasons, it is extremely convenient to employ an unbounded prior for M. Therefore, we propose to use unbounded distributions for M but to use, as a predictor, a truncated version of XMZ rather than M itself. For a matrix A, let be the orthogonal projection of A on matrices with entries bounded by C. Note that B is simply obtained by replacing entries of A larger than C by C, and entries smaller than −C by −C. For a matrix M ∈ R p×k , we denote by r(M) the empirical risk of M as and its expectation is denoted by The focus of our work in this paper is on the predictive aspects of the model, that is, a matrix M predicts almost as well as M * if R(M) − R(M * ) is small. Under the assumption that E ij has a finite variance, using the Pythagorean theorem, we have for any M, which means that our results can also be interpreted in terms of the Frobenius norm. Let π be a prior distribution on R p×k (see Section 2.2). For any λ > 0, we define the quasi-posterior ρ λ (dM) ∝ exp(−λr(M))π(dM).
It is worth noting that for a specific choice of λ = nq/(2σ 2 ), the posterior distribution obtained corresponds to the case where the noise term E ij is assumed to be Gaussian distributed with a mean of 0 and a variance of σ 2 . However, our theoretical results hold under a more general class of noise distributions. It is known that a small enough λ is sufficient when the model is misspecified [14]. Additionally, even in the case of Gaussian noise, in high-dimensional settings, a smaller value of λ than n/(2σ 2 ) leads to better adaptation properties [30,34]. The precise choice of λ in our method will be further discussed below.
We consider the following posterior mean of XMZ, given by It is worth noting that from the simulation experiments, it is observed that using reasonable values for C, the Monte Carlo algorithm never samples matrices M such that Π C (XMZ) = XMZ. In other words, the boundedness constraint has very little impact on practice, and it is mainly necessary for technical proofs. If one is interested in obtaining an estimator of M * instead of an estimator of XM * Z, when X X and ZZ are invertible, one can consider the estimator M λ = (X X) −1 X XMZ λ Z (ZZ ) −1 and note that X M λ Z = XMZ λ . This estimator can be used to obtain the estimator of M * in case the inverses of X X and ZZ are computationally feasible. The quasi-posterior distribution investigated in this paper is often referred to as the "Gibbs posterior" in the PAC-Bayes approach. This terminology is used in the literature such as [17][18][19]35,36]. Additionally, the estimator M λ is sometimes referred to as the Gibbs estimator or the exponentially weighted aggregate (EWA) in the literature, such as [34,37,38].

Prior Specification
We consider, in this paper, the following spectral scaled Student prior distribution, with parameter τ > 0: This prior distribution is designed to promote low-rankness by placing more probability mass on matrices with smaller singular values, which leads to sparse solutions. This prior has a similar form to the one used in related works such as [39,40], and it is known to lead to good performance in different problems, such as high-dimensional regression and image denoising. It can be verified that where s j (M) denotes the j th largest singular value of M. The above expression is a scaled Student distribution evaluated at s j (M), which can be seen as a way to approximate sparsity on s j (M) [30]. The log-sum function ∑ m j=1 log(τ 2 + s j (M) 2 ) used by [39,40] is also known to enforce approximate sparsity on the singular values of s j (M). This means that under this prior, most of the s j (M) are close to 0, which implies that M is well approximated by a low-rank matrix. Therefore, it has the ability to promote the low-rankness of M.
As previously stated, this prior is not conjugate to the problem at hand. However, it is particularly convenient to implement gradient-based sampling algorithms, such as the Langevin Monte Carlo method, which will be discussed in more detail in Section 4. This is because the gradient of the log-posterior can be computed efficiently, and it allows for an efficient implementation of the LMC algorithm.

Theoretical Results
We assume the sub-exponential distribution assumption on the noise.

Assumption 2.
The entries E i,j of E are independent. There exist two known constants σ > 0 and Let us put The statistical properties of mean estimator are given in the following theorem, where we propose a non-asymptotic analysis for our mean estimator. Theorem 1. Let Assumptions 1 and 2 be satisfied. Fix the parameter τ = τ * in the prior. Fix δ > 0 and define λ * := nq min(1/(2C 2 ), δ/[C 1 (1 + δ)]). Then, for any ε ∈ (0, 1), we have, with probability at least 1 − ε on the sample, The choice of λ = λ * is determined by optimizing an upper bound on the risk R (as shown in the proof of this theorem). However, it is important to note that this choice may not necessarily be the best choice in practice, even though it gives a good estimate of the order of magnitude for λ. To ensure optimal performance, the user can use crossvalidation to properly adjust the temperature parameter. Additionally, it is worth noting that rank(M) = 0 is not a requirement in the above formula. If rank(M) = 0, thenM = 0 and we interpret 0 log(1 + 0/0) as 0.
The proof of this theorem is based on the PAC-Bayes theory, and it is provided in the Appendix A. By takingM = M * , we can obtain an upper bound on the infimum, leading to the following result.

Corollary 1.
Under the same assumptions and the same τ, λ * as in Theorem 1, let r * = rank(M * ). Then, for any ε ∈ (0, 1), we have, with probability at least 1 − ε on the sample, Theorem 1 provides an understanding of the statistical properties of the posterior mean. However, it is also important to understand the contraction properties of the quasi-posterior distribution. In the following theorem, we aim to provide a result that demonstrates this aspect of the proposed method.

Theorem 2.
Under the assumptions for Theorem 1, let ε n be any sequence in (0, 1) such that ε n → 0 when n → ∞. Define Then, The proof of this theorem is provided in Appendix A.

Model and Method
In the context of inductive matrix completion, given two side information matrices X and Z, we assume that only a random subset Ω of the entries of Y in model (1) is observed. More precisely, we assume that we observe m independent and identically random pairs (I 1 , Y 1 ), . . . , (I m , Y m ) given by where M * ∈ R p×k is the unknown parameter matrix expected to be low-rank and observation sample size is assumed that m < nq. The noise variables E i are assumed to be independent with E(E i ) = 0. The variables I i are independent and identical copies of a random variable I having distribution Π on the set {1, . . . , n} × {1, . . . , q}, we denote Π x,y := Π(I = (x, y)). Our goal in this paper is to investigate the problem of bilinear regression and also address the case where the response matrix contains missing data, a problem known as inductive matrix completion. In particular, when p = n, k = q and X = I n , Z = I q are the identity matrices, the problem reduces to the traditional matrix completion problem, which has been well studied in the literature [26]. Similarly, when k = q and Z = I q is the identity matrix, the problem becomes the reduced rank regression problem with incomplete response, which has also been studied in recent works, such as [23,41]. However, in the context of inductive matrix completion, we focus on the more general case where X and Z contain additional explanatory variables; in other words, we consider the side information from the n users and the q items in our model [8].
It has been acknowledged that there are two different ways to model the observed values of Y, either by including or excluding the possibility of observing the same entry multiple times. Previous studies have examined both of these methods, such as the examination of matrix completion without replacement in [25] and with replacement in [26]. Both methods have practical uses and use similar techniques for estimation. This particular study focuses on the scenario where the variables I i are independently and identically distributed, meaning that it is possible to observe the same entry multiple times. Additionally, it is important to note that, according to the findings presented in Section 6 of [42], the results of this study can also be applied to the scenario of sampling without replacement, as long as the sampling is performed uniformly and there is no observation noise.
We are now adapting the quasi-Bayesian approach for bilinear regression in Section 2 to the context of inductive matrix completion. For a probability distribution P on {1, . . . , n 1 } × {1, . . . , n 2 }, we generalize the Frobenius norm by . For a matrix M ∈ R p×k , we denote the empirical risk of M, r (M), and its expected risk R (M) respectively as As in Section 2, we will focus on the predictive aspects of the model, that is, a matrix M predicts almost as well as Under the assumption that E i has a finite variance, based on the Pythagorean theorem, we have for any M, which means that our results can also be interpreted in terms of an estimation of M * with respect to a generalized Frobenius norm.
We will actually specify our choice of λ below. The truncated posterior mean of XMZ is given by Here, for the same technical reasons as in the context of bilinear regression, this truncation has a very little impact in practice for reasonable values of C.

Theoretical Results
In this section, we derive the statistical properties of the posteriorρ λ and the mean estimator XMZ λ for the context of inductive matrix completion. Let us first state our assumptions on this model.
Assumptions 1 and 3 are both standard; they have been used in [41] for theoretical analysis of reduced rank regression and in [26] for trace regression and matrix completion.
Let us put Theorem 3. Let Assumptions 1 and 3 be satisfied. Fix the parameter τ = τ * in the prior. Fix δ > 0 and define λ * : Then, for any ε ∈ (0, 1), we have, with probability at least 1 − ε on the sample, Similar to the context of bilinear regression, the choices of λ = λ * , τ = τ * come from the optimization of an upper bound on the risk R (in the proof of this theorem). Therefore, these choices may not be necessarily the best choice in practice, even though it gives a good order of magnitude for tuning these parameters. The user could use cross-validation to properly tune them in practice. Note again that rank(M) = 0 is not required in the above formula, if rank(M) = 0 thenM = 0 and we interpret 0 log(1 + 0/0) as 0. The proof of this theorem is provided in the Appendix A. In particular, we can upper bound the infimum on M by takingM = M * , which leads to the following result.

Corollary 2. Under the assumptions that Theorem 3 holds, let r
and in particular, if the sampling distribution Π is uniform,

Remark 1.
Up to a log-term, our error rate r(k + p)/m is similar to the best known up-to-date rate derived in [8].
While Theorem 3 is about the finite sample convergence rate of the posterior mean, it is actually possible to prove that the quasi-posteriorρ λ contracts around M * at the same rate.

Theorem 4.
Under the same assumptions for Theorem 3, and the same definition for τ and λ * , let ε m be any sequence in (0, 1) such that ε m → 0 when m → ∞. Define The proof of this theorem is provided in Appendix A.

Langevin Monte Carlo Implementation
In this section, we propose to sample from the (quasi) posterior, in Sections 2 and 3, by a suitable version of the Langevin Monte Carlo (LMC) algorithm, a gradient-based sampling method. We propose to use a constant step-size unadjusted LMC algorithm; see [43] for more details. The algorithm is given by an initial matrix M 0 and the recursion where h > 0 is the step-size, ρ λ is the (quasi) posterior and N 0 , N 1 , . . . are independent random matrices with independent and identical standard Gaussian entries. We provide a pseudo-code for LMC in Algorithm A1. For small values of the step-size h, the output of Algorithm A1, M, is very close to the integral (3) of interest. However, for some h that may not be small enough, the sum can explode [44]. In such cases, we consider to include a Metropolis-Hastings correction in the algorithm. Another possible choice is to take a smaller h and restart the algorithm; although it slows down the algorithm, we keep some control over its time of execution. On the other hand, the Metropolis-Hastings approach ensures the convergence to the desired distribution; however, the algorithm is greatly slowed down because of an additional acceptance/rejection step at each iteration. Next, we propose a Metropolis-Hasting correction to the LMC algorithm. It guarantees the convergence to the (quasi) posterior, and it also provides a useful way for choosing h. More precisely, we consider the update rule in (8) as a proposal for a new candidate: Note that the matrixM k+1 is normally distributed with mean M k − h∇ log ρ λ (M k ) and the covariance matrices equal to 2h times the identity matrices. This proposal is then accepted or rejected according to the Metropolis-Hastings algorithm, where the proposal is accepted with probability: where is the transition probability density from x to x . The details of the Metropolis-adjusted Langevin algorithm (denoted by MALA) are presented in Algorithm A2. Compared to the random-walk Metropolis-Hastings, MALA usually proposes moves into regions of higher probability, which are then more likely to be accepted. We note that the step-size h for MALA is chosen such that the acceptance rate is approximately 0.5 following [45], while the step-size for LMC in the same setting should be smaller than the one for MALA [46].

Simulation Studies for Biliear Regression
We perform some numerical studies on simulated data to assess the performance of our proposed algorithms. All simulations were conducted using the R statistical software [47].
For fixed dimensions q = 10, k = 20 of the data, we vary n = 100 and n = 1000 to check the effect of the samples, whereas the dimensions of the coefficient matrix are varied by p = 10 and p = 100. The entries of the design matrices X and Z are independently simulated from the standard Gaussian N (0, 1). Then, given a matrix M * , we simulate the response matrix Y from model (1) whose entries of the noise matrix E are independent and identically sampled from N (0, 1). We consider the following setups for the true coefficient matrix: • Model I: The true coefficient matrix M * is a rank-2 matrix that is generated as and all entries in B 1 and B 2 are independent and identically sampled from N (0, 1). • Model II: An approximate low-rank set up is studied. This series of simulations is similar to the Model I, except that the true coefficient is no longer rank 2, but it can be well approximated by a rank 2 matrix: where U is a matrix whose entries are independent and identically sampled from N (0, 0.1).
We compare our approaches denoted by LMC and MALA against the (generalized) ordinary least square [2], denoted by OLS. The OLS is defined as follows: where A † denotes the Moore-Penrose inverse of matrix A. We fixed λ = nq, τ = 1, and the LMC and MALA methods are initiated at the OLS estimator and are run with 10,000 iterations, where the first 1000 steps are removed as burn-in periods.
The evaluations are performed by using the mean squared estimation error (Est) and the normalized (relative) mean square error (Nmse): and the prediction error (Pred) as whereM here is one of the estimators for LMC, MALA or OLS. We report the averages and the standard deviation of these errors over 100 data replications. The results of our study are presented in Tables 1 and 2. From the tables, it can be observed that our proposed methods perform similarly to the OLS method. However, the estimation method obtained from the MALA algorithm often results in smaller prediction errors, particularly in high-dimensional settings. This advantage is even more pronounced when the method is applied in the context of inductive matrix completion, as discussed in the next subsection.

Simulation Studies for Inductive Matrix Completion
The simulation settings for inductive matrix completion are similar to the settings for bilinear regression, Section 4.2. However, after obtaining the response matrix Y, we remove uniformly at random κ = 10% and κ = 30% of the entries of Y. Here, κ denotes the missing rate. We denote the response matrix with missing entries by Y miss .
As in the context of inductive matrix completion, we only observe the response matrix with missing entries, Y miss , and thus we cannot construct the OLS estimator as in the case of bilinear regression. For this purpose, we first impute the missing entries in Y miss by using the R package softImpute [48], where the rank of M * is specified as the true rank for matrix Y miss . We denote the resulting imputed matrix by Y imp .
We compare our approaches denoted by LMC and MALA against the (imputed and generalized) ordinary least square, denoted by OLS_imp. The OLS_imp is defined as follows:M where A † denotes the Moore-Penrose inverse of matrix A. The LMC and MALA methods are initiated at the OLS_imp estimator and are run with 10,000 iterations, where the first 1000 steps are removed as burn-in periods.
As previously discussed in Section 4.2, we present the averages and the standard deviation of the mean squared estimation error (Est), the normalized (relative) mean square error (Nmse), and the prediction error (Pred) over 100 data replications in our results.
The results are detailed in Tables 3 and 4. It is evident from these tables that the results obtained from our MALA method surpass those of the other methods in terms of prediction error in most of the settings considered. This advantage becomes more pronounced as the missing rate in the response matrix increases. Additionally, it is worth noting that our MALA method is robust and performs well in the approximate low-rank setting (model II), while the OLS and LMC methods do not.

Discussion and Conclusions
In this paper, we focus on the problem of bilinear regression and its extension, the problem of inductive matrix completion, where the response matrix contains missing data. We propose a novel approach that combines elements of Bayesian statistics with a quasilikelihood method. Our proposed method first addresses the problem of bilinear regression using a quasi-Bayesian approach and then adapts this approach to the problem of inductive matrix completion. By making use of a low-rankness assumption and leveraging the powerful PAC-Bayes bound technique, we provide statistical properties for our proposed estimators and for the quasi-posteriors.
Our proposed method includes an efficient gradient-based sampling algorithm that is designed to sample from the (quasi) posterior distribution. This algorithm allows for the approximate computation of mean estimators. These methods, referred to as LMC and MALA, were tested in various simulation studies and were found to perform well when compared to the ordinary least squared method. The ability to accurately sample from the (quasi) posterior distribution and compute mean estimators makes these methods a valuable tool for data analysis and modeling.
There are still some unresolved issues that require further investigation. One of these is the presence of missing data in the covariate matrices X and Z. This can have a significant impact on the analysis and may lead to biased results. Another area that needs further exploration is the assumption of independence and identically distributed data. In some cases, this assumption may not hold, and alternative models that allow for a dispersion matrix may be needed. These are potential topics for future research to address and further our understanding of these issues. Acknowledgments: The author would like to thank the anonymous referees for their useful comments.

Conflicts of Interest:
The author declare no conflict of interest.

Appendix A. Proofs
The main technique for our proofs is the oracle-type PAC-Bayes bounds, in the spirit of [35]. We start with a few preliminary lemmas.

Appendix A.1. Preliminary Lemmas
First, we state a version of Bernstein's inequality from Proposition 2.9 page 24 in [49].
Lemma A2 (Donsker and Varadhan's variational formula). Let µ ∈ P (Θ). For any measurable, bounded function h : Θ → R, we have Moreover, the supremum with respect to ρ in the right-hand side is reached for the Gibbs measure µ h defined by its density with respect to µ These two lemmas are the only tools we need to prove Theorems 1 and 2. Their proofs are quite similar, with a few differences. For the sake of simplicity, we will state the common parts of the proofs as a separate result in Lemma A3. Note that the proof of this lemma will use Lemmas A1 and A2.
Let us now prove (A4). Here again, we start with an application of Lemma A1, but this time with U i := −T i (we keep v := v(M, M * ), w := C 2 and ζ := λ/nq). We obtain, for any λ ∈ (0, nq/C 2 ), Rearranging terms, using the definition of β in (A2) and multiplying both sides by ε/2, we obtain We integrate with respect to π and use Fubini to obtain Here, we use a different argument from the proof of the first inequality: we use Lemma A2 on the integral, and this gives directly (A4).
Finally, in both proofs, we will use quite often distributions ρ ∈ P (R p×k ) that will be defined as translations of the prior π. We introduce the following notation.
Definition A1. For any matrixM ∈ R p×k , we define ρM ∈ P (R p×k ) by The following technical lemmas from [31] will be useful in the proofs.
Appendix A.2. Proof of Theorem 1 Proof of Theorem 1. An application of Jensen's inequality on inequality (A3) yields Using the standard Chernoff's trick to transform an exponential moment inequality into a deviation inequality, that is: Using (2) we have where we used Jensen's inequality in the second line, and the definition of XMZ λ from the second to the third line. Plugging this into our probability bound (A5), and dividing both sides by α, we obtain under the additional condition that λ is such that α > 0, which we will assume from now (note that this is satisfied by λ * ). Using Lemma A2, we can rewrite this as We consider now the consequences of the second inequality in Lemma A3, that is (A4). With Chernoff's trick and rearranging terms a little, we obtain which we can rewrite as, ∀ρ ∈ P (R p×k ), with probability at least 1 − ε 2 , Combining (A7) and (A6) with a union bound argument gives the following bound, with probability of at least 1 − ε, Noting that, for any (i, j), (XM * Z) i,j ∈ [−C, C] implies that The end of the proof consists in making the right-hand side in the inequality more explicit. In order to do so, we restrict the infimum bound above to the distributions given by Definition A1: We see immediately that Dalalyan's lemma will be extremely useful for that. First, Lemma A5 provides an upper bound on K(ρM, π). Moreover, The second term in the right-hand side is null because π is centered, and thus where we used elementary properties of the Frobenius norm, and Lemma A4 in the last line. We can now plug this (and Lemma A5) back into (A8) to obtain We are now making the constants explicit. First, if λ ≤ nq/(2C 2 ), then 2nq(1 − C 2 λ/nq) ≥ np and thus β α Then, λ ≤ nqδ C 1 (1+δ) leads to β α ≤ (1 + δ).
Note that λ * = nq min(1/(2C 2 ), δ/[C 1 (1 + δ)]) satisfies these two conditions, so from now, λ = λ * . We also use the following: So far the bound is: In particular, with probability at least 1 − ε, the choice Appendix A.3. Proof of Theorem 2 Proof of Theorem 2. We also start with an application of Lemma A3, and focus on (A3), applied to ε := ε n , that is: Using Chernoff's trick, this gives Using the definition of ρ λ , for M ∈ A n we have Now, let us define Using (A4), we have that We will now prove that, if λ is such that α > 0, In order to do so, assume that we are on the set B n , and let M ∈ A n . Then, or, rewriting it in terms of norms, We upper-bound the right-hand side exactly as in the proof of Theorem 1, which gives M ∈ M n .
Then for any ε ∈ (0, 1), and λ ∈ (0, m/C 2 ), Proof of Lemma A6. The inequality (A10) is proved in a similar way to the proof of Lemma A3. That is, we apply Lemma A1 to the following independent random variables The proof of the inequality (A11) is processed similar in the proof of Lemma A3 in which we apply Lemma A1 to the independent random variables −V i , i = 1, . . . , m.
Proof of Theorem 3. Similar to the proof of Theorem 1, until the (A6), and noting that using (6), we have , and thus, we obtain We consider now the consequences of inequality (A11) in Lemma A6. With Chernoff's trick and rearranging terms a little, we obtain ∀ρ ∈ P (R p×k ), with probability at least 1 − ε 2 , Combining (A13) and (A12) with a union bound argument gives the bound and noting that for any (i, j), (XM * Z) i,j ∈ [−C, C] implies that |(Π C (XMZ)) i,j − (XM * Z) i,j | ≤ |(XMZ) i,j − (XM * Z) i,j | and thus We are now making the right-hand side in the inequality more explicit. In order to do so, we restrict the infimum bound above to the distributions given by Definition A1: We see immediately that Dalalyan's lemma will be extremely useful for that. First, Lemma A5 provides an upper bound on K(ρM, π). Moreover, The second term in the above right-hand side is null because π is centered, and thus XMZ − XM * Z 2 F,Π ρM(dM) where we used the elementary properties of the Frobenius norm, and Lemma A4 in the last line. We can now plug this (and Lemma A5) back into (A14) to obtain: We are now making the constants explicit. First, if λ ≤ m/(2C 2 ), then 2m(1 − C 2 λ/m) ≥ m and thus β α Then, λ ≤ mδ C 1 (1+δ) leads to β α ≤ (1 + δ). Note that λ * = m min(1/(2C 2 ), δ/[C 1 (1 + δ)]) satisfies these two conditions, so from now λ = λ * . We also use the following: So far the bound is: In particular, with probability at least 1 − ε, the choice τ 2 = C 1 (k + p)/(mkp X 2 F Z 2 F ) gives XMZ λ * − XM * Z Proof. The proof is proceeded completely similar to the proof of Theorem 2, in Appendix A.3.

Appendix B. Comments on Algorithm Implementation
For the case of inductive matrix completion, we write the logarithm of the density of the posterior (Y i − (Π C (XMZ)) i ) 2 − p + m + 2 2 log det(τ 2 I m + MM ).
Let us now differentiate this expression in M. Note that the term (Y i − (Π C (XMZ)) i ) 2 does actually not depend on M locally if (XMZ) i / ∈ [−C, C], in this case its differential with respect to M is 0 p×m . Otherwise, (Y i − (Π C (XMZ)) i ) 2 = (Y i − (XMZ) i ) 2 . In order to be able to differentiate the term (XMZ) i , let us introduce a notation for the entries of I i : I i = (a i , b i ). Then ∇(XMZ) i = D where the matrix D ∈ R p×m satisfies D x,y = 1 {x=b j } X a j ,y . Then The above calculation still requires to calculate a p × p matrix inversion at each iteration; for very large p, this might be expensive and can slow down the algorithm. Therefore, we could replace this matrix inversion by its accurately approximation through a convex optimization. It is noted that the matrix B := (τ 2 I m + MM ) −1 M is the solution to the following convex optimization problem: min B I p − M B 2 F + τ 2 B 2 F . The solution of this optimization problem can be conveniently obtained by using the package 'glmnet' [50] (with the family option 'mgaussian'). This avoids performing matrix inversion or other costly calculation. However, we note here that the LMC algorithm is being used with approximate gradient evaluation; theoretical assessment of this approach can be found in [51].