Abstract
In this paper, we conduct a theoretical examination of a low-rank matrix single-index model. This model has recently been introduced in the field of biostatistics, but its theoretical properties for jointly estimating the link function and the coefficient matrix have not yet been fully explored. In this paper, we make use of the PAC-Bayesian bounds technique to provide a thorough theoretical understanding of the joint estimation of the link function and the coefficient matrix. This allows us to gain a deeper insight into the properties of this model and its potential applications in different fields.
MSC:
62G05; 62C20
1. Introduction
In this study, we investigate a particular type of single-index model, where the response variable, denoted by Y, is a real number and the covariate matrix, represented by X, is a matrix of real numbers with dimensions of . The model is defined in Equation (1) as
In this equation, represents the inner product between matrices X and , where is an unknown coefficient matrix with dimensions of . The link function is an unknown univariate measurable function. The noise term, represented by , is assumed to have a mean of 0 and is independent of the covariate X.
In line with the recent research presented in [1,2], we make the assumption that the coefficient matrix is a symmetric, low-rank matrix with . Additionally, in order to ensure the uniqueness of the model, we impose the condition that the Frobenius norm of is equal to 1, i.e., .
Previous studies have been conducted on a similar model to the one presented in this paper, where the unknown coefficient matrix is assumed to have sparse elements. In particular, the work of [1] in the field of biostatistics has been used to examine the correlation between a response variable and the functional connectivity associated with a certain brain region. Additionally, recent research by [2] has focused on developing methods for estimating the unknown low-rank matrix by using implicit regularization techniques.
The model discussed in this paper can be thought of as a nonparametric version of the trace regression model that has been previously proposed in the literature, specifically in the works in [3,4,5]. This trace regression model utilizes the identity function as the link function, and encompasses a diverse array of statistical models, including but not limited to reduced rank regression, matrix completion, and linear regression.
The single-index model is a versatile extension of the linear model, which offers a natural interpretation. This model only changes in the direction of the parameter (vector/matrix), and the nature of this change is depicted by the link function . This has been the subject of extensive research in the literature, with various studies exploring its applications and extensions in various fields. Examples of such works include [1,6,7,8,9,10,11,12,13,14]. These studies have demonstrated the versatility and utility of the single-index model in a wide range of contexts, making it a valuable tool for researchers in various fields.
Definition 1.
Let denote the set of all symmetric matrix such that .
Given the covariates , the response variables are i.i.d. generated from model (1). We define the expected risk for any measurable and as
and denote the empirical counterpart of by
In this research, we examine the forecasting abilities of the model. More specifically, we consider a pair to have comparable predictive performance to if the difference between and is minimal.
Our approach in this work is built on the PAC-Bayesian bound technique, which is a powerful tool for obtaining oracle inequalities bounds [15]. Similar to Bayesian analysis, one important aspect of a PAC-Bayesian bound is specifying a prior distribution over the parameter space. In our approach, we adopt the prior distribution for the link function from the reference [11], while the prior distribution for the matrix parameter B is inspired by the eigen decomposition of the matrix. The specifics of our approach and the details of the prior distributions we chose are discussed in the next section. The use of the PAC-Bayesian bound technique in combination with carefully chosen prior distributions allows us to obtain reliable and accurate estimates of the unknown parameters in our model.
2. Main Result
2.1. Method
We make an additional assumption in our model (1) that , and the following conditional moment assumptions on the noise are assumed.
Assumption 1.
We assume that there exist two constants and , such that for all integers ,
Remark 1.
The assumption stated above implies that the noise term in our model follows a subexponential distribution. This class of distributions includes, for example, Gaussian noise or bounded noise, as discussed in [16]. In simpler terms, this means that the noise term in our model is characterized by a rate of decay that is slower than that of an exponential distribution. This assumption is critical for the application of our approach, as it allows us to obtain accurate and reliable estimates of the unknown parameters under a wide range of noise conditions. This is an important consideration, as the presence of noise can have a significant impact on the accuracy of the estimates obtained from our model. By assuming that the noise follows a subexponential distribution, we can be confident that our estimates are robust to the presence of noise.
In addition to the assumptions stated previously, it is also necessary to assume that the covariate matrix X is almost surely bounded by a constant. Additionally, the unknown link function is also assumed to be bounded by some known positive constant. To make this more precise, we use the notation to represent its supremum norm and to denote its functional supremum norm over the interval . Based on these definitions, we make the following assumption:
Assumption 2.
We assume that a.s. and , such that
In order to present the technical proofs in the clearest and simplest manner, we did not attempt to find the best constant used in the proofs. Specifically, the condition that is just convenient for the proofs in nature, and it could be eliminated by using in the proofs.
The link function is approximately estimated through a given specific countable set of measurable functions (dictionary) . For this purpose, the set of finite linear combinations of functions from the dictionary is utilized, and we denote this vector space by . We assume that each element in the dictionary is defined on the interval and takes values within the range .
Assumption 3.
For the sake of simplicity, we assume that the basic functions are differentiable and there exists some constant , such that
An example of such a collection of functions is the system of non-normalized trigonometric functions, where
satisfy this assumption. This assumption on the dictionary functions enables us to approximate the unknown link function with a finite linear combination of these functions.
Our approach is inspired by the work of [11], where the authors explored the PAC-Bayesian approach in [15] for a sparse-vector single-index model. The method needs to first specify a distribution on , similar to the prior distribution in Bayesian analysis. This prior distribution in our framework should enforce the characteristics of the underlying link function and the parameter matrix. In this work, we consider the following prior distribution:
in other words, it means that the prior distribution of the index matrix and the prior distribution over the link functions are assumed to be independent.
In this study, the matrix B is treated as a symmetric matrix and can be expressed in its eigen-decomposition form . The matrix U is an orthogonal matrix with (identity matrix of dimension ), and the diagonal matrix holds the corresponding eigenvalues . To enforce that , the sum of the squares of the eigenvalues must equal 1, as and . Additionally, the requirement of low-rankness on B means that most of the eigenvalues are close to zero, with only a few being significantly larger.
With the goal of obtaining an appropriate low-rank-promoting prior for B, we propose the following approach. We simulate an orthogonal matrix V and simulate from a Dirichlet distribution . Put
To obtain an approximate low-rank matrix, we take all parameters of the Dirichlet distribution to be very close to 0, for example, by setting . It is worth noting that a typical drawing of the Dirichlet distribution leads to one of the s being close to 1 and the others being close to 0. For more detailed discussions on how to choose the parameters for the Dirichlet distribution, one can refer to [17].
Now, we present a prior distribution on . We opted to use the prior introduced in [11]. With any integer M that , let us put
Now, we define the image of by the function
Remark 2.
Corollary 1 (below) provides a discussion regarding the approximation of Sobolev spaces (see [18] by the set ), which become more accurate as M increases.
Now, a prior distribution is defined on the set . This is performed by considering the image of the uniform measure on obtained through the function . We consider the following choice for the prior distribution on
The reason for choosing rather than C in the above definition of the prior distribution support is essentially for technical proof. This is to ensure that as soon as the underlying link function belongs to , there then exists a small ball around it that is contained in . One could safely replace it by , where is any positive sequence vanishing sufficiently slowly as .
Remark 3.
The integer M can be viewed as a measure of the “dimension” of the function f—the larger the M, the more complex the function—and the prior ν adapts again to the sparsity idea by penalizing large-dimensional functions f. The coefficient , which appears in (2), shows that more complex models have a geometrically decreasing influence. Inspired from the practical results in [11], the value 10 is a random choice. This choice could be in general changed by another positive constant, but it requires more technical attention.
2.2. The Proposed Estimator
Definition 2.
The Gibbs posterior distribution over is defined as
Now, we define an estimator as follows. Let be a tuning parameter, or sometime called the inverse temperature parameter. Let be an estimator of . It is simply achieved by a random draw from , the Gibbs posterior distribution above.
2.3. Theoretical Results
As almost surely, it is noted that for all ,
(Pythagoras theorem).
Definition 3.
For any positive integer , we set
Remark 4.
It is noted here that the infimum is defined on for each value of M. However, the prior distribution is defined on a slightly larger set, that is, .
Let us define
The theoretical results in this work mainly come from the following theorem, the proof of which is provided in Section 3. It should be noted that throughout the paper, the phrase “with probability ” refers to the probability calculated with respect to both the distribution of the data and the conditional Gibbs distribution .
Theorem 1.
Assume that Assumptions 1 and 2 hold, with
We have that, for all , with a probability of at least ,
where is a constant depending only on .
Remark 5.
As in practice, the value of w and are not known, and the theoretical value of λ cannot be used. However, it provides a good order to tune this parameter, for example, using cross-validation.
Remark 6.
Theorem 1 can be interpreted in a straightforward manner. Essentially, it states that if there exists a “small” M and is small, such that the difference between and is minimal, then the difference between and will also be small in the order of . On the other hand, if neither of these conditions are met, then the rate or (or either) will start to dominate, thus resulting in a decrease in the general quality of the convergence rate.
We can obtain a good convergence rate as soon as a low-rank assumption is considered. This is typically achievable when is already low-rank or can be well approximated by a low-rank matrix. In the case that is sufficiently regular, we can obtain a good approximation with a “small” M.
As shown in [11], when belongs to a Sobolev space, we can derive a more specific nonparametric rate for the above theorem. For example, assume that is the system of trigonometric functions and in addition that the link function is in the following Sobolev ellipsoid space [18],
where is an unknown regularity parameter. In this context, the approximation set is in the following form:
It should be noted that the results presented in this paper are in the so-called adaptive setting, where the regularity parameter k is not assumed to be known. However, in order to obtain these results, it is necessary to make an additional assumption.
Assumption 4.
We assume that the probability density of the random variable is defined on , and it is upper-bounded by a constant .
Corollary 1.
Assume that Theorem 1 and additional Assumption 4 hold. Moreover, assume that is in the Sobolev ellipsoid space , where the regularity parameter is unknown. The tuning parameter λ is as in (3). We have that for all with a probability of at least ,
where is a constant depending only on L, C, σ, .
The proof for Corollary 1 follows a similar approach to that of Corollary 4 in [11], and thus, it is not included in this paper.
Remark 7.
From an asymptotic point of view, that d is fixed and , the leading rate on the right-hand side in the above Corollary is . This is known to be the minimax rate of convergence up to a factor over a Sobolev class; see [18]. On the other hand, in a nonasymptotic setting where n is “small”, we obtain the estimation rate , which was also obtained by [2], and it is minimax optimal up to a logarithmic term, as in [3].
From Theorem 1, it is actually possible to derive that the Gibbs posterior contracts around at the optimal rate.
Theorem 2.
Under the same assumptions for Theorem 1 and the same definition for λ, let be any sequence in , such that when . Define
Then,
3. Proofs
For the sake of simplicity in the proofs, we put
We have that for each ,
The following lemma, Lemma 1, is a Bernstein-type inequality [16] that is useful for our proofs. We denote by the positive part of a random variable Z.
Lemma 1.
Let be independent real-valued random variables. It is assumed that there exist two constants that for all integers , We have that with ,
Let be a measurable space and and be two probability measures on . Denote by the Kullback–Leibler divergence of with respect to . Lemma 2 is a classical result, and its proof can be found, for example, in [15], (page 4).
Lemma 2.
Let be a measurable space. For any probability measure ν on and any measurable function , such that , we have
where κ is a probability measure on and . In addition, when g is upper-bounded on the support of ν, the supremum in (5) is obtained by the Gibbs distribution g, given by
Lemma 3.
We assume that Assumption 1 is satisfied. Put and take and put
With and any distribution , we have that
Proof.
Fix and . We start by using Lemma 1 with the following random variables:
Note that are independent, and we have that
where we set and
Now, for all integers k greater than 3, we have that
In the last inequality, we used the fact that . We obtain that
with .
Thus, for any , taking , we apply Lemma 1 to obtain
Therefore, we obtain, with the given in (6),
Next, integrating with respect to and consequently using Fubini’s theorem, we obtain
The proof for (8) is similar. More precisely, we apply Lemma 1 with . We obtain, for any ,
By rearranging terms, using definition of in (6), and multiplying both sides by , we obtain
Integrating with respect to and using Fubini’s theorem, we obtain
Now, Lemma 2 is applied to the integral, and this directly yields (8). □
Proof of Theorem 1.
Recall that stands for the distribution of the sample ; the Equation (7) can be written conveniently as
Now, we use the standard Chernoff trick to transform an exponential moment inequality into a deviation inequality, i.e., using . We obtain, with a probability of at least for any and any distribution ,
It is noted that we have
thus, we obtain, with a probability larger than ,
Now, using Lemma 2, it yields that with a probability larger than ,
Now, from (8) with an application of the standard Chernoff trick, we obtain, with a probability larger than for any and any distribution ,
Combining (9) and (10) with a union bound argument gives the bound, with a probability larger than ,
The final steps of the proof involve making the right-hand side of the inequality more explicit. To achieve this, we limit the infimum bound to a specific distribution. This allows us to have a more concrete understanding of the result and to explicitly obtain the error rate.
Put and let with small . Take
For any positive integer and any , let the probability measure be defined by
with
We denote for ,
Inequality (11) leads to
To finish the proof, we have to control the different terms in (12). Note first that
By technical Lemma 4, we know that
Additionally, by technical Lemma 10 in [11], we have that
Bringing together all the parts, it arrives at
Finally, it remains to control the term To this aim, we write
Computation of C by Fubini’s theorem:
Using the triangle inequality, we obtain that for and ,
Since , and thus as a consequence, as soon as . This shows that the set
is contained in the support of . In particular, this implies that is centered at and, consequently,
This proves that .
Control of A: Clearly,
Control of B: We have
Using Lemma 6 from [19], we have that
Thus,
Control of E: We have that
Control of D: Finally,
As we have that
it leads to
and therefore,
Thus, taking and assembling all the components, we obtain that
where is a positive constant function of C, , and . Combining this inequality with (12) and (13) yields, with a probability larger than ,
Finally, choosing it yields that there exists a constant depending only on with a probability of at least , such that
This concludes the proof of Theorem 1. □
Lemma 4.
Let with small . Take
Then,
where is a universal constant.
Proof.
We have that
The first log term
Note the following for the above calculation: firstly, the distribution of the orthogonal vector is approximated by the uniform distribution on the sphere [20], and secondly, the probability is greater or equal to the volume of the (d-1)-“circle” with radius over the surface area of the d-“unit sphere”.
It is noted that if (beta distribution), then has the pdf as where is the beta function. The second log term in the Kullback–Leibler term with is
The interval of integration contains at least an interval of length . Thus, we obtain
for some absolute numerical constant that does not depend on or d. □
Proof of Theorem 2.
We also apply Lemma 3, and focus on (7), applied to , that is
Using Chernoff’s inequality, this leads to
where
From the definition of , for , we obtain that
Now, put
Using (8), we have that
We now prove that if is such that ,
and, together with,
leads to
To obtain that, assume that we are on the set , and let . Then,
that is,
We upper-bound the right-hand side similarly as in the proof of Theorem 1, which leads to . □
4. Conclusions
In this paper, we conduct a theoretical study of a low-rank matrix single-index model. The model is used to estimate the link function and the coefficient matrix jointly. We leverage the PAC-Bayesian bounds technique to gain a deeper insight into the properties of this model and its potential applications. The study extends previous work in the field by considering a low-rank matrix, rather than a sparse vector, as the coefficient matrix. We also provide a detailed explanation of the choice of prior distributions for the link function and the coefficient matrix, which allows to obtain accurate and reliable estimates of the unknown parameters. Overall, this study provides a thorough theoretical understanding of the low-rank matrix single-index model.
The focus of future research would center on executing the proposed approach. There are various possible avenues to explore. One of the promising approaches is to use the reversible jump Markov chain Monte Carlo method, which was successfully applied in the past to address the sparse vector single-index model, as documented in [11].
Funding
This research was funded by Norwegian Research Council grant number 309960 through the Centre for Geophysical Forecasting at NTNU.
Data Availability Statement
No new data were created or analyzed in this study.
Acknowledgments
The author is grateful to two anonymous reviewers for their expert analysis and helpful suggestions.
Conflicts of Interest
The author declares no conflict of interest.
References
- Weaver, C.; Xiao, L.; Lindquist, M.A. Single-index models with functional connectivity network predictors. Biostatistics 2021, 24, 52–67. [Google Scholar] [CrossRef] [PubMed]
- Fan, J.; Yang, Z.; Yu, M. Understanding Implicit Regularization in Over-Parameterized Single Index Model. J. Am. Stat. Assoc. 2022, 1–14. [Google Scholar] [CrossRef]
- Rohde, A.; Tsybakov, A.B. Estimation of high-dimensional low-rank matrices. Ann. Stat. 2011, 39, 887–930. [Google Scholar] [CrossRef]
- Koltchinskii, V.; Lounici, K.; Tsybakov, A.B. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat. 2011, 39, 2302–2329. [Google Scholar] [CrossRef]
- Zhao, J.; Niu, L.; Zhan, S. Trace regression model with simultaneously low rank and row (column) sparse parameter. Comput. Stat. Data Anal. 2017, 116, 1–18. [Google Scholar] [CrossRef]
- Nelder, J.A.; Wedderburn, R.W. Generalized linear models. J. R. Stat. Soc. Ser. A Gen. 1972, 135, 370–384. [Google Scholar] [CrossRef]
- Hardle, W.; Hall, P.; Ichimura, H. Optimal smoothing in single-index models. Ann. Stat. 1993, 21, 157–178. [Google Scholar] [CrossRef]
- Ichimura, H. Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. J. Econom. 1993, 58, 71–120. [Google Scholar] [CrossRef]
- Jiang, B.; Liu, J.S. Variable selection for general index models via sliced inverse regression. Ann. Stat. 2014, 42, 1751–1786. [Google Scholar] [CrossRef]
- Kong, E.; Xia, Y. Variable selection for the single-index model. Biometrika 2007, 94, 217–229. [Google Scholar] [CrossRef]
- Alquier, P.; Biau, G. Sparse Single-Index Model. JMLR 2013, 14, 243–280. [Google Scholar]
- Putra, I.; Dana, I.M. Study of Optimal Portfolio Performance Comparison: Single Index Model and Markowitz Model on LQ45 Stocks in Indonesia Stock Exchange. Am. J. Humanit. Soc. Sci. Res. 2020, 3, 237–244. [Google Scholar]
- Pananjady, A.; Foster, D.P. Single-index models in the high signal regime. IEEE Trans. Inf. Theory 2021, 67, 4092–4124. [Google Scholar] [CrossRef]
- Ganti, R.S.; Balzano, L.; Willett, R. Matrix completion under monotonic single index models. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
- Catoni, O. Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning; Institute of Mathematical Statistics Lecture Notes—Monograph Series 56; Institute of Mathematical Statistics: Beachwood, OH, USA, 2007; Volume 5544465. [Google Scholar]
- Boucheron, S.; Lugosi, G.; Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
- Wallach, H.; Mimno, D.; McCallum, A. Rethinking LDA: Why priors matter. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 7–10 December 2009; Volume 22. [Google Scholar]
- Tsybakov, A.B. Introduction to Nonparametric Estimation; Springer Series in Statistics; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar] [CrossRef]
- Mai, T.T.; Alquier, P. Pseudo-Bayesian quantum tomography with rank-adaptation. J. Stat. Plan. Inference 2017, 184, 62–76. [Google Scholar] [CrossRef]
- Goldstein, S.; Lebowitz, J.L.; Tumulka, R.; Zanghî, N. Any orthonormal basis in high dimension is uniformly distributed over the sphere. Ann. L’Institut Henri Poincaré Probab. Stat. 2017, 53, 701–717. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).