On a Low-Rank Matrix Single-Index Model

The Tien Mai

doi:10.3390/math11092065

Department of Mathematical Sciences, Norwegian University of Science and Technology, 7034 Trondheim, Norway

Mathematics2023, 11(9), 2065;https://doi.org/10.3390/math11092065

This article belongs to the Special Issue New Advances in High-Dimensional and Non-asymptotic Statistics

Version Notes

Order Reprints

Abstract

In this paper, we conduct a theoretical examination of a low-rank matrix single-index model. This model has recently been introduced in the field of biostatistics, but its theoretical properties for jointly estimating the link function and the coefficient matrix have not yet been fully explored. In this paper, we make use of the PAC-Bayesian bounds technique to provide a thorough theoretical understanding of the joint estimation of the link function and the coefficient matrix. This allows us to gain a deeper insight into the properties of this model and its potential applications in different fields.

Keywords:

low-rank matrix; single-index model; PAC-Bayes bounds; optimal rate; oracle inequality

MSC:

62G05; 62C20

1. Introduction

In this study, we investigate a particular type of single-index model, where the response variable, denoted by Y, is a real number and the covariate matrix, represented by X, is a matrix of real numbers with dimensions of

d \times d

. The model is defined in Equation (1) as

Y = f^{⋆} (⟨X, B^{⋆}⟩) + ϵ

(1)

In this equation,

⟨X, B^{⋆}⟩ = trace (X^{⊤} B^{⋆})

represents the inner product between matrices X and

B^{⋆}

, where

B^{⋆}

is an unknown coefficient matrix with dimensions of

d \times d

. The link function

f^{⋆}

is an unknown univariate measurable function. The noise term, represented by

ϵ

, is assumed to have a mean of 0 and is independent of the covariate X.

In line with the recent research presented in [1,2], we make the assumption that the coefficient matrix

B^{⋆}

is a symmetric, low-rank matrix with

rank (B^{⋆}) < d

. Additionally, in order to ensure the uniqueness of the model, we impose the condition that the Frobenius norm of

B^{⋆}

is equal to 1, i.e.,

∥ B^{⋆} ∥_{F} = 1

.

Previous studies have been conducted on a similar model to the one presented in this paper, where the unknown coefficient matrix

B^{⋆}

is assumed to have sparse elements. In particular, the work of [1] in the field of biostatistics has been used to examine the correlation between a response variable and the functional connectivity associated with a certain brain region. Additionally, recent research by [2] has focused on developing methods for estimating the unknown low-rank matrix

B^{⋆}

by using implicit regularization techniques.

The model discussed in this paper can be thought of as a nonparametric version of the trace regression model that has been previously proposed in the literature, specifically in the works in [3,4,5]. This trace regression model utilizes the identity function as the link function, and encompasses a diverse array of statistical models, including but not limited to reduced rank regression, matrix completion, and linear regression.

The single-index model is a versatile extension of the linear model, which offers a natural interpretation. This model only changes in the direction of the parameter (vector/matrix), and the nature of this change is depicted by the link function

f^{⋆}

. This has been the subject of extensive research in the literature, with various studies exploring its applications and extensions in various fields. Examples of such works include [1,6,7,8,9,10,11,12,13,14]. These studies have demonstrated the versatility and utility of the single-index model in a wide range of contexts, making it a valuable tool for researchers in various fields.

Definition 1.

Let

S_{1}^{d}

denote the set of all symmetric matrix

B \in R^{d \times d}

such that

{∥ B ∥}_{F} = 1

.

Given the covariates

{X_{i}}_{i = 1}^{n}

, the response variables

{Y_{i}}_{i = 1}^{n}

are i.i.d. generated from model (1). We define the expected risk for any measurable

f : R \to R

and

B \in S_{1}^{d}

as

R (B, f) = E [(Y - f {(⟨X, B⟩)}^{2})]

and denote the empirical counterpart of

R (f, B)

by

r_{n} (B, f) = \frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - f {(⟨X_{i}, B⟩)}^{2}) .

In this research, we examine the forecasting abilities of the model. More specifically, we consider a pair

(f, B)

to have comparable predictive performance to

(f^{⋆}, B^{⋆})

if the difference between

R (B, f)

and

R (B^{⋆}, f^{⋆})

is minimal.

Our approach in this work is built on the PAC-Bayesian bound technique, which is a powerful tool for obtaining oracle inequalities bounds [15]. Similar to Bayesian analysis, one important aspect of a PAC-Bayesian bound is specifying a prior distribution over the parameter space. In our approach, we adopt the prior distribution for the link function from the reference [11], while the prior distribution for the matrix parameter B is inspired by the eigen decomposition of the matrix. The specifics of our approach and the details of the prior distributions we chose are discussed in the next section. The use of the PAC-Bayesian bound technique in combination with carefully chosen prior distributions allows us to obtain reliable and accurate estimates of the unknown parameters in our model.

2. Main Result

2.1. Method

We make an additional assumption in our model (1) that

E [ϵ | X] = 0

, and the following conditional moment assumptions on the noise

ϵ

are assumed.

Assumption 1.

We assume that there exist two constants

σ > 0

and

L > 0

, such that for all integers

s \geq 2

,

{E [| ϵ |}^{s} | X] \leq \frac{s!}{2} σ^{2} L^{s - 2} .

Remark 1.

The assumption stated above implies that the noise term in our model follows a subexponential distribution. This class of distributions includes, for example, Gaussian noise or bounded noise, as discussed in [16]. In simpler terms, this means that the noise term in our model is characterized by a rate of decay that is slower than that of an exponential distribution. This assumption is critical for the application of our approach, as it allows us to obtain accurate and reliable estimates of the unknown parameters under a wide range of noise conditions. This is an important consideration, as the presence of noise can have a significant impact on the accuracy of the estimates obtained from our model. By assuming that the noise follows a subexponential distribution, we can be confident that our estimates are robust to the presence of noise.

In addition to the assumptions stated previously, it is also necessary to assume that the covariate matrix X is almost surely bounded by a constant. Additionally, the unknown link function

f^{⋆}

is also assumed to be bounded by some known positive constant. To make this more precise, we use the notation

{∥ X ∥}_{\infty}

to represent its supremum norm and

∥ f^{⋆} ∥_{\infty}

to denote its functional supremum norm over the interval

[- 1, 1]

. Based on these definitions, we make the following assumption:

Assumption 2.

We assume that

{∥ X ∥}_{\infty} \leq 1

a.s. and

\exists C \geq 1

, such that

∥ f^{⋆} ∥_{\infty} \leq C .

In order to present the technical proofs in the clearest and simplest manner, we did not attempt to find the best constant used in the proofs. Specifically, the condition that

C \geq 1

is just convenient for the proofs in nature, and it could be eliminated by using

max [C, 1]

in the proofs.

The link function

f^{⋆}

is approximately estimated through a given specific countable set of measurable functions (dictionary)

{φ_{k}}_{k = 1}^{\infty}

. For this purpose, the set of finite linear combinations of functions from the dictionary is utilized, and we denote this vector space by

F

. We assume that each element

φ_{k}

in the dictionary is defined on the interval

[- 1, 1]

and takes values within the range

[- 1, 1]

.

Assumption 3.

For the sake of simplicity, we assume that the basic functions are differentiable and there exists some constant

C_{ϕ} > 0

, such that

∥ φ_{k}^{'} ∥_{\infty} \leq k C_{ϕ} .

An example of such a collection of functions is the system of non-normalized trigonometric functions, where

φ_{1} (t) = 1, φ_{2 k} (t) = cos (π k t), φ_{2 k + 1} (t) = sin (π k t), k = 1, 2, \dots

satisfy this assumption. This assumption on the dictionary functions enables us to approximate the unknown link function

f^{⋆}

with a finite linear combination of these functions.

Our approach is inspired by the work of [11], where the authors explored the PAC-Bayesian approach in [15] for a sparse-vector single-index model. The method needs to first specify a distribution

π

on

S_{1}^{d} \times F

, similar to the prior distribution in Bayesian analysis. This prior distribution in our framework should enforce the characteristics of the underlying link function and the parameter matrix. In this work, we consider the following prior distribution:

d π (B, f) = d μ (B) d ν (f),

in other words, it means that the prior distribution of the index matrix and the prior distribution over the link functions are assumed to be independent.

In this study, the matrix B is treated as a symmetric matrix and can be expressed in its eigen-decomposition form

B = U Λ U^{⊤}

. The matrix U is an orthogonal matrix with

U U^{⊤} = U U^{- 1} = I_{d}

(identity matrix of dimension

d \times d

), and the diagonal matrix

Λ

holds the corresponding eigenvalues

λ_{1}, \dots, λ_{d}

. To enforce that

{∥ B ∥}_{F} = 1

, the sum of the squares of the eigenvalues

λ_{i}

must equal 1, as

{∥ B ∥}_{F} = \sqrt{trace (B^{2})}

and

trace (B^{2}) = \sum_{j = 1}^{d} λ_{i}^{2}

. Additionally, the requirement of low-rankness on B means that most of the eigenvalues

λ_{1}, \dots, λ_{d}

are close to zero, with only a few being significantly larger.

With the goal of obtaining an appropriate low-rank-promoting prior for B, we propose the following approach. We simulate an orthogonal matrix V and simulate

(γ_{1}, \dots, γ_{d})

from a Dirichlet distribution

D i r (α_{1}, \dots, α_{d})

. Put

B = V diag (γ_{1}^{1 / 2}, \dots, γ_{d}^{1 / 2}) V^{⊤} .

To obtain an approximate low-rank matrix, we take all parameters of the Dirichlet distribution to be very close to 0, for example, by setting

α_{1} = \dots = α_{d} = 1 / d

. It is worth noting that a typical drawing of the Dirichlet distribution leads to one of the

γ_{i}

s being close to 1 and the others being close to 0. For more detailed discussions on how to choose the parameters for the Dirichlet distribution, one can refer to [17].

Now, we present a prior distribution on

F

. We opted to use the prior introduced in [11]. With any integer M that

0 < M \leq n

, let us put

B_{M} (c_{Λ}) = \{(β_{1}, \dots, β_{M}) \in R^{M} : \sum_{s = 1}^{M} s | β_{s} | \leq c_{Λ} and β_{M} \neq 0\}, \forall c_{Λ} > 0 .

Now, we define

F_{M} (c_{Λ}) \subset F

the image of

B_{M} (c_{Λ})

by the function

\begin{matrix} G_{M} & : R^{M} & \to & F \\ (β_{1}, \dots, β_{M}) & \mapsto & \sum_{j = 1}^{M} β_{j} φ_{j} . \end{matrix}

Remark 2.

Corollary 1 (below) provides a discussion regarding the approximation of Sobolev spaces (see [18] by the set

F_{M} (c_{Λ})

), which become more accurate as M increases.

Now, a prior distribution

ν_{M} (d f)

is defined on the set

F_{M} (C + 1)

. This is performed by considering the image of the uniform measure on

B_{M} (C + 1)

obtained through the function

G_{M}

. We consider the following choice for the prior distribution

ν

on

F

d ν (f) = \frac{\sum_{M = 1}^{n} 10^{- M} ν_{M} (d f)}{1 - {(\frac{1}{10})}^{n}} .

(2)

The reason for choosing

C + 1

rather than C in the above definition of the prior distribution support is essentially for technical proof. This is to ensure that as soon as the underlying link function

f^{⋆}

belongs to

F_{n} (C)

, there then exists a small ball around it that is contained in

F_{n} (C + 1)

. One could safely replace it by

C + a_{n}

, where

{a_{n}}_{n = 1}^{\infty}

is any positive sequence vanishing sufficiently slowly as

n \to \infty

.

Remark 3.

The integer M can be viewed as a measure of the “dimension” of the function f—the larger the M, the more complex the function—and the prior ν adapts again to the sparsity idea by penalizing large-dimensional functions f. The coefficient

10^{- M}

, which appears in (2), shows that more complex models have a geometrically decreasing influence. Inspired from the practical results in [11], the value 10 is a random choice. This choice could be in general changed by another positive constant, but it requires more technical attention.

2.2. The Proposed Estimator

Definition 2.

The Gibbs posterior distribution over

S_{1}^{d} \times F_{n} (C + 1)

is defined as

{\hat{ρ}}_{λ} (B, f) = \frac{exp [- λ r_{n} (B, f)] d π (B, f)}{\int exp [- λ r_{n} (B, f)] d π (B, f)} .

Now, we define an estimator as follows. Let

λ > 0

be a tuning parameter, or sometime called the inverse temperature parameter. Let

({\hat{B}}_{λ}, {\hat{f}}_{λ})

be an estimator of

(B^{⋆}, f^{⋆})

. It is simply achieved by a random draw from

{\hat{ρ}}_{λ}

, the Gibbs posterior distribution above.

2.3. Theoretical Results

As

E [Y | X] = f^{⋆} (⟨X, B^{⋆}⟩)

almost surely, it is noted that for all

(B, f) \in S_{1}^{d} \times F_{n} (C + 1)

,

\begin{matrix} R (B, f) - R (B^{⋆}, f^{⋆}) & = E {[Y - f (⟨X, B⟩)]}^{2} - E {[Y - f^{⋆} (⟨X, B^{⋆}⟩)]}^{2} \\ = E {[f (⟨X, B⟩) - f^{⋆} (⟨X, B^{⋆}⟩)]}^{2} \end{matrix}

(Pythagoras theorem).

Definition 3.

For any positive integer

M \leq n

, we set

(B_{M}^{⋆}, f_{M}^{⋆}) \in arg min_{(B, f) \in S_{1}^{d} \times F_{M} (C)} R (B, f) .

Remark 4.

It is noted here that the infimum

f_{M}^{⋆}

is defined on

F_{M} (C)

for each value of M. However, the prior distribution is defined on a slightly larger set, that is,

F_{M} (C + 1)

.

Let us define

w : = 64 (C + 1) max [L, C + 1], C_{1} : = 8 [{(C + 1)}^{2} + σ^{2}] .

The theoretical results in this work mainly come from the following theorem, the proof of which is provided in Section 3. It should be noted that throughout the paper, the phrase “with probability

1 - δ

” refers to the probability calculated with respect to both the distribution

P^{\otimes n}

of the data and the conditional Gibbs distribution

{\hat{ρ}}_{λ}

.

Theorem 1.

Assume that Assumptions 1 and 2 hold, with

λ = \frac{n}{w + 2 C_{1}} .

(3)

We have that, for all

δ \in (0, 1)

, with a probability of at least

1 - δ

,

\begin{matrix} R ({\hat{B}}_{λ}, {\hat{f}}_{λ}) - R (B^{⋆}, f^{⋆}) \leq C inf_{1 \leq M \leq n} & {R (B_{M}^{⋆}, f_{M}^{⋆}) - R (B^{⋆}, f^{⋆}) + \\ \frac{log (n) (M + d rank (B^{⋆}) + d log (d)) + log (\frac{2}{δ})}{n}}, \end{matrix}

where

C > 0

is a constant depending only on

L, σ, C, C_{ϕ}

.

Remark 5.

As in practice, the value of w and

C_{1}

are not known, and the theoretical value of λ cannot be used. However, it provides a good order to tune this parameter, for example, using cross-validation.

Remark 6.

Theorem 1 can be interpreted in a straightforward manner. Essentially, it states that if there exists a “small” M and

rank (B^{⋆})

is small, such that the difference between

R (B_{M}^{⋆}, f_{M}^{⋆})

and

R (B^{⋆}, f^{⋆})

is minimal, then the difference between

R ({\hat{B}}_{λ}, {\hat{f}}_{λ})

and

R (B^{⋆}, f^{⋆})

will also be small in the order of

log (n) / n

. On the other hand, if neither of these conditions are met, then the rate

M log (n) / n

or

rank (B^{⋆}) d log (n) / n

(or either) will start to dominate, thus resulting in a decrease in the general quality of the convergence rate.

We can obtain a good convergence rate as soon as a low-rank assumption is considered. This is typically achievable when

B^{⋆}

is already low-rank or can be well approximated by a low-rank matrix. In the case that

f^{⋆}

is sufficiently regular, we can obtain a good approximation with a “small” M.

As shown in [11], when

f^{⋆}

belongs to a Sobolev space, we can derive a more specific nonparametric rate for the above theorem. For example, assume that

{φ_{k}}_{k = 1}^{\infty}

is the system of trigonometric functions and in addition that the link function

f^{⋆}

is in the following Sobolev ellipsoid space [18],

W (k, \frac{6 C^{2}}{π^{2}}) = \{f \in L_{2} ([- 1, 1]) : f = \sum_{j = 1}^{\infty} β_{j} φ_{j} and \sum_{j = 1}^{\infty} j^{2 k} β_{j}^{2} \leq \frac{6 C^{2}}{π^{2}}\}

where

k \geq 2

is an unknown regularity parameter. In this context, the approximation set

F_{M} (C + 1)

is in the following form:

\begin{matrix} F_{M} (C + 1) = \{f \in L_{2} ([- 1, 1]) : f = \sum_{s = 1}^{M} β_{s} φ_{s}, \sum_{s = 1}^{M} s | β_{s} | \leq C + 1 and β_{M} \neq 0\} . \end{matrix}

It should be noted that the results presented in this paper are in the so-called adaptive setting, where the regularity parameter k is not assumed to be known. However, in order to obtain these results, it is necessary to make an additional assumption.

Assumption 4.

We assume that the probability density of the random variable

⟨X, B^{⋆}⟩

is defined on

[- 1, 1]

, and it is upper-bounded by a constant

A > 0

.

Corollary 1.

Assume that Theorem 1 and additional Assumption 4 hold. Moreover, assume that

f^{⋆}

is in the Sobolev ellipsoid space

W (k, 6 C^{2} / π^{2})

, where the regularity parameter

k \geq 2

is unknown. The tuning parameter λ is as in (3). We have that for all

δ \in (0, 1)

with a probability of at least

1 - δ

,

\begin{matrix} R ({\hat{B}}_{λ}, {\hat{f}}_{λ}) - R (B^{⋆}, f^{⋆}) \leq \\ C^{'} \{{(\frac{log (n)}{n})}^{\frac{2 k}{2 k + 1}} + \frac{log (n) (d rank (B^{⋆}) + d log d) + log (\frac{2}{δ})}{n}\}, \end{matrix}

(4)

where

C^{'} > 0

is a constant depending only on L, C, σ,

C_{ϕ}, A

.

The proof for Corollary 1 follows a similar approach to that of Corollary 4 in [11], and thus, it is not included in this paper.

Remark 7.

From an asymptotic point of view, that d is fixed and

n \to \infty

, the leading rate on the right-hand side in the above Corollary is

{(log (n) / n)}^{\frac{2 k}{2 k + 1}}

. This is known to be the minimax rate of convergence up to a

log (n)

factor over a Sobolev class; see [18]. On the other hand, in a nonasymptotic setting where n is “small”, we obtain the estimation rate

rank (B^{⋆}) d log (n) / n

, which was also obtained by [2], and it is minimax optimal up to a logarithmic term, as in [3].

From Theorem 1, it is actually possible to derive that the Gibbs posterior

{\hat{ρ}}_{λ}

contracts around

(B^{⋆}, f^{⋆})

at the optimal rate.

Theorem 2.

Under the same assumptions for Theorem 1 and the same definition for λ, let

ε_{n}

be any sequence in

(0, 1)

, such that

ε_{n} \to 0

when

n \to \infty

. Define

\begin{matrix} E_{n} = {(B, f) \in S_{1}^{d} \times F_{n} (C + 1) : R (B, f) - R (B^{⋆}, f^{⋆}) \\ \leq C inf_{1 \leq M \leq n} {R (B_{M}^{⋆}, f_{M}^{⋆}) - R (B^{⋆}, f^{⋆}) + \\ \frac{log (n) (M + rank (B^{⋆}) d + d log d) + log (\frac{2}{ε_{n}})}{n}} . \end{matrix}

Then,

E [P_{(B, f) \sim {\hat{ρ}}_{λ}} ((B, f) \in E_{n})] \geq 1 - ε_{n} \underset{n \to \infty}{\to} 1 .

3. Proofs

For the sake of simplicity in the proofs, we put

R^{⋆} : = R (B^{⋆}, f^{⋆}), r_{n}^{⋆} : = r_{n} (B^{⋆}, f^{⋆}) .

We have that for each

f = \sum_{j = 1}^{M} β_{j} φ_{j} \in F_{M} (C + 1)

,

{∥ f ∥}_{\infty} \leq \sum_{j = 1}^{M} | β_{j} | \leq C + 1 .

The following lemma, Lemma 1, is a Bernstein-type inequality [16] that is useful for our proofs. We denote by

{(Z)}_{+}

the positive part of a random variable Z.

Lemma 1.

Let

Z_{1}, \dots, Z_{n}

be independent real-valued random variables. It is assumed that there exist two constants

v > 0, w > 0

that for all integers

r \geq 2

,

\sum_{s = 1}^{n} E [{(Z_{s})}_{+}^{r}] \leq \frac{r!}{2} v w^{r - 2} .

We have that with

ζ \in (0, 1 / w)

,

E e^{ζ \sum_{s = 1}^{n} (Z_{s} - E Z_{s})} \leq e^{\frac{v ζ^{2}}{2 (1 - w ζ)}} .

Let

(A, A)

be a measurable space and

γ_{1}

and

γ_{2}

be two probability measures on

(A, A)

. Denote by

K (γ_{1}, γ_{2})

the Kullback–Leibler divergence of

γ_{1}

with respect to

γ_{2}

. Lemma 2 is a classical result, and its proof can be found, for example, in [15], (page 4).

Lemma 2.

Let

(A, A)

be a measurable space. For any probability measure ν on

(A, A)

and any measurable function

g : A \to R

, such that

\int (exp \circ g) d ν < \infty

, we have

log \int (exp \circ g) d ν = sup_{κ} (\int g d κ - K (κ, ν)),

(5)

where κ is a probability measure on

(A, A)

and

\infty - \infty = - \infty

. In addition, when g is upper-bounded on the support of ν, the supremum in (5) is obtained by the Gibbs distribution g, given by

\frac{d ρ}{d ν} (a) = \frac{exp (g (a))}{\int (exp \circ g) d ν}, a \in A .

Lemma 3.

We assume that Assumption 1 is satisfied. Put

w = 16 (C + 1) max [L, 2 (C + 1)], C_{1} : = 8 [{(C + 1)}^{2} + σ^{2}]

and take

λ \in (0, \frac{n}{w + C_{1}})

and put

α = (λ - \frac{λ^{2} C_{1}}{2 n (1 - \frac{C_{2} λ}{n})}) and β = (λ + \frac{λ^{2} C_{1}}{2 n (1 - \frac{C_{2} λ}{n})}) .

(6)

With

δ \in (0, 1)

and any distribution

{\hat{ρ}}_{λ} ≪ π

, we have that

\begin{matrix} E \int exp [α (R (B, f) - R^{⋆}) + λ (- r_{n} (B, f) + r_{n}^{⋆}) - log (\frac{d {\hat{ρ}}_{λ}}{d π} (B, f)) - \\ log \frac{2}{δ}] d {\hat{ρ}}_{λ} (B, f) \leq δ / 2, \end{matrix}

(7)

\begin{matrix} E sup_{ρ} exp [β (- \int R (B, f) d ρ - R^{⋆}) + λ (\int r_{n} (B, f) d ρ - r_{n}^{⋆}) - \\ K (ρ, π) - log \frac{2}{δ}] \leq δ / 2, \end{matrix}

(8)

Proof.

Fix

B \in S_{1}^{d}

and

f \in F_{n} (C + 1)

. We start by using Lemma 1 with the following random variables:

T_{i} = - {(Y_{i} - f (⟨X_{i}, B⟩))}^{2} + {(Y_{i} - f^{⋆} (⟨X_{i}, B^{⋆}⟩))}^{2}, i = 1, \dots, n .

Note that

T_{i}, i = 1, \dots, n

are independent, and we have that

\begin{matrix} \sum_{i = 1}^{n} E T_{i}^{2} & = \sum_{i = 1}^{n} E \{{[2 Y_{i} - f (⟨X_{i}, B⟩) - f^{⋆} (⟨X_{i}, B^{⋆}⟩)]}^{2} {[f (⟨X_{i}, B⟩) - f^{⋆} (⟨X_{i}, B^{⋆}⟩)]}^{2}\} \\ = \sum_{i = 1}^{n} E \{{[2 ϵ_{i} + f^{⋆} (⟨X_{i}, B^{⋆}⟩) - f (⟨X_{i}, B⟩)]}^{2} {[f (⟨X_{i}, B⟩) - f^{⋆} (⟨X_{i}, B^{⋆}⟩)]}^{2}\} \\ \leq \sum_{i = 1}^{n} E \{[8 ϵ_{i}^{2} + 8 {(C + 1)}^{2}] {[f (⟨X_{i}, B⟩) - f^{⋆} (⟨X_{i}, B^{⋆}⟩)]}^{2}\} . \\ \leq 8 [{(C + 1)}^{2} + σ^{2}] \sum_{i = 1}^{n} E {[f (⟨X_{i}, B⟩) - f^{⋆} (⟨X_{i}, B^{⋆}⟩)]}^{2} : = v, \end{matrix}

where we set

C_{1} : = 8 [{(C + 1)}^{2} + σ^{2}];

and

v = n C_{1} [R (B, f) - R^{⋆}] .

Now, for all integers k greater than 3, we have that

\begin{matrix} \sum_{i = 1}^{n} E [{(T_{i})}_{+}^{k}] \\ \leq \sum_{i = 1}^{n} E \{| 2 Y_{i} - f (⟨X_{i}, B⟩) - f^{⋆} (⟨X_{i}, B^{⋆}⟩) |^{k} {| f (⟨X_{i}, B⟩) - f^{⋆} (⟨X_{i}, B^{⋆}⟩) |}^{k}\} \\ = \sum_{i = 1}^{n} E \{| 2 ϵ_{i} + f^{⋆} (⟨X_{i}, B^{⋆}⟩) - f (⟨X_{i}, B⟩) |^{k} {| f (⟨X_{i}, B⟩) - f^{⋆} (⟨X_{i}, B^{⋆}⟩) |}^{k}\} \\ \leq 2^{k - 1} \sum_{i = 1}^{n} E \{[2^{k} {| ϵ_{i} |}^{k} + 2^{k} {(C + 1)}^{k}] 2^{k - 2} {(C + 1)}^{k - 2} {| f (⟨X_{i}, B⟩) - f^{⋆} (⟨X_{i}, B^{⋆}⟩) |}^{2}\} . \end{matrix}

In the last inequality, we used the fact that

{| q + w |}^{k} \leq 2^{k - 1} {(| q |}^{k} + {| w |}^{k})

. We obtain that

\begin{matrix} \sum_{i = 1}^{n} E [{(T_{i})}_{+}^{k}] & \leq \sum_{i = 1}^{n} [2^{2 k - 2} k! σ^{2} L^{k - 2} + 2^{2 k - 1} {(C + 1)}^{k}] 2^{k - 2} {(C + 1)}^{k - 2} [R (B, f) - R^{⋆}] \\ = v \times \frac{[2^{2 k - 2} k! σ^{2} L^{k - 2} + 2^{2 k - 1} {(C + 1)}^{k}] 2^{k - 2} {(C + 1)}^{k - 2}}{[2 {(C + 1)}^{2} + 4 σ^{2}]} \\ \leq v \times \frac{k! 8^{k - 2} max [L^{k - 2}, 2^{k - 2} {(C + 1)}^{k - 2}] 2^{k - 2} {(C + 1)}^{k - 2}}{2} : = \frac{k!}{2} v w^{k - 2}, \end{matrix}

with

w = 64 (C + 1) max [L, C + 1]

.

Thus, for any

λ \in (0, n / w)

, taking

ζ = λ / n

, we apply Lemma 1 to obtain

\begin{matrix} E exp [λ (R (B, f) - R^{⋆} - r_{n} (B, f) + r_{n}^{⋆})] & \leq exp (\frac{v λ^{2}}{2 n^{2} (1 - \frac{w λ}{n})}) \\ = exp (\frac{C_{1} [R (B, f) - R^{⋆}] λ^{2}}{2 n (1 - \frac{w λ}{n})}) . \end{matrix}

Therefore, we obtain, with the

α

given in (6),

\begin{matrix} E e^{α (R (B, f) - R^{⋆}) + λ (- r_{n} (B, f) + r_{n}^{⋆}) - log (\frac{2}{δ})} \leq δ / 2 . \end{matrix}

Next, integrating with respect to

π

and consequently using Fubini’s theorem, we obtain

\begin{matrix} E \int exp [α (R (B, f) - R^{⋆}) + λ (- r_{n} (B, f) + r_{n}^{⋆}) - log (2 / δ)] d π (B, f) \leq δ / 2 . \end{matrix}

To obtain (7), it is noted that for any measurable function h,

\begin{matrix} \int exp [h (B, f)] d π = \int exp [h (B, f) - log \frac{d {\hat{ρ}}_{λ}}{d π} (B, f)] d {\hat{ρ}}_{λ} . \end{matrix}

The proof for (8) is similar. More precisely, we apply Lemma 1 with

T_{i} = {(Y_{i} - f (⟨X, B⟩))}^{2} - {(Y_{i} - f^{⋆} (⟨X, B^{⋆}⟩))}^{2}

. We obtain, for any

λ \in (0, n / w)

,

\begin{matrix} E exp [λ (r_{n} (B, f) + r_{n}^{⋆} - R (B, f) + R^{⋆})] \leq exp (\frac{v λ^{2}}{2 n^{2} (1 - \frac{w λ}{n})}) . \end{matrix}

By rearranging terms, using definition of

β

in (6), and multiplying both sides by

δ / 2

, we obtain

\begin{matrix} E exp [β (- R (B, f) + R^{⋆}) + λ (r_{n} (B, f) - r_{n}^{⋆}) - log \frac{2}{δ}] \leq δ / 2 . \end{matrix}

Integrating with respect to

π

and using Fubini’s theorem, we obtain

\begin{matrix} E \int exp [β (- R (B, f) + R^{⋆}) + λ (r_{n} (B, f) - r_{n}^{⋆}) - log \frac{2}{δ}] d π \leq δ / 2 . \end{matrix}

Now, Lemma 2 is applied to the integral, and this directly yields (8). □

Proof of Theorem 1.

Recall that

P^{\otimes n}

stands for the distribution of the sample

D_{n}

; the Equation (7) can be written conveniently as

\begin{matrix} E_{D_{n} \sim P^{\otimes n}} E_{(\hat{B}, \hat{f}) \sim {\hat{ρ}}_{λ}} exp [α (R (\hat{B}, \hat{f}) - R^{⋆}) + λ (- r_{n} (\hat{B}, \hat{f}) + r_{n}^{⋆}) - \\ log (\frac{d {\hat{ρ}}_{λ}}{d π} (\hat{B}, \hat{f})) - log \frac{2}{δ}] \leq δ / 2, \end{matrix}

Now, we use the standard Chernoff trick to transform an exponential moment inequality into a deviation inequality, i.e., using

exp (λ x) \geq 1_{R_{+}} (x)

. We obtain, with a probability of at least

1 - δ / 2

for any

δ \in (0, 1)

and any distribution

{\hat{ρ}}_{λ}

,

\begin{matrix} R (\hat{B}, \hat{f}) - R^{⋆} \leq \frac{λ}{α} (r_{n} (\hat{B}, \hat{f}) - r_{n}^{⋆} + \frac{log (\frac{d {\hat{ρ}}_{λ}}{d π} (\hat{B}, \hat{f})) + log (\frac{2}{δ})}{λ}) . \end{matrix}

It is noted that we have

\begin{matrix} log (\frac{d {\hat{ρ}}_{λ}}{d π} (\hat{B}, \hat{f})) & = log (\frac{exp (- λ r_{n} (\hat{B}, \hat{f}))}{\int exp (- λ r_{n} (B, f)) d π}) \\ = - λ r_{n} (\hat{B}, \hat{f}) - log \int e^{- λ r_{n} (B, f)} d π; \end{matrix}

thus, we obtain, with a probability larger than

1 - δ / 2

,

\begin{matrix} R (\hat{B}, \hat{f}) - R^{⋆} \leq \frac{1}{α} (log \int exp (- λ r_{n} (B, f)) d π - λ r_{n}^{⋆} + log (\frac{2}{δ})) . \end{matrix}

Now, using Lemma 2, it yields that with a probability larger than

1 - δ / 2

,

\begin{matrix} R (\hat{B}, \hat{f}) - R^{⋆} \leq \frac{λ}{α} (\int r_{n} (B, f) d {\hat{ρ}}_{λ} - r_{n}^{⋆} + \frac{K ({\hat{ρ}}_{λ}, π) + log (\frac{2}{δ})}{λ}) . \end{matrix}

(9)

Now, from (8) with an application of the standard Chernoff trick, we obtain, with a probability larger than

1 - δ / 2

for any

δ \in (0, 1)

and any distribution

{\hat{ρ}}_{λ} ≪ π

,

\begin{matrix} \int r_{n} (B, f) d {\hat{ρ}}_{λ} - r_{n}^{⋆} \leq \frac{β}{λ} (\int R (B, f) d {\hat{ρ}}_{λ} - R^{⋆}) + \frac{K ({\hat{ρ}}_{λ}, π) + log (\frac{2}{δ})}{λ} . \end{matrix}

(10)

Combining (9) and (10) with a union bound argument gives the bound, with a probability larger than

1 - δ

,

\begin{matrix} R (\hat{B}, \hat{f}) - R^{⋆} \leq inf_{ρ} \{\frac{β}{α} (\int R (B, f) d ρ - R^{⋆}) + 2 \frac{K (ρ, π) + log (\frac{2}{δ})}{α}\} . \end{matrix}

(11)

The final steps of the proof involve making the right-hand side of the inequality more explicit. To achieve this, we limit the infimum bound to a specific distribution. This allows us to have a more concrete understanding of the result and to explicitly obtain the error rate.

Put

B^{⋆} = U Λ U^{⊤}

and let

r = # {i : Λ_{i} > ε},

with small

ε \in (0, 1)

. Take

d ρ_{η}^{1} \propto 1 (\forall i : | v_{i} - Λ_{i} | \leq ε; \forall i = 1, \dots, r : ∥ u_{i} - U_{i} ∥_{F} \leq η) π (d u, d v)

For any positive integer

M \leq n

and any

η, γ \in (0, 1 / n)

, let the probability measure

ρ_{M, η, γ}

be defined by

d ρ_{M, η, γ} (B, f) = d ρ_{η}^{1} (B) d ρ_{M, γ}^{2} (f),

with

ρ_{M, γ}^{2} (f) \propto 1_{[∥ f - f_{M}^{⋆} ∥_{M} \leq γ]} ν_{M} (f) .

We denote for

f = \sum_{s = 1}^{M} β_{s} φ_{s} \in F_{M} (C + 1)

,

{∥ f ∥}_{M} = \sum_{j = 1}^{M} j | β_{j} | .

Inequality (11) leads to

\begin{matrix} R (\hat{B}, \hat{f}) - R^{⋆} \leq inf_{1 \leq M \leq n} inf_{η, γ > 0} {\frac{β}{α} (\int R (B, f) d ρ_{M, η, γ} (B, f) - R^{⋆}) + \\ 2 \frac{K (ρ_{M, η, γ}, π) + log (\frac{2}{δ})}{α}} . \end{matrix}

(12)

To finish the proof, we have to control the different terms in (12). Note first that

\begin{matrix} K (ρ_{M, η, γ}, π) & = K (ρ_{η}^{1} \otimes ρ_{M, γ}^{2}, μ \otimes ν_{M}) \\ = K (ρ_{η}^{1}, μ) + K (ρ_{M, γ}^{2}, ν_{M}) + log \frac{1 - {(1 / 10)}^{n}}{10^{- M}} . \end{matrix}

By technical Lemma 4, we know that

K (ρ_{η}^{1}, μ) \leq r d log (16 / η) + C_{D_{1}} d log d (1 + log (2 / ε)) .

Additionally, by technical Lemma 10 in [11], we have that

K (ρ_{M, γ}^{2}, ν_{M}) = M log (\frac{C + 1}{γ}) .

Bringing together all the parts, it arrives at

K (ρ_{M, η, γ}, π) \leq r d log (1 / c) + C_{D_{1}} d log d (1 + log (2 / δ)) + M log (\frac{C + 1}{γ}) + log \frac{1}{10^{- M}} .

(13)

Finally, it remains to control the term

\int R (B, f) d ρ_{M, η, γ} (B, f) .

To this aim, we write

\begin{matrix} \int R (B, f) d ρ_{M, η, γ} (B, f) \\ = \int E [{(Y - f (⟨X, B⟩))}^{2}] d ρ_{M, η, γ} (B, f) \\ = \int E [(Y - f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) + f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B_{M}^{⋆}⟩) + \\ f (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B⟩))^{2}] d ρ_{M, η, γ} (B, f) \\ = R (B_{M}^{⋆}, f_{M}^{⋆}) + \int E [{(f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B_{M}^{⋆}⟩))}^{2} + {(f (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B⟩))}^{2} \\ + 2 (Y - f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩)) (f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B_{M}^{⋆}⟩)) + \\ 2 (Y - f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩)) (f (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B⟩)) \\ + 2 (f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B_{M}^{⋆}⟩)) (f (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B⟩))] d ρ_{M, η, γ} (B, f) \\ : = R (B_{M}^{⋆}, f_{M}^{⋆}) + A + B + C + D + E . \end{matrix}

Computation of C by Fubini’s theorem:

\begin{matrix} C \\ = & E [\int 2 (Y - f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩)) (f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B_{M}^{⋆}⟩)) d ρ_{M, η, γ} (B, f)] \\ = & E \{\int [2 (Y - f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩)) \int (f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B_{M}^{⋆}⟩)) d ρ_{M, γ}^{2} (f)] d ρ_{η}^{1} (B)\} . \end{matrix}

Using the triangle inequality, we obtain that for

f = \sum_{s = 1}^{M} β_{s} φ_{s}

and

f_{M}^{⋆} = \sum_{s = 1}^{M} {(β_{M}^{⋆})}_{s} φ_{s}

,

\sum_{j = 1}^{M} j | β_{j} | \leq \sum_{j = 1}^{M} j | β_{j} - {(β_{M}^{⋆})}_{j} | + \sum_{j = 1}^{M} j | {(β_{M}^{⋆})}_{j} | .

Since

f_{M}^{⋆} \in F_{M} (C)

, and thus

\sum_{s = 1}^{M} s | {(β_{M}^{⋆})}_{s} | \leq C,

as a consequence,

\sum_{s = 1}^{M} s | β_{s} | \leq C + 1

as soon as

∥ f - f_{M}^{⋆} ∥_{M} \leq 1

. This shows that the set

\{f = \sum_{j = 1}^{M} β_{j} φ_{j} : {∥ f - f_{M}^{⋆} ∥}_{M} \leq γ\}

is contained in the support of

ν_{M}

. In particular, this implies that

ρ_{M, γ}^{2}

is centered at

f_{M}^{⋆}

and, consequently,

\int (f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B_{M}^{⋆}⟩)) d ρ_{M, γ}^{2} (f) = 0 .

This proves that

C = 0

.

Control of A: Clearly,

A \leq \int sup_{y \in R} {((f_{M}^{⋆} (y) - f (y))}^{2} d ρ_{M, γ}^{2} (f) \leq γ^{2} .

Control of B: We have

\begin{matrix} B = \int E [{(f (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B⟩))}^{2}] d ρ_{M, η, γ} (B, f) \\ \leq \int E [{(C_{ϕ} (C + 1) (B_{M}^{⋆} - B) X)}^{2}] d ρ_{η}^{1} (B) (using the mean value theorem) \\ \leq C_{ϕ}^{2} {(C + 1)}^{2} E [{∥ X ∥}_{\infty}^{2}] \int {∥ B_{M}^{⋆} - B ∥}_{F}^{2} d ρ_{η}^{1} (B) (by Assumption 4) . \end{matrix}

Using Lemma 6 from [19], we have that

\int ∥ B_{M}^{⋆} {- B ∥}_{F}^{2} d ρ_{η}^{1} (B) \leq {(3 d c + 2 r η)}^{2} .

Thus,

B \leq C_{ϕ}^{2} {(C + 1)}^{2} {(3 d c + 2 r η)}^{2} .

Control of E: We have that

\begin{matrix} | E | & \leq 2 \int E [| f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B_{M}^{⋆}⟩) | | f (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B⟩) |] d ρ_{M, η, γ} (B, f) \\ \leq 2 \int E [| f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B_{M}^{⋆}⟩) | C_{ϕ} (C + 1) | (B_{M}^{⋆} - B) X |] d ρ_{M, η, γ} (B, f) \\ \leq 2 {(\int E {(f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B_{M}^{⋆}⟩))}^{2} d ρ_{M, η, γ} (B, f))}^{\frac{1}{2}} \\ {(\int E {(C_{ϕ} (C + 1) (B_{M}^{⋆} - B) X)}^{2} d ρ_{M, η, γ} (B, f))}^{\frac{1}{2}} \\ \leq 2 {(γ^{2})}^{\frac{1}{2}} {(C_{ϕ}^{2} {(C + 1)}^{2} {(3 d c + 2 r η)}^{2})}^{\frac{1}{2}} = 2 C_{ϕ} (C + 1) γ (3 d ε + 2 r η) . \end{matrix}

Control of D: Finally,

\begin{matrix} D & = 2 \int E [(Y - f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩)) (f (⟨X, B_{M}^{⋆}⟩) - f (⟨X, B⟩))] d ρ_{M, η, γ} (B, f) \\ = 2 \int E [(Y - f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩)) (f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f_{M}^{⋆} (⟨X, B⟩))] d ρ_{η}^{1} (B) \\ (since \int f d ρ_{M, γ}^{2} (f) = f_{M}^{⋆}) \\ = 2 E [(Y - f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩)) \int (f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f_{M}^{⋆} (⟨X, B⟩)) d ρ_{η}^{1} (B)] \\ \leq 2 \sqrt{E [{(Y - f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩))}^{2}]} \sqrt{E {[\int (f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f_{M}^{⋆} (⟨X, B⟩)) d ρ_{η}^{1} (B)]}^{2}} \\ = 2 \sqrt{R (B_{M}^{⋆}, f_{M}^{⋆})} \sqrt{E {[\int (f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f_{M}^{⋆} (⟨X, B⟩)) d ρ_{η}^{1} (B)]}^{2}} . \end{matrix}

As we have that

\begin{matrix} | f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f_{M}^{⋆} (⟨X, B⟩) | \leq C_{ϕ} (C + 1) | ⟨(B_{M}^{⋆} - B) X⟩ | \leq C_{ϕ} (C + 1) {∥ B_{M}^{⋆} - B ∥}_{F}, \end{matrix}

it leads to

\begin{matrix} {[\int (f_{M}^{⋆} (⟨X, B_{M}^{⋆}⟩) - f_{M}^{⋆} (⟨X, B⟩)) d ρ_{η}^{1} (B)]}^{2} & \leq C_{ϕ}^{2} {(C + 1)}^{2} {[\int ∥ B_{M}^{⋆} {- B ∥}_{F} d ρ_{η}^{1} (B)]}^{2} \\ \leq C_{ϕ}^{2} {(C + 1)}^{2} {(3 d c + 2 r η)}^{2}, \end{matrix}

and therefore,

\begin{matrix} D \leq 2 C_{ϕ} (C + 1) (3 d c + 2 r η) \sqrt{R (0, 0) / 2} \leq \sqrt{2} C_{ϕ} (C + 1) (3 d ε + 2 r η) \sqrt{C^{2} + σ^{2}} . \end{matrix}

Thus, taking

η = γ = ε = 1 / n

and assembling all the components, we obtain that

A + B + C + D + E \leq \frac{C_{1}}{n},

where

C_{1}

is a positive constant function of C,

σ

, and

C_{ϕ}

. Combining this inequality with (12) and (13) yields, with a probability larger than

1 - δ

,

\begin{matrix} R ({\hat{B}}_{λ}, {\hat{f}}_{λ}) - R^{⋆} & \leq inf_{1 \leq M \leq n} {\frac{β}{α} (R (B_{M}^{⋆}, f_{M}^{⋆}) - R^{⋆} + \frac{C_{1}}{n}) \\ + 2 \frac{M log ((C + 1) 10 n) + r d log (16 n) + C_{D_{1}} d log d log (2 n e) + log (\frac{2}{δ})}{λ}} . \end{matrix}

Finally, choosing

λ = \frac{n}{w + 2 C_{1}},

it yields that there exists a constant

C_{2} > 0

depending only on

L, σ, C, C_{ϕ}

with a probability of at least

1 - δ

, such that

\begin{matrix} R ({\hat{B}}_{λ}, {\hat{f}}_{λ}) - R^{⋆} & \leq C_{2} inf_{1 \leq M \leq n} {R (B_{M}^{⋆}, f_{M}^{⋆}) - R^{⋆} + \\ \frac{M log (10 C n) + r d log (16 n) + C_{3} d log d log (2 n e) + log (\frac{2}{δ})}{n}} . \end{matrix}

This concludes the proof of Theorem 1. □

Lemma 4.

Let

r = # {i : Λ_{i} > ε}

with small

ε \in [0, 1)

. Take

d ρ_{η}^{1} \propto 1 (\forall i : | v_{i} - Λ_{i} | \leq ε; \forall i = 1, \dots, r : ∥ u_{i} - U_{i} ∥_{F} \leq η) μ (d u, d v)

Then,

K (ρ_{η}^{1}, μ) \leq r d log (16 / η) + C_{3} d log d log (2 e / ε)

where

C_{3}

is a universal constant.

Proof.

We have that

\begin{matrix} K (ρ_{η}^{1}, μ) = & log \frac{1}{μ ({u, v : \forall i : | v_{i} - Λ_{i} | \leq ε; \forall i = 1, r : ∥ u_{i} - U_{i} ∥_{F} \leq η})} \\ = & log \frac{1}{μ (\{\forall i = 1, r : ∥ u_{i .} - U_{i .} ∥_{F} \leq η\})} + log \frac{1}{μ ({\forall i : | v_{i} - Λ_{i} | \leq ε})} . \end{matrix}

The first log term

\begin{matrix} π (\{\forall i = 1, r : ∥ u_{i .} - U_{i .} ∥_{F} \leq η\}) & \geq \prod_{i = 1}^{r} [\frac{π^{(d - 1) / 2} {(η / 2)}^{d - 1}}{Γ (\frac{d - 1}{2} + 1)} / \frac{2 π^{(d + 1) / 2}}{Γ (\frac{d + 1}{2})}] \\ \geq {[\frac{η^{d - 1}}{2^{d} π}]}^{r} \geq \frac{η^{r (d - 1)}}{2^{4 r d}} . \end{matrix}

Note the following for the above calculation: firstly, the distribution of the orthogonal vector is approximated by the uniform distribution on the sphere [20], and secondly, the probability is greater or equal to the volume of the (d-1)-“circle” with radius

c / 2

over the surface area of the d-“unit sphere”.

It is noted that if

γ \sim B e t a (a, b)

(beta distribution), then

γ^{1 / 2}

has the pdf as

f (γ) = 2 \frac{γ^{2 a - 1} {(1 - γ^{2})}^{b - 1}}{B e (a, b)}, 0 < γ < 1

where

B e (a, b)

is the beta function. The second log term in the Kullback–Leibler term with

a = α_{i}, b = \sum_{i = 1}^{d} α_{i} - α_{i}, α_{i} = 1 / d

is

\begin{matrix} π ({\forall i : | v_{i} - Λ_{i} | \leq ε}) & = \prod_{i = 1}^{d} \int_{max (Λ_{i} - ε, 0)}^{min (Λ_{i} + ε, 1)} \frac{v_{i}^{2 a - 1} {(1 - v_{i}^{2})}^{b - 1}}{2 B e (a, b)} d v_{i} \\ \geq \prod_{i = 1}^{d} \int_{0}^{ε} \frac{v_{i}^{2 a - 1} {(1 - v_{i}^{2})}^{b - 1}}{2 B e (a, b)} d v_{i} \geq C_{3} {(ε / 2 d)}^{d} e^{- d log d} . \end{matrix}

The interval of integration contains at least an interval of length

ε

. Thus, we obtain

\begin{matrix} K (ρ_{η}^{1}, μ) \leq log \frac{2^{4 r d}}{η^{r (d - 1)}} + log (\frac{{(2 d)}^{d} e^{d log d}}{C_{3} ε^{d}}) \leq r d log (\frac{16}{η}) + C_{3} d log d log (\frac{e 2}{ε}) \end{matrix}

for some absolute numerical constant

C_{3}

that does not depend on

r, n

or d. □

Proof of Theorem 2.

We also apply Lemma 3, and focus on (7), applied to

δ : = ε_{n}

, that is

\begin{matrix} E \int exp [α (R (B, f) - R^{⋆}) + λ (- r_{n} (B, f) + r_{n}^{⋆}) - log (\frac{d {\hat{ρ}}_{λ}}{d π} (B, f)) - \\ log \frac{2}{ε_{n}}] d {\hat{ρ}}_{λ} (B, f) \leq ε_{n} / 2 \end{matrix}

Using Chernoff’s inequality, this leads to

E [P_{(B, f) \sim {\hat{ρ}}_{λ}} ((B, f) \in A_{n})] \geq 1 - \frac{ε_{n}}{2}

where

A_{n} = \{(B, f) : α (R (B, f) - R^{⋆}) + λ (- r_{n} (B, f) + r_{n}^{⋆}) \leq log [\frac{d {\hat{ρ}}_{λ}}{d π} (B, f)] + log \frac{2}{ε_{n}}\} .

From the definition of

{\hat{ρ}}_{λ}

, for

(B, f) \in A_{n}

, we obtain that

\begin{matrix} α (R (B, f) - R^{⋆}) & \leq λ (r_{n} (B, f) - r_{n}^{⋆}) + log [\frac{d {\hat{ρ}}_{λ}}{d π} (B, f)] + log \frac{2}{ε_{n}} \\ \leq - log \int exp [- λ r_{n} (B, f)] π (d (B, f)) - λ r_{n}^{⋆} + log \frac{2}{ε_{n}} \\ = λ (\int r_{n} (B, f) {\hat{ρ}}_{λ} (d (B, f)) - r_{n}^{⋆}) + K ({\hat{ρ}}_{λ}, π) + log \frac{2}{ε_{n}} \\ = inf_{ρ} \{λ (\int r_{n} (B, f) ρ (d (B, f)) - r_{n}^{⋆}) + K (ρ, π) + log \frac{2}{ε_{n}}\} . \end{matrix}

Now, put

B_{n} : = \{\forall ρ, β (- \int R (B, f) d ρ + R^{⋆}) + λ (\int r_{n} d ρ - r_{n}^{⋆}) \leq K (ρ, π) + log \frac{2}{ε_{n}}\} .

Using (8), we have that

E [1_{B_{n}}] \geq 1 - \frac{ε_{n}}{2} .

We now prove that if

λ

is such that

α > 0

,

E [P_{(B, f) \sim {\hat{ρ}}_{λ}} ((B, f) \in E_{n})] \geq E [P_{(B, f) \sim {\hat{ρ}}_{λ}} ((B, f) \in A_{n}) 1_{B_{n}}]

and, together with,

\begin{matrix} E [P_{(B, f) \sim {\hat{ρ}}_{λ}} ((B, f) \in A_{n}) 1_{B_{n}}] & = E [(1 - P_{(B, f) \sim {\hat{ρ}}_{λ}} ((B, f) \notin A_{n})) (1 - 1_{B_{n}^{c}})] \\ \geq E [1 - P_{(B, f) \sim {\hat{ρ}}_{λ}} ((B, f) \notin A_{n}) - 1_{B_{n}^{c}}] \\ \geq 1 - ε_{n} \end{matrix}

leads to

E [P_{(B, f) \sim {\hat{ρ}}_{λ}} ((B, f) \in E_{n})] \geq 1 - ε_{n} .

To obtain that, assume that we are on the set

B_{n}

, and let

(B, f) \in A_{n}

. Then,

\begin{matrix} α (R (B, f) - R^{⋆}) & \leq inf_{ρ} \{λ (\int r_{n} (B, f) ρ (d (B, f)) - r_{n}^{⋆}) + K (ρ, π) + log \frac{2}{ε_{n}}\} \\ \leq inf_{ρ} \{β (\int R (B, f) ρ (d (B, f)) - R^{⋆}) + 2 K (ρ, π) + 2 log \frac{2}{ε_{n}}\} \end{matrix}

that is,

R (B, f) - R^{⋆} \leq inf_{ρ} \frac{β [\int R d ρ - R^{⋆}] + 2 [K (ρ, π) + log \frac{2}{ε}]}{α}

We upper-bound the right-hand side similarly as in the proof of Theorem 1, which leads to

(B, f) \in E_{n}

. □

4. Conclusions

In this paper, we conduct a theoretical study of a low-rank matrix single-index model. The model is used to estimate the link function and the coefficient matrix jointly. We leverage the PAC-Bayesian bounds technique to gain a deeper insight into the properties of this model and its potential applications. The study extends previous work in the field by considering a low-rank matrix, rather than a sparse vector, as the coefficient matrix. We also provide a detailed explanation of the choice of prior distributions for the link function and the coefficient matrix, which allows to obtain accurate and reliable estimates of the unknown parameters. Overall, this study provides a thorough theoretical understanding of the low-rank matrix single-index model.

The focus of future research would center on executing the proposed approach. There are various possible avenues to explore. One of the promising approaches is to use the reversible jump Markov chain Monte Carlo method, which was successfully applied in the past to address the sparse vector single-index model, as documented in [11].

Funding

This research was funded by Norwegian Research Council grant number 309960 through the Centre for Geophysical Forecasting at NTNU.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

The author is grateful to two anonymous reviewers for their expert analysis and helpful suggestions.

Conflicts of Interest

The author declares no conflict of interest.

References

Weaver, C.; Xiao, L.; Lindquist, M.A. Single-index models with functional connectivity network predictors. Biostatistics 2021, 24, 52–67. [Google Scholar] [CrossRef] [PubMed]
Fan, J.; Yang, Z.; Yu, M. Understanding Implicit Regularization in Over-Parameterized Single Index Model. J. Am. Stat. Assoc. 2022, 1–14. [Google Scholar] [CrossRef]
Rohde, A.; Tsybakov, A.B. Estimation of high-dimensional low-rank matrices. Ann. Stat. 2011, 39, 887–930. [Google Scholar] [CrossRef]
Koltchinskii, V.; Lounici, K.; Tsybakov, A.B. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat. 2011, 39, 2302–2329. [Google Scholar] [CrossRef]
Zhao, J.; Niu, L.; Zhan, S. Trace regression model with simultaneously low rank and row (column) sparse parameter. Comput. Stat. Data Anal. 2017, 116, 1–18. [Google Scholar] [CrossRef]
Nelder, J.A.; Wedderburn, R.W. Generalized linear models. J. R. Stat. Soc. Ser. A Gen. 1972, 135, 370–384. [Google Scholar] [CrossRef]
Hardle, W.; Hall, P.; Ichimura, H. Optimal smoothing in single-index models. Ann. Stat. 1993, 21, 157–178. [Google Scholar] [CrossRef]
Ichimura, H. Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. J. Econom. 1993, 58, 71–120. [Google Scholar] [CrossRef]
Jiang, B.; Liu, J.S. Variable selection for general index models via sliced inverse regression. Ann. Stat. 2014, 42, 1751–1786. [Google Scholar] [CrossRef]
Kong, E.; Xia, Y. Variable selection for the single-index model. Biometrika 2007, 94, 217–229. [Google Scholar] [CrossRef]
Alquier, P.; Biau, G. Sparse Single-Index Model. JMLR 2013, 14, 243–280. [Google Scholar]
Putra, I.; Dana, I.M. Study of Optimal Portfolio Performance Comparison: Single Index Model and Markowitz Model on LQ45 Stocks in Indonesia Stock Exchange. Am. J. Humanit. Soc. Sci. Res. 2020, 3, 237–244. [Google Scholar]
Pananjady, A.; Foster, D.P. Single-index models in the high signal regime. IEEE Trans. Inf. Theory 2021, 67, 4092–4124. [Google Scholar] [CrossRef]
Ganti, R.S.; Balzano, L.; Willett, R. Matrix completion under monotonic single index models. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Catoni, O. Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning; Institute of Mathematical Statistics Lecture Notes—Monograph Series 56; Institute of Mathematical Statistics: Beachwood, OH, USA, 2007; Volume 5544465. [Google Scholar]
Boucheron, S.; Lugosi, G.; Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
Wallach, H.; Mimno, D.; McCallum, A. Rethinking LDA: Why priors matter. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 7–10 December 2009; Volume 22. [Google Scholar]
Tsybakov, A.B. Introduction to Nonparametric Estimation; Springer Series in Statistics; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar] [CrossRef]
Mai, T.T.; Alquier, P. Pseudo-Bayesian quantum tomography with rank-adaptation. J. Stat. Plan. Inference 2017, 184, 62–76. [Google Scholar] [CrossRef]
Goldstein, S.; Lebowitz, J.L.; Tumulka, R.; Zanghî, N. Any orthonormal basis in high dimension is uniformly distributed over the sphere. Ann. L’Institut Henri Poincaré Probab. Stat. 2017, 53, 701–717. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

On a Low-Rank Matrix Single-Index Model

Abstract

1. Introduction

2. Main Result

2.1. Method

2.2. The Proposed Estimator

2.3. Theoretical Results

3. Proofs

4. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics