Heterogeneous Overdispersed Count Data Regressions via Double-Penalized Estimations

Li, Shaomin; Wei, Haoyu; Lei, Xiaoyu

doi:10.3390/math10101700

Open AccessEditor’s ChoiceArticle

Heterogeneous Overdispersed Count Data Regressions via Double-Penalized Estimations

by

Shaomin Li

¹

,

Haoyu Wei

^2,* and

Xiaoyu Lei

³

¹

Center for Statistics and Data Science, Beijing Normal University, Zhuhai 516087, China

²

Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA

³

Department of Statistics, University of Chicago, Chicago, IL 60637, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(10), 1700; https://doi.org/10.3390/math10101700

Submission received: 21 April 2022 / Revised: 6 May 2022 / Accepted: 11 May 2022 / Published: 16 May 2022

(This article belongs to the Special Issue New Advances in High-Dimensional and Non-asymptotic Statistics)

Download Versions Notes

Abstract

:

Recently, the high-dimensional negative binomial regression (NBR) for count data has been widely used in many scientific fields. However, most studies assumed the dispersion parameter as a constant, which may not be satisfied in practice. This paper studies the variable selection and dispersion estimation for the heterogeneous NBR models, which model the dispersion parameter as a function. Specifically, we proposed a double regression and applied a double

ℓ_{1}

-penalty to both regressions. Under the restricted eigenvalue conditions, we prove the oracle inequalities for the lasso estimators of two partial regression coefficients for the first time, using concentration inequalities of empirical processes. Furthermore, derived from the oracle inequalities, the consistency and convergence rate for the estimators are the theoretical guarantees for further statistical inference. Finally, both simulations and a real data analysis demonstrate that the new methods are effective.

Keywords:

negative binomial regressions; heterogeneous count data regression; estimation of dispersion parameter; oracle inequalities

MSC:

62E17; 62E20; 62F07

1. Introduction

In many scientific fields, such as biomedical science, ecology, and economics, experimental and observational studies often yield count data, a type of data in which the observations can take only the non-negative integer values. The Poisson regression models are commonly used for count data. However, it needs a restrictive assumption that the variance equals the mean. For many count data, the variance is often larger than the mean [1], which is called overdispersion. Because the Poisson regression model is invalid under the overdispersion case, a more general and flexible regression model, the negative binomial regression, has attracted lots of research attention and become popular in analyzing count data [2,3,4].

With the advance of modern data collection techniques, high-dimensional data are becoming increasingly common in scientific studies. The widely used estimations for the high-dimensional parameter include the lasso [5], the scad [6], the elastic net [7], the adaptive lasso [8], and so on. Recently, there has been much research on the high-dimensional NBR model, such as [9,10,11,12,13,14]. All of these works assumed the dispersion parameter as a constant. In practice, however, not all models satisfy the assumption. If the dispersion parameter is wrongly assumed to be a constant, the estimation of the mean regression will perform poorly as shown in the simulation in Section 4.1, thus the need to model the dispersion parameter as a function of some covariates. The heterogeneous negative binomial regression (HNBR) extends the NBR by observation-specific parameterization of the dispersion parameter [3]. The HNBR is a valuable tool for assessing the source of overdispersion. It belongs to the double-generalized linear models (DGLMs) or vector-generalized linear models (VGLMs), which are very useful in fitting more complex and potentially realistic models [15,16,17,18]. However, it appears that there is no study on selecting the dispersion explanation variables in the HNBR model.

In this paper, we study the variable selection and dispersion estimation for the heterogeneous NBR models. To the best of our knowledge and based on the literature, this study is the first. Specifically, we propose a double regression to estimate the coefficients of NB dispersion and NBR simultaneously. Because of the high dimension of the covariates, we apply a double

ℓ_{1}

penalty to both regressions. The two adjustment parameters we set are different because the first-order conditions for estimating the regression coefficients are entirely different from those for estimating the dispersion parameters. We construct an algorithm to perform variable selection and dispersion estimation simultaneously. Similar studies on high-dimensional NBR models include [19], which assumed the dispersion parameter as a constant. Their method requires an iterative algorithm to estimate the mean regression and dispersion alternatively and implement a lasso in each iteration. If there are many iterations, such an algorithm is a waste of computing resources.

The rest of the paper is organized as follows. Section 2 introduces the heterogeneous overdispersed count data model and defines the double

ℓ_{1}

-penalized estimators for the mean and dispersion regressions. Then we use a technique called the stochastic Lipschitz condition to derive the asymptotic results in Section 3. Simulation studies and a real data application are given in Section 4. Finally, Section 5 concludes the article with a discussion. All proofs and technical details are provided in Appendix A.

2. Double $ℓ_{1}$ -Penalized NBR

2.1. Heterogeneous Overdispersed Count Data Regressions

Suppose we have n count responses

Y_{i}

and p-dimensional covariates

X_{i} = (x_{i 1}, \dots, x_{i p})

,

i \in [n] : = {1, 2, \dots, n}

. For the Poisson regression models, the response obeys the Poisson distribution

P (Y_{i} = y_{i} {|λ}_{i}) = \frac{λ_{i}^{y_{i}}}{y_{i}!} e^{- λ_{i}}, i \in [n] .

with

λ_{i} = E (Y_{i})

, we require that the positive parameter

λ_{i}

is related to a linear combination of p covariates. A plausible assumption for the link function is

η (λ_{i}) = log (λ_{i}) = X_{i}^{⊤} β

. It is worth noting that

E (Y_{i} | X_{i}) = var (Y_{i} | X_{i}) = exp (X_{i}^{⊤} β) > 0 .

For the traditional negative binomial regression, it assumes that the count data response obeys the NB distribution with overdispersion:

P (Y_{i} = y_{i} | X_{i}) = : f (y_{i}; k, μ_{i}) = \frac{Γ (k + y_{i})}{Γ (k) y_{i}!} {(\frac{μ_{i}}{k + μ_{i}})}^{y_{i}} {(\frac{k}{k + μ_{i}})}^{k}, i \in [n],

(1)

with

E (Y_{i} | X_{i}) = μ_{i} = exp (β^{⊤} X_{i})

and k is an unknown qualification of the overdispersion level. When

k \to \infty

, we have

var (Y_{i} | X_{i}) = μ_{i} + \frac{μ_{i}^{2}}{k} \to μ_{i} = E (Y_{i} | X_{i})

, the Poisson regression for the mean parameter

μ_{i}

. Thus, the Poisson regression is a limiting case of negative binomial regression when the dispersion parameter k tends to infinite.

In the heterogeneous negative binomial regression, k is proposed as a specific parameterization, i.e.,

k = k (X_{i})

. More specifically, we assume in this paper that

μ (x) = exp {θ^{(1) ⊤} x}, k (x) = exp {θ^{(2) ⊤} x} .

For notation simplicity, we denote

P f : = E f (X_{i}, Y_{i}), P_{n} f : = \frac{1}{n} \sum_{i = 1}^{n} f (X_{i}, Y_{i}), G_{n} f : = \sqrt{n} (P_{n} - P) f,

for any measurable and integrable function f.

Let

θ = {(θ^{(1) ⊤}, θ^{(2) ⊤})}^{⊤} \in R^{2 p}

, the log-likelihood is

\begin{matrix} n ℓ (θ) & = log \prod_{i = 1}^{n} f (y_{i}, k_{i}, μ_{i}) = \sum_{i = 1}^{n} log \{\frac{Γ (k_{i} + Y_{i})}{Γ (k_{i}) Y_{i}!} {(\frac{μ_{i}}{k_{i} + μ_{i}})}^{Y_{i}} {(\frac{k_{i}}{k_{i} + μ_{i}})}^{k_{i}}\} \\ = \sum_{i = 1}^{n} [- log \frac{Γ (exp {X_{i}^{⊤} θ^{(2)}})}{Γ (Y_{i} + exp {X_{i}^{⊤} θ^{(2)}})} + Y_{i} X_{i}^{⊤} (θ^{(1)} - θ^{(2)}) \\ - [Y_{i} + exp {X_{i}^{⊤} θ^{(2)}}] log (1 + exp {X_{i}^{⊤} (θ^{(1)} - θ^{(2)})}) - log Y_{i}!] \end{matrix}

We use the negative log-likelihood as the loss function

γ

, and define

\begin{matrix} γ (θ) & : = - log f (y | x, θ) + log y! . \end{matrix}

Denote

\partial_{j} : = \frac{\partial}{\partial θ^{(j)}}

,

j = 1, 2

, the score function for

θ^{(1)}

is

\partial_{1} ℓ (θ) = - P_{n} \partial_{1} γ (θ) = \frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - e^{X_{i}^{⊤} θ^{(1)}}) \frac{e^{X_{i}^{⊤} θ^{(2)}} X_{i}}{e^{X_{i}^{⊤} θ^{(1)}} + e^{X_{i}^{⊤} θ^{(2)}}} .

Furthermore, fix

θ^{(1)}

, the score function for

θ^{(2)}

is

\partial_{2} ℓ (θ) = P_{n} \partial_{2} γ (θ) = \frac{1}{n} \sum_{i = 1}^{n} \{[log (1 + e^{X_{i}^{⊤} (θ^{(1)} - θ^{(2)})}) - \sum_{j = 0}^{Y_{i} - 1} \frac{1}{j + e^{X_{i}^{⊤} θ^{(2)}}}] + \frac{Y_{i} - e^{X_{i}^{⊤} θ^{(1)}}}{e^{X_{i}^{⊤} θ^{(1)}} + e^{X_{i}^{⊤} θ^{(2)}}}\} e^{X_{i}^{⊤} θ^{(2)}} X_{i} .

It is easy to verify that

P \partial_{1} ℓ (θ) = P \partial_{2} ℓ (θ) = 0 .

Thus, from now, we will suppose the true value of parameter

θ

is

θ^{*}

.

2.2. Heterogeneous Overdispersed NBR via Double $ℓ_{1}$ Penalty

The weighted lasso estimator under our circumstance is defined as

{\hat{θ}}_{n} = \underset{θ \in Θ}{argmin} (P_{n} γ (θ) + λ {∥ θ ∥}_{ω, 1}),

(2)

where

λ > 0

is the tuning parameter and the weighted norm is defined by

{λ ∥ θ ∥}_{ω, 1} = λ_{1} ∥ θ^{(1)} ∥_{1} + λ_{2} {∥ θ^{(2)} ∥}_{1} = λ (ω_{1} ∥ θ^{(1)} ∥_{1} + ω_{2} {∥ θ^{(2)} ∥}_{1}),

and

ω = {(ω_{1}, ω_{2})}^{⊤} = {(λ_{1} / λ, λ_{2} / λ)}^{⊤} \in [0, 1] \times [0, 1]

is the weight,

{∥ \cdot ∥}_{1}

means the

ℓ_{1}

-norm. This technique is also used in [20]. Equation (2) is a weighted double

ℓ_{1}

-penalized problem, which is a kind of convex penalty optimization, and when

λ_{1} = λ_{2}

, it becomes a single-penalized problem. In this paper, we use different

λ_{1}

and

λ_{2}

, as the first-order conditions for estimating the regression coefficients are entirely different from those for estimating the dispersion parameters, and take

λ = λ_{1} \lor λ_{2}

.

Because the weighted group lasso estimator

{\hat{θ}}_{n}

has no closed-form solution, we need to use iterative methods such as quasi-Newton or coordinate descent methods. We use BIC to choose the parameter

λ_{1}

and

λ_{2}

.

BIC (λ_{1}, λ_{2}) = - 2 ℓ ({\hat{θ}}_{n}) + \frac{log n}{n} k,

where k is the number of nonzero estimated coefficients. To illustrate the algorithm explicitly, we rewrite

γ (θ)

as

γ (θ^{(1) ⊤} x, θ^{(2) ⊤} x)

and define

θ^{(3)} = λ_{2} / λ_{1} θ^{(2)}

,

θ^{†} = {(θ^{(1) ⊤}, θ^{(3) ⊤})}^{⊤}

. Converting

θ^{(2)}

into

θ^{(3)}

turns the double

ℓ_{1}

-penalized problem into a single penalized one, which can be solved through some R packages, such as “lbfgs”. The algorithm is formally given in Algorithm 1.

Algorithm 1 Double

ℓ_{1}

-Penalized Optimization

Input: the set of tuning parameters

Λ = {(λ_{1, i}, λ_{2, i})}_{i = 1}^{m}

Output: the estimate

{\hat{θ}}_{n}

for

i = 1, \dots, m

, do
let

x^{*} = \frac{λ_{1, i}}{λ_{2, i}} x

;
solve

{\hat{θ}}^{†} = {({\hat{θ}}^{(1) ⊤}, {\hat{θ}}^{(3) ⊤})}^{⊤} = {argmin}_{θ^{†} \in Θ} (P_{n} γ (θ^{(1) ⊤} x, θ^{(3) ⊤} x^{*})) + λ_{1, i} {∥ θ^{†} ∥}_{1})

;
obtain the estimate

{\hat{θ}}_{n, i} = {({\hat{θ}}^{(1) ⊤}, \frac{λ_{2, i}}{λ_{1, i}} {\hat{θ}}^{(3) ⊤})}^{⊤}

;
compute

BIC (λ_{1, i}, λ_{2, i}) = - 2 ℓ ({\hat{θ}}_{n, i}) + \frac{log n}{n} k_{i}

;
end for
find

i_{o p t} = {argmin}_{i = 1, \dots, m} BIC (λ_{1, i}, λ_{2, i})

;
return

{\hat{θ}}_{n, i_{o p t}}

The proposed algorithm can perform variable selection and dispersion estimation simultaneously. Similar studies on high-dimensional NBR models include [19], which assumed the dispersion parameter as a constant. However, their method requires an iterative algorithm to estimate the mean regression and dispersion alternatively and implement lasso in each iteration. If there are many iterations, such an algorithm is a waste of computing resources.

3. Main Results

3.1. Stochastic Lipschitz Conditions

We write the maximum of

Y_{i}

from the sample of size n as

M_{Y, n}

, then the sample space for

{Y_{i}}_{i = 1}^{n}

is

Y : = {y \in N, y \leq M_{y, n}}

., i.e.,

M_{y, n} = {max}_{i \in [n]} Y_{i}

. Note that

{lim}_{n \to \infty} P (M_{y, n} = \infty) = 1

; what we need to tackle is actually an unbounded empirical process. However, for

z : = (\begin{matrix} x^{⊤} \\ x^{⊤} \end{matrix}) \in R^{2 \times 2 p}

, we can assume the value space

S

for

s : = z θ

is bounded and satisfies

S : = \{s = {(s_{1}, s_{2})}^{⊤} \in R^{2}, - \infty < m_{s, n} \leq s_{j} \leq | s_{j} | \leq M_{s, n} < \infty, j = 1, 2\} .

As we can see, the most significant difference between this article and other conventional literature about lasso estimators is that we use

s = z θ

rather than

θ

as the explanatory variable to analyze the properties of the loss function

γ

. This is not a traditional way. At first glimpse, the combination may complicate the analysis in the next step because the KKT condition requires the story about

\frac{\partial}{\partial θ} γ

, which is critical for the traditional convex penalty problem. However, this article will try a different approach, the stochastic Lipschitz conditions introduced in the event

A

of Proposition 1 in [14], to solve the

ℓ_{1}

-penalization problem. Define the local stochastic Lipschitz constant by

Lip (f; θ^{*}) : = sup_{θ \in Θ / {θ^{*}}} |\frac{\sqrt{n} G_{n} (f (θ) - f (θ^{*}))}{∥ θ - θ^{*} ∥_{1}}| .

The most apparent advantage of the stochastic Lipschitz conditions over the KKT condition is that it can easily deal with the several parameters involved in different locations of the model that need to impose the same penalty on them, which is why we do not need to derive the KKT condition in this paper.

To establish the stochastic Lipschitz conditions for this unbounded counting process, another assumption, called the strongly midpoint log-convex, for some positive

γ

should be satisfied, which states for the joint density from the sample

Y : = {(Y_{1}, \dots, Y_{n})}^{⊤} \in Z^{n}

’s negative log-density of n independent NB responses

ψ (y) : = - log p_{Y} (y)

satisfies

ψ (x) + ψ (y) - ψ (⌈\frac{1}{2} x + \frac{1}{2} y⌉) - ψ (⌊\frac{1}{2} x + \frac{1}{2} y⌋) \geq \frac{γ}{4} {∥ x - y ∥}_{2}^{2}, \forall x, y \in Z^{n} .

This assumption is a condition that ensures that the suprema of the multiplier empirical processes of n independent responses have sub-exponential concentration phenomena, which can be alternatively checked by the tail inequality for the suprema of the empirical processes corresponding to classes of unbounded functions ([21]).

Theorem 1.

Suppose

{max}_{i \in [n], 1 \leq k \leq p} | X_{i k} | \leq M_{x} < \infty

, the parameter space Θ is convex and its diameter

D_{Θ} < \infty

. If

{Y_{i}}_{i = 1}^{n}

and

{Z_{i} θ}_{i \in [n], θ \in Θ}

are both in the value space

Y

and

S

defined as previous, then for any

θ \in Θ

,

\begin{matrix} Lip (γ; θ^{*}) & = sup_{θ \in Θ / {θ^{T} *}} |\frac{\sqrt{n} G_{n} (γ (θ) - γ (θ^{*}))}{∥ θ - θ^{*} ∥_{1}}| \\ \leq \sqrt{n} M_{q} : = (A_{1} \sqrt{log (2 p / q_{2})} + A_{2} \sqrt{log p} + A_{3} \sqrt{log (p / q_{3})}) \sqrt{max_{1 \leq k \leq p} \sum_{i = 1}^{n} X_{i k}^{2}} \\ + B \sqrt{log (2 p / q_{1})} \sqrt{{(max_{1 \leq k \leq p} \sum_{i = 1}^{n} X_{i k}^{4})}^{1 / 2}} \lor C log (2 p / q_{1}) + D log (p / q_{3}), \end{matrix}

with probability at least

1 - q_{0}

, where

q_{1}, q_{2}, q_{3} \in (0, 1)

satisfy

q_{1} + q_{2} + q_{3} = q_{0}

, and the constants are as follows:

\begin{matrix} A_{1} = \sqrt{2} F_{1}, A_{2} = 32 \sqrt{2} M_{x} F_{2} D_{Θ}, A_{3} = \sqrt{2} (2 (F_{1} + M_{y, n}) \lor F_{2} M_{x} D_{Θ}), \\ B = 6 \sqrt{2 (w^{(1)} \lor w^{(2)}) {(\sum_{i = 1}^{n} a {(μ_{i}, k_{i})}^{4})}^{1 / 2}}, C = 12 M_{x} (w^{(1)} \lor w^{(2)}) max_{1 \leq i \leq n} a (μ_{i}, k_{i}), \\ D = 8 (2 (F_{1} + M_{y, n}) \lor F_{2} M_{x} D_{Θ}) M_{x}, w^{(1)} = \frac{e^{M_{s, n}}}{e^{m_{s, n}} + e^{M_{s, n}}}, \\ w^{(2)} = \frac{e + e^{M_{s, n} - m_{s, n}}}{1 + e^{m_{s, n} - M_{s, n}}} + \frac{1}{1 + e^{m_{s, n} - M_{s, n}}}, \end{matrix}

where

M_{y, n} = {max}_{i \in [n]} Y_{i}

is the suprema empirical process.

It is worthy to note that the

M_{y, n}

in Theorem 1 is a random process; hence, the bound above is not deterministic. Fortunately,

M_{y, n}

can use the strongly midpoint log-convex condition to be bounded, which we state in Lemma A3. Theorem 1 combined with Lemma A3 will give the following result as a step more.

Theorem 2.

Assume the conditions are the same as that in Theorem 1, then the stochastic Lipschitz constant has a nonrandom upper bound:

\begin{matrix} Lip (γ; θ^{*}) & \leq \sqrt{n} M_{q}^{'} : = (A_{1} \sqrt{log (2 p / q_{2})} + A_{2} \sqrt{log p} + 2 A_{3}^{'} (log (2 n / q_{4}) + \sqrt{log (n p / q_{3}))}) \sqrt{max_{1 \leq k \leq p} \sum_{i = 1}^{n} X_{i k}^{2}} \\ + B \sqrt{log (2 p / q_{1})} \sqrt{{(max_{1 \leq k \leq p} \sum_{i = 1}^{n} X_{i k}^{4})}^{1 / 2}} \lor C log (2 p / q_{1}) + D log (p / q_{3}), \end{matrix}

with probability at least

1 - q_{0}

, where

q_{1}, q_{2}, q_{3}, q_{4} \in (0, 1)

satisfy

q_{1} + q_{2} + q_{3} + q_{4} = q_{0}

, and

A_{3}^{'} = 2 \sqrt{2} (F_{1} + (2 γ max_{i \in [n]} [a (μ_{i}, k_{i}) - \frac{μ_{i}}{log 2}]) \lor F_{2} M_{x} D_{Θ}) .

Theorem 1 gives us a different sight of the loss function far more than KKT conditions. However, the stochastic Lipschitz condition above does not compare the estimated and true values directly. We can resolve this issue by using an eigenvalue condition on the design matrix consisting of

X_{i}

. Because the design matrix X is fixed, the eigenvalue condition in the next section is reasonable. It is worthy to note that this inequality is an oracle because it involves an unknown empirical process on the right side.

3.2. $ℓ_{2}$ -Estimation Error Oracle Inequalities RE Conditions

As we said previously, although we use stochastic Lipschitz conditions instead of KKT conditions, the restricted eigenvalue conditions (RE conditions) are still required. We denote by

δ_{J}

the vector in

R^{p}

with the same coordinates as v on J and zero coordinates on the complement

J^{c}

of J, and

spt (v) = {j : v_{j} \neq 0}

. We will assume that the minima in (2) can always be obtained in the following setting, but it may not be unique. In general, to bound

\hat{θ} - θ^{*}

, some conditions on the design matrix

X \in R^{n \times p}

are needed for obtaining abound in terms of the

ℓ_{2}

norm of

θ - θ^{*}

. Here, we will utilize the restricted eigenvalue condition introduced in [22], which says that for some

1 \leq s \leq p

and

K > 0

,

κ (s, K) = min \{\frac{{∥ X v ∥}_{2}}{\sqrt{n} {∥ v_{J} ∥}_{2}} : 1 \leq | J | \leq s, v \in R^{p} / {0}, ∥ v_{J^{c}} ∥_{1} \leq K {∥ v_{J} ∥}_{1}\} > 0 .

(3)

It should be noted that omitting the weight

ω

and the sparse restricted set

∥ v_{J^{c}} ∥_{1} \leq K {∥ v_{J} ∥}_{1}

leads to

v^{⊤} [\frac{1}{n} X^{⊤} X] v / v^{⊤} v \geq κ^{2} (s, K)

. Thus, it means that the smallest eigenvalue of the sample covariance matrix

\frac{1}{n} X^{⊤} X

is positive, which is impossible when

p > n

because

\frac{1}{n} X^{⊤} X

is not full rank. To avoid this problem, ref. [22] consider the restricted eigenvalue condition under the sparse restricted set

∥ v_{J^{c}} ∥_{1} \leq K {∥ v_{J} ∥}_{1}

as a considerable relation in sparse high-dimensional estimation. The restricted eigenvalue is from the restricted strong convexity, which enforces a strong convexity condition for the negative log-likelihood function of linear models under a certain sparse restrict set.

Due to the double penalty, besides the RE condition, we also require another condition similar to the RE condition, the so-called l-restricted isometry constant defined in [23], as follows

σ_{X, l}^{2} = max \{{∥ X v ∥}_{2}^{2} / {∥ v ∥}_{2}^{2} : v \in R^{p}, 1 \leq spt (v) \leq l\} \in (0, \infty),

which essentially requires the eigenvalue of the sample covariance matrix under every vector with cardinality less than l (l should be no more than n) approximately behaves normally like the low-dimensional case.

With the RE condition and l-restricted isometry constant, and the two theorems we established before, the lasso estimator in (2) can guarantee a good consistent property.

Lemma 1

(see Lemma 3.1 in [23]). Suppose

T_{0}

is a set of cardinality S. For a vector

h \in R^{p}

, we let

T_{1}

be the S largest positions of h outside of

T_{0}

. Put

T_{01} = T_{0} \cup T_{1}

, then

{∥ h ∥}_{2}^{2} \leq ∥ h_{T_{01}} ∥_{2}^{2} + S^{- 1} {∥ h_{T_{0}^{c}} ∥}_{1}^{2} .

Theorem 3.

Suppose the condition is the same as that in Theorem A1. Furthermore, assume

p_{1} = spt (θ^{(* 1)}) \lor spt (θ^{(* 2)}) \leq p / 2

, and there exists some

K > 1

,

κ : = κ (2 p_{1}, K) > 0

. Let

λ = \frac{(K + 1) M_{q}}{n (K - 1)}

, then using this λ in (2), with probability at least

1 - q

,

{∥ \hat{θ} - θ^{*} ∥}_{2}^{2} \leq \frac{8 p_{1} M_{q}^{2'} K^{2}}{κ^{4} n^{2} C_{γ}^{2} {(K - 1)}^{2}} [2 + K^{2} + \frac{2 (1 + 2 p_{1} K^{2}) (n κ^{2} + 2 σ_{X, p_{1}}^{2})}{n κ^{2}}],

where

M_{q}

,

C_{γ}

are defined in Theorems 1 and A1, respectively.

Remark 1.

Compared to the single lasso problem, in which we only have one unknown vectorized parameter, the oracle inequality in Theorem 3 has an extra term

\frac{2 (1 + 2 p_{1} K^{2}) (n κ^{2} + 2 σ_{X, p_{1}}^{2})}{n κ^{2}}

.

Remark 2.

From Theorem 3, we know that the

ℓ_{2}

convergence rate is minimax optimal, as studied in [14].

Remark 3.

In this study, we use the lasso estimators of two partial regression coefficients because it is one of the most popular techniques for high-dimensional data. It is worth mentioning that the algorithms and theoretical results could be similarly generalized to other shrinkage estimators, such as the elastic net [7], the adaptive lasso [8], and so on.

4. Numerical Studies

4.1. Simulations

In this section, we evaluate the finite sample performance of the proposed method. The response is generated from the negative binomial regression model (1) with

μ (x) = exp {θ^{(1) ⊤} x}, and k (x) = exp {θ^{(2) ⊤} x},

where

θ^{(1)}

and

θ^{(2)}

are two p-dimensional parameters. The explanatory variables are generated from the multivariate normal distributions with mean vector 0 and

C o v (x_{i}, x_{j}) = ρ^{| i - j |}

, where

ρ = 0, 0.5

. The following two examples show the performance of the proposed estimator for the low-dimensional heterogeneous negative binomial regression and the variable selection in the high-dimensional case, respectively. The R package “lbfgs” is required to solve the optimization problem.

Example 1

(Low dimension). We set

p = 3

and

n = 100, 200, 400

. The true parameters are

θ^{(1)} = (1, 2, - 1)

and

θ^{(2)} = (- 1, 0.5, 1)

, and their maximum likelihood estimators are denoted as

{\hat{θ}}^{(1)}

and

{\hat{θ}}^{(2)}

, respectively. We compare the estimator

{\hat{θ}}^{(1)}

with

{\hat{θ}}^{(1) *}

, which ignores the heterogeneity of the overdispersion and treats

k (x)

as a constant. Table 1 displays the average squared estimation errors

{∥ \hat{θ} - θ ∥}_{2}^{2}

based on 200 repetitions.

We can make the following observations from the table. Firstly, the performances of the three estimators become better and better as n increases. Secondly, the estimator

{\hat{θ}}^{(1)}

, which estimates the parameter in the mean function

μ (x)

, performs better than

{\hat{θ}}^{(2)}

, which estimates the parameter in the overdispersion function

k (x)

. Last, but the most important,

{\hat{θ}}^{(1) *}

performs much worse than

{\hat{θ}}^{(1)}

. For example, the average squared estimation error of

{\hat{θ}}^{(1) *}

is about 5 times of

{\hat{θ}}^{(1)}

’s when

n = 100

, and 10 times of

{\hat{θ}}^{(1)}

’s when

n = 400

. The comparison between

{\hat{θ}}^{(1)}

and

{\hat{θ}}^{(1) *}

indicates the necessity of considering the heterogeneity of the overdispersion.

Example 2

(High dimension). The sample sizes are chosen to be

n = 100, 200, 400

, with dimension

p \in (25, 50, 150)

,

(50, 100, 250)

and

(100, 200, 500)

, respectively. We set

θ^{(1)} = (1, 2, - 1, 0, \dots, 0)

and

θ^{(2)} = (- 1, 0.5, 1, 0, \dots, 0)

. The unknown tuning parameters

(λ_{1}, λ_{2})

for the penalty functions are chosen by BIC criterion in the simulation. Results over 200 repetitions are reported. We compared the variable selection performance of the proposed method to the previous method, which ignores the heterogeneity of the overdispersion and treats

k (x)

as a constant. For each case, Table 2 reports the number of repetitions that each important explanatory variable is selected in the final model and also the average number of unimportant explanatory variables being selected.

We see from the table that our method performs much better than the previous method that treats

k (x)

as a constant. Specifically, our method correctly selects important variables more times than the previous method, and it is less likely to select unimportant variables. Furthermore, the variable selection procedure performs better and better as the sample size n increases. When

n = 400

, the important explanatory variables in

μ (x)

and

k (x)

are correctly selected in almost every repetitions. When the dimension p increases, the procedure may select more unimportant explanatory variables, but the average numbers are less than

1.3

. The important variables in

k (x)

are less likely to be selected than the important variables in

μ (x)

especially when the sample size is small, as well as the unimportant variables.

4.2. A Real Data Example

In this section, we apply the proposed method to the dataset of German health care demand. The data were employed in [24] and could be downloaded on http://qed.econ.queensu.ca/jae/2003-v18.4/riphahn-wambach-million/, accessed on 1 January 2022. The data contain 27,326 observations on 25 variables, including 2 dependent variables, Docvis (number of doctor visits in the last three months) and Hospvis (number of hospital visits in the last calendar year). For conciseness, we focus on Docvis in this study. We build the HNBR model based on the proposed variable selection procedure and make the standard NBR model a comparison. Define the fitting errors (FE) as

n^{- 1} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})

, where

y_{i}

denotes the raw data of Hospvis,

{\hat{y}}_{i}

is the predicted value, and n is the sample size. As the data are observed during 1984–1988, 1991, and 1994, we make the analysis for each observed year. Table 3 displays the variable selection results and fitting errors.

We have the following findings from the table. First, the important variables in the NBR are the same as HNBR models in each year, and the estimates are close. Second, the selected variables in

μ (x)

are almost the same every year, namely Age, Hsat (health satisfaction), Handper (degree of handicap), and Educ (years of schooling). Moreover, some of these variables still play an essential role in

k (x)

, and

k (x)

contains no variables other than these. Moreover, we can see that the fitting errors of the HNBR is less than that of the NBR. All of these illustrate the advantage of our method.

5. Conclusions and Future Study

We study the high-dimensional heterogeneous overdispersed count data via negative binomial regression models and propose a double

ℓ_{1}

-regularized method for simultaneous variable selection and dispersion estimation. Under the restricted eigenvalue conditions, we prove the oracle inequalities with lasso estimators of two partial regression coefficients for the first time, using concentration inequalities of empirical processes. Furthermore, we derive the consistency and convergence rate for the estimators, which are the theoretical guarantees for further statistical inference. Simulation studies and a real example from the German health care demand data indicate that the proposed method works satisfactorily.

There are some limitations of this study. First, we assume that the responses are independent in this work. However, the NB responses are temporal dependent in the time-series data [25]. Thus, weak dependence conditions, including

ρ

-mixing, m-dependent types, could be considered in the future. Second, this study focuses little on the statistical inference, such as testing heterogeneous

H_{0} : θ^{(2)} = 0 vs . H_{1} : θ^{(2)} \neq 0 .

The issues concerning the hypothesis testing are via the debiased lasso estimator; see [26] and references therein. This will comprise our future research work. Another possible study is the false discovery rate (FDR) control, which aims to identify some small number of statistically significantly nonzero results after obtaining the sparse penalized estimation of the HNBR; see [27,28].

Author Contributions

Conceptualization, S.L. and H.W.; methodology, H.W.; software, S.L.; validation, S.L., H.W. and X.L.; data curation, S.L.; writing—original draft preparation, S.L., H.W. and X.L.; writing—review and editing, S.L. and H.W.; supervision, S.L. and H.W.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 12101056.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in Section 4.2 could be downloaded on http://qed.econ.queensu.ca/jae/2003-v18.4/riphahn-wambach-million/, accessed on 1 January 2022.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs

The first step is giving the property of the loss function. From mathematical analysis, we prefer bounded things to unlimited things. Denote

\partial_{j}

is the first partial differentiation with respect to

s_{j}

. The bounded aspect for y and s gives a nice property for the loss function

γ (s, y) = γ (z θ, y)

.

Lemma A1.

We have

\partial_{1} γ (s, y) = - \frac{y e^{s_{2}}}{e^{s_{1}} + e^{s_{2}}} + \frac{e^{s_{1} + s_{2}}}{e^{s_{1}} + e^{s_{2}}}, \partial_{2} γ (s, y) = ν (s, y) + \frac{y e^{s_{2}}}{e^{s_{1}} + e^{s_{2}}}

where

ν (s, y) = e^{s_{2}} (ψ (e^{s_{2}}) - ψ (y + e^{s_{2}})) + e^{s_{2}} log (1 + e^{s_{1} - s_{2}}) - \frac{e^{s_{1} + s_{2}}}{e^{s_{1}} + e^{s_{2}}}

satisfying

sup_{s \in S, y \in Y} | ν (s, y) | \leq F_{1}, sup_{s \neq t \in S, y \in Y} \frac{| ν (s, y) - ν (t, y) |}{{∥ s - t ∥}_{\infty}} \leq F_{2},

with

F_{1} = M_{y, n} (1 + e^{- m_{s, n}}) + e^{M_{s, n}} + \frac{e^{2 M_{s, n}}}{2 e^{m_{s, n}}}

, and

\begin{matrix} F_{2} & = 2 [|e^{M_{s, n}} (1 + log (M_{y, n} + e^{M_{s, n}}) - \frac{1}{2 (M_{y, n} + e^{M_{s, n}})})| \lor |1 - m_{s, n} e^{m_{s, n}}|] \\ + [\frac{e^{2 M_{s, n}}}{e^{m_{s, n}}} + \frac{2 e^{2 M_{s, n}}}{e^{m_{s, n}} + e^{M_{s, n}}}] + \frac{3}{2} e^{M_{s, n}} . \end{matrix}

Proof.

We will use the properties of the psi function, the logarithmic derivative of the gamma function, to prove this lemma. Write

ψ (x) = Γ^{'} (x) / Γ (x)

. For any

s \in S, y \in Y

, using the Binet ’s formula (see p. 18 of [29])

ψ (x) = log x - \int_{0}^{\infty} φ (t) e^{- t x} d t,

where

φ (t) = 1 / (1 - e^{- t}) - 1 / t

is strictly increasing on

(0, \infty)

, it gives

0 < ψ^{'} (x) = \frac{1}{x} + \int_{0}^{\infty} t φ (t) e^{- t x} d t \leq \frac{1}{x} + \int_{0}^{\infty} t e^{- t x} d t = \frac{1}{x} + \frac{1}{x^{2}} .

and

y \geq 0

, we have

\begin{matrix} | ν (s, y) | & = |e^{s_{2}} (ψ (e^{s_{2}}) - ψ (y + e^{s_{2}})) + e^{s_{2}} log (1 + e^{s_{1} - s_{2}}) - \frac{e^{s_{1} + s_{2}}}{e^{s_{1}} + e^{s_{2}}}| \\ \leq e^{s_{2}} y (\frac{1}{e^{s_{2}}} + \frac{1}{e^{s_{2}}}) + e^{s_{2}} e^{s_{1} - s_{2}} + \frac{e^{s_{1} + s_{2}}}{e^{s_{1}} + e^{s_{2}}} \leq M_{y, n} (1 + e^{- m_{s, n}}) + e^{M_{s, n}} + \frac{e^{2 M_{s, n}}}{2 e^{m_{s, n}}} . \end{matrix}

Then, the first inequality in the lemma has been verified. On the other hand, by using the fact that (see (2.2) in [30])

\frac{1}{2 x} < log (x) - ψ (x) < \frac{1}{x}, x > 0,

for the function

f_{1} (x) = e^{x} ψ (e^{x})

and

f_{2} (x) = e^{x} ψ (y + e^{x})

,

\begin{matrix} f_{1}^{'} (x) = e^{x} ψ (e^{x}) + e^{2 x} ψ^{'} (e^{x}) \leq e^{x} (x - \frac{1}{2 e^{x}}) + e^{2 x} (\frac{1}{e^{x}} + \frac{1}{e^{2 x}}) = (x + 1) e^{x} + \frac{1}{2}, \\ f_{1}^{'} (x) = e^{x} ψ (e^{x}) - e^{2 x} ψ^{'} (e^{x}) \geq e^{x} ψ (e^{x}) \geq x e^{x} - 1, \\ f_{2}^{'} (x) = e^{x} ψ (y + e^{x}) + e^{2 x} ψ^{'} (y + e^{x}) \leq e^{x} (1 + log (y + e^{x}) - \frac{1}{2 (y + e^{x})}) + 1, \\ f_{2}^{'} (x) \geq e^{x} (log (y + e^{x}) - \frac{1}{y + e^{x}}) \geq x e^{x} - 1, \end{matrix}

and for any

s \neq t \in S

,

y \in Y

, we conclude that

| ψ (e^{s_{2}}) e^{s_{2}} - ψ (e^{t_{2}}) e^{t_{2}} | = | f_{1} (s_{2}) - f_{1} (t_{2}) | \leq (|(M_{s, n} + 1) e^{M_{s, n}} + 1 / 2| \lor |1 - m_{s, n} e^{m_{s, n}}|) {∥ s - t ∥}_{\infty},

and

\begin{matrix} | ψ (y + e^{s_{2}}) e^{s_{2}} - ψ (y + e^{t_{2}}) e^{t_{2}} | = | f_{2} (s_{2}) - f_{2} (t_{2}) | \\ \leq [|e^{M_{s, n}} (1 + log (M_{y, n} + e^{M_{s, n}}) - \frac{1}{2 (M_{y, n} + e^{M_{s, n}})})| \lor |1 - m_{s, n} e^{m_{s, n}}|] {∥ s - t ∥}_{\infty} . \end{matrix}

In addition, using the median value theorem again, we also have

\begin{matrix} |e^{s_{2}} log (1 + e^{s_{1} - s_{2}}) - e^{t_{2}} log (1 + e^{t_{1} - t_{2}})| \\ \leq & log (1 + e^{s_{1} - s_{2}}) | e^{s_{2}} - e^{t_{2}} | + e^{t_{2}} |log (1 + e^{s_{1} - s_{2}}) - log (1 + e^{t_{1} - t_{2}})| \\ \leq & e^{2 M_{s, n} - m_{s, n}} | s_{2} - t_{2} | + e^{M_{s, n}} \frac{1}{1 + e^{- (M_{s, n} - m_{s, n})}} | (s_{1} - s_{2}) - (t_{1} - t_{2}) | \\ \leq & [\frac{e^{2 M_{s, n}}}{e^{m_{s, n}}} + \frac{2 e^{2 M_{s, n}}}{e^{m_{s, n}} + e^{M_{s, n}}}] {∥ s - t ∥}_{\infty}, \end{matrix}

and

\begin{matrix} |\frac{e^{s_{1} + s_{2}}}{e^{s_{1}} + e^{s_{2}}} - \frac{e^{t_{1} + t_{2}}}{e^{t_{1}} + e^{t_{2}}}| & \leq e^{s_{1}} |\frac{1}{1 + e^{s_{1} - s_{2}}} - \frac{1}{1 + e^{t_{1} - t_{2}}}| + \frac{1}{1 + e^{t_{1} - t_{2}}} | e^{s_{1}} - e^{t_{1}} | \\ \leq e^{M_{s, n}} \frac{1}{4} | (s_{1} - s_{2}) - (t_{1} - t_{2}) | + 1 \times e^{M_{s, n}} | s_{1} - t_{1} | \leq \frac{3}{2} e^{M_{s, n}} {∥ s - t ∥}_{\infty}, \end{matrix}

where the fact used is that

f_{3} (x) = 1 / (1 + e^{x})

satisfies

| f_{3}^{'} (x) | = 1 / (e^{x} + e^{- x} + 2) \leq 1 / 4

. Because

\begin{matrix} | \partial_{2} γ (s, y) - \partial_{2} γ (t, y) | & \leq |ψ (e^{s_{2}}) e^{s_{2}} - ψ (e^{t_{2}}) e^{t_{2}}| + |ψ (y + e^{s_{2}}) e^{s_{2}} - ψ (y + e^{t_{2}}) e^{t_{2}}| \\ + |e^{s_{2}} log (1 + e^{s_{1} - s_{2}}) - e^{t_{2}} log (1 + e^{t_{1} - t_{2}})| + |\frac{e^{s_{1} + s_{2}}}{e^{s_{1}} + e^{s_{2}}} - \frac{e^{t_{1} + t_{2}}}{e^{t_{1}} + e^{t_{2}}}|, \end{matrix}

we can conclude the second inequality in the lemma. □

The Lemma separates the partial derivative of

γ

into two parts: the first part is the linear about the response variable y (say

- y e^{s_{2}} / (e^{s_{1}} + e^{s_{2}})

,

e^{s_{1} + s_{2}} / (e^{s_{1}} + e^{s_{2}})

, and

y e^{s_{2}} / (e^{s_{1}} + e^{s_{2}})

), the second part is other complicated functions (not linear function) about y. The first part is relatively easy to analyze because the following concentration inequality gives a measure of dispersion about the weighted summation of negative binomial variables. This concentration inequality is a special case for the weighted summation of a series of random variables, which can be proved by sub-exponential concentration results in Proposition 4.2 in [31].

Lemma A2.

Suppose

{Y_{i}}_{i = 1}^{n}

are independently distributed as

NB (μ_{i}, k_{i})

. Then, for any nonrandom weights

w = {(w_{1}, \dots, w_{n})}^{⊤} \in R^{n}

independent with

{Y_{i}}_{i = 1}^{n}

and

t \geq 0

,

P (| \sum_{i = 1}^{n} w_{i} (Y_{i} - E Y_{i}) | \geq t) \leq 2 exp \{- \frac{1}{4} (\frac{t^{2}}{2 \sum_{i = 1}^{n} w_{i}^{2} a^{2} (μ_{i}, k_{i})} \land \frac{t}{max_{1 \leq i \leq n} | w_{i} | a (μ_{i}, k_{i})})\},

where

q_{i} : = \frac{μ_{i}}{k_{i} + μ_{i}} \in (0, 1)

and

a (μ, k) : = {[log \frac{1 - (1 - q) / \sqrt[k]{2}}{q}]}^{- 1} + \frac{μ}{log 2} .

Proof.

We will use the sub-exponential norm. The moment-generating function (MGF) for

Y_{i}

is

E e^{s Y_{i}} = {(\frac{1 - q_{i}}{1 - q_{i} e^{s}})}^{k_{i}} .

Then, by letting

E exp (| Y_{i} | / t) \leq 2

, we have

2 \geq E exp (| Y_{i} | / t) = E exp (Y_{i} / t) = {(\frac{1 - q_{i}}{1 - q_{i} e^{1 / t}})}^{k_{i}},

which implies the sub-exponential norm for

Y_{i}

is

{∥ Y_{i} ∥}_{ψ_{1}} = inf {t > 0 : E exp (| Y_{i} | / t) \leq 2} = {[log \frac{1 - (1 - q_{i}) / \sqrt[k_{i}]{2}}{q_{i}}]}^{- 1} .

Using the definition of

a_{i}

, from Proposition 4.2 in [31], we can immediately obtain the result in the Lemma. □

It should be noted that

a_{i} = a (μ_{i}, k_{i})

naturally has a lower and upper bound for any

i \in [n]

because

μ_{i}

and

k_{i}

are both bounded between

e^{m_{s, n}}

and

e^{M_{s, n}} .

Note that

Y_{i}

is an unbounded random variable; the next step is to find a probabilistic bound for

M_{y, n} = {max}_{i \in [n]} Y_{i}

. We will cite an important lemma for this type of problem. We say a distribution

P_{γ}

is strongly discrete log-concave with

γ > 0

if its density is strongly midpoint log-convex with the same

γ > 0

.

Lemma A3

(Concentration for strongly log-concave discrete distributions). Let

P_{γ}

be any strongly log-concave discrete distribution index by

γ > 0

on

Z^{n}

. Then, for any function

f : R^{n} \to R

that is L-Lipschitz with respect to Euclidean norm, we have for

X \sim P_{γ}

,

P_{P_{γ}} (| f (X) - E f (X) | \geq t) \leq 2 exp \{- \frac{γ t^{2}}{4 L^{2}}\}

for any

t > 0

.

Lemma A4.

The maximal of the response

M_{y, n} = {max}_{i \in [n]} Y_{i}

has the concentration

P \{M_{y, n} - (2 max_{i \in [n]} [a (μ_{i}, k_{i}) - \frac{μ_{i}}{log 2}] [log (2 n) + \sqrt{2 log (2 n)}] + max_{i \in [n]} μ_{i}) > t\} \leq e^{- γ t^{2} / 4}

for any

t > 0

.

Proof.

For the upper bound of expectation, we first note that

Y_{i} - E Y_{i} \sim subE (2 ∥ Y_{i} ∥_{ψ_{1}})

with

∥ Y_{i} ∥_{ψ_{1}}

has calculated in Lemma A2, then we have

Y_{i} - E Y_{i} \sim sub Γ (4 ∥ Y_{i} ∥_{ψ_{1}}^{2}, 2 ∥ Y_{i} ∥_{ψ_{1}})

by Example 5.3 in [31], which further gives

\begin{matrix} E M_{y, n} & \leq E max_{i \in [n]} (Y_{i} - E Y_{i}) + max_{i \in [n]} E Y_{i} \\ \leq {(2 \cdot max_{i \in [n]} 4 {∥ Y_{i} ∥}_{ψ_{1}}^{2} \cdot log (2 n))}^{\frac{1}{2}} + max_{i \in [n]} 2 {∥ Y_{i} ∥}_{ψ_{1}} \cdot log (2 n) + max_{i \in [n]} μ_{i} \\ = 2 max_{i \in [n]} {∥ Y_{i} ∥}_{ψ_{1}} [log (2 n) + \sqrt{2 log (2 n)}] + max_{i \in [n]} μ_{i}, \end{matrix}

where the second ≤ is by Corollary 7.3 in [31] and the bound in the lemma comes from the explicit expression in Lemma A2.

By implementing Lemma A3, it remains that we need to verify that

Y : = {(Y_{1}, \dots, Y_{n})}^{⊤} \in Z^{n}

belongs to some strongly log-concave discrete distribution

P_{γ}

with the specifying

γ > 0

after we take

f : (x_{1}, \dots, x_{n}) \mapsto {max}_{i \in [n]} x_{i}

which is 1-Lipschitz. By the definition, the derivative of log-density for

y : = {(y_{1}, \dots, y_{n})}^{⊤}

is

ψ^{'} (y_{i}) : = {\frac{\partial log p (y)}{\partial y}|}_{y_{i}} = log \frac{Γ (k_{i} + y_{i})}{Γ (1 + y_{i})} - y_{i} log (k_{i} + μ_{i}),

then the Taylor expansion gives

\begin{matrix} ψ (y) = ψ ([\frac{1}{2} x + \frac{1}{2} y]) + \frac{1}{2} ψ^{'} ([\frac{1}{2} x + \frac{1}{2} y]) (y - x) + \frac{1}{8} {(y - x)}^{2} ψ^{″} (a_{1}), \\ ψ (x) = ψ (⌊\frac{1}{2} x + \frac{1}{2} y⌋) + \frac{1}{2} ψ^{'} (⌊\frac{1}{2} x + \frac{1}{2} y⌋) (x - y) + \frac{1}{8} {(y - x)}^{2} ψ^{″} (a_{2}) \end{matrix}

where

a_{1} = t_{1} y + (1 - t_{1}) (x + y) / 2

,

a_{2} = t_{2} y + (1 - t_{1}) (x + y) / 2

with

t_{1}, t_{2} \in [0, 1]

. Define the difference function

Δ (x, y) : = \frac{x - y}{4} [ψ^{'} ([\frac{1}{2} x + \frac{1}{2} y]) - ψ^{'} ([\frac{1}{2} x + \frac{1}{2} y])] + \frac{ψ^{″} (a_{1}) + ψ^{″} (a_{2})}{16} {(y - x)}^{2},

the Taylor expression above immediately implies

Δ (x, y) \geq {| x - y |}^{2} \{\frac{ψ^{″} (a_{1}) + ψ^{″} (a_{2})}{16} - sup_{x \neq y; x, y \in Z^{n}} \frac{|[ψ^{'} (⌊ (x + y) / 2 ⌋) - ψ^{'} (⌈ (x + y) / 2 ⌉)]|}{4 | x - y |}\} .

Let

\begin{matrix} C_{ψ} & : = sup_{x \neq y; x, y \in Z^{n}} \frac{|[ψ^{'} (⌊ (x + y) / 2 ⌋) - ψ^{'} (⌈ (x + y) / 2 ⌉)]|}{4 | x - y |} \\ = sup_{x \neq y; x, y \in Z^{n}} |[log \frac{Γ (k_{i} + ⌊ (x + y) / 2 ⌋) Γ (⌈ (x + y) / 2 ⌉ + 1)}{Γ (k_{i} + ⌈ (x + y) / 2 ⌉) Γ (⌊ (x + y) / 2 ⌋ + 1)} - \frac{(⌊ (x + y) / 2 ⌋ - ⌈ (x + y) / 2 ⌉)}{{log}^{- 1} (k_{i} + μ_{i})}]| / 4 | x - y |, \end{matrix}

and it is not hard to see

C_{ψ} \approx \frac{|log (k_{i} + μ_{i})|}{4}

or 0. Besides,

\begin{matrix} ψ^{″} (y) & : = {\frac{\partial^{2} log p (y)}{\partial y^{2}}|}_{y = y_{i}} = \frac{d}{d y_{i}} log \frac{Γ (θ + y_{i})}{Γ (y_{i} + 1)} = \sum_{m = 1}^{\infty} (\frac{1}{m + 1} - \frac{1}{m + k_{i} + y_{i}}) - \sum_{m = 1}^{\infty} (\frac{1}{m + 1} - \frac{1}{m + y_{i} + 1}) \\ = \sum_{m = 1}^{\infty} (\frac{1}{m + y_{i} + 1} - \frac{1}{m + k_{i} + y_{i}}) \geq inf_{y_{i} \in Z} \sum_{m = 1}^{\infty} (\frac{1}{m + y_{i} + 1} - \frac{1}{m + k_{i} + y_{i}}) = C_{ψ^{″}} . \end{matrix}

Now, we have obtained

Δ (x, y) \geq {| x - y |}^{2} \{\frac{ψ^{″} (a_{1}) + ψ^{″} (a_{2})}{16} - C_{ψ}\} \geq {| x - y |}^{2} (\frac{C_{ψ^{″}}}{8} - C_{ψ})

which gives

γ = : \frac{C_{ψ^{″}}}{8} - C_{ψ} > 0

from the strong log-concave assumption for

Y

, if

C_{ψ} \approx \frac{| log (k_{i} + μ_{i}) |}{4}

is small. Hence, we can conclude from Lemma A3 and the upper bound of

E M_{y, n}

\begin{matrix} P \{M_{y, n} - (2 max_{i \in [n]} [a (μ_{i}, k_{i}) - \frac{μ_{i}}{log 2}] [log (2 n) + \sqrt{2 log (2 n)}] + max_{i \in [n]} μ_{i}) > t\} \\ \leq & P (M_{y, n} - E M_{y, n} > t) \leq e^{- γ t^{2} / 4} \end{matrix}

which is exactly the result in the lemma. □

Remark A1.

For

M_{y, n}

, it is distributed as sub-Gumbel, which is rarely studied by research. Another way to deal with using the extreme value theory (EVT) technique, we note that for any

t \in R

\begin{matrix} P (M_{y, n} - E M_{y, n} > t) & = 1 - \prod_{i = 1}^{n} P (Y_{i} \leq t + E M_{y, n}) \\ = 1 - \prod_{i = 1}^{n} [1 - exp \{- \frac{1}{4} (\frac{{(t + E M_{y, n} - μ_{i})}^{2}}{2 a_{i}^{2}} \land \frac{t + E M_{y, n} - μ_{i}}{a_{i}})\}] . \end{matrix}

If

Y_{i}

is i.i.d., then in asymptotic sense,

\begin{matrix} P (M_{y, n} - E M_{y, n} > t) & = 1 - {[1 - P (Y_{1} > t + E M_{y, n})]}^{n} \\ \sim 1 - exp \{- n P (Y_{1} > t + E M_{y, n})\} + o (1) \\ \sim 1 - exp \{- n exp (- \frac{1}{4} \frac{{(t + E M_{y, n} - μ_{1})}^{2}}{2 a_{1}^{2}} \land \frac{t + E M_{y, n} - μ_{1}}{a_{1}})\} . \end{matrix}

Unfortunately, this technique cannot be used in the above lemma because: (i) we need non-asymptotic version inequality instead of a vague expression with

n \to \infty

and (ii)

{Y_{i}}

is not an i.i.d. series, and then EVT theory will not be easily used in this particular setting. Hence, we adopt a discrete technique which has been used in [32] and fully illustrated in [14].

The stochastic Lipschitz conditions are established by using the properties of

\partial_{1} γ (s, y)

and

\partial_{2} γ (s, y)

. As we said before, they are divided into two parts. The linear parts in them can be solved by the concentration inequality for NB variables given in Lemma A2, but the non-linear part

ν (s, y)

needs some more advanced tools regarding the empirical process. They are given as the following lemmas.

Lemma A5

(3.12) in [33]). Suppose

X_{1} (ω), \dots, X_{n} (ω) \in R

are zero-mean independent stochastic processes indexed by

ω \in Ω

. If there exist

M_{0}

and

S_{0}

satisfying

| X_{i} (ω) | \leq M_{0}

and

\sum_{i = 1}^{n} var (X_{i} (ω)) \leq S_{0}^{2}

for all

ω \in Ω

. Denote

S_{n} = {sup}_{ω \in Ω} | \sum_{i = 1}^{n} X_{i} (ω) |

, then for any

t > 0

,

P (S_{n} \geq 2 E S_{n} + S_{0} \sqrt{2 t} + 4 M_{0} t) \leq e^{- t} .

A map

ϕ : R \to R

is called a contraction if

| ϕ (s) - ϕ (t) | \leq | s - t |

for all

s, t \in R

. In addition, in the following lemmas,

ε_{1}, \dots, ε_{n}

are always i.i.d. Rademacher variables.

Lemma A6

(Theorem 2.2 in [34]). Let

T \subseteq V^{n}

be a bounded set and

f_{1}, \dots, f_{n}

be functions

V \to R

such that

f_{i}

is

(M_{i}, ℓ_{\infty})

-Lipschitz with

f_{i} (0) = 0

. For

j = 1, \dots, k \in N

, let

T_{j} = {(t_{1 j}, \dots, t_{n j}) : (t_{1}, \dots, t_{n}) \in T} \subseteq R^{n}

. Then,

E sup_{t \in T} | \sum_{i = 1}^{n} ε_{i} f_{i} (t_{i}) | \leq β_{k} \sum_{j = 1}^{k} E sup_{s \in T_{j}} | \sum_{i = 1}^{n} ε_{i} M_{i} s_{i} |,

where

β_{k}

is a universal constant that can be set no greater than

3^{k} + 3^{k - 1} - 2^{k}

.

Lemma A7

(Theorem 4.12 in [35]). Let

F : R_{+} \to R_{+}

be convex and increasing. Let further

ϕ_{i} : R \to R, i \leq n

be contractions such that

ϕ_{i} (0) = 0

. Then, for any bounded subset

T

in

R^{n}

,

E F (\frac{1}{2} sup_{T} | \sum_{i = 1}^{n} ε_{i} ϕ_{i} (t_{i}) |) \leq E F (sup_{T} | \sum_{i = 1}^{n} ε_{i} t_{i} |) .

Lemma A8

(Lemma 5.2 in [36]). Let

A

be some finite subset of

R^{n}

, let

R = {sup}_{a \in A} {[\sum_{i = 1}^{n} a_{i}^{2}]}^{1 / 2}

, then

E [sup_{a \in A} \sum_{i = 1}^{n} ε_{i} a_{i}] \leq R \sqrt{2 log (card (A))} .

With the assistance of these powerful tools, we can establish the stochastic Lipschitz condition as follows, which is one of the most important points in this article for establishing the oracle inequality of the

ℓ_{2}

distance between the estimated value

\hat{θ}

and the real value

θ^{*}

.

The proof of Theorem 1.

Denote

c_{i} = Z_{i} θ^{*}

. For

θ \in Θ

, denote

t_{i} = Z_{i} (θ - θ^{*}) = Z_{i} θ - c_{i}

. We also define the map

{\bar{π}}_{j} : {(x_{1}, \dots, x_{p})}^{⊤} \mapsto {(x_{1}, \dots, x_{j}, 0, \dots, 0)}^{⊤}

and the function

φ_{i j} (s) = \{\begin{matrix} \frac{γ (c_{i} + {\bar{π}}_{j} s, Y_{i}) - γ (c_{i} + {\bar{π}}_{j - 1} s, Y_{i})}{s_{j}} - \partial_{j} γ (c_{i}, Y_{i}), & if s_{j} \neq 0; \\ \partial_{j} γ (c_{i} + {\bar{π}}_{j - 1} s, Y_{i}) - \partial_{j} γ (c_{i}, Y_{i}), & if s_{j} = 0 . \end{matrix}

Thus,

φ_{i j} : R^{2} \to R

is a real-value function for

i = 1, \dots, n, j = 1, 2

. Then, it is easy to check that

γ (Z_{i} θ, Y_{i}) - γ (Z_{i} θ^{*}, Y_{i}) = \sum_{j = 1}^{2} (\partial_{j} γ (c_{i}, Y_{i}) + φ_{i j} (t_{i})) t_{i j},

and

n P_{n} (γ (θ) - γ (θ^{*})) = \sum_{i = 1}^{n} \sum_{j = 1}^{2} (\partial_{j} γ (c_{i}, Y_{i}) + φ_{i j} (t_{i})) X_{i}^{⊤} (θ^{(j)} - θ^{* (j)})

in turn. It gives

\begin{matrix} \sqrt{n} G_{n} (γ (θ) - γ (θ^{*})) = & \sum_{i = 1}^{n} \sum_{j = 1}^{2} (\partial_{j} γ (c_{i}, Y_{i}) - E \partial_{j} γ (c_{i}, Y_{i})) X_{i}^{⊤} (θ^{(j)} - θ^{* (j)}) \\ + \sum_{i = 1}^{n} \sum_{j = 1}^{2} (φ_{i j} (t_{i}) - E φ_{i j} (t_{i})) X_{i}^{⊤} (θ^{(j)} - θ^{* (j)}) . \end{matrix}

First, we would like to give the explicit formula for

φ_{i 1}

and obtain an upper bound as well as a Lipschitz parameter for

φ_{i 2}

. Denote

h_{i} (\cdot) = γ (\cdot, Y_{i})

, then

φ_{i j} (s) = \int_{0}^{1} (\partial_{j} h_{i} (c_{i} + {\bar{π}}_{j - 1} s + s_{j} u e_{j}) - \partial_{j} h_{i} (c_{i})) d u,

where

e_{j}

is the j-th basis vector of

R^{2}

. Hence, for

j = 1

,

\begin{matrix} φ_{i 1} (s) & = - Y_{i} \int_{0}^{1} [\frac{e^{s_{2}}}{e^{s_{1} + u} + e^{s_{2}}} - \frac{e^{s_{2}}}{e^{s_{1}} + e^{s_{2}}}] d u + \int_{0}^{1} [\frac{e^{s_{1} + s_{2} + u}}{e^{s_{1} + u} + e^{s_{2}}} - \frac{e^{s_{1} + s_{2}}}{e^{s_{1}} + e^{s_{2}}}] d u \\ = [log \frac{e^{s_{1} + 1} + e^{s_{2}}}{e^{s_{1}} + e^{s_{2}}} - \frac{e^{s_{1}}}{e^{s_{1}} + e^{s_{2}}}] Y_{i} + C_{1} (s), \end{matrix}

in which

C_{1} (s)

is a function only related to s and free of Y and the index i. Using Lemma A1, for

j = 2

, write

F_{3} = F_{1} + M_{y, n}

,

{∣ φ}_{i 2} (s) ∣ \leq \int_{0}^{1} ∣ \partial_{2} h_{i} (c_{i} + {\bar{π}}_{1} s + s_{2} u e_{2}) - \partial_{2} h_{i} (u) ∣ d u \leq 2 F_{3},

and

\begin{matrix} | φ_{i 2} (s) - φ_{i 2} (t) | & \leq \int_{0}^{1} | \partial_{2} h_{i} (c_{i} + {\bar{π}}_{1} s + s_{2} u e_{2}) - \partial_{2} h_{i} (c_{i} + {\bar{π}}_{1} t + t_{2} u e_{2}) | d u \\ \leq \int_{0}^{1} F_{2} ∥ {\bar{π}}_{1} (s - t) + (s_{2} - t_{2}) u e_{2} ∥_{\infty} d u \leq F_{2} {∥ s - t ∥}_{\infty} . \end{matrix}

This implies

φ_{i 2}

is

(F_{2}, ℓ_{\infty})

Lipschitz. In particular, letting

s = Z_{i} (θ - θ^{*})

and

t = 0

,

|φ_{i 2} (Z_{i} (θ - θ^{*}))| \leq {∥ Z_{i} (θ - θ^{*}) ∥}_{\infty} \leq F_{2} M_{x} D_{Θ} .

Hence, we obtain an upper bound for

φ_{i 2}

that

|φ_{i 2} (Z_{i} (θ - θ^{*}))| \leq 2 F_{3} \lor F_{2} M_{x} D_{Θ} : = M_{1}

(A1)

Now, for

k = 1, \dots, p

, define

ξ_{i k} (θ) : = (φ_{i 2} (t_{i}) - E φ_{i 2} (t_{i})) X_{i k}, S_{k} = sup_{θ \in Θ} |\sum_{i = 1}^{n} ξ_{i k} (θ)| .

Then, we can approach the final conclusion in the theorem by

\begin{matrix} sup_{θ \in Θ / {θ^{*}}} |\frac{\sqrt{n} G_{n} (γ (θ) - γ (θ^{*}))}{∥ θ - θ^{*} ∥_{1}}| & \leq max_{1 \leq k \leq p} |\sum_{i = 1}^{n} (\partial_{1} γ (c_{i}, Y_{i}) - E \partial_{1} γ (c_{i}, Y_{i})) X_{i k}| \\ + sup_{θ \in Θ / {θ^{*}}} max_{1 \leq k \leq p} |\sum_{i = 1}^{n} (φ_{i 1} (t_{i}) - E φ_{i 1} (t_{i})) X_{i k}| \\ + max_{1 \leq k \leq p} |\sum_{i = 1}^{n} \frac{e^{c_{i 2}}}{e^{c_{i 1}} + e^{c_{i 2}}} (Y_{i} - E Y_{i}) X_{i k}| \\ + max_{1 \leq k \leq p} |\sum_{i = 1}^{n} (ν (c_{i}, Y_{i}) - E ν (c_{i}, Y_{i})) X_{i k}| + sup_{θ \in Θ / {θ^{*}}} max_{1 \leq k \leq p} S_{k} . \end{matrix}

(A2)

We will tickle with (A2) term by term.

(i). The first three terms in (A2):

We will use concentration inequality to deal with these terms. For any

1 \leq k \leq p

and

t \geq 0

, by Lemma A2 and Cauchy–Schwartz inequality,

\begin{matrix} P (| \sum_{i = 1}^{n} (\partial_{1} γ (c_{i}, Y_{i}) - E \partial_{1} γ (c_{i}, Y_{i})) X_{i k} | \geq t) = P (| \frac{e^{c_{i 2}}}{e^{c_{i 1}} + e^{c_{i 2}}} X_{i k} (Y_{i} - E Y_{i}) | \geq t) \\ \leq 2 exp \{- \frac{1}{4} (\frac{t^{2}}{2 \sum_{i = 1}^{n} {(w_{i}^{(1)})}^{2} X_{i k}^{2} a_{i}^{2}} \land \frac{t}{max_{1 \leq i \leq n} | w_{i}^{(1)} X_{i k} | a_{i}})\} \\ \leq 2 exp \{- \frac{1}{4} (\frac{t^{2}}{2 \sqrt{\sum_{i = 1}^{n} {(w_{i}^{(1)})}^{4} a_{i}^{4}} max_{1 \leq k \leq p} \sqrt{\begin{matrix} \sum_{i = 1}^{n} X_{i k}^{4} \end{matrix}}} \land \frac{t}{M_{x} max_{1 \leq i \leq n} | w_{i}^{(1)} | a_{i}})\}, \end{matrix}

where

w_{i}^{(1)} = e^{c_{i 2}} / (e^{c_{i 1}} + e^{c_{i 2}})

and

a_{i} = a (μ_{i}, k_{i})

is defined in Lemma A2; they are both determined and free of

θ

and the index k. Hence,

\begin{matrix} P (max_{1 \leq k \leq p} | \sum_{i = 1}^{n} (\partial_{1} γ (c_{i}, Y_{i}) - E \partial_{1} γ (c_{i}, Y_{i})) X_{i k} | \geq t) \\ \leq 2 p exp \{- \frac{1}{4} (\frac{t^{2}}{2 \sqrt{\sum_{i = 1}^{n} {(w_{i}^{(1)})}^{4} a_{i}^{4}} max_{1 \leq k \leq p} \sqrt{\begin{matrix} \sum_{i = 1}^{n} X_{i k}^{4} \end{matrix}}} \land \frac{t}{M_{x} max_{1 \leq i \leq n} | w_{i}^{(1)} | a_{i}})\} . \end{matrix}

By letting the right side of the above display be

q_{1} \in (0, 1)

, we can obtain

\begin{matrix} P (max_{1 \leq k \leq p} | \sum_{i = 1}^{n} (\partial_{1} γ (c_{i}, Y_{i}) - E \partial_{1} γ (c_{i}, Y_{i})) X_{i k} | \\ \geq 2 \sqrt{2 {(\sum_{i = 1}^{n} {(w_{i}^{(1)})}^{4} a_{i}^{4})}^{1 / 2} {({max}_{1 \leq k \leq p} \sum_{i = 1}^{n} X_{i k}^{4})}^{1 / 2} log (2 p / q_{1})} \lor 4 M_{x} max_{1 \leq i \leq n} | w_{i}^{(1)} | a_{i} log (2 p / q_{1})) \leq q_{1} . \end{matrix}

Exactly the same, we can obtain for any

q_{3} \in (0, 1)

, regarding to the third term,

\begin{matrix} P (max_{1 \leq k \leq p} | \sum_{i = 1}^{n} \frac{e^{c_{i 2}}}{e^{c_{i 1}} + e^{c_{i 2}}} (Y_{i} - E Y_{i}) X_{i k} | \\ \geq 2 \sqrt{2 {(\sum_{i = 1}^{n} {(w_{i}^{(1)})}^{4} a_{i}^{4})}^{1 / 2} {(max_{1 \leq k \leq p} \sum_{i = 1}^{n} X_{i k}^{4})}^{1 / 2} log (2 p / q_{3})} \lor 4 M_{x} max_{1 \leq i \leq n} | w_{i}^{(1)} | a_{i} log (2 p / q_{3})) \leq q_{3} . \end{matrix}

The situation is slightly different for the second term. Indeed,

\begin{matrix} P (| \sum_{i = 1}^{n} (φ_{i 1} (t_{i}) - E φ_{i 1} (t_{i})) X_{i k} | \geq t) & = P (|\sum_{i = 1}^{n} [log \frac{e^{t_{i 1} + 1} + e^{t_{i 2}}}{e^{t_{i 1}} + e^{t_{i 2}}} - \frac{e^{t_{i 1}}}{e^{t_{i 1}} + e^{t_{i 2}}}] X_{i k} (Y_{i} - E Y_{i})| \geq t) \\ : = P (| \sum_{i = 1}^{n} w_{i}^{(2)} (θ) X_{i k} (Y_{i} - E Y_{i}) | \geq t) \end{matrix}

Because

t_{i}

is a function of

θ

, so as the weights

w_{i}^{(2)} (θ)

, we cannot use the exact same method as previously. However, because

Θ

is convex, we have

{t_{i}}_{i = 1}^{n} \subseteq S

. Then, it only needs to note that,

| w_{i}^{(2)} (θ) | = |log \frac{e^{t_{i 1} + 1} + e^{t_{i 2}}}{e^{t_{i 1}} + e^{t_{i 2}}} - \frac{e^{t_{i 1}}}{e^{t_{i 1}} + e^{t_{i 2}}}| \leq log \frac{e + e^{M_{s, n} - m_{s, n}}}{1 + e^{m_{s, n} - M_{s, n}}} + \frac{1}{1 + e^{m_{s, n} - M_{s, n}}} : = w^{(2)},

which gives

\begin{matrix} P (max_{1 \leq k \leq p} | \sum_{i = 1}^{n} (φ_{i 1} (t_{i}) - E φ_{i 1} (t_{i})) X_{i k} | \\ \geq 2 \sqrt{2 \sqrt{n} w^{(2) 2} {(\sum_{i = 1}^{n} a_{i}^{4})}^{1 / 2} {(max_{1 \leq k \leq p} \sum_{i = 1}^{n} X_{i k}^{4})}^{1 / 2} log (2 p / q_{2})} \lor 4 M_{x} w^{(2)} max_{1 \leq i \leq n} | a_{i} | log (2 p / q_{2})) \leq q_{2} . \end{matrix}

for any

θ \in Θ

and

q_{2} \in (0, 1)

.

(ii). The fourth term in (A2):

From Lemma A1, we know that

| ν (c_{i}, Y_{i}) | \leq F_{1}

. Thus, simply by Hoeffding inequality (see Corollary 2.1 (b) in [31]), for any

t \geq 0

and

1 \leq k \leq p

,

\begin{matrix} P (| \sum_{i = 1}^{n} (ν (c_{i}, Y_{i}) - E ν (c_{i}, Y_{i})) X_{i k} | \geq t) & \leq 2 exp \{- \frac{t^{2}}{2 F_{1}^{2} \sum_{i = 1}^{n} X_{i k}^{2}}\} \leq 2 exp \{- \frac{t^{2}}{2 F_{1}^{2} max_{1 \leq k \leq p} \begin{matrix} \sum_{i = 1}^{n} X_{i k}^{2} \end{matrix}}\} . \end{matrix}

For arbitrary

q_{4} \in (0, 1)

, let

t = F_{1} \sqrt{2 log (2 p / q_{4}) {max}_{1 \leq k \leq p} \sum_{i = 1}^{n} X_{i k}^{2}}

, we obtain

P (max_{1 \leq k \leq p} | \sum_{i = 1}^{n} (ν (c_{i}, Y_{i}) - E ν (c_{i}, Y_{i})) X_{i k} | \geq F_{1} \sqrt{2 log (2 p / q_{4}) max_{1 \leq k \leq p} \sum_{i = 1}^{n} X_{i k}^{2}}) \leq q_{4} .

(iii). The last term in (A2):

For any

i = 1, \dots, n

and

k = 1, \dots, p

, by (A1),

| ξ_{i k} (θ) | \leq 2 M_{1} M_{x} : = M_{0}

. In addition, for any

θ \in Θ

, (A1) also implies

\sum_{i = 1}^{n} var (ξ_{i k} (θ)) = \sum_{i = 1}^{n} E {(φ_{i 2} (t_{i}) X_{i k})}^{2} \leq A_{1}^{2} \sum_{i = 1}^{n} X_{i k}^{2} \leq A_{1}^{2} max_{1 \leq k \leq p} \sum_{i = 1}^{n} X_{i k}^{2} : = S_{0}^{2}

Therefore, from Lemma A5, it follows that

P (S_{k} \geq 2 E S_{k} + S_{0} \sqrt{2 t} + 4 M_{0} t) \leq e^{- t} .

(A3)

Thus, the last task is giving an upper bound for

E S_{k}

. Note that

E ξ_{i k} (θ) = 0

, by symmetrization,

\begin{matrix} E S_{k} = E sup_{θ \in Θ} | \sum_{i = 1}^{n} (φ_{i 2} (t_{i}) - E φ_{i 2} (t_{i})) X_{i k} | \leq 2 E sup_{θ \in Θ} | \sum_{i = 1}^{n} ε_{i} φ_{i 2} (t_{i}) X_{i k} | = 2 E sup_{t \in T} | \sum_{i = 1}^{n} ε_{i} φ_{i 2} (t_{i}) X_{i k} |, \end{matrix}

where

T = {t_{i} = Z_{i} (θ - θ^{*}) : θ \in Θ, i = 1, \dots, n}

, and

ε_{1}, \dots, ε_{n}

are i.i.d. Rademacher variables independent of

Y_{1}, \dots, Y_{n}

. Here, using the fact

φ_{i 2} (\cdot) X_{i k}

is

(M_{x} F_{2}, ℓ_{\infty})

-Lipschitz and Lemmas A6–A8,

\begin{matrix} E sup_{t \in T} | \sum_{i = 1}^{n} ε_{i} φ_{i 2} (t_{i}) X_{i k} | & \leq 8 M_{x} F_{2} \sum_{j = 1}^{2} E sup_{t \in T} | ε_{i} t_{i j} | = 8 M_{x} F_{2} \sum_{j = 1}^{2} E sup_{θ \in Θ} | \sum_{i = 1}^{n} ε_{i} X_{i}^{⊤} (θ^{(j)} - θ^{* (j)}) | \\ \leq 16 M_{x} F_{2} D_{Θ} E max_{1 \leq k \leq p} | \sum_{i = 1}^{n} ε_{i} X_{i k} | \leq 16 \sqrt{2 log p} M_{x} F_{2} D_{Θ} \sqrt{max_{1 \leq k \leq p} \sum_{i = 1}^{n} X_{i k}^{2}} . \end{matrix}

Then, by (A3),

P (S_{k} \geq 32 \sqrt{2 log p} M_{x} F_{2} D_{Θ} \sqrt{max_{1 \leq k \leq p} \sum_{i = 1}^{n} X_{i k}^{2}} + M_{1} \sqrt{2 t max_{1 \leq k \leq p} \sum_{i = 1}^{n} X_{i k}^{2}} + 8 M_{1} M_{x} t) \leq e^{- t} .

Note that the right side of the inequality is free of

θ

, let

t = log (p / q_{5})

in the above inequality, and use the same technique as previous, we obtain the uniform bound for it. The Theorem is proved by letting

q_{2} = q_{3} = q_{1}

,

q_{4} = q_{2}

,

q_{5} = q_{3}

, and

| w_{i}^{(1)} | \leq w^{(1)}

. □

The lower bound of the likelihood-based divergence

Recall the standard steps for establishing the oracle inequality for a lasso estimator are (see [37] for example):

To avoid the ill behavior of Hessian, propose the restricted eigenvalue condition or other analogous conditions about the design matrix.
Find the tuning parameter based on the high-probability event, i.e., the KKT conditions.
According to some restricted eigenvalue assumptions and tuning parameter selection, derive the oracle inequalities via the definition of the lasso optimality and the minimizer under unknown expected risk function and some basic inequalities. There are three sub-steps:
(i)
Under the KKT conditions, show that the error vector $\hat{θ} - θ^{*}$ is in some restricted set with structure sparsity, and check that $\hat{θ} - θ^{*}$ is in a big compact set;
(ii)
Show that the likelihood-based divergence of $\hat{θ}$ and $θ^{*}$ can be lower bounded by some quadratic distance between $\hat{θ}$ and $θ^{*}$ ;
(iii)
By some elementary inequalities and (ii), show that $∥ \hat{θ} - θ^{*} ∥_{1}$ is in a smaller compact set with a radius of optimal rate (proportional to $λ$ ).

Under our approach, the KKT condition with a high probability is replaced by the stochastic Lipschitz condition, while other steps should remain the same. For most models belonging to the canonical exponential family, the step III.(ii) is quite trivial, see Lemma 1 in [38] for example. Nonetheless, it is worthy to note that our loss function is not in the canonical exponential family, so there is no extended discussion about the lower bound of the likelihood-based divergence of

\hat{θ}

and

θ^{*}

in our setting. We will use the following theorem to clarify this thing.

Theorem A1.

Suppose the condition is the same as that in Theorem 1. Denote the true parameter for

Y_{i}

is

μ^{*}

and

k^{*}

. If

{Z_{i} θ}_{i = 1, \dots, n, θ \in Θ} \subseteq S \cap {s \in R^{2} : 2 s_{1} + (1 + s_{2} (1 - k^{*}) k^{* μ^{*}}) μ^{*} \leq \frac{s_{1} + μ^{*}}{2 s_{2}^{2}}}

and

μ^{*} \geq 1

, then

E γ (Z_{i} θ, Y_{i}) - E γ (Z_{i} θ^{*}, Y_{i}) \geq C_{γ} {∥ Z_{i} (θ - θ^{*}) ∥}_{2}^{2},

where

C_{γ}

is a positive constant and its exact definition is in the proof.

Proof.

For simplicity, we drop the index i. By the definition and the notation in Theorem 1,

\begin{matrix} E γ (Z θ, Y) - E γ (Z θ^{*}, Y) & = D_{KL} (s, c), \end{matrix}

where

D_{KL}

is the Kullback–Leibler divergence from the

Y_{i}

’s density

f (y | Z θ)

to

f (y | Z θ^{*})

, i.e.,

D_{KL} (s, c) : = \int f (y | c) log \frac{f (y | c)}{f (y | s)} d y .

Due to the identification of the negative binomial distribution, we have

D_{KL} (s, c) \geq 0

with equality if and only if

s = c

. Using the Taylor theorem,

\begin{matrix} D_{KL} (s, c) & = D_{KL} (c, c) + {\frac{\partial}{\partial s} D_{KL} (s, c)|}_{s = c} + \frac{1}{2} {(s - c)}^{⊤} {[\frac{\partial^{2}}{\partial s \partial s^{⊤}} D_{KL} (s, c)]}_{s = c + ρ (s - c)} (s - c) \\ = \frac{1}{2} {(s - c)}^{⊤} {[\frac{\partial^{2}}{\partial s \partial s^{⊤}} D_{KL} (s, c)]}_{s = c + ρ (s - c)} (s - c) \\ \geq \frac{1}{2} inf_{ρ \in [0, 1]} λ_{m i n} {[\frac{\partial^{2}}{\partial s \partial s^{⊤}} D_{KL} (s, c)]}_{s = c + ρ (s - c)} {∥ s - c ∥}_{2}^{2} \end{matrix}

where

ρ \in [0, 1]

and

λ_{m i n} (M)

is the smallest eigenvalue of the matrix

M

. Thus, it is enough to show that

{[\frac{\partial^{2}}{\partial s \partial s^{⊤}} D_{KL} (s, c)]}_{s = c + ρ (s - c)}

is strictly positive define for any

ρ \in [0, 1]

. First, calculate directly,

\begin{matrix} \frac{\partial^{2}}{\partial s \partial s^{⊤}} D_{KL} (s, c) & = \int f (y | c) [\frac{\partial^{2}}{\partial s \partial s^{⊤}} γ (s, y)] d y \\ = \int f (y | c) [\begin{matrix} \frac{e^{s_{1} + s_{2}}}{{(e^{s_{1}} + e^{s_{2}})}^{2}} (e^{s_{2}} + y) & \frac{e^{s_{1} + s_{2}}}{{(e^{s_{1}} + e^{s_{2}})}^{2}} (e^{s_{1}} - y) \\ \frac{e^{s_{1} + s_{2}}}{{(e^{s_{1}} + e^{s_{2}})}^{2}} (e^{s_{1}} - y) & \partial_{2} ν (s, y) + \frac{e^{s_{1} + s_{2}}}{{(e^{s_{1}} + e^{s_{2}})}^{2}} y \end{matrix}] d y = : [\begin{matrix} a_{11} + b & a_{12} - b \\ a_{21} - b & a_{22} + b \end{matrix}], \end{matrix}

where

a_{11} = \frac{e^{s_{1} + 2 s_{2}}}{{(e^{s_{1}} + e^{s_{2}})}^{2}}, a_{12} = \frac{e^{2 s_{1} + s_{2}}}{{(e^{s_{1}} + e^{s_{2}})}^{2}}, b = \frac{e^{s_{1} + 2 s_{2}}}{{(e^{s_{1}} + e^{s_{2}})}^{2}} E Y,

and

\begin{matrix} a_{22} = E \partial_{2} v (s, Y) & = e^{s_{2}} [ψ (e^{s_{2}}) + e^{s_{2}} ψ^{'} (e^{s_{2}}) + log (1 + e^{s_{1} - s_{2}}) - \frac{e^{s_{1}}}{e^{s_{1}} + e^{s_{2}}} - {(\frac{e^{s_{1}}}{e^{s_{1}} + e^{s_{2}}})}^{2}] \\ - e^{s_{2}} [E ψ (Y + e^{s_{2}}) + e^{s_{2}} E ψ^{'} (Y + e^{s_{2}})] . \end{matrix}

For a

2 \times 2

matrix

M

, it is strictly positive define if and only if

tr (M) > 0

and

det (M) > 0

. Denote

μ = e^{s_{1}}, k = e^{s_{2}}

, and

μ^{*} = e^{c_{1}}, k^{*} = e^{c_{2}}

are true parameters for Y. Then,

\begin{matrix} tr [\frac{\partial^{2}}{\partial s \partial s^{⊤}} D_{KL} (s, c)] & = \frac{μ k^{2}}{{(μ + k)}^{2}} + 2 \frac{μ k^{2}}{{(μ + k)}^{2}} μ^{*} - k [\frac{μ}{μ + k} + {(\frac{μ}{μ + k})}^{2}] \\ + k [log (1 + μ / k) + (ψ (k) - E ψ (Y + k)) + k (ψ^{'} (k) - E ψ^{'} (Y + k))] \\ = \frac{2 (μ^{*} - 1) μ k^{2}}{{(μ + k)}^{2}} + k [log (1 + μ / k) + g_{1} (k) + k g_{2} (k)] \\ \geq k [log (1 + μ / k) + g_{1} (k) + k g_{2} (k)] . \end{matrix}

(A4)

Now, we are going to deal with

g_{1} (k) = ψ (k) - E ψ (Y + k)

and

g_{2} (k) = ψ^{'} (k) - E ψ^{'} (Y + k)

. For

ψ (x)

,

0 > ψ^{″} (x) = - \frac{1}{x^{2}} - \int_{0}^{\infty} t^{2} φ (t) e^{- t x} d t \geq - \frac{1}{x^{2}} - \frac{2}{x^{3}} .

Therefore,

ψ (\cdot)

is concave. Using Jensen inequality and median value theorem

\begin{matrix} g_{1} (k) = ψ (k) - E ψ (Y + k) \geq ψ (k) - ψ (E Y + k) \geq - (\frac{1}{k} + \frac{1}{k^{2}}) E Y = - μ^{*} (\frac{1}{k} + \frac{1}{k^{2}}) . \end{matrix}

Similarly, for

g_{2} (k)

, by using the fact that

E (1 / Y) = (1 - k^{*}) k^{* μ^{*}} μ^{*}

and the assumption,

\begin{matrix} g_{2} (k) & = E [ψ^{'} (k) - ψ^{'} (Y + k)] \geq E [Y (\frac{1}{{(ξ (Y) + k)}^{2}} + \frac{2}{{(ξ (Y) + k)}^{3}})] \\ \geq E [\frac{Y}{{(Y + k)}^{2}}] + 2 E [\frac{Y}{{(Y + k)}^{3}}] \geq {[E \frac{{(Y + k)}^{2}}{Y}]}^{- 1} + 2 {[E \frac{{(Y + k)}^{3})}{Y}]}^{- 1} \\ = \frac{1}{2 k + (1 + k^{2} (1 - k^{*}) k^{* μ^{*}}) μ^{*}} + \frac{2}{k^{* 2} (k^{*} + μ^{*}) / μ^{* 2} + μ^{* 2} + 3 k μ^{*} + 3 k^{2} + k^{3} (1 - k^{*}) k^{* μ^{*}} μ^{*}} \\ \geq (μ + μ^{*}) (\frac{1}{2 k^{2}} + \frac{1}{k^{3}}) . \end{matrix}

where

ξ (Y)

lies between 0 and Y. The lower bounds for

g_{1}

and

g_{2}

, together with the fact that

log (1 + x) \geq x - x^{2} / 2

for

x \geq 0

, we conclude that

tr [\frac{\partial^{2}}{\partial s \partial s^{⊤}} D_{KL} (s, c)] > 0

. Similarly, we can also prove

det [\frac{\partial^{2}}{\partial s \partial s^{⊤}} D_{KL} (s, c)] > 0

, so the theorem holds. □

The proof of Theorem 3.

The proof follows the idea in [22]. First, by the definition of

\hat{θ}

,

\begin{matrix} P (γ (\hat{θ}) - γ (θ^{*})) & \leq P (γ (\hat{θ}) - γ (θ^{*})) + (P_{n} γ (θ^{*}) + λ {∥ θ^{*} ∥}_{ω, 1}) - (P_{n} γ (\hat{θ}) + λ {∥ \hat{θ} ∥}_{ω, 1}) \\ \leq \frac{1}{\sqrt{n}} G_{n} (γ (θ^{*}) - γ (\hat{θ})) + λ (∥ θ^{*} ∥_{ω, 1} - {∥ \hat{θ} ∥}_{ω, 1}) . \end{matrix}

From Theorem A1, we also have

P (γ (\hat{θ}) - γ (θ^{*})) \geq \frac{C_{γ}}{n} \sum_{i = 1}^{n} ∥ Z_{i} (\hat{θ} - θ^{*}) ∥_{2}^{2} = \frac{C_{γ}}{n} \sum_{j = 1}^{2} {∥ X ({\hat{θ}}^{(j)} - θ^{* (j)}) ∥}_{2}^{2} .

Then, by Theorem 1 and the definition of

λ

,

\begin{matrix} C_{γ} \sum_{j = 1}^{2} {∥ X ({\hat{θ}}^{(j)} - θ^{* (j)}) ∥}_{2}^{2} & \leq \sqrt{n} G_{n} (γ (θ^{*}) - γ (\hat{θ})) + n λ (∥ θ^{*} ∥_{ω, 1} - {∥ \hat{θ} ∥}_{ω, 1}) \\ \leq M_{q} {∥ θ^{*} - \hat{θ} ∥}_{1} + (1 + 1 / a) M_{q} (∥ θ^{*} ∥_{ω, 1} - {∥ \hat{θ} ∥}_{ω, 1}) \\ = M_{q} \sum_{j = 1}^{2} [∥ {\hat{θ}}^{(j)} - θ^{* (j)} ∥_{1} + (1 + 1 / a) ω_{j} (∥ θ^{* (j)} ∥_{1} - {∥ {\hat{θ}}^{(j)} ∥}_{1})] \end{matrix}

holds with probability at least

1 - q

, where

a = (K - 1) / 2

. Now, let

J_{1}, J_{2} \subseteq {1, \dots, p}

be any sets with

J_{j} \supseteq spt (θ^{* (j)})

. It is easy to check

\begin{matrix} ∥ {\hat{θ}}^{(j)} - θ^{* (j)} ∥_{1} & + (1 + 1 / a) ω_{j} (∥ θ^{* (j)} ∥_{1} - {∥ {\hat{θ}}^{(j)} ∥}_{1}) \\ = ∥ {\hat{θ}}_{J_{j}}^{(j)} - θ^{* (j)} ∥_{1} + {∥ {\hat{θ}}_{J_{j}^{c}}^{(j)} ∥}_{1} + (1 + 1 / a) ω_{j} (∥ θ^{* (j)} ∥_{1} - ∥ {\hat{θ}}_{J_{j}}^{(j)} ∥_{1} - {∥ {\hat{θ}}_{J_{j}^{c}}^{(j)} ∥}_{1}) \\ \leq (K / a) ∥ {\hat{θ}}_{J_{j}}^{(j)} - θ^{* (j)} ∥_{1} - (1 / a) {∥ {\hat{θ}}_{J_{j}^{c}}^{(j)} ∥}_{1} . \end{matrix}

by the fact

ω_{j} \in [0, 1]

. It gives that with probability at least

1 - q

,

\sum_{j = 1}^{2} {∥ X ({\hat{θ}}^{(j)} - θ^{* (j)}) ∥}_{2}^{2} \leq \frac{M_{q}}{a C_{γ}} \sum_{j = 1}^{2} (K ∥ {\hat{θ}}_{J_{j}}^{(j)} - θ^{* (j)} ∥_{1} - {∥ {\hat{θ}}_{J_{j}^{c}}^{(j)} ∥}_{1}) .

(A5)

Let

A_{1}, A_{2} \subseteq {1, \dots, p}

satisfying

spt (θ^{* (j)}) \subseteq A_{j}

and

card (A_{j}) = p_{1}

, and we also let

B_{j}

be the union of

A_{j}

and the indices of

p_{1}

largest

{\hat{θ}}^{(j)}

. Then,

A_{j}

and

B_{j}

also guarantee (A5). In addition, from Lemma 1, they also give

∥ {\hat{θ}}_{B_{j}^{c}}^{(j)} ∥_{2}^{2} \leq p_{1}^{- 1} {∥ {\hat{θ}}_{A_{j}^{c}}^{(j)} ∥}_{1}^{2} .

In addition, from the definition of

A_{j}

and

B_{j}

, we know that

∥ {\hat{θ}}_{A_{j}^{c}}^{(j)} ∥_{1} \geq {∥ {\hat{θ}}_{B_{j}^{c}}^{(j)} ∥}_{1}

and

∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥_{1} \leq {∥ {\hat{θ}}_{B_{j}}^{(j)} - θ^{* (j)} ∥}_{1}

.

Unlike the single lasso question, here we need to define

I : = {j = 1, 2 : K ∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥_{1} \geq ∥ {\hat{θ}}_{A_{j}^{c}}^{(j)} ∥_{1}}

, and consider

j \in I

and

j \notin I

separately. Obviously,

I \neq \emptyset

, or (A5) cannot be beholden. For

j \in I

, we have

K ∥ {\hat{θ}}_{B_{j}}^{(j)} - θ^{* (j)} ∥_{1} - ∥ {\hat{θ}}_{B_{j}^{c}}^{(j)} ∥_{1} \geq K ∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥_{1} - {∥ {\hat{θ}}_{A_{j}^{c}}^{(j)} ∥}_{1} \geq 0 .

Then, by the restricted eigenvalue condition,

n κ^{2} ∥ {\hat{θ}}_{J_{j}}^{(j)} - θ^{* (j)} ∥_{2}^{2} \leq {∥ X ({\hat{θ}}^{(j)} - θ^{* (j)}) ∥}_{2}^{2}

holds for

J_{j} = A_{j}

or

J_{j} = B_{j}

. Note that from (A5),

\sum_{j \in I} {∥ X ({\hat{θ}}^{(j)} - θ^{* (j)}) ∥}_{2}^{2} \leq \frac{M_{q}}{a C_{γ}} \sum_{j \in I} (∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥_{1} - {∥ {\hat{θ}}_{A_{j}^{c}}^{(j)} ∥}_{1}) \leq \frac{M_{q}}{a C_{γ}} \sum_{j \in I} (∥ {\hat{θ}}_{B_{j}}^{(j)} - θ^{* (j)} ∥_{1} - {∥ {\hat{θ}}_{B_{j}^{c}}^{(j)} ∥}_{1}),

then by Cauchy–Schwartz inequality,

\begin{matrix} n κ^{2} \sum_{j \in I} {∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥}_{2}^{2} & \leq ∥ X ({\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)}) ∥_{2}^{2} \leq \frac{M_{q} K}{a C_{γ}} \sum_{j \in I} {∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥}_{1} \\ \leq \frac{M_{q} K \sqrt{p_{1}}}{a C_{γ}} \sum_{j \in I} {∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥}_{2} \leq \frac{M_{q} K \sqrt{2 p_{1}}}{a C_{γ}} {[\sum_{j \in I} {∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥}_{2}^{2}]}^{1 / 2} . \end{matrix}

It gives

\sum_{j \in I} ∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥_{2}^{2} \leq \frac{2 p_{1} M_{q}^{2} K^{2}}{a^{2} κ^{4} n^{2} C_{γ}^{2}}, \sum_{j \in I} {∥ {\hat{θ}}_{B_{j}}^{(j)} - θ^{* (j)} ∥}_{2}^{2} \leq \frac{4 p_{1} M_{q}^{2} K^{2}}{a^{2} κ^{4} n^{2} C_{γ}^{2}},

where we use that fact

card (B_{j}) = 2 p_{1}

. Furthermore, because

\begin{matrix} {∥ \hat{θ}}_{B_{j}^{c}}^{(j)} ∥_{2}^{2} \leq \sum_{j \in I} p_{1}^{- 1} ∥ {\hat{θ}}_{A_{j}^{c}}^{(j)} ∥_{1}^{2} \leq \frac{K^{2}}{p_{1}} \sum_{j \in I} ∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥_{1}^{2} \leq K^{2} \sum_{j \in I} ∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥_{2}^{2}, \end{matrix}

we can conclude that

\begin{matrix} \sum_{j \in I} {∥ {\hat{θ}}^{(j)} - θ^{* (j)} ∥}_{2}^{2} & = \sum_{j \in I} (∥ {\hat{θ}}_{B_{j}}^{(j)} - θ^{* (j)} ∥_{2}^{2} + {∥ {\hat{θ}}_{B_{j}^{c}}^{(j)} ∥}_{2}^{2}) \\ \leq \sum_{j \in I} (∥ {\hat{θ}}_{B_{j}}^{(j)} - θ^{* (j)} ∥_{2}^{2} + K^{2} {∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥}_{2}^{2}) = \frac{2 p_{1} M_{q}^{2} (2 + K^{2}) K^{2}}{a^{2} κ^{4} n^{2} C_{γ}^{2}} . \end{matrix}

(A6)

Now, we will tickle the situation that

j \notin I

. For

j \notin I

,

K ∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥_{1} < {∥ {\hat{θ}}_{A_{j}^{c}}^{(j)} ∥}_{1}

. Again from (A5), we have

\sum_{j \notin I} ∥ X ({\hat{θ}}^{(j)} - θ^{* (j)}) ∥_{2}^{2} \leq \frac{M_{q} K}{a C_{γ}} \sum_{j \in I} {∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥}_{1}

and

0 \leq \sum_{j \notin I} (∥ {\hat{θ}}_{A_{j}^{c}}^{(j)} ∥_{1} - K {∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥}_{1}) \leq K \sum_{j \in I} {∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥}_{1} .

Indeed, if the two inequalities above have the opposite direction, then for the first one, one can find that

\sum_{j \in I} {∥ X ({\hat{θ}}^{(j)} - θ^{* (j)}) ∥}_{2}^{2} \leq \frac{M_{q}}{a C_{γ}} [\sum_{j \notin I} (K ∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥_{1} - {∥ {\hat{θ}}_{A_{j}^{c}}^{(j)} ∥}_{1}) - \sum_{j \in I} {∥ {\hat{θ}}_{A_{j}^{c}}^{(j)} ∥}_{1}] < 0,

and

\sum_{j = 1}^{2} ∥ X ({\hat{θ}}^{(j)} - θ^{* (j)}) ∥_{2}^{2} \leq - \frac{M_{q}}{a C_{γ}} \sum_{j \in I} {∥ {\hat{θ}}_{A_{j}^{c}}^{(j)} ∥}_{1} < 0 .

Once again, by Cauchy–Schwartz inequality,

\sum_{j \in I} ∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥_{1} \leq \sqrt{p_{1}} \sum_{j \in I} {∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥}_{2} \leq \sqrt{2 p_{1}} {[\sum_{j \in I} {∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥}_{2}^{2}]}^{1 / 2} \leq \frac{2 p_{1} M_{q} K}{a κ^{2} n C_{γ}} .

Denote

Δ_{j} : = ∥ {\hat{θ}}_{A_{j}^{c}}^{(j)} ∥_{1} - K {∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥}_{1}

. Then, for

j \notin J

,

Δ_{j} > 0

, and

\sum_{j \notin I} Δ_{j} \leq K \sum_{j \in I} {∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥}_{1} \leq \frac{2 p_{1} M_{q} K^{2}}{a κ^{2} n C_{γ}} .

For any

j \notin I

, define

{\tilde{θ}}^{(j)} = {\hat{θ}}^{(j)} + \frac{Δ_{j}}{p_{1} K} \sum_{k \in A_{j}} sgn ({\hat{θ}}_{k}^{(j)} - θ_{k}^{* (j)}) e_{k} .

Then, for

k \in A_{j}

,

∣ {\tilde{θ}}_{k}^{(j)} - θ_{k}^{* (j)} | = | {\hat{θ}}_{k}^{(j)} - θ_{k}^{* (j)} | + \frac{Δ_{j}}{p_{1} K},

while for

k \notin I

,

{\tilde{θ}}_{k}^{(j)} = {\hat{θ}}_{k}^{(j)}

. Therefore,

K ∥ {\tilde{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥_{1} = K [∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥_{1} + \sum_{k \in A_{j}} \frac{Δ_{j}}{p_{1} K}] = ∥ {\hat{θ}}_{A_{j}^{c}}^{(j)} ∥_{1} = {∥ {\tilde{θ}}_{A_{j}^{c}}^{(j)} ∥}_{1},

and consequently

{∥ \tilde{θ}}_{B_{j}^{c}}^{(j)} ∥_{1} \leq K {∥ {\tilde{θ}}_{B_{j}}^{(j)} - θ^{* (j)} ∥}_{1}

. Once again, by the restricted eigenvalue condition,

∥ X ({\tilde{θ}}^{(j)} - θ^{* (j)}) ∥_{2}^{2} \geq n κ^{2} ∥ {\tilde{θ}}_{B_{j}}^{(j)} - θ^{* (j)} ∥_{2}^{2} \geq n κ^{2} {∥ {\tilde{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥}_{2}^{2} .

(A7)

On the other hand, note that for any

s, t \in R^{m}

inequality

{∥ s + t ∥}_{2}^{2} \leq {2 (∥ s ∥}_{2}^{2} + {∥ t ∥}_{2}^{2})

and

{∥ s ∥}_{2} \leq {∥ s ∥}_{1} \leq \sqrt{m} {∥ s ∥}_{2}

hold, we conclude

\begin{matrix} \sum_{j \notin I} {∥ X ({\tilde{θ}}^{(j)} - θ^{* (j)}) ∥}_{2}^{2} & \leq 2 \sum_{j \notin I} (∥ X ({\hat{θ}}^{(j)} - θ^{* (j)}) ∥_{2}^{2} + {∥ X ({\hat{θ}}^{(j)} - {\tilde{θ}}^{(j)}) ∥}_{2}^{2}) \\ \leq \frac{2 M_{q} K}{a C_{γ}} \sum_{j \in I} ∥ {\hat{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥_{1} + 2 \sum_{j \notin I} {∥ X ({\hat{θ}}^{(j)} - {\tilde{θ}}^{(j)}) ∥}_{2}^{2} \\ \leq \frac{4 p_{1} M_{q}^{2} K^{2}}{n a^{2} κ^{2} C_{γ}^{2}} + 2 \sum_{j \notin I} {∥ X ({\hat{θ}}^{(j)} - {\tilde{θ}}^{(j)}) ∥}_{2}^{2} . \end{matrix}

(A8)

Next, we will use the definition of the

p_{1}

-restricted isometry constant

σ_{X, l}^{2}

. Because

spt ({\tilde{θ}}^{(j)} - {\hat{θ}}^{(j)}) \leq card (A_{j}) = p_{1}

, then

\begin{matrix} \sum_{j \notin I} {∥ X ({\hat{θ}}^{(j)} - {\tilde{θ}}^{(j)}) ∥}_{2}^{2} & \leq σ_{X, p_{1}}^{2} \sum_{j \notin I} {∥ {\hat{θ}}^{(j)} - {\tilde{θ}}^{(j)} ∥}_{2}^{2} \\ = σ_{X, p_{1}}^{2} \sum_{j \notin I} \sum_{k \in A_{j}} {(\frac{Δ_{j}}{p_{1} K})}^{2} = \frac{σ_{X, p_{1}}^{2}}{p_{1} K^{2}} \sum_{j \notin I} Δ_{j}^{2} \\ \leq \frac{σ_{X, p_{1}}^{2}}{p_{1} K^{2}} {(\sum_{j \notin I} Δ_{j})}^{2} \leq \frac{4 p_{1} σ_{X, p_{1}}^{2} K^{2}}{a^{2} κ^{4} n^{2} C_{γ}^{2}} . \end{matrix}

The above inequality together with (A7) and (A8) gives

\sum_{j \notin I} ∥ {\tilde{θ}}_{A_{j}}^{(j)} - θ^{* (j)} ∥_{2}^{2} \leq \sum_{j \notin I} {∥ {\tilde{θ}}_{B_{j}}^{(j)} - θ^{* (j)} ∥}_{2}^{2} \leq \frac{4 p_{1} (n κ^{2} + 2 σ_{X, p_{1}}^{2}) M_{q}^{2} K^{2}}{a^{2} C_{γ}^{2} n^{3} κ^{6}} .

Finally, because

∥ {\tilde{θ}}_{B_{j}^{c}}^{(j)} ∥_{2}^{2} \leq ∥ {\tilde{θ}}_{B_{j}^{c}}^{(j)} ∥_{1}^{2} \leq K^{2} ∥ {\tilde{θ}}_{B_{j}}^{(j)} - θ^{* (j)} ∥_{1}^{2} \leq 2 p_{1} K^{2} {∥ {\tilde{θ}}_{B_{j}}^{(j)} - θ^{* (j)} ∥}_{2}^{2},

we obtain that

\begin{matrix} \sum_{j \notin I} {∥ {\hat{θ}}^{(j)} - θ^{* (j)} ∥}_{2}^{2} & \leq \sum_{j \notin I} {∥ {\tilde{θ}}^{(j)} - θ^{* (j)} ∥}_{2}^{2} = \sum_{j \notin I} (∥ {\tilde{θ}}_{B_{j}}^{(j)} - θ^{* (j)} ∥_{2}^{2} + {∥ {\tilde{θ}}_{B_{j}^{c}}^{(j)} ∥}_{2}^{2}) \\ \leq (1 + 2 p_{1} K) \sum_{j \notin I} {∥ {\tilde{θ}}_{B_{j}}^{(j)} - θ^{* (j)} ∥}_{2}^{2} \leq \frac{4 p_{1} (1 + 2 p_{1} K) (n κ^{2} + 2 σ_{X, p_{1}}^{2}) M_{q}^{2} K^{2}}{a^{2} C_{γ}^{2} n^{3} κ^{6}} . \end{matrix}

(A9)

Combining (A6) and (A9), it is easy to see what remains. □

References

Dai, H.; Bao, Y.; Bao, M. Maximum likelihood estimate for the dispersion parameter of the negative binomial distribution. Stat. Probab. Lett. 2013, 83, 21–27. [Google Scholar] [CrossRef]
Allison, P.D.; Waterman, R.P. Fixed–effects negative binomial regression models. Sociol. Methodol. 2002, 32, 247–265. [Google Scholar] [CrossRef] [Green Version]
Hilbe, J.M. Negative Binomial Regression; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
Weißbach, R.; Radloff, L. Consistency for the negative binomial regression with fixed covariate. Metrika 2020, 83, 627–641. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Statal Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2005, 67, 301–320. [Google Scholar] [CrossRef] [Green Version]
Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef] [Green Version]
Qiu, Y.; Chen, S.X.; Nettleton, D. Detecting rare and faint signals via thresholding maximum likelihood estimators. Ann. Stat. 2018, 46, 895–923. [Google Scholar] [CrossRef] [Green Version]
Xie, F.; Xiao, Z. Consistency of l1 penalized negative binomial regressions. Stat. Probab. Lett. 2020, 165, 108816. [Google Scholar] [CrossRef]
Li, Y.; Rahman, T.; Ma, T.; Tang, L.; Tseng, G.C. A sparse negative binomial mixture model for clustering RNA-seq count data. Biostatistics 2021, kxab025. [Google Scholar] [CrossRef]
Jankowiak, M. Fast Bayesian Variable Selection in Binomial and Negative Binomial Regression. arXiv 2021, arXiv:2106.14981. [Google Scholar]
Lisawadi, S.; Ahmed, S.; Reangsephet, O. Post estimation and prediction strategies in negative binomial regression model. Int. J. Model. Simul. 2021, 41, 463–477. [Google Scholar] [CrossRef]
Zhang, H.; Jia, J. Elastic-net Regularized High-dimensional Negative Binomial Regression: Consistency and Weak Signals Detection. Stat. Sin. 2022, 32, 181–207. [Google Scholar] [CrossRef]
Xu, D.; Zhang, Z.; Wu, L. Variable selection in high-dimensional double generalized linear models. Stat. Pap. 2014, 55, 327–347. [Google Scholar] [CrossRef]
Yee, T.W. Vector Generalized Linear and Additive Models: With an Implementation in R; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
Nguelifack, B.M.; Kemajou-Brown, I. Robust rank-based variable selection in double generalized linear models with diverging number of parameters under adaptive Lasso. J. Stat. Comput. Simul. 2019, 89, 2051–2072. [Google Scholar] [CrossRef]
Cavalaro, L.L.; Pereira, G.H. A procedure for variable selection in double generalized linear models. J. Stat. Comput. Simul. 2022, 1–18. [Google Scholar] [CrossRef]
Wang, Z.; Ma, S.; Zappitelli, M.; Parikh, C.; Wang, C.Y.; Devarajan, P. Penalized count data regression with application to hospital stay after pediatric cardiac surgery. Stat. Methods Med. Res. 2016, 25, 2685–2703. [Google Scholar] [CrossRef] [Green Version]
Huang, H.; Zhang, H.; Li, B. Weighted Lasso estimates for sparse logistic regression: Non-asymptotic properties with measurement errors. Acta Math. Sci. 2021, 41, 207–230. [Google Scholar] [CrossRef]
Adamczak, R. A tail inequality for suprema of unbounded empirical processes with applications to Markov chains. Electron. J. Probab. 2008, 13, 1000–1034. [Google Scholar] [CrossRef]
Bickel, P.J.; Ritov, Y.; Tsybakov, A.B. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 2009, 37, 1705–1732. [Google Scholar] [CrossRef]
Candes, E.; Tao, T. The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Stat. 2007, 35, 2313–2351. [Google Scholar]
Riphahn, R.T.; Wambach, A.; Million, A. Incentive effects in the demand for health care: A bivariate panel count data estimation. J. Appl. Econom. 2003, 18, 387–405. [Google Scholar] [CrossRef]
Yang, X.; Song, S.; Zhang, H. Law of iterated logarithm and model selection consistency for generalized linear models with independent and dependent responses. Front. Math. China 2021, 16, 825–856. [Google Scholar] [CrossRef]
Shi, C.; Song, R.; Chen, Z.; Li, R. Linear hypothesis testing for high dimensional generalized linear models. Ann. Stat. 2019, 47, 2671. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Xie, F.; Lederer, J. Aggregating Knockoffs for False Discovery Rate Control with an Application to Gut Microbiome Data. Entropy 2021, 23, 230. [Google Scholar] [CrossRef]
Cui, C.; Jia, J.; Xiao, Y.; Zhang, H. Directional FDR Control for Sub-Gaussian Sparse GLMs. arXiv 2021, arXiv:2105.00393. [Google Scholar]
Bateman, H. Higher Transcendental Functions [Volumes i–iii]; McGraw-Hill Book Company: New York, NY, USA, 1953; Volume 1. [Google Scholar]
Alzer, H. On some inequalities for the gamma and psi functions. Math. Comput. 1997, 66, 373–389. [Google Scholar] [CrossRef] [Green Version]
Zhang, H.; Chen, S.X. Concentration inequalities for statistical inference. Commun. Math. Res. 2021, 37, 1–85. [Google Scholar]
Moriguchi, S.; Murota, K.; Tamura, A.; Tardella, F. Discrete midpoint convexity. Math. Oper. Res. 2020, 45, 99–128. [Google Scholar] [CrossRef] [Green Version]
Sen, B. A Gentle Introduction to Empirical Process Theory and Applications; Columbia University: New York, NY, USA, 2018. [Google Scholar]
Chi, Z. Stochastic Lipschitz continuity for high dimensional Lasso with multiple linear covariate structures or hidden linear covariates. arXiv 2010, arXiv:1011.1384. [Google Scholar]
Ledoux, M.; Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Massart, P. Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse Math. 2000, 9, 245–303. [Google Scholar] [CrossRef]
Xiao, Y.; Yan, T.; Zhang, H.; Zhang, Y. Oracle inequalities for weighted group lasso in high-dimensional misspecified Cox models. J. Inequalities Appl. 2020, 2020, 1–33. [Google Scholar] [CrossRef]
Abramovich, F.; Grinshtein, V. Model selection and minimax estimation in generalized linear models. IEEE Trans. Inf. Theory 2016, 62, 3721–3730. [Google Scholar] [CrossRef] [Green Version]

Table 1. The average squared estimation errors of the estimators.

n	$ρ = 0$			$ρ = 0.5$
n	${\hat{θ}}^{(1) *}$	${\hat{θ}}^{(1)}$	${\hat{θ}}^{(2)}$	${\hat{θ}}^{(1) *}$	${\hat{θ}}^{(1)}$	${\hat{θ}}^{(2)}$
100	0.1597	0.0335	0.72414	0.1809	0.0397	0.68904
200	0.0862	0.01	0.22149	0.0837	0.0169	0.33048
400	0.05	0.0047	0.08847	0.0619	0.0067	0.15066

Table 2. The results of variable selection.

		Previous Method				Proposed Method
		$μ (x)$				$μ (x)$				$k (x)$
n	p	$θ_{1}^{(1)}$	$θ_{2}^{(1)}$	$θ_{3}^{(1)}$	Other $θ^{(1)}$ s	$θ_{1}^{(1)}$	$θ_{2}^{(1)}$	$θ_{3}^{(1)}$	Other $θ^{(1)}$ s	$θ_{1}^{(2)}$	$θ_{2}^{(2)}$	$θ_{3}^{(2)}$	Other $θ^{(2)}$ s
$ρ = 0$
100	25	173	198	171	2.33	192	200	190	0.37	180	184	180	0.32
	50	164	197	147	2.885	196	200	193	0.52	182	180	188	0.41
	150	136	182	111	2.725	194	194	192	1.02	188	182	186	0.41
200	50	196	200	192	1.435	200	200	200	0.59	200	190	198	0.53
	100	193	200	193	2.05	200	200	200	0.91	196	186	196	0.69
	250	162	198	155	1.5	199	199	198	1.18	198	198	198	0.69
400	100	200	200	200	0.605	200	200	200	0.4	200	198	200	0.55
	200	200	200	199	0.88	200	200	200	0.6	200	200	200	0.51
	500	197	200	198	1.29	200	200	200	1.21	200	200	200	0.61
$ρ = 0.5$
100	25	183	199	179	2.3	194	198	194	0.41	179	184	180	0.35
	50	172	197	150	2.66	196	196	190	0.63	178	182	180	0.42
	150	134	191	99	2.32	194	196	192	1.01	180	184	182	0.43
200	50	195	200	197	1.48	200	200	198	0.38	196	183	190	0.32
	100	189	200	179	1.52	199	200	198	0.53	194	186	194	0.44
	250	178	200	154	1.39	196	198	196	1.1	196	196	194	0.55
400	100	200	200	200	0.435	200	200	200	0.28	200	199	194	0.34
	200	200	200	199	0.675	200	200	198	0.47	200	198	196	0.36
	500	199	200	194	1.12	200	200	198	1.07	200	198	196	0.56

Table 3. The variable selection results and the fitting errors (FE) of NBR and HNBR models. The variable Others = {Married, Haupts, Reals, Fachhs, Abitur, Univ, Working, Bluec, Whitec, Self, Beamt, Public, Addon}. Because these variables are not selected in any year, we put them in “Others” for brevity.

Variables	1984			1985			1986			1987
	NBR	HNBR		NBR	HNBR		NBR	HNBR		NBR	HNBR
		$μ (x)$	$k (x)$		$μ (x)$	$k (x)$		$μ (x)$	$k (x)$		$μ (x)$	$k (x)$
Female	0	0	0	0	0	0	0	0	0	0	0	0
Age	−0.013	−0.013	−0.012	−0.009	−0.01	−0.007	−0.006	−0.006	−0.013	−0.002	−0.001	−0.018
Hsat	−0.205	−0.2	−0.025	−0.244	−0.237	0	−0.188	−0.195	−0.045	−0.158	−0.153	−0.043
Handdum	0	0	0	0	0	0	0	0	0	0	0	0
Handper	0.005	0.005	0.004	0.007	0.006	0.007	0.007	0.007	0	0.007	0.007	0.01
Hhninc	0	0	0	0	0	0	0	0	0	0	0	0
Hhkids	0	0	0	0	0	0	0	0	0	0	0	0
Educ	0	0	−0.027	0	0	−0.064	−0.035	−0.038	0	−0.095	−0.106	−0.003
Others	0	0	0	0	0	0	0	0	0	0	0	0
FE	0.798	0.602		2.203	1.874		0.735	0.581		1.314	1.027
Variables	1988			1991			1994
Variables	NBR	HNBR		NBR	HNBR		NBR	HNBR
		$μ (x)$	$k (x)$		$μ (x)$	$k (x)$		$μ (x)$	$k (x)$
Female	0	0	0	0	0	0	0	0	0
Age	−0.015	−0.014	−0.012	−0.022	−0.019	−0.003	−0.005	−0.004	−0.011
Hsat	−0.191	−0.187	−0.015	−0.112	−0.132	−0.049	−0.226	−0.224	−0.06
Handdum	0	0	0	0	0	0	0	0	0
Handper	0.011	0.009	0.006	0.014	0.013	0	0.007	0.008	0.004
Hhninc	0	0	0	0	0	0	0	0	0
Hhkids	0	0	0	0	0	0	0	0	0
Educ	−0.016	−0.023	−0.002	−0.074	−0.068	0	−0.064	−0.069	0
Others	0	0	0	0	0	0	0	0	0
FE	1.144	0.912		1.007	0.787		0.713	0.58

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, S.; Wei, H.; Lei, X. Heterogeneous Overdispersed Count Data Regressions via Double-Penalized Estimations. Mathematics 2022, 10, 1700. https://doi.org/10.3390/math10101700

AMA Style

Li S, Wei H, Lei X. Heterogeneous Overdispersed Count Data Regressions via Double-Penalized Estimations. Mathematics. 2022; 10(10):1700. https://doi.org/10.3390/math10101700

Chicago/Turabian Style

Li, Shaomin, Haoyu Wei, and Xiaoyu Lei. 2022. "Heterogeneous Overdispersed Count Data Regressions via Double-Penalized Estimations" Mathematics 10, no. 10: 1700. https://doi.org/10.3390/math10101700

APA Style

Li, S., Wei, H., & Lei, X. (2022). Heterogeneous Overdispersed Count Data Regressions via Double-Penalized Estimations. Mathematics, 10(10), 1700. https://doi.org/10.3390/math10101700

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Heterogeneous Overdispersed Count Data Regressions via Double-Penalized Estimations

Abstract

1. Introduction

2. Double $ℓ_{1}$ -Penalized NBR

2.1. Heterogeneous Overdispersed Count Data Regressions

2.2. Heterogeneous Overdispersed NBR via Double $ℓ_{1}$ Penalty

3. Main Results

3.1. Stochastic Lipschitz Conditions

3.2. $ℓ_{2}$ -Estimation Error Oracle Inequalities RE Conditions

4. Numerical Studies

4.1. Simulations

4.2. A Real Data Example

5. Conclusions and Future Study

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Proofs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Heterogeneous Overdispersed Count Data Regressions via Double-Penalized Estimations

Abstract

1. Introduction

2. Double ℓ 1 -Penalized NBR

2.1. Heterogeneous Overdispersed Count Data Regressions

2.2. Heterogeneous Overdispersed NBR via Double ℓ 1 Penalty

3. Main Results

3.1. Stochastic Lipschitz Conditions

3.2. ℓ 2 -Estimation Error Oracle Inequalities RE Conditions

4. Numerical Studies

4.1. Simulations

4.2. A Real Data Example

5. Conclusions and Future Study

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Proofs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2. Double $ℓ_{1}$ -Penalized NBR

2.2. Heterogeneous Overdispersed NBR via Double $ℓ_{1}$ Penalty

3.2. $ℓ_{2}$ -Estimation Error Oracle Inequalities RE Conditions