Abstract
Recently, the high-dimensional negative binomial regression (NBR) for count data has been widely used in many scientific fields. However, most studies assumed the dispersion parameter as a constant, which may not be satisfied in practice. This paper studies the variable selection and dispersion estimation for the heterogeneous NBR models, which model the dispersion parameter as a function. Specifically, we proposed a double regression and applied a double -penalty to both regressions. Under the restricted eigenvalue conditions, we prove the oracle inequalities for the lasso estimators of two partial regression coefficients for the first time, using concentration inequalities of empirical processes. Furthermore, derived from the oracle inequalities, the consistency and convergence rate for the estimators are the theoretical guarantees for further statistical inference. Finally, both simulations and a real data analysis demonstrate that the new methods are effective.
Keywords:
negative binomial regressions; heterogeneous count data regression; estimation of dispersion parameter; oracle inequalities MSC:
62E17; 62E20; 62F07
1. Introduction
In many scientific fields, such as biomedical science, ecology, and economics, experimental and observational studies often yield count data, a type of data in which the observations can take only the non-negative integer values. The Poisson regression models are commonly used for count data. However, it needs a restrictive assumption that the variance equals the mean. For many count data, the variance is often larger than the mean [1], which is called overdispersion. Because the Poisson regression model is invalid under the overdispersion case, a more general and flexible regression model, the negative binomial regression, has attracted lots of research attention and become popular in analyzing count data [2,3,4].
With the advance of modern data collection techniques, high-dimensional data are becoming increasingly common in scientific studies. The widely used estimations for the high-dimensional parameter include the lasso [5], the scad [6], the elastic net [7], the adaptive lasso [8], and so on. Recently, there has been much research on the high-dimensional NBR model, such as [9,10,11,12,13,14]. All of these works assumed the dispersion parameter as a constant. In practice, however, not all models satisfy the assumption. If the dispersion parameter is wrongly assumed to be a constant, the estimation of the mean regression will perform poorly as shown in the simulation in Section 4.1, thus the need to model the dispersion parameter as a function of some covariates. The heterogeneous negative binomial regression (HNBR) extends the NBR by observation-specific parameterization of the dispersion parameter [3]. The HNBR is a valuable tool for assessing the source of overdispersion. It belongs to the double-generalized linear models (DGLMs) or vector-generalized linear models (VGLMs), which are very useful in fitting more complex and potentially realistic models [15,16,17,18]. However, it appears that there is no study on selecting the dispersion explanation variables in the HNBR model.
In this paper, we study the variable selection and dispersion estimation for the heterogeneous NBR models. To the best of our knowledge and based on the literature, this study is the first. Specifically, we propose a double regression to estimate the coefficients of NB dispersion and NBR simultaneously. Because of the high dimension of the covariates, we apply a double penalty to both regressions. The two adjustment parameters we set are different because the first-order conditions for estimating the regression coefficients are entirely different from those for estimating the dispersion parameters. We construct an algorithm to perform variable selection and dispersion estimation simultaneously. Similar studies on high-dimensional NBR models include [19], which assumed the dispersion parameter as a constant. Their method requires an iterative algorithm to estimate the mean regression and dispersion alternatively and implement a lasso in each iteration. If there are many iterations, such an algorithm is a waste of computing resources.
The rest of the paper is organized as follows. Section 2 introduces the heterogeneous overdispersed count data model and defines the double -penalized estimators for the mean and dispersion regressions. Then we use a technique called the stochastic Lipschitz condition to derive the asymptotic results in Section 3. Simulation studies and a real data application are given in Section 4. Finally, Section 5 concludes the article with a discussion. All proofs and technical details are provided in Appendix A.
2. Double -Penalized NBR
2.1. Heterogeneous Overdispersed Count Data Regressions
Suppose we have n count responses and p-dimensional covariates , . For the Poisson regression models, the response obeys the Poisson distribution
with , we require that the positive parameter is related to a linear combination of p covariates. A plausible assumption for the link function is . It is worth noting that
For the traditional negative binomial regression, it assumes that the count data response obeys the NB distribution with overdispersion:
with and k is an unknown qualification of the overdispersion level. When , we have , the Poisson regression for the mean parameter . Thus, the Poisson regression is a limiting case of negative binomial regression when the dispersion parameter k tends to infinite.
In the heterogeneous negative binomial regression, k is proposed as a specific parameterization, i.e., . More specifically, we assume in this paper that
For notation simplicity, we denote
for any measurable and integrable function f.
Let , the log-likelihood is
We use the negative log-likelihood as the loss function , and define
Denote , , the score function for is
Furthermore, fix , the score function for is
It is easy to verify that
Thus, from now, we will suppose the true value of parameter is .
2.2. Heterogeneous Overdispersed NBR via Double Penalty
The weighted lasso estimator under our circumstance is defined as
where is the tuning parameter and the weighted norm is defined by
and is the weight, means the -norm. This technique is also used in [20]. Equation (2) is a weighted double -penalized problem, which is a kind of convex penalty optimization, and when , it becomes a single-penalized problem. In this paper, we use different and , as the first-order conditions for estimating the regression coefficients are entirely different from those for estimating the dispersion parameters, and take .
Because the weighted group lasso estimator has no closed-form solution, we need to use iterative methods such as quasi-Newton or coordinate descent methods. We use BIC to choose the parameter and .
where k is the number of nonzero estimated coefficients. To illustrate the algorithm explicitly, we rewrite as and define , . Converting into turns the double -penalized problem into a single penalized one, which can be solved through some R packages, such as “lbfgs”. The algorithm is formally given in Algorithm 1.
| Algorithm 1 Double -Penalized Optimization |
| Input: the set of tuning parameters Output: the estimate for , do let ; solve ; obtain the estimate ; compute ; end for find ; return |
The proposed algorithm can perform variable selection and dispersion estimation simultaneously. Similar studies on high-dimensional NBR models include [19], which assumed the dispersion parameter as a constant. However, their method requires an iterative algorithm to estimate the mean regression and dispersion alternatively and implement lasso in each iteration. If there are many iterations, such an algorithm is a waste of computing resources.
3. Main Results
3.1. Stochastic Lipschitz Conditions
We write the maximum of from the sample of size n as , then the sample space for is ., i.e., . Note that ; what we need to tackle is actually an unbounded empirical process. However, for , we can assume the value space for is bounded and satisfies
As we can see, the most significant difference between this article and other conventional literature about lasso estimators is that we use rather than as the explanatory variable to analyze the properties of the loss function . This is not a traditional way. At first glimpse, the combination may complicate the analysis in the next step because the KKT condition requires the story about , which is critical for the traditional convex penalty problem. However, this article will try a different approach, the stochastic Lipschitz conditions introduced in the event of Proposition 1 in [14], to solve the -penalization problem. Define the local stochastic Lipschitz constant by
The most apparent advantage of the stochastic Lipschitz conditions over the KKT condition is that it can easily deal with the several parameters involved in different locations of the model that need to impose the same penalty on them, which is why we do not need to derive the KKT condition in this paper.
To establish the stochastic Lipschitz conditions for this unbounded counting process, another assumption, called the strongly midpoint log-convex, for some positive should be satisfied, which states for the joint density from the sample ’s negative log-density of n independent NB responses satisfies
This assumption is a condition that ensures that the suprema of the multiplier empirical processes of n independent responses have sub-exponential concentration phenomena, which can be alternatively checked by the tail inequality for the suprema of the empirical processes corresponding to classes of unbounded functions ([21]).
Theorem 1.
Suppose , the parameter space Θ is convex and its diameter . If and are both in the value space and defined as previous, then for any ,
with probability at least , where satisfy , and the constants are as follows:
where is the suprema empirical process.
It is worthy to note that the in Theorem 1 is a random process; hence, the bound above is not deterministic. Fortunately, can use the strongly midpoint log-convex condition to be bounded, which we state in Lemma A3. Theorem 1 combined with Lemma A3 will give the following result as a step more.
Theorem 2.
Assume the conditions are the same as that in Theorem 1, then the stochastic Lipschitz constant has a nonrandom upper bound:
with probability at least , where satisfy , and
Theorem 1 gives us a different sight of the loss function far more than KKT conditions. However, the stochastic Lipschitz condition above does not compare the estimated and true values directly. We can resolve this issue by using an eigenvalue condition on the design matrix consisting of . Because the design matrix X is fixed, the eigenvalue condition in the next section is reasonable. It is worthy to note that this inequality is an oracle because it involves an unknown empirical process on the right side.
3.2. -Estimation Error Oracle Inequalities RE Conditions
As we said previously, although we use stochastic Lipschitz conditions instead of KKT conditions, the restricted eigenvalue conditions (RE conditions) are still required. We denote by the vector in with the same coordinates as v on J and zero coordinates on the complement of J, and . We will assume that the minima in (2) can always be obtained in the following setting, but it may not be unique. In general, to bound , some conditions on the design matrix are needed for obtaining abound in terms of the norm of . Here, we will utilize the restricted eigenvalue condition introduced in [22], which says that for some and ,
It should be noted that omitting the weight and the sparse restricted set leads to . Thus, it means that the smallest eigenvalue of the sample covariance matrix is positive, which is impossible when because is not full rank. To avoid this problem, ref. [22] consider the restricted eigenvalue condition under the sparse restricted set as a considerable relation in sparse high-dimensional estimation. The restricted eigenvalue is from the restricted strong convexity, which enforces a strong convexity condition for the negative log-likelihood function of linear models under a certain sparse restrict set.
Due to the double penalty, besides the RE condition, we also require another condition similar to the RE condition, the so-called l-restricted isometry constant defined in [23], as follows
which essentially requires the eigenvalue of the sample covariance matrix under every vector with cardinality less than l (l should be no more than n) approximately behaves normally like the low-dimensional case.
With the RE condition and l-restricted isometry constant, and the two theorems we established before, the lasso estimator in (2) can guarantee a good consistent property.
Lemma 1
(see Lemma 3.1 in [23]). Suppose is a set of cardinality S. For a vector , we let be the S largest positions of h outside of . Put , then
Theorem 3.
Suppose the condition is the same as that in Theorem A1. Furthermore, assume , and there exists some , . Let , then using this λ in (2), with probability at least ,
where , are defined in Theorems 1 and A1, respectively.
Remark 1.
Compared to the single lasso problem, in which we only have one unknown vectorized parameter, the oracle inequality in Theorem 3 has an extra term .
Remark 2.
From Theorem 3, we know that the convergence rate is minimax optimal, as studied in [14].
Remark 3.
In this study, we use the lasso estimators of two partial regression coefficients because it is one of the most popular techniques for high-dimensional data. It is worth mentioning that the algorithms and theoretical results could be similarly generalized to other shrinkage estimators, such as the elastic net [7], the adaptive lasso [8], and so on.
4. Numerical Studies
4.1. Simulations
In this section, we evaluate the finite sample performance of the proposed method. The response is generated from the negative binomial regression model (1) with
where and are two p-dimensional parameters. The explanatory variables are generated from the multivariate normal distributions with mean vector 0 and , where . The following two examples show the performance of the proposed estimator for the low-dimensional heterogeneous negative binomial regression and the variable selection in the high-dimensional case, respectively. The R package “lbfgs” is required to solve the optimization problem.
Example 1
(Low dimension). We set and . The true parameters are and , and their maximum likelihood estimators are denoted as and , respectively. We compare the estimator with , which ignores the heterogeneity of the overdispersion and treats as a constant. Table 1 displays the average squared estimation errors based on 200 repetitions.
Table 1.
The average squared estimation errors of the estimators.
We can make the following observations from the table. Firstly, the performances of the three estimators become better and better as n increases. Secondly, the estimator , which estimates the parameter in the mean function , performs better than , which estimates the parameter in the overdispersion function . Last, but the most important, performs much worse than . For example, the average squared estimation error of is about 5 times of ’s when , and 10 times of ’s when . The comparison between and indicates the necessity of considering the heterogeneity of the overdispersion.
Example 2
(High dimension). The sample sizes are chosen to be , with dimension , and , respectively. We set and . The unknown tuning parameters for the penalty functions are chosen by BIC criterion in the simulation. Results over 200 repetitions are reported. We compared the variable selection performance of the proposed method to the previous method, which ignores the heterogeneity of the overdispersion and treats as a constant. For each case, Table 2 reports the number of repetitions that each important explanatory variable is selected in the final model and also the average number of unimportant explanatory variables being selected.
Table 2.
The results of variable selection.
We see from the table that our method performs much better than the previous method that treats as a constant. Specifically, our method correctly selects important variables more times than the previous method, and it is less likely to select unimportant variables. Furthermore, the variable selection procedure performs better and better as the sample size n increases. When , the important explanatory variables in and are correctly selected in almost every repetitions. When the dimension p increases, the procedure may select more unimportant explanatory variables, but the average numbers are less than . The important variables in are less likely to be selected than the important variables in especially when the sample size is small, as well as the unimportant variables.
4.2. A Real Data Example
In this section, we apply the proposed method to the dataset of German health care demand. The data were employed in [24] and could be downloaded on http://qed.econ.queensu.ca/jae/2003-v18.4/riphahn-wambach-million/, accessed on 1 January 2022. The data contain 27,326 observations on 25 variables, including 2 dependent variables, Docvis (number of doctor visits in the last three months) and Hospvis (number of hospital visits in the last calendar year). For conciseness, we focus on Docvis in this study. We build the HNBR model based on the proposed variable selection procedure and make the standard NBR model a comparison. Define the fitting errors (FE) as , where denotes the raw data of Hospvis, is the predicted value, and n is the sample size. As the data are observed during 1984–1988, 1991, and 1994, we make the analysis for each observed year. Table 3 displays the variable selection results and fitting errors.
Table 3.
The variable selection results and the fitting errors (FE) of NBR and HNBR models. The variable Others = {Married, Haupts, Reals, Fachhs, Abitur, Univ, Working, Bluec, Whitec, Self, Beamt, Public, Addon}. Because these variables are not selected in any year, we put them in “Others” for brevity.
We have the following findings from the table. First, the important variables in the NBR are the same as HNBR models in each year, and the estimates are close. Second, the selected variables in are almost the same every year, namely Age, Hsat (health satisfaction), Handper (degree of handicap), and Educ (years of schooling). Moreover, some of these variables still play an essential role in , and contains no variables other than these. Moreover, we can see that the fitting errors of the HNBR is less than that of the NBR. All of these illustrate the advantage of our method.
5. Conclusions and Future Study
We study the high-dimensional heterogeneous overdispersed count data via negative binomial regression models and propose a double -regularized method for simultaneous variable selection and dispersion estimation. Under the restricted eigenvalue conditions, we prove the oracle inequalities with lasso estimators of two partial regression coefficients for the first time, using concentration inequalities of empirical processes. Furthermore, we derive the consistency and convergence rate for the estimators, which are the theoretical guarantees for further statistical inference. Simulation studies and a real example from the German health care demand data indicate that the proposed method works satisfactorily.
There are some limitations of this study. First, we assume that the responses are independent in this work. However, the NB responses are temporal dependent in the time-series data [25]. Thus, weak dependence conditions, including -mixing, m-dependent types, could be considered in the future. Second, this study focuses little on the statistical inference, such as testing heterogeneous
The issues concerning the hypothesis testing are via the debiased lasso estimator; see [26] and references therein. This will comprise our future research work. Another possible study is the false discovery rate (FDR) control, which aims to identify some small number of statistically significantly nonzero results after obtaining the sparse penalized estimation of the HNBR; see [27,28].
Author Contributions
Conceptualization, S.L. and H.W.; methodology, H.W.; software, S.L.; validation, S.L., H.W. and X.L.; data curation, S.L.; writing—original draft preparation, S.L., H.W. and X.L.; writing—review and editing, S.L. and H.W.; supervision, S.L. and H.W.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the National Natural Science Foundation of China, grant number 12101056.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data used in Section 4.2 could be downloaded on http://qed.econ.queensu.ca/jae/2003-v18.4/riphahn-wambach-million/, accessed on 1 January 2022.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. Proofs
The first step is giving the property of the loss function. From mathematical analysis, we prefer bounded things to unlimited things. Denote is the first partial differentiation with respect to . The bounded aspect for y and s gives a nice property for the loss function .
Lemma A1.
We have
where satisfying
with , and
Proof.
We will use the properties of the psi function, the logarithmic derivative of the gamma function, to prove this lemma. Write . For any , using the Binet ’s formula (see p. 18 of [29])
where is strictly increasing on , it gives
and , we have
Then, the first inequality in the lemma has been verified. On the other hand, by using the fact that (see (2.2) in [30])
for the function and ,
and for any , , we conclude that
and
In addition, using the median value theorem again, we also have
and
where the fact used is that satisfies . Because
we can conclude the second inequality in the lemma. □
The Lemma separates the partial derivative of into two parts: the first part is the linear about the response variable y (say , , and ), the second part is other complicated functions (not linear function) about y. The first part is relatively easy to analyze because the following concentration inequality gives a measure of dispersion about the weighted summation of negative binomial variables. This concentration inequality is a special case for the weighted summation of a series of random variables, which can be proved by sub-exponential concentration results in Proposition 4.2 in [31].
Lemma A2.
Suppose are independently distributed as . Then, for any nonrandom weights independent with and ,
where and
Proof.
We will use the sub-exponential norm. The moment-generating function (MGF) for is
Then, by letting , we have
which implies the sub-exponential norm for is
Using the definition of , from Proposition 4.2 in [31], we can immediately obtain the result in the Lemma. □
It should be noted that naturally has a lower and upper bound for any because and are both bounded between and
Note that is an unbounded random variable; the next step is to find a probabilistic bound for . We will cite an important lemma for this type of problem. We say a distribution is strongly discrete log-concave with if its density is strongly midpoint log-convex with the same .
Lemma A3
(Concentration for strongly log-concave discrete distributions). Let be any strongly log-concave discrete distribution index by on . Then, for any function that is L-Lipschitz with respect to Euclidean norm, we have for ,
for any .
Lemma A4.
The maximal of the response has the concentration
for any .
Proof.
For the upper bound of expectation, we first note that with has calculated in Lemma A2, then we have by Example 5.3 in [31], which further gives
where the second ≤ is by Corollary 7.3 in [31] and the bound in the lemma comes from the explicit expression in Lemma A2.
By implementing Lemma A3, it remains that we need to verify that belongs to some strongly log-concave discrete distribution with the specifying after we take which is 1-Lipschitz. By the definition, the derivative of log-density for is
then the Taylor expansion gives
where , with . Define the difference function
the Taylor expression above immediately implies
Let
and it is not hard to see or 0. Besides,
Now, we have obtained
which gives from the strong log-concave assumption for , if is small. Hence, we can conclude from Lemma A3 and the upper bound of
which is exactly the result in the lemma. □
Remark A1.
For , it is distributed as sub-Gumbel, which is rarely studied by research. Another way to deal with using the extreme value theory (EVT) technique, we note that for any
If is i.i.d., then in asymptotic sense,
Unfortunately, this technique cannot be used in the above lemma because: (i) we need non-asymptotic version inequality instead of a vague expression with and (ii) is not an i.i.d. series, and then EVT theory will not be easily used in this particular setting. Hence, we adopt a discrete technique which has been used in [32] and fully illustrated in [14].
The stochastic Lipschitz conditions are established by using the properties of and . As we said before, they are divided into two parts. The linear parts in them can be solved by the concentration inequality for NB variables given in Lemma A2, but the non-linear part needs some more advanced tools regarding the empirical process. They are given as the following lemmas.
Lemma A5
(3.12) in [33]). Suppose are zero-mean independent stochastic processes indexed by . If there exist and satisfying and for all . Denote , then for any ,
A map is called a contraction if for all . In addition, in the following lemmas, are always i.i.d. Rademacher variables.
Lemma A6
(Theorem 2.2 in [34]). Let be a bounded set and be functions such that is -Lipschitz with . For , let . Then,
where is a universal constant that can be set no greater than .
Lemma A7
(Theorem 4.12 in [35]). Let be convex and increasing. Let further be contractions such that . Then, for any bounded subset in ,
Lemma A8
(Lemma 5.2 in [36]). Let be some finite subset of , let , then
With the assistance of these powerful tools, we can establish the stochastic Lipschitz condition as follows, which is one of the most important points in this article for establishing the oracle inequality of the distance between the estimated value and the real value .
The proof of Theorem 1.
Denote . For , denote . We also define the map and the function
Thus, is a real-value function for . Then, it is easy to check that
and in turn. It gives
First, we would like to give the explicit formula for and obtain an upper bound as well as a Lipschitz parameter for . Denote , then
where is the j-th basis vector of . Hence, for ,
in which is a function only related to s and free of Y and the index i. Using Lemma A1, for , write ,
and
This implies is Lipschitz. In particular, letting and ,
Hence, we obtain an upper bound for that
Now, for , define
Then, we can approach the final conclusion in the theorem by
We will tickle with (A2) term by term.
(i). The first three terms in (A2):
We will use concentration inequality to deal with these terms. For any and , by Lemma A2 and Cauchy–Schwartz inequality,
where and is defined in Lemma A2; they are both determined and free of and the index k. Hence,
By letting the right side of the above display be , we can obtain
Exactly the same, we can obtain for any , regarding to the third term,
The situation is slightly different for the second term. Indeed,
Because is a function of , so as the weights , we cannot use the exact same method as previously. However, because is convex, we have . Then, it only needs to note that,
which gives
for any and .
(ii). The fourth term in (A2):
From Lemma A1, we know that . Thus, simply by Hoeffding inequality (see Corollary 2.1 (b) in [31]), for any and ,
For arbitrary , let , we obtain
(iii). The last term in (A2):
Therefore, from Lemma A5, it follows that
Thus, the last task is giving an upper bound for . Note that , by symmetrization,
where , and are i.i.d. Rademacher variables independent of . Here, using the fact is -Lipschitz and Lemmas A6–A8,
Then, by (A3),
Note that the right side of the inequality is free of , let in the above inequality, and use the same technique as previous, we obtain the uniform bound for it. The Theorem is proved by letting , , , and . □
The lower bound of the likelihood-based divergence
Recall the standard steps for establishing the oracle inequality for a lasso estimator are (see [37] for example):
- To avoid the ill behavior of Hessian, propose the restricted eigenvalue condition or other analogous conditions about the design matrix.
- Find the tuning parameter based on the high-probability event, i.e., the KKT conditions.
- According to some restricted eigenvalue assumptions and tuning parameter selection, derive the oracle inequalities via the definition of the lasso optimality and the minimizer under unknown expected risk function and some basic inequalities. There are three sub-steps:
- (i)
- Under the KKT conditions, show that the error vector is in some restricted set with structure sparsity, and check that is in a big compact set;
- (ii)
- Show that the likelihood-based divergence of and can be lower bounded by some quadratic distance between and ;
- (iii)
- By some elementary inequalities and (ii), show that is in a smaller compact set with a radius of optimal rate (proportional to ).
Under our approach, the KKT condition with a high probability is replaced by the stochastic Lipschitz condition, while other steps should remain the same. For most models belonging to the canonical exponential family, the step III.(ii) is quite trivial, see Lemma 1 in [38] for example. Nonetheless, it is worthy to note that our loss function is not in the canonical exponential family, so there is no extended discussion about the lower bound of the likelihood-based divergence of and in our setting. We will use the following theorem to clarify this thing.
Theorem A1.
Suppose the condition is the same as that in Theorem 1. Denote the true parameter for is and . If and , then
where is a positive constant and its exact definition is in the proof.
Proof.
For simplicity, we drop the index i. By the definition and the notation in Theorem 1,
where is the Kullback–Leibler divergence from the ’s density to , i.e.,
Due to the identification of the negative binomial distribution, we have with equality if and only if . Using the Taylor theorem,
where and is the smallest eigenvalue of the matrix . Thus, it is enough to show that is strictly positive define for any . First, calculate directly,
where and
For a matrix , it is strictly positive define if and only if and . Denote , and are true parameters for Y. Then,
Now, we are going to deal with and . For ,
Therefore, is concave. Using Jensen inequality and median value theorem
Similarly, for , by using the fact that and the assumption,
where lies between 0 and Y. The lower bounds for and , together with the fact that for , we conclude that . Similarly, we can also prove , so the theorem holds. □
The proof of Theorem 3.
The proof follows the idea in [22]. First, by the definition of ,
From Theorem A1, we also have
Then, by Theorem 1 and the definition of ,
holds with probability at least , where . Now, let be any sets with . It is easy to check
by the fact . It gives that with probability at least ,
Let satisfying and , and we also let be the union of and the indices of largest . Then, and also guarantee (A5). In addition, from Lemma 1, they also give
In addition, from the definition of and , we know that and .
Unlike the single lasso question, here we need to define , and consider and separately. Obviously, , or (A5) cannot be beholden. For , we have
Then, by the restricted eigenvalue condition,
holds for or . Note that from (A5),
then by Cauchy–Schwartz inequality,
It gives
where we use that fact . Furthermore, because
we can conclude that
Indeed, if the two inequalities above have the opposite direction, then for the first one, one can find that
and
Once again, by Cauchy–Schwartz inequality,
Denote . Then, for , , and
For any , define
Then, for ,
while for , . Therefore,
and consequently . Once again, by the restricted eigenvalue condition,
On the other hand, note that for any inequality and hold, we conclude
Next, we will use the definition of the -restricted isometry constant . Because , then
Finally, because
we obtain that
References
- Dai, H.; Bao, Y.; Bao, M. Maximum likelihood estimate for the dispersion parameter of the negative binomial distribution. Stat. Probab. Lett. 2013, 83, 21–27. [Google Scholar] [CrossRef]
- Allison, P.D.; Waterman, R.P. Fixed–effects negative binomial regression models. Sociol. Methodol. 2002, 32, 247–265. [Google Scholar] [CrossRef] [Green Version]
- Hilbe, J.M. Negative Binomial Regression; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
- Weißbach, R.; Radloff, L. Consistency for the negative binomial regression with fixed covariate. Metrika 2020, 83, 627–641. [Google Scholar] [CrossRef]
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Statal Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
- Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2005, 67, 301–320. [Google Scholar] [CrossRef] [Green Version]
- Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef] [Green Version]
- Qiu, Y.; Chen, S.X.; Nettleton, D. Detecting rare and faint signals via thresholding maximum likelihood estimators. Ann. Stat. 2018, 46, 895–923. [Google Scholar] [CrossRef] [Green Version]
- Xie, F.; Xiao, Z. Consistency of l1 penalized negative binomial regressions. Stat. Probab. Lett. 2020, 165, 108816. [Google Scholar] [CrossRef]
- Li, Y.; Rahman, T.; Ma, T.; Tang, L.; Tseng, G.C. A sparse negative binomial mixture model for clustering RNA-seq count data. Biostatistics 2021, kxab025. [Google Scholar] [CrossRef]
- Jankowiak, M. Fast Bayesian Variable Selection in Binomial and Negative Binomial Regression. arXiv 2021, arXiv:2106.14981. [Google Scholar]
- Lisawadi, S.; Ahmed, S.; Reangsephet, O. Post estimation and prediction strategies in negative binomial regression model. Int. J. Model. Simul. 2021, 41, 463–477. [Google Scholar] [CrossRef]
- Zhang, H.; Jia, J. Elastic-net Regularized High-dimensional Negative Binomial Regression: Consistency and Weak Signals Detection. Stat. Sin. 2022, 32, 181–207. [Google Scholar] [CrossRef]
- Xu, D.; Zhang, Z.; Wu, L. Variable selection in high-dimensional double generalized linear models. Stat. Pap. 2014, 55, 327–347. [Google Scholar] [CrossRef]
- Yee, T.W. Vector Generalized Linear and Additive Models: With an Implementation in R; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
- Nguelifack, B.M.; Kemajou-Brown, I. Robust rank-based variable selection in double generalized linear models with diverging number of parameters under adaptive Lasso. J. Stat. Comput. Simul. 2019, 89, 2051–2072. [Google Scholar] [CrossRef]
- Cavalaro, L.L.; Pereira, G.H. A procedure for variable selection in double generalized linear models. J. Stat. Comput. Simul. 2022, 1–18. [Google Scholar] [CrossRef]
- Wang, Z.; Ma, S.; Zappitelli, M.; Parikh, C.; Wang, C.Y.; Devarajan, P. Penalized count data regression with application to hospital stay after pediatric cardiac surgery. Stat. Methods Med. Res. 2016, 25, 2685–2703. [Google Scholar] [CrossRef] [Green Version]
- Huang, H.; Zhang, H.; Li, B. Weighted Lasso estimates for sparse logistic regression: Non-asymptotic properties with measurement errors. Acta Math. Sci. 2021, 41, 207–230. [Google Scholar] [CrossRef]
- Adamczak, R. A tail inequality for suprema of unbounded empirical processes with applications to Markov chains. Electron. J. Probab. 2008, 13, 1000–1034. [Google Scholar] [CrossRef]
- Bickel, P.J.; Ritov, Y.; Tsybakov, A.B. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 2009, 37, 1705–1732. [Google Scholar] [CrossRef]
- Candes, E.; Tao, T. The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Stat. 2007, 35, 2313–2351. [Google Scholar]
- Riphahn, R.T.; Wambach, A.; Million, A. Incentive effects in the demand for health care: A bivariate panel count data estimation. J. Appl. Econom. 2003, 18, 387–405. [Google Scholar] [CrossRef]
- Yang, X.; Song, S.; Zhang, H. Law of iterated logarithm and model selection consistency for generalized linear models with independent and dependent responses. Front. Math. China 2021, 16, 825–856. [Google Scholar] [CrossRef]
- Shi, C.; Song, R.; Chen, Z.; Li, R. Linear hypothesis testing for high dimensional generalized linear models. Ann. Stat. 2019, 47, 2671. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Xie, F.; Lederer, J. Aggregating Knockoffs for False Discovery Rate Control with an Application to Gut Microbiome Data. Entropy 2021, 23, 230. [Google Scholar] [CrossRef]
- Cui, C.; Jia, J.; Xiao, Y.; Zhang, H. Directional FDR Control for Sub-Gaussian Sparse GLMs. arXiv 2021, arXiv:2105.00393. [Google Scholar]
- Bateman, H. Higher Transcendental Functions [Volumes i–iii]; McGraw-Hill Book Company: New York, NY, USA, 1953; Volume 1. [Google Scholar]
- Alzer, H. On some inequalities for the gamma and psi functions. Math. Comput. 1997, 66, 373–389. [Google Scholar] [CrossRef] [Green Version]
- Zhang, H.; Chen, S.X. Concentration inequalities for statistical inference. Commun. Math. Res. 2021, 37, 1–85. [Google Scholar]
- Moriguchi, S.; Murota, K.; Tamura, A.; Tardella, F. Discrete midpoint convexity. Math. Oper. Res. 2020, 45, 99–128. [Google Scholar] [CrossRef] [Green Version]
- Sen, B. A Gentle Introduction to Empirical Process Theory and Applications; Columbia University: New York, NY, USA, 2018. [Google Scholar]
- Chi, Z. Stochastic Lipschitz continuity for high dimensional Lasso with multiple linear covariate structures or hidden linear covariates. arXiv 2010, arXiv:1011.1384. [Google Scholar]
- Ledoux, M.; Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- Massart, P. Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse Math. 2000, 9, 245–303. [Google Scholar] [CrossRef]
- Xiao, Y.; Yan, T.; Zhang, H.; Zhang, Y. Oracle inequalities for weighted group lasso in high-dimensional misspecified Cox models. J. Inequalities Appl. 2020, 2020, 1–33. [Google Scholar] [CrossRef]
- Abramovich, F.; Grinshtein, V. Model selection and minimax estimation in generalized linear models. IEEE Trans. Inf. Theory 2016, 62, 3721–3730. [Google Scholar] [CrossRef] [Green Version]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).