Asymptotics of Subsampling for Generalized Linear Regression Models under Unbounded Design

The optimal subsampling is an statistical methodology for generalized linear models (GLMs) to make inference quickly about parameter estimation in massive data regression. Existing literature only considers bounded covariates. In this paper, the asymptotic normality of the subsampling M-estimator based on the Fisher information matrix is obtained. Then, we study the asymptotic properties of subsampling estimators of unbounded GLMs with nonnatural links, including conditional asymptotic properties and unconditional asymptotic properties.


Introduction
In recent years, the amount of information that people need to process is increasing dramatically. It is of great challenge to directly process massive data for statistical analysis. The divide-and-conquer strategy can mitigate the challenge of directly processing such big data [1], but it still consumes considerable computing resources. As a cheaper alternative in computing, subsampling gains its value in the case of limited computing resources.
To reduce the burden on the machine, the subsampling strategy based on big data has been given more attention in recent years. Ref. [2] proposes simple necessary and sufficient conditions for a convolved subsampling estimator to produce a normal limit that matches the target of bootstrap estimation; Ref. [3] provides an optimally distributed subsampling for maximum quasi-likelihood estimators with massive data; Ref. [4] studies some adaptive optimal subsampling algorithms; and Ref. [5] describes a subdata selection method based on leverage scores which conduct the linear model selection on a small subdata set.
GLM is a kind of statistical model with a wide range of applications such as [6][7][8]. Many subsampling studies are based on GLMs such as [3,9,10]. However, the covariates of the subsampled GLMs in the literature are bounded. When dealing with some big data problems, the size of covariate is not strictly bounded, such as the number of clicks on a web page, which can grow infinitely. This requires the extension of existing theories to the unbounded design. To fill this gap, this paper aims to study asymptotic properties of the subsampled GLMs with unbounded covariates based on empirical process and martingale technology.
Our three contributions are shown as follows: (1) we describe the asymptotic property of subsampled M-estimator using Fisher information matrix; (2) we give the conditional consistency and asymptotic normality of unbounded GLMs subsampling estimator; (3) we provide the unconditional consistency and asymptotic normality of unbounded GLMs subsampling estimator.
The rest of the paper is organized as follows. Section 2 introduces the basic concepts in GLMs and subsampling M-estimation problem. Section 3 presents the asymptotical properties for unbounded GLMs subsampling estimators. Section 4 gives the conclusion and discussion, as well as future research directions. All the technical proofs are collected in the Appendix A.

Preliminaries
This section introduces the subsampling M-estimation problem and GLMs.

Subsampling M-Estimation
Let {l(β; Z) ∈ R|Z ∈ Z } be a set of loss functions with a finite dimensional convex set β ∈ Θ ⊂ R p , and U = {1, 2, . . . , N} be the index of the full large dataset with σ-algebra F N = σ(Z 1 , . . . , Z N ), where for each i ∈ U, the random data point Z i ∈ Z (some probability space) is observed. The empirical risk L N : Θ → R is given by L N (β) = 1 N ∑ i∈U l(β; Z i ). The goal is to get the solutionβ N to minimize the risk, namelŷ β N = arg min β∈Θ L N (β). (1) To solve Equation (1), we needβ N satisfy: ∇L N (β) = 1 N ∑ i∈U ∇l(β; Z i ) = 0, and let Σ N := ∇ 2 L N (β N ). This is an M-estimation problem; see [11]. For fast solving large-scale estimation in Equation (1), we propose the subsampling M-estimation. Consider an index set S = {i 1 , i 2 , . . . , i n } with replacement from U according to the sampling probability The subsampling M-estimation problem is to obtain the solutionβ n satisfying ∇L n (β) = 0 with L n (β) = 1 where Z * i is the i-th time subsample with replacement and π * i is the subsampling probability of Z * i . For example, if Z * 1 = Z 1 , then π * 1 = π 1 ; if Z * 2 = Z 1 , then π * 2 = π 1 . Denote a i as the number of i-th subsampled data such that ∑ i∈U a i = n. And L n (β) is constructed by inverse probability weighting skill such that E[L n (β)|F N ] = L N (β); see [12]. Details about properties of conditional expectation are shown in [13].

Generalized Linear Models
Let the random variable Y be the distribution of the natural exponential families P α indexed by parameter α, where α is often referred to as the canonical parameter belonging to its natural space ν(·) is the Lebesgue measure for continuous distributions (Normal, Gamma) or counting measure for discrete distributions (binomial, Poisson, negative binomial). The c(y) is free of α.
be N independent sample data pairs. Here the X i ∈ R p is covariates and we assume that the response Y i follows the distribution of the natural exponential families with the parameter α i ∈ Λ. The covariates X i := (x i1 , . . . , x ip ) T , (i = 1, 2, . . . , N) are supposed to be deterministic.
The conditional expectation of Y i for a given X i is defined as a function of β T X i after a transformation by a link function α i = ψ(β T X i ). The mean value denoted as µ i := E(Y i ) is mostly considered for regression.
If α i = β T X i then we call that ψ(β T X i ) = β T X i is canonical (or natural) link function, and corresponding model is canonical (or natural) GLMs; see Page 32 in [14]. Sometimes the assumption α i = β T X i is somewhat strong and not very suitable in practice, while nonnatural link GLMs allow more flexible choices for the link function. We can further assume that α i and β T X i can be related by a nonnatural link function from the exponential family with a link function ψ(·). Then the nonnatural GLMs [15] is defined by Here is a classic result for the exponential family (3), where i = 1, 2, . . . , N; see P280 in [16].

Subsampling M-Estimation Problem
In this part we first look at the term Σ −1 N ∇L n (β N ). Define an independent random vector sequence {ζ j } N j=1 and the subsampled {ζ * j } n j=1 , such that each vector ζ takes the value among { 1 From the definition of ∇L N (β), we have E(ζ|F N ) = Σ −1 ∇L N (β N ) = 0 and Var(ζ|F N ) = nV M (β N ; n). Then we have the asymptotic property of subsampled M-estimator. Theorem 1. Suppose that the risk function L N (β) is twice differentiable and λ-strongly convex over Θ, that is, for β ∈ Θ, ∇ 2 L N (β) ≥ λI, where ≥ denotes the semidefinite positive ordering; and the sampling-based moment condition, Then we can obtain: As n → ∞, conditioning on F N , where d − → means convergence in distribution.
Theorem 1 reveals that the subsampling M-estimation scheme is theoretically feasible under mild conditions. In addition, the existence of the estimator is given by the Fisher information matrix.

Conditional Asymptotic Properties of Subsampled GLMs with Unbounded Covariates
The exponential family is very versatile for containing many common light-tail distributions such as binomial, Poisson, negative binomial, normal and Gamma. Along with their attendant convexity properties which leads to finite variance property for log-density, they can serve for a large amount of popular and effective statistical models. It is pre-cisely because of the commonality of these distributions so that we study the subsampling problem for GLMs.
From the loss function introduced in Section 2.1, we set l(β; (2), then the problem solving the minimum of the loss function is equivalent to solve the maximum of the likelihood function. For simplicity, we assume that c(y) = 1, then with the nonnatural link function α i = ψ(β T X i ). We also use this idea in Section 3.3. More generally, we consider a wider class saying quasi-GLMs, rather than GLMs, which assumes that Equation (4) holds for a certain function µ(·). Strong consistency and asymptotic normality of quasi maximum likelihood estimate in GLMs with bounded covariates are proved in [17]. For unbounded covariates, adopting the subsampled estimation of GLMs in [9], we calculate the inverse probability weighted estimator of β by solving the estimating equation based on the subsampled index set S, Equation (6) is called quasi-GLMs since Equation (4) is given instead of the distribution function.
Letβ n be the estimator of the real parameter β 0 in subsampled quasi-GLMs andβ N be the estimator of β 0 in quasi-GLMs with full data. For the unbounded quasi-GLMs with full data,β N is asymptotic unbiased with respect to β 0 ; see [18]. Next, we focus on the asymptotical properties ofβ n , as shown in the following theorems.
Consider the Equation (4) and (6) where ψ(·) is three times continuously differentiable whose every derivative is bounded, and b(·) is twice continuously differentiable whose every derivative is also bounded. Assume that: Thenβ n is consistent withβ N , i.e., where o P|F N (1) means o(1) conditioning on F N in probability.

Theorem 3.
Under the conditions in Theorem 2, as N → ∞ and n → ∞, conditional on F N in probability, In this part, we complete the asymptotic properties without the moment condition of the covariates {X i } N i=1 which is used in [9], and that means X i 's are unbounded. Here we only provide the theoretical asymptotic results. Furthermore, the subsampling probability can be derived by A-optimal criterion like [10].

Unconditional Asymptotic Properties of Subsampled GLMs with Unbounded Covariates
In real engineering, the measurement of some response variable data is very expensive, such as superconductor data, deep space exploration data, etc. The accuracy of estimating the target parameters under measurement constraints of responses is a very important issue. Ref. [19] completed the unconditional asymptotic properties of parameter estimation in bounded GLMs with canonical link. But the unbounded GLMs with nonnatural link situation has not been discussed yet.
In this section, we continue to use the notations of Section 3.2. Through the theory of empirical process [11], we obtain the unconditional consistency ofβ n in the following theorem.  (3) is twice continuously differentiable and its every derivative has a positive minimum. (3) is twice continuously differentiable and its every derivative has a positive minimum.
Theorem 4 directly obtains the unconditional consistency of the subsampling estimator with respect to the true parameters under the unbounded assumption.
To prove the asymptotic normality ofβ n with respect to β 0 , we briefly review the subsampled score function in Section 3.2 Next we will apply a multivariate martingale central limit theorem (Lemma 4 in [19]), which is the extension of Theorem A.1 in [20], to show the asymptotic normality ofβ n . Let be a filtration adaptive to the sampling: . . . , where σ( * i ) is the σ-algebra generated by ith sampling step. The subsample of size n is assumed to increase with N. By the filtration, we define the martingalē The following theorem shows the asymptotic normality of the estimatorβ n .
Theorem 5. Assume the conditions, where X ik means k-th element of vector X i and X ij means j-th element of vector X i . Then Here, we establish the unconditional asymptotic properties of subsampling estimator for unbounded GLMs. The condition n/N = o(1) ensures that small-scale subsamples also have expected performance, which greatly release the computational cost. We also present the theoretical asymptotic results, which leads to the subsampling probability using the A-optimal criterion in [10].

Conclusions and Future Work
In this paper, we derive the asymptotic normality of the subsampling M-estimator by Fisher information. In the unbounded GLMs with nonnatural link function, we separately obtain the conditional and unconditional asymptotic properties of subsampling estimator.
For future study, it is meaningful to apply the sub-Weibull concentration inequalities in [21] to make nonasymptotic inference. The importance sampling is not ideal, since it tends to assign high sampling probability to the observed samples. Hence, effective subsampling methods are considered for GLMs, such as Markov subsampling in [22]. Moreover, high-dimensional methods in [23,24] for subsampling need further studies.

Acknowledgments:
We would like to thank Huiming Zhang for helpful discussions on large sample theory.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Technical Details
Lemma A1 (Theorem 4.17 in [16]). Let X 1 , . . . , X N be i.i.d. from a p.d.f. f β w.r.t. a σ-finite measure ν on (R, B R ), where β ∈ Θ and Θ is an open set in R p . Suppose that for every x in the range of X 1 , f β (x) is twice continuously differentiable on β and satisfies 2) The Fisher information matrix 3) For any given β ∈ Θ, there exists a positive number C β and a positive function h β such that for all x in the range of X 1 , where || · || is Euclidean norm and A = tr(A T A) for any matrix A. Then there exist a sequence of estimatorsβ N (based on {X i , i ∈ U}) such that where s a (γ) = ∂ logL N (γ) ∂γ andL N (γ) is the likelihood function of full data and β 0 is the real parameter. Meanwhile, there exist a sequence of estimatorsβ n (based on {X i , i ∈ S}) such that P(s s (β n ) = 0) → 1 andβ n where s s (γ) = ∂ logL n (γ) ∂γ andL n (γ) is the likelihood function of subsampled data and β 0 is the real parameter.
Let a i be the number of i-th subsampled data such that ∑ i∈U a i = n.
Proof. From the definition of a i , one has Proposition A1. Under the conditions of Lemma A1 and Assume thatβ N based on {X i , i ∈ U} is an estimator of β, andβ n based on {X i , i ∈ S} is also an estimator of β, thenβ Proof. Taking Taylor series expansion of ∇L n (β n ) aroundβ N , we have From the definition of a i , one has
Remark A1. The last equation in the proof ensures thatβ n −β N + Σ −1 N ∇L n (β N ) is a higherorder infinitesimal with respect to 1, which is true according to conditional probability with F N . o P|F N (1) in Equation (A6) is denoted as the higher order infinitesimal of 1 according to conditional probability with F N .
Proof of Theorem 1. For every constantγ > 0, one has Then, by the Lindeberg-Feller central limit theorem (Proposition 2.27 of [11]), conditional on F N , Therefore, combining the above and Proposition A1, Equation (5) holds. Thus, the proof is completed.
Proof of Theorem 2. Next, one needs to show convexity (i.e., uniqueness and maximum value) due to the existence of the estimators from [25]. Let From [16] in Theorem 4.17, one needs to show max γ∈G(C 0 ) Then ∇s n (γ) = R n (γ) − M n (γ) (A9) and I 1 (γ) = −E(∇s n (γ)) = M n (γ) Thus, one only needs to prove and max γ∈G(C 0 ) for any C 0 > 0. From the definition of M n (γ), and the property of trace in P288 of [16], the left-hand side of Equation (A11) can be bounded by From condition (A.4), one needs to prove γ T X * i −β T N X * i converges to 0 so that Equation (A11) holds, and one has Then R n (γ) = U n (γ) + V n (γ) + W n (β N ). In the same way as proving Equation (A11), we have is bounded by the product of and max

Equation (A13) can be bounded as
where the penultimate equal sign applies the Lemma A2 with Equation (A14) can be bounded as which can be proved as the same argument of Equation (A11) by Lagrange mean value theorem. Combine the bounds of Equations (A13) and (A14), one obtains Under the definition of W n (β N ) and E(e * i |F N ) = 0, together with Theorem 1.14(ii) in [16], one obtains Hence, Equation (A12) holds and the proof is completed.
Proof of Theorem 4. Here, one needs to prove the consistency ofβ n with respect to β 0 due to the existence ofβ n ; see [27].
Proof. One derives each entry in the matrix by (∇s n (β)) kj =(R n (β)) kj − (M n (β)) kj By the definition of Φ, one has Next, one obtains where the first equality is based on the fact that after conditioning on the N data points, the n repeating sampling steps should be independent and identically distributed in each step. Proof. By Taylor's expansion: where Σ(β n ) = ∇ 2 s n (β n ) andβ n is between β 0 andβ n . From assumption (C.3), (C.4) and (C.5) in Theorem 5, we have .
. Therefore, by Lemma A3, one has Hence, the proof is completed.
Proof. TheM i 's are F N,i -measurable by the definition ofM i and the definition of the filtration {F N,i } n i=1 . Then we obtain By the definition of martingale difference sequence in P230 of [29], the proof is completed.
Under the definition of T,M, Q, it is obvious that Var(T) = Var(M) + Var(Q). Therefore, I − B N is equivalent to the positive definite matrix Var(Q). The proof is completed.
By Lemma A5, conditions (F.1) and (F.2) of Lemma A7 are satisfied. Next we only need to show the third condition in Lemma A7 holds. According to central limit theorem we have For any t ∈ R p , lett = Var(T) − 1 2 t,X = iQ, due to the properties of the complex multivariate normal distributions are equivalent to the properties of real multivariate normal distributions in P222 of [30], and EQ = 0, one has Applying Lemma A8, one obtains The proof is completed.