Robust Inference after Random Projections via Hellinger Distance for Location-Scale Family

Big data and streaming data are encountered in a variety of contemporary applications in business and industry. In such cases, it is common to use random projections to reduce the dimension of the data yielding compressed data. These data however possess various anomalies such as heterogeneity, outliers, and round-off errors which are hard to detect due to volume and processing challenges. This paper describes a new robust and efficient methodology, using Hellinger distance, to analyze the compressed data. Using large sample methods and numerical experiments, it is demonstrated that a routine use of robust estimation procedure is feasible. The role of double limits in understanding the efficiency and robustness is brought out, which is of independent interest.


Introduction
Streaming data are commonly encountered in several business and industrial applications leading to the so-called Big Data. These are commonly characterized using four V's: velocity, volume, variety, and veracity. Velocity refers to the speed of data processing while volume refers to the amount of data. Variety refers to various types of data while veracity refers to uncertainty and imprecision in data. It is believed that veracity is due to data inconsistencies, incompleteness, and approximations. Whatever be the real cause, it is hard to identify and pre-process data for veracity in a big data setting. The issues are even more complicated when the data are streaming.
A consequence of the data veracity is that statistical assumptions used for analytics tend to be inaccurate. Specifically, considerations such as model misspecification, statistical efficiency, robustness, and uncertainty assessment-which are standard part of a statistical toolkit-cannot be routinely carried out due to storage limitations. Statistical methods that facilitate simultaneous addressal of twin problems of volume and veracity would enhance the value of the big data. While health care industry and financial industries would be the prime benefactors of this technology, the methods can be routinely applied in a variety of problems that use big data for decision making.
We consider a collection of n (n is of the order of at least 10 6 ) observations, assumed to be independent and identically distributed (i.i.d.), from a probability distribution f (·) belonging to a location-scale family; that is, We denote by Θ the parameter space and without loss of generality take it as compact since otherwise it can be re-parametrized in a such a way that the resulting parameter space is compact (see [1]).
The purpose of this paper is to describe a methodology for joint robust and efficient estimation of µ and σ 2 that takes into account (i) storage issues, (ii) potential model misspecifications, and (iii) presence of aberrant outliers. These issues-which are more likely to occur when dealing with massive amounts of data-if not appropriately accounted in the methodological development, can lead to inaccurate inference and misleading conclusions. On the other hand, incorporating them in the existing methodology may not be feasible due to a computational burden.
Hellinger distance-based methods have long been used to handle the dual issue of robustness and statistical efficiency. Since the work of [1,2] statistical methods that invoke alternative objective functions which converge to the objective function under the posited model have been developed and the methods have been shown to possess efficiency and robustness. However, their routine use in the context of big data problems is not feasible due to the complexity in the computations and other statistical challenges. Recently, a class of algorithms-referred to as Divide and Conquer-have been developed to address some of these issues in the context of likelihood. These algorithms consist in distributing the data across multiple processors and, in the context of the problem under consideration, estimating the parameters from each processor separately and then combining them to obtain an overall estimate. The algorithm assumes availability of several processors, with substantial processing power, to solve the complex problem at hand. Since robust procedures involve complex iterative computations-invoking the increased demand for several high-speed processors and enhanced memory-routine use of available analytical methods in a big data setting is challenging. Maximum likelihood method of estimation in the context of location-scale family of distributions has received much attention in the literature ( [3][4][5][6][7]). It is well-known that the maximum likelihood estimators (MLE) of location-scale families may not exist unless the defining function f (·) satisfies certain regularity conditions. Hence, it is natural to ask if other methods of estimation such as minimum Hellinger distance estimator(MHDE) under weaker regularity conditions. This manuscript provides a first step towards addressing this question. Random projections and sparse random projections are being increasingly used to "compress data" and then use the resulting compressed data for inference. The methodology, primarily developed by computer scientists, is increasingly gaining attention among the statistical community and is investigated in a variety of recent work ( [8][9][10][11][12]). In this manuscript, we describe a Hellinger distance-based methodology for robust and efficient estimation after the use of random projections for compressing i.i.d data belonging to the location-scale family. The proposed method consists in reducing the dimension of the data to facilitate the ease of computations and simultaneously maintain robustness and efficiency when the posited model is correct. While primarily developed to handle big and streaming data, the approach can also be used to handle privacy issues in a variety of applications [13].
The rest of the paper is organized as follows: Section 2 provides background on minimum Hellinger distance estimation; Section 3 is concerned with the development of Hellinger distance-based methods for compressed data obtained after using random projections; additionally, it contains the main results and their proofs. Section 4 contains results of the numerical experiments and also describes an algorithm for implementation of the proposed methods. Section 5 contains a real data example from financial analytics. Section 6 is concerned with discussions and extensions. Section 7 contains some concluding remarks.

Background on Minimum Hellinger Distance Estimation
Ref. [1] proposed minimum Hellinger distance (MHD) estimation for i.i.d. observations and established that MHD estimators (MHDE) are simultaneously robust and first-order efficient under the true model. Other researchers have investigated related estimators, for example, [14][15][16][17][18][19][20]. These authors establish that when the model is correct, the MHDE is asymptotically equivalent to the maximum likelihood estimator (MLE) in a variety of independent and dependent data settings. For a comprehensive discussion of minimum divergence theory see [21].
We begin by recalling that the Hellinger distance between two probability densities is the L 2 distance between the square root of the densities. Specifically, let, for p ≥ 1, || · || p denote the L p norm defined by The Hellinger distance between the densities f (·) and g(·) is given by , g(·)) = || f 1/2 (·) − g 1/2 (·)|| 2 2 .
To study the robustness of MHDE, ref. [1] showed that to assess the robustness of a functional with respect to the gross-error model it is necessary to examine the α-influence curve rather than the influence curve, except when the influence curve provides a uniform approximation to the α-influence curve. Specifically, the α-influence function (IF α (θ, z)) is defined as follows: for θ ∈ Θ, let f α,θ,z = (1 − α) f (·|θ) + αη z , where η z denotes the uniform density on the interval (z − , z + ), where > 0 is small, α ∈ (0, 1), z ∈ IR; the α-influence function is then defined to be where T( f α,θ,z ) is the functional for the model with density f α,θ,z (·). Equation (2) represents a complete description of the behavior of the estimator in the presence of contamination, up to the shape of the contaminating density. If IF α (θ, z) is a bounded function of z such that lim z→∞ IF α (θ, z) = 0, for every θ ∈ Θ, then the functional T is robust at f (·|θ) against 100%α contamination by gross errors at arbitrary large value z. The influence function can be obtained by letting α → 0. Under standard regularity conditions, the minimum divergence estimators (MDE) are first order efficient and have the same influence function as the MLE under the model, which is often unbounded. Hence the robustness of these estimators cannot be explained through their influence functions. In contrast, the α-influence function of the estimators are often bounded, continuous functions of the contaminating point. Finally, this approach often leads to high breakdown points in parametric estimation. Other explanations can also be found in [22,23].
Ref. [1] showed that the MHDE of location has a breakdown point equal to 50%. Roughly speaking, the breakdown point is the smallest fraction of data that, when strategically placed, can cause an estimator to take arbitrary values. Ref. [24] obtained breakdown results for MHDE of multivariate location and covariance. They showed that the affine-invariant MHDE for multivariate location and covariance has a breakdown point of at least 25%. Ref. [18] showed that the MHDE has 50% breakdown in some discrete models.

Hellinger Distance Methodology for Compressed Data
In this section we describe the Hellinger distance-based methodology as applied to the compressed data. Since we are seeking to model the streaming independent and identically distributed data, we denote by J the number of observations in a fixed time-interval (for instance, every ten minutes, or every half-hour, or every three hours). Let B denote the total number of time intervals. Alternatively, B could also represent the number of sources from which the data are collected. Then, the incoming data can be expressed as {X jl , 1 ≤ j ≤ J; 1 ≤ l ≤ B}. Throughout this paper, we assume that the density of X jl belongs to a location-scale family and is given by A typical example is a data store receiving data from multiple sources, for instance financial or healthcare organizations, where information from multiple sources across several hours are used to monitor events of interest such as cumulative usage of certain financial instruments or drugs.

Random Projections
Let R l = (r ijl ) be a S × J matrix, where S is the number of compressed observations in each time interval, S J, and r ijl 's are independent and identically distributed random variables and assumed to be independent of {X jl , j = 1, 2, · · · , J; 1 ≤ l ≤ B}. Let r ijl X jl and set Y l = ( Y 1l , · · · , Y Sl ) ; in matrix form this can be expressed as Y l = R l X l . The matrix R l is referred to as the sensing matrix and { Y il , i = 1, 2 · · · , S; l = 1, 2, · · · , B} is referred to as the compressed data. The total number of compressed observations m = SB is much smaller than the number of original observations n = JB. We notice here that R l 's are independent and identically distributed random matrices of order S × J. Referring to each time interval or a source as a group, the following Table 1 is a tabular representation of the compressed data. Table 1. Illustration of Data Reduction Mechanism, Here r * il = (r i·l , ω il ).
In random projections literature, the distribution of r ijl is typically taken to be Gaussian; but other distributions such as Rademacher distribution, exponential distribution and extreme value distributions are also used (for instance, see [25]). In this paper, we do not make any strong distributional assumptions on r ijl . We only assume that E r ijl = 1 and Var r ijl = γ 2 0 , where E[·] represents the expectation of the random variable and Var [·] represents the variance of the random variable. Additionally, we denote the density of r ijl by q(·).
We next return to the storage issue. When S = 1 and r ijl = 1, Y il is a sum of J random variables. In this case, one retains (stores) only the sum of J observations and robust estimates of θ * are sought using the sum of observations. In other situations, that is when r ijl are not degenerate at 1, the distribution of Y il is complicated. Indeed, even if r ijl are assumed to be normally distributed, the marginal distribution of Y il is complicated. The conditional distribution is Y il (given r ijl ) is a weighted sum of location scale distributions and does not have a useful closed form expression. Hence, in general for these problems the MLE method is not feasible. We denote by ω 2 il = ∑ J j=1 r 2 ijl and work with the random variables Y il ≡ ω −1 il Y il . We denote the true density of Y il to be h J (·|θ * , γ 0 ). Also, when γ 0 = 0 (which implies r ijl ≡ 1) we denote the true density of Y il by h * J (·|θ * ) to emphasize that the true density is a convolution of J independent and identically distributed random variables.

Hellinger Distance Method for Compressed Data
In this section, we describe the Hellinger distance-based method for estimating the parameters of the location scale family using the compressed data. As described in the last section, let {X jl , j = 1, 2, · · · , J; l = 1, 2, · · · , B} be a doubly indexed collection of independent and identically distributed random variables with true density 1 Our goal is to estimate θ * = (µ * , σ 2 * ) using the compressed data {Y il , i = 1, 2, · · · , S; l = 1, 2, · · · , B}. We re-emphasize here that the density of Y il depends additionally on γ 0 , the variance of the sensing random variables r ijl .
To formulate the Hellinger-distance estimation method, let G be a class of densities metrized by the L 1 distance. Let {h J (·|θ, γ 0 ); θ ∈ Θ} be a parametric family of densities. The Hellinger distance functional T is a measurable mapping mapping from G to Θ, defined as follows: When g(·) = h J (·|θ * , γ 0 ), then under additional assumptions θ * g (γ 0 ) = θ * (γ 0 ). Since minimizing the Hellinger-distance is equivalent to maximizing the affinity, it follows that It is worth noticing here that To obtain the Hellinger distance estimator of the true unknown parameters θ * , expectedly we choose the parametric family h J (·|θ, γ 0 ) to be density of Y il and g(·) to be a non-parametric L 1 consistent estimator g B (·) of h J (·|θ, γ 0 ). Thus, the MHDE of θ * B is given bŷ In the notation above, we emphasize the dependence of the estimator on the variance of the projecting random variables. We notice here that the solution to (1) may not be unique. In such cases, we choose one of the solutions in a measurable manner.
The choice of the density estimate, typically employed in the literature is the kernel density estimate. However, in the setting of the compressed data investigated here, there are S observations per group. These S observations are, conditioned on r ijl independent; however they are marginally dependent (if S > 1). In the case when S > 1, we propose the following formula for g B (·). First, we consider the estimator With this choice, the MHDE of θ * B is given by, for 1 ≤ i ≤ S, The above estimate of the density chooses i th observation from each group and obtains the kernel density estimator using the B independent and identically distributed compressed observations. This is one choice for the estimator. Of course, alternatively, one could obtain S B different estimators by choosing different combinations of observations from each group.
It is well-known that the estimator is almost surely L 1 consistent for h J (·|θ * , γ 0 ) as long as c B → 0 and Bc B → ∞ as B → ∞. Hence, under additional regularity and identifiability conditions and further conditions on the bandwidth c B , existence, uniqueness, consistency and asymptotic normality of θ i,B (γ 0 ), for fixed γ 0 , follows from the existing results in the literature.
When γ 0 = 0 and r ijl ≡ 1, as explained previously, the true density is a J−fold convolution of f (·|θ * ), it is natural to ask the following question: if one lets γ 0 → 0, will the asymptotic results converge to what one would obtain by taking γ 0 = 0. We refer to this property as a continuity property in γ 0 of the procedure. Furthermore, it is natural to wonder if these asymptotic properties can be established uniformly in γ 0 . If that is the case, then one can also allow γ 0 to depend on B. This idea has an intuitive appeal since one can choose the parameters of the sensing random variables to achieve an optimal inferential scheme. We address some of these issues in the next subsection.
Finally, we emphasize here that while we do not require S > 1, in applications involving streaming data and privacy problems S tends to greater than one. In problems where the variance of sensing variables are large, one can obtain an overall estimator by averagingθ i,B (γ 0 ) over various choices of 1 ≤ i ≤ S; that is,θ The averaging improves the accuracy of the estimator in small compressed samples (data not presented). For this reason, we provide results for this general case, even though our simulation and theoretical results demonstrate that for some problems considered in this paper, S can be taken to be one. We now turn to our main results which are presented in the next subsection. The following Figure 1 provides a overview of our work.

Main Results
In this section we state our main results concerning the asymptotic properties of the MHDE of compressed data Y il . We emphasize here that we only store {( Y il , r i·l , ω 2 il ) : i = 1, 2, · · · , S; l = 1, 2, · · · , B}. Specifically, we establish the continuity property in γ 0 of the proposed methods by establishing the existence of the iterated limits. This provides a first step in establishing the double limit. The first proposition is well-known and is concerned with the existence and uniqueness of MHDE for the location-scale family defined in (4) using compressed data.
Proof. The proof follows from Theorem 2.2 of [20] since, without loss of generality, Θ is taken to be compact and the density function h J (·|θ, γ 0 ) is continuous in θ.

Consistency:
We next turn our attention to consistency. As explained previously, under regularity conditions for each fixed γ 0 , the MHDEθ i,B (γ 0 ) is consistent for θ * (γ 0 ). The next result says that under additional conditions, the consistency property of MHDE is continuous in γ 0 .

Proposition 2.
Let h J (·|θ, γ 0 ) be a continuous probability density function satisfying the conditions of Proposition 1. Assume that Then, with probability one (wp1) the iterated limits also exist and equals θ * ; that is, for Proof. Without loss of generality let Θ be compact since otherwise it can be embedded into a compact set as described in [1]. Since f (·) is continuous in θ and g(·) is continuous in γ 0 , it follows that h J (·|θ, γ 0 ) is continuous in θ and γ 0 . Hence by Theorem 1 of [1] for every fixed γ 0 ≥ 0 and 1 ≤ i ≤ S, Thus, to verify the convergence of θ * (γ 0 ) to θ * as γ 0 → 0, we first establish, using (6), that To this end, we first notice that Hence, using (3), Also, by continuity, which, in turn implies that Thus existence of the iterated limit first as B → ∞ and then γ 0 → 0 follows using compactness of Θ and the identifiability of the model. As for the other iterated limit, again notice notice that for each B , h * J (·|θ)) with probability one as γ 0 converges to 0. The result then follows again by an application of Theorem 1 of [20].

Remark 1.
Verification of condition (6) seems to be involved even in the case of standard Gaussian random variables and standard Gaussian sensing random variables. Indeed in this case, the density of h J (·|θ, γ 0 ) is a J−fold convolution of a Bessel function of second kind. It may be possible to verify the condition (6) using the properties of these functions and compactness of the parameter space Θ. However, if one is focused only on weak-consistency, it is an immediate consequence of Theorems 1 and 2 below and condition (6) is not required. Finally, it is worth mentioning here that the convergence in (6) without uniformity over Θ is a consequence of convergence in probability of r ijl to 1 and Glick's Theorem.

Asymptotic limit distribution:
We now proceed to investigate the limit distribution of θ * B (γ 0 ) as B → ∞ and γ 0 → 0. It is well-known that for fixed γ 0 ≥ 0, after centering and scaling, θ * B (γ 0 ) has a limiting Gaussian distribution, under appropriate regularity conditions (see for instance [20]). However to evaluate the iterated limits as γ 0 → 0 and B → ∞, additional refinements of the techniques in [20] are required. To this end, we start with additional notations. Let s J (·|θ, γ 0 ) = h 1 2 J (·|θ, γ 0 ) and let the score function be denoted by . Also, the Fisher information I(θ(γ 0 )) is given by We emphasize here that we suppress i on the LHS of the above equation since g The iterated limit distribution involves additional regularity conditions which are stated in the Appendix. The first step towards this aim is a representation formula which expresses the quantity of interest, viz., as a sum of two terms, one involving sums of compressed i.i.d. random variables and the other involving remainder terms that converge to 0 at a specific rate. This expression will appear in different guises in the rest of the manuscript and will play a critical role in the proofs.

Representation Formula
Before we state the lemma, we first provide two crucial assumptions that allow differentiating the objective function and interchanging the differentiation and integration: (D2) Assume further that ||∇s J (·|θ, γ 0 )|| 2 is continuous and bounded.

Remark 2.
In the rest of the manuscript, we will refer to A 2B (γ 0 ) as the remainder term in the representation formula.
We now turn to the first main result of the manuscript, namely a central limit theorem forθ i,B (γ 0 ) as first B → ∞ and then γ 0 → 0. As a first step, we note that the Fisher information of the density h * J (·|θ) is given by Next we state the assumptions needed in the proof of Theorem 1. We separate these conditions as (i) model assumptions, (ii) kernel assumptions, (iii) regularity conditions, (iV) conditions that allow comparison of original data and compressed data.

Model assumptions on
(D2') Assume further that ||∇t J (·|θ)|| 2 is continuous and bounded. Kernel assumptions (B1) K(·) is symmetric about 0 on a compact support and bounded in L 2 . We denote the support of K(·) by Supp(K).

Regularity conditions
where Supp(K) is the support of the kernel density K(·) and ∆ is a generic random variable with (M6) The score function has a regular central behavior relative to the smoothing constants, i.e., (M7) The density functions are smooth in an L 2 sense; i.e.,

Assumptions comparing models for original and compressed data
and (O1)-(O2) hold. Then, for every 1 ≤ i ≤ S, the following holds: where G is a bivariate Gaussian random variable with mean 0 and variance I −1 (θ * ), where I(θ) is defined in (15).
Before we embark on the proof of Theorem 1, we first discuss the assumptions. Assumptions (B1) and (B2) are standard assumptions on the kernel and the bandwidth and are typically employed when investigating the asymptotic behavior of divergence-based estimators (see for instance [1]). Assumptions (M1)-(M7) and (M1')-(M3') are regularity conditions which are concerned essentially with L 2 continuity and boundedness of the scores and their derivatives. Assumptions (O1)-(O2) allow for comparison of u J (·|θ, γ 0 ) and v J (·|θ). Returning to the proof of Theorem 1, using representation formula, we will first show that , and then prove that lim γ 0 →0 lim B→∞ A 2B (γ 0 ) = 0 in probability. We start with the following proposition.
where G is given in Theorem 1.
We divide the proof of Proposition 3 into two lemmas. In the first lemma we will show that Next in the second lemma we will show that first letting B → ∞ and then allowing γ 0 → 0, We start with the first part. hold. Then, with probability one, the following prevails: Proof. Using representation formula in Lemma 1. First fix γ 0 > 0. It suffices to show We begin with D 1B (θ i,B (γ 0 )). By algebra, D 1B (θ i,B (γ 0 )) can be expressed as It suffices to show that as B → ∞, D ). By Cauchy-Schwarz inequality and assumption (M2), it follows that there exists where the last convergence follows from the L 1 convergence of g 1B (θ i,B (γ 0 )). Again, by Cauchy-Schwarz inequality and assumption (M2), it follows that D ). Turning to D 2B (θ i,B (γ 0 )), by similar argument, using Cauchy-Schwarz inequality and assumption (M3), it follows that D 2B (θ i,B (γ 0 )) → − 1 4 I(θ * (γ 0 )). Thus, to complete the proof, it is enough to show that converges to zero by Cauchy-Schwarz inequality and assumption (O2), and J 12 (γ 0 ) converges to zero by Cauchy-Schwarz inequality, assumption (M2') and Scheffe's theorem. Next we consider the second term of (16). Let We will show that lim γ 0 →0 J 2 (γ 0 ) = 0. By algebra, the difference of the above two terms can be expressed as the sum of J 21 (γ 0 ) and J 22 (γ 0 ), where J 11 (γ 0 ) converges to zero by Cauchy-Schwarz inequality and assumption (O1), and J 12 (γ 0 ) converges to zero by Cauchy-Schwarz inequality, assumption (M3') and Scheffe's theorem. Therefore the lemma holds. Therefore, Since Y il 's are i.i.d. across l, using Cauchy-Schwarz inequality and assumption (B1), we can show that there exists 0 < C < ∞, converging to zero as B → ∞ by assumption (M7). Also, the limiting distribution of 4B 1 2 T B (γ 0 ) is N(0, I(θ * (γ 0 ))) as B → ∞. Now let γ 0 → 0. It is enough to show that as γ 0 → 0 the density of N(0, I(θ * (γ 0 ))) converges to the density of N(0, I(θ * )). To this end, it suffices to show that lim γ 0 →0 I(θ * (γ 0 )) = I(θ * ). However, this is established in Lemma 2. Combining the results, the lemma follows.
Proof of Proposition 3. The proof of Proposition 3 follows immediately by combining Lemmas 2 and 3.
We now turn to establishing that the remainder term in the representation formula converges to zero.
We first deal with R 1B (γ 0 ), which can be expressed as the sum of R 1B (γ 0 ) and R 2B (γ 0 ), where and R 1B . Let > 0 be arbitrary but fixed. Then, by Markov's inequality, Now since Y il s are independent and identically distributed across l, it follows that Now plugging (19) into (18), interchanging the order of integration (using Tonelli's Theorem), we get where C is a universal constant, and the last convergence follows from conditions (M5)-(M6). We now deal with R 1B . To this end, we need to calculate E g Using change of variables, two-step Taylor approximation, and assumption (B1), we get Now plugging in (20) into (17) and using conditions (M3) and (M6), we get Convergence of (21) to 0 now follows from condition (M6). We next deal with R 2B (γ 0 ). To this end, by writing our the square term of d J (·|θ * (γ 0 )), we have We will show that the RHS of (22) converges to 0 as B → ∞. We begin with the first term. Please note that by Cauchy-Schwarz inequality, the last term converges to 0 by (M4).
As for the second term, note that, a.s., by Cauchy-Schwarz inequality, Now taking the expectation and using Cauchy-Schwarz inequality, one can show that . The convergence to 0 of the RHS of above inequality now follows from condition (M4). Finally, by another application of the Cauchy-Schwarz inequality, The convergence of RHS of above inequality to zero follows from (M4). Now the lemma follows.

Proof of Theorem 1. Recall that
where A 1B (γ 0 ) and A 2B (γ 0 ) are given in (9). Proposition 3 shows that lim γ 0 →0 lim B→∞ A 1B (γ 0 ) = N(0, I −1 (θ * )); while Lemma 4 shows that lim γ 0 →0 lim B→∞ A 2B (γ 0 ) = 0 in probability. The result follows from Slutsky's theorem. We next show that by interchanging the limits, namely first allowing γ 0 to converge to 0 and then letting B → ∞ the limit distribution ofθ i,B (γ 0 ) is Gaussian with the same covariance matrix as Theorem 1. We begin with additional assumptions required in the proof of the theorem.
where Supp(K) is the support of the kernel density K(·) and ∆ is a generic random variable with density h * J (·|θ * ).
(M6') The score function has a regular central behavior relative to the smoothing constants, i.e., (M7') The density functions are smooth in an L 2 sense; i.e.,

Assumptions comparing models for original and compressed data
where the expectation is with respect to distribution K(·). (V3) Assume that for all θ ∈ Θ, IR ∇h * J (y|θ)dy < ∞.
where G is a bivariate Gaussian random variable with mean 0 and variance I −1 (θ * ).
We notice that in the above Theorem 2 that we use conditions (V2)-(V4) which are regularity conditions on the scores of the J− fold convolution of f (·) while (V1) facilitates comparison of the scores of the densities of the compressed data and that of the J−fold convolution. As before, we will first establish (a): and then (b): lim B→∞ lim γ 0 →0 A 2B (γ 0 ) = 0 in probability. We start with the proof of (a).
We divide the proof of Proposition 4 into two lemmas. In the first lemma, we will show that In the second lemma, we will show that first let γ 0 → 0, then let B → ∞, Proof. First fix B. Recall that By algebra, D 1B (θ i,B (γ 0 )) can be expressed as the sum of H 1B , H 1B , H 1B , H 1B and H We will show that where g * B (·) is given in (7). First consider H 1B . it converges to zero as γ 0 → 0 by Cauchy-Schwarz inequality and assumption (O2). Next we consider H 1B . We will first show that To this end, notice that by Cauchy-Schwarz inequality and boundedness ofv J (y|θ * )t J (y|θ * ) in L 2 , it follows that there exists a constant C such that It suffices to show that g and min g → g * B (·). Next we will show that In addition, by Cauchy-Schwarz inequality, boundedness ofv J (y|θ * )t J (y|θ * ) in L 2 and Scheffe's theorem, we have that IRv J (y|θ * )h 1 2 J (y|θ * , γ 0 ) s J (y|θ * |γ 0 ) − t J (y|θ * ) dy converges to zero as γ 0 → 0. Next we consider H (4) 1B . it converges to zero by Cauchy-Schwarz inequality and assumption (M2'). Thus (24) holds. Now let B → ∞, we will show that lim B→∞ H 1B . It converges to zero by Cauchy-Schwarz inequality and assumption (M2'). Next we consider lim B→∞ lim γ 0 →0 H 1B . It converges to zero by Cauchy-Schwarz inequality and L 1 convergence of g * B (·) and h * J (·|θ * ). Therefore lim B→∞ lim γ 0 →0 D 1B (θ i,B (γ 0 )) = 1 2 I(θ * ). We now turn to show that lim B→∞ lim γ 0 →0 D 2B (θ i,B (γ 0 )) = − 1 4 I(θ * ). First fix B and express D 2B (θ i,B (γ 0 )) as the sum of H We will show that First consider H 2B . It converges to zero as γ 0 → 0 by Cauchy-Schwarz inequality and assumption (O1). Next consider H Proof. First fix B. We will show that as γ 0 → 0, We will show that the RHS of (29) converges to zero as γ 0 → 0 and the RHS of (30) converges to zero in probability as γ 0 → 0. First consider the RHS of (29). Since which converges to zero as γ 0 → 0 by assumption (V1). Next consider the RHS of (30). Since By assumption (V2), it follows that as γ 0 → 0, (30) converges to zero in probability. Now letting B → ∞, we have where the last convergence follows by assumption (M7'). Hence, using the Central limit theorem for independent and identically distributed random variables it follows that the limiting distribution of B A 2B (γ 0 ) = 0 in probability.

Proposition 4 shows that first letting
); while Lemma 7 shows that lim B→∞ lim γ 0 →0 A 2B (γ 0 ) = 0 in probability. The theorem follows from Slutsky's theorem. Remark 3. The above two theorems (Theorems 1 and 2) do not immediately imply the double limit exists. This requires stronger conditions and more delicate calculations and will be considered elsewhere.
Our next result is concerned with the behavior of the α−influence function when γ 0 → 0 first and then |z| → ∞ or α → 0. The following three additional assumptions will be used in the proof of part (ii) of Theorem 4.
In the next section, we describe the implementation details and provide several simulation results in support of our methodology.

Implementation and Numerical Results
In this section, we apply the proposed MHD based methods to estimate the unknown parameters θ = (µ, σ 2 ) using the compressed data. We set J = 10,000 and B = 100. All simulations are based on 5000 replications. We consider the Gaussian kernel and Epanechnikov kernel for the nonparametric density estimation. The Gaussian kernel is given by and the Epanechnikov kernel is given by We generate X and uncontaminated compressed data Y in the following way: ∼ N(µ, σ 2 ).
Step 3. Generate the uncontaminated Y l by calculating Y l = R l X l .

Objective Function
In practice, we store the compressed data ( Y l , r ·l , ω l ) for all 1 ≤ l ≤ B. Hence if X jl follows Normal distribution with mean µ and variance σ 2 , the form of the marginal density of the compressed data, viz., Y il is complicated and does not have a closed form expression. However, for large J, using the local limit theorem its density can be approximated by Gaussian density with mean √ Jµ . Please note that with this transformation, E[U il ] = 0 and Var[U il ] = σ 2 . Hence, the kernel density estimate of the unknown true density is given by The difference between the kernel density estimate and the one proposed here is that we include the unknown parameter µ in the kernel. Additionally, this allows one to incorporate (r ·r , ω l ) into the kernel. Consequently, only the scale parameter σ is part of the parametric model. Using the local limit theorem, we approximate the true parametric model by φ(·|σ), where φ(·|σ) is the density of N(0, σ 2 ). Hence, the objective function is and, the estimator is given bŷ It is clear thatθ B (γ 0 ) is a consistent estimator of θ * . In the next subsection, we use Quasi-Newton method with Broyden-Fletcher-Goldfarb-Shanno (BFGS) update to estimate θ. Quasi-Newton method is appealing since (i) it replaces the complicated calculation of the Hessian matrix with an approximation which is easier to compute (∆ k (θ) given in the next subsection) and (ii) gives more flexible step size t (compared to the Newton-Raphson method), ensuring that it does not "jump" too far at every step and hence guaranteeing convergence of the estimating equation. The BFGS update (H k ) is a popular method for approximating the Hessian matrix via gradient evaluations. The step size t is determined using Backtracking line search algorithm described in Algorithm 2. The algorithms are given in detail in the next subsection. Our analysis also includes the case where S ≡ 1 and r ijl ≡ 1. In this case, as explained previously, one obtains significant reduction in storage and computational complexity. Finally, we emphasize here that the density estimate contains µ and is not parameter free as is typical in classical MHDE analysis. In the next subsection, we describe an algorithm to implement our method.

Algorithm
As explained previously, we use the Quasi-Newton Algorithm with BFGS update to obtain θ MHDE . To describe this method, consider the objective function (suppressing i) Ψ(θ), which is twice continuously differentiable. Let the initial value of θ be θ (0) = µ (0) , σ (0) and H 0 = I, where I is the identity matrix.

Initial Values
The initial value for θ are taken to be Another choice of the initial value for σ is: where In all the tables below, we report the average (Ave), standard deviation (StD) and mean square error (MSE) to assess the performance of the proposed methods.

Analyses Without Contamination
From Tables 2-5, we let true µ = 2, σ = 1, and take the kernel to be Gaussian kernel. In Table 2, we compare the estimates of the parameters as the dimension of the compressed data S increases. In this table, we allow S to take values in the set {1, 2, 5, 10}. Also, we let the number of groups B = 100, the bandwidth is chosen to be c B = 0.3, and γ 0 = 0.1. In addition, in Table 2, S * = 1 means that S = 1 with γ 0 ≡ 0. From Table 2 we observe that as S increases, the estimates for µ and σ remain stable. The case S * = 1 is interesting, since even by storing the sum we are able to obtain point estimates which are close to the true value. In Table 3, we choose S = 1, B = 100 and c B = 0.3 and compare the estimates as γ 0 changes from 0.01 to 1.00. We can see that as γ 0 increases, the estimate for µ remains stable, whereas the bias, standard deviation and MSE for σ increase. In Table 4, we fix S = 1, B = 100 and γ 0 = 0.1 and allow the bandwidth c B to increase. Also, c * B = 0.30 means that the bandwidth is chosen as 0.30 with γ 0 ≡ 0. Notice that in this case when c B = 0.9 B 1 2 c B = 9 while B 1 2 c 2 B = 8.1 which is not small as is required in assumption (B2). We notice again that as c B decreases, the estimates of µ and σ are close to the true value with small MSE and StD. In Table 5, we let S = 1, c B = 0.3 and γ 0 = 0.1 and let the number of groups B increase. This table implies that as B increases, the estimate performs better in terms of bias, standard deviation and MSE. In Table 6, we set γ 0 ≡ 0 and keep other settings same as Table 5. This table implies that as B increases, the estimate performs better in terms of bias, standard deviation and MSE. Furthermore, the standard deviation and MSE are slightly smaller than the results in Table 5. We next move on to investigating the effect of other sensing variables. In the following table, we use Gamma model to generate the additive matrix R l . Specifically, the mean of Gamma random variable is set as α 0 β 0 = 1, and the variance var ≡ α 0 β 2 0 is chosen from the set {0, 0.01 2 , 0.01, 0.25, 1.00} which are also the variances in Table 3.
From Table 7, notice that using Gamma sensing variable yields similar results as Gaussian sensing variable. Our next example considers the case when the mean of the sensing variable is not equal to one and the sensing variable is taken to have a discrete distribution.Specifically, we use Bernoulli sensing variables with parameter p. Moreover, we fix S = 1 and let pJ = S. Therefore p = 1/J. Hence as J increases, the variance decreases. Now notice that in this case the mean of sensing variable is p instead of 1. In addition, E[ Y il ] = µ and Var[ Y il ] = σ 2 + µ 2 (1 − 1 J ). Hence we set the initial value as Additionally, we take B = 100, c B = 0.30 and s B to be 1.  Table 8 shows that MHD method also performs well with Bernoulli sensing variable, although the bias of σ, standard deviation and mean squre error for both estimates are larger than those using Gaussian sensing variable and Gamma sensing variable.

Robustness and Model Misspecification
In this section, we provide a numerical assessment of the robustness of the proposed methodology. To this end, let where η(x) is a contaminating component, α ∈ [0, 1). We generate the contaminated reduced data Y in the following way:
Step 3. Generate uncontaminated Y l by calculating Y l = R l X l .

•
Step 4. Generate contaminated Y c il , where Y c il = Y il + η(x) with probability α, and Y c il = Y il with probability 1 − α.
In the above description, the contamination with outliers is within blocks. A conceptual issue that one encounters is the meaning of outliers in this setting. Specifically, a data point which is an outlier in the original data set may not remain an outlier in the reduced data and vice-versa. Hence the concepts such as breakdown point and influence function need to be carefully studied. The tables below present one version of the robustness exhibited by the proposed method. In Tables 9 and 10, we set J = 10 4 , B = 100, S = 1, γ 0 = 0.1, c B = 0.3, η = 1000. In addition, α * = 0 means that α = 0 with γ 0 ≡ 0. From the above Table we observe that, even under 50% contamination the estimate of the mean remains stable; however, the estimate of the variance is affected at high-levels of contamination (beyond 30%). An interesting and important issue is to investigate the role of γ 0 on the breakdown point of the estimator.
Finally, we investigate the bias in MHDE as a function of the values of the outlier. The graphs below (Figure 2) describe the changes to MHDE when outlier values (η) increase. Here we set S = 1, B = 100, γ 0 = 0.1. In addition, we let α = 0.2, and η to take values from {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000}. We can see that as η increases, bothμ andσ increase up to η = 500 then decrease, althoughμ does not change too much. This phenomenon is because when the outlier value is small (or closer to the observations), then it may not be considered as an "outlier" by the MHD method. However, as the outlier values move "far enough" from other values, then the estimate for µ and σ remain the stable.

Example
In this section we describe an analysis of data from financial analytics, using the proposed methods. The data are from a bank (a cash and credit card issuer) in Taiwan and the targets of analyses were credit card holders of the bank. The research focused on the case of customers' default payments. The data set (see [27] for details) contains 180,000 observations and includes information on twenty five variables such as default payments, demographic factors, credit data, history of payment, and billing statements of credit card clients from April 2005 to September 2005. Ref. [28] study machine learning methods for evaluating the probability of default. Here, we work with the first three months of data containing 90,000 observations concerning bill payments. For our analyses we remove zero payments and negative payment from the data set and perform a logarithmic transformation of the bill payments . Since the log-transformed data was multi-modal and exhibited features of a mixture of normal distributions, we work with the log-transformed data with values in the range (6.1, 13). Next, we performed the Box-Cox transformation to the log-transformed data. This transformation identifies the best transformation that yields approximately normal distribution (which belongs to the location-scale family). Specifically, let L denote the log-transformed data in range (6.1, 13), then the data after Box-Cox transformation is given by X = L 2 − 1 /19.9091. The histogram for X is given in Figure 3. The number of observations at the end of data processing was 70,000. Our goal is to estimate the average bill payment during the first three months. For this, we will apply the proposed method. In this analysis, we assume that the target model for X is Gaussian and split the data, randomly, into B = 100 blocks yielding J = 700 observations per block.
In Table 11, we choose the bandwidth as c B = 0.30. Also, S * = 1 represents the case where S = 1 and γ 0 ≡ 0. In all other settings, we keep γ 0 = 0.1. We observe that all estimates are similar as S changes. Next we study the robustness of MHDE for this data by investigating the relative bias and studying the influence function. Specifically, we first reduce the dimension from J = 700 to S = 1 for each of the B = 100 blocks and obtain the compressed data Y; next, we generate the contaminated reduced data Y c il from step 4 in Section 4.5. Also, we set α = 0.20, γ 0 = 0.20; the kernel is taken to be to be Epanechnikov density with bandwidth c B = 0.30. η(x) is assumed to takes values in {50, 100, 200, 300, 500, 800, 1000} (note that the approximate mean of Y is around 3600). Let T MHD be the Hellinger distance functional. The influence function given by which we use to assess the robustness. The graphs shown below ( Figure 4) illustrate how the influence function changes as the outlier values increase. We observe that for both estimates (μ andσ), the influence function first increase and then decrease fast. From η(x) = 300, the influence functions remain stable and are close to zero, which clearly indicate that MHDE is stable. Additional Analyses: The histogram in Figure 3 suggests that, may be a mixture of normal distributions may fit the log and Box-Cox transformed data better than the normal distribution. For this reason, we calculated the Hellinger distance between four component mixture (chosen using BIC criteria) and the normal distribution and this was determined to be 0.0237, approximately. Thus, the normal distribution (which belongs to the location-scale family) can be viewed as a misspecified target distribution; admittedly, one does lose information about the components of the mixture distribution due to model misspecification. However, since our goal was to estimate the overall mean and variance the proposed estimate seems to possess the properties described in the manuscript.

Discussion and Extensions
The results in the manuscript focus on the iterated limit theory for MHDE of the compressed data obtained from a location-scale family. Two pertinent questions arise: (i) is it easy to extend this theory to MHDE of compressed data arising from non location-scale family of distributions? and (ii) is it possible to extend the theory from iterated limits to a double limit? Turning to (i), we note that the heuristic for considering the location-scale family comes from the fact that the first and the second moment are consistently estimable for partially observed random walks (see [29,30]). This is related to the size of J and can be of exponential order. For such large J, other moments may not be consistently estimable. Hence, the entire theory goes through as long as one is considering parametric models f (·|θ), where θ = W (µ, σ 2 ), for a known function W (·, ·). The case in point is the Gamma distribution which can be re-parametrized in terms of the first two moments.
As for (ii), it is well-known that existence and equality of iterated limits for real sequences does not imply the existence of the double limit unless additional uniformity of convergence holds (see [31] for instance). Extension of this notion for distributional convergence requires additional assumptions and are investigated in a different manuscript wherein more general divergences are also considered.

Concluding Remarks
In this paper we proposed the Hellinger distance-based method to obtain robust estimates for mean and variance in a location-scale model using compressed data. Our extensive theoretical investigations and simulations show the usefulness of the methodology and hence can be applied in a variety of scientific settings. Several theoretical and practical questions concerning robustness in a big data setting arise. For instance, the effect of the variability in the R matrix and its effect on outliers are important issues that need further investigation. Furthermore, statistical properties such as uniform consistency and uniform asymptotic normality under different choices for the distribution of R would be useful. These are under investigation by the authors.
Author Contributions: The problem was conceived by E.A., A.N.V. and G.D. L.L. is a student of A.N.V., and worked on theoretical and simulation details with inputs from all members at different stages.

Funding:
The authors thank George Mason University Libraries for support with the article processing fees; Ahmed's research is supported by a grant from NSERC.