Gini Index Estimation within Pre-Speciﬁed Error Bound: Application to Indian Household Survey Data

: The Gini index, a widely used economic inequality measure, is computed using data whose designs involve clustering and stratiﬁcation, generally known as complex household surveys. Under complex household survey, we develop two novel procedures for estimating Gini index with a pre-speciﬁed error bound and conﬁdence level. The two proposed approaches are based on the concept of sequential analysis which is known to be economical in the sense of obtaining an optimal cluster size which reduces project cost (that is total sampling cost) thereby achieving the pre-speciﬁed error bound and the conﬁdence level under reasonable assumptions. Some large sample properties of the proposed procedures are examined without assuming any speciﬁc distribution. Empirical illustrations of both procedures are provided using the consumption expenditure data obtained by National Sample Survey (NSS) Organization in India.


Introduction
Economic measures based on income levels of the residents of a specific region play an important role in social, economic and socio-economic sciences. They are used to quantify both the actual balance of the economy as well as the wealthiness and poverty of the people. One of the most prominent candidates is the (normalized) Gini index, which quantifies the economic inequality of a region, state, country or the world. Here, the random variable X denotes the income level, F(x) its cumulative distribution function, and µ = E(X) its expected value. If G F = 0, then the economic system has maximal equality (e.g., everyone has the same income), while G F = 1 represents perfect inequality (e.g., one individual has everything while the rest have nothing). For example, according to the Organization for Economic Cooperation and Development (2017), the Gini indices of the USA, Germany and South Africa were G F = 0.39, 0.29, 0.62 in 2017, respectively. These values suggest income inequality in these regions. Therefore, Gini index serves as a measure of economic balance that allows comparison across regions. Roughly speaking, income levels were more balanced (equal) in Germany than they were in the USA and in Brazil, respectively. Thus, the Gini index serves as an important measure in economics, social and political sciences. The estimation of the Gini index G F of a country or a region, however, is a rather challenging task, because income is usually measured on household levels and thus in a clustered and stratified way. In most countries (e.g., United States, European Union, India and others), complex household surveys are conducted annually, the data of which can be used for the estimation of the Gini index as given in (1) see (Bhattacharya 2005(Bhattacharya , 2007. The single computation of a point estimator of G F as being reported in most available resources is, however, rather unsatisfactory, because neither the variability in the sample nor sample/cluster sizes visualize the estimator in an informative manner. Therefore, computing 100(1 − α)% confidence intervals for G F as point estimators are much more informative for making both descriptive as well as comparative conclusions. Binder and Kovacevic (1995) and Bhattacharya (2007) proposed point estimators of the Gini index as well as of their standard errors in such complex survey designs (see Section 2 for details), which can be used for the computation of 100(1 − α)% confidence intervals for G F . Furthermore, Peng (2011) proposed an empirical likelihood-based approach to construct such confidence intervals (as well as the confidence interval for the difference of two Gini indices). Clearly, for a desired confidence level, a narrower confidence interval will be more accurate about the parameter of interest. Therefore, it is the aim of the present article to develop confidence intervals for the Gini index G F in complex survey designs that both control the nominal confidence level (1 − α) and the confidence interval width. To guarantee that these criteria will be fulfilled, the optimal number of clusters will be computed using an innovative 'learn-as-you-go' or sequential procedure. We refer the readers to Ghosh and Sen (1991); Ghosh et al. (1997); Chattopadhyay and Kelley (2017); Kelley et al. (2018) and others for more on sequential analysis literature.
The first known application of sequential analysis in surveys was done by Mahalanobis (1940), who described the design and implementation of the method (in a different context) for estimating acreage of jute crop in the whole state of Bengal in undivided India. This was even before the seminal works of Stein (1945Stein ( , 1949 on sequential analysis area. Kanninen (1993); Greene (1998); Arcidiacono and Jones (2003); Aguirregabiria and Mira (2007) and many others contributed to application of sequential analysis in the field of economics, data analysis, medicine, and other areas. Recently, Chattopadhyay and De (2016) and De and Chattopadhyay (2017) developed a sequential procedure for inference problems related to the Gini index under independent and identically distributed (i.i.d.) conditions, but the proposed methodology cannot be used for finding a sufficiently narrow 100(1 − α)% confidence interval for the population Gini index under a complex household survey design. We propose a two stage procedure and a purely sequential procedure to find an estimate of the minimum number of clusters which is required to find a sufficiently narrow confidence interval under a distribution-free scenario. Both the two-stage and purely sequential procedures are applied to the 64th round of household survey data collected in India. Further, a simulation study is carried out on observations collected in the Indian household survey data and from known income distributions to explore the properties of the procedures.
The remainder of this paper is organized as follows: Section 2 describes the sampling framework of the complex survey design that is considered in this work. In Section 3, we formulate the problem of finding a sufficiently narrow confidence interval for the Gini index and the reason for non-applicability of a procedure with fixed cluster size. In Section 4, we develop the purely sequential, as well as the two-stage, procedure followed by a discussion on the characteristics of our procedure in Section 5. Furthermore, an application of both of our procedures to real and synthetic data sets can be found in Section 6, while Section 7 describes an extension of the problem to the multivariate setup. We discuss the advantage and drawbacks of the proposed procedures in Section 8, and provide concluding comments in Section 9.

Survey Design and Point Estimation
In this section, the complex household survey design along with the used notations will be described: Assume that the population is divided into s = 1, 2, . . . , S strata, whereas the s th stratum is divided into c s = 1, . . . , H s clusters. Under the c th s cluster in stratum s, there is a group of M sc s households with ν sc s h individuals or members, h = 1, 2, ..., M sc s . Therefore, the total number of clusters in the population is H = ∑ S s=1 H s . The number of households in a stratum will be M s = ∑ H s c s =1 M sc s and the total number of households in the population is denoted by M = ∑ S s=1 M s = ∑ S s=1 ∑ H s c s =1 M sc s . For estimation purpose in such complex survey designs, a sample of n s clusters is selected from the s th stratum by simple random sampling with replacement. A simple random sample of k households is then considered (without replacement) from each of the selected clusters. Let the total number of clusters being selected from the population be denoted by n = S ∑ s=1 n s with n s = a s n and a s = H s H . ( Thus, the total number of households in the sample will be kn = k ∑ S s=1 n s . For the h th household in the c th s cluster from the s th stratum, the observed data (that is, the household monthly income, monthly expenditure, per capita income or others) are denoted as x sc s h . With the presence of stratification and clustering, the households are assigned different weights W sc s h as the probability of inclusion in the sample will vary. The assigned weight to the selected household is computed as the inverse of the probability of inclusion of the household in the sample (see Binder and Kovacevic 1995;Horvitz and Thompson 1952;Lee and Forthofer 2006). If researchers wish to increase (or decrease) the representation of a subgroup of the population that is of interest, they can employ oversampling (or undersampling) procedures and use appropriate weighting techniques. Wells (1998) discussed several weighting methods for such cases. For our survey framework, weights are assigned to the data (x sc s h ) with respect to the number of observations in the population. The attached weight for all the ν sc s h members of the h th household belonging to the c th s sampled cluster from the s th stratum as given by Bhattacharya (2007)  It should be noted that the computation of the sampling weights will change depending on the sampling design and also on whether the analysis is being done at the district-, household-or individual level (Bhattacharya 2007). If the cluster size is large, sampling with or without replacement will result in similar values for the weights. Moreover, Bhattacharya (2005Bhattacharya ( , 2007 noted that using sampling with or without replacement does not affect the asymptotic results of this work as in most practical situations the number of clusters per stratum are usually large.
Under the above framework, let W sc s h denote the total of per-household weights associated with the survey and define w sc s h = W −1 W sc s h the normalized weights, which will be used in the estimation of the average income µ = E(X) and its cumulative distribution function F(x) bŷ in order to take the relative household sizes into account. Then, under fairly mild conditions on the numbers of clusters, a consistent estimator of the Gini index G F given in (1) is given bŷ It follows that, the estimated Gini index basically is a ratio of two weighted averages of the income levels, respectively (see Bhattacharya 2007). In the next section, we discuss the idea towards the construction of confidence intervals with bounded width.

Bounded Width Confidence Intervals
In order to derive bounded width confidence intervals, the (asymptotic) distribution of the empirical Gini indexĜ n must be tackled. It has been shown by Bhattacharya (2007), that if E |X| s < ∞, and if n s → ∞ for each stratum s = 1, . . . , S at the same rate, then Here, ξ 2 denotes the (asymptotic) variance of √ nĜ n . Due to its quite involved representation, we refer to Bhattacharya (2007) for the specific variance formula. The asymptotic distribution, however, can now be used for the computation of 100(1 − α)% confidence intervals for the population Gini index the width of which does not exceed a pre-specified value ω, that is Here, z α/2 is the 100(1 − α/2)th percentile of the standard normal distribution N(0, 1). Thus, the actual arising task is the computation of n that will guarantee that the width of the confidence interval is bounded by ω, i.e., Hence, C denotes the optimal total number of clusters from all strata needed such that L ≤ ω. Therefore, the optimal number of clusters that will be required to be sampled from the s th stratum (s = 1, 2, . . . , S) will be C s = Ca s . Here, the term optimal is used in the sense of minimum number of clusters to meet the requirements and not as in the sense of optimal allocation used in sample survey methods (see Cochran 1997). If C is known, one can find the sufficiently narrow confidence interval that satisfies (5). However without knowing the underlying distribution of the income (or assets or expenditure), the value of ξ 2 is unknown in practical scenarios. Thus, the optimal cluster size from all the S strata, C, is also unknown. We note that supposed value (or previous survey estimate) of ξ 2 may be used to obtain the value of C. However, a potential problem that may arise is that the supposed value of ξ 2 may be different from the actual value. Moreover, using previous survey estimates in many situations is not advised as that may not be applicable in the current population. This is because of a possible change in socio-economic conditions that may arise due to the change in distribution of income or expenditure as a result of change in economic policies or situations. Due to all these factors, the value of C may widely differ from what it would have been if ξ 2 is known and will not guarantee that (5) is satisfied. The (asymptotic) variance ξ 2 of the estimated Gini index is, however, unknown in practical applications and must be estimated in an appropriate way. Consistent estimators will now be discussed below.
Estimation of ξ 2 Several articles published in statistics and economics journals have proposed different estimators of the asymptotic variance parameter of the estimator of the Gini index under different sampling schemes. Zitikis and Gastwirth (2002) proposed explicit formulas for the asymptotic variance of a general class of the Gini index (i.e., the S-Gini index) for simple random sampling with observations coming from the Exponential and Pareto distributions. We refer to Langel and Tillé (2013) for a discussion on several techniques used in estimating the asymptotic variance of the Gini index for various sampling designs. Under the current framework, Binder and Kovacevic (1995) proposed an estimator of ξ 2 using the empirical variance of the values Here,ū s = n −1 s ∑ n s c s =1 u sc s denote the empirical mean of u sc s and A(x sc s h ) =F(x sc s h ) −Ĝ n + 1 2 , and are weighted placements and averages of the income values obtained from n clusters, respectively. It should be noted that Bhattacharya (2007) proposed an alternative estimator of ξ 2 which is given by whereψ n s is the total number of observations, and x (g) is the gth ordered observation (among all x sc s h ).
However, Hoque and Clarke (2015) showed that the estimators in (6) and (7) are numerically the same, i.e., V 2 n,1 = V 2 n,2 . We therefore chose V 2 n,1 as a consistent estimator of ξ 2 and drop the second subscript, without loss of generality (i.e., we use V 2 n as the estimator of ξ 2 ). Having found a consistent estimator of the (asymptotic) variance ξ 2 , it follows that the optimal number of clusters C defined in (5) that lead to the bounded width confidence interval can now be estimated from the data. In order to do so, different sequential methodologies will be discussed in the next section.

Sequential Methodology
In this section, different sequential methodologies including two-stage and purely sequential approaches will be discussed to find the sufficiently narrow confidence interval. First, purely sequential methods will be introduced.

Purely Sequential Procedure
The purely sequential confidence interval computation is based on consecutive sampling until a certain stopping rule is met which ensured that the width of the confidence interval is smaller than or equal to the given bound. This sampling process begins with a pilot sample the sizes of which will be specified in Section 4.3. However, recall that computing a bounded width confidence interval requires at least C s clusters from the s th stratum (s = 1, 2, . . . , S). Therefore, choose a pilot cluster size of t s from each stratum s, which results in a total number of clusters in the pilot stage of t = ∑ S s=1 t s . Within each selected cluster, there are k randomly selected households (without replacement). Now, collect pilot observations x s11 , . . . , x s1k , . . . , x st s 1 , . . . , x st s k on each stratum s = 1, . . . , S. Now, the estimator V 2 n of ξ 2 is computed to examine the following stopping rule If the condition in the stopping rule is not satisfied, the surveyor collects data from additional m (≥ 1) clusters, with k randomly chosen households, from each stratum that has n s ≤Ĉ s . Then ξ 2 is estimated based on all the observations collected up to that stage and the stopping condition is checked. This process is repeated until the condition in the stopping rule is satisfied. It should be noted that m (≥ 1) can be any integer that is appropriate, suitable or feasible for the survey.
The term 1/n in (8) is a correction term incorporated to avoid early stopping of the sequential procedure as V 2 n (the estimator of ξ 2 ) may be very small in the early stages. Without this term, the stopping rule in (8) can be satisfied for very small sample sizes due to sampling error. In general, any null-sequence, e.g., 1/n γ , where γ(> 0) is a fixed number, can be used as a correction term, because it does not affect the the consistency of the variance estimator (see Mukhopadhyay and De Silva 2009, p. 260, for more details). The use of a correction term can be seen in several articles, e.g., Chattopadhyay and De (2016), Chattopadhyay and Kelley (2017), and Kelley et al. (2019). The final cluster size N constitutes N s clusters from each stratum s where N s = Na s , for s = 1, 2, . . . , S.
Based on the sampled data x sc s h and their corresponding standardized weights w sc s h , where s = 1, . . . , S, c s = 1, . . . , N s , and h = 1, . . . , k, the 100(1 − α)% bounded width confidence interval for the Gini index G F is given by The purely sequential procedure may be numerically cumbersome due to the consecutive sampling and repeated computations of the variance estimators. Therefore, a less numerically intensive method-a two-stage procedure-will be examined in the next section.

Two-Stage Procedure
Unlike the purely sequential procedure, the two-stage procedure comprises of two stages. The first stage is called the pilot stage, wherein a sample is drawn from the population. That is, first a pilot sample of clusters, t s (with ∑ S s=1 t s = t), is selected from each stratum s. Based on the sample from the pilot stage, ξ 2 is estimated as in (6). Then, the total final cluster size from all strata can be estimated as where Q * is the (unbounded) optimal cluster size and · is the ceiling function, that is, x is the smallest integer that is greater than or equal to x. Thus, the estimated number of clusters to be sampled from the s th stratum is given by with a s as defined in (2) and [·] being the nearest integer function. So, in the second stage, observations from k households will be collected from Q s − t s clusters from each stratum s. Using the combined data from the two stages, the estimator of ξ 2 is updated and the approximate 100(1 − α)% confidence interval for the Gini index is given by We note that the final cluster size using either the two-stage procedure or the purely sequential procedure can be shown to be always finite. In addition, the number of clusters per stratum are mutually dependent as they all depend on the same stopping rule. In the next subsection, we derive the pilot cluster size formula.

Pilot Cluster Size
Using (8) and proceeding along the lines of Chattopadhyay and De (2016), we have Thus the total number of sampled clusters is at least 2z α/2 /ω. The maximum number of clusters from the s th stratum is H s and also the minimum number of clusters to estimate ξ 2 is 2. Considering all the constraints in (8), the number of clusters recommended to be sampled from the s th stratum at the pilot stage is We note that this ensures that the minimum cluster size is met as well as the total possible cluster size is not exceeded.

Characteristics of the Procedures and Simulation Study
The purely sequential procedure and the two-stage procedure for constructing a sufficiently narrow confidence interval for the Gini index-unlike fixed cluster size procedures-require cluster sizes which are obtained from data. So, the respective cluster sizes N and Q are random in nature. In the following subsection, we will look at the characteristics of the random cluster sizes viz. N and Q.

Characteristics
The following theorem provides some asymptotic properties (as ω → 0) of the final cluster sizes of the above procedures with sufficiently large H.
n ] exists and H s (fixed) are sufficiently large for all s ∈ S, then as ω → 0, Proof of Theorem 1.
(i) The definition of stopping rule N associated with the purely sequential procedure in (8) yields Since N → ∞ as ω ↓ 0 and V 2 n → ξ 2 in probability as n → ∞, by applying Theorem 2.1 of Gut (2009) Hence, dividing all sides of (14) by C and letting ω ↓ 0, we prove N/C → 1 in probability as ω ↓ 0. (ii) The definition of final cluster size Q related to the two-stage procedure in (10) yields Furthermore, t Pr(Q = t)/C ≤ t/C → 0 as ω ↓ 0. Now, V 2 t → ξ 2 in probability as ω ↓ 0. Hence, dividing all sides of (15) by C and letting ω ↓ 0, we prove Q/C → 1 in probability as ω ↓ 0. (iii) Using stopping rule N in (8) we have, for all N, Parts (i) and (ii) of the theorem show that the final cluster size as obtained from the purely sequential and the two-stage procedure is a consistent estimator of the cluster size provided ξ 2 is known. Part (iii) of the theorem shows that the sufficiently narrow confidence interval (that is length less than or equal to ω) will be obtained by the purely sequential procedure. The same result cannot be proven for the two-stage procedure.

Simulation Study
We now use a detailed simulation study, presented in the Supplement, to illustrate and compare the properties of our purely sequential and the two stage procedures in constructing a 100(1 − α)% confidence interval for the Gini index under a complex survey whose width is less than ω. We presented two different simulation studies with 5000 simulation runs-(a) simulation using the NSS survey data as the population and (b) a Monte Carlo simulation in which the observations are drawn from three different populations, each of which has been drawn using three different distributions, namely; Pareto, Gamma and Lognormal distributions. The two simulation studies were performed in RStudio (RStudio Team 2018, version 1.2.1335) and codes are available upon request.
To begin with, we describe the simulation procedure for the purely sequential methodology. From the given populations, t s (s = 1, 2, ..., S) clusters are randomly sampled from the sth stratum without replacement. From there, four households are selected from each cluster using simple random sampling without replacement and these households from all t clusters will constitute the pilot sample. From the collected pilot sample, the asymptotic variance of the Gini index ξ 2 is estimated using (6), and from (8), the optimal number of clusters C is estimated. The stopping rule is checked and if it is satisfied, sampling is terminated. On the other hand, if the stopping rule is not satisfied, the strata whose number of clusters selected are less than the expected number, that is {s : t s <Ĉ s }, are identified and additional m number of clusters are randomly selected without replacement. Here, m is chosen to be either 1, 10 or 20. In each of the selected m clusters, four households are randomly selected without replacement. At this stage, with the total number of sampled clusters being n (say), the value of V 2 n is updated and the stopping rule is checked. If the rule is met, sampling is stopped, otherwise the strata without enough clusters are identified again and additional m clusters are collected from each of them. This process is continued until and unless the stopping rule is met. At that point, based on N (say) numbers of clusters sampled from all strata, the 100(1 − α)% confidence interval for the Gini index is constructed as given in (9).
Unlike the purely sequential procedure described above, the two-stage procedure has only two stages. The simulation algorithm for the two-stage is as follows. From a given population, t s number of clusters are randomly selected without replacement from the sth stratum and four households are randomly sampled from each of the selected clusters without replacement. The per monthly capita expenditure x sc s h from the selected households, with their respective weight W sc s h , are used to estimate the asymptotic variance of the Gini index (from (6)). This is followed by using (10) to obtain the optimal number of clusters Q needed to achieve the desired confidence level and width. If Q > t, additional Q s − t s number of clusters are randomly selected without replacement from each stratum s. In each of the additional clusters, four households are also randomly selected without replacement. Finally, per capita monthly expenditure of all households from the Q number of clusters are used to construct the 100(1 − α)% confidence interval for the Gini index as stated in (11).
From the simulations, we find that the coverage probability for the confidence intervals for both purely sequential procedure and the two-stage procedure are approximately close to the desired confidence level provided that the cluster size (in all strata) is large, which is also a basic criterion while proving the asymptotic normality in (4). However, the width of the confidence intervals for the two stage procedure, unlike the purely sequential procedure, may result in confidence intervals of width larger than the pre-specified value of ω. For details, one may look at Tables S21-S24 of the supplementary material. This outcome is not surprising since the two-stage procedure is based on only the pilot sample which is usually taken to be small. So, the variability of the variance estimator V 2 n is higher. The optimal cluster sizes obtained by the purely sequential procedure is less than the one obtained by the two-stage procedure. The newly developed methods can now be applied using real data. This will be explained in the next section.

Gini Index Estimation in India
We now apply the sequential procedures to construct bounded width confidence intervals for the Gini index in India using the per capita monthly expenditures obtained via the 64th Round National Sample Survey (NSS) (a stratified multi-staged survey design between July 2007 and June 2008). In 2008, the country was divided into 28 states and seven union territories thereof each was subdivided into districts. Within each district, two basic sectors were formed; all rural areas constituted the rural sector while all urban areas constituted the urban sector. Nonetheless, for the urban areas in a district, separate basic strata were formed for each town that had at least a population of 10 lakhs (1 lakh is 100,000). The remaining areas were grouped as another basic stratum (National Sample Survey Office 2007). For the rural sector, the sampling frame was made up of villages while for the urban sector, it was towns/blocks. 1 Census villages and the Urban Frame Survey blocks were the first stage units (FSU) in the rural and urban sectors respectively. From each strata, FSUs are selected from the rural sector with probability proportional to size with replacement and from the urban sector by using simple random sampling without replacement. Within the FSU, the households in each sector were considered as the smallest unit of grouping, which is also referred to as the ultimate stage units. Households were selected by simple random sampling without replacement and various information about the households were recorded during the survey. Some of the information include the demographics, household size, expenditure on education, food, clothing, corresponding weights etc. A detailed description of the NSS Data can be found online at National Sample Survey Office (2015).
The "Stratum" variable in the 64th NSS data set will be used to stratify the states/sectors while "FSUno" (First Stage Unit Number) variable will be used to cluster the households under each stratum. We discuss the results obtained from applying the proposed sequential methodologies which were applied to the data collected from two of the most populous states in India, namely Uttar Pradesh and West Bengal. Additionally, the report includes the results for the whole state as well as rural and urban sectors of the state. Here, all the households in each cluster were considered since we are sampling from a survey that already has few number of households per cluster. However, the weight per household is adjusted at each sampling stage to reflect the actual weight that would be used during a survey.
In applying the sequential methodologies, the pilot cluster sizes t s for each stratum s are computed using (13). At the outset, t s number of clusters are selected from stratum s for s = 1, . . . , S. Where t s is same for both the purely sequential procedure and the two-stage methodology. We apply each of the procedures considering the survey data as our population.

Application of Purely Sequential Procedure (PSP)
The proposed purely sequential procedure, with observations from one cluster collected at each stage after the pilot stage, is applied to the NSS 64th round data. The results for different combinations of pre-specified width (ω ∈ {0.020, 0.025}) and confidence level (1 − α, α ∈ {0.05, 0.10}) can be found in Tables 1-4. The first column of the tables indicates the region on which we applied our procedure. The PSP was applied on the entire data from Uttar Pradesh (denoted as All) and then separately applied on the rural and urban sectors of Utter Pradesh (denoted as Rural and Urban respectively). The same process was also repeated for West Bengal. The second column of the tables shows the estimated Gini index (Ĝ H ) and its standard error (se Ĝ H ) using the entire number of clusters (H) available in the data set for that region (i.e., all of the state, rural sector of the state, or the urban sector of the state). In the third column is the total number of clusters (H) available in the data set for that region. The fourth column shows the value ofĈ when the procedure ended,Ĉ being the estimated optimal cluster size as in (8). The fifth column of the tables shows the collected cluster size N using the stopping rule in (8) and the pilot cluster size t. The values ofĜ N and se(Ĝ N ) in the sixth column are the estimated Gini index and its standard error respectively based on N clusters. The next two columns are respectively the lower and upper limits of the confidence intervals obtained with the stopping rule in (8). The ninth column is w N which is the estimated width of the confidence interval. The last column Pr(N s <Ĉ s ) shows the proportion of strata that had their collected cluster size N s from the purely sequential procedure being less than their estimated optimal cluster sizeĈ s (N s is the final number of clusters selected from stratum s whileĈ s is the estimated optimal number of clusters to be sampled from stratum s). In Tables 1-4, it can be seen that, when the maximum available (to be drawn from) cluster size (H s ) per stratum are large, the purely sequential procedure is able to achieve desired precision, i.e., a narrow confidence interval, (w N ≤ ω) for the Gini index with relatively fewer number of clusters sampled while maintaining the desired confidence level. This is shown in the results where N < H for all of Uttar Pradesh and West Bengal, as well as their individual rural sectors. The same cannot be said about their urban sectors as they do not have enough maximum available clusters from the onset. Thus, the procedure did not reach the optimal cluster size but stopped when there were no more clusters remaining to be sampled.
The results also show that, aside the fact that the entire urban regions did not have enough clusters (N = H < C), each of the strata in the regions also do not have enough clusters (that is, Pr(N s <Ĉ s ) = 1) to obtain a narrow confidence interval width. However, in the other regions (i.e., All and Rural for Uttar Pradesh and West Bengal), even thoughĈ < N < H, some strata had N s <Ĉ s . This is because some strata have more than enough clusters while others do not and that offsets each other at the end. For example, it can be seen from Table 1 that in the rural sector of Uttar Pradesh, 40% of the strata did not have enough clusters even though, at the end, the confidence interval was 0.0186 wide which was less than the desired width of 0.02.
Next, the the results will be compared with the two-stage procedure as discussed in Section 4.2.

Application of Two-Stage Procedure
First, the estimator V 2 n of ξ 2 is obtained from the pilot stage and then the final cluster size Q * is computed. Q * is then adjusted to account for the limited availability of clusters per stratum in the NSS data to obtain the possible number of clusters Q that can be sampled (see (10)). Here, Q is distributed over S strata as Q s for stratum s; rounding off where Q s is not an integer. The sum of Q s gives the actual number of clusters,Q = ∑ S s=1 Q s , that are sampled from all strata. UsingQ clusters, the Gini index and ξ 2 are re-estimated (or updated) and a 100(1 − α)% confidence interval is constructed according to (11).
Similar to the application of the purely sequential procedure, the two-stage procedure is applied to the NSS 64th round data for different combinations of pre-specified precision (ω) and accuracy (1 − α) with the results shown in Tables 5-8. The second column of the tables indicates the total number of clusters H in the unit (i.e., the whole state, rural sector, or urban sector) of the NSS data. The third column displays estimated optimal number of cluster (Q * ) that are required in order to achieve the desired precision and accuracy. Below Q * is the pilot number of clusters t. The next column shows the estimated optimal cluster sizes Q taking into account the total number of clusters available in the data, because the number of clusters are finite and limited. Furthermore,Q is the actual number of clusters that can be sampled from all strata considering the fact that we can only sample integer number of clusters from each strata (i.e., rounding off where there are decimals in the number of clusters to be sampled from a stratum). Using (3) and (6), the Gini index estimate,Ĝ H , for the unit is computed using all H clusters with its standard error as se(Ĝ H ) and these are shown in the fifth column. The selected clusters are used to estimate the Gini index and it is denoted asĜQ, with its standard error as se(ĜQ), in the sixth column. In the seventh and eighth columns, Lower CI and Upper CI are the lower and upper limits of the 100(1 − α)% confidence interval of the Gini index usingQ clusters, respectively. The last column shows the length of the confidence interval, wQ. It must be noted that Q * is unbounded while on the other hand, Q andQ cannot exceed H.Q can be less than, equal to, or greater than Q depending on the rounding off. Q * will be equal to Q if and only if Q * is less than or equal to H. Table 5. Application results for the two-stage procedure on NSS 64th round data for α = 0.1 and ω = 0.02.

Uttar Pradesh
All 1262   From Tables 5-8, it can be observed that in all cases, except the urban sectors for both states, the confidence interval widths were less than ω. These results were achieved because the optimal number of clusters required (Q * ), according to the two-stage procedure, were less than the number available (H). On the other hand, in both Uttar Pradesh and West Bengal, the estimated optimal cluster sizes Q * for the urban sector exceeded the available number of clusters H in the data. As a consequence of this, the confidence interval widths for the Gini index in the urban sectors were larger than the pre-specified bound, that is wQ > ω.

Extension: Narrow Confidence Region
The methodology presented in this article for the Gini Index parameter can be extended to a multi-parameter setup in which we would like to make an inference about a vector of parameters θ = (θ 1 , θ 2 , . . . , θ p ) for p ≥ 2. This situation arises when we are interested in making joint inference related to a number of welfare related measures computed from socio-economic survey data (e.g., household consumer expenditure survey conducted by National Sample Survey, India). Thus, instead of a sufficiently narrow confidence interval, we would like to construct a narrow confidence region for a vector of parameters. Let the vector of estimators be defined as T n = (T 1n , . . . , T pn ) based on the data on n households collected using a complex household survey. We extend our proposed methodology for constructing the narrow confidence region in the spirit of Mukhopadhyay and De Silva (2009, pp. 284-89). We propose the following confidence region for θ F : Using the regularity conditions by Bhattacharya (2005), we have, with Σ being a positive definite matrix and χ 2 p being a chi-squared distribution with p degrees of freedom. If Σ is a positive definite matrix then there exist an orthogonal matrix P and a diagonal matrix ∆ such that P ΣP = ∆. The diagonal elements of ∆ contains the eigen values of Σ. If the positive eigen values of Σ be λ 1 , . . . , λ p then ∆ = diag(λ 1 , . . . , λ p ). Furthermore, let (PT n − Pθ) = (Y 1 , . . . , Y p ) and λ (p) is the maximum of the p eigen values of Σ. So, we have Thus, using (16), we say, Provided χ 2 α;p being the 100(1 − α)th percentile of χ 2 p , we claim that the coverage probability of the confidence region n is more than Here, C is the required optimal cluster size that should be used provided the covariance matrix (Σ) is known. If the parameter Σ is known in advance, one could simply collect observations belonging to cluster C s , s = 1, . . . , S of the each of the S Strata. Since Σ is not known in practice, we can estimate Σ, using a consistent estimator (V n , say) which can be obtained using the jackknife method. The consistency result of the jackknife estimator follows from Sen (1988). Thus using the jackknife estimator, we may propose either a two-stage or a sequential procedure. Similar results associated with the procedures described earlier is expected to hold under appropriate regularity conditions.

Discussion
At the outset, we would like to caution readers not to confuse two-stage sampling with the two-stage procedure discussed in Section 4.2, in the sequential sampling literature. For two-stage procedure, we refer Chattopadhyay and Mukhopadhyay (2013); Stein (1945) and others. A two-stage sampling (e.g., see Fuller (2009)) is a sampling technique in which a sample of clusters is selected and within those selected clusters, a sample of units are selected assuming the units to be independent of one another, and the selection rule depends only on the cluster. Under this two-stage sampling, Fuller (2009) discussed the use of Horvitz-Thompson estimator to estimate the total number and mean of the population and their respective variances. In addition, Fuller (2009) elaborated on the use of Horvitz-Thompson estimators and their (asymptotic) variances for functions of means and complex estimators, in general, under the assumption that the population distribution has a finite fourth moment. However, in the asymptotic framework of Fuller (2009), it was assumed that observations are independently and identically distributed (iid) which is a stronger assumption when compared to the framework of Bhattacharya (2007), also used in this work. Furthermore, Fuller (2009) also discussed the classical optimal sample allocation problem under the two-stage sampling technique for estimating the mean per element in a population. In his discussion, he assumed an equal number of units to be sampled from each cluster as well as an equal total number of units in each cluster and also known population variances of the cluster size and the sampling units. Under these assumptions, Fuller (2009) obtained the optimal number of units to be sampled per cluster by minimizing the variance of the mean per element subject to a cost constraint.
Our work is different from the survey procedures discussed in Fuller (2009). Our work, as indicated earlier, is based on the survey framework used in Bhattacharya (2007). In order to get such a confidence interval for Gini index, we are interested in estimating the unknown optimum number of clusters in each of the stratum, prefixing the number of strata. Apart from the survey framework, in our work, optimal cluster size depends on the data unlike the procedures discussed in Fuller (2009). The total cluster size (as well as the cluster size per each stratum) is a random variable that depends on a stopping criterion. This procedure also makes the estimated cluster sizes mutually dependent as they are all estimated based on the same stopping rule. Thus, the method discussed in Fuller (2009) or any other existing work can not be applied to find such a confidence interval.
We believe, this is the first work to make developments on having sufficiently narrow confidence interval of economic inequality index based on complex household survey. Now we discuss some issues or limitations of our proposed procedures because our proposed (a) procedures depend on the pre-specified number of households in each cluster (b) sequential procedure depends on pre-specified m (c) procedures consider large cluster size scenario (d) procedures do not consider the sampling cost and/or a fixed budget.
To begin with, the purely sequential procedure requires observations from additional m clusters, after the first stage, every time the condition in the stopping rule is not met. Thus, there is a need to fix the value of m . In some situations, it is as easy to collect observations from more than one cluster as it is to collect observations from a single cluster at every stage. So, as per convenience, the value of m should be accordingly decided based on economic considerations. In fact, the purely sequential procedure is not affected by the choice of m , the larger the value of m , the fewer number of stages, and the higher the chances of overestimating the optimal number of clusters. On the other hand, the smaller the value of m , the more number of stages and the higher the chances of accurately stopping at the optimal number of clusters. Thus, there is a trade off between the number of stages and stopping accurately at the optimal cluster size when choosing m .
Furthermore, our proposed procedures are based on the central limit theorem (when the cluster sizes per stratum are large). If the number of clusters is small, the confidence interval for Gini index cannot be constructed using Bhattacharya (2005Bhattacharya ( , 2007 (fixed-cluster size method) and narrow confidence interval for Gini index using our proposed procedures. For smaller number of available clusters (H s ) for few strata, the sequence of the sampling distributions of the empirical Gini indices may not reach asymptotic limiting normal distribution. In a situation when limiting normality cannot be reached, our proposed procedures should not be applied. If one of our proposed procedures are applied, because of not having enough clusters in a few strata, one may not achieve desired confidence interval for the population Gini Index. This scenario was encountered in the application section of this work, for both the purely sequential and two-stage procedures, when there were not enough available clusters in the urban sectors, and as such, resulted in confidence intervals that were wider than desired.
Lastly, a very important question raised by the Bhattacharya (2005) was about developing a survey design taking the economic factors into account. Both our proposed procedures can be extended to include cost factors whereby optimization will be done at several levels for construction of a narrow confidence interval or confidence region under cost constraints. However, we do not explore that possibility in this article. A related issue is the fact that usually a budget is allocated by a country to its survey agency to carry out the survey. Under such budget constraints, the funding agency is not likely to willingly hand out more money if stopping rule is not met with the available amount. Without question, issue of budget constraint is important. Here, we do not discuss the estimation of cluster sizes under a fixed budget. We feel that our current work is a first step towards addressing the important issue in the sense of achieving a sufficiently narrow confidence interval or region and may yield different outcomes under cost constraints. We believe our work will lead to further research on this topic.

Conclusions
Working within the asymptotic purview for complex survey data, developed by Bhattacharya (2005Bhattacharya ( , 2007, we have developed purely sequential and two-stage procedures for constructing sufficiently narrow confidence intervals for the Gini index which is one of the most popular measure of economic inequality. Our procedure may be applied for surveys when stratified clustered sample data are drawn from a large number of clusters per stratum, which is a reasonable assumption to make. More so, our procedure may also be applied to special cases of multi-stage survey designs including cases without stratification (i.e., S = 1), and those that have independent observations within clusters (interclass correlation is zero).
It is with no doubt that the two-stage procedure is practically more feasible under this survey design than the purely sequential procedure. The confidence intervals of both procedures yielded a coverage probability closer to the desired confidence coefficient, however, the purely sequential procedure produces confidence intervals whose width are always less than the desired bound ω. The two-stage procedure is also known to over-estimate the optimal cluster size as compared to the purely sequential procedure Mukhopadhyay and De Silva (2009) and this property can be seen in results from the simulation (in the supplementary material) and the application to the NSS data. Furthermore, the estimated optimal cluster sizes have smaller standard errors under purely sequential procedure as compared to two-stage procedure.