Gini Index Estimation within Pre-Specified Error Bound: Application to Indian Household Survey Data

Bilson Darku, Francis; Konietschke, Frank; Chattopadhyay, Bhargab

doi:10.3390/econometrics8020026

Open AccessArticle

Gini Index Estimation within Pre-Specified Error Bound: Application to Indian Household Survey Data

by

Francis Bilson Darku

^1,†

,

Frank Konietschke

^2,3

and

Bhargab Chattopadhyay

^4,*

¹

Mendoza College of Business, University of Notre Dame, Notre Dame, IN 46556, USA

²

Institute of Biometry and Clinical Epidemiology, Charité—Universitätsmedizin Berlin, 10117 Berlin, Germany

³

Berlin Institute of Health, Anna-Louisa-Karsch-Straße 2, 10178 Berlin, Germany

⁴

Department of Decision Sciences and Information Systems, Indian Institute of Management Visakhapatnam, Visakhapatnam, Andhra Pradesh 530003, India

^*

Author to whom correspondence should be addressed.

^†

This work is part of the final dissertation of Francis Bilson Darku that was submitted to the Department of Mathematical Sciences at The University of Texas at Dallas.

Econometrics 2020, 8(2), 26; https://doi.org/10.3390/econometrics8020026

Submission received: 23 July 2019 / Revised: 11 June 2020 / Accepted: 12 June 2020 / Published: 18 June 2020

Download Versions Notes

Abstract

The Gini index, a widely used economic inequality measure, is computed using data whose designs involve clustering and stratification, generally known as complex household surveys. Under complex household survey, we develop two novel procedures for estimating Gini index with a pre-specified error bound and confidence level. The two proposed approaches are based on the concept of sequential analysis which is known to be economical in the sense of obtaining an optimal cluster size which reduces project cost (that is total sampling cost) thereby achieving the pre-specified error bound and the confidence level under reasonable assumptions. Some large sample properties of the proposed procedures are examined without assuming any specific distribution. Empirical illustrations of both procedures are provided using the consumption expenditure data obtained by National Sample Survey (NSS) Organization in India.

Keywords:

complex household survey; confidence interval; income distribution; inequality; sequential analysis

1. Introduction

Economic measures based on income levels of the residents of a specific region play an important role in social, economic and socio-economic sciences. They are used to quantify both the actual balance of the economy as well as the wealthiness and poverty of the people. One of the most prominent candidates is the (normalized) Gini index,

G_{F} = G_{F} (X) = \frac{2}{μ} \int_{0}^{\infty} x F (x) d F (x) - 1, μ = E (X),

(1)

which quantifies the economic inequality of a region, state, country or the world. Here, the random variable X denotes the income level,

F (x)

its cumulative distribution function, and

μ = E (X)

its expected value. If

G_{F} = 0

, then the economic system has maximal equality (e.g., everyone has the same income), while

G_{F} = 1

represents perfect inequality (e.g., one individual has everything while the rest have nothing). For example, according to the Organization for Economic Cooperation and Development (2017), the Gini indices of the USA, Germany and South Africa were

G_{F} = 0.39, 0.29, 0.62

in 2017, respectively. These values suggest income inequality in these regions. Therefore, Gini index serves as a measure of economic balance that allows comparison across regions. Roughly speaking, income levels were more balanced (equal) in Germany than they were in the USA and in Brazil, respectively. Thus, the Gini index serves as an important measure in economics, social and political sciences. The estimation of the Gini index

G_{F}

of a country or a region, however, is a rather challenging task, because income is usually measured on household levels and thus in a clustered and stratified way. In most countries (e.g., United States, European Union, India and others), complex household surveys are conducted annually, the data of which can be used for the estimation of the Gini index as given in (1) see (Bhattacharya 2005, 2007).

The single computation of a point estimator of

G_{F}

as being reported in most available resources is, however, rather unsatisfactory, because neither the variability in the sample nor sample/cluster sizes visualize the estimator in an informative manner. Therefore, computing

100 (1 - α) %

confidence intervals for

G_{F}

as point estimators are much more informative for making both descriptive as well as comparative conclusions. Binder and Kovacevic (1995) and Bhattacharya (2007) proposed point estimators of the Gini index as well as of their standard errors in such complex survey designs (see Section 2 for details), which can be used for the computation of

100 (1 - α) %

confidence intervals for

G_{F}

. Furthermore, Peng (2011) proposed an empirical likelihood-based approach to construct such confidence intervals (as well as the confidence interval for the difference of two Gini indices). Clearly, for a desired confidence level, a narrower confidence interval will be more accurate about the parameter of interest. Therefore, it is the aim of the present article to develop confidence intervals for the Gini index

G_{F}

in complex survey designs that both control the nominal confidence level

(1 - α)

and the confidence interval width. To guarantee that these criteria will be fulfilled, the optimal number of clusters will be computed using an innovative ‘learn-as-you-go’ or sequential procedure. We refer the readers to Ghosh and Sen (1991); Ghosh et al. (1997); Chattopadhyay and Kelley (2017); Kelley et al. (2018) and others for more on sequential analysis literature.

The first known application of sequential analysis in surveys was done by Mahalanobis (1940), who described the design and implementation of the method (in a different context) for estimating acreage of jute crop in the whole state of Bengal in undivided India. This was even before the seminal works of Stein (1945, 1949) on sequential analysis area. Kanninen (1993); Greene (1998); Arcidiacono and Jones (2003); Aguirregabiria and Mira (2007) and many others contributed to application of sequential analysis in the field of economics, data analysis, medicine, and other areas. Recently, Chattopadhyay and De (2016) and De and Chattopadhyay (2017) developed a sequential procedure for inference problems related to the Gini index under independent and identically distributed (i.i.d.) conditions, but the proposed methodology cannot be used for finding a sufficiently narrow

100 (1 - α) %

confidence interval for the population Gini index under a complex household survey design. We propose a two stage procedure and a purely sequential procedure to find an estimate of the minimum number of clusters which is required to find a sufficiently narrow confidence interval under a distribution-free scenario. Both the two-stage and purely sequential procedures are applied to the 64th round of household survey data collected in India. Further, a simulation study is carried out on observations collected in the Indian household survey data and from known income distributions to explore the properties of the procedures.

The remainder of this paper is organized as follows: Section 2 describes the sampling framework of the complex survey design that is considered in this work. In Section 3, we formulate the problem of finding a sufficiently narrow confidence interval for the Gini index and the reason for non-applicability of a procedure with fixed cluster size. In Section 4, we develop the purely sequential, as well as the two-stage, procedure followed by a discussion on the characteristics of our procedure in Section 5. Furthermore, an application of both of our procedures to real and synthetic data sets can be found in Section 6, while Section 7 describes an extension of the problem to the multivariate setup. We discuss the advantage and drawbacks of the proposed procedures in Section 8, and provide concluding comments in Section 9.

2. Survey Design and Point Estimation

In this section, the complex household survey design along with the used notations will be described: Assume that the population is divided into

s = 1, 2, \dots, S

strata, whereas the

s^{th}

stratum is divided into

c_{s} = 1, \dots, H_{s}

clusters. Under the

c_{s}^{th}

cluster in stratum s, there is a group of

M_{s c_{s}}

households with

ν_{s c_{s} h}

individuals or members,

h = 1, 2, \dots, M_{s c_{s}}

. Therefore, the total number of clusters in the population is

H = \sum_{s = 1}^{S} H_{s}

. The number of households in a stratum will be

M_{s} = \sum_{c_{s} = 1}^{H_{s}} M_{s c_{s}}

and the total number of households in the population is denoted by

M = \sum_{s = 1}^{S} M_{s} = \sum_{s = 1}^{S} \sum_{c_{s} = 1}^{H_{s}} M_{s c_{s}}

.

For estimation purpose in such complex survey designs, a sample of

n_{s}

clusters is selected from the

s^{th}

stratum by simple random sampling with replacement. A simple random sample of k households is then considered (without replacement) from each of the selected clusters. Let the total number of clusters being selected from the population be denoted by

n = \sum_{s = 1}^{S} n_{s} with n_{s} = a_{s} n and a_{s} = \frac{H_{s}}{H} .

(2)

Thus, the total number of households in the sample will be

k n = k \sum_{s = 1}^{S} n_{s}

. For the

h^{th}

household in the

c_{s}^{th}

cluster from the

s^{th}

stratum, the observed data (that is, the household monthly income, monthly expenditure, per capita income or others) are denoted as

x_{s c_{s} h}

. With the presence of stratification and clustering, the households are assigned different weights

W_{s c_{s} h}

as the probability of inclusion in the sample will vary. The assigned weight to the selected household is computed as the inverse of the probability of inclusion of the household in the sample (see Binder and Kovacevic 1995; Horvitz and Thompson 1952; Lee and Forthofer 2006). If researchers wish to increase (or decrease) the representation of a subgroup of the population that is of interest, they can employ oversampling (or undersampling) procedures and use appropriate weighting techniques. Wells (1998) discussed several weighting methods for such cases. For our survey framework, weights are assigned to the data (

x_{s c_{s} h}

) with respect to the number of observations in the population. The attached weight for all the

ν_{s c_{s} h}

members of the

h^{th}

household belonging to the

c_{s}^{th}

sampled cluster from the

s^{th}

stratum as given by Bhattacharya (2007) is

W_{s c_{s} h} = \frac{M_{s c_{s} h} H_{s}}{k n_{s}} ν_{s c_{s} h} .

It should be noted that the computation of the sampling weights will change depending on the sampling design and also on whether the analysis is being done at the district-, household- or individual level (Bhattacharya 2007). If the cluster size is large, sampling with or without replacement will result in similar values for the weights. Moreover, Bhattacharya (2005, 2007) noted that using sampling with or without replacement does not affect the asymptotic results of this work as in most practical situations the number of clusters per stratum are usually large.

Under the above framework, let

W = \sum_{s = 1}^{S} \sum_{c_{s} = 1}^{n_{s}} \sum_{h = 1}^{k} W_{s c_{s} h}

denote the total of per-household weights associated with the survey and define

w_{s c_{s} h} = W^{- 1} W_{s c_{s} h}

the normalized weights, which will be used in the estimation of the average income

μ = E (X)

and its cumulative distribution function

F (x)

by

\hat{μ} = \sum_{s = 1}^{S} \sum_{c_{s} = 1}^{n_{s}} \sum_{h = 1}^{k} w_{s c_{s} h} x_{s c_{s} h}

and

\hat{F} (x) = \sum_{i = 1}^{S} \sum_{j = 1}^{n_{s}} \sum_{l = 1}^{k} w_{i j l} 1 (x_{i j l} \leq x),

in order to take the relative household sizes into account. Then, under fairly mild conditions on the numbers of clusters, a consistent estimator of the Gini index

G_{F}

given in (1) is given by

{\hat{G}}_{n} = 1 - \frac{2}{\hat{μ}} \sum_{s = 1}^{S} \sum_{c_{s} = 1}^{n_{s}} \sum_{h = 1}^{k} w_{s c_{s} h} x_{s c_{s} h} (1 - \hat{F} (x_{s c_{s} h})) .

(3)

It follows that, the estimated Gini index basically is a ratio of two weighted averages of the income levels, respectively (see Bhattacharya 2007). In the next section, we discuss the idea towards the construction of confidence intervals with bounded width.

3. Bounded Width Confidence Intervals

In order to derive bounded width confidence intervals, the (asymptotic) distribution of the empirical Gini index

{\hat{G}}_{n}

must be tackled. It has been shown by Bhattacharya (2007), that if

E (| X | | s) < \infty

, and if

n_{s} \to \infty

for each stratum

s = 1, \dots, S

at the same rate, then

\sqrt{n} ({\hat{G}}_{n} - G_{F}) \overset{D}{⟶} N (0, ξ^{2}) .

(4)

Here,

ξ^{2}

denotes the (asymptotic) variance of

\sqrt{n} {\hat{G}}_{n}

. Due to its quite involved representation, we refer to Bhattacharya (2007) for the specific variance formula. The asymptotic distribution, however, can now be used for the computation of

100 (1 - α) %

confidence intervals for the population Gini index the width of which does not exceed a pre-specified value

ω

, that is

Pr ({\hat{G}}_{n} - z_{α / 2} \frac{ξ}{\sqrt{n}} < G_{F} < {\hat{G}}_{n} + z_{α / 2} \frac{ξ}{\sqrt{n}}) \geq 1 - α,

and

L = 2 z_{α / 2} \frac{ξ}{\sqrt{n}} \leq ω .

Here,

z_{α / 2}

is the

100 (1 - α / 2)

th percentile of the standard normal distribution N

(0, 1)

. Thus, the actual arising task is the computation of n that will guarantee that the width of the confidence interval is bounded by

ω

, i.e.,

\frac{ω \sqrt{n}}{2 ξ} \geq z_{α / 2} \Rightarrow n \geq \frac{4 z_{α / 2}^{2} ξ^{2}}{ω^{2}} = C .

(5)

Hence, C denotes the optimal total number of clusters from all strata needed such that

L \leq ω

. Therefore, the optimal number of clusters that will be required to be sampled from the

s^{th}

stratum

(s = 1, 2, \dots, S)

will be

C_{s} = C a_{s}

. Here, the term optimal is used in the sense of minimum number of clusters to meet the requirements and not as in the sense of optimal allocation used in sample survey methods (see Cochran 1997). If C is known, one can find the sufficiently narrow confidence interval

({\hat{G}}_{C} - z_{α / z} \frac{ξ}{\sqrt{C}}, {\hat{G}}_{C} + z_{α / z} \frac{ξ}{\sqrt{C}}),

that satisfies (5). However without knowing the underlying distribution of the income (or assets or expenditure), the value of

ξ^{2}

is unknown in practical scenarios. Thus, the optimal cluster size from all the S strata, C, is also unknown. We note that supposed value (or previous survey estimate) of

ξ^{2}

may be used to obtain the value of C. However, a potential problem that may arise is that the supposed value of

ξ^{2}

may be different from the actual value. Moreover, using previous survey estimates in many situations is not advised as that may not be applicable in the current population. This is because of a possible change in socio-economic conditions that may arise due to the change in distribution of income or expenditure as a result of change in economic policies or situations. Due to all these factors, the value of C may widely differ from what it would have been if

ξ^{2}

is known and will not guarantee that (5) is satisfied. The (asymptotic) variance

ξ^{2}

of the estimated Gini index is, however, unknown in practical applications and must be estimated in an appropriate way. Consistent estimators will now be discussed below.

Estimation of $ξ^{2}$

Several articles published in statistics and economics journals have proposed different estimators of the asymptotic variance parameter of the estimator of the Gini index under different sampling schemes. Zitikis and Gastwirth (2002) proposed explicit formulas for the asymptotic variance of a general class of the Gini index (i.e., the S-Gini index) for simple random sampling with observations coming from the Exponential and Pareto distributions. We refer to Langel and Tillé (2013) for a discussion on several techniques used in estimating the asymptotic variance of the Gini index for various sampling designs. Under the current framework, Binder and Kovacevic (1995) proposed an estimator of

ξ^{2}

using the empirical variance

V_{n, 1}^{2} = \sum_{s = 1}^{S} \frac{n_{s}}{n_{s} - 1} \sum_{c_{s} = 1}^{n_{s}} {(u_{s c_{s}} - {\bar{u}}_{s})}^{2}

(6)

of the values

u_{s c_{s}} = \frac{2}{\hat{μ}} \sum_{h = 1}^{k} w_{s c_{s} h} [A (x_{s c_{s} h}) x_{s c_{s} h} + B (x_{s c_{s} h}) - \frac{\hat{μ}}{2} ({\hat{G}}_{n} + 1)] .

Here,

{\bar{u}}_{s} = n_{s}^{- 1} \sum_{c_{s} = 1}^{n_{s}} u_{s c_{s}}

denote the empirical mean of

u_{s c_{s}}

and

\begin{matrix} A (x_{s c_{s} h}) & = \hat{F} (x_{s c_{s} h}) - \frac{{\hat{G}}_{n} + 1}{2}, and \\ B (x_{s c_{s} h}) & = \sum_{a = 1}^{S} \sum_{b = 1}^{n_{s}} \sum_{c = 1}^{k} w_{a b c} x_{a b c} 1 (x_{a b c} \geq x_{s c_{s} h}), \end{matrix}

are weighted placements and averages of the income values obtained from n clusters, respectively. It should be noted that Bhattacharya (2007) proposed an alternative estimator of

ξ^{2}

which is given by

\begin{matrix} V_{n, 2}^{2} & = \sum_{s = 1}^{S} \sum_{c_{s} = 1}^{n_{s}} \sum_{h = 1}^{k} w_{s c_{s} h}^{2} {\hat{ψ}}_{s c_{s} h}^{2} + \sum_{s = 1}^{S} \sum_{c_{s} = 1}^{n_{s}} \sum_{h = 1}^{k} \sum_{h^{'} \neq h} w_{s c_{s} h} {\hat{ψ}}_{s c_{s} h} w_{s c_{s} h^{'}} {\hat{ψ}}_{s c_{s} h^{'}} \\ - \sum_{s = 1}^{S} \frac{1}{n_{s}} {(\sum_{c_{s} = 1}^{n_{s}} \sum_{h = 1}^{k} w_{s c_{s} h} {\hat{ψ}}_{s c_{s} h})}^{2}, \end{matrix}

(7)

where

\begin{matrix} {\hat{ψ}}_{s c_{s} h} & = - \frac{2}{μ} \sum_{g = 1}^{k n} w_{g} [x_{s c_{s} h} 1 (x_{s c_{s} h} \leq x_{(g)}) + x_{(g)} (\hat{F} (x_{(g)}) - 1 (x_{s c_{s} h} \leq x_{(g)}))] \\ + \frac{2}{{\hat{μ}}^{2}} \sum_{g = 1}^{k n} [\{\sum_{a = 1}^{S} \sum_{b = 1}^{n_{s}} \sum_{c = 1}^{k} w_{a b c} x_{a b c} 1 (x_{a b c} \leq x_{(g)})\} x_{s c_{s} h}], \\ k n & = k \sum_{s = 1}^{S} n_{s} is the total number of observations, and \\ x_{(g)} & is the g th ordered observation (among all x_{s c_{s} h}) . \end{matrix}

However, Hoque and Clarke (2015) showed that the estimators in (6) and (7) are numerically the same, i.e.,

V_{n, 1}^{2} = V_{n, 2}^{2}

. We therefore chose

V_{n, 1}^{2}

as a consistent estimator of

ξ^{2}

and drop the second subscript, without loss of generality (i.e., we use

V_{n}^{2}

as the estimator of

ξ^{2}

). Having found a consistent estimator of the (asymptotic) variance

ξ^{2}

, it follows that the optimal number of clusters C defined in (5) that lead to the bounded width confidence interval can now be estimated from the data. In order to do so, different sequential methodologies will be discussed in the next section.

4. Sequential Methodology

In this section, different sequential methodologies including two-stage and purely sequential approaches will be discussed to find the sufficiently narrow confidence interval. First, purely sequential methods will be introduced.

4.1. Purely Sequential Procedure

The purely sequential confidence interval computation is based on consecutive sampling until a certain stopping rule is met which ensured that the width of the confidence interval is smaller than or equal to the given bound. This sampling process begins with a pilot sample the sizes of which will be specified in Section 4.3. However, recall that computing a bounded width confidence interval requires at least

C_{s}

clusters from the

s^{th}

stratum (

s = 1, 2, \dots, S

). Therefore, choose a pilot cluster size of

t_{s}

from each stratum s, which results in a total number of clusters in the pilot stage of

t = \sum_{s = 1}^{S} t_{s}

. Within each selected cluster, there are k randomly selected households (without replacement). Now, collect pilot observations

x_{s 11}, \dots, x_{s 1 k}, \dots, x_{s t_{s} 1}, \dots, x_{s t_{s} k}

on each stratum

s = 1, \dots, S

. Now, the estimator

V_{n}^{2}

of

ξ^{2}

is computed to examine the following stopping rule

\begin{matrix} N = N_{ω} (\leq H) is the smallest integer n (\geq t) such that \\ n \geq \frac{4 z_{α / 2}^{2}}{ω^{2}} (V_{n}^{2} + \frac{1}{n}) = \hat{C} and n_{s} \geq {\hat{C}}_{s} = \hat{C} a_{s}, for all s . \end{matrix}

(8)

If the condition in the stopping rule is not satisfied, the surveyor collects data from additional

m^{'} (\geq 1)

clusters, with k randomly chosen households, from each stratum that has

n_{s} \leq {\hat{C}}_{s}

. Then

ξ^{2}

is estimated based on all the observations collected up to that stage and the stopping condition is checked. This process is repeated until the condition in the stopping rule is satisfied. It should be noted that

m^{'} (\geq 1)

can be any integer that is appropriate, suitable or feasible for the survey.

The term

1 / n

in (8) is a correction term incorporated to avoid early stopping of the sequential procedure as

V_{n}^{2}

(the estimator of

ξ^{2}

) may be very small in the early stages. Without this term, the stopping rule in (8) can be satisfied for very small sample sizes due to sampling error. In general, any null-sequence, e.g.,

1 / n^{γ}

, where

γ (> 0)

is a fixed number, can be used as a correction term, because it does not affect the the consistency of the variance estimator (see Mukhopadhyay and De Silva 2009, p. 260, for more details). The use of a correction term can be seen in several articles, e.g., Chattopadhyay and De (2016), Chattopadhyay and Kelley (2017), and Kelley et al. (2019). The final cluster size N constitutes

N_{s}

clusters from each stratum s where

N_{s} = N a_{s}, for s = 1, 2, \dots, S .

Based on the sampled data

x_{s c_{s} h}

and their corresponding standardized weights

w_{s c_{s} h}

, where

s = 1, \dots, S

,

c_{s} = 1, \dots, N_{s}

, and

h = 1, \dots, k

, the

100 (1 - α) %

bounded width confidence interval for the Gini index

G_{F}

is given by

({\hat{G}}_{N} - z_{α / 2} \frac{V_{N}}{\sqrt{N}}, {\hat{G}}_{N} + z_{α / 2} \frac{V_{N}}{\sqrt{N}}) .

(9)

The purely sequential procedure may be numerically cumbersome due to the consecutive sampling and repeated computations of the variance estimators. Therefore, a less numerically intensive method—a two-stage procedure—will be examined in the next section.

4.2. Two-Stage Procedure

Unlike the purely sequential procedure, the two-stage procedure comprises of two stages. The first stage is called the pilot stage, wherein a sample is drawn from the population. That is, first a pilot sample of clusters,

t_{s}

(with

\sum_{s = 1}^{S} t_{s} = t

), is selected from each stratum s. Based on the sample from the pilot stage,

ξ^{2}

is estimated as in (6). Then, the total final cluster size from all strata can be estimated as

Q = min \{H, max \{t, ⌈\frac{4 z_{α / 2}^{2}}{ω^{2}} V_{t}^{2}⌉\}\} = min \{H, Q^{*}\}

(10)

where

Q^{*}

is the (unbounded) optimal cluster size and

⌈ \cdot ⌉

is the ceiling function, that is,

⌈ x ⌉

is the smallest integer that is greater than or equal to x. Thus, the estimated number of clusters to be sampled from the

s^{th}

stratum is given by

Q_{s} = min {H_{s}, [Q a_{s}]},

with

a_{s}

as defined in (2) and

[\cdot]

being the nearest integer function. So, in the second stage, observations from k households will be collected from

Q_{s} - t_{s}

clusters from each stratum s. Using the combined data from the two stages, the estimator of

ξ^{2}

is updated and the approximate

100 (1 - α) %

confidence interval for the Gini index is given by

({\hat{G}}_{Q} - z_{α / 2} \frac{V_{Q}}{\sqrt{Q}}, {\hat{G}}_{Q} + z_{α / 2} \frac{V_{Q}}{\sqrt{Q}}) .

(11)

We note that the final cluster size using either the two-stage procedure or the purely sequential procedure can be shown to be always finite. In addition, the number of clusters per stratum are mutually dependent as they all depend on the same stopping rule. In the next subsection, we derive the pilot cluster size formula.

4.3. Pilot Cluster Size

Using (8) and proceeding along the lines of Chattopadhyay and De (2016), we have

\begin{matrix} n \geq \frac{4 z_{α / 2}^{2}}{ω^{2}} (V_{n}^{2} + \frac{1}{n}) \geq \frac{4 z_{α / 2}^{2}}{ω^{2}} \frac{1}{n} \Rightarrow n \geq \frac{2 z_{α / 2}}{ω} . \end{matrix}

(12)

Thus the total number of sampled clusters is at least

2 z_{α / 2} / ω

. The maximum number of clusters from the

s^{th}

stratum is

H_{s}

and also the minimum number of clusters to estimate

ξ^{2}

is 2. Considering all the constraints in (8), the number of clusters recommended to be sampled from the

s^{th}

stratum at the pilot stage is

t_{s} = min \{H_{s}, max \{2, ⌈\frac{2 a_{s} z_{α / 2}}{ω}⌉\}\} .

(13)

We note that this ensures that the minimum cluster size is met as well as the total possible cluster size is not exceeded.

5. Characteristics of the Procedures and Simulation Study

The purely sequential procedure and the two-stage procedure for constructing a sufficiently narrow confidence interval for the Gini index—unlike fixed cluster size procedures—require cluster sizes which are obtained from data. So, the respective cluster sizes N and Q are random in nature. In the following subsection, we will look at the characteristics of the random cluster sizes viz. N and Q.

5.1. Characteristics

The following theorem provides some asymptotic properties (as

ω \to 0

) of the final cluster sizes of the above procedures with sufficiently large H.

Theorem 1.

If the parent distribution(s) is(are) such that

E [V_{n}^{2}]

exists and

H_{s}

(fixed) are sufficiently large for all

s \in S

, then as

ω \to 0

,

(i): $\frac{N}{C} \to 1$ in probability,
(ii): $\frac{Q}{C} \to 1$ in probability, and
(iii): $\frac{2 z_{α / 2} V_{N}}{\sqrt{N}} \leq ω$ .

Proof of Theorem 1.

(i): The definition of stopping rule N associated with the purely sequential procedure in (8) yields

${(\frac{2 z_{α / 2}}{ω})}^{2} V_{N}^{2} \leq N \leq t 1 (N = t) + {(\frac{2 z_{α / 2}}{ω})}^{2} (V_{N - 1}^{2} + \frac{1}{N - 1}) .$

(14)

Since $N \to \infty$ as $ω ↓ 0$ and $V_{n}^{2} \to ξ^{2}$ in probability as $n \to \infty$ , by applying Theorem 2.1 of Gut (2009), $V_{N}^{2} \to ξ^{2}$ in probability.
Furthermore, $t Pr (N = t) / C \leq t / C \to 0$ as $ω ↓ 0$ . Hence, dividing all sides of (14) by C and letting $ω ↓ 0$ , we prove $N / C \to 1$ in probability as $ω ↓ 0$ .
(ii): The definition of final cluster size Q related to the two-stage procedure in (10) yields

${(\frac{2 z_{α / 2}}{ω})}^{2} V_{t}^{2} \leq Q \leq t 1 (Q = t) + {(\frac{2 z_{α / 2}}{ω})}^{2} (V_{t}^{2} + \frac{1}{t}) .$

(15)

Furthermore, $t Pr (Q = t) / C \leq t / C \to 0$ as $ω ↓ 0$ . Now, $V_{t}^{2} \to ξ^{2}$ in probability as $ω ↓ 0$ . Hence, dividing all sides of (15) by C and letting $ω ↓ 0$ , we prove $Q / C \to 1$ in probability as $ω ↓ 0$ .
(iii): Using stopping rule N in (8) we have, for all N,

$\begin{matrix} {(\frac{2 z_{α / 2}}{ω})}^{2} V_{N}^{2} \leq N & \Rightarrow & \frac{4 z_{α / 2}^{2}}{N} V_{N}^{2} \leq ω^{2} \\ \Rightarrow & 2 z_{α / 2} \frac{V_{N}}{\sqrt{N}} \leq ω \end{matrix}$

□

Parts (i) and (ii) of the theorem show that the final cluster size as obtained from the purely sequential and the two-stage procedure is a consistent estimator of the cluster size provided

ξ^{2}

is known. Part (iii) of the theorem shows that the sufficiently narrow confidence interval (that is length less than or equal to

ω

) will be obtained by the purely sequential procedure. The same result cannot be proven for the two-stage procedure.

5.2. Simulation Study

We now use a detailed simulation study, presented in the Supplement, to illustrate and compare the properties of our purely sequential and the two stage procedures in constructing a

100 (1 - α) %

confidence interval for the Gini index under a complex survey whose width is less than

ω

. We presented two different simulation studies with 5000 simulation runs—(a) simulation using the NSS survey data as the population and (b) a Monte Carlo simulation in which the observations are drawn from three different populations, each of which has been drawn using three different distributions, namely; Pareto, Gamma and Lognormal distributions. The two simulation studies were performed in RStudio (RStudio Team 2018, version 1.2.1335) and codes are available upon request.

To begin with, we describe the simulation procedure for the purely sequential methodology. From the given populations,

t_{s} (s = 1, 2, \dots, S)

clusters are randomly sampled from the sth stratum without replacement. From there, four households are selected from each cluster using simple random sampling without replacement and these households from all t clusters will constitute the pilot sample. From the collected pilot sample, the asymptotic variance of the Gini index

ξ^{2}

is estimated using (6), and from (8), the optimal number of clusters C is estimated. The stopping rule is checked and if it is satisfied, sampling is terminated. On the other hand, if the stopping rule is not satisfied, the strata whose number of clusters selected are less than the expected number, that is

{s : t_{s} < {\hat{C}}_{s}}

, are identified and additional

m^{'}

number of clusters are randomly selected without replacement. Here,

m^{'}

is chosen to be either 1, 10 or 20. In each of the selected

m^{'}

clusters, four households are randomly selected without replacement. At this stage, with the total number of sampled clusters being n (say), the value of

V_{n}^{2}

is updated and the stopping rule is checked. If the rule is met, sampling is stopped, otherwise the strata without enough clusters are identified again and additional

m^{'}

clusters are collected from each of them. This process is continued until and unless the stopping rule is met. At that point, based on N (say) numbers of clusters sampled from all strata, the

100 (1 - α) %

confidence interval for the Gini index is constructed as given in (9).

Unlike the purely sequential procedure described above, the two-stage procedure has only two stages. The simulation algorithm for the two-stage is as follows. From a given population,

t_{s}

number of clusters are randomly selected without replacement from the sth stratum and four households are randomly sampled from each of the selected clusters without replacement. The per monthly capita expenditure

x_{s c_{s} h}

from the selected households, with their respective weight

W_{s c_{s} h}

, are used to estimate the asymptotic variance of the Gini index (from (6)). This is followed by using (10) to obtain the optimal number of clusters Q needed to achieve the desired confidence level and width. If

Q > t

, additional

Q_{s} - t_{s}

number of clusters are randomly selected without replacement from each stratum s. In each of the additional clusters, four households are also randomly selected without replacement. Finally, per capita monthly expenditure of all households from the Q number of clusters are used to construct the

100 (1 - α) %

confidence interval for the Gini index as stated in (11).

From the simulations, we find that the coverage probability for the confidence intervals for both purely sequential procedure and the two-stage procedure are approximately close to the desired confidence level provided that the cluster size (in all strata) is large, which is also a basic criterion while proving the asymptotic normality in (4). However, the width of the confidence intervals for the two stage procedure, unlike the purely sequential procedure, may result in confidence intervals of width larger than the pre-specified value of

ω

. For details, one may look at Tables S21–S24 of the supplementary material. This outcome is not surprising since the two-stage procedure is based on only the pilot sample which is usually taken to be small. So, the variability of the variance estimator

V_{n}^{2}

is higher. The optimal cluster sizes obtained by the purely sequential procedure is less than the one obtained by the two-stage procedure. The newly developed methods can now be applied using real data. This will be explained in the next section.

6. Gini Index Estimation in India

We now apply the sequential procedures to construct bounded width confidence intervals for the Gini index in India using the per capita monthly expenditures obtained via the 64th Round National Sample Survey (NSS) (a stratified multi-staged survey design between July 2007 and June 2008). In 2008, the country was divided into 28 states and seven union territories thereof each was subdivided into districts. Within each district, two basic sectors were formed; all rural areas constituted the rural sector while all urban areas constituted the urban sector. Nonetheless, for the urban areas in a district, separate basic strata were formed for each town that had at least a population of 10 lakhs (1 lakh is 100,000). The remaining areas were grouped as another basic stratum (National Sample Survey Office 2007). For the rural sector, the sampling frame was made up of villages while for the urban sector, it was towns/blocks.1

Census villages and the Urban Frame Survey blocks were the first stage units (FSU) in the rural and urban sectors respectively. From each strata, FSUs are selected from the rural sector with probability proportional to size with replacement and from the urban sector by using simple random sampling without replacement. Within the FSU, the households in each sector were considered as the smallest unit of grouping, which is also referred to as the ultimate stage units. Households were selected by simple random sampling without replacement and various information about the households were recorded during the survey. Some of the information include the demographics, household size, expenditure on education, food, clothing, corresponding weights etc. A detailed description of the NSS Data can be found online at National Sample Survey Office (2015).

The “Stratum” variable in the 64th NSS data set will be used to stratify the states/sectors while “FSUno” (First Stage Unit Number) variable will be used to cluster the households under each stratum. We discuss the results obtained from applying the proposed sequential methodologies which were applied to the data collected from two of the most populous states in India, namely Uttar Pradesh and West Bengal. Additionally, the report includes the results for the whole state as well as rural and urban sectors of the state. Here, all the households in each cluster were considered since we are sampling from a survey that already has few number of households per cluster. However, the weight per household is adjusted at each sampling stage to reflect the actual weight that would be used during a survey.

In applying the sequential methodologies, the pilot cluster sizes

t_{s}

for each stratum s are computed using (13). At the outset,

t_{s}

number of clusters are selected from stratum s for

s = 1, \dots, S

. Where

t_{s}

is same for both the purely sequential procedure and the two-stage methodology. We apply each of the procedures considering the survey data as our population.

6.1. Application of Purely Sequential Procedure (PSP)

The proposed purely sequential procedure, with observations from one cluster collected at each stage after the pilot stage, is applied to the NSS 64th round data. The results for different combinations of pre-specified width (

ω \in \{0.020, 0.025\}

) and confidence level (

1 - α, α \in \{0.05, 0.10\}

) can be found in Table 1, Table 2, Table 3 and Table 4. The first column of the tables indicates the region on which we applied our procedure. The PSP was applied on the entire data from Uttar Pradesh (denoted as All) and then separately applied on the rural and urban sectors of Utter Pradesh (denoted as Rural and Urban respectively). The same process was also repeated for West Bengal. The second column of the tables shows the estimated Gini index (

{\hat{G}}_{H}

) and its standard error (

s e ({\hat{G}}_{H})

) using the entire number of clusters (H) available in the data set for that region (i.e., all of the state, rural sector of the state, or the urban sector of the state). In the third column is the total number of clusters

(H)

available in the data set for that region. The fourth column shows the value of

\hat{C}

when the procedure ended,

\hat{C}

being the estimated optimal cluster size as in (8). The fifth column of the tables shows the collected cluster size N using the stopping rule in (8) and the pilot cluster size t. The values of

{\hat{G}}_{N}

and

s e ({\hat{G}}_{N})

in the sixth column are the estimated Gini index and its standard error respectively based on N clusters. The next two columns are respectively the lower and upper limits of the confidence intervals obtained with the stopping rule in (8). The ninth column is

w_{N}

which is the estimated width of the confidence interval. The last column

Pr (N_{s} < {\hat{C}}_{s})

shows the proportion of strata that had their collected cluster size

N_{s}

from the purely sequential procedure being less than their estimated optimal cluster size

{\hat{C}}_{s}

(

N_{s}

is the final number of clusters selected from stratum s while

{\hat{C}}_{s}

is the estimated optimal number of clusters to be sampled from stratum s).

In Table 1, Table 2, Table 3 and Table 4, it can be seen that, when the maximum available (to be drawn from) cluster size (

H_{s}

) per stratum are large, the purely sequential procedure is able to achieve desired precision, i.e., a narrow confidence interval, (

w_{N} \leq ω

) for the Gini index with relatively fewer number of clusters sampled while maintaining the desired confidence level. This is shown in the results where

N < H

for all of Uttar Pradesh and West Bengal, as well as their individual rural sectors. The same cannot be said about their urban sectors as they do not have enough maximum available clusters from the onset. Thus, the procedure did not reach the optimal cluster size but stopped when there were no more clusters remaining to be sampled.

The results also show that, aside the fact that the entire urban regions did not have enough clusters (

N = H < C

), each of the strata in the regions also do not have enough clusters (that is,

Pr (N_{s} < {\hat{C}}_{s}) = 1

) to obtain a narrow confidence interval width. However, in the other regions (i.e., All and Rural for Uttar Pradesh and West Bengal), even though

\hat{C} < N < H

, some strata had

N_{s} < {\hat{C}}_{s}

. This is because some strata have more than enough clusters while others do not and that offsets each other at the end. For example, it can be seen from Table 1 that in the rural sector of Uttar Pradesh, 40% of the strata did not have enough clusters even though, at the end, the confidence interval was 0.0186 wide which was less than the desired width of 0.02.

Next, the the results will be compared with the two-stage procedure as discussed in Section 4.2.

6.2. Application of Two-Stage Procedure

First, the estimator

V_{n}^{2}

of

ξ^{2}

is obtained from the pilot stage and then the final cluster size

Q^{*}

is computed.

Q^{*}

is then adjusted to account for the limited availability of clusters per stratum in the NSS data to obtain the possible number of clusters Q that can be sampled (see (10)). Here, Q is distributed over S strata as

Q_{s}

for stratum s; rounding off where

Q_{s}

is not an integer. The sum of

Q_{s}

gives the actual number of clusters,

\tilde{Q} = \sum_{s = 1}^{S} Q_{s}

, that are sampled from all strata. Using

\tilde{Q}

clusters, the Gini index and

ξ^{2}

are re-estimated (or updated) and a

100 (1 - α) %

confidence interval is constructed according to (11).

Similar to the application of the purely sequential procedure, the two-stage procedure is applied to the NSS 64th round data for different combinations of pre-specified precision (

ω

) and accuracy (

1 - α

) with the results shown in Table 5, Table 6, Table 7 and Table 8. The second column of the tables indicates the total number of clusters H in the unit (i.e., the whole state, rural sector, or urban sector) of the NSS data. The third column displays estimated optimal number of cluster (

Q^{*}

) that are required in order to achieve the desired precision and accuracy. Below

Q^{*}

is the pilot number of clusters t. The next column shows the estimated optimal cluster sizes Q taking into account the total number of clusters available in the data, because the number of clusters are finite and limited. Furthermore,

\tilde{Q}

is the actual number of clusters that can be sampled from all strata considering the fact that we can only sample integer number of clusters from each strata (i.e., rounding off where there are decimals in the number of clusters to be sampled from a stratum). Using (3) and (6), the Gini index estimate,

{\hat{G}}_{H}

, for the unit is computed using all H clusters with its standard error as

s e ({\hat{G}}_{H})

and these are shown in the fifth column. The selected clusters are used to estimate the Gini index and it is denoted as

{\hat{G}}_{\tilde{Q}}

, with its standard error as

s e ({\hat{G}}_{\tilde{Q}})

, in the sixth column. In the seventh and eighth columns, Lower CI and Upper CI are the lower and upper limits of the

100 (1 - α) %

confidence interval of the Gini index using

\tilde{Q}

clusters, respectively. The last column shows the length of the confidence interval,

w_{\tilde{Q}}

. It must be noted that

Q^{*}

is unbounded while on the other hand, Q and

\tilde{Q}

cannot exceed H.

\tilde{Q}

can be less than, equal to, or greater than Q depending on the rounding off.

Q^{*}

will be equal to Q if and only if

Q^{*}

is less than or equal to H.

From Table 5, Table 6, Table 7 and Table 8, it can be observed that in all cases, except the urban sectors for both states, the confidence interval widths were less than

ω

. These results were achieved because the optimal number of clusters required

(Q^{*})

, according to the two-stage procedure, were less than the number available

(H)

. On the other hand, in both Uttar Pradesh and West Bengal, the estimated optimal cluster sizes

Q^{*}

for the urban sector exceeded the available number of clusters H in the data. As a consequence of this, the confidence interval widths for the Gini index in the urban sectors were larger than the pre-specified bound, that is

w_{\tilde{Q}} > ω

.

7. Extension: Narrow Confidence Region

The methodology presented in this article for the Gini Index parameter can be extended to a multi-parameter setup in which we would like to make an inference about a vector of parameters

θ = {(θ_{1}, θ_{2}, \dots, θ_{p})}^{⊤}

for

p \geq 2

. This situation arises when we are interested in making joint inference related to a number of welfare related measures computed from socio-economic survey data (e.g., household consumer expenditure survey conducted by National Sample Survey, India). Thus, instead of a sufficiently narrow confidence interval, we would like to construct a narrow confidence region for a vector of parameters. Let the vector of estimators be defined as

T_{n} = {(T_{1 n}, \dots, T_{p n})}^{⊤}

based on the data on n households collected using a complex household survey. We extend our proposed methodology for constructing the narrow confidence region in the spirit of Mukhopadhyay and De Silva (2009, pp. 284–89). We propose the following confidence region for

θ_{F}

:

ℜ_{n} = \{θ \in R^{p} : {(T_{n} - θ)}^{⊤} (T_{n} - θ) \leq ω^{2}\} .

Using the regularity conditions by Bhattacharya (2005), we have,

\sqrt{n} (T_{n} - θ) \overset{D}{\to} N (0, Σ), i . e ., n {(T_{n} - θ)}^{⊤} Σ^{- 1} (T_{n} - θ) \overset{a}{\sim} χ_{p}^{2},

with

Σ

being a positive definite matrix and

χ_{p}^{2}

being a chi-squared distribution with p degrees of freedom. If

Σ

is a positive definite matrix then there exist an orthogonal matrix

P

and a diagonal matrix

Δ

such that

P^{⊤} Σ P = Δ

. The diagonal elements of

Δ

contains the eigen values of

Σ

. If the positive eigen values of

Σ

be

λ_{1}, \dots, λ_{p}

then

Δ = diag (λ_{1}, \dots, λ_{p})

. Furthermore, let

({PT}_{n} - P θ) = {(Y_{1}, \dots, Y_{p})}^{⊤}

and

λ_{(p)}

is the maximum of the p eigen values of

Σ

. So, we have

\begin{matrix} {(T_{n} - θ)}^{⊤} Σ^{- 1} (T_{n} - θ) & = {(T_{n} - θ)}^{⊤} P^{⊤} Δ^{- 1} P (T_{n} - θ) \\ = {({PT}_{n} - P θ)}^{⊤} Δ^{- 1} ({PT}_{n} - P θ) = \sum_{i = 1}^{p} \frac{Y_{i}^{2}}{λ_{i}} \\ λ_{(p)} {(T_{n} - θ)}^{⊤} Σ^{- 1} (T_{n} - θ) & \geq \sum_{i = 1}^{p} Y_{i}^{2} = {({PT}_{n} - P θ)}^{⊤} ({PT}_{n} - P θ) \\ = {(T_{n} - θ)}^{⊤} (T_{n} - θ) . \end{matrix}

(16)

Thus, using (16), we say,

\begin{matrix} Pr (θ \in ℜ_{n}) & = Pr [{(T_{n} - θ)}^{⊤} (T_{n} - θ) \leq ω^{2}] \\ \geq Pr [λ_{(p)} {(T_{n} - θ)}^{⊤} Σ^{- 1} (T_{n} - θ) \leq ω^{2}] \\ = Pr [{(T_{n} - θ)}^{⊤} Σ^{- 1} (T_{n} - θ) \leq \frac{ω^{2}}{λ_{(p)}}] . \end{matrix}

Provided

χ_{α; p}^{2}

being the

100 (1 - α)

th percentile of

χ_{p}^{2}

, we claim that the coverage probability of the confidence region

ℜ_{n}

is more than

(1 - α)

if

\frac{n ω^{2}}{λ_{(p)}} \geq χ_{α; p}^{2}, i . e ., n \geq \frac{χ_{α; p}^{2} λ_{(p)}}{ω^{2}} = C .

Here, C is the required optimal cluster size that should be used provided the covariance matrix (

Σ

) is known. If the parameter

Σ

is known in advance, one could simply collect observations belonging to cluster

C_{s}, s = 1, \dots, S

of the each of the S Strata. Since

Σ

is not known in practice, we can estimate

Σ

, using a consistent estimator (

V_{n}

, say) which can be obtained using the jackknife method. The consistency result of the jackknife estimator follows from Sen (1988). Thus using the jackknife estimator, we may propose either a two-stage or a sequential procedure. Similar results associated with the procedures described earlier is expected to hold under appropriate regularity conditions.

8. Discussion

At the outset, we would like to caution readers not to confuse two-stage sampling with the two-stage procedure discussed in Section 4.2, in the sequential sampling literature. For two-stage procedure, we refer Chattopadhyay and Mukhopadhyay (2013); Stein (1945) and others. A two-stage sampling (e.g., see Fuller (2009)) is a sampling technique in which a sample of clusters is selected and within those selected clusters, a sample of units are selected assuming the units to be independent of one another, and the selection rule depends only on the cluster. Under this two-stage sampling, Fuller (2009) discussed the use of Horvitz-Thompson estimator to estimate the total number and mean of the population and their respective variances. In addition, Fuller (2009) elaborated on the use of Horvitz–Thompson estimators and their (asymptotic) variances for functions of means and complex estimators, in general, under the assumption that the population distribution has a finite fourth moment. However, in the asymptotic framework of Fuller (2009), it was assumed that observations are independently and identically distributed (iid) which is a stronger assumption when compared to the framework of Bhattacharya (2007), also used in this work. Furthermore, Fuller (2009) also discussed the classical optimal sample allocation problem under the two-stage sampling technique for estimating the mean per element in a population. In his discussion, he assumed an equal number of units to be sampled from each cluster as well as an equal total number of units in each cluster and also known population variances of the cluster size and the sampling units. Under these assumptions, Fuller (2009) obtained the optimal number of units to be sampled per cluster by minimizing the variance of the mean per element subject to a cost constraint.

Our work is different from the survey procedures discussed in Fuller (2009). Our work, as indicated earlier, is based on the survey framework used in Bhattacharya (2007). In order to get such a confidence interval for Gini index, we are interested in estimating the unknown optimum number of clusters in each of the stratum, prefixing the number of strata. Apart from the survey framework, in our work, optimal cluster size depends on the data unlike the procedures discussed in Fuller (2009). The total cluster size (as well as the cluster size per each stratum) is a random variable that depends on a stopping criterion. This procedure also makes the estimated cluster sizes mutually dependent as they are all estimated based on the same stopping rule. Thus, the method discussed in Fuller (2009) or any other existing work can not be applied to find such a confidence interval.

We believe, this is the first work to make developments on having sufficiently narrow confidence interval of economic inequality index based on complex household survey. Now we discuss some issues or limitations of our proposed procedures because our proposed (a) procedures depend on the pre-specified number of households in each cluster (b) sequential procedure depends on pre-specified

m^{'}

(c) procedures consider large cluster size scenario (d) procedures do not consider the sampling cost and/or a fixed budget.

To begin with, the purely sequential procedure requires observations from additional

m^{'}

clusters, after the first stage, every time the condition in the stopping rule is not met. Thus, there is a need to fix the value of

m^{'}

. In some situations, it is as easy to collect observations from more than one cluster as it is to collect observations from a single cluster at every stage. So, as per convenience, the value of

m^{'}

should be accordingly decided based on economic considerations. In fact, the purely sequential procedure is not affected by the choice of

m^{'}

, the larger the value of

m^{'}

, the fewer number of stages, and the higher the chances of overestimating the optimal number of clusters. On the other hand, the smaller the value of

m^{'}

, the more number of stages and the higher the chances of accurately stopping at the optimal number of clusters. Thus, there is a trade off between the number of stages and stopping accurately at the optimal cluster size when choosing

m^{'}

.

Furthermore, our proposed procedures are based on the central limit theorem (when the cluster sizes per stratum are large). If the number of clusters is small, the confidence interval for Gini index cannot be constructed using Bhattacharya (2005, 2007) (fixed-cluster size method) and narrow confidence interval for Gini index using our proposed procedures. For smaller number of available clusters (

H_{s}

) for few strata, the sequence of the sampling distributions of the empirical Gini indices may not reach asymptotic limiting normal distribution. In a situation when limiting normality cannot be reached, our proposed procedures should not be applied. If one of our proposed procedures are applied, because of not having enough clusters in a few strata, one may not achieve desired confidence interval for the population Gini Index. This scenario was encountered in the application section of this work, for both the purely sequential and two-stage procedures, when there were not enough available clusters in the urban sectors, and as such, resulted in confidence intervals that were wider than desired.

Lastly, a very important question raised by the Bhattacharya (2005) was about developing a survey design taking the economic factors into account. Both our proposed procedures can be extended to include cost factors whereby optimization will be done at several levels for construction of a narrow confidence interval or confidence region under cost constraints. However, we do not explore that possibility in this article. A related issue is the fact that usually a budget is allocated by a country to its survey agency to carry out the survey. Under such budget constraints, the funding agency is not likely to willingly hand out more money if stopping rule is not met with the available amount. Without question, issue of budget constraint is important. Here, we do not discuss the estimation of cluster sizes under a fixed budget. We feel that our current work is a first step towards addressing the important issue in the sense of achieving a sufficiently narrow confidence interval or region and may yield different outcomes under cost constraints. We believe our work will lead to further research on this topic.

9. Conclusions

Working within the asymptotic purview for complex survey data, developed by Bhattacharya (2005, 2007), we have developed purely sequential and two-stage procedures for constructing sufficiently narrow confidence intervals for the Gini index which is one of the most popular measure of economic inequality. Our procedure may be applied for surveys when stratified clustered sample data are drawn from a large number of clusters per stratum, which is a reasonable assumption to make. More so, our procedure may also be applied to special cases of multi-stage survey designs including cases without stratification (i.e.,

S = 1

), and those that have independent observations within clusters (interclass correlation is zero).

It is with no doubt that the two-stage procedure is practically more feasible under this survey design than the purely sequential procedure. The confidence intervals of both procedures yielded a coverage probability closer to the desired confidence coefficient, however, the purely sequential procedure produces confidence intervals whose width are always less than the desired bound

ω

. The two-stage procedure is also known to over-estimate the optimal cluster size as compared to the purely sequential procedure Mukhopadhyay and De Silva (2009) and this property can be seen in results from the simulation (in the supplementary material) and the application to the NSS data. Furthermore, the estimated optimal cluster sizes have smaller standard errors under purely sequential procedure as compared to two-stage procedure.

Supplementary Materials

The following are available online at https://www.mdpi.com/2225-1146/8/2/26/s1.

Author Contributions

Conceptualization, B.C.; methodology, F.B.D. and B.C.; software, F.B.D.; validation, F.B.D., F.K. and B.C.; formal analysis, F.B.D., F.K. and B.C.; investigation, F.B.D. and B.C.; writing–original draft preparation, F.B.D., F.K. and B.C.; writing–review and editing, F.B.D., F.K. and B.C.; visualization, F.B.D., F.K. and B.C.; supervision, B.C.; project administration, B.C.; funding acquisition, B.C. All authors have read and agreed to the published version of the manuscript.

Funding

The research work of Bhargab Chattopadhyay was part of the project sanctioned by Science and Engineering Research Board, Government of India (ECR/2017/001213).

Acknowledgments

This author is also grateful to the Ministry of Statistics and Program Implementation, Government Of India for permitting the use of the household data related to the consumer expenditure for the year 2007–2008 (Round 64, Schedule 1.0).

Conflicts of Interest

The authors declare no conflict of interest. The funding agency had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Aguirregabiria, Victor, and Pedro Mira. 2007. Sequential estimation of dynamic discrete games. Econometrica 75: 1–53. [Google Scholar] [CrossRef]
Arcidiacono, Peter, and John Bailey Jones. 2003. Finite mixture distributions, sequential likelihood and the em algorithm. Econometrica 71: 933–46. [Google Scholar] [CrossRef]
Bhattacharya, Debopam. 2005. Asymptotic inference from multi-stage samples. Journal of Econometrics 126: 145–71. [Google Scholar] [CrossRef]
Bhattacharya, Debopam. 2007. Inference on inequality from household survey data. Journal of Econometrics 137: 674–707. [Google Scholar] [CrossRef]
Binder, David A., and Milorad S. Kovacevic. 1995. Estimating some measures of income inequality from survey data: An application of the estimating equations approach. Survey Methodology 21: 137–46. [Google Scholar]
Chattopadhyay, Bhargab, and Shyamal Krishna De. 2016. Estimation of Gini index within pre-specified error bound. Econometrics 4: 30. [Google Scholar] [CrossRef]
Chattopadhyay, Bhargab, and Ken Kelley. 2017. Estimating the standardized mean difference with minimum risk: Maximizing accuracy and minimizing cost with sequential estimation. Psychological Methods 22: 94–113. [Google Scholar] [CrossRef]
Chattopadhyay, Bhargab, and Nitis Mukhopadhyay. 2013. Two-stage fixed-width confidence intervals for a normal mean in the presence of suspect outliers. Sequential Analysis 32: 134–57. [Google Scholar] [CrossRef]
Cochran, William G. 1997. Sampling Techniques, 3rd ed. Hoboken: John Wiley & Sons. [Google Scholar] [CrossRef]
De, Shyamal K., and Bhargab Chattopadhyay. 2017. Minimum risk point estimation of Gini index. Sankhya B 79: 247–277. [Google Scholar] [CrossRef]
Fuller, Wayne A. 2009. Sampling Statistics. Hoboken: Wiley. [Google Scholar] [CrossRef]
Ghosh, Bhaskar Kumar, and Pranab Kumar Sen. 1991. Handbook of Sequential Analysis. New York: CRC Press, vol. 118. [Google Scholar]
Ghosh, Malay, Nitis Mukhopadhyay, and Pranab K. Sen. 1997. Sequential Estimation. New York: John Wiley & Sons, Inc. [Google Scholar] [CrossRef]
Greene, William H. 1998. Gender economics courses in liberal arts colleges: Further results. The Journal of Economic Education 29: 291–300. [Google Scholar] [CrossRef]
Gut, Allan. 2009. Stopped Random Walks. New York: Springer. [Google Scholar] [CrossRef]
Hoque, Ahmed Anisul, and Judith Anne Clarke. 2015. On variance estimation for a Gini coefficient estimator obtained from complex survey data. Communications in Statistics: Case Studies, Data Analysis and Applications 1: 39–58. [Google Scholar] [CrossRef]
Horvitz, D. G., and D. J. Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47: 663–85. [Google Scholar] [CrossRef]
Kanninen, Barbara J. 1993. Design of sequential experiments for contingent valuation studies. Journal of Environmental Economics and Management 25: S1–S11. [Google Scholar] [CrossRef]
Kelley, Ken, Francis Bilson Darku, and Bhargab Chattopadhyay. 2018. Accuracy in parameter estimation for a general class of effect sizes: A sequential approach. Psychological Methods 23: 226–43. [Google Scholar] [CrossRef]
Kelley, Ken, Francis Bilson Darku, and Bhargab Chattopadhyay. 2019. Sequential accuracy in parameter estimation for population correlation coefficients. Psychological Methods 24: 492–515. [Google Scholar] [CrossRef]
Langel, Matti, and Yves Tillé. 2013. Variance estimation of the Gini index: Revisiting a result several times published. Journal of the Royal Statistical Society: Series A (Statistics in Society) 176: 521–40. [Google Scholar] [CrossRef]
Lee, Eun, and Ronald Forthofer. 2006. Analyzing Complex Survey Data. New York: SAGE Publications, Inc. [Google Scholar] [CrossRef]
Mahalanobis, Prasanta Chandra. 1940. A sample survey of the acreage under jute in Bengal. Sankhyā: The Indian Journal of Statistics 4: 511–530. [Google Scholar]
Mukhopadhyay, Nitis, and Basil M. De Silva. 2009. Sequential Methods and Their Applications. Boca Raton: CRC Press. [Google Scholar]
National Sample Survey Office. 2007. Note on Estimation Procedure of NSS 64th Round. Available online: http://catalog.ihsn.org/index.php/catalog/1906/download/35538 (accessed on 21 July 2019).
National Sample Survey Office. 2015. India—Household Consumer Expenditure Survey: 64th Round, Schedule 1.0, July 2007–June 2008. Available online: http://www.icssrdataservice.in/datarepository/index.php/catalog/4/study-description (accessed on 21 July 2019).
Organization for Economic Cooperation and Development. 2017. Income Inequality. Available online: https://data.oecd.org/inequality/income-inequality.htm (accessed on 21 July 2019).
Peng, Liang. 2011. Emperical likelihood methods for the Gini index. Australian & New Zealand Journal of Statistics 53: 131–39. [Google Scholar] [CrossRef]
RStudio Team. 2018. RStudio: Integrated Development Environment for R. Boston: RStudio, Inc. [Google Scholar]
Sen, Pranab Kumar. 1988. Functional jackknifing: Rationality and general asymptotics. The Annals of Statistics 16: 450–69. [Google Scholar] [CrossRef]
Stein, Charles. 1945. A two-sample test for a linear hypothesis whose power is independent of the variance. The Annals of Mathematical Statistics 16: 243–58. [Google Scholar] [CrossRef]
Stein, Ch. 1949. Some problems in sequential estimation. Econometrica 17: 77–78. [Google Scholar]
Wells, J. 1998. Applications: Oversampling through households or other clusters: Comparisons of methods for weighting the oversampled elements. Australian & New Zealand Journal of Statistics 40: 269–78. [Google Scholar] [CrossRef]
Zitikis, Ričardas, and Joseph L. Gastwirth. 2002. The asymptotic distribution of the S-Gini index. Australian & New Zealand Journal of Statistics 44: 439–46. [Google Scholar] [CrossRef]

1

The survey excluded “(i) Leh (Ladakh) and Kargil districts of Jammu and Kashmir (for central sample), (ii) interior villages of Nagaland situated beyond 5 km of the bus route and (ii) villages of Andaman and Nicobar Islands which remain inaccessible throughout the year.” (National Sample Survey Office 2007).

Table 1. Application results for PSP on NSS 64th round data for

α = 0.1, ω = 0.02

.

Table 1. Application results for PSP on NSS 64th round data for

α = 0.1, ω = 0.02

.

Region	${\hat{G}}_{H}$	H	$\hat{C}$	N	${\hat{G}}_{N}$	Lower CI	Upper CI	$w_{N}$	$Pr (N_{s} < {\hat{C}}_{s})$
	$se ({\hat{G}}_{H})$			$(t)$	$se ({\hat{G}}_{N})$
Uttar Pradesh
All	0.2163	1262	622	672	0.2116	0.2023	0.2209	0.0186	0.2138
	(0.0042)			(321)	(0.0057)
Rural	0.1997	903	505	523	0.2024	0.1931	0.2117	0.0186	0.4
	(0.0041)			(198)	(0.0057)
Urban	0.2229	359	903	359	0.2229	0.2077	0.2381	0.0304	1.0
	(0.0092)			(180)	(0.0092)
West Bengal
All	0.2320	878	587	593	0.2334	0.2239	0.2430	0.0191	0.1282
	(0.0051)			(190)	(0.0058)
Rural	0.1812	551	450	450	0.1816	0.1723	0.1909	0.0186	0.2353
	(0.0048)			(172)	(0.0057)
Urban	0.2609	327	612	327	0.2609	0.2482	0.2736	0.0254	1.0
	(0.0077)			(185)	(0.0077)

Table 2. Application results for PSP on NSS 64th round data for

α = 0.05, ω = 0.02

.

Table 2. Application results for PSP on NSS 64th round data for

α = 0.05, ω = 0.02

.

Region	${\hat{G}}_{H}$	H	$\hat{C}$	N	${\hat{G}}_{N}$	Lower CI	Upper CI	$w_{N}$	$Pr (N_{s} < {\hat{C}}_{s})$
	$se ({\hat{G}}_{H})$			$(t)$	$se ({\hat{G}}_{N})$
Uttar Pradesh
All	0.2163	1262	834	878	0.2117	0.2022	0.2212	0.0190	0.2138
	(0.0042)			(333)	(0.0048)
Rural	0.1997	903	643	667	0.2024	0.1930	0.2117	0.0187	0.4
	(0.0041)			(226)	(0.0048)
Urban	0.2229	359	1282	359	0.2229	0.2048	0.2410	0.0362	1.0
	(0.0092)			(254)	(0.0092)
West Bengal
All	0.2320	878	906	878	0.2320	0.2221	0.2419	0.0198	1.0
	(0.0051)			(223)	(0.0051)
Rural	0.181	551	552	551	0.1812	0.1719	0.1906	0.01871	1.0
	(0.0048)			(203)	(0.0048)
Urban	0.2609	327	869	327	0.2609	0.2458	0.2761	0.0303	1.0
	(0.0077)			(207)	(0.0077)

Table 3. Application results for PSP on NSS 64th round data for

α = 0.1, ω = 0.025

.

Table 3. Application results for PSP on NSS 64th round data for

α = 0.1, ω = 0.025

.

Region	${\hat{G}}_{H}$	H	$\hat{C}$	N	${\hat{G}}_{N}$	Lower CI	Upper CI	$w_{N}$	$Pr (N_{s} < {\hat{C}}_{s})$
	$se ({\hat{G}}_{H})$			$(t)$	$se ({\hat{G}}_{N})$
Uttar Pradesh
All	0.2163	1262	401	540	0.2138	0.2035	0.2242	0.0207	0.0
	(0.0042)			(302)	(0.0063)
Rural	0.1997	903	386	400	0.2014	0.1899	0.2130	0.0231	0.1714
	(0.0041)			(168)	(0.0070)
Urban	0.2229	359	578	359	0.2229	0.2077	0.2381	0.0304	1.0
	(0.0092)			(168)	(0.0092)
West Bengal
All	0.2320	878	324	319	0.2288	0.2175	0.2401	0.0226	0.1795
	(0.0051)			(158)	(0.0069)
Rural	0.1812	551	276	289	0.1829	0.1721	0.1937	0.0216	0.2353
	(0.00477)			(138)	(0.0066)
Urban	0.2609	327	392	327	0.2609	0.2482	0.2736	0.0254	1.0
	(0.0077)			(142)	(0.0077)

Table 4. Application results for PSP on NSS 64th round data for

α = 0.05, ω = 0.025

.

Table 4. Application results for PSP on NSS 64th round data for

α = 0.05, ω = 0.025

.

Region	${\hat{G}}_{H}$	H	$\hat{C}$	N	${\hat{G}}_{N}$	Lower CI	Upper CI	$w_{N}$	$Pr (N_{s} < {\hat{C}}_{s})$
	$se ({\hat{G}}_{H})$			$(t)$	$se ({\hat{G}}_{N})$
Uttar Pradesh
All	0.2163	1262	572	653	0.2123	0.2010	0.2236	0.0226	0.2138
	(0.0042)			(728)	(0.0058)
Rural	0.1997	903	496	510	0.2010	0.1893	0.2128	0.0234	0.1714
	(0.0041)			(197)	(0.0060)
Urban	0.2229	359	821	359	0.2229	0.2048	0.2410	0.0362	1.0
	(0.0092)			(717)	(0.0092)
West Bengal
All	0.2320	878	517	519	0.2318	0.2199	0.2437	0.0238	0.1538
	(0.0051)			(186)	(0.0061)
Rural	0.1812	551	351	352	0.1815	0.1703	0.1927	0.0223	0.2353
	(0.0048)			(163)	(0.0057)
Urban	0.2609	327	556	327	0.2609	0.2458	0.2761	0.0303	1.0
	(0.0077)			(162)	(0.0077)

Table 5. Application results for the two-stage procedure on NSS 64th round data for

α = 0.1

and

ω = 0.02

.

Table 5. Application results for the two-stage procedure on NSS 64th round data for

α = 0.1

and

ω = 0.02

.

Region	H	$Q^{*}$	$\tilde{Q}$	${\hat{G}}_{H}$	${\hat{G}}_{\tilde{Q}}$	Lower CI	Upper CI	$w_{\tilde{Q}}$
		$(t)$	$(Q)$	$(se ({\hat{G}}_{H}))$	$(se ({\hat{G}}_{\tilde{Q}}))$
Uttar Pradesh
All	1262	1146	1171	0.2163	0.2137	0.2072	0.2202	0.0131
		(321)	(1146)	(0.0042)	(0.0040)
Rural	903	398	406	0.1997	0.2027	0.1940	0.2114	0.0174
		(198)	(398)	(0.0041)	(0.0053)
Urban	359	1177	359	0.2229	0.2229	0.2077	0.2381	0.0304
		(180)	(359)	(0.0092)	(0.0092)
West Bengal
All	878	624	626	0.2320	0.2307	0.2216	0.2398	0.0182
		(190)	(624)	(0.0051)	(0.0055)
Rural	551	422	420	0.1812	0.1785	0.1707	0.1862	0.0155
		(173)	(422)	(0.0048)	(0.0047)
Urban	327	857	327	0.2609	0.2609	0.2482	0.2736	0.0254
		(185)	(327)	(0.0077)	(0.0077)

Table 6. Application results for the two-stage procedure on NSS 64th round data for

α = 0.05

and

ω = 0.02

.

Table 6. Application results for the two-stage procedure on NSS 64th round data for

α = 0.05

and

ω = 0.02

.

Region	H	$Q^{*}$	$\tilde{Q}$	${\hat{G}}_{H}$	${\hat{G}}_{\tilde{Q}}$	Lower CI	Upper CI	$w_{\tilde{Q}}$
		$(t)$	$(Q)$	$(se ({\hat{G}}_{H}))$	$(se ({\hat{G}}_{\tilde{Q}}))$
Uttar Pradesh
All	1262	1665	1262	0.2163	0.2163	0.2081	0.2245	0.0164
		(333)	(1262)	(0.0042)	(0.0042)
Rural	903	593	595	0.2000	0.2000	0.1914	0.2085	0.0171
		(226)	(593)	(0.0041)	(0.0044)
Urban	359	1712	359	0.2229	0.2229	0.2048	0.2410	0.0362
		(254)	(359)	(0.0092)	(0.0092)
West Bengal
All	878	874	878	0.2320	0.2320	0.2221	0.2419	0.0198
		(223)	(874)	(0.0051)	(0.0051)
Rural	551	535	534	0.1812	0.1814	0.1719	0.1910	0.0191
		(203)	(535)	(0.0048)	(0.0049)
Urban	327	1110	327	0.2609	0.2609	0.2458	0.2761	0.0303
		(207)	(327)	(0.0077)	(0.0077)

Table 7. Application results for the two-stage procedure on NSS 64th round data for

α = 0.1

and

ω = 0.025

.

Table 7. Application results for the two-stage procedure on NSS 64th round data for

α = 0.1

and

ω = 0.025

.

Region	H	$Q^{*}$	$\tilde{Q}$	${\hat{G}}_{H}$	${\hat{G}}_{\tilde{Q}}$	Lower CI	Upper CI	$w_{\tilde{Q}}$
		$(t)$	$(Q)$	$(se ({\hat{G}}_{H}))$	$(se ({\hat{G}}_{\tilde{Q}}))$
Uttar Pradesh
All	1262	688	680	0.2163	0.2104	0.2023	0.2185	0.0162
		(302)	(688)	(0.0042)	(0.0049)
Rural	903	299	308	0.1997	0.2026	0.1927	0.2126	0.0199
		(168)	(299)	(0.0041)	(0.0061)
Urban	359	1087	359	0.2229	0.2229	0.2077	0.2381	0.0304
		(168)	(359)	(0.0092)	(0.0092)
West Bengal
All	878	396	396	0.2320	0.2293	0.2171	0.2414	0.0243
		(158)	(396)	(0.0051)	(0.0074)
Rural	551	275	275	0.1812	0.1750	0.1660	0.1840	0.0180
		(138)	(275)	(0.0048)	(0.0055)
Urban	327	582	327	0.2609	0.2609	0.2482	0.2736	0.0254
		(142)	(327)	(0.0077)	(0.0077)

Table 8. Application results for the two-stage procedure on NSS 64th round data for

α = 0.05

and

ω = 0.025

.

Table 8. Application results for the two-stage procedure on NSS 64th round data for

α = 0.05

and

ω = 0.025

.

Region	H	$Q^{*}$	$\tilde{Q}$	${\hat{G}}_{H}$	${\hat{G}}_{\tilde{Q}}$	Lower CI	Upper CI	$w_{\tilde{Q}}$
		$(t)$	$(Q)$	$(se ({\hat{G}}_{H}))$	$(se ({\hat{G}}_{\tilde{Q}}))$
Uttar Pradesh
All	1262	976	947	0.2163	0.2124	0.2041	0.2207	0.0166
		(302)	(946)	(0.0042)	(0.0042)
Rural	903	364	353	0.1997	0.2032	0.1922	0.2142	0.0220
		(197)	(364)	(0.0041)	(0.0056)
Urban	359	1081	359	0.2229	0.2229	0.2048	0.2410	0.0362
		(177)	(359)	(0.0092)	(0.0092)
West Bengal
All	878	607	608	0.2320	0.2315	0.2204	0.2427	0.0224
		(186)	(607)	(0.0051)	(0.0057)
Rural	551	391	392	0.1812	0.1759	0.1670	0.1849	0.0178
		(163)	(391)	(0.0048)	(0.0045)
Urban	327	754	327	0.2609	0.2609	0.2458	0.2761	0.0303
		(162)	(327)	(0.0077)	(0.0077)

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bilson Darku, F.; Konietschke, F.; Chattopadhyay, B. Gini Index Estimation within Pre-Specified Error Bound: Application to Indian Household Survey Data. Econometrics 2020, 8, 26. https://doi.org/10.3390/econometrics8020026

AMA Style

Bilson Darku F, Konietschke F, Chattopadhyay B. Gini Index Estimation within Pre-Specified Error Bound: Application to Indian Household Survey Data. Econometrics. 2020; 8(2):26. https://doi.org/10.3390/econometrics8020026

Chicago/Turabian Style

Bilson Darku, Francis, Frank Konietschke, and Bhargab Chattopadhyay. 2020. "Gini Index Estimation within Pre-Specified Error Bound: Application to Indian Household Survey Data" Econometrics 8, no. 2: 26. https://doi.org/10.3390/econometrics8020026

APA Style

Bilson Darku, F., Konietschke, F., & Chattopadhyay, B. (2020). Gini Index Estimation within Pre-Specified Error Bound: Application to Indian Household Survey Data. Econometrics, 8(2), 26. https://doi.org/10.3390/econometrics8020026

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Gini Index Estimation within Pre-Specified Error Bound: Application to Indian Household Survey Data

Abstract

1. Introduction

2. Survey Design and Point Estimation

3. Bounded Width Confidence Intervals

Estimation of $ξ^{2}$

4. Sequential Methodology

4.1. Purely Sequential Procedure

4.2. Two-Stage Procedure

4.3. Pilot Cluster Size

5. Characteristics of the Procedures and Simulation Study

5.1. Characteristics

5.2. Simulation Study

6. Gini Index Estimation in India

6.1. Application of Purely Sequential Procedure (PSP)

6.2. Application of Two-Stage Procedure

7. Extension: Narrow Confidence Region

8. Discussion

9. Conclusions

Supplementary Materials

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Gini Index Estimation within Pre-Specified Error Bound: Application to Indian Household Survey Data

Abstract

1. Introduction

2. Survey Design and Point Estimation

3. Bounded Width Confidence Intervals

Estimation of ξ 2

4. Sequential Methodology

4.1. Purely Sequential Procedure

4.2. Two-Stage Procedure

4.3. Pilot Cluster Size

5. Characteristics of the Procedures and Simulation Study

5.1. Characteristics

5.2. Simulation Study

6. Gini Index Estimation in India

6.1. Application of Purely Sequential Procedure (PSP)

6.2. Application of Two-Stage Procedure

7. Extension: Narrow Confidence Region

8. Discussion

9. Conclusions

Supplementary Materials

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Estimation of $ξ^{2}$