Estimation of Gini Index within Pre-Specified Error Bound

Chattopadhyay, Bhargab; De, Shyamal Krishna

doi:10.3390/econometrics4030030

Open AccessArticle

Estimation of Gini Index within Pre-Specified Error Bound

by

Bhargab Chattopadhyay

^1,* and

Shyamal Krishna De

²

¹

Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, TX 75080, USA

²

School of Mathematical Sciences, National Institute of Science Education and Research, Jatni 752050, Odisha, India

^*

Author to whom correspondence should be addressed.

Econometrics 2016, 4(3), 30; https://doi.org/10.3390/econometrics4030030

Submission received: 19 December 2015 / Revised: 25 April 2016 / Accepted: 3 June 2016 / Published: 24 June 2016

Download Versions Notes

Abstract

:

Gini index is a widely used measure of economic inequality. This article develops a theory and methodology for constructing a confidence interval for Gini index with a specified confidence coefficient and a specified width without assuming any specific distribution of the data. Fixed sample size methods cannot simultaneously achieve both specified confidence coefficient and fixed width. We develop a purely sequential procedure for interval estimation of Gini index with a specified confidence coefficient and a specified margin of error. Optimality properties of the proposed method, namely first order asymptotic efficiency and asymptotic consistency properties are proved under mild moment assumptions of the distribution of the data.

Keywords:

distribution-free method; fixed width confidence interval; Gini index; sample size planning; U-statistics

JEL:

C130; C140; C400; C440

1. Introduction

Economic inequality arises due to inequality in the distribution of income and assets among individuals or groups within a society or region or even between countries. Economic inequality is usually measured to evaluate the effects of economic policies at the micro or macro level. Several inequality indexes that measure the economic inequality are proposed in the economics literature. Among those indexes, Gini inequality index is the most widely used measure. The most celebrated Gini index, as given in [1], is

G_{F} (X) = \frac{Δ}{2 μ}, where Δ = E |X_{1} - X_{2}|, μ = E (X)

(1)

and

X_{1}

&

X_{2}

are two i.i.d. copies of nonnegative random variable X with distribution function F. Gini index compares every individual’s income with other individual’s income. If there are n randomly selected individuals with incomes given by

X_{1}, \dots, X_{n}

, then the estimator of the celebrated Gini index is

\begin{matrix} {\hat{G}}_{n} = \frac{{\hat{Δ}}_{n}}{2 {\bar{X}}_{n}}, \end{matrix}

(2)

where

{\bar{X}}_{n}

is the sample mean and

{\hat{Δ}}_{n}

is the sample Gini’s mean difference defined as,

\begin{matrix} {\bar{X}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} X_{i} and {\hat{Δ}}_{n} = {(\binom{n}{2})}^{- 1} \sum_{1 \leq i_{1} < i_{2} \leq n} |X_{i_{1}} - X_{i_{2}}| . \end{matrix}

(3)

The Gini index is undefined if

{\bar{X}}_{n} = 0 .

We ignore this special case.

Inference for inequality measures, including Gini index, has been an area of research interest among many economists in recent years. For the existing literature on inference problems related to inequality index, we refer to [2,3,4,5,6,7]. Even though there exist innovative methods for constructing confidence intervals for

G_{F}

(e.g., see [7]), due to large standard errors of estimated Gini index as mentioned in [8], we may not get a short confidence interval for Gini index. We know that the confidence interval varies from sample to sample and so does its width. Wider confidence intervals provide less precise information about the true value of the parameter of interest. Since it is desirable to construct shorter confidence intervals, we rather fix the length of the confidence interval, or in other words, the margin of error while achieving the confidence coefficient

(1 - α)

for some specified α in

(0, 1)

. This problem is known as the fixed-width confidence interval estimation problem.

No fixed sample size procedure can provide a solution to the fixed-width confidence interval estimation problem (e.g., see [9]). Thus, one must resort to sampling in stages to construct a

100 (1 - α) %

confidence interval for

G_{F}

with a pre-specified width. This problem falls in the domain of sequential analysis. For details about the general theory of fixed-width confidence interval estimation, we refer to [10,11]. Sequential analysis is concerned with studies where sample sizes are not fixed in advance. Instead, the sequential estimation procedure depends on collecting observations until an a-priori specified criterion or stopping rule is satisfied.

We know that Gini’s mean difference is U-statistic with a symmetric kernel of degree 2 and the sample mean is a U-statistic with a symmetric kernel of degree 1 (e.g., see [12]). Under distribution-free scenario, [7] used the central limit theorem for U-statistics to come up with a confidence interval for Gini index. However, this cannot be used to find out a fixed-width confidence interval for Gini index. In this article, we solve the problem of obtaining a fixed-width confidence interval for Gini index using a purely sequential procedure with a stopping rule based on several U-statistics. Apart from being unbiased estimators, U-statistics are also reverse martingales with respect to some non-increasing filtration as proven in [13]. For more literature on reverse martingales, we refer to classical textbooks on probability theory and stochastic processes such as [14,15]. We exploit the reverse martingale property of U-statistics to derive attractive asymptotic properties of our proposed estimation procedure.

In the next section, we formally state the fixed-width confidence interval estimation problem and why a fixed-sample size procedure cannot be used. In Section 3, a purely sequential procedure is proposed to construct a

100 (1 - α) %

fixed-width confidence interval for unknown population Gini index and implementation and characteristics of the sequential procedure is discussed as well. Section 4 presents simulation study and validate all theoretical results related to our procedure. We conclude this article with some remarks in Section 5.

2. Problem Statement and Optimal Sample Size

Consider n randomly selected individuals from some population of interest with incomes denoted by

X_{1}, X_{2}, \dots, X_{n}

. Suppose these are nonnegative independent and identically distributed random variables assumed to be drawn from an unknown distribution function F where the support of the distribution is

(0, \infty)

. A strongly consistent estimator of population Gini index

G_{F}

is

{\hat{G}}_{n}

given in (2). For pre-specified α in

(0, 1)

, the goal of this paper is to develop the theory for constructing a

100 (1 - α) %

fixed-width confidence interval for

G_{F}

. Formally, we would like to construct a confidence interval

J_{n} = ({\hat{G}}_{n} - d, {\hat{G}}_{n} + d)

such that

\begin{matrix} P ({\hat{G}}_{n} - d < G_{F} < {\hat{G}}_{n} + d) \geq 1 - α, \end{matrix}

(4)

for some prefixed margin of error

d > 0

. Using [7], we have

\begin{matrix} \sqrt{n} ({\hat{G}}_{n} - G_{F}) \to N (0, ξ^{2}) as n \to \infty, \end{matrix}

(5)

where

ξ^{2}

is the asymptotic variance given by

\begin{matrix} ξ^{2} = \frac{Δ^{2}}{4 μ^{4}} σ^{2} - \frac{Δ τ}{μ^{3}} + \frac{Δ^{2}}{μ^{2}} + \frac{σ_{1}^{2}}{μ^{2}} . \end{matrix}

(6)

Here,

\begin{matrix} τ = E (X_{1} |X_{1} - X_{2}|) and σ_{1}^{2} = V [E (|X_{1} - X_{2}| | X_{1})] . \end{matrix}

Schroder and Yitzhaki [16] proposed a way to come up with the reasonable sample size related to the convergence of the distribution of

{\hat{G}}_{n}

to normality. In this paper, we verify via simulation study that for moderate sample sizes (see Section 4) the distribution of sample Gini index is approximately normal. Based on the asymptotic normality of

{\hat{G}}_{n}

, we observe that the coverage probability is

P ({\hat{G}}_{n} - d < G_{F} < {\hat{G}}_{n} + d) \approx 2 Φ (\frac{d \sqrt{n}}{ξ}) - 1,

where Φ is the distribution function of standard normal random variable. In order to have

100 (1 - α) %

confidence interval, sample size n must satisfy

\begin{matrix} 2 Φ (\frac{d \sqrt{n}}{ξ}) - 1 \geq 1 - α . \end{matrix}

(7)

Solving (7) for n, we obtain

n \geq d^{- 2} z_{α / 2}^{2} ξ^{2}

, where

z_{α / 2}

is the upper

{(\frac{α}{2})}^{t h}

quantile of the standard normal distribution. Provided ξ is known, the optimal (minimal) sample size required to construct a fixed-width confidence interval for Gini index with approximately

(1 - α)

coverage probability is

\begin{matrix} C = ⌈ d^{- 2} z_{α / 2}^{2} ξ^{2} ⌉, \end{matrix}

(8)

where,

⌈ w ⌉

the lowest integer which is greater than or equal to w.

The optimal fixed sample size C is unknown since the true value of ξ is unknown in practice. If C were known, one would just draw C observations independently from the population of interest and compute

({\hat{G}}_{C} - d, {\hat{G}}_{C} + d)

which would satisfy (4) approximately. Since C is unknown, one must draw samples at least in two stages in order to achieve the desired coverage probability at least approximately. In the first stage, one must estimate C by estimating ξ, and then in the subsequent stages one should collect samples until the current sample size is more or equal to the estimated optimal sample size. In this article, we propose a purely sequential sampling procedure to estimate the optimal sample size C and ensure that the fixed-width confidence interval based on the final sample size attains the desired

(1 - α)

coverage probability atleast asymptotically.

3. The Sequential Estimation Procedure

In sequential estimation procedures, the parameter estimates are updated as the data is observed. In the first step, a small sample, called the pilot sample, is observed to gather preliminary information about the parameter of interest. Then, in each successive step, one or more additional observations are collected and the estimates of the parameters are updated. After each and every step a decision is taken whether to continue or to terminate the sampling process. This decision is based on a pre-defined stopping rule.

From (8) we note that the optimal sample size needed to find a fixed-width confidence interval depends on unknown parameter

ξ^{2}

. So, let us first find a good estimator of the unknown parameter

ξ^{2}

. Following [7,17], we consider the following strongly consistent estimator of

ξ^{2}

based on U-statistics. Let us define a U-statistic, for each

j = 1, 2, \dots, n

,

\begin{matrix} {\hat{Δ}}_{n}^{(j)} = {(\binom{n - 1}{2})}^{- 1} \sum_{T_{j}} |X_{i_{1}} - X_{i_{2}}|, \end{matrix}

(9)

where

T_{j} = {(i_{1} {, i}_{2}) : 1 \leq i_{1} < i_{2} \leq n

and

i_{1}, i_{2} \neq j}

. Define

W_{j n} = n {\hat{Δ}}_{n} - (n - 2) {\hat{Δ}}_{n}^{(j)}

for

j = 1, \dots, n

, and

{\bar{W}}_{n} = n^{- 1} \sum_{j = 1}^{n} W_{j n}

. According to [17], a strongly consistent estimator of

4 σ_{1}^{2}

is

s_{w n}^{2} = {(n - 1)}^{- 1} \sum_{i = 1}^{n} {(W_{j n} - {\bar{W}}_{n})}^{2} .

Using [7],

\begin{matrix} {\hat{τ}}_{n} = \frac{2}{n (n - 1)} \sum_{(n, 2)} \frac{1}{2} (X_{i_{1}} {+ X}_{i_{2}}) |X_{i_{1}} {- X}_{i_{2}}| \end{matrix}

(10)

is an estimator of τ. Let

S_{n}^{2}

be the sample variance. Thus, the estimator of

ξ^{2}

is

V_{n}^{2} = m a x (0, \frac{{\hat{Δ}}_{n}^{2} S_{n}^{2}}{4 {\bar{X}}_{n}^{4}} - \frac{{\hat{Δ}}_{n}}{{\bar{X}}_{n}^{3}} {\hat{τ}}_{n} + \frac{{\hat{Δ}}_{n}^{2}}{{\bar{X}}_{n}^{2}} + \frac{s_{w n}^{2}}{4 {\bar{X}}_{n}^{2}})

(11)

similar to [7]. Note that

{\hat{τ}}_{n}, S_{n}^{2}, {\hat{Δ}}_{n}

are U-statistics of degree 2 (e.g., [7,18]) and the sample mean

{\bar{X}}_{n}

is a U-statistic of degree 1. Using continuous mapping theorem [17], and Theorem 3.2.1 in [11], we observe that a strongly consistent estimator of

ξ^{2}

is

V_{n}^{2}

which will be used in our proposed sequential procedure to estimate C.

Several plug-in estimators of the asymptotic variance parameter of Gini index are proposed in the economics and statistics literature. To find the details about several plug-in estimators of the asymptotic variance of Gini index under different sampling schemes, we refer to [4,19]. The proposed plug-in estimator in [4] is simpler than

V_{n}^{2}

. However, it is not known whether the estimator enjoys the almost sure convergence property which is very important for us as we need this property to prove the asymptotic optimality properties of the proposed sequential procedure. Moreover, with the high-end computing facilities available these days,

V_{n}^{2}

can be computed in seconds.

Using

V_{n}^{2}

as the estimator of

ξ^{2}

, we define the stopping rule

N_{d}

, for every

d > 0

, as

N_{d} = the smallest integer n (\geq m) such that n \geq {(\frac{z_{α / 2}}{d})}^{2} (V_{n}^{2} + n^{- 1}) .

(12)

Here, m is called the initial or pilot sample size, and the term

n^{- 1}

is known as a correction term. Note that

V_{n}

can be very close to zero with positive probability. Without the correction term, the inequality (12) may be satisfied for very small n terminating the sampling process too early. Thus the correction term

n^{- 1}

ensures that the sampling process for estimating the optimal sample size does not stop too early. For details about the correction term, we refer to [11].

From (12), we note that,

N_{d} \geq {(\frac{z_{α} / 2}{d})}^{2} N_{d}^{- 1}

, i.e., the final sample size must be at least

z_{α / 2} / d

. Therefore, we consider the pilot sample size to be

m = max {4, ⌈ z_{α / 2} / d ⌉}

. This technique of estimating pilot sample size can also be found in [10].

Recall that the optimal sample size required to achieve

100 (1 - α) %

confidence interval for Gini index is C which is unknown in practice. The stopping variable

N_{d}

defined in (12) serves as an estimator of C. Below, we develop a purely sequential procedure to estimate the optimal sample size C.

Implementation and Characteristics

We propose the following purely sequential procedure to estimate the optimal sample size C.

Stage 1: Compute the pilot sample size

m = max {4, ⌈ z_{α / 2} / d ⌉}

and draw a random sample of size m from the population of interest. Based on this pilot sample of size m, obtain an estimate of

ξ^{2}

by finding

V_{m}^{2}

as given in (11) and check whether

m \geq {(z_{α / 2} / d)}^{2} (V_{m}^{2} + m^{- 1})

. If

m < {(z_{α / 2} / d)}^{2} (V_{m}^{2} + m^{- 1})

then go to the next step. Otherwise, set the final sample size

N_{d} = m

.

Stage 2: Draw an additional observation independent of the pilot sample and update the estimate of

ξ^{2}

by computing

V_{m + 1}^{2}

. Check if

m + 1 \geq {(z_{α / 2} / d)}^{2} (V_{m + 1}^{2} + {(m + 1)}^{- 1})

. If

m + 1 < {(z_{α / 2} / d)}^{2} (V_{m + 1}^{2} + {(m + 1)}^{- 1})

then go to the next step. Otherwise, stop sampling and report the final sample size as

N_{d} = m + 1

.

The process of collecting observations one by one is continued until there are

N_{d}

observations such that

N_{d} \geq {(z_{α / 2} / d)}^{2} (V_{N_{d}}^{2} + {N_{d}}^{- 1})

. At this stage, we stop sampling and report the final sample size as

N_{d}

.

Based on the above algorithm, the sampling process will stop at some stage. This is proved in Lemma A1 which states that if observations are collected using (12), under appropriate conditions,

P (N_{d} < \infty) = 1

. This is a very important property of any sequential procedure since it mathematically ensures that the sampling will be terminated eventually.

Next, we establish some desirable asymptotic properties of our proposed sequential procedure. First, we prove that the final sample size

N_{d}

required by our sampling strategy is close to the optimal sample size C at least asymptotically. We also prove the asymptotic efficiency property of sequential procedure which ensures that, on average, we collect only the minimum number of samples to achieve certain accuracy of estimation. Second, we show that the fixed-width confidence interval

({\hat{G}}_{N_{d}} - d, {\hat{G}}_{N_{d}} + d)

contains the true value of Gini index

G_{F}

nearly with probability

1 - α

. We formally state these results in Theorems 1 and 2.

Theorem 1.

If the parent distribution F of X is such that E

[X^{4}]

exists, then the stopping rule in (12) yields the following asymptotic optimality properties:

(i): $N_{d} / C \overset{a . s .}{\to} 1$ as $d ↓ 0$ .
(ii): $P ({\hat{G}}_{N_{d}} - d < G_{F} < {\hat{G}}_{N_{d}} + d) \to 1 - α a s d ↓ 0$ .

Theorem 2.

If the parent distribution F of X is such that the support of the distribution being

(t, \infty)

with

t > 0

and E

[X^{4}]

exists, then the stopping rule in (12) yields

\begin{matrix} E (N_{d} / C) \to 1 a s d ↓ 0 . \end{matrix}

(13)

Theorems 1 and 2 are proved in the appendix. Part (i) of Theorem 1 implies that the ratio of final sample size of our procedure and the optimal sample size, C asymptotically converges to 1. Part (ii) of Theorem 1 implies that the coverage probability produced by the fixed-width confidence interval

({\hat{G}}_{N_{d}} - d, {\hat{G}}_{N_{d}} + d)

attains the desired level

1 - α

asymptotically. Theorem 2 implies that the ratio of the average final sample size of our procedure asymptotically converges to the optimal sample size, C.

4. Simulation Study

In this section, we validate the asymptotic properties of our method stated in Theorems 1 and 2 through Monte Carlo study. To implement the sequential procedure, we fix

d (= 0.01, 0.02)

and

α (= 0.1, 0.05)

. Using the pilot sample size formula

m = max {4, ⌈ z_{\frac{α}{2}} / d ⌉}

, the pilot sample size considered here is 165. Then, we implement the sequential procedure described in Subsection 3.1 and estimate the average sample size (

\bar{N}

), the maximum sample size (

max (N)

), the standard error (

s (\bar{N})

) of

\bar{N}

, the coverage probability (p), and its standard error (

s_{p}

) based on 2000 replications by drawing random samples from gamma (shape

= 2.649

, rate

= 0.84

) distribution truncated at

t = 0.001 (G_{F} = 0.3308, ξ^{2} = 0.0468)

, log-normal distribution (mean

= 2.185

, sd

= 0.562

) truncated at

t = 0.001 (G_{F} = 0.3089, ξ^{2} = 0.0532)

, and Pareto (20,000, 5).

Table 1 summarizes the numerical results obtained from the simulation study. The parameters of log-normal and gamma distributions are the same as used by [20]. From the fourth column of Table 1, we find that the ratio of the average final sample size and C is close to 1. Moreover, column 6 of Table 1 illustrates that the attained coverage probability is very close to the desired level of

90 %

. Thus, we find that the simulation results validate all theoretical results mentioned in the previous section, and the performance of the procedure is satisfactory for the above mentioned distributions.

To verify whether the distribution of sample Gini index converges to normality, we do the following simulation study. We draw samples of sizes

n = 470

, 509, and 245 from gamma (shape

= 2.649

, rate

= 0.84

), log-normal (mean

= 2.185

, sd

= 0.562

), and Pareto (20,000, 5) distributions respectively and compute the sample Gini indexes. Please note that the choice of the parameters of lognormal and gamma distributions are same as in [20]. The sample size n is chosen according to the smallest

\bar{N}

in Table 1 for different scenarios. We compute 200 replications for the sample Gini index (using set.seed(123) in “R” language) and observe that Shapiro-Wilk’s test for normality returns p-values 0.1938, 0.5066, and 0.3984 for gamma (shape

= 2.649

, rate

= 0.84

), log-normal (mean

= 2.185

, sd

= 0.562

), and Pareto (20,000, 5) distributions respectively. Thus, we observe that for the above scenarios sample Gini index converges to normality for moderate sample sizes.

5. Discussion and Concluding Remarks

Gini index is a widely used measure of economic inequality index. In order to evaluate the economic policies adopted by a government, it is important to estimate Gini index at any specific time period. If the income data for all households in the region of interest is not available, one needs to estimate Gini index by drawing a sample of households from that region. Typically, large income surveys are associated with different sampling schemes. To review several sampling techniques, we refer to [19,21,22,23,24]. The sampling technique chosen to collect data usually depends on the socio-economic diversity and size of the country or region. For regions or smaller countries with lesser socio-economic diversity, the simple random sampling technique can be used to collect income or expenditure data to estimate Gini index. Several research articles (e.g., [4,5,7,25,26]) are devoted to drawing statistical inference on inequality indexes which are computed from household income or expenditure by means of simple random sampling from the population of interest. In this paper, we also use simple random sampling technique to collect income or expenditure data in order to estimate Gini index accurately. Even though the sequential methodology is introduced under i.i.d. framework, which may be considered as a practical limitation of our work, sequential methodologies may be adopted to different sampling schemes (e.g., see [27]). In the following Subsection 5.1, we discuss the possibility of extending our work to stratified sampling.

Without assuming any specific distribution of the data, we show that the ratio of the final sample size and the optimal sample size approaches 1. We also show that the confidence interval constructed using our proposed sequential method attains the required coverage probability. Thus, based on these results, we conclude that the proposed sequential estimation strategy can efficiently construct a

100 (1 - α) %

fixed-width confidence interval for Gini index. In this article, we consider that after pilot sample, one additional observation is collected in each step. If instead, a group of

r (\geq 1

) observations are collected in each step after the pilot sample stage, the same properties will hold. The proofs will be similar to the ones in Appendix.

The theory of the sequential procedure revolves around the idea of “learn-as-you-go”. In our proposed method of estimation, the final sample size is not fixed in advance, and the observations are collected until the estimated optimal sample size is obtained achieving the required

100 (1 - α) %

fixed-width confidence interval for Gini index. We hope that number of sequential procedures to estimate income inequalities will be developed following this article, whereupon the idea of sequential procedure can be applied to other sampling schemes used in economic surveys, taking into account of the cost considerations as well.

Possible Extension to Stratified Sampling

The sequential procedure that is proposed under i.i.d. framework may be extended to non i.i.d framework where stratified sampling is used. Suppose we divide the population into S strata. In the population, stratum s contains a mass of

H_{s}

households. The total number of households in the population is

H = \sum_{s = 1}^{S} H_{s}

. The density of the household income or expenditure, X in the

s^{th}

stratum is denoted by

d F (x | s)

.

Now, a sample of

n_{s}

households (indexed by

h_{s}

) is drawn by simple random sample with replacement from every strata so that the total number of households in the sample is

n = \sum_{s = 1}^{S} n_{s}

and

n_{s} = n a_{s}

with

\sum_{s = 1}^{S} a_{s} = 1, where a_{s} = H_{s} / H

. Let

x_{s h_{s}}

be the total income of the

h_{s}^{th}

household belonging to the

s^{th}

stratum. If

w_{s h_{s}} = \frac{H_{s}}{n_{s}}

is the weight of

h_{s}^{th}

household in the

s^{th}

stratum, then following [21,22], Gini index can be estimated by the estimator

\begin{matrix} \hat{G} = 1 - \frac{2}{\hat{μ}} \sum_{s = 1}^{S} \sum_{h_{s} = 1}^{n_{s}} w_{s h_{s}} x_{s h_{s}} (1 - \hat{F} (x_{s h_{s}})), \end{matrix}

(14)

where

\begin{matrix} \hat{μ} = \sum_{s = 1}^{S} \sum_{h_{s} = 1}^{n_{s}} w_{s h_{s}} x_{s h_{s}} and \hat{F} (x_{s h_{s}}) = \sum_{i = 1}^{S} \sum_{j = 1}^{n_{s}} w_{i j} I [x_{i j} \leq x_{h c_{s}}] / \sum_{i = 1}^{S} \sum_{j = 1}^{n_{s}} w_{i j} . \end{matrix}

(15)

Now, following [21,22] we have

\begin{matrix} \sqrt{n} (\hat{G} - G_{F}) \overset{d}{\to} N (0, V^{*}), \end{matrix}

(16)

where

V^{*}

is the asymptotic variance given in [22], modified to take into account of the stratified sampling only. Now,

\begin{matrix} P [|{\hat{G}}_{n} - G_{F}| \leq d] \to 2 Φ (\sqrt{\frac{n}{V^{*}}} d) - 1 \end{matrix}

(17)

The confidence coefficient will be approximately

1 - α

provided

\sqrt{n} d / \sqrt{V^{*}} \geq z_{α / 2}

. In order to have a fixed-width confidence interval, we need sample size n satisfying

\begin{matrix} n \geq d^{- 2} z_{α / 2}^{2} V^{*} \equiv C, say . \end{matrix}

(18)

Since C is unknown it must be estimated in the first stage and continue sampling until the sample size n is bigger than corresponding estimated value of C. Note that C is the optimum (i.e., minimum) household size to be sampled to achieve

(1 - α)

confidence level provided

V^{*}

were known. The optimal number of households to be sampled in the

s^{th}

stratum (

s = 1, \dots, S

) will be

C_{s} = C a_{s}

which is also unknown since C is unknown. Bhattacharya [22] proposed an estimator of the asymptotic variance

V^{*}

of the Gini index under complex household survey which can be used in Equation (18). Then the stopping rule developed in Equation (12) may be modified taking into account of the stratification and finite sampling scenario to find out an estimate of the optimum number of households in order to find a fixed-width confidence interval for Gini index under stratified sampling. However, we do not intend to explore this possibility in this article, and we believe that this could be a good topic of future research.

Acknowledgments

We thank the three anonymous referees and the editor whose insightful comments helped us improve the paper. We remain deeply indebted to Professor Gautam Tripathi and Professor Nitis Mukhopadhyay for their comments and suggestions.

Author Contributions

The authors contributed equally.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Lemma A1.

Under the assumption that

ξ < \infty

, for any

d > 0

, the stopping time

N_{d}

is finite, that is,

P (N_{d} < \infty) = 1

.

Proof.

The Lemma A1 is proved by using (12) and the fact that

V_{n}^{2}

is strongly consistent estimator of

ξ^{2}

and

N_{d} \to \infty

as

d ↓ 0

almost surely. ☐

Lemma A2.

The value of sample Gini index lies between 0 and 1.

Proof.

Let

Y_{1}, \dots, Y_{n}

be the ordered incomes of n persons where

Y_{1}

represents the income of the poorest person and

Y_{n}

represents the income of the richest person. Using [28], Gini index can be rewritten as

\begin{matrix} 0 \leq {\hat{G}}_{n} & = \frac{2 \sum_{i = 1}^{n} i Y_{i}}{n \sum_{i = 1}^{n} Y_{i}} - \frac{n + 1}{n} \leq \frac{2 n \sum_{i = 1}^{n} Y_{i}}{n \sum_{i = 1}^{n} Y_{i}} - \frac{n + 1}{n} = \frac{n - 1}{n} \leq 1 . \end{matrix}

This proves the lemma. ☐

Appendix A.1 Proof of Theorem 1

(i): The definition of stopping rule $N_{d}$ in (12) yields

$\begin{matrix} {(\frac{z_{α / 2}}{d})}^{2} V_{N_{d}}^{2} \leq N_{d} \leq m I (N_{d} = m) + {(\frac{z_{α / 2}}{d})}^{2} (V_{N_{d} - 1}^{2} + {(N}_{d} {- 1)}^{- 1}) . \end{matrix}$

(A1)

Since $N_{d} \to \infty$ a.s. as $d ↓ 0$ and $V_{n} \to ξ$ a.s. as $n \to \infty$ , by Theorem 2.1 of [29], $V_{N_{d}}^{2} \to ξ^{2}$ a.s. Hence, dividing all sides of (A1) by C and letting $d ↓ 0$ , we prove $N_{d} / C \to 1$ a.s. as $d ↓ 0$ .
(ii): In order to show that our procedure satisfies the asymptotic consistency property, we will derive an Anscombe-type random central limit theorem for Gini index. This requires the existence of usual central limit theorem of Gini index and uniform continuity in probability (u.c.i.p.) condition. For details about the u.c.i.p. condition, we refer to [17,30,31,32] etc.

First of all, let us define

n_{1} = (1 - ρ) C

and

n_{2} = (1 + ρ) C

for

0 < ρ < 1

. Now, we know from [7] that

Y_{n} = {(\sqrt{n} ({\hat{Δ}}_{n} - Δ), \sqrt{n} ({\bar{X}}_{n} - μ))}^{'} \overset{L}{\to} N_{2} (0, Σ)

, where

Σ = (\begin{matrix} 4 σ_{1}^{2} & 2 (τ - μ Δ) \\ 2 (τ - μ Δ) & σ^{2} \end{matrix}) .

First, let us prove that

Y_{N_{d}} \overset{L}{\to} N_{2} (0, Σ)

. Define

D^{'} = (a_{0} a_{1})

. Note that

D^{'} Y_{N_{d}} = D^{'} Y_{C} + (D^{'} Y_{N_{d}} - D^{'} Y_{C})

. Thus, it is enough to show that

(D^{'} Y_{N_{d}} - D^{'} Y_{C}) \overset{P}{\to} 0

as

d ↓ 0

. We can write

\begin{matrix} (D^{'} Y_{N_{d}} - D^{'} Y_{C}) = & a_{0} \sqrt{N_{d}} ({\hat{Δ}}_{N} - {\hat{Δ}}_{C}) + a_{1} \sqrt{N_{d}} ({\bar{X}}_{N_{d}} - {\bar{X}}_{C}) \\ + (\sqrt{N_{d} / C} - 1) D^{'} Y_{C} . \end{matrix}

(A2)

Fix some

ϵ > 0

and note that

\begin{array}{l} P \{| a_{0} \sqrt{N_{d}} ({\hat{Δ}}_{N_{d}} - {\hat{Δ}}_{C}) + a_{1} \sqrt{N_{d}} ({\bar{X}}_{N_{d}} - {\bar{X}}_{C}) | > ϵ\} \\ \leq P \{| a_{0} \sqrt{N_{d}} ({\hat{Δ}}_{N_{d}} - {\hat{Δ}}_{C}) + a_{1} \sqrt{N_{d}} ({\bar{X}}_{N_{d}} - {\bar{X}}_{C}) | > ϵ, | N_{d} - C | < ρ C\} \\ + P [| N_{d} - C | > ρ C] \\ \leq P \{\underset{n_{1} < n < n_{2}}{m a x} | \sqrt{n} | {\hat{Δ}}_{n} - {\hat{Δ}}_{C} | > \frac{ϵ}{2 | a_{0} |}\} + P \{\underset{n_{1} < n < n_{2}}{m a x} | \sqrt{n} | {\bar{X}}_{n} - {\bar{X}}_{C} | > \frac{ϵ}{2 | a_{1} |}\} \\ + P [| N_{d} - C | > ρ C] \end{array}

Here,

{\hat{Δ}}_{n}

and

{\bar{X}}_{n}

are both U-statistics which satisfy Anscombe’s u.c.i.p. condition (for e.g., see [17]). Using u.c.i.p. condition and the fact that

N_{d} / C \overset{a . s .}{\to} 1

, we conclude that for given

ϵ > 0

, there exist

η > 0

and

d_{0} > 0

such that

P {| a_{0} \sqrt{N_{d}} ({\hat{Δ}}_{N_{d}} - {\hat{Δ}}_{C}) + a_{1} \sqrt{N_{d}} ({\bar{X}}_{N_{d}} - {\bar{X}}_{C}) | > ϵ} < η for all d \leq d_{0} .

This implies

a_{0} \sqrt{N_{d}} ({\hat{Δ}}_{N_{d}} - {\hat{Δ}}_{C}) + a_{1} \sqrt{N_{d}} ({\bar{X}}_{N_{d}} - {\bar{X}}_{C}) \overset{P}{\to} 0

as

d ↓ 0

. Also, note that

(\sqrt{N_{d} / C} - 1) D^{'} Y_{C} \overset{P}{\to} 0

as

d ↓ 0

since

N_{d} / C \to 1

almost surely and

D^{'} Y_{C} \overset{L}{\to} N_{2} (0, Σ)

. Thus, from (A2), we conclude

(D^{'} Y_{N_{d}} - D^{'} Y_{C}) \overset{P}{\to} 0

, that is,

Y_{N_{d}} \overset{L}{\to} N_{2} (0, Σ)

. Now, define

G (u, v) = \frac{u}{2 v}

, if

v \neq 0

. Using Taylor’s expansion, we can write

\begin{matrix} \sqrt{N_{d}} (G ({\hat{Δ}}_{N_{d}}, {\bar{X}}_{N_{d}}) - G (Δ, μ)) & = \sqrt{N_{d}} (\frac{{\hat{Δ}}_{N_{d}} - Δ}{2 μ} - \frac{Δ}{2 μ^{2}} ({\bar{X}}_{N_{d}} - μ) + R_{N_{d}}), \end{matrix}

(A3)

where

R_{N_{d}} = - 2 ({\hat{Δ}}_{N_{d}} - Δ) ({\bar{X}}_{N_{d}} - μ) / b^{2} + 4 a {({\bar{X}}_{N_{d}} - μ)}^{2} / b^{3}

,

a = Δ + p ({\hat{Δ}}_{N_{d}} - Δ)

,

b = 2 μ + p (2 {\bar{X}}_{N_{d}} - 2 μ)

, and

p \in (0, 1)

. Rewriting (A3) in the vector-matrix form, we get

\begin{matrix} \sqrt{N_{d}} (G ({\hat{Δ}}_{N_{d}}, {\bar{X}}_{N_{d}}) - G (Δ, μ)) = D^{'} Y_{N_{d}} + \sqrt{N_{d}} R_{N_{d}}, \end{matrix}

(A4)

where

D^{'} = (\frac{1}{2 μ}, \frac{- Δ}{2 μ^{2}})

. Note that

\sqrt{N_{d}} ({\bar{X}}_{N_{d}} - μ)

converges in distribution to a normal distribution by Anscombe’s CLT and both

({\hat{Δ}}_{N_{d}} - Δ)

and

({\bar{X}}_{N_{d}} - μ)

converges to 0 almost surely. This yields

\sqrt{N_{d}} R_{N_{d}} \overset{P}{\to} 0

as

d ↓ 0

. Hence,

\sqrt{N_{d}} ({\hat{G}}_{N_{d}} - G_{F}) \overset{L}{\to} N (0, D^{'} Σ D)

as

d ↓ 0

. This completes the proof of Theorem 1. ☐

Appendix A.2 Proof of Theorem 2

In this subsection, we prove a lemma that is essential to establish Theorem 2. Note from (12) that

N_{d} \geq \frac{z_{α / 2}^{2}}{d^{2}} N_{d}^{- 1}

, i.e.,

N_{d} \geq \frac{z_{α / 2}}{d} (= m)

with probability 1. Suppose

X_{(n)} = (X_{(1)}, \dots, X_{(n)})

denotes the n dimensional vector of order statistics from the sample

X_{1}, \dots, X_{n}

, and

F_{n}

is the σ-algebra generated by

(X_{(n)}, X_{n + 1}, X_{n + 2}, \dots) .

By [13],

\{{\bar{X}}_{n} {, F}_{n}\}

,

\{S_{n}^{2} {, F}_{n}\}

,

\{{\hat{τ}}_{n} {, F}_{n}\}

,

\{{\hat{Δ}}_{n}, F_{n}\}

, and their convex functions are all reverse submartingales. Using reverse submartingale properties, let us prove the following lemma along the lines of [33].

Lemma A3.

If

E (X_{1}^{2 p})

is finite for some

p > 1

, then

E [sup_{n \geq m} V_{n}^{2}] < \infty

for

m \geq 4

.

Proof.

To prove Lemma A3, it is enough to show that:

E [sup_{n \geq m} s_{w n}^{2} {\bar{X}}_{n}^{- 2}]

,

E [sup_{n \geq m} |\frac{{\hat{Δ}}_{n}}{{\bar{X}}_{n}^{3}} {\hat{τ}}_{n}|]

,

E [sup_{n \geq m} \frac{{\hat{Δ}}_{n}^{2}}{{\bar{X}}_{n}^{2}}]

, and

E [sup_{n \geq m} \frac{{\hat{Δ}}_{n}^{2}}{{\bar{X}}_{n}^{4}} S_{n}^{2}]

are finite.

We note that,

0 \leq \frac{{\hat{Δ}}_{n}}{2 \bar{X}} \leq 1

. So, it is enough to show that

E [sup_{n \geq m} s_{w n}^{2} {\bar{X}}_{n}^{- 2}]

,

E [sup_{n \geq m} \frac{{\hat{τ}}_{n}}{{\bar{X}}_{n}^{2}}]

and

E [sup_{n \geq m} \frac{S_{n}^{2}}{{\bar{X}}_{n}^{2}}]

are finite. Following [34] (p. 338), we have

E [sup_{n \geq m} s_{w n}^{2}] < \infty

if

E [X_{1}^{α}] < \infty

for

α > 2

and

m \geq 4

. Therefore,

\begin{matrix} E (sup_{n \geq m} s_{w n}^{2} {\bar{X}}_{n}^{- 2}) \leq t^{- 2} E (sup_{n \geq m} s_{w n}^{2}) < \infty . \end{matrix}

(A5)

We note that

{\hat{τ}}_{n}

and

S_{n}^{2}

are U-statistics. Using Lemma 9.2.4 of [35], for

p > 1

,

E (sup_{n \geq m} {|{\hat{τ}}_{n}|}^{p}) \leq {(\frac{p}{1 - p})}^{p} E ({|{\hat{τ}}_{m}|}^{p}) and E (sup_{n \geq m} {|S_{n}^{2}|}^{p}) \leq {(\frac{p}{1 - p})}^{p} E ({|S_{m}^{2}|}^{p}) .

Since

E (X_{1}^{2 p})

if finite for some

p > 1

,

E ({|{\hat{τ}}_{m}|}^{p})

and

E ({|S_{m}^{2}|}^{p})

are finite which yield

E (sup_{n \geq m} |\frac{{\hat{τ}}_{n}}{{\bar{X}}_{n}^{2}}|) \leq \{t^{- 2} E (sup_{n \geq m} |{\hat{τ}}_{n}|)\} < \infty,

and

E (sup_{n \geq m} |\frac{S_{n}^{2}}{{\bar{X}}_{n}^{2}}|) \leq \{t^{- 2} E (sup_{n \geq m} |S_{n}^{2}|)\} < \infty .

This completes the proof of Lemma A3. ☐

Below, we prove Theorem 2 by using Lemma A3.

Since

N_{d} \geq m

a.s., dividing (A1) by C yields

\begin{matrix} N_{d} / C - m I (N_{d} = m) / C \leq \frac{1}{ξ^{2}} (sup_{d > 0} V_{N_{d} - 1}^{2} + {(m - 1)}^{- 1}) almost surely . \end{matrix}

(A6)

Since

E (sup_{d > 0} V_{N_{d} - 1}^{2}) < \infty

by Lemma A3 and

N_{d} / C \to 1

a.s. as

d ↓ 0

, by the dominated convergence theorem, we conclude that

lim_{d ↓ 0} E (N_{d} / C) = 1

.

This completes the proof of Theorem 2. ☐

References

B.C. Arnold. “Inequality measures for multivariate distributions.” Metron 63 (2005): 317–327. [Google Scholar]
R. Andres, and C. Samuel. Inference on Income Inequality and Tax Progressivity Indices: U-Statistics and Bootstrap Methods. ECINEQ working paper 2005-9; Palma, Spain: ECINEQ, 2005. [Google Scholar]
J.A. Bishop, J.P. Formby, and B. Zheng. “Statistical inference and the sen index of poverty.” Int. Econ. Rev. 38 (1997): 381–387. [Google Scholar] [CrossRef]
R. Davidson. “Reliable inference for the Gini index.” J. Econom. 150 (2009): 30–40. [Google Scholar] [CrossRef]
J.L. Gastwirth. “The estimation of the Lorenz curve and Gini index.” Rev. Econ. Stat. 54 (1972): 306–316. [Google Scholar] [CrossRef]
P. Palmitesta, P. Corrado, and S. Cosimo. “Confidence interval estimation for inequality indices of the Gini family.” Comput. Econ. 16 (2000): 137–147. [Google Scholar] [CrossRef]
K. Xu. “U-statistics and their asymptotic results for some inequality and poverty measures.” Econom. Rev. 26 (2007): 567–577. [Google Scholar] [CrossRef]
E. Maasoumi. “Empirical analysis of inequality and welfare.” In Handbook of Applied Microeconomics. Edited by S. Schmidt and H. Pesaran. Malden, MA, USA: Blackwell Publishers Inc., 1997. [Google Scholar]
G.B. Dantzig. “On the non-existence of tests of “student’s” hypothesis having power functions independent of σ.” Ann. Math. Stat. 11 (1940): 186–192. [Google Scholar] [CrossRef]
N. Mukhopadhyay, and B.M. de Silva. Sequential Methods and Their Applications. Boca Raton, FL, USA: CRC Press, 2009. [Google Scholar]
P.K. Sen. Sequential Nonparametrics: Invariance Principles and Statistical Inference. New York, NY, USA: Wiley, 1981. [Google Scholar]
W. Hoeffding. “A class of statistics with asymptotically normal distribution.” Ann. Math. Stat. 19 (1948): 293–325. [Google Scholar] [CrossRef]
A.J. Lee. U-Statistics: Theory and Practice. New York, NY, USA: CRC Press, 1990. [Google Scholar]
M. Loève. Probability Theory. Princeton, NJ, USA: Van Nostrand, 1963. [Google Scholar]
J.L. Doob. Stochastic Processes. New York, NY, USA: Wiley, 1953. [Google Scholar]
C. Schröder, and S. Yitzhaki. Reasonable Sample Sizes for Convergence to Normality. No. 714, SOEP Papers on Multidisciplinary Panel Data Research. 2014. Available online: http://papers.ssrn.com/sol3/papers.cfm?abstractid=2539096 (accessed on 5 March 2016).
R. Sproule. “A Sequential Fixed-Width Confidence Interval for the Mean of a U-Statistic.” Ph.D. Thesis, University of North Carolina, Chapel Hill, NC, USA, 1969. [Google Scholar]
B. Chattopadhyay, and N. Mukhopadhyay. “Two-stage fixed-width confidence intervals for a normal mean in the presence of suspect outliers.” Seq. Anal. 32 (2013): 134–157. [Google Scholar] [CrossRef]
M. Langel, and Y. Tillè. “Variance estimation of the Gini index: Revisiting a result several times published.” J. R. Stat. Soc. Ser. A Stat. Soc. 176 (2013): 521–540. [Google Scholar] [CrossRef]
M.R. Ransom, and J.S. Cramer. “Income distribution functions with disturbances.” Eur. Econ. Rev. 22 (1983): 363–372. [Google Scholar] [CrossRef]
D. Bhattacharya. “Asymptotic inference from multi-stage samples.” J. Econom. 126 (2005): 145–171. [Google Scholar] [CrossRef]
D. Bhattacharya. “Inference on inequality from household survey data.” J. Econom. 137 (2007): 674–707. [Google Scholar] [CrossRef]
D.A. Binder, and M.S. Kovacevic. “Estimating some measures of income inequality from survey data: An application of the estimating equations approach.” Surv. Methodol. 21 (1995): 137–146. [Google Scholar]
W.G. Cochran. Sampling Techniques. New York, NY, USA: Wiley & Sons, 1977. [Google Scholar]
C.M. Beach, and R. Davidson. “Distribution-free statistical inference with Lorenz curves and income shares.” Rev. Econ. Stud. 50 (1983): 723–735. [Google Scholar] [CrossRef]
R. Davidson, and J. Duclos. “Statistical inference for stochastic dominance and for the measurement of poverty and inequality.” Econometrica 68 (2000): 1435–1464. [Google Scholar] [CrossRef]
S. Zacks. Stage-Wise Adaptive Designs. New York, NY, USA: Wiley, 2009. [Google Scholar]
C. Damgaard, and J. Weiner. “Describing inequality in plant size or fecundity.” Ecology 81 (2000): 1139–1142. [Google Scholar] [CrossRef]
A. Gut. Stopped random walks: Limit theorems and applications. New York, NY, USA: Springer, 2009. [Google Scholar]
F.J. Anscombe. “Sequential estimation.” J. R. Stat. Soc. Ser. B 15 (1953): 1–29. [Google Scholar]
E. Isogai. “Asymptotic consistency of fixed-width sequential confidence intervals for a multiple regression function.” Ann. Inst. Stat. Math. 38 (1986): 69–83. [Google Scholar] [CrossRef]
N. Mukhopadhyay, and B. Chattopadhyay. “A tribute to Frank Anscombe and random central limit theorem from 1952.” Seq. Anal. 31 (2012): 265–277. [Google Scholar]
S.K. De, and B. Chattopadhyay. “Minimum Risk Point Estimation of Gini Index.” Available online: http://arxiv.org/abs/1503.08148 (accessed on 27 March 2015).
P.K. Sen, and M. Ghosh. “Sequential point estimation of estimable parameters based on U-statistics.” Sankhyā Indian J. Stat. Ser. A 43 (1981): 331–344. [Google Scholar]
M. Ghosh, N. Mukhopadhyay, and P.K. Sen. Sequential Estimation. New York, NY, USA: Wiley, 1997. [Google Scholar]

Table 1. Performance of the proposed sequential procedure when the data is from gamma, log-normal, and Pareto distribution.

**Table 1.** Performance of the proposed sequential procedure when the data is from gamma, log-normal, and Pareto distribution.
d	Distribution	$\bar{N}$	C	$\bar{N} / C$	p
$α$		$\underset{}{s (\bar{N})}$			$s_{p}$
$d = 0.01$	Gamma	1283.7450	1267	1.0132	0.9090
$α = 0.1$		1.7561			0.0064
$d = 0.02$	Gamma	469.1650	450	1.0426	0.9535
$α = 0.05$		1.0485			0.0047
$d = 0.01$	Lognormal	1435.3640	1440	0.9968	0.8965
$α = 0.1$		4.091604			0.0068
$d = 0.02$	Lognormal	509.0020	511	0.9961	0.9430
$α = 0.05$		2.0538			0.0052
$d = 0.01$	Pareto	654.5364	686	0.9541	0.9018
$α = 0.1$		4.2151			0.0063
$d = 0.02$	Pareto	244.3330	244	1.0014	0.9470
$α = 0.05$		2.0099			0.0050

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license ( http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chattopadhyay, B.; De, S.K. Estimation of Gini Index within Pre-Specified Error Bound. Econometrics 2016, 4, 30. https://doi.org/10.3390/econometrics4030030

AMA Style

Chattopadhyay B, De SK. Estimation of Gini Index within Pre-Specified Error Bound. Econometrics. 2016; 4(3):30. https://doi.org/10.3390/econometrics4030030

Chicago/Turabian Style

Chattopadhyay, Bhargab, and Shyamal Krishna De. 2016. "Estimation of Gini Index within Pre-Specified Error Bound" Econometrics 4, no. 3: 30. https://doi.org/10.3390/econometrics4030030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Estimation of Gini Index within Pre-Specified Error Bound

Abstract

1. Introduction

2. Problem Statement and Optimal Sample Size

3. The Sequential Estimation Procedure

Implementation and Characteristics

4. Simulation Study

5. Discussion and Concluding Remarks

Possible Extension to Stratified Sampling

Acknowledgments

Author Contributions

Conflicts of Interest

Appendix A

Appendix A.1 Proof of Theorem 1

Appendix A.2 Proof of Theorem 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI