Estimation of Gini Index within Pre-Specied Error Bound

Gini index is a widely used measure of economic inequality. This article develops a general theory for constructing a confidence interval for Gini index with a specified confidence coefficient and a specified width. Fixed sample size methods cannot simultaneously achieve both the specified confidence coefficient and specified width. We develop a purely sequential procedure for interval estimation of Gini index with a specified confidence coefficient and a fixed margin of error. Optimality properties of the proposed method, namely first order asymptotic efficiency and asymptotic consistency are proved. All theoretical results are derived without assuming any specific distribution of the data.


Introduction
Economic inequality arises due to the inequality in the distribution of income and assets among individuals or groups within a society or region or even between countries. Economic inequality is usually measured to evaluate the effects of economic policies at the micro or macro level. In the economics literature, there are several inequality indexes that measure the economic inequality. Among those indexes, Gini inequality index is the most widely used measure. The most celebrated Gini index, as given in Arnold (2005), is arnold2005inequality and X 1 & X 2 are two i.i.d. copies of non-negative random variable X. Gini index compares every individual's income with every other individual's income. If there are n randomly selected individuals with incomes given by X 1 , . . . , X n , then the estimator of the celebrated Gini index is whereX n is the sample mean and ∆ n is the sample Gini's mean difference defined as, The Gini index is undefined ifX n = 0. We ignore this special case.
For continuous evaluation of different economic policies implemented by the government, computation of Gini index for the whole country or a region is very important. One source of income or expenditure data for all households in a region is census data which is typically collected every 10 years. As a result, Gini index computed based on census data is available only once in every 10 years 2 . Unless the household income data is annually updated, an 2 For some countries, Gini indexes reported are based on even more than 15 years old data (see World estimate of Gini index for intermediate periods between two censuses can not be obtained. Some developed countries conduct household surveys annually 3 . However, many countries can not afford or do not conduct household survey annually. For those countries, it is useful to draw relatively small number of households to estimate the population Gini index. In order to estimate Gini index for a region, a simple random sampling of households may be used 4 . For the existing literature on the use of simple random sampling to estimate inequality indexes, we refer to Gastwirth (1972), Bishop et al (1997), Xu (2007) and Davidson (2009).
In the economics literature, there exist innovative methods for constructing confidence intervals for G F (for e.g., see Xu (2007)). However, we know that the confidence interval varies from sample to sample and so is its width. Wider confidence intervals provide less precise information about the true value of the parameter of interest. Since it is desirable to construct shorter confidence intervals, we rather fix the length of the confidence interval, or in other words, the margin of error while achieving the same confidence coefficient. Thus we want to construct a 100(1 − α)% fixed-width confidence interval for G F . This problem is know as the fixed-width confidence interval estimation problem.
No fixed sample size procedure can provide a solution to the fixed-width confidence interval estimation problem (e.g., see Dantzig (1940)). This problem falls in the domain of sequential analysis. For the details about the general theory of fixed-width confidence interval estimation, we refer the interested readers to Sen (1981) and Mukhopadhyay and De Silva (2009). Sequential analysis is concerned with studies where sample sizes are not fixed in advance unlike fixed-sample size procedures. Instead, the sequential estimation procedure depends on collecting observations until an a-priori specified criterion or stopping rule is satisfied.
Bank website). Examples include Belize, Algeria, and Botswana whose Gini indexes are based on data collected in 1999, 1995, and 1994  We know that Gini's mean difference is U-statistic with a symmetric kernel of degree 2 and the sample mean is a U-statistic with a symmetric kernel of degree 1 (for e.g., see Hoeffding, Hoeffding (1948)). Under distribution-free scenario, Xu (2007) used the central limit theorem for U-statistics to come up with a confidence interval for Gini index. However, this cannot be used to find out a fixed-width confidence interval for Gini index. In this article, we solve the problem of obtaining a fixed-width confidence interval for Gini index using a purely sequential procedure with a stopping rule based on several U-statistics. Apart from being unbiased estimators, U-statistics are also reverse martingales with respect to some non-increasing filtration as proven in Lee (1990). For more literature on reverse martingales, we refer to classical textbooks on probability theory and stochastic processes such as Loève (1963), Doob (1953), and others. We exploit the reverse martingale property of U-statistics to derive attractive asymptotic properties of our proposed estimation procedure.
In the next section, we formally state the fixed-width confidence interval estimation problem and why a fixed-sample size procedure cannot be used. In section 3, a purely sequential procedure is proposed to construct a 100(1 − α)% fixed-width confidence interval for unknown population Gini index and implementation and characteristics of the sequential procedure is discussed as well. Section 4 presents simulation study and validate all theoretical results related to our procedure. We conclude this article with some remarks in section 5.

Problem Statement and Optimal Sample Size
Consider n randomly selected individuals from some population of interest with incomes denoted by X 1 , X 2 , . . . , X n . These are nonnegative random variables. A strongly consistent estimator of population Gini index G F is G n given in (2). For fixed α ∈ (0, 1), the goal of this paper is to develop the theory for constructing a 100(1 − α)% fixed-width confidence interval for G F . Formally, we would like to construct a confidence interval J n = ( G n − d, G n + d) for some prefixed margin of error d > 0. Using Xu (2007), we have where ξ 2 is the asymptotic variance given by, Here, Based on the asymptotic normality of G n , we observe that the coverage probability is where Φ is the distribution function of standard normal random variable. In order to have Solving (7) for n, we obtain n ≥ d −2 z 2 α/2 ξ 2 , where z α/2 is the upper α 2 th quantile of the standard normal distribution. Thus, the optimal (minimal) sample size required to construct a fixed-width confidence interval for Gini index with approximately (1 − α) coverage probability is provided ξ is known.
The optimal fixed sample size C is unknown since the true value of ξ is unknown in practice. If C were known, one would just draw C observations independently from the population of interest and compute ( G C − d, G C + d) which would satisfy (4) approximately.
Since C is unknown, one must draw samples at least in two stages in order to achieve the desired coverage probability at least approximately. In the first stage, one must estimate C by estimating ξ, and then in the subsequent stages one should collect samples until the current sample size is more or equal to the estimated optimal sample size. In this article, we propose a sequential sampling procedure to estimate the optimal sample size C and ensure that the fixed-width confidence interval based on the final sample size attains the desired (1 − α) coverage probability.

The Sequential Estimation Procedure
In sequential estimation procedures, the parameter estimates are updated as the data is observed. In the first step, a small sample, called the pilot sample, is observed to gather preliminary information about the parameter of interest. Then, in each successive step, one or more additional observations are collected and the estimates of the parameters are updated. After each and every step a decision is taken whether to continue or to terminate the sampling process. This decision is based on a pre-defined stopping rule.
From (8) we note that the optimal sample size needed to find a fixed-width confidence interval depends on unknown parameter ξ 2 . So, let us first find a good estimator of the unknown parameter ξ 2 . Following Xu (2007) and Sproule (1969), we consider the following strongly consistent estimator of ξ 2 based on U-statistics. Let us define a U-statistic, for each j = 1, 2, . . . , n, where n for j = 1, . . . , n, and W n = n −1 n j=1 W jn . According to Sproule (1969), a strongly consistent estimator of 4σ 2 1 is Using Xu (2007), is an estimator of τ . Let S 2 n be the sample variance. Thus, the estimator of ξ 2 is similar to Xu (2007). Using Sproule (1969) and theorem 3.2.1 of Sen (1981), we conclude that V 2 n is a strongly consistent estimator of ξ 2 . Based on this estimator of ξ 2 , we define the stopping rule N d , for every d > 0, as Here, m is called the initial or pilot sample size, and the term n −1 is known as a correction term. Note that V n can be very close to zero with positive probability. Without the correction term, the inequality (12) may be satisfied for very small n terminating the sampling process too early. Thus the correction term n −1 ensures that the sampling process for estimating the optimal sample size does not stop too early. For details about the correction term, we refer to Sen (1981).
From (12), we note that, N d ≥ zα/2 d 2 N −1 d , i.e., the final sample size must be at least z α/2 /d. Therefore, we consider the pilot sample size to be m = max {4, z α/2 /d}. This technique of estimating pilot sample size can also be found in Mukhopadhyay and De Silva (2009).
Recall that the optimal sample size required to achieve 100(1 − α)% confidence interval for Gini index is C which is unknown in practice. The stopping variable N d defined in (12) serves as an estimator of C. Below, we develop a purely sequential procedure to estimate the optimal sample size C.

Implementation and Characteristics
We propose the following purely sequential estimation procedure to estimate C: Stage 1: Compute the pilot sample size m = max {4, z α/2 /d} and draw a random sample of size m from the population of interest. Based on this pilot sample of size m, obtain an estimate of ξ 2 by finding V 2 m as given in (11) and check whether m ≥ (z α/2 /d) 2 (V 2 m + m −1 ).
If m < (z α/2 /d) 2 (V 2 m + m −1 ) then go to the next step. Otherwise, set the final sample size Stage 2: Draw an additional observation independent of the pilot sample and update the estimate of ξ 2 by computing V 2 m+1 . Check if m + 1 ≥ (z α/2 /d) 2 V 2 m+1 + (m + 1) −1 . If m + 1 < (z α/2 /d) 2 V 2 m+1 + (m + 1) −1 then go to the next step. Otherwise, if m + 1 ≥ (z α/2 /d) 2 V 2 m+1 + (m + 1) −1 then stop further sampling and report the final sample size as This process of collecting one observation in each stage after stage 1 is continued until At this stage, we stop sampling and report the final sample size as N d .
Based on the above algorithm, the sampling process will stop at some stage. This is proved in Lemma 1 which states that if observations are collected using (12), under appropriate conditions, P (N d < ∞) = 1. This is a very important property of any sequential procedure since it mathematically ensures that the sampling will be terminated eventually.
Next, we establish some desirable asymptotic properties of our proposed sequential procedure. First, we prove that the final sample size N d required by our sampling strategy is close to the optimal sample size C at least asymptotically. This property is known as asymptotic efficiency property of sequential procedure which ensures that, on average, we collect only the minimum number of samples to achieve certain accuracy of estimation. Second, we show that the fixed-width confidence interval G N d − d, G N d + d contains the true value of Gini index G F nearly with probability 1 − α. We formally state these results in theorems 1 and 2.
Theorem 1. If the parent distribution F is such that E[X 4 ] and E[X −β ] exist for β > 4, 5 then the stopping rule in (12) yields the following asymptotic optimality properties: Theorem 2. If the parent distribution F is such that E[X 4 ] exist, then the stopping rule in (12) yields Theorems 1 and 2 are proved in the appendix. Part (i) of theorem 1 implies that the ratio of final sample size of our procedure and the optimal sample size, C asymptotically converges to 1. Part (ii) of theorem 1 implies that the ratio of the average final sample size of our procedure and C asymptotically converges to 1. This property is called first order asymptotic efficiency property as it can be found in Mukhopadhyay and De Silva (2009).
Theorem 2 implies that the coverage probability produced by the fixed-width confidence 5 If for a certain distribution function, negative moments doesn't exist, then theorem 1 will hold, if interval G N d − d, G N d + d attains the desired level 1 − α asymptotically. This property is called asymptotic consistency. Thus, we prove that the proposed purely sequential procedure enjoys both asymptotic efficiency property and asymptotic consistency property.

Simulation Study
In this section, we validate the asymptotic properties of our method stated in theorems 1 and 2 through Monte Carlo study. To implement the sequential procedure, we fix d(= 0.01) and α(= 0.1). Using the pilot sample size formula m = max {4, z α 2 /d}, the pilot sample size considered here is 165. Then, we implement the sequential procedure described in section 3.1 and estimate the average sample size (N ), the maximum sample size (max(N)), the standard error (s(N)) of N, the coverage probability (p), and its standard error (s p ) based on 2000 replications by drawing random samples from gamma distribution (shape = 2.649,rate = 0.84), log-normal distribution (mean = 2.185, sd = 0.562), and Pareto (20000,5). Table   1 summarizes the numerical results obtained from the simulation study. The parameters of log-normal and gamma distributions are same as used by Ransom and Cramer (1983).
From the fourth column of table 1, we find that the ratio of the average final sample size and C is close to 1. Moreover, column 6 of table 1 illustrates that the attained coverage probability is very close to the desired level of 90%. Thus, we find that the simulation results validate all theoretical results mentioned in the previous section, and the performance of the procedure is satisfactory for the above mentioned distributions.

Concluding Remarks
Gini index is a widely used measure of economic inequality index. In order to evaluate the economic policies adopted by a government, it is important to estimate Gini index at any specific time period. If the income data for all households in the region of interest is not available, one should estimate Gini index by drawing a simple random sample of house-holds from that region. This article develops a purely sequential procedure that provides a 100(1 − α)% fixed-width confidence interval for Gini index. Without assuming any specific distribution for the data, we show that the ratio of the final sample size and the optimal sample size approaches 1. We also show that the confidence interval constructed using our proposed sequential method attains the required coverage probability. Thus, based on these results, we conclude that the proposed sequential estimation strategy can efficiently construct a 100(1 −α)% fixed-width confidence interval for Gini index. In this article, we consider that after pilot sample, one additional observation is collected in each step. If instead, a group of r(≥ 1, say) observations are collected in each step after the pilot sample stage, the same properties will hold. The proofs will be similar to the ones in Appendix.
Apart from economics, there are other fields where researchers report Gini index. For instance, in social sciences and economics, the Gini index is used to measure inequality in education (see Thomas et al (2001)). In ecology, the Gini index is used as a measure of biodiversity (for e.g., see Wittebolle et al (2009)). Asada (2005) uses Gini index as a measure of the inequality of health related quality of life in a population. Shi and Sethu (2003) uses Gini index to evaluate the fairness achieved by internet routers in scheduling packet transmissions from different flows of traffic. Possible application of Gini index in so many fields such as sociology, health science, ecology, engineering, and chemistry motivates us to develop the theory for constructing a fixed-width confidence interval for Gini index.
6 Appendix Lemma 1. Under the assumption that ξ < ∞, for any d > 0, the stopping time N d is finite, that is, P (N d < ∞) = 1.
Proof. The lemma 1 is proved by using (12) and the fact that V 2 n is strongly consistent estimator of ξ 2 and N d → ∞ as d ↓ 0 almost surely.
Lemma 2. The value of sample Gini index lies between 0 and 1.
Proof. Let Y 1 , . . . , Y n be the ordered incomes of n persons where Y 1 represents the income of the poorest person and Y n represents the income of the richest person. Using Damgaard and Weiner (2000), Gini index can be rewritten as This proves the lemma.

Proof of Theorem 1
In this subsection, we prove some lemmas that are essential to establish theorem 1 and theorem 2. First, we introduce a few notations. Note from (12) . . , X (n) ) denotes the n dimensional vector of order statistics from the sample X 1 , . . . , X n , and F n is the σ-algebra generated by (X (n) , X n+1 , X n+2 , . . .). By Lee (1990), X n , F n , {S 2 n , F n }, { τ n , F n }, ∆ n , F n , and their convex functions are all reverse submartingales. Using reverse submartingale properties, let us prove the following lemmas.
Lemma 3. Let X n be the sample mean based on non-negative i.i.d. observations X 1 , . . . , X n .
Then, if E(X −s ) < ∞, for s > r and r ≥ 1, Proof. For α ≥ 1, we have where β > 1. The last inequality is obtained by applying maximal inequality for reverse submartingales (see Lee (1990)). Let s = rβ. Now, it is enough to show that if E(X −s ) < ∞, then E X −s n < ∞. Note that X n ≥ ( n i=1 X i ) 1/n as the observations are nonnegative and The last equality is due to the i.i.d. property of the observations. We know that {E (|X| p )} 1/p is a nondecreasing function of p for p > 0. Applying this result with p = 1/n ≤ 1 in (16) are finite. Following Sen and Ghosh (p. 338, Sen and Ghosh (1981) and m ≥ 4. By lemma 3, E sup We note that τ n and S 2 n are U-statistics. Using lemma 9.2.4 of Ghosh et al (1997), if E(X 4 1 ) and E(X −β 1 ) exist for β > 4. This completes the proof of lemma 4.
Below, we prove theorem 1 by using lemma 3 and 4.
(i) The definition of stopping rule N d in (12) yields Since N d → ∞ a.s. as d ↓ 0 and V n → ξ a.s. as n → ∞, by theorem 2.1 of Gut (2009), Hence, dividing all sides of (18) by C and letting d ↓ 0, we prove N d /C → 1 a.s. as d ↓ 0.
< ∞ by lemma 4 and N d /C → 1 a.s. as d ↓ 0, by the dominated convergence theorem, we conclude that lim d↓0 E(N d /C) = 1. This completes the proof of theorem 1.

Proof of Theorem 2
In order to show that our procedure satisfies the asymptotic consistency property, we will derive an Anscombe-type random central limit theorem for Gini index. This requires the existence of usual central limit theorem of Gini index and uniform continuity in probability (u.c.i.p.) condition. For details about the u.c.i.p. condition, we refer to Anscombe (1953), Sproule (1969), Isogai (1986), and Mukhopadhyay and Chattopadhyay (2012) etc.