Abstract
To gain insight into various phenomena of interest, cumulative distribution functions
(CDFs) can be used to analyze survey data. The purpose of this study was to present an efficient ratiocum-exponential estimator for estimating a population CDF using auxiliary information under two
scenarios of non-response. Up to first-order approximation, expressions for the bias and mean squared error (MSE) were derived. The proposed estimator was compared theoretically and empirically, with the modified estimators. The proposed estimator was found to be better than the modified estimators based on present-relative efficiency PRE and MSE criteria under the specific conditions.
Keywords:
auxiliary information; exponential estimator; sub-sampling of non-respondents; cumulative distribution function; non-response MSC:
62D05; 62G05
1. Introduction
It is a well-accepted fact in survey sampling that, under certain conditions, auxiliary information can provide precise estimates of population parameters such as the mean, median, standard deviation, totals, quantiles, and the cumulative distribution function (CDF), etc. If a linear and higher correlation is observed between the study variable Y and auxiliary variable X, researchers often use a traditional estimator for the mean, like ratio, product, and regression estimators, to estimate a population mean. The literature includes a significant amount of work for estimating different parameters of a population, for example, see [1,2,3,4,5,6,7].These studies propose improved ratio-, product-, and regression-type estimators for estimating the mean and variance of a population using auxiliary variables.
Non-response is an issue that cannot be avoided in complex sample surveys, and it can be found in surveys that involve human responses. Language problems, inaccurate return addresses, a lack of information, and the sensitivity of the survey question(s), among many other reasons, can play a part in causing this issue. For example, an individual may be reluctant to provide salary information. Non-response in sample surveys is more prevalent and pervasive in postal surveys than in special canvasser surveys. Therefore, the term non-response refers to the inability to measure part of the units in a sample survey. Non-response can compromise estimator accuracy and increase its bias.
To cope with the non-response problem, several measures are proposed by various researchers, such as the weighting adjustment approach as, suggested by Oh and Scheuren [8]; imputation techniques provided by Kalton [9] and Kalton and Maligalig [10]; and the approach of sub-sampling non-respondents as recommended by Hansen and Hurwitz [11]. Various researchers have attempted to reduce the bias and to improve the effectiveness of the estimators of a population mean in the existence of non-response. Some significant references on estimating the population mean utilizing auxiliary variables in the existence of non-response include [12,13,14,15,16,17], etc.
Although there is extensive literature on estimating different estimators, it is noted that an auxiliary information-based estimation of a population CDF is less emphasized. It is becoming increasingly significant in survey sampling when statisticians are frequently interested in the proportion of a variable’s domain under examination. For example, policymakers may want to know where the percentage of educated women is higher or equal to 50% in Pakistan, the proportion of individuals having a weekly income of 100 USD or more in a developing country, etc. Similarly, a psychiatrist may be interested in knowing how many children spend one or more hours with their mobile activities or what proportion of children spend one or more hours on their phones. Several studies have revealed that spending more than an hour a day on phones or smart devices has a significant relation with psychological problems among childern, such as anxiety, loneliness, and depression.
Hence, it has become necessary to estimate the finite population CDF. Therefore, Singh et al. [18], Muñoz et al. [19], Yaqub and Shabbir [20,21], Hussain et al. [22], and Hussain et al. [23] have put their efforts on estimating the population CDF using auxiliary information.
2. Sampling Design and Notations
2.1. Notations for the CDF under SRS
Consider a finite population of N distinct units, let be the values of research variable Y and auxiliary variable X, respectively, on the unit. For every index and where (, ) , the population CDFs of Y and X are defined, respectively, by,
where I(.) is an indicator variable. It is an average of the Bernoulli distributed variable, such that
Theorem 1.
where we have the following:In SRS, = is a hyper-geometrically distributed variable with expected mean and variance for Y, respectively,
the number of units in the population that belong to and ;
the number of units in the population that belong to and ;
the number of units in the population with and ; and
the number of units in the population that belong to and .
Theorem 1. can be proved easily along the lines of García et al. [24].
Lemma 1.
For a large sample size is defined as
Let us consider that and are the population variances of and , respectively.
Let be the population covariance between and , then we have the following:
is the population coefficient of variation of , and
;
is the population coefficient of variation of , and
;
is the phi-coefficient of correlation between and .
2.2. Notation for the CDF with Non-Response under an SRS Design
Consider the case where a finite population of N units is divided into two groups: a respondent’s group of units and another non-respondent’s group of units, where . Consider the case where a sample of size ℓ is drawn from a target population using SRSWOR, and it is further assumed that only out of ℓ units respond, while units do not. Now, a sub-sample, also referred to as the 2nd phase sample, of units, where , is taken from the group of non-respondents of size for interviewing. This way of dealing with non-respondents to obtain responses from them is also referred to as the canvasser method. Hence, the total number of responses is , collected from ℓ units, and only units are left as non-respondents who are not selected in the 2nd phase sample. Following Hansen and Hurwitz [11] a population CDF in the existence of non-response can be defined as follows:
Similarly, let
where and In addition, we have the following:
is the population CDF of for the response group;
is the population CDF of for the non-response group;
is the population CDF of for the response group;
is the population CDF of for the non-response group.
Yaqub and Shabbir [20] briefly studied the unbiased estimator of the population CDF of the research variable when there was non-response in the sample.
Let the sample CDF be the unbiased estimators of the population CDF , based on ℓ units in the existence of non-response.
By using the Hansen and Hurwitz [11] approach, is defined as
where and . In addition, we have the following:
denotes the sample CDF based on responding units out of ℓ units;
denotes the sample CDF based on q responding units out of non-response units.
Theorem 2.
where and . Theorem 2. can be proved along the lines of [20].The mean and variance of is defined as follows:
- is an unbiased estimator of , i.e., ;
- The variance of is defined by
Similarly, for the supplemental variable X, the estimator is defined as
In addition, we have the following:
denotes the sample CDF based on responding units out of ℓ units;
denotes the sample CDF based on q responding units out of non-response units.
Lemma 2.
On the lines of Theorem 2, the mean and variance of are defined as follows:
- is an unbiased estimator of , i.e., ;
- The variance of is defined by.
- The covariance between and is given by
(For the proof see [20]).
In addition, let us define the following:
is the population variance of for the response group;
is the population variance of for the non-response group;
is the population variance of for the response group;
is the population variance of for the non-response group;
is the population coefficient of variation of for the response group;
is the population coefficient of variation of for the response group;
is the population coefficient of variation of for the non-response group;
is the population coefficient of variation of for the non-response group;
is the population covariance between and for the response group;
is the population covariance between and for the non-response group.
The following relative error terms are taken into account to determine the biases and MSEs of the existing and proposed estimators:
such that for , where is mathematical expectation. Utilizing approximation up to the first order we have the following:
There are two scenarios under consideration in the existence of non-response:
- Scenario I refers to non-response on both the study and auxiliary variables, whereas
- Scenario II solely refers to non-response on the study variable.
3. Some Modified Estimators for the CDF under Non-Response
3.1. Modified Estimators in Scenario I
In this section, some existing estimators for population mean estimation are modified for the estimation of a population CDF using SRS under Scenario I, i.e., non-response is present in both the study and the auxiliary variables. Furthermore, the biases and MSEs of the modified estimators are derived to the first order of approximation.
- 1.
- The Cochran [25] ratio estimator is modified for , and given by
- 2.
- Singh et al. [26] proposed an exponential estimator under non-response on both the study and auxiliary variables along the lines of Bahl and Tuteja [27]. The modified form of [26] for estimating the CDF is given byThe bias and MSEs of Equation (6) to the first order of approximation are given as
- 3.
- The modified regression estimator for is provided aswhere is the regression co-efficient. Moreover, Equation (8) is an unbiased estimator of .In addition, at the optimum value , the minimum variance of is given as
3.2. Modified Estimators in Scenario II
In this section, some of the existing estimators used to estimate the mean of a population, are modified for the estimation of the population CDF under Scenario II, i.e., non-response is present only on the study variable. Furthermore, the biases and MSEs of these modified estimators are obtained to their first order approximation. Let be the estimator of the population CDF under Scenario II.
- 1.
- The Cochran [25] ratio estimator is modified for under Scenario II, and given as
- 2.
- The exponential estimator of Singh et al. [26] is modified for estimating and is provided asThe bias and MSEs of Equation (12) up to the first order of approximation are given as
- 3.
- The modified regression estimator for in Scenario II is provided aswhere is said to be the regression coefficient. Moreover, Equation (14) is an unbiased estimator of . In addition, at the optimum value , the minimum variance of is given asor
4. Proposed Estimators
4.1. The Proposed Estimator in Scenario I
Following [17], an estimator for estimating a population CDF of the study variable under Scenario I is defined as
where is unknown and needs to be estimated such that the MSE is minimum.
Theorem 3.
Proof.
Equation (16) is expressed in error terms as
Expanding Equation (18) up to the first order of approximation, yields
Keeping the terms up to the second power and extending the above equation, we get the following:
and
Squaring (20) and applying the expectation yield MSE of we obtain
The MSE of (16) is obtained as
□
Hence the theorem is proved.
Theorem 4.
The minimum MSE of is given as follows:
Proof.
Differentiating Equation (17) with respect to and simplifying it to obtain the optimal value of for minimal , we get
Substituting into (17) we obtain the minimal MSE of , such that
□
Hence the theorem is proved.
4.2. Proposed Estimator in Scenario II
Motivated by [28], to estimate the population CDF in the presence of non-response under Scenario II, an estimator is proposed as
where is an unknown and needs to be estimated such that the MSE is minimum.
Theorem 5.
The and are given, respectively, by the following:
Proof.
In error terms Equation (24) can be expressed as
Expanding Equation (26) to the first order of approximation yields
Keeping the terms up to the second power and extending the above equation, we get
and
Squaring (28) and applying the expectation after simplification we obtain
The MSE of is determined as
□
Hence the theorem is proved.
Theorem 6.
The Minimum MSE of is given as
Proof.
Differentiating Equation (25) with respect to , equating it to zero, and simplifying it to obtain the optimal value of for the minimum MSE, we get
Substituting into (25) we obtain the minimum MSE of as
□
Hence the theorem is proved.
5. Efficiency Comparisons
The MSEs of the modified and proposed estimators are compared in this section.
5.1. Efficiency Comparisons under Scenario I
The proposed estimator under Scenario I is more efficient if we have the following:
5.2. Efficiency Comparisons under Scenario II
The proposed estimator under Scenario II is more efficient compared to existing modified estimators if we have the following:
6. Numerical Study
An empirical evaluation is presented to evaluate the performance of the proposed estimators and some of the existing estimators, by using three different populations. The summary statistics for these populations are shown in Table 1, Table 2 and Table 3 respectively:
Table 1.
Summary statistics for Population I.
Table 2.
Summary statistics for Population II.
Table 3.
Summary statistics for Population III.
MSEs of the modified estimators and the proposed estimators are given in Table 4 and Table 5, with respect to Scenario I and Scenario II, respectively.
Table 4.
MSEs under Scenario I.
Table 5.
MSEs under Scenario II.
The proposed estimators and the modified estimators of a population CDF were compared to the variance of under both scenarios in terms of their percent-relative efficiencies (PREs) by using the following formula:
PREs of the proposed estimators, and the modified estimators, are given in Table 6 and Table 7, with respect to Scenario I and Scenario II.
Table 6.
PREs of estimators under Scenario I.
Table 7.
PREs of the estimators under Scenario II.
- •
- It was observed that the PREs corresponding of the estimators , , , and declined.
- •
- The PREs corresponding to the proposed estimators, and , and the modified regression estimators, and , showed that the proposed estimators were efficient estimators under both scenarios of non-response.
7. Conclusions
This study proposed an improved class of estimators for the estimation of a finite population CDF under two different scenarios of non-response using SRS. From theoretical and empirical comparisons, the proposed estimators were found to be perform better, based on large PRE and smaller MSE criteria. Therefore, our study suggests the use of the proposed estimators for estimating the CDF in the presence of non-response. Limitations of this study are provided in the Appendix A.
Author Contributions
Conceptualization, A.K. and A.S.; methodology, A.K. and A.S.; software, A.K.; formal analysis, A.K.; investigation, A.K. and A.S., F.S.A.-D. and M.M.A.A.; data curation, A.K.; writing—original draft preparation, A.K.; writing—review and editing, A.K. and A.S.; supervision, A.S.; funding acquisition, F.S.A.-D. and M.M.A.A. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data utilized to support the numerical conclusions of this study are available within the article. The data can also be retrieved by searching the provided data sources.
Acknowledgments
The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through the Larg Groups Program, under the grant number (RGP.2/23/44). This study was supported via funding from Prince Sattam bin Abdulaziz University, under project number (PSAU/2023/R/1444).
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| CDF | cumulative distribution function; |
| SRS | simple random sampling; |
| SRSWOR | simple random sampling without replacement; |
| MSE | mean square error; |
| PRE | percent-relative efficiency. |
Appendix A. Limitations
The proposed class of estimators was designed to perform well under certain conditions. When these conditions deviated, the proposed estimator did not outperform. For example, we have the following:
- (1)
- The performance of the estimator relies on the relationship between the study variable and the auxiliary variable. If this relationship is weak the proposed estimator did not perform well.
- (2)
- The proposed estimator assumes that the underlying distribution of the study variable, as well as auxiliary variable, has a certain shape, such as exponential or normal. In situations there are large gaps in the data, the distribution is highly skewed, or risk behavior of the distribution is non-additive, the proposed estimators may not perform well.
- (3)
- The choice of an auxiliary variable can have a significant impact on the performance of the estimator. The auxiliary variable should be strongly related to the variable of interest and should be easy to measure for all units in the sample.
- (4)
- The proposed estimator is expected to perform well when the sample size is large enough to provide sufficient precision in the estimates.
References
- Shabbir, J.; Gupta, S. Estimation of finite population mean in simple and stratified random sampling using two auxiliary variables. Commun. Stat.-Theory Methods 2017, 46, 10135–10148. [Google Scholar] [CrossRef]
- Haq, A.; Khan, M.; Hussain, Z. A new estimator of finite population mean based on the dual use of the auxiliary information. Commun. Stat.-Theory Methods 2017, 46, 4425–4436. [Google Scholar] [CrossRef]
- Kaur, H.; Brar, S.S.; Sharma, M. Efficient ratio type estimators of population variance through linear transformation in simple and stratified random sampling. Int. J. Stat. Reliab. Eng. 2018, 4, 144–153. [Google Scholar]
- Ahmad, S.; Shabbir, J. Use of extreme values to estimate finite population mean under pps sampling scheme. J. Reliab. Stat. Stud. 2018, 43, 99–112. [Google Scholar]
- Ahmad, S.; Hussain, S.; Yasmeen, U.; Aamir, M.; Shabbir, J.; El-Morshedy, M.; Al-Bossly, A.; Ahmad, Z. A simulation study: Using dual ancillary variable to estimate population mean under stratified random sampling. PLoS ONE 2022, 17, e0275875. [Google Scholar] [CrossRef] [PubMed]
- Niaz, I.; Sanaullah, A.; Saleem, I.; Shabbir, J. An improved efficient class of estimators for the population variance. Concurr. Comput. Pract. Exp. 2022, 34, e6620. [Google Scholar] [CrossRef]
- Shabbir, J.; Onyango, R. Use of an efficient unbiased estimator for finite population mean. PLoS ONE 2022, 17, e0270277. [Google Scholar] [CrossRef]
- Oh, H.; Scheuren, F. Weighting adjustments for unit non-response. Incomplete Data Sample Surv. 1983, 2, 143–184. [Google Scholar]
- Kalton, G. Handling wave nonresponse in panel surveys. J. Off. Stat. 1986, 2, 303–314. [Google Scholar]
- Kalton, G.; Maligalig, D.S. A comparison of methods of weighting adjustment for nonresponse. In Proceedings of the 1991 Annual Research Conference, Arlington, TX, USA, 17–20 March 1991; US Bureau of the Census: Washington, DC, USA, 1991; Volume 409428. [Google Scholar]
- Hansen, M.H.; Hurwitz, W.N. The problem of non-response in sample surveys. J. Am. Stat. Assoc. 1946, 41, 517–529. [Google Scholar] [CrossRef]
- Sanaullah, A.; Noor-ul Amin, M.; Hanif, M. Generalized exponential-type ratio-cum-ratio and product-cum-product estimators for population mean an the presence of non-response under stratified two-phase random sampling. Pak. J. Stat. 2015, 31, 71–94. [Google Scholar]
- Ahmed, S.; Shabbir, J.; Gupta, S. Use of scrambled response model in estimating the finite population mean in presence of non response when coefficient of variation is known. Commun. Stat.-Theory Methods 2017, 46, 8435–8449. [Google Scholar] [CrossRef]
- Saleem, I.; Sanaullah, A.; Hanif, M. A Generalized class of estimators for estimating population mean in the presence of non-Response. J. Stat. Theory Appl. 2018, 17, 616–626. [Google Scholar] [CrossRef]
- Makhdum, M.; Sanaullah, A.; Hanif, M. A modified regression-cum-ratio estimator of population mean of a sensitive variable in the presence of non-response in simple random sampling. J. Stat. Manag. Syst. 2020, 23, 495–510. [Google Scholar] [CrossRef]
- Bhushan, S.; Pandey, A.P. An efficient estimation procedure for the population Mean under non-response. Statistica 2020, 79, 363–378. [Google Scholar]
- Ünal, C.; Kadilar, C. Exponential type estimator for the population mean in the presence of non-response. J. Stat. Manag. Syst. 2020, 23, 603–615. [Google Scholar] [CrossRef]
- Singh, H.P.; Singh, S.; Kozak, M. A family of estimators of finite-population distribution function using auxiliary information. Acta Appl. Math. 2008, 104, 115–130. [Google Scholar] [CrossRef]
- Muñoz, J.; Arcos, A.; Álvarez, E.; Rueda, M. New ratio and difference estimators of the finite population distribution function. Math. Comput. Simul. 2014, 102, 51–61. [Google Scholar] [CrossRef]
- Yaqub, M.; Shabbir, J. Estimation of population distribution function in the presence of non-response. Hacet. J. Math. Stat. 2018, 47, 471–511. [Google Scholar] [CrossRef]
- Yaqub, M.; Shabbir, J. Estimation of population distribution function involving measurement error in the presence of non response. Commun. Stat.-Theory Methods 2020, 49, 2540–2559. [Google Scholar] [CrossRef]
- Hussain, S.; Ahmad, S.; Saleem, M.; Akhtar, S. Finite population distribution function estimation with dual use of auxiliary information under simple and stratified random sampling. PLoS ONE 2020, 15, e0239098. [Google Scholar] [CrossRef] [PubMed]
- Hussain, S.; Akhtar, S.; El-Morshedy, M. Modified estimators of finite population distribution function based on dual use of auxiliary information under stratified random sampling. Sci. Prog. 2022, 105, 00368504221128486. [Google Scholar] [CrossRef] [PubMed]
- García, M.R.; Cebrián, A.A.; Rodríguez, E. Quantile interval estimation in finite population using a multivariate ratio estimator. Metrika 1998, 47, 203–213. [Google Scholar] [CrossRef]
- Cochran, W. The estimation of the yields of cereal experiments by sampling for the ratio of grain to total produce. J. Agric. Sci. 1940, 30, 262–275. [Google Scholar] [CrossRef]
- Singh, R.; Kumar, M.; Chaudhary, M.K.; Smarandache, F. Estimation of Mean in Presence of Non Response Using Exponential Estimator; Infinite Study: Dubai, United Arab Emirates, 2009. [Google Scholar] [CrossRef]
- Bahl, S.; Tuteja, R.K. Ratio and product type exponential estimators. J. Inf. Optim. Sci. 1991, 12, 159–164. [Google Scholar] [CrossRef]
- Singh, G.; Usman, M. Ratio-to-product exponential-type estimators under non-response. Jordan J. Math. Stat. (JJMS) 2019, 12, 593–616. [Google Scholar]
- Singh, S. Advanced Sampling Theory with Applications: How Michael ‘Selected’ Amy, Volume 1; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
- Gujarati, D.N. Basic Econometrics; Tata McGraw-Hill Education: New York, NY, USA, 2009. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).