Calibration Estimation of Cumulative Distribution Function Using Robust Measures

: Outliers are observations that are signiﬁcantly different from the other observations in a dataset. These types of observations are asymmetric in nature due to a lack of symmetry. The estimation of the cumulative distribution function (CDF) is an important statistical measure commonly discussed for symmetric datasets. However, the estimation of the CDF in the case of the asymmetric nature of the dataset is not a much-explored topic. In this article, we use calibration methodology with auxiliary information for modifying the traditional stratiﬁcation weight, and hence, we obtain efﬁcient estimates of the CDF using robust measures, i.e., mid-range and tri-mean, under the different distance functions. A simulation study is carried out to see the performance of proposed and existing estimators using asymmetric real-life datasets.


Introduction
Finding the percentage of research variables Y that are less than or equal to a specific value is important, and this leads to the estimation method of the countable population CDF. In some cases, it is thought necessary to estimate the CDF. For instance, a soil scientist would be curious to discover how many people in a developing nation are living in poverty. We are usually concerned with the percentage of y i values in the population. In certain situations, the need for a CDF is more important. Users of sample survey data frequently need to calculate the population CDF or, alternatively, the percentage of population elements whose values are less than or equal to a certain value t y . For instance, we might be interested in the percentage of agricultural land where pesticide poisoning effects are less than zero or the percentage of filtration facilities where arsenic is present in portable water that is less than zero. Such a percentage is a specific value of the population's CDF.
where I y i ≤ t y = 1 for y i ≤ t y and I y i ≤ t y = 0 for y i > t y . In surveys, we can frequently only measure the research variable for those items in a sample; hence, the typical estimation methods of the CDF depend solely on the choice of the sampling design and the sampled percentage of the population. F Y t y can be estimated bŷ Many researchers have calculated the CDF using data from one or more additional variables. First of all, Reference [1] proposed a method for estimating the countable population CDF. Reference [2] obtained ratio and difference estimation methods for a population CDF under a general sample design using supplementary population variables. They demonstrated the benefits of the design-based estimation method over the modelbased estimation method in the case of model misspecifications, especially for large samples. Reference [3] developed a traditional as well as a prediction technique for estimating the CDF from survey data. Reference [4] proposes an estimator for the finite population CDF using the model-calibration pseudo-empirical likelihood technique. Reference [5] considers the issue of estimating the CDF and quantiles for a countable population using supplementary data. Reference [6] develops a generalized family of estimation methods for estimating the CDF using auxiliary variables. Reference [7] develops an efficient approach for the estimation of process variability by using the exponential technique. Reference [8] developed two new families for the estimation of the countable population CDF in the case of non-response under simple random sampling. They studied two different types of nonresponse situations: (i) non-response on both the research and supplementary data; and (ii) non-response just on the research data. The developed estimation methods are compared to existing estimation methods, both theoretically and numerically. Reference [9] developed a new family of estimation methods for the finite CDF using the stratified random sampling (StRS) method. Reference [10] also proposed a generalized class of exponential factor type estimation methods for estimating the countable population CDF with supplementary information in the form of the average and rank of the supplementary information.
In recent years, the calibration estimation method has become an important area of study in survey sampling. By using auxiliary data, the calibration estimation technique increases the accuracy of estimations by adjusting the original design weights. The calibration estimation method is a procedure for adjusting survey sampling weights in order to simulate population means, totals, etc. with the help of supplementary data. The pioneering article on calibration was written by Reference [11]. Reference [12] developed a calibration estimation method for mean estimation. Reference [13] proposed a calibration estimation method for estimating the population mean in StRS with various calibration conditions based on supplementary information. Reference [14] proposes a novel calibration estimation method for the population parameter of the study variable using newly calibrated weights for two supplementary variables under StRS. Reference [15] proposes a distance function. Using their developed distance function, a calibration estimation method of the population mean in StRS is obtained. References [16,17] extended the work by utilizing linear moments' characteristics. Reference [18] developed two novel classes of ratio-and regression-type estimation methods of population variation under SRSWOR by integrating knowledge on nonconventional and robust dispersion measures of supplementary data. Reference [19] proposes a new robust calibration estimation method for estimating the population mean under StRS. Reference [12] methodology for CDF estimation, however, has not received much attention yet.
This article proposes a new calibration estimation method for the population CDF under StRS using new calibration conditions that include robust measures. The use of robust measures makes the calibration estimator of CDF more efficient. The rest of the article is organized as follows: In Section 2, an adapted estimator of CDF using robust measures is shown. In Section 3, the proposed CDF using robust measure estimators is developed. In Section 4, a numerical study is conducted. The article concludes in Section 5.

First Adapted Calibration Estimator of CDF Using Robust Measure
Outliers can be caused by a variety of factors, such as measurement errors, sampling bias, or extreme values. As they belong to an asymmetric nature. So, they can have a long tail on one side or the other, indicating that there are more extreme values in one direction than the other. Outliers can have a major impact on statistical analyses, as they can distort summary statistics and lead to misleading conclusions. So, in this article, we will use robust measures such as the mid-range and tri-mean to reduce the impact of outliers.
Let ϑ = 1, 2, . . . , M be a finite population M of units, which is divided into γ homogeneous strata, where the size of ϕ th stratum is M ϕ , for ϕ = 1, 2, .., γ in such a manner that where X 1(1) is the minimum value in a population of size M and X 1(M) is the maximum value in a population of size M. The next measure included in this article is the tri-mean (T M ), which is the weighted average of the population median and two quartiles and is defined as: They denote the population variance of the supplementary variable in ϕ th stratum. Under this StRS, the traditional unbiased estimator of the CDF is given by I y i ≤ t y is the sample CDF estimate of Y in the ϕ th stratum.

First Adapted Calibration Estimator of CDF
Taking motivation from Reference [15], the first adapted estimators are as follows: whereF yϕ t y is the sample CDF of the study variable in ϕ th stratum. Further, Ω A 1 ϕ is the calibrated weight; we will use the sum of weighted squared deviation of calibrated weights function as given below: and satisfy the calibration constraint are presenting the sample and population mid-range of the supplementary variable in the ϕ th stratum, and Q ϕ is suitably chosen weights to decide different types of estimation methods. The Lagrange function is given by where λ A are multipliers of Lagrange and setting ∂∆(Ω A 1 ϕ ,W ϕ ) Substituting Equation (5) in Equation (3) and solving for lambda, we have Substituting Equation (6) in Equation (5), we obtain the calibration weight as Substituting Equation (7) in Equation (1), we obtain the calibrated estimator of CDF as given below: Rϕ(x)Fyϕ t y

Second Adapted Calibration Estimator of CDF
Taking motivation from Reference [15], the second adapted estimators are as follows: whereF yϕ t y is the sample CDF of the study variable in the ϕ th stratum. Further, Ω A 2 ϕ is the calibrated weight; we will use the sum of weighted squared deviation of calibrated weights function as given below: Subject to calibration constraints defined by are presenting the sample and population midrange and tri-mean of the supplementary variable in the ϕ th stratum. The Lagrange function is given by where λ A 1 and λ A 2 are the Lagrange's multipliers, setting Thus, the calibration weight can be obtained as Substituting Equation (14) in Equations (10) and (11), respectively, we obtain Solving the system of equations for lambdas, we obtain Substituting these values into Equation (14), we obtain the weights as given by Writing these weights in Equation (8), we obtain the calibration estimator of CDF as where betas are given bŷ

First Proposed Calibration Estimator of CDF
Taking inspiration from the first adapted estimator, we proposed the following CDF estimator: whereF yϕ t y is the sample CDF of the study variable in ϕ th stratum. Further, Ω P 1 ϕ is the calibrated weight, we will use the chi-square distance function, as given below: and satisfy the calibration constraint Note that W ϕ = M ϕ M denote the traditional stratum weight, and M Rϕ(x) , M Rϕ(x) are presenting the sample and population mid-range of the auxiliary variable in the ϕ th stratum. The Lagrange function is given by where λ P are multipliers of Lagrange, setting ∂∆(Ω P 1 ϕ ,W ϕ ) Substituting Equation (19) in Equation (17), and solving for lambda, we have Substituting Equation (20) in Equation (19), we obtain the calibration weight as Substituting Equation (21) in Equation (15), we obtain the calibrated estimator of CDF, as given below

Second Proposed Calibration Estimator of CDF
Taking inspiration from the second adapted estimator, we proposed the following CDF estimator: whereF yϕ t y is the sample CDF of the study variable in the ϕ th stratum. Further, Ω P 2 ϕ is the calibrated weight, we will use the chi-square distance function, as given below: Subject to calibration constraints defined by where M Rϕ(x) , M Rϕ(x) , T Mϕ(x) , T Mϕ(x) are presenting the sample and population midrange and tri-mean of the supplementary variable in the ϕ th stratum. The Lagrange function is given by where λ P 1 and λ P 2 are the Lagrange's multipliers, setting ∂∆(Ω P 2 ϕ ,W ϕ ) Thus, the calibration weight can be obtained as Substituting Equation (28) in Equations (24) and (25), respectively, we obtain Solving the system of equations for lambdas, we obtain Substituting these values into Equation (28), we obtain the weights as given by Writing these weights in Equation (22), we obtain the calibration estimator of CDF as where betas are given bŷ

Numerical Study
To study the performance of the developed calibration estimation methods of CDF using robust measures, we generated four different real-life datasets. The Figures 1-16 show that these populations have outliers and therefore belong to an asymmetric nature. We compared the mean square error (MSE) of the proposed estimators with the adapted estimators to evaluate which estimators performed more efficiently. For MSE estimation, we perform the steps of the simulation study as given below: Step-1: Select a random sample with size n ϕ through StRS from stratum ϕ; Step-2: Find the value of CDF estimates (say) Step-3: Replicate the above steps G = 5000 times and attainedξ 1 ,ξ 2 , . . . ,ξ G ; Step-4: Compute the MSE as The bias MSEs and PREs are provided in Tables 1-3, respectively. It is interesting to notice that in the following part, we will compare the outcomes of all four populations using the t = 0.25 quantile point.           Population 2 We consider the following variables for population 2: = The amount of apples produced in 1998; = The amount of apples produced in 1999.

COVID-19 Data (Populations 3 and 4)
To demonstrate the performance of the proposed estimation method in this article, we examine a COVID-19 dataset, used in Reference [21]. We consider the following variables for population 3: = Total cases per million; = Total deaths per million. The extreme values of each stratum are clearly shown in the scatter plots in Figures  9-16, and as a result, the data are appropriate for our suggested estimators.

Population 4
We consider the following variables for population 4: = Total number of cases per million; = Total number of recoveries per million.

Interpretation
Results of Table 2, indicate that: The similar pattern of performance for PREs of the suggested estimation methods can be observed in Table 3.
Based on these results for all the estimators, we conclude that the proposed estimators has minimum bias, MSE, and maximum PRE values for all four populations compared to the adapted estimators.   To demonstrate the performance of the proposed estimation method in this article, we examine a dataset of apples used in References [16,20].

Population 1
We consider the following variables for population 1: x = The list of apple trees in 1999; y = The amount of apples produced in 1999.
The extreme values of each stratum are clearly shown in the scatter plots in Figures 1-8, and as a result, the data are appropriate for our suggested estimators.

Population 2
We consider the following variables for population 2: x = The amount of apples produced in 1998; y = The amount of apples produced in 1999.

COVID-19 Data (Populations 3 and 4)
To demonstrate the performance of the proposed estimation method in this article, we examine a COVID-19 dataset, used in Reference [21].

Population 3
We consider the following variables for population 3: x = Total cases per million; y = Total deaths per million.
The extreme values of each stratum are clearly shown in the scatter plots in Figures 9-16, and as a result, the data are appropriate for our suggested estimators.

Population 4
We consider the following variables for population 4: x = Total number of cases per million; y = Total number of recoveries per million.

Interpretation
Results of Table 2, indicate that: • For population 1, the first proposed estimator G RM (P 1 ) = 2.284482 is better than first adapted estimator G RM (A 1 ) = 4.083395 and the second proposed estimator G RM (P 2 ) = 0.9799003 is better than second adapted estimator G RM (A 2 ) = 1.538583 at quantile (t = 0.25); • For population 2, the first proposed estimator G RM (P 1 ) = 4.106615 is better than first adapted estimator G RM (A 1 ) = 6.378484 and the second proposed estimator G RM (P 2 ) = 2.021053 is better than second adapted estimator G RM (A 2 ) = 3.500668 at quantile (t = 0.25); • For population 3, the first proposed estimator G RM (P 1 ) = 0.5599689 is better than first adapted estimator G RM (A 1 ) = 0.666069 and the second proposed estimator G RM (P 2 ) = 0.56593 is better than second adapted estimator G RM (A 2 ) = at quantile (t = 0.25); • For population 4, the first proposed estimator G RM (P 1 ) = 43.68103 is better than first adapted estimator G RM (A 1 ) = 65.88738 and the second proposed estimator G RM (P 2 ) = 34.73049 is better than second adapted estimator G RM (A 2 ) = 87.86289 at quantile (t = 0.25).
The similar pattern of performance for PREs of the suggested estimation methods can be observed in Table 3.
Based on these results for all the estimators, we conclude that the proposed estimators has minimum bias, MSE, and maximum PRE values for all four populations compared to the adapted estimators.

Conclusions
There are a variety of calibration estimation methods that use one or two calibration constraints based on supplementary data. In this article, a new, improved calibration estimator of CDF using robust measures is developed under StRS. To evaluate the effectiveness of the developed calibration estimators with the adapted calibration estimators, we conducted a simulation study using asymmetric real-life datasets. We calculate the bias, MSE, and PREs of calibration estimators. The results demonstrate that the proposed calibration estimators are more efficient than the adapted calibration estimators for asymmetric datasets. In future studies, the work can be expanded to incorporate different sampling schemes, and new proposals can be compared to existing approaches.  Data Availability Statement: All the dataset information is already available in References [16,20,21].