Efﬁcient Difference and Ratio-Type Imputation Methods under Ranked Set Sampling

: It is well known that ranked set sampling (RSS) is more efﬁcient than simple random sampling (SRS). Furthermore, the presence of missing data vitiates the conventional results. Only a minuscule amount of work has been conducted under RSS with missing data. This paper makes a modest attempt to provide some efﬁcient difference-and ratio-type imputation methods in the presence of missing values under RSS. The envisaged imputation methods are demonstrated to provide better results than the existing imputation methods. The theoretical results are enhanced by a computational analysis using real and hypothetically generated symmetric (Normal) and asymmetric (Gamma and Weibull) populations. The computational results show that the proposed imputation method outperforms the existing imputation methods in terms of its higher percent relative efﬁciency. Additionally, the impact of skewness and kurtosis on the efﬁciency of the suggested imputation methods has also been calculated.


Introduction
The most common problem reported by a survey statistician in their daily life is making inferences from data containing missing values. Such problems of missing values in survey sampling may be tackled through the technique of imputation. A wide range of imputation methods have been suggested by various authors. The authors of [1] discussed three noteworthy concepts on missing values as missing at random (MAR), observed at random (OAR), and parameter distribution (PD). Subsequently, [2][3][4][5][6] suggested different types of imputation methods. The authors of [7] showed that missing at random and missing completely at random (MCAR) are totally different approaches. Many renowned authors [8][9][10][11][12] assumed the MCAR approach in their studies for the imputation of missing values. The authors of [13] introduced some imputation methods which outperformed the imputation methods suggested by [14]. The authors of [15] developed logarithmictype imputation methods under SRS. The authors of [16] utilized robust measures and suggested compromised imputation-based mean estimators.
In real-life applications, situations may arise where the measurement of the study variable is not easy or expensive to do so but can be ranked visually or by a cost-free measure. In this situation, ref. [17] envisaged the concept of ranked set sampling (RSS) but did not provide any rigorous mathematical support. The authors of [18] explored the idea of [17] and furnished the essential mathematical foundation to the theory of RSS. In sample surveys, when each group has very few observations, each observation then The methodology of RSS was initiated by [17], based on drawing m simple random samples of size m from the parent population. These m units are now ranked inside each set regarding the auxiliary variable. The rank 1 unit is chosen from the first set for the measurement of the auxiliary variables along with the associated study variable. The rank 2 unit is chosen from the second ranked set for the measurement of auxiliary variable X along with the associated study variable Y, and the process is proceeded until the rank m unit is chosen from the last set. The above process is referred to as a cycle. This whole procedure is repeated k times, providing n = mk ranked set samples.
In the presence of missing values in a dataset, an alteration in the aforesaid methodology is proposed for the estimation of the population mean of the study variables under the consideration of usable auxiliary information. To facilitate ranking, m bivariate random samples, each consisting of m units, are quantified from the parent population. These m units are ranked within each set regarding the auxiliary variable as it is hypothesized that the study variable has some missing values. Now, from the first sample, the smallest ranked unit of X along with the correlated Y is selected. From the second sample, the second smallest ranked unit of X along with the correlated Y is selected. The above procedure is continued in the same mode until the mth sample from the highest ranked unit of X along with the correlated Y is selected. Compatible to the study variable from the first cycle, m units can provide a response for the measurement of the element out of the selected m units such that m > m . The whole procedure is repeated k times until responses from n units out of n selected units is obtained, where n > n .

Notations
Let µ y = N −1 ∑ N i=1 Y i be the mean of the finite population Ω of N identifiable units with values Y i , i ∈ Ω. Let a ranked set sample s of size n = mk be quantified from Ω to estimate the population mean µ y . Let m be the number of responding units out of the sampled m units. Let P be the probability that the ith respondent belongs to the responding group A and (1 − P) be the probability that the ith respondent belongs to the non-responding groupĀ such that s = A ∪Ā. The value Y i , i ∈ A is observed for every unit, but for the units i ∈Ā the values are missing and need imputation to build the complete structure of the data to draw a valid conclusion. The auxiliary variable X assists in the execution of imputation of missing values. Let X i be the value of X for the unit i which is positive and known ∀ i ∈ s such that X s = X i ; i ∈ s are known. LetX r,rss = ∑ m i=1 ∑ k j=1 X (i:i)j /mkP andȲ r,rss = ∑ m i=1 ∑ k j=1 Y [i:i]j /mkP possess the unbiased estimator of population means µ x and µ y , respectively. Here, X (i:i)j and Y [i:i]j are the ith order statistics and ith judgement order in the ith sample, respectively, of size m in cycle j for variable X and Y. For the sake of simplicity, we denote X (i:i)j and Y [i:i]j by X (i) and Y [i] , respectively. Let P be the probability of determining the response, then E(r −j ) = {E(r)} −j , which provides the variance as The proof of (1) and (2) can be viewed in [20].
To tabulate the bias and mean square error (MSE), the following notations and results are used throughout this paper. LetȲ r,rss = µ y (1 + 0 ),X r,rss = µ x (1 + 1 ), and X n,rss = µ x (1 + 2 ), where 0 , 1 , and 2 are the error terms, such that Here, S x and S y are the population standard deviations due to the auxiliary variable X and study variable Y, respectively, C x and C y are the population coefficients of variation due to the auxiliary variable X and study variable Y, respectively, and ρ xy is the population correlation coefficient between the auxiliary variable X and study variable Y. Moreover, we would also like to annotate that the quantities µ x (i) and µ y [i] consist of order statistics from some particular distributions and can be easily determined from [23].

Mean Imputation Method
The method of imputation is The consequent estimator is The imputation methods are categorized into three strategies under the consideration of the availability of auxiliary information. Strategy I: When µ x is known andX n,rss is used. Strategy I I: When µ x is known andX r,rss is used. Strategy I I I: When µ x is unknown andX n,rss andX r,rss are used.

The Al-Omari and Bouza Imputation Method
To improve the efficiency of the estimators in the presence of missing data, [9,21] suggested some regression-cum-ratio-type estimators under RSS as Strategy Iȳ Strategy I Iȳ Strategy I I Iȳ KC 3 =Ȳ r,rss + b(X n,rss −X r,rss )X n,rss X r,rss where b = S xy /S 2 x is the regression coefficient of Y on X.

The Sohail, Shabbir and Ahmed Imputation Methods
Following [20][21][22], we examined the ratio-type estimators of [8] using RSS for the imputation of missing values. These imputation methods are Strategy I Proof. The precis of the derivations are given in Appendix B for quick reference.

Corollary 1.
The minimum MSE of the consequent estimators comprising the suggested imputation methods are given by Proof. A summary of the derivations and the definition of the parametric function A j are given in Appendix B.

Computational Study
To enhance the soundness of the efficiency conditions obtained in the previous section, a computational study was designed in three subsections, namely, a numerical analysis based on a real population, a simulation analysis based on an artificially generated population, and a discussion of the computational findings.

Numerical Study
In this subsection, a numerical study is performed and the performance of the proposed imputation methods is compared with existing imputation methods. The numerical analysis was accomplished on four real datasets. Population 1 was taken from [24], where the level of apple production was taken as the study variable and the number of apple trees taken as the auxiliary variable in 69 villages of the South Anatolia region of Turkey in 1999. Population 2 was taken from [25], where the population (in millions) in 1983 was considered as the study variable and the export (in millions of U.S. dollars) was considered the auxiliary variable. Population 3 was taken from [26], where the amount (in U.S. dollars) of real estate farm loans in different states during 1997 was considered as the study variable and the amount (in U.S. dollars) of non-real estate farm loans in different states during 1997 was considered the auxiliary variable. Population 4 was taken from [25], where the total number of seats in the municipal council in 1982 was considered the study variable and the number of conservative seats in the municipal council in 1982 was considered the auxiliary variable. The necessary values of the parameters for all four populations are reported in Table 1. The percent relative efficiency (PRE) of the proposed imputation methods regarding the conventional imputation methods was calculated using the following formula: where T = t m , t s i , i = 1, 2, . . . , 6, and T i , i = 1, 2, . . . , 9. The results of the numerical analysis are summarized in Table 2 and depicted in Figure 1 under strategies I, II, III and III for each population.   Table 2 relative efficiency (PRE) of the consequent estimators concerning to the conventional mean estimator were computed as The outcomes of the simulation experiments are reported below in Tables 3-9 Table 2.

Simulation Analysis
To assess the performance of the suggested imputation methods, following [27], simulation experiments were conducted over three parent populations, namely, Normal, Gamma, and Weibull, of size N = 1000 units with variables X and Y, expressed by where X * and Y * are independent variables of the corresponding parent population. The sampling methodology of Section 2 was used to draw an RSS of size 12 units with set size 3 from each parent population. Using 20,000 iterations, the PRE of the consequent estimators compared to the conventional mean estimator were computed as The outcomes of the simulation experiments are reported in Tables 3-9 by their PRE for each sensibly opted values of response probability P = 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 with corresponding correlation coefficient ρ xy = 0.6, 0.7, 0.8, 0.9.

Discussion of Computational Findings
After carefully observing the findings reported in Tables 2-9, we discuss the following points: (i). From the findings of Table 2, the proposed imputation methods y .ij , j = 1, 2, . . . , 9 outperform the mean imputation methods, ref. [21] imputation methods and ref. [22] imputation methods in each real population. Furthermore, the proposed imputation methods y .ij , j = 1, 3, 7, 9 were superior among the proposed imputation methods in population 1, whereas the proposed imputation methods y .ij , j = 4, 5, 6 were superior among the proposed imputation methods in populations 2-4. This is easily observed in Figure 1. (ii). From the findings of Tables 3-9, the proposed imputation methods y .ij , j = 1, 2, . . . , 9 are also better than the mean imputation, ref. [21] imputation methods and ref. [22] imputation methods under both the symmetric and asymmetric populations for different correlation coefficients ρ xy , coefficients of skewness β 1 and coefficients of kurtosis β 2 . (iii). When the parent population was normal (symmetric) and Weibull (asymmetric), the proposed ratio-type imputation methods y .ij , j = 4, 5, 6 always performed better than the competitors as well as within the proposed class of imputation methods under strategies I, II and III. (iv). When the parent population was Gamma (asymmetric), the proposed differenceand ratio-type imputation methods y .ij , j = 1, 2, 3, 7, 8, 9 were equally efficient and outperformed the conventional methods and performed better in comparison with the proposed imputation methods under strategies I, II and III. (v). The suggested imputation methods performed better in strategy II compared to strategies I and III in the real and artificially generated populations. (vi). It can be easily seen that the PRE decreases with the increase in asymmetry and peakedness for asymmetric distributions such as Gamma and Weibull. (vii). Moreover, the numerical analysis is summarized in Table 2 and Figure 1 under strategies I, II, and III for real populations 1-4. The PRE of the consequent estimators for the remaining simulation results in Tables 3-7 exhibit the same pattern and can be easily presented as line diagrams, if required.

Conclusions
In this manuscript, we proposed efficient difference-and ratio-type imputation methods for the estimation of the population mean in the presence of missing data. The efficiency conditions have been derived and sustained with computational analysis on some real and hypothetically generated symmetric and asymmetric populations. The computational and theoretical results show that the proposed imputation methods y .ij , j = 1, 2, . . . , 9 outperformed the mean imputation method y .i , ref. [21] imputation methodsȳ KC i , i = 1, 2, 3, and ref. [22] imputation methods y .is j , j = 1, 2, . . . , 6. In the simulation analysis, we considered one family of a symmetric population, namely, Normal, and two families of asymmetric populations, namely, Gamma and Weibull, to ascertain the effect of the correlation coefficient for a symmetric population and the effect of skewness and kurtosis for asymmetric populations.
It is worth mentioning that among the asymmetric populations, all imputation methods exhibited a decreasing trend in PRE as the coefficient of skewness and kurtosis increased. Although, in such cases, the proposed estimators fared better than their conventional counterparts. These results are in agreement with the results of [17,28,29], where these authors took skewed distributions and reported that the efficiency of the estimators decreased with an increase in skewness and kurtosis. The same was also true for imputation as well.
Lastly, the proposed imputation methods currently provide the best possible imputation methods for the estimation of a population mean in the presence of the missing data.
Furthermore, the proposed imputation strategies can be defined using multi-auxiliary information, which our future research with investigate.

Appendix A
The expressions of MSE, minimum MSE, and the optimum scalar values of the existing resultant estimators is reported below.