Optimizing Finite Population Mean Estimation Using Simulation and Empirical Data

Alghamdi, Abdulaziz S.; Almulhim, Fatimah A.

doi:10.3390/math13101635

Open AccessArticle

Optimizing Finite Population Mean Estimation Using Simulation and Empirical Data

by

Abdulaziz S. Alghamdi

¹

and

Fatimah A. Almulhim

^2,*

¹

Department of Mathematics, College of Science & Arts, King Abdulaziz University, P.O. Box 344, Rabigh 21911, Saudi Arabia

²

Department of Mathematical Sciences, College of Science, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(10), 1635; https://doi.org/10.3390/math13101635

Submission received: 13 March 2025 / Revised: 1 May 2025 / Accepted: 15 May 2025 / Published: 16 May 2025

(This article belongs to the Section D1: Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

Two-phase sampling is an effective sampling approach that is useful in sample surveys when prior auxiliary information is not available. When two variables have an association, the ranks of the auxiliary variable are proportional to the study variable. Therefore, we can use these rankings to improve the accuracy of the estimators. In this article, we estimate the overall mean of the study variable based on extreme values and the ranks of the auxiliary variable. The properties of the proposed estimators with respect to biases and mean squared errors (MSEs) in two-phase sampling are obtained up to first order approximation. We verify the theoretical results and assess the performance of the proposed estimators using three datasets and a simulation study, which show that the proposed estimators outperform other existing estimators in terms of percent relative efficiency (PRE).

Keywords:

study variable; auxiliary information; minimum/maximum values; ranks; bias; percent relative efficiency

MSC:

62D05

1. Introduction

Utilizing auxiliary data in survey sampling plays a significant role in enhancing the accuracy of estimators. To achieve better relative efficiency, methods such as ratio, regression, and product estimators incorporate supplementary details in addition to the primary study variables. For example, when estimating total household income, auxiliary metrics like the number of household members and overall expenditure can be helpful. Researchers have carried out extensive investigations to develop more effective estimators for various population parameters, including the mean, total, and median. Further insights into these improved estimators and their statistical characteristics can be found in related studies, see [1,2,3,4,5] and references therein.

In the field of sampling theory, the use of auxiliary variables alongside the primary variable of interest is a widely recognized strategy to enhance the efficiency and accuracy of survey designs. This approach takes advantage of the correlation between the auxiliary and main variables to yield more reliable estimations. In many practical situations, the population mean of the variable under study is not known prior to data collection. In such cases, researchers often rely on a dual-phase sampling approach, also known as double sampling. This strategy involves conducting an initial phase to gather basic information, followed by a second phase focused on more detailed data collection. Typically, a large preliminary sample is drawn to collect data on auxiliary variables, and a smaller, more focused sample is then selected either from the initial group or independently, to observe both the auxiliary and study variables. Due to its efficiency and lower cost, dual-phase sampling is a valuable tool in survey research, particularly when prior knowledge of auxiliary data is lacking. A brief review of two-phase sampling was first introduced in [6,7]. Recently, two-phase sampling has gained significant attention because of its cost-effectiveness in screening variables. Several studies have explored different aspects of two-phase sampling, including [8,9,10,11,12,13,14,15,16,17].

Survey data can sometimes include outliers or unusually large or small values, which may affect the reliability of statistical estimations. The presence of such extreme observations can distort the sample mean, potentially resulting in biased inferences. To address this, the authors in [18] initially introduced two estimators based on linear transformations involving the minimum and maximum values of known auxiliary variables. However, this approach did not receive much attention until it was later revisited in [19], who proposed improved ratio, product, and regression estimators by incorporating extreme value information of auxiliary variables for estimating population means. Building upon this, the authors in [20] extended these ideas within a dual-phase sampling framework to enhance estimation accuracy. The authors in [21] also explored different transformation techniques using extreme values of auxiliary variables to estimate the finite population mean. Further advancement was made by the authors in [22], who suggested innovative methods for estimating the mean under stratified random sampling designs that account for extreme values. Additionally, the authors in [23] introduced a novel group of estimators aimed at calculating population variance with minimal mean squared error by utilizing extreme value information. Most recently, the authors of [24,25,26] proposed several new classes of efficient estimators for population variance by applying transformation strategies to extreme values. For more details, see [27,28,29,30,31,32] and references therein.

Although the removal of extreme observations from survey data might appear beneficial, retaining them can offer valuable insights, especially when auxiliary variables are involved. Classical estimators often suffer from increased mean square error (MSE) in the presence of such values, leading to reduced efficiency. Rather than excluding these outliers, this study treats the minimum and maximum values of auxiliary variables as informative features. Inspired by the work of [22,23], we introduce two innovative families of estimators that incorporate these extremes values and ranks of auxiliary data to enhance the estimation of the finite population mean within a two-phase sampling framework.

Real-life relevance of the proposed estimators

The proposed estimators are not only theoretically efficient but also highly applicable in practical settings where auxiliary information is partially available. In many real-world surveys, it is common to have access to auxiliary variables during a preliminary phase, while collecting detailed data on the study variable may be expensive or time-consuming. In such cases, utilizing the ranks and extreme values of auxiliary variables can significantly enhance the accuracy of population mean estimation. The following examples illustrate typical scenarios where the proposed methodology can be effectively applied:

Agricultural Surveys: When estimating average crop yield in a region, data on the number of acres cultivated (auxiliary variable) may be available from administrative records. The ranks (e.g., top 10 largest farms) and extreme values (smallest and largest land holdings) can enhance estimation accuracy, especially in two-phase sampling where detailed crop yield data is expensive to collect.
Public health studies: To estimate the average healthcare expenditure across households, auxiliary data like income levels or family size from census data can be used. Ranks of income groups (e.g., lowest and highest quintiles) and extreme income values can help refine estimates, particularly when full income data is unavailable in the first phase.
Educational statistics: In assessing the average test scores of students across a district, school-level data such as the number of teachers or school enrollment (auxiliary variables) are readily available. Using the ranks of schools based on size or resources, along with the extremes (smallest and largest schools), can improve estimates with limited test score data in the second phase.
Socioeconomic surveys: In household income and expenditure surveys, auxiliary variables like electricity usage or mobile phone ownership can be ranked, and extreme values (e.g., zero usage or very high usage) can serve as indicators to better estimate average household income.

The article is organized as follows: Section 2 outlines the methodology of the study and introduces the notation used throughout the paper. A review of the existing estimators is provided in Section 3. In Section 4, we present a detailed discussion of the newly proposed classes of estimators. Section 5 offers an in-depth mathematical comparison of these estimators. The simulation study, outlined in Section 6, generates six distinct artificial populations using various probability distributions, validating the theoretical results discussed in Section 5. This section also includes numerical examples that demonstrate the practical applications of the theoretical findings. Finally, Section 7 summarizes the main conclusions of the study and proposes directions for future research.

2. Methodology and Notation

Consider a finite population

U = (U_{1}, U_{2}, U_{3}, \dots, U_{N})

consisting of N units. Let

y_{i}

be the value of the study variable

Y,

let

x_{i}

denote the value of the auxiliary variable X, and let

w_{i}

be the rank of the auxiliary variable W for the

i t h

unit.

Let the population data

X = {x_{1}, x_{2}, \dots x_{n}, \dots, x_{m}, \dots, x_{N}} .

The sample data collected in the first phase is denoted by

X_{1} = {x_{1}, x_{2}, \dots, x_{m}}

, while the second phase sample data is represented by

X_{2} = {x_{1}, x_{2}, \dots, x_{n}}

. In this paper, we propose two improved classes of estimators to estimate the finite population mean

\bar{Y}

of Y in the presence of auxiliary variable X. The definition of the two-phase sampling scheme is as follows:

1.: In the first phase, a simple random sample without replacement of size $(m < N)$ is drawn from the population to provide an estimate of the population mean $\bar{X}$ , allowing for an initial approximation before further analysis.
2.: In the second phase, a simple random sample without replacement of n observations (where $n < m$ ) is selected to observe the variables y and x, allowing for more precise measurements and further analysis of the relationship between them.

Suppose the average values for the main study variable (

\bar{Y}

), the auxiliary variable (

\bar{X}

), and the corresponding ranks of the auxiliary variable (

\bar{W}

) are described as follows:

\bar{Y} = \frac{1}{N} \sum_{i = 1}^{N} Y_{i},

(1)

\bar{X} = \frac{1}{N} \sum_{i = 1}^{N} X_{i},

(2)

and

\bar{W} = \frac{1}{N} \sum_{i = 1}^{N} W_{i} .

(3)

The list of the important variables and different notations are given in Table 1.

Define the population variances under simple random sampling without replacement (SRSWOR) for the study variable (Y), the auxiliary variable (X), and the ranked values of the auxiliary variable (W) as follows:

S_{y}^{2} = \frac{1}{N - 1} \sum_{i = 1}^{N} {(Y_{i} - \bar{Y})}^{2},

(4)

S_{x}^{2} = \frac{1}{N - 1} \sum_{i = 1}^{N} {(X_{i} - \bar{X})}^{2},

(5)

and

S_{w}^{2} = \frac{1}{N - 1} \sum_{i = 1}^{N} {(W_{i} - \bar{W})}^{2},

(6)

respectively. Moreover, the population-level coefficients of variation corresponding to these variables are expressed as follows:

C_{y} = \frac{S_{y}}{\bar{Y}},

(7)

C_{x} = \frac{S_{x}}{\bar{X}},

(8)

and

C_{w} = \frac{S_{w}}{\bar{W}},

(9)

respectively. We also know that the population correlation coefficients between Y and X, Y and W, as well as X and W are given by

ρ_{y x} = \frac{S_{y x}}{S_{y} S_{x}},

(10)

ρ_{y w} = \frac{S_{y w}}{S_{y} S_{w}},

(11)

and

ρ_{x w} = \frac{S_{x w}}{S_{x} S_{w}} .

(12)

Simple random sampling without replacement is used to estimate the unknown population mean

\bar{Y}

. In the first phase, a random sample of m units is selected from the population. The sample means

{\bar{x}}_{1}

and

{\bar{w}}_{1}

for the auxiliary variable X and its ranked version W are computed from this initial sample as follows:

\bar{x_{1}} = \frac{1}{m} \sum_{i = 1}^{m} X_{i},

(13)

\bar{w_{1}} = \frac{1}{m} \sum_{i = 1}^{m} W_{i},

(14)

while the first phase sample variances are expressed as

s_{x_{1}}^{2} = \frac{1}{m - 1} \sum_{i = 1}^{m} {(X_{i} - {\bar{x}}_{1})}^{2},

(15)

and

s_{w_{1}}^{2} = \frac{1}{m - 1} \sum_{i = 1}^{m} {(W_{i} - {\bar{w}}_{1})}^{2} .

(16)

Additionally, let

\bar{y}

,

{\bar{x}}_{2}

, and

{\bar{w}}_{2}

denote the sample means of the variables Y, X, and W, respectively, calculated from the second-phase sample of size n, such that

\bar{y} = \frac{1}{n} \sum_{i = 1}^{n} Y_{i},

(17)

\bar{x_{2}} = \frac{1}{n} \sum_{i = 1}^{n} X_{i},

(18)

and

\bar{w_{2}} = \frac{1}{n} \sum_{i = 1}^{n} W_{i} .

(19)

The second phase without replacement sample variances for these variables are defined as

s_{y}^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} {(Y_{i} - \bar{y})}^{2},

(20)

s_{x_{2}}^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} {(X_{i} - {\bar{x}}_{2})}^{2},

(21)

and

s_{w_{2}}^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} {(W_{i} - {\bar{w}}_{2})}^{2} .

(22)

3. Some Existing Estimators

In this section, we evaluate the bias and mean squared error characteristics of existing methods used to estimate the finite population mean and present a comparison with the newly proposed estimator classes.

The usual unbiased estimator to estimate the population mean

\bar{y} = {\hat{Q}}_{1}

was proposed in [6], which is given by

{\hat{Q}}_{1} = \frac{1}{n} \sum_{i = 1}^{n} y_{i} .

(23)

The variance of

{\hat{Q}}_{1}

is given by

V a r ({\hat{Q}}_{1}) = θ_{2} Q_{1}^{2} C_{y}^{2},

(24)

where

\bar{Y} = Q_{1},

represents the finite population mean in the two phase sampling method, and

θ_{2} = (\frac{N - n}{n N}),

denotes a sampling fraction difference or correction term applied to account for variations in sample sizes during the second phase.

The authors in [7] suggested that a ratio type estimator for

Q_{1}

in two-phase sampling be defined as

{\hat{Q}}_{2} = {\hat{Q}}_{1} (\frac{{\bar{x}}_{2}}{{\bar{x}}_{1}}) .

(25)

The expressions for bias and mean squared error (MSE) of

{\hat{Q}}_{2}

are expressed as follows:

B i a s ({\hat{Q}}_{2}) ≅ θ_{3} Q_{1} (C_{x}^{2} - C_{y x}),

(26)

and

M S E ({\hat{Q}}_{2}) ≅ Q_{1}^{2} (θ_{1} C_{y}^{2} + θ_{3} C_{x}^{2} - 2 θ_{3} C_{y x}),

(27)

where

θ_{1} = (\frac{N - m}{m N}), θ_{3} = (\frac{m - n}{m n}),

represent the first phase and inter phase sampling correction terms, respectively.

The classical regression estimator

{\hat{Q}}_{4}

under two-phase sampling was proposed in [7], which is defined as

{\hat{Q}}_{3} = {\hat{Q}}_{1} + b_{y x} ({\bar{x}}_{2} - {\bar{x}}_{1}),

(28)

where

b_{y x}

is the sample regression coefficient.

The bias and mean squared error (MSE) of

{\hat{Q}}_{4}

are expressed as follows:

B i a s ({\hat{Q}}_{3}) ≅ - θ_{3} β_{y x} (\frac{Δ_{12}}{S_{y x}} - \frac{Δ_{03}}{S_{x}^{2}}),

(29)

M S E ({\hat{Q}}_{3}) ≅ Q_{1}^{2} C_{y}^{2} (θ_{2} - θ_{3} ρ_{y x}^{2}),

(30)

where

β_{y x} = \frac{S_{y x}}{S_{x}^{2}},

and

Δ_{q t} = \frac{\sum_{i = 1}^{N} {(Y_{i} - \bar{Y})}^{r} {(X_{i} - \bar{X})}^{t}}{N - 1} .

The authors in [13] introduced exponential ratio and product-type estimators as alternative approaches for improved estimation, which are defined as follows:

{\hat{Q}}_{4} = {\hat{Q}}_{1} exp (\frac{{\bar{x}}_{2} - {\bar{x}}_{1}}{{\bar{x}}_{1} + {\bar{x}}_{2}}),

(31)

and

{\hat{Q}}_{5} = {\hat{Q}}_{1} exp (\frac{{\bar{x}}_{1} - {\bar{x}}_{2}}{{\bar{x}}_{1} + {\bar{x}}_{2}}) .

(32)

The expressions for mean squared error of

{\hat{Q}}_{5}

and

{\hat{Q}}_{6}

are defined as follows:

M S E ({\hat{Q}}_{4}) ≅ Q_{1}^{2} [θ_{2} C_{y}^{2} + θ_{3} C_{x}^{2} (\frac{1}{4} - δ)],

(33)

and

M S E ({\hat{Q}}_{5}) ≅ Q_{1}^{2} [θ_{2} C_{y}^{2} + θ_{3} C_{x}^{2} (\frac{1}{4} + δ)],

(34)

where

δ = (\frac{ρ_{y x}}{C_{x}}) C_{y} .

The authors in [14] proposed that a double sampling estimator for

Q_{1}

is defined as

{\hat{Q}}_{6} ≅ {\hat{Q}}_{1} \{t (\frac{{\bar{x}}_{2}}{{\bar{x}}_{1}}) + (1 - t) (\frac{{\bar{x}}_{1}}{{\bar{x}}_{2}})\},

(35)

where

t = \frac{1 + δ}{2} .

The expressions for bias and mean squared error (MSE) are expressed as follows:

B i a s {({\hat{Q}}_{6})}_{min} ≅ \frac{1}{2} θ_{3} Q_{1} C_{x}^{2} [1 + δ (1 - 2 δ)],

(36)

and

M S E {({\hat{Q}}_{6})}_{min} ≅ Q_{1}^{2} C_{y}^{2} [θ_{2} (1 - ρ_{y x}^{2}) + θ_{1} ρ_{y x}^{2}] .

(37)

4. Suggested an Improved Family of Estimators

In this section, we present enhanced estimators for estimating the finite population mean, based on the principles outlined in [22,23]. These estimators utilize the known extreme values of auxiliary variables, along with their ranks, within a two-phase sampling method to enhance the reliability of the results. The mathematical expressions for these estimators are outlined below:

\hat{T} = [k_{1} {\hat{Q}}_{1} {(\frac{{\bar{x}}_{2}}{{\bar{x}}_{1}})}^{t_{1}} + k_{2} {(\frac{{\bar{x}}_{2}}{{\bar{x}}_{1}})}^{t_{2}}] exp [\frac{t_{3} ({\bar{x}}_{2} - {\bar{x}}_{1})}{t_{3} ({\bar{x}}_{1} + {\bar{x}}_{2}) + 2 t_{4}}]

(38)

and

{\hat{T}}_{e} = {\hat{Q}}_{1} exp [k_{3} \{\frac{({\bar{x}}_{2} - {\bar{x}}_{1})}{({\bar{x}}_{2} - {\bar{x}}_{1}) + 2 t_{5}}\}] exp [k_{4} \{\frac{({\bar{w}}_{2} - {\bar{w}}_{1})}{({\bar{w}}_{1} + {\bar{w}}_{2}) + 2 t_{6}}\}],

(39)

where the scalar parameters

(t_{1}, t_{2})

are restricted to the values

(0, - 1, 1)

, while

(k_{1}, k_{2})

represent unknown constants that require suitable selection to achieve reduced bias and lower mean squared error. On the other hand,

(k_{3} = k_{4} = 2)

are fixed known constants. The auxiliary variable parameters are represented by

t_{3}

and

t_{4}

, whereas

t_{5} = X_{M} - X_{m}

and

t_{6} = W_{M} - W_{m}

indicate the range of values based on the maximum and minimum observations of the auxiliary variables. Additionally, various sub-forms of the proposed estimator-I are derived from Equation (38) and summarized in Table 2.

where

S = exp [\frac{t_{3} ({\bar{x}}_{2} - {\bar{x}}_{1})}{t_{3} ({\bar{x}}_{1} + {\bar{x}}_{2}) + 2 t_{4}}],

and

\frac{X_{M} - X_{m}}{\bar{X}},

represents the normalized range of the auxiliary variable.

4.1. Properties of the Suggested Estimator-I

To analyze the properties of the first proposed class of estimators, we introduce the following error terms. Let

e_{0} = (\frac{{\hat{Q}}_{1} - Q_{1}}{Q_{1}}), e_{1} = (\frac{{\bar{x}}_{1} - \bar{X}}{\bar{X}}), e_{2} = (\frac{{\bar{x}}_{2} - \bar{X}}{\bar{X}}), e_{3} = (\frac{{\bar{w}}_{1} - \bar{W}}{\bar{W}}), e_{4} = (\frac{{\bar{w}}_{2} - \bar{W}}{\bar{W}}),

such that

E (e_{i}) = 0

,

(i = 0, 1, 2, 3, 4)

.

Additionally,

E (e_{0}^{2}) = θ_{2} C_{y}^{2}

,

E (e_{1}^{2}) = θ_{2} C_{x}^{2}

,

E (e_{2}^{2}) = θ_{1} C_{x}^{2}

,

E (e_{3}^{2}) = θ_{2} C_{w}^{2}

,

E (e_{4}^{2}) = θ_{1} C_{w}^{2}

,

E (e_{0} e_{1}) = θ_{2} C_{y x}

,

E (e_{0} e_{2}) = θ_{1} C_{y x}

,

E (e_{0} e_{3}) = θ_{2} C_{y w}

,

E (e_{0} e_{4}) = θ_{1} C_{y w}

,

E (e_{1} e_{2}) = θ_{1} C_{x}^{2}

,

E (e_{1} e_{3}) = θ_{2} C_{x w}

,

E (e_{1} e_{4}) = θ_{1} C_{x w}

,

E (e_{2} e_{3}) = θ_{1} C_{x w}

,

E (e_{2} e_{4}) = θ_{1} C_{x w}

, and

E (e_{3} e_{4}) = θ_{2} C_{w}^{2} .

To explore the characteristics of the first proposed estimator, we express Equation (38) in terms of error components.

\begin{matrix} \hat{T} = [k_{1} Q_{1} (1 + e_{0}) {(1 + e_{1})}^{- t_{1}} {(1 + e_{2})}^{t_{1}} + k_{2} {(1 + e_{1})}^{- t_{2}} {(1 + e_{2})}^{t_{2}}] \times \\ exp [\frac{v_{1} (e_{2} - e_{1})}{2} {(1 + \frac{v_{1}}{2} (e_{2} + e_{1}))}^{- 1}], \end{matrix}

where

v_{1} = \frac{t_{3} \bar{X}}{t_{3} \bar{X} + t_{4}} .

By applying a first-order Taylor series expansion, we derive

\begin{matrix} \hat{T} - Q_{1} ≅ - Q_{1} + k_{1} Q_{1} [1 + e_{0} - e_{1} (t_{1} + \frac{v_{1}}{2}) + e_{2} (t_{1} + \frac{v_{1}}{2}) + e_{1}^{2} (\frac{t_{1} v_{1}}{2} + \frac{3 v_{1}^{2}}{8} + \frac{t_{1} (t_{1} + 1)}{2}) \\ + e_{2}^{2} (\frac{t_{1} v_{1}}{2} - \frac{v_{1}^{2}}{8} + \frac{t_{1} (t_{1} - 1)}{2}) - e_{0} e_{1} (t_{1} + \frac{v_{1}}{2}) - e_{0} e_{2} (t_{1} + \frac{v_{1}}{2}) - e_{1} e_{2} {(t_{1} + \frac{v_{1}}{2})}^{2}] \\ + k_{2} [1 - e_{1} (t_{2} + \frac{v_{1}}{2}) + e_{2} (t_{2} + \frac{v_{1}}{2}) + e_{1}^{2} (\frac{t_{2} v_{1}}{2} + \frac{3 v_{1}^{2}}{8} + \frac{t_{2} (t_{2} + 1)}{2}) \\ + e_{2}^{2} (\frac{t_{2} v_{1}}{2} - \frac{v_{1}^{2}}{8} + \frac{t_{2} (t_{2} - 1)}{2}) - e_{1} e_{2} {(t_{2} + \frac{v_{1}}{2})}^{2}]] . \end{matrix}

(40)

Using (40), the bias of

\hat{T}

is given by

B i a s (\hat{T}) ≅ [- Q_{1} + k_{1} Q_{1} A_{3} + k_{2} A_{5}],

(41)

where

\begin{matrix} A_{3} = [1 + θ_{2} \{C_{x}^{2} (\frac{4 t_{1} (t_{1} + 1 + v_{1}) + 3 v_{1}^{2}}{8}) - C_{y x} (\frac{2 t_{1} + v_{1}}{2})\} \\ + θ_{1} \{C_{x}^{2} \{\frac{- 4 t_{1} (t_{1} + v_{1} + 1) - 3 v_{1}^{2}}{2}\} + C_{y x} (\frac{2 t_{1} + v_{1}}{2})\}], \end{matrix}

and

A_{5} = [1 + θ_{2} C_{x}^{2} (\frac{4 t_{2} (t_{2} + v_{1} + 1) + 3 v_{1}^{2}}{8}) + θ_{1} C_{x}^{2} (\frac{- 2 t_{2} (t_{2} + v_{1} - 1) - 3 v_{1}^{2}}{4})] .

Squaring both sides of Equation (40) and then taking the expected value leads to the first-order mean squared error (MSE), which is expressed as follows:

M S E (\hat{T}) ≅ [Q_{1}^{2} + k_{1}^{2} Q_{1}^{2} A_{1} + k_{2}^{2} A_{2} - 2 k_{1} Q_{1}^{2} A_{3} - 2 k_{2} Q_{1} A_{5} + 2 k_{1} k_{2} Q_{1} A_{4}],

(42)

where

\begin{matrix} A_{1} = [1 + θ_{2} \{C_{y}^{2} + C_{x}^{2} \{{(t_{1} + \frac{v_{1}}{2})}^{2} + (t_{1} v_{1} + \frac{3 v_{1}^{2}}{4} + \frac{t_{1} (t_{1} + 1)}{2})\} - 4 C_{y x} (t_{1} + \frac{v_{1}}{2})\} \\ + θ_{1} \{C_{x}^{2} \{{(t_{1} + \frac{v_{1}}{2})}^{2} + (t_{1} v_{1} - \frac{v_{1}^{2}}{4} + t_{1} (t_{1} - 1)) - 4 (t_{1} + \frac{v_{1}}{2})\} + 4 C_{y x} (t_{1} + \frac{v_{1}}{2})\}], \end{matrix}

\begin{matrix} A_{2} = [1 + θ_{2} C_{x}^{2} \{{(t_{2} + \frac{v_{1}}{2})}^{2} + (t_{2} v_{1} + \frac{3 v_{1}^{2}}{4} + t_{2} (t_{2} + 1))\} + θ_{1} C_{x}^{2} \{{(t_{2} + \frac{v_{1}}{2})}^{2} \\ + (t_{2} v_{1} - \frac{v_{1}^{2}}{4} + t_{2} (t_{2} - 1)) - 4 {(t_{2} + \frac{v_{1}}{2})}^{2}\}], \end{matrix}

and

\begin{matrix} A_{4} = [1 + θ_{2} \{C_{x}^{2} \{(\frac{t_{1} v_{1}}{2} + \frac{3 v_{1}^{2}}{8} + \frac{t_{1} (t_{1} + 1)}{2}) + (t_{1} + \frac{v_{1}}{2}) (t_{2} + \frac{v_{1}}{2}) + (\frac{t_{2} v_{1}}{2} + \frac{3 v_{1}^{2}}{8} + \frac{t_{2} (t_{2} + 1)}{2}) \\ - C_{y x} (t_{1} + t_{2} + v_{1})\} + θ_{1} \{C_{x}^{2} \{(\frac{t_{1} v_{1}}{2} - \frac{v_{1}^{2}}{8} + \frac{t_{1} (t_{1} - 1)}{2}) - (t_{1} + \frac{v_{1}}{2}) (t_{2} + \frac{v_{1}}{2}) \\ + (\frac{t_{2} v_{1}}{2} - \frac{v_{1}^{2}}{8} + \frac{t_{2} (t_{2} - 1)}{2}) - {(t_{1} + \frac{v_{1}}{2})}^{2} - {(t_{2} + \frac{v_{1}}{2})}^{2}\} + C_{y x} (t_{1} + t_{2} + v_{1})}] . \end{matrix}

To find the optimal values of

k_{1}

and

k_{2}

, we minimize Equation (42), resulting in the following expressions:

k_{1 (o p t)} = \frac{A_{2} A_{3} - A_{4} A_{5}}{A_{1} A_{2} - A_{4}^{2}},

and

k_{2 (o p t)} = \frac{Q_{1} (A_{1} A_{5} - A_{3} A_{4})}{A_{1} A_{2} - A_{4}^{2}} .

The bias and mean squared error (MSE) for

\hat{T}

are minimized by substituting the optimal values of

k_{1}

and

k_{2}

into Equations (41) and (42), leading to the following results:

B i a s {(\hat{T})}_{m i n} ≅ - Q_{1}^{2} [1 - \frac{(A_{1} A_{5}^{2} + A_{2} A_{3}^{2} - 2 A_{3} A_{4} A_{5})}{A_{1} A_{2} - A_{4}^{2}}],

(43)

and

M S E {(\hat{T})}_{m i n} ≅ Q_{1}^{2} [1 - \frac{(A_{1} A_{5}^{2} + A_{2} A_{3}^{2} - 2 A_{3} A_{4} A_{5})}{A_{1} A_{2} - A_{4}^{2}}] .

(44)

4.2. Properties of the Suggested Estimator-II

Next, we express Equation (39) in terms of error components to derive the bias and mean squared error (MSE) of the second proposed estimator

{\hat{T}}_{e},

i.e.,

\begin{matrix} {\hat{T}}_{e} = Q_{1} (1 + e_{0}) exp [k_{3} \{\frac{v_{2} (e_{2} - e_{1})}{2} {(1 + \frac{v_{2}}{2} (e_{1} + e_{2}))}^{- 1}\}] exp [k_{4} \{\frac{v_{3} (e_{4} - e_{3})}{2} \\ {(1 + \frac{v_{3}}{2} (e_{3} + e_{4}))}^{- 1}\}], \end{matrix}

(45)

where

t_{2} = \frac{\bar{X}}{\bar{X} + t_{5}},

and

v_{3} = \frac{\bar{W}}{\bar{W} + t_{6}} .

By applying a first-order Taylor expansion, we derive

\begin{matrix} {\hat{T}}_{e} - Q_{1} ≅ Q_{1} [e_{0} - \frac{k_{3} v_{2}}{2} (e_{1} - e_{2}) - \frac{k_{4} v_{3}}{2} (e_{3} - e_{4}) + (\frac{k_{3} v_{2}^{2}}{4} + \frac{k_{3}^{2} v_{2}^{2}}{8}) e_{1}^{2} - (\frac{k_{3} v_{2}^{2}}{4} - \frac{k_{3}^{2} v_{2}^{2}}{8}) e_{2}^{2} \\ + (\frac{k_{4} v_{3}^{2}}{4} - \frac{k_{4}^{2} v_{3}^{2}}{8}) e_{3}^{2} - (\frac{k_{4} v_{3}^{2}}{4} - \frac{k_{4}^{2} v_{3}^{2}}{8}) e_{4}^{2} - \frac{k_{3} v_{2}}{2} e_{0} e_{1} + \frac{k_{3} v_{2}}{2} e_{0} e_{2} - \frac{k_{4} v_{3}}{2} e_{0} e_{3} \\ + \frac{k_{4} v_{3}}{2} e_{0} e_{4} - \frac{k_{3}^{2} v_{2}^{2}}{2} e_{1} e_{2} + \frac{k_{3} k_{4} v_{2} v_{3}}{4} e_{1} e_{3} - \frac{k_{3} k_{4} v_{2} v_{3}}{4} e_{1} e_{4} - \frac{k_{3} k_{4} v_{2} v_{3}}{4} e_{2} e_{3} \\ + \frac{k_{3} k_{4} v_{2} v_{3}}{4} e_{2} e_{4} - \frac{k_{4}^{2} v_{3}^{2}}{2} e_{3} e_{4}] . \end{matrix}

(46)

Using (46), the bias of

{\hat{T}}_{e}

is given by

\begin{matrix} B i a s ({\hat{T}}_{e}) ≅ θ_{2} Q_{1} [(\frac{k_{3}^{2} v_{2}^{2}}{8} + \frac{k_{3} v_{2}^{2}}{4}) C_{x}^{2} + (\frac{k_{4}^{2} v_{3}^{2}}{8} + \frac{k_{4} v_{3}^{2}}{4}) C_{w}^{2} - \frac{k_{3} v_{2}}{2} C_{y x} - \frac{k_{4} v_{3}}{2} C_{y w} \\ + \frac{k_{3} k_{4} v_{2} v_{3}}{2} C_{x w}] - θ_{1} Q_{1} [(\frac{k_{3}^{2} v_{2}^{2}}{8} + \frac{k_{3} v_{2}^{2}}{4}) C_{x}^{2} + (\frac{k_{4}^{2} v_{3}^{2}}{8} + \frac{k_{4} v_{3}^{2}}{4}) C_{w}^{2} \\ - \frac{k_{3} v_{2}}{2} C_{y x} - \frac{k_{4} v_{3}}{2} C_{y w} + \frac{k_{3} k_{4} v_{2} v_{3}}{2} C_{x w}] . \end{matrix}

(47)

After squaring both sides of (46) and applying the expected value, we obtain a first-order mean squared error (MSE) as shown below:

\begin{matrix} M S E ({\hat{T}}_{e}) ≅ θ_{2} Q_{1}^{2} [C_{y}^{2} + \frac{k_{3}^{2} v_{2}^{2}}{4} C_{x}^{2} + \frac{k_{4}^{2} v_{3}^{2}}{4} C_{w}^{2} - k_{3} v_{2} C_{y x} - k_{4} v_{3} C_{y w} + \frac{k_{3} k_{4} v_{2} v_{3}}{2} C_{x w}] \\ - θ_{1} Q_{1}^{2} [\frac{k_{3}^{2} v_{2}^{2}}{4} C_{x}^{2} + \frac{k_{4}^{2} v_{3}^{2}}{4} C_{w}^{2} - k_{3} v_{2} C_{y x} - k_{4} v_{3} C_{y w} + \frac{k_{3} k_{4} v_{2} v_{3}}{2} C_{x w}] . \end{matrix}

(48)

After substituting the known values for

k_{3}

and

k_{4}

into Equations (47) and (48), we can express the bias and MSE for

{\hat{T}}_{e}

. With a few simplifications, the resulting expressions are as follows:

B i a s ({\hat{T}}_{e}) ≅ θ_{3} Q_{1} [\frac{3}{2} (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2}) - (v_{2} C_{y x} + v_{3} C_{y w} - \frac{1}{2} v_{2} v_{3} C_{x w})],

(49)

and

M S E ({\hat{T}}_{e}) ≅ Q_{1}^{2} [θ_{2} C_{y}^{2} + θ_{3} (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w})] .

(50)

5. Mathematical Comparison

In this section, the suggested estimators

\hat{T}

and

{\hat{T}}_{e}

are compared with existing estimators, including

{\hat{Q}}_{1}

,

{\hat{Q}}_{2}

,

{\hat{Q}}_{3}

,

{\hat{Q}}_{4}

,

{\hat{Q}}_{5},

and

{\hat{Q}}_{6}

.

5.1. Suggested Estimator-I

Condition (i): By (24) and (44),

V a r ({\hat{Q}}_{1}) > M S E {(\hat{T})}_{m i n} if θ_{2} C_{y}^{2} + (\frac{A_{1} A_{5}^{2} + A_{2} A_{3}^{2} - 2 A_{3} A_{4} A_{5}}{A_{1} A_{2} - A_{4}^{2}}) > 1 .

Condition (ii): By (27) and (44),

M S E ({\hat{Q}}_{2}) > M S E {(\hat{T})}_{m i n} if (θ_{2} C_{y}^{2} + θ_{3} C_{x}^{2} - 2 θ_{3} C_{y x}) + (\frac{A_{1} A_{5}^{2} + A_{2} A_{3}^{2} - 2 A_{3} A_{4} A_{5}}{A_{1} A_{2} - A_{4}^{2}}) > 1 .

Condition (iii): By (30) and (44),

M S E ({\hat{Q}}_{3}) > M S E {(\hat{T})}_{m i n} if C_{y}^{2} (θ_{2} - θ_{3} ρ_{y x}^{2}) + (\frac{A_{1} A_{5}^{2} + A_{2} A_{3}^{2} - 2 A_{3} A_{4} A_{5}}{A_{1} A_{2} - A_{4}^{2}}) > 1 .

Condition (iv): By (33) and (44),

M S E ({\hat{Q}}_{4}) > M S E {(\hat{T})}_{m i n} if [θ_{2} C_{y}^{2} + θ_{3} C_{x}^{2} (\frac{1}{4} - δ)] + (\frac{A_{1} A_{5}^{2} + A_{2} A_{3}^{2} - 2 A_{3} A_{4} A_{5}}{A_{1} A_{2} - A_{4}^{2}}) > 1 .

Condition (v): By (34) and (44),

M S E ({\hat{Q}}_{5}) > M S E {(\hat{T})}_{m i n} if [θ_{2} C_{y}^{2} + θ_{3} C_{x}^{2} (\frac{1}{4} + δ)] + (\frac{A_{1} A_{5}^{2} + A_{2} A_{3}^{2} - 2 A_{3} A_{4} A_{5}}{A_{1} A_{2} - A_{4}^{2}}) > 1 .

Condition (vi): By (37) and (44),

M S E {({\hat{Q}}_{6})}_{m i n} > M S E {(\hat{T})}_{m i n} if C_{y}^{2} [θ_{2} (1 - ρ_{y x}^{2}) + θ_{1} ρ_{y x}^{2}] + (\frac{A_{1} A_{5}^{2} + A_{2} A_{3}^{2} - 2 A_{3} A_{4} A_{5}}{A_{1} A_{2} - A_{4}^{2}}) > 1 .

5.2. Suggested Estimator-II

Condition (vii): By (24) and (50),

V a r ({\hat{Q}}_{1}) > M S E ({\hat{T}}_{e}) if, (θ_{2} - θ_{1}) (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) < 0 .

For

θ_{2} - θ_{1} > 0,

that is,

θ_{2} > θ_{1}

,

(v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) < 0 .

(51)

Similarly, for

θ_{2} - θ_{1} < 0,

that is,

θ_{2} < θ_{1}

,

(v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) > 0 .

(52)

Whenever either condition (51) or (52) is satisfied, the estimator

\hat{T} e

shows improved performance, resulting in a lower mean squared error (MSE) compared to

V a r ({\hat{Q}}_{1})

, along with greater efficiency.

Condition (viii): By (27) and (50),

M S E ({\hat{Q}}_{2}) > M S E ({\hat{T}}_{e}) if (θ_{2} - θ_{1}) (C_{x}^{2} - 2 C_{y x}) > (θ_{2} - θ_{1}) (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) .

For

θ_{2} - θ_{1} < 0,

that is,

θ_{2} < θ 1

,

(C_{x}^{2} - 2 C_{y x}) > (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) .

(53)

Similarly, for

θ_{2} - θ_{1} > 0,

that is,

θ_{2} > θ_{1}

,

(C_{x}^{2} - 2 C_{y x}) < (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) .

(54)

Whenever either condition (53) or (54) is satisfied, the estimator

\hat{T} e

shows improved performance, resulting in a lower mean squared error (MSE) compared to

M S E ({\hat{Q}}_{2})

, along with greater efficiency.

Condition (ix): By (30) and (50),

M S E ({\hat{Q}}_{3}) > M S E ({\hat{T}}_{e}) if (θ_{2} - θ_{1}) C_{y}^{2} ρ_{y x}^{* 2} < (θ_{2} - θ_{1}) (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) .

For

θ_{2} - θ_{1} > 0,

that is,

θ_{2} > θ_{1}

,

C_{y}^{2} ρ_{y x}^{* 2} > (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) .

(55)

Similarly, for

θ_{2} - θ_{1} < 0,

that is,

θ_{2} < θ_{1}

,

C_{y}^{2} ρ_{y x}^{* 2} < (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) .

(56)

Whenever either condition (55) or (56) is satisfied, the estimator

\hat{T} e

shows improved performance, resulting in a lower mean squared error (MSE) compared to

M S E ({\hat{Q}}_{3})

, along with greater efficiency.

Condition (x): By (33) and (50),

M S E ({\hat{Q}}_{4}) > M S E ({\hat{T}}_{e}) if (θ_{2} - θ_{1}) C_{x}^{2} (\frac{1}{4} - δ) > (θ_{2} - θ_{1}) (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - v_{2} C_{y x} - v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) .

For

θ_{2} - θ_{1} > 0,

that is,

θ_{2} > θ_{1}

,

C_{x}^{2} (1 - 4 δ) > 4 (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) .

(57)

Similarly, for

θ_{2} - θ_{1} < 0,

that is,

θ_{2} < θ_{1}

,

C_{x}^{2} (1 - 4 δ) < 4 (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) .

(58)

Whenever either condition (57) or (58) is satisfied, the estimator

\hat{T} e

shows improved performance, resulting in a lower mean squared error (MSE) compared to

M S E ({\hat{Q}}_{4})

, along with greater efficiency.

Condition (xi): By (34) and (50),

M S E ({\hat{Q}}_{5}) > M S E ({\hat{T}}_{e}) if (θ_{2} - θ_{1}) C_{x}^{2} (\frac{1}{4} + δ) > (θ_{2} - θ_{1}) (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) .

For

θ_{2} - θ_{1} > 0,

that is,

θ_{2} > θ_{1}

,

C_{x}^{2} (1 + 4 δ) > 4 (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) .

(59)

Similarly, for

θ_{2} - θ_{1} < 0,

that is,

θ_{2} < θ_{1}

,

C_{x}^{2} (1 + 4 δ) < 4 (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) .

(60)

Whenever either condition (59) or (60) is satisfied, the estimator

\hat{T} e

shows improved performance, resulting in a lower mean squared error (MSE) compared to

M S E ({\hat{Q}}_{5})

, along with greater efficiency.

Condition (xii): By (37) and (50),

M S E {({\hat{Q}}_{6})}_{m i n} > M S E ({\hat{T}}_{e}) if (θ_{2} - θ_{1}) C_{y}^{2} ρ_{y x}^{2} < (θ_{2} - θ_{1}) (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) .

For

θ_{2} - θ_{1} > 0,

that is,

θ_{2} > θ_{1}

,

C_{y}^{2} ρ_{y x}^{2} > (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) .

(61)

Similarly, for

θ_{2} - θ_{1} < 0,

that is,

θ_{2} < θ_{1}

,

C_{y}^{2} ρ_{y x}^{2} < (v_{2}^{2} C_{x}^{2} + v_{3}^{2} C_{w}^{2} - 2 v_{2} C_{y x} - 2 v_{3} C_{y w} + 2 v_{2} v_{3} C_{x w}) .

(62)

Whenever either condition (61) or (62) is satisfied, the estimator

\hat{T} e

shows improved performance, resulting in a lower mean squared error (MSE) compared to

M S E ({\hat{Q}}_{6})

, along with greater efficiency.

6. Numerical Comparison

This section presents a comparative evaluation of the percent relative efficiency (PRE) between the proposed estimators and several established ones. The assessment utilizes both simulated datasets and three different real datasets. Through this analysis, we aim to offer a detailed understanding of how the proposed estimators perform in terms of accuracy and consistency across a range of practical applications.

6.1. Simulation Study

A simulation study is carried out in this section where the data of the auxiliary variable X are generated from six distinct populations, with each population structured according to a different probability distribution as listed below.

Population 1: $X \sim L o g - N o r m a l (η_{1} = 7, η_{2} = 4),$
Population 2: $X \sim L o g - N o r m a l (η_{3} = 9, η_{4} = 6),$
Population 3: $X \sim E x p o n e n t i a l (μ_{1} = 4),$
Population 4: $X \sim E x p o n e n t i a l (μ_{2} = 9)$ ,
Population 5: $X \sim G a m m a (γ_{1} = 3, γ_{2} = 5),$
Population 6: $X \sim G a m m a (γ_{3} = 7, γ_{4} = 10) .$

In the simulation study, the log-normal, exponential, and gamma distributions were selected to reflect realistic data structures commonly found in survey sampling. These distributions cover a range of positive skewness and variability, allowing for a robust evaluation of the proposed estimators. The log-normal distribution models highly skewed socio-economic variables such as income, the exponential distribution represents time-to-event data, and the gamma distribution provides flexibility to simulate various positively skewed datasets. This selection ensures that the simulation results are applicable to a wide range of practical scenarios where two-phase sampling techniques are employed.

The value of the study variable Y is then derived based on the defined relationship or formula as follows:

Y = r_{y x} \times X + e,

where

r_{y x} = 0.80

represents the correlation coefficient between the dependent and independent variables, and e is the error term

e \sim N (0, 1) .

The mean squared errors (MSEs) and percent relative efficiency (PRE) of both the proposed and existing estimators were calculated using specific computational procedures implemented in R software (latest v. 4.4.0).

Step 1: We begin by generating a population consisting of 1000 observations, each obtained from the specified probability distributions mentioned above.
Step 2: A first-phase sample of size m is selected from a population of size N using simple random sampling without the replacement (SRSWOR) technique.
Step 3: The second-phase sample consisting of n units is subsequently selected from the initial sample by reapplying the SRSWOR approach.
Step 4: The population total and the lowest and maximum observations of the auxiliary variable, along with their ranks, are determined using the steps described above. We also find the optimal values of the proposed estimators for the unknown constants.
Step 5: Different sample sizes are obtained for each population using SRSWOR.
Step 6: The MSE and PRE values for all estimators discussed in this article are computed for each sample size.
Step 7: After 50,000 repetitions of Steps 5 and 6, the results for artificial populations are presented in Table 4, while Table 5 summarizes the findings for real datasets, all calculated using the formula provided below.

M S E {({\hat{S}}_{j})}_{min} = \frac{\sum_{i = 1}^{50,000} {({\hat{S}}_{j i} - \bar{Y})}^{2}}{50,000}

and

P R E = \frac{V a r ({\hat{Q}}_{1})}{M S E {({\hat{S}}_{j})}_{min}} \times 100,

where

{\hat{S}}_{j} = {\hat{Q}}_{1}, {\hat{Q}}_{2}, {\hat{Q}}_{3}, {\hat{Q}}_{4}, {\hat{Q}}_{5}, {\hat{Q}}_{6} {\hat{T}}_{e}, {\hat{T}}_{1}, {\hat{T}}_{2}, \dots, {\hat{T}}_{8} .

6.2. Numerical Examples

We calculated the mean squared errors (MSEs) by using the three different datasets in order to evaluate the effectiveness of different estimators. The goal is to evaluate the effectiveness of the proposed estimators. We describe the datasets in detail below, along with summary statistics:

Data 1. (Source: [33], p. 226)
Y: departmental employment levels in 2012;
X: number of factories the departments registered in 2012;
W: ranking of the number of factories the departments registered in 2012.

Data 2. (Source: [33], p. 135)
$Y :$ reports the total student population in educational institutions in 2012;
$X :$ reports the total number of schools operated with government funding in 2012;
$W :$ displays the position of each area based on the total number of schools funded by the government in 2012.

Data 3. (Source: [1], p. 24)
Y: the cost of food for families directly influenced by their occupations;
X: the total weekly income earned by families reflecting their financial resources during that period;
W: the ranking of families based on their weekly income showing how their earnings compare to each other.

The datasets presented above are compiled in the summary statistics shown in Table 3, while the comparison of percent relative efficiency (PRE) values between the newly proposed estimators and the existing ones is presented in Table 5.

6.3. Discussion

Simulation studies and analysis of three real populations were conducted to evaluate the performance of the proposed estimators. The estimators were compared using the percent relative efficiency (PRE) criterion. Table 4 shows the simulation results, including PRE values for both proposed and existing estimators, while Table 5 summarizes the findings from the real populations. The following general conclusions can be drawn from these studies:

All simulated scenarios and real datasets show that the $P R E$ values for all suggested estimators are higher than those of the existing estimators reported in the literature, as shown in Table 4 and Table 5. This demonstrates how well the suggested estimators perform in comparison to the existing ones.
In addition, all suggested estimators have $P R E$ values that are consistently higher than those of existing estimators, as demonstrated by the upward-trending graph lines in Figure 1 and Figure 2 for both simulation studies and real populations. This suggests that the suggested classes of estimators perform better than the existing ones, as evidenced by the inverse relationship between the $P R E$ values for the suggested and existing estimators.

Table 4. PRE of different estimators using the artificial populations.

Estimator	$L N (7, 4)$	$L N (9, 6)$	$E x p (4)$	$E x p (9)$	$G a m (3, 5)$	$Gam (7, 10)$
${\hat{Q}}_{1}$	100	100	100	100	100	100
${\hat{Q}}_{2}$	105.839	107.971	108.983	110.287	111.064	114.373
${\hat{Q}}_{3}$	118.317	120.643	113.286	115.735	127.677	129.607
${\hat{Q}}_{4}$	123.695	126.703	122.730	131/398	140.342	146.740
${\hat{Q}}_{5}$	91.973	98.514	87.843	93.665	92.081	97.192
${\hat{Q}}_{6}$	123.787	126.812l	130.209	134.947	140.556	145.389
${\hat{T}}_{1}$	508.112	683.541	576.577	616.447	704.186	818.293
${\hat{T}}_{2}$	974.074	1046.103	783.673	830.248	952.000	1035.902
${\hat{T}}_{3}$	423.399	559.205	530.387	569.895	619.075	691.546
${\hat{T}}_{4}$	1295.076	1331.667	807.595	844.140	1028.387	1102.939
${\hat{T}}_{5}$	1057.236	1203.164	784.793	847.426	1068.354	1178.732
${\hat{T}}_{6}$	393.263	583.854	539.394	589.307	673.846	720.480
${\hat{T}}_{7}$	1349.529	1559.195	1082.609	1132.379	1108.818	1266.757
${\hat{T}}_{8}$	1610.804	1793.650	1206.316	1476.295	1310.380	1420.820
${\hat{T}}_{e}$	169.632	175.107	147.969	173.367	181.864	197.643

Table 5. PRE of different estimators using real populations.

Estimator	PRE: Pop-I	PRE: Pop-II	PRE: Pop-III
${\hat{Q}}_{1}$	100	100	100
${\hat{Q}}_{2}$	112.104	101.092	103.252
${\hat{Q}}_{3}$	112.106	102.383	104.870
${\hat{Q}}_{4}$	108.945	102.342	104.642
${\hat{Q}}_{5}$	87.845	94.535	90.71429
${\hat{Q}}_{6}$	138.664	114.875	122.651
${\hat{T}}_{1}$	1204.266	1192.485	765.188
${\hat{T}}_{2}$	1625.500	1784.690	1094.834
${\hat{T}}_{3}$	859.653	1048.866	540.278
${\hat{T}}_{4}$	1675.141	1805.001	584.391
${\hat{T}}_{5}$	1303.178	1345.824	909.167
${\hat{T}}_{6}$	768.419	1007.728	870.308
${\hat{T}}_{7}$	2262.137	1843.198	1675.656
${\hat{T}}_{8}$	3283.4	3063.480	2329.250
${\hat{T}}_{e}$	155.795	142.829	132.737

7. Conclusions

This article introduces new classes of efficient estimators for the finite population mean, utilizing the ranks of the auxiliary variable along with the known minimum and maximum values. To assess the properties of these estimators in comparison to existing ones, we derived theoretical conditions in Section 5, demonstrating the improved efficiency of the proposed methods. To verify these conditions, simulation studies and analysis of various empirical datasets were carried out. The results indicate that the proposed estimators consistently outperform the existing ones in terms of percent relative efficiency (PRE), as illustrated in Table 4. These findings are further supported by the empirical results in Table 5, which validate the theoretical conditions established in Section 5.

The simulation results and empirical studies clearly demonstrate that the proposed estimators

{\hat{T}}_{i}

(i = e, 1, 2, \dots, 8)

outperform the other estimators under consideration in terms of efficiency. Among the proposed estimators,

{\hat{T}}_{8}

stands out as the most effective choice and is therefore strongly recommended for use.

In addition, our study examined the properties of the proposed efficient estimators within a two-phase sampling framework. Moving forward, there is significant potential to develop new estimators based on these findings, aiming to achieve even higher percent relative efficiency (PRE) values. Future research may focus on extending these estimators to more complex sampling designs, such as stratified or multistage sampling, to evaluate their effectiveness in various practical contexts. Furthermore, the use of machine learning methods for adaptive sampling may offer valuable perspectives on improving estimator efficiency. Exploring the application of these estimators in fields like environmental monitoring or healthcare data analysis can also lead to new opportunities for real-world implementation.

Author Contributions

Conceptualization, A.S.A. and F.A.A.; Methodology, A.S.A. and F.A.A.; Software, A.S.A. and F.A.A.; Validation, F.A.A.; Formal analysis, A.S.A. and F.A.A.; Investigation, A.S.A. and F.A.A.; Resources, A.S.A. and F.A.A.; Data curation, A.S.A. and F.A.A.; Writing—original draft, A.S.A.; Writing—review & editing, F.A.A.; Visualization, A.S.A. and F.A.A.; Supervision, F.A.A.; Project administration, F.A.A.; Funding acquisition, F.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R515), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cochran, W.B. Sampling Techniques; John Wiley and Sons: Hoboken, NJ, USA, 1963. [Google Scholar]
Khoshnevisan, M.; Singh, R.; Chauhan, P.; Sawan, N. A general family of estimators for estimating population mean using known value of some population parameter (s). Far East J. Theor. Stat. 2007, 22, 181–191. [Google Scholar]
Rueda, M.M.; Arcos, A.; Martınez-Miranda, M.D.; Román, Y. Some improved estimators of finite population quantile using auxiliary information in sample surveys. Comput. Stat. Data Anal. 2004, 45, 825–848. [Google Scholar] [CrossRef]
Särndal, C.E. Sample survey theory vs. general statistical theory: Estimation of the population mean. Int. Stat. Rev. Rev. Int. Stat. 1972, 40, 1–12. [Google Scholar] [CrossRef]
Tarima, S.; Pavlov, D. Using auxiliary information in statistical function estimation. ESAIM Probab. Stat. 2006, 10, 11–23. [Google Scholar] [CrossRef]
Neyman, J. Contribution to the theory of sampling human populations. J. Am. Stat. Assoc. 1938, 33, 101–116. [Google Scholar] [CrossRef]
Sukhatme, B.V. Some ratio-type estimators in two-phase sampling. J. Am. Stat. Assoc. 1962, 57, 628–632. [Google Scholar] [CrossRef]
Erinola, A.Y.; Singh, R.V.K.; Audu, A.; James, T. Modified class of estimator for finite population mean under two-phase sampling using regression estimation approach. Asian J. Probab. Stat. 2021, 4, 52–64. [Google Scholar] [CrossRef]
Garg, N.; Srivastava, M. A general class of estimators of a finite population mean using multi-auxiliary information under two stage sampling scheme. J. Reliab. Stat. Stud. 2009, 2, 103–118. [Google Scholar]
Guha, S.; Chandra, H. Improved estimation of finite population mean in two-phase sampling with subsampling of the nonrespondents. Math. Popul. Stud. 2021, 28, 24–44. [Google Scholar] [CrossRef]
Daraz, U.; Wu, J.; Agustiana, D.; Emam, W. Finite population variance estimation using Monte Carlo simulation and real life application. Symmetry 2025, 17, 84. [Google Scholar] [CrossRef]
Daraz, U.; Agustiana, D.; Wu, J.; Emam, W. Twofold auxiliary information under two-phase sampling: An improved family of double-transformed variance estimators. Axioms 2025, 14, 64. [Google Scholar] [CrossRef]
Singh, H.P.; Vishwakarma, G.K. Modified exponential ratio and product estimators for finite population mean in double sampling. Austrian J. Stat. 2007, 36, 217–225. [Google Scholar] [CrossRef]
Singh, H.P.; Espeio, M.R. Double sampling ratio-product estimator of a finite population mean in sample surveys. J. Appl. Stat. 2007, 34, 71–85. [Google Scholar] [CrossRef]
Vishwakarma, G.K.; Zeeshan, S.M. Generalized ratio-cum-product estimator for finite population mean under two-phase sampling scheme. J. Mod. Appl. Stat. Methods 2020, 19, 1–16. [Google Scholar] [CrossRef]
Zaman, T.; Kadilar, C. New class of exponential estimators for finite population mean in two-phase sampling. Commun. Stat.-Theory Methods 2021, 50, 874–889. [Google Scholar] [CrossRef]
Albalawi, O. Estimation techniques utilizing dual auxiliary variables in stratified two-phase sampling. AIMS Math 2024, 9, 33139–33160. [Google Scholar] [CrossRef]
Mohanty, S.; Sahoo, J. A note on improving the ratio method of estimation through linear transformation using certain known population parameters. Sankhyā Indian J. Stat. Ser. 1995, 57, 93–102. [Google Scholar]
Khan, M.; Shabbir, J. Some improved ratio, product, and regression estimators of finite population mean when using minimum and maximum values. Sci. World J. 2013, 2013, 431868. [Google Scholar] [CrossRef]
Khan, M. Improvement in estimating the finite population mean under maximum and minimum values in double sampling scheme. J. Stat. Appl. Probab. Lett. 2015, 2, 115–121. [Google Scholar]
Walia, G.S.; Kaur, H.; Sharma, M. Ratio type estimator of population mean through efficient linear transformation. Am. J. Math. Stat. 2015, 5, 144–149. [Google Scholar]
Daraz, U.; Shabbir, J.; Khan, H. Estimation of finite population mean by using minimum and maximum values in stratified random sampling. J. Mod. Appl. Stat. Methods 2018, 17, 1–15. [Google Scholar] [CrossRef]
Daraz, U.; Khan, M. Estimation of variance of the difference-cum-ratio-type exponential estimator in simple random sampling. Res. Math. Stat. 2021, 8, 1899402. [Google Scholar] [CrossRef]
Daraz, U.; Wu, J.; Albalawi, O. Double exponential ratio estimator of a finite population variance under extreme values in simple random sampling. Mathematics 2024, 12, 1737. [Google Scholar] [CrossRef]
Alghamdi, A.S.; Alrweili, H. A comparative study of new ratio-type family of estimators under stratified two-phase sampling. Mathematics 2025, 13, 327. [Google Scholar] [CrossRef]
Alghamdi, A.S.; Alrweili, H. New class of estimators for finite population mean under stratified double phase sampling with simulation and real-life application. Mathematics 2025, 13, 329. [Google Scholar] [CrossRef]
Cekim, H.O.; Cingi, H. Some estimator types for population mean using linear transformation with the help of the minimum and maximum values of the auxiliary variable. Hacet. J. Math. Stat. 2017, 46, 685–694. [Google Scholar]
Alomair, M.A.; Daraz, U. Dual transformation of auxiliary variables by using outliers in stratified random sampling. Mathematics 2024, 12, 2829. [Google Scholar] [CrossRef]
Daraz, U.; Alomair, M.A.; Albalawi, O.; Al Naim, A.S. New techniques for estimating finite population variance using ranks of auxiliary variable in two-stage sampling. Mathematics 2024, 12, 2741. [Google Scholar] [CrossRef]
Chatterjee, S.; Hadi, A.S. Regression Analysis by Example; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
Daraz, U.; Wu, J.; Alomair, M.A.; Aldoghan, L.A. New classes of difference cum-ratio-type exponential estimators for a finite population variance in stratified random sampling. Heliyon 2024, 10, e33402. [Google Scholar] [CrossRef]
Daraz, U.; Alomair, M.A.; Albalawi, O. Variance estimation under some transformation for both symmetric and asymmetric data. Symmetry 2024, 16, 957. [Google Scholar] [CrossRef]
Bureau of Statistics. Punjab Development Statistics; Government of the Punjab: Lahore, Pakistan, 2013.

Figure 1. A graphical illustration of the percent relative efficiency (PRE) outcomes for both proposed and existing estimators based on artificial data is presented. In this figure, the vertical axis indicates the PREs associated with each estimator, while the horizontal axis displays their corresponding numerical labels, ranging from 1 to 15. These numeric labels represent individual estimators to simplify interpretation. For a detailed explanation of each estimator and their associated values, please refer to the corresponding Table 4.

Figure 2. A graphical illustration of the percent relative efficiency (PRE) outcomes for both proposed and existing estimators based on actual populations is presented. In this figure, the vertical axis indicates the PREs associated with each estimator, while the horizontal axis displays their corresponding numerical labels, ranging from 1 to 15. These numeric labels represent individual estimators to simplify interpretation. For a detailed explanation of each estimator and their associated values, please refer to the corresponding Table 3 and Table 5.

Table 1. List of important variable notations under a two-phase sampling design.

Symbol	Meaning	Symbol	Meaning
Y	Study variable	X	Auxiliary variable
W	Rank of auxiliary variable	$\bar{Y}$	Population mean of Y
$\bar{X}$	Population mean of X	$\bar{W}$	Population mean of W
$S_{y}^{2}$	Population variance of Y	$S_{x}^{2}$	Population variance of X
$S_{w}^{2}$	Population variance of W	$C_{y}$	Coefficient of variation of Y
$C_{x}$	Coefficient of variation of X	$C_{w}$	Coefficient of variation of W
$ρ_{y x}$	Correlation between Y and X	$ρ_{y w}$	Correlation between Y and W
$ρ_{x w}$	Correlation between X and W	m	First-phase sample size
n	Second-phase sample size	N	Population size
$θ_{1}, θ_{2}, θ_{3}$	Sampling fractions	$Q_{1}$	Usual unbiased estimator
$Q_{2}, Q_{3}, \dots, Q_{6}$	Existing estimators	$T, T_{e}$	Proposed estimators
$e_{0}, e_{1}, e_{2}, e_{3}, e_{4}$	Error terms

Table 2. Various forms of estimator-I under a two-phase sampling design.

Subsets of $\hat{T}$	$t_{1}$	$t_{2}$	$t_{3}$	$t_{4}$
${\hat{T}}_{1} = [k_{1} {\hat{Q}}_{1} (\frac{{\bar{x}}_{2}}{{\bar{x}}_{1}}) + k_{2} (\frac{{\bar{x}}_{1}}{{\bar{x}}_{2}})] S$	1	−1	1	$(X_{M} - X_{m}) / \bar{X}$
${\hat{T}}_{2} = [k_{1} {\hat{Q}}_{1} (\frac{{\bar{x}}_{1}}{{\bar{x}}_{2}}) + k_{2} (\frac{{\bar{x}}_{2}}{{\bar{x}}_{1}})] S$	−1	1	$ρ_{y x}$	$(X_{M} - X_{m}) / \bar{X}$
${\hat{T}}_{3} = [k_{1} {\hat{Q}}_{1} (\frac{{\bar{x}}_{1}}{{\bar{x}}_{2}}) + k_{2} (\frac{{\bar{x}}_{1}}{{\bar{x}}_{2}})] S$	−1	−1	$(X_{M} - X_{m}) / \bar{X}$	$ρ_{y x}$
${\hat{T}}_{4} = [k_{1} {\hat{Q}}_{4} + k_{2} (\frac{{\bar{x}}_{2}}{{\bar{x}}_{1}})] S$	0	1	$(X_{M} - X_{m}) / \bar{X}$	$β_{2 (x)}$
${\hat{T}}_{5} = [k_{1} {\hat{Q}}_{1} (\frac{{\bar{x}}_{2}}{{\bar{x}}_{1}}) + k_{2} (\frac{{\bar{x}}_{2}}{{\bar{x}}_{1}})] S$	1	1	$(X_{M} - X_{m}) / \bar{X}$	1
${\hat{T}}_{6} = [k_{1} {\hat{Q}}_{1} + k_{2} (\frac{{\bar{x}}_{1}}{{\bar{x}}_{2}})] S$	0	−1	$β_{2 (x)}$	$(X_{M} - X_{m}) / \bar{X}$
${\hat{T}}_{7} = [k_{1} {\hat{Q}}_{1} (\frac{{\bar{x}}_{2}}{{\bar{x}}_{1}}) + k_{2}] S$	1	0	$(X_{M} - X_{m}) / \bar{X}$	$C_{x}$
${\hat{T}}_{8} = [k_{1} {\hat{Q}}_{1} (\frac{{\bar{x}}_{1}}{{\bar{x}}_{2}}) + k_{2}] S$	−1	0	$C_{x}$	$(X_{M} - X_{m}) / \bar{X}$

Table 3. Summary statistics for different real populations.

Parameters	Pop-I	Pop-II	Pop-III	Parameters	Pop-I	Pop-II	Pop-III
N	36	36	33	$\bar{Y}$	52,432	148,718	27.49
m	15	15	15	$\bar{X}$	335.78	1054.39	72.55
n	6	6	6	$\bar{W}$	18.51	18.50	17.00
$X_{M}$	2055	2370	95	$S_{y}$	178,201	182,315	10.13
$X_{m}$	24	39	58	$S_{x}$	451.136	402.61	10.58
$W_{M}$	36	36	33	$S_{w}$	10.53	10.54	9.64
$W_{m}$	1	1	1	$C_{y}$	3.40	1.23	0.37
$C_{x}$	1.34	0.38	0.15	$C_{w}$	0.57	0.56	0.57
$ρ_{y x}$	0.39	0.17	0.25	$ρ_{y w}$	0.36	0.19	0.20
$ρ_{x w}$	0.75	0.94	0.98	$θ_{1}$	0.10	0.10	0.10
$θ_{2}$	0.14	0.14	0.14	$θ_{3}$	0.04	0.04	0.04

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alghamdi, A.S.; Almulhim, F.A. Optimizing Finite Population Mean Estimation Using Simulation and Empirical Data. Mathematics 2025, 13, 1635. https://doi.org/10.3390/math13101635

AMA Style

Alghamdi AS, Almulhim FA. Optimizing Finite Population Mean Estimation Using Simulation and Empirical Data. Mathematics. 2025; 13(10):1635. https://doi.org/10.3390/math13101635

Chicago/Turabian Style

Alghamdi, Abdulaziz S., and Fatimah A. Almulhim. 2025. "Optimizing Finite Population Mean Estimation Using Simulation and Empirical Data" Mathematics 13, no. 10: 1635. https://doi.org/10.3390/math13101635

APA Style

Alghamdi, A. S., & Almulhim, F. A. (2025). Optimizing Finite Population Mean Estimation Using Simulation and Empirical Data. Mathematics, 13(10), 1635. https://doi.org/10.3390/math13101635

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimizing Finite Population Mean Estimation Using Simulation and Empirical Data

Abstract

1. Introduction

2. Methodology and Notation

3. Some Existing Estimators

4. Suggested an Improved Family of Estimators

4.1. Properties of the Suggested Estimator-I

4.2. Properties of the Suggested Estimator-II

5. Mathematical Comparison

5.1. Suggested Estimator-I

5.2. Suggested Estimator-II

6. Numerical Comparison

6.1. Simulation Study

6.2. Numerical Examples

6.3. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI