1. Introduction
Accurate inference on population distribution features is central to survey statistics and numerous applied domains. When unit sizes exhibit pronounced heterogeneity, probability proportional to size (PPS) sampling [
1,
2] delivers substantial efficiency gains over equal-probability designs by prioritizing information-rich units. Beyond totals or means, reliable inference on the cumulative distribution function (CDF) is crucial for quantile estimation, inequality assessment, risk profiling, and tail probability reporting in economics, health, and environmental monitoring [
3,
4]. However, standard distribution function estimators often suffer from inflated mean squared error (MSE) and diminished percentage relative efficiency (PRE) when size variability is pronounced [
5,
6,
7,
8,
9].
The existing research has enriched distribution function estimation by leveraging auxiliary information, primarily under simple random sampling (SRS). Hartley–Ross-type approaches have been extended to handle nonresponse and robustness [
10,
11,
12], while auxiliary-variable calibration has yielded efficient families of CDF estimators [
13,
14]. Within PPS designs, most attention has focused on mean estimation, where logarithmic, predictive, and regression-type estimators have been proposed to reduce bias, minimize MSE, and improve robustness [
15,
16,
17,
18,
19]. Recent empirical work has further demonstrated the advantages of PPS-based mean estimators using real-world data, such as the radiation dataset [
20]. Extensions to rank set sampling (RSS) have also shown promise in survey applications, offering lower variances and higher efficiencies than conventional designs [
21]. Yet, despite these advances, a clear methodological gap persists in CDF estimation under PPS: an approach that integrates multiple auxiliary variables with tractable theoretical guarantees and extensive empirical validation. To address this gap, the present study develops a new class of CDF estimators explicitly designed for PPS sampling. At the design stage, one auxiliary variable governs the inclusion probabilities, while at the estimation stage, multiple auxiliary variables (up to three) are incorporated to enhance precision. The proposed estimators are computationally straightforward, admit closed-form first-order approximations for bias and MSE, and can be easily implemented within standard survey workflows.
Parallel to these traditional developments, advances in computational and statistical learning have introduced flexible, data-driven methodologies that align conceptually with PPS-based estimation principles [
22,
23,
24]. For instance, Ref. [
25] proposed a regression-based conditional independence test employing adaptive kernels, providing a mechanism for data-driven bandwidth selection that parallels efficiency optimization in PPS estimation. Similarly, Ref. [
26] introduced an adaptive tempered reversible-jump algorithm for Bayesian curve fitting, demonstrating that flexible computational structures can enhance estimator stability and convergence. Furthermore, Ref. [
27] developed a Gaussian kernel similarity approach for multisource information fusion, emphasizing the role of weighted auxiliary information in improving estimation accuracy. This idea resonates with our proposed dual-auxiliary-variable framework, wherein the inclusion of multiple auxiliary sources systematically reduces estimator bias and MSE. Likewise, Ref. [
28] examined belief-based fuzzy and imprecise clustering for arbitrary data distributions, offering valuable insights into managing uncertainty and variability—issues central to complex survey designs.
Complementary to these methodological contributions, Ref. [
29] investigated the dimensional efficiency of noncontrastive learning, offering computational perspectives on balancing model complexity and scalability—an aspect particularly relevant for large-scale survey datasets. In the same vein, Ref. [
30] extended the Merton option-pricing framework to deposit insurance modeling, demonstrating that statistically rigorous models can address practical problems in economics and finance. Collectively, these studies highlight the convergence between traditional survey estimation and emerging computational paradigms. They underscore the growing emphasis on adaptability, efficiency, and scalability—principles directly motivating our proposed PPS-based distribution estimation framework. By aligning theoretical efficiency with adaptive computational techniques, the proposed work advances modern survey methodology and integrates it into the broader landscape of computational statistics.
This paper contributes to four major dimensions: First, it introduces a flexible family of PPS-based CDF estimators that combines multi-auxiliary calibration with interpretable association measures. Second, it derives analytical expressions for first-order bias and MSE, identifies conditions for estimator dominance, and provides optimal tuning strategies. Third, it demonstrates computational tractability, requiring only standard survey quantities easily implemented in existing pipelines. Fourth, it provides extensive numerical evidence from benchmark populations—fisheries, wine chemistry, and demographic data—where the proposed estimators achieve empirical MSEs as low as (versus for comparators) and PRE gains ranging from to .
The proposed methodology is broadly applicable to domains where PPS sampling is natural and CDF-based inference is decision-critical, such as income distribution analysis, poverty and inequality assessment, disease prevalence, hospital resource allocation, crop yield risk, deforestation monitoring, election polling, and transportation planning. In such contexts, unit sizes (e.g., household size, facility capacity, stand area, or farm acreage) align with PPS designs, while auxiliary data are increasingly accessible from administrative or satellite sources. By jointly leveraging PPS and auxiliary calibration, the proposed estimators offer enhanced accuracy, reduced sample size requirements, and broad practical relevance.
The remainder of the paper is organized as follows:
Section 2 presents the proposed class of estimators within the PPS framework.
Section 3 reports empirical and simulation results, including efficiency comparisons.
Section 4 provides practical guidance and broader implications, while
Section 5 concludes the study.
2. Methodology
Let the finite population be denoted by
, where
N is the population size. For the
unit, let
denote the value of the study variable of interest, while
represent the auxiliary variables associated with the same unit. We consider probability proportional to size (PPS) sampling, where the selection probability of the
unit is defined as
To standardize the PPS design, we introduce transformed variables.
To characterize the cumulative distribution function (CDF) in the PPS framework, we define indicator functions for the medians of the transformed variables:
where
,
, and
denote the population medians of
,
, and
, respectively.
The corresponding finite population distribution functions and their sample-based estimators are then given as
The variability of these indicator functions across the population is captured through their variances:
The coefficients of variation are therefore defined as
To account for dependence among the transformed indicators, we define population covariances:
The corresponding correlation coefficients are
To study the bias and mean squared error (MSE) properties of the estimators of
, we introduce relative error components:
with expectations
, for
.
The error terms are given by
where
and
n is the sample size.
The above framework provides the foundational notations for developing new classes of CDF estimators under PPS sampling. The inclusion of auxiliary information through dual variables () facilitates efficiency gains in finite-sample inference. From an application perspective, these formulations are directly applicable in survey settings where unit sizes are heterogeneous (e.g., household income surveys, enterprise-level trade statistics, or agricultural crop yield studies). The subsequent sections will utilize these expressions to derive bias, MSE, and efficiency results, and to demonstrate their empirical advantages through both simulations and real-world datasets.
On the other hand, we construct several standard estimators for the finite population cumulative distribution function (CDF) under probability proportional to size (PPS) sampling. The biases and mean squared errors (MSEs) of these estimators are derived up to the first-order approximation. The existing estimators considered are as follows:
- (1)
Consider the traditional mean estimator for the finite population CDF:
The variance of
is given by
- (2)
Ref. [
31] The ratio estimator under PPS sampling is defined as
The bias and mean squared error (MSE) of
are given by
and
- (3)
Ref. [
32] The Bahl and Tuteja ratio-type exponential estimator under PPS sampling is defined as
The bias and mean squared error of this estimator are given by
and
- (4)
The regression estimator for under PPS sampling is given by
where
m is an unknown parameter to be determined.
The minimum variance of
is achieved at the optimal value
and is given by
This can also be expressed as
- (5)
Ref. [
33] The Rao difference-type estimator for
is defined as
where
and
are unknown constants to be determined.
The bias and mean squared error (MSE) of
, up to the first order of approximation, are given by
and
The optimal values of
and
that minimize the MSE are
At these optimum values, the minimum mean squared error is
- (6)
Ref. [
34] Grover and Kaur introduced a generalized class of exponential estimators for
, defined as
where
and
are unknown constants.
The bias of
, up to the first order of approximation, is
The mean squared error (MSE) is given by
The optimal values of
and
are
The minimum MSE of
at these optimal values is
2.1. Proposed Family of Estimators Under PPS Sampling
In this subsection, we propose a novel family of estimators for the finite population distribution function (DF) within the framework of probability proportional to size (PPS) sampling. This development is motivated by the need to construct efficient estimators that simultaneously exploit auxiliary information and maintain computational tractability, thereby aligning with both theoretical contributions and practical applications in large-scale survey sampling and computational statistics.
We define the proposed class of estimators as
where
denotes the proposed estimator of the population DF at
, while
and
represent transformed auxiliary points with known distribution functions
and
, respectively. The tuning parameters
and
are user-specified constants chosen to optimize estimator performance in terms of efficiency and robustness.
2.1.1. First-Order Approximation
Expanding the estimator using a first-order Taylor series expansion and introducing error terms
and
, we obtain
By retaining terms up to the first order, the deviation of
from
is expressed as
2.1.2. Bias and Mean Squared Error
The bias of
, up to the first order, is given by
Similarly, the mean squared error (MSE) of
is approximated by
The values of
and
that minimize the MSE are obtained as
Substituting these into the MSE expression yields the minimized MSE:
2.1.3. Theoretical and Application Relevance
Theoretically, this proposed family of estimators provides a flexible framework that generalizes ratio-type and exponential-type estimators under PPS sampling. The introduction of auxiliary parameters enables adaptability to different sampling designs and error structures.
From an application standpoint, the estimator is particularly suitable for survey settings where PPS designs are frequently employed, such as household income surveys, business establishment surveys, and agricultural statistics. By incorporating auxiliary information effectively, the estimator reduces bias and MSE, thereby ensuring more reliable estimation of population distribution functions in practice.
3. Results
In this section, we present the findings of our empirical and simulation-based investigations, focusing on efficiency comparisons across multiple estimator families. The results are structured into three complementary parts: efficiency comparison, empirical study, and simulation study.
3.1. Efficiency Comparison
To evaluate the theoretical performance of the proposed estimator under the PPS framework, we compared its minimum mean square error (MSE) with that of the existing competing estimators. The efficiency conditions were established by showing that the proposed estimator consistently achieved a smaller MSE, thereby ensuring superior precision. Specifically, the following results hold:
- 1.
From (
2) and (
20),
provided that
Proof. Simplifying the above expression we get,
where
…
and
So, the dominancy of the proposed estimator holds iff
,
, and
. Then
□
- 2.
From (
4) and (
20),
whenever
Proof. For the above statement, the simplified expression is given below:
where
, while
A and
B are defined in the proof of the first statement, and
Now we know that,
because D is basically the collection of covariance and correction terms. So,
So, the above result shows that
□
- 3.
- 4.
From (
8) and (
20),
provided that
- 5.
From (
10) and (
20),
whenever
- 6.
From (
12) and (
20),
provided that
Proof. The simplification after putting the mean squared error (MSE) of both estimators, we get,
where,
where A, B and C are defined in the proof of the first statement. So, the above statement will hold iff
,
, and
.
□
These theoretical inequalities confirm that the proposed estimator achieves a lower MSE than its counterparts across multiple established classes, including ratio-type, regression-type, and generalized kernel-based estimators. In other words, the proposed estimator exhibits uniformly better efficiency under the PPS framework.
From a practical perspective, such efficiency gains translate directly into more reliable estimation of cumulative distribution functions (CDFs) in real-world survey applications. For instance,
In income distribution studies, a more efficient estimator reduces the required sample size, thereby lowering data collection costs.
In epidemiological surveys, improved precision ensures more accurate prevalence estimates, which are critical for healthcare policy and resource allocation.
In agricultural yield forecasting, efficiency improvements minimize the risk of biased or imprecise yield estimates when unit sizes (e.g., farm sizes) vary widely.
Thus, the theoretical superiority of the proposed estimator directly strengthens its applicability in domains where PPS sampling is indispensable and estimator efficiency is crucial.
3.2. Simulation Study
To evaluate the finite-sample performance of the proposed family of estimators, a Monte Carlo simulation study was conducted using three synthetic populations with distinct mean and covariance structures. These populations were generated from multivariate normal distributions to mimic diverse correlation patterns and size heterogeneity frequently encountered in survey data.
Population I was designed with a simple increasing mean vector
and a moderately correlated covariance matrix, thereby representing a balanced structure with mild heterogeneity.
Population II was constructed with zero means and higher off-diagonal correlations, reflecting stronger dependency among variables and potential multicollinearity effects.
Population III had decreasing negative means
and a denser covariance matrix with stronger correlations, thereby representing more heterogeneous and complex conditions (see
Table 1).
Table 2 reports the mean squared errors (MSEs) for the proposed estimators
(
) and for benchmark competitors including the random group (
), bias-transformed ratio (
), regression (
), regression difference (
), and generalized kernel (
) estimators. The results demonstrate that across all three populations, the proposed estimators consistently yield smaller MSEs than the classical alternatives. Particularly,
and
achieve the lowest error magnitudes, highlighting the robustness of the SM-based framework.
To complement the MSE results,
Table 3 presents the percentage relative efficiencies (PREs) of the estimators with respect to the baseline
. Values greater than 100 indicate efficiency gains. The superiority of the proposed estimators is evident, with PREs ranging from 114% to 171% in Population I, 146% to 160% in Population II, and 121% to 158% in Population III. In particular,
and
consistently outperform the benchmarks across different populations, demonstrating notable gains in precision under heterogeneous sampling conditions.
Overall, the simulation study highlights that the proposed SM-based estimators substantially reduce estimation error and improve efficiency, especially in scenarios with high heterogeneity and strong inter-variable correlations. These gains reaffirm the theoretical advantages of incorporating auxiliary size information into distribution function estimation under PPS sampling.
In addition to the tabulated results,
Figure 1 and
Figure 2 provides a visual comparison of the performance of the proposed family of estimators relative to the existing benchmarks. Panel (a) presents the mean squared error (MSE) values, while Panel (b) depicts the percentage relative efficiencies (PREs) for all three simulated populations.
In
Figure 1a and
Figure 2a, the proposed estimators
,
, consistently achieve lower MSE values across all the populations when compared to traditional estimators such as
,
, and
. This clearly demonstrates the robustness of the new class of estimators in reducing estimation error even under diverse covariance structures. The improvement is most pronounced for Populations I and III, where heterogeneity in the covariance matrix is higher, highlighting the adaptability of the proposed methods to more complex correlation patterns.
Figure 1b and
Figure 2b further corroborate these findings by showing that the proposed estimators substantially outperform classical benchmarks in terms of PRE. For instance, the estimator
achieves up to 170.92% efficiency relative to the simple expansion estimator
, while
and
also exhibit consistently high efficiency gains across populations. These improvements indicate that the proposed estimators not only reduce bias and variance but also enhance efficiency, making them preferable choices for practical applications in survey inference.
Overall, the visual evidence aligns well with the numerical results reported in
Table 4, reinforcing the conclusion that the proposed family of estimators provides superior performance in terms of both error minimization and efficiency enhancement.
3.3. Empirical Study
To examine the empirical performance of the proposed family of estimators, we consider three distinct populations commonly used in survey sampling research. These populations are drawn from fisheries, enology, and demographic contexts, ensuring diversity in both application domains and correlation structures. The corresponding descriptive statistics are summarized in
Table 5.
Population I (Adapted from [
35]). This population is based on fisheries data, where auxiliary information is naturally available across different years. The variables are defined as follows:
Y = Quantity of fish caught in 1995;
X = Quantity of fish caught in 1992;
Z = Quantity of fish caught by fishermen in 1993; a
Q = Quantity of fish caught in 1994. This dataset is characterized by strong temporal correlations across years, making it well-suited for evaluating auxiliary-variable-based estimators.
Population II ([
36]). This population is constructed from the UCI Wine dataset, with chemical composition attributes serving as the study and auxiliary variables:
Y = Aspartame;
X = Leucine;
Z = Isoleucine;
Q = Valine. This setting captures biochemical correlation structures, with strong relationships among amino acids, providing a different test case from fisheries and demographic data.
Population III (Adapted from [
37]). This population reflects demographic data with institutional auxiliary variables:
Y = Population (in thousands) in 1985;
X = Population in 1975;
Z = Population in 1977;
Q = Total number of seats in the municipal council. This population represents the type of socio-economic datasets where auxiliary information from administrative records enhances survey inference.
As shown in
Table 5, Population I and II exhibit high correlations between study and auxiliary variables (e.g.,
and
, respectively), whereas Population III shows weaker correlations (e.g.,
). These differences create a proper test bed for assessing how the proposed estimators respond to varying strengths of auxiliary information. The coefficients of variation (
) also highlight structural heterogeneity, especially in Population III, where demographic and institutional measures are less aligned. These variations across populations provide an informative benchmark for evaluating the robustness and efficiency of the proposed estimators (see
Table 6).
The empirical findings provide a comprehensive comparison of the proposed family of distribution function estimators with the existing methods in terms of efficiency, precision, and robustness (see
Table 7).
Table 4 reports the mean squared errors (MSEs) across Populations I–III, while
Table 8 presents the percentage relative efficiencies (PREs). The corresponding graphical representations are presented in
Figure 3, which illustrates both the MSE patterns (panel a) and the efficiency gains (panel b).
From
Table 4, it is evident that the proposed estimators consistently achieve lower MSE values compared to conventional estimators such as
,
, and
. For example,
records the lowest MSEs across all three populations, reaching values of
,
, and
for Populations I, II, and III, respectively. Similarly,
and
perform remarkably well, with substantial reductions in error relative to their classical counterparts. These improvements underscore the adaptability of the proposed estimators in diverse population settings.
The efficiency comparison, summarized in
Table 8, further confirms these findings. The proposed family outperforms baseline estimators, often by a wide margin. For instance,
achieves PREs of
,
, and
across the three populations, highlighting its robustness and superior efficiency relative to the standard
benchmark, which is normalized to 100. Likewise,
and
consistently show high PRE values, demonstrating their efficiency gains across both small and large populations.
These numerical insights are supported by the graphical results in
Figure 3. Panel (a) illustrates the MSE comparisons, where the proposed estimators demonstrate clear superiority over the existing methods, resulting in significant reductions in error levels. Panel (b) illustrates the PRE values, which further highlight the relative efficiency improvements. Notably, the proposed estimators not only outperform traditional regression and ratio-based estimators but also maintain robustness across repeated simulation runs, reinforcing their practical utility in survey sampling applications. Panel (c) and (d) indicates the pattern and trend with respect to mean square errors and percentage relative efficiency.
Taken together, the results indicate that the proposed family of estimators achieves substantial improvements in both precision and efficiency, offering a versatile and effective alternative to conventional approaches. Their consistently high performance across multiple populations and simulation settings suggests strong potential for application in real-world survey data, especially in contexts where auxiliary information is available and reliable.
4. Discussion
The performance evaluation of the proposed estimator requires not only theoretical justification but also empirical evidence that reflects its robustness in practical applications. To this end, we carried out an extensive simulation study, complemented by an empirical investigation across three distinct populations, to validate the theoretical properties established in the earlier sections. The discussion presented here emphasizes both the statistical implications and the applied relevance of the findings.
From the theoretical perspective, the derivations in
Section 3 demonstrated that the proposed family of estimators consistently achieves a lower minimum mean squared error (MSE) compared to conventional alternatives such as
,
,
,
,
, and
. This advantage was shown to hold under general conditions, thereby establishing a strong foundation for the efficiency gains observed in practice.
On the empirical side, the results provide compelling numerical evidence. As depicted in
Figure 3, the proposed estimators
through
consistently yield substantially lower MSE values across all three populations, thereby confirming the robustness of the theoretical results. The magnitude of improvement is not marginal; rather, it is systematic and pronounced. In particular, the efficiency gains are clearly illustrated in
Figure 3, where the percentage relative efficiency (PRE) values of the proposed estimators exceed those of the existing estimators, ranging from 136% to 328% across populations. Such high PRE values underscore the substantial improvements in precision and efficiency achieved by incorporating auxiliary information into the PPS sampling framework.
From an applied standpoint, these findings are particularly significant. The improved estimation of the population cumulative distribution function (CDF) has direct implications for fields where accurate distributional inference is critical, such as economics, social sciences, official statistics, and biomedical studies. For instance, survey practitioners working with unequal probability designs often face challenges in balancing design efficiency and estimator bias; the proposed family of estimators provides a viable solution that reduces estimation error while maintaining design consistency. Moreover, the graphical comparisons not only enhance the interpretability of the results but also underscore the stability of the proposed estimators across heterogeneous population structures, making them more attractive for practical implementation.
Beyond these theoretical and applied contributions, the study also underscores the role of computational advances in modern survey sampling and inference. The use of simulation-based validation enables the exploration of estimator performance across a wide range of population structures and sampling designs, providing insights that would be difficult to obtain through purely analytical derivations. Furthermore, the integration of auxiliary information into both the design and estimation stages exemplifies how computational statistics can bridge theory and application: leveraging additional covariates enhances estimator efficiency. At the same time, simulation frameworks ensure reproducibility and scalability in practice. This synergy between computational methods, auxiliary information, and theoretical efficiency directly aligns with the goals of contemporary computational statistics, as emphasized in this special issue.
Therefore, the discussion consolidates both theoretical and empirical perspectives: the proposed estimators achieve provable efficiency gains under PPS sampling, supported by extensive numerical validation. The joint evidence from analytical derivations, numerical tables, and graphical results confirms that the proposed family of estimators offers a substantial improvement over the existing approaches, thereby advancing the toolkit available for distribution function estimation in modern computational statistics and its applications.
Looking ahead, the methodological framework introduced here could be further extended to meet the challenges posed by contemporary data environments. In particular, integrating the proposed estimators with big data survey designs, Bayesian computational frameworks, or machine learning-based auxiliary information extraction presents exciting avenues for future research. Such extensions would not only enhance scalability and adaptability in high-dimensional or complex population settings but also strengthen the role of computational statistics in bridging traditional theory with modern data-driven applications. Likewise, can also be extended to other scenarios with different datasets [
38,
39,
40,
41,
42].