1. Introduction
In multifaceted survey environments, supplemental information gathered through a national census, remote sensing network, or environment inventory may be of great importance in the design as well as estimation steps of a survey. It is common to use these external sources of data in order to formulate effective estimators of major population parameters including the total or mean. The conventional methods of estimation were founded on the belief that the research variable has a functional relationship, usually linear, with auxiliary variables. These model-based estimators necessitate the upfront specification of underlying model structure, which, in any case, is a hard task when there are more than a few variables or when the population is characterised in a complex manner (Opsomer et al. [
1] and Wu and Sitter [
2]). These have led to the tendency to switch to nonparametric methods, which are more flexible since they do not presume strongly defined functional forms. It is necessary to highlight that the initial work of Dorfman [
3] and subsequently of Dorfman and Hall [
4] established the principles of nonparametric modelling inclusion in survey estimation, which can enable more flexible approaches that can be used to model up complex relations in various populations.
Compared to parametric methods, nonparametric inference methods are less sensitive to sampling designs and assumptions of the models (Nadaraya, [
5]). The theoretical literature focuses on two major frameworks of the development of efficient estimators: the design-driven framework, which is built only on the randomisation of the sampling design, and the model-driven framework, which assumes the consideration of the finite population conceptualised as stemming from a superpopulation model. The latter model allows for estimating even when non-sampled units are used, and an assumed correlation exists between the survey and auxiliary variables (Dorfman and Hall, [
4]). One early step in this direction was the work of Nadaraya, which introduced the Local Polynomial Regression (LPR) as a versatile nonparametric alternative to the classical parametric regression estimators [
5]. The numerical experiments demonstrate that the LPR-based estimators are the lowest in MSE (the highest in PRE) compared to the parametric estimators, which are allowed to be used in skewed and outlier-contaminated conditions. On this basis, Rueda and Sanchez-Borrego (RSB) [
6] applied the use of LPR techniques to guide probability sampling situations, which added to the further verification of their usefulness in model-based predictive contexts. Recent developments in robust regression and nonparametric learning, with a focus on distribution-resistant modelling such as functional coefficient and quantile-based regression [
7], neural network-driven robustness [
8], and the adaptive kernel smoothing methods [
9,
10], highlight the increasing importance of flexible and outlier-resistant estimation strategies for complex data environments.
The local polynomial kernel regression is a versatile tool that can be used to handle both continuous and discrete datasets; although, its usefulness heavily relies on the characteristics of the response variable. In continuous cases, the technique is superior as it models localised polynomials at the target point where the local observations have a heavier influence on the output by weighting with a kernel. This localised construction provides adaptive smoothing that makes little assumptions about the global shape of the underlying function. In the case of discrete or categorical data, it must be modified. As an example, binary or multiclass outcomes are modelled with local logistic regression, and count data are generally modelled with local versions of generalised linear models (GLMs) with a link function, such as Poisson or negative binomial. In such situations, it is particularly important to choose a suitable bandwidth of a kernel, since sparse or unevenly distributed data may have a considerable influence on the model stability and predictive capabilities.
The traditional average stands as one of the most prevalent measures of statistics used as a fundamental measure of data summary in all fields of study, including the sciences, social sciences as well as arts (Zaman [
11]). Since it can be interpreted and is generally applicable, the correct estimation of the population mean is of critical significance in survey sampling as well as in a variety of applications (Subzar et al., [
12]; Kumar and Siddiqui, [
13]). An in-depth description of the methods of mean estimation is available in the article by Shahzad et al. [
14] and Koc and Koc [
15]. There is an ongoing need of more efficient and reliable ways of estimating the mean in the light of its critical importance. This has elicited the increased use of the model-based methods, which incorporate the auxiliary information and easily customisable modelling structures to enhance the accuracy of the mean estimates.
In this paper, the evidence-based literature of the model-based nonparametric mean estimation methods is subject to discussion since they are broadly applied to estimate population parameters in the context of complex sampling designs. In model-based estimation, the dependent and independent variables are modelled explicitly, and one can predict non-sampled units. The other type of estimation technique constructs the structure of the underlying model that forms the basis of parameter determination (Srivastava, [
16]). These models, subject to specific assumptions, enable the imputation of unobserved values at both micro and macro levels. In the estimation, when a sampling design has been used to gather data, this design can be included in the estimation process just like design-based methods. RSB [
6] formulated an LPR-assisted estimator on simple random sampling and exhibited a number of the desirable properties following a model-based paradigm. Such kernel-based and nonparametric predictive estimators, however, are susceptible to outliers, a major drawback of these types of estimators, which are common in real-world applications like environmental or meteorological measurements because of sensor errors, data anomalies, or extreme events. As a solution to this, we suggest a new predictive mean estimator, which is a combination of robustness of the regression methods to ensure that they do not give prominence to the outliers in the data and the flexibility of the kernel regression, which offers local smoothing without the assumption of the global parametric form. This mixed methodology improves the accuracy and consistency of central tendency estimation when contaminated data is present and is thus especially appropriate for real-life situations where data quality problems are widespread.
In the case of many predictor variables, LPR may be generalised to a multivariate framework, often known as multiple local polynomial regression (MLPR), where a local polynomial surface is fitted, instead of a curve. Although this generalisation allows for modelling complex multidimensional relationships, it also comes with the problem of the curse of dimensionality. Higher dimensions cause the data to be sparse in higher dimensional space, which may cause instability and decrease the reliability of the estimates. In addition, appropriate bandwidths to be used in each covariate must be chosen; the wrong decisions may lead to over-smoothing with important signals or under-smoothing with noisy predictors. Conversely, these problems are to a great extent alleviated when the model contains one predictor variable. The sparsity in one-dimensional space permits smoothing to be more effective and consistent, and makes bandwidth choice more convenient, decreasing the likelihood of inflation in the variance and making the regression more robust overall.
Although there is a rich body of literature on both model-based and calibration-type mean estimation techniques, to our knowledge, no prior research has constructed calibrated predictive mean estimators under stratified random sampling that apply (i) robust regression to sampled units and (ii) local polynomial regression to non-sampled units, incorporating dual calibration constraints on auxiliary means and coefficients of variation. Historically, the calibration estimators enhance accuracy by adding auxiliary information using constraints to match sample estimates with known population characteristics. This paper then builds on that research by suggesting a new hybrid estimator in which the sampled subset of the population is estimated with outlier-resistant regression approaches, which remain resistant to outlier-induced distortions, and the non-sampled subset is then estimated with kernel regression, which is flexible and data-driven in its smoothing. Robust regression contributes to the accuracy and stability of the overall predictive mean estimator, especially when the data is contaminated. In order to operationalise this hybrid process, we apply a model-based methodology where we implement an LPR estimator to the non-sampled units. It is also necessary to carefully select kernel bandwidths that would have an understood effect on the quality of kernel-based estimator. The resultant process does not only fill the gap between two effective nonparametric tools but is also a practical and strong substitute to more traditional estimators in stratified sampling designs. The application of Gaussian kernel function also guarantees a stable and smooth estimation behaviour in a variety of data conditions.
The accurate measurement and assessment of natural resources, in particular, aquaculture and fisheries have received considerable attention in recent years due to the need for sustainable management and information-based decision-making. Estimation of the average values of biological parameters such as fish length, weight, and body shape are significant in stock assessment and economic and operational planning in the fishery industry. In this paper, the performance of predictive mean estimators is compared on a real-life data of fish market that has diverse morphological characteristics. Meanwhile, a simulated dataset concerning solar ultraviolet (UV) radiation is taken into account as well, which covers significant environmental variables and categories of UV risks. It is necessary to estimate average levels of UV exposure in the environment to conduct environmental risk assessment and develop adaptive strategies in aquatic ecosystems. Collectively, the datasets can help investigate the possibility of powerful, model-oriented mean estimation methods that combine both biological and environmental data. This study also highlights the excellence of estimating means through advanced survey sampling and predictive models in the intricate natural environment, which is useful for fishery management, environmental surveillance, and natural resource planning.
The theoretical and methodological foundation for the creation of a new type of predictive estimators in StRS is laid out in the later
Section 2,
Section 3,
Section 4 and
Section 5 of this article.
Section 2 explains the existing model of kernel regression in a stratified sampling case and briefly describes the LPR estimator as a nonparametric version of the classical linear regression estimator. This section also describes the way in which the estimator depends upon the smoothing parameters and auxiliary variables and how it is sensitive to the bandwidth selection and the problem of contaminated data.
Section 3 deals with the issue of incorporating robust regression techniques into the nonparametric estimation framework. This combination results in a more resistant kernel regression estimator that suppresses the impact of outliers and heteroscedastic noise, particularly in stratified data in the real world. The work is further expanded in
Section 3, where two calibrated forms of the adaptive predictive estimators are introduced, which make more efficient use of auxiliary information.
Section 4 includes a detailed numerical analysis of artificial and natural populations intended to provide simulations of real situations of sampling with outliers and stratified design. The results are evaluated based on PRE, using three bandwidth selection techniques: fixed, data-driven plug-in (dpik), and biased cross-validation (bcv). Ultimately,
Section 5 delivers the final conclusions.
2. Fundamental Estimators
Alshanbari and Anas [
17], Alomair et al. [
18], and RSB [
6] propose a model-driven technique, according to which the bounded population is supposed to be satisfactorily characterised by a predictive model, denoted as
, such that
In stratified random sampling (StRS), the predictive model is generalised to every stratum
, which is denoted as
where
represents independent, identically distributed random errors with zero mean,
, and constant variance
. Further,
is a smooth and unknown function of the supplementary variable
w and
is the expectation in the model
.
After selecting a sample, the average of the population in stratum
, which is denoted by
, can be written as
In this case, i.e., Equation (
1),
represents the sample mean of the sampled unit
and
is the non-sample mean of non-sampled unit
. The population and sample counts in the stratum are represented by
and
respectively and the sampling fraction is
.
N is the total elements of the strata.
It should be noted that the first term of Equation (
1) can be calculated directly using the sample. Consequently, the task is to estimate the unknown element
, which refers to non-sampled units. If auxiliary variable
w were observed in all units, it would be easy to predict using the regression model,
is the proxy of the unobservable
, and
. Nevertheless, in a real world context, the actual
is not known. In response to this, nonparametric kernel regression methods are used to obtain predictions
, which are predictions
, obtained at each
, as illustrated by Chambers et al. [
19]. Since then, this method has been adapted and generalised by a number of researchers such as RSB [
6] to enhance the predictive estimation with respect to more complicated sampling designs.
2.1. Rueda and Sanchez-Borrego Estimator
Based on the fundamental contributions made by RSB [
6], the traditional model-driven estimator under the stratified random sampling (StRS) of the
stratum takes the form
The aggregated estimator
of the entire population that is represented by all strata is expressed as
where
is the weight of the single stratum and
is the weight of all the strata, being considered.
It is worth noting that
, which is obtained through LPR, is a generalisation of the classical LPR regression model and may be employed across diverse forms of modelling. Following the approach taken by Ref. [
20] and its developed methodology, RSB [
6] utilised a kernel-based
-order LPR estimator aimed at computing the research variable. The kernel function takes the form
, in which
K is generally adopted as a Gaussian-shaped kernel, with
h representing the window-width parameter. To see a bigger picture pertaining to up-to-date changes within the sphere of the kernel-based approach, the readers can turn to Refs. [
18,
21,
22].
Accordingly, the predicted value
for the non-sampled unit
is calculated using
where
is a unit vector of length
,
represents the vector of observed responses,
is the diagonal weight matrix formed using the kernel function, and
is the design matrix constructed from local polynomial terms.
In the findings of Alshanbari and Anas [
17], it is observed that under stratified sampling, the base estimator
can be improved upon using calibration techniques.
2.2. Alshanbari and Anas Estimator
The inclusion of auxiliary data in estimation processes has been commonly accepted as a favourable approach to the accuracy of estimators of means. The common assumption defined in Shahzad et al. [
23] and Zaman [
24] is that a meaningful association is present between the principal variable of the study,
Y, and a corresponding auxiliary variable,
W. As an example, one can state a solid example of the positive correlation between education and income in which education has generally been considered as a causal variable that affects income. Several socioeconomic studies have proven this association (Leesch and Skopek [
25]). Likewise, in health sciences, a considerable amount of empirical evidence has been conducted to show the positive impact of physical activity on cardiovascular health. According to Kaiser and Oswald [
26], more active people tend to have a healthier heart condition on average. These examples demonstrate that in the proper use of the auxiliary variables, they help to refine the mean calculation and increase the trustworthiness of the survey outcomes.
Calibration estimation is generally recognised as an effective method of smoothing survey weights by minimising an appropriated distance functional; it is also applied to combine auxiliary or additional information. Many scholars have noted the importance of carrying out calibration in strata in order to maximise the effectiveness of population parameter estimates. Construction of the calibration weights is generally divided into two basic modes: selection of the appropriate distance measure and introduction of the calibration constraints. These constraints when well-matched with auxiliary variables could greatly enhance the accuracy of the estimates of the main study variable. This method was expanded by Refs. [
27,
28], which included several calibration conditions in the general survey sampling paradigm, a concept also examined by Refs. [
29,
30,
31,
32]. Although these developments have been made, few efforts have been focused on the development of calibrated mean estimators in the context of stratified random sampling (StRS) and within the model-based framework. The Alshanbari and Anas [
17] study is one of them. Their study presents a new, calibration-supported, model-driven mean estimator with StRS using the flexibility of nonparametric calibrated kernel-oriented nonparametric regression techniques.
In the StRS scheme,
denote the population size and total sample size respectively.
represents the sample and population averages and CVs
of the supplementary variable
W of the
stratum. Likewise,
represents the usual stratified weights and their calibrated versions. Based on the foregoing definitions, a particular randomly obtained sample
is selected out of a stratum containing
population units, with
. Under these conditions, the calibrated estimator introduced by Alshanbari and Anas [
17] is given as
subject to the constraints
The motivation of incorporating loss functions within the calibration framework as discussed by Ref. [
27] is the accuracy of estimating the parameter by changing the weight of the sampled units. This optimisation is a process of reducing a given distance measure, usually between the original design weights and the calibrated weights, with a known set of calibration constraints. In order to operationalise this process, we build a Lagrange-type function (LF) developed by adding the constraint multipliers
,
,
to a loss function
based on chi-square and have the following formulation:
Computing
and subjecting it to the zero-gradient condition provides
Calibrated weights have a number of desirable characteristics, such as being able to reduce bias and minimise variance and being coherent with known auxiliary information. The main goal in constructing such weights is to initiate a suitably weighted average of the supplementary information in the sample to their respective established population totals in an attempt to increase the quality of estimates using surveys. It should be noted, however, that one cannot assume calibrated weights will be strictly positive. Negative-valued weights may occur, especially in cases where large differences arise between the distribution of the sample and population, or certain types of distance functions are used in the calibration. The chi-square distance is one of the distance functions that is quite effective in alleviating the negative weight occurrence. The reason is that it penalises large deviations relative to the starting weights which provides incentives to change values without moving too far. Consequently, the application of chi-square distance causes more consistent calibration by providing a better balance and minimising the likelihood of assigning extreme or negative weights.
By substituting (8) in (4), (5), and (6), respectively, we get
where
By solving Equation (
9), we get
where
, and
are provided in
Appendix A.
Substituting these values in (8) and (3), we get
where
It is worth noticing that the adapted estimator
may be generalised into a more generalised form by choosing different values of
. As a simplification to help the readers to stay on point, we will now assume that
. However, estimator formulation is adjustable by inclusion of various known values of the population characteristic,
, which permits a range of functional manifestations. To shed more light on such generalisations, the readers can consult the works of Refs. [
33,
34,
35].