1. Introduction
To analyze a subset of a given population, a sampling method is used, which can be classified as probabilistic or non-probabilistic. Specifically, in this context, we are concerned with systematic sampling, which involves selecting elements from the population at regular intervals. For a positive integer m, the process starts with an initial element g chosen at random and then selects the following sequence of elements , where .
Systematic sampling is especially helpful when the population is uniformly allotted or logically listed. For example, in agricultural studies, the researcher should systematically sample the plants in a field to determine crop health. In industrial quality control, every
k-th product from an assembly line is inspected periodically to measure defects. This method is favored because it is simple, easy to implement, and offers evenly distributed samples across the entire population. The cost-effectiveness of systematic sampling is one of the main advantages of the approach because it takes less time and requires less effort than simple random sampling. Systematic sampling is an appropriate method for studying natural populations, such as estimating timber volume in forests (Zinger, [
1]). This technique is widely used by research institutions, such as the United Nations Food and Agriculture Organization (FAO) in its 2010 Global Forest Resources Assessment survey.
The uniform systematic sampling approach has received considerable attention, and numerous modifications and improvements have been made to it. Further details on the theoretical foundations and practical application of systematic sampling are provided in existing literature on sampling theory. The primary aim of this approach is to estimate various population characteristics or trends. This study focuses on the estimation of the mean for a particular variable. Moreover, the accuracy of these estimates is improved by incorporating an additional auxiliary variable, particularly when the sample size for the principal variable is limited.
Many researchers have calculated the average using data from one or more additional variables under systematic sampling. Initially, Swain [
2] provided a ratio-type mean estimator in this context. Kushwaha and Singh [
3] proposed nearly unbiased product and ratio estimators aimed at estimating the true mean of a research variable. Their methods involve the use of the jackknife technique as first proposed by Ref. [
4] or a refined form of the jackknife as proposed by Singh and Singh [
5]. Kushwaha and Kushwaha’s [
6] research introduced a group of estimators for estimating the population average in systematic sampling design by utilizing supporting information. The study extended a previously established estimator, proposed by Swain [
2], by examining conditions that enhance its efficiency. Additionally, it considered difference estimators as particular cases and identified scenarios where the proposed approach outperforms traditional estimation methods. Singh et al. [
7,
8] developed a generalized group of estimators for the true mean of the population in systematic sampling with a linear transformation of a supplementary variable. They derived bias and MSE expressions and an asymptotically optimal estimator. Shahzad [
9] extended this concept by including the auxiliary attributes. In the study by Ali et al. [
10], two classes of estimators were introduced for cases where data affecting the systematic random sampling setting includes an outlier in one or both of the critical region parameters. The first class considers ratio-type robust regression-based estimators, and the second class considers a regression-based estimator. Ref. [
11] presented a regression-ratio technique for average estimation from a population with high-breakdown regression to mitigate the outlier-induced effect. The latter were extended to create these estimators using quantile regression coefficients, minimizing absolute deviations instead of squared deviations. Their estimator efficiency was then evaluated in a forestry case study and found to exceed 30 percent compared with the standard procedure for timber volume. In Audu et al.’s [
12] recent work, calibrated estimators were developed to estimate population proportions from diagonal systematic samples.
Outliers in the accuracy of population parameter estimates have serious effects, but efforts have been frequently made to minimize such impact with weighted methods. Such extreme values may sometimes be ignored, but having such extreme values tends to result in significant overestimation or underestimation, impeding the reliability of traditional estimators and, in particular, increasing MSE. This has been carried out previously in [
13], where an array of different estimation techniques were employed, including ratio, product, and regression-based methods, accompanied by both regular and robust estimators. On the other hand, systematic sampling methods using re-descending M-estimators to deal with outlying observations have not been explored yet. This paper addresses this gap by proposing M-type average approximations with re-descending coefficients that aid estimation accuracy in the case of anomalies under systematic sampling.
The remaining article explores the development and application of re-descending M-estimators for robust average approximation (estimation) in systematic sampling design with multiple sections.
Section 2 provides an overview of re-descending M-estimators, which is one of the potential solutions under outlier-contaminated conditions. The adapted estimators are also provided in the same section. In
Section 3, the theoretical framework behind the proposed estimator is explained based on a systematic sampling design. The statistical features of the estimators under study are also provided in the same section. In
Section 4, the real-world applications and the simulation studies are separated into two main subsections. In the real-world application, the study evaluates the effectiveness of the developed estimators through well-known datasets, including the mtcars and Trees datasets, where comparisons are made against adapted estimators based on MSE and PRE. The results are also compared using a simulation study for varying statistical scenarios.
Section 4.2 interprets the findings. Finally,
Section 6 summarizes the important findings of the study, emphasizes the advantage of re-descending M-estimators in systematic sampling, and points out future research directions.
2. M-Estimators Based Average Estimation
OLS is a fundamental tool in regression analysis in many applications. However, outliers are present in the real world, and the assumption of a normal error distribution for the method is unrealistic. Therefore, OLS is not used if there are anomalies in the data (Dunder et al. [
14]), even in the case of large datasets. In order to mitigate this drawback, re-descending regression methods have been developed as an extension of OLS in order to strengthen its robustness. In large samples with skewed logistic distribution, OLS is still vulnerable to distortion by outliers; robust M-estimators are more effective.
Huber built on research studying the effect of outliers on linear regression models using the M-estimator. With the central values reaching close to one and the extreme values near zero, this estimator assigns weights close to one to the central values and reduces the influence of extreme values. While not explicitly stated, the M-estimator provides a reasonable method of inference within the context of likelihood maximizing when the error normality assumption about the model may not be valid. With regard to robust M-estimators, instead of squared error terms used in the OLS, M-estimators use the following symmetric loss function:
These estimators are constructed by adjusting the loss function
to ensure a re-descending property (see Refs. [
15,
16,
17,
18,
19,
20,
21]). These estimators demonstrate a great ability to mitigate the influence of outliers, thus improving the reliability of statistical inference.
The arithmetic average is a key measure employed in statistical analysis. When a strong positive linear association exists between the auxiliary and primary study variables in survey sampling, ratio estimation provides an effective approach for estimating the population mean. Ratio estimation was first introduced by Cochran during the post-1950 period and has since been a major methodological breakthrough in research, except for in agricultural studies. For a comprehensive discussion on ratio-type estimators, readers may refer to Refs. [
22,
23,
24].
Kadilar and Cingi [
25] were the first to use ratio-type estimation incorporating OLS regression coefficients. However, traditional OLS-based estimators are incredibly sensitive to outliers, and are prone to delivering very bad estimates in their presence. To address this issue, Kadilar et al. [
26], Zaman et al. [
27], and Koc and Koc [
28] introduced robust regression techniques that enhance the stability and accuracy of ratio estimators in simple random sampling. Based on this work, we suggest an adapted class of intuitive mean estimators based on re-descending coefficients under systematic sampling.
2.1. Estimators with [19] Coefficients
To handle datasets consisting of outliers, Ref. [
19] developed an estimator that penalizes large residuals using a weighting function. This estimator is adjustable with parameters
c and
a, which determine its robustness, efficiency, and flexibility in accounting for robust regression models. The corresponding objective function (OF), influence function (IF), and weight function (WF) are expressed as follows:
The adapted family of mean estimators under systematic sampling utilizing the re-descending coefficient
proposed by Ref. [
19] is defined as follows:
Note that
. The MSE of
is derived using a Taylor series expansion:
where:
2.2. Estimators with [29] Coefficients
In the work of Khan et al. [
29], a novel estimator was introduced using a hyperbolic tangent function. The estimator is particularly effective in handling asymmetric heavy-tailed distributions. Khan et al. [
29] used the hyperbolic cosine function as a weight function in order to minimize sensitivity. The OF, IF, and WF are formulated as follows:
The modified class of mean estimators under systematic sampling utilizing the re-descending M-estimator coefficient
from Khan et al. [
29] is given by
The MSE of the estimators is given by
2.3. Estimators with [20] Coefficients
Ref. [
20] introduced a re-descending-type estimator that employs polynomial-driven weight functions. When residuals exceed a defined threshold
h, the weight function smoothly reduces to zero. This method ensures a high breakdown point, making it resistant to leverage points. OF, IF, and WF are defined as follows:
The modified class of mean estimators using the re-descending M-estimator coefficient
proposed by Anekwe and Onyeagu [
20] is given by
The MSE of these estimators is given by
2.4. Estimators with [30] Coefficients
Ref. [
30] defined a parameterized robustness method for re-descending estimator. The weight function of the estimator is purposely designed to vary the parameters of its robustness and efficiency,
m and
b, so that some desired balance can be achieved. The OF, IF, and WF are defined as follows:
The modified class of mean estimators utilizing the re-descending M-estimator coefficient
proposed by Raza et al. [
30] is expressed as
The MSE of these estimators is formulated as
2.5. Estimators with [31] Re-Descending
Raza et al. [
31] suggested a re-descending-type estimator of a higher-order polynomial, which minimizes the effect of anomalies while keeping the accuracy in data regions away from the tails. The estimator’s OF, IF, and WF are defined as follows:
The modified class of mean estimators using the re-descending M-estimator coefficient
proposed by Raza et al. [
31] is expressed as
The MSE of these estimators is given by
In the estimators
,
where
and
take values:
In Equation (
13),
represents the coefficient of variation, and
denotes the kurtosis coefficient of
X.
The tuning constants
, and
m control the aggressiveness of the respective re-descending influence functions and thus directly control the robust-efficiency trade-off. These constants are not treated as free design parameters in this research but are instead chosen based on the original formulations and recommendations in the relevant robust regression references [
19,
20,
29,
30,
31]. Smaller values of
, or
h tend to mean that more extreme residual values have been rejected/down-weighted (some robustness at the cost of efficiency), whereas larger values imply fewer of them have been thrown out (less robustness at the cost of a large amount of contaminated data). On the same note,
m and
b can control the smoothness and tail decay of the weight function, and thus the extreme observations can be either well diminished or hard discarded. The approach that is used in practice is to choose these parameters based on robust scale approximations of the residuals and to select them such that the central values have almost equal weights, while extreme values are heavily suppressed.
The importance of robust estimation methods increases in real-life situations when the data is noisy and contaminated [
32,
33]. It is noteworthy, as highlighted in the Introduction, that even though a number of strong estimation methods have been used to approximate the average parameter in the presence and absence of anomalies, including Huber M-estimators [
34], quantile regression [
28], and jackknife-based methods [
4], these approaches have already been established. While these methods are utilized in simple or stratified random sampling, to the best of our knowledge, no current study has applied re-descending M-estimators to the estimation of means with a systematic sampling design in environments characterized by outlier influence. This use of re-descending M-estimators is the gist and the novelty of our work: the creation of 20 adapted estimators specifically designed to improve the robustness in the systematic sampling case.
3. Proposed Family:
The data points that deviate significantly from other data points in the dataset are called outliers. Their presence can have a huge effect on estimating the mean as a way of measuring the center point of the distribution. In most cases, traditional mean estimators such as the population mean (
) are approximated through OLS regression coefficients. However, these regression coefficients are greatly influenced by the existence of anomalies, making the approximates (estimates) of the population mean (
) unreliable. To counteract this problem, a robust alternative is provided by M-estimators for data that are non-normal and contaminated by extreme values. Also, re-descending coefficients are less sensitive to outliers (Raza et al. [
30,
31]). OLS and re-descending regression, as different techniques for modeling the link between a response variable and a predictor, differ in their optimization criterion. We extend the notion of the adapted class
to a novel class of robust re-descending-coefficient-driven estimators. The developed family of estimators within the systematic random sampling design is
where
represents the five re-descending coefficients
used in this context. The notations used in
retain their standard interpretation as discussed in previous sections. The MSE corresponding to the developed estimator family is given by
Inserting the values of
,
, and
from Ref. [
11] into Equation (
15) results in:
It is essential to highlight that our estimation process incorporates five distinct re-descending coefficients. To enhance robustness, these coefficients are determined iteratively in order to dampen the extreme values. The weight functions,
, applied by each estimator, are specifically structured to reduce the effect of large residuals. Initially, the OLS technique is employed for the initial estimate of
. Subsequently, an iterative weighted least squares algorithm is applied, where residual weights are updated in each iteration
t according to
Repeat the iteration until the convergence criterion is satisfied, with being defined as a predefined tolerance threshold.
For clarity, the five key members of the developed family and their MSE are presented in a compact form as follows:
We should also note that when finding the expressions of MSE, the first-order Taylor series approximation was applied. This method is standard in the analysis of robust estimation schemes, as it is simple and can give a sensible analytical estimate of the behavior of estimators in conditions of mild regularity. First-order expansion is a representation of the principal linear characteristic of the estimator at the actual parameter, which is usually sufficient with large samples or in moderate deviations. Although higher-order terms may provide more accurate approximations, especially where there are extreme outliers or heavy tails, they have less effect on bias and variance in practice in systematic sampling problems. In addition, there is no guarantee that adding terms of higher order adds the same amount of complexity to an algebric structure or that this doing so provides an equivalent improvement in accuracy. Therefore, the first-order approximation is a trade-off between features of easy analysis and empirical validity, which is borne out by the fact that the theoretical MSEs have the same behavior as the ones in our simulation studies, as shown in
Section 4.
The efficiency conditions of the developed family can be determined by comparing the MSE of
with
: