1. Introduction
Sample surveys are designed to provide reliable estimates of the overall means of a finite population and means for large domains or sub-populations (areas). For areas with small sample sizes (called small areas), direct area-specific estimators from the survey data are unreliable, and it is necessary to use model-based methods based on models linking area means to related covariates and random area effects. Resulting model-based estimators can lead to a significant increase in precision relative to direct estimators. Rao and Molina [
1], in Chapter 6, provide a detailed account of model-based estimation under area level models. The effective selection of auxiliary variables to be included in the linking model is important for the success of model-based small area estimation (SAE).
A basic area level model, due to Fay and Herriot [
2], is widely used for SAE in practice. Suppose that we have
m areas with direct estimators
of the area means
(
) and associated candidate covariate vectors
. The area level model consists of two components: a sampling model given by
and a linking model given by
where
denotes sampling errors assumed to be independent
with known sampling variance
, and
denotes random area effects independent of
that are assumed to be independent and identically distributed (iid) as
with unknown variance
. In practice, the sampling variances are obtained by smoothing the estimators of sampling variances using the Generalized Variance Function (GVF) method [
3] and treating the smoothed estimators as the sampling variances
. It is clear from (
2) that it has a standard linear regression model form, and standard variable selection methods, such as Akaike Information Criterion (AIC) or Bayesian Information Criterions (BIC), can be applied to select variables, provided the area means
are known. Lahiri and Suntornchost [
4] estimated the resulting selection criteria using the sampling model (
1) and proposed to use them for variable selection (see
Section 2.1 for details). We refer the reader to Rao and Molina [
1], Chapter 6, for details of empirical best linear unbiased prediction (EBLUP) estimators of area means from the models (
1) and (
2) for specified covariate vectors
. The EBLUP estimator of
is a weighted average of the direct estimator
and a synthetic regression estimator
, where
denotes an estimator of the regression parameter vector
. For a non-sampled area, direct estimator is not available. Hence, the synthetic estimator
is used to estimate small area mean, provided the associated
is known. Fay and Herriot [
2] obtained EBLUP estimates of per-capita income for small places in the USA, using the basic area level model given by (
1) and (
2).
Estimation of means for subareas nested within areas is of considerable interest. Mohadjer et al. [
5] studied adult literacy for counties (subareas) sampled from states (areas), using data from the 2003 U.S. National Assessment of Adult Literacy. A two-fold subarea model is used to estimate subarea means
from
subareas
j sampled from
m areas
i. A two-fold linking model on the subarea means
is given by
where
is the vector of covariates associated with
, and
is random area effect independent of random subarea effect
. Furthermore,
and
. The linking model (
3) is combined with the sampling model for the direct estimators
, and it is given by
where
are sampling errors independently distributed as
with known sampling variances
, and assumed to be independent of
and
. Torabi and Rao [
6] obtained EBLUP estimators of subarea means for sampled subareas as well as non-sampled subareas. An advantage of the two-fold model is that the EBLUP estimator of a non-sampled subarea involves both the synthetic estimator of
and the direct estimators for the sampled subareas within the same area. For a non-sampled subarea within a non-sampled area, a synthetic estimator is used under the two-fold model. For variable selection under the two-fold model, Cai et al. [
7] transformed the linking model to a standard regression model and applied variable selection criteria to the reduced model; see
Section 2.2 for details.
Three-fold linking models involving sub-subareas (level 3) nested within subareas (level 2) which in turn are nested within areas (level 1) are also of practical interest. For example, such models were used in the Program for the International Assessment of Adult Competencies (PIAAC) in the context of estimating means for sub-subareas (counties) nested within subareas (states), which in turn are nested within areas (census divisions). Details of this application are reported in Krenzke et al. [
8] and Ren et al. [
9]. A three-fold linking model on the sub-subarea means
is given by
where
k denotes sub-subarea nested within subarea
j nested within area
i,
is the vector of covariates associated with
,
is the random area effect,
is the random subarea effect, and
is the random sub-sub area effect. We assume that all the
L areas in the population are included in the sample, but not all the subareas within an area are covered by the sample. Furthermore, not all the sub-subareas within a subarea covered by the sample are included in the sample. We assume that the three random effects in the model (
5) are independent,
,
and
. The linking model (
5) is combined with the sampling model for the direct estimators
of the means
for the sub-subareas in the sample. It is given by
where the
are sampling errors assumed to be independently distributed as
with known sampling variances
, and they are assumed to be independent of the random effects
,
, and
. In practice, the sampling variances are ascertained through smoothing of the estimated sampling variances, as done in the PIAAC project.
The survey design may not have the same hierarchical structure as the linking model (
5). For example, in the PIAAC project, data from a stratified multistage sample with a different hierarchical structure are used. Given the vector of covariates
after variable selection, EBLUP estimators of the sub-subarea means can be obtained. It should be noted that the EBLUP estimators for non-sampled sub-subareas within a sampled subarea as well as those within non-sampled subareas avoid pure synthetic estimation by virtue of the area effects
included in the linking model (
5), noting that all the areas in the population are included in the sample. In the PIAAC study, a hierarchical Bayes (HB) approach was used to estimate the population sub-subarea means. We will report EBLUP estimation for the three-fold model, which is given by (
5) and (
6), in a separate paper.
The main purpose of this paper is to extend the transformation method of Cai et al. [
7] for variable selection to three-fold models given by (
5) and (
6). We propose two transformation-based methods—one is parameter free and the other is parameter-dependent—for variable selection.
Section 2 is a review of some relevant variable selection methods for the area level model and the two-fold subarea model. Variable selection methods for the three-fold model are presented in
Section 3. Results of a simulation study on the performance of the proposed methods relative to some naive alternatives, based on one-fold and two-fold models, are presented in
Section 4. Some concluding remarks are presented in
Section 5.
4. Results of a Simulation Study
This section provides results of a limited simulation study on the performance of the proposed method for variable selection for sub-subarea linking models. The simulation data are generated from the three-fold sub-subarea model given by (
5) and (
6). The number of areas is set to
and the number of subareas sampled from each area
i,
, is set to
. The number of sampled sub-subareas is taken as
for every subarea
j in areas
,
for every subarea in areas
and
for each subarea in areas
. The sampling standard deviation
in the sampling model (
6) is generated from
. The standard deviation of the sub-subarea random effect in the linking model (
5) is set to
. A few settings for the standard deviations of the area-level and subarea-level random effects,
, are used:
,
,
,
,
,
and
. We consider a linking model that has an intercept term with corresponding covariate
and eight other covariates
(
) generated as follows:
with mean
and variance
,
with shape parameter
and rate parameter 2,
,
,
,
with shape parameters
and
,
on the interval
, and
with mean parameter
. The value of the regression parameter vector
is set to
. It corresponds to a true model consisting of the intercept term of value 2 and covariates
,
,
and
. For variable selection, we always include the intercept term when we compare all possible sub-models defined by the inclusion/exclusion of the eight variables
.
We generated 5000 simulation runs, and the covariates are generated first and kept fixed throughout all simulation runs. Then, we generated the response vectors
,
, from the sub-subarea model given by (
9) and (
10) for each simulation run, using the specified settings.
We report the performance of the proposed method with parameter-free transformation (
) and parameter-dependent transformation (
). For
, the true parameter values are used here, for simplicity. Under estimated parameter values, the performance of
is likely to be inferior. The parameter-free and parameter-dependent methods of Cai et al. [
7] for the two-fold subarea model are used for comparison. To fit a two-fold subarea model to the data with a three-fold structure, the actual sub-subareas are treated as the subareas in the two-fold model. We can treat either (i) the actual subareas or (ii) the actual areas as the areas in the two-fold model. Treatment (i) is a natural choice when there is substantial subarea-level variability. Under treatment (i), where the actual subareas are treated as areas, the parameter-free transformation under the two-fold model is algebraically identical to the parameter-free transformation under the three-fold model. As a result, variable selection based on the parameter-free transformation under treatment (i) leads to the same set of variables as that under the three-fold model. However, it leads to pure synthetic estimates for non-sampled areas (actual subareas). Moreover, computationally, there is no advantage of treatment (i) over the three-fold model because the same transformation is used. On the other hand, the parameter-dependent method applied to treatment (i) may lead to a different set of variables. Therefore, we report the simulation results only for the parameter-dependent method under treatment (i), which is denoted as
. The two-fold parameter-free and parameter-dependent methods under treatment (ii) are denoted as
and
, respectively. Under treatment (ii), pure synthetic estimation is avoided because all areas are sampled. For comparison, we further consider three naive methods designed for the one-fold FH model and the regular linear regression model, including the Lahiri–Suntornchost [
4] method (Naive-LS) and Han’s [
10] cAIC method (Naive-cAIC) for the FH model, as well as an information criterion-based method for the regular linear regression model fitted naively to the data (Naive-LM). For Naive-LS and Naive-cAIC, the actual sub-subareas are treated as the areas in the FH model. For Naive-LM, the sub-subarea level direct estimator
is treated as the response variable of the regular linear regression model.
Table 1 summarizes the simulation results for variable selection using BIC.
The proposed and perform equally well with a stable rate between and in selecting the true model under all settings for . The two-fold method , which treats the actual subareas as areas in the two-fold model, exhibits similar performance to that of the proposed methods. All the other methods have inferior performance and display a dramatic decay in rate of selecting the true model when and increase. This indicates that in the presence of strong area-level effect or subarea-level effect, which often happens in practice, , and are preferred over the other alternative methods.
The simulation results based on AIC and Naive cAIC are given in
Table 2.
Compared with BIC, AIC gives a significantly lower true-model selection rate under all the methods. As the case for BIC, methods , and perform equally well and yield stable results for different values, and they have better performance than the other methods. Methods and have slightly better performance than , and when , , and but notably inferior performance under the other settings for . Methods Naive-LS, Naive-LM and Naive-cAIC have significantly lower rates of selecting the true model than the other methods.
Table 3 reports simulation results under Mallows’
criterion for variable selection. The results in
Table 3 are similar to those reported in
Table 2 under AIC, and the same conclusions hold.