1. Introduction
Sample surveys are designed to provide reliable estimates of the overall means of a finite population and means for large domains or sub-populations (areas). For areas with small sample sizes (called small areas), direct area-specific estimators from the survey data are unreliable, and it is necessary to use model-based methods based on models linking area means to related covariates and random area effects. Resulting model-based estimators can lead to a significant increase in precision relative to direct estimators. Rao and Molina [
1], in Chapter 6, provide a detailed account of model-based estimation under area level models. The effective selection of auxiliary variables to be included in the linking model is important for the success of model-based small area estimation (SAE).
A basic area level model, due to Fay and Herriot [
2], is widely used for SAE in practice. Suppose that we have 
m areas with direct estimators 
 of the area means 
 (
) and associated candidate covariate vectors 
. The area level model consists of two components: a sampling model given by
      
      and a linking model given by
      
      where 
 denotes sampling errors assumed to be independent 
 with known sampling variance 
, and 
 denotes random area effects independent of 
 that are assumed to be independent and identically distributed (iid) as 
 with unknown variance 
. In practice, the sampling variances are obtained by smoothing the estimators of sampling variances using the Generalized Variance Function (GVF) method [
3] and treating the smoothed estimators as the sampling variances 
. It is clear from (
2) that it has a standard linear regression model form, and standard variable selection methods, such as Akaike Information Criterion (AIC) or Bayesian Information Criterions (BIC), can be applied to select variables, provided the area means 
 are known. Lahiri and Suntornchost [
4] estimated the resulting selection criteria using the sampling model (
1) and proposed to use them for variable selection (see 
Section 2.1 for details). We refer the reader to Rao and Molina [
1], Chapter 6, for details of empirical best linear unbiased prediction (EBLUP) estimators of area means from the models (
1) and (
2) for specified covariate vectors 
. The EBLUP estimator of 
 is a weighted average of the direct estimator 
 and a synthetic regression estimator 
, where 
 denotes an estimator of the regression parameter vector 
. For a non-sampled area, direct estimator is not available. Hence, the synthetic estimator 
 is used to estimate small area mean, provided the associated 
 is known. Fay and Herriot [
2] obtained EBLUP estimates of per-capita income for small places in the USA, using the basic area level model given by (
1) and (
2).
Estimation of means for subareas nested within areas is of considerable interest. Mohadjer et al. [
5] studied adult literacy for counties (subareas) sampled from states (areas), using data from the 2003 U.S. National Assessment of Adult Literacy. A two-fold subarea model is used to estimate subarea means 
 from 
 subareas 
j sampled from 
m areas 
i. A two-fold linking model on the subarea means 
 is given by
      
      where 
 is the vector of covariates associated with 
, and 
 is random area effect independent of random subarea effect 
. Furthermore, 
 and 
. The linking model (
3) is combined with the sampling model for the direct estimators 
, and it is given by
      
      where 
 are sampling errors independently distributed as 
 with known sampling variances 
, and assumed to be independent of 
 and 
. Torabi and Rao [
6] obtained EBLUP estimators of subarea means for sampled subareas as well as non-sampled subareas. An advantage of the two-fold model is that the EBLUP estimator of a non-sampled subarea involves both the synthetic estimator of 
 and the direct estimators for the sampled subareas within the same area. For a non-sampled subarea within a non-sampled area, a synthetic estimator is used under the two-fold model. For variable selection under the two-fold model, Cai et al. [
7] transformed the linking model to a standard regression model and applied variable selection criteria to the reduced model; see 
Section 2.2 for details.
Three-fold linking models involving sub-subareas (level 3) nested within subareas (level 2) which in turn are nested within areas (level 1) are also of practical interest. For example, such models were used in the Program for the International Assessment of Adult Competencies (PIAAC) in the context of estimating means for sub-subareas (counties) nested within subareas (states), which in turn are nested within areas (census divisions). Details of this application are reported in Krenzke et al. [
8] and Ren et al. [
9]. A three-fold linking model on the sub-subarea means 
 is given by
      
      where 
k denotes sub-subarea nested within subarea 
j nested within area 
i, 
 is the vector of covariates associated with 
, 
 is the random area effect, 
 is the random subarea effect, and 
 is the random sub-sub area effect. We assume that all the 
L areas in the population are included in the sample, but not all the subareas within an area are covered by the sample. Furthermore, not all the sub-subareas within a subarea covered by the sample are included in the sample. We assume that the three random effects in the model (
5) are independent, 
, 
 and 
. The linking model (
5) is combined with the sampling model for the direct estimators 
 of the means 
 for the sub-subareas in the sample. It is given by
      
      where the 
 are sampling errors assumed to be independently distributed as 
 with known sampling variances 
, and they are assumed to be independent of the random effects 
, 
, and 
. In practice, the sampling variances are ascertained through smoothing of the estimated sampling variances, as done in the PIAAC project.
The survey design may not have the same hierarchical structure as the linking model (
5). For example, in the PIAAC project, data from a stratified multistage sample with a different hierarchical structure are used. Given the vector of covariates 
 after variable selection, EBLUP estimators of the sub-subarea means can be obtained. It should be noted that the EBLUP estimators for non-sampled sub-subareas within a sampled subarea as well as those within non-sampled subareas avoid pure synthetic estimation by virtue of the area effects 
 included in the linking model (
5), noting that all the areas in the population are included in the sample. In the PIAAC study, a hierarchical Bayes (HB) approach was used to estimate the population sub-subarea means. We will report EBLUP estimation for the three-fold model, which is given by (
5) and (
6), in a separate paper.
The main purpose of this paper is to extend the transformation method of Cai et al. [
7] for variable selection to three-fold models given by (
5) and (
6). We propose two transformation-based methods—one is parameter free and the other is parameter-dependent—for variable selection. 
Section 2 is a review of some relevant variable selection methods for the area level model and the two-fold subarea model. Variable selection methods for the three-fold model are presented in 
Section 3. Results of a simulation study on the performance of the proposed methods relative to some naive alternatives, based on one-fold and two-fold models, are presented in 
Section 4. Some concluding remarks are presented in 
Section 5.
  4. Results of a Simulation Study
This section provides results of a limited simulation study on the performance of the proposed method for variable selection for sub-subarea linking models. The simulation data are generated from the three-fold sub-subarea model given by (
5) and (
6). The number of areas is set to 
 and the number of subareas sampled from each area 
i, 
, is set to 
. The number of sampled sub-subareas is taken as 
 for every subarea 
j in areas 
, 
 for every subarea in areas 
 and 
 for each subarea in areas 
. The sampling standard deviation 
 in the sampling model (
6) is generated from 
. The standard deviation of the sub-subarea random effect in the linking model (
5) is set to 
. A few settings for the standard deviations of the area-level and subarea-level random effects, 
, are used: 
, 
, 
, 
, 
, 
 and 
. We consider a linking model that has an intercept term with corresponding covariate 
 and eight other covariates 
 (
) generated as follows: 
 with mean 
 and variance 
, 
 with shape parameter 
 and rate parameter 2, 
, 
, 
, 
 with shape parameters 
 and 
, 
 on the interval 
, and 
 with mean parameter 
. The value of the regression parameter vector 
 is set to 
. It corresponds to a true model consisting of the intercept term of value 2 and covariates 
, 
, 
 and 
. For variable selection, we always include the intercept term when we compare all possible sub-models defined by the inclusion/exclusion of the eight variables 
.
We generated 5000 simulation runs, and the covariates are generated first and kept fixed throughout all simulation runs. Then, we generated the response vectors 
, 
, from the sub-subarea model given by (
9) and (
10) for each simulation run, using the specified settings.
We report the performance of the proposed method with parameter-free transformation (
) and parameter-dependent transformation (
). For 
, the true parameter values are used here, for simplicity. Under estimated parameter values, the performance of 
 is likely to be inferior. The parameter-free and parameter-dependent methods of Cai et al. [
7] for the two-fold subarea model are used for comparison. To fit a two-fold subarea model to the data with a three-fold structure, the actual sub-subareas are treated as the subareas in the two-fold model. We can treat either (i) the actual subareas or (ii) the actual areas as the areas in the two-fold model. Treatment (i) is a natural choice when there is substantial subarea-level variability. Under treatment (i), where the actual subareas are treated as areas, the parameter-free transformation under the two-fold model is algebraically identical to the parameter-free transformation under the three-fold model. As a result, variable selection based on the parameter-free transformation under treatment (i) leads to the same set of variables as that under the three-fold model. However, it leads to pure synthetic estimates for non-sampled areas (actual subareas). Moreover, computationally, there is no advantage of treatment (i) over the three-fold model because the same transformation is used. On the other hand, the parameter-dependent method applied to treatment (i) may lead to a different set of variables. Therefore, we report the simulation results only for the parameter-dependent method under treatment (i), which is denoted as 
. The two-fold parameter-free and parameter-dependent methods under treatment (ii) are denoted as 
 and 
, respectively. Under treatment (ii), pure synthetic estimation is avoided because all areas are sampled. For comparison, we further consider three naive methods designed for the one-fold FH model and the regular linear regression model, including the Lahiri–Suntornchost [
4] method (Naive-LS) and Han’s [
10] cAIC method (Naive-cAIC) for the FH model, as well as an information criterion-based method for the regular linear regression model fitted naively to the data (Naive-LM). For Naive-LS and Naive-cAIC, the actual sub-subareas are treated as the areas in the FH model. For Naive-LM, the sub-subarea level direct estimator 
 is treated as the response variable of the regular linear regression model.
Table 1 summarizes the simulation results for variable selection using BIC.
 The proposed  and  perform equally well with a stable rate between  and  in selecting the true model under all settings for . The two-fold method , which treats the actual subareas as areas in the two-fold model, exhibits similar performance to that of the proposed methods. All the other methods have inferior performance and display a dramatic decay in rate of selecting the true model when  and  increase. This indicates that in the presence of strong area-level effect or subarea-level effect, which often happens in practice, ,  and  are preferred over the other alternative methods.
The simulation results based on AIC and Naive cAIC are given in 
Table 2.
Compared with BIC, AIC gives a significantly lower true-model selection rate under all the methods. As the case for BIC, methods ,  and  perform equally well and yield stable results for different  values, and they have better performance than the other methods. Methods  and  have slightly better performance than ,  and  when , , and  but notably inferior performance under the other settings for . Methods Naive-LS, Naive-LM and Naive-cAIC have significantly lower rates of selecting the true model than the other methods.
Table 3 reports simulation results under Mallows’ 
 criterion for variable selection. The results in 
Table 3 are similar to those reported in 
Table 2 under AIC, and the same conclusions hold.