Next Article in Journal / Special Issue
Analysis of Household Pulse Survey Public-Use Microdata via Unit-Level Models for Informative Sampling
Previous Article in Journal
A General Description of Growth Trends
Previous Article in Special Issue
Estimating the RMSE of Small Area Estimates without the Tears
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Selection of Auxiliary Variables for Three-Fold Linking Models in Small Area Estimation: A Simple and Effective Method

School of Mathematics and Statistics, Carleton University, Ottawa, ON K1S 5B6, Canada
*
Author to whom correspondence should be addressed.
Stats 2022, 5(1), 128-138; https://doi.org/10.3390/stats5010009
Submission received: 9 January 2022 / Revised: 31 January 2022 / Accepted: 3 February 2022 / Published: 5 February 2022
(This article belongs to the Special Issue Small Area Estimation: Theories, Methods and Applications)

Abstract

:
Model-based estimation of small area means can lead to reliable estimates when the area sample sizes are small. This is accomplished by borrowing strength across related areas using models linking area means to related covariates and random area effects. The effective selection of variables to be included in the linking model is important in small area estimation. The main purpose of this paper is to extend the earlier work on variable selection for area level and two-fold subarea level models to three-fold sub-subarea models linking sub-subarea means to related covariates and random effects at the area, sub-area, and sub-subarea levels. The proposed variable selection method transforms the sub-subarea means to reduce the linking model to a standard regression model and applies commonly used criteria for variable selection, such as AIC and BIC, to the reduced model. The resulting criteria depend on the unknown sub-subarea means, which are then estimated using the sample sub-subarea means. Then, the estimated selection criteria are used for variable selection. Simulation results on the performance of the proposed variable selection method relative to methods based on area level and two-fold subarea level models are also presented.

1. Introduction

Sample surveys are designed to provide reliable estimates of the overall means of a finite population and means for large domains or sub-populations (areas). For areas with small sample sizes (called small areas), direct area-specific estimators from the survey data are unreliable, and it is necessary to use model-based methods based on models linking area means to related covariates and random area effects. Resulting model-based estimators can lead to a significant increase in precision relative to direct estimators. Rao and Molina [1], in Chapter 6, provide a detailed account of model-based estimation under area level models. The effective selection of auxiliary variables to be included in the linking model is important for the success of model-based small area estimation (SAE).
A basic area level model, due to Fay and Herriot [2], is widely used for SAE in practice. Suppose that we have m areas with direct estimators y i of the area means θ i ( i = 1 , , m ) and associated candidate covariate vectors x i . The area level model consists of two components: a sampling model given by
y i = θ i + e i , i = 1 , , m
and a linking model given by
θ i = x i T β + u i , i = 1 , , m ,
where e i denotes sampling errors assumed to be independent N ( 0 , Ψ i ) with known sampling variance Ψ i , and u i denotes random area effects independent of e i that are assumed to be independent and identically distributed (iid) as N ( 0 , σ u 2 ) with unknown variance σ u 2 . In practice, the sampling variances are obtained by smoothing the estimators of sampling variances using the Generalized Variance Function (GVF) method [3] and treating the smoothed estimators as the sampling variances Ψ i . It is clear from (2) that it has a standard linear regression model form, and standard variable selection methods, such as Akaike Information Criterion (AIC) or Bayesian Information Criterions (BIC), can be applied to select variables, provided the area means θ i are known. Lahiri and Suntornchost [4] estimated the resulting selection criteria using the sampling model (1) and proposed to use them for variable selection (see Section 2.1 for details). We refer the reader to Rao and Molina [1], Chapter 6, for details of empirical best linear unbiased prediction (EBLUP) estimators of area means from the models (1) and (2) for specified covariate vectors x i . The EBLUP estimator of θ i is a weighted average of the direct estimator y i and a synthetic regression estimator x i T β ^ , where β ^ denotes an estimator of the regression parameter vector β . For a non-sampled area, direct estimator is not available. Hence, the synthetic estimator x i T β ^ is used to estimate small area mean, provided the associated x i is known. Fay and Herriot [2] obtained EBLUP estimates of per-capita income for small places in the USA, using the basic area level model given by (1) and (2).
Estimation of means for subareas nested within areas is of considerable interest. Mohadjer et al. [5] studied adult literacy for counties (subareas) sampled from states (areas), using data from the 2003 U.S. National Assessment of Adult Literacy. A two-fold subarea model is used to estimate subarea means θ i j from n i subareas j sampled from m areas i. A two-fold linking model on the subarea means θ i j is given by
θ i j = x i j T β + v i + u i j , j = 1 , , n i ; i = 1 , , m ,
where x i j is the vector of covariates associated with θ i j , and v i is random area effect independent of random subarea effect u i j . Furthermore, v i i i d N ( 0 , σ v 2 ) and u i j i i d N ( 0 , σ u 2 ) . The linking model (3) is combined with the sampling model for the direct estimators y i j , and it is given by
y i j = θ i j + e i j , j = 1 , , n i ; i = 1 , , m ,
where e i j are sampling errors independently distributed as N ( 0 , Ψ i j ) with known sampling variances Ψ i j , and assumed to be independent of v i and u i j . Torabi and Rao [6] obtained EBLUP estimators of subarea means for sampled subareas as well as non-sampled subareas. An advantage of the two-fold model is that the EBLUP estimator of a non-sampled subarea involves both the synthetic estimator of θ i j and the direct estimators for the sampled subareas within the same area. For a non-sampled subarea within a non-sampled area, a synthetic estimator is used under the two-fold model. For variable selection under the two-fold model, Cai et al. [7] transformed the linking model to a standard regression model and applied variable selection criteria to the reduced model; see Section 2.2 for details.
Three-fold linking models involving sub-subareas (level 3) nested within subareas (level 2) which in turn are nested within areas (level 1) are also of practical interest. For example, such models were used in the Program for the International Assessment of Adult Competencies (PIAAC) in the context of estimating means for sub-subareas (counties) nested within subareas (states), which in turn are nested within areas (census divisions). Details of this application are reported in Krenzke et al. [8] and Ren et al. [9]. A three-fold linking model on the sub-subarea means θ i j k is given by
θ i j k = x i j k T β + w i + v i j + u i j k , k = 1 , , n i j ; j = 1 , , m i ; i = 1 , , L ,
where k denotes sub-subarea nested within subarea j nested within area i, x i j k is the vector of covariates associated with θ i j k , w i is the random area effect, v i j is the random subarea effect, and u i j k is the random sub-sub area effect. We assume that all the L areas in the population are included in the sample, but not all the subareas within an area are covered by the sample. Furthermore, not all the sub-subareas within a subarea covered by the sample are included in the sample. We assume that the three random effects in the model (5) are independent, w i i i d N ( 0 , σ w 2 ) , v i j i i d N ( 0 , σ v 2 ) and u i j k i i d N ( 0 , σ u 2 ) . The linking model (5) is combined with the sampling model for the direct estimators y i j k of the means θ i j k for the sub-subareas in the sample. It is given by
y i j k = θ i j k + e i j k , k = 1 , , n i j ; j = 1 , , m i ; i = 1 , , L ,
where the e i j k are sampling errors assumed to be independently distributed as N ( 0 , Ψ i j k ) with known sampling variances Ψ i j k , and they are assumed to be independent of the random effects w i , v i j , and u i j k . In practice, the sampling variances are ascertained through smoothing of the estimated sampling variances, as done in the PIAAC project.
The survey design may not have the same hierarchical structure as the linking model (5). For example, in the PIAAC project, data from a stratified multistage sample with a different hierarchical structure are used. Given the vector of covariates x i j k after variable selection, EBLUP estimators of the sub-subarea means can be obtained. It should be noted that the EBLUP estimators for non-sampled sub-subareas within a sampled subarea as well as those within non-sampled subareas avoid pure synthetic estimation by virtue of the area effects w i included in the linking model (5), noting that all the areas in the population are included in the sample. In the PIAAC study, a hierarchical Bayes (HB) approach was used to estimate the population sub-subarea means. We will report EBLUP estimation for the three-fold model, which is given by (5) and (6), in a separate paper.
The main purpose of this paper is to extend the transformation method of Cai et al. [7] for variable selection to three-fold models given by (5) and (6). We propose two transformation-based methods—one is parameter free and the other is parameter-dependent—for variable selection. Section 2 is a review of some relevant variable selection methods for the area level model and the two-fold subarea model. Variable selection methods for the three-fold model are presented in Section 3. Results of a simulation study on the performance of the proposed methods relative to some naive alternatives, based on one-fold and two-fold models, are presented in Section 4. Some concluding remarks are presented in Section 5.

2. Area Level and Subarea Level Linking Models: Methods for Variable Selection

We now provide a brief review of earlier work on variable selection for area level and subarea level linking models related to the method for sub-subarea linking models presented in Section 3.

2.1. Area Level Model

The area level linking model (2) has the standard linear regression model form with unknown θ i as the dependent variable. Lahiri and Suntornchost [4] noted that standard variable selection criteria applied to (2), such as AIC, BIC, and Mallow’s C p , are continuous functions of the unknown error mean sum of squares MSE θ = ( m p ) 1 θ T ( I m P X ) θ , where θ = ( θ 1 θ m ) T , P X = X ( X T X ) 1 X T is the projection matrix with X = ( x 1 x m ) T , I m is the identity matrix of order m, and p is the dimension of β . Then, the unknown MSE θ is replaced by a consistent estimator obtained as
MSE ^ θ = MSE y Ψ ¯ w ,
where MSE y is obtained by substituting y = ( y 1 y m ) T for θ in the above expression for MSE θ and Ψ ¯ w = ( m p ) 1 i = 1 m ( 1 h i i ) Ψ i is a weighted mean of the sampling variance Ψ i with h i i = x i T ( X T X ) 1 x i . The estimator (7) can take negative values, and modifications to (7) leading to positive values were proposed by Lahiri and Suntornchost [4].
As noted earlier, standard variable selection criteria applied to the linking model (2) are simple functions of MSE θ and can be estimated by simply substituting MSE y for MSE θ . For example, BIC applied to linking model (2) can be estimated as
BIC ^ = m log m p MSE ^ θ / m + p log m .
Some other variable selection criteria applicable to the area level model include the conditional AIC (cAIC) proposed by Han [10] and mixed generalized AIC (xGAIC) proposed by Lombardía et al. [11].

2.2. Subarea Level Model

Cai et al. [7] extended the method of Lahiri and Suntornchost [4] to the subarea model given by (3) and (4). In this case, the linking model (3) does not have a standard linear regression model form, because the error terms v i + u i j within areas are correlated. As a result, we need to first transform the linking model (3) to a standard linear regression model form with iid errors. Specifically, we rewrite the subarea model in matrix form as y i = θ i + e i and θ i = X i β + τ i , i = 1 , , m , where y i = ( y i 1 y i n i ) T , X i = ( x i 1 x i n i ) T , θ i = ( θ i 1 θ i n i ) T , e i = ( e i 1 e i n i ) T , and τ i = v i 𝟙 n i + u i with 𝟙 n i being a vector of 1s of length n i and u i = ( u i 1 u i n i ) T . Cai et al. [7] proposed to find a matrix A i for each i = 1 , , m such that the covariance matrix of τ i * : = A i τ i is a diagonal matrix with equal diagonal elements across i = 1 , , m . Then, a transformed two-fold model is obtained:
y i * = θ i * + e i * and θ i * = X i * β + τ i * , i = 1 , , m ,
where y i * = A i y i , e i * = A i e i , θ i * = A i θ i , and X i * = A i X i . Noting that (8) has the standard regression model form on the transformed variables y i * and θ i * , we can apply the method used for the area level model to obtain variable selection criteria.
Cai et al. [7] gave two methods for finding the transformation matrix A i , one being parameter-free and the other relying on estimated model-parameter values. The parameter-free transformation follows the parameter-free transformation method proposed by Li and Lahiri [12] for selecting auxiliary variables under the unit-level nested error regression (NER) model. Observing that the covariance matrix of τ i * is given by
Cov ( τ i * ) = A i Σ i A i T = σ v 2 ( A i 𝟙 n i ) ( A i 𝟙 n i ) T + σ u 2 A i A i T ,
one can choose an matrix A i such that (a) A i 𝟙 n i = 0 , and (b) A i A i T is a diagonal matrix whose diagonal elements are equal for all i = 1 , , m . Cai et al. [7] proposed a numerical procedure to find the A i matrix satisfying the above conditions. As a result of the linear constraint (a), the rank of A i is n i 1 at most, and as a result, the transformed two-fold model loses one data point for each sampled area. The parameter-dependent method used by Cai et al. [7] is the well-known Fuller–Battese transformation [13]. In practice, the parameter-free transformation is more likely to be used because of its simplicity and not requiring the estimates of variance parameters.

3. Sub-Subarea Linking Models: Variable Selection

In this section, we present the proposed method for variable selection under the sub-subarea linking model. We extend the method of Cai et al. [7] for the two-fold model to the three-fold case.
We first express the sub-subarea linking and sampling models given by (5) and (6) as
θ i = X i β + η i , i = 1 , , L ,
and
y i = θ i + e i , i = 1 , , L ,
respectively, where y i = ( y i j 1 y i j 2 y i m i n i j ) T , X i = ( x i j 1 x i j 2 x i m i n i j ) T , θ i = ( θ i j 1 θ i j 2 θ i m i n i j ) T , e i = ( e i j 1 e i j 2 e i m i n i j ) T and
η i = w i 𝟙 n i + Ω i v i + u i
with n i = j = 1 m i n i j , Ω i = diag ( 𝟙 n i 1 , 𝟙 n i 2 , , 𝟙 n i m i ) , v i = ( v i 1 v i m i ) T and u i = ( u i 11 u i 12 u i m i n i m i ) T . Note that η i N ( 0 , Σ i ) , where
Σ i = Cov ( η i ) = σ w 2 𝟙 n i 𝟙 n i T + σ v 2 Ω i Ω i T + σ u 2 I n i .
As in the case of subarea linking model (3), the covariance matrix Σ i of η i in the linking model (9) does not have a diagonal structure. Following the idea of Cai et al. [7], we first transform the three-fold linking model (9) using a linear transformation so that the covariance matrix of the transformed η i has a diagonal structure. For each area i, we obtain a matrix T i that makes the transformed vector η i * : = T i η i have a diagonal covariance matrix with diagonal elements being a positive constant c for all i = 1 , , L . Using the T i , we transform the three-fold sampling model (10) and linking model (9) into
y i * = θ i * + e i * ,
θ i * = X i * β + η i * ,
where y i * = T i y i , θ i * = T i θ i , e i * = T i e i and X i * = T i X i . The transformed linking model (13) is a standard linear regression model with unknown dependent variable θ i * , and it shares the same β parameter with the original linking model (9). Then, we can use a bias-correction method similar to that of Lahiri and Suntornchost [4] to estimate an information criterion for (13) so as to select auxiliary variables. The details of the proposed transformation and bias-correction methods are given in the following subsections.

3.1. Transformation Methods

3.1.1. Parameter-Free Transformation

It is desirable that the transformation matrices T i , i = 1 , , L do not rely on unknown parameter values. To find parameter-free T i , we follow the idea used by Cai et al. [7] for the two-fold subarea model. By (11),
Cov ( η i * ) = T i Cov ( η i ) T i T = σ w 2 ( T i 𝟙 n i ) ( T i 𝟙 n i ) T + σ v 2 ( T i Ω i ) ( T i Ω i ) T + σ u 2 T i T i T .
If T i 𝟙 n i = 0 , T i Ω i = 0 and T i T i T is a diagonal matrix with equal diagonal elements, then Cov ( η i * ) will have the desired diagonal structure. Furthermore, T i Ω i = 0 implies T i 𝟙 n i = 0 because Ω i 𝟙 m i = 𝟙 n i . Therefore, it suffices to find T i such that (i) T i Ω i = 0 , and (ii) T i T i T is a diagonal matrix with equal diagonal elements across i = 1 , , L . Since the above two conditions do not involve any parameter, a matrix T i that satisfies them will be parameter free.
Recall that Ω i = diag ( 𝟙 n i 1 , 𝟙 n i 2 , , 𝟙 n i m i ) , which is a matrix with full column rank m i . Therefore, imposing T i Ω i = 0 on T i introduces m i independent linear constraints on T i , with one constraint for each sub-area j, j = 1 , , m i . To be specific, the constraint for subarea j is T i ω ˜ j = 0 , where ω ˜ j is the jth, j = 1 , , m i , column of Ω i . Consequently, the rank of T i is at most n i m i , and hence, each area i will lose m i data points (or equivalently, each subarea will lose one data point) in the transformation. This is different from the parameter-free transformation for the two-fold subarea model discussed in Section 2.2, where each area loses a single data point in the transformation.
In the following, we provide a numerical algorithm to find T i for area i, i = 1 , , L , that satisfies the above requirements (i) and (ii).
Step 1:
For each subarea j, j = 1 , , m i , of area i, fix a set of n i j 1 linearly independent vectors of length n i j , denoted b i 1 , b i 2 , , b i ( n i j 1 ) , that satisfies b i k T 𝟙 n i j = 0 for k = 1 , , n i j 1 . For a given k, a valid choice for b i k is the vector of length n i j whose kth element is 1, last element is 1 , and all other elements are 0. For example, if n i j = 5 , then we can take b i k , k = 1 , 2 , 3 , 4 , as b i 1 = ( 1 0 0 0 1 ) T , b i 2 = ( 0 1 0 0 1 ) T , b i 3 = ( 0 0 1 0 1 ) T and b i 4 = ( 0 0 0 1 1 ) T . Another possibility is to take b i k to be the vector whose kth element is 1 and all the other elements equal 1 n i j 1 if n i j > 1 .
Step 2:
Apply the Gram–Schmidt process to b i 1 , , b i ( n i j 1 ) to acquire a set of orthogonal vectors a i 1 , a i 2 , , a i ( n i j 1 ) with a i 1 = b i 1 and a i k = b i k l = 1 k 1 Proj a i l ( b i k ) for k = 2 , , n i j 1 , where Proj y ( x ) : = x T y y T y y is the projection of vector x onto the line spanned by vector y. Construct a ( n i j 1 ) by n i j matrix, denoted T i j , from a i 1 , , a i ( n i j 1 ) as T i j = a i 1 a i 1 a i ( n i j 1 ) a i ( n i j 1 ) T , where · is the Euclidean norm.
Step 3:
Repeat Step 1 and Step 2 for all subareas j = 1 , , m i of area i to obtain matrices T i 1 , , T i m i . Take T i = diag ( T i 1 , , T i m i ) .
The T i constructed using the above steps is parameter free. Step 1 generates a set of linear independent vectors b i k , k = 1 , , m i , satisfying b i k T 𝟙 n i j = 0 . The Gram–Schmidt process in Step 2 produces a set of orthonormal vectors a i k , k = 1 , , m i , based on b i k , while carrying over the property a i k T 𝟙 n i j = 0 . Thus, the matrix T i j constructed in Step 2 satisfies T i j 𝟙 n i j = 0 and T i j T i j T = I n i j 1 , which in turn guarantee that the matrix T i defined in Step 3 satisfies the requirements (i) and (ii) with T i T i T = I n i m i . Under this transformation, we get Cov ( η i * ) = σ u 2 I n i m i .
A parameter-free transformation matrix T i that satisfies the constraints (i) and (ii) is not unique. However, we found that different choices of T i yield similar results.

3.1.2. Parameter-Dependent Transformation

It is straightforward to obtain a transformation matrix T i that depends on the model parameter values. Since Cov ( τ i * ) = T i Σ i T i T , where Σ i is given by (11), we can take T i = c Σ i 1 / 2 , where Σ i 1 / 2 is the positive definite square-root matrix of Σ i 1 and c is a non-zero constant. Then, Cov ( τ i * ) is a diagonal matrix whose diagonal elements are all equal to c. Note that Σ i is determined by the variance parameters σ w 2 , σ v 2 and σ u 2 , so this transformation matrix is parameter-dependent. Under the two-fold subarea model, applying this idea and choosing c = σ u yields the Fuller–Battese Transformation [7].
Under the three-fold sub-subarea model, we found that
Σ i 1 = σ u 2 I n i Λ i ξ i 𝟙 n i 𝟙 n i T + ξ i 𝟙 n i 𝟙 n i T Λ i + ξ i Λ i 𝟙 n i 𝟙 n i T ξ i Λ i 𝟙 n i 𝟙 n i T Λ i ,
where Λ i = diag ρ i 1 𝟙 n i 1 𝟙 n i 1 T , , ρ i m i 𝟙 n i m i 𝟙 n i m i T , ρ i j = σ v 2 / ( σ u 2 + n i j σ v 2 ) and ξ i = σ w 2 / σ u 2 + σ w 2 n i k = 1 m i ρ i k n i k 2 . The square-root matrix Σ i 1 / 2 has a complicated expression but can be found easily with a numerical procedure, for example, by applying the spectral decomposition or polar decomposition on Σ i 1 . Taking T i = σ u Σ i 1 / 2 , we get Cov ( τ i * ) = σ u 2 I n i .
In practice, we need to estimate the variance parameters σ w 2 , σ v 2 , and σ u 2 to construct the transformation matrices T i , as in the case of the subarea model given by (8).

3.2. Estimating Variable Selection Criteria: Sub-Subarea Model

After transformation, the linking model (13) takes the matrix form of a regular regression model with unobserved response variable values θ i * . We now use a method similar to that of Cai et al. [7] to estimate information criteria, including AIC, BIC, and Mallows’ C p , for the transformed linking model (13). Then, these information criteria can be used for selecting auxiliary variables under the three-fold sub-subarea model.
The error mean sum of squares of the transformed linking model (13) is given by
MSE θ * = 1 n * p θ * T ( I n * P X * ) θ * ,
where θ * = θ 1 * T θ L * T T , P X * = X * X * T X * 1 X * T with X * = X 1 * T X L * T T , n * is the dimension of θ * , and p is the dimension of β . Since θ * are unobserved, MSE θ * cannot be calculated. Instead, we estimate MSE θ * based on the transformed sampling model (12). Let y * = y 1 * T y L * T T and e * = e 1 * T e L * T T . Put
MSE y * = 1 n * p y * T ( I n * P X * ) y * .
We propose to estimate MSE θ * by
MSE ^ θ * = MSE y * 1 n * p tr ( I n * P X * ) V e * ,
where V e * is the covariance matrix of e * , given by V e * = Cov ( e * ) = T V e T T with T = diag ( T 1 , , T L ) and V e = diag ( Ψ 111 , Ψ 112 , , Ψ L m L n L m L ) . It can be shown, using the same argument as used in the proof of Theorem 1 of Cai et al. [7], that if the sampling variances Ψ i j k are bounded for all i, j and k, and n i j 2 for all i and j, then
MSE ^ θ * = MSE θ * + o p ( 1 )
as the number of areas L .
The term ( n * p ) 1 tr ( I n * P X * ) V e * in (14) can be considered as a bias-correction term, and because of its presence, MSE ^ θ * may take a negative value. A simple truncation or a continuous transformation of MSE ^ θ * as suggested by Lahiri and Suntornchost [4] may be used to obtain a strictly positive estimate of MSE θ * .
Given the above estimator of MSE θ * , estimators of the AIC, BIC and Mallows’ C p for the transformed linking model (13) are readily constructed. The AIC, BIC and Mallows’ C p of a submodel of (13) with p s covariates are given by
AIC ( s ) = n * log n * p s MSE θ * ( s ) / n * + 2 p s ,
BIC ( s ) = n * log n * p s MSE θ * ( s ) / n * + p s log ( n * ) ,
C p ( s ) = n * p s MSE θ * ( s ) / MSE θ * + 2 p s n * ,
respectively, where MSE θ * ( s ) is the MSE from the submodel. Their estimators, denoted AIC ^ ( s ) , BIC ^ ( s ) and C p ^ ( s ) , respectively, are obtained by substituting MSE ^ θ * into the corresponding expressions in (16a) to (16c). Then, variable selection is carried out by choosing one of the above criteria and estimating its values for a set of specified sub-models. The sub-model with the smallest estimated criterion value is chosen as the selected linking model. Noting that the criteria (16a)–(16c) are continuous functions of MSE θ * and the error of the estimator MSE ^ θ * is of o p ( 1 ) , it follows from the continuous mapping theorem [14] (Theorem 2.3) that the error in the estimated variable selection criteria is also o p ( 1 ) and hence negligible when the number of areas L is large.

4. Results of a Simulation Study

This section provides results of a limited simulation study on the performance of the proposed method for variable selection for sub-subarea linking models. The simulation data are generated from the three-fold sub-subarea model given by (5) and (6). The number of areas is set to L = 10 and the number of subareas sampled from each area i, i = 1 , , L , is set to m i = 5 . The number of sampled sub-subareas is taken as n i j = 8 for every subarea j in areas i = 1 , , 5 , n i j = 5 for every subarea in areas i = 6 , 7 , 8 and n i j = 10 for each subarea in areas i = 9 , 10 . The sampling standard deviation Ψ i j k in the sampling model (6) is generated from Unif ( 0.5 , 1.5 ) . The standard deviation of the sub-subarea random effect in the linking model (5) is set to σ u = 2 . A few settings for the standard deviations of the area-level and subarea-level random effects, ( σ w , σ v ) , are used: ( 2 , 2 ) , ( 4 , 3 ) , ( 6 , 3 ) , ( 8 , 4 ) , ( 6 , 6 ) , ( 3 , 6 ) and ( 4 , 8 ) . We consider a linking model that has an intercept term with corresponding covariate x i j k , 1 = 1 and eight other covariates x i j k , l ( l = 2 , , 9 ) generated as follows: log x i j k , 2 N ( 0.3 , 0.5 ) with mean 0.3 and variance 0.5 , x i j k , 3 Gamma ( 1.5 , 2 ) with shape parameter 1.5 and rate parameter 2, x i j k , 4 N ( 0 , 0.8 ) , x i j k , 5 N ( 1 , 1.5 ) , x i j k , 6 Gamma ( 0.6 , 10 ) , x i j k , 7 Beta ( 0.5 , 0.5 ) with shape parameters 0.5 and 0.5 , x i j k , 8 Unif ( 1 , 3 ) on the interval ( 0 , 3 ) , and x i j , 9 Poisson ( 1.5 ) with mean parameter 1.5 . The value of the regression parameter vector β is set to ( 2 , 3 , 0 , 4 , 0 , 8 , 0 , 1 , 0 ) T . It corresponds to a true model consisting of the intercept term of value 2 and covariates x i j , 2 , x i j , 4 , x i j , 6 and x i j , 8 . For variable selection, we always include the intercept term when we compare all possible sub-models defined by the inclusion/exclusion of the eight variables x i j , 2 , , x i j , 9 .
We generated 5000 simulation runs, and the covariates are generated first and kept fixed throughout all simulation runs. Then, we generated the response vectors y i , j = 1 , , L , from the sub-subarea model given by (9) and (10) for each simulation run, using the specified settings.
We report the performance of the proposed method with parameter-free transformation ( 3 F pfree ) and parameter-dependent transformation ( 3 F pdep ). For 3F pdep , the true parameter values are used here, for simplicity. Under estimated parameter values, the performance of 3F pdep is likely to be inferior. The parameter-free and parameter-dependent methods of Cai et al. [7] for the two-fold subarea model are used for comparison. To fit a two-fold subarea model to the data with a three-fold structure, the actual sub-subareas are treated as the subareas in the two-fold model. We can treat either (i) the actual subareas or (ii) the actual areas as the areas in the two-fold model. Treatment (i) is a natural choice when there is substantial subarea-level variability. Under treatment (i), where the actual subareas are treated as areas, the parameter-free transformation under the two-fold model is algebraically identical to the parameter-free transformation under the three-fold model. As a result, variable selection based on the parameter-free transformation under treatment (i) leads to the same set of variables as that under the three-fold model. However, it leads to pure synthetic estimates for non-sampled areas (actual subareas). Moreover, computationally, there is no advantage of treatment (i) over the three-fold model because the same transformation is used. On the other hand, the parameter-dependent method applied to treatment (i) may lead to a different set of variables. Therefore, we report the simulation results only for the parameter-dependent method under treatment (i), which is denoted as 2F-S-SS pdep . The two-fold parameter-free and parameter-dependent methods under treatment (ii) are denoted as 2F-A-SS pfree and 2F-A-SS pdep , respectively. Under treatment (ii), pure synthetic estimation is avoided because all areas are sampled. For comparison, we further consider three naive methods designed for the one-fold FH model and the regular linear regression model, including the Lahiri–Suntornchost [4] method (Naive-LS) and Han’s [10] cAIC method (Naive-cAIC) for the FH model, as well as an information criterion-based method for the regular linear regression model fitted naively to the data (Naive-LM). For Naive-LS and Naive-cAIC, the actual sub-subareas are treated as the areas in the FH model. For Naive-LM, the sub-subarea level direct estimator y i j k is treated as the response variable of the regular linear regression model.
Table 1 summarizes the simulation results for variable selection using BIC.
The proposed 3F pfree and 3F pdep perform equally well with a stable rate between 87 % and 89 % in selecting the true model under all settings for ( σ w , σ v ) . The two-fold method 2F-S-SS pdep , which treats the actual subareas as areas in the two-fold model, exhibits similar performance to that of the proposed methods. All the other methods have inferior performance and display a dramatic decay in rate of selecting the true model when σ w and σ u increase. This indicates that in the presence of strong area-level effect or subarea-level effect, which often happens in practice, 3F pfree , 3F pdep and 2F-S-SS pdep are preferred over the other alternative methods.
The simulation results based on AIC and Naive cAIC are given in Table 2.
Compared with BIC, AIC gives a significantly lower true-model selection rate under all the methods. As the case for BIC, methods 3F pfree , 3F pdep and 2F-S-SS pdep perform equally well and yield stable results for different ( σ w , σ v ) values, and they have better performance than the other methods. Methods 2F-A-SS pfree and 2F-A-SS pdep have slightly better performance than 3F pfree , 3F pdep and 2F-S-SS pdep when ( σ w , σ v ) = ( 2 , 2 ) , ( 4 , 3 ) , and ( 6 , 3 ) but notably inferior performance under the other settings for ( σ w , σ v ) . Methods Naive-LS, Naive-LM and Naive-cAIC have significantly lower rates of selecting the true model than the other methods.
Table 3 reports simulation results under Mallows’ C p criterion for variable selection. The results in Table 3 are similar to those reported in Table 2 under AIC, and the same conclusions hold.

5. Concluding Remarks

A transformation-based method is proposed for selecting covariates under the three-fold sub-subarea model for small area estimation. Two transformations, one being parameter-free and the other being parameter-dependent, are proposed to accompany the variable selection method. Compared to the parameter-free transformation, the parameter-dependent transformation does not induce loss of data points but requires estimated variance parameters in practice. We prefer the parameter-free transformation for its simplicity and not requiring the estimates of variance parameters. The performance of parameter-free and parameter-dependent transformation methods is similar under various simulation settings for variances of the area-level and subarea-level random effects. EBLUP estimation of sub-subarea means for sampled sub-subareas, non-sampled sub-subareas within sampled subareas, and sub-subareas within non-sampled subareas will be studied in detail in a separate paper. Measures of uncertainty of the EBLUP estimators will also be studied.

Author Contributions

Formal analysis, S.C.; Methodology, S.C., J.N.K.R.; Writing-original draft, S.C.; Writing-review & editing, S.C., J.N.K.R.; Conceptualization, J.N.K.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by research grants to Song Cai and J.N.K. Rao from the Natural Sciences and Engineering Research Council of Canada.

Acknowledgments

We thank the reviewers for their useful comments and constructive suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Rao, J.N.K.; Molina, I. Small Area Estimation, 2nd ed.; Wiley: Hoboken, NJ, USA, 2015. [Google Scholar]
  2. Fay, R.E.; Herriot, R.A. Estimates of income for small places: An application of james-stein procedures to census data. J. Am. Stat. Assoc. 1979, 74, 269–277. [Google Scholar] [CrossRef]
  3. Wolter, K.M. Introduction to Variance Estimation, 2nd ed.; Springer: New York, NY, USA, 2007. [Google Scholar]
  4. Lahiri, P.; Suntornchost, J. Variable selection for linear mixed models with applications in small area estimation. Sankhyā B 2015, 77, 312–320. [Google Scholar] [CrossRef]
  5. Mohadjer, L.; Rao, J.N.K.; Liu, B.; Krenzke, T.; Van de Kerckhove, W. Hierarchical Bayes small area estimates of adult literacy using unmatched sampling and linking models. J. Indian Soc. Agric. 2012, 66, 55–63. [Google Scholar]
  6. Torabi, M.; Rao, J.N.K. On small area estimation under a sub-area level model. J. Multivar. Anal. 2014, 127, 36–55. [Google Scholar] [CrossRef]
  7. Cai, S.; Rao, J.N.K.; Dumitrescu, L.; Chatrchi, G. Effective transformation-based variable selection under two-fold subarea models in small area estimation. Stat. Transit. New Ser. 2020, 21, 68–83. [Google Scholar] [CrossRef]
  8. Krenzke, T.; Mohadjer, L.; Li, J.; Erciulescu, A.; Fay, R.E.; Ren, W.; VanDeKerckhove, W.; Li, L.; Rao, J.N.K. Program for the International Assessment of Adult Competencies (PIAAC): State and County Estimation Methodology Report; Technical Report; Institute of Education Sciences, National Center for Education Statistics: Washington, DC, USA, 2020. [Google Scholar]
  9. Ren, W.; Li, J.; Erciulescu, A.; Krenzke, T.; Mohadjer, L. A variable selection method for small area estimation modeling of the proficiency of adult competency. In Proceedings of the Survey Research Methods Section, Joint Statistical Meetings of the American Statistical Association, Alexandria, VA, USA, 2–6 August 2020; pp. 924–956. [Google Scholar]
  10. Han, B. Conditional Akaike information criterion in the Fay-Herriot model. Stat. Methodol. 2013, 11, 53–67. [Google Scholar] [CrossRef]
  11. Lombardía, M.J.; López-Vizcaíno, E.; Rueda, C. Mixed generalized akaike information criterion for small area models. J. R. Stat. Soc. Ser. A 2017, 180, 1229–1252. [Google Scholar] [CrossRef]
  12. Li, Y.; Lahiri, P. A simple adaptation of variable selection software for regression models to select variables in nested error regression models. Sankhyā B 2019, 81, 302–317. [Google Scholar] [CrossRef]
  13. Fuller, W.A.; Battese, G.E. Transformations for estimation of linear models with nested-error structure. J. Am. Stat. Assoc. 1973, 68, 626–632. [Google Scholar] [CrossRef]
  14. van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, MA, USA, 1998. [Google Scholar]
Table 1. True model selection rate (%): BIC.
Table 1. True model selection rate (%): BIC.
Method ( σ w , σ v )
( 2 , 2 ) ( 4 , 3 ) ( 6 , 3 ) ( 8 , 4 ) ( 6 , 6 ) ( 3 , 6 ) ( 4 , 8 )
3F pfree 87.12 87.62 87.50 88.18 87.26 87.32 87.02
3F pdep 87.94 88.20 88.46 88.52 88.00 87.96 87.60
2F-S-SS pdep 87.64 87.82 88.16 88.48 87.90 87.86 87.56
2F-A-SS pfree 83.28 63.14 62.62 36.70 8.38 9.22 2.24
2F-A-SS pdep 82.60 60.84 60.66 34.58 7.24 8.48 1.80
Naive-LS 63.62 19.68 8.80 2.56 1.94 4.96 0.78
Naive-LM 60.94 18.32 8.26 2.44 1.84 4.70 0.76
Table 2. True model selection rate (%): AIC and Naive-cAIC.
Table 2. True model selection rate (%): AIC and Naive-cAIC.
Method ( σ w , σ v )
( 2 , 2 ) ( 4 , 3 ) ( 6 , 3 ) ( 8 , 4 ) ( 6 , 6 ) ( 3 , 6 ) ( 4 , 8 )
3F pfree 43.88 42.84 43.76 43.92 43.38 44.02 43.42
3F pdep 44.22 43.18 43.66 44.08 43.36 44.04 43.48
2F-S-SS pdep 44.32 43.14 43.66 43.68 43.22 44.02 43.32
2F-A-SS pfree 47.00 46.60 47.10 42.86 26.46 27.16 14.60
2F-A-SS pdep 47.78 48.48 48.36 43.96 26.36 26.52 14.54
Naive-LS 45.00 31.74 22.48 14.62 14.02 19.84 10.52
Naive-LM 47.02 32.18 22.54 14.56 13.94 19.76 10.40
Naive-cAIC 42.86 26.64 17.28 12.12 11.16 15.62 8.84
Table 3. True model selection rate (%): Mallows’ C p .
Table 3. True model selection rate (%): Mallows’ C p .
Method ( σ w , σ v )
( 2 , 2 ) ( 4 , 3 ) ( 6 , 3 ) ( 8 , 4 ) ( 6 , 6 ) ( 3 , 6 ) ( 4 , 8 )
3F pfree 44.78 43.66 44.84 44.60 44.00 44.84 44.04
3F pdep 45.02 43.90 44.40 44.90 44.16 44.80 44.30
2F-S-SS pdep 44.96 43.74 44.32 44.50 43.76 44.74 44.10
2F-A-SS pfree 47.60 47.28 47.84 43.32 26.34 27.08 14.54
2F-A-SS pdep 48.82 49.08 49.38 44.52 26.54 26.64 14.22
Naive-LS 45.60 32.16 22.60 14.52 13.98 19.80 10.32
Naive-LM 47.66 32.52 22.64 14.54 14.04 19.80 10.24
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Cai, S.; Rao, J.N.K. Selection of Auxiliary Variables for Three-Fold Linking Models in Small Area Estimation: A Simple and Effective Method. Stats 2022, 5, 128-138. https://doi.org/10.3390/stats5010009

AMA Style

Cai S, Rao JNK. Selection of Auxiliary Variables for Three-Fold Linking Models in Small Area Estimation: A Simple and Effective Method. Stats. 2022; 5(1):128-138. https://doi.org/10.3390/stats5010009

Chicago/Turabian Style

Cai, Song, and J.N.K. Rao. 2022. "Selection of Auxiliary Variables for Three-Fold Linking Models in Small Area Estimation: A Simple and Effective Method" Stats 5, no. 1: 128-138. https://doi.org/10.3390/stats5010009

APA Style

Cai, S., & Rao, J. N. K. (2022). Selection of Auxiliary Variables for Three-Fold Linking Models in Small Area Estimation: A Simple and Effective Method. Stats, 5(1), 128-138. https://doi.org/10.3390/stats5010009

Article Metrics

Back to TopTop