Abstract
Many population-based surveys have binary responses from a large number of individuals in each household within small areas. One example is the Nepal Living Standards Survey (NLSS II), in which health status binary data (good versus poor) for each individual from sampled households (sub-areas) are available in the sampled wards (small areas). To make an inference for the finite population proportion of individuals in each household, we use the sub-area logistic regression model with reliable auxiliary information. The contribution of this model is twofold. First, we extend an area-level model to a sub-area level model. Second, because there are numerous sub-areas, standard Markov chain Monte Carlo (MCMC) methods to find the joint posterior density are very time-consuming. Therefore, we provide a sampling-based method, the integrated nested normal approximation (INNA), which permits fast computation. Our main goal is to describe this hierarchical Bayesian logistic regression model and to show that the computation is much faster than the exact MCMC method and also reasonably accurate. The performance of our method is studied by using NLSS II data. Our model can borrow strength from both areas and sub-areas to obtain more efficient and precise estimates. The hierarchical structure of our model captures the variation in the binary data reasonably well.
1. Introduction
The Nepal Living Standard Survey (NLSS) II (see [1]) is a two-stage stratified sampling. A random sample of wards (areas) were selected from six strata and 12 households (sub-areas) were selected from each sampled ward. All individuals in each sampled household were interviewed. One interest is on health status, a binary variable. To make smooth estimates of the finite population proportion of the individuals with good health in each household, we focus on hierarchical Bayesian (HB) models with sub-area random effects to obtain reliable “indirect” estimates for numerous small areas or sub-areas. Most of the sample surveys are designed to provide reliable “direct” estimates of interests for large areas or domains (e.g., state level, national level). However, direct estimates are not reliable for areas or domains for which only small samples or no samples are available—see [2].
In many applications, some areas, e.g., states and wards, are sampled; in each sampled area, a sample of sub-areas, e.g., counties and households, is further selected. Ref. [3] proposed a one-fold hierarchical Bayesian logistic regression model and applied the model to NLSS II data. The main objective is to make an inference for the finite population proportion of individuals with a specific character for each area. However, the one-fold model ignores the sub-area level structure in the data. As an extension of [3], we are particularly interested in small area models that can capture the hierarchical structure of the NLSS II data in this paper. Although the one-fold basic models are very popular and in common use in producing reliable estimates, the hierarchical structure of the data and the consistency between the estimates for different levels may not hold. In particular, the sampling designs of many population-based surveys were two-stage stratified sampling as NLSS II. But if we use a one-fold unit level model to fit the data, the sub-area level effects will have been ignored. Ref. [4] studied the case that the data follow a normal model with a two-stage (three-stage) hierarchical structure, while the fitted model has a one-stage (two-stage) hierarchical structure using posterior predictive p-values. Ref. [5] discussed the ability to detect a three-stage model when a two-stage model is actually fitted.
Two-fold models are an important extension of basic small area models. Many authors have considered the problems and proposed these kinds of models. Much of the literature focuses on continuous data. Ref. [6] proposed a sub-area level model which provides model-based estimates that account for the hierarchical structure of data. Two-fold sub-area level models were studied by [2,7,8,9], and many others. This type of model is an area-level model which extends the Fay-Harriot model (see [10]) to the sub-area level. Two-fold nested error regression models were considered by [11,12]. On the other hand, some literature focus on the categorical data. Ref. [13] described a HB model to make an inference about the finite population proportion under two-stage clustering sampling. Ref. [14] extended the Beta-Binomial model to the two-fold model and used Gibbs sampling to obtain the posterior estimates. Ref. [15] showed that the two-fold Beta-Binomial model is preferable over the one-fold one if the data have a hierarchical structure. Ref. [16] extended [15] to accommodate heterogeneous correlations. They used a HB model to make a posterior inference about the finite population proportion of each area, accounting for intracluster correlations. Ref. [17] discussed the sub-area Beta-Binomial model and applied the model to estimate the finite population proportion of healthy individuals in each household covered by the NLSS II, assuming no covariate was available.
Bayesian logistic regression models with random effects are suitable for handling binary data with covariates. Ref. [18] discussed discrimination between the logit and the complementary log-log link functions by using the logistic regression model. Roberts, Ref. [19], discussed logistic regression for the sample survey data (not small area estimation). Ref. [20] showed how to accelerate the Gibbs sampler for a model with latent variables introduced earlier by [21] for Bayesian probit analysis. Ref. [22], discussed the logistic regression model by using the empirical Bayesian approach. Ref. [23] showed how to analyze binary data with covariates to maintain conjugacy for both the logistic and Poisson regression models. The analysis of binary data with covariates under nonignorable nonresponse was discussed by [24]. Ref. [3] proposed a hierarchical Bayesian logistic regression model for binary data in a small area estimation. This model is a unit level model without a sub-area effect. Our two-fold sub-area model is an extension of this logistic regression model. We add the sub-area level random effect into the model which can capture the hierarchical structure of the sampled data. At the same time, we add more hyper-parameters into the model, which make the inference more complicated. However, we propose an approximation method called the integrated nested normal approximation (INNA), which solves the difficulties.
The other side of our application is that there are numerous small areas (households and individuals) and MCMC methods, which involve complicated integrals, and cannot handle them efficiently. “Big data” are defined as data that are too big to comfortably process on a single machine [25]. The researchers considered consensus Monte Carlo methods that split the data across several machines. They proposed algorithms that perform distributed approximate Bayesian analyses in order to minimize the communication between computers. The parallel MCMC methods for non-Gaussian posterior distributions were discussed by [26]. Fortunately, in survey sampling, the design generally uses a stratification which is not artificial, and, in this case, consensus Monte Carlo may not be needed; it will be a good idea for a large stratum.
The integration involved in Bayesian inference is usually intractable, which is true for our logistic regression model. The approximation techniques are desired. The procedure we used to approximate the posterior density of the parameters of the logistic regression sub-area model, INNA, is similar to the integrated nested Laplace approximation (INLA) originally proposed by [27], but they are actually different. INLA is a quite popular algorithm and an alternative to MCMC for big data analysis if the joint posterior density is very complicated. It requires posterior modes, and, for numerous small areas, the computation of modes becomes time-consuming and challenging for the logistic regression model or any generalized linear mixed models. Yet, INLA has found many useful applications, such as in Poisson regression by [28], and in spatial point pattern data by [29]. We note that INLA can be problematic, especially for logistic and Poisson hierarchical regression models, even if the modes can be computed. Ref. [30], attempted to improve INLA using a copula-based correction, which adds complexity to INLA. Our approximation method, INNA, which does not require finding posterior modes, uses a sampling-based procedure accommodated by the multiplication rule of probability. Instead of finding the posterior modes, INNA finds the approximate modes in closed form, facilitated by the empirical logistic transform ([31]) and the second-order Taylor series approximation.
On the other hand, two-fold models can capture the heterogeneity between samples within not only areas but also sub-areas. Many model-based estimation techniques for the sampling variances have been considered in the literature, but most of them for the area-level model: see [32,33,34].
In Section 2, a full description of a sub-area HB logistic regression model is given. In Section 3, we describe the integrated nested normal approximation (INNA) computation method and some theoretical results are provided. The exact MCMC method is presented in Appendix A. The exact method refers to MCMC methods without further approximation. In Section 4, we apply the model to the NLSS II data to provide smoothed estimates of the household proportions of members in good health for both sampled and nonsampled households. Some comparisons between INNA and the exact method are presented. Finally, in Section 5, we make concluding remarks and discuss the future work.
2. Sub-Area Logistic Regression Model
In this section, we discuss the sub-area HB logistic regression model at the unit level. In the NLSS II data, we have binary data (good health versus poor health) for each individual within a household, and these households are within wards. The observations are available at the unit level and so is the reliable auxiliary information. However, the model and method we proposed for small areas and sub-areas is not only for this application on NLSS II data. It can be also applied to other population-based surveys with binary responses which contain small areas or/and sub-areas.
Suppose that there are L small areas (wards) in the finite population and that, within the area, there are sub-areas (households). Within the sub-area, there are individuals. We assume that areas are sampled and a simple random sample of households is taken from the area. All individuals in the sampled households are sampled. Here, we assume the survey weights are the same within all households in each area. Actually, the design is almost self-weighting.
Let denote the binary responses. Let . Let be the number with response 1 and be the total number of people who responded. Let be the vector with p covariates for individuals and an intercept.
We use P to represent the population proportion and p as the sample proportion. Let be the corresponding sample probability of , .
The primary interests are the finite population proportions of the households, which are and the finite population proportions of the areas, which are
In the content of the logistic regression model, the two-fold hierarchical Bayesian logistic regression model for the sub-area means, , is
Here, are the sub-area level random effects, which are not in the area-level model in [3]. are the area random effects and are the regression coefficients, with as the variance of the random effects, respectively.
In order to apply our approximation method and make an inference for posterior distribution, we use an equivalent model.
First, we separate into and , where . We set as the mean of , and then we can omit the intercept term from the covariate . Second, we introduce a new parameter, , in order to set and independently and make it easy to make an inference for both of them. We have
The joint posterior density for the parameters is
The posterior density is a non-standard multivariate density, and there are difficulties in fitting it using MCMC methods, more so when are large. This motivates our approximate methods.
3. Integrated Nested Normal Approximation Method
In this section, we discuss the INNA method for the sub-area HB logistic regression model. It is an extension of the INNA method in [3]. INNA method is not required to find the posterior modes. Due to the large amount of sub-areas, it would be time-consuming to find all posterior modes, which is why we did not choose the popular INLA method. In detail, we discuss the approximation of the joint posterior density (3).
Notice that the joint posterior density (3) is very complicated and it is the expit part, , that causes the difficulties. In the following, we discuss how to approximate this term to normal density functions by using Laplace approximation, the second-order multivariate Taylor-series approximation and the empirical logistic transform (ELT). This is the key contribution in the paper. Then we use the multiplication rule to approximate the joint posterior density,
where the first three densities on the right-hand side are all multivariate normal densities. Therefore, we can draw samples and make inference through the approximate joint posterior density.
Let denote the density of a vector of parameters . Let denote the gradient vector and H the Hessian matrix at some point .
Lemma 1.
Let be a logconcave density function with the parameter . Then, approximately has a multivariate normal distribution,
Proof.
Simply applying the second-order multivariate Taylor series of at gives
Due to the logconcavity of , its Hessian Matrix is posit-definite, which can be the covariance matrix. Notice that we are not required to use the mode of . We do not need to find the solution of the gradient vector . Therefore, does not have to be the solution but some other point. It is worth noticing that the term, , is a correction to . □
To illustrate the approximation steps, we start with a simpler model with flat priors for and the , according to model (2). That is,
The joint posterior density is
The logarithm of the joint posterior density (or log likelihood) is
Let . In our method, we find a convenient point to expand the log-likelihood in a second-order multivariate Taylor-series expansion.
To begin with, let . We use the empirical logistic transform to get an estimate of , where
First, we discuss how to find the quasi mode of . We plug into the log likelihood function and consider it as a function of only as , and we get
The first derivative of is
Usually we should set equal to zero and find the modes as the maximum likelihood estimator (MLE) of . But here, it is not easy to solve the equation due to the complexity of . We use the first-order Taylor series to approximate it and then simplify so that we can get quasi-modes of .
The first-order Taylor expansion of equals . Notice that by Taylor series, . Then we can get
We can get the quasi-modes of by solving the equation . That is,
Second, we obtain quasi-modes for the , a refinement of the . Plug into the likelihood function and consider it as function only:
Similarly, after applying Taylor expansion, we get the approximate first derivative of
We can obtain the approximate posterior mode of by solving the equation .
Notice that the term in denominator may cause trouble if for some is and js. Here, we borrow the idea from ELT and make a small adjustment in order to avoid a zero denominator. That is,
Let . Next, we evaluate and H at the quasi-modes can also be obtained as
The partial derivatives can be expressed in terms of response and covariates as
where .
For the convenience of computation, denote and where
Let , where
Lemma 2.
Assuming that the design matrix is full-rank and , the posterior density, in (5), is logconcave.
Proof.
If , there are solutions to the gradient vector set to zero.
Let . Then, A, B and C of the negative Hessian matrix can be written as,
where
It is obvious that D is positive-definite. Thus, to show that is positive-definite, we need to show that its Schur complement of D, , is positive-definite (e.g., see [35]). Let . The Schur complement is
It is now easy to show that
Therefore, is positive-definite, and is logconcave. □
Finally, according to the Lemmas 1 and 2, we can establish the approximation Theorem.
Theorem 1.
Assuming that the design matrix is full-rank and , the posterior density, in (5) is approximately a multivariate normal density, and the conditional posterior density of and can also be approximated by multivariate normal distributions.
Proof.
The proof is given in Appendix B. □
Therefore, we can approximate that logit expit term into two multivariate densities by Theorem 1. And then we can get our approximate two-fold Bayesian logistic regression model.
Recall the posterior density of our two-fold logistic model is
The likelihood function can be approximated by the multivariate normal distribution by Theorem 1. Combining the prior values of and given by our Bayesian Logistic model and the results in Theroem 1, we can obtain our INNA model
where and is a vector of ones.
Using Bayes’ Theorem and the multiplication rule, the posterior density can be approximated as
Therefore, we can get the following key result.
Theorem 2.
Using the multiplication rule, the joint posterior density, in (6), can be approximated by
where the first three densities on the right-hand side are all multivariate normal densities.
Proof.
The proof is given in Appendix C. □
The INNA is actually a random sampler. First, we draw samples for from . The posterior distribution of does not have standardized form. Here, we use the grid method and numerical integration to sample and . Since and , we make a transformation to and so that we get and . Then, the posterior density of is
We need to draw together. The joint density can be rewritten as
We plug each grid of into and then use numerical integration to get the density of . After we plug all the 100 grids, we can get 100 value of and then draw from them, i.e. . Next, we plug into and use grid method to draw . We repeat those steps 10,000 times to get the sample of Once we get samples for , we transform them back to and respectively. Second, given , we can simply draw samples of from the approximate multivariate normal distribution . Third, we can draw samples of independently given and data from the approximate normal distribution . Finally, samples of independently given can be obtained from the approximate normal distribution . Notice that the last three steps are very simple, just drawing samples from normal densities. In addition, and are all independent so that we can draw them simultaneously. Therefore, those latter steps permit fast computing.
In order to check if INNA method can provide resonal results, we apply the MCMC logistic regression exact method to the sub-area model. The idea of exact method is to get full conditional posterior distributions for all of the parameters in the model, and then get a large number of independent samples of each parameter with its full conditional posterior density. Details are given in Appendix A.
There are two differences between these two methods. First, both methods are sampling-based. The approximate method implements random samples and the exact method uses numerical integration method and Markov chains. Second, is used for the INNA method. In the exact method, a Metroplis step is used for the . This is very time-consuming in the exact method. On the other hand, the exact method actually uses the INNA method. We use M-H sampler draw samples for and , respectively. Proposal functions are and , respectively, from the INNA method.
4. Numerical Example
4.1. Nepal Living Standards Survey II
The performance of our method is studied using the Nepal Living Standard Survey (NLSS II), conducted in the years 2003–2004. The main objective of the NLSS II is to track changes in and the progress of national living standards and social indicators of the Nepalese population. It is an integrated survey which covers samples from the whole country and runs throughout the year.
The NLSS II gathers information on a variety of aspects. It has collected data on demographics, housing, education, health, fertility, employment, income, agricultural activity, consumption, and various other areas. The sampling design of NLSS II is two-stage stratified sampling. Nepal is stratified into Primary Sampling Units (PSUs) and, within each PSU, there are a number of households (sub-area) selected. All household members in the sample were interviewed.
In detail, the NLSS II has records for 20,263 individuals from 3912 households (sub-areas) from 326 PSUs (areas) from a population of 60,262 households and about two million Nepalese. A sample of PSUs was selected from the strata using probability proportional to size (PPS) sampling and 12 households were systematically selected from each PSU. The survey is self-weighted and some adjustments were made after conducting the survey for non-responses or missing data. For simplicity, in this paper, we assume all samples have the same weight. Table 1 shows the distribution of all samples by stratum.
Table 1.
Distribution of wards and households in the sample.
We chose four relevant covariates which can influence health status from the same NLSS II survey for our two-fold logistic regression model. They are age, nativity, sex and religion. We created binary variables for nativity (Indigenous = 1, Non-indigenous = 0) and religion ((Hindu = 1, Non-Hindu = 0), sex (Male = 1, Female = 0). Table 2 shows the details of these four covariates. In the model fitting, we standardized the age covariate. Older age and a child’s age are more vulnerable times than younger age. Indigenous people can have different health statuses from migrated people.
Table 2.
The descriptives of 4 covariates.
According to the 2001 census data, only about 0.091% of households and only 0.904% of PSU were sampled. The NLSS II was designed to provide reliable estimates only at the stratum level, or even larger areas than the stratum. It cannot give estimates in small areas (PSU or household level) since the sample sizes are too small. Therefore, we need to use statistical models to fit the available data and find reliable estimates in small areas. In our study, we chose the binary variable, health status, from the health section of the questionnaire.
4.2. Numerical Comparison
We used data from NLSS II to illustrate our sub-area logistic regression model. We predicted the household proportions of members in good health for 18,924 households (sampled and non-sampled). Bayesian bootstrap by [36] was applied to get non-sampled auxiliary information. This analysis was based on 1224 sample households from 102 wards (PSUs) in strata 6. Our primary purposes were to show that our model can provide good estimates and to compare the approximate method with the exact method when there are random effects at the household level.
We used Rcpp [37] and RcppArmadillo [38] packages in R [39] to fit the model based on both the approximation, INNA, method and the exact method to this NLSS II dataset. For the INNA method, we began with 10,000 iterations and a burn-in of 1000 and we kept only every ninth sample. Finally, 1000 samples were obtained for constructing the posterior distributions of all the parameters. The exact method was very time-consuming, taking about 30 h to finish. However, the INNA approximation method can get samples in 8 min. When we have a large number of areas or sub-areas, the approximation method will make enormous savings.
Convergence diagnostics were conducted. The convergences of the hyperparameters were monitored by the Geweke test of stationarity [40] and the effective sample sizes. The p-values and effective sample sizes are shown in Table 3, resulting in good convergence for both methods. Table 3 also shows the posterior means (PMs) and associated posterior standard deviations (PMs) of the hyperparameters. The PMs are very close between these two methods. The PSDs are slightly larger for the exact method than for the INNA method, but they are reasonably close.
Table 3.
Posterior means (PM), associated posterior standard deviations (PSD), Geweke test p-values and effective sample sizes (ESS) of hyperparameters based on the INNA and exact method.
In Figure 1, Figure 2 and Figure 3, we compare the PMs, PSDs and posterior coefficient of variations (PCVs) in the household level as our primary purpose. We can see that the PMs are very close, nearly lying on the 45-degree line through the origin. The PSDs are slightly spread out and thicker, but all points still lie on the 45-degree line and so do the PCVs. Overall, these approximations are acceptable in the data analysis. Figure 4, Figure 5 and Figure 6 we compare respectively to the PMs, PSDs and PCVs at the ward level. The plots of the PMs are still very good. Notice that other two plots of PSDs and PCVs are more spread out than those in the household level. Again, though, the approximate method and the exact method are reasonably close.
Figure 1.
Comparison of the INNA method and the exact method using the PSDs of the household proportions.
Figure 2.
Comparison of the INNA method and the exact method using the PSDs of the household proportions.
Figure 3.
Comparison of the INNA method and the exact method using the CVs of the household proportions.
Figure 4.
Comparison of the INNA method and the exact method using the PMs of the ward proportions.
Figure 5.
Comparison of the INNA method and the exact method using the PSDs of the ward proportions.
Figure 6.
Comparison of the INNA method and the exact method using the CVs of the ward proportions.
We also compare the approximate method with the exact method using the five number summaries (the minimum values, the first quartiles, the median, the third quartiles, and the maximum values) with respect to the PMs, PSDs and PCVs of the finite population proportions at the household level and ward level in Table 4 and Table 5. The PMs from both methods at the household level have larger variations than those at the ward level. The PCVs at the ward level are generally much smaller than at the household level. The summaries of the PMs, PSDs and PCVs within households and wards between the approximate and exact methods are very close.
Table 4.
Comparison of posterior inference about the finite population proportions using the five-number summaries at the household level.
Table 5.
Comparison of posterior inference about the finite population proportions using five-number summaries at the ward level.
We conclude that the approximation method at the household level is reasonable. The approximation is desirable because one can perform the computations in real time.
5. Conclusions and Future Work
The sub-area HB logistic regression model can be applied to analyze the binary response variable. This model is an extension of the HB logistic regression area-level model, which ignores the actual hierarchical structure of the data. We propose an approximation method, INNA, to fit the model. For large datasets, it is very unrealistic to use the MCMC method to fit the model. We propose the approximation method, INNA, which saves time significantly because there is no need to compute numerous modes. In the numerical example, we can show that INNA can provide reliable estimates as well. An illustrative example of the NLSS II is presented in order to compare the approximation method and the exact method. It shows that, when there are a large number of areas and sub-areas, the approximation method can be efficient and it can also provide reasonable estimates.
INNA is a method for approximate Bayesian inference based on Laplace’s method, the second-order multivariate Taylor-series approximation and the empirical logistic transform (ELT). It can be applied to all HB logistic regression models, for which it can be a fast and accurate alternative to the Markov chain Monte Carlo methods. The comparison and model results illustrate the performance of the INNA methods based on the sub-area model.
There will be many future works on the two-fold small areas model. First, in this paper, we assume equal survey weights since the NLSS II is a self-weighted sampling. However, after the data are collected, the sampling weights are usually adjusted for various characteristics or based on nonresponse as well. Incorporating those survey weights into the model is also very important. Generally, we need to consider these weights in the model. The NLSS II is a national population-based survey. We should rescale the sample weights to sum to an equivalent sample size. That is, we consider the adjusted weight as , where as an equivalent sample. Introducing the sampling weights, we can obtain an updated normalized likelihood function. Based on the updated likelihood function and the same prior in the two-fold model, we can have a full Bayesian analysis on the updated model and then project the finite population proportion of the family members with good health in each household.
Second, we focus on the binary data. Actually, there are four options in the health status questionnaire. The Multinomial-Dirichlet model can be an extension of the polychotomous data. Third, the two-fold sub-area level models can also be extended to three-fold models if the data have an additional hierarchical structure; actually, the NLSS II has this structure (households within wards, wards within districts). Fourth, in our models, we consider parametric priors. Introducing the Dirichlet process as a prior might make our method more robust to its specifications.
Author Contributions
Conceptualization, B.N. and L.C.; methodology, B.N. and L.C.; software, L.C.; validation, B.N. and L.C.; formal analysis, L.C.; investigation, L.C.; resources, B.N.; writing—original draft preparation, L.C.; writing—review and editing, B.N. and L.C.; visualization, L.C.; supervision, B.N. All authors have read and agreed to the published version of the manuscript.
Funding
Balgobin Nandram was supported by a grant from the Simons Foundation (#353953, Balgobin Nandram).
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. Exact Method for Sub-Area Logistic Regression Model
Recall that the joint posterior distribution of our two-fold logistic regression model is the joint posterior density for the parameters is
We can see that the form of the joint posterior density is very complicated. It is very time consuming to draw all the posterior samples if applying the exact MCMC method. But the exact method will provide reliable estimates of all parameters, so in order to test the performance of our approximation method, we need to apply MCMC method on our model and then compare the performance of two methods. We use Metropolis-Hastings sampler to draw samples for together and then draw given samples. At last, we use MH method to draw given samples.
In order to draw samples for together, we need to integrate out and . First, we integrate out from the joint posterior density to get
Notice that the integrant is not any simple distribution function, so we use Monte Carlo numberical integration to approximate the integrals. Let . Notice that follows standard normal distribution. For standard normal density, 99.7% of data will fall within 3 standard deviations of the mean, which corresponds to the interval . Therefore, we bounded the integration domain to and divide the interval to M equal subintervals . Then we can get an approximate but very accurate joint density
Let , which is the midpoint of each interval . We use midpoint rule to approximate the definite integrals. We divide the interval into 100 subintervals, and so we use 100 midpoints to get the approximate joint posterior distribution
Similarly, let and . We use the midpoint rule to approximate the definite integral with respect to and then get the posterior density of
We propose to draw samples from jointly by applying M-H sampler. Target function is . We set the proposal function as
where , Chi-square on t degree of freedom, i.e. Student’s t. Here t is tuning constant.
We also use M-H sampler draw samples for and respectively. Proposal functions are and respectively from the INNA method.
The target function to draw is
After we get samples for , we can use M-H sampler to draw
Appendix B. Proof of Theorem 1
Proof.
By Lemma 2, the posterior density is logconcave. Then according to Lemma 1, the posterior distribution is approximately a multivariate normal distribution.
By Lemma 1, evaluating all quantities at , the mean is
Also, the covariance matrix is
Therefore, by Lemma 1, the approximate joint posterior density of is
Finally, using the property of the multivariate normal density, the conditional posterior density of and can also be approximated by multivariate normal distributions,
where
□
Appendix C. Proof of Theorem 2
Proof.
First, look at the exponent terms containing in the above approximate posterior density function
where .
Then it can show that the is
Notice that is diagonal matrix. Then given , all s are independent. This is an important result because parallel computation can be done for , which accommodates time-consuming and massive storage challenges in big data analysis. This result holds for the exact conditional posterior density of the . Since has a multivariate normal distribution, we can integrate out from the joint approximate posterior density , and obtain the joint posterior density of and
Next, we will show that the approximate conditional posterior density of is also normal distribution and all s are independent as well. Here we consider each . Let , , and .
Look at the exponent only containing in the
where and .
Then it is easy to see that
Similarly, we can use parallel computing to draw as well since all of them are independent given . Then we can integrate out from the joint approximate posterior density and obtain the joint posterior density of and
Next we assume that the conditional posterior density of has an approximate multivariate normal density,
which is denoted by . The density function is
So the exponent terms are
Consider the exponent terms containing and
We know those two exponent parts are equal, so we have
That is, approximately follows multivariate normal distribution,
Then we can easily integrate out from the joint density of , and get the posterior density of
□
References
- Central Bureau of Statistics Thapathali. Nepal living Standards Survey 2003/04; Statistical Report; Central Bureau of Statistics Thapathali: Kathmandu, Nepal, 2004; Volume 1.
- Rao, J.N.; Molina, I. Small Area Estimation; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
- Nandram, B.; Chen, L.; Fu, S.-T.; Manandhar, B. Bayesian Logistic Regression for Small Areas with Numerous Households. Stat. Appl. 2018, 1, 171–205. [Google Scholar]
- Yan, G.; Sedransk, J. Bayesian Diagnostic Techniques for Detecting Hierarchical Structure. Bayesian Anal. 2007, 2, 735–760. [Google Scholar]
- Yan, G.; Sedransk, J. A Note on Bayesian Residuals as A Hierarchical Model Diagnostic Technique. In Stat. Pap. 2010, 51, 1. [Google Scholar] [CrossRef]
- Fuller, W.A.; Goyeneche, J.J. Estimation of the State Variance Component. Unpublished Manuscript. 1988. [Google Scholar]
- Torabi, M.; Rao, J.N.K. On Small Area Estimation under a Sub-Area Level Model. In J. Multivar. Anal. 2014, 127, 36–55. [Google Scholar] [CrossRef]
- Chen, L.; Nandram, B.; Cruze, N.B. Hierarchical Bayesian Model with Inequality Constraints for US County Estimates. J. Off. Stat. 2022, 38, 709–732. [Google Scholar] [CrossRef]
- Erciulescu, A.L.; Cruze, N.B.; Nandram, B. Model-Based County Level Crop Estimates Incorporating Auxiliary Sources of Information. J. R. Stat. Soc. Ser. (Stat. Soc.) 2019, 182, 283–303. [Google Scholar] [CrossRef]
- Fay, R.E.; Herriot, R.A. Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data. J. Am. Stat. Assoc. 1979, 74, 269–277. [Google Scholar] [CrossRef]
- Stukel, D.M.; Rao, J.N.K. Estimation of Regression Models with Nested Error Structure and Unequal Error Variances Under Two and Three Stage Cluster Sampling. Stat. Probab. Lett. 1997, 35, 401–407. [Google Scholar] [CrossRef]
- Stukel, D.M.; Rao, J.N.K. On Small-Area Estimation under Two-Fold Nested Error Regression Models. J. Stat. Plan. Inference 1999, 78, 131–147. [Google Scholar] [CrossRef]
- Nandram, B.; Sedransk, J. Bayesian Predictive Inference for a Finite Population Proportion: Two-Stage Cluster Sampling. J. R. Stat. Soc. Ser. B (Methodol.) 1993, 55, 399–408. [Google Scholar] [CrossRef]
- You, Y.; Reiss, P. Hierarchical Bayes Small Area Estimation of Response Rates for an Expenditure Survey. In Proceedings of the Survey Methods Section; Statistical Society of Canada: Ottawa, ON, Canada, 2000; pp. 123–128. [Google Scholar]
- Nandram, B. Bayesian Predictive Inference of a Proportion Under a Twofold Small-Area Model. J. Off. Stat. 2016, 32, 187–208. [Google Scholar] [CrossRef]
- Lee, D.; Nandram, B.; Kim, D. Bayesian Predictive Inference of a Proportion under a Two-Fold Small Area Model with Heterogeneous Correlations. Surv. Methodol. 2017, 17, 69–92. [Google Scholar]
- Chen, L.; Nandram, B. A Hierarchical Bayesian Beta-Binomial Model for Sub-areas. In Applied Statistical Methods. ISGES 2020, Springer Proceedings in Mathematics & Statistics; Hanagal, D.D., Latpate, R.V., Chandra, G., Eds.; Springer: Singapore, 2022; pp. 23–40. [Google Scholar]
- Nandram, B. Discrimination between Complementary Log-log and Logistic Model for Ordinal Data. Commun. Stat. Theory Methods 1989, 18, 2155–2164. [Google Scholar] [CrossRef]
- Roberts, G.; Rao, J.N.K.; Kumar, S. Logistic Regression Analysis of Sample Survey Data. Biometrika 1987, 74, 1–12. [Google Scholar] [CrossRef]
- Nandram, B.; Chen, M.-H. Reparameterizing the Generalized Linear Model to Accelerate Gibbs Sampler Convergence. J. Stat. Comput. Simul. 1996, 54, 129–144. [Google Scholar] [CrossRef]
- Albert, J.H.; Chib, S. Bayesian Analysis of Binary and Polychotomous Response Data. J. Am. Stat. Assoc. 1993, 88, 669–679. [Google Scholar] [CrossRef]
- Farrell, P.J.; MacGibbon, B.; Tomberlin, T.J. Empirical Bayes Small Area Estimation Using Logistic Regression Models and Summary Statistics. J. Bus. Econ. Stat. 1997, 15, 101–108. [Google Scholar]
- Nandram, B.; Erhardt, E. Fitting Bayesian Two-Stage Generalized Linear Models Using Random Samples via the SIR Algorithm. Sankhya 2010, 66, 733–755. [Google Scholar]
- Nandram, B.; Choi, J.W. A Bayesian Analysis of Body Mass Index Data from Small Domains Under Nonignorable Nonresponse and Selection. J. Am. Stat. Assoc. 2010, 105, 120–135. [Google Scholar] [CrossRef]
- Scott, S.L.; Blocker, A.W.; Bonassi, F.V.; Chipman, H.A.; George, E.I.; McCulloch, R.E. Bayes and Big Data: The Consensus Monte Carlo Algorithm; Technical Report; Google, Inc.: Mountain View, CA, USA, 2013; pp. 1–22. [Google Scholar]
- Miroshnikov, A.; Wei, Z.; Conlon, E.M. Parallel Markov chain Monte Carlo for non-Gaussian posterior distributions. Stat 2015, 4, 304–319. [Google Scholar] [CrossRef]
- Rue, H.; Martino, S.; Chopin, N. Approximate Bayesian Inference for Latent Gaussian Models Using Integrated Nested Laplace Approximations. J. R. Stat. Soc. (Ser. B) 2009, 71, 319–392. [Google Scholar] [CrossRef]
- Fong, Y.; Rue, H.; Wakefield, J. Bayesian inference for generalized linear mixed models. Biostatistics 2010, 11, 397–412. [Google Scholar] [CrossRef] [PubMed]
- Illian, J.B.; Sørbye, S.H.; Rue, H. A toolbox for fitting complex spatial point process models using integrated nested Laplace approximation (INLA). Ann. Appl. Stat. 2012, 6, 1499–1530. [Google Scholar] [CrossRef]
- Ferkingstad, E.; Rue, H. Improving the INLA approach for approximate Bayesian inference for latent Gaussian models. Electron. J. Stat. 2015, 9, 2706–2731. [Google Scholar] [CrossRef]
- Cox, D.R.; Snell, E.J. Analysis of Binary Data, 2nd ed.; Chapman and Hall/CRC: London, UK, 1989. [Google Scholar]
- Wang, J.; Fuller, W.A. The Mean Squared Error of Small Area Predictors Constructed With Estimated Area Variances. J. Am. Stat. Assoc. 2003, 98, 716–723. [Google Scholar] [CrossRef]
- Yan, G.; Sedransk, J. Small Area Estimation Using Area Level Models and Estimated Sampling Variances. Surv. Methodol. 2006, 32, 97–103. [Google Scholar]
- Erciulescu, A.L.; Berg, E. Small area estimates for the conservation effects assessment project. In In Frontiers of Hierarchical Modeling in Observational Studies, Complex Surveys and Big Data: A Conference Honoring Professor Malay Ghosh, College Park, MD, USA; Women in Statistics Conference: Cary, NC, USA, 2014. [Google Scholar]
- Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
- Rubin, D.B. The Bayesian Bootstrap. Ann. Stat. 1981, 9, 130–134. [Google Scholar] [CrossRef]
- Eddelbuettel, D.; Francois, R. Rcpp: Seamless R and C++ Integration. J. Stat. Softw. 2011, 40, 1–18. [Google Scholar] [CrossRef]
- Eddelbuettel, D.; Sanderson, C. Rcpparmadillo: Accelerating R with High-performance C++ Linear Algebra. Comput. Stat. Data Anal. 2014, 71, 1054–1063. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2019. [Google Scholar]
- Geweke, J. Evaluating the Accuracy of Sampling-Based Approaches to the Calculation of Posterior Moments. Bayesian Stat. 1992, 169–193. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).