Abstract
In this paper, we propose an improved Nested Error Regression model in which the random effects for each area are given a prior distribution using the Poisson–Dirichlet Process. Based on this model, we mainly investigate the construction of the parameter estimation using the Empirical Bayesian(EB) estimation method, and we adopt various methods such as the Maximum Likelihood Estimation(MLE) method and the Markov chain Monte Carlo algorithm to solve the model parameter estimation jointly. The viability of the model is verified using numerical simulation, and the proposed model is applied to an actual small area estimation problem. Compared to the conventional normal random effects linear model, the proposed model is more accurate for the estimation of complex real-world application data, which makes it suitable for a broader range of application contexts.
Keywords:
small area estimation; nested error regression models; Bayesian nonparametric estimation; Poisson–Dirichlet process; MCMC algorithm MSC:
62G05; 62J05
1. Introduction
The study of small area estimation has received more and more attention due to the increasing demand for reliable estimates in governmental departments, enterprises, agriculture, commerce, and socioeconomic fields. The issue of how to improve the estimation accuracy and obtain reliable estimates of subpopulation parameters is known as small area estimation. The term “small area” usually refers to a small geographic area, such as a county, city, state, or a small population group that is crossclassified by demographic characteristics, such as a certain age, gender, or ethnic group. Since a majority of sample surveys are designed to estimate certain parameters for the overall population, there are no particular requirements for sample sizes for subgroups, which can lead to the possibility that the parameter estimates for subgroups may not be as accurate as desired. Furthermore, if the sample size of a survey is determined to achieve a specified level of accuracy on a large scale, there may not be enough resources to conduct a second survey aimed at achieving a similar level of accuracy on a smaller scale.
The application of indirect estimation, in particular model-based estimation for small areas, to address this issue has been widely accepted. Recognizing the commonality or similarity of several small areas in certain aspects, we can construct reliable small area estimates with the help of well-defined models, thus borrowing relevant auxiliary information from other variables in the area, from neighboring small areas, or even from other sources, such as population census data. This concept has also greatly contributed to the development of the small area estimation problem, and various models and methods have been proposed to estimate small area parameters by borrowing auxiliary information. For example, one can check the publications of Rao [1] and Pfeffermann [2] for understanding and reviewing. Rao and Molina [3] give a detailed and previous description of the models and methods for small area estimation.
Fay and Herriot [4] were the first to propose an area level model, the Fay-Herriot model, to solve the problem of small area estimation of per capita income in the USA. Battese, Harter, and Fuller [5] were the first to adopt the unit level model, the Nested Error Regression Model, which combines agricultural survey data and satellite data, to estimate the average acreage of crops in twelve counties in the state of Iowa in the United States. With the growing demand for small area estimation, there has been a large amount of literature on extending these two base models to accommodate different data structures and requirements. For example, Serena et al. [6] provided two extensions of the Fay–Herriot model. The first one is a multivariate extension, which jointly models the survey estimates of two or more different but related demographic characteristics; the second extension is to build a functional measurement error model into the original Fay–Herriot model to consider the case of covariates with errors. Yang and Chen [7] improved the Nested Error Regression model by clustering small areas based on their centers to obtain a new model and estimate the model parameters based on the model.
The accuracy and precision of small area estimators depend on the validity of the model. In this context, we pay particular attention to an important underlying assumption of the model, namely the normality of random effects. The assumption of normality of random effects is not necessarily reasonable for no reason other than computational convenience. In particular, this assumption is difficult to detect in practice because it involves unobservable quantities. Therefore, it is necessary to investigate the flexible modeling of random effects. Datta et al. [8] considered the case where random effects do not exist and proposed a bootstrap method for hypothesis testing that allows us to determine the presence or absence of random effects in various regions of the model. This is because the use of random effects will increase the variability of the estimates when the actual data structure does not include random effects. Sugasawa and Kubokawa [9] also proposed a Nested Error Regression model that utilizes uncertain random effects and gave estimates of the corresponding model parameters. Ferrante and Pacei [10] considered the asymmetry of the data they examined and thus relaxed the normality assumption of the Fay–Herriot model by adopting a skewed distribution. They proposed a multivariate skewed small area model and applied the model to the business statistics of firms. Fabrizi and Trivisano [11] considered two extensions of the Fay–Herriot model where both extensions were for the assumptions of the random effects of the model, which were an exponential power distribution and a skewed power distribution. Chakraborty et al. [12] proposed a mixture model based on the Fay–Herriot model with random effects obeying a two-component normal form. Diallo and Rao [13] concerned themselves with the skewed distribution of the response variable and suggested replacing the assumption of normality for both random effects and errors with a skew normal distribution. Tsujino and Kubokawa [14] investigated a model in which the random effects remain normal but the errors obey a skew normal distribution, and they gave an expression for predicting the random effects.
In small area estimation, parametric models are usually used to build mixed models to achieve estimation. However, parametric models suffer from model mis-specification, which may produce unreliable small area estimations. The application of nonparametric and semiparametric models to achieve small area estimation has been partially considered in the literature. Opsomer et al. [15] used a p-spline approach applied to the nonparametric estimation of the regression component in a Nested Error Regression model linear model. Polettini [16] used a Dirichlet Process mixture model to implement the construction of random effects in the Fay–Herriot model.
To further extend the Nested Error Regression model, this paper proposes an improved Nested Error Regression model in which the default normality assumption for random effects is replaced by a nonparametric specification, i.e., using the Poisson–Dirichlet Process. In Nested Error Regression models, random effects are typically assumed to be normally distributed. This assumption simplifies the theoretical and computational analysis of the model. However, the normality assumption may not hold in certain cases, thus leading to inaccurate model estimates. To address this limitation, we propose the use of a nonparametric specification to replace the default normality assumption. Nonparametric methods do not rely on specific parameters or distributional forms; instead, they learn their structure or characteristics directly from the data. The Poisson–Dirichlet Process is proposed as a prior distribution for the random effects. This approach differs from the conventional assumption that the random effects in each area follow a fixed distribution. Instead, the model allows the distribution of these random effects to change adaptively based on the data. The Poisson–Dirichlet Process is a two-parameter generalization of the Dirichlet Process, where, in addition to the concentration parameter, an additional parameter called the discount parameter is also added. Similar to the Dirichlet Process, samples from the Poisson–Dirichlet Process correspond to discrete distributions that have the same support as its underlying distribution. The underlying distribution of the Poisson–Dirichlet Process is the Poisson–Dirichlet Distribution introduced by Pitman and Yor [17]. Due to its properties, we can apply it to the prior distribution of random effects, which has an unobservable variable. There are many studies related to the Dirichlet Process and the Poisson–Dirichlet Process in Bayesian nonparametrics. For example, Al-labadi et al. [18], by applying Bayesian estimation of the Dirichlet Process, proposed a novel methodology for computing varentropy and varextropy. Handa [19] studied a two-parameter Poisson–Dirichlet Process based on point process theory. Favaro et al. [20] developed a two-parameter Poisson–Dirichlet Process model for dealing with the problem of the predictive sampling of species or kinds. Performing exact Bayesian inference on nonparametric models is always challenging, because it is difficult to derive the posterior distribution. This drives us to use the Markov chain Monte Carlo (MCMC) algorithm [21] for approximate inference.
The paper is organized as follows: Section 2 reviews the classical Nested Error Regression model and the Poisson–Dirichlet Process. Section 3 describes the proposed model. In Section 4, the methods corresponding to the estimation of each parameter in the model and the algorithm flow are given. In Section 5 and Section 6, the feasibility of the model and the corresponding parameter estimation methods are investigated by applying them to simulated and real data. The article is briefly summarized in the last section.
2. Theoretical Background
2.1. Nested Error Regression Models
Battese, Harter, and Fuller [5] published a seminal paper that contributed to the popularization of model-based small area estimation in government statistics. Model-based small area estimation requires the use of appropriate regression models that relate area target variables to appropriate auxiliary variables obtained from other surveys and records administered by government agencies. These authors presented Nested Error Regression (NER) models, as well as a unit-level model that combines satellite data and farm-level survey observations to determine the average corn and soybean acreage in twelve counties in Iowa, USA.
Suppose that the population of interest U is partitioned into mutually m independent small areas , where the total size is , and the sample size is . The sequence is the target variable and the corresponding covariate for the observation of the jth sample unit in the ith small area, where is an indicator of the small areas. And is a known p-dimensional vector of covariates. Battese et al. [5] proposed the following normal NER model:
where denotes the unknown p-dimensional vector of the regression coefficients, and and denote the area-specific random effect and error, respectively, for the ith small area. It is assumed that the random effects and errors follow normal distributions, and , and that the random effects and errors are independent of each other. We frequently seek to obtain the estimated mean of the target variable for all small areas through the small area model, also known as the small area mean . Assuming that the mean of the covariate is available for each small area, the ith small area mean is defined by the following formula:
As shown below in the hierarchical Bayesian approach based on the model in Equation (1) for predicting , this is accomplished by giving a prior distribution to the unknown model parameter .
The NER model can be described as follows:
- Define a conditional on and , as well as estimator , for , independently;
- Define a conditional on and small area means , for , independently;
- The model parameters are given a prior distribution with a density .
2.2. Poisson–Dirichlet Process
The Poisson–Dirichlet Process (PDP), also known as the Pitman–Yor Process, is a two-parameter extension of the Dirichlet Process. Similar to the Dirichlet Process, the Poisson–Dirichlet Process is a distribution placed on top of a distribution. The distribution underlying the Poisson–Dirichlet Process is the Poisson–Dirichlet Distribution. Assume that there exists a pair of parameters and such that and . Here, we name as the discount parameter and as the concentration parameter. Let be a sequence of mutually independent random variables, and allow obey the following distribution: ; define as follows:
Meanwhile, satisfies . If we arrange in descending order to obtain , then p is a Poisson-Dirichlet distribution, which is denoted as . Having defined the Poisson–Dirichlet Distribution, we can formally define the Poisson–Dirichlet Process. Assume that the base distribution is a probability distribution on the measurable space . Let be a sequence of independently and identically distributed random variables from the base distribution , and assume that ; then, we define the random probability measure G on to be the Poisson–Dirichlet Process with parameters and and base distribution , which is given by the following:
where denotes the Dirac measure of degeneracy at point : . This stick-breaking distribution G is also denoted as . The parameters and determine the power law properties of the Poisson–Dirichlet Process. In practical modeling, the Poisson–Dirichlet Process is more appropriate than the Dirichlet Process, because it exhibits power law properties that can be captured in natural language, and corresponds to the Dirichlet Process.
3. Small Area Model with PDP Random Effects
As in the NER model, in the proposed model, we assume that the observations are a set of data associated with independent and identically distributed random effects . We consider replacing the normal random effects prior distribution of the NER model with a Bayesian nonparametric prior; namely, we assume that is independently and identically distributed, thus obeying an unknown probability measure . At this moment, the distribution of the random effects , as an unknown quantity, can be given a Bayesian nonparametric prior. In this section, by assuming and , we introduce the Poisson–Dirichlet Process with parameters and and a base distribution as a prior to assign the distributions of the random effects to the model, and we obtain a unit-level small area model with PDP random effects:
where is a known auxiliary variable in the p-dimension associated with the observation , is an unknown vector of regression coefficients in the p-dimension, is a random variable with a mean of 0 and a variance of , and is a random effect with a magnitude reflecting the differences between units in different areas; and are mutual and independent, and they can be expressed as follows:
where is the base distribution, is the discount parameter, and is the concentration parameter. Parameters and determine not only the power law properties of the Poisson–Dirichlet Process, but also the probability that a new random effect will be sampled if already exists and given . In the proposed model, when a new random effect is drawn, it either comes from one of the previous classes of or a new one from . If comes from one of the previous classes of , its probability is positively related to the number of data points contained in this class. The larger the value of the parameter , the greater the probability that will be drawn from in a new class. For the random effects variable , let be the elements of that are not identical to each other, let be the number of classes in , and let be the number of elements in the kth class; then the sampling probability of is as follows:
The hierarchical structure of our model can be represented as follows.
The NER model with PDP random effects:
- Define a conditional on , , , and ; estimator is given by , for , independently;
- Define a conditional on , , and ; the random effects are given by and , for , independently.
4. Estimation
In small area estimation, we aim to obtain good estimators of the small area mean . Based on the proposed model, we mainly consider the estimation of the model parameters , , , , and , and the conditional mean is given the data .
In the small area model with PDP random effects, for the ith small area, the random effect that comes from , can be divided into mutually exclusive classes , and we define , where denotes the number of contained elements in the jth class of the ith small area, and denotes the number of mutually exclusive classes in . Assume that we assign data points according to the classes of to obtain groups: . The likelihood function of the model parameters , , , , and is
where denotes the marginal expectation about x, f denotes the probability density function of normality, and denotes the density function of the base distribution . This likelihood function is too complex to be maximized or numerical, so we consider applying empirical Bayesian estimation to solve the estimation of the model parameters.
In this section, we consider the application of empirical Bayesian nonparametric methods to study the estimation of the regression coefficients and the error variance for the small area model with PDP random effects, as well as the estimation of the discounting parameter , the concentration parameter , and the base distribution for the Poisson–Dirichlet Process when given the known data . We explain the algorithms used to derive these estimates in detail.
4.1. Proposed Approach
4.1.1. Estimation of Regression Coefficients and Error Variance
The first consideration we make is the estimation of the regression coefficients and the error variance in the model. Assuming that the random effects are known, by rewriting Equation (5) and defining in the model, we obtain the following:
We consider a matrix representation of the above equation such that , and is a matrix with column rank and error . In this case, , , and ; thus, we again obtain the following:
For the above linear regression model, we consider the idea of using the classical algorithm of parameter estimation for solving the regression model, that is, the least squares estimation algorithm, to obtain the estimates of the regression coefficients and the error variance .
For the regression coefficients , the objective of least squares estimation is to find an estimate of the regression coefficients that minimizes the sum of the squared residuals of all sample observations . The sum of the squared residuals is easily obtained by derivation:
By minimizing the sum of the squared residuals and assuming that exists, an estimate of the regression coefficient can be obtained as follows.
For the estimation of the error variance , we first consider that the error vector is an unobservable vector. Suppose we replace with the least squares estimate of , thus defining the residual vector . It is natural to consider using the residual sum of squares as a measure of the magnitude of . This can be obtained by substituting Equation (16) and into the residual sum of squares :
Furthermore, we compute the expectation of the ; then, we can obtain an unbiased estimate of the error variance :
4.1.2. Estimation of the Base Distribution and Two Parameters of the Poisson–Dirichlet Process
We first discuss the estimation of the base distribution of the Poisson–Dirichlet Process for random effects. Yang and Wu [22] proposed the application of the multivariate kernel density method to estimate the base distribution under the Dirichlet Process prior; Qiu, Yuan, and Zhou [23] considered applying the multivariate kernel density method to estimate the base distribution of the Poisson–Dirichlet Process under a multigroup data structure. We can also apply the multivariate kernel density method to realize the estimation of the base distribution of our model.
Assuming that the random effect is known, we can equivalently obtain , , and . The density function of the base distribution can then be estimated using the following equation:
where is some kernel function with bandwidth . We choose the kernel function as a Gaussian kernel function to realize the estimation.
Then, we discuss the estimation of the two parameters and of the Poisson–Dirichlet Process for random effects. Similar to Carlton [24], who studied the estimation of the parameters of the Poisson–Dirichlet Process for a single set of data, we apply maximum likelihood method to estimate the parameters and .
For each small area , define to denote the number of categories containing j individuals in the ith small area, where , and . Denote . Then, the log likelihood functions for parameters and are given below:
where is the given observed data, and . The maximum likelihood estimates of the parameters and are obtained by solving the equations:
Here, we use the numerical method of the Newton–Rapson iteration to obtain the above maximum likelihood estimates and .
4.2. Algorithms
The random effects are unobservable hidden variables, and we construct the following pseudoestimates:
Given the observed data Y, we can introduce an algorithm that is computed according to the following iterative formula given some initial values of the parameters , , , , and :
where represents the posterior expectation , and is calculated by applying to replace the unknown parameter when estimating the parameter in the th iteration. Then, we can obtain the parameter estimates:
However, it is not an easy task to compute the above posterior expectation expression during the iterative process, and we must consider applying the MCMC algorithm to seek its numerical solution. In the following, we will discuss the process of computational estimation of the model, which is divided into three stages, namely the selection of the initial values, the full conditional distribution of the MCMC algorithm, and the sampling and estimation.
4.2.1. Selection of Initial Values
During each iteration, we need to give the initial values of the parameters to achieve the corresponding parameter estimation.
We first consider the initial value of the base distribution , assuming that f is a normal density function, and we obtain the maximum likelihood estimate by solving the equation . This results in a kernel estimate of as the initial value of the base distribution :
For the selection of the two parameters and , in the absence of information, we can choose and to be some random values that satisfy the requirements and . Given , and , we obtain the hidden variable by extracting it from so that we can obtain the initial values and of the regression coefficient and the error variance from the least squares estimation.
4.2.2. Full Conditional Distributions of the MCMC Algorithm
Given the observations and , notate to denote the residual vector removing from , to denote the number of mutually exclusive elements in , and to denote the number of elements taking the value in . In order to apply the MCMC algorithm to solve the posterior expectation, we need to discuss the full conditional distribution of the MCMC algorithm.
Theorem 1.
For each , given and , the conditional distribution of is
where , and , thus satisfying the condition ; denotes the posterior distribution of given observation .
Proof of Theorem 1.
The posterior distribution of the Poisson–Dirichlet Process is known to be
The conditional distribution of is obtained as given in :
We let
Then, we obtain
Let , and let , thus satisfying the condition and making ; thus, the conditional distribution of is obtained as follows:
□
4.2.3. Sampling and Estimation
Given the initial values of the parameters and given the full conditional distribution of the MCMC algorithm, we present the sampling stage of the iterative computation. For all , with the parameter estimates already obtained for the rth iteration, we consider sampling from the posterior distribution of the hidden variable . The sampling phase is divided into two steps: the first step is Gibbs sampling based on the full conditional distribution of the MCMC algorithm, and the second step is to consider the use of an accelerated step, that is, to consider the introduction of an auxiliary parameter to sample at the end of each iteration. This is because if the sampling is done directly based on the full conditional distribution of the MCMC algorithm, the problem may arise that f and are not conjugate, or the MCMC chain is slow to converge when is relatively large with respect to . And, the method of introducing auxiliary parameters can be applied to the two types of problems mentioned above.
To start, we draw from , for , where B stands for the number of repetitions within an iteration and is a sufficiently large number. We introduce an auxiliary variable two-step update based on Equation (34), thus referring to the method in the article by Neal [25].
Define the auxiliary variable to denote which category belongs to in , and use to denote the number of occurrences of in . Assuming that , , , and have been obtained, first we consider updating the auxiliary variables first.
For the update of , suppose that if is removed; at this point, if . Otherwise, the value of is the same as that of . Then, the new is drawn again, and the distribution of the new is taken as follows:
where w is a normalization parameter designed to satisfy the condition .
Through the sampling stage, we can obtain . Next, we can estimate the parameters based on the obtained hidden variables . We obtain the estimates of each parameter according to the algorithm mentioned earlier; then, we can obtain the th iteration estimates:
5. Simulation
This section provides the simulation results to study the estimation performance of the proposed parameters under a small area model with PDP randon effects.
5.1. Model Setup and Simulation Conditions
We first design a finite population containing small areas, and we take a certain number of samples in each small area. For convenience, we set the sample capacity of each small area to .
Then, we provide the following model
The basic simulation assumptions are as follows:
- Five choices of parameters are: , , , , or , and the base distribution is set to be , , or ;
- The error comes from a normal distribution ;
- The true value of the regression coefficient is set to ;
- The initial parameter values are set to be , , , , or when is , , , , or ;
- The initial random effects are set to ;
- The number of iterations for the MCMC algorithm is set to .
5.2. Simulation Results and Analysis
The simulation results are given in detail by the following figures and tables. Table 1 shows the simulation results of the estimation of the corresponding parameters for the different cases of varying the values of parameters and versus transforming the base distribution under the PDP prior for random effects in the proposed model. The first column in Table 1 shows all the parameters that were estimated, the second column shows the settings of the base distribution for the different cases, the third column shows the true values of the corresponding parameters, the fourth and fifth columns show the bias and the MSE, and the sixth column shows the 95 percent confidence domain. Figure 1 shows the density estimation curves for the parameters and under different simulation data.
Table 1.
Performance of parameters estimation.
Figure 1.
Density estimation curve of and . (a) based on ; (b) based on ; (c) based on ; (d) based on ; (e) based on .
In Table 1, we calculated the bias, MSE, and confidence intervals of and for different types of base distributions. We assumed that the base distribution follows the normal distribution and the t distribution, respectively. Here, we chose t distributions with five and two degrees of freedom as the base distributions for estimation. The results of these estimators obtained under two different types of base distribution conditions were similar. The biases of were inside the interval , and the biases of were inside the interval . The MSEs of were less than 0.04, although the MSEs of were large, and most of the estimated confidence intervals failed to capture the true values of the parameters. Meanwhile, we considered the bias, MSE, and confidence intervals of and with different true values for and under the condition where the base distribution follows a normal distribution or a t distribution. From Table 1, we can see that when the base distribution followed the normal distribution and , the bias and MSE of were larger than those of other cases. The bias and MSE of were the smallest when the base distribution followed and . The density curve of and in Figure 1 also agrees with the estimated results of Table 1.
Table 1 lists the estimation results of the regression coefficients given different values of the parameters and and different base distribution conditions. The biases and MSEs of and were maintained at very low levels, which reflect the accuracy and reliability of the estimates. Figure 2 shows the density estimation curve of and when based on , which coincides with the estimation results in Table 1.
Figure 2.
Density estimation curve of regression coefficients when based on .
Table 2 presents the results of a comparison between the estimates of the regression coefficients and of the proposed model and those of the NER model, as derived from five sets of simulations. Table 2 demonstrates that the estimates of and obtained from the estimation of both the proposed model and the NER model were highly close to the true values. However, for the simulation where the base distribution followed the normal distribution, the NER model estimation was slightly more accurate than the proposed model. Conversely, when the base distribution followed the t distribution, the proposed model estimation was more precise than the NER model estimation.
Table 2.
Estimated mean of regression coefficients.
The Total Variation Distance (TVD) is a statistical distance measure between probability distributions, which represents the distance between the true and estimated distribution of the base distribution and thus serves as a basis for evaluating the estimation performance. Figure 3 shows the Total Variation Distance between the two distributions for each iteration under these five scenarios. As can be seen from Figure 3, the gap between the estimated distribution and the true distribution was small, and the total variance distance was less than 0.3 in each iteration.
Figure 3.
Total variation distance between two distributions at each iteration. (a) based on ; (b) based on ; (c) based on ; (d) based on ; (e) based on .
Table 3 shows the simulation results for the estimates of all small area means when the situation was that and the base distribution is . The first column in Table 3 shows the mean of , and the second column is the estimated value of the small area mean. As can be seen from Table 3, the estimates about the small area means were more accurate, and the estimates were close to the means without large deviations.
Table 3.
Results of small areas means when based on .
In order to more directly reflect the reliability of the estimation of the small area means, we introduced squared residuals to measure the degree of matching between the estimated values of the small area means and the true values, and we give five graphs of the squared residuals of the small area means under these five scenarios. As shown in Figure 4, the squared residuals for all small area means were small, with most of the squared residuals centered between 0 and 0.3.
Figure 4.
Squared residuals results of small area means. (a) based on ; (b) based on ; (c) based on ; (d) based on ; (e) based on .
5.3. Simulated Normal Data
In this section, we demonstrated how we implemented the estimation by using simulated data with the aim of testing the strengths and weaknesses of the model. Similar to the simulations in the previous section and for ease of computation, we assumed that there are small areas, and the sample size for each small area was set to . And, the simulation data were derived from the following model structure:
where the random effects comes from a normal distribution , and the error comes from a normal distribution . The true value of the regression coefficient was set to .
Then, we used the proposed model with PDP random effects to estimate and obtain the corresponding estimates. Table 4 shows the results of parameter estimation. The estimates of the fixed effects in Table 4 were very close to the true values, and through Table 4, we can also observe the parameters and of the PDP prior for the random effects of the model. Based on the estimates of the parameters and and the estimated distribution of the base distribution of the random effects, we further obtained the estimates of the means of all the small areas through the model, which are presented through Table 5. Table 5 shows that the estimation of the small area mean was more reliable, and the estimates were close to the means without major deviations. Figure 5 shows squared residuals for all the small area means. We can find that the residual squared of all small area means were small, thus demonstrating the validity of our model and method.
Table 4.
Performance of parameters estimation of simulated data.
Table 5.
Results of small areas means of simulated data.
Figure 5.
Squared residuals results of small area means of simulated data.
6. Application
Following this, we applied the proposed model to a dataset of combined income and other sociological variables for the Spanish provinces [26], which is available in the R package sae [27]. This dataset contains 20 regions, 21 variables, and a total of 1050 observation units. We retained and integrated these variables to select four variables, that is, incomedata, age, edu, and sex, thus representing total income, age, education level, and gender, respectively. Because the central aim of the survey was to understand the income levels of the Spanish provinces, incomedata was used directly as a response variable in the model. The remaining three variables, age, edu, and sex, are considered to be closely related to income and therefore served as auxiliary variables in the model. We removed any units containing missing values in these variables and normalized these data.
The proposed model can be described as follows:
Based on the proposed model, by applying the parameter estimation method in Section 4 and the provided algorithm to run two hundred rounds, the estimation results of each parameter were obtained and are shown in Table 6. Figure 6 gives the density plot of the base distribution obtained from our estimation.
Table 6.
Performance of parameters estimation of real data.
Figure 6.
Density of estimated base distribution of real data.
7. Conclusions
In this paper, we proposed to use the Poisson–Dirichlet Process in a Nested Error Regression model to provide a priori distributions for random effects in unit-level data. In the small area model, since the random effects are not directly observable as hidden variables, we applied the MCMC algorithm to extract the random effects at fixed initial values and constructed parameter estimates in the prior; we then gave estimates of the parameters such as regression coefficients and the base distributions with known random effects. Through numerical simulations and the application of example data, we demonstrated the feasibility of the studied model and the practicality of the estimation algorithm.
Our proposed model and its parameter estimation method have significant advantages. Firstly, the Poisson–Dirichlet process as a prior is able to flexibly capture the nonparametric properties of the random effects, thus overcoming the limitations of the traditional normality assumption and improving the adaptability and accuracy of the model. Second, the effective application of the MCMC algorithm ensures the robustness and accuracy of parameter estimation, especially when dealing with complex data and models. Both the theoretical and simulation results confirm these advantages, thus making our model and method widely applicable and effective in practical applications.
Although our proposed model and method performed well in several aspects, there are still some shortcomings and directions worthy of further research. The computational complexity of the models is high, especially when dealing with large-scale datasets, and the computational cost may become a limiting factor for their application. Therefore, developing more efficient computational methods and algorithms is a future research focus. In addition, with the development of data science, combining new statistical techniques and machine learning algorithms to improve and optimize the model is also a worthy research direction.
Author Contributions
Conceptualization, X.Z.; methodology, Q.K. and X.Z.; software, Q.K. and X.Z.; validation, Q.K. and X.Z.; formal analysis, Q.K. and X.Z.; investigation, X.Z.; resources, X.Q.; writing—original draft, Q.K.; writing—review and editing, X.Z.; visualization, X.Q.; supervision, Y.L.; project administration, X.Q. and Y.L.; funding acquisition, X.Q. and Y.L. All authors have read and agreed to the published version of the manuscript.
Funding
This work is supported by the National Natural Science Foundation of China (Grant Nos. 12032016 and 12372277).
Data Availability Statement
The data that support the findings of this study are openly available, which can be downloaded from https://cran.r-project.org/web/packages/sae/index.html (accessed on 15 April 2024).
Acknowledgments
We thank the associate editor and the reviewers for their useful feedback that improved the quality and clarity of this paper.
Conflicts of Interest
The authors declared no conflicts of interest.
References
- Rao, J.N. Small Area Estimation; John Wiley & Sons: Hoboken, NJ, USA, 2005; Volume 331. [Google Scholar]
- Pfeffermann, D. New important developments in small area estimation. Stat. Sci. 2013, 28, 40–68. [Google Scholar] [CrossRef]
- Rao, J.N.; Molina, I. Small Area Estimation; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
- Fay, R.E., III; Herriot, R.A. Estimates of income for small places: An application of James-Stein procedures to census data. J. Am. Stat. Assoc. 1979, 74, 269–277. [Google Scholar] [CrossRef]
- Battese, G.E.; Harter, R.M.; Fuller, W.A. An error-components model for prediction of county crop areas using survey and satellite data. J. Am. Stat. Assoc. 1988, 83, 28–36. [Google Scholar] [CrossRef]
- Arima, S.; Bell, W.R.; Datta, G.S.; Franco, C.; Liseo, B. Multivariate Fay–Herriot Bayesian estimation of small area means under functional measurement error. J. R. Stat. Soc. Ser. A Stat. Soc. 2017, 180, 1191–1209. [Google Scholar] [CrossRef]
- Yang, Z.; Chen, J. Small area mean estimation after effect clustering. J. Appl. Stat. 2020, 47, 602–623. [Google Scholar] [CrossRef] [PubMed]
- Datta, G.S.; Hall, P.; Mandal, A. Model selection by testing for the presence of small-area effects, and application to area-level data. J. Am. Stat. Assoc. 2011, 106, 362–374. [Google Scholar] [CrossRef]
- Sugasawa, S.; Kubokawa, T. Bayesian estimators in uncertain nested error regression models. J. Multivar. Anal. 2017, 153, 52–63. [Google Scholar] [CrossRef]
- Ferrante, M.R.; Pacei, S. Small domain estimation of business statistics by using multivariate skew normal models. J. R. Stat. Soc. Ser. A Stat. Soc. 2017, 180, 1057–1088. [Google Scholar] [CrossRef]
- Fabrizi, E.; Trivisano, C. Robust linear mixed models for small area estimation. J. Stat. Plan. Inference 2010, 140, 433–443. [Google Scholar] [CrossRef]
- Chakraborty, A.; Datta, G.S.; Mandal, A. A two-component normal mixture alternative to the Fay-Herriot model. Stat. Transit. New Ser. 2016, 17, 67–90. [Google Scholar]
- Diallo, M.S.; Rao, J. Small area estimation of complex parameters under unit-level models with skew-normal errors. Scand. J. Stat. 2018, 45, 1092–1116. [Google Scholar] [CrossRef]
- Tsujino, T.; Kubokawa, T. Empirical Bayes methods in nested error regression models with skew-normal errors. Jpn. J. Stat. Data Sci. 2019, 2, 375–403. [Google Scholar] [CrossRef]
- Opsomer, J.D.; Claeskens, G.; Ranalli, M.G.; Kauermann, G.; Breidt, F.J. Non-parametric small area estimation using penalized spline regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 2008, 70, 265–286. [Google Scholar] [CrossRef]
- Polettini, S. A Generalised Semiparametric Bayesian Fay–Herriot Model for Small Area Estimation Shrinking Both Means and Variances. Bayesian Anal. 2016, 12, 729–752. [Google Scholar] [CrossRef]
- Pitman, J.; Yor, M. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probab. 1997, 25, 855–900. [Google Scholar] [CrossRef]
- Al-Labadi, L.; Hamlili, M.; Ly, A. Bayesian Estimation of Variance-Based Information Measures and Their Application to Testing Uniformity. Axioms 2023, 12, 887. [Google Scholar] [CrossRef]
- Handa, K. The two-parameter Poisson–Dirichlet point process. Bernoulli 2009, 15, 1082–1116. [Google Scholar] [CrossRef]
- Favaro, S.; Lijoi, A.; Mena, R.H.; Prünster, I. Bayesian non-parametric inference for species variety with a two-parameter Poisson–Dirichlet process prior. J. R. Stat. Soc. Ser. B Stat. Methodol. 2009, 71, 993–1008. [Google Scholar] [CrossRef]
- Gelman, A.; Carlin, J.B.; Stern, H.S.; Rubin, D.B. Bayesian Data Analysis, 3rd ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2013. [Google Scholar]
- Yang, L.; Wu, X. Estimation of Dirichlet process priors with monotone missing data. J. Nonparametr. Stat. 2013, 25, 787–807. [Google Scholar] [CrossRef]
- Qiu, X.; Yuan, L.; Zhou, X. MCMC sampling estimation of Poisson-Dirichlet process mixture models. Math. Probl. Eng. 2021, 2021, 6618548. [Google Scholar] [CrossRef]
- Carlton, M.A. Applications of the Two-Parameter Poisson-Dirichlet Distribution; University of California: Los Angeles, CA, USA, 1999. [Google Scholar]
- Neal, R.M. Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 2000, 9, 249–265. [Google Scholar] [CrossRef]
- Molina, I.; Rao, J.N. Small area estimation of poverty indicators. Can. J. Stat. 2010, 38, 369–385. [Google Scholar] [CrossRef]
- Molina, I.; Marhuenda, Y. sae: An R Package for Small Area Estimation. R J. 2015, 7, 81–98. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).