4.1. Subarea-Level Models of the NASS
The subarea-level models were first developed by Fuller and Goyeneche [
18] and later studied in a frequentist framework by Torabi and Rao [
19] and Rao and Molina [
20]. Erciulescu et al. [
5] presented a subarea-level model in a Bayesian framework, where the area represents the agricultural statistics district (groups of neighboring counties within a state, hereafter denoted as ASDs) and the subarea represents the county, and study its performance under different scenarios of data availability.
Let
be an index for
m ASDs in the state,
be an index for the
counties in the
ith ASD and
be the sample size of the
jth county in the
ith ASD. The total number of counties within a state is
, and the state sample size is
. The direct estimate in county
j within the
ith ASD is denoted by
, and the associated variance estimated from the survey is denoted by
. Illustrated for one state, one commodity and one parameter (i.e., yield), the model is
where
is a set of nuisance parameters.
A diffused prior is adopted for the vector parameter (coefficients)
(i.e., a bivariate normal prior distribution with a fixed and known mean and a variance and covariance matrix
), where
. Here,
are the least squares estimates of
obtained from fitting a simple linear regression model of the county-level direct estimates from the survey on the auxiliary data
, and
is the estimated covariance matrix of
. Identical non-informative prior distributions (i.e., Uniform
) are adopted for
and
. For more details on the choice of priors for the variance components in the Bayesian models, see the discussion by Browne and Draper [
21] and Gelman [
22].
NASS publishes the yields to the nearest tenth of a bushel per acre. The variances in yield estimated from the 2016 CAPS for corn were smaller than 0.01 bushels per acre for 215 counties. The nearest neighbor imputation technique is applied to fill in the missing data pairs, namely the missing direct estimates and missing corresponding sampling variances that will be used to feed the HB model in Equation (
4). However, with this approach, counties with non-zero direct estimates and missing or equal-to-zero sampling variances are not modeled. Furthermore, the model in Equation (
4) cannot produce estimates (or if produced, the estimates are unreliable) for the 348 counties with valid (positive) direct survey estimates and corresponding sampling variances outside the threshold bounds (
L and
U).
In more recent research on modeling yields at the county level in the NASS, counties with valid direct estimates that are below some threshold sampling variances are considered in the model. Instead of excluding them from the subarea-level model in Equation (
4), these unreliable sampling variances are assumed to be unknown (the same applies throughout all anomalous counties) and are updated using a Bayesian technique. This approach is formally presented by the updated subarea-level model in Equation (
5). In what follows, we briefly describe this research.
Let n counties be separated into two groups. One group, consisting of counties with known direct estimates and sampling variances below some threshold (e.g., less than 1 bushel per acre for corn), is indexed by . The other group, consisting of counties with known direct estimates and sampling variances above the threshold, is indexed by .
Then, the updated subarea-level model (hereafter addressed as the original model) is
where
is a set of nuisance parameters.
The priors adopted for
are the same as those in Equation (
4), and the prior adopted for
, the unknown constant variance throughout all anomalous counties, is
.
The limitations of the original model in Equation (
5) are twofold. First, the model assumes that all unknown variances are the same, and second, not all unreliable sampling variances are considered by the model. The variances with values that fall in the upper extreme right tail of the distribution of sampling variances are unreliable and not considered by Equation (
5).
This paper addresses the challenge of improving the unreliable sampling variances for counties with valid (positive) direct survey estimates more realistically. By considering all unreliable sampling variances in both tails of the distribution (i.e., outside of the threshold bounds
L and
U) and relaxing the assumption of constant (unreliable) variances throughout small areas, this research overcomes the limitations of the original model shown in Equation (
5). One can apply any of the two alternative approaches presented earlier in
Section 3 to improve the HB model inputs. As an illustration, our case study shows that more reliable final estimates of the yield were produced for the US counties by the HB model fed with updated survey summaries (based on the two alternative approaches presented in this paper) when compared with the existing approaches.
4.2. Results
In this section, nationwide results from different estimation procedures that used CAPS data to produce the county-level corn yield estimates for 2016 are presented. We compare the corn yield estimates and the associated measures of uncertainty produced from the following:
A survey;
The original model in Equation (
5);
The updated model in Equation (
4) using improved sampling variances based on the Bayesian method as the input;
The updated model in Equation (
4) using improved sampling variances based on the bootstrap method as the input.
The Markov chain Monte Carlo (MCMC) simulation method was used to fit the Bayesian models using R and JAGS [
23]. The JAGS model descriptions used in the R script are shown in
Appendix C. All the Bayesian models are fit for each state individually, and there were 37 corn states in the 2016 CAPS. In each model, three chains were run for our MCMC simulation. Each chain contained 10,000 Monte Carlo samples, and the first 2000 iterates were discarded as a burn-in to improve the mixing of each chain. In order to eliminate the correlations among the neighboring iterations, those iterations were thinned by taking a systematic sample of one in every eight samples. Finally, 1000 MCMC samples in each chain were obtained for constructing the posterior distributions of the parameters and make inferences for the yield estimates. Convergence diagnostics were conducted to make sure that the MCMC samples were mixing well. The convergence was monitored using trace plots, the multiple potential scale reduction factors (
close to one) and the Geweke test of stationarity for each chain (see [
24,
25]). We found that the Geweke tests for all the parameters in models (
2), (
4) and (
5) were not significant, and the effective sample sizes were all near the actual sample size of 1000. (Nearly all of them were 1000.)
For the bootstrap method,
samples of a size
n were used to construct the empirical distribution of the sampling variances (see
Appendix D). Then, as described in
Section 3.2, the new set of values was drawn from the empirical distribution to update the unreliable variances for the “anomalous” counties within each state. The updated sampling variances obtained via the bootstrap method satisfied the inequalities in model (
3) and provided more reasonable values for the extreme sampling variances, which were further used as inputs in the model in Equation (
4).
We recall here that the 2016 CAPS sample consists of 37 states comprising 2329 counties with positive yields or production for corn. There were 99 counties with zero sampling variances, 142 counties with positive sampling variances below
L and 107 counties with sampling variances that were relatively large, being greater than
U. In total, there were 348 anomalous counties with sampling variances falling outside the bounds
L and
U. These bounds defined, earlier in
Section 2.1, vary by state.
Before comparing the final modeled estimates of the yields generated from all methods discussed in this paper, we briefly show the improvement gained for the sampling variances by applying the two proposed methods to the 348 anomalous counties.
Table 1 shows the five-number summary (i.e., minimum, first, second and third quartile, as well as the maximum) of the survey-estimated variances, improved sampling variances based on the Bayesian method and improved sampling variances based on the bootstrap approach at the county level for the anomalous counties. It is straightforward to see that the survey-estimated variances were highly right-skewed. A large part of those was very close to zero or extremely small. The first quartile was 0.00, the median was 6.25 × 10
, the third quartile was 0.71, and the maximum was 5264.01. However, the variances generated from both the Bayesian and bootstrap methods were improved. These updated variances appeared to not be that extreme (i.e., they were far from zero), shifted more to the right and were more centered than the survey-estimated variances. The minimum and first quartile of the updated variances based on the bootstrap method were smaller than ones generated by the Bayesian method (
Table 1). The median, third quartile and maximum of the updated variances based on the Bayesian method were smaller than the ones from the bootstrap method.
The performance of each approach was evaluated based on the relative bias of the final modeled estimates produced from each method toward the published estimates. The absolute relative differences (ARDs) between the estimates from any procedure and the published estimates were computed as follows:
where
is the final modeled estimate or the survey’s direct estimate and
is the corresponding published county-level yield estimate.
Table 2 shows the nationwide results using a five-number summary of the ARDs of the yield estimates produced from all four approaches, with a focus on the published estimates for the anomalous counties and all available counties afterward. The median of the ARDs when the Bayesian and bootstrap methods qwre applied in the anomalous counties were 7.93% and 6.71%, respectively. These were much smaller than the median ARD of the survey’s direct estimates and, to a lesser extent, smaller than the median ARD of the estimates produced from the original model in Equation (
5). The estimates based on the Bayesian and bootstrap methods were generally closer to the published estimates in the anomalous counties. Similar relationships can be seen in the third quartile. However, the maximum ARD of the survey’s direct estimates was the smallest of all the other methods, and this was because some of the direct estimates from the survey were missing, and other methods provided a complete dataset. The median ARDs from the bootstrap approach were the smallest in the anomalous counties. The maximum ARD of the estimates produced by the Bayesian method was the smallest in the anomalous counties. In all counties, all modeled estimates were generally closer to the published estimates when compared with the direct survey estimates. Overall,
Table 2 reveals an improvement in performance for the HB models in small areas under the two proposed approaches. All five-number summaries of the ARDs of the yield estimates based on the Bayesian and bootstrap methods were smaller than those from the original model.
In addition, the choropleth maps (
Figure 3) depict the ARDs for the county-level estimates produced from different methods in selected states, known as the corn belt states for dominating the corn production in the US. As the difference between the estimates produced from each method and the published estimates increased (relative to the published estimate), the corresponding colored area became darker. Most counties are shown as yellow, indicating that the estimates produced by the model were closer to the published estimates. Counties shown from dark green to blue or purple on the map depicting the estimates based on the survey (upper left corner) consisted of very small sample sizes and unreliable sampling variances for the yield. The corresponding counties in other maps, which depict the estimates based on the subarea-level models (original model in Equation (
5) and the model in Equation (
4) with updated inputs) appeared to be much lighter. For the areas with small sample sizes, the subarea-level models produced the yield estimates by incorporating other (administrative) data and by “borrowing information” across and within areas and subareas.
The correlation matrix of the published estimates of the yield, survey’s direct estimates of the yield, estimates of the yield based on the original model in Equation (
5), estimates of the yield from the model in Equation (
4) with improved sampling variances based on the Bayesian method as the input and estimates of the yield from the model in Equation (
4) with improved sampling variances based on the bootstrap method as the input are shown in
Table 3. All the correlations were larger than 0.75, indicating high correlation among the final estimates from all methods. The highest correlation with the published yield estimates appeared for the estimates produced from the original model in Equation (
5). This was expected, since the published estimates were produced using several sources of information where the original model (currently in production) plays a central role. Furthermore, the correlation between the survey and the published estimates was the lowest, since the direct survey estimates did not leverage model-based solutions designed to improve the estimation accuracy.
Table 3 also indicates that all model-based estimates were more accurate than the survey estimates.
Table 4 shows the five-number summaries of the coefficients of variation (CVs) of the county-level yield estimates from our case study, produced nationwide using the four approaches discussed in this paper. We recall that there were 99 counties in the 2016 CAPS with zero sampling variances for the yield. Hence, the CVs from the survey for these counties were not valid statistics, and these counties were removed from the CV comparison. The CVs of the yield estimates from the survey among the anomalous counties consisted of extreme values close to either zero or one. However, the CVs of the yield estimates from the original model and Bayesian and bootstrap methods were more stable than the CVs from the survey. The bootstrap method provided the smallest CVs among all three methods. All the five-number summaries from the original model were larger than the two alternative methods proposed in this paper. Over all counties, one can observe the decrease in CVs (an increase in relative precision) from the models based on three approaches when compared with the survey CVs. The original model had the smallest first quartile CVs (2.51%). The smallest median, third quartile and the smallest maximum CVs were shown when the bootstrap method was used. The results demonstrate the tendency of the small area models to improve the accuracy of the estimates when compared with the accuracy of the survey estimates, especially in areas with small sample sizes (i.e., counties with very large CVs).