Small area models are now being used to produce estimates for farm labor, crop county estimates, and cash rent county estimates. For each of the subarea models, the area and the subarea are defined. For farm labor, the region (see
Figure 1) is the area, and the state within region is the subarea. For both crop county estimates and cash rent county estimates, the agricultural statistics district is the area, and the county within the agricultural statistics district is the subarea. An agricultural statistics district is a predefined group of neighboring counties within a state that have similar agriculture. The number of agricultural statistics districts within a state varies from one for small states to 15 for Texas, with a median number of nine.
The small area models that the NASS has implemented can all be viewed as extensions to the two-stage FH model [
4]. In the first stage, the subarea-level means from the survey are assumed to follow a distribution with mean
and sampling variance
, which is estimated using the survey design and weights. The second stage relates the
s to the covariates through a regression
, where
represents the prediction error associated with the regression model and is assumed to have mean 0. Thus, the corresponding probability-based surveys discussed in the last section serve as the foundation for the models, and the information from the non-survey data are incorporated as covariates in the regression. The NASS publishes the coefficient of variation (CV) with its point estimates. For the models developed here, the CV is based on the point estimate and its standard error from the posterior distribution.
3.1. Small Area Models for Farm Labor Estimates
The NASS Farm Labor Report is published semiannually and provides estimates of the number of workers, average hours worked per week, and average wage rates by worker type at the regional, and national levels. For each worker type, three subarea models, one for each variable of interest, are fit. The farm labor region is the area, and the state within region is the subarea. The distribution of the number of workers is highly right-skewed so a normal subarea-level model is based on the log transformation. The distributions of hours worked per week and wage rates, which are also non-negative, are symmetric; thus, normal subarea-level models are fit to these variables.
Each model is outlined below, and the modeling details are in Chen et al. [
15]. To estimate the number of workers, let
i =1, 2, …, 18 be an index for the 18 labor regions and let
j = 1, 2, …,
ni be the
jth state in the
ith region. Furthermore, define
k = 1, 2, 3, 4 as an index for the four worker types: (1) field workers, (2) livestock workers, (3) supervisors, and (4) other workers. Let
denote the true number of workers of type
k in state
j and region
i;
; and
and
be, respectively, the direct survey estimate and the associated survey variance of
. The covariates including an intercept are
(see
Table A1 in
Appendix A for a list of the covariates).
The model for the number of workers is then
where
is, by the delta method, the estimate for the sampling variances after log transformation;
is the area-level random effect representing the region-level variability;
and
are, respectively, the least squares estimates of
and the estimated covariance matrix of
; and
represents the positive real numbers. The uniform priors for scale parameters
and
are motived by Gelman [
16] and Browne and Draper [
17]. The uniform prior on the real line is functionally equivalent to a proper U(0,1/ε) prior for very small ε.
After obtaining the posterior distribution of
, the estimators
, where
wk represents the number of workers, follow from back transformation and are used to obtain the posterior means and measures of uncertainty for the number of workers by each worker type. The aggregated regional level posterior summaries for the number of workers by different worker types are obtained based on state-level MCMC samples (see Chen et al. [
15] for details).
Because the distributions of the average hours worked per week and the average wage rate per hour are symmetric, a normal subarea model is applied to each of these response variables.
As in the lognormal subarea model,
is the area-level random effect representing the region-level variability; the coefficients of
have an empirical diffuse prior; and the prior distributions for
and
are noninformative uniform priors (see
Table A1,
Appendix A for a list of the model covariates).
After obtaining the posterior distribution of
, for
, where
hr and
wg represent average hours worked per week and average wage per hour, respectively, the estimators
follow from the identity transformation and are used to obtain the posterior means and measures of uncertainty for
hr and
wg by each worker type. The aggregated regional level posterior summaries for hours and wage rate by different worker types are obtained conditional on the state-level MCMC samples of the number of workers (see Chen et al. [
15]).
The detailed model evaluations including model effectiveness, model efficiency, and a comparison between survey estimates and subarea model estimates can be found in Chen et al. [
15]. Furthermore, a 2020 case study illustrates the improvement in the direct estimates for areas with small sample sizes by using auxiliary information and borrowing information across areas and subareas.
3.2. Small Area Models for Crop County Estimates
In the Crop County Estimates program, estimates of the planted and harvested acres, yield and production for each county in the target population of a specified crop are produced. Production (or yield) can be derived from the yield (or production) and harvested acres as the product of yield and harvested acres (or the ratio of production to harvested acres). Thus, only three models are needed: (1) planted acres, (2) harvested acres, and (3) yield or production. Reflecting the agricultural process, the model for planted acres is modeled first. The harvested acres model is modeled next and must reflect two constraints: (1) If the acres planted to the specified crop is zero, then the number of harvested acres is zero; and (2) the number of acres harvested can be no more than the number of planted acres. Because the number of planted acres can vary widely from farm to farm with a few farms planting many more acres than the majority of the others, production tends to be highly skewed whereas yield tends to be more normally distributed. Therefore, a yield model was developed, and production was derived from the estimates of the yield and harvested acres. Of course, yield and production must be zero if no acres are planted or harvested.
All crop county estimates must honor two constraints that follow from the available information. First, the state and U.S. estimates of the planted and harvested acres, yield, and production are published before the county estimates of those same quantities, and the state estimates are coherent with the national estimates; that is, the estimated state-level numbers of the planted and harvested acres and production sum to the published U.S. estimates. To maintain coherence in the estimates, the estimates of the planted and harvested acres and production for counties within a state must total the state-level estimates. Ratio benchmarking similar to that of Nandram and Sayit [
18] enforces this coherence. Second, the yield, as the ratio of production to harvested acres, needs to aggregate to the corresponding state-level estimates. The study by Erciulescu et al. [
19] explored the preservation of triplet relationships among the numerator totals, denominator totals, and their ratios for two nested, smaller-than-state geographies.
Erciulescu et al. [
8] suggested a subarea model for planted acres and applied ratio benchmarking. The area is an agricultural statistics district, and a county within a district is the subarea.
Let
be the number of planted acres in county
j = 1, 2, …,
, within agricultural statistics district
i = 1, 2, …,
m. Furthermore, let the county sample size be
and
be the direct survey estimate with the estimated sampling variance
. The total number of counties in a state is
, and the state sample size is
. The county-level auxiliary information is
(see
Table A2,
Appendix A for a list of the model covariates). Further assume that the county-level random effects have independent, normal distributions with mean 0 and variance
and the district-level random effects are independent, normally distributed with mean 0 and variance
. Then,
The prior distribution for the model parameter
is a normal distribution with mean and variance being the least squares estimates of
. With known
and with no district-level effects
, Model (1) reduces to the FH area-level model [
4].
The number of acres planted to a specified crop within a county is at least as large as the number of acres that the producers reported to either the FSA or RMA as having been planted in that county. Often, the direct survey estimate of planted acres for a county is above this lower bound, but due to sampling variation, this is not always the case (see
Figure 4). The NASS has long used expert opinion to ensure that this lower bound was honored. Developing the methodology to enforce this lower bound within (1) was technically challenging. Nandram et al. [
20] and Chen et al. [
21] proposed and implemented the constrained model (4) for planted acres.
where
is the vector of the maximum of the acres planted to the specified crop reported to the FSA and RMA for each county
i in an agricultural statistics district
j, and
is the prepublished state-level estimate of the planted acres. Ratio benchmarking was applied so that the total of the estimated planted acres within each county totaled the state-level estimate of acres planted to the crop. Adding the constraint to the model and applying ratio benchmarking led to estimates that were consistent with the expert opinion used by the members of the Agricultural Statistics Board, which enabled the model to be considered for production.
Erciulescu et al. [
8] proposed and implemented a subarea model for harvested acre estimates analogous to the one for planted acres in (1). In contrast, a subarea model for failed acre estimates was developed where the number of failed acres was equal to the number of planted acres less the number of acres harvested. Through its insurance program, the RMA collects information on failed acres due to drought, storms, or other events. The number of acres reported as having failed within a county provides a lower bound for that county’s number of failed acres, which is the difference in the number of acres planted and those harvested. Thus, conditioned on the model-based planted acre estimates, the model incorporated a constraint to honor the lower bound of failed acres obtained from the RMA administrative data, and the model-based harvested acre estimates can be derived from the planted acre and failed acre estimates. In such a model setting, the two relationships, (i) between planted acres and harvested acres and (ii) between the model-based failed acres and RMA administrative failed acres can be satisfied. In the end, ratio benchmarking was applied so that the total of the estimated harvested acres within each county totaled the state-level estimate of acres harvested to the crop.
The subarea models for yield are of the same form as model (1) with representing the direct survey estimates of yield and its associated sampling variance, respectively, for county i within agricultural statistics district j. The National Commodity Crop Productivity Indices (NCCPIs), which measure the quality of the soil for growing non-irrigated crops in climate conditions best suited for corn (NCCPI-corn), wheat (NCCPI-wheat), and cotton (NCCPI-cotton), are incorporated as covariates in . The mean and variance of the posterior distribution of the yield are, respectively, the modeled estimate of the yield of county i in agricultural statistics district j and its estimated variability.
It is worth noting that some sampling variances are not stable or are unavailable due to zero or small sample sizes for certain counties, which differ with commodity. Erciulescu et al. [
22] discussed the challenges of missing data when fitting the subarea level model to obtain the crop total estimates for the whole nation. A nearest neighbor imputation method was proposed to impute missing data including the missing sampling variances. In addition, an approach based on Taylor’s approximation and Bayesian modeling was applied to smooth unstable, modeled sampling variances (see [
23]).
Detailed model evaluations in terms of effectiveness and model efficiency have been conducted. For instance, Nandram et al. [
20] showed how to incorporate the area-specific inequality constraints and benchmarking into the Fay–Herriot model using simulated datasets with properties resembling an Illinois corn crop. Chen et al. [
21] examined the performance of the model with inequality constraints and, through a case study, illustrated the improvement in the county-level estimates in terms of accuracy and precision while preserving the required relationships. Erciulescu et al. [
19] discussed the yield model and different methods of applying benchmarking constraints to a triplet (numerator, denominator, ratio) and illustrated results for 2014 for corn and soybeans in Indiana, Iowa, and Illinois. Based on these results, small area models implemented in crop county estimates for total acre and yield estimates provide accurate indirect estimates while improving the precision.
3.3. Small Area Models for Cash Rent County Estimates
The Agricultural Statistics Board began using a univariate area-level model for cash rental rates in 2013 [
24]. The model was based on the average and change in the current and previous years’ cash rental rates for county
i, which are orthogonal under the normality assumption. Information on the total value of agricultural production, the published county-level crop yield estimates, and the NCCPIs were incorporated into the model. Two-stage benchmarking [
25] was used to ensure coherence in the estimates at the county, agricultural statistics district, state, and national levels. However, the two-stage benchmarking led to a few negative estimates. The model did not provide estimates of the total value from the cash rentals or the total land rented, both of which are important for assessing coverage, which is a published metric of quality. Furthermore, the modeling assumption of equality of variances in the two years is not always appropriate, and the survey outliers impact the estimates in two years, not just one. Thus, although the modeled estimates were reviewed by the Agricultural Statistics Board, they were not used as the foundation for publication.
In its review, the CNSTAT panel recommended that the NASS develops a bivariate, unit-level hierarchical Bayesian model to estimate the county-level cash rents that do not depend on the assumption of equal variances in two survey years [
1]. Erciulescu et al. [
26] partitioned the respondents into three sets: those reporting only in the previous year, those reporting only in the current year, and those reporting in both years. They then developed a unit-level bivariate, hierarchical Bayesian model that incorporated covariates of other available information that differed by state. The two-stage benchmarking was conditioned on the direct survey estimates for rented acres, which could be adjusted in the review process. Accounting for the correlations (counties and operations) from one year to the next in the resulting model led to a level of computational intensity that made it difficult to complete and review results in the available production window. Therefore, this model was not considered further for production.
In 2021, the NASS implemented county-level models for the acres rented and rental rates and derived the total dollars from the cash rents as the product of the two modeled estimates for non-irrigated cropland, irrigated cropland, and permanent pasture. The adopted two-component mixture model of the county-level cash rents has the advantage that the two years of data are together, but the two correlations are avoided by using a power prior that partly discounts past data (see [
27,
28]). In addition, the structure of the model can adjust the outliers among the county estimates. Chakraborty et al. [
29] and Goyal et al. [
30] provided a full Bayesian approach to adjusting outliers from this type of model. The basic assumption of the county-level model is that the two years are similar. A discounting factor “
” (see [
27,
28]) associated with the previous year data was introduced in the model to adjust for differences from the current year data. The discounting factor was the same for all counties within the same region. Furthermore, it was assumed that outliers were present but less prevalent than the remaining reported data. Because the variance with outliers should be greater than that without the outliers, a mixture model was used to adjust for outliers and robustness.
Let
be the index of counties with responses in year 1 (previous year) and let
be the index of counties with responses in year 2 (current year). That is, there are
counties sampled only in year 1,
sampled on both years, and
sampled only in year 2. Let
be the survey indications and survey sampling variances from year 1 and
be the survey indications and survey sampling variances from year 2. Let
be the known auxilliary information: the corresponding previous year county-level official estimates, the number of positive responses, and NCCPIs (see
Table A3,
Appendix A).
The two-component mixture model was used to estimate the cash rental rates at the county level. The model for year 1 is
and, for year 2, the model is
It was assumed that (1) a proportion of the p counties had estimates that were outliers, (2) the prior was informative with discounting factor , and (3) the variance in the normal data (not outliers) was smaller than the variance with outliers. Here, it is convenient that . Note that the parameters were the same over the counties and years. The prior for was . The county estimates were benchmarked to the state and national estimates using the ratio benchmarking method at the end.