A Bayesian Hierarchical Spatial Model to Correct for Misreporting in Count Data: Application to State-Level COVID-19 Data in the United States

The COVID-19 pandemic that began at the end of 2019 has caused hundreds of millions of infections and millions of deaths worldwide. COVID-19 posed a threat to human health and profoundly impacted the global economy and people’s lifestyles. The United States is one of the countries severely affected by the disease. Evidence shows that the spread of COVID-19 was significantly underestimated in the early stages, which prevented governments from adopting effective interventions promptly to curb the spread of the disease. This paper adopts a Bayesian hierarchical model to study the under-reporting of COVID-19 at the state level in the United States as of the end of April 2020. The model examines the effects of different covariates on the under-reporting and accurate incidence rates and considers spatial dependency. In addition to under-reporting (false negatives), we also explore the impact of over-reporting (false positives). Adjusting for misclassification requires adding additional parameters that are not directly identified by the observed data. Informative priors are required. We discuss prior elicitation and include R functions that convert expert information into the appropriate prior distribution.


Introduction
Coronavirus disease 2019 (COVID- 19) is an infectious disease caused by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). The first known case was reported and confirmed in Wuhan, China, in December 2019. Since then, the disease has spread globally, leading to an ongoing pandemic. According to data from Johns Hopkins University, since the initial spread of the disease in the United States in March 2020, there have been over 762,000 deaths and about 47 million contractions as of mid-November, 2021, more than any other country.
Misclassification is a common problem in public health count data. Misclassification leads to both biased parameter estimates for regression coefficients and also to interval estimates that are too narrow when the uncertainty in the model is not fully accounted for. Common approaches for correcting under-reporting include the censored likelihood method [1,2] and a hierarchical count framework [3][4][5][6]. Recently, Stoner et al. [7] extended the model of Winkelmann et al. [3] to correct under-reporting of tuberculosis in Brazil. In their paper, spatial random effects were also incorporated to account for neighborhood dependency. The Bayesian model they consider is highly flexible and could be applied to a variety of public health scenarios. In our paper, we extend the model and apply it to COVID-19 data. We extend the model in two ways. One, because of the complexity of the model, we replace the Besag-York-Mollié (BYM) model [8] with BYM2 [9,10] and use the Stan programming language of Gelman et al. [11]. We found our implementation with Stan sped up the computation considerably compared to the results using the Nimble package developed by de Valpine et al. [12]. Secondly, since there were likely some false positives (over-reporting) in the COVID-19 data, we extend the model to allow for a background rate of false positives. By including both under-reporting and false positives, the model we propose is overparameterized. Thus, the prior elicitation procedure becomes particularly important, and model robustness to prior sensitivity should be checked. We include R functions to assist with prior elicitation and consider different prior combinations in our example to show robustness to moderate changes in the prior parameters. This can be viewed as a sensitivity analysis for the model to see if inferences are impacted by varying degrees of false positives.
In this paper, we adopt a retrospective methodology to explore the impact of underreporting (false negatives) and over-reporting (false positives) on statistical models of state-level COVID-19 cases in the early stages of the pandemic in the United States. There have been many studies supporting the argument that the COVID-19 cases were considerably under-estimated early in the pandemic. For example, in a community seroprevalence study in Los Angeles County, the prevalence of antibodies for SARS-CoV-2 was 4.65% (bootstrap 95%, confidence interval 2.52-7.07%), indicating that approximately 367,000 (198,890-557,998) of adults had SARS-CoV-2 antibodies, which is substantially greater than the 8430 cumulative number of confirmed infections in the county as of 10 April 2020 [13]. In other words, the case reporting rate was only 2.30% (1.51-4.24%). Another study of SARS-CoV-2 antibodies by Bendavid et al. [14] implies that by early April, the case reporting rate was approximately 1.85% (1.10-4.00%) in Santa Clara County, CA. Furthermore, Hortacsu et al. [15]  Our paper is organized as follows. In Section 2, we review the Poisson-logistic model of Winkelmann et al. [3] and the extension to allow for spatial effects by Stoner et al. [7]. We then discuss an extension to the model that allows for sensitivity analysis for a range of false-positive rates. In Section 3, we describe a simulation study to justify the superiority of our extended model. Then, in Section 4, we illustrate the model with data from the first two months of the COVID-19 outbreak in the United States. Finally, we provide concluding thoughts in Section 5. The code used to run the most complicated model is in the Appendix E.

Poisson-Logistic Model
Relatively few different approaches have been proposed to account for under-reporting in count data. The most popular method is the Poisson-logistic (Pogit) model proposed by Winkelmann et al. [3]. The model consists of a binomial component for the observed counts, z, conditional on the underlying unobserved true counts, y, which are assumed to follow a Poisson distribution. Adding spatial effects similar to that of Stoner et al. [7], we consider the following initial model that addresses under-reporting, where τ is an overall precision parameter, D w is an N × N diagonal matrix whose diagonal element, d ii , equals the number of neighbors for region i, W is the adjacency matrix with elements w ij equal to 1 if regions i and j are neighbors (sharing a border) and 0 otherwise. Equation (4) was initially proposed by Besag et al. [8] and referred to as the BYM model. BYM is popular and commonly used for areal data due to its flexibility; however, fitting the BYM model with Markov Chain Monte Carlo (MCMC) methods is difficult due to the non-identifiability of φ and θ. Thus, the sampler will often explore many extreme areas of the joint-parameter space. The BYM2 model is a good solution for this difficulty. BYM2 was discussed in detail by Riebler et al. [9] following the penalized complexity framework proposed by Simpson et al. [10]. Like the BYM model, BYM2 includes both spatial and non-spatial random effects. The difference lies in that BYM2 places a single precision parameter, σ, on the combined random effects and uses a mixing parameter, ρ, for the proportion of spatial/non-spatial variation. BYM2 rewrites φ + θ in BYM as, where σ represents the overall standard deviation, and ρ ∈ (0, 1) controls the proportion of the variation modeled by ICAR, i.e.,φ, which is scaled by a rescaling parameter, s, such that var(φ i ) ≈ 1. The heterogeneous effectθ i follows Normal(0, n), where n is the number of connected subgraphs. Morris et al. [17] proposed to determine the variance as the number of connected subgraphs. In our case, we fixed it at one because the neighborhood graph is fully connected. For ICAR model for disconnected graphs, see Freni-Sterrantino et al. [18]. The quantity, σ, is assumed to follow a Normal(0, 1), which requests a condition that var(θ i ) ≈ var(φ i ) ≈ 1. By using an undirected graph, Morris et al. [17] proposed a novel implementation of BYM2 in the Stan language [11], which substantially reduces the usage of computer memory, and significantly increases the fitting efficiency and speed. Because the scaling factor, s, depends on the spatial structure of the specific dataset, it is provided to the Stan model as data. Following Morris et al. [17], we compute s using inla.scale.model function in the R package INLA developed by Lindgren et al. [19]. For more details about INLA, please also refer to Rue et al. [20]. We need at least a mildly informative prior on at least one parameter in the logistic regression portion of the model because of the over-parameterization of the Pogit model. One approach to develop this prior is to elicit from an expert or previous data on what the under-reporting would be at the "average" value of all the centered covariates. We denote this reporting probability as p 0 . A beta prior for p 0 is developed using the prior data and/or expert opinion. A prior for the intercept β 0 is then induced through the logistic relationship. Such a setting is convenient as we can use estimates from other studies or expert opinions on the national or local reporting rates. One can extend this process to obtain a fully informative prior by eliciting beta priors for a range of settings across the covariate space and similarly inducing a joint prior for all the regression parameters. To elicit the beta prior for p 0 , we assume either an expert or reliable prior data is available. In the case of an expert elicitation, a common approach to determine a beta prior is to elicit a central value (such as mean, median, or mode) and a percentile. The expert is asked two questions.

1.
What value is most likely for the reporting probability? 2.
What value would be considered unusually high?
Setting these two equations equal to, for example, the mode and 99th percentile of a beta distribution CDF yields the parameters of the beta distribution. The R function "elicit_beta" provided in the Appendix E takes the inputs of the mode, quantile, and tail probability and numerically solves for the elicited prior. In this work, we use relatively diffuse normal priors for the other regression parameters. A Normal (0, 1) prior is assigned to logit(ρ), σ andθ i , respectively. Relatively non-informative Normal (0, 10) priors are used for γ s and β s .

Model with False Positives
In many real-life applications, including the COVID-19 data of interest here, the counts may contain false positives (FP, i.e., over-reporting) in addition to the under-reporting (false negatives). Here, we outline the additions to the model to account for both false negatives and false positives. We extend the model of Bratcher et al. [21]. For the following, the models for y i and z i are as defined in Equations (1)-(4).
where TP are the true positives which have probability π i of being reported. The number of occurrences that are missed (Y i − X i ) can be at most Y i . For the false positives, since the number of tests is large and the probability of a false positive is minimal, we use the Poisson approximation to the binomial for these counts. Again, since the number of tests was large, we can also assume the false positives are independent of the true occurrences.
For the spatial dependency, we assume the same structure as in Section 2.1 for the Poisson rate of the true count, λ i . In the case of the COVID-19 data we model here, we do not have enough data to model the false-positive rates across covariates as we modeled the under-reporting. So, we assume that it is roughly constant across all covariates and areas, and we assign a gamma prior for ψ, An informative prior would typically be required. Using prior information on the number of false positives per test and the testing rate in the population, a gamma prior can be formed. Alternatively, a mode and a percentile for the rate of false positives can be elicited from an expert and converted into the corresponding gamma distribution. Regardless of how the prior is parameterized, the data would provide little information for this particular parameter, so this approach can be viewed as a sensitivity analysis for the overall model, allowing for a range of potential over-reporting along with the underreporting from the Poisson-logistic model. The function "elicit_gamma" provided in the Appendix E is used to determine a prior for a given mode and percentile, similar to the "elicit_beta" function previously discussed.

A Simulation Study
We implemented a simulation study to evaluate the performance of the proposed models. To generate the true counts, we considered parameters γ s = (5, −1, 2). For the under-reporting rates, we selected parameters β s = c(−2, 0.5). Thus, we assumed two covariates for the Poisson regression and one covariate for the under-reporting. All covariates were independently drawn from U(−1, 1). Finally, we set the false positive rate ψ = 6. One hundred data sets were replicated in this simulation. We use Normal prior N(0, 10 2 ) for γ 0 , γ 1 , γ 2 and β 1 . Instead of directly placing a prior on β 0 , we assigned a Beta(2, 8) prior on p 0 = exp(β 0 )/[1 + exp(β 0 )] to mimic the application to COVID-19 data in Section 4. For the 100 simulated data sets, we considered five different models (explicit forms are listed in Appendix A) for the data. The five models, in order of complexity, are as follows: We provide average posterior bias, mean square error, and coverage probabilities for 95 percent credible intervals for the five models in Table 1. We see that the model with both under-reporting and over-reporting (M5) performs quite well since the average bias and mean square error are small, and the coverage of the 95 percent intervals are close to nominal for all parameters. Models that do not consider spatial dependency (M1 and M2) perform poorly for most parameters, and M2 exhibits a fairly large bias and MSE for some parameters. In addition, the coverages of the 95 percent intervals given by M1 and M2 are all much lower than nominal. It is interesting, but not surprising, that when applying to the simulated data with both under-reporting and over-reporting, M4 performs similar to M5, but not quite as well. Specifically, there is slightly more bias for the parameters of the Poisson regression part of the Pogit model, and the low interval coverage for γ 2 is significant. This finding illustrates that ignoring moderate over-reporting can impact results. Table 1. Summaries of simulation for five models, β 0 = −2, ψ = 6.

Average Bias MSE Coverage
Parameter

Data
We collected the state-level accumulated COVID-19 cases of the U.S. from https: //covidtracking.com/, accessed on 11 April 2021. Figure 1 shows the state-level map of accumulated confirmed COVID-19 cases per 10,000 people as of 30 April 2020. At this moment in the pandemic, 1,071,003 confirmed cases have been reported in the 48 contiguous U.S. states and Washington DC. We observed that states in the northeast have a considerably larger number of cases per 10,000 people. In particular, the highest incidence of COVID-19 was reported in New York state (304,372 cases in total, 154 per 10,000 population). According to the global Moran's I statistic (Moran's I = 0.21, p < 0.0001), the state-level COVID-19 counts display highly positive auto-correlation. We also collected state-by-state risk factors from https://www.americashealthrankings.org, accessed on on 11 April 2021. It is believed that insufficient COVID-19 testing capability at the early stage of the pandemic was the key factor leading to under-reporting. Historical testing data by states were also available from https://covidtracking.com, accessed on 11 April 2021. Table 2 presents the description and summary statistics of variables used for the application. In Table 2, the variable "Pop" divided by 1,000,000 is used as the offset E, "Testing" is the single covariate used for the reporting procedure, and all the other variables are potential covariates for the Poisson regression. We are interested in actual counts of COVID-19 cases, not deaths. However, early in the pandemic, often only more severe cases were reported, which is why we included several covariates which were considered to be related to underlying medical conditions such as "Smoking", "Obesity", and "Air pollution" (strongly related to asthma). Furthermore, "Alcoholism", "Inactivity", and "Drug deaths" are included as we believe these behavioral factors may have a potential impact on the willingness or awareness to get tested when symptoms are mild. Furthermore, free testing was very limited in the first two months; thus, people without insurance would be less likely to request the COVID-19 test as the fees may have been unaffordable, so "uninsured" is included.

Priors
The prior distributions for the parameters can be developed through expert opinion or induced from other studies. As mentioned previously, we require at least one mildly informative prior so that the posterior distributions converge appropriately. The conditional means prior approach of Bedrick et al. [22] can be used to obtain a fully informative prior for all regression parameters. Here, we use it to get an informative prior for the intercept of the logistic regression. We first assume diffuse normal priors for all the regression parameters except β 0 , β s ∼ N(0, 10 2 ), s = 1, · · · , k − 1 (6) γ s ∼ N(0, 10 2 ), s = 0, 1, · · · , j − 1, where, k and j are numbers of covariates used for the logistic and Poisson components, respectively. To determine an informative prior for β 0 , we refer to the available studies on COVID-19 reporting rates. The references in Section 1 indicate that the actual reporting rate in the early stage of the pandemic was meager, so we assume that the mode of the probability at the average value of the covariates, π 0 , to be around 0.1 and that Pr(π 0 > 0.3) is small ( i.e., 1 − Pr(π 0 <= 0.3) = 0.0001) which results in a beta (7,55) prior. Because all covariates are centered, β 0 is interpreted as the log-odds of the reporting rate at the average level of the covariates. In this case if we assume logit(p 0 ) = β 0 , the beta(7, 55) prior on p 0 induces a prior on β 0 . Note that a Jacobian adjustment is necessary for the log-likelihood statement in Stan. This approach of prior elicitation could be extended to other regression parameters by eliciting priors on the probabilities for a range of covariate values spread across the covariate space. Then, this idea is used to induce a prior on all the coefficients similar to the method we used just for the intercept. We also considered a beta(5, 78) prior, which corresponds to an expected reporting probability of only 5% but could be as high as 20%. For the model with false positives, we require an informative prior for ψ. We are modeling the false positive rate on the same scale as observed counts. We are looking at false positives per population, not per a certain number of tests. This approach allows the model to be generalizable to other situations where counts are not the result of a certain number of diagnostic tests. As of April 2020, testing was still relatively low, so we consider two priors for the false positive rate, a gamma(5, 1) which would correspond to an expected five false positives per 100,000 people, and gamma(30, 1), which corresponds to an expected 30 false positives per 100,000 people. The modeling process is implemented in R [23] and Stan. The code is provided in the Appendix E.

Results
Our simulation in Section 3, along with the simulation results in Stoner et al. [7], confirm the usefulness of the hierarchical model. Specifically, ignoring the under-reporting and spatial aspects of the data result in biased estimates and underestimation of posterior variability. The results of the analysis of the COVID-19 data are summarized in Table 3. As expected, M1 and M2 exhibit considerably less posterior variability as they ignore the significant spatial dependency. The results of M3 are likely to be an improvement, but from the simulation studies, the posterior means are likely biased since the under-reporting is not considered. M4 and M5 have very similar estimates of most regression parameters and spatial random effects (see Appendix D). Both show "Testing" is a strong factor that is positively related to the COVID-19 case reporting rate. The deviance information criterion (DIC) for M1-M5 are 152,001, 131,129, 633.1, 631.1, and 631.3, respectively. Thus, M3, M4, and M5 provide very similar fits to the data, but there seems to be a preference for M4, which accounts for under-reporting but ignores any potential over-reporting. This result is likely reasonable as the degree of under-reporting is so large it almost certainly washes out the impact of whatever over-reporting happened to be present in the early days of the pandemic. The best fit model, M4, has three significant variables in the Poisson component of the model for the COVID-19 counts. These variables are "Alcoholism", "Smoking", and "Inactivity". The effects of "Alcoholism" and "Inactivity" are positive, while the effect of "Smoking" is negative. These results indicate that high alcoholism and inactivity percentages may increase the state-level incidence of COVID-19 while smoking is protective. The latter is surprising as smoking is a risk factor for respiratory diseases. This scenario is also possible as an example of the ecological fallacy since the data is collected at the state level. Our model can estimate the central tendency and credible set of the reporting rate for each state Appendix C.
The simulation and application are both implemented with the R and Stan languages. It is worth mentioning that we obtain relatively stable (convergent) results with a burn-in of only 2000 iterations and 4000 iterations in this application which is much faster and more efficient than Stoner et al. [7], in which 400,000 burn-ins and 800,000 iterations were required for convergent chains used for inference (they use the Nimble package in R).
To check prior sensitivity, we reran model 5 with all combinations of the priors used for the reporting probability and false-positive rate. That is the beta(7, 55) and beta(5, 78) that induce an informative prior for the logistic model and the gamma(5, 1) and gamma(30, 1) for the false positive rate. The results were remarkably robust concerning these choices of prior. No posterior loses or gains statistical significance due to changes in the priors, and the posterior means change very little across the prior distribution combinations. For instance, the posterior means for the alcoholism coefficient ranged from 0.201 to 0.213 across the four combinations while the posterior for inactivity ranged from 0.413 to 0.427. The summaries for all the coefficients are provided in Appendix B.

Conclusions
Misclassification in the form of under-reporting and over-reporting can bias estimates and potentially lead to wrong decisions. When the decisions are for issues related to a worldwide pandemic, it is of particular importance that all tools available are applied to bring the best information available to policymakers. This paper has used and extended a recently proposed Bayesian model that accounts for under-reporting and spatial effects. Our extensions include allowing for potential over-reporting, a suggested way to include experts' prior information, and coding the model in an alternative software that appears to result in much faster results.
For the COVID-19 data, we found that accounting for the under-reporting yielded relatively similar inferences, with the main difference in the effect sizes. That is, no parameter estimates switched direction, only magnitude. One can easily extend the model by adding a time component if repeated counts are available, allowing for estimation and prediction of the trajectories of the count variable in each area. On the other hand, we could expect more accurate inference if the model is applied to higher spatial resolution data, such as county-level COVID-19 cases, which is not complex as the computation is speedy due to the clever implementation of the BYM2 in Stan language.
This study is confined to modeling spatial variation in adjusting for misreporting. Infectious disease data sets are often recorded at high temporal resolutions and used to study characteristics of seasonal epidemics and future pandemics. Notably, it is crucial to timely and accurately forecast the characteristics of emerging infectious diseases in public health. Thus, our future study could be extending the proposed models to Spatio-temporal misreporting models that account for spatial and temporal variations simultaneously.
Another limitation is we did not include spatial effects on the reporting mechanism itself. This direction is another essential extension that will require significant work as additional identifiability constraints will need to be resolved.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Explicit Forms for M1 to M5
Appendix B. Sensitivity Analysis for M5 Table A1. Sensitivity analysis for M5 with different priors on intercept 2 β 0 and false-positive rate ψ.