Estimating Unreported COVID-19 Cases with a Time-Varying SIR Regression Model

Background: Potential unreported infection might impair and mislead policymaking for COVID-19, and the contemporary spread of COVID-19 varies in different counties of the United States. It is necessary to estimate the cases that might be underestimated based on county-level data, to take better countermeasures against COVID-19. We suggested taking time-varying Susceptible-Infected-Recovered (SIR) models with unreported infection rates (UIR) to estimate factual COVID-19 cases in the United States. Methods: Both the SIR model integrated with unreported infection rates (SIRu) of fixed-time effect and SIRu with time-varying parameters (tvSIRu) were applied to estimate and compare the values of transmission rate (TR), UIR, and infection fatality rate (IFR) based on US county-level COVID-19 data. Results: Based on the US county-level COVID-19 data from 22 January (T1) to 20 August (T212) in 2020, SIRu was first tested and verified by Ordinary Least Squares (OLS) regression. Further regression of SIRu at the county-level showed that the average values of TR, UIR, and IFR were 0.034%, 19.5%, and 0.51% respectively. The ranges of TR, UIR, and IFR for all states ranged from 0.007–0.157 (mean = 0.048), 7.31–185.6 (mean = 38.89), and 0.04–2.22% (mean = 0.22%). Among the time-varying TR equations, the power function showed better fitness, which indicated a decline in TR decreasing from 227.58 (T1) to 0.022 (T212). The general equation of tvSIRu showed that both the UIR and IFR were gradually increasing, wherein, the estimated value of UIR was 9.1 (95%CI 5.7–14.0) and IFR was 0.70% (95%CI 0.52–0.95%) at T212. Interpretation: Despite the declining trend in TR and IFR, the UIR of COVID-19 in the United States is still on the rise, which, it was assumed would decrease with sufficient tests or improved countersues. The US medical system might be largely affected by severe cases amidst a rapid spread of COVID-19.


Introduction
Although COVID-19 was reported several months ago [1], the coronavirus is still raging on a global scale, and is especially surging in the United States, which is one of the most important engines of the global economic network. The pandemic in the United States will have an important impact on the global economy and politics. It is fundamental to make relatively accurate estimates for preventing and controlling the COVID-19 pandemic in the United States [2,3], wherein the transmission rate (TR) and infection fatality rate (IFR) are key indicators [4].
The main obstacle to calculating such indicators is the unreported infection rate (UIR), which might be caused by insufficient testing, data depression of mild or asymptomatic patients, and a time-lag bias [5,6]. Direct use of IFR values derived from official data might lead to larger errors [7]. Similar research on SARS pointed out that preferential ascertainment of severe cases and delayed reporting of deaths are the main two reasons for case fatality risk (CFR) error [8]. Beyond insufficient early testing, mild and asymptomatic patients might cause most unreported cases. In Brazil, only some moderate and severe infectives in hospitalizations are recorded thus far [9]. On the other hand, the time lag deviation could be explained by the incubation period of COVID-19, which fluctuates in a wide range [10] and still possess a high transmittance [11]. The incubation period is also correlated to the age of the infectives, which can directly affect IFR [12]. It was concluded that the unreported cases might lead to four kinds of uncertainty in IFR calibration, with the unclear denominator, unknown infection time, unknown incubation, and undiagnosed asymptomatic infections [13].
Characterizing unreported cases has become a popular question in the epidemic modeling of COVID-19. The recent literature attempts to calculate the UIR or the reported rate (RR) based on country-level data [14][15][16], wherein, a single country-level data might lead to a greater bias [17]. Moreover, the county-level data in the United States on recovered infectives are not released. Thus, the calculation of IFR depends merely on the national aggregate data, which might further amplify the error. More and more studies use multinational data [18], county-level data [19], or country-county mixed regional data [20] for analysis, which greatly improves the accuracy of modeling by increasing the dimensionality and quantity of data.
However, previous studies seldom investigated the time effect of UIR, which might affect the accuracy of all indicators. A recent study suggested using a time-varying SIR model to capture the changing transmissive rate [21]. Moreover, the incubation period was shown to change in different stages of transmission [22]. Some studies showed that the possible value of COVID-19 IFR of China should be 2.3% [23], while another study showed that the early COVID-19 IFR in Wuhan might be as high as 20% [24]. Such disputes might also imply a changing trend in IFR.
This study proposes an SIR regression model with an unreported infection rate (SIRu) and SIRu, with time-varying parameters (tvSIRu) to estimate the values of TR, UIR, and IFR, and assess the impact of the time effect. The US county-level data used in this study comes from the open-source data of Johan Hopkins University on GitHub [25]. This study provides the first insights into the time series values of TR, UIR, and IFR of COVID-19, contributing to a deeper understanding of the trend of COVID-19 in the United States.

Data Source
The COVID-19 data used in this article contained 3142 counties in the United States, which included the number of daily new infectives, cumulative infectives, and deaths, while the population of recovered infectives remained unreported.
The date of the data ranged from 22 January 2020 to 20 August 2020, which contained 666,104 (3142 × 212) records. As a time-lag order (t k , t k+1 ) was applied in the data analysis, the number of whole records used for regression was 662,962 (3142 × 211).

tvSIRu Model with Fixed UIR
In the classic SIR dynamic model, the number of daily infectives (I d t k+1 ) at time t 1 could be expressed by the function of the infection rate β, the number of susceptible persons (S t k ), infected persons (I t k ), and the total population (N) at time t k (Equation (1)).
The SIR model with unreported infection rate (SIRu) could be illustrated in Figure 1. As the population of the recovered infectives was not released, two kinds of parameters were added to the SIR model, λ for the recovery/death rate (RDR), ϕ and ϕ for the unreported infection rate (UIR) of cumulative cases and daily cases, respectively. Such variables could be described as the following equations: while considering a fixed UIR with no time effect, the UIR of total cumulative infectives and daily new cases could be considered equivalent, thus: The two explanatory variables in Equation (1), S t k , I t k , could be calculated as The SIR model (Equation (1)) could be developed into Equation (6) by substituting Equations (2)-(4).
Through further simplification and operation, Equation (6) could be transformed into Equation (7), which could be taken as the general tvSIRu model: Since the four combined variables, I cr t k , R dr t k , (I cr t k ) 2 N , I cr t k R dr t k N , could be acquired or calculated by the data released, Equation (7) could be regarded as the primary linear function, Equation (8) with coefficients, a, b, c, d: while considering the fixed-time effect of all three parameters in Equation (7), the corresponding average value (β, λ, ϕ) could be calculated in Equation (9).

tvSIRu Model with Time-Varying UIR
If the UIR varied over time, the UIRs of the cumulative cases and daily new cases were different, which was defined as ϕ and ϕ , respectively. Equation (6) could be rewritten as To simplify the computation, a new parameter β was introduced: Then Equation (10) could be transformed into a similar form of Equation (7): To verify the assumption of time-varying parameters, the coefficients in Equations (7) and (12) could be represented by the initial values and time effect functions. Such functions were substituted into the two models gradually, resulting in several sub-equations with time effects.

OLS and SIRu Regressions
The linear regression derived from the SIRu model showed acceptable fitness and the adjusted R 2 was 0.4813 (n = 662,962) ( Table 1). The negative value of coefficients b and c were consistent with the corresponding operation signs in Equation (7). Such results verified the assumption of the SIRu model to a certain extent. The SIRu model with a fixed-time effect in Equation (9) further provided the estimated value of TR, UIR, and RDR ( Table 2). The results showed that the average β 0 value from 22 January to 20 August was 0.0339 (95%CI 0.0338-0.0340), and the ϕ 0 value was 19.5 (95%CI 19.38-19.54), which implied that there might be 19.5 undiagnosed cases while one infection was reported in US counties, on average. Meanwhile, the λ 0 value of 192.5 .243) could be interpreted as an IFR value of 0.516%.

SIRu at the State Level
The study further utilized county-level data to compare state-level parameters based on fixed-time effects. Figure 2 shows the fitness of Equation (8) across the whole states, most of which were above 0.5 (Figure 2), and each state had different TR, UIR, and RDR values in Equation (9), which indicated an obvious spatial heterogeneity in the transmission of COVID-19 ( Figure 3). All parameters and statistical descriptions are reported in Appendices A-C.  (8) with county-level data. The scaled density curve of adjusted R 2 shows that Equation (8) was generally applicable, and its mapping indicated that the potential spatial heterogeneity of the states would affect the results of the SIRu modeling. Among them, the states in the southeastern, the west coast, and the Great Lakes Region showed higher adaptability. The fitting results on RDR in some cities were not significant, but most significant values were between 200-500, which was equivalent to the value of IFR ranging from 0.2% to 0.5% (Figure 3c). Wherein, eight cities were reported below 99 (IFR > 1%), including Ohio The Pearson correlation between the three state-level indicators was also tested, showing a positive correlation between UIR and RDR. In other words, the lower the IFR, the higher the UIR (Figure 3d).

tvSIRu Regression at the Country Level
The tvSIRu model with time-varying TR was first tested by three sub-equations of Equation (16), and the AIC of all equations was reduced, by comparing to the SIRu model of fixed-time effect (Table 3). Meanwhile, all estimated TR displayed a declining trend ( Figure 4). Wherein, the power function showed the best fitness with an initial extremely high value of 227.58 (95%CI 219.89-235.27) decreasing to 0.022 on 20 August. Such a high value might reflect the high contagiousness of COVID-19 in the early stage. The corresponding UIR and RDR were 18.61 (95%CI 18.52-18.69) and 183.34 (95%CI 182.63-184.05), which were slightly higher than the values in Equation (9). Table 3. Time-varying TR estimated by Equation (16).   (16). Although the initial values of the power function were much higher than the exponential function in the medium term, the two values tended to be the same, while the periodic function showed that it was in the third wave.

g(t)=m t g(t)=t m g(t)=
When the time effect of RDR was further added to Equation (17), the AIC of the power function displayed a slight decrease in Equation (17) (Table 4). Wherein, the UIR was 19.02 (95%CI 18.93-19.12), which was similar to the value in Equation (9). However, both equations showed decreasing trends in the changing RDR, implying an increase of IFR ( Figure 5). Table 4. Time-varying TR and RDR estimated by Equation (17).    (17). If the time effect of UIR was not considered, the fitting results showed that RDR exhibited a decreasing effect over time, which meant that IFR might be slowly increasing.

g(t)=m t , h(t)=k t g(t)=t m , h(t)=t k
The power function also showed better performance in tvSIRu with all three timevarying parameters estimated by Equation (18), which indicated a gradual increase in both UIR and RDR (Table 5). This trend indicated that the initial UIR and RDR were relatively low ( Figure 6). The value of UIR and RDR achieved 9.1 (95%CI 5.7-14.0) and 141.706 (95%CI 103.3358-189.9486) at T 212 on 20 August, respectively. IFR could be calculated as 0.70% (95%CI 0.52-0.95%). Based on the officially released data on 20 August 2020, it might be concluded that about 30% of the whole population was infected. Table 5. Time-varying UIR and RDR estimated by Equation (18).

Discussion
Few studies analyzed the time-varying UIR of COVID-19, and its impacted on the estimation of TR and IFR. This study estimated the values of UIR, TR, and IFR of both time-fixed effect and time-varying effect with tvSIRu models, based on county-level data.
In terms of the fixed-time effect, the results showed that from 22 January to 20 August, the average TR and UIR at the country level in the United States were 0.03 and 19.5, respectively, and the RDR was 192.5, which also meant that the IFR was 0.516%. The IFR was slightly lower than the overall IFR of 0.66% estimated in China [17], while the CDC in the United States recommends 0.65% [26].
In a further analysis on the state level, the UIR of all states ranged from 7.32-185.66 (mean = 38), and the IFR ranged from 0.037-2.20% (mean = 0.21%). A related study on 20 US counties estimated that the range of UIR was 4.32-776.68 (mean = 27.7) and IFR was 0.02-1.81% (mean = 0.027%), the range of UIR estimated by the SIRu model was more concentrated, and the IFRs had a similar upper boundary [27]. Another previous study estimated four states' upper boundary of UIR-Illinois (40.86), Massachusetts (38.28), New Jersey (29.22), and New York (35.17) [19]. Among these, the first three were similar to the values estimated by the SIRu model, which were 41.51, 39.22, and 31.83, only New York had a different value of 7.32. However, interestingly the study also pointed out that the UIR estimated by an antibody test in New York State in early May was around 7.6, which might indicate the stability of the SIRu method.
Based on the tvSIRu model, UIR and IFR increased by following the power function rather than the exponent function, which was the default setting in previous research [21]. Other than the average value of 0.03 in SIRu, the TR estimated by the tvSIRu model decreased from a large value of 227 to a value of 0.022 on 20 August, which was much lower than the fixed value 0.05-0.06 reported in related research [21]. It might further explain the high contagiousness in the initial stage in COVID-19 transmission. The increasing UIR estimated by the tvSIRu model had a similar value of 9.1 (95%CI 5.7-14.0) at T 212 (20 August), which was very close to the value of 9 estimated in a former study in April [20], and the latest study in September [28]. The UIR value was also close to the value reported in Brazil (Reported rate = 9.2%, UIR = 10.8) [18]. Such similarity in the estimated UIR in different periods might be caused by the fixed-time effect in the former models, which only represented the average values of UIR, as calculated by the SIRu model. The increasing UIR meant that the IFR was on a downward trend. The value of IFR on August 20 was 0.70% (95%CI 0.52-0.95%), which was still close to the value recommend by CDC [26].
Many studies supposed that the UIR would decrease with the improvement of COVID-19 testing and increased hygiene awareness, but our research showed that UIR in the United States is increasing, which might have a great impact on policy-making for COVID-19 prevention. On the other hand, empirical TR is often used in contemporary COVID-19 modeling, but the tvSIRu model indicates that the COVID-19 infection rate changed dramatically. The initial value of TR was 246, reflecting that this pandemic was extremely contagious in the early transmission stage of the United States. Previous SIR modeling seldom characterized such a feature, which might lead to large estimation errors. The reducing TR, IFR, and increasing UIR indicated by the model showed that the epidemic was rapidly spreading in the United States with a large number of self-healing populations. However, it is noteworthy the potential increasing cases of severe illnesses might greatly affect the medical system, and the relevant departments still need to provide more protection to high-risk groups.
As shown in Figure 3, with the potential pattern of spatial correlation, the tvSIRu model could be developed by integrating models considering the spatial weight, to detect the spatiotemporal features of COVID-19 transmission, such as Geographical Weighed Regression model (GWR) [29], Spatial Panel Model, etc. Meanwhile, the regression used in the tvSIRu models could also be extended by a non-linear method, such as the Artificial Neural Network (ANN) [30].

Conclusions
This article indicates that there might be an increasing number of unrecorded COVID-19 cases in the official U.S. data, wherein, the tvSIRu model provides a simple, convenient, and relatively accurate calculation of the unreported parameters of COVID-19 with time effect, based on official released data. Moreover, this method can be easily transplanted to analyze the epidemic modeling of other countries.
It must be admitted that if single level geography units of data are used, the independent variables might display strong collinearity, leading to overfitting. It is therefore necessary to use proper sub-geographical level data to fit the national-level or state-level data. Furthermore, the non-linear model regression was based on the Gauss-Newton iteration, which could be further optimized with machine learning models.