Modeling COVID-19 Incidence by the Renewal Equation after Removal of Administrative Bias and Noise

Simple Summary In the past two years, the COVID-19 incidence curves and reproduction number Rt have been the main metrics used by policy makers and journalists to monitor the spread of this global pandemic. However, these metrics are not always reliable in the short term, because of a combination of delay in detection, administrative delays and random noise. In this article, we present a complete model of COVID-19 incidence, faithfully reconstructing the incidence curve and reproduction number from the renewal equation of the disease and precisely estimating the biases associated with periodic weekly bias, festive day bias and residual noise. Abstract The sanitary crisis of the past two years has focused the public’s attention on quantitative indicators of the spread of the COVID-19 pandemic. The daily reproduction number Rt, defined by the average number of new infections caused by a single infected individual at time t, is one of the best metrics for estimating the epidemic trend. In this paper, we provide a complete observation model for sampled epidemiological incidence signals obtained through periodic administrative measurements. The model is governed by the classic renewal equation using an empirical reproduction kernel, and subject to two perturbations: a time-varying gain with a weekly period and a white observation noise. We estimate this noise model and its parameters by extending a variational inversion of the model recovering its main driving variable Rt. Using Rt, a restored incidence curve, corrected of the weekly and festive day bias, can be deduced through the renewal equation. We verify experimentally on many countries that, once the weekly and festive days bias have been corrected, the difference between the incidence curve and its expected value is well approximated by an exponential distributed white noise multiplied by a power of the magnitude of the restored incidence curve.


Introduction
The renewal equation, first formulated for birth-death processes in a 1907 note of Alfred Lotka [1], establishes a model for epidemic propagation based on the individual infectiousness. The infectiousness of individuals at time t is characterized by the reproduction number R t , defined as the average number of cases generated by an infected person at time t, and by the generation time [2,3] defined as the probability distribution of the time between infection of a primary case and infections in secondary cases. This probability distribution depends on the incubation time (a permanent biological factor) and on the detection time (which we assume stationary). For these reasons, the distribution of the generation time is supposed to be independent of t. In practice, the generation time is replaced by the observable serial interval Φ s which represents the time distribution of the delay of the onset of symptoms between primary and secondary cases. In Figure 1, we show the serial interval obtained in [4] using 689 observed pairs of primary and secondary cases.   Observed number of cases Approximation using a shifted log-normal Figure 1. The serial interval Φ s obtained by [4]. The bars represent the observed number of cases in function of the number of days between the onset of symptoms in primary and secondary cases. The dotted line is its approximation by a scaled and shifted log-normal distribution.
The case renewal Equation [5,6] is a classic equation linking R t , Φ and the incidence i t of new daily cases, where t c is the current time. This equation does not account for several strong perturbations of i t . Government statistics of the observed incidence curve are indeed affected by changes in testing and polling policies and by weekend reporting delays. These recording delays and subsequent rash corrections result in impulse noise, and in a strong weekly periodic bias observable on the observed incidence curve i 0 t . In [3] this bias is corrected by a seven days sliding average and in [7] it is corrected by multiplying i 0 t by a 7-day periodic factor q t . These bias correcting coefficients q t are learned by a variational method that we describe below. Our first purpose in this note is to resolve the festive day problem. We denote by F the set of festive days t, at which the i 0 t curve is strongly affected by the reduction in the number of registered cases. This reduction is compensated by an increase in the number of registered cases the following days. No model has been proposed so far to address this problem, which creates strong impulse noise in any estimation of i t and R t . We tackle this problem by a variational method computing R t , where both i t and R t are considered unknown on festive days and in the next few days. To that purpose, we shall denote by F + the union of festive days and the ones following them affected by the festive day (typically 2 or 3 days after the festive day).
Our second purpose is to provide a noise model for the differenceî t − i r t between the signalî t corrected of the week-end and festive effects, and its restored version i r t using the renewal equation, defined by We provide strong experimental evidence that the relation betweenî t and i r t , can be empirically modeled byî where a > 0 and ε t is a white noise. This leads us to propose a signal processing version of the renewal equation model taking into account noise and bias and justifying a posteriori the variational method. The proposed observation model linking the observed signal i 0 t to the ground truth incidence where q t is a quasi-periodic gain with period 7, ε t is a white noise. The exponent a can be estimated for each country and varies between 0.6 and 0.9. The exceptional set F + is introduced because festive days provoke perturbations of the observation model (4). Specifically, the 7 days period of q t is broken for these groups of days. We shall verify experimentally on 38 countries (and detail the results on USA, France and Germany) that the normalized error ε t is indeed a white noise with a distribution that is well described by an exponential distribution. This a posteriori noise model contradicts the classic a priori stochastic formulation of the renewal equation where the first member i t of Equation (1) is assumed to be a Poisson variable, and the second member of this equation is interpreted as the expectation of this Poisson variable. Using this Poisson model leads maximum likelihood estimation strategies to compute R t [3,[8][9][10]. As we shall see, the Poisson model is not verified. Indeed, as we mentioned, the empirically observed standard deviation of the noise follows a power law with exponent a significantly larger than 0.5, which is incompatible with the Poisson model. The proposed observation model (4) of the pandemic's incidence curve provides a simple framework enabling: 1. a computation of the reproduction number R t ; 2. a correction of the weekend and festive days bias on i t ; 3. a verification that the difference between the observed incidence curve after bias correction and its expected value using the renewal equation is a white noise, the parameters of which can be estimated.
Paper organization: In Section 2, we describe an anterior variational method [7] and point out its main three limitations: its weekly bias correction is strongly periodic, which does not work on long periods; the festive days cause strong perturbations in the inversion, finally no residual noise model is proposed. We therefore modify its variational formulation. In Section 3, we present the results of the statistical analysis of the residual noise on many countries. These examples lead to specify the noise model and to validate a posteriori the proposed inversion model. In Section 4, we discuss the a priori noise models proposed in the literature. Finally, in Section 5, we present the conclusions of this work.
Timely estimates of restored versions of i t and R t are extremely useful to tame a pandemic. The proposed restoration and inversion algorithm can be run through an online demo [11] for every day in every country and U.S. state. The demo plots the objects of this paper, namely the incidence curve i 0 t , its bias corrected versionî t , its fully restored version i r t , finally the main pandemic index, the time-dependent reproduction number R t . Figure 2 illustrates the application of the variational method of Section 2 to USA on 1 February 2022, as displayed by the online demo. Figure 3 compares the results of this inversion method, applied with and without festive day bias correction, obtained for France on 6 January 2022.  [11]. On the left in red, the obtained reproduction number R t and in black its estimate obtained by the classic EpiEstim method. On the right in green, the original incidence curve i t of new cases, in blue the incidence curveî t corrected of the weekend and festive biases, and in red the final reconstructed incidence curve i r t obtained from R t by the application of the renewal equation. Estimate obtained for USA on 1 February 2022. . Incidence curve (in green) of France up to 6 January 2022. In blue, the incidence corrected of the weekly bias, in cyan the incidence corrected of the weekly and festive day. The Christmas holidays introduce a distortion in the weekly bias corrected incidence that is corrected by the festive day bias correction.
We can summarize the main contributions of this paper in the following way: 1. Based on the case renewal equation, we propose a new variational model which estimate: • A time varying reproduction number R t • A restored incidence curve with the weekly and festive day biases corrected. • The weekly seasonality profile of the incidence curve.
2. We verify experimentally, on many countries, that, once the weekly and festive days biases have been corrected, the difference between the incidence curve and its expected value using the renewal equation is well approximated by an exponential distributed white noise multiplied by a power of the magnitude of the restored incidence curve.

The Proposed Variational Model
The EpiInvert method proposed in [7] is a deconvolution + denoising procedure to solve the functional Equation (1) using the Tikhonov-Arsenin [12,13] variational approach. EpiInvert estimates both R t and a restored i t corrected for the weekend bias. To remove the weekend effect, it computes a 7-day periodic multiplicative factor q = (q 0 , q 1 , q 2 , q 3 , q 4 , q 5 , q 6 ). From the observed incidence curve and the serial interval, R t and q are jointly estimated by minimizing where t%7 denotes the remainder of the Euclidean division of t by 7 and median (t−τ,t] (i 0 ) is the median of i 0 t in the interval (t − τ, t] used to normalize the energy with respect to the size of i t (the value of τ is fixed to 21 (3 weeks) in the experiments). The total number of cases is preserved by adding to (5) the constraint on q t : where T is a period of analysis empirically fixed to T = 56 days. The minimization of the above energy yields estimates of R t , q and a restored incidence curve. One limitation of using a 7-day periodic formulation to model the weekend effect is that it does not take into account the variation over time of the seasonal profile. To deal with this issue, we consider q t for t = 0, . . . , t c allowing different correction factors q t for every day but keeping the values q t − q t−7 small which forces q t to be quasi-periodic. A regularity assumption for the seasonality is commonly used in the study of time series as it is the case of the standard Holt-Winters' seasonal method [14].
In addition to the weekend bias, festive days can introduce a strong bias in the incidence values. On a festive day t ∈ F, a sharp decrease in the number of registered incident cases is generally observed. This is compensated by increased incidence numbers in the next few days. Assuming that each festive day, t ∈ F, mainly affects the value of the incidence curve in the festive day and in the next M t days (where M t is an algorithm parameter (by default we fix M t = min{2, t c − t})), we consider the values of i 0 t , i 0 t+1 , . . . , i 0 t+M t as unknown. We denote by F + the union of the festive days t ∈ F and the M t days following them. We set i f t = i 0 t for t / ∈ F + and consider the values (i f t ) t∈F + as unknowns. Then the new proposed inversion functional is The values i f t for t ∈ F + are set free in the minimization. Yet the third term in the functional ensures that the overall number of cases in the affected days remains unchanged. For each t ∈ F, λ t ≥ 0 represents the weight we assign to this constraint for each festive day. We fix, experimentally, In other terms, the value of λ t is adjusted according to the number of days that have passed since the festive day. To keep a smooth seasonality we add to the energy a regularization term where we penalize high values of q t − q t−7 . The parameters w R and w q are regularization weights with default values w R = w q = 5. Their values are proven in [7] to be nearly optimal for COVID-19 incidence curves.
By minimizing this energy we obtain the reproduction number R t , the seasonality q t and i f t , which corresponds to the original incidence i 0 t but with the optimized values in the festive days. The bias corrected incidenceî t defined in model (3) is given byî t = q t i f t . The estimated incidence curve must preserve the number of cases. In the original EpiInvert formulation this constraint is enforced by (6) on its analysis interval (t c − T, T]. In the new formulation, the interval time of analysis is the whole time interval [0, t c ]. Extra conditions are required to keep i 0 t close toî t and i r t . Therefore, to preserve the number of cases we add to the energy (7) the constraints on q t : (8) The first constraint corresponds to a global preservation of the number of cases in the whole period and the second one corresponds to a local preservation of the number of cases every 2 weeks. In particular, the second constraint ensures a good agreement between the epidemiological indicator given by the accumulated number of cases in the last 14 days of the original incidence curve and the estimated ones using the proposed method. This indicator is currently widely used to evaluate the current epidemic transmission.
The minimization of the energy (7) is obtained by alternating steps computing in turn R t , q t , and then i f t (for t ∈ F + ) until convergence. The above constraints are added to the minimization by the Lagrange multiplier technique.

Results
We used the incidence data published in [15] for France, [16] for Germany, [17] for Spain and [18] for the rest of countries. We checked the observation model and its inversion on the 626 daily incidence data from 24 March 2020 to 9 December 2021 for 38 countries and will detail the results for France, Germany, and the USA. In general, for the festive days we fixed M t = 2, so the method estimated the incidence value of the festive day and of the next 2 days. However, not all festive days disturb the incidence in the same way. Parameter M t allows us to adapt the number of days affected. To illustrate this option we set M t = 5 for Thanksgiving in the USA in 2021 because this festive day causes in 2021 a longer perturbation in the number of registered cases. Figures 4, A1 and A2 show the minimization results for the energy (7). They display for each country (i) the original incidence curve i 0 t , (ii) the incidence curve after bias correctionî t , (iii) the restored incidence curve i r t using the renewal Equation (1), (iv) the weekly bias correction factors q t , (v) the reproduction number estimation R t and (vi) the normalized error defined by The power a was obtained through log-log linear regression. Indeed, if |î t − i r t | is proportional to (i r t ) a , then log(|î t − i r t |) ≈ a · log(i r t ) + b, and a and b can be estimated by a linear regression between log(|î t − i r t |) and log(i r t ). Its results are illustrated for 38 countries in Figure 5 and Table A2. The Pearson correlation p-values in this table confirm the linear relation. The estimated exponent a varies between 0.7 and 0.9, and the constant coefficient b varies between −0.11 and −2.6. For the world we have a = 0.76 and b = −1. 16.
We performed a control test on a Brownian motion simulated by starting from 10,000 and sampling i t+1 − i t N (0, 100). The obtained exponent a is negative (a = −1.01) and we have b = 13.4. Both values are far away from the group of coefficients of real incidence curves. The p-value for the control is anyway non significant (0.0844), compared to the extremely small p-values for the real incidence curves. Figure A3 shows the results of the variational inversion method on the Brownian control. For this control, both R t and the weekly seasonality correction coefficients stay very close to 1 as should be expected, with means 1.001 and 1.00002, and standard deviations 1.7% and 0.3% respectively.
Next, we looked for a stochastic model of the normalized error ε t defined by (9). Figures 4, A1 and A2 visually support a stationarity assumption for ε t in France, Germany and USA.
In Figure 6 we show the autocorrelation function for these three countries. For most non-zero shifts, its value stays inside the 95% confidence interval for the stationarity assumption. (This interval is indicated by horizontal blue lines in the plot.) Similar results were obtained on 33 more countries, as illustrated in Figure A5. These results support a white noise assumption for ε t .  Figure 4. From top to bottom: (i) the original incidence curve i 0 t of France, (ii) the incidence curve after bias correctionî t , (iii) the restored incidence curve using the renewal equation i r t , (iv) the weekly bias correction factors q t , (v) the reproduction number estimation R t and (vi) the normalized error where a is the optimal exponent obtained by regression (see Table A2).   Figure 6. For France, Germany and USA, autocorrelation of the normalized error ε t , using the festive day correction, obtained with the R-software functionalities (acf() function). The orange dotted line provides the 95% confidence interval for non-correlation. Similar plots for the same countries and 33 more countries, without using the festive day correction, are displayed in Figure A5.
We finally estimated the parameters of the distribution of ε t assuming an exponential power distribution with density where µ is the location, α the scale and β the shape. These parameters to approximate ε t by an exponential power distribution were estimated by the R-package normalp [19]. In Figure 7, we plot for these three countries the histogram of the distribution of ε t and its approximation by a normal (β = 2) and by the obtained optimal exponential distribution. We display the same result for 33 more countries in Figure A4. For France, Germany and USA, histogram of the normalized error ε t , using the festive day correction, its normal approximation (blue line) and its optimal approximation using an exponential distribution (red line) (we use the R-package normalp to approximate ε t by an exponential distribution). See Figure A4 for the results for the same countries and 33 more countries without using the festive day correction. Table A1 provides the results for all countries. Columns 5 to 8 in the table provide the parameters of the optimal exponential law: location, scale, shape. In all cases the exponent remains close to 1. Figure 8 displays a quantile-quantile plot comparing ε t with the estimated exponential distribution for three countries: France, Germany, USA. The linear fit is excellent, and this goodness of fit is confirmed for 33 more countries in Figure A6.  . Quantile-quantile plot with France, Germany and the USA comparing ε t , using the festive day correction, with the optimal exponential distribution using the R-package normalp.

The Fraser Renewal Equation
In our proposed incidence model, we used the general integral Equation (1), which is a functional equation in R t . Integral equations have been previously used to estimate R t : in [20], the authors estimate R t as the direct deconvolution of a simplified integral equation where i t is expressed in terms of R t and i t in the past, without using the serial interval. A simpler functional equation than (1) was proposed in Fraser [21] (Equation (9)), This equation is derived from the general case renewal Equation (1) by assuming that R t is constant in the serial interval support. It computes the "instantaneous reproduction number" and represents the number of secondary cases arising from an individual showing symptoms at a particular time, assuming that conditions remain identical after that time, in contrast with the case renewal Equation (1). This last equation applied to the incidence curve is coherent if Φ s denotes the serial interval between two cases, which can have negative dates, because an infectious may be detected after the infection cases she caused. Using (11) requires that Φ s only has positive dates. This explains why [22] proposed to estimate the generation time, namely the (always positive) time between two infections, before using it in (11). The advantage of Equation (11) is that R t is estimated at time t from the past incidence values i t−s by a simple division, provided that Φ s = 0 for s < 0:

Deterministic Implementations Using Fraser's Renewal Equation and Other Models
Many papers estimating R t use the deterministic causal renewal Equation (11). This is the case of [23][24][25]. This last paper also involves the Wallinga-Teunis formulation [2], also based on the renewal equation but only allowing a backward estimate of R t (see the discussion in [7]). Some papers such as [26] propose a simplified version of (11). See also [27], who use this equation but estimate the probability distribution Φ s by a maximum entropy method. A few papers use another deterministic model, the Wallinga-Teunis formulation, to compute R t [28], or a SIR model, such as in [29], where the time variable parameter β(t) of the three ODE's of a SIR model is estimated from incidence data in a seven days sliding window.

Stochastic Observation Models for i t and R t
The renewal Equation (11) is often endowed with an a priori stochastic Poisson model as In this stochastic formulation, the first member i t of Equation (11) is assumed to be a Poisson variable, and the second member of this equation is interpreted as the expectation of this Poisson variable. This leads to a maximum likelihood estimation strategy to compute R t (see [3,[8][9][10]30]). This form of the renewal equation is proposed and used in [3] and in the EpiEstim software. It is highly recommended in a recent review [31] signed by representatives from ten different epidemiological labs from several continents. Many papers dedicated to the computation of R t use this model, for example [32][33][34], who also assume that R t is a Poisson variable, and [35] who also assume that R t also is a random variable following a Gamma distribution. In [36], the authors use the stochastic form of the renewal Equation (13) where they call Φ s causal serial interval. Then R t is estimated jointly on all regions of a country by a variational model containing a spatial total variation regularization to ensure that R t is piecewise constant, and the L 1 norm of its time Laplacian to ensure time regularity. The functional also penalizes outliers, typically Sundays and holidays by assuming a sparse structure of such events. See also [37] for an exposition of the application of this method.
In [38], the method Epifilter is introduced as an extension of EpiEstim and of the Wallinga-Teunis formulation. Epifilter has been applied in practical studies such as [39].
The core of Epifilter is again the causal renewal equation in Poisson form (13). Yet, the author proposes a doubly stochastic model, as R t is assumed to follow a recursive discrete Brownian motion of the sort where N (0, 1) and η is a user parameter, that we can interpret as a regularity control on R t . Then (R s ) s≤t is computed from the incidence data (i t ) s≤t by recursive filtering. The method is complemented by Bayesian (backward) recursive smoothing that brings a better estimate on low incidence periods.
Similarly, in [40], a parametric model with a stochastic multiplicative term is proposed for R t where the stochastic term is a Gamma law with prescribed standard deviation. The parameters are estimated in several prefectures in interaction to provide the best fit to incidence data linked to R t through the causal renewal Equation (11).
A few papers assume a negative binomial a priori for the incidence [41]. Nevertheless, the equations given in the paper indicate the adoption of the renewal Equation (11) and put the stochastic process on R t by assuming R t R t−1 GP where GP is a squared exponential kernel. The very same model is used in [42], and is based on the authors' software EpiNow2. Similarly in [43], incidence i t and reproduction number R t are linked through the classic SIR model; a parametric piecewise linear model for R t is estimated by fitting the parameters to real incidence data. Here, the daily incidence data are modeled as a negative binomial, with mean given by the deterministic solution of the SIR equations and unknown dispersion.
In [44], a direct stochastic model is proposed for R t , assuming that its log derivative is where ν is the volatility of the Brownian process to be estimated. Then we have where C is a constant depending on steady transmission characteristics and s(t) is the proportion of the population that is susceptible. The case incidence is then estimated through an SEIR model. We refer to [45] for a still more complex stochastic model for R t , depending on three stochastic parameters.

Conclusions
In [7], we have proven extensively by simulations and experiments on live worldwide COVID-19 incidence data that using the simplified causal renewal Equation (11) incurs in a five days delay in the estimation of R t , compared to the Nishiura renewal Equation (1). This is why we used here this second model.
All of the stochastic models mentioned in Section 4.3 are formulated a priori. To the best of our knowledge, no there has been no a posteriori verification of their noise models on i t or R t . In contrast, we have proposed to learn the noise model from data and to verify a posteriori that the noise model is correct. Our experiments show that the weekly and festive administrative perturbations are more important than the noise. Hence, they must be corrected first to enable a proper noise analysis.
These experiments seem to confirm the validity of the observation model (4). As we saw, this model can be inverted by minimizing the energy (7). This minimization yields three signals: a restored incidence on the festive days, the administrative bias correcting coefficients q t that are quasi-periodic with period 7, and the time varying reproduction number R t , arguably the pandemic's most useful control parameter. Last but not least, the renewal equation deduces a restored incidence i r t by (2) from the bias compensated incidenceî t . The modeling loop was closed by verifying that the normalized error defined by (9) is a white noise. We also found that this noise follows an exponential distribution. This analysis discards the Poisson model for the pandemic's case count i t . A pure case count should be a Poisson noise, but we saw that the main perturbation was an administrative bias which, once compensated, leaves behind a noise with standard deviation proportional to a power larger than 0.5 of the case count i t . Under the Poisson model this standard deviation would have been equal to the square root of i t .
In summary, based on the renewal equation inversion, this work contributes to a better understanding of the dynamic of the registered administrative observation of the incidence curve, its weekly seasonality, the influence of the festive days and the expected noise model in the observation of the incidence curve.

Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets of the incidence curve were analyzed in this study. These data can be found in [15] for France, [16] for Germany, [17] for Spain and [18] for the rest of countries.

Conflicts of Interest:
The authors declare no conflict of interest.  Figure A1. From top to bottom: (i) the original incidence curve of Germany i 0 t , (ii) the incidence curve after bias correctionî t , (iii) the restored incidence curve using the renewal equation i r t , (iv) the weekly bias correction factors q t , (v) the reproduction number estimation R t and (vi) the normalized error where a is the optimal exponent obtained by regression (see Table A2).  Figure A2. From top to bottom: (i) the original incidence curve i 0 t of USA, (ii) the incidence curve after bias correctionî t , (iii) the restored incidence curve using the renewal equation i r t , (iv) the weekly bias correction factors q t , (v) the reproduction number estimation R t and (vi) the normalized error where a is the optimal exponent obtained by regression (see Table A2).  Figure A3. Control test, from top to bottom: (i) the test incidence curve i 0 t which is a Brownian motion, (ii) the test curve after bias correctionî t , (iii) the restored incidence curve using the renewal equation i r t , (iv) the weekly bias correction factors q t , (v) the reproduction number estimation R t and (vi) the relative error (i r t −î t )/i r t . Both R t and the weekly seasonality correction coefficients stay very close to 1, with means 1.001 and 1.00002, and standard deviations 1.7% and 0.3% respectively. Table A1. Table with the mean and standard deviation of ε t and the parameters of the best fit to the exponential distributions for 36 countries. The data of starred countries in the first three rows have undergone the festive bias correction.  Table A2. Coefficients a and b for 38 countries of the log-log linear regression ax + b between restored incidence i r t and the residual |î t − i r t | as displayed in Figure 5. The Pearson correlation p-values given by the stats R package confirm a linear relation. The exponent a varies between 0.7 and 0.9. Stars * indicate countries with festive correction. The pvalues are slightly better with festive correction than without. The last row shows the results on the control curve, simulated as a Brownian process. Its large p-value discards a linear log-log relation, and the estimated values of a and b also stand far away from the estimated values for real incidence curves.   Figure A4. Fit of exponential distributions for 36 countries. In red, the best fitting exponential distribution with shape larger or equal to 1. In black, the best fitting normal law.  Figure A5. Autocorrelation of the normalized error ε t using the R-software functionalities (acf() function) for 36 countries. The dotted lines give the 95% confidence interval for non-correlation.  Figure A6. Quantile-quantile plot with 36 countries comparing ε t (without using the festive day correction) with the optimal exponential distribution using the R-package normalp. In the horizontal axis we show the theoretical quantiles and in the vertical axis, the sample quantiles. Note that the exponential distribution shape parameter β, indicated on the graphs can have values >1.