Real-Time Estimation of R 0 for COVID-19 Spread

: We propose a real-time approximation of R 0 in an SIR-type model that applies to the COVID-19 epidemic outbreak. A very useful direct formula expressing R 0 is found. Then, various type of models are considered, namely, ﬁnite differences, cubic splines, Piecewise Cubic Hermite interpolation and linear least squares approximation. Preserving the monotonicity of the formula under consideration proves to be of crucial importance. This latter property is preferred over accuracy, since it maintains positive R 0 . Only the Linear Least Squares technique guarantees this, and is ﬁnally proposed here. Tests on real COVID-19 data conﬁrm the usefulness of our approach.


Introduction
The SIR model describes the dynamics of an epidemic. Kermack and McKendrick introduced this in their pioneering work [1]. In this model, we consider a homogeneous population. Thus, no age-structures or group behaviours are taken into account. The population is divided into three homogeneous sections: susceptible people S, infectious people I and recovered people R. Since S, I and R change in time t, we represent these numbers as functions S(t), I(t) and R(t). Births and deaths are ignored, considering a constant population size N within time t i.e., S(t) + R(t) + I(t) = N.
A very important quantity is the reproduction number, given as Ordinary Differential Equations (1) can be solved numerically using Runge-Kutta pairs [5][6][7]. One of the main difficulties arising is the estimation of parameters β(t), γ(t) and, consequently, of R 0 (t).
There is an ongoing interest in SIR-type models. Bertrand and Pirch [8] presented a least-squares finite element method for the SEIQRD model. Cherniha and Davydovych [9] studied a nonlinear model based on logistic equations, while Keller et al. [10] simulated the spread of an infectious disease across a heterogeneous and continuous landscape. E. Kuhl [11] focused on modeling various stages of COVID-19 outbreak and Viguerie et al. [12] simulated COVID-19 via an SEIRD model.
From (1) and, consequently, we arrive at Thus, our concern here is actually a reliable estimation of ds dr .

Interpolating the Past
We proceed to set t n as the current time, while s n = s(t n ) and r n = r(t n ) are the current observed numbers of susceptible and recovered. Consequently, t n−1 corresponds to the previous time step, usually the previous day, i.e., t n−1 = t n − 1 and all t are considered integers. Past values of s n−1 , r n−1 , s n−2 , r n−2 , · · · and population N are usually known and we seek an approximation of the R 0 (t) trough ds dr . Through the estimation of this derivative at the right (current) point, ds n dr n . In the following, it is convenient to consider s as a function with respect to the increasing variable r. We have various choices.

Backward Finite Differences
The celebrated Taylor series [13] states that We are given various values of s at distinct points, r 0 < r 1 < · · · < r n and name them for simplicity s 0 = s(r 0 ), s 1 = s(r 1 ), · · · , etc. Backward finite differences answer the question of approximating the derivatives of s at r n . Thus, we may derive ds n dr n = s n ≈ s n − s n−1 r n − r n−1 (4) which is easily verified from (3) using its implicit Euler variant s n = s(r + h) ≈ s(r) + hs (r + h) = s n−1 + hs n−1 , with h = r n − r n−1 . Approximation (4) is said to be the first order of accuracy, since no higher orders of h are involved in using (3), e.g., h 2 , h 3 , · · · etc. After substituting − 1 s with the mean value − 2 s n−1 +s n , we obtain the formula This latter estimation (5) is of limited value, since it is based on data from two days and fluctuates intensively. Thus, we may proceed to higher-order differences. A second-order backward finite difference approximation of ds n dr n produces the formula As we worked in (5), we substituted − 1 s with the mean value of s n−2 , s n−1 and s n . We observe that s is clearly a decreasing variable. Then, ds n dr n ≤ 0 and R 0 ≥ 0 must hold. In (5), this property is preserved. However, this is not true for (6). Let us check this using a small example. Take the following artificial data r n−2 = 0.001, r n−1 = 0.002, r n = 0.003, s n−2 = 0.996, s n−1 = 0.9915 and s n = 0.991 (7) to verify that, according to (6), we get R 0 ≈ −1.5108 < 0! This happens because second-and higher-order backward finite differences do not preserve monotonicity [14]. Dealing with higher-order methods is,therefore, of no meaning.
Notice that using (5), we have R 0 ≈ 0.505 for the last two days (i.e., based on the r-interval [0.002, 0.003]). However, the previous day's estimation was R 0 ≈ 4.539. This oscillation is also unacceptable.

Cubic Splines
Cubic splines [15] are a very interesting tool that can be used in various fields [16,17]. The interval [r 0 , r n ] is divided to n subintervals [r 0 , r 1 ], [r 1 , r 2 ], · · · , [r n−1 , r n ]. Using cubic splines, we can obtain n polynomials of third degree, with each one active in every subinterval. Concentrating again on the interval [r n−2 , r n ], we may obtain the following formula after using the software found in [18] The data (7) produce R 0 ≈ −0.5036 < 0, which is unacceptable. For this dataset, we can obtain two polynomials of the third degree. In the interval [0.002, 0.003], the polynomial approximation of s is with the positive derivative s (0.003) ≈ 0.5 > 0 at the rightmost point. Through this counterexample, we deduce that cubic splines do not preserve monotonicity either.

Piecewise Cubic Hermite Interpolant
An alternative to cubic splines is Piecewise Cubic Hermite interpolation (PCHIP). In cubic splines, the coincidence of higher derivatives at the nodes is demanded. In PCHIP, this is abandoned in order to achieve monotone cubic polynomials for monotone data. The issue is rather complicated to explain here. More details can be found in [19] or in the MATLAB function pchip.
For the data (7), we can obtain the following polynomial active in the interval r ∈ [0.002, 0.003] s(r) ≈ 10 5 (r − 0.002) 3 + 300(r − 0.002) 2 − 0.9(r − 0.002) + 0.9915, with derivative s (0.003) = 0 at the rightmost point. Monotonicity is marginally preserved, but R 0 = 0 in this case. Even if we add more points from the past, we will not change the situation. The method tries always to produce a monotone polynomial of the third degree.
Thus, it will make very small changes to (8), regardless of how many points we add from the past. All of the above three type of approximation (2.1-2.3) may attain higher algebraic orders, i.e., second-order or higher. This means that we may obtain better accuracy in the approximation of ds dr . However, these actually fail, since they do not preserve monotonicity, which is by no means a property of the function s(r) and is also present in the corresponding data. Failure to preserve monotonicity is catastrophic. We may sometimes experience R 0 < 0 then.

Linear Least Squares
Finally, we propose estimation ds n dr n by the slope of linear least squares approximation of the data in [r 0 , r n ]. Then, we get The denominator of (9) is always positive, since r is ascending. For the special case of using data in the interval [r n−2 , r n ], we have 3(r n−2 s n−2 + r n−1 s n−1 + r n s n ) −(r n−2 + r n−1 + r n )(s n−2 + s n−1 + s n ) 3(r 2 n−2 + r 2 n−1 + r 2 n ) − (r n−2 + r n−1 + r n ) 2 After using (10) with data (7), we arrive at an acceptable R 0 ≈ 2.523. It seems that (9) furnishes a balanced result in view of the corresponding results found by (5). The reason for the final choice of this method is the preservation of monotonicity, which helps to avoid unpleasant outcomes such as ds dr > 0. This is true, since the slope of a least squares line applied on decaying data is obligatorily negative. The latter does not always happen in cases where, e.g., a parabola passes through these points.
The question that raises now is the size of the data used. i.e., n =?. A strategy to address this is begins with n = 2 and estimates R 0 with a value named R 2 0 . Continue with n = 3, 4, · · · and estimates R 3 0 , R 4 0 , · · · , respectively. We stop the iteration whenever two consecutive estimations of R 0 differ less than 1 100 . That is, whenever we accept R k 0 as the value on demand. In case this does not happen, we accept R 21 0 as the final approximation of R 0 . We implement this procedure as a MATLAB [20] program in the Appendix A.
A very interesting issue is that using mean value of data s in (2) calibrates the result to the correct value of R 0 . The percentage of susceptible s varies slowly. Using the mean value instead of s n causes only a small difference, which delivers 2-3 more digits of accuracy. Additionally, we mention that this new approach is much easier to compute than our previous method [21]. It also seems to achieve better accuracy.

Tests on Real COVID-19 Data
We will test the new approach using real COVID-19 data. The time-series data of confirmed, recovered, and death cases for various countries were retrieved from [22]. The data are presented in the format shown in Table 1.
The two rightmost columns sum to form vector r, after dividing this sum by the country's population. The vector s is formed by the division of confirmed cases by population. The population of each country was retrieved from Wikipedia [23]. The data in [22] are not always reliable. Data are missing or reported inaccurately for some countries. The method presented here does not apply properly at the outbreak of a disease, when values r are rather small. Then, we may apply some scale, e.g., use vectors S and R. The values in r have to vary somehow in order to obtain reliable results. Thus, no trustworthy results can be derived from the following Table 2. We do not believe that any serious method can extract something reliable from the above data. By this, we mean that, at the beginning of an outbreak, we may not obtain an explicit picture of the situation from a sole country's data. Only a global view may raise some interest in these numbers. Thus, the method at hand may apply after the initial development of the pandemic.
We present the results for Russia in Figure 1, for Germany in Figure 2, and for Italy in Figure 3. We produced these results using at least 14 of the latest data in order to obtain smoother curves.
We observe that R 0 decays for Russia and, after mid-January 2021, stays below 1. The corresponding parameter for Germany stayed below 1 for 2021, but it seems that this was not true from the beginning of March. Finally, R 0 stayed below 1 for Italy until the end of February, and climbed above it in March.
The data for the countries mentioned above seem to be reliable. However, there are cases with misreported data. In France, after a year, only about a quarter of million were reported as recovered, while there are 4 million infected! The same problem is apparent in the data for the UK and other countries.
Thus, the new method cannot apply to corrupted data, as expected for any other method. Least squares may circumvent an outlier or some misreported data, but consistently faulty data are untreatable.

Conclusions
A new formula for directly estimating R 0 present in the SIR epidemic model is derived here. Using only the percentage values of susceptible s, and recovered r at consecutive days, we form a linear least squares approximation of the derivative ds dr ≤ 0. This approximation is non-positive (i.e., preserves monotonicity). Then, the new formula stays close enough to R 0 . For use with real COVID-19 data, we implemented an iterative technique that promises convergence to the actual value of R 0 . Similar research is planned in the future for other models, such as SIS, SIRD, and SEIR.