New Insights into the Estimation of Reproduction Numbers during an Epidemic

In this paper, we deal with the problem of estimating the reproduction number Rt during an epidemic, as it represents one of the most used indicators to study and control this phenomenon. In particular, we focus on two issues. First, to estimate Rt, we consider the use of positive test case data as an alternative to the first symptoms data, which are typically used. We both theoretically and empirically study the relationship between the two approaches. Second, we modify a method for estimating Rt during an epidemic that is widely used by public institutions in several countries worldwide. Our procedure is not affected by the problems deriving from the hypothesis of Rt local constancy, which is assumed in the standard approach. We illustrate the results obtained by applying the proposed methodologies to real and simulated SARS-CoV-2 datasets. In both cases, we also apply some specific methods to reduce systematic and random errors affecting the data. Our results show that the Rt during an epidemic can be estimated by using the positive test data, and that our estimator outperforms the standard estimator that makes use of the first symptoms data. It is hoped that the techniques proposed here could help in the study and control of epidemics, particularly the current SARS-CoV-2 pandemic.


Introduction
To study and control an epidemic such as the current SARS-CoV-2 pandemic, positive test case data (Y pt ) and first symptoms data (Y f s ) sequences are commonly used. These data are independently gathered and respectively describe the number of positive tests officially registered each day and the number of patients exhibiting symptoms on the first day. The data of the first kind are commonly communicated by the media, while those of the second kind are elaborated to compute the instantaneous reproduction number R t during the evolution of an epidemic [1,2].
Both kinds of data are affected by errors. Positive test data have an intrinsic problem: the result of a test is typically associated with the day of its official registration with the authorities. It would be more appropriate to consider the day of the test. In addition, tests have false negative and false positive outcomes. However, molecular tests are more commonly used, with a typical sensitivity of 90%; these tests are different from the rapid antigenic tests, whose sensitivity largely varies and can be much lower. First symptoms data have two main problems. First, not all positive patients are able to contextualize their symptoms into a time frame, which reduces the sample size. Second, these data are continuously reported to the medical authorities, and the number of first symptoms on each day is updated until a certain time, after which it stabilizes.
The study and estimation of the reproduction numbers during an epidemic is an active field of research. A large number of different methodologies, approaches, and practical applications have been developed.
Fraser [3] proposed to use a slight modification of the classical Kermack-McKendrick transmission model to estimate the R t [4] based on factors that influence susceptibility and infectiousness. In addition, Fraser [3] specifically focused on households as they are considered a fundamental unit of transmission for several infections that are directly transmitted. He then analysed a model of transmission within and between households and developed a method to estimate their reproduction number.
Specific to the SARS-CoV-2 pandemic are in-host disease models, which have largely been analysed due to the relevant role played by co-morbidities and co-treatments (e.g., [5,6]), as well as methodologies which tackle the issue of asymptomatic cases. For example, Zhao et al. [7] performed a combined estimation of the generation interval and incubation period, and then inferred a pre-symptomatic transmission proportion and latent period.
Several proposed methodologies are based on simple or generalized compartmental models that combine deterministic and stochastic components, thus allowing for a consideration of various external variables such as quarantine, self-isolation, social distancing, or infection profiles [8][9][10][11]. Nevertheless, public institutions such as the Istituto Superiore di Sanità (ISS) in Italy do not use such models to estimate R t . They instead use a far simpler data-driven approach that is based on the simple statistics of first symptoms onset data [12], on which we have focused part of our work. This approach is preferred as it does not formulate any hypothesis on the mechanisms of virus transmission. One main problem with compartment models is the assumption of homogeneity: it is very unrealistic, even at the minimum geographical level used in Italy, to assume that each individual can infect every other person. In addition, there are problems related to the number of parameters and their estimation.
The method described by Cori et al. [12] was implemented for the computation of R t , both in the R software [13] and as a Microsoft Excel spreadsheet ( [14] and documentation in SI of [12]), which was named EpiEstim. In [15], an extensive review was performed on articles that used EpiEstim to describe modified approaches. It was reported that the method of [12], possibly modified, had been used to compute R t in more than 280 papers, most of which were for SARS-CoV-2. Even if the data used were concerned with the incidence of first symptoms and not of infection, the R t computed from them by the method in [12] could provide useful quantitative information about the evolution of the SARS-CoV-2 pandemic.
First symptoms data satisfy an equation involving a convolution integral [12]: where Y(·) = Y f s (·), the symbol * indicates the convolution, and f (·) is a proper kernel function representing the probability density function (PDF) of the serial interval, which is the time between the onset of the first symptoms and the infection that had caused them (e.g., [16]). Equation (1) is the basis for the estimation of the instantaneous reproduction number, but we point out that it can be rigorously derived only when R t is constant along time and there are no asymptomatic individuals. Nevertheless, it is commonly used in real applications, either as a definition of R t or as a basic equation to derive an estimator for it. We stress that in practice, the convolution integral in Equation (1) is numerically approximated by a linear combination of data, and this is also performed here in the applications. However, in some theoretical calculations, we will consider the convolution as an integral. For simplicity, we will use the same symbol.
In an ideal condition, first symptoms data are replaced by the infection data [17]. However, it is obvious that data on the last type of events are hard to collect. In fact, the time of first symptoms can be quantified, although with errors, while it is very often not possible to identify the infection time. We notice that Equation (1) becomes an eigenfunction problem if R t is assumed to be constant over time. In principle, the eigenfunctions could be used as a base of functions to describe the R t sequence.
Usually, in the literature, the kernel function f is modelled by a Gamma PDF: where k and θ are the shape and the scale (positive) parameters, respectively, and Γ indicates the Gamma function. Based on a limited set of epidemic data, these parameters were estimated for the SARS-Cov-2 pandemic in Italy by [18] to be 1.87 (shape) and 3.57 (scale), and this is indeed the model currently adopted by the dedicated Italian Governmental Institution to estimate the reproduction number (see https://www.iss.it/coronavirus; for the theoretical specifications, see https://www.iss.it/coronavirus/-/asset_publisher/ 1SRKHcCJJQ7E/content/faq-sul-calcolo-del-rt). We stress that other parameter values are used in different countries (e.g., [19,20]).
In this paper, we focus on one of the most widely used methods to estimate R t during an epidemic [12]. It is developed within a Bayesian statistical framework [21], which uses both the likelihood of the data Y(t) = Y f s (t) for t = 1, . . . , T and a prior model for the temporal R t sequence. As it will be seen in the next section, by following this approach, we can derive the expression for an estimatorR t (t) in a closed form, that is: where Y * f C denotes the convolution discretized by a linear combination of data, and I t is a temporal interval containing time t. Intuition suggests that the choice of I t to be symmetrically centred in t can be adopted to minimize a possible bias of the estimator. To obtain this formula, it is assumed that the instantaneous reproduction number is constant within I t . This hypothesis allows for a reduction of the influence of random errors in measurements during the estimation of R t . Mathematically, this is obtained by the two "averaging" terms appearing both in the numerator and the denominator of Equation (2). However, we stress that this procedure may induce systematic errors. In fact, the slope of the estimated R t is often reduced (in absolute value) with respect to that of the underlying "true" curve. Finally, the constancy over I t is independently assumed from time to time without any guarantee of consistency. The strategy adopted here aims to overcome the problem induced by the "averaging" procedure of the standard approach to estimate R t . Hence, we do not assume R t to be constant over the interval I t . This paper first describes a new procedure that is closely related to the one in [12], which is currently used in Italy to estimate R t . The method is simple, both to understand and to implement, and it has been applied to both simulated and real SARS-CoV-2 Italian data at different time intervals. However, it could be easily adapted to other contexts worldwide. Secondly, we illustrate the theoretical and empirical results we found on the relationship between the R t estimation based on first symptoms and that based on positive test data. In Section 2, we first give some details of the method in [12], and we describe the methodologies developed in our work. In Section 3, we illustrate the results obtained by applying the proposed methodologies to both synthetic and real SARS-CoV-2 Italian data. For comparison, the estimates from the standard method in [12] are included as well. Finally, the results are discussed in Section 4.

Mathematical Models and Methods
In this section, we first provide some details about the method in [12] to derive Equation (2). Secondly, we illustrate how it is possible to reduce the errors of the data, both for first symptoms and positive test sequences. Then, we focus on the estimation of R t based on positive test data as a valid alternative to the currently used first symptoms data. At the end of this section, we describe the proposed modification to the method in [12] to estimate the instantaneous reproduction number R t during an epidemic.

A Standard Method to Estimate R t
In the method by [12], the data Y(t) = Y f s (t) for t = 1, . . . , T are assumed to be independent and identically distributed (i.i.d.) as a Poisson distribution, and therefore the expression for their likelihood in terms of the parameter vector R t is as follows: where we used Equation (1) to approximate the expected value of Y(s). The prior model on the instantaneous reproduction number in the days of the time interval considered is assumed to be the product of the PDFs, which are all equal to a Gamma distribution with shape a = 1 and scale b = 5 (see [12]). Those two probabilistic models are then combined by means of the Bayes theorem to obtain the posterior PDF, that is: Equation (3) factorizes because of the independence of the different components of R t ; therefore, the inference can be performed independently component by component.
To do that, we consider the posterior marginal distribution density function of the variable Q = R t (t) given by the following equation: which is a Gamma distribution with shape Y(t) + a and scale as parameters. The mean estimator is then adopted. We recall that the mean of a Gammadistributed random variable is the product between the shape and the scale. Then, in our case, we obtain the following equation: When the order of magnitude of the measured data is at least in the hundreds, the prior parameters (a, b) in Equation (6) can be neglected. In fact, this situation typically happens in the case of a very large population studied in an active phase of the pandemic, long after its beginning.
The above estimatorR t (t) is, however, affected by random errors induced by those of the first symptoms sequence. The contribution of the denominator in (6) to the errors is lower than that of the numerator because of the presence of the convolution, discretized by a linear combination of data. To further reduce the influence of the errors on R t (t), it is assumed that the instantaneous reproduction number is constant within an interval I t . Then, the marginal distribution density of R t becomes a Gamma PDF with shape a + ∑ s∈I t Y(s) and . With abuse of notation, the corresponding estimatorR t (t) is, in this case, given by the right-hand side of Equation (2). Since summation now also appears in the numerator and twice in the denominator, the influence of random errors is further reduced, which is reflected in a smoother R t curve. However, as pointed out in the Introduction, systematic errors may appear.

Data Error Reduction
Often, in an epidemic such as the current SARS-CoV-2 pandemic, both positive test and first symptoms sequences, besides random fluctuations, contain variations that are approximately described by a weekly oscillating component, with a local minimum on Mondays and a local maximum on Wednesdays/Thursdays. This is more evident for positive test data, as some steps involved to produce them are dependent on the day of the week. Therefore, we process the measured sequences to produce a version of them with reduced periodic and random distortions, as often performed in the literature (e.g., [22,23]). To do that, we model the data as the sum of two components. The first consists of a non-parametric Nadaraya-Watson linear combination of data [24,25]: where Y(1), . . . , Y(T) are the measurements. In the above formulas, K(·) is the kernel function, and the positive parameter γ is the bandwidth [26]. The kernel is typically modelled by a standard Gaussian PDF: and this is also performed here. The second component of the model that we propose for the data is parametric: it consists of a sinusoidal function with a period of 7 days, which is multiplied for each time by an "envelope" function. The specific choice of the latter depends on the data considered (see Section 3.1).
The whole set of model parameters consists of the bandwidth of the non-parametric component and the coefficients of the second one. To select the optimal values for all the parameters, we proceed as follows. For each element of the bandwidth in a finite set of selected values, we initially compute the first component directly from the data with Equation (7). Then, correspondingly, we estimate the parameters of the periodic component by optimizing the fit of the complete model to the data. Among the estimated models, we finally select the one which gives the best fit to the data. At the end of this procedure, when dealing with the standard estimator for R t , we use the data obtained by subtracting the optimal periodic component from the measured data. For simplicity, we hereafter use the same notation Y(1), . . . , Y(T) for this "filtered" sequence. In contrast, to estimate R t through the proposed method, we use the non-parametric componentỸ(1), . . . ,Ỹ(T) of the best model.

Estimation of R t from Positive Test Data
As intuition suggests, there exists a relationship between the positive test and the first symptoms sequences. In fact, a contribution to the incidence Y pt (t) of a positive test registered at time t is given by the integral Y f s (τ) of the first symptoms outcomes that occur at any time τ < t, weighted by the conditional probability density P[Y p t (t) | Y f s (τ)]. Of course, we also have to take into account the contribution to the incidence Y pt (t) of the subjects that are not able to localize their first symptoms outcomes. Assuming that this last contribution is approximately proportional to the total incidence Y pt (t), it follows that By assuming that the mechanisms for sampling, executing tests, and registering the results remain the same along time, the above conditional probability only depends on the temporal difference t − τ. Therefore, the sequence of the positive test is obtained by the convolution of the first symptoms sequence with an appropriate kernel function g(·): We notice that the function g(·) no longer integrates to 1, as it incorporates a constant larger than 1 due to the presence of the subjects that cannot identify the day of their first symptoms.
Let us now replace the expression for the first symptoms sequence in the previous Equation (8) with a convolution integral given in (1), where we explicitly set f := f C , which is the kernel introduced by [18]: Since, in practice, the support of the kernel g(·) is 14 days, we can write the following equation: where we have applied the general version of the mean value theorem, andt is a certain time in the interval (t − 14, t). In general,t is an unknown functiont(t) of time t. However, we can assume that the mechanisms producing the first symptoms outcomes in a subset of infected subjects, and all those that involve testing in this subpopulation, remain identical during a reasonably limited time period. Of course, this is not true in the very first stages of the pandemic. Therefore, in a limited temporal interval, the whole composite phenomenon is "homogeneous" along time. For such kinds of phenomena, given a generic time t, the pointt(t) where R t is calculated in the final expression of Y p t (t) above, must have the same distance from t independently of the absolute location of time t along the temporal axis. It then follows thatt(t) = t − δ, where δ is a constant, and we have the equation below: We notice that Equation (9) is identical to Equation (1), but the R t is shifted by the constant quantity δ. This is intuitive, as the process of the positive test sequence ideally resembles that of the first symptoms, with a delay. The same is true between the latter and the infection sequence. However, we recall that the infection sequence is unknown, and that in fact, the kernel in [18] was estimated by using a very limited sample of primary infection data (in tens or hundreds).
We also want to empirically verify Equations (8) and (9). We then come back to the kernel function g in Equation (8). It is usually modelled by a Gamma distribution density function with shape k and scale θ parameters. However, to verify Equation (8), we have to re-parametrize this model by replacing the scale parameter θ with the mean value µ = k · θ, which has a direct interpretation. We also consider an additional multiplicative parameter in the kernel, which allows us to account for the fact that the day of the first symptoms is only known for a fraction of all the patients that tested positive. The explicit formulation we consider for the kernel function is then given by the following equation: The components of the parameter vector (A, k, µ, t 0 ) of this model are estimated by minimizing the discrepancy between the measured sequence of the positive test incidence and its theoretical expression, which we recall is given by the discrete version of the convolution integral in (8). For the minimization, we use the simulated annealing algorithm with a geometric temperature schedule [27]. The initial value for the temperature is chosen as the empirical mean of an initial random exploration of the parameters state space. The value for the exponent of the geometric law is 0.999.

A New Estimator for the Reproduction Number Sequence
Similar to the standard method in [12], the method proposed here to estimate R t during an epidemic is also based on Equation (1). As we explained before, in the standard approach, when estimating R t on day t, one assumes that its value is constant within an interval I t centred on t. In this way, the estimatorR t on day t will depend on the data in the whole I t (see Equation (2)). This is done in order to increase the regularity of theR t curve. However, in order to reach the same goal with our approach, we first compute the sequence of the non-parametric componentỸ(1), . . . ,Ỹ(T) from the data, as described in Section 2.2. After that, we estimate the R t on day t by using Equation (6), thus obtaining the following equation:R We recall that the convolution in the denominator above is a discrete approximation of the convolution integral given by a linear combination of the sequenceỸ(1), . . . ,Ỹ(T). Of course, we are aware that Equation (6) is theoretically derived under the assumption of the data's independence at different times. However, this assumption is not valid here as we perform the first step in reducing the errors in the data.

Real SARS-CoV-2 Italian Data
We start by considering a first set of SARS-CoV-2 Italian data for the first symptoms and a positive test, measured from 21 September until 20 November 2020. We notice that the date 14 September 2020 corresponds to the beginning of school activity in Italy; this induced a fast growth in the disease incidence, which started about two weeks later [28]. In Figure 1, we show both the first symptoms and the positive test measured data as well as the results of the methods to reduce their errors. As shown in this figure, the processed data are less affected by the presence of a one-week periodicity. The reduction of the errors is stronger for the positive test sequence. This is an expected result, firstly because the corresponding values are typically larger than those of the first symptoms. More importantly, the latter data have fewer factors that induce periodicity in the measurements. For the positive test sequence, the envelope chosen for the periodic component modelling the data (see Section 2.2) is proportional to the first component. This assumption is reasonable as the within-week variation is expected to be proportional to the mean value. This hypothesis is also supported by the empirical evidence given in Figure 2, where the standard deviation of the local fluctuations of the positive test sequence is shown as a function of the estimated first component of the model for Y pt (·). To compute the standard deviation, we use the discrepancies between the measured positive test data and the estimated non-parametric component of their model, in slicing windows of 14 days centred on each point of the time interval. A degree 2 polynomial has been adopted. The optimal values for the parameters are obtained through an iterative least square procedure. We notice that in this case, we end up, in practice, with a linear relationship, which justifies the above choice for the envelope. However, this does not happen for data in another time interval, as we are going to see later on. The same procedure for computing the standard deviation and the same type of quadratic model are also adopted in the case of the first symptoms data (see Figure 3). In general, a quadratic model is adopted for the envelope in terms of the first component of the data model. Since the periodic distortions only appear in a sub-window of the time interval considered, we only applied the correction at that point. In Figure 1, the estimated first component of the data model is shown for both types of data. In addition to systematic errors, those that are random are also strongly reduced. First symptoms incidence As explained before, the sequences of the first symptoms and positive test data can be linked to each other through a convolution relationship based on a suitable kernel (see Equation (8)). This is empirically verified by the results in Figure 4. We stress that this convolution link is a crucial element at the basis of the theoretical relationship between the estimate of the R t from the first symptoms sequence and the sequence corresponding to the positive test curve. In particular, the calculations in Section 2.3 show that the R t sequence, which is based on positive test data, can be obtained from analogous data based on the first symptoms by a translation of a suitable time δ.  This theoretical result is empirically verified on the real SARS-CoV-2 data considered here. By means of the standard approach, we estimated R t, f s from the first symptoms sequence and R t,pt from the positive test sequence. In both cases, we used the data obtained by subtracting the optimal periodic component from the raw measurements (see Section 2.2). We stress that by following the method used to compute and publicly diffuse the official R t values from the Italian ISS, the estimation temporal interval I t was 14 days long, centred on t: {t − 7, . . . , t + 6}. To find the optimal value for δ, we focused on the mean absolute difference between the R t,pt and R t, f s sequences in the time interval considered, which was performed after shifting backward the first sequence of a variable number of days, from 1 to 10. We then selected the δ that could minimize this difference, and we found an optimal value of 6 days. This choice corresponds to a good agreement between the two sequences after shifting, as shown in Figure 5. The value of the mean absolute difference in the time interval considered is 0.02.
We also applied the same analysis to a similar dataset in a different temporal interval, i.e., the period of one month starting on 7 December 2021, when the highly contagious Omicron variant of SARS-CoV-2 largely spread in Italy for the first time [29]. In addition to the differences due to the virus variant, the testing conditions were then quite different. The results we obtained are illustrated in Figure 6, from which we can draw the same conclusion as for the previous dataset considered. The optimal value we found for the shift δ was 4 days. This shorter delay is expected, since we are now considering a period in which the possibility for a person to be tested with reliable results is far easier than before. There is also the possibility of self-testing by means of easily purchasable antigenic tests, and a new, more reliable antigenic test (COI) that gives rapid results. For completeness, in Figure 7    We now turn to the results obtained by the proposed method to evaluate the reproduction number R t during an epidemic for the first real dataset considered. They are illustrated in Figure 8, which also contains the results from the standard approach for comparison. As we can see, the R t curve from the proposed method is more regular than the curve from the standard approach. Around day 30, the differences between the two curves are small. This is not true for the first part of the interval, where the slope of the R t curve is high (until about day 15), because a systematic deviation appears: the estimate of the R t with the proposed approach shows a higher slope than that estimated in the standard case. In contrast to this latter approach, in the proposed one, we first reduced both the systematic and random errors in the data, and then we applied the estimator in (6). As an alternative, one could first apply the same formula to the data after reducing the systematic errors and then perform an arithmetic averaging of R t in the interval I t . The result is shown in Figure 9, where it can be noticed that the proposed method outperforms this last procedure.
Confidence Intervals (CIs) for R t from both the proposed and standard methods are computed as follows. After estimating the two model components from the data, we generated a sample of n simulated signals first by adding these two sequences to each other, and then summing the resulting i.i.d. Gaussian noise with time-varying standard deviation, as shown in Figure 3. For each element of the data sample, we estimated the R t with the use of both the proposed and the standard methods. We stress that in order to make a fair comparison, the data processed by the standard approach were obtained from the simulated signals following a procedure the same as that used for the real data. For each of the two methods, we computed the 95% CI at each time t in the temporal interval considered. Using a sample of (n = 100) independent replications, the maximum and mean values for the semi-amplitude of the CI t were respectively (0.07, 0.1) for the proposed method, and (0.04, 0.07) for the standard one. Since the posterior of R t for the standard method is a Gamma model, we used it to compute CI t . However, in this specific case, the values of the CI t became substantially smaller than those above, and we therefore did not follow this procedure.   Figure 9. The same as in Figure 8, but the modified standard R t curve is now obtained by first applying estimator (6) to the data reduced by the systematic error, and then performing arithmetic averaging in time intervals I t .
Similarly to what had been done before, we further validated the results of the comparison with the standard method by analysing the data in the second time interval, as above. In fact, due to the large spread of the Omicron variant, this period is also characterized by a high slope of the curve, thus allowing us to highlight the differences between the standard approach and our proposed method. The two estimated R t curves are shown in Figure 10. We can see that the R t from our method shows a slightly higher slope and higher peak, correctly retracing the result from the raw first symptoms data, represented in the figure by a dotted line. Finally, the same results as those above have also been obtained from data in different countries. An example is given in Figure 11, where we consider SARS-CoV-2 data in New York City (NYC), measured from 26 October to 25 November 2020. We can also see that in this case, the R t from our proposed method shows a higher slope and a higher peak, occurring after the first two weeks of October. The peak could be due to the fact that at the beginning of October, NYC elementary, middle, and high schools began in-person learning.

Synthetic Epidemic Data
As anticipated before, to further validate the effectiveness of the proposed method from a quantitative/objective point of view, we generated synthetic epidemic data. We first obtained the synthetic data by modelling the incidence of first symptoms events by means of a pseudo-Gamma PDF (see (10)), with parameter vector (13289, 22, 83, 0) plus an extra additive parameter C = 551. This choice was made to obtain a "realistic" synthetic dataset, resembling the real data illustrated in panel a) of Figure 1. Given the regularity of the generated first symptoms events, we computed the "true" R t sequence by using Equation (6). We stress that in general, since the kernel by [18] has, in practice, a support of 28 days, in order to get the value of R t at any time, we need the data relative to the 28 days before it. An i.i.d. sample of the n signal was then simulated by adding an i.i.d. zero mean Gaussian noise to the standard deviation, growing as the same degree 2 polynomial model used for the first set of the real data. Figure 12 shows the mean of the R t curves and its 95% CI estimated by the proposed method (n = 100). The CI t is computed first by following the same procedure as that used for the real data, and then applying the Law of Large Numbers. The same figure shows the "true" R t curve, which is almost always contained in the 95% CI band. Analogously, Figure 13 contains the same result for the standard method. In this case, the "true" R t is not included in the 95% CI. Therefore, this provides additional evidences in favour of our procedure.  Figure 12. Mean of the R t curve estimated by the proposed method from a realistic SARS-CoV-2 first symptoms synthetic data sample (n = 100). The relative 95% CI is shown as a grey shadow. The "true" R t sequence is given as a dotted line.  Figure 13. Mean of the R t curve estimated by the standard method from a realistic SARS-CoV-2 first symptoms synthetic data sample (n = 100). The relative 95% CI is shown as a grey shadow. The "true" R t sequence is given as a dotted line. The mean is obtained from 100 independent realizations of the synthetic data.
In addition to the simulated results just described, we compared the performances of the standard and the proposed methods by using another synthetic dataset. In this case, we started directly from the R t sequence. More precisely, we considered an R t that was constantly equal to 1 until a certain time t = 0, after which it increased linearly with a slope α (ramp). Correspondingly, the sequence of the first symptoms data was explicitly derived by imposing that Equation (1) be fulfilled for the times in the interval considered (−∞, T], with T > 0. Since R t (t) = 1 for t ≤ 0, it is easy to verify that this implies that {Y(t)} t=−∞,...,−1 ≡ Y(0). We then set Y(0) = 1. For any time t = 1, . . . , T, we obtain the following equation (see Supplementary Material): Equation (12) can then be used recursively to compute the values of {Y(t)} t=1,...,T . In panel (a) of Figure 14, the "true" R t sequence is shown. In panel (b) of the same figure, we plot instead the corresponding first symptoms sequence obtained by recursively applying Equation (12). Considering the noiseless case, from Equation (2) with a = 1 b = 0, we can explicitly compute the estimate of R t in the standard case. By looking at the results of both the real SARS-CoV-2 and the synthetic epidemic data of the first type, we notice that the bias is larger at the beginning of the interval where the R t has a high slope. Therefore, we focus on the biasR t (0) − R t (0) at time t = 0. As shown in the Supplementary Material, the standard estimatorR t (0) is given by the following equation: where we used the basic Equation (1), and where R t (t) ≡ R t (0) for t ≤ 0. By inserting in Equation (13) the known values of the R t sequence and those of Y(·) calculated from the system Equation (12), the considered bias is obtained. The corresponding bias for the proposed method is instead derived as follows: where the sequenceỸ(·) is obtained by applying the Nadaraya-Watson estimator (7) to the sequence Y(·), calculated from the system Equation (12). In Figure 15, we plotted the bias as a function of the slope α for both the standard and the proposed methods at day 21 (t = 0). For the proposed method, we used three different (fixed) values of the bandwidth γ appearing in (7). The highest value of γ was close to the one (2.7) estimated from the real SARS-CoV-2 first symptoms Italian data. By looking at the figure, we notice that all the bias curves increase with α. The curve of the bias from the standard approach is above all the curves from the proposed method. Furthermore, by increasing the value of the fixed bandwidth γ, the curve of the proposed method gets closer to that of the standard approach.

Discussion and Conclusions
In this paper, we dealt with estimating the reproduction number R t during an epidemic. We first analysed, both theoretically and empirically, the relationship between the R t estimated from the first symptoms and the positive test data. Second, we modified the standard method by [12], which is widely used to estimate R t in several countries worldwide, including Italy: we did not rely on the hypothesis of the local constancy of R t . To perform both tasks, we also developed a specific method to reduce errors in both the first symptoms and the positive test data.
We have proved, both theoretically and empirically, that the R t curve estimated from the positive test sequence is equal to the curve estimated from the first symptoms after shifting the last one forward by a temporal quantity δ. After a certain time, the value of δ may change, depending on some specific factors. For example, the average time between the first symptoms and a positive test may change under some conditions, e.g., the season. In addition, δ may change with the spatial location. In any case, the value of δ is a-priori unknown, and it must be estimated. Several approaches can be used. One could use a combined set of first symptoms and positive test data. After estimating R t from each of the two kinds of data, δ can be found by minimizing the discrepancy of the two R t curves (after shifting). Alternatively, one could rely on the meaning of this shift, interpreting it as the time delay ∆t between the occurrence of the first symptoms and the positive results of the test. More precisely, from a dataset of patients where this delay is known for each individual, δ can be estimated as the median value of ∆t. For example, in the case of the SARS-CoV-2 data for the Veneto region (Italy) during the first epidemic wave, the median value of the temporal delay is 6 days [30]. This value is equal to the one obtained by applying the first approach to the first real dataset used. We point out that instead of aiming at replacing the first symptoms data with the positive test data, one could use both of them to get two estimates of R t , which would likely enhance the reliability of the results. In addition, estimating R t from positive test data has the advantage of being based on a far larger sample.
We notice that by the standard method, we cannot estimate R t closer than 6 days from the last measurement. This is also true when using positive test data. However, as we have seen, the estimated R t at any day is equal to the R t from the first symptoms process corresponding to δ days before it. For the value of δ estimated here (6 days), it follows that the last day in which we can estimate R t is 12 days before the last measurement. On the other hand, we notice that R t estimates based on first symptoms data are reliable until 16 days before the last measurement. In fact, these data are stable only approximately 10 days before the last measurement. Therefore, by using positive test data, we also gain 4 days, which increase to 6 for the second real dataset, as in this case δ is 4 days.
Regarding the new methodology for estimating R t , the results obtained show that it outperforms the standard approach. In fact, in the regions with a high R t slope, the estimate from the latter approach has a lower slope as compared to the proposed one, while in the remaining part, there is good agreement. This is a real effect, as the pattern appears for both real and "realistic" synthetic first symptoms data. From the ramp synthetic data, we have seen that the bias increases with the slope α of the ramp. However, for the standard estimator, the bias is always larger than that of our method, and its rate of increase with α is greater for the standard method.
In the standard approach, the estimator is the mean of the posterior distribution (in most cases and in practice, the likelihood of the data), and one simultaneously imposes that R t is constant within two weeks. As usual, in this situation, the price for the induced regularity of the R t curve is a reduction of its slope in the region where it is large. Instead, in our method, we reduce the influence of the errors on the data by first smoothing them. Then, the estimate of R t is obtained by independently maximizing the posterior probability at each time. The estimates obtained are not affected by the problem above. This is true for both the first symptoms and the positive test data. It would be interesting to investigate if this happens in general or under certain conditions. By applying our simple methodologies, either the standard method applied in Italy to compute R t for SARS-CoV-2 based on the first symptoms sequence is improved, or a valid alternative (or additional) approach is provided to the one using the first symptoms data when applying the standard method to positive test data. We notice that we also developed more advanced methods, e.g., the Bayesian approach with an a priori probability model on the temporal continuity of R t , obtaining very similar results. However, the former approach is more difficult to understand, and its implementation is more complicated as it involves Markov Chain Monte Carlo simulation to perform statistical inference, which is also time-consuming. In addition, there is the problem of hyper-parameter estimation. We were able to cope with all these issues, but we wonder-why should one use a more complicated approach to get similar results to those derived from a far simpler one?
As a conclusion, the results illustrated in this paper suggest that the reproduction number R t during an epidemic can also be estimated by applying the standard estimator to positive test data, with some advantages. In addition, the new estimator we proposed for R t outperforms the standard one. We hope that the extensive application of these procedures to real situations, including the current SARS-CoV-2 pandemic, will become a common practice that could help in the study and control of epidemics.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/vaccines10111788/s1. Author Contributions: All authors contributed equally to the study's conception and design, statistical and mathematical data analyses, code development, results interpretation, and manuscript writing. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.