Correcting the Actual Reproduction Number: A Simple Method to Estimate R0 from Early Epidemic Growth Data

The basic reproduction number, R0, a summary measure of the transmission potential of an infectious disease, is estimated from early epidemic growth rate, but a likelihood-based method for the estimation has yet to be developed. The present study corrects the concept of the actual reproduction number, offering a simple framework for estimating R0 without assuming exponential growth of cases. The proposed method is applied to the HIV epidemic in European countries, yielding R0 values ranging from 3.60 to 3.74, consistent with those based on the Euler-Lotka equation. The method also permits calculating the expected value of R0 using a spreadsheet.


The Basic Reproduction Number
The basic reproduction number, R 0 , of an infectious disease is the average number of secondary cases generated by a single primary case in a fully susceptible population [1]. R 0 is the most widely used epidemiological measurement of the transmission potential in a given population. Statistical estimation of R 0 has been performed for various infectious diseases [2,3], aiming towards understanding the dynamics of transmission and evolution, and designing effective public health OPEN ACCESS intervention strategies. In particular, R 0 has been used for determining the minimum coverage of immunization, because the threshold condition to prevent a major epidemic in a randomly-mixing population is given by 1-1/R 0 [4]. In addition, R 0 gives an estimate of the so-called final size, i.e., the proportion of the population that will experience infection by the end of an epidemic [5,6].

Statistical Estimation of R 0
Methodological discussions concerning the statistical inference of R 0 are still in progress, and it is recognized that the estimate is very sensitive to dispersal (or underlying epidemiological assumptions) of the progression of a disease [7]. When one estimates R 0 using epidemic data of an emerging (or exotic) disease, the exponential growth rate, r, of infections during the initial phase of the epidemic is used [8,9]. Assuming that the generation time, i.e., the time from infection of a primary case to infection of a secondary case generated by the primary case [10], is known (or separately estimated from other empirical data), the growth rate r is transformed to R 0 using that knowledge (see below). That is, the conventional estimation technique has required two statistical steps, namely, first estimate r and then convert r to R 0 .
The estimation method can be illustrated by employing a simple renewal process which adheres to the original definition of R 0 [1]. Let j(t) be the number of new infections (i.e., incidence) at calendar time t. Supposing that each infected individual on average generates secondary cases at a rate A() at time  since infection (where  is referred to as the "infection-age" hereafter), j(t) is written as: Since R 0 represents the total number of secondary cases that a primary case generates during the entire course of infection, the estimate is given by ( [11]): When j(t) follows an exponential growth path, it is easy to extract the integral of A() from equation (1). Supposing that the incidence grows exponentially at a rate r, we have j(t) = kexp(rt) where k is a constant, and moreover, j(t − ) = kexp(rt)exp(−r). This simplifies (1) to the so-called Euler-Lotka equation: Since the density function of the generation time, g(), represents the frequency of secondary transmissions relative to infection-age , we can write g() as: Replacing A() in the right-hand side of (3) by that of (4), an estimator of R 0 is obtained [9]: The estimation of R 0 is achieved by measuring the exponential growth rate r from the incidence data and also by assuming that g() is known (or separately estimated from empirical observation such as contact tracing data [12,13]). It should be noted that the above mentioned framework has not been given a likelihood-based method for estimating R 0 (i.e., a likelihood function used for fitting a statistical model to data, and providing estimates for R 0 , has been missing). Moreover, equation (5) may not be easily used by non-experts, e.g., an epidemiologist who wishes to estimate R 0 using her/his own data.
The purpose of the present study is to offer an improved framework for estimating R 0 from early epidemic growth data, which may be slightly more tractable among non-experts as compared with the above mentioned estimator (5). A likelihood-based approach is proposed to permit derivation of the uncertainty bounds of R 0 . For an exposition of the proposed method, the incidence data of the HIV epidemic in Europe is explored.

Actual Reproduction Number
In addition to R 0 , a different measurement of the transmission potential using widely available epidemiological data, the actual reproduction number, R a , has been proposed for HIV/AIDS [14]. The concept of R a is much simpler than R 0 in that R a is defined as a product of the mean duration of infectiousness and the ratio of incidence to prevalence [15]. The prevalence p(t) at calendar time t is written as: where () is the survivorship of infectiousness, or probability of being infectious, at infection-age .
Letting () be the transmission rate, which depends primarily on the frequency of contact and infectiousness at infection-age , the above mentioned A(), the rate of secondary transmission per single primary case at , under an assumption of a Kermack and McKendrick type model, is decomposed as ( [16]): The actual reproduction number R a is written as: where D is the mean generation time (or what was previously described as the mean duration of infectiousness [14,15]). If the transmission rate () is constant  (i.e., independent of infection-age), R a = D. Moreover, from equation (2), R 0 is given by: Figure 1. The relative frequency of secondary transmissions of HIV as a function of the time since infection. A step function is employed to approximately model the frequency of secondary transmissions relative to infection-age. For d 1 years shortly after infection, the frequency g 1 is very high. Subsequently, for d 2 years (i.e., during the asymptomatic period), g 2 is persistently low, followed by a time period with high infectiousness g 3 for d 3 years until death or no secondary transmission due to AIDS. Following a statistical study [20], d 1 , R a coincides with R 0 as long as () is constant. Nevertheless, in many instances, the contact frequency and infectiousness (which may be partly reflected, for example, in the viral load distribution of the infected host) vary as a function of infection-age . The variation in () is particularly the case for HIV infection. Thus, although the usefulness of the incidence-to-prevalence ratio and R a has been emphasized to have an application in HIV/AIDS [15], R a tends to yield a biased estimate (if R a is regarded as a proxy for R 0 ), and the estimate of R a is not as robust as that is obtained with equation (5) to objectively interpret the transmission potential [17,18].

Correcting R a
Here, the above mentioned negative aspect of R a is reconsidered by correcting the definition of R a . The disease of interest in the present study is HIV. The frequency of secondary transmissions relative to infection-age  (i.e., the generation time distribution), approximated by a step function, is shown in Figure 1. As has been known for some time [19], the frequency of secondary transmissions is very high shortly after infection (for a duration d 1 = 0.24 years), followed by a long asymptomatic period with a low frequency of secondary transmissions (for d 2 = 8.38 years) [20]. Subsequently, the frequency rises sharply again resulting in a substantial number of secondary cases for a duration d 3 = 0.75 years until death or until the infected individual ceases risky sexual contact due to AIDS [20][21][22]. Assuming that the contact frequency does not vary as a function of infection-age, g 1 , g 2 and g 3 have been estimated at 1.30, 0.05 and 0.36 per year [20]. Here, g() is the density function of the generation time, with a mean of 3.79 years.
Here the concept of R a is corrected. The equation (8) is rewritten as: The numerator represents the number of new infections at calendar time t, while the denominator was originally intended to represent "the total number of infectious individuals" who can potentially be primary cases with an equal chance at time t. Nevertheless, in order that the estimator of the actual reproduction number coincides with that of R 0 , the concept of the denominator is better replaced by "the total number of effective contacts (which can potentially lead to secondary transmissions with an equal probability)". Therefore, the corrected R a is better written as: where g(), in the renewal equation with the Kermack and McKendrick type assumption, is written as the normalized density of secondary transmissions, i.e.,: Replacing g() in the right-hand side of (11) by that of (12), we get: Thus, the estimator of corrected R a in (11) is identical to that of R 0 . In other words, R 0 can be estimated from the incidence data and the generation time without assuming exponential growth of cases during the early phase of an epidemic. It should be noted that the ratio of prevalence to mean generation time p(t)/D in the denominator of the right-hand side of (10) has been replaced by "the total number of effective contacts that have equal potential to generate secondary cases".

Data
Here the epidemic data of HIV/AIDS in three European countries: France, the Western part of Germany (i.e., the former Western Germany) and the United Kingdom (UK) are investigated [23]. Figure 2A shows the yearly incidence in these countries from 1976-2000. During the observation period, a total of 23,243, 13,126 and 11,491 AIDS cases were diagnosed in France, Western Germany and the UK, respectively. Although the time of HIV infection is not directly observable, the HIV incidence has been estimated by employing a back-calculation technique and using the AIDS incidence and the incubation period distribution of AIDS [23]. The present study does not discuss the details of back-calculation, but explanatory guides can be found elsewhere [24][25][26]. Figure 2B shows the enlarged HIV incidence curve during the initial phase of an epidemic. The peak incidence was observed in 1982 for Western Germany and 1983 for France and the UK. In the following, the time period from 1976 up to one year prior to the peak incidence is taken as the early growth phase. For the purpose of an exposition of the proposed method, the HIV incidence is assumed to have been in a single homogeneously-mixing population.

Statistical Analysis
R 0 is estimated using two different methods, one employing the estimator (5) and another using the corrected R a . For the former approach, the exponential growth rate is estimated via a pure birth process [27]. Given that the cumulative incidence from year 0 to t − 1 is observed, the conditional likelihood of observing the cumulative incidence J t cases in year t is proportional to ( [28]): from which the maximum likelihood estimate and the 95% confidence intervals (CI) of r (per year) are obtained. Given a fixed generation time distribution g(), the uncertainty bound of R 0 mirrors the uncertainty in the estimate of r. The translation of r into R 0 via (5) is made by using g() in Figure 1.
The latter method, proposed in the present study, employs the estimator of corrected R a in (11). Since the data are yearly, the equation (11) needs to be rewritten in discrete-time: (15) where the discrete density function of the generation time, g s is assumed to be given by g s = G(s + 1) − G(s), where G(s) is the cumulative distribution function of the generation time of length s, but g 0 is calculated as a normalized yearly average, because d 1 is as short as 0.24 years.
The likelihood of estimating R 0 with (15) is proposed as follows. First, the inverse of both sides of (15) is taken: The numerator of the right-hand side indicates the total number of effective contacts made by potential primary cases in year t that have an equal probability of resulting in a secondary transmission, and the denominator is the number of secondary cases in year t. Given that all the potential contacts made by primary cases (black circles) are known using the incidence data and the generation time distribution, the probability that each potential contact resulted in a secondary transmission is given by 1/R 0 .  1 The intrinsic growth rate during the exponential growth phase; 2 the basic reproduction number estimated using equation (5); 3 the basic reproduction number estimated using equation (17); the 95% confidence intervals are shown in parentheses.
The right-hand side of equation (16) is interpreted as follows. Figure 3A shows a transmission tree, i.e., a representation of who infected whom, where each primary case on average generates two secondary cases. A transmission tree of this kind is usually unobserved unless rigorous contact-tracing with microbiological examination is implemented. Thus, a likelihood-based approach to reconstructing the tree is considered ( Figure 3B) [29][30][31]. Given the total number of effective contacts that have equal potential for resulting in secondary transmission, the probability of a single effective contact resulting in a secondary transmission (or the probability that a secondary case is linked to an effective contact made by a single primary case in year t) is given by 1/R 0 . This is a simple binomial sampling process. In other words, the likelihood function for estimating R 0 is:   (17) where T is the most recent time point of observation within an early (linear) epidemic growth stage. The maximum likelihood estimate of R 0 is obtained by minimizing the negative logarithm of (17), and the 95% CI are derived from the profile likelihood. Table 1 shows the estimates of r and R 0 for HIV in France, Western Germany and the UK. The maximum likelihood estimates of r ranged from 1.15 to 2.15 per year with the smallest estimate in France and the highest in Western Germany. The 95% CI for Western Germany did not overlap with those of the other two countries, reflecting the steep rise in incidence in this region in Figure 2B. Based on the exponential growth assumption in (5), the estimate of R 0 ranged from 3.65-4.08. Again, Western Germany yielded the highest estimate without an overlap of the uncertainty bound with the other two countries. The maximum likelihood estimate of R 0 based on the proposed new method ranged from 3.59 to 3.74. Not only were the qualitative patterns for the expected values of R 0 consistent with those based on an exponential growth assumption, but the 95% CI also broadly overlapped with those based on the other method. In particular, although R 0 in Western Germany using the proposed method is smaller than that based on an exponential growth assumption, the 95% CIs of the two methods overlapped. The 95% CI based on the proposed method appeared to be wider than those based on exponential growth assumption. Since HIV is mainly transmitted via sexual contact, the above mentioned estimate may vary with the mixing pattern and contact frequency (thus, there is no general disease-specific R 0 , especially for HIV/AIDS). At least, compared with a previous estimate of R 0 as ranging from 13.9 to 54.5 in the USA, based on an exponential growth assumption that adopted a mean infectious period of 10 years [32], R 0 in the present study appeared to be much smaller using a precise estimate of the generation time distribution.

Results and Discussion
The present study proposed the use of the corrected actual reproduction number, R a , for statistical inference of R 0 based on incidence and known relative frequency of secondary transmissions (i.e., the generation time distribution) during the early growth phase of an epidemic. The previously available method using (5) forced us to adopt an exponential growth assumption, and moreover, an additional step towards the estimation of r (i.e., the translation of r to R 0 ) was required [9]. The proposed method does not necessarily require an exponential growth assumption and provides a "short-cut" to estimate R 0 from incidence data. The simple likelihood function employing a binomial distribution was also proposed to yield an appropriate uncertainty bound of R 0 . It should be noted that given the knowledge of g s and readily available incidence data, equation (15) permits calculation of the expected value of R 0 without likelihood. Such a calculation can be attained using any spreadsheet.
The usefulness of the actual reproduction number, calculated as a product of the mean generation time and the incidence-to-prevalence ratio, has been previously emphasized in assessing the epidemiological time course of an epidemic [14,15]. However, it appears that R a does not precisely capture the secondary transmission if the transmission rate () varies with infection-age  [18], and moreover, the cohort-and period-reproduction numbers directly derived from the renewal equation have been suggested to be more accurate figures in capturing the underlying transmission dynamics [17,33]. In the present study, replacing the denominator (i.e., prevalence) of R a by the total number of potential contacts, it was shown that the R 0 derived from the renewal equation coincides with the corrected actual reproduction number, R a , and also that the likelihood can be quite easily derived. The corrected R a does not require prevalence data, and uses only the incidence data and the generation time distribution.
Many future tasks remain, however. Most importantly, the estimation of R 0 from early epidemic growth data for a heterogeneously-mixing population is called for. R 0 in the present study can even be interpreted as R 0 for a heterogeneously-mixing population (i.e., the dominant eigenvalue of the nextgeneration matrix), provided that the early growth rate is the same among sub-populations (though it is not the case if the growth rate varies across sub-populations) [34,35]. Analyzing heterogeneous transmission among approximately-aggregated discrete groups, the estimate of the next-generation matrix would give more detailed insights into the epidemic dynamics, including the most important target host for intervention [36]. As discussed in a modeling study in this special issue of the International Journal of Environmental Research and Public Health [37], understanding the implications of sexual partnerships and their variations as a function of calendar time as well as infection-age is also of utmost importance. As the next step for a similar estimation framework, methods for estimating robust R 0 and the next-generation matrix from structured data (i.e., data stratified by age-and/or risk-groups) will be useful. Despite the future challenges, I believe the present study at least satisfies a need to offer a likelihood-based approach to estimate R 0 from early epidemic growth data, while being easily tractable and calculable among general epidemiologists.