The Stochastic Approach for SIR Epidemic Models: Do They Help to Increase Information from Raw Data?

: The recent outbreak of COVID-19 underlined the need for a fast and trustworthy methodology to identify the features of a pandemic, whose early identiﬁcation is of help for designing non-pharmaceutical interventions (including lockdown and social distancing) to limit the progression of the disease. A common approach in this context is the parameter identiﬁcation from deterministic epidemic models, which, unfortunately, cannot take into account the inherent randomness of the epidemic phenomenon, especially in the initial stage; on the other hand, the use of raw data within the framework of a stochastic model is not straightforward. This note investigates the stochastic approach applied to a basic SIR (Susceptible, Infected, Recovered) epidemic model to enhance information from raw data generated in silico. The stochastic model consists of a Continuous-Time Markov Model, describing the epidemic outbreak in terms of stochastic discrete infection and recovery events in a given region, and where independent random paths are associated to different provinces of the same region, which are assumed to share the same set of model parameters. The estimation procedure is based on the building of a loss function that symmetrically weighs ﬁrst-order and second-order moments, differently from the standard approach that considers a highly asymmetrical choice, exploiting only ﬁrst-order moments. Instead, we opt for an innovative symmetrical identiﬁcation approach which exploits both moments. The new approach is speciﬁcally proposed to enhance the statistical information content of the raw epidemiological data.


Introduction
Since the very beginning of COVID-19 diffusion in early 2020, effective countermeasures to limit the spread of the disease have been non-pharmaceutical interventions such as lockdown, social distancing and limitation of economic activities, with an unavoidable worsening of social life and wealth [1][2][3].Most of such unpopular decisions have been taken according to scientific committees striving to understand the real gravity of the diffusion and to forecast how these decisions would impact with the spreading of the pandemic.To this end, it has been soon clear that mathematical models could be of great help to identify the features of the COVID-19 spread as well as to make predictions and to support decisions [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18].Indeed, nowadays there can be found many models, ranging from the basic SIR epidemic model that accounts for Susceptible, Infected and Recovered individuals, to finer, multi-compartmental models accounting for the Exposed, Hospitalized, Vaccinated, etc. [19][20][21].However, most of the existing models share the feature of being deterministic; therefore, they are not able to capture the inherent randomness of the phenomenon under investigation and to properly characterize the uncertainty of future possible scenarios.
In [22], we presented a first investigation on an alternative identification approach based on the Chemical Master Equation (CME) modeling framework, which was used to describe the COVID-19 epidemics in Italy.In that paper, the model parameters of the resulting stochastic SIR model, i.e., the relative infectivity rate and the per capita removal rate, were the target of the estimation approach.The proposed results exploited second-order moments and showed the potential effectiveness of the proposed approach by applying it to real data on the number of notified cases coming from the National Institute of Health of Italy (Istituto Superiore di Sanità, ISS) [23].The CME modeling framework was originally introduced to describe the dynamics of chemical reaction networks (see for instance [24]), but the method is frequently used in biological applications, especially when the dynamics of the considered population is inherently stochastic, such as for epidemics [25], cellular biochemical systems [26] and growing tumor populations [27].
Differently from [22], in this work, we aim at investigating the estimation approach proposed in [22] on a theoretical ground, assuming to exploit in silico data provided by the model simulations.This allows us to stress the effectiveness of the proposed research with respect to different sets of pandemic parameters, not necessarily anchored to the ones related to the COVID-19 disease.According to the idea carried out in [22], we assume a given country affected by the pandemic divided in districts.Each district develops the pandemic according to the same model parameters, therefore providing independent random paths sampled from the same stochastic process: to this end, a Continuous-Time Markov Chain is exploited to deal with the discrete random events providing one more infected or one more recovered [28].Random paths are simulated according to the Stochastic Sampling Algorithm (SSA) [29].Different sets of model parameters provide different epidemic scenarios where to test the proposed methodology.First-and second-order moments are computed from in silico random paths, and they are compared to the theoretical first-and second-order moments computed from the CTMC theory [30].As a matter of fact, an in silico model with known parameters allows one to evaluate more precisely the added value of second-order moments in the estimation.This is not possible in the presence of real data, because the real pandemic parameters would be unknown.For the moment computation, due to the nonlinear fashion of the model propensities (infected are supposed to occur at a rate proportional to both infected and susceptible individuals, as usual), we resort to the linear approximation, which is usually suggested by the conjecture that the pandemic shows very low infected individuals with respect to susceptibles.It is worth noticing that according to standard deterministic epidemic models, data coming from different sub-regions would be put together and therefore treated as averages.
Numerical results show that if only infectious data are available, the use of the secondorder moment beyond the first-order one sensibly improves the parameter estimation, obtaining finite estimation errors when separate estimates of the relative infection and removal rate are attempted.Indeed, the first-order moment expression of the number of infected patients only depends on the difference between the model parameters, definitely making them not individually identifiable from the mean value of infected.
Conversely, when both infectious and removed data are available, the identification of the SIR parameters is always meaningful and benefits from a possible larger number of experimental samples, as happens when more provinces are considered.The removal rate is better estimated by the second-order procedure, differently from the infection rate, whose estimate worsens (on average).However, the model trajectories are scarcely sensitive to this estimation error, since both first and second-order experimental data are accurately reproduced by the second-order fitting.Furthermore, confidence intervals of both parameters are also reduced in this case, which suggest higher reliability of the second-order method in real scenarios.

A Stochastic SIR Model
In the present work, our investigation focuses on the early spread of the epidemics, which is a relatively short time horizon where the system is open-loop (no restriction measures have not been taken yet); in particular, we focus on an identification problem in a situation of scarce data; for this reason, we decided to keep our modeling setting as simple as possible, and we selected the classical SIR model [31] as the more appropriate framework for our study, by neglecting for example the exposition dynamics (see e.g., [32]) and the presence of delayed transmission and lags/dead times in the effects of control actions (see e.g., [33] for a comprehensive review on this topic, not limited to biological systems).
According to the standard SIR modeling choice, the whole population is divided into Susceptibles (the ones prone to get infected), Infected (the ones actually infected and responsible for further infections) and Recovered (the ones healed and no more infectious, or death).The whole population (Susceptibles + Infected + Recovered) is not supposed to vary for birth/death processes different from the ones provoked by the pandemics: its constant amount is denoted by N. Due to the constraint on the whole population, the state of the system evolves, thus, according to a pair of components.In this work, we consider the currently infectious patients I and the removed individuals R, and their dynamics is described by exploiting the formalism of Continuous-Time Markov Models (CTMC) [28].After deriving the evolution of the pair I, R, the dynamical behavior of the number of individuals susceptible to the infection S is straightforwardly obtained by the relation Note that according to the chosen modeling setting, the state variables are actually countable variables, which is a more appropriate representation than the unrealistic continuity assumption made by the classical deterministic SIR formulation.
Going into the mathematical details, the state variation of I and R, as well as of the other derived variables, is due to a pair of discrete stochastic events, namely the infection, E 1 , and recovery (by healing or death), E 2 , events: where w 1 and w 2 are the propensities of the modeled events (i.e., the time derivatives of the transition probabilities, time −1 ), β is the relative infectivity (depending on the infection probability of an infected-susceptible contact and on the contact frequency), and γ is the per capita rate of removal (healing plus death) from the infection.The dynamical equation of the first-and second-order moments cannot be directly written in a closed form from the propensity expressions due to the nonlinear form of w 1 [30].In order to obtain closed forms of higher-order moments, we consider the simplifying assumption S ≈ N, which is actually a realistic assumption when the epidemic is at the beginning of its outbreak or when it is controlled, so that the total cases are sensibly lower than the whole population number.This assumption simplifies the expression of w 1 as w 1 βI, that becomes actually linear with respect to the state variables.The simplified expression of w 1 allows one to write the first-order moment equations as that provide the explicit solutions The ODE system (3), representing the first-order moments dynamics when S/N ≈ 1, gives back the standard structure of the deterministic SIR formulation [31].
Indeed, these first-order Equations ( 3) and ( 4) are usually exploited to identify the relative infectivity rate β and the per capita removal rate γ according to the measured infected or cumulative cases available from local or national Institutes of Health.In [22], we proposed a way to exploit the inherent randomness of the events and the straightforward stochastic SIR model in order to gain information from the available data.The key point is to gather more random paths from a given scenario.To this end, we assume to divide a given country (affected by a pandemic) into m districts.Each district has its own population (N i , i = 1, . . ., m) and faces the same pandemic whose spreading depends on a unique set of model parameters: in other words, each district shares the same pair (β, γ).By keeping the approximation S ≈ N i , i = 1, . . ., m, each district shares as well also the approximated propensities.A further hypothesis is that any district is isolated with any other.This way, the stream of data acquired from the districts is independent random paths, according to which higher-order moments can be computed and exploited to increase the reliability of the estimates.
Therefore, according to [30], due to the approximation S i ≈ N i , we write the secondorder moments in closed forms as that is, after computations: that give the explicit solutions Remark 1.The initial conditions I(0) , R(0) , I 2 (0) , I(0)R(0) , and R 2 (0) have to be considered as unknown to be identified as well as the pair (β, γ).However, in case of the very beginning of the epidemic spread, the initial recovered could be thought of as a deterministic value equal to 0, so that first-and second-order moment equations involving R(0) simplify as follows:

The Identification Procedure
Let us assume that an epidemic is spreading throughout a given country and that the stochastic modeling framework proposed above represents the main features of the disease.Let us also consider the different regions belonging to the country, assuming that their different geographical positions and local cultures may affect the value of the parameter pair (β, γ).However, the epidemiological data coming from different sub-regions, the aforementioned districts that we will call provinces or counties hereafter, for simplicity, can be seen as different realizations of the regional CTMC.In other words, each province initially shares the same pair (β, γ) of the region, and its data can be regarded as a random path of the region where it belongs.Obviously, fluxes of people within provinces are supposed to be neglected.
Differently from [22], where we assumed that the daily number of cumulative cases was the only available data coming from the provinces, here, we assume two different scenarios depending on the information level: (a) We assume knowledge of periodic measurements of currently infectious individuals; (b) We assume knowledge of periodic measurements of currently infectious and removed individuals.
More formally, in this paper, we focus the attention on a single region with m ∈ N provinces, which are tracked for consecutive K samples, and for which the data set of currently infectious individuals {i l,k } K−1 k=0 , l = 1, . . ., m, is always available and the data set of removed individuals {r l,k } K−1 k=0 , l = 1, . . ., m, is available only in the scenario (b) defined above; those data sets denote the measurements of infectious and removed for province l at discrete time t = k∆, k = 0, ..., K − 1, where ∆ is usually a 1-day interval.We assume i l,0 ≥ 1 for any province l, i.e., there is already one infectious in each province at the initial time.This is necessary to obtain meaningful random paths, since the condition I = 0 characterizes the absorbing states of the CTMC, preventing the occurrence of both infections and removal events (since I = 0 implies w 1 = w 2 = 0).We also assume r l,0 = 0 for any province l, which is compatible with the early spreading of the epidemic.
Statistical moments (up to the second order) at each time are readily computed from data as For the following developments, we also define the quantities , which are the explicit moment Expressions ( 4) and ( 7) computed at general time t = k∆ and starting from the initial conditions I 0 , R 0 , I 2 0 , IR 0 , R 2 0 , where we set R 0 = IR 0 = R 2 0 = 0, to make the theoretical moment Expressions ( 4) and ( 7) consistent with the data assumption r l,0 = 0, ∀l.In this scenario, we assume that only the expressions of I k and I 2 k from ( 11) and ( 12) are known.We can define the following running costs, accounting for the Euclidean distance between statistical and theoretical first-and second-order moments of the infectious at each time k = 0, ..., K −1: where we highlighted the dependence of J 1 k and J 2 k on the parameters and on the initial conditions to estimate, i.e., I 0 and I 2 0 .

Scenario (b): Parameter Estimation from Infectious and Removed Data
In this scenario, all the statistical moment expressions in ( 11) and ( 12) are known, so the running costs at all times k will also account for the first-and second-order moments of the removed and for the joint second-order moment of infectious and removed: Note that in spite of the different information set with respect to the scenario (a), the Expressions ( 14) and ( 15) of the running costs in the scenario (b) depend on the same parameters, since only the non-zero moment initial conditions I 0 and I 2 0 need to be estimated.

The Optimization Problem
The goal of the identification step is to minimize the following cost, which is a weighted normalized version of the root mean square error: where α ∈ [0, 1] weighs the relative importance of tracking effectively first-order with respect to second-order moments, and the running costs are defined according to the chosen scenario, i.e., Equation (13) for scenario (a) and Equations ( 14) and (15) for scenario (b).Looking at the family of functions defined by ( 16), the standard approach consists of considering a highly asymmetrical choice, exploiting only first-order moments (α = 1).Instead, we opt for an innovative symmetrical identification approach which exploits both moments (first and second order), choosing α = 0.5 that symmetrically poses in between with respect to the use of them.In other words, the proposed estimation procedure is based on a loss function that symmetrically balances the information content provided by firstand second-order moments.

Results and Discussion
The identification method described in the previous section has been applied to an ideal setting, where data have been generated according to the stochastic CTMC model ( 2).This is a substantial difference with the work [22], where methods have been applied to the real data of the COVID-19 pandemic in Italy, because the goal of the present work is a more rigorous validation of the identification method in a general epidemic scenario.
In the following simulations, we consider a region with a possibly variable number of provinces of random sizes (N i , i = 1, . . ., m), which are independently sampled according to a discrete uniform distribution in the interval [25 . . . 10 4 ; 75 . . . 10 4 ], where we denoted by [a; b] := [a, b] ∩ N an interval on the line of natural numbers.This results in random sizes which are comparable to many Italian provinces.Similarly, the initial infectious are independently sampled according to a discrete uniform distribution in the interval [1; I 0,max ], where we set I 0,max = 10.Each random path associated to a province of the simulated region is built according to the Stochastic Sampling Algorithm (SSA) [29].
We choose an observation interval of K = 14 days, which is a period compatible with the assumption of early-stage epidemics (see also [8]), inducing an approximate exponential increase of the number of infectious, by virtue of the simplifying assumption S ≈ N.
For each of the two scenarios described in the previous section, we will consider two cases: (1) Best fitting limited to the first-order moments, corresponding to the case α = 1 in the index J α (β, γ, I 0 , I 2 0 ) in ( 16); this case (optimization of J 1 (β, γ, I 0 , I 2 0 )) will be shortly referred to as best first-order fit.
(2) Best fitting of a balanced combination of first-and second-order moments corresponding to the case α = 0.5 in the index J α (β, γ, I 0 , I 2 0 ) in ( 16); this case (optimization of J 0.5 (β, γ, I 0 , I 2 0 )) will be shortly referred to as best second-order fit, in the sense that it exploits (in a balanced way) the moments up to the second order.
All the simulations have been performed in the MATLAB ® environment, exploiting the function lsqcurvefit" (Optimization Toolbox).

Scenario (a): Parameter Estimation from Infectious Data
In scenario (a) of parameter estimation from infectious data, we consider M = 100 simulations of a region with m = 5 provinces, with each simulation associated to a point of the grid (β, γ) ∈ {(0.05k, 0.005j) : k, j ∈ [1; 10])}.
The simulation results show that in this scenario, the estimation of (β, γ) obtained by the best second-order fitting procedure (J 0.5 minimization) outperforms the analogous estimation achieved by the best first-order fitting procedure (J 1 minimization).In particular, if reasonable upper bounds of the parameters are not provided to the numerical optimizer, the best first-order procedure returns estimated values that can be arbitrarily large in general.Instead, the estimation of the difference (β − γ) is quite accurate in both cases.
The described behavior can be readily justified by looking at the expression of I(t) and I 2 (t) from ( 4) and (7), respectively: As a matter of fact, the first-order moment expression (18) only depends on the difference (β − γ), which makes the two parameters not individually identifiable by using only this equation.On the other hand, although the second-order moment expression in (19) formally depends on both (β − γ) and (β + γ), the actual value of I 2 (t) is scarcely affected by the sum (β + γ), multiplying the value I(0) , while it is much more sensitive to the difference (β − γ), which affects the growth rate of the exponential terms.This leads to a rather large relative error (more than 30%, on average, in our simulations), even in the best second-order fit of parameter γ.
As an example, the fitting results for the realization related to the "true values" (β, γ) = (0.3, 0.03) are shown in Figure 1.The figure shows the comparison between the sample moments (blue circles), which are obtained by applying Equations ( 11) and (12) to the raw data, and the model predictions (lines), obtained from Equations ( 4) and ( 7), using the parameter values estimated with α = 1 (red dashed line) or α = 0.5 (yellow solid line).As expected, only the best second-order fit (yellow solid line) is able to track both the sequence of first-and second-order data, I k and I 2 k , respectively.In this simulation, the estimates of the parameter pair are ( β1 , γ1 ) = (1.695,1.402) for the best first-order fit and ( β2 , γ2 ) = (0.262, 0.022) for the best second-order fit.In scenario (b) of parameter estimation from infectious and removed data, we consider a more extensive simulation setup, namely M = 100 simulations in two distinct cases of a region with m = 5 and m = 10 provinces, and the same grid of scenario (a).The results are summarized in Table 1, evaluated by means of the mean square error (MSE) of the best first-and second-order fit and by the length of the corresponding estimated 95% confidence intervals (ci).Confidence intervals have been estimated by means of the Matlab function nlparci", exploiting the Jacobian matrix computed numerically by function lsqcurvefit" and evaluated at the estimation point.Overall, the first remark is that the two parameters β and γ are now individually identifiable, since the complete sets of theoretical moments (4), (7) and statistical moments (11) and ( 12) are now exploited.The two sub-cases do not exhibit particular qualitative differences in the estimation when varying the number of provinces m, while from the quantitative viewpoint, there is a net reduction of the MSE in the case m = 10 with respect to the case m = 5, which is probably due to the double number of samples in the case m = 10.
By keeping fixed the number of provinces, the parameter γ is fitted better (lower MSE) by the second-order procedure (J 0.5 minimization) with respect to the first-order one (J 1 minimization).A possible explanation for such behavior is that the second-order moment equation in (6) introduces another dynamics R 2 (in addition to R(t) ) depending on γ only, which is probably useful for an improved estimation of this parameter.
Initially, the parameter β is fitted better, on average, by the first-order procedure.This drawback of the second-order procedure is compensated by a higher robustness of the estimation of both β and γ, expressed in terms of the 95% confidence intervals, whose length is substantially reduced, with a larger reduction in the case of larger m.Notice that a higher confidence level of the estimated parameters is crucial when applying the identification procedure to real settings, where the true parameter values are unknown.
As an example, the fitting results for the realization (β, γ) = (0.5, 0.03) are shown in Figures 2 and 3.As expected and symmetrically to the analogous simulations related to the scenario (a), only the best second-order fit (yellow solid line) is able to track accurately both the sequence of first-and second-order data.

Conclusions
This paper proposes a stochastic approach for the SIR modeling and identification, which is aimed at increasing the information content carried out by raw epidemiological data.The identification method is based on a modeling framework describing new infections and removals as discrete stochastic events and it assumes the epidemiological data related to the provinces of a given region as independent random paths drawn from the same Continuous-Time Markov Chain describing the epidemics spread in a given region.The estimation procedure is based on the building of a loss function that symmetrically weighs first-order and second-order moments, differently from the highly asymmetrical choice of standard approaches that exploit only first-order moments.
The numerical results show that when the infectious data are the only data available, adding the information content of the second-order moment to the first-order one is crucial to sensibly improve the parameter estimation, allowing one to obtain finite estimation errors attempting to separately estimate the relative infection and removal rates.Indeed, the first-order moment expression of the number of infected patients only depends on the difference between the model parameters, definitely making them not individually identifiable from the mean value of infected.
In the scenario where more information is available, namely both data of infectious and removed, the identification of the single parameters is always meaningful and benefits from a possible larger number of experimental samples, as happens when more provinces are considered.The removal rate is better estimated by the second-order procedure, differently from the infection rate, whose estimate becomes worse (on average).However, the model trajectories seem to be scarcely sensitive to this estimation error, since both first and second-order experimental data are accurately reproduced by the second-order fitting.Furthermore, confidence intervals of both parameters are also reduced in this case, which suggest higher reliability of the second-order method in real scenarios.
In conclusion, in view of the preliminary results shown in this paper and highlighting the potential of the approach, further investigation will be devoted to the topic of parameter estimation from real data exploiting information deriving from higher-order moment equations.This will be the object of future work.

Figure 1 .
Figure 1.Scenario (a): estimation from infectious data.Fitting of the first-order moment (top panel) and of the second-order moment (bottom panel) of the infectious individuals from a region including m = 5 provinces, in the case (β, γ) = (0.3, 0.03): CTMC data (blue circles), best first-order fit (red dashed line), best second-order fit (yellow solid line).

Figure 2 .
Figure 2. Scenario (b): estimation from infectious and removed data.Fitting of the first-order moments of the infectious individuals (top panel) and of the removed individuals (bottom panel) from a region including m = 5 provinces, in the case (β, γ) = (0.5, 0.03): CTMC data (blue circles), best first-order fit (red dashed line), best second-order fit (yellow solid line).

Figure 3 .
Figure 3. Fitting of the second-order moments of the infectious-removed population from a region including m = 5 provinces, in the case (β, γ) = (0.5, 0.03): CTMC data (blue circles), best first-order fit (red dashed line), best second-order fit (yellow solid line).

Table 1 .
Scenario (b).Estimation results evaluated in terms of mean square error (MSE) of the best first-order fit and second-order fit and by the length of the corresponding estimated 95% Confidence Intervals (CI), with m = 5, 10 provinces.