2.1. Data on COVID-19 in Italy
Data on daily new cases of COVID-19 for each province have been made available by the Italian Presidency of the Council of Ministers-Department of Civil Protection Agency (CPA), since the very beginning of the outbreak in Italy (Official data are available at
https://github.com/pcm-dpc/COVID-19, accessed on 31 May 2021). No data about deaths and recovered patients were provided at the provincial level. Therefore, we decided to integrate the official data on new cases with data on new provincial deaths derived from press conferences and reports published online by regional authorities or local newspapers. Data on deaths have been acquired using the daily press conferences and COVID-19 bulletins from regional authorities for many Italian provinces.
Regions for which we could not obtain provincial death data from official bulletins or press conferences were Lombardia and Campania. However, we obtained data on COVID-19 deaths from local newspapers for the Cremona province in Lombardy.
Table 1 contains all the sources that we used to retrieve provincial COVID-19 death data.
Another time series not available from official repositories at a provincial level was the number of recovered people at time
t. Therefore, the series was estimated using the recovery rate at the regional level, computed as the ratio of recovered people and the total number of new cases each day. The regional recovery rate at time
t was then multiplied by the number of total cases in the province within the region:
R is the total number of recovered individuals, and T represents the total number of cases. We estimated this number proportionally to the regional ones because the patient treatment for the illness due to COVID-19 could be considered more uniform across the provinces (with almost the same recovery rate across provinces within the region) for the number of deaths. Protocols on treatment were adopted uniformly in each province within each region.
Another data issue that researchers have faced in dealing with COVID-19 forecasting in Italy is that official data often present flaws, mainly related to delays in reporting new cases and deaths, missing data, and negative values in the series of new cases post-event recounts, and missing data. Menchetti and Noirjean [
21] reported widely on the biases of these official data. Bartoszek et al. [
22] highlighted that reporting statistics at a specific spatial level (national, regional, etc.) in Italy do not say much about the dynamics of the disease at lower levels. The problem of unreliable data becomes even more cogent with epidemiological models, both deterministic and stochastic, when many parameters should be estimated based on unreliable data, especially for long-range estimates, which are even more critical for an outbreak with such dramatic consequences that the whole world is experiencing. This inevitably results in less robust estimates.
Figure 1 reports on the differences between the sum of provincial COVID-19 cause-specific deaths and the related total regional deaths for each region, for which we have retrieved the data from regional authorities from 10 September 2020 to 28 February 2021. For a few regions, differences are often related to a delay between the publication in press conferences and the reporting in the official CPA data repository. This is an issue also experienced in other countries during the pandemic (see [
23] for the number of death adjustments in the United Kingdom).
Figure 1 clearly shows that for four regions out of 15, there were some discrepancies concerning the official CPA data, but these are mainly due to recounting and counting deaths “from other regions” occurring for people deceased in other regions than that of their residence, as reported by the CPA data.
However, our choice to work on death data published on regional authorities and local newspapers improved the quality of data, as the reporting was more up to date than the regional data published by CPA and avoiding the “recounting problem”, which sometimes affected the time series with peaks due to reporting delay, not present in the provincial time series.
Table 2 and
Table 3 better explain the two problems.
In
Table 2, the numbers of daily deaths for each province in the Marche region are displayed from 1 April 2021 to 4 April 2021. At the bottom of the table, the total deaths from provincial deaths and the deaths reported from the CPA data set are displayed. This is a typical case of a region where there is no “from other regions” issue present, but the daily CPA underreporting of deaths and the huge death recounts done from time to time are of great magnitude. Note, in fact, that on April 1st, 2nd and 4th, deaths were always underreported, whereas on April 3rd, a huge recount from previous days was done. On the other hand,
Table 3 is an example of a reverse issue: differences between the sum of provincial deaths and the number of regional deaths from the CPA data set are always due to the “from other regions” reporting. From these two examples, it is clear that the regional death time series is strongly affected by this recounting or delay in reporting, with unjustified peaks created from time to time, and therefore an approach taking into consideration provincial data acquired from regional bulletins and local newspapers is more appropriate, at least in the case of the Italian COVID-19 data.
2.2. An Adjusted Time-Dependent SIRD Model
The SIRD model is a compartmental model used in epidemiology to design the spread of a disease [
6,
24,
25]. The model divides the population into four different groups: susceptible, infected, recovered, and deceased. This kind of design is appropriate when the disease of interest respects the following two assumptions: infected individuals can propagate the infection; recovered individuals receive longstanding immunity. The COVID-19 pandemic respects the first assumption, and some preliminary studies show that recovered individuals receive at least short-term immunity. There are other essential assumptions concerning the population (and therefore the number of susceptible people). First, its size is considered fixed (in our case, we considered Italy, which was affected by movement restrictions, sometimes with people not leaving their region). Second, individuals are identical to one another (i.e., demographic factors or different health conditions are not considered). Finally, we do not consider the effect of the vaccination campaign, as, for the period considered, it was still at the first stages in Italy. In
Figure 2, a schematic of the compartments and flows forming the model is shown.
The SIRD model is based on four variables
,
,
, and
, that are respectively the number of susceptible people at the beginning of the period considered for the time series (taken from the Italian National Statistical Institute data warehouse:
http://dati.istat.it/, accessed on 30 April 2020), currently infected, recovered and deaths at time
t. The size of the population,
n, is given by the sum of these four variables. The model’s parameters are the transmission rate, the recovery rate, and the mortality rate, respectively represented by
,
,
. Being rates, these parameters can also be seen respectively as the average time between effective contagious contact (
) and the average time before removal from the infectious class (
).
Another important parameter that summarizes the spread of an outbreak is the basic reproduction number,
, which is computed as the ratio between the transmission rate and the sum of the recovery and mortality rate. Furthermore,
represents the expected number of individuals directly infected by one infected individual, in a population where everyone is susceptible to infection. If
is less than 1, the epidemic will eventually be controlled. If it is larger than 1, the transmission of the disease will increase in the population. The formula for
is given by:
Building on Chen et al.’s work [
26], in this paper, a time-dependent model is proposed in order to let the parameters be free to change over time. This kind of model is chosen because, in Italy and various other countries facing the virus, containment measures have been adopted and incremented over time. In particular, a national lockdown was introduced in Italy on 11 March 2020, and lasted until 4 May 2020. Other pandemic containment measures were taken later on. By allowing the parameters, especially the effective transmission rate, to vary over time, control measures can be somewhat included in the model. On the other hand, recovery and mortality rate are likely to depend on the pressure under which hospitals and, in particular, intensive care units are in, which increases sharply at the beginning of a pandemic (i.e., when a high mortality rate is reported) and then relaxes after the health system capacity is enhanced.
The differential equations governing the standard deterministic SIRD model are the following:
subject to the constraint
, since we are neglecting the effects of new births and of people dying for causes not related to COVID-19. Note that, because of this constraint, one of the previous equations in the SIRD model can be derived from the other ones, and can be omitted.
We consider the day as a unit of time, and we transform the previous system of ordinary differential equations into a discrete-time difference system of equations, using
and applying a forward finite differences scheme, which results in the following:
From the records of the four variables of interest in a specific province, the evolution of each parameter can be retrieved using the equations above, as follows:
The observed time series of is then used to estimate the daily values of transmission rate, recovery rate, and mortality rate and predict future values. Because of the stochasticity of the time series, the daily estimate of the three driving parameters is itself stochastic, and we need then to smooth or filter out the noise of the estimates in order to obtain robust predictions.
We do not assume any specific underlying model for the noise, and, in order to make our predictions more robust, we decided to apply a finite impulse response (FIR) filter. The following equations describe the regression model used in the FIR filter for each parameter and the cost function to be minimized in order to find the optimal coefficients:
where
and
are the usual intercept and regression coefficient parameters. J was set to 14, since the estimated period change in the pandemic is likely to become visible in the data after 14 days, as this is also the quarantine time used in Italy.
The FIR filter requires a single hyper-parameter (
J in the above formula) which represents the maximum number of lag days to include in the regression. The cost function consists of a regularized least-squares method in which the penalty (
) is applied to the sum of squares of the regression coefficients (Ridge regression regularization, based on a
norm [
27]). Different penalty functions
have been used in the ridge regression to estimate each parameter, i.e., transmission, recovery, and death rates. The value of
for each parameter has been obtained using cross-validation. Therefore, the resulting overall model is such that its parameters are time-dependent, and the lags of these are modeled via loss functions, whereby parameters at a time are regressed on previous lagged parameters.
The SIRD models suffer from some drawbacks in periods of fast pandemic spread or contraction (they tend to overestimate when there is a rapid increase of the infections and to underestimate when there is a sudden decrease, see [
28,
29]). In order to consider the effect of the social distancing policy adopted in Italy, the transmission parameters are multiplied day to day by a parameter
, where
is the national COVID-19 stringency index in that period as computed according to [
30]. We present model evaluation results with and without considering the stringency index in
Section 3.1.
Once the model has been trained using historical data, future predictions can be made on the parameters and, therefore, estimates for the evolution of S, I, R, and D can be computed using the SIRD model equations.
2.3. Adjusted Training Process
Our model aims to make predictions about the evolution of the COVID-19 outbreak in Italy at the local level, particularly using historical data on each province. The model presented here differs from [
26], in that the hyperparameters are not considered fixed, but optimized using multiple approaches, and it is composed of three different autoregressions based each on a SIRD model’s parameter. Each of the regressions requires the choice of the penalty value for the regularization process (
). In our approach, the regularization parameters were free to vary (within a range from
to
with step
for the powers), and cross-validation was employed to find their optimal values [
31].
However, the deterministic model described above might struggle to produce reliable estimates in contexts where the number of cases or deaths is meager. There is considerable fluctuation or inconsistency in the data, as happens in some provinces where the outbreak is not so intense (see
Figure 3, where the heterogeneity of the outbreak is clearly shown at a provincial level, at least in the first-medium stages of its evolution). In contrast, it seems to give more robust results when data are aggregated at a higher level, as in the entire country’s time series. This happens because the model is based on smoothing the sequential values of the variables that become less precise as the numbers decrease. Therefore, an aggregation approach was used to train the three models, based on the assumption that provinces which ’behaved’ similarly in recent history are more likely to behave similarly in the future.
For a given province and a given forecast origin, three sets containing the most similar provinces concerning the different parameters
,
, and
were retrieved. The distance between the two series was computed using dynamic time warping (DTW) with Itakura constraint to allow for some small temporal shifts between the series [
32]. The similarity was considered only in the last 30 days before the origin, assuming that only the most recent past was of interest. The number of provinces to select for each set was chosen using cross-validation: the first decile, corresponding to 9 provinces, resulted in the minimum prediction error. Such a choice is also reasonable in terms of the problem that we are trying to solve, since it appears to be a good compromise between the overfitting and extreme generalization of the series.
In
Figure 4, an example on applying this model to parameter
I,
R and
D for the Catania province is shown.
The predicted values for these parameters are derived from the predicted hyperparameters , , and . Together with these estimates, the estimates for the total cases and the new daily cases are also obtained algebraically and displayed in the figures.
Predictions for the hyperparameters and other provinces, together with other model settings like the number of lags to be chosen for prediction, can be seen on a dashboard developed for this model, available at
https://ceeds.unimi.it/covid-19-in-italy/ (accessed on 31 May 2021, see also [
33] for a detailed description of this dashboard).
2.4. Bootstrap Prediction Intervals
Because of the numerous issues causing inconsistency in the reported daily data, it is necessary to accompany the point estimates with reasonable confidence intervals. Since the model starts with predicting the disease parameters, the intervals are also computed first on the parameters and later derived for I, R, and D.
In order to build prediction intervals for the model parameters, a block bootstrap algorithm using a stationary version of the time series with blocks of 30 observations was used [
34]. Zero lower bound was imposed for all the parameters to avoid results that contradict the compartmental logic of the SIRD model. The same procedure could be applied to the three regressions of the model, respectively used for
,
, and
. However, each of the model’s variables depends on the whole set of parameters, so that the obtained error range for a parameter must be combined with the other two parameters’ intervals. Thus, the intervals for the variables are subject to the uncertainty of three different parameters and can be composed in different ways. Combining lower and upper bounds of the parameters can be misleading, since the variables of the SIRD model develop in different directions and, mainly, each variable depends on the past value of
, which in turn depends on
,
, and
. Nevertheless, combinations of interest can be used to describe the epidemic development in particular scenarios. The method proposed here is to use the prediction interval for the parameter of interest and use point estimates for the other two parameters. Thus, the effects of the variability of the parameter can be easily displayed on each variable. Accordingly, when the parameter of interest is
, the following equations are used to compute the prediction intervals for the variables
S,
R,
D and
I, following Equation (
1):
where
and
stand respectively for lower and upper bounds.
Bootstrap intervals for each variable are shown in
Figure 5 for the province of Torino. Note that real values in the prediction windows are always within the confidence bands.
2.6. Model Extensions
In the previous sections, each province had its specific model trained using a cluster formed by the most similar provinces in terms of the parameters. The distance between provinces was used assuming that provinces with similar behavior in the near past will continue to have that behavior in the future, despite the geographical distance. The spatial structure was not considered, as it was assumed that neighboring provinces did not affect one another.
Indeed, neighboring provinces are likely to influence one another in many ways. Movements between bordering provinces are more likely to happen than between non-bordering ones. For example, commuters (i.e., potential spreaders) are more likely to work or study in a neighboring province, thus increasing the probability of exporting (or importing) the virus in nearby territories. Moreover, during the period of emergency, residents of different provinces shared medical resources, such as hospitals, health care workforce, equipment, and test capacity, thus affecting several different parameters regarding the spread of the epidemic, such as the reported rate of transmission and the reported number of COVID-19-related deaths.
Given these considerations, a spatial model is also implemented. Its results are presented in
Section 3 as a possible alternative to the original strategy for spatial effects. The model performance is not considered in detail, as the focus of this article is the methodology presented in
Section 2.2 and
Section 2.3, while the spatial model only represents one of the possible extensions to the original model that could be developed when less restrictive assumptions are made. Other potential models may consider the hierarchical structure of the national, regional, and provincial levels, or could drop the assumption of independence between the three parameters
, and
.