An Extended SEIR Model with Vaccination for Forecasting the COVID-19 Pandemic in Saudi Arabia Using an Ensemble Kalman Filter

In this paper, an extended SEIR model with a vaccination compartment is proposed to simulate the novel coronavirus disease (COVID-19) spread in Saudi Arabia. The model considers seven stages of infection: susceptible (S), exposed (E), infectious (I), quarantined (Q), recovered (R), deaths (D), and vaccinated (V). Initially, a mathematical analysis is carried out to illustrate the non-negativity, boundedness, epidemic equilibrium, existence, and uniqueness of the endemic equilibrium, and the basic reproduction number of the proposed model. Such numerical models can be, however, subject to various sources of uncertainties, due to an imperfect description of the biological processes governing the disease spread, which may strongly limit their forecasting skills. A data assimilation method, mainly, the ensemble Kalman filter (EnKF), is then used to constrain the model outputs and its parameters with available data. We conduct joint state-parameters estimation experiments assimilating daily data into the proposed model using the EnKF in order to enhance the model’s forecasting skills. Starting from the estimated set of model parameters, we then conduct short-term predictions in order to assess the predicability range of the model. We apply the proposed assimilation system on real data sets from Saudi Arabia. The numerical results demonstrate the capability of the proposed model in achieving accurate prediction of the epidemic development up to two-week time scales. Finally, we investigate the effect of vaccination on the spread of the pandemic.


Introduction
The novel coronavirus disease (COVID-19) first appeared in Wuhan city, and due to its high transmission rate, the virus has been rapidly spreading all over the world. On 8 March 2020, the WHO announced COVID-19 as a global pandemic [1]. As of 1 January 2021, there are over 100 million reported cases, and a death toll exceeding 2 million.
In the absence of a ready-to-use vaccine, and besides medical and biological research, mathematical models can play an important role in understanding and predicting disease transmission. Moreover, it helps to implement appropriate measures and efficient strategies to control the pandemic's spread and mitigate its impact.
Mathematical epidemic models based on the SIR model [2] are widely used to study the spread of a disease in a population [3][4][5][6][7]. Recently, several mathematical models have been developed to study the transmission dynamics of COVID-19 [8][9][10][11]. However, these models suffer from various sources of uncertainties, due to the incomplete description of the biological processes governing the disease spread, and also due to some involved parameters being poorly known. One way to mitigate these uncertainties is to constrain epidemic forecasting models with available data. These data can be combined with the model output to improve its prediction and reduce uncertainties [12][13][14]. This process, known as data assimilation, sequentially adjusts the model state and parameters once data become available to keep the model as close as possible to the real track [15].
The ensemble Kalman filter (EnKF) is a popular data assimilation technique introduced by [16] to tackle large-scale, sequential, and non-linear problems. The EnKF is a Monte Carlo implementation of the traditional Kalman filter [17], operating in cycles of two steps: (i) A forecast step in which a set of state samples, or ensemble members, are integrated forward in time with the model, and (ii) an analysis step to update the forecasted members with available data based on the Bayes rule in a Gaussian framework [15,16]. Because of its affordable computational cost, robust performance, nonintrusive formulation, and ease of implementation, the EnKF have been successfully implemented in many geophysical applications, including weather forecasting [18][19][20], hydrology and reservoir applications [21][22][23][24][25][26], ocean modeling [27][28][29][30][31], and flood modeling [32][33][34].
In epidemiology, previous studies have used data assimilation in the context of both variational and Kalman filter methods [35][36][37][38]. The Reference [39] used the EnKF to estimate the parameters of a stochastic SEIR model. The Reference [40] applied a deterministic variant of the EnKF, the Ensemble Adjustment Kalman Filter (EAKF), in combination with a network of SEIR models, simulating different connected cities. The Reference [41] used an ensemble smoother with a multiple data assimilation (ESMDA) approach for parameter estimation in a SEIR model. The Reference [42] also implemented the ESMDA to effectively estimate the parameters of a more sophisticated SEIR model accounting for age-classes and compartments of the sick, hospitalized, and dead. The Reference [43] used an EnKF approach for parameter estimation and shot-term forecasts with a simple SIR model. The Reference [44] proposed an EnKF for estimating unmeasurable state variables and unknown parameters of a SIRV model, which takes into account the circulation of free coronavirus in the environment. In the context of variational data assimilation methods, the Reference [45] performed parameter estimation and predictions using a SIR model, while [46] proposed a modified SEIR model that distinguishes between symptomatic and asymptomatic, and conducted observation sensitivity experiments to identify suitable observing strategies. However, these studies did not consider the impact of vaccination on the pandemic spread.
This work therefore aims to study the impact of vaccination on the COVID-19 spread. An extended SEIR model comprising of seven compartments-susceptible, exposed, infectious, quarantined, recovered, deaths, and vaccinated-is then proposed. First, we conduct a mathematical analysis to illustrate the non-negativity, boundedness, epidemic equilibrium, existence, and uniqueness of the endemic equilibrium, and the basic reproduction number of the model. A Joint-EnKF is implemented for state-parameters estimation by assimilating COVID-19 data into the model. By considering uncertainties in all model states and poorly known parameters, our ensemble data assimilation approach computes updates of both state and parameter ensembles. The estimated set of model state and parameters is then used to assess the model prediction skill. The proposed approach is validated with the cases reported from the most-infected Gulf country, Saudi Arabia, showing a great ability to provide accurate and consistent forecasts up to two weeks. We finally investigate the effect of vaccination on the spread of the pandemic, showing a notable decrease in the number of cases.
The remainder of this paper is organized as follows. Section 2 presents the extended SEIR model and briefly describes the Joint-EnKF algorithm. Section 3 is devoted to the mathematical analysis of the model. The numerical results are presented and discussed in Section 4. Section 5 concludes the study, with a summary of the main results.

Extended SEIR Model
We extend the SEIR model to seven compartments to simulate the epidemic of COVID-19. Seven state variables are considered within a population, that is, S(t), E(t), I(t), Q(t), R(t), D(t), and V(t), denoting the number of susceptible, exposed (infected, but not yet infectious), infectious (not yet quarantined), quarantined (confirmed and infected), recovered, dead, and vaccinated cases, respectively. The disease transmission flow of the proposed model is sketched in Figure 1. The model is then governed by the following set of nonlinear ordinary differential equations: with non-negative initial conditions S(0) = S 0 , E(0) = E 0 , I(0) = I 0 , Q(0) = Q 0 , R(0) = R 0 , D(0) = D 0 , and V(0) = V 0 . The coefficients Λ, β, α, µ, γ, σ, δ, κ, λ, and ρ represent the new births and new residents per unit of time, transmission rate divided by N, vaccination rate (rate of people who are vaccinated), natural death rate, average latent time, average quarantine time, mortality rate, average days until recovery, and average days until death, respectively. σ is the vaccine inefficacy (0 ≤ σ ≤ 1). So 1 − σ represents the vaccine efficacy. If σ = 0, the vaccine offers 100% protection against the disease. N = S + E + I + Q + R + D + V is the total population size. Finally, system (1) is solved using a fourth-order Runge-Kutta method.

Ensemble Kalman Filter (EnKF)
Consider the following nonlinear data assimilation problem: where x k is the state vector of the model at time t k , M is the dynamical operator (system (1)) that integrates the model state from time t k to time t k+1 , θ is the parameters vector, y k+1 is the observation vector (active, recovered, and death cases) at time t k+1 , H is the observation operator that projects the state space onto the observation space, and η k+1 and k+1 are model and observation noises assumed to follow Gaussian distributions with zero mean and covariances Q k+1 and R k+1 , respectively. At every forecast step, we integrate the ensemble members with the dynamical model (1). The predicted state mean and the associated error covariance matrix are then estimatedx where N e is the number of ensemble members. The Joint-EnKF consists of estimating the augmented state-parameters vector z k+1 = x T k+1 , θ T at each time t k+1 given the observation y k+1 . The prediction step of the Joint-EnKF becomes: whereM is the augmented nonlinear dynamic operator for the joint state-parameters.
With similar computations as in (4), the joint state-parameters error covariance matrix for the augmented ensemble vectors z f ,i k+1 is given bŷ where X θ denotes the parameter ensemble perturbations. The updated joint state and parameters are then computed using the current observation ensemble y i k+1 : whereH = [H 0] is the augmented observational operator with the zero matrix.

Mathematical Analysis of the COVID-19 Model
In this section, we discuss the boundedness, non-negativity, epidemic equilibrium, endemic equilibrium, and basic reproduction number of the COVID-19 model (1).

Non-Negativity of the Model
and V 0 ≥ 0, then the solutions of system (1) remain non-negative for all t > 0.
Proof of Theorem 1. From the first equation of system (1), we have: Integrating Equation (8), we get Hence, S(t) remains non-negative for all t > 0. Similarly, using the remaining equa-  (1), the growth rate of the total population can be written as

Boundedness of the Model
From Equation (10), we have the following equation: Integrating both sides of Equation (11) gives the solution for N(t) as follows Thus, when t → ∞, we find that Finally, Theorem 1 and Equation (13) give Thus, S(t), E(t), I(t), Q(t), R(t), D(t), and V(t) are bounded and Theorem 2 is proved.

Epidemic Equilibrium and Basic Reproduction Number of the Model
The epidemic equilibrium X 0 of system (1) is obtained by setting all the derivatives to zero with I = 0, that yields to: R 0 is computed using the next-generation matrix concept [47]. Let X = (E, I) T . System (1) can be written as where F(X) = (βSI + σβIV, 0) T and W(X) = ((µ + δ)E, −γE + (µ + δ)) T . Their corresponding Jacobian matrices at the disease-free equilibrium are According to [47], the basic reproduction number R 0 is the spectral radius of the next-generation matrix FW −1 . That gives Substituting S 0 and V 0 from Equation (15) into Equation (20) yields the following formula for the basic reproduction number of system (1): used to measure the transmission potential of a disease. It is the number of secondary cases or new infections produced by a single infectious individual in a completely susceptible population [47,48]. An epidemic is expected to increase exponentially if R 0 > 1, and to end if R 0 < 1 [47].
Theorem 3. The disease-free equilibrium X 0 is locally asymptotically stable if R 0 < 1, and unstable if R 0 > 1.

Proof of Theorem 3.
System (1) can be written in a simplified form by eliminating the sixth equation for the death compartment (D). This is due to the fact that the state D only appears in the corresponding differential equation and has no significance on the overall system. The number of deaths can be determined from The Jacobian matrix of the reduced model (1) at the disease-free equilibrium is given by The remaining two eigenvalues are when R 0 < 1, we have λ 5 < 0 and λ 6 < 0. Since all the eigenvalues are negative, the disease-free equilibrium X 0 is locally asymptotically stable. If R 0 > 1, we have λ 5 < 0 and λ 6 > 0, and the disease-free equilibrium X 0 is locally asymptotically unstable [49].

Existence and Uniqueness of the Endemic Equilibrium
In this section, we investigate the existence and uniqueness of the endemic equilibrium when R 0 > 1. Let X * = (S * , E * , I * , Q * , D * , V * ) be the endemic equilibrium of model (1). The endemic equilibrium is obtained by setting all the derivatives in model (1) to zero, giving Solving the first, third, and fifth equations of (22) in terms of I * gives Adding the second and sixth equation in (22) gives Now substituting the expressions of S * , E * , and V * in the second equation of (22), we get the following quadratic equation for I * : where It is clear that a 2 is always positive and a 0 is negative when R 0 > 1. Therefore, there is a unique positive value for I * , and consequently, a unique endemic equilibrium X * when R 0 > 1.

Results and Discussion
The proposed modeling system was applied to forecast the COVID-19 pandemic in Saudi Arabia. Data are available from the Saudi Center for Diseases Prevention and Control [50]. These include the number of deaths, as well as recovered and confirmed cases. The total number of deaths, active cases, and recovered cases were assimilated daily, and model parameters were also updated using the Joint-EnKF.

Assimilation Settings
In this section, we cover the period before the vaccine becomes available. Therefore, state and parameters related to vaccination are excluded from the model at this stage. The state vector x to be estimated is then composed of six elements and the parameters to be estimated become The augmented state vector to be estimated is then: where the parameters are in a logarithmic sense in order to avoid negative values in the model. We set the initial numbers of infected and exposed people to I(0) = E(0) = 3, three times the number of accumulated positive tests (Q 0 = 1) provided by March 2nd [50]. The initial numbers of recovered, and death cases are R(0) = D(0) = 0. The total population and the total number of susceptible cases are N(0) = 34,218,000 and S(0) = N(0) − E(0) − I(0) − Q(0) − R(0) − D(0), respectively. We consider the following two periods: the first period concerns the period of the beginning of the pandemic when the intervention had not been applied using a basic reproduction number greater than one, R 0 = 2.5, which gives β 1 = 8.58 × 10 −9 , while the second period corresponds to the period when the intervention was applied by the government until June 30th. During this period, the numbers of new cases are relatively constant. Therefore, we use a basic reproduction number R 0 = 1, which gives β 2 = 3.43 × 10 −9 . To reflect the uncertainty in the transmission rate, we use a standard deviation of 25%. The initial values of the remaining parameters are listed in Table 1, with a standard deviation of 10%. In addition, we assume the observation-error standard deviations to be 5% of the data value.  [42] In order to initialize the assimilation system, we perturb the initial state using a Gaussian noise as follows: where σ ∼ N (0, 1) and η is set to 10%. Similarly, the parameters' ensemble is generated by perturbing the initial values, given in Table 1, using a log-normal distribution.
To evaluate the accuracy of the assimilation system solution, we compute the Relative Mean Absolute Error (RMAE), defined as follows: where y(j) represents the observed value, x(j) is the model simulation, and n is the sample size of the observed data.

Sensitivity to Ensemble Size
We first study the sensitivity of the filter, to the ensemble size, N e . Computing accurate state and parameter estimates with small ensembles is desirable to avoid an important increase in computational time. We run our assimilation system using seven ensemble sizes: N e = 10, 20, 50, 100, 200, 500, and 1000. Figure 2 presents the performance of the EnKF in terms of RMAE for the different ensemble sizes. As expected, increasing the ensemble size reduces the discrepancy between the estimates and the data. A clear improvement over the case with 20 ensemble members is observed. However, increasing the ensemble size beyond 200 members does not notably enhance the performance of the filter, although it dramatically increases the computational cost.  Figure 3 summarizes the assimilation results for the active, recovered, and confirmed cases obtained by the EnKF using 200 ensemble numbers. All results show a good fit between the predicted and observed values, with generally minor differences. The values of the model parameters estimated by the EnKF during the assimilation process are sketched in Figure 4. The parameters do not appear to be dramatically different from the initial values, except for the reproduction number and the recovery time. We observe a sharp increase in the reproduction number between March 1st and 15th. This number remains relatively high, almost R 0 = 4, during the pre-intervention period. During and after the intervention, the value of R 0 remains close to 1. That suggests that the intervention was very fruitful in controlling the pandemic. The recovery time increases significantly during the first two months, peaking at 16.1, and remains constant after that. We also see a slight decrease in the time until death to 13.75. The other parameters show slight deviations from the proposed initial values.

Mar
Apr

Predictability Experiment
To assess the system relevance for making short-term predictions, we daily performed two-week forecasts, starting from 1st July until 17th December (the day where the vaccination campaign started in Saudi Arabia). The estimated state and parameters during the assimilation process were used to conduct these forecasts. The RMAE of the deaths, and recovered and confirmed cases were then computed and plotted in Figure 5. The model performed very well. The values of the RMAE for the forecasted recovered and confirmed cases are always less than 5%, within the margin of the observation errors. Slightly higher RMAE are observed during the month of July for the death cases, reaching 12%. However, the model achieves a lower RMAE in the next few months, with an average of 3%.   Figure 6 shows the final predictions performed on 3rd December for the first two weeks of December. The predicted numbers of deaths, and recovered and confirmed cases are all in very good agreement with the data. While they perfectly match the data during the first 10 days, they tend to slightly underestimate or overestimate the observed values at the end of the second week. Overall, our numerical results demonstrate the capability of the proposed system in achieving an excellent two-week prediction skill.

Dec 03
Dec

Impact of Vaccination
On 17th December, Saudi Arabia started a three-phase COVID-19 vaccination program, thus becoming the first Arab country to roll out the Pfizer-BioNTech vaccine. The first phase targets people aged 65 and above, as well as people most exposed to the disease. The second phase will include people aged over 50, as well as those with specific chronic diseases, such as asthma and diabetes. The rest of the population will get the vaccine at the third stage. However, the Ministry has identified those who should not get the vaccine. They are women who are pregnant or breastfeeding, women who plan on getting pregnant in the next two months, people who suffer from severe allergic reactions, and those who have been infected by the virus in the past 90 days.
As of 17 January 2021, nearly 295,000 people have been vaccinated against COVID-19, according to the Saudi Health Council [52]. That gives α = 3 × 10 −4 . The Pfizer-BioNtech vaccine offers up to 95% protection against Covid-19 after a second dose [53], which gives σ = 0.05. We start our numerical simulation on December 17th for a period of six months with and without vaccination. Numerical results are presented in Figure 7, where α = 0 corresponds to the model output without vaccination. We notice that the current pace of vaccination does not show a significant decrease in the number of cases. We observe only a reduction of 9.02% in the total confirmed cases, and 7.47% in the total deaths. However, the vaccination rate is expected to increase in the upcoming weeks. Therefore, we consider three more different values of α 6 × 10 −4 , 1.2 × 10 −3 , and 2.4 × 10 −3 . As we can see in Figure 7, intensifying the vaccination campaign starts to significantly decrease the number of cases. For α = 0.0006, we observe a reduction up to 16.08% and 13.47% for the total confirmed cases and deaths, respectively. Finally, a notable reduction is observed for α = 0.0024. Results show a decline by 38.67% and 33.65%, respectively.

Conclusions
This paper presented an extended SEIR model with seven stages of infection, including vaccination, to simulate and predict the development of the novel COVID-19 outbreak. Initially, a mathematical analysis was carried out to illustrate the non-negativity, boundedness, epidemic equilibrium, existence, and uniqueness of the endemic equilibrium, and basic reproduction number of the proposed model. To mitigate the imperfection of SEIR models and the poorly known parameters, we first applied a data assimilation framework based on a Joint-EnKF for state-parameters estimation in order to enhance the model forecast skill. We assimilated real data of the number of deaths, the number of recovered cases, and the number of active cases. In the second step, the updated state and parameters were used to perform predictions two weeks ahead in order to assess the relevance of the system of making short-term predictions of the COVID-19 pandemic. Finally, we investigated the impact of vaccination on the spread of the disease.
The proposed methodology was validated with cases reported from Saudi Arabia. In the assimilation process, the estimated numbers of active, recovered, deaths, and confirmed cases were accurately predicted by the model, with generally minor differences. In the predictability experiments, the model provided encouraging results for short-term predictions, with a relatively small ensemble size. Moreover, model predictions can be updated weekly or every two weeks to accommodate short-and medium-term predictions. Results also show that intensifying the vaccination campaign significantly decreases the number of confirmed cases and deaths. The proposed model can serve as a tool for health authorities to plan, prepare, and take appropriate measures and decisions to control the pandemic.