Analysis and Prediction of COVID-19 Using SIR, SEIQR, and Machine Learning Models: Australia, Italy, and UK Cases

: The novel coronavirus disease, also known as COVID-19, is a disease outbreak that was ﬁrst identiﬁed in Wuhan, a Central Chinese city. In this report, a short analysis focusing on Australia, Italy, and UK is conducted. The analysis includes conﬁrmed and recovered cases and deaths, the growth rate in Australia compared with that in Italy and UK, and the trend of the disease in different Australian regions. Mathematical approaches based on susceptible, infected, and recovered (SIR) cases and susceptible, exposed, infected, quarantined, and recovered (SEIQR) cases models are proposed to predict epidemiology in the above-mentioned countries. Since the performance of the classic forms of SIR and SEIQR depends on parameter settings, some optimization algorithms, namely Broyden–Fletcher–Goldfarb–Shanno (BFGS), conjugate gradients (CG), limited memory bound constrained BFGS (L-BFGS-B), and Nelder–Mead, are proposed to optimize the parameters and the predictive capabilities of the SIR and SEIQR models. The results of the optimized SIR and SEIQR models were compared with those of two well-known machine learning algorithms, i.e., the Prophet algorithm and logistic function. The results demonstrate the different behaviors of these algorithms in different countries as well as the better performance of the improved SIR and SEIQR models. Moreover, the Prophet algorithm was found to provide better prediction performance than the logistic function, as well as better prediction performance for Italy and UK cases than for Australian cases. Therefore, it seems that the Prophet algorithm is suitable for data with an increasing trend in the context of a pandemic. Optimization of SIR and SEIQR model parameters yielded a signiﬁcant improvement in the prediction accuracy of the models. Despite the availability of several algorithms for trend predictions in this pandemic, there is no single algorithm that would be optimal for all cases.


Introduction
In December 2019, the Chinese government informed the rest of the world that a virus was rapidly spreading throughout China.A few months later, it had detrimentally spread to several other countries.The United States Centers for Disease Control and Prevention (CDC) identified a seafood market in Wuhan as the center of the outbreak of the novel coronavirus disease (COVID- 19), which is caused by severe acute respiratory syndrome coronavirus 2. The World Health Organization (WHO) reported a case in Thailand on 13 January 2020, the first case to be identified outside China.On 16 January 2020, Japan confirmed its first case of the novel coronavirus, followed by South Korea on 20 January.As of today, most countries around the world have been affected.
Numerous studies have been conducted to predict the spread of the virus in order to seek the best prevention measures.For instance, a simulation model based on mobility data was proposed [1], and a particle swarm optimization (PSO) algorithm was used to estimate the parameters (susceptible, infected, recovered) in the susceptible, infected, and recovered (SIR) cases model [1,2].The results indicate that the latter method is precise enough, with a low margin of error compared with analytical methods.Another study [3] calibrated the SIR model to South Africa after considering different scenarios for the reproduction number (R0) for reporting infections and short-term healthcare resource estimations.Meanwhile, daily temperature and relative humidity were both reported to influence the occurrence of COVID-19 in Hubei province and some other provinces [4].
The authors of [5] proposed a heuristic algorithm to model and evaluate the risk of the COVID-19 pandemic in six different countries/states, namely New York, California, the whole of the USA, Iran, Sweden, and the UK.
The first case of COVID-19 in Australia was reported in January 2020.In this paper, we also report on a short analysis focusing on Australia, which continued as a short-term simulation.
The manuscript is organized in several sections.Section 2 presents the research methodology.Sections 3 and 4 introduce the SIR and susceptible, exposed, infected, quarantined, and recovered (SEIQR) models.Section 5 describes the prediction algorithms (the logistic function and Prophet algorithm).Section 6 provides the results, followed by a discussion and concluding remarks in Section 7.

Research Methodology
The study was carried out in several phases.First, data were collected from the World Health Organization (WHO) and John Hopkins University, which obtain data from different organizations.After this, the data were analyzed and preprocessed in order to avoid any duplicate or missing values.Numerical tests were performed using Python and R and executed on an Intel ® Core i7-4510U, 2.0 GHz, 8 GB, DDR3 Memory computer (Supplementary File).The flowchart of the research methodology is provided in Figure 1.

SIR Model
This section introduces the classic form of the SIR model [20,21], which is used to describe the transmission of COVID-19 in Australia, Italy, and UK.The flowchart of the SIR model is presented in Figure 2. where: • S is the number of susceptible individuals at time t;

SEIQR Model
The SEIQR model, an extended version of SIR [23], models the interaction of people under different conditions: susceptible (S), exposed (E), infected (I), quarantined (Q), and recovered (R).The parameters S, I, and R are the same as those in the SIR model, and E presents the fraction of individuals that have been infected but do not show any signs.The SEIQR model diagram is illustrated in Figure 3.The equations of the SEIQR model are defined as follows: dP(t) dt = αS(t) (10) where α represents the protection rate; β is the infection rate and illustrates the inverse of the average latent time; γ shows the rate of recovery (removal); δ represents the inverse of the average quarantine time; λ 0 and λ 1 are coefficients used in the time-dependent cure rate; κ 0 and κ 1 are coefficients used in the time-dependent mortality rate [23]; and {S(t), P(t), E(t), I(t), Q(t), R(t), D(t)} refer to the susceptible, insusceptible, exposed (in- fected but not yet infectious, in a latent period), infectious (with infectious capacity and not yet quarantined), quarantined (confirmed and infected), recovered, and closed cases [23].

Prediction
The machine learning techniques described in this section were used for COVID-19 case predictions in Australia, Italy, and UK.Machine learning is a branch of computer science in which data teach algorithms, and the learning process is performed as supervised, unsupervised, and/or semi-supervised learning forms [24][25][26][27].In this section, some approaches that were employed to predict cases (confirmed and deaths) of COVID-19 are discussed.

Logistic Function
A logistic function could be defined as follows: where e = Euler's number, x 0 = Sigmoid's midpoint, L is the curve's maximum value, and K is the logistic growth of the curve.

Times Series Forecasting with the Prophet Algorithm
The Prophet algorithm is an open-source tool developed by Facebook's Data Science team, which is aimed at business forecasting [28].This algorithm works well with timeseries data that have seasonal effects, and it is robust in dealing with missing data [29].In the Prophet algorithm, the forecast is determined as follows [29]: where y 1 , y 2 , . . ., y T are denoted as historical data; and ŷ T+h|T is a short-hand to forecast y T+h|T based on available data.

Analysis 6.1.1. New Cases
In this subsection, the confirmed growth rates in Australia, Italy, and UK were calculated for every day from 24 April to 23 May 2020. Figure 4 depicts the growth rate of confirmed cases in these countries.As can be seen, the growth rate for Australia remained below 0.5 during times of outbreak and was just above 0.0 at the end of May, while the rates for Italy and UK were generally high.The growth rate for UK was almost above 2.0 in April, and then dramatically declined in May.The rate for Italy fluctuated between 0.5 and 1.5 in April and May.

Overall Growth Rate
This section presents the numbers of active cases in these three countries, which were calculated using the following equation: Active_cases = confirmed_cases − deaths_cases − recovered_cases (13) From Equation ( 13), the overall growth rate could be calculated according to Equation ( 14), in which i refers to the present day: Figure 6 illustrates the overall growth rate for confirmed cases in the studied countries.Negative numbers indicate that people recover faster than the rate at which they get sick, which is good news.The rate for Australia in the time period remained almost below zero and then changed from −15 at the end of April to just below −5 at the end of May.For Italy, the rate fluctuated between just above −7.5 and just above 0.0, while the rate for UK remained almost consistently positive in the time horizon (0.0-3.0). Figure 7 illustrates that the number of death cases in Australia is significantly lower than the other two countries.With the aim of forecasting, the logistic function defined in Equation (11) was applied to the collected data (time horizon: start of outbreak in the countries).According to the results in Figures 9-14, the logistic function is fitted to the trend of increasing cases to evaluate the performance of metric R2 scores used for confirmed and death cases.Results are presented in Table 1.The root mean square error (RMSE) was used as another metric to analyze cases, and the results in Table 2 show that the best RMSE value belongs to the Australian cases (confirmed and death).3 (RMSE values), the classic SIR form was not suitable for predicting the COVID-19 pandemic in these three countries.In order to fit the SIR model to Australia, Italy, and UK, an optimizer was needed to find the unknown parameters (β and γ) from equation R 0 (R 0 = β γ ) since these parameters could be estimated.Prior to the outbreak in these countries, it is essential to address whether the number of susceptible cases is equal to the population due to the absence of antibodies and vaccines for the disease.At first, R 0 = 2.7 was fixed (reported by Australian Government: Department of Health) as the median number, β = 0.378, and γ = 0.14.
Real data were applied to estimate the values of β and γ.An optimizer was used to find the best estimation of β and γ.The optimization algorithms used were the Broyden-Fletcher-Goldfarb-Shanno (BFGS) [30], limited memory bound constrained BFGS ( L-BFGS-B) [31], conjugate gradients (CG) [30], and Nelder-Mead algorithms [32].The parameter settings are provided in Table 3.The flowchart of the improved SIR and SEIQR versions and parameter settings for the above-mentioned algorithms are addressed in Figure 18 and Table 4, respectively.Table 5 provides the optimized values obtained by different algorithms (SIR model).The best values for the parameters were found using the Nelder-Mead algorithm (for the SIR model) and L-BFGS-B algorithm (for the SEIQR model).This method is illustrated in Figure 18.As previously mentioned, before the start of the outbreak, the number of susceptible cases was equal to the populations of these countries, since neither antibodies nor a developed vaccine were available.According to Wikipedia, the populations of Australia, Italy, and UK were 25 06 , 60 06 , and 67 06 , respectively.Table 6 illustrates the RMSE values obtained by the algorithms (for SIR and SEIQR models), showing improvements in significantly reducing the values.Figure 19a-c presents the confirmed cases provided by the optimized SEIQR model with the above-mentioned descriptions in the three countries (see Figure 18).Figures 20-22 reveal the forecasted values obtained using the Prophet algorithm, where the algorithm is fitted for the cases of Italy and UK but has errors for Australia.Tables 7-9 present the results of the predicted cumulative confirmed cases using the Prophet algorithm in the three countries, where y represents the true values of confirmed cases, ds is time, ŷ is the forecasted values, and ŷlower and ŷupper are the lower and upper bounds for the forecasted values, respectively.It should be noted that the forecasted values were determined between the cutoff and cutoff + horizon.Tables 7-9 are also cross-validation matrices that are used to find the error values between y and ŷ, from which the RMSE values can be obtained (Figure 23a-c).

Discussion and Conclusions
COVID-19 is a family of coronaviruses that has affected the lives of billions of people worldwide.The first section of this paper presented a short analysis of COVID-19, focusing on its effect in Australia, Italy, and UK.Specifically, the analysis gives a comparison of the confirmed cases and death rates between Australia, Italy, and UK and among the different states of Australia.The analysis reveals that Australia is in a generally good position compared with the other two countries.However, the situation in different regions of Australia is rather complicated.For example, New South Wales has the most confirmed cases and death cases, while Northern Territory shows the least confirmed and death cases (it is worth mentioning that New South Wales has a larger population).
Mathematical approaches based on SIR and SEIQR were proposed to predict the epidemiology in Australia, Italy, and UK.Since the classic forms of SIR and SEIQR are deterministic, an improved version based on parameter optimization is suggested to improve the prediction.The results were compared with the logistic function and Prophet algorithm, and are summarized as follows:

•
The comparison between the classic SIR model and real data showed a significant gap.However, initializing the parameters of the SIR model significantly improved the prediction.

•
The classic SIR model worked best for UK but was not suitable for Australia based on RMSE values.

•
The logistic function was a good model for UK with an R2 score of 0.97, while the scores for Australia and Italy were 0.67 and 0.95, respectively.

•
The best RMSE value belonged to the Australian cases (confirmed and deaths).

•
Parameter optimization for the SIR and SEIQR models significantly improved their prediction accuracy.

•
The improved version of SEIQR exhibited better performance than the SIR model (regarding RMSE values and figures).

•
The optimized SEIQR model has better prediction for UK and Italy compared with Australia.

•
The best values for the parameters were determined using the Nelder-Mead algorithm for the SIR model and the L-BFGS-B algorithm for the SEIQR model.

•
The Prophet algorithm worked better for Italy and UK cases than for Australian cases.

•
The logistic function had a better performance for cases in all three countries compared with the Prophet algorithm.

•
The improved versions of the SIR and SEIQR models exhibited a better performance than the logistic function, Prophet algorithm, and classic SIR model.
Some studies and research on related viruses have predicted that COVID-19 is dependent on environmental characteristics and will decline with higher temperature, humidity, ultraviolet (UV) light [33], and with spatial colony-growth heterogeneity [34].Since UV light has been strongly associated with lower COVID-19 growth, projections suggest that, without intervention, COVID-19 will decrease temporarily during summer, rebound by autumn, and peak the subsequent winter.Regarding the above-mentioned discussion, the growth rate appears to be a country-specific characteristic.
The evolution of COVID-19 throughout the world is difficult to predict.Until a reliable vaccine becomes available for all, which may only happen by the end of 2021, governments will have to strike the tough balance between health and other issues, such as economic and social.Although social distancing costs an economic and psychological price, recent experience in several countries indicates that lifting a majority of restrictions increases the potential for multiple local outbreaks (second and third waves).In the absence of an effective and reliable vaccine, preparedness for this "wave" phenomenon is absolutely required.
One limitation of this study is that the authors did not account for human behavior or control measures in the models.By modeling the maximum growth rate and using a threshold number of cases, we could restrict the analyses to the period during which the disease expanded quickly: between the beginning of community transmission and the implementation of major control measures.This aspect alone is a suggested direction for further study.
Another limitation of this paper is the untimely analysis, which was an analysis based on data retrieved up until May 2020.If we look at the current data, the trend of COVID-19 cases is completely different from the predictions made in this paper.However, the results of this paper show that the growth rate is a country-specific characteristic and depends on time.Some factors relevant to these country specifications are the country's size, population heterogeneity factor, etc.A similar effect of heterogeneous subpopulations is that, for instance, they are known for spreading bacterial colonies.Furthermore, there exist much more advanced models for the COVID-19 pandemic, whereby the inclusion of hospitalization rates, use of intensive care units (ICUs), and age groups is influential (vaccinations will be included in this list in the near future).All the above-mentioned issues are encouraged for future studies.
In addition, all the forecasting in this paper was addressed without considering the scenario of social distancing and quarantine, which is valuable as a future research direction.While this paper analyzes the improved SIR and SEIQR models, it would be interesting to test other epidemiology models.Moreover, it would be worthwhile to combine mathematical models with other observations, such as policy interventions, human behavior, and constraints, which may yield better prediction performance.

Figure 1 .
Figure 1.Flowchart of the current research process.

Figure 2 .
Figure 2. Susceptible, infected, and recovered (SIR) model.The SIR model shows how a disease spreads through a population.The equations of the SIR model are as shown below [22]: ds dt = −βIS (1)

Figure 4 .
Figure 4. Growth rate (confirmed cases in Australia, Italy, and UK).

Figure 5
Figure 5 presents the growth rate of death cases for the above-mentioned countries, according to daily data from 24 April to 23 May 2020.The growth rate for death cases in Australia fluctuated between 0.0 and 7.0 in April and May and reached 7.0 at the end of April.During the same period, the rate remained almost below 2.0 in Italy, and in UK, the rate was just below 4.0 at the end of April and just above 0.0 at the end of May.

Figure 5 .
Figure 5. Growth rate (death cases in Australia, Italy, and UK).

Figure 6 .
Figure 6.Overall growth rate for confirmed cases in Australia, Italy, and UK.

Figure 7 .
Figure 7. Number of death cases in Australia compared with Italy and UK.

Figure
Figure 8a-h shows confirmed cases versus death cases in each Australian state.By 23 May 2020, New South Wales and Northern Territory possessed the most confirmed and the

Figure 9 .
Figure 9. Prediction of confirmed cases by logistic function (Australia).

Figure 10 .
Figure 10.Prediction of death cases by logistic function (Australia).

Figure 11 .
Figure 11.Prediction of confirmed cases by logistic function (UK).

Figure 12 .
Figure 12.Prediction of death cases by logistic function (UK).

Figure 13 .
Figure 13.Prediction of confirmed cases by logistic function (Italy).

Figure 14 .
Figure 14.Prediction of death cases by logistic function (Italy).

Figures 15 -
present the results of the classic SIR model.As previously mentioned, the controlling β parameter indicates the level of disease transmission, and γ is the recovery (removal) period indicating how many people could recover in a certain period.First, all parameters were initially added to the SIR model, which was then applied to real data.As can be seen in Figures15-17and Table3(RMSE values), the classic SIR form was not suitable for predicting the COVID-19 pandemic in these three countries.In order to fit the SIR model to Australia, Italy, and UK, an optimizer was needed to find the unknown parameters (β and γ) from equation R 0 (R 0 = β γ ) since these parameters could be estimated.Prior to the outbreak in these countries, it is essential to address whether the number of susceptible cases is equal to the population due to the absence of antibodies and vaccines for the disease.At first, R 0 = 2.7 was fixed (reported by Australian Government: Department of Health) as the median number, β = 0.378, and γ = 0.14.Real data were applied to estimate the values of β and γ.An optimizer was used to find the best estimation of β and γ.The optimization algorithms used were the Broyden-Fletcher-Goldfarb-Shanno (BFGS)[30], limited memory bound constrained BFGS ( L-BFGS-B)[31], conjugate gradients (CG)[30], and Nelder-Mead algorithms[32].The parameter settings are provided in Table3.The flowchart of the improved SIR and SEIQR versions and parameter settings for the above-mentioned algorithms are addressed in Figure18and Table4, respectively.

Figure 15 .
Figure 15.Predicted cases in Australia using the susceptible, infected, recovered (SIR) model (blue: real confirmed cases; red: SIR model).

Figure 16 .
Figure 16.Predicted cases in Italy based on the SIR model (blue: real confirmed cases; red: SIR model).

Figure 17 .
Figure 17.Predicted cases in UK based on the SIR model (blue: real confirmed cases; red: SIR model).

Figure 18 .
Figure 18.Flowchart of improved versions of SIR and SEIQR models.

Figure 20 .
Figure 20.Forecasting by Prophet algorithm for the next year (confirmed cases in Australia).

Figure 21 .
Figure 21.Forecasting by Prophet algorithm for the next year (confirmed cases in Italy).

Figure 22 .
Figure 22.Forecasting by Prophet algorithm for the next year (confirmed cases in UK).

Figure 23 .
Figure 23.Visualization of performance metric for Prophet algorithm (considering RMSE) for (a) United Kingdom, (b) Australia, and (c) Italy.

Figure 23 .
Figure 23.Visualization of performance metric for Prophet algorithm (considering RMSE) for (a) UK, (b) Australia, and (c) Italy.

Table 1 .
R2 score for different cases in the three countries.

Table 2 .
Root mean square error (RMSE) values for different cases in the three countries.

Table 3 .
RMSE values obtained by SIR model (before optimization of parameters).

Table 5 .
Median values of SIR parameters determined by the Department of Health in each country.

Table 6 .
RMSE values obtained based on the improved SIR model considering a 0.99 confidence interval.

Table 7 .
Predicted cumulative confirmed cases in Australia (cross-validation matrix).

Table 8 .
Predicted cumulative confirmed cases in UK (cross-validation matrix).

Table 9 .
Predicted cumulative confirmed cases in Italy (cross-validation matrix).