COVID-19 Spread Forecasting, Mathematical Methods vs. Machine Learning, Moscow Case

: To predict the spread of the new coronavirus infection COVID-19, the critical values of spread indicators have been determined for deciding on the introduction of restrictive measures using the city of Moscow as an example. A model was developed using classical methods of mathematical modeling based on exponential regression, the accuracy of the forecast was estimated, and the shortcomings of mathematical methods for predicting the spread of infection for more than two weeks. As a solution to the problem of the accuracy of long-term forecasts for more than two weeks, two models based on machine learning methods are proposed: a recurrent neural network with two layers of long short-term memory (LSTM) blocks and a 1-D convolutional neural network with a description of the choice of an optimization algorithm. The forecast accuracy of ML models was evaluated in comparison with the exponential regression model and one another using the example of data on the number of COVID-19 cases in the city of Moscow.


Introduction
In the context of the sanitary and epidemiological situation in Moscow associated with the spread of the new coronavirus infection COVID-19, the task of predicting the dynamics of the spread of infection and the question of when to implement quarantine measures at the regional level remains urgent. Since the start of the pandemic, the development of methods for predicting indicators characterizing the spread of coronavirus infection has remained active. The most widely used model of logistic growth [1][2][3][4] is the coplanar model, susceptible-infectious-recovered (SIR), and its various variations: susceptible-infectious-recovered-susceptible (SIRS), susceptible-exposed-infectious-recovered (SEIR), susceptible-infectious, recovered-deceased (SIRD), susceptible-infectious-recovered-vaccinated (SIRV) [5][6][7][8][9][10][11][12][13][14], and others [15][16][17][18]. However, as the complexity of the model grows, there is a concomitant increase in the number of unknown parameters, which are often difficult to estimate. Coplanar models of the type SIR, SIRS, SEIR, etc., can be used in long-term forecasting, but only for inhomogeneous isolated systems in which the law of mass action is fulfilled and good mixing of the population is observed. Real large systems cannot be considered isolated due to many factors. For example, Moscow and the Moscow agglomeration are characterized by commuting (more than 2 million people per day), including migration from neighboring regions (Tula, Vladimir, and Ka-network architectures. Section 4 summarizes the study's findings on the use of mathematical methods for modeling the long term incidence of COVID-19 infections and machine learning methods. Section 5 formulates the main conclusion of this work.

Determination of Predicted Parameters
To begin the study, it is necessary to identify the possible factors influencing the adoption of measures. As for Moscow, the decrees of the mayor of Moscow do not indicate specific indicators used to make decisions on restrictions on the movement of the population [28,29]. However, the features of COVID-19 are already well known, which is characterized by high contagiousness, a long incubation period, and a rather high probability of serious complications requiring inpatient treatment. All of these factors combine to quickly overwhelm the healthcare system. Therefore, it is natural to take the number of infections per day and the number of hospitalized patients with COVID-19 per day as the baseline indicators of the study. The first parameter reflects the spread of new coronavirus infection, and the second parameter (the number of hospitalized) reflects the load on the bed fund and the healthcare system as a whole. The achievement of a certain critical level by these parameters becomes a significant reason for deciding to impose restrictions.
We conducted a study using the example of Moscow. We considered the time series of cases of coronavirus infection and the number of hospitalized patients in the city of Moscow from 12 March 2020 to 15 October 2021 ( Figure 1) [30]. At the same time, there are no data for the period from 31 December 2020 to 10 January 2021, since the Moscow operational headquarters did not publish statistics on the number of hospitalized individuals with a new coronavirus infection during this period. Denote the number of cases of COVID-19 infection per day by the function f(t), and the number of hospitalized by the function φ(t), where t is a step in time (day). To check whether it is necessary to simultaneously use both indicators, we calculated the Pearson correlation coefficient: where: = ( )-number of cases per day , Φ = ( )-number of hospitalized per day , and Φ -the arithmetic mean of the sample. To eliminate the difference in sample size associated with the lack of data on hospitalizations from 31 December 2020 to 10 December 2020, the corresponding values for the number of infections were removed.
The correlation coefficient for the entire sample size is 0.866735, which characterizes the level of linear relationship as high. However, considering the periods of "waves", we note that the number of hospitalized people grows in proportion to the number of infected individuals. Therefore, in the period from 1 October 2020 to 31 December 2020 (main stage "second" wave of coronavirus), the coefficient was only 0.627 (weak-moderate on the scale of E.P. Golubkov), and during the period of active growth of the third wave (from 9 June 2021 to 22 July 2021), the coefficient was 0.636 (moderate). This discrepancy is explained by the following: in periods between waves of morbidity, the healthcare system can hospitalize a larger number of patients (including those with mild and moderate disease), and during a period of active growth, the healthcare system is forced to refuse hospitalization for patients with mild and moderately mild disease due to insufficient bed capacity, hospitalizing only moderately and seriously ill patients. Due to this fact, during periods of "waves", the correlation between the number of hospitalized and sick people is noticeably weak, as a result of which it is necessary to consider both factors as influencing the decision to introduce measures aimed at improving the sanitary and epidemiological situation.
Since the time series above ( Figure 1) is characterized by strong daily changes, we applied a seven-day moving average, simple moving average (SMA7): where: is the number of cases of hospitalizations at the point − . As a result, we obtained a smoothed time series (Figure 2). We determined the "critical" values of the parameters based on the dates of the introduction of measures aimed at stabilizing the epidemiological situation: 13 November 2020 to 13 June 2021. It is known that the duration of the disease from the moment of onset of symptoms in the case of moderate to severe averages 28 days [31]. Taking this into account, we determined the cumulative number of patients in the period up to the moment = 13.11.2020 at , in the period before = 13.06.2021 at : Similarly, we determines the cumulative number of hospitalized individuals in the period up to the moments -for Φ and -for Φ : At the time of making decisions on the introduction of new restrictive measures, the indicators of the cumulative number of hospitalized Φ and Φ are approximately similar. However, the difference between and is significant. From the smoothed time series of the number of cases ( Figure 2), it can be seen that the growth of incidence rates in the first case is more uniform than in the second, which was more "explosive".
For a more detailed analysis, we used the Rt Coronavirus Spread Index, also known as the effective reproductive number. This indicator reflected the rate of growth of new cases of new coronavirus infection, calculated using the "4 by 4" method: where is the current day. In other words, the number of people infected in the last 4 days (including the current one) is divided by the number of people infected in the previous 4 days. This indicator, used by Rospotrebnadzor as a metric for deciding whether to soften or strengthen restrictive measures, shows how many people manage to infect one infected person before they are isolated. If Rt < 1, Rospotrebnadzor recommends starting to mitigate restrictive measures. The calculation according to the "4 by 4" formula makes it possible to smooth out daily fluctuations in the number of detected cases of COVID-19 infection, occurring due to the different number of tests taken and processed to detect a new coronavirus infection.
The spread of coronavirus was also considered in conjunction with the baseline reproduction rate R0 (baseline reproductive number). R0 is the average number of people who are infected by one carrier in a naive population, that is, a population whose citizens do not have immunity to this disease at all and do not take any measures to protect themselves from it. We can say that at the beginning of the epidemic, Rt coincided with R0. The baseline reproduction rate cannot be measured directly, and its value depends on the chosen model of the infection mechanism and, at the same time, remains constant. Several studies also indicate that the basic reproductive number R0 does not take into account the presence of "super-spreaders" and individual differences in infectivity and is often highly distorted [32][33][34]. Therefore, for research purposes, the R0 level is not of interest.
We considered the dynamics of the index in the first and second moments during the three days before and after the introduction of measures to reduce the incidence (Table 1). With the introduction of restrictive measures in November 2020, the average value of the index was 1.02. When the restrictive measures were introduced in June 2021, the average value of the index was 1.54. Moreover, in June 2021, there was a significant increase in the number of cases of the disease, which had an additional impact on the decision to introduce restrictive measures.

Forecasting Based on the Exponential Model
For further analysis and prognosis, it is required to approximate the values of the number of cases of the disease and the number of hospitalizations. Note that during the period of increasing incidence, the growth is exponential. Thus, the function of the number of cases of the disease can be represented as: where , -some ratios. Whence, after taking the logarithm, we obtain: Suppose = ln ( ) , = ln( ) , = , then the function can be represented as: In this case, the problem is reduced to estimating the parameters of the paired linear regression model using the least-squares method: Then the estimates of the coefficients can be found by the formula: Approximating the data on the smoothed time series from 1 September 2021 to 16 October 2021, we obtain: We compared the data of the time series of smoothed values of infection cases and the resulting function ( Figure 3). In this case, the coefficient of determination was 0.9918, which characterizes the model on the segment as satisfactory. Note that forecasting based on the exponential model allows us to very accurately simulate the initial growth period occurring in realtime.
In Figure 3, it can be seen that ( ) is lagging from the real number of cases per day, but it is the lag of the function ( ) « from the actual current indicators that will increase the forecast accuracy over a short time interval since the values of ( ) will be higher than the increase in morbidity in the long run. Since, at the moment, the average value of the index in the period over the past 7 days is 1.1, it is logical to assume that the critical value of the number of cases of the disease, further denoted as , is more likely to impact the value of (14,3496 people) than the value of , since in June, there was a rapid increase in cases, which is currently not observed. Similarly, we assumed that the critical value of the number of hospitalized individuals, hereinafter denoted as Φ , was more likely to impact the value of Φ (34848 people).
We apply the same approximation method for the values of the function of the number of hospitalized patients ( ), as a result of which: We compared the data of the time series of smoothed values of the number of hospitalized people and the resulting function ( Figure 4).  In this case, the coefficient of determination was 0.9869, which also characterized the model on the segment as quite good.
For forecasting, we defined the confidence interval estimates of the coefficients = , = ln( ) for the period from 1 September 2021 to 17 October 2021 from the exponential model of the approximation functions ( ) and ( ), corresponding to the reliability of 95% according to the formulas: . ; where . ; is a two-sided critical point of the student distribution with 45 degrees of freedom, corresponding to a probability of 0.05, is an unbiased estimate of the standard deviation of the regression equation.
We denoted the values of the point estimates of the coefficients and for the base forecast, the positive forecast, the lower limit of the confidence interval, the negative, and the upper limit confidence interval ( Table 2).   Similarly, we considered the forecast of a seven-day moving average number of hospitalizations based on the values of the function ( ) for the same period ( Figure 6).   In this period, we considered the cumulative number of hospitalizations for the last 28 days and defined it as the "critical" value Φ = 36,000 people:  We evaluated the forecast accuracy for 15 days (from 17 October 2021 to 31 October 2021) of three scenarios with real data on the number of smoothed cases of COVID-19 infection using two metrics: root mean square error (RMSE) and mean absolute percentage error (MAPE) (  The table values concerning a baseline forecast with MAPE 8.596% and RMSE 655.951 showed more accurate forecast results. It can be seen that, until 23 October, the situation developed according to a negative forecast, and the introduction of restrictive measures was announced on 21 October. At the same time, it was reported that the situation was developing according to a negative forecast. However, from 24 October, the real number of cases of infection deviated from the predicted values. Here, the disadvantages of using exponential regression models for the long term are fully manifested: the number of infection cases has a "wavy" appearance, which the exponential model cannot display, as a result of which it becomes necessary to use other forecasting methods.

Determination of the Dataset and Neural Network Architectures
For further research, we turned to machine learning methods. In this case, the task of forecasting time series belongs to the supervised learning class. To form the sample, the data on the number of cases of COVID-19 infection were used from 20 countries: Austria, Belgium, Great Britain, Hungary, Germany, Israel, Iran, Spain, Italy, Canada, the Netherlands, Poland, Russia, Romania, Serbia, USA, France, Czech Republic, Switzerland, Japan, and data from the city of Moscow in the period from 18.03.2020 to 03.12.2021 [28,35,36]. These countries were selected according to two criteria: a population of more than 8 million people and a developed healthcare system. Together, these criteria made it possible to obtain sufficiently accurate and objective data on the number of cases of coronavirus infection.
To reduce the influence of daily fluctuations, a seven-day simple moving average (SMA7) was applied to the entire sample size. For the adequate operation of the neural network to all the values of these systems from the dataset separately, we applied the normalization function ∶ → [0,1]: where is SMA7 infections per day, and are the the maximum and a minimum number of cases of COVID-19 infection in the system, respectively. To normalize the data, the MinMaxScaler class from the scikit-learn library was used. Thus, all values were in the range from 0 to 1 in the corresponding scale. Figure 8 show examples of data from several systems. The dataset length was 626 values, of which 67% (419 values from each system) were intended for the training sample, and the remaining 33% (207 values from each system) were intended for the validation sample. The entire amount of data was 13,146 values.
For training, the networks received batches: 144 days before the forecast period as data based on which the forecast was made, and 48 days for the forecast. From the original dataset, 5775 batches were generated for training and 315 batches for validation. For efficiency, the batchhes were shuffled using the shuffle method from the TensorFlow framework. Examples of the batches are shown in Figure 9. To predict the time series of the number of cases of COVID-19 infection, we used two types of neural networks: long short-time memory (LSTM) and a convolutional neural network. We considered in more detail the architecture of the LSTM recurrent neural network, which is one of the most suitable ML-models for forecasting time series. The main advantage of this architecture is its good adaptability to learning and forecasting in a situation where key events (in our case, the "waves" of the where, is the input vector, is the loss layer, is the input gate layer, and are the state vectors, is the output gate layer, ℎ is the target (output) vector, , are the vectors of parameters (weights) [37].
In this case, the neural network was trained to predict 48 days based on data from the previous 144 days. Thus, the input vector is: where, is the number of cases of COVID-19, SMA7. Since the number of neurons in the output layer is determined by the number of forecast days, this value is 48 pieces. LSTM RNNs work with sequences and require data of the following form as input: [observations, time interval, number of signs]. In our case, the time interval was equal to one (the step is equal to one day). Therefore, we took a closer look at the two hidden layers.
The first hidden layer consisted of 144 LSTM blocks with the activation function in the form of a hyperbolic tangent. The second hidden layer consisted of 48 LSTM blocks and used the rectified linear unit (ReLU) as the activation function. Between the two hidden layers was dropout with a factor of 0.3 to combat overfitting. The total number of parameters for training in this network model was 123,504. In more detail, the proposed architecture with the names of layers from the TensorFlow framework can be seen in Table  4. We also took a closer look at the architecture of the second neural network for predicting cases of COVID-19 infection-the convolutional neural network. The principle of operation of this network is to alternate convolutional layers and subsampling layers (pooling layers).
In this model, the first convolution layer formed 64 output filters in convolution; the size of the one-dimensional convolution window was two. Then there was a downsampling layer with a window size equal to the convolution window size, that is, 2, and a flattened layer. This was followed by two fully connected layers of 50 neurons and 48 output neurons. The total number of trained parameters was 229890. In more detail, the network architecture with the names of the layers from the TensorFlow framework is given in Table 5. The classical indicator was used as the loss functions for training both networks of mean squared error ( ): where, is the true meaning, is the estimated value. We considered in more detail the choice of the optimization algorithm. There are many different methods, but the most preferable is the method of adaptive estimation of the moment, Adam, due to the combination of the accumulation of the motion of the gradient of the loss function and the relatively weak update of weights for typical features.
We considered its work in more detail. First, the algorithm updates the exponential moving averages of the gradient , which is the estimate of the 1st moment (mean): where is a hyperparameter that controls the exponential decay rates of the sliding, while 0 ≪ < 1. ∇ ( ) is the objective function gradient (vector of partial derivatives ). On the recommendation of the authors for machine learning problems = 0.9 [38]. To estimate the frequency of the gradient change, the average non-centered variance is used) : where, -similar the hyperparameter that controls the exponential decay rates of the sliding, and is taken as 0.999 as the most appropriate value for machine learning problems [39,40]. However, these slides are initialized as vectors close to zero (and the values will accumulate for a long time). To correct this problem, the first and second moments and are calculated, corrected for displacement: As a result, the weights are updated according to the rule: where, -learning rate, = 10 -smoothing parameter.

Training and Validation of Neural Network Models
The implementation of the neural network was carried out in the Python programming language in a Jupyter Notebook development environment using the open Tensor-Flow machine learning framework from Google as a basis (mainly a Keras add-on) with the involvement of the matplotlib libraries (for drawing graphs), scikit-learns (for normalization and network validation), and Pandas (for data structuring).
For LSTM RNN training, we determined an epoch size of 100 steps with 64 batches processing at each step and determined the number of epochs to be 24. We also considered the dynamics of the loss function on the training and test samples ( Figure 10). To validate the trained neural network, we used the mean absolute percentage error (MAPE): where, is the sample size, and are the true and predicted value for the day . On the training sample, the MAPE indicator was 7.294%; on the validation sample, it was 8.861%. An example of LSTM RNN operation can be seen in Figure 11. To train CNN, we determined an epoch size of 100 steps by processing 256 batches at each step and determining the number of epochs to be 100. We considered the dynamics of the loss function on the training and test samples ( Figure 12). For the trained model, we also used the MAPE metric as a validation: on the training set, this indicator was 6.203%, on the validation set, it was 7.637%. An example of how CNN works can be seen is represented in Figure 13.

Comparison of Forecast Accuracy
We compared the accuracy of the forecasts of all the constructed models on the same time interval that was considered for the exponential model (from 17 October 2021), extending it to 48 days (until 3 December 2021). To generate the forecast, the LSTM RNN and CNN networks were given data approximately 144 days before the start of the forecast period, i.e., from 26 May 2021 to 16 October 2021. The results of the models can be seen in Figure 14. To assess the accuracy, we used two previously applied metrics: RMSE and MAPE, for each model separately (Table 6). From the data in the table, it can be seen that machine learning models cope with forecasting on an equal time interval much better than the exponential regression model. Among machine learning methods, the CNN model performed more than three times worse on the MAPE score than the RNN LSTM.
We made an additional comparison of CNN and LSTM RNN models. To do this, we developed two models using input data on the cases of COVID-19 infection in Moscow during the period from 7 June 2021 to 28 October 2021 and obtained a forecast from 29 October 2021 to 15 December 2021 ( Figure 15). Similarly, we compared the forecasts for two metrics: RMSE and MAPE (Table 7).  CNN). Consequently, long short-term memory models perform significantly better at predicting time-series COVID-19 infections than convolutional neural network models.

Discussion
The results of the study indicate the sufficient accuracy of analytical methods of regression analysis for the short-term forecasting of the spread of coronavirus infection by the example of the number of cases of COVID-19 infection in the city of Moscow over 14 days. In this case, the MAPE indicator was 8.596%. However, in the case of long-term forecasts with a duration of 48 days, the MAPE error rose to an unacceptable 207.811%. Other methods, indicated in the introduction, also make it possible to obtain a forecast with sufficient accuracy for a period of up to two weeks, such as models using the Feigenbaum logistic equation [19,20]. Using coplanar models for the logistic growth of COVID-19 cases would presumably yield more acceptable results than an exponential growth model but would also fail to account for the possibility of a multimodal view of the time series of COVID-19 cases. The lack of classical mathematical models, including coplanar ones, is because real systems, such as large cities, regions, or countries, are open and heterogeneous, and it is virtually impossible to take into account a large number of stochastic factors and processes in such models. However, one area for future research may be to compare coplanar and spectral analysis models (such as the season-trend fit model and others mentioned in the introduction) with neural network models.
To solve the problem of a drop in the forecast accuracy of models based on classical mathematical methods, the paper proposes two models of neural networks with different architectures: a recurrent one based on LSTM blocks and a convolutional network based on the application of the convolution operation to a one-dimensional vector over the kernel with a window size of 2. According to the training results on a sample of sufficient size (5775 different batches with the minimum recommended for LSTM RNN 1000 different batches), both models show a sufficiently accurate result on their samples. When comparing the forecasts of all the models proposed in the work, neural networks showed a MAPE error of 16.9% for CNN and 5.391% for LSTM RNN, which is much better than the error indicators of the model based on exponential regression. With an additional comparison of artificial neural network models for the period from 29 October 2021 to 15 December on data from the city of Moscow, the LSTM-based model gave a much more accurate result with a MAPE error rate of 7.974% versus 20.094% for CNN. At the same time, the LSTM-based model took into account a faster decline in incidence and a plateau at the level of 3000 SMA7 cases of COVID-19 infection. However, it is worth noting the drawback of such models in the form of a significantly greater need for computing resources. When training the network for one epoch step of 64 different batches of the LSTM model, an average of 594 ms/step was required, while one epoch step of 256 different batches of the CNN model required an average of only 59 ms/step. Thus, on average, processing one batch of an LSTM network requires 38 times more time (8.81 ms/batch) than CNN (0.23 ms/batch). However, since accuracy is the main metric in this study, LSTM RNNs have a significant advantage over CNN and the exponential regression model. Further improvement of the forecast accuracy and optimization of the LSTM-based model requires additional research. It should be noted that with an increase in the sample size in the future and the "study" of new patterns by neural networks, the accuracy of models based on machine learning methods will only grow, which will increase the accuracy of long-term forecasts. A question for future research remains the possibility of augmentation of time series data to artificially increase the sample size.

Conclusions
This paper compares the prediction accuracy between mathematical methods for predicting the number of COVID-19 cases using machine learning methods. To estimate the exponential regression parameters, the least-squares method was used. Two architectures are proposed as machine learning methods: one based on LSTM blocks and one based on CNN. To train the networks, the data on the number of cases of infection with a new coro-navirus infection in 20 countries and the city of Moscow was used. The classical MSE indicator was used as a loss function, and the Adam method was used as an optimization algorithm. According to our research, standard mathematical models such as exponential regression can provide reasonably accurate results within two weeks. Machine learning models can predict the spread of a new coronavirus infection with acceptable accuracy for 48 days, which is more than 3.4 times longer than the forecasting period using classical mathematical methods. The best results were shown by the model of recurrent neural networks with long-term, short-term memory LSTM RNN, the forecasts of which were distinguished by the smallest average absolute percentage error (5-8%); however, such models require much higher computing power. For future work, it is of interest to conduct studies with the improvement of our proposed neural network architectures to obtain more accurate forecasting results and to also study the possibility of obtaining forecasts for more than 48 days.