Short-Term Prediction Methodology of COVID-19 Infection in South Korea

: The purpose of this study is to predict the short-term trend of the COVID-19 pandemic and give insights into effective response strategies. Based on the basic SIR model, a compartment method for modeling the course of an epidemic, the short-term infection change ratio m d , is derived. The number of infected people can be predicted using this ratio. We calculated different m d values on a weekly basis. As we tested different combinations of m d , the prediction from the combination of m d based on a week and m d based on 4 weeks was found to be statistically reliable. According to our regression analysis, our approach has an explanatory power of 96%. However, this method could only predict 1 week ahead of current data. Thus, we use LSTM, a deep learning method applied for time series data, to forecast the trend 4 weeks ahead. The forecasted trends show that the number of infected people in South Korea will reach its peak a week after the writing of this work and start to gradually decline after that.


Introduction
In December 2019, an unknown novel coronavirus, SARS-CoV-2 (COVID-19), was first reported in Wuhan, China. As globalization has increased the frequency of personal and business travel to other countries, COVID-19 has spread rapidly around the world. On 11 March 2020, the World Health Organization (WHO) declared a pandemic [1]. As of 21 August 2021, a total of 211,503,434 confirmed cases of COVID-19 were reported, and the death toll reached 4,426,543 [2].
As the pandemic continues, available information about COVID-19 is being collected, especially regarding its nature and characteristics. In particular, it is known that the virus tends to change its nature, evolving new variants based on genetic mutations. Therefore, thorough research is urgent in order to find the most effective measurements that can help end the COVID-19 pandemic [3].
Regarding the daily reported cases in South Korea, as presented in Figure 1, the number of cases first peaked and declined around March 2020. This decline was analyzed to be the result of measures such as school closures, social distancing, and the cancellation of all public gatherings. However, the number increased again in September 2020, which was repeated in January 2021, caused by various factors such as the increase and decrease of daily inspections, the significant relaxation and reinforcement of preventive measures, and the perception of members of society. Recently, the number of cases started to increase again in June 2021, and the numbers are currently still not decreasing. Therefore, predictions about the evolution of COVID-19 are crucial to adopting or readjusting preventive measures. Short-term predictions as well as long-term predictions about the end of the COVID-19 pandemic are very important.
There are three basic approaches to the prediction of the dynamics of an epidemic such as COVID-19: compartmental models, statistical methods, and ML-based methods [4,5]. Compartment models divide a population into mutually exclusive compartments using a set of dynamic equations that describe the transition between compartments [6]. The Susceptible-Infected-Removed (SIR) model [7] is the most common model for modeling epidemics. Statistical methods involve the extraction of general statistics from data and fitting them to a mathematical model that describes the evolution of an epidemic [8,9]. Finally, ML-based methods are techniques that use machine learning algorithms (artificial intelligence) to analyze historical data and find patterns that accurately predict the number of new infections [4,10,11].
In this study, based on classical SIR models, we want to estimate the parameters of SIR models using statistical techniques from existing data and predict short-term infections through these estimated parameters in South Korea. We also want to apply the Recurrent Neural Network (RNN) technique for time-series prediction as a parametric prediction methodology for predicting extended periods.

Data Source
In this study, data from 21 January 2020 to 10 August 2021 are used; these data are taken from the data provided by the Korea Centers for Disease Control and Prevention (KCDC) [12]. In this dataset, there are four data fields: date, confirmed, death, and released. The number of current infections is calculated and added to the dataset, which is the number of confirmed cases subtracted by the number of deaths and released people. The trend of the number of infections in South Korea is shown in Figure 1. There have been four big waves of the pandemic in Korea. Although the fourth wave remains in progress at the time of the last data, the overall number of the infected is managed at a low level compared to the total population of Korea (51,829,136 people) [13]. However, considering the recent epidemic situation, there is still a possibility of an explosion, so continuous and thorough management is required.

SIR Model
In this study, the standard Susceptible-Infected-Recovered (SIR) model, proposed by , is considered. In this model, the total population (N) is composed of three categories: Susceptible (S), Infected (I), and Recovered (R). The Susceptible category is the part of the total population that is vulnerable and at risk of becoming infected. The Infected category is the portion infected by the disease, and the Recovered category is the fraction of the total population that has recovered or has been removed from the data (by death or recovery). The basic structure of the SIR mathematical model is described in Figure 2. The SIR model is expressed as an ordinary differential equation modeling the spread of epidemics in a fixed population assuming permanent immunity [14]. These equations are where β represents the effective transmission rate-i.e., the average number of contacts per person over time-multiplied by the probability of disease transmission in a contact between a susceptible and an infectious subject. γ represents the recovery rate, which is defined as the inverse of the duration of recovery r d (γ = 1/r d ). N, S, I, and R are assumed to be functions of t, as they fluctuate with time t as infection occurs. The Susceptible population changes with the number of infections and the vaccinated. If we let α(t) be the ratio of infected and vaccinated people at time t, then S(t) = α(t)N. Consequently, Equation (1) can be rewritten as follows: where m(t) = βα(t) − γ, which indicates the change ratio of the number of infections at time t, since Here, we assume that m(t) is constant in some interval (t − d, t + d) and m(t) = m d t . Integrating Equation (2), we can obtain the following result:

Calculation of m d t
By calculating m d t , we can predict the rate of infection after d days. If d is small, we can satisfy the assumption that m d t is a constant. If d is large, the uncertainty in the change over the period increases, which can violate the assumption. Thus, the selection of an appropriate d is important. Furthermore, an important factor in the choice of d is the forecast period. We set d on a weekly basis, from 1 week (m 7 t ) to 4 weeks (m 2 8 t ), as shown in Figure 3. In Figure 3, all m d s show a similar pattern. This pattern is also shown well in the correlation coefficient table in Table 1. The volatility of m 7 and m 21 is relatively large, and m 14 and m 28 are relatively stable. In particular, m 28 indicates a stable state during the whole interval. The overall trend started in a very unstable state at the beginning, but it entered a stable state as time passed. It can be seen that the volatility expands in the large wave interval. Since the initial trend in m d is very unstable and stabilizes after the first large wave, we analyze data after May 2020, excluding data from the initial period of instability, as these data may negatively affect the overall analysis results. Table 1 shows that the difference between the value d and the correlation is inversely related. This means that it is useful to use variables with large differences in d to increase the effectiveness of predictions.

Regression Analysis
Each m d can be used to predict the number of infected people at time t. However, predicting with a combination of these values can increase the predictive power. Therefore, this study presents an equation for prediction through regression analysis. The full regression model can be expressed as follows: ln(I p (t)) = β 0 + β 1 ln(I p 7 (t)) + β 2 ln(I p 14 (t)) + β 3 ln(I p 21 (t)) + β 4 ln(I p 28 (t)) (6) where ln(I p d (t)) = ln(I(t − d)) + m d t−d × d when d = 7, 14, 21, 28 By applying the backward elimination method, we can obtain the final results in Table 2. The regression model is well fitted with adj-R 2 = 0.969.  (6) can be reduced as follows: Using data from 1 May 2020 to 10 August 2021, Figure 4 shows the actual number of infected people, the predicted value (I p ) of infected people obtained from the Equation (7), and the predicted value (I p 7 ) of infected people using m 7 . Figure 4 shows that I p is closer to the true value while reducing the volatility compared to I p 7 . It can also be seen that it shows the same pattern as the number of infected people fluctuates. The prediction of these movements shows that it is possible to predict the change in the number of infected people due to the variants. It can be seen that the proposed regression analysis model fits the actual values well. However, due to I p 7 , the prediction interval of this equation based on the current information is as short as 1 week. To establish a strategy to deal with the current COVID-19 situation, a forecast of at least 4 weeks is required. To achieve this, a forecast of the next 3 weeks for m 7 must be included. To forecast the value of m 7 , we adopt the LSTM method-the most popular deep learning model in time series prediction.
LSTM consists of three types of layers: an input layer, output layer, and hidden layer. The input layer and output layer are composed of one layer each, while the number of hidden layers can be set. In this study, the LSTM is run with three hidden layers, using Mean Squared Error as the LSTM loss function and Adam as the optimization algorithm. A batch size of 64 is used for training. The number of epochs is fixed at 200. With this LSTM, we obtain m 7 values for the next 21 days. Using these values, we can obtain the predicted values of infected people from the 8th day after the last data point to the 28th day by Equations (5) and (7). The prediction values for first 7 days after the last data point are based on current data. These values are shown in Figure 5. In Figure 5, we can see that the wave is heading towards a peak and will decline over the following week. A gradual decrease is expected rather than a sharp decrease. This indicates that an active response strategy is needed to reduce the number of infected people. It also shows that the cycle is occurring on a weekly basis. This phenomenon occurs continuously under the influence of the number of inspections, and this is reflected in the forecast data.

Discussion
This study focuses on short-term prediction as a strategic response to COVID-19 rather than predicting the end of the COVID-19 epidemic. Under the assumption that the short-term infection change ratio (m d ) is constant, this study was conducted based on the differential equation of the basic SIR model.
The results of the first study derived a short-term (1 to 4 week) prediction formula (3) for the number of infected people (I) as a function of the change ratio (m d ) and itself. The change ratio of the number of infected people (m d ) induced by the SIR model can be simply calculated from the current data.
For more accurate prediction, a combination of short-term predictions was constructed through regression analysis, and a combination of m d values of 1 week and 4 weeks was proposed. This formula (Equation (7) had an explanatory power (adj-R 2 ) of 96%, indicating that it reflects the actual data well.
Finally, in order to predict the trend of the pandemic a month ahead using this formula, which can predict a week based on current data, the change ratio for the number of infected people (m d ) must be predicted. For this prediction, we applied LSTM-the most representative time series analysis method. The results of South Korea's forecast for infections, for which we applied the predicted infected change ratio (m d ), showed that the current wave is expected to peak soon and show a gradual decline, but the sharp decline shows that the situation will remain difficult.
Currently, COVID-19 has presented several large waves in South Korea, but long-term prediction methods show difficulties in predicting these big waves. On the other hand, the model presented in this study focuses on forecasting for about 4 weeks considering various environmental changes. Thus, our approach will be useful for short-term COVID-19 response strategies.