A Novel β SA Ensemble Model for Forecasting the Number of Conﬁrmed COVID-19 Cases in the US

: In December 2019, Severe Special Infectious Pneumonia (SARS-CoV-2)–the novel coronavirus (COVID-19)– appeared for the ﬁrst time, breaking out in Wuhan, China, and the epidemic spread quickly to the world in a very short period time. According to WHO data, ten million people have been infected, and more than one million people have died; moreover, the economy has also been severely hit. In an outbreak of an epidemic, people are concerned about the ﬁnal number of infections. Therefore, effectively predicting the number of conﬁrmed cases in the future can provide a reference for decision-makers to make decisions and avoid the spread of deadly epidemics. In recent years, the α -Sutte indicator method is an excellent predictor in short-term forecasting; however, the α -Sutte indicator uses ﬁxed static weights. In this study, by adding an error-based dynamic weighting method, a novel β -Sutte indicator is proposed. Combined with ARIMA as an ensemble model ( β SA), the forecasting of the future COVID-19 daily cumulative number of cases and the number of new cases in the US are evaluated from the experiment. The experimental results show that the forecasting accuracy of β SA proposed in this study is better than other methods in forecasting with metrics MAPE and RMSE. It proves the feasibility of adding error-based dynamic weights in the β -Sutte indicator in the area of forecasting. Conceptualization, D.-H.S. and D.C.Y.; Data curation, T.-W.W., M.-H.S. and M.-J.Y.; T.-W.W. and M.-J.Y.; D.-H.S.; Methodology, D.-H.S. and M.-H.S.;


Introduction
In December 2019, Severe Special Infectious Pneumonia (SARS-CoV-2)-the novel coronavirus (COVID-19)-appeared for the first time, breaking out in Wuhan City, Hubei Province, China. To curb the spread of the epidemic, the Chinese government authorities imposed a lockdown policy on Wuhan; however, the virus had already spread to all continents through the global transportation industry. According to the World Health Organization (WHO), COVID-19 is an emerging disease which has the characteristics of human-to-human transmission and is extremely contagious, can infect ten million people, and has caused more than millions of deaths around the world. Due to the impact of COVID-19 on people's lives, the general public and the government are concerned about how many people will eventually be infected [1]. Many academics have invested in research on COVID-19, for example, to predict the future daily and monthly number forecasting of confirmed cases [2][3][4], or to discover related factors that affect the severity of the epidemic and cause death [5,6].
The cumulative number of confirmed COVID-19 cases in the United States is ranked No. 1 in the world, and the total number of confirmed cases accounted for 20% of global infections (https://covid19.who.int/, accessed on 10 July 2021). The U.S. is the country 2 of 15 with the most COVID-19 confirmed cases in the world. If we can correctly predict the number of confirmed cases in the U.S., then we can know the future trend of confirmed COVID-19 cases of the world.
Forecasting methods deal with uncertainties about the future, which is crucial in helping decision-makers to make reasonable decisions and plan activities. In various fields of society, effective and highly accurate forecasts are considered important prerequisites for the effective management of an organization [7].
In recent years, the α-Sutte indicator method [8] has been an excellent predictor in short-term forecasting; however, the α-Sutte indicator uses fixed static weights. To improve the forecasting accuracy, adding weights to unweighted models is usually better than the original unweighted ones. Al-Dahidi, Baraldi, Zio, and Legnani [9] have proven that adding dynamic weights to the ensemble model led to better results than the original. Since the α-Sutte indicator uses static weight in forecasting, this study attempts to add dynamic weights into the α-Sutte indicator and incorporate it with the ARIMA method to make an ensemble forecasting model. The forecasting targets are the daily cumulative number of confirmed COVID-19 cases and the number of new COVID-19 cases in the US, and the five worst-hit states with the largest number of cumulative confirmed cases are also included for evaluation. It is hoped that the results of this study can help to make up for the lack of diversity issue in previous COVID-19 related research.
The rest of the paper is sectioned as follows. Section 2 is the improved β-Sutte indicator and βSA Ensemble model based on the α-Sutte indicator. Section 3 is the experimental process and model evaluation metrics. Section 4 predicts the results and discussion one day and five days ahead, leading finally, to the conclusion.

Data Collection
The data source of this study is adopted from the Centers for Disease Control and Prevention (CDC) in the United States (https://covid.cdc.gov/covid-data-tracker/ accessed on 10 July 2021). The variables and definitions of the dataset are shown in Table 1. Since the dataset does not provide the daily cumulative number of confirmed cases and the number of new cases in the U.S., and there are missing values in the daily cumulative number of confirmed cases in each state of the United States in the dataset, this study adds up the cumulative number of confirmed cases (conf_cases) and the number of new cases (new_case) in each state in the United States. The five worst-hit states in the cumulative number of confirmed cases are selected for evaluation.

The α-Sutte and Proposed β-Sutte Indicator
Definitions and descriptions of notations in this study are shown in Table 2. Table 2. Notation of symbols.

Notations
Definition The forecasting value at the t-th day The dynamic weighting function The α-Sutte indicator was proposed in 2017, and it can be used to predict a variety of different time-series data [8]. During the forecasting process, the α-Sutte indicator only uses previous four data points (γ, β, α, δ) to make a next point forecasting, therefore, it is flexible when using any type of data [8]. The equation of the α-Sutte indicator is shown in Equation (1): As can be seen in Equation (1), the α-Sutte indicator divides static weight, which is 1/3, into three different error items to make the final forecasting. In this study, a novel forecasting indicator which uses dynamic weighting, the β-Sutte indicator, is proposed.
To ensure the clarity of our proposed β-Sutte indicator, a(t), b(t), g(t) are defined as: Abdollahi and Ebrahimi [10] used average weights, a weighting method based on error value, and a genetic algorithm to assign weights to three different methods in an ensemble model, pointing out that the result of the ensemble model using genetic algorithm is the best, the second best is the weighting method based on the error value, and the average weight is the worst-performing. Abdollahi and Ebrahimi [10] believe that a large part of the success of the model they put forward depends on the choice of weighting method. Based on the calculation time and cost issues, the authors of this study believe that the weighting method based on the error value used by Abdollahi and Ebrahimi [10] is an effective method to improve the forecasting accuracy without excessive cost; the principle is that method produces a higher error will be assigned a smaller weight. The dynamic weighting functions are obtained from the average of three different time estimated errors of a(t), b(t), and g(t) which are at day t-1, t-2, and t-3. Therefore, the dynamic weighting functions of our proposed β-Sutte indicator are defined as follows: Then, our proposed β-Sutte indicator of forecasting is shown in Equation (2):

Autoregressive Integrated Moving Average (ARIMA)
The autoregressive integrated moving average (ARIMA) model was introduced by George Box and Gwilym Jenkins in 1976. The model of ARIMA is generally written with notation ARIMA (p, d, q), with p representing the order of the autoregressive (AR) process, d representing the differencing, and q stating the order of the moving average (MA) process.
Abolmaali & Shirzaei [11] have compared the results of different models (SIR Model, linear regression, logistic function, ARIMA) in the prediction of confirmed COVID-19 cases in 2021. Although the linear regression model performs well in short-term prediction, overall, the ARIMA model is still better than other models. Therefore, we chose the ARIMA model to compare with our proposed β-Sutte indicator and βSA ensemble model.

βSA Ensemble Model
The βSA Ensemble model is mainly based on the SutteARIMA prediction method proposed by Ahmar & Del Val [12]. SutteARIMA combines α-Sutte indicator and ARIMA; as such, the prediction result of SutteARIMA is the average of α-Sutte indicator and ARIMA. In terms of prediction results, SutteARIMA is better than ARIMA, but the results are quite close. The proposed βSA ensemble model tries to combine the proposed β-Sutte indicator with ARIMA, and the prediction result of the βSA ensemble model is the average of the two prediction results.

Data
Data regarding confirmed US COVID-19 cases were obtained from the Centers for Disease Control and Prevention (CDC) in the United States. Due to the large number of vaccines used in the United States after July 2021, only data from 25 July 2020 to 30 June Mathematics 2022, 10, 824 5 of 15 2021 are used for method evaluation. This study proposes an improved β-Sutte indicator and βSA ensemble model based on the α-Sutte indicator to forecast the cumulative number of confirmed cases and the daily number of newly confirmed cases of COVID-19 in the U.S. in different forecast periods (over one day and five days). It is expected to outperform the α-Sutte indicator and ARIMA on model evaluation metrics.

Metrics
For the evaluation of the forecasting methods, we applied two forecasting accuracy measures, including mean absolute percentage error (MAPE) and root mean square error (RMSE) [13]. The indicators of both measures are the smaller the better.
MAPE and RMSE are defined as follows: Assuming y are the predicted values, and y are real values: MAPE and RMSE metrics are used for evaluating the quality of the model, and their values will vary between 0 and infinity. MAPE focuses on percentage errors, while RMSE is more sensitive to the data structure (numerical units, outliers). Both of these metrics are as small as possible, but there is no absolute reference value [14]. The results of this forecasting were obtained by using R Software with the forecast and SutteForecastR Package.

Flow Chart of Experiment
This section will give a detailed description of the experiment process in this study, including data pre-processing, dataset split, weight training, the α-Sutte indicator, the β-Sutte indicator, ARIMA, and βSA. The evaluation flow chart is shown in Figure 1.

Metrics
For the evaluation of the forecasting methods, we applied two forecasting accuracy measures, including mean absolute percentage error (MAPE) and root mean square error (RMSE) [13]. The indicators of both measures are the smaller the better.
MAPE and RMSE are defined as follows: Assuming y are the predicted values, and y are real values: MAPE and RMSE metrics are used for evaluating the quality of the model, and their values will vary between 0 and infinity. MAPE focuses on percentage errors, while RMSE is more sensitive to the data structure (numerical units, outliers). Both of these metrics are as small as possible, but there is no absolute reference value [14]. The results of this forecasting were obtained by using R Software with the forecast and SutteForecastR Package.

Flow Chart of Experiment
This section will give a detailed description of the experiment process in this study, including data pre-processing, dataset split, weight training, the α-Sutte indicator, the β-Sutte indicator, ARIMA, and βSA. The evaluation flow chart is shown in Figure 1.

One-Day-Ahead Forecasting of the Cumulative Number of Confirmed Cases
One-day-ahead forecasting of the cumulative number of confirmed cases in the US are evaluated in this section. The five worst-hit states (IL, OH, GA, PA, AZ) in the cumulative number of confirmed cases are also included for comparison. Overall, four methods (α-Sutte indicator, β-Sutte indicator, ARIMA, and βSA) of forecasting are employed, each using a sliding window with 7-days training and one-day-ahead forecasting. The first training time period is 25 July 2020 to 31 July 2020, then, scrolls to the training interval step by step. The predicted testing time period is 1 August 2020 to 31 The detailed experiment process is as follows: 1.
The data set used in this study does not provide the daily cumulative total number of confirmed cases in the U.S., Therefore, this study applies to sum up the daily cumulative total number of cases (tot_cases) of all states to calculate the daily cumulative The detailed experiment process is as follows: 1.
The data set used in this study does not provide the daily cumulative total number of confirmed cases in the U.S., Therefore, this study applies to sum up the daily cumulative total number of cases (tot_cases) of all states to calculate the daily cumulative total number of confirmed cases in the U.S. In addition, it will select the five worst-hit states of the cumulative number of confirmed cases for evaluation.

2.
Split the preprocessed data set into training days from d(t − 7) to d(t − 1) and a testing day d(t). The time window is set to eight, and the sliding window is set to one.

3.
Employ the dynamic weighting method based on the error value, and calculate the training weights ω a (t), ω b (t),ω g (t) according to their error function ε a (t), ε b (t), ε g (t).

4.
By using the sliding window method to obtain the moving dynamic weights, the time points of training days required for the β-Sutte indicator is seven. Let the obtained dynamic weights be incorporated into the β-Sutte indicator to predict d(t).

5.
The ARIMA method uses the same dataset, then averages the results of β-Sutte indicator and ARIMA, becoming the final result of the βSA ensemble model. In addition, α-Sutte indicator is also used to compare other models with different forecast periods (one-day-ahead and five days ahead, respectively). 6.
Compare and discuss the results of different methods (α-Sutte indicator, β-Sutte indicator, ARIMA, and βSA ensemble model) by using evaluation metrics with RMSE and MAPE. It is expected that the β-Sutte indicator and the βSA ensemble model are better than the α-Sutte indicator and ARIMA in the performance of model evaluation metrics.

One-Day-Ahead Forecasting of the Cumulative Number of Confirmed Cases
One-day-ahead forecasting of the cumulative number of confirmed cases in the US are evaluated in this section. The five worst-hit states (IL, OH, GA, PA, AZ) in the cumulative number of confirmed cases are also included for comparison. Overall, four methods (α-Sutte indicator, β-Sutte indicator, ARIMA, and βSA) of forecasting are employed, each using a sliding window with 7-days training and one-day-ahead forecasting. The first training time period is 25 July 2020 to 31 July 2020, then, scrolls to the training interval step by step. The predicted testing time period is 1 August 2020 to 31 December 2020. The one-day-ahead forecasting results of the daily cumulative number of confirmed cases with metrics MAPE and RMSE are shown in Table 3

Five Days Ahead Forecasting of the Cumulative Number of Confirmed Cases
Khan et al. [15] proposed that a flexible framework will help relevant departments formulate policies by predicting new infections of COVID-19 after 5 and 10 days. To observe the forecasting capability of our proposed β-Sutte indicator and βSA ensemble model further, the five days ahead prediction of the cumulative number of confirmed cases is also evaluated. Due to the fact that the actual data of five days ahead cannot be

Five Days Ahead Forecasting of the Cumulative Number of Confirmed Cases
Khan et al. [15] proposed that a flexible framework will help relevant departments formulate policies by predicting new infections of COVID-19 after 5 and 10 days. To observe the forecasting capability of our proposed β-Sutte indicator and βSA ensemble model further, the five days ahead prediction of the cumulative number of confirmed cases is also evaluated. Due to the fact that the actual data of five days ahead cannot be

Five Days Ahead Forecasting of the Cumulative Number of Confirmed Cases
Khan et al. [15] proposed that a flexible framework will help relevant departments formulate policies by predicting new infections of COVID-19 after 5 and 10 days. To observe the forecasting capability of our proposed β-Sutte indicator and βSA ensemble model further, the five days ahead prediction of the cumulative number of confirmed cases is also evaluated. Due to the fact that the actual data of five days ahead cannot be obtained in advance when forecasting, we assume that the dynamic weight in Equation (2) is fixed in the next four days' forecasting except d(t). Therefore, the other four days' forecasting prediction is defined as Equation (5): where a(t + j), b(t + j), g(t + j) are calculated from the previous forecasting value of d(t + j − 1), j = 1, 2, 3, 4. Thus, five days ahead forecasting of the cumulative number of confirmed cases in the US and five worst-hit states (IL, OH, GA, PA, AZ) using four forecasting methods (α-Sutte indicator, β-Sutte indicator, ARIMA, and βSA) are evaluated. The data set and sliding window setting are the same as in Section 4.1. The evaluation results with metrics MAPE and RMSE of five days ahead are shown in Table 4. As an example, one monthly forecast trends of the US and Illinois State are shown in Figures 8 and 9 for demonstration. obtained in advance when forecasting, we assume that the dynamic weight in Equation (2) is fixed in the next four days' forecasting except ( ). Therefore, the other four days' forecasting prediction is defined as Equation (5): ( + ) = ω ( ) • ( + ) + ω ( ) • ( + ) + ω ( ) • ( + ) where ( + ), ( + ), ( + ) are calculated from the previous forecasting value of ( + − 1), = 1, 2, 3, 4. Thus, five days ahead forecasting of the cumulative number of confirmed cases in the US and five worst-hit states (IL, OH, GA, PA, AZ) using four forecasting methods (α-Sutte indicator, β-Sutte indicator, ARIMA, and βSA) are evaluated. The data set and sliding window setting are the same as in Section 4.1. The evaluation results with metrics MAPE and RMSE of five days ahead are shown in Table 4. As an example, one monthly forecast trends of the US and Illinois State are shown in Figures 8 and 9 for demonstration.
According to MAPE and RMSE in Table 4, it is found that the ARIMA method outperforms almost all other methods, and even outperforms our proposed βSA ensemble model for five day ahead predictions in all regions. A possible reason might be that fixed dynamic weights are assumed in the last four-day forecast, in Equation (2), which may need to be adjusted in a certain way rather than fixed. However, this question is left for other researchers to study further in the future.

Forecasting of the Daily Number of Newly Confirmed Cases
Since the cumulative number of confirmed cases in the US are almost an incr linear function, the capability of our proposed β-Sutte indicator and βSA ensemb are tested by another vibration function, such as the daily number of newly co cases.
Therefore, the daily number of newly confirmed cases in the US and five w states of confirmed cases are adopted, and four methods (α-Sutte indicator, indicator, ARIMA, and βSA) are used for testing their forecasting capability. Th windows setting is the same as in Section 4.1. As the United States invested in number of vaccines after July 2021, to avoid the interference of this event, the for period time is set from 1 April 2021 to June 2021.
After calculation, Figures 10-15 are the forecasting run charts of the number confirmed cases daily by the four methods (α-Sutte indicator, β-Sutte indicator, and βSA) in each state, and the red dots stand for the actual values. The MAPE an of the daily number of newly confirmed cases using four forecasting methods ar in Table 5. It can be seen from Table 5 that the MAPE and RMSE in the cases of th and GA have the best forecasting accuracy of our proposed βSA ensemble mod study. However, the ARIMA method takes the lead in the cases of OH, PA, and A be said that these two methods, ARIMA and the proposed βSA ensemble, are com in forecasting performance. According to MAPE and RMSE in Table 4, it is found that the ARIMA method outperforms almost all other methods, and even outperforms our proposed βSA ensemble model for five day ahead predictions in all regions. A possible reason might be that fixed dynamic weights are assumed in the last four-day forecast, in Equation (2), which may need to be adjusted in a certain way rather than fixed. However, this question is left for other researchers to study further in the future.

Forecasting of the Daily Number of Newly Confirmed Cases
Since the cumulative number of confirmed cases in the US are almost an incremental linear function, the capability of our proposed β-Sutte indicator and βSA ensemble model are tested by another vibration function, such as the daily number of newly confirmed cases. Therefore, the daily number of newly confirmed cases in the US and five worsthit states of confirmed cases are adopted, and four methods (α-Sutte indicator, β-Sutte indicator, ARIMA, and βSA) are used for testing their forecasting capability. The sliding windows setting is the same as in Section 4.1. As the United States invested in a large number of vaccines after July 2021, to avoid the interference of this event, the forecasting period time is set from 1 April 2021 to June 2021.
After calculation, Figures 10-15 are the forecasting run charts of the number of newly confirmed cases daily by the four methods (α-Sutte indicator, β-Sutte indicator, ARIMA, and βSA) in each state, and the red dots stand for the actual values. The MAPE and RMSE of the daily number of newly confirmed cases using four forecasting methods are shown in Table 5. It can be seen from Table 5 that the MAPE and RMSE in the cases of the US, IL, and GA have the best forecasting accuracy of our proposed βSA ensemble model in this study. However, the ARIMA method takes the lead in the cases of OH, PA, and AZ. It can be said that these two methods, ARIMA and the proposed βSA ensemble, are comparable in forecasting performance.