COVID 19 Peak Time Prediction via a Gradient Boosting Method

: The outbreak of COVID-19 has caught humanity off guard. Peak-times differ in countries based on their characteristics and precautions taken by governments. In this study, we aimed to determine relative importance of indicators on the spread and to assist non-peaked countries to estimate their peak-times. Gradient Boosting Method was employed on 82 countries which reached peak-times. The findings indicate that hospital beds per thousand is the main predictor of peak-time estimation. Restrictions on gatherings and closing public transportation have the highest relative importance among governmental precautions. This model can be utilized and employed with various indices and alternative machine-learning algorithms.


Introduction
The outbreak of COVID-19 pandemic has been one of the major health issues for human beings came across since decades. COVID-19 is a novel coronavirus disease known as severe acute respiratory syndrome coronavirus 2 appeared in Wuhan, China in December 2019 [1]. By the end of July, almost 16.5 million people is infected all over the world and more than 652 thousand people died because of this virus. Yet, COVID-19 is neither the first pandemic nor the last one. In the history, there have been several pandemics such as Black Death, Spanish flu, Smallpox, HIV etc. [2]. However, in the 21st century, living in a global world have caused COVID-19 to spread in a short time to all over the world due to heavy air traffic routes, international trade actions and tourism habits. Under these circumstances, there have been several essential issues to manage by public health concern such as identifying people at risk, controlling the borders, monitoring tourism, surveillance of active cases [3]. There has been done tremendous research by investigators from various disciplines under the same goal of understanding the pharmaceutical and non-pharmaceutical dimensions of COVID-19. Especially in the first phases of pandemic, exploring COVID-19 characteristics was very crucial to identify the virus and who were at risks. Fu and his colleagues (2020) examined 43 different studies (with 3600 patients) that focused on clinical characteristics of COVID-19, and found that the most common symptoms are fever, cough and fatigue among patients [4]. Furthermore, elder people with comorbidities are associated with the highest risk [5]. Moreover, according to John Hopkins University, mortality rate of COVID-19 for the world is approximately 3.97% and differs between 0.1% and 28.5% among countries by 27 July 2020 [6]. On the other hand, unfortunately there is still not a cure for this virus, medications applied to COVID-19 patients at hospitals are not proved. Although vaccine studies have been undergoing in different countries, yet there is no publicly announced ready vaccination. Under these circumstances, being infected is very vital, if public health strategies are not applied to the community, individuals remain very vulnerable. Therefore, which strategies are applied  when makes difference in spread pattern [7]. Moreover, strategies should help to flatten and decrease the COVID-19 cases and mortality curves [8]. Governments take precautions by implementing policies such as external border restrictions, closure of schools, public awareness campaigns, lockdowns, health monitoring and testing etc. [9]. To our knowledge, Government Response Index [10] and Government Policy Activity Index [9] are two indices in the literature that investigating government policies regarding COVID-19 among countries. These indices are valuable research resources for policy makers and scientists to understand the dynamics and effects of implemented policies. In this study, we used some indicators from Government Response Index alongside with the attributes of human resources, economical background and cultural dynamics of each country. These variables were employed in prediction of COVID-19 peak time via examining the curves of COVID-19 cases from the first day of occurrence. The main aims of this study are to examine the importance of the policies and dynamics regarding COVID-19 and to predict peak time of the countries which are not reached their peak yet.
In the next section data and methodology will be introduced. Third chapter includes analysis and results of the study. Then, implications of the study will be mentioned in discussion section. Finally, in the last section limitations and further research will be given.

Data Sources
In this study, our data comprised of two main dimensions; the first one is constituted of restriction policies, the other one is characteristics of each country based on human capital, economics and habits. The restriction policies are selected based on containment and closures as school closing, workplace closing, cancel public events, restrictions on internal movements, international travel controls, restrictions on gatherings, stay at home requirements and closing public transportation, and derived from Government Response Index [10]. The remaining characteristics: hospital beds per thousand, GDP per capita, population density, handwashing facilities, life expectancy, aged 65 older and median age are obtained from Coronavirus Source Data [11]. Totally 173 countries are included in the analysis and distributed in terms of region and income groups as seen in Table 1. Peak times is the target variable which is evaluated by the authors based on daily COVID-19 case curves. Out of 173 countries, 83 country has been peaked while 53 country has not reached its peaked yet. The data of remained 37 countries has not enough to evaluate whether peaked or not. Descriptions and scales of variables are given in detail with their sources in Table A1.

Gradient Boosting Method
In this study, a gradient boosting regression tree method (GBM) was performed to estimate COVID-19 peak time and to determine which factors affect the peak time by giving their relative importance. This method provides more prediction accuracy and model interpretability while comparing to the single decision tree model. GBM is the generalization of tree reinforcement that tries to reduce accuracy and interpretability problems to provide a precise and operative way for data mining [12]. By emphasizing the training data which are difficult to predict, multiple models are developed sequentially, and the accuracy of the prediction is increased. In the training data during the boosting process, while using previous base models examples that are estimated difficultly are much more common than correctly estimated ones. Mistakes made by previous base models are handled to corrected by each additional base model [13].
In GBM, the model is constituted gradually and updated by minimizing the expected value of loss function via the number of iterations (i.e., the number of trees). Fitted model may accomplish illogically small training error by adding many trees to the model. However, this can cause an overfitting problem because of depending on the training data and a lack of generalizability. The number of gradient boosting iterations can be controlled to prevent overfitting [13]. Alongside with the number of iterations, the other parameters as learning rate and tree complexity that directly affect the performance of the algorithm should be handled.

Analysis and Results
Firstly, after data cleaning process, one country is eliminated from data set due to missing data of some predictor variables. Considering the beforementioned explanations, after trying the combinations of the parameters (tree complexity, learning rate and number of iterations), we decided to employ 400 trees with 0.01 learning rate to avoid overfitting. Furthermore, a leave-one-out strategy was performed to validate the GBT models because of limited sample size (n = 82 countries). In each iteration, the algorithm sequentially operated 81 data points for training and remaining datum point for testing. Rapid Miner academic version is selected to conduct analysis. According to the results, some variables have strong effects while others have no importance on prediction of COVID-19 peak time. GBT algorithm provides relative score of predictor variables to indicate the importance in building of the trees and ignores some variables completely [12]. In our model, school closing, workplace closing, cancel public events, and restrictions on internal movement variables are not importance in prediction of COVID-19 peak time. The remaining variables, hospital beds per thousand, close public transport, GDP per capita, population density, handwashing facilities, life expectancy, aged 65 older, median age, international travel controls, restrictions on gatherings, income group, stay at home requirements are ordered decently in terms of score can be seen in Figure 1. At the same time our determination coefficient is 0.832 that means our model explains 83.2% of variance in prediction of COVID-19 peak time. Moreover, root mean squared error is 23.901 in the model.
Non-peaked countries peak times were predicted via the algorithm and predicted times are given in Figure A1 in Appendix A. According to our results, the highest time to peak days are estimated in Nigeria, Indonesia, India and Philippines. On the other hand, Uruguay, Bosna Herzegovina and Poland are expected to reach peak day in a short time relatively to other non-peaked countries.

Discussion
Among 82 countries, relative importance of government policies and characteristics have been examined regarding COVID-19 peak times. Most of the characteristics of countries have been more effective than government policies to constitute peak times. The highest relative importance was found as hospital beds for thousand that means the main determinator to predict the countries peak time duration. If there are not enough hospital beds for patients, they will not be the place where they are supposed to be and continue to spread the virus. Respectively, life expectancy, income group, GDP per capita and percentage of aged 65 older have similar scores while population density and median age have relatively lower importance on to identify peak times of COVID-19. Moreover, contrary to our expectations the only indicator of cultural dynamics of countries having handwashing facilities has mild importance to estimate peak times. Although one of the contagion ways of COVID-19 is hand contact, finding mild importance of hand washing facilities show that transmission rate by hand contact might not be higher as scientists clarified. Among restrictions, parallel to our expectations the highest score is restrictions on gatherings and followed by closing of public transport because the areas, where there is high human density, are effective to increase infection rate due to high circulation and close contact of people. Furthermore, staying at home requirements, closing of workplaces and control of international travels have also slightest importance on estimation of peak times. On the other hand, based on our analysis, the importance of closing schools, cancelling public events and restrictions on internal movements indicators are found as zero in our model. When investigating the data closely, most of the countries take similar strict actions based on these topics. Furthermore, these results can be used by policy makers and government administrators.

Limitations and Further Research
This study make contribution to the evaluations of COVID-19 precautions; however, it has several limitations. Firstly, COVID-19 literature has been changing every day, even in hours. All humanity tries to contribute to prevent and remediation of COVID-19. Therefore, updated data should use to conduct analysis while using the proposed models. Secondly, by the nature of this study, number of countries are limited and while employing machine learning techniques the limited number of observations narrows method choices and the prediction accuracy. Moreover, especially restriction data is scaled by perceptually might affect the performance of the model, hence different indices might be employed for further research.
Author Contributions: Introduction and data, B.C.; methodology, analysis and results E.C.; discussion, writing, review and editing, B.C and E.C. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.