A Feature-Based Analysis for Time-Series Classiﬁcation of COVID-19 Incidence in Chile: A Case Study

: The 2019 Coronavirus disease (COVID-19) pandemic is a current challenge for the world’s health systems aiming to control this disease. From an epidemiological point of view, the control of the incidence of this disease requires an understanding of the inﬂuence of the variables describing a population. This research aims to predict the COVID-19 incidence in three risk categories using two types of machine learning models, together with an analysis of the relative importance of the available features in predicting the COVID-19 incidence in the Chilean urban commune of Concepción. The classiﬁcation results indicate that the ConvLSTM (Convolutional Long Short-Term Memory) classiﬁer performed better than the SVM (Support Vector Machine), with results between 93% and 96% in terms of accuracy (ACC) and F-measure (F1) metrics. In addition, when considering each one of the regional and national features as well as the communal features (DEATHS and MOBILITY), it was observed that at the regional level the CRITICAL BED OCCUPANCY and PATIENTS IN ICU features positively contributed to the performance of the classiﬁers, while at the national level the features that most impacted the performance of the SVM and ConvLSTM were those related to the type of hospitalization of patients and the use of mechanical ventilators.


Introduction
Coronavirus Disease 2019 (COVID-19) is a disease caused by a type of coronavirus identified in 2019 in Wuhan, Hubei, China. This disease, appearing as pneumonia of unknown cause, and whose pathogen had not been previously identified in humans, was termed Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) [1]. In some cases, the disease progresses to a critical illness, such as acute respiratory failure, pneumonia, renal failure, and even death [2,3]. In March 2020, when there were more than 118,000 cases in 114 countries worldwide, the World Health Organization (WHO) declared the COVID-19 outbreak a pandemic [4]. During the first semester of 2020, Chile was one of the countries most affected by the COVID-19 pandemic, ranking sixth in the number of infected people among more than two hundred countries worldwide [5]. In this complex scenario, the Ministry of Health (MINSAL) is involved, as in many other countries, in the control of the pandemic following the indications of the WHO through testing, traceability, and isolation strategies [6].
COVID-19 is transmitted from person to person by the aerosols emitted by infected people when breathing, talking, sneezing, or coughing. This infectious mechanism, added to the fact that there may be infected asymptomatic people, makes its detection a vital strategy to control the spread of this disease [7]. Like other countries globally, the Chilean government has promoted public health strategies based on self-care, mobility reduction, testing, traceability, and isolation. Self-care measures include the use of masks, hand washing, and social distancing. At the same time, public management actions have been promoted to reduce the mobility of the population, such as quarantines, online classes, teleworking, and curfews. Among the efforts for infection control, there has been an increase in testing, an expansion in the capacity of ICU beds in hospitals, the opening of health residences, free vaccination for the population, an increase in human resources to strengthen health care, traceability of suspected as well as confirmed cases, and greater control to ensure compliance with measures. Such actions are toughened or relaxed, depending on the behavior and evolution of the disease. Among the actions to increase and reinforce testing, the government has implemented community active case-finding interventions in crowded places such as workplaces, shopping malls, and plazas [8]. Active case-finding is a community screening approach to detect active cases of COVID-19 that have not been promptly detected by spontaneous consultation, especially in those people who do not have or recognize symptoms or who, for whatever reason, have not consulted health care centers [9]. In this approach, health teams bring Polymerase Chain Reaction (PCR) testing to strategic points of the community. Reverse Transcription Polymerase Chain Reaction (RT-PCR) is one of the most frequently used laboratory methods to detect and confirm COVID-19 cases in symptomatic and asymptomatic patients. This test detects the presence of SARS-CoV-2 Ribonucleic acid (RNA) from a respiratory tract secretion sample using the RT-PCR [10]. In active case finding, it is crucial to determine the community points to be tested to increase the effectiveness of detection. Therefore, the Chilean government has urged researchers to find new methods to predict the progression of COVID-19 including the prediction of incidence and its behavior in the different regions and communes or districts of the country. This research aims to predict the COVID-19 incidence in three risk categories: low risk (less than 10 new cases per day in 100,000 inhabitants), medium risk (between 10 and 25 new cases per day in 100,000 inhabitants), and critical risk (greater than 25 new cases per day in 100,000 inhabitants). To consider possible nonlinear relationships in the interaction of variables with the prediction of incidence, we propose a machine learning model, together with an analysis of the relative importance of the available features in predicting COVID-19 incidence. Taking the above into account, the main contributions of this study can be summarized as follows: • The development of a three-level or three-category prediction model of COVID-19 incidence tested in a Chilean case study. In order to achieve this model of incidence levels, the time series problem was transformed into a classification problem. Four classifiers were then compared to predict these levels to select the one with the best performance; • Support of the prediction of COVID-19 risk levels, considering an automatic analysis of epidemiological indicators (classifier features or variables) linked to the behavior of the disease in the community. This model is the initial attempt to create a national model that can guide the response and intensity of the active search effort to control COVID-19 according to the trends in communal, regional, and national features in Chile expressed in incidence risk levels.
The rest of the paper is organized as follows: Section 3 briefly describes the datasets, the available features, and the methods we used in this research. Section 4 presents the performances of the classification algorithms in terms of Accuracy (ACC), Precision (P), Recall (R), and F-measure (F1) together with an analysis of the relative importance of the available features in predicting the COVID-19 incidence. Section 5 shows an analysis of the results obtained and suggests possible future research to improve the present study.

Related Work
The SARS-CoV-2 coronavirus, which causes the COVID-19 disease, has caused a state of alert in all countries aiming to control this pandemic that has led to more than 2.84 million deaths in the world [11,12]. From an epidemiological point of view, one of the variables to be controlled is the incidence, a concept defined as the occurrence of new cases of the disease, generally measured in a population of 100,000 inhabitants [13,14]. The literature reveals research aimed at predicting the risk of COVID-19 using different epidemiological variables in the form of time series data [15][16][17][18]. Epidemiological, statistical, or machine learning models have been used to address the problem. Among the most widely used epidemiological models are the Susceptible, Susceptible, Infectious, or Recovered (SIR) model and its variations [19,20]. For example, Malavikaa et al. used the SIR model to estimate the maximum number of active cases and the peak period of COVID-19 in India [19]. The most commonly used statistical models are data correlation, probabilistic and autoregressive models such as Auto-Regressive Integrated Moving Average (ARIMA) [19,[21][22][23][24][25][26][27][28]. Roy et al. used an ARIMA model to estimate the prevalence and incidence of COVID-19 in India [28]. Although authors using epidemiological or statistical methods for time series data reported good results, one of the main disadvantages of time series models is that they only perform well if there is a linear relationship between the variables analyzed [16,[29][30][31][32][33]. In order to capture possible nonlinear relationships among the variables of interest, researchers have tried to use machine learning models [16,[29][30][31][34][35][36][37][38][39]. For this purpose, some of the most commonly used algorithms are logistic regression, Support Vector Machine (SVM), Multilayer Perceptron (MLP) and more recently DL-based algorithms, such as Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN). For example, Singh et al. used an ARIMA model and an SVM model to predict the daily confirmed cases of COVID-19 in the five most affected countries during the study period. The results obtained indicate that the SVM performed better than the ARIMA model in terms of ACC [30]. In another study, Shahid and Zameer used LSTM based models, Support Vector Regression (SVR), and ARIMA to predict confirmed cases, deaths and recovered cases in ten of the most affected countries during the study period. The results showed that models based on LSTM have a lower predictive error measured as Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) than the SVR and ARIMA models [31]. In addition, Shastri et al. showed that the use of the Convolutional LSTM (ConvLSTM) improved the prediction of confirmed cases of COVID-19 in the USA and India, exhibiting performances of over 85% in terms of both ACC and F1 metrics [37].
Finally, one of the critical aspects to consider in an analysis of COVID-19 predictive models is the need to determine the importance of the features involved in the spread and risk of contagion of this disease. In this respect, Wilde et al. demonstrated an association between a high risk of mortality in patients admitted to the Intensive Care Unit (ICU) and increased occupancy of mechanically ventilated beds in England [40]. Other studies showed that mobility has a significant impact on the incidence of disease in the samples analyzed [34,[41][42][43][44][45].

Datasets and Pre-Processing
The data for this study were obtained from the official COVID-19 repository of Ministry of Science, Technology, Knowledge and Innovation (MICITEC), which contains national, regional, and communal data associated with COVID-19. We used the commune of Concepción as a study case for feature importance analysis and incidence classification (refer to Figure 1). This commune is located in one of the most densely populated regions of the country: Biobío Region [46]. From the official COVID-19 repository, we used databases containing national summaries, regional summaries, and commune data for the period from April 2020 to March 2021.
At the commune level, we used the following features: • Mobility: This feature was reported as the movements of mobile phones (antenna transitions) connected to Telefonica's national network, in a grouped and anonymous way. The internal mobility is measured as the evolution of travel occurring within the commune, and the external mobility is calculated as the movement from outside the commune into the commune. The mobility index corresponds to the number of trips within a specific commune normalized by the number of commune residents.  We standardized these datasets, i.e., we centered the values around the mean with unit standard deviation [47]. Standardization is one of the feature scaling techniques that aim to make the gradient descent converge faster to minimum values in neural network-based algorithms or make all features contribute equally in the case of distance-based algorithms such as SVM [48]. In addition to describing the variables provided in this section, we will perform a descriptive statistical analysis for a better understanding of the study.

Problem Definition
This article presents an exploratory analysis of the possibility of predicting risk categories or classes of COVID-19 using freely available data in Chile as a way to guide the active search for cases (refer to Figure 2). The classes were defined based on the incidence risk level model proposed by Harvard Global Health Institute (HGHI) [14] and the availability of case study data. Therefore, machine learning models classify the incidence rate at each time step between low risk (less than 10 new cases per 100,000 people), medium risk (10 to 25 cases per 100,000 people), and critical risk (more than 25 cases per 100,000 people).
Consider a dataset containing a multivariate time series of n training samples X = {x tr } n t=1 , where x tr ∈ R s× f represents a sample of s historical time steps from the current value for each of the f features of the problem. Consider also that each i-th sample is assigned a class y t ∈ Y, where y t ∈ L = {0, 1, 2} and Y = {y t } n t=1 represent the set of all labels, as a function of the incidence value for a given time step (refer to Figure 1). The aim of supervised training is to find a decision function δ(·) that allows the assignation of a class y t to a test sample x t ∈ X T according to δ(x t ) : x t ⊆ X T → y t ⊆ L. Finally, consider a baseline model of each classification algorithm trained only with the communal features whose decision function is denoted as δ b (·). This baseline model will be used to evaluate the importance of each of the features in the classification problem.

Classification Algorithms
For classification, we considered two of the most widely used supervised algorithms in time series classification: LSTM-based classifiers and SVM [16,30,31,[36][37][38][39]49]. In the case of LSTM-based classifiers, we have employed models based on simple LSTM, Bidirectional Long Short-Term Memory (BiLSTM) and ConvLSTM with typical parameters (refer to Figure 3, and Table 1). BiLSTM provides additional training by analyzing the data from left to right and right to left [50]. On the other hand, ConvLSTM is a type of LSTM that includes a convolution operation inside the cell [51]. Thus, such models are useful to analyze data sequences with spatio-temporal information [37,52,53]. The decision function of the LSTM-based classifiers relies on the value of the softmax function of the classification layer according to: where z = {z 1 , . . . , z l } is the intermediate output of the softmax layer for the test data point x i . With regard to SVM, the decision function depends on n s vectors that help to form the class separation hyperplane (support vectors) with their respective weights α j and classes y j and a kernel function that makes it possible to transform the input data K(·) to another dimension, where b is a scalar value and sign(·) is the sign function [54][55][56]: In SVM classifiers, we apply kernel rbf to capture the nonlinear patterns generally present in temporal data such as COVID-19. The selected kernel is defined as K(·) = exp(−γ|| x i − x j ||) with γ a hyperparameter that controls the tradeoff between error due to bias and variance in the model, keeping the other parameters as default (refer to Table 1) [33,57]. Finally, in both classifiers we considered s = 4 time steps for each of the i-th sample features [58,59]. In the case of SVM, an average of the four time steps was considered to form an f -dimensional feature vector at each time instant.

Evaluation of Algorithms
Owing to constraints in terms of the amount of available data, we considered the technique known as "walk forward validation" to evaluate the performance of the classification algorithms [60]. This technique allows us to perform a progressive evaluation of the performance using temporal data as new time-dependent observations are available (refer to Figure 4). In this case, in each iteration, the classifier's performance at the i-th sample was evaluated by our algorithm according to its features at that time instant. The metrics for evaluating the test samples are the ACC, P, R, and F1: where TP and FP correspond to the true and false positives, while TN and FN correspond to the true and false negatives of classification. Note that P, R, and F1 are calculated for each one of the i-th classes. In addition, our algorithm evaluates the training error of the classifiers in terms of the zero-one-loss metric (L) as follows [61]: where y i and y i represent the predictions and actual classes, respectively. To measure the relative importance of the available features in the prediction of COVID-19 incidence, we proposed an iterative wrap-around method that considers the following steps (refer to Figure 5). First, our approach creates a baseline model δ b (·) that considers only the communal feature subset to obtain an initial performance. Then, each of the regional and national features is added to the model individually (one at a time) to create a feature space. For each feature added to the model, the algorithm measures the performance difference compared to the performance of the baseline model. Finally, the performance differences are sorted from highest to lowest. A larger difference indicates that this variable is more relevant to the problem because it contributes more to the classifier's performance.   Figure 6 shows the boxplot of the commune and regional features. This graph mainly shows internal mobility indexes that are larger than external ones, outliers in communal deaths (≥3), and a maximum occupancy of critical beds close to 80%. On the other hand, Figure 7 shows the correlation matrix of the communal and regional features, including incidence per 100,000 inhabitants. From this Figure, we observed a high correlation between incidence and variables related to ICU patients, a moderate correlation between incidence and deaths (communal), and a low correlation between incidence and mobility. As mentioned in the previous section, we used the HGHI model as a basis for defining incidence risk levels. This model classifies incidence into four levels: green (less than one new case per 100,000 people), yellow (one to nine cases per 100,000 people), orange (ten to twenty-five new cases per 100,000 people), and red (more than 25 new cases per day per 100,000 people). In this research, the distribution of the incidence data proposed by Harvard is shown in Figure 8. From this Figure, we can observe poor data availability in the green and yellow levels; since this study uses machine learning models, we reclassified the incidence data keeping the thresholds proposed by Harvard but combining the green and yellow categories into a single category named low risk. Then, the machine learning models classify the incidence into three classes: low risk (less than 10 new cases per day in 100,000 inhabitants), medium risk (between 10 and 25 new cases per day in 100,000 inhabitants), and critical risk (greater than 25 new cases per day in 100,000 inhabitants).    Table 2 shows the classification results of the different models implemented, based on an SVM and LSTM. Both types of classifier had a good overall performance in classifying the incidence classes with results above 93% in all reported metrics. In all cases the F1 values were greater than ACC (≥94%). In this sense, the Precision values were higher than Recall values, meaning that classifiers were more able to correctly detect the positive class (i.e., each of the classes in the problem). In all cases, LSTM-based classifiers performed better than or equal to the SVM. Additionally, we observed that the performance of ConvLSTM was better than that of the rest of the classifiers in all performance metrics.  Table 3 shows the average results of the classifier training error curves (refer to (7)) as shown in Figure 9 according to the validation scheme used [62,63]. From Table 3, we can see that in all cases the training error of SVM was lower than all LSTM-based classifiers. On the other hand, the largest error was obtained in the ConvLSTM-based classifier. These results indicate that LSTM-based classifiers could be less prone to overfitting than SVM because although they presented a higher training error, they performed better when classifying test examples. In addition, the training error decreases noticeably as the number of training examples increases.  To measure feature significance in the performance of the classifiers, our approach individually incorporates regional and national features into the set of commune features, as mentioned in Section 3.4. Figure 10 shows the performance difference in terms of the zero-one-loss metric considering the set of commune features as the baseline. We note that, at the regional level, the CRITICAL BED OCCUPANCY feature ranks first in error difference, followed by the

Conclusions and Future Research
This article presents a model for predicting COVID-19 incidence in three risk categories: low, medium, and critical. Predicting the incidence in categories responds to the need to support the active search strategies carried out in the country for the detection and control of COVID-19. The risk levels of the model can be interpreted following the HGHI framework recommendations as follows: (i) the yellow level means a low transmission rate of the disease and orients decision makers to continue with testing and traceability measures; (ii) the orange level means that the virus transmission speed has increased, so that it is necessary to evaluate measures to increase active search efforts in communities at this risk level; and (iii) the red level means the speed of virus transmission is high, with it therefore being necessary to evaluate more aggressive measures to contain the spread of the disease in the community.
The SVM-and LSTM-based classifiers tested in the case study showed good performance ranging between 93% and 96%. On the other hand, the zero-one-loss error decreases as more data are incorporated, with the SVM model having the lowest value. This result is explained by the small amount of data available to train and test the models since LSTMbased classifiers usually require considerably more data. As demonstrated, these models could be potentially helpful to evaluate the risk of incidence in a community; however, they do not replicate the structure proposed by the original HGHI classification, so that further investigation and the incorporation of more data are recommended.
Taking into consideration the dynamic commune features (DEATHS and MOBILITY) as the baseline in each classifier, the importance of each of the regional and national features (refer to Figure 10) was analyzed. In this regard, it was observed that, in most cases, each of the commune and regional features positively impacted the performance of the classifiers. For example, at the regional level, CRITICAL BED OCCUPANCY and PATIENTS IN ICU rank first in terms of error difference with respect to those of the commune. On the other hand, in the case of national features, a more significant influence was found on the performance of classifiers in the use of features related to the type of hospitalization of patients and the use of mechanical ventilators. These results are consistent with the results of previous studies related to COVID-19 that show an association between COVID-19 and risk factors associated with patients admitted to the Intensive Care Unit (ICU) and occupancy of mechanically ventilated beds [34,[40][41][42][43][44][45]64].
We would like to point out that this study has the following limitations: (i) The data utilized correspond to free reports from MICITEC, so that there is no control over the frequency and quality of the information reported, which resulted in low availability of data for analysis (approximately two weekly reports for the period of study). (ii) The progression of COVID-19 in Chile has not allowed us to have a reasonable amount of data to train a model that considers the low incidence categories proposed by HGHI; in fact, after evaluating several Chilean communes, only Concepcion presented a data distribution that allowed us to identify at least three categories. We therefore aim to continue researching as more data become available to confirm this study's results and to extend the model to more communes. Consequently, in future work, we plan to pursue the following research directions: (i) In this paper, we grouped the incidence data by keeping the thresholds proposed by Harvard but combining the green and yellow categories into a single category that is referred to as low risk. As a future research direction, we propose studying, both technically and epidemiologically, the effect of regrouping the Harvard model categories to obtain a larger number of cases per class (e.g., considering only a binary problem for nonhigh and high incidence levels with a threshold of 25 cases per 100,000 population, or redefining new thresholds). (ii) We propose developing a software tool that allows decision makers to direct active search efforts using this predictive model, since at the present time they are only directed towards crowded places. In this regard, the idea of exploring new COVID-19 incidence levels according to the availability and distribution of data is also under consideration. (iii) We would like to analyze the personal risk of COVID-19 infection in Chile by examining the commune, regional, and national data of this study, as well as health and socio-demographic data as soon as these data become available to us. (iv) Finally, we consider the possibility of extending the definition of our research problem to combine classification models with epidemiological models that consider the individual prediction of COVID-19 incidence, concerning confirmed cases, deaths and recovered cases, using MAE and RMSE as statistical loss functions.

Conflicts of Interest:
The authors declare that there are no conflict of interest.