A Model for Rapid Selection and COVID-19 Prediction with Dynamic and Imbalanced Data

: The COVID-19 pandemic is threatening our quality of life and economic sustainability. The rapid spread of COVID-19 around the world requires each country or region to establish appropriate anti-proliferation policies in a timely manner. It is important, in making COVID-19-related health policy decisions, to predict the number of conﬁrmed COVID-19 patients as accurately and quickly as possible. Predictions are already being made using several traditional models such as the susceptible, infected, and recovered (SIR) and susceptible, exposed, infected, and resistant (SEIR) frameworks, but these predictions may not be accurate due to the simplicity of the models, so a prediction model with more diverse input features is needed. However, it is difﬁcult to propose a universal predictive model globally because there are differences in data availability by country and region. Moreover, the training data for predicting conﬁrmed patients is typically an imbalanced dataset consisting mostly of normal data; this imbalance negatively affects the accuracy of prediction. Hence, the purposes of this study are to extract rules for selecting appropriate prediction algorithms and data imbalance resolution methods according to the characteristics of the datasets available for each country or region, and to predict the number of COVID-19 patients based on these algorithms. To this end, a decision tree-type rule was extracted to identify 13 data characteristics and a discrimination algorithm was selected based on those characteristics. With this system, we predicted the COVID-19 situation in four regions: Africa, China, Korea, and the United States. The proposed method has higher prediction accuracy than the random selection method, the ensemble method, or the greedy method of discriminant analysis, and prediction takes very little time.


Introduction
As COVID-19 has recently become a global pandemic, each country is striving to respond appropriately. To this end, accurately predicting trends in the number of confirmed cases of COVID-19 is important in determining quarantine policies.
The factors that affect COVID-19 infection are related to direct contact with the infectious agent, but COVID-19 has a latent time during which it is difficult to identify infection externally. Therefore, in addition to rapid discovery and isolation of symptomatic infectious agents, it is important to predict trends in confirmed cases and to imagine how many infected people will be in a space where infected and healthy people without symptoms are mixed. Therefore, infection models are used, such as the susceptible, infected, and recovered (SIR) model [1]; the susceptible, exposed, infected, and resistant (SEIR) model; the Gaussian mixture model [2]; and regression analysis [3]. These models use epidemiological data such as the number of previously infected people and the total population. Infection trends are also predicted using several parameters such as the latency period and probability of healing. However, these simple models do not reflect various socio-static and economic factors that may profoundly affect the course of the virus. This is why predictions based on economic and social data in addition to epidemiological data in the analysis of COVID-19 trends are necessary.
Other confounding factors make prediction difficult. In particular, COVID-19-related datasets are often highly imbalanced, which negatively affects the accuracy of predictions. To predict well using highly imbalanced datasets, statisticians often use sampling methods, cost-sensitive learning, and feature selection. However, there are no guidelines on which method and which prediction model to use depending on the nature of the dataset. Second, the economic and social variables that seem significant in predicting infectious diseases such as COVID- 19 have not been confirmed. This leads to a proliferation of predictive models and makes it difficult to determine the optimal model to be applied in any given country and situation. Third, the availability of reliable economic and social data differs by country. For example, the degree of exposure to fake information related to quarantine causes confusion and increases the risk of COVID-19 infection. However, reliable exposure data are not available in some countries. Confusingly, the features included in COVID-19 prediction models are inevitably different for each country. Finally, COVID-19 infection is still progressing every day. Therefore, the urgent decision of which sampling method and prediction algorithm to use according to the characteristics of each country's dataset must be made quickly.
Thus, as part of efforts to prevent the spread of COVID-19 infection, a country-specific infection-level prediction model would solve the methodological problem of prompt and optimal prediction with an imbalanced data set consisting of features that are difficult to define in advance. Unfortunately, no meta-model that enables decisions about which characteristics of a certain dataset should be included in which sampling method with which prediction algorithm has yet been introduced. As interest in the application of artificial intelligence (AI), including machine learning, is increasing in the business domain, selecting the optimal discrimination algorithm is gaining attention, but excessive waste of resources (e.g., hardware, money, time, manpower) has occurred, which makes a prompt response difficult [4]. A guideline for selecting a sampling and discrimination algorithm based on the meta-characteristics of the dataset would make it possible to alleviate problems related to COVID-19 by quickly finding a model that can provide good prediction performance.
Therefore, the purpose of this study is to present a model that quickly predicts confirmed cases by recognizing appropriate sampling methods and algorithms according to the data characteristics of each country and region, focusing on class-imbalanced data according to the COVID-19 situation of each country. To do this, we first selected various datasets consisting of numerical and nominal values, including a socioeconomic dataset related to COVID-19, to explore the relationship between data characteristics, sampling method, and algorithm prediction performance, after which we used the results to make a decision. Using the extracted selection guidelines, we then examined how performance improves by selecting and predicting a sampling method and classification algorithm using an available dataset for each region in which COVID-19 has been detected in four countries. Using the proposed model, companies and researchers can perform eco-friendly machine learning, reduce repetitive experiments, and waste fewer resources.

Sustainable Machine Learning
Recently, many information systems have introduced machine learning for personalization and AI. However, developing an AI system through cloud computing and tensor processing consumes a huge amount of power, resulting in an indirect carbon footprint greater than the average carbon dioxide generated by a single car in a year [5]. Figure 1 is a graph representing the amount of carbon generated and the amount of power consumption required to make an AI algorithm. Efforts are therefore needed to reduce the amount of carbon generated while developing AI algorithms.

Metadata Collection for Finding Classification Algorithm Selection Rules
The metadata used in this study were collected through the process shown in Figure 2. The data were provided by the UCI Machine Learning Repository (see Table 1), which provides a variety of benchmark data and is frequently used by data mining researchers to verify research results. A dataset with an imbalanced class (Balance Scale, Car Evaluation, Dermatology, Wine, Contraceptive Method Choice, Glass, New-Thyroid, Hayes Roth) was selected. The selected dataset was divided using 10-fold cross-validation and then the characteristics of the dataset were extracted. For this study, several datasets with different degrees of imbalance were generated using the sampling method. The data were increased by 10% by applying the weight to the oversampling method; finally, oversampling the number of classes to the same level was attempted. For undersampling, a method of reducing the total number of classes by 10% was used. This procedure was not applied to data that did not need weight. Finally, a 10% to 100% sampling method was applied to one dataset, and 10 datasets (excluding the original dataset) were used. The total amount of metadata generated was 28,160 records. In terms of sustainability, machine learning can be classified into greening machine learning and greened machine learning. First, greening machine learning refers to the use of machine learning methods to maintain the planet's sustainability by doing things such as reducing carbon dioxide emissions. Using deep learning improves the accuracy of carbon dioxide emission predictions to facilitate decision-making and reduce carbon dioxide [6,7]. For example, studies have demonstrated the efficacy of machine learning for reducing the amount of repetitive work in experiments on reducing carbon dioxide emissions [8].
Greened machine learning, on the other hand, involves devising a method to reduce power consumption or costs. However, recently developed deep learning algorithms have large numbers of parameters and require considerable computation time to calculate [9]. In particular, running repeated experiments to find a suitable predictive model is costly and causes huge CO 2 emissions. These experiments must be repeated frequently because classification algorithms used in machine learning are not generalized to all data. Accordingly, some researchers have argued for the necessity of a generalization method that would take into account the characteristics of the dataset [10], but so far no solution has been found.
Although various dataset characteristics are being studied, there is a lack of guidelines for the selection of sampling methods or classification algorithms according to the characteristics of the dataset. Efforts have been made to elucidate the causal relationship between classification performance and data characteristics in some studies, but no guidelines for selecting a classification algorithm have been provided [11,12]. In addition, even in studies using sampling methods, there are cases where a simple classification performance comparison is limited [13]. Therefore, in this study, we propose a method of recommending a classification algorithm and a sampling method that considers the characteristics of the dataset and that also reduces the unnecessary resource consumption that is common in repeated experiments. Figure 1 shows the results of calculations regarding the amount of power and carbon dioxide generated for training and hyperparameter tuning of a natural language processing (NLP) model. Examples of machine translation models include the Transformer model, the Embeddings from Language Model (ELMo), the Bidirectional Encoder Representations from Transformers (BERT) model, and the Neural Architecture Search (NAS) model.
The difference between Big and Base arises from the number of parameters: 65 million parameters in the base model and 213 million parameters in the big model. The amounts of carbon emissions that are generated in the process of learning in the BERT and NAS models (BERT = 1438 CO 2 e, NAS = 626,155 CO 2 e) are greater than the carbon emission of one passenger who travels from New York to San Francisco, which is 1984 CO 2 e. Many companies and data scientists are generating a large volume of carbon by repeating experiments with AI algorithms for high performance. If there is a guideline for the production of AI algorithms, it may be motivated by the need to develop AI algorithms in a more eco-friendly manner, while reducing the amount of carbon generated in repeated experiments. In particular, regulations on carbon emissions have emerged as international concerns. Currently, carbon emissions from transportation and factories are actively regulated, but with the advent of the fourth industry, AI developers based on machine learning will also be subject to such regulations due to carbon emissions from graphics processing units (GPUs) and tensor processing units (TPUs). Therefore, AI-based companies will have to make efforts to reduce carbon emissions. Companies may gain benefits from their eco-friendly images that are associated with the efforts to reduce carbon emissions, especially for AI-based companies.

Metadata Collection for Finding Classification Algorithm Selection Rules
The metadata used in this study were collected through the process shown in Figure 2. The data were provided by the UCI Machine Learning Repository (see Table 1), which provides a variety of benchmark data and is frequently used by data mining researchers to verify research results. A dataset with an imbalanced class (Balance Scale, Car Evaluation, Dermatology, Wine, Contraceptive Method Choice, Glass, New-Thyroid, Hayes Roth) was selected. The selected dataset was divided using 10-fold cross-validation and then the characteristics of the dataset were extracted. For this study, several datasets with different degrees of imbalance were generated using the sampling method. The data were increased by 10% by applying the weight to the oversampling method; finally, oversampling the number of classes to the same level was attempted. For undersampling, a method of reducing the total number of classes by 10% was used. This procedure was not applied to data that did not need weight. Finally, a 10% to 100% sampling method was applied to one dataset, and 10 datasets (excluding the original dataset) were used. The total amount of metadata generated was 28,160 records.

Metadata Collection for Finding Classification Algorithm Selection Rules
The metadata used in this study were collected through the process shown in Figure 2. The data were provided by the UCI Machine Learning Repository (see Table 1), which provides a variety of benchmark data and is frequently used by data mining researchers to verify research results. A dataset with an imbalanced class (Balance Scale, Car Evaluation, Dermatology, Wine, Contraceptive Method Choice, Glass, New-Thyroid, Hayes Roth) was selected. The selected dataset was divided using 10-fold cross-validation and then the characteristics of the dataset were extracted. For this study, several datasets with different degrees of imbalance were generated using the sampling method. The data were increased by 10% by applying the weight to the oversampling method; finally, oversampling the number of classes to the same level was attempted. For undersampling, a method of reducing the total number of classes by 10% was used. This procedure was not applied to data that did not need weight. Finally, a 10% to 100% sampling method was applied to one dataset, and 10 datasets (excluding the original dataset) were used. The total amount of metadata generated was 28,160 records.

Dataset Characteristics Used in the Metadata
The following characteristics of datasets can affect the selection of the classification algorithm.

Number of Instances
Number of instances is a representation of the size of the dataset. In general, with more instances, we can have a more accurate classification. However, the larger the dataset, the more noise occurs and the more learning time is required in the machine learning process. Too little noise makes correct classification difficult due to lack of information. According to Brazdil et al., 1994 [14], some classification algorithms can be determined by the number of instances.

Number of Numeric Variables
Numeric variables provide more abundant information than nominal variables. However, compared to nominal variables, numeric variables may have an ambiguous classification boundary, which may cause difficulties in classification. Therefore, fewer numeric variables can negatively affect classification accuracy [15].

Number of Nominal Variables
The number of nominal variables is a factor that positively influences classification accuracy [15]. Moreover, nominal variables have clear boundaries compared to numeric variables, so they can be advantageous in classification problems. However, this factor can be unfavorable for probability problems using methods such as logistic regression and naïve Bayes.

Number of Missing Values
Missing values often appear in COVID-19 databases in various fields. Missing values cause loss of information and are related to the quality of the data. If there are many missing values, information for classification will be insufficient, and many errors will result [16].

Herfindahl-Hirschman Index (HHI)
The HHI is an indicator of market concentration, which is the sum of the squared market share of all operators in the market [17,18]. When comparing the market share to the frequency of each class, the higher the class balance, the lower the HHI value. In addition, higher HHI values are expected to affect discrimination performance negatively because, in general, the greater the data imbalance problem, the lower the accuracy of the decision [19,20].

Number of Variables
Reducing the number of input variables can reduce data complexity, solve multicollinearity problems between variables, and improve prediction, classification, and cluster-ing [21,22]. However, if too few variables are considered, the complexity of the predictive model will improve, but the accuracy will be lower than when considering a large number of variables [23].

Number of Classes
The number of classes represents the dimensions of a class [24,25]. Since the expected probability decreases according to the number of classes, it will be closely related to the accuracy of the class. In general, the greater the number of classes, the lower the discrimination accuracy. Krawczyk, 2016 [26], argued for the need to find an appropriate number of classes to resolve class imbalances.

Entropy
Entropy is a variable that can quantitatively measure the uncertainty of information in a given dataset because it is related to the amount and purity of information [27,28]. The rarer the information, the better, and the more common the data, the more consistent it is. Entropy can be an indicator of whether the data are consistently pure. We measured this variable using the method proposed by Shannon [27].

Silhouette Score
The silhouette score is an index used to verify whether a cluster is correctly formed during clustering analysis. The closer the silhouette score is to 1, the more appropriate the number of clusters is. In addition, the more that clusters are grouped, the more homogeneity there is within the cluster; however, heterogeneity between clusters is better. In this study, k-means clustering was performed, where k was determined as the number of classes in the data, and then silhouette coefficients were measured.

Data Nonlinearity
In general, data nonlinearity negatively affects discrimination performance [29]. Barella et al., 2018 [12], found a correlation between data complexity and discriminant performance, and revealed a negative relationship between the nonlinearity of a linear classifier and the classification performance (G-mean). In this study, a method of measuring the nonlinearity of linear classifiers was used. This method considers both the linearity of the data and the outliers, first classifying errors through a non-linear classifier and then comparing the results classified through the linear kernel function of the support vector machine (SVM) [30].

Hub Score
The hub score is an index that measures the cohesiveness of the data using the concept of network connectivity. The score is determined by configuring a network using the given data and measuring the number of nodes connected to the network; the more connected nodes, the higher the score [30]. The hub score has values ranging from 0 to 1. The more nodes there are connected, the closer the value will be to 1. On the other hand, if the hub score is close to 0, there is a high probability of error due to overlapping with other classes [31].

Feature Overlap
The maximum Fisher's discriminant ratio measure represents the degree of overlap between variables as a ratio. The higher the data overlap, the more ambiguous the division of decision boundaries. Therefore, as the overlap of variables increases, the use of undersampling methods such as Tomek link, edited nearest neighbors (ENN), and condensed nearest neighbors (CNN) may be more advantageous.

Neighborhood
Classes that are labeled incorrectly or that are randomly labeled are placed on different class boundaries. In some cases, it is difficult to maximize the distance between classes in a linear separation problem, and of course, this may negatively affect the discrimination performance [32].

Dimensionality
Dimensionality involves the reduction of the dimensions of the variables of the data used through principal component analysis (PCA), and the expression of the difference between the reduced variables and the original data variables as a ratio. If the dimensions are reduced more than the original data, it means that the variables in the original data are not efficient [33]. Variables that are not efficient create the risk of generating noise. In addition, if there are many unnecessary variables, machine learning may take longer.

Sampling Methods Used in the Metadata
Various classification studies are being conducted using algorithms such as the support vector machine (SVM), naïve Bayes classifier, k-nearest neighbor (k-NN), and decision tree. However, according to the 'no free lunch theory' of Wolpert and Macready, 1997 [34], an algorithm optimized for a specific problem does not exhibit the same performance in other problems. Actual data has various problems such as data imbalance, data errors, and high dimensions [34]. In particular, class imbalance has a significant negative impact on classification performance. Class imbalance means that the properties of the target variable to be classified are unbalanced. This phenomenon occurs in various fields such as fraud detection, medical diagnosis, network intrusion detection, and modern manufacturing plants. Various studies have been conducted using various methods to resolve class imbalances [35].
First, there is a cost-sensitive learning method that mitigates the bias of multiple classes by modifying the existing algorithm. Cost-sensitive learning is a method of reducing classification errors by using a cost matrix for misclassified data, unlike the data extraction method, which operates according to the distribution of classes [36]. In cases of costsensitive learning, it shows good performance because it is applied according to the characteristics of actual data.
Second, data preprocessing is a method of sampling training data to fit the classification algorithm. For example, undersampling balances the class distribution by removing the majority class. The problem with the undersampling method is the loss of information. On the other hand, the oversampling method balances a minority class by replicating it to fit a large number of classes. However, the oversampling method may generate noise in the data, and learning may be prolonged due to the increase in data. The data preprocessing method may be less accurate than the cost-sensitive learning method, but it does not provide a cost metric, so if you have no expertise or experience with the data, it takes a lot of time and money to find an appropriate cost metric [37]. Therefore, it would be more efficient to use the preprocessing method when dealing with a large amount of data or in cases where domain knowledge is lacking.
Looking at existing class imbalance resolution studies, we see that the performance of the sampling method differs according to the data characteristics and that performance varies according to the strategy of the classification algorithm. Therefore, in this study, the oversampling method and the undersampling method were targeted. Random oversampling (ROS), the synthetic minority oversampling technique (SMOTE), and the adaptive synthetic sampling approach for imbalanced learning (ADASYN) were selected as oversampling methods, and random undersampling (RUS), ENN, the Tomek link method, CNN, and the neighborhood cleaning rule (NCL) were selected as undersampling methods. These sampling methods are described in Table 2.

Random Oversampling (ROS)
This is a method of iteratively recovering and extracting data by randomly selecting data until a few classes are equal to the data size of many classes. The random overextraction method has the advantage of being very convenient to use, with almost no loss of information. However, if the sampling rate is unreasonably increased, an overfitting problem may occur because data of a minority class are repeatedly reconstructed and extracted.

Synthetic Minority Oversampling Technique (SMOTE)
This is a method of selecting random data of a minority class and artificially generating new data between k-nearest neighbors [38]. Unlike ROS, which restores and extracts fractional class data, SMOTE has been proposed to avoid the overfitting problem by generating new data.
Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN) This is a method of generating data with consideration of the density distribution of a minority class based on SMOTE [39]. The method is similar to the distribution of the original data because it considers the density distribution of the data.

Random Undersampling (RUS)
This is a sampling method in which the majority class is randomly deleted and its proportion adjusted. RUS has the advantage of being easy to use with large-scale data, which can reduce the cost by reducing the amount of data. However, there is a high possibility of losing important information because the data are arbitrarily reduced.

Condensed Nearest Neighbors (CNN)
CNN is a method of removing data until there are no data concentrated in a majority class, leaving only representative data in the data distribution. The CNN method leaves data with clear boundaries of different classes [40]. Data are stored one-by-one and a suitable dataset is constructed by removing duplicate data.

Edited Nearest Neighbors (ENN)
Unlike CNN, if the value included in set X is misclassified, it can be excluded from X [41].
Tomek link Based on the CNN sampling method, Tomek link is a method of removing internal data near the decision boundary. The method has the effect of removing ambiguous data overlapping with other classes [42]. Therefore, it is regarded as an efficient sampling method for removing abnormal data.

Neighborhood Cleaning Rule (NCL)
NCL is a method that combines condensed nearest neighbors (CNN) and edited nearest neighbors (ENN). It has the effect of clarifying class boundaries by removing data from multiple classes rather than the nearest data, avoiding fractional data [43].

Classification Algorithms Used in the Metadata
In this study, random forest, SVM, naïve Bayes classifier, k-NN, and logistic regression were considered.
First, random forest is an ensemble technique based on the majority voting method that generates several decision tree models and tends to show superior performance compared to other decision tree models.
Second, SVM is a method used to classify data from n-dimensional data using an n-1-dimensional hyperplane. The dividing line between the two classes allows selection of the hyperplane with the largest width of the two classes. This method is called linear classification. In addition, SVM is also capable of nonlinear classification using kernel functions such as polynomial kernel, sigmoid kernel, or radial base function (RBF) kernel [44].
Third, the naïve Bayes classifier is a probabilistic approach that uses Bayes' theorem during supervised learning. Although the naïve Bayes model is relatively simple and the calculation process is not complicated, it is known to exhibit excellent performance [45].
Next, unlike the naïve Bayes classifier, SVM, and the decision tree model, the k-NN method is a lazy learning method that does not use training data and moves only when empirical data are given. This is a method of selecting k neighbors among the nearest neighbors and classifying them into the most common class.
Lastly, logistic regression is a probability model that predicts the likelihood of an event using a linear combination of independent variables.

Classification Performance Measurement
The F-score was used to measure classification performance. Overall accuracy can be useful for measuring algorithm performance in class-balanced data. However, in real datasets, data are often skewed to one side, so overall accuracy alone cannot be a sure indicator of the true performance of the algorithm. Therefore, in this study, the F-score was used as a performance measurement method considering the class-imbalanced situation. The formula for calculating the F-score is presented in Equation (1).

Extraction of Classification Algorithm Recommendation Rules According to Data Characteristics
First, only the classification algorithm and the class imbalance resolution method that showed the best classification performance for each fold were extracted from the collected metadata to identify the class imbalance resolution method and classification algorithm recommendation rule. We selected classification algorithms and class imbalance resolutions that showed the greatest improvements in F-scores and if F-scores were identical, we compared them in the order of G-mean, overall accuracy, and elapsed time. Elapsed time was calculated from algorithms and class imbalance resolutions that consumed the least amount of time. In total, 70 metadata were finally used. The standard of extraction was the difference in classification performance, which was derived by applying the class imbalance resolution method from the classification performance of original data. The most improved ones were extracted. Table 3 shows the results of applying the random forest method to the data characteristics selected in this study to identify the most important factors in selecting a classification algorithm. Mean decrease accuracy refers to the expected decrease in accuracy when no factor is selected, and the mean decrease Gini refers to the decrease in data purity when no factor is selected. First, looking at Table 3, the mean decrease accuracy was 41.594, and the entropy of features was the most important factor in selecting a classification algorithm. The next most important factors were the network and coefficient of determination. Figures 3 and 4 show the decision trees that represent classification algorithm selection rules and sampling method selection rules, respectively. The decision trees were constructed based on the indicators (HHI, mean silhouette score, variable entropy, coefficient of determination) that characterized the datasets. The decision tree indicated that the data characteristics considered in this study were used in almost all rules. In addition, the higher the number of instances, the more divided the nodes, which resulted in the most effective classification algorithm in a class imbalance state. Thus, the data characteristics considered in this study played an important role in selecting suitable classification algorithms.            Table 4 shows the importance of various variables in resolving class imbalances. The main factors involved in selecting the method for resolving these imbalances were variable entropy, network, and the coefficient of determination. Looking at the decision tree for resolving the imbalance problem, we see that most data can be classified based on the coefficient of determination, network, and entropy of variables. The reason for this result is that most class imbalance resolution methods delete or create data based on distance. The coefficient of determination means the extent to which the independent variable explains the dependent variable. In other words, it represents the efficiency of the variable. Network and variable entropy represent noise data. In particular, the ADASYN, SMOTE, and Tomek methods use the strategy of calculating and removing or generating distances between data. Therefore, if the efficiency of the variables constituting the dataset is low, better results can be obtained if the class imbalance is resolved after preprocessing through methods such as dimension reduction and feature selection. In addition, as the rule was created using network, variable entropy, dimensionality, and overlap, it seems that it was determined by the structure and distribution of the data. If the variables constituting the dataset do not have a marginal hyperplane, an error may occur when creating or removing data.

COVID-19 Datasets
In order to check the performance of the optimal COVID-19 prediction model (decision tree type) proposed in this study in terms of the data characteristics and data imbalances, the following datasets were collected.
For this study, the United States, China, Korea, and the African continent were targeted, and the variables secured in the open dataset included humidity, temperature, population, and economic variables (GDP, store sales, etc.). The US dataset was collected by the United States Census, Bureau of Economic Analysis. In the United States Census, the number of men, women, and total population by state, and the number of students enrolled in school for more than three years were collected, and quarterly GDP was extracted from the Bureau of Economic Analysis. For the Korean dataset, population, population movement data, and sales datasets were collected from the Korean Statistical Information Service (Kosis), and temperature, humidity, and wind speed were collected from the Korea Meteorological Agency. To build the China dataset, economic, education, energy, environment, and population data were collected by China Knowledge Resource Integrated (CKNI), and weather data were collected from http://data.sheshiyuanyi.com/ (accessed on 8 January 2021) in China. In the case of China and Korea, data were collected by city, and in the case of the United States, they were collected by state. Africa data were collected by country, and the variables used were life expectancy, GDP, population, and Gini coefficients from World Bank Open Data. The variables included in each country's dataset depended on the circumstances in each country. For a fair comparison, data analysis experts were given 24 h to build a dataset. Each dataset was collected in state units, except in Africa, where they were collected in national units. The number of samples was increased to 10 times that of the original dataset, which was too small. The values of the variables obtained were between the mean and the standard deviation.
As for the dependent variable of each dataset, differences in the number of confirmed cases or deaths between August and September 2020 were calculated. Classes included increase, flat, and decrease. The composition of the datasets is shown in Table 5.  Table 6 shows the characteristics of the COVID-19 datasets in the four regions. Looking at Table 6, we see that the data characteristics differ because the data that could be collected differed for each country. Even if the same data were available, the collection period, the type of data, and the completeness of the data all differed.

Performance Comparison
To verify the superiority of the method proposed in this study, a performance comparison was conducted. For the experiment, the following four methods were compared.
Method 1 (Random): After randomly selecting one of the classification algorithms, the algorithm predicts and discriminates the number of confirmed cases of COVID-19 in the area.
Method 2 (Ensemble): Using five classification algorithms such as k-NN, logistic regression (LR), naïve Bayes (NB), random forest (RF), and SVM, the majority vote of the results obtained is used to predict the number of confirmed cases of COVID-19 in the region.
Method 3 (Greedy): After predicting the number of confirmed cases of COVID-19 in the region using each of the five discrimination algorithms, k-NN, LR, NB, RF, and SVM, the best performance is determined.
Method 4 (Proposed): After analyzing the characteristics of the dataset, the number of confirmed cases of COVID-19 in the area is predicted by selecting a classification algorithm according to the selection method proposed in this study.
The results are reported in Table 7. The proposed method (Proposed) was the best based on the F-score standard, and its standard error was also the lowest, indicating that its performance was very stable. In addition, even in terms of elapsed time, it can be seen that the proposed method produced results more quickly than the other methods (on average 0.740 s faster). In fact, it was more than twice as fast as Method 2 (Random), which ran quite fast. In this study, relatively few samples were used for the experiment, but if the number of confirmed cases of COVID-19 is predicted using a large dataset in the future, the effect of saved execution time will be even greater. Thus, the proposed method is excellent in terms of discrimination performance and time required for discrimination. The method of selecting a method for resolving class imbalances was also compared as follows.
Method 1 (Random): After randomly selecting one of the methods for resolving class imbalances, the algorithm predicts the number of confirmed cases of COVID-19 in the region.
Method 2 (Ensemble): Based on the principle of majority vote and using the results obtained by utilizing the eight class imbalance resolution methods (ADASYN, CNN, ENN, NCL, ROS, RUS, SMOTE, Tomek), we predict and discriminate the number of confirmed cases of COVID-19 in the region.
Method 3 (Greedy): Each of the eight class imbalance resolution methods (ADASYN, CNN, ENN, NCL, ROS, RUS, SMOTE, Tomek) is used to predict the number of confirmed cases of COVID-19 in the region, and the best performance is determined.
Method 4 (Proposed): After analyzing the characteristics of the dataset, we predict the number of confirmed cases of COVID-19 in the area by selecting a class imbalance resolution method according to the method proposed in this study. Table 8 shows the results of the performance comparison between the class imbalance resolution method and other selection methods. The proposed method (Proposed) was almost no different from the Greedy method in terms of F-score, and was a safer method in terms of performance changes (0.173 < 0.418). Furthermore, when using the proposed method, the elapsed time was the least required. Thus, the superiority of the method of selecting the method for resolving the class imbalances proposed in this paper was demonstrated. Table 9 lists classification algorithms derived from the dataset characteristics utilized in this study. In Africa, random forest was recommended, naïve Bayes was recommended for China, k-NN and logistic regression were recommended for Korea, and naïve Bayes and random forest were recommended for the United States. Table 10 shows the recommendation results for class imbalance resolution methods.

Recommendations
ADASYN and SMOTE were recommended for Africa; ROS for China; SMOTE for Korea; and ADASYN, SMOTE, and ROS for the USA. The method recommended in this study derives from the rules trained using the metadata presented in Table 2. The example of classification algorithm selections is as follows (see Figure 3 and Table 6). The number of instances from China is larger than 195.3, so it goes in the right direction. Next, the mean silhouette score goes in the left direction because it is smaller than 0.560. Naive Bayes is selected because the entropy of the variable is greater than 6.635 and the linearity is greater than 0.035.

Contributions
In this study, we have proposed a method to recommend an appropriate classification algorithm and sampling method according to the characteristics of the dataset. This study makes the following theoretical contributions. First, in this study, 13 types of data characteristics were used, and the main factors determining classification algorithms and sampling methods were identified. With consideration of various data characteristics, a selection rule was first proposed.
Even if the characteristics of domains or datasets are different, our method can quickly identify an appropriate classification algorithm or sampling method. Domain knowledge is very important in data analysis. For example, in the microarray field, there are many variables, but relatively few instances, but in the social science field, there are relatively few variables and many instances. In order to overcome such differences arising from the characteristics of the domain, our proposed method generates a rule that determines the optimal classification algorithm and sampling method by converting the data into metadata. This method can reduce resource consumption caused by repeated measurements in the field of data science.
This study has the following practical implications. First, the results of this study can be used to develop a classification algorithm recommendation system according to the characteristics of the dataset. In particular, in the current COVID-19 global pandemic, in situations where it is necessary to predict trends quickly using different datasets by country or region, our method allows for a review of all possible discrimination algorithms or data imbalances in order to find the best alternative. Details can be missed when there is no systematic way of determining the optimal algorithm or sampling method. However, applying the method proposed in this study will facilitate the rapid recommendation of an excellent prediction method regardless of regional characteristics and data availability, thus contributing to preventing the spread of COVID-19.
Second, there are differences in discrimination performance due to parameter setting in the area of machine learning. Because all the characteristics of the data are different, knowing the optimal parameters required according to the data is necessary. Many data scientists perform repeated experiments to determine the optimal parameters and algorithms to solve new problems; this effort is time-consuming and expensive . Many of the parameters of machine learning are related to the characteristics of the data. If the method proposed in this study is used, repetition of experiments will be reduced and the optimal algorithm may be selected in a relatively short time.

Limitations
This study has several limitations. First, not all existing class imbalance solutions were considered. For brevity, in this study, only oversampling and undersampling were considered, and feature selection and cost-sensitive learning were not considered. In future research, we plan to propose a selection algorithm that considers all of these factors.
Second, this study does not consider the data characteristics of unstructured datasets. Although most input features used in predicting confirmed cases of COVID-19 are numerical data, prediction will present few problems using the current method; however, future studies will need to include data characteristics of unstructured data.
Third, the results of this study may be generalized by repeated experimentation with the data of other countries. With any dataset, the proposed method may be applied because metadata can be constructed, as in this study. Furthermore, in this study, deep learning algorithms of artificial neural networks were not included in the algorithm recommendations. However, if the learning data include multimedia such as images, artificial neural networks (ANNs) may be included. This is an area for further study.

Conclusions
Recently, interest has been increasing in using AI methods to respond to global crises in various fields. However, since many iterations must be performed, the process of finding an appropriate algorithm can be a stumbling block in a crisis situation. Therefore, in this study, we proposed a method of identifying decision rules that determine classification algorithms and sampling methods using existing datasets. The method proposed in this study will be useful for machine learning and data mining researchers, practitioners, and machine learning-based system developers. In particular, it will contribute to the improvement of the quality of intelligent information systems using classification algorithms and minimize wasteful, resource-heavy repeated experiments by data scientists. We believe that the proposed method will also contribute to understanding the current status of COVID-19 in all countries of the world and enable policymakers to respond quickly.
Our method increases environmental sustainability by consuming as little power as possible in the process of finding an optimal machine learning algorithm in a trial-anderror manner. Until now, repeated experiments have been necessary to find the optimal algorithm, and in this process, resources are unnecessarily consumed or wasted and carbon emissions are increased. In addition, the number of parameters used in the algorithms in the development of AI continues to increase exponentially, and the consumption of resources also increases accordingly. As the rapidity and diversity of the spread of COVID-19 poses a threat to national defense systems, especially those in developing countries, accurate prediction models are vital to overcoming this crisis. Another problem is that the COVID-19 prediction models in developed countries cannot be reused as-is in developing countries due to data differences; a separate prediction model must be built. This time-consuming, expensive process threatens the lives of citizens of underdeveloped countries. We believe that companies or researchers can utilize eco-friendly machine learning to reduce repetitive experiments and wasted resources using the method proposed in this paper.