Next Article in Journal
Genetic Testing on Patients with Developmental Delay: A Preliminary Study from the Perspective of Physicians
Next Article in Special Issue
China’s Vaccine Diplomacy and Its Implications for Global Health Governance
Previous Article in Journal
Calcification of the Atlanto-Occipital Ligament (Ponticulus Posticus) in Orthodontic Patients: A Retrospective Study
Previous Article in Special Issue
Employment Management Policies for College Graduates under COVID-19 in China: Diffusion Characteristics and Core Issues
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Cluster Analysis of US COVID-19 Infected States for Vaccine Distribution

1
Department of Information Management, National Yunlin University of Science and Technology, Douliu 64002, Taiwan
2
Department of Information Management, National Chung Cheng University, Chiayi 621301, Taiwan
3
Department of Electrical and Computer Engineering, Iowa State University, 2520 Osborn Drive, Ames, IA 50011, USA
*
Author to whom correspondence should be addressed.
Healthcare 2022, 10(7), 1235; https://doi.org/10.3390/healthcare10071235
Submission received: 6 June 2022 / Revised: 30 June 2022 / Accepted: 1 July 2022 / Published: 2 July 2022
(This article belongs to the Special Issue Global Health in the Time of COVID-19: Law, Policy and Governance)

Abstract

:
Since December 2019, COVID-19 has been raging worldwide. To prevent the spread of COVID-19 infection, many countries have proposed epidemic prevention policies and quickly administered vaccines, However, under facing a shortage of vaccines, the United States did not put forward effective epidemic prevention policies in time to prevent the infection from expanding, resulting in the epidemic in the United States becoming more and more serious. Through “The COVID Tracking Project”, this study collects medical indicators for each state in the United States from 2020 to 2021, and through feature selection, each state is clustered according to the epidemic’s severity. Furthermore, through the confusion matrix of the classifier to verify the accuracy of the cluster analysis, the study results show that the Cascade K-means cluster analysis has the highest accuracy. This study also labeled the three clusters of the cluster analysis results as high, medium, and low infection levels. Policymakers could more objectively decide which states should prioritize vaccine allocation in a vaccine shortage to prevent the epidemic from continuing to expand. It is hoped that if there is a similar epidemic in the future, relevant policymakers can use the analysis procedure of this study to determine the allocation of relevant medical resources for epidemic prevention according to the severity of infection in each state to prevent the spread of infection.

1. Introduction

Since the emergence of the new coronavirus COVID-19 in December 2019, it has spread through the world at a rapid rate, causing a catastrophe in the field of human public health. This virus not only affects world transportation but also causes irreparable economic and human losses. The United States, which dominates the global economic system, has failed to make timely corresponding policies for epidemic prevention [1]. Although the number of infections continues to rise, the United States seems to have failed to take effective vaccine distribution and isolation measures [2]. According to [3] who investigated vaccine allocation, it is important for a region to prioritize who receives vaccines. According to [4], reasonable vaccine distribution protocols are needed in the case of regions with unstable resources. The authors used more than 100 countries to conducted data analysis of more than 100 countries and found that problems with vaccine distribution are a main cause of the spread of influenza. In [5], authors collected the global coronavirus collective infection data, used the cluster technique to analyze, and found that virus transmission is related to family and community infection. The authors in [6] found data on influenza, used feature dimension reduction to extract important information from the data, and finally used classification algorithms to evaluate the outcome.
Clustering is the disassembly of a dataset from group to group, and the comparison between clusters shows whether the difference within the cluster is small and the difference between the groups is significant; the difference is measured by the distance between observations. Academic studies in all areas often use clustering to help analyze data and reach conclusions. For example, in [7], the authors detected gas leaks by monitoring mass spectrometer data and cluster analysis. Because of disputes between tourism development and natural landscape protection, various stakeholders were included in a cluster analysis, and the analysis results were divided into four groups: conservative to radical [8]. According to financial strategies of different companies [9], non-financial companies are often analyzed by clusters. Some strategies are suitable for high-tech economic industries, while others are suitable for basic industries.
Classification refers to establishing a data classification model based on known data and their category attributes, which can help predict which label the target data will be assigned. According to [10], sonar datasets are selected through the short-time Fourier transform, and then the sonar targets are classified by few-shot learning in the small sample learning method, which improves classification accuracy. In [11], a support vector machine was used to classify different hand movements according to experimental subjects’ real-time and non-real-time EMG data, and the results showed that the human muscles set were as repetitive as fingerprints or retinas. Dritsas and Trigka [12] using different machine learning techniques to predict stroke, found that ensemble machine learning was the best approach.
In the face of the shortage of vaccines during the COVID-19 pandemic, the United States did not put forward effective epidemic prevention policies in time to prevent the infection from further expanding, resulting in an increasingly serious epidemic in the United States. This study mainly uses the COVID-19 infection case indicators collected by the COVID Tracking Project in various states of the United States. Through cluster analysis, we can distinguish the severity of infection (low, medium, and high) in each state and further allocate vaccines to states with high infection rates to carry out priority epidemic prevention measures. In this study, the data mining software WEKA is used to conduct cluster analysis and classification. After the experiment, the cluster is named after confirming the classification accuracy of the confusion matrix with classification verification. This study hopes that understanding the indicators of infection cases in each state can help allocate medical resources in the future and provide a reference for decision makers in vaccine distribution.
The structure of this study is as follows. Section 2 is a preliminary description of COVID-19, medical indicators, and vaccine distribution, combined with machine learning processes, which include feature selection, clustering method, and classification verification. Section 3 describes the dataset and the experimental process. Section 4 is the result of grouping analysis and classification verification. Section 5 is the discussion, and Section 6 is the conclusion.

2. Preliminary

2.1. COVID-19

Severe Special Infectious Pneumonia (SARS-CoV-2) aka Novel Coronavirus (COVID-19) has seriously affected people’s lives worldwide, and many scholars have conducted related studies on COVID-19. A survey [13] of seriously ill patients collected many characteristics of COVID-19 symptoms, found it to be a dangerous virus, and conjectured that early pulmonary fibrosis was a substantial basis. As the epidemic became more serious, some scholars began to explore factors closely related to the death of patients. Zhou et al. [14] used univariate and multivariate logistic regression to explore factors associated with in-hospital mortality. Zheng et al. [15] analyzed the clinical characteristics of severe and non-severely ill COVID-19 patients in 13 articles with a total of 3027 patients to identify risk factors for developing severe disease or death in COVID-19 patients to predict disease progression effectively, respond to treatment early, and allocate medical resources in a better way.

2.2. Medical Indicators and Vaccine Allocation

According to [16], a survey of Taiwanese medical institutions found that the quality of medical care and its organization and management are highly correlated. The Taiwan Quality Indicators Project (TQIP), which collected hospital datasets from 1998 to 2004, used this dataset and a survey in the United States to conduct data envelopment analysis and found that improvements in medical quality services could reduce costs [17]. In [18], the authors analyzed common disease characteristics to improve the quality of medical care in the country and found that data analysis could help the hospital to make better decisions.
According to [19], research on vaccine distribution has studied how to stop disease transmission effectively and found that when the number of transmissions is reduced, the number of deaths can be effectively reduced. The vaccine allocation study in [20] found that 5 to 15% of people worldwide die each year from epidemics. To counter this threat, the United States mass produces a variety of vaccines. However, these vaccines have no effect on newly emergent diseases. COVID-19 required the development and production of a completely new vaccine. Then the problem of vaccine distribution had to be considered.

2.3. Feature Selection Techniques

In the clustering process, if there are too many variables, the problem will often become more complicated; therefore, feature selection or feature extraction becomes very important. It can reduce the dimension of variables and make the problem simpler. The following are some commonly used feature selection methods.

2.3.1. Principal Component Analysis (PCA)

According to the discussion of the principal component analysis in [21], this technique is mainly aimed at finding the vector after the data projection, hoping to maximize the data variation, find the C (covariance) value, and finally get the covariance. Algorithm 1 is shown below:
Algorithm 1 PCA based Feature Selection.
Inputs: X = {x1, x2, ……, xD} // D-dimension training dataset
Outputs: Y = {y1, y2, ……, yd} // lower dimensionality d-dimensional feature set where d <= D
1Do PCA on X for dimensionality reduction
2 Compute mean of input dataset (x)’
3 Calculate the covariance matrix Cov (x)
4Find spectral decomposition of Cov (x) and the corresponding Eigen vectors and values E1, E2, …ED to get the principal components P = (x1′, x2′, ……, xn′) which is a subset of X.
The purpose of using PCA to extract features is to generate a new collection of dimensionally reduced features compared to the original dataset. This will convert a D-dimensional dataset to a new lower d-dimensional dataset where D <= d, as shown in Equation (1). Let X be the original dataset and x i be the individual variables in the dataset:
C o n s i d e r   a   D - d i m e n s i o n a l   d a t a s e t   X = ( x 1 ,   x 2 ,   x 3 ,   ,   x N )
In this study, PCA was used to reduce the dimension of the data, and the specific steps were as follows. The first step was to calculate the mean value of X with Equation (2), N = number of observations.
( x ) = 1 N i = 1 N . ( x i )
This will help standardize the data and calculate the covariance. Standardization places variables and values of data within a given range to achieve unbiased results.
The next step is computing the covariance matrix. The covariance matrix is used to identify the correlation and dependencies among the features as shown in Equation (3).
Cov ( x ) = 1 / N i = 1 N . ( x i x i ) ( x i x i ) T
The last phase is spectral decomposition of the covariance matrix using eigenvectors £1, £2, ……, £D and eigenvalues λ1, λ2, ……, λD. This gives Y as shown in Equation (4). Let Y be the lower dimensional set and yi be the variables.
Y = ( y 1 ,   y 2 ,   y 3 ,   ,   y P )
such that Y is the lower d-dimensional dataset and has the principal components, as shown in Equation (5).
( Y = ( £ T 1 ( x x i ) , £ T 2 ( x x i ) , £ T 3 ( x x i ) ,   ,   £ T d ( x x i ) ) T )
such that for the original dataset X, the new dimensional representation is Y which has principal components.

2.3.2. Information Gain

Information gain uses a feature sorting method to rank the variables in the dataset. It mainly uses an entropy principle to measure a set of randomly generated variables [22]. The information gain-based feature selection is shown in Algorithm 2:
Algorithm 2 Information gain-based feature selection
Inputs: Dataset D
Outputs: Selected Features FS
1Start
2 Initialize threshold for gain gt
3 Initialize feature -gain map G
4Get attributes from D into A provided c
5for each attribute a in A
6 Find gain g
7 IF (g > gt) THEN
8  Add attribute a and g to G
9 End IF
10End For
11For each element in G
12 IF feature is found useful THEN
13  Update FS with the feature
14 END IF
As can be seen in Algorithm 2, the information gain-based approach finds the information gain pertaining to the importance of features in the dataset. Equations (6) and (7) are used to compute the entropy of x and y.
En ( x ) = p ( x ) log p ( x )
En ( y ) = p ( y ) log p ( y )
Once entropy is computed, the difference is computed to know the gain value. In fact, the gain from x on y is the reduction in entropy values and is computed using Equation (8).
IG ( y , x ) = E n ( y ) E n ( y / x )

2.3.3. Gain Ratio

Karegowda et al. [23] introduced the extension technique of information gain in which after the information gain result appears, it will branch it separately, and then find the best score from it. The information gain metric is used to select test attributes at each node of the decision tree. Let s be a set consisting of s data samples with m distinct classes. The expected information needed to classify a given sample is given by Equation (9).
I ( s ) = i = 1 m P i log 2 ( P i )
P i is the probability that an arbitrary sample belongs to class Ci and is estimated by si/s. Attribute A has v distinct values. Let sij be the number of samples of class Ci in a subset Sj. Sj contains those samples in S that have value aj of A. The entropy information based on the partitioning into subsets by A, is shown in Equation (10).
E ( A ) = i = 1 m I ( S ) S 1 i + S 2 i + S m i s
The encoding information that would be gained by branching on A is shown in Equation (11)
Gain ( A ) = I ( S ) E ( A )
The gain ratio which applies normalization to information gain using a value defined is shown in Equation (12)
S p l i t I n f o A   ( S ) = i = 1 v ( | S i | / | S | ) log 2 ( | S i | / | S | )
The above value represents the information generated by splitting the training dataset S into v partitions corresponding to v outcomes of a test on the attribute A. Finally, the gain ratio is defined as shown in Equation (13).
Gain   Ratio   ( A ) = G a i n ( A ) S p l i t I n f o A   ( S )

2.4. Cluster Analysis

Cluster analysis aims to divide samples into different groups to maximize homogeneity within each group and maximize heterogeneity between groups. This concept is similar to “intra-group homogeneity and inter-group heterogeneity” in market segmentation. There are two commonly used clustering methods introduced as follows.

2.4.1. K-Means

According to [24], K-means clustering consists of two independent stages: the first stage is to select k centers, where the k value is pre-fixed, and the next stage is to bring each data object to the nearest center. Supposing that the target object is x, xi indicates the average of cluster Ci. The criterion function is defined as shown in Equation (14).
E = i = 1 k x C i | x x i | 2
E is the sum of squares of errors of all objects in the dataset. The distance of the criterion function is the Euclidean distance, which is used to determine the closest distance of each data object to the cluster center. The distance between one vector x = (x1, x2, … xn) and another vector y = (y1, y2, … yn), is the Euclidean distance d ( x i , y j ) . It can be calculated as shown in Equation (15).
d ( x i , y j ) = [ i = 1 n ( x i y j ) 2 ] 1 2

2.4.2. Cascade K-Means

Another clustering is based on [25], in which there is a Calinski--Harabasz metric that evaluates the number of clusters, calculates the distance, and then uses the metric to decide whether to continue the cluster and find the optimal number of clusters. The matrices WG{k} are square symmetric matrices of size p × p. Let WG denote their sum for all the clusters as shown in Equation (16).
W G = k = 0 K W G { k }
The matrices WG{k} represent a positive semi-definite quadratic form Qk, and their eigenvalues and their determinant are greater than or equal to 0. The within-cluster dispersion, noted as WGSS{k} or WGSSk, is the trace of the scatter matrix WG{k}, as shown in Equation (17).
W G S S { k } = T r ( W G { k } ) = i   I k | | M i { k } G { k } | | 2
The within-cluster dispersion is the sum of the squared distances between the observations M i { k } and the barycenter G{k} of the cluster. Finally, the pooled within-cluster sum of squares WGSS is the sum of the within-cluster dispersions for all the clusters as shown in Equation (18).
W G S S = k = 0 K W G S S { k }
The between-group dispersion measures the dispersion of the clusters between each other. This sum is the weighted sum of the squared distances between the G{k} and G, the weight being the number nk of elements in the cluster Ck, as shown in Equation (19).
B G S S = k = 1 K n K | | G { k } G | | 2
Using the notations of Equations (18) and (19), the Calinski--Harabasz metric is shown as in Equation (20).
C H = BGSS / ( K 1 ) WGSS / ( N K ) = N K K 1 BGSS WGSS

2.5. Classification

Classification is a critical method of data mining. The classification concept is to learn a classification function or construct a classification model (usually called a classifier) based on existing data. The function or model can map data records in the database to one of the given categories, and apply it to data prediction. The classifier is a general term for classifying samples in data mining, including decision tree, logistic regression, naive Bayes, neural networks, and other algorithms. The following is an introduction to the two classifiers used in this study.

2.5.1. Random Forest

According to [26], the generation technology of random trees evolved from decision trees. In addition, the pattern generation method of a tree can also be used without selecting the data target first. This technique is an ensemble learning algorithm that uses bagging plus random feature sampling. The random forest training algorithm applies the general bagging technique to tree learning. Given a training set X = x1, …, xn and a target Y = y1, …, yn, the bagging method is repeated (B times) to sample from the training set with replacement and then train a tree model on these samples:
For b = 1, …, B:
  • Sample, with replacement, n training examples from X, Y; call these Xb, Yb.
  • Train a classification or regression tree fb on Xb, Yb.
After training, the prediction for the unknown sample x can be achieved by averaging the predictions of all individual regression trees on x as in Equation (21):
f ^ = 1 B b = 1 B f b ( x )

2.5.2. Neural Network

According to [27], the neural network is one of the classic technologies inspired by the human brain. human brain can process different information content because it has different neurons. The includes hundreds of millions of neurons that can connect and share. When a neuron can continuously connect to other neurons, it can trigger the brain to control the human body to complete some behaviors. This behavior is essentially a process of learning and absorbing knowledge, while new neural connections stimulate the brain to learn new actions. The process mentioned above is similar to the one used in neural networks, where neurons can be defined as one or more nodes used. The structure of neurons is an input layer, a hidden layer, and an output layer which can be shown in Equation (22), where σ () is called the activation or transfer function, N is the number of input neurons,   V i j is the weights, x j is inputs to the input neurons, and T i h i d is the threshold terms of the hidden neurons.
H i = σ ( j = 1 N V i j x j + T i h i d )

3. Material and Methods

3.1. Dataset

The dataset of this study was collected from various states in the United States (The COVID Tracking Project). The dataset variables are divided into various levels. The variables marked as grade A or above indicate that the information provided by this state is relatively sufficient and complete and vice versa. There are 44 variables in the dataset, roughly divided into cases, PCR tests, antibody tests, antigen tests, hospitalizations, death outcomes, and the state metadata, as shown in Table 1.

3.2. Conceptual Framework

The clustering analysis and classification scenario of this study is shown in Figure 1, which are data preprocessing, cluster experiment, and classification verification. Feature selection was performed before the dataset was clustered, and essential feature variables were identified. The results of the two clustering methods were statistically compared for these essential variables. Finally, after clustering analysis, the classification method was used to verify the confusion matrix.

3.3. A Brief Review of Clustering Techniques

Clustering is to group all data and classify similar data into the same group. A piece of data only belongs to a particular group, and each group is called a cluster. Defining the so-called similarity is usually judged by the distance between data points. The closer the distance is, the more similar it is presumed to be. The denser the neighbors, the more similar they are presumed to be. The clustering techniques are pretty diverse. In the 1970s, most of the published studies were performed with hierarchical-based algorithms. In addition to generating a tree graph, these algorithms can also present the division of relatively important and target clusters [28]. However, when the merge or split decision is implemented in the pure hierarchical clustering method, the quality of the clustering will be affected, and it will not undo previous operations. Moreover, an object cannot move to another cluster [29]. The density-based algorithm plays a vital role in nonlinear shapes and structures derived from density. The concepts of the density-based algorithm are density accessibility and density connectivity. However, most of the indicators used to evaluate or compare cluster analysis results are not suitable for evaluating the results of density-based clustering analysis [30].
In the past few years, there has been much work on graph-based clustering, and theorists have extensively studied the properties of clustering and the quality measures of various clustering algorithms using elegant mathematical structures established in graph theory [31]. For example, the Markov Cluster Algorithm is a fast and scalable graph (also known as a network) unsupervised clustering algorithm based on the simulation of (random) flow in a graph [32]. A comparison table of different clustering techniques is shown in Table 2. The pros and cons of different clustering techniques are also included. This study employs a partitioning algorithm, a non-hierarchical approach, to evaluate clustering results by constructing various partitions. Criteria are globally optimal or efficient heuristics. K-means is the most commonly used method [33], which needs to define the number of clusters in advance to meet the requirements of specific clusters. The dataset in this study does not belong to a complex real network or shape, so the cluster analysis in this study is performed by well-integrated K-means and modified Cascade K-means.

3.4. Data Preprocessing

3.4.1. Feature Selection

According to [34], when the data are more complex, the readability is lower; therefore the reduction of the data dimension becomes a matter of course, and the reduction of the data dimension is also a method usually used for feature selection. Uğuz [35] used the method of data dimension reduction to extract features in different mixed models using two-stage testing and finally obtained better results. Therefore, this study used PCA, IG, and GR to select essential variables and then sorted out co-occurring variables. The method of feature selection and the detailed WEKA feature selection process is shown in Figure 2. The selected common variables are used for the next step of cluster analysis.

3.4.2. Clustering Analysis and Classification

This study mainly used the dataset from 2020 to 2021 in the COVID Tracking Project to conduct cluster analysis. In the experiment, fearing that the single clustering method was too subjective, we intentionally compared the clustering results of the two clustering methods after data preprocessing. Both K-means and Cascade K-means were used in the cluster experiments in this study. The classification verification after the cluster analysis was completed. Previous research found that the random forest method (RF) and the neural network (NN) have good performance [36]. We therefore used these two methods to compare the confusion matrix in the classification results and verify the effect of cluster analysis. The detailed WEKA cluster experiment and classification verification process are shown in Figure 3.

4. Results

4.1. Feature Selection and Clustering Analysis Result

The descriptive statistics such as the minimum, maximum, average, and standard deviation of the data fields marked as grade A in The COVID tracking project dataset are shown in Table 3.
We first performed principal component analysis (PCA) on the dataset and extracted the first 20% of the variables from the PCA results. The detailed results are shown in Table 4.
Next, we continued to extract the feature variables from the IG and GR methods. Feature variable selection of the IG and GR methods also takes the top 20% of the feature variables and the top 20% of the PCA feature variables for sorting and comparison. The results are shown in Table 4. The feature variables that repeatedly appear in Table 4 are selected in this study. We list these important feature variables that repeatedly appear in Table 5. Table 6 is the cluster feature variables selected for this study, and there are 10 variables in total.
According to the critical cluster characteristics in Table 6, the number of confirmed cases, the number of deaths, the number of hospitalizations, the number of PCR tests, and other variables are related. It can be seen from this information that the results of future cluster analysis will have a positive relationship with the severity of the confirmed outbreak, and age, hospitalization, and death are closely related [37].
This study next carried out cluster analysis. The variables after feature selection (FS) in Table 5 were analyzed by K-means and Cascade K-means methods of WEKA. The results of the cluster analysis of US states are shown in Table 7.
It can be observed from Table 6 that the clustering results of these two types of cluster analysis are not the same.
To determine which cluster analysis results were better, we first performed an ANOVA analysis of variance between clusters. Table 7 shows the results of the analysis of variance (ANOVA) between clusters after using K-means for cluster analysis. Table 8 shows that 7 of the 10 selected variables are significant. Moreover, Table 9 shows the results of ANOVA analysis between clusters after using Cascade K-means for cluster analysis. Table 9 shows that all 10 variables are significant. Therefore, the preliminary analysis results show that there are significant differences between clusters using Cascade K-means cluster analysis results, and the effect of clustering is better.

4.2. Classification Validation

Although the cluster analysis results verify that the clustering effect of Cascade K-means is better, we still need more evidence to prove its clustering effect. Therefore, we used two classifiers to train and test their classification effects in this section. The accuracy of the confusion matrix after classification by the classifiers can represent the clustering effect after cluster analysis. The higher the accuracy is, the better the clustering effect.

4.2.1. Validation of Random Forest

This section verifies the clustering results of K-means through the confusion matrix of the random forest classifier. As shown in Table 7, the clustering results of K-means are divided into three groups, the first group has 5 states, the second group has 5 states, and the third group has 41 states. This study uses WEKA’s random forest (RF) classifier for training and testing. The confusion matrix of the test results is shown in Table 10. In the classification of the first group, a total of five states were misclassified to the third group, and four states of the second group were misclassified to the third group. The accuracy of random forest classification using the clustering results of K-means was 82.35%.
In addition, as can be seen from Table 7, the cluster analysis results of Cascade K-means are also divided into three groups. The first group has 22 states, the second group has 22 states, and the third group has 7 states. This study still uses WEKA’s random forest (RF) classifier for training and testing. The confusion matrix of the test results is shown in Table 11. In the classification of the first group, only one state was misclassified to the third group, and the rest were classified correctly, with a classification accuracy rate of 98.03%. This study again shows that the cluster analysis results of Cascade K-means are better than the cluster analysis results of K-means.

4.2.2. Validation of Neural Network

In case the results of random forest are too subjective, in this section we will use another neural network classifier to verify the results of cluster analysis again. Table 12 is the confusion matrix verified by the K-means cluster analysis results through neural networks classification. As can be seen from Table 12, five states were misclassified to the third group in the classification of the first group. In addition, the second group had three states misclassified, and the third group had two states misclassified from the second group, and the classification accuracy for K-means clustering was 80.39%.
Table 13 is the confusion matrix verified by the Cascade K-means cluster analysis results through neural networks’ classification. It can be seen from Table 13 that under the classification and verification of the neural network method, there were three states in the first group and the second group each with three misclassifications, and the classification accuracy rate is 88.23%. Table 14 shows the comparison of classification verification of the two clustering methods, and it can be seen that Cascade K-means has a better clustering result in terms of accuracy, precision, and recall. The study results again show that the cluster analysis results of Cascade K-means are still better than the cluster analysis results of K-means.
The above study results show that the cluster analysis results of Cascade K-means are the best, so the following discussions are based on the cluster analysis results of Cascade K-means for more in-depth discussions.

5. Discussion

After classification and verification, this study uses the cluster analysis results of Cascade K-means with the highest accuracy as the cluster basis. Figure 4 shows the characteristics of the 10 feature variables of the 3 clusters. The proportion of variables in the first cluster is all below 20%, The second cluster of variables is mainly characterized by (total TestResults) and (deaths Confirmed). The third cluster of variables is mainly characterized by hospitalization-related from (hospitalizedIncrease), (hospitalizedCurrently), and (hospitalizedCumulative). Next, we will label the three clusters of the cluster analysis results. In this study, the average of the 10 feature variables of the 3 clusters was used to label the high, medium, and low severity of the epidemic infection in each state in the United States. Table 15 is the cluster labeling result of this study.
Nevertheless, Figure 5 is a color map of the severity of the outbreak in each US state based on the findings of this study. California in the lower-left corner was omitted because the data collected in California at the beginning of the outbreak were not sufficient and were not listed as A-level. From the color map, it can be seen that the southeastern United States was more serious. At the same time, the United States also had a large outbreak of influenza in 2021, as shown in Figure 6, and the southeastern United States was also the main infection area. That as an interesting finding. In the future, decision makers can refer to these two graphs and deduce an epidemic prevention strategy after comparing the severity to improve future epidemic prevention efficacy.
COVID-related vaccine supplies increased throughout 2021, but it will take months to get enough vaccines for everyone who needs them; existing vaccines must be distributed to priority groups until there is a sufficient supply for all [37]. In this study, the clustering variables include the number of confirmed cases, the number of deaths, PCR tests used, and the number of respirators. In addition to determining possible priorities through the clustering results, other external information can also be used to coordinate vaccine allocation. for example, to older age groups, front-line workers, or vaccine distribution models [38,39]. Wingert et al. [40] used machine learning methods to establish a cluster analysis of regional severity, which may provide an alternative perspective for future vaccine allocation by using multivariate analysis of the severity of risk factors to allocate vaccine.

6. Conclusions

COVID-19 has spread and ravaged worldwide since December 2019, but in the case of vaccine shortage, how to distribute vaccines is a critical issue. This study uses machine learning technique combined with the medical indicators of the COVID Tracking Project to perform a cluster analysis of various states in the United States to distinguish the severity of COVID-19 infection. The dataset of this study was collected from 2020 to 2021. After features selection, and clustering methods, and then through the classification method to verify the data, the verification results show that the clustering results of Cascade K-means are better, and the classification accuracy is the highest. This study also marked the three clusters of the analysis results of US states as high, medium, and low infection so that policymakers can better objectively decide which states should prioritize vaccine allocation to prevent the epidemic from continuing to expand in the event of a vaccine shortage. It is hoped that, if there is a similar disease pandemic in the future, relevant policymakers can use the procedure of this study to allocate relevant medical resources according to the severity of infection in each state to prevent the spread of infection and the loss of more lives. The clustering results from this study can also be combined with other external information to shape a more careful vaccine distribution plan, helping back decision makers in the event of a potential future outbreak.

Author Contributions

Conceptualization, P.-L.S. and M.-H.S.; data curation, C.-J.L.; formal analysis, T.-W.W. and C.-J.L.; funding acquisition, D.-H.S. and P.-L.S.; investigation, P.-L.S.; methodology, D.-H.S.; project administration, D.-H.S.; resources, T.-W.W. and M.-H.S.; supervision, D.-H.S.; validation, M.-H.S.; visualization, P.-L.S.; writing—original draft, C.-J.L.; writing—review and editing, D.-H.S., T.-W.W. and M.-H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Taiwan Ministry of Science and Technology (grants MOST 110-2410-H-224-010). The funder has no role in study design, data collection, and analysis, decision to publish, or preparation of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. CDC COVID-19 Response Team; Bialek, S.; Boundy, E.; Bowen, V.; Chow, N.; Cohn, A.; Dowling, N.; Ellington, S.; Gierke, R.; Hall, A.; et al. Severe outcomes among patients with coronavirus disease 2019 (COVID-19)—United States, 12 February–16 March 2020. Morb. Mortal. Wkly. Rep. 2020, 69, 343–346. [Google Scholar]
  2. Jit, M.; Jombart, T.; Nightingale, E.S.; Endo, A.; Abbott, S.; Edmunds, W.J. Estimating number of cases and spread of coronavirus disease (COVID-19) using critical care admissions, United Kingdom, February to March 2020. Eurosurveillance 2020, 25, 2000632. [Google Scholar] [CrossRef]
  3. Chen, S.I.; Wu, C.Y.; Wu, Y.H.; Hsieh, M.W. Optimizing influenza vaccine policies for controlling 2009-like pandemics and regular outbreaks. PeerJ 2019, 7, e6340. [Google Scholar] [CrossRef]
  4. Kurbucz, M.T. A joint dataset of official COVID-19 reports and the governance, trade and competitiveness indicators of World Bank group platforms. Data Brief 2020, 31, 105881. [Google Scholar] [CrossRef]
  5. Liu, T.; Gong, D.; Xiao, J.; Hu, J.; He, G.; Rong, Z.; Ma, W. Cluster infections play important roles in the rapid evolution of COVID-19 transmission: A systematic review. Int. J. Infect. Dis. 2020, 99, 374–380. [Google Scholar] [CrossRef] [PubMed]
  6. Álvarez, J.D.; Matias-Guiu, J.A.; Cabrera-Martín, M.N.; Risco-Martín, J.L.; Ayala, J.L. An application of machine learning with feature selection to improve diagnosis and classification of neurodegenerative disorders. BMC Bioinform. 2019, 20, 1–12. [Google Scholar] [CrossRef] [Green Version]
  7. Hasegawa, M.; Sakurai, D.; Higashijima, A.; Niiya, I.; Matsushima, K.; Hanada, K.; Kuroda, K. Towards automated gas leak detection through cluster analysis of mass spectrometer data. Fusion Eng. Des. 2022, 180, 113199. [Google Scholar] [CrossRef]
  8. Trelohan, M.; François-Lecompte, A.; Gentric, M. Tourism development or nature protection? Lessons from a cluster analysis based on users of a French nature-based destination. J. Outdoor Recreat. Tour. 2022, 39, 100496. [Google Scholar] [CrossRef]
  9. Dzuba, S.; Krylov, D. Cluster analysis of financial strategies of companies. Mathematics 2021, 9, 3192. [Google Scholar] [CrossRef]
  10. Ghavidel, M.; Azhdari, S.M.H.; Khishe, M.; Kazemirad, M. Sonar data classification by using few-shot learning and concept extraction. Appl. Acoust. 2022, 195, 108856. [Google Scholar] [CrossRef]
  11. Tepe, C.; Demir, M.C. Real-Time Classification of EMG Myo Armband Data Using Support Vector Machine; IRBM: Pomezia, Italy, 2022. [Google Scholar]
  12. Dritsas, E.; Trigka, M. Stroke risk prediction with machine learning techniques. Sensors 2022, 22, 4670. [Google Scholar] [CrossRef]
  13. Huang, W.; Wu, Q.; Chen, Z.; Xiong, Z.; Wang, K.; Tian, J.; Zhang, S. The potential indicators for pulmonary fibrosis in survivors of severe COVID-19. J. Infect. 2021, 82, e5–e7. [Google Scholar] [CrossRef]
  14. Zhou, F.; Yu, T.; Du, R.; Fan, G.; Liu, Y.; Liu, Z.; Cao, B. Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: A retrospective cohort study. Lancet 2020, 395, 1054–1062. [Google Scholar] [CrossRef]
  15. Zheng, Z.; Peng, F.; Xu, B.; Zhao, J.; Liu, H.; Peng, J.; Tang, W. Risk factors of critical & mortal COVID-19 cases: A systematic literature review and meta-analysis. J. Infect. 2020, 81, e16–e25. [Google Scholar] [PubMed]
  16. Pan, T.; Fang, K. An effective information support system for medical management: Indicator based intelligence system. Int. J. Comput. Appl. 2010, 32, 119–124. [Google Scholar]
  17. Chang, S.J.; Hsiao, H.C.; Huang, L.H.; Chang, H. Taiwan quality indicator project and hospital productivity growth. Omega 2011, 39, 14–22. [Google Scholar] [CrossRef]
  18. Mainz, J.; Krog, B.R.; Bjørnshave, B.; Bartels, P. Nationwide continuous quality improvement using clinical indicators: The Danish National Indicator Project. Int. J. Qual. Health Care 2004, 16 (Suppl. 1), i45–i50. [Google Scholar] [CrossRef]
  19. Medlock, J.; Galvani, A.P. Optimizing influenza vaccine distribution. Science 2009, 325, 1705–1708. [Google Scholar] [CrossRef]
  20. Enayati, S.; Özaltın, O.Y. Optimal influenza vaccine distribution with equity. Eur. J. Oper. Res. 2020, 283, 714–725. [Google Scholar] [CrossRef]
  21. Abdi, H.; Williams, L.J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
  22. Ramesh, G.; Madhavi, K.; Reddy, P.D.K.; Somasekar, J.; Tan, J. Improving the accuracy of heart attack risk prediction based on information gain feature selection technique. Mater. Today Proc. 2022, in press. [Google Scholar] [CrossRef]
  23. Karegowda, A.G.; Manjunath, A.S.; Jayaram, M.A. Comparative study of attribute selection using gain ratio and correlation based feature selection. Int. J. Inf. Technol. Knowl. Manag. 2010, 2, 271–277. [Google Scholar]
  24. Na, S.; Xumin, L.; Yong, G. Research on k-means clustering algorithm: An improved k-means clustering algorithm. In Proceedings of the 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, Jinggangshan, China, 2–4 April 2010; pp. 63–67. [Google Scholar]
  25. Desgraupes, B. Clustering indices. Univ. Paris Ouest-Lab Modal’X 2013, 1, 34. [Google Scholar]
  26. Shi, T.; Horvath, S. Unsupervised learning with random forest predictors. J. Comput. Graph. Stat. 2006, 15, 118–138. [Google Scholar] [CrossRef]
  27. Wang, S.C. Artificial neural network. In Interdisciplinary Computing in Java Programming; Springer: Boston, MA, USA, 2003; pp. 81–100. [Google Scholar]
  28. Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2.1 2012, 2, 86–97. [Google Scholar] [CrossRef]
  29. Rani, Y.; Rohil, H. A study of hierarchical clustering algorithm. Int. J. Inf. Comput. Technol. 2013, 3, 1115–1122. [Google Scholar]
  30. Campello, R.J.; Kröger, P.; Sander, J.; Zimek, A. Density-based clustering. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2020, 10, e1343. [Google Scholar] [CrossRef]
  31. Chen, Z.; Ji, H. Graph-based clustering for computational linguistics: A survey. In Proceedings of the TextGraphs-5-2010 Workshop on Graph-Based Methods for Natural Language Processing, Uppsala, Sweden, 16 July 2010; pp. 1–9. [Google Scholar]
  32. Somasekar, H.; Naveen, K. Text Categorization and graphical representation using Improved Markov Clustering. Int. J. 2018, 11, 107–116. [Google Scholar] [CrossRef]
  33. Kameshwaran, K.; Malarvizhi, K. Survey on clustering techniques in data mining. Int. J. Comput. Sci. Inf. Technol. 2014, 5, 2272–2276. [Google Scholar]
  34. Mitra, P.; Murthy, C.A.; Pal, S.K. Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 301–312. [Google Scholar] [CrossRef]
  35. Uğuz, H. A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl-Based Syst. 2011, 24, 1024–1032. [Google Scholar] [CrossRef]
  36. Zekić-Sušac, M.; Has, A.; Knežević, M. Predicting energy cost of public buildings by artificial neural networks, CART, and random forest. Neurocomputing 2021, 439, 223–233. [Google Scholar] [CrossRef]
  37. Reilev, M.; Kristensen, K.B.; Pottegård, A.; Lund, L.C.; Hallas, J.; Ernst, M.T.; Thomsen, R.W. Characteristics and predictors of hospitalization and death in the first 11 122 cases with a positive RT-PCR test for SARS-CoV-2 in Denmark: A nationwide cohort. Int. J. Epidemiol. 2020, 49, 1468–1481. [Google Scholar] [CrossRef]
  38. Swift, M.D.; Sampathkumar, P.; Breeher, L.E.; Ting, H.H.; Virk, A. Mayo Clinic’s multidisciplinary approach to Covid-19 vaccine allocation and distribution. NEJM Catal. Innov. Care Deliv. 2021, 2, 1–9. [Google Scholar]
  39. Bertsimas, D.; Ivanhoe, J.; Jacquillat, A.; Li, M.; Previero, A.; Lami, O.S.; Bouardi, H.T. Optimizing vaccine allocation to combat the COVID-19 pandemic. medRxiv 2020. [Google Scholar] [CrossRef]
  40. Wingert, A.; Pillay, J.; Gates, M.; Guitard, S.; Rahman, S.; Beck, A.; Hartling, L. Risk factors for severity of COVID-19: A rapid review to inform vaccine prioritisation in Canada. BMJ Open 2021, 11, e044684. [Google Scholar] [CrossRef]
Figure 1. Clustering analysis and classifications scenario.
Figure 1. Clustering analysis and classifications scenario.
Healthcare 10 01235 g001
Figure 2. WEKA Feature selection process.
Figure 2. WEKA Feature selection process.
Healthcare 10 01235 g002
Figure 3. Clustering and classification validation with WEKA.
Figure 3. Clustering and classification validation with WEKA.
Healthcare 10 01235 g003
Figure 4. Cluster characteristics.
Figure 4. Cluster characteristics.
Healthcare 10 01235 g004
Figure 5. Clustering results from Cascade K-means.
Figure 5. Clustering results from Cascade K-means.
Healthcare 10 01235 g005
Figure 6. Influenza activity according to the CDC.
Figure 6. Influenza activity according to the CDC.
Healthcare 10 01235 g006
Table 1. Dataset.
Table 1. Dataset.
ItemVariablesAttributeItemVariablesAttribute
1dateString24inlcucumulativeNumerical
2stateString25inlcuCurrentlyNumerical
3dataQualityGradeString26onVentilatorCumulativeNumerical
4positiveNumerical27onVentilatorCurrentlyNumerical
5positive IncreaseNumerical28deathNumerical
6probable CasesNumerical29death IncreaseNumerical
7positiveScoreNumerical30death ProbableNumerical
8positiveCasesViralNumerical31death ConfirmedNumerical
9positiveTestsViralNumerical32recoveredNumerical
10positiveTestsPeopleAntibodyNumerical33totaltestResultsNumerical
11positiveTestsAntibodyNumerical34totalTestResultsIncreaseNumerical
12positiveTestsPeopleAntigenNumerical35totalTestsViralNumerical
13positiveTestsAntigenNumerical36totalTestsViralIncreaseNumerical
14negativeNumerical37totalTestsPeopleViralNumerical
15negativeTestsViralNumerical38totalTestsPeopleViralIncreaseNumerical
16negativeTestsPeopleAntibodyNumerical39totalTestEncountersViralNumerical
17negativeTestsAntibodyNumerical40totalTestEncountersViralIncreaseNumerical
18negativeIncreaseNumerical41totalTestsAntigenNumerical
19PendingNumerical42totalTestsPeopleAntigenNumerical
20hospitalizedNumerical43totalTestsAntibodyNumerical
21hospitalized IncreaseNumerical44totalTestsPeopleAntibodyNumerical
22hospitalized CumulativeNumerical
23hospitalized CurrentlyNumerical
Table 2. Comparison of different clustering techniques.
Table 2. Comparison of different clustering techniques.
CategoryHierarchicalDensity-BasedGraph-BasedPartitioning
Based onLinkage methodsDensity accessibility
Density connectivity
Graph theoryMean Centroid
Mediod-Centriod
Type of DataNumericalNumericalMix dataNumerical
ProsEasy to implement
Good for small datasets
Found clusters of arbitrary shapes and sizesPerform well with complex shapes of dataEasy to implement
Robust and easier to understand
ConsFails on larger sets
Hard to find the correct number of clusters
Doe not work well in high dimensionality data.Can be costly to computeUnable to handle noisy data and outliers
Table 3. Dataset descriptive statistics.
Table 3. Dataset descriptive statistics.
Data FieldMinimumMaximumMeanStandard Deviation
Death3.56320,146.9933012.7264063.720
deathConfirmed0.00011,873.8191612.2642564.015
deathIncrease0.229139.08323.80228.186
deathProbable0.0001557.983116.001241.290
Hospitalized0.00074,908.5366795.44712,375.868
hospitalizedCumulative0.00074,908.5366795.44712,375.868
hospitalizedCurrently0.0005706.6971016.5051178.133
hospitalizedIncrease−0.868574.36152.53793.318
inIcuCumulative0.0004167.848374.140935.623
inIcuCurrently0.0001594.500198.782310.440
negative3193.5835,676,868.2131,217,574.9821,393,361.630
negativeIncrease225.34766,790.59010,430.28013,289.475
negativeTestsAntibody0.000274,785.5359939.13842,390.124
negativeTestsPeopleAntibody0.000307,446.4369400.50444,786.195
negativeTestsViral0.0005,069,123.190376,163.876913,586.361
onVentilatorCumulative0.0001210.75744.743189.114
onVentilatorCurrently0.000383.80577.040115.523
positive290.354689,808.865126,025.819137,216.541
positiveCasesViral0.000644,108.81499,837.740122,787.805
positiveIncrease16.8336633.7891390.6621405.567
positiveScore0.0000.0000.0000.000
positiveTestsAntibody0.00032,553.5592128.5006739.150
positiveTestsAntigen0.00028,745.6081748.2585172.510
positiveTestsPeopleAntibody0.00030,843.331969.2654476.315
positiveTestsPeopleAntigen0.00019,372.812912.6663418.733
positiveTestsViral0.000746,688.08469,991.507149,301.079
recovered0.000548,376.91756,375.98091,104.014
totalTestEncountersViral0.0006,252,107.282436,938.9621,266,757.976
totalTestEncountersViralIncrease0.00070,634.7984496.10013,115.623
totalTestResults3643.7856,252,107.2821,575,543.0981,650,858.474
totalTestResultsIncrease250.51470,634.79814,555.70915,895.585
totalTestsAntibody0.000336,182.48833,358.43681,366.549
totalTestsAntigen0.000329,705.52323,679.76757,553.140
totalTestsPeopleAntibody0.000338,396.71613,807.53351,122.147
totalTestsPeopleAntigen0.000121,896.2616203.05121,180.165
totalTestsPeopleViral0.0003,939,157.669424,758.638718,630.307
totalTestsPeopleViralIncrease−251.65331,530.3653374.9215665.766
totalTestsViral0.0005,972,478.4031,244,215.9851,591,549.099
totalTestsViralIncrease0.00052,143.06111,304.52514,366.747
Table 4. PCA feature selection results.
Table 4. PCA feature selection results.
Features
GroupPc1Pc2Pc3Pc4Pc5Pc6Pc7Pc8Pc9Pc10Pc11
Variation15.896.054.692.641.901.381.110.920.640.540.49
Variation Percentage0.420.160.120.070.050.040.030.020.020.010.01
Cumulative contribution ratio0.420.580.700.770.820.850.880.910.930.940.96
Table 5. Feature selection sorting.
Table 5. Feature selection sorting.
RankPCAIGGRAverage Rank
1A30A2A392.7302
2A19A19A112.7302
3A21A30A132.7302
4A31A31A162.7302
5A8A13A182.7302
6A15A21A192.7302
7A24A12A382.7302
8A34A4A122.7302
9A14A8A92.7302
10A36A27A222.6814
11A9A38A82.4945
12A33A39A32.4945
13A6A20A42.4945
14A7A7A52.3984
15A23A6A62.3984
16A3A11A72.3269
17A5A9A212.3269
18A39A36A202.2745
19A29A18A21.9183
20A18A3A311.9183
Table 6. Clustering variable.
Table 6. Clustering variable.
CodeVariableDefinition
A3deathConfirmedNumber of confirmed deaths
A6 HospitalizedNumber of hospitalizations
A7hospitalizedCumulativeCumulative hospitalizations
A8hospitalizedCurrentlyNumber of people currently hospitalized
A9hospitalizedIncreaseNew hospitalizations
A18onVentilatorCurrentlyNumber of respirators currently in use
A19positiveNumber of confirmed cases
A21positiveIncreaseThe number of new diagnoses
A31totalTestResultsTotal number of tests
A39totalTestsViralIncreaseNumber of new PCR tests
Table 7. Clustering results.
Table 7. Clustering results.
MethodFS + K-Means ClusteringFS + Cascade K-Means Clustering
GroupCluster1Cluster2Cluster3Cluster3Cluster1Cluster2
Count5 states5 states41 states7 states22 states22 states
cluster memberLA
MO
NC
PA
TN
IL
MA
MI
OH
TX
AK, AL, AR, AZ, CO, CT, DC, DE, FL, GA, GU, HI, IA, ID, IN,
KS, KY, MD, ME, MN,
MS, MT, ND, NE, NH, NJ, NM, NV,
NY, OK, OR, PR, RI, SC, SD,
UT, VA, VT, WA, WI, WY
IL
LA
MI
MO
NC
PA
TX
AL, AR, AZ, CO, FL, GA,
IN, KY, MA, MD, MN, MS, NJ, NM, NY, OH, OK, SC,
TN, UT, VA, WI
AK, CT,
DC, DE,
GU, HI, IA, ID, KS, ME, MT, ND, NE, NH,
NV, OR,
PR, RI, SD, VT, WA, WY
Table 8. ANOVA analysis results with K-means clustering.
Table 8. ANOVA analysis results with K-means clustering.
SourceSum of Squaresdfp-Value
deathConfirmed96,309,899.00420.000 ***
hospitalized163,810,675.37220.595
hospitalizedCumulative163,810,675.37220.595
hospitalizedCurrently20,454,751.19620.000 ***
hospitalizedIncrease10,452.30120.558
onVentilatorCurrently171,650.04120.001 ***
positive297,933,527,997.26620.000 ***
positiveIncrease33,214,641.96620.000 ***
totalTestResults56,194,090,472,628.9820.000 ***
totalTestsViral73,640,463,254,532.7520.000 ***
*** p < 0.001.
Table 9. ANOVA analysis results with Cascade K-means clustering.
Table 9. ANOVA analysis results with Cascade K-means clustering.
SourceSum of Squaresdfp-Value
deathConfirmed92,467,765.94120.000 ***
hospitalized2,542,735,930.52420.000 ***
hospitalizedCumulative2,542,735,930.52420.000 ***
hospitalizedCurrently28,731,039.60520.000 ***
hospitalizedIncrease146,195.26720.000 ***
onVentilatorCurrently182,801.93720.000 ***
positive428,233,021,802.74820.000 ***
positiveIncrease47,433,754.86120.000 ***
totalTestResults62,516,522,292,791.3920.000 ***
totalTestsViral53,777,136,225,256.5120.000 ***
*** p < 0.001.
Table 10. Random forest classification validation for K-means clustering.
Table 10. Random forest classification validation for K-means clustering.
Confusion MatrixClustering Class
Cluster1Cluster2Cluster3
abc
Prediction ClassCluster1a 0 5 05
Cluster2b0 1 5 4
Cluster3c00 41 41
Table 11. Random forest classification validation for Cascade K-means clustering.
Table 11. Random forest classification validation for Cascade K-means clustering.
Confusion MatrixClustering Class
Cluster1Cluster2Cluster3
abc
Prediction ClassCluster1a 21 22 10
Cluster2b0 22 22 0
Cluster3c00 7 7
Table 12. Neural network classification validation for K-means clustering.
Table 12. Neural network classification validation for K-means clustering.
Confusion MatrixClustering Class
Cluster1Cluster2Cluster3
abc
Prediction ClassCluster1a 0 5 05
Cluster2b2 2 5 1
Cluster3c02 39 41
Table 13. Neural network classification validation for Cascade K-means clustering.
Table 13. Neural network classification validation for Cascade K-means clustering.
Confusion MatrixClustering Class
Cluster1Cluster2Cluster3
abc
Prediction ClassCluster1a 19 22 30
Cluster2b0 22 22 0
Cluster3c30 4 7
Table 14. Comprehensive comparison of two clustering methods.
Table 14. Comprehensive comparison of two clustering methods.
ClusteringK-MeansCascade K-Means
ValidationRFNNRFNN
Accuracy0.82350.80390.98030.8823
Precision10.91110.938
Recall0.8240.8720.980.938
Table 15. Cluster labeling.
Table 15. Cluster labeling.
SeverityStates
LowAK, CT, DC, DE, GU, HI, IA, ID, KS, ME, MT, ND, NE, NH, NV, OR, PR, RI, SD, VT, WA, WY
MediumAL, AR, AZ, CO, FL, GA, IN, KY, MA, MD, MN, MS, NJ, NM, NY, OH, OK, SC, TN, UT, VA, WI
HighIL, LA, MI, MO, NC, PA, TX
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Shih, D.-H.; Shih, P.-L.; Wu, T.-W.; Li, C.-J.; Shih, M.-H. Cluster Analysis of US COVID-19 Infected States for Vaccine Distribution. Healthcare 2022, 10, 1235. https://doi.org/10.3390/healthcare10071235

AMA Style

Shih D-H, Shih P-L, Wu T-W, Li C-J, Shih M-H. Cluster Analysis of US COVID-19 Infected States for Vaccine Distribution. Healthcare. 2022; 10(7):1235. https://doi.org/10.3390/healthcare10071235

Chicago/Turabian Style

Shih, Dong-Her, Pai-Ling Shih, Ting-Wei Wu, Cheng-Jung Li, and Ming-Hung Shih. 2022. "Cluster Analysis of US COVID-19 Infected States for Vaccine Distribution" Healthcare 10, no. 7: 1235. https://doi.org/10.3390/healthcare10071235

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop