Using Medical Data and Clustering Techniques for a Smart Healthcare System

Yang, Wen-Chieh; Lai, Jung-Pin; Liu, Yu-Hui; Lin, Ying-Lei; Hou, Hung-Pin; Pai, Ping-Feng

doi:10.3390/electronics13010140

Open AccessArticle

Using Medical Data and Clustering Techniques for a Smart Healthcare System

by

Wen-Chieh Yang

¹,

Jung-Pin Lai

²

,

Yu-Hui Liu

³,

Ying-Lei Lin

³

,

Hung-Pin Hou

¹ and

Ping-Feng Pai

^3,4,*

¹

Puli Christian Hospital, Puli 54546, Taiwan

²

Department of Multimedia Game Development and Application, HungKuang University, Taichung 43302, Taiwan

³

PhD Program in Strategy and Development of Emerging Industries, National Chi Nan University, Nantou 54561, Taiwan

⁴

Department of Information Management, National Chi Nan University, Nantou 54561, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(1), 140; https://doi.org/10.3390/electronics13010140

Submission received: 15 November 2023 / Revised: 18 December 2023 / Accepted: 27 December 2023 / Published: 28 December 2023

(This article belongs to the Special Issue Machine Learning and Artificial Intelligence in Healthcare with Big Data Analytics)

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid advancement of information technology, both hardware and software, smart healthcare has become increasingly achievable. The integration of medical data and machine-learning technology is the key to realizing this potential. The quality of medical data influences the results of a smart healthcare system to a great extent. This study aimed to design a smart healthcare system based on clustering techniques and medical data (SHCM) to analyze potential risks and trends in patients in a given time frame. Evidence-based medicine was also employed to explore the results generated by the proposed SHCM system. Thus, similar and different discoveries examined by applying evidence-based medicine could be investigated and integrated into the SHCM to provide personalized smart medical services. In addition, the presented SHCM system analyzes the relationship between health conditions and patients in terms of the clustering results. The findings of this study show the similarities and differences in the clusters obtained between indigenous patients and non-indigenous patients in terms of diseases, time, and numbers. Therefore, the analyzed potential health risks could be further employed in hospital management, such as personalized health education control, personal healthcare, improvement in the utilization of medical resources, and the evaluation of medical expenses.

Keywords:

clustering; medical data; smart healthcare

1. Introduction

Due to the progress and advantages of information technology and data analysis techniques, smart medical care plays an important role in the modern medical field. Machine-learning and data-mining techniques have provided hospital practitioners with more effective and efficient medical solutions in personalized medicine and led to disease predictions, medical efficiency improvement, and medical resource optimization. To identify similarities among patients, grouping patients into clinically meaningful clusters is essential [1]. Healthcare organizations and physicians take advantage of clustering results to analyze similarities among patients. By clustering patients in terms of diseases, risk factors, lifestyles, or other relevant factors, clustering results can help physicians gain insights into patients’ needs and provide personalized treatments.

Previous studies have pointed out the importance of using medical management databases to analyze patient clusters to learn trends of diseases according to clustering results [2]. The clustering technique is one of the most useful methods for analyzing patient similarities for precision medicine [1]. Analyzing a patient’s potential risks and trends requires a lot of patient-related data, which are recorded every time a patient visits a hospital for medical treatment. In the era of big data, electronic records include a large amount of text, such as the clinical narration of doctors’ advice. Thus, the analysis of electronic records has become more complex than before. In addition, due to the high dimensions of input data, the reduction in dimensions or feature selection can improve model efficiency and the performance of clustering tasks. Zelina et al. [3] proposed a natural language processing (NLP) method to investigate the clinician dataset of Czech breast cancer patients. The developed RobeCzech model is a general-purpose Czech transformer language model and is used for the unsupervised extraction, labeling, and clustering of fragments from clinical records. This study indicated the feasibility as well as the possibility of dealing with unstructured Czech clinical records in a non-supervised training manner. Irving et al. [4] employed electronic medical record (EMR) data to enhance the detection and prediction of psychosis risk in South London. In addition to basic patient information, clinical characteristics, symptoms, and substances, the EMR data included NLP predictions. The authors reported that using NLP to cope with EMRs can significantly improve the prognostic accuracy of psychosis risk.

Issues of concern in existing electronic medical records and eHealth systems include technical aspects, managerial factors, and particularly the quality of data in systems [5]. Additionally, as previously pointed out, the quality of the data is essential for healthcare systems [6]. Thus, this study aimed to deal with various data types by applying data preprocessing with data merging, data conversion, data cleaning, data selection, and data normalization. Then, clustering techniques were employed to group patients with similar medical features to improve the data quality for the healthcare system.

This investigation used demographic information, drug items, doctors’ advice, and exam items to perform clustering tasks and then to analyze the results in terms of indigenous people and non-indigenous people. Four clustering methods were used in this study, namely, K-means, hierarchical clustering, autoencoder, and SOM-KM. The clustering performance was evaluated through three indicators: the Calinski–Harabasz index (CH), Davies–Bouldin index (DB), and Silhouette Coefficient (SC). For most cases and indices, K-means outperformed the other methods. Therefore, K-means was used to analyze the clustering results. The rest of this study is organized as follows. Section 2 illustrates the clustering methods and applications in medical data analysis. The presented smart healthcare system based on clustering techniques and big data is introduced in Section 3. Section 4 depicts numerical examples. Finally, conclusions are presented in Section 5.

2. Clustering Techniques and Applications for Medical Data Analysis

Ezugwu et al. [7] and Saxena et al. [8] reported that clustering techniques can be divided into two major categories, namely, hierarchical clustering algorithms and partition clustering algorithms. More clustering categories, including grid clustering, density clustering, and model clustering, were proposed by Chaudhry et al. [9] and Oyewole and Thopil [10]. K-means and hierarchical clustering techniques are the most widely used algorithms in the literature. K-means clustering is one of the partition clustering methods. Applications of clustering approaches in medical data analysis include disease nosology [11], early diagnosis of diseases [12,13], predictions of diseases [14,15], etc. The clustering of diseases is mostly for chronic diseases and severe illnesses, for example, diabetes [13,16,17], heart failure [18,19], cancer [20,21], stroke [22,23], and COVID-19 cases [24,25]. Arora et al. [16] used K-means clustering for the prediction of diabetes. Jasinska-Piadlo et al. [18] employed K-means to cluster emergency readmissions of heart failure patients by using a data-driven approach and domain-leading methods. Heart failure patients usually have various characteristics at the physiological level. The study indicated that the K-means clustering algorithm could identify patients with heart failure very well. Ilbeigipour et al. [25] used SOM (self-organizing map) neural networks and the K-means technique to cluster COVID-19 patients and investigated the relationships between different symptoms of cases. The findings of this study could help health specialists improve their services by considering other important factors in treating COVID-19 patients in different ethnic groups. Table 1 lists a summary of recent clustering approaches for medical data. It can be observed that most studies dealt with a single disease, and K-means was commonly used as a popular clustering technique for analyzing medical data. The clustering approaches can be generally classified into categories: hierarchical clustering algorithms and partition clustering algorithms [7,8,9,10]. This study employed four clustering methods, K-means (KM), hierarchical clustering (HC), the K-means autoencoder (AEKM), and the K-means self-organizing map (SOMKM), to analyze medical data.

The K-means method [26] involves dividing a sample dataset into k subsets, forming k clusters, and assigning n data points to these k clusters, with each data point exclusively belonging to one cluster. The K-means algorithm is an iterative process that consists of two primary steps. Initially, it selects k cluster centers, and subsequently, it assigns data points to the nearest center to obtain an initial result. Following this, the centroids of each cluster are updated as new centers, and these two steps are repeated iteratively. The objective of the clustering results is to minimize the distance between data points and their respective cluster centers. The objection function of the K-means algorithm is shown in the following equations. Equation (1) employs the Euclidean distance to ensure that data point

x_{i}

is closest to its assigned center, while Equation (2) is used to update the center as the mean value [27,28,29,30].

O b j = \sum_{i = 1}^{k} \sum_{j = 1}^{N} {(‖x_{i} - x_{j}‖)}^{2}

(1)

X_{k} = \frac{1}{N} \sum_{i = 1}^{N} x_{i}

(2)

where

k

is the number of cluster centers,

N

is the number of data points in the ith cluster,

x_{j}

is the cluster mean, and

x_{i}

is the ith point in the dataset.

For the K-means clustering algorithm, it is necessary to pre-specify the number of clusters denoted by K. This is an important hyperparameter of the algorithm. To determine the most suitable number of clusters for the experimental data, the Elbow Method is employed in this approach [31].

Hierarchical clustering (HC) constructs a hierarchy of clusters by iteratively merging or dividing clusters based on a distance metric. This method provides a visual representation of the data structure through dendrogram plots. There are two main types of hierarchical clustering: agglomerative and divisive. In our study, we employed the agglomerative clustering approach because our data samples were generated from patient records. This method begins with each sample being treated as an individual cluster and then progressively merges clusters that are close in proximity until a certain termination condition is met. For hierarchical clustering, three essential elements, the similarity distance, merging rules, and termination conditions, need to be considered [32]. The hierarchical clustering process is irreversible, and due to its consideration of each individual data point, it can be computationally time-consuming.

Developed in the 1980s by Hinton and the PDP group [33], the autoencoder is an artificial neural network with an input layer, a hidden layer, and an output layer. The main purpose of the autoencoder is to perform representation learning on the input data and make the output and input have the same meaning. Autoencoders have been widely used in feature extraction [29,34,35,36]. An

m

-dimensional dataset is considered as

X = \{X_{1}, X_{2}, \dots, X_{m}\}

. The compressed data features are generated by the encoder

E

, and following that, the output

X^{*}

is generated by the decoder

D

, which can be expressed by Equation (3):

X^{*} = D (E (x))

(3)

The training goal of the autoencoder is to minimize the error. The loss function can be expressed as Equation (4):

L o s s f u n c t i o n (X, X^{*}) = {(X - X^{*})}^{2}

(4)

After establishing the autoencoder model, the K-means method is then used because the autoencoder is not a clustering tool [35].

The self-organizing map (SOM) [37] is a method consisting of a two-dimensional grid used for mapping input data. During the training process, the SOM forms an elastic grid to envelop the distribution of input data, mapping adjacent input data to nearby grid units. SOM training is an iterative process that adjusts the positions of grid units by computing distances and finding the Best-Matching Unit (BMU) with prototype vectors. Furthermore, the SOM’s computational complexity scales linearly with the number of data samples, making it memory-efficient, but scales quadratically with the number of map units. Training large maps can be time-consuming, although it can be expedited with specialized techniques. Apart from the SOM, alternative variants are available, though they may require more complex visualization methods. In summary, the SOM is an effective approach for processing large datasets while preserving the topological characteristics of the input space [38].

SOM training is conducted iteratively. In each training step, a sample vector is randomly chosen from the input dataset. Distances between this sample vector and all prototype vectors are computed. The Best-Matching Unit (BMU), denoted by BMU, is the map unit whose prototype vector is closest to the sample vector. Subsequently, the prototype vectors are updated. The BMU and its topological neighbors are adjusted toward the sample vector in the input space. The rule for the prototype vector of unit “i” is updated as expressed in Equation (5):

v_{i} (t + 1) = v_{i} (t) + α (t) \cdot h_{i j} (t) \cdot [x (t) - v_{i} (t)]

(5)

where
$v_{i} (t + 1)$ is the updated prototype vector for unit i at time t + 1.
$v_{i} (t)$ is the current prototype vector for unit i at time t.
$α (t)$ is the adaptation coefficient at time t.
$h_{i j} (t)$ is the neighborhood kernel centered on the winning unit at time t.

The SOM is commonly used for dimensionality reduction and data visualization; it maps high-dimensional data into two- or three-dimensional spaces, providing a significant advantage when dealing with complex data. In this study, we leveraged the strengths of both models by first mapping the data into a two-dimensional representation through a SOM and then performing clustering using K-means.

3. The Proposed SHCM System

After reviewing the clustering techniques and medical data analysis, the proposed smart healthcare system based on clustering techniques and medical data (SHCM) is introduced in this section. Figure 1 depicts the structure and procedures of the designed SHCM system. The SHCM contains four parts: data preprocessing, clustering, performance evaluation, and result analysis. The data were collected from the outpatient clinic database of Puli Christian Hospital. Then, the data preprocessing process was conducted. Sequentially, four clustering techniques were employed to perform grouping tasks. Three measurements, namely, the Calinski–Harabasz index, the Davies–Bouldin Index, and the Silhouette Coefficient, were utilized to evaluate the performance of the clustering techniques. Based on three measurements, K-means can mostly generate better results than the other three clustering approaches in 12 months. Therefore, the clustering results provided by the K-means methods were used to observe the grouping data and discuss them with medical doctors. Finally, similar and different discoveries investigated by applying evidence-based medicine could be identified and provided for further use in personalized health education and healthcare. In addition, the utilization of medical resources and the evaluation of medical expenses could possibly be improved.

3.1. Data and Data Preprocessing

The data were collected from the outpatient clinic database of Puli Christian Hospital and included structured data and unstructured data from patient consultation information. Because of the diversity of data formats and data structures, data preprocessing procedures need to be conducted first. After the data preprocessing procedure, a total of 63,151 records in this study contained patients who visited the outpatient clinic from 1 January to 31 December 2022. Four major attributes used in clustering model experiments include demographic information, drug items, doctors’ advice, and exam items. Figure 2 illustrates the data preprocessing steps with five stages: data merging, data conversion, data cleaning, data selection, and data normalization.

The raw data collected were presented in four major categories: gender and age, doctors’ advice, drug descriptions, and exam items. The merged data included structured and unstructured data. In order to achieve numerical values that can be recognized by the clustering model, unstructured data such as text and symbols were converted into numerical forms. Table 2 shows the conversion methods according to the attributes, and following that are the details. Categorical data included gender and some of the exam items. Doctors’ advice, drug descriptions, and some contents of exam items were expressed as text. Thus, the bidirectional encoder representation transformer (BERT) was used to convert text into vectors, and a principal component analysis (PCA) was employed to reduce the high dimensions of converted results into 10 dimensions. Age and most exam items were represented by numerical forms. Finally, normalization was performed. The MinMaxScaler was used for data normalization in this study and is represented by Equation (6). Table 3 shows the number of attributes and patient visits in 12 months.

X_{M i n M a x S c a l e r} = \frac{X - X_{m i n}}{X_{m a x} - X_{m i n}}

(6)

where

X_{M i n M a x S c a l e r}

is the normalized feature, and

X_{m a x}

and

X_{m i n}

are the maximum and minimum values of the feature

X

.

3.2. Performance Measurements

Three measurements, the Calinski–Harabasz index (CH), the Davies–Bouldin index (DB), and the Silhouette Coefficient (SC), were used in this study to evaluate the performance of the clustering techniques. The Calinski–Harabasz index (CH) [39] assesses the concentration of data in the clustering results by calculating the ratio of the sum of squared distances between clusters (BGSS) to the within-cluster sum of squares (WGSS). It is one of the commonly used evaluation metrics in K-means and hierarchical clustering. The CH index calculation formula under the assumption of N data points divided into K clusters is shown in Equation (7); with this calculation method, the larger the value, the better [40]:

C H = \frac{B G S S}{W G S S} \times \frac{N - K}{K - 1}

(7)

The DB index [41] evaluates the clustering results by considering both the similarity and separation between different clusters. It calculates the similarity (

C_{i}

) between two clusters based on the distances between data points within each cluster. It then identifies the cluster (

C_{j}

) with the highest similarity and divides it by the cluster’s dispersion (S), which is computed by averaging the distances between data points within that cluster, and can be expressed by Equation (8). The DB index is established by averaging these cluster similarities across all clusters, and a smaller DB index value indicates better clustering results [42].

D B = \frac{1}{k} \sum_{i = 1}^{k} m a x_{j \neq i} (\frac{C_{i} + C_{j}}{S_{i j}})

(8)

The Silhouette Coefficient (SC) [43] considers the similarity of each data point to others within its cluster (A) and the dissimilarity to other clusters (B). The within-cluster similarity (A) measures the distance between the data point and other data points within the same cluster. The between-cluster dissimilarity (B) measures the distance between the data point and data points in other clusters. The formula for calculating the Silhouette Coefficient is in Equation (9):

S C = \frac{B - A}{\max (A, B)}

(9)

The SI index’s values range from −1 to 1, where a value close to 1 indicates that data points within their assigned cluster are very similar and dissimilar to data points in other clusters, while a value close to −1 suggests that data points are more likely to be assigned to the wrong cluster [44].

The clustering and storing of data in this study were implemented in the Anaconda environment based on the python programming language and the scikit-learn library. In the next section, numerical data are employed to demonstrate the performance of the SHCM system. Then, numerical results generated by the SHCM system are observed and analyzed, and conclusions are drawn.

4. Numerical Results

4.1. Clustering Performance with Three Measurements

Table 4 indicates the cluster numbers obtained by the four clustering methods in 12 months. Table 5, Table 6 and Table 7 list three measurements: the Calinski–Harabasz index (CH), the Davies–Bouldin index (DB), and the Silhouette Coefficient (SC) were used in this study to evaluate the performance of the clustering techniques. The number of suitable clusters falls between clusters 3 and 5, and the three clusters appeared the most frequently. Table 5 illustrates the CH indexes of the four clustering approaches. A larger CH value means a better clustering result. Table 6 shows the DB indicators of the different clustering methods. A smaller DB value implies a better clustering result. Table 7 depicts the SI coefficients. An SI value close to 1 means a better clustering result. In summary, the K-means method is mostly superior to the other clustering methods for 12 months of data. It has been pointed out that the K-means approach can provide quite satisfactory results compared to the other clustering methods [15,30]. Harada et al. [45] reported that the advantages of K-means are its simple principle and high flexibility. Therefore, the clustering results generated by k-means were used to illustrate the clustering results for this study.

4.2. Preliminary Analysis of ICD-10-CM Codes between Indigenous Patients and Non-Indigenous Patients

This study focuses on the analysis of disease codes among patients at Puli Christian Hospital. Given the hospital’s service to many indigenous populations, patients were initially categorized into indigenous and non-indigenous groups. Subsequently, we collected and analyzed the International Classification of Diseases (ICD-10-CM) codes assigned by doctors and recorded the top ten most frequent disease codes each month for both groups. Table A1 in Appendix A depicts ICD-10-CM codes and the corresponding diseases. Two main trends were observed. Firstly, type 2 diabetes (E11) consistently ranked within the top three for both indigenous and non-indigenous groups. This highlights a substantial demand for medical care relating to metabolic diseases in the Puli regions. Thus, various complications associated with diabetes underline the need for an in-depth understanding of the medical requirements at different stages. Secondly, the ranking of bacterial infectious diseases (Z20) was initially not prominent but escalated to the top position in June. Notably, 2022 was the year of the COVID-19 pandemic. Comparing this trend with Taiwan’s COVID-19 statistical data, a similar pattern emerges. These two disease codes represent chronic and infectious diseases, respectively, and indicate a diversity and complexity of health issues.

4.3. Analysis of ICD-10-CM after Clustering

To gain a deeper understanding of the differences in medical needs between indigenous and non-indigenous patients, this study employed clustering techniques to stratify the patient population. This stratification facilitated an in-depth exploration of the predominant health conditions within each cluster. Additionally, to compare the primary diseases between indigenous and non-indigenous groups, the criteria for listing major diseases included not only the frequency of disease codes but also proportional representations within each cluster. Disease codes were ranked based on occurrences, and the top ten diseases were selected for further analysis. The proportions of these top ten codes within each cluster were calculated and are listed in Table 8, Table 9, Table 10, Table 11, Table 12, Table 13, Table 14 and Table 15.

4.4. Analysis of Major Disease Codes over 12 Months in Indigenous Groups

An overview of the data from the clustering results shows that in January, the disease codes among indigenous patients were categorized into four main groups. The prevalent conditions included hypertension (I10), hypertensive heart disease (I11), chronic viral hepatitis (B18), gastroesophageal reflux disease (K21), and pregnancy-related care (Z34). As the months progressed to February, new codes emerged, such as lipid disorders (E78), chronic ischemic heart disease (I25), asthma (J45), and sleep disorders (G47), indicating shifts in health issues among indigenous populations. By March, the groupings were reduced to three, with an increased frequency of urinary system diseases (N39), suggesting that this is an area of health concern worth further exploration. In April, the grouping included hypertensive heart disease (I11), gastroesophageal reflux disease (K21), and pregnancy-related care (Z34), alongside infectious diseases (Z20) and chronic viral hepatitis (B18), hinting at possible commonalities among these conditions. In May, the number of groups increased to five, with respiratory diseases like chronic obstructive pulmonary disease (J44) and acute rhinitis (J00) becoming more prominent, potentially relating to prevalent diseases at the time. From June to December, the groupings remained consistent at three, with the recurrent appearances of codes like gout (M10) and pneumonia (J18) often in the same group, highlighting the significance of urinary (e.g., N39) and respiratory diseases (e.g., J18, J20).

Throughout the year, hypertension- and heart-disease-related codes (such as I10 and I11) were almost consistently present, while infectious diseases (like Z20) and seasonal illnesses (such as J18) showed an increase in specific months. Particular conditions like gout (M10) and urinary system diseases (e.g., N39) were especially pronounced among the indigenous population, suggesting possible environmental or physiological factors.

4.5. Analysis of Major Disease Codes in Non-Indigenous Groups

According to the clustering results, in January, non-indigenous patients were grouped into four categories focused on digestive system diseases (K21), sleep disorders (G47), urinary system diseases (N39, N20), gout (M10), chronic viral hepatitis (B18), blood disorders (R31), and infectious diseases (Z20). The grouping pattern continued similarly in February. In March and April, the number of groups was reduced to three, combining infectious diseases (Z20), pregnancy-related care (Z34), digestive system diseases (K21), and sleep disorders (G47), along with groups featuring gout (M10) and chronic viral hepatitis (B18) that now included respiratory disease codes (I25, J44, J18). In April, the group with pregnancy-related care (Z34) also showed occurrences of breast cancer (C50). By May, the number of groups increased to five, including groups with respiratory disease codes (J00) and respiratory symptoms (R05, O47). From June to December, the groups were consistently divided into three: a group focusing on infectious diseases (Z20), pregnancy-related care (Z34), and sleep disorders (G47); a group with urinary system diseases (N39), often accompanied by hematuria (R31) and kidney stones (N20); and a group dominated by gout (M10) and chronic viral hepatitis (B18), frequently associated with respiratory diseases (J44, J18) and chronic ischemic heart disease (I25).

Observations throughout the year show that urinary diseases (N class) and infectious diseases (Z20) were almost constantly present, along with the consistent appearance of respiratory diseases (J class), indicating these as the main health concerns affecting the non-indigenous population. The presence of chronic ischemic heart disease (I25) may relate to lifestyle factors in the non-indigenous community.

4.6. Comparative Analysis of Major Disease Codes among Indigenous and Non-Indigenous Patients

Utilizing K-means clustering, which resulted in groups of three, four, and five, an examination of the primary disease codes was conducted. Across eight months, namely, March, April, and July to December, where the data were clustered into three groups, the same disease codes were consistently observed each month in the indigenous population. To facilitate a clearer observation and comparison of similar codes across different groups, the results are documented in Table 16. Consequently, the outcomes were categorized into three classes: class A, predominantly featuring codes I10, I11, K21, and Z34; class B, primarily centered around code N39; and class C, focusing on code M10.

In the non-indigenous groups, similar patterns were observed. In class A, the codes included K21 and G47, sharing K21 with the indigenous group but adding G47 while lacking I10, I11, and Z34. Both indigenous and non-indigenous groups had N39 as the main code in class B. In class C, both groups shared the M10 code, but an additional N40 code was observed in the non-indigenous group. During the same period, when the data were categorized into three groups, a unique situation was noted in June for the indigenous population. They had groups belonging to class A and class C, but not class B. Instead, there was an additional group, classified as class D, characterized by Z20 as the primary code, accompanied by respiratory diseases (U07, J00, J02, J06, R05). In contrast, the non-indigenous population continued to be categorized under the original classes A, B, and C. This variance in June, particularly considering the specific impact of COVID-19 (U07) in 2022, suggests that the indigenous population was more significantly affected during this month, leading to a different clustering trend. After classifying the main diseases based on their similarities, it becomes easier to observe the differing trends between infectious and chronic diseases over the 12 months. Consequently, the subsequent analysis focuses on infectious diseases and chronic diseases.

4.6.1. Impacts of Infectious Diseases

In the cluster analysis conducted in January and February, when the data were segmented into four groups, both indigenous and non-indigenous populations exhibited grouping patterns similar to the three groups observed in other months. This included class A, predominantly characterized by gastroesophageal reflux disease (K21), class B, centered around urinary system infections (N39), and class C, led by gout (M10). However, a category similar to that observed in June for the indigenous population emerged: class D. Unlike in June, the frequency of respiratory disease codes was lower in January and February, with pneumonia (B18) notably being the second most common condition. Given the outbreak of COVID-19 in Taiwan during these months, the appearance of class D could serve as an early warning signal of the epidemic.

In the analysis of May, considering the greater number of assigned groups due to potential overlapping diseases in patients, primary diseases were selected based on a threshold of 40%. In addition to the similar classes A, B, C, and D, an additional class E was identified, which is primarily associated with acute upper respiratory infections (J00). In this group, the specific COVID-19 disease code (U07) exceeded 43% in both indigenous and non-indigenous populations, aligning with the surge in COVID-19 cases in Taiwan in May. This finding indicates the emergence of a new group of patients seeking medical assistance due to the epidemic, in addition to regular medical patients. Among groups in May, 48% of the indigenous population had the U07 code, compared to 43% in the non-indigenous population. The disparity in the proportion of disease codes suggests a greater impact on indigenous communities. Considering lifestyle factors, indigenous communities, often residing in closely knit tribes, have a higher interaction frequency compared to non-indigenous populations. In addition, indigenous people are slower to receive disease information and initiate preventive measures compared to people in urban areas.

4.6.2. Impacts of Type 2 Diabetes

In the analysis of major diseases, type 2 diabetes (code E11), initially ranked in the top three by patient frequency, did not consistently feature as a primary disease when the concept of proportionality was adopted. Unlike infectious diseases, which were frequently observed in more than four groups in January, February, and May, chronic diseases were consistently presented in all monthly groupings. This illustrates that after the grouping, each cluster maintained a certain proportion of chronic disease codes, which were differentiated by the accompanying comorbidities. Additionally, since chronic diseases are closely linked to time and disease progression, age was factored into the analysis. A comparison was made across groups in terms of age and found that each group appeared to have a lag of ten years. Figure 3, Figure 4 and Figure 5 illustrate the observation of the occurrences in March as an example.

The bar chart displays the number of individuals with the E11 code in terms of age. The line graph represents cumulative cases. Observing the cumulative cases, it is evident that the proportion of the E11 code in the indigenous population is mostly higher than in the non-indigenous groups. The bar chart reveals similar trends in both groups, but with different age brackets. The trend for the non-indigenous group appears a decade later than that for the indigenous group, indicating an earlier onset of E11-related health impacts in the indigenous population.

4.6.3. Impacts of Essential Hypertension and Hypertensive Heart Diseases

Observations from the analysis of major diseases reveal distinct patterns within chronic heart diseases of the I class. Among the indigenous population, the prevalent codes include I10 (essential hypertension) and I11 (hypertensive heart diseases), predominantly occurring in disease groups associated with K21 (gastroesophageal reflux disease). In contrast, the non-indigenous population predominantly showed the presence of I25 (chronic ischemic heart disease). The prevalence of the I10 code, which is linked to genetic factors, indicates a heightened need for preventive measures against heart disease risks in the indigenous population.

Upon categorizing the clustering results based on the main diseases, distinct trends were observed between chronic and infectious diseases. Chronic diseases displayed a more consistent distribution across all 12 months. However, infectious diseases exhibited greater variability. A month-by-month analysis was performed to track the dynamics. Therefore, using clustering methods to differentiate between chronic diseases and infectious diseases provides a clearer outline of regional disease patterns and healthcare needs. This approach allows for a better understanding of medical requirements in different areas and facilitates the provision of appropriate medical assistance tailored to the needs of specific subgroups. Additionally, in the context of epidemic prevention and control, this method can enable early prevention and management based on local infection trends.

5. Conclusions

This study employed clustering techniques to group and then analyze diseases in the indigenous population and the non-indigenous population. K-means clustering obtained better results than the other three clustering techniques in terms of three measurements. The developed model can learn distances between clusters and further investigate relations among diseases in patients through the features of clusters. Although the medical conditions of patient groups vary each month, a consistent clustering trend is observed overall. This trend is particularly pronounced in cases where the primary disease is the same, indicating a higher probability of certain disease codes appearing together. This result lays the foundation for a deeper exploration of potential correlations between different diseases.

From the perspectives of chronic diseases and bacterial or viral infections, we noted distinct clustering behaviors between the two. The chronic disease group exhibited consistency in the monthly analyses, while the clustering of bacterial or viral infections showed a close correlation with the stages of epidemic development. This was particularly evident in the context of the 2022 epidemic trends in Taiwan, where changes in the number of clusters were highly correlated with different stages of the epidemic.

The unsupervised clustering method helps in identifying correlations in complex and varied data that are not readily observable. However, the diversity of the data, along with the varying medical needs of patients at the time of consultation, poses challenges in data preprocessing. Moreover, challenges arise due to the large data scales, diversities, and complexity. Both structured and unstructured data are included in the dataset. In addition to medical treatment, there may be return visits or patients with chronic diseases who only receive medicine without treatment. Thus, the presentation of data should not only consider the identity of the patient but also consider the timeliness of the patient’s visits.

Only data from 2022 were employed in this study. Data gathered in other years could be employed to examine the feasibility of the proposed SHCM system. In addition, only data collected from the Puli Christian Hospital served as data for the SHCM system. Data collected from other hospitals could be utilized to investigate the generalization ability of the developed system. Finally, some deep clustering techniques could be employed to deal with the clustering tasks for the presented system.

Author Contributions

Conceptualization, P.-F.P., W.-C.Y. and H.-P.H.; methodology, J.-P.L. and P.-F.P.; software, J.-P.L., Y.-H.L. and Y.-L.L.; formal analysis, Y.-H.L.; writing—original draft preparation, Y.-H.L., Y.-L.L. and P.-F.P.; writing—review and editing, P.-F.P.; visualization, Y.-H.L. and Y.-L.L.; supervision, P.-F.P. and W.-C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by funding from Puli Christian Hospital/Chi Nan National University Joint Research Program under grant number 112-PuChi-AIR-001.

Institutional Review Board Statement

Ethical review and approval were waived for this study, due to the use of a database with data aggregated by age (10-year age-groups) and diagnosis categories.

Informed Consent Statement

Informed consent was not required as cohort members were unidentifiable.

Data Availability Statement

The data presented in this study are available on reasonable request from the corresponding author. The data are not public available due to the privacy.

Acknowledgments

This work was supported by Kai Yen, Bing-Cheng Chiu, and Yan-Song Chang, who assisted in data analysis.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In this study, the analysis of electronic medical record (EMR) data involved labeling patients’ diseases according to the International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10 CM). The ICD-10, established by the World Health Organization (WHO), categorizes diseases based on their characteristics and represents them using a coding system. This system is crucial for accurately and systematically recording cases, and it plays a significant role in clinical diagnosis, epidemiological research, health management, and data collection. This appendix only includes disease codes directly relevant to our study. These codes represent specific disease types involved in our research and are intended to help readers better understand the scope and focus of our study. Each code is accompanied by a brief description of the disease, making it accessible to readers who are not specialists in the field.

Table A1. The ICD-10 CM and brief descriptions.

ICD-10 CM	Diseases
A09	Infectious gastroenteritis and colitis, unspecified
B08	Other viral infections characterized by skin and mucous membrane lesions, not elsewhere classified
B18	Chronic viral hepatitis
C50	Malignant neoplasm of breast
D64	Other anemias
E11	Type 2 diabetes mellitus
E78	Disorders of lipoprotein metabolism and other lipidemias
E86	Volume depletion
E87	Other disorders of fluid, electrolyte and acid-base balance
G47	Sleep disorders
I10	Essential (primary) hypertension
I11	Hypertensive heart disease
I20	Angina pectoris
I25	Chronic ischemic heart disease
I50	Heart failure
J00	Acute nasopharyngitis [common cold]
J01	Acute sinusitis
J02	Acute pharyngitis
J03	Acute tonsillitis
J06	Acute upper respiratory infections of multiple and unspecified sites
J12	Viral pneumonia, not elsewhere classified
J18	Pneumonia, unspecified organism
J20	Acute bronchitis
J30	Vasomotor and allergic rhinitis
J44	Other chronic obstructive pulmonary disease
J45	Asthma
K21	Gastroesophageal reflux disease
K25	Gastric ulcer
K29	Gastritis and duodenitis
K59	Other functional intestinal disorders
K92	Other diseases of digestive system
L03	Cellulitis and acute lymphangitis
L08	Other local infections of skin and subcutaneous tissue
M10	Gout
M19	Other and unspecified osteoarthritis
M54	Dorsalgia
N13	Obstructive and reflux uropathy
N18	Chronic kidney disease (CKD)
N20	Calculus of kidney and ureter
N39	Other disorders of urinary system
N40	Benign prostatic hyperplasia
O47	False labor
P59	Neonatal jaundice from other and unspecified causes
R00	Abnormalities of heart beat
R05	Cough
R07	Pain in throat and chest
R10	Abdominal and pelvic pain
R11	Nausea and vomiting
R31	Hematuria
R35	Polyuria
R42	Dizziness and giddiness
R50	Fever of other and unknown origin
R51	Headache
R80	Proteinuria
U07	Emergency use of U07
Z11	Encounter for screening for infectious and parasitic diseases
Z20	Contact with and (suspected) exposure to communicable diseases
Z34	Encounter for supervision of normal pregnancy

References

Parimbelli, E.; Marini, S.; Sacchi, L.; Bellazzi, R. Patient similarity for precision medicine: A systematic review. J. Biomed. Inform. 2018, 83, 87–96. [Google Scholar] [CrossRef] [PubMed]
Lambert, J.; Leutenegger, A.-L.; Jannot, A.-S.; Baudot, A. Tracking clusters of patients over time enables extracting information from medico-administrative databases. J. Biomed. Inform. 2023, 139, 104309. [Google Scholar] [CrossRef] [PubMed]
Zelina, P.; Halámková, J.; Nováček, V. Unsupervised extraction, labelling and clustering of segments from clinical notes. In Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 6–8 December 2022; pp. 1362–1368. [Google Scholar]
Irving, J.; Patel, R.; Oliver, D.; Colling, C.; Pritchard, M.; Broadbent, M.; Baldwin, H.; Stahl, D.; Stewart, R.; Fusar-Poli, P. Using natural language processing on electronic health records to enhance detection and prediction of psychosis risk. Schizophr. Bull. 2021, 47, 405–414. [Google Scholar] [CrossRef] [PubMed]
Ebad, S.A. Healthcare software design and implementation—A project failure case. Softw. Pract. Exp. 2020, 50, 1258–1276. [Google Scholar] [CrossRef]
Mashoufi, M.; Ayatollahi, H.; Khorasani-Zavareh, D.; Talebi Azad Boni, T. Data quality in health care: Main concepts and assessment methodologies. Methods Inf. Med. 2023, 62, 005–018. [Google Scholar] [CrossRef] [PubMed]
Ezugwu, A.E.; Ikotun, A.M.; Oyelade, O.O.; Abualigah, L.; Agushaka, J.O.; Eke, C.I.; Akinyelu, A.A. A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Eng. Appl. Artif. Intell. 2022, 110, 104743. [Google Scholar] [CrossRef]
Saxena, A.; Prasad, M.; Gupta, A.; Bharill, N.; Patel, O.P.; Tiwari, A.; Er, M.J.; Ding, W.; Lin, C.-T. A review of clustering techniques and developments. Neurocomputing 2017, 267, 664–681. [Google Scholar] [CrossRef]
Chaudhry, M.; Shafi, I.; Mahnoor, M.; Vargas, D.L.R.; Thompson, E.B.; Ashraf, I. A systematic literature review on identifying patterns using unsupervised clustering algorithms: A data mining perspective. Symmetry 2023, 15, 1679. [Google Scholar] [CrossRef]
Oyewole, G.J.; Thopil, G.A. Data clustering: Application and trends. Artif. Intell. Rev. 2023, 56, 6439–6475. [Google Scholar] [CrossRef]
Santamaría, L.P.; del Valle, E.P.G.; García, G.L.; Zanin, M.; González, A.R.; Ruiz, E.M.; Gallardo, Y.P.; Chan, G.S.H. Analysis of new nosological models from disease similarities using clustering. In Proceedings of the 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), Rochester, MN, USA, 28–30 July 2020; pp. 183–188. [Google Scholar]
Farouk, Y.; Rady, S. Early diagnosis of alzheimer’s disease using unsupervised clustering. Int. J. Intell. Comput. Inf. Sci. 2020, 20, 112–124. [Google Scholar] [CrossRef]
Hassan, M.M.; Mollick, S.; Yasmin, F. An unsupervised cluster-based feature grouping model for early diabetes detection. Healthc. Anal. 2022, 2, 100112. [Google Scholar] [CrossRef]
Antony, L.; Azam, S.; Ignatious, E.; Quadir, R.; Beeravolu, A.R.; Jonkman, M.; De Boer, F. A comprehensive unsupervised framework for chronic kidney disease prediction. IEEE Access 2021, 9, 126481–126501. [Google Scholar] [CrossRef]
Enireddy, V.; Anitha, R.; Vallinayagam, S.; Maridurai, T.; Sathish, T.; Balakrishnan, E. Prediction of human diseases using optimized clustering techniques. Mater. Today Proc. 2021, 46, 4258–4264. [Google Scholar] [CrossRef]
Arora, N.; Singh, A.; Al-Dabagh, M.Z.N.; Maitra, S.K. A novel architecture for diabetes patients’ prediction using k-means clustering and svm. Math. Probl. Eng. 2022, 2022, 4815521. [Google Scholar] [CrossRef]
Parikh, H.M.; Remedios, C.L.; Hampe, C.S.; Balasubramanyam, A.; Fisher-Hoch, S.P.; Choi, Y.J.; Patel, S.; McCormick, J.B.; Redondo, M.J.; Krischer, J.P. Data mining framework for discovering and clustering phenotypes of atypical diabetes. J. Clin. Endocrinol. Metab. 2023, 108, 834–846. [Google Scholar] [CrossRef] [PubMed]
Jasinska-Piadlo, A.; Bond, R.; Biglarbeigi, P.; Brisk, R.; Campbell, P.; Browne, F.; McEneaneny, D. Data-driven versus a domain-led approach to k-means clustering on an open heart failure dataset. Int. J. Data Sci. Anal. 2023, 15, 49–66. [Google Scholar] [CrossRef]
Mpanya, D.; Celik, T.; Klug, E.; Ntsinjana, H. Clustering of heart failure phenotypes in johannesburg using unsupervised machine learning. Appl. Sci. 2023, 13, 1509. [Google Scholar] [CrossRef]
Florensa, D.; Mateo-Fornés, J.; Solsona, F.; Pedrol Aige, T.; Mesas Julió, M.; Piñol, R.; Godoy, P. Use of multiple correspondence analysis and k-means to explore associations between risk factors and likelihood of colorectal cancer: Cross-sectional study. J. Med. Internet Res. 2022, 24, e29056. [Google Scholar] [CrossRef]
Koné, A.P.; Scharf, D.; Tan, A. Multimorbidity and complexity among patients with cancer in ontario: A retrospective cohort study exploring the clustering of 17 chronic conditions with cancer. Cancer Control 2023, 30, 10732748221150393. [Google Scholar] [CrossRef]
Chantraine, F.; Schreiber, C.; Pereira, J.A.C.; Kaps, J.; Dierick, F. Classification of stiff-knee gait kinematic severity after stroke using retrospective k-means clustering algorithm. J. Clin. Med. 2022, 11, 6270. [Google Scholar] [CrossRef]
Yasa, I.; Rusjayanthi, N.; Luthfi, W.B.M. Classification of stroke using k-means and deep learning methods. Lontar Komput. J. Ilm. Teknol. Inf. 2022, 13, 23. [Google Scholar] [CrossRef]
Al-Khafaji, H.M.R.; Jaleel, R.A. Adopting effective hierarchal iomts computing with k-efficient clustering to control and forecast covid-19 cases. Comput. Electr. Eng. 2022, 104, 108472. [Google Scholar] [CrossRef] [PubMed]
Ilbeigipour, S.; Albadvi, A.; Noughabi, E.A. Cluster-based analysis of covid-19 cases using self-organizing map neural network and k-means methods to improve medical decision-making. Inform. Med. Unlocked 2022, 32, 101005. [Google Scholar] [CrossRef] [PubMed]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 18–21 June 1965 and 27 December 1965–7 January 1966; University of California Press: Oakland, CA, USA, 1967; pp. 281–297. [Google Scholar]
Na, S.; Xumin, L.; Yong, G. Research on k-means clustering algorithm: An improved k-means clustering algorithm. In Proceedings of the 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, Jian, China, 2–4 April 2010; pp. 63–67. [Google Scholar]
Alam, M.S.; Rahman, M.M.; Hossain, M.A.; Islam, M.K.; Ahmed, K.M.; Ahmed, K.T.; Singh, B.C.; Miah, M.S. Automatic human brain tumor detection in mri image using template-based k means and improved fuzzy c means clustering algorithm. Big Data Cogn. Comput. 2019, 3, 27. [Google Scholar] [CrossRef]
Lee, H.; Choi, Y.; Son, B.; Lim, J.; Lee, S.; Kang, J.W.; Kim, K.H.; Kim, E.J.; Yang, C.; Lee, J.-D. Deep autoencoder-powered pattern identification of sleep disturbance using multi-site cross-sectional survey data. Front. Med. 2022, 9, 950327. [Google Scholar] [CrossRef] [PubMed]
Setiawan, K.E.; Kurniawan, A.; Chowanda, A.; Suhartono, D. Clustering models for hospitals in jakarta using fuzzy c-means and k-means. Procedia Comput. Sci. 2023, 216, 356–363. [Google Scholar] [CrossRef] [PubMed]
Yuan, C.; Yang, H. Research on k-value selection method of k-means clustering algorithm. J 2019, 2, 226–235. [Google Scholar] [CrossRef]
Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2012, 2, 86–97. [Google Scholar] [CrossRef]
Rumelhart, D.; Hinton, G.; Williams, R. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition; MIT Press: Cambridge, MA, USA, 1986; Chapter 8; Volume 1, pp. 318–362. [Google Scholar]
Baldi, P. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of the ICML Workshop on Unsupervised and Transfer Learning, Bellevue, WA, USA, 2 July 2011; JMLR Workshop and Conference Proceedings. ML Research Press: London, UK, 2012; pp. 37–49. [Google Scholar]
Zhang, L.; Lv, C.; Jin, Y.; Cheng, G.; Fu, Y.; Yuan, D.; Tao, Y.; Guo, Y.; Ni, X.; Shi, T. Deep learning-based multi-omics data integration reveals two prognostic subtypes in high-risk neuroblastoma. Front. Genet. 2018, 9, 477. [Google Scholar] [CrossRef]
Bank, D.; Koenigstein, N.; Giryes, R. Autoencoders. Machine Learning for Data Science Handbook: Data Mining and Knowledge Discovery Handbook; Springer: New York, NY, USA, 2023; pp. 353–374. [Google Scholar]
Kohonen, T. The self-organizing map. Proc. IEEE 1990, 78, 1464–1480. [Google Scholar] [CrossRef]
Vesanto, J.; Alhoniemi, E. Clustering of the self-organizing map. IEEE Trans. Neural Netw. 2000, 11, 586–600. [Google Scholar] [CrossRef] [PubMed]
Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 1974, 3, 1–27. [Google Scholar] [CrossRef]
Desgraupes, B. Clustering indices. Univ. Paris Ouest-Lab Modal’X 2013, 1, 34. [Google Scholar]
Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
Xiao, J.; Lu, J.; Li, X. Davies bouldin index based hierarchical initialization k-means. Intell. Data Anal. 2017, 21, 1327–1338. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Shahapure, K.R.; Nicholas, C. Cluster quality analysis using silhouette score. In Proceedings of the 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), Sydney, Australia, 6–9 October 2020; pp. 747–748. [Google Scholar]
Harada, D.; Asanoi, H.; Noto, T.; Takagawa, J. Different pathophysiology and outcomes of heart failure with preserved ejection fraction stratified by k-means clustering. Front. Cardiovasc. Med. 2020, 7, 607760. [Google Scholar] [CrossRef]

Figure 1. The proposed smart healthcare system based on clustering techniques and medical data (SHCM).

Figure 2. Data preprocessing steps.

Figure 3. The age distribution of patients with E11 disease code of class A in March.

Figure 4. The age distribution of patients with E11 disease code of class B in March.

Figure 5. The age distribution of patients with E11 disease code of class C in March.

Table 1. Recent clustering methods for medical data.

References	Years	Applications	Methods of Clustering
Santamaría et al. [11]	2020	Analysis of new nosological models	DBSCAN *
Farouk and Rady [12]	2020	Early diagnosis of Alzheimer’s disease	K-means, K-medoids
Hassan et al. [13]	2022	As a feature-grouping model for early diabetes detection	K-means
Antony et al. [14]	2021	Chronic kidney disease prediction	K-means, DBSCAN , I-Forest , Autoencoder
Enireddy et al. [15]	2021	Prediction of diseases	K-means, Agglomerative, Fuzzy C-means
Arora et al. [16]	2022	As a feature-extracted tool for diabetes patient prediction	K-means
Parikh et al. [17]	2023	Discovering and clustering phenotypes of atypical diabetes	K-means
Jasinska-Piadlo et al. [18]	2023	Clustering heart failures	K-means
Mpanya et al. [19]	2023	Clustering heart failure phenotypes	K-prototype, K-means, Agglomerative, BIRCH , OPTICS , DBSCAN , GMM
Florensa et al. [20]	2022	Exploring associations between risk factors and likelihood of colorectal cancer	K-means
Koné et al. [21]	2023	Exploring the clustering of 17 chronic conditions with cancer	K-means
Chantraine et al. [22]	2022	Classification of stiff-knee gait kinematic severity after stroke	K-means
Yasa et al. [23]	2022	Classification of stroke	K-means
Al-Khafaji and Jaleel [24]	2022	Controlling and forecasting COVID-19 cases	K-Efficient (a hybrid of K-medoids and K-means)
Ilbeigipour et al. [25]	2022	The analysis of COVID-19 cases	SOM, K-means

Note: * I-Forest = Isolation Forest; BIRCH = Balanced Iterative Reducing and Clustering Hierarchies; OPTICS = Ordering Points to Identify the Clustering Structure; DBSCAN = Density-Based Spatial Clustering of Applications with Noise; GMM = Gaussian Mixture Model.

Table 2. Conversion methods for attributes.

Variables	Attributes	Conversion Methods
X1	Gender	Labeling
X2	Age	From birthdays to ages
X3~X12	Drug items	BERT and PCA
X13~X22	Doctors’ advice	BERT and PCA
X23~Xn	Exam items	Labeling

Table 3. Numbers of attributes and visits of patients in 12 months.

Datasets	Jan.	Feb.	Mar.	Apr.	May	Jun.
Number of attributes	456	394	418	447	417	416
Visits of patients	4765	4405	5667	5410	7593	5397
Datasets	Jul.	Aug.	Sep.	Oct.	Nov.	Dec.
Number of attributes	455	447	445	404	444	470
Visits of patients	5136	5341	5023	5233	4675	4506

Table 4. The numbers of clusters using four methods from January to December.

Methods	Jan.	Feb.	Mar.	Apr.	May	Jun.	Jul.	Aug.	Sep.	Oct.	Nov.	Dec.
KM	4	4	3	3	5	3	3	3	3	3	3	3
AEKM	4	4	3	3	5	3	3	3	3	3	3	3
SOMKM	4	4	3	3	4	3	4	3	4	4	3	3
HC	4	4	3	3	3	3	3	3	3	3	3	4

Table 5. The clustering performance in terms of the Calinski–Harabasz index.

Methods	Jan.	Feb.	Mar.	Apr.	May	Jun.	Jul.	Aug.	Sep.	Oct.	Nov.	Dec.
KM	1310.21	1431.42	1988.48	2107.83	1810.97	1892.51	1914.83	2082.41	1682.57	1731.59	1590.45	1323.84
AEKM	257.73	268.64	217.65	399.86	319.53	653.52	709.87	900.43	347.86	336.36	384.98	163.36
SOMKM	1281.33	1423.37	1988.48	1012.57	1377.30	1888.91	1432.10	1841.56	1391.69	1416.60	1590.45	1233.08
HC	1264.75	1368.76	1883.29	2008.26	2060.34	1847.02	1805.66	2035.00	1632.01	1690.22	1512.80	1112.01

Table 6. The clustering performance in terms of the Davies–Bouldin index.

Methods	Jan.	Feb.	Mar.	Apr.	May	Jun.	Jul.	Aug.	Sep.	Oct.	Nov.	Dec.
KM	1.43	1.28	1.43	1.25	1.6	1.31	1.27	1.28	1.44	1.5	1.35	1.58
AEKM	4.59	3.68	5.55	6	5.33	4.02	3.12	3.74	3.58	4.34	3.11	6.18
SOMKM	1.42	1.29	1.43	2.19	1.96	1.31	1.91	1.5	1.57	1.67	1.35	1.7
HC	1.45	1.31	1.43	1.27	1.49	1.32	1.34	1.28	1.43	1.52	1.38	1.53

Table 7. The clustering performance in terms of the Silhouette Coefficient.

Methods	Jan.	Feb.	Mar.	Apr.	May	Jun.	Jul.	Aug.	Sep.	Oct.	Nov.	Dec.
KM	0.27	0.29	0.27	0.32	0.24	0.31	0.31	0.32	0.27	0.25	0.29	0.23
AEKM	0.01	0.03	0.02	0.07	0.01	0.13	0.16	0.13	0.04	0.03	0.04	0.01
SOMKM	0.26	0.29	0.27	0.16	0.2	0.31	0.24	0.31	0.28	0.25	0.29	0.21
HC	0.27	0.29	0.25	0.31	0.26	0.3	0.3	0.32	0.26	0.25	0.29	0.25

Table 8. Clustering analysis of indigenous patients in January.

Ranks	Total			Cluster 1			Cluster 2			Cluster 3			Cluster 4
Ranks	Codes	Disease %	Visits	Codes	Visits	Group %	Codes	Visits	Group %	Codes	Visits	Group %	Codes	Visits	Group %
1	E11	9%	178	E11	87	49%	E11	24	13%	E11	62	35%	Z20	65	94%
2	E78	5%	112	I10	45	50%	E78	17	15%	E78	51	46%	J18	38	46%
3	I10	4%	90	E78	44	39%	N39	17	50%	I11	28	35%	J01	16	36%
4	J18	4%	82	I11	43	53%	I10	16	18%	I10	27	30%	R50	14	45%
5	I11	4%	81	J18	22	27%	N18	12	32%	I25	21	45%	E11	5	3%
6	Z20	3%	69	B18	22	55%	R50	10	32%	M10	20	65%	J20	5	18%
7	I25	2%	47	K21	21	54%	I11	9	11%	B18	17	43%	J45	5	17%
8	J01	2%	44	Z34	18	60%	Z34	8	27%	J18	16	20%	R10	5	19%
9	B18	2%	40	I25	15	32%	I25	7	15%	J01	15	34%	I25	4	9%
10	K21	2%	39	J45	14	48%	R10	7	26%	K21	14	36%	N18	4	11%
11	N18	2%	37				N20	7	47%				J30	4	17%
12	N39	2%	34				M10	6	19%				A09	4	21%
13	R50	2%	31				J18	6	7%				R11	4	31%
14	M10	2%	31										E86	4	50%
15	Z34	1%	30										Z34	4	13%

Table 9. Clustering analysis of indigenous patients in March.

Ranks	Total			Cluster 1			Cluster 2			Cluster 3
Ranks	Codes	Disease %	Visits	Codes	Visits	Group %	Codes	Visits	Group %	Codes	Visits	Group %
1	E11	8%	206	E11	30	15%	E11	93	45%	E11	83	40%
2	E78	6%	154	R50	25	45%	Z20	79	74%	E78	63	41%
3	I10	4%	108	N39	23	56%	E78	69	45%	I10	36	33%
4	Z20	4%	107	E78	22	14%	I10	55	51%	I11	34	34%
5	I11	4%	99	I10	17	16%	I11	53	54%	Z11	33	60%
6	J18	3%	80	N18	14	25%	J18	41	51%	J01	31	41%
7	J01	3%	76	I11	12	12%	J01	39	51%	M10	30	60%
8	B18	2%	56	J18	12	15%	K21	29	63%	Z20	28	26%
9	R50	2%	56	R10	11	35%	Z34	28	82%	J18	27	34%
10	N18	2%	56	I25	9	20%	B18	26	46%	B18	26	46%
11	Z11	2%	55							N18	26	46%
12	M10	2%	50
13	K21	2%	46
14	I25	2%	46
15	N39	2%	41

Table 10. Clustering analysis of indigenous patients in May.

Ranks	Total			Cluster 1			Cluster 2			Cluster 3			Cluster 4			Cluster 5
Ranks	Codes	Disease %	Visits	Codes	Visits	Group %	Codes	Visits	Group %	Codes	Visits	Group %	Codes	Visits	Group %	Codes	Visits	Group %
1	Z20	22%	660	E11	58	35%	E11	83	50%	Z20	240	36%	N39	23	62%	Z20	346	52%
2	E11	6%	167	E78	46	43%	Z20	56	8%	R05	56	45%	R50	21	22%	R05	66	53%
3	R05	4%	125	I11	27	34%	E78	49	45%	J06	30	37%	E11	21	13%	U07	40	48%
4	E78	4%	108	B18	20	47%	I10	40	61%	R50	29	31%	I11	12	15%	J06	39	48%
5	R50	3%	94	I10	19	29%	I11	38	48%	U07	25	30%	E78	11	10%	R50	29	31%
6	U07	3%	83	K21	14	42%	Z34	30	65%	J18	10	19%	N18	10	20%	J18	9	17%
7	J06	3%	81	J44	14	61%	I25	23	50%	J20	10	33%	Z34	9	20%	J00	8	50%
8	I11	3%	79	Z20	13	2%	B18	20	47%	J00	8	50%	N40	9	64%	Z34	7	15%
9	I10	2%	66	J18	13	24%	N18	20	41%	J01	7	39%	I25	7	15%	J20	7	23%
10	J18	2%	54	N18	12	24%	K21	18	55%	N18	6	12%	M10	7	24%	R10	5	23%
11	N18	2%	49	I25	12	26%	J18	16	30%	J30	5	26%	I10	7	11%	J02	5	63%
12	Z34	2%	46	M10	12	41%				E11	4	2%	J18	6	11%
13	I25	2%	46							I25	4	9%	N20	6	55%
14	B18	1%	43							R51	4	20%	R10	6	27%
15	N39	1%	37							R11	4	27%	M54	6	30%

Table 11. Clustering analysis of indigenous patients in June.

Ranks	Total			Cluster 1			Cluster 2			Cluster 3
Ranks	Codes	Disease %	Visits	Codes	Visits	Group %	Codes	Visits	Group %	Codes	Visits	Group %
1	Z20	15%	309	E11	58	40%	E11	83	57%	Z20	240	78%
2	E11	7%	145	E78	46	47%	Z20	56	18%	R05	56	95%
3	E78	5%	97	I11	27	40%	E78	49	51%	J06	30	75%
4	I11	3%	67	B18	20	50%	I10	40	68%	R50	29	66%
5	I10	3%	59	I10	19	32%	I11	38	57%	U07	25	64%
6	R05	3%	59	K21	14	44%	Z34	30	100%	J18	10	26%
7	R50	2%	44	J44	14	78%	I25	23	59%	J20	10	56%
8	B18	2%	40	Z20	13	4%	B18	20	50%	J00	8	100%
9	J06	2%	40	J18	13	33%	N18	20	53%	J01	7	44%
10	J18	2%	39	I25	12	31%	K21	18	56%	N18	6	16%
11	I25	2%	39	N18	12	32%	J18	16	41%	J30	5	33%
12	U07	2%	39	M10	12	55%				E11	4	3%
13	N18	2%	38	Z11	11	48%				I25	4	10%
14	K21	2%	32	K25	11	50%				R51	4	24%
15	Z34	1%	30	R50	9	20%				R11	4	40%

Table 12. Clustering analysis of non-indigenous patients in January.

Ranks	Total			Cluster 1			Cluster 2			Cluster 3			Cluster 4
Ranks	Codes	Disease %	Visits	Codes	Visits	Group %	Codes	Visits	Group %	Codes	Visits	Group %	Codes	Visits	Group %
1	E11	8%	985	E11	364	37%	E11	176	18%	E11	414	42%	Z20	412	96%
2	E78	6%	790	E78	316	40%	N39	135	73%	E78	349	44%	J18	50	32%
3	I11	5%	680	I11	303	45%	E78	121	15%	I11	286	42%	E11	31	3%
4	Z20	3%	429	I10	149	38%	N18	81	20%	I10	169	43%	N18	30	7%
5	N18	3%	407	N18	142	35%	I11	80	12%	N18	154	38%	R50	21	24%
6	I10	3%	391	K21	109	51%	N20	72	51%	I25	114	48%	J01	18	21%
7	I25	2%	237	B18	88	43%	I10	59	15%	B18	104	51%	I10	14	4%
8	K21	2%	212	G47	88	55%	N40	54	42%	K21	89	42%	R10	14	11%
9	B18	2%	204	I25	82	35%	R10	44	34%	M10	77	66%	K92	13	21%
10	N39	1%	184	K59	68	46%	R50	35	41%	N40	72	55%	A09	13	19%
11	G47	1%	161				I25	30	13%				E86	13	24%
12	J18	1%	158				Z34	26	28%
13	K59	1%	147				R31	25	78%
14	N20	1%	140
15	N40	1%	130

Table 13. Clustering analysis of non-indigenous patients in March.

Ranks	Total			Cluster 1			Cluster 2			Cluster 3
Ranks	Codes	Disease %	Visits	Codes	Visits	Group %	Codes	Visits	Group %	Codes	Visits	Group %
1	E11	7%	953	N39	177	71%	E11	406	43%	E11	393	41%
2	E78	6%	827	E11	154	16%	Z20	394	58%	E78	351	42%
3	I11	5%	676	E78	109	13%	E78	367	44%	I11	284	42%
4	Z20	5%	674	N18	103	23%	I11	313	46%	Z20	280	42%
5	I10	3%	464	I11	79	12%	I10	184	40%	I10	210	45%
6	N18	3%	450	R50	75	50%	N18	154	34%	N18	193	43%
7	I25	2%	283	I10	70	15%	G47	111	60%	I25	156	55%
8	N39	2%	251	R10	58	43%	K21	104	49%	J44	106	70%
9	K21	1%	213	N20	47	38%	I25	98	35%	J18	105	53%
10	J18	1%	199	N40	36	27%	K59	80	53%	B18	100	54%
11	B18	1%	186	M10	32	26%	Z34	78	71%	N40	97	73%
12	G47	1%	186	Z34	32	29%	J18	78	39%	K21	95	45%
13	J44	1%	152	R31	30	70%	J30	76	51%	M10	72	58%
14	K59	1%	151	I25	29	10%				N20	66	53%
15	R50	1%	151

Table 14. Clustering analysis of non-indigenous patients in May.

Ranks	Total			Cluster 1			Cluster 2			Cluster 3			Cluster 4			Cluster 5
Ranks	Codes	Disease %	Visits	Codes	Visits	Group %	Codes	Visits	Group %	Codes	Visits	Group %	Codes	Visits	Group %	Codes	Visits	Group %
1	Z20	19%	2922	E11	367	41%	E11	354	40%	Z20	1029	35%	E11	151	17%	Z20	1343	46%
2	E11	6%	888	E78	311	42%	Z20	325	11%	R05	242	44%	N39	131	72%	R05	301	54%
3	E78	5%	740	I11	285	44%	E78	308	42%	U07	129	37%	E78	111	15%	U07	153	43%
4	I11	4%	648	Z20	212	7%	I11	279	43%	J06	120	45%	N18	92	22%	J06	111	42%
5	R05	4%	553	N18	171	41%	I10	149	39%	R50	88	31%	I11	78	12%	R50	75	27%
6	N18	3%	420	I10	156	41%	N18	138	33%	J00	38	54%	R50	73	26%	J00	31	44%
7	I10	2%	382	I25	109	44%	K21	102	55%	R11	16	29%	I10	62	16%	Z34	20	16%
8	U07	2%	352	N40	84	64%	I25	101	41%	J20	16	29%	N20	48	44%	J20	16	29%
9	R50	2%	281	B18	78	49%	G47	87	54%	E11	14	2%	N40	43	33%	R11	13	24%
10	J06	2%	265	K21	72	39%	Z34	79	64%	N18	14	3%	R10	32	31%	R07	11	16%
11	I25	2%	248				B18	74	46%				R35	30	67%	J03	10	34%
12	K21	1%	186				C50	58	91%							R51	10	25%
13	N39	1%	181				K25	55	54%							O47	10	77%
14	G47	1%	162				R00	54	62%
15	B18	1%	160

Table 15. Clustering analysis of non-indigenous patients in June.

Ranks	Total			Cluster 1			Cluster 2			Cluster 3
Ranks	Codes	Disease %	Visits	Codes	Visits	Group %	Codes	Visits	Group %	Codes	Visits	Group %
1	Z20	11%	1440	Z20	867	60%	Z20	558	39%	E11	140	17%
2	E11	6%	822	E11	342	42%	E11	340	41%	N39	111	66%
3	E78	5%	681	E78	294	43%	E78	285	42%	E78	102	15%
4	I11	5%	588	I11	259	44%	I11	250	43%	N18	89	22%
5	N18	3%	399	N18	147	37%	I10	165	45%	I11	79	13%
6	I10	3%	365	I10	145	40%	N18	163	41%	R50	69	29%
7	U07	2%	267	U07	124	46%	I25	115	54%	I10	55	15%
8	R50	2%	238	Z34	102	83%	U07	106	40%	U07	37	14%
9	I25	2%	214	I25	80	37%	J44	96	69%	N20	37	44%
10	N39	1%	168	G47	77	53%	R50	93	39%	R10	31	34%
11	B18	1%	148				J18	74	54%	N40	26	27%
12	K21	1%	147				M10	71	63%	R31	25	86%
13	G47	1%	145				N40	70	73%
14	J06	1%	143				B18	67	45%
15	J44	1%	140				K21	65	44%

Table 16. Clusters of diseases.

Months	Clusters of Indigenous Patients					Clusters of Non-Indigenous Patients
Months	Class A	Class B	Class C	Class D	Class E	Class A	Class B	Class C	Class D	Class E
Jan.	I10, I11, B18, K21, Z34	N39	M10	Z20, E86		K21, G47	N39, N20, R31	B18, M10, N40	Z20
Feb.	E78, Z34, I25, K21, J45, G47, M19	N39, N20	M10, B18, Z11	Z20, R07, L08, B08		G47, Z34	N39	B18, N40, M10	Z20
Mar.	E11, I10, I11, J18, J01, K21, Z34	N39	Z11, M10			Z20, G47, K59, Z34, J30	N39, R50, R31	I25, J44, J18, B18, N40, M10, N20
Apr.	Z20, I11, K21, Z34, B18	N39, N20	J01, N18, Z11			Z20, K21, G47, Z34, C50	N39, R35	I25, B18, J44, J18, M10, N40, K25, I20
May	E11, E78, I10, I11, Z34, I25, B18, N18, K21	N39, Z34, N20	E78, B18, K21, J44, M10	Z20, R05, U07, J06, J00, J02	J00, R05	E78, I11, K21, I25, G47, Z34, B18, C50, K25, R00	N39, N20, R35	E11, E78, I11, N18, I10, I25, N40, B18	Z20, R05, U07, J06, J00, O47	J00, R05, J06
Jun.	E11, E78, I10, I11, Z34, I25, B18, N18, K21		B18, J44, M10, K25	Z20, R05, J06, R50, U07, J20, J00		Z20, Z34, G47	N39, R31	I25, J44, J18, M10, N40
Jul.	Z20, E11, E78, I11, I10, Z34, K21, B18	N20	M10			Z20, Z34, K21, G47, K59, R42, R00	N39	I25, B18, J44, N40, J18, M10
Aug.	Z20, E11, E78, I10, I11, Z34, K21, J01, B18, J20, K59	N39	J18, I25, M10, J44			Z20, K21, Z34	N39, R31	I25, B18, J44, N40, M10, J18
Sep.	Z20, E78, I11, B18, K21, Z34	N39	J18, M10			Z20, Z34, K21, K59, G47, R42	N39, R31	I25, B18, J19, N40, J44, M10
Oct.	E11, Z20, I11, K21, Z34, J45	N39, R10, N13	R50, J18, M10, N18, J20, J01, J06			Z20, Z34, G47	N39, N20, R31	R50, N18, J18, Z20, I25, B18, U07, J20, M10, N40
Nov.	I10, Z34, J12, Z20, P59	N39	J18, M10, J45, J20			Z34, K21, R42, Z20, G47	N39, R31, R80	J18, I25, J44, B18, N40
Dec.	E11, I10, I11, K21, P59, J45, Z34, J20, J30	N39, R10, N20, J03	J18, M10			Z34, G47, K59, C50, R42	N39, R31, R35	I25, J18, J44, M10, N40, I20
SAME CODE	I10, I11, K21, Z34	N39	M10	Z20	J00	K21, G47	N39	M10, N40	Z20	J00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, W.-C.; Lai, J.-P.; Liu, Y.-H.; Lin, Y.-L.; Hou, H.-P.; Pai, P.-F. Using Medical Data and Clustering Techniques for a Smart Healthcare System. Electronics 2024, 13, 140. https://doi.org/10.3390/electronics13010140

AMA Style

Yang W-C, Lai J-P, Liu Y-H, Lin Y-L, Hou H-P, Pai P-F. Using Medical Data and Clustering Techniques for a Smart Healthcare System. Electronics. 2024; 13(1):140. https://doi.org/10.3390/electronics13010140

Chicago/Turabian Style

Yang, Wen-Chieh, Jung-Pin Lai, Yu-Hui Liu, Ying-Lei Lin, Hung-Pin Hou, and Ping-Feng Pai. 2024. "Using Medical Data and Clustering Techniques for a Smart Healthcare System" Electronics 13, no. 1: 140. https://doi.org/10.3390/electronics13010140

APA Style

Yang, W.-C., Lai, J.-P., Liu, Y.-H., Lin, Y.-L., Hou, H.-P., & Pai, P.-F. (2024). Using Medical Data and Clustering Techniques for a Smart Healthcare System. Electronics, 13(1), 140. https://doi.org/10.3390/electronics13010140

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Using Medical Data and Clustering Techniques for a Smart Healthcare System

Abstract

1. Introduction

2. Clustering Techniques and Applications for Medical Data Analysis

3. The Proposed SHCM System

3.1. Data and Data Preprocessing

3.2. Performance Measurements

4. Numerical Results

4.1. Clustering Performance with Three Measurements

4.2. Preliminary Analysis of ICD-10-CM Codes between Indigenous Patients and Non-Indigenous Patients

4.3. Analysis of ICD-10-CM after Clustering

4.4. Analysis of Major Disease Codes over 12 Months in Indigenous Groups

4.5. Analysis of Major Disease Codes in Non-Indigenous Groups

4.6. Comparative Analysis of Major Disease Codes among Indigenous and Non-Indigenous Patients

4.6.1. Impacts of Infectious Diseases

4.6.2. Impacts of Type 2 Diabetes

4.6.3. Impacts of Essential Hypertension and Hypertensive Heart Diseases

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI