Next Article in Journal
Unmanned Aerial Vehicle-Based Automated Path Generation of Rollers for Smart Construction
Previous Article in Journal
Uncertainty Detection in Supervisor–Operator Audio Records of Real Electrical Network Operations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Using Medical Data and Clustering Techniques for a Smart Healthcare System

1
Puli Christian Hospital, Puli 54546, Taiwan
2
Department of Multimedia Game Development and Application, HungKuang University, Taichung 43302, Taiwan
3
PhD Program in Strategy and Development of Emerging Industries, National Chi Nan University, Nantou 54561, Taiwan
4
Department of Information Management, National Chi Nan University, Nantou 54561, Taiwan
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(1), 140; https://doi.org/10.3390/electronics13010140
Submission received: 15 November 2023 / Revised: 18 December 2023 / Accepted: 27 December 2023 / Published: 28 December 2023

Abstract

:
With the rapid advancement of information technology, both hardware and software, smart healthcare has become increasingly achievable. The integration of medical data and machine-learning technology is the key to realizing this potential. The quality of medical data influences the results of a smart healthcare system to a great extent. This study aimed to design a smart healthcare system based on clustering techniques and medical data (SHCM) to analyze potential risks and trends in patients in a given time frame. Evidence-based medicine was also employed to explore the results generated by the proposed SHCM system. Thus, similar and different discoveries examined by applying evidence-based medicine could be investigated and integrated into the SHCM to provide personalized smart medical services. In addition, the presented SHCM system analyzes the relationship between health conditions and patients in terms of the clustering results. The findings of this study show the similarities and differences in the clusters obtained between indigenous patients and non-indigenous patients in terms of diseases, time, and numbers. Therefore, the analyzed potential health risks could be further employed in hospital management, such as personalized health education control, personal healthcare, improvement in the utilization of medical resources, and the evaluation of medical expenses.

1. Introduction

Due to the progress and advantages of information technology and data analysis techniques, smart medical care plays an important role in the modern medical field. Machine-learning and data-mining techniques have provided hospital practitioners with more effective and efficient medical solutions in personalized medicine and led to disease predictions, medical efficiency improvement, and medical resource optimization. To identify similarities among patients, grouping patients into clinically meaningful clusters is essential [1]. Healthcare organizations and physicians take advantage of clustering results to analyze similarities among patients. By clustering patients in terms of diseases, risk factors, lifestyles, or other relevant factors, clustering results can help physicians gain insights into patients’ needs and provide personalized treatments.
Previous studies have pointed out the importance of using medical management databases to analyze patient clusters to learn trends of diseases according to clustering results [2]. The clustering technique is one of the most useful methods for analyzing patient similarities for precision medicine [1]. Analyzing a patient’s potential risks and trends requires a lot of patient-related data, which are recorded every time a patient visits a hospital for medical treatment. In the era of big data, electronic records include a large amount of text, such as the clinical narration of doctors’ advice. Thus, the analysis of electronic records has become more complex than before. In addition, due to the high dimensions of input data, the reduction in dimensions or feature selection can improve model efficiency and the performance of clustering tasks. Zelina et al. [3] proposed a natural language processing (NLP) method to investigate the clinician dataset of Czech breast cancer patients. The developed RobeCzech model is a general-purpose Czech transformer language model and is used for the unsupervised extraction, labeling, and clustering of fragments from clinical records. This study indicated the feasibility as well as the possibility of dealing with unstructured Czech clinical records in a non-supervised training manner. Irving et al. [4] employed electronic medical record (EMR) data to enhance the detection and prediction of psychosis risk in South London. In addition to basic patient information, clinical characteristics, symptoms, and substances, the EMR data included NLP predictions. The authors reported that using NLP to cope with EMRs can significantly improve the prognostic accuracy of psychosis risk.
Issues of concern in existing electronic medical records and eHealth systems include technical aspects, managerial factors, and particularly the quality of data in systems [5]. Additionally, as previously pointed out, the quality of the data is essential for healthcare systems [6]. Thus, this study aimed to deal with various data types by applying data preprocessing with data merging, data conversion, data cleaning, data selection, and data normalization. Then, clustering techniques were employed to group patients with similar medical features to improve the data quality for the healthcare system.
This investigation used demographic information, drug items, doctors’ advice, and exam items to perform clustering tasks and then to analyze the results in terms of indigenous people and non-indigenous people. Four clustering methods were used in this study, namely, K-means, hierarchical clustering, autoencoder, and SOM-KM. The clustering performance was evaluated through three indicators: the Calinski–Harabasz index (CH), Davies–Bouldin index (DB), and Silhouette Coefficient (SC). For most cases and indices, K-means outperformed the other methods. Therefore, K-means was used to analyze the clustering results. The rest of this study is organized as follows. Section 2 illustrates the clustering methods and applications in medical data analysis. The presented smart healthcare system based on clustering techniques and big data is introduced in Section 3. Section 4 depicts numerical examples. Finally, conclusions are presented in Section 5.

2. Clustering Techniques and Applications for Medical Data Analysis

Ezugwu et al. [7] and Saxena et al. [8] reported that clustering techniques can be divided into two major categories, namely, hierarchical clustering algorithms and partition clustering algorithms. More clustering categories, including grid clustering, density clustering, and model clustering, were proposed by Chaudhry et al. [9] and Oyewole and Thopil [10]. K-means and hierarchical clustering techniques are the most widely used algorithms in the literature. K-means clustering is one of the partition clustering methods. Applications of clustering approaches in medical data analysis include disease nosology [11], early diagnosis of diseases [12,13], predictions of diseases [14,15], etc. The clustering of diseases is mostly for chronic diseases and severe illnesses, for example, diabetes [13,16,17], heart failure [18,19], cancer [20,21], stroke [22,23], and COVID-19 cases [24,25]. Arora et al. [16] used K-means clustering for the prediction of diabetes. Jasinska-Piadlo et al. [18] employed K-means to cluster emergency readmissions of heart failure patients by using a data-driven approach and domain-leading methods. Heart failure patients usually have various characteristics at the physiological level. The study indicated that the K-means clustering algorithm could identify patients with heart failure very well. Ilbeigipour et al. [25] used SOM (self-organizing map) neural networks and the K-means technique to cluster COVID-19 patients and investigated the relationships between different symptoms of cases. The findings of this study could help health specialists improve their services by considering other important factors in treating COVID-19 patients in different ethnic groups. Table 1 lists a summary of recent clustering approaches for medical data. It can be observed that most studies dealt with a single disease, and K-means was commonly used as a popular clustering technique for analyzing medical data. The clustering approaches can be generally classified into categories: hierarchical clustering algorithms and partition clustering algorithms [7,8,9,10]. This study employed four clustering methods, K-means (KM), hierarchical clustering (HC), the K-means autoencoder (AEKM), and the K-means self-organizing map (SOMKM), to analyze medical data.
The K-means method [26] involves dividing a sample dataset into k subsets, forming k clusters, and assigning n data points to these k clusters, with each data point exclusively belonging to one cluster. The K-means algorithm is an iterative process that consists of two primary steps. Initially, it selects k cluster centers, and subsequently, it assigns data points to the nearest center to obtain an initial result. Following this, the centroids of each cluster are updated as new centers, and these two steps are repeated iteratively. The objective of the clustering results is to minimize the distance between data points and their respective cluster centers. The objection function of the K-means algorithm is shown in the following equations. Equation (1) employs the Euclidean distance to ensure that data point x i is closest to its assigned center, while Equation (2) is used to update the center as the mean value [27,28,29,30].
O b j = i = 1 k j = 1 N ( x i x j ) 2
X k = 1 N i = 1 N x i
where k is the number of cluster centers, N is the number of data points in the ith cluster, x j is the cluster mean, and x i is the ith point in the dataset.
For the K-means clustering algorithm, it is necessary to pre-specify the number of clusters denoted by K. This is an important hyperparameter of the algorithm. To determine the most suitable number of clusters for the experimental data, the Elbow Method is employed in this approach [31].
Hierarchical clustering (HC) constructs a hierarchy of clusters by iteratively merging or dividing clusters based on a distance metric. This method provides a visual representation of the data structure through dendrogram plots. There are two main types of hierarchical clustering: agglomerative and divisive. In our study, we employed the agglomerative clustering approach because our data samples were generated from patient records. This method begins with each sample being treated as an individual cluster and then progressively merges clusters that are close in proximity until a certain termination condition is met. For hierarchical clustering, three essential elements, the similarity distance, merging rules, and termination conditions, need to be considered [32]. The hierarchical clustering process is irreversible, and due to its consideration of each individual data point, it can be computationally time-consuming.
Developed in the 1980s by Hinton and the PDP group [33], the autoencoder is an artificial neural network with an input layer, a hidden layer, and an output layer. The main purpose of the autoencoder is to perform representation learning on the input data and make the output and input have the same meaning. Autoencoders have been widely used in feature extraction [29,34,35,36]. An m -dimensional dataset is considered as X = X 1 , X 2 , , X m . The compressed data features are generated by the encoder E , and following that, the output X is generated by the decoder D , which can be expressed by Equation (3):
X = D ( E ( x ) )
The training goal of the autoencoder is to minimize the error. The loss function can be expressed as Equation (4):
L o s s   f u n c t i o n ( X , X ) = ( X X ) 2
After establishing the autoencoder model, the K-means method is then used because the autoencoder is not a clustering tool [35].
The self-organizing map (SOM) [37] is a method consisting of a two-dimensional grid used for mapping input data. During the training process, the SOM forms an elastic grid to envelop the distribution of input data, mapping adjacent input data to nearby grid units. SOM training is an iterative process that adjusts the positions of grid units by computing distances and finding the Best-Matching Unit (BMU) with prototype vectors. Furthermore, the SOM’s computational complexity scales linearly with the number of data samples, making it memory-efficient, but scales quadratically with the number of map units. Training large maps can be time-consuming, although it can be expedited with specialized techniques. Apart from the SOM, alternative variants are available, though they may require more complex visualization methods. In summary, the SOM is an effective approach for processing large datasets while preserving the topological characteristics of the input space [38].
SOM training is conducted iteratively. In each training step, a sample vector is randomly chosen from the input dataset. Distances between this sample vector and all prototype vectors are computed. The Best-Matching Unit (BMU), denoted by BMU, is the map unit whose prototype vector is closest to the sample vector. Subsequently, the prototype vectors are updated. The BMU and its topological neighbors are adjusted toward the sample vector in the input space. The rule for the prototype vector of unit “i” is updated as expressed in Equation (5):
v i t + 1 = v i t + α t h i j t x t v i t
  • where
  • v i t + 1 is the updated prototype vector for unit i at time t + 1.
  • v i t is the current prototype vector for unit i at time t.
  • α t is the adaptation coefficient at time t.
  • h i j t is the neighborhood kernel centered on the winning unit at time t.
The SOM is commonly used for dimensionality reduction and data visualization; it maps high-dimensional data into two- or three-dimensional spaces, providing a significant advantage when dealing with complex data. In this study, we leveraged the strengths of both models by first mapping the data into a two-dimensional representation through a SOM and then performing clustering using K-means.

3. The Proposed SHCM System

After reviewing the clustering techniques and medical data analysis, the proposed smart healthcare system based on clustering techniques and medical data (SHCM) is introduced in this section. Figure 1 depicts the structure and procedures of the designed SHCM system. The SHCM contains four parts: data preprocessing, clustering, performance evaluation, and result analysis. The data were collected from the outpatient clinic database of Puli Christian Hospital. Then, the data preprocessing process was conducted. Sequentially, four clustering techniques were employed to perform grouping tasks. Three measurements, namely, the Calinski–Harabasz index, the Davies–Bouldin Index, and the Silhouette Coefficient, were utilized to evaluate the performance of the clustering techniques. Based on three measurements, K-means can mostly generate better results than the other three clustering approaches in 12 months. Therefore, the clustering results provided by the K-means methods were used to observe the grouping data and discuss them with medical doctors. Finally, similar and different discoveries investigated by applying evidence-based medicine could be identified and provided for further use in personalized health education and healthcare. In addition, the utilization of medical resources and the evaluation of medical expenses could possibly be improved.

3.1. Data and Data Preprocessing

The data were collected from the outpatient clinic database of Puli Christian Hospital and included structured data and unstructured data from patient consultation information. Because of the diversity of data formats and data structures, data preprocessing procedures need to be conducted first. After the data preprocessing procedure, a total of 63,151 records in this study contained patients who visited the outpatient clinic from 1 January to 31 December 2022. Four major attributes used in clustering model experiments include demographic information, drug items, doctors’ advice, and exam items. Figure 2 illustrates the data preprocessing steps with five stages: data merging, data conversion, data cleaning, data selection, and data normalization.
The raw data collected were presented in four major categories: gender and age, doctors’ advice, drug descriptions, and exam items. The merged data included structured and unstructured data. In order to achieve numerical values that can be recognized by the clustering model, unstructured data such as text and symbols were converted into numerical forms. Table 2 shows the conversion methods according to the attributes, and following that are the details. Categorical data included gender and some of the exam items. Doctors’ advice, drug descriptions, and some contents of exam items were expressed as text. Thus, the bidirectional encoder representation transformer (BERT) was used to convert text into vectors, and a principal component analysis (PCA) was employed to reduce the high dimensions of converted results into 10 dimensions. Age and most exam items were represented by numerical forms. Finally, normalization was performed. The MinMaxScaler was used for data normalization in this study and is represented by Equation (6). Table 3 shows the number of attributes and patient visits in 12 months.
X M i n M a x S c a l e r = X X m i n X m a x X m i n
where X M i n M a x S c a l e r is the normalized feature, and X m a x and X m i n are the maximum and minimum values of the feature X .

3.2. Performance Measurements

Three measurements, the Calinski–Harabasz index (CH), the Davies–Bouldin index (DB), and the Silhouette Coefficient (SC), were used in this study to evaluate the performance of the clustering techniques. The Calinski–Harabasz index (CH) [39] assesses the concentration of data in the clustering results by calculating the ratio of the sum of squared distances between clusters (BGSS) to the within-cluster sum of squares (WGSS). It is one of the commonly used evaluation metrics in K-means and hierarchical clustering. The CH index calculation formula under the assumption of N data points divided into K clusters is shown in Equation (7); with this calculation method, the larger the value, the better [40]:
C H = B G S S W G S S × N K K 1
The DB index [41] evaluates the clustering results by considering both the similarity and separation between different clusters. It calculates the similarity ( C i ) between two clusters based on the distances between data points within each cluster. It then identifies the cluster ( C j ) with the highest similarity and divides it by the cluster’s dispersion (S), which is computed by averaging the distances between data points within that cluster, and can be expressed by Equation (8). The DB index is established by averaging these cluster similarities across all clusters, and a smaller DB index value indicates better clustering results [42].
D B = 1 k i = 1 k m a x j i C i + C j S i j
The Silhouette Coefficient (SC) [43] considers the similarity of each data point to others within its cluster (A) and the dissimilarity to other clusters (B). The within-cluster similarity (A) measures the distance between the data point and other data points within the same cluster. The between-cluster dissimilarity (B) measures the distance between the data point and data points in other clusters. The formula for calculating the Silhouette Coefficient is in Equation (9):
S C = B A max A , B
The SI index’s values range from −1 to 1, where a value close to 1 indicates that data points within their assigned cluster are very similar and dissimilar to data points in other clusters, while a value close to −1 suggests that data points are more likely to be assigned to the wrong cluster [44].
The clustering and storing of data in this study were implemented in the Anaconda environment based on the python programming language and the scikit-learn library. In the next section, numerical data are employed to demonstrate the performance of the SHCM system. Then, numerical results generated by the SHCM system are observed and analyzed, and conclusions are drawn.

4. Numerical Results

4.1. Clustering Performance with Three Measurements

Table 4 indicates the cluster numbers obtained by the four clustering methods in 12 months. Table 5, Table 6 and Table 7 list three measurements: the Calinski–Harabasz index (CH), the Davies–Bouldin index (DB), and the Silhouette Coefficient (SC) were used in this study to evaluate the performance of the clustering techniques. The number of suitable clusters falls between clusters 3 and 5, and the three clusters appeared the most frequently. Table 5 illustrates the CH indexes of the four clustering approaches. A larger CH value means a better clustering result. Table 6 shows the DB indicators of the different clustering methods. A smaller DB value implies a better clustering result. Table 7 depicts the SI coefficients. An SI value close to 1 means a better clustering result. In summary, the K-means method is mostly superior to the other clustering methods for 12 months of data. It has been pointed out that the K-means approach can provide quite satisfactory results compared to the other clustering methods [15,30]. Harada et al. [45] reported that the advantages of K-means are its simple principle and high flexibility. Therefore, the clustering results generated by k-means were used to illustrate the clustering results for this study.

4.2. Preliminary Analysis of ICD-10-CM Codes between Indigenous Patients and Non-Indigenous Patients

This study focuses on the analysis of disease codes among patients at Puli Christian Hospital. Given the hospital’s service to many indigenous populations, patients were initially categorized into indigenous and non-indigenous groups. Subsequently, we collected and analyzed the International Classification of Diseases (ICD-10-CM) codes assigned by doctors and recorded the top ten most frequent disease codes each month for both groups. Table A1 in Appendix A depicts ICD-10-CM codes and the corresponding diseases. Two main trends were observed. Firstly, type 2 diabetes (E11) consistently ranked within the top three for both indigenous and non-indigenous groups. This highlights a substantial demand for medical care relating to metabolic diseases in the Puli regions. Thus, various complications associated with diabetes underline the need for an in-depth understanding of the medical requirements at different stages. Secondly, the ranking of bacterial infectious diseases (Z20) was initially not prominent but escalated to the top position in June. Notably, 2022 was the year of the COVID-19 pandemic. Comparing this trend with Taiwan’s COVID-19 statistical data, a similar pattern emerges. These two disease codes represent chronic and infectious diseases, respectively, and indicate a diversity and complexity of health issues.

4.3. Analysis of ICD-10-CM after Clustering

To gain a deeper understanding of the differences in medical needs between indigenous and non-indigenous patients, this study employed clustering techniques to stratify the patient population. This stratification facilitated an in-depth exploration of the predominant health conditions within each cluster. Additionally, to compare the primary diseases between indigenous and non-indigenous groups, the criteria for listing major diseases included not only the frequency of disease codes but also proportional representations within each cluster. Disease codes were ranked based on occurrences, and the top ten diseases were selected for further analysis. The proportions of these top ten codes within each cluster were calculated and are listed in Table 8, Table 9, Table 10, Table 11, Table 12, Table 13, Table 14 and Table 15.

4.4. Analysis of Major Disease Codes over 12 Months in Indigenous Groups

An overview of the data from the clustering results shows that in January, the disease codes among indigenous patients were categorized into four main groups. The prevalent conditions included hypertension (I10), hypertensive heart disease (I11), chronic viral hepatitis (B18), gastroesophageal reflux disease (K21), and pregnancy-related care (Z34). As the months progressed to February, new codes emerged, such as lipid disorders (E78), chronic ischemic heart disease (I25), asthma (J45), and sleep disorders (G47), indicating shifts in health issues among indigenous populations. By March, the groupings were reduced to three, with an increased frequency of urinary system diseases (N39), suggesting that this is an area of health concern worth further exploration. In April, the grouping included hypertensive heart disease (I11), gastroesophageal reflux disease (K21), and pregnancy-related care (Z34), alongside infectious diseases (Z20) and chronic viral hepatitis (B18), hinting at possible commonalities among these conditions. In May, the number of groups increased to five, with respiratory diseases like chronic obstructive pulmonary disease (J44) and acute rhinitis (J00) becoming more prominent, potentially relating to prevalent diseases at the time. From June to December, the groupings remained consistent at three, with the recurrent appearances of codes like gout (M10) and pneumonia (J18) often in the same group, highlighting the significance of urinary (e.g., N39) and respiratory diseases (e.g., J18, J20).
Throughout the year, hypertension- and heart-disease-related codes (such as I10 and I11) were almost consistently present, while infectious diseases (like Z20) and seasonal illnesses (such as J18) showed an increase in specific months. Particular conditions like gout (M10) and urinary system diseases (e.g., N39) were especially pronounced among the indigenous population, suggesting possible environmental or physiological factors.

4.5. Analysis of Major Disease Codes in Non-Indigenous Groups

According to the clustering results, in January, non-indigenous patients were grouped into four categories focused on digestive system diseases (K21), sleep disorders (G47), urinary system diseases (N39, N20), gout (M10), chronic viral hepatitis (B18), blood disorders (R31), and infectious diseases (Z20). The grouping pattern continued similarly in February. In March and April, the number of groups was reduced to three, combining infectious diseases (Z20), pregnancy-related care (Z34), digestive system diseases (K21), and sleep disorders (G47), along with groups featuring gout (M10) and chronic viral hepatitis (B18) that now included respiratory disease codes (I25, J44, J18). In April, the group with pregnancy-related care (Z34) also showed occurrences of breast cancer (C50). By May, the number of groups increased to five, including groups with respiratory disease codes (J00) and respiratory symptoms (R05, O47). From June to December, the groups were consistently divided into three: a group focusing on infectious diseases (Z20), pregnancy-related care (Z34), and sleep disorders (G47); a group with urinary system diseases (N39), often accompanied by hematuria (R31) and kidney stones (N20); and a group dominated by gout (M10) and chronic viral hepatitis (B18), frequently associated with respiratory diseases (J44, J18) and chronic ischemic heart disease (I25).
Observations throughout the year show that urinary diseases (N class) and infectious diseases (Z20) were almost constantly present, along with the consistent appearance of respiratory diseases (J class), indicating these as the main health concerns affecting the non-indigenous population. The presence of chronic ischemic heart disease (I25) may relate to lifestyle factors in the non-indigenous community.

4.6. Comparative Analysis of Major Disease Codes among Indigenous and Non-Indigenous Patients

Utilizing K-means clustering, which resulted in groups of three, four, and five, an examination of the primary disease codes was conducted. Across eight months, namely, March, April, and July to December, where the data were clustered into three groups, the same disease codes were consistently observed each month in the indigenous population. To facilitate a clearer observation and comparison of similar codes across different groups, the results are documented in Table 16. Consequently, the outcomes were categorized into three classes: class A, predominantly featuring codes I10, I11, K21, and Z34; class B, primarily centered around code N39; and class C, focusing on code M10.
In the non-indigenous groups, similar patterns were observed. In class A, the codes included K21 and G47, sharing K21 with the indigenous group but adding G47 while lacking I10, I11, and Z34. Both indigenous and non-indigenous groups had N39 as the main code in class B. In class C, both groups shared the M10 code, but an additional N40 code was observed in the non-indigenous group. During the same period, when the data were categorized into three groups, a unique situation was noted in June for the indigenous population. They had groups belonging to class A and class C, but not class B. Instead, there was an additional group, classified as class D, characterized by Z20 as the primary code, accompanied by respiratory diseases (U07, J00, J02, J06, R05). In contrast, the non-indigenous population continued to be categorized under the original classes A, B, and C. This variance in June, particularly considering the specific impact of COVID-19 (U07) in 2022, suggests that the indigenous population was more significantly affected during this month, leading to a different clustering trend. After classifying the main diseases based on their similarities, it becomes easier to observe the differing trends between infectious and chronic diseases over the 12 months. Consequently, the subsequent analysis focuses on infectious diseases and chronic diseases.

4.6.1. Impacts of Infectious Diseases

In the cluster analysis conducted in January and February, when the data were segmented into four groups, both indigenous and non-indigenous populations exhibited grouping patterns similar to the three groups observed in other months. This included class A, predominantly characterized by gastroesophageal reflux disease (K21), class B, centered around urinary system infections (N39), and class C, led by gout (M10). However, a category similar to that observed in June for the indigenous population emerged: class D. Unlike in June, the frequency of respiratory disease codes was lower in January and February, with pneumonia (B18) notably being the second most common condition. Given the outbreak of COVID-19 in Taiwan during these months, the appearance of class D could serve as an early warning signal of the epidemic.
In the analysis of May, considering the greater number of assigned groups due to potential overlapping diseases in patients, primary diseases were selected based on a threshold of 40%. In addition to the similar classes A, B, C, and D, an additional class E was identified, which is primarily associated with acute upper respiratory infections (J00). In this group, the specific COVID-19 disease code (U07) exceeded 43% in both indigenous and non-indigenous populations, aligning with the surge in COVID-19 cases in Taiwan in May. This finding indicates the emergence of a new group of patients seeking medical assistance due to the epidemic, in addition to regular medical patients. Among groups in May, 48% of the indigenous population had the U07 code, compared to 43% in the non-indigenous population. The disparity in the proportion of disease codes suggests a greater impact on indigenous communities. Considering lifestyle factors, indigenous communities, often residing in closely knit tribes, have a higher interaction frequency compared to non-indigenous populations. In addition, indigenous people are slower to receive disease information and initiate preventive measures compared to people in urban areas.

4.6.2. Impacts of Type 2 Diabetes

In the analysis of major diseases, type 2 diabetes (code E11), initially ranked in the top three by patient frequency, did not consistently feature as a primary disease when the concept of proportionality was adopted. Unlike infectious diseases, which were frequently observed in more than four groups in January, February, and May, chronic diseases were consistently presented in all monthly groupings. This illustrates that after the grouping, each cluster maintained a certain proportion of chronic disease codes, which were differentiated by the accompanying comorbidities. Additionally, since chronic diseases are closely linked to time and disease progression, age was factored into the analysis. A comparison was made across groups in terms of age and found that each group appeared to have a lag of ten years. Figure 3, Figure 4 and Figure 5 illustrate the observation of the occurrences in March as an example.
The bar chart displays the number of individuals with the E11 code in terms of age. The line graph represents cumulative cases. Observing the cumulative cases, it is evident that the proportion of the E11 code in the indigenous population is mostly higher than in the non-indigenous groups. The bar chart reveals similar trends in both groups, but with different age brackets. The trend for the non-indigenous group appears a decade later than that for the indigenous group, indicating an earlier onset of E11-related health impacts in the indigenous population.

4.6.3. Impacts of Essential Hypertension and Hypertensive Heart Diseases

Observations from the analysis of major diseases reveal distinct patterns within chronic heart diseases of the I class. Among the indigenous population, the prevalent codes include I10 (essential hypertension) and I11 (hypertensive heart diseases), predominantly occurring in disease groups associated with K21 (gastroesophageal reflux disease). In contrast, the non-indigenous population predominantly showed the presence of I25 (chronic ischemic heart disease). The prevalence of the I10 code, which is linked to genetic factors, indicates a heightened need for preventive measures against heart disease risks in the indigenous population.
Upon categorizing the clustering results based on the main diseases, distinct trends were observed between chronic and infectious diseases. Chronic diseases displayed a more consistent distribution across all 12 months. However, infectious diseases exhibited greater variability. A month-by-month analysis was performed to track the dynamics. Therefore, using clustering methods to differentiate between chronic diseases and infectious diseases provides a clearer outline of regional disease patterns and healthcare needs. This approach allows for a better understanding of medical requirements in different areas and facilitates the provision of appropriate medical assistance tailored to the needs of specific subgroups. Additionally, in the context of epidemic prevention and control, this method can enable early prevention and management based on local infection trends.

5. Conclusions

This study employed clustering techniques to group and then analyze diseases in the indigenous population and the non-indigenous population. K-means clustering obtained better results than the other three clustering techniques in terms of three measurements. The developed model can learn distances between clusters and further investigate relations among diseases in patients through the features of clusters. Although the medical conditions of patient groups vary each month, a consistent clustering trend is observed overall. This trend is particularly pronounced in cases where the primary disease is the same, indicating a higher probability of certain disease codes appearing together. This result lays the foundation for a deeper exploration of potential correlations between different diseases.
From the perspectives of chronic diseases and bacterial or viral infections, we noted distinct clustering behaviors between the two. The chronic disease group exhibited consistency in the monthly analyses, while the clustering of bacterial or viral infections showed a close correlation with the stages of epidemic development. This was particularly evident in the context of the 2022 epidemic trends in Taiwan, where changes in the number of clusters were highly correlated with different stages of the epidemic.
The unsupervised clustering method helps in identifying correlations in complex and varied data that are not readily observable. However, the diversity of the data, along with the varying medical needs of patients at the time of consultation, poses challenges in data preprocessing. Moreover, challenges arise due to the large data scales, diversities, and complexity. Both structured and unstructured data are included in the dataset. In addition to medical treatment, there may be return visits or patients with chronic diseases who only receive medicine without treatment. Thus, the presentation of data should not only consider the identity of the patient but also consider the timeliness of the patient’s visits.
Only data from 2022 were employed in this study. Data gathered in other years could be employed to examine the feasibility of the proposed SHCM system. In addition, only data collected from the Puli Christian Hospital served as data for the SHCM system. Data collected from other hospitals could be utilized to investigate the generalization ability of the developed system. Finally, some deep clustering techniques could be employed to deal with the clustering tasks for the presented system.

Author Contributions

Conceptualization, P.-F.P., W.-C.Y. and H.-P.H.; methodology, J.-P.L. and P.-F.P.; software, J.-P.L., Y.-H.L. and Y.-L.L.; formal analysis, Y.-H.L.; writing—original draft preparation, Y.-H.L., Y.-L.L. and P.-F.P.; writing—review and editing, P.-F.P.; visualization, Y.-H.L. and Y.-L.L.; supervision, P.-F.P. and W.-C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by funding from Puli Christian Hospital/Chi Nan National University Joint Research Program under grant number 112-PuChi-AIR-001.

Institutional Review Board Statement

Ethical review and approval were waived for this study, due to the use of a database with data aggregated by age (10-year age-groups) and diagnosis categories.

Informed Consent Statement

Informed consent was not required as cohort members were unidentifiable.

Data Availability Statement

The data presented in this study are available on reasonable request from the corresponding author. The data are not public available due to the privacy.

Acknowledgments

This work was supported by Kai Yen, Bing-Cheng Chiu, and Yan-Song Chang, who assisted in data analysis.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In this study, the analysis of electronic medical record (EMR) data involved labeling patients’ diseases according to the International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10 CM). The ICD-10, established by the World Health Organization (WHO), categorizes diseases based on their characteristics and represents them using a coding system. This system is crucial for accurately and systematically recording cases, and it plays a significant role in clinical diagnosis, epidemiological research, health management, and data collection. This appendix only includes disease codes directly relevant to our study. These codes represent specific disease types involved in our research and are intended to help readers better understand the scope and focus of our study. Each code is accompanied by a brief description of the disease, making it accessible to readers who are not specialists in the field.
Table A1. The ICD-10 CM and brief descriptions.
Table A1. The ICD-10 CM and brief descriptions.
ICD-10 CMDiseases
A09Infectious gastroenteritis and colitis, unspecified
B08Other viral infections characterized by skin and mucous membrane lesions, not elsewhere classified
B18Chronic viral hepatitis
C50Malignant neoplasm of breast
D64Other anemias
E11Type 2 diabetes mellitus
E78Disorders of lipoprotein metabolism and other lipidemias
E86Volume depletion
E87Other disorders of fluid, electrolyte and acid-base balance
G47Sleep disorders
I10Essential (primary) hypertension
I11Hypertensive heart disease
I20Angina pectoris
I25Chronic ischemic heart disease
I50Heart failure
J00Acute nasopharyngitis [common cold]
J01Acute sinusitis
J02Acute pharyngitis
J03Acute tonsillitis
J06Acute upper respiratory infections of multiple and unspecified sites
J12Viral pneumonia, not elsewhere classified
J18Pneumonia, unspecified organism
J20Acute bronchitis
J30Vasomotor and allergic rhinitis
J44Other chronic obstructive pulmonary disease
J45Asthma
K21Gastroesophageal reflux disease
K25Gastric ulcer
K29Gastritis and duodenitis
K59Other functional intestinal disorders
K92Other diseases of digestive system
L03Cellulitis and acute lymphangitis
L08Other local infections of skin and subcutaneous tissue
M10Gout
M19Other and unspecified osteoarthritis
M54Dorsalgia
N13Obstructive and reflux uropathy
N18Chronic kidney disease (CKD)
N20Calculus of kidney and ureter
N39Other disorders of urinary system
N40Benign prostatic hyperplasia
O47False labor
P59Neonatal jaundice from other and unspecified causes
R00Abnormalities of heart beat
R05Cough
R07Pain in throat and chest
R10Abdominal and pelvic pain
R11Nausea and vomiting
R31Hematuria
R35Polyuria
R42Dizziness and giddiness
R50Fever of other and unknown origin
R51Headache
R80Proteinuria
U07Emergency use of U07
Z11Encounter for screening for infectious and parasitic diseases
Z20Contact with and (suspected) exposure to communicable diseases
Z34Encounter for supervision of normal pregnancy

References

  1. Parimbelli, E.; Marini, S.; Sacchi, L.; Bellazzi, R. Patient similarity for precision medicine: A systematic review. J. Biomed. Inform. 2018, 83, 87–96. [Google Scholar] [CrossRef] [PubMed]
  2. Lambert, J.; Leutenegger, A.-L.; Jannot, A.-S.; Baudot, A. Tracking clusters of patients over time enables extracting information from medico-administrative databases. J. Biomed. Inform. 2023, 139, 104309. [Google Scholar] [CrossRef] [PubMed]
  3. Zelina, P.; Halámková, J.; Nováček, V. Unsupervised extraction, labelling and clustering of segments from clinical notes. In Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 6–8 December 2022; pp. 1362–1368. [Google Scholar]
  4. Irving, J.; Patel, R.; Oliver, D.; Colling, C.; Pritchard, M.; Broadbent, M.; Baldwin, H.; Stahl, D.; Stewart, R.; Fusar-Poli, P. Using natural language processing on electronic health records to enhance detection and prediction of psychosis risk. Schizophr. Bull. 2021, 47, 405–414. [Google Scholar] [CrossRef] [PubMed]
  5. Ebad, S.A. Healthcare software design and implementation—A project failure case. Softw. Pract. Exp. 2020, 50, 1258–1276. [Google Scholar] [CrossRef]
  6. Mashoufi, M.; Ayatollahi, H.; Khorasani-Zavareh, D.; Talebi Azad Boni, T. Data quality in health care: Main concepts and assessment methodologies. Methods Inf. Med. 2023, 62, 005–018. [Google Scholar] [CrossRef] [PubMed]
  7. Ezugwu, A.E.; Ikotun, A.M.; Oyelade, O.O.; Abualigah, L.; Agushaka, J.O.; Eke, C.I.; Akinyelu, A.A. A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Eng. Appl. Artif. Intell. 2022, 110, 104743. [Google Scholar] [CrossRef]
  8. Saxena, A.; Prasad, M.; Gupta, A.; Bharill, N.; Patel, O.P.; Tiwari, A.; Er, M.J.; Ding, W.; Lin, C.-T. A review of clustering techniques and developments. Neurocomputing 2017, 267, 664–681. [Google Scholar] [CrossRef]
  9. Chaudhry, M.; Shafi, I.; Mahnoor, M.; Vargas, D.L.R.; Thompson, E.B.; Ashraf, I. A systematic literature review on identifying patterns using unsupervised clustering algorithms: A data mining perspective. Symmetry 2023, 15, 1679. [Google Scholar] [CrossRef]
  10. Oyewole, G.J.; Thopil, G.A. Data clustering: Application and trends. Artif. Intell. Rev. 2023, 56, 6439–6475. [Google Scholar] [CrossRef]
  11. Santamaría, L.P.; del Valle, E.P.G.; García, G.L.; Zanin, M.; González, A.R.; Ruiz, E.M.; Gallardo, Y.P.; Chan, G.S.H. Analysis of new nosological models from disease similarities using clustering. In Proceedings of the 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), Rochester, MN, USA, 28–30 July 2020; pp. 183–188. [Google Scholar]
  12. Farouk, Y.; Rady, S. Early diagnosis of alzheimer’s disease using unsupervised clustering. Int. J. Intell. Comput. Inf. Sci. 2020, 20, 112–124. [Google Scholar] [CrossRef]
  13. Hassan, M.M.; Mollick, S.; Yasmin, F. An unsupervised cluster-based feature grouping model for early diabetes detection. Healthc. Anal. 2022, 2, 100112. [Google Scholar] [CrossRef]
  14. Antony, L.; Azam, S.; Ignatious, E.; Quadir, R.; Beeravolu, A.R.; Jonkman, M.; De Boer, F. A comprehensive unsupervised framework for chronic kidney disease prediction. IEEE Access 2021, 9, 126481–126501. [Google Scholar] [CrossRef]
  15. Enireddy, V.; Anitha, R.; Vallinayagam, S.; Maridurai, T.; Sathish, T.; Balakrishnan, E. Prediction of human diseases using optimized clustering techniques. Mater. Today Proc. 2021, 46, 4258–4264. [Google Scholar] [CrossRef]
  16. Arora, N.; Singh, A.; Al-Dabagh, M.Z.N.; Maitra, S.K. A novel architecture for diabetes patients’ prediction using k-means clustering and svm. Math. Probl. Eng. 2022, 2022, 4815521. [Google Scholar] [CrossRef]
  17. Parikh, H.M.; Remedios, C.L.; Hampe, C.S.; Balasubramanyam, A.; Fisher-Hoch, S.P.; Choi, Y.J.; Patel, S.; McCormick, J.B.; Redondo, M.J.; Krischer, J.P. Data mining framework for discovering and clustering phenotypes of atypical diabetes. J. Clin. Endocrinol. Metab. 2023, 108, 834–846. [Google Scholar] [CrossRef] [PubMed]
  18. Jasinska-Piadlo, A.; Bond, R.; Biglarbeigi, P.; Brisk, R.; Campbell, P.; Browne, F.; McEneaneny, D. Data-driven versus a domain-led approach to k-means clustering on an open heart failure dataset. Int. J. Data Sci. Anal. 2023, 15, 49–66. [Google Scholar] [CrossRef]
  19. Mpanya, D.; Celik, T.; Klug, E.; Ntsinjana, H. Clustering of heart failure phenotypes in johannesburg using unsupervised machine learning. Appl. Sci. 2023, 13, 1509. [Google Scholar] [CrossRef]
  20. Florensa, D.; Mateo-Fornés, J.; Solsona, F.; Pedrol Aige, T.; Mesas Julió, M.; Piñol, R.; Godoy, P. Use of multiple correspondence analysis and k-means to explore associations between risk factors and likelihood of colorectal cancer: Cross-sectional study. J. Med. Internet Res. 2022, 24, e29056. [Google Scholar] [CrossRef]
  21. Koné, A.P.; Scharf, D.; Tan, A. Multimorbidity and complexity among patients with cancer in ontario: A retrospective cohort study exploring the clustering of 17 chronic conditions with cancer. Cancer Control 2023, 30, 10732748221150393. [Google Scholar] [CrossRef]
  22. Chantraine, F.; Schreiber, C.; Pereira, J.A.C.; Kaps, J.; Dierick, F. Classification of stiff-knee gait kinematic severity after stroke using retrospective k-means clustering algorithm. J. Clin. Med. 2022, 11, 6270. [Google Scholar] [CrossRef]
  23. Yasa, I.; Rusjayanthi, N.; Luthfi, W.B.M. Classification of stroke using k-means and deep learning methods. Lontar Komput. J. Ilm. Teknol. Inf. 2022, 13, 23. [Google Scholar] [CrossRef]
  24. Al-Khafaji, H.M.R.; Jaleel, R.A. Adopting effective hierarchal iomts computing with k-efficient clustering to control and forecast covid-19 cases. Comput. Electr. Eng. 2022, 104, 108472. [Google Scholar] [CrossRef] [PubMed]
  25. Ilbeigipour, S.; Albadvi, A.; Noughabi, E.A. Cluster-based analysis of covid-19 cases using self-organizing map neural network and k-means methods to improve medical decision-making. Inform. Med. Unlocked 2022, 32, 101005. [Google Scholar] [CrossRef] [PubMed]
  26. MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 18–21 June 1965 and 27 December 1965–7 January 1966; University of California Press: Oakland, CA, USA, 1967; pp. 281–297. [Google Scholar]
  27. Na, S.; Xumin, L.; Yong, G. Research on k-means clustering algorithm: An improved k-means clustering algorithm. In Proceedings of the 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, Jian, China, 2–4 April 2010; pp. 63–67. [Google Scholar]
  28. Alam, M.S.; Rahman, M.M.; Hossain, M.A.; Islam, M.K.; Ahmed, K.M.; Ahmed, K.T.; Singh, B.C.; Miah, M.S. Automatic human brain tumor detection in mri image using template-based k means and improved fuzzy c means clustering algorithm. Big Data Cogn. Comput. 2019, 3, 27. [Google Scholar] [CrossRef]
  29. Lee, H.; Choi, Y.; Son, B.; Lim, J.; Lee, S.; Kang, J.W.; Kim, K.H.; Kim, E.J.; Yang, C.; Lee, J.-D. Deep autoencoder-powered pattern identification of sleep disturbance using multi-site cross-sectional survey data. Front. Med. 2022, 9, 950327. [Google Scholar] [CrossRef] [PubMed]
  30. Setiawan, K.E.; Kurniawan, A.; Chowanda, A.; Suhartono, D. Clustering models for hospitals in jakarta using fuzzy c-means and k-means. Procedia Comput. Sci. 2023, 216, 356–363. [Google Scholar] [CrossRef] [PubMed]
  31. Yuan, C.; Yang, H. Research on k-value selection method of k-means clustering algorithm. J 2019, 2, 226–235. [Google Scholar] [CrossRef]
  32. Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2012, 2, 86–97. [Google Scholar] [CrossRef]
  33. Rumelhart, D.; Hinton, G.; Williams, R. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition; MIT Press: Cambridge, MA, USA, 1986; Chapter 8; Volume 1, pp. 318–362. [Google Scholar]
  34. Baldi, P. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of the ICML Workshop on Unsupervised and Transfer Learning, Bellevue, WA, USA, 2 July 2011; JMLR Workshop and Conference Proceedings. ML Research Press: London, UK, 2012; pp. 37–49. [Google Scholar]
  35. Zhang, L.; Lv, C.; Jin, Y.; Cheng, G.; Fu, Y.; Yuan, D.; Tao, Y.; Guo, Y.; Ni, X.; Shi, T. Deep learning-based multi-omics data integration reveals two prognostic subtypes in high-risk neuroblastoma. Front. Genet. 2018, 9, 477. [Google Scholar] [CrossRef]
  36. Bank, D.; Koenigstein, N.; Giryes, R. Autoencoders. Machine Learning for Data Science Handbook: Data Mining and Knowledge Discovery Handbook; Springer: New York, NY, USA, 2023; pp. 353–374. [Google Scholar]
  37. Kohonen, T. The self-organizing map. Proc. IEEE 1990, 78, 1464–1480. [Google Scholar] [CrossRef]
  38. Vesanto, J.; Alhoniemi, E. Clustering of the self-organizing map. IEEE Trans. Neural Netw. 2000, 11, 586–600. [Google Scholar] [CrossRef] [PubMed]
  39. Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 1974, 3, 1–27. [Google Scholar] [CrossRef]
  40. Desgraupes, B. Clustering indices. Univ. Paris Ouest-Lab Modal’X 2013, 1, 34. [Google Scholar]
  41. Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
  42. Xiao, J.; Lu, J.; Li, X. Davies bouldin index based hierarchical initialization k-means. Intell. Data Anal. 2017, 21, 1327–1338. [Google Scholar] [CrossRef]
  43. Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
  44. Shahapure, K.R.; Nicholas, C. Cluster quality analysis using silhouette score. In Proceedings of the 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), Sydney, Australia, 6–9 October 2020; pp. 747–748. [Google Scholar]
  45. Harada, D.; Asanoi, H.; Noto, T.; Takagawa, J. Different pathophysiology and outcomes of heart failure with preserved ejection fraction stratified by k-means clustering. Front. Cardiovasc. Med. 2020, 7, 607760. [Google Scholar] [CrossRef]
Figure 1. The proposed smart healthcare system based on clustering techniques and medical data (SHCM).
Figure 1. The proposed smart healthcare system based on clustering techniques and medical data (SHCM).
Electronics 13 00140 g001
Figure 2. Data preprocessing steps.
Figure 2. Data preprocessing steps.
Electronics 13 00140 g002
Figure 3. The age distribution of patients with E11 disease code of class A in March.
Figure 3. The age distribution of patients with E11 disease code of class A in March.
Electronics 13 00140 g003
Figure 4. The age distribution of patients with E11 disease code of class B in March.
Figure 4. The age distribution of patients with E11 disease code of class B in March.
Electronics 13 00140 g004
Figure 5. The age distribution of patients with E11 disease code of class C in March.
Figure 5. The age distribution of patients with E11 disease code of class C in March.
Electronics 13 00140 g005
Table 1. Recent clustering methods for medical data.
Table 1. Recent clustering methods for medical data.
ReferencesYearsApplicationsMethods of Clustering
Santamaría et al. [11]2020Analysis of new nosological modelsDBSCAN *
Farouk and Rady [12]2020Early diagnosis of Alzheimer’s diseaseK-means,
K-medoids
Hassan et al. [13]2022As a feature-grouping model for early diabetes detectionK-means
Antony et al. [14]2021Chronic kidney disease predictionK-means,
DBSCAN *,
I-Forest *,
Autoencoder
Enireddy et al. [15]2021Prediction of diseasesK-means,
Agglomerative,
Fuzzy C-means
Arora et al. [16]2022As a feature-extracted tool for diabetes patient predictionK-means
Parikh et al. [17]2023Discovering and clustering phenotypes of atypical diabetesK-means
Jasinska-Piadlo et al. [18]2023Clustering heart failuresK-means
Mpanya et al. [19]2023Clustering heart failure phenotypesK-prototype,
K-means,
Agglomerative,
BIRCH *,
OPTICS *,
DBSCAN *,
GMM *
Florensa et al. [20]2022Exploring associations between risk factors and likelihood of colorectal cancerK-means
Koné et al. [21]2023Exploring the clustering of 17 chronic conditions with cancerK-means
Chantraine et al. [22]2022Classification of stiff-knee gait kinematic severity after strokeK-means
Yasa et al. [23]2022Classification of strokeK-means
Al-Khafaji and Jaleel [24]2022Controlling and forecasting COVID-19 casesK-Efficient (a hybrid of K-medoids and K-means)
Ilbeigipour et al. [25]2022The analysis of COVID-19 casesSOM,
K-means
Note: * I-Forest = Isolation Forest; BIRCH = Balanced Iterative Reducing and Clustering Hierarchies; OPTICS = Ordering Points to Identify the Clustering Structure; DBSCAN = Density-Based Spatial Clustering of Applications with Noise; GMM = Gaussian Mixture Model.
Table 2. Conversion methods for attributes.
Table 2. Conversion methods for attributes.
VariablesAttributesConversion Methods
X1GenderLabeling
X2AgeFrom birthdays to ages
X3~X12Drug itemsBERT and PCA
X13~X22Doctors’ adviceBERT and PCA
X23~XnExam itemsLabeling
Table 3. Numbers of attributes and visits of patients in 12 months.
Table 3. Numbers of attributes and visits of patients in 12 months.
DatasetsJan.Feb.Mar.Apr.MayJun.
Number of attributes456394418447417416
Visits of patients476544055667541075935397
DatasetsJul.Aug.Sep.Oct.Nov.Dec.
Number of attributes455447445404444470
Visits of patients513653415023523346754506
Table 4. The numbers of clusters using four methods from January to December.
Table 4. The numbers of clusters using four methods from January to December.
MethodsJan.Feb.Mar.Apr.MayJun.Jul.Aug.Sep.Oct.Nov.Dec.
KM443353333333
AEKM443353333333
SOMKM443343434433
HC443333333334
Table 5. The clustering performance in terms of the Calinski–Harabasz index.
Table 5. The clustering performance in terms of the Calinski–Harabasz index.
MethodsJan.Feb.Mar.Apr.MayJun.Jul.Aug.Sep.Oct.Nov.Dec.
KM1310.211431.421988.482107.831810.971892.511914.832082.411682.571731.591590.451323.84
AEKM257.73268.64217.65399.86319.53653.52709.87900.43347.86336.36384.98163.36
SOMKM1281.331423.371988.481012.571377.301888.911432.101841.561391.691416.601590.451233.08
HC1264.751368.761883.292008.262060.341847.021805.662035.001632.011690.221512.801112.01
Table 6. The clustering performance in terms of the Davies–Bouldin index.
Table 6. The clustering performance in terms of the Davies–Bouldin index.
MethodsJan.Feb.Mar.Apr.MayJun.Jul.Aug.Sep.Oct.Nov.Dec.
KM1.431.281.431.251.61.311.271.281.441.51.351.58
AEKM4.593.685.5565.334.023.123.743.584.343.116.18
SOMKM1.421.291.432.191.961.311.911.51.571.671.351.7
HC1.451.311.431.271.491.321.341.281.431.521.381.53
Table 7. The clustering performance in terms of the Silhouette Coefficient.
Table 7. The clustering performance in terms of the Silhouette Coefficient.
MethodsJan.Feb.Mar.Apr.MayJun.Jul.Aug.Sep.Oct.Nov.Dec.
KM0.270.290.270.320.240.310.310.320.270.250.290.23
AEKM0.010.030.020.070.010.130.160.130.040.030.040.01
SOMKM0.260.290.270.160.20.310.240.310.280.250.290.21
HC0.270.290.250.310.260.30.30.320.260.250.290.25
Table 8. Clustering analysis of indigenous patients in January.
Table 8. Clustering analysis of indigenous patients in January.
RanksTotalCluster 1Cluster 2Cluster 3Cluster 4
CodesDisease %VisitsCodesVisitsGroup %CodesVisitsGroup %CodesVisitsGroup %CodesVisitsGroup %
1E119%178E118749%E112413%E116235%Z206594%
2E785%112I104550%E781715%E785146%J183846%
3I104%90E784439%N391750%I112835%J011636%
4J184%82I114353%I101618%I102730%R501445%
5I114%81J182227%N181232%I252145%E1153%
6Z203%69B182255%R501032%M102065%J20518%
7I252%47K212154%I11911%B181743%J45517%
8J012%44Z341860%Z34827%J181620%R10519%
9B182%40I251532%I25715%J011534%I2549%
10K212%39J451448%R10726%K211436%N18411%
11N182%37 N20747% J30417%
12N392%34 M10619% A09421%
13R502%31 J1867% R11431%
14M102%31 E86450%
15Z341%30 Z34413%
Table 9. Clustering analysis of indigenous patients in March.
Table 9. Clustering analysis of indigenous patients in March.
RanksTotalCluster 1Cluster 2Cluster 3
CodesDisease %VisitsCodesVisits Group %CodesVisits Group %CodesVisits Group %
1E118%206E113015%E119345%E118340%
2E786%154R502545%Z207974%E786341%
3I104%108N392356%E786945%I103633%
4Z204%107E782214%I105551%I113434%
5I114%99I101716%I115354%Z113360%
6J183%80N181425%J184151%J013141%
7J013%76I111212%J013951%M103060%
8B182%56J181215%K212963%Z202826%
9R502%56R101135%Z342882%J182734%
10N182%56I25920%B182646%B182646%
11Z112%55 N182646%
12M102%50
13K212%46
14I252%46
15N392%41
Table 10. Clustering analysis of indigenous patients in May.
Table 10. Clustering analysis of indigenous patients in May.
RanksTotalCluster 1Cluster 2Cluster 3Cluster 4Cluster 5
CodesDisease %VisitsCodesVisits Group %CodesVisits Group %CodesVisits Group %CodesVisits Group %CodesVisits Group %
1Z2022%660E115835%E118350%Z2024036%N392362%Z2034652%
2E116%167E784643%Z20568%R055645%R502122%R056653%
3R054%125I112734%E784945%J063037%E112113%U074048%
4E784%108B182047%I104061%R502931%I111215%J063948%
5R503%94I101929%I113848%U072530%E781110%R502931%
6U073%83K211442%Z343065%J181019%N181020%J18917%
7J063%81J441461%I252350%J201033%Z34920%J00850%
8I113%79Z20132%B182047%J00850%N40964%Z34715%
9I102%66J181324%N182041%J01739%I25715%J20723%
10J182%54N181224%K211855%N18612%M10724%R10523%
11N182%49I251226%J181630%J30526%I10711%J02563%
12Z342%46M101241% E1142%J18611%
13I252%46 I2549%N20655%
14B181%43 R51420%R10627%
15N391%37 R11427%M54630%
Table 11. Clustering analysis of indigenous patients in June.
Table 11. Clustering analysis of indigenous patients in June.
RanksTotalCluster 1Cluster 2Cluster 3
CodesDisease %VisitsCodesVisits Group %CodesVisits Group %CodesVisits Group %
1Z2015%309E115840%E118357%Z2024078%
2E117%145E784647%Z205618%R055695%
3E785%97I112740%E784951%J063075%
4I113%67B182050%I104068%R502966%
5I103%59I101932%I113857%U072564%
6R053%59K211444%Z3430100%J181026%
7R502%44J441478%I252359%J201056%
8B182%40Z20134%B182050%J008100%
9J062%40J181333%N182053%J01744%
10J182%39I251231%K211856%N18616%
11I252%39N181232%J181641%J30533%
12U072%39M101255% E1143%
13N182%38Z111148% I25410%
14K212%32K251150% R51424%
15Z341%30R50920% R11440%
Table 12. Clustering analysis of non-indigenous patients in January.
Table 12. Clustering analysis of non-indigenous patients in January.
RanksTotalCluster 1Cluster 2Cluster 3Cluster 4
CodesDisease %Visits CodesVisits Group %CodesVisits Group %CodesVisits Group %CodesVisits Group %
1E118%985E1136437%E1117618%E1141442%Z2041296%
2E786%790E7831640%N3913573%E7834944%J185032%
3I115%680I1130345%E7812115%I1128642%E11313%
4Z203%429I1014938%N188120%I1016943%N18307%
5N183%407N1814235%I118012%N1815438%R502124%
6I103%391K2110951%N207251%I2511448%J011821%
7I252%237B188843%I105915%B1810451%I10144%
8K212%212G478855%N405442%K218942%R101411%
9B182%204I258235%R104434%M107766%K921321%
10N391%184K596846%R503541%N407255%A091319%
11G471%161 I253013% E861324%
12J181%158 Z342628%
13K591%147 R312578%
14N201%140
15N401%130
Table 13. Clustering analysis of non-indigenous patients in March.
Table 13. Clustering analysis of non-indigenous patients in March.
RanksTotalCluster 1Cluster 2Cluster 3
CodesDisease %Visits CodesVisits Group %CodesVisits Group %CodesVisits Group %
1E117%953N3917771%E1140643%E1139341%
2E786%827E1115416%Z2039458%E7835142%
3I115%676E7810913%E7836744%I1128442%
4Z205%674N1810323%I1131346%Z2028042%
5I103%464I117912%I1018440%I1021045%
6N183%450R507550%N1815434%N1819343%
7I252%283I107015%G4711160%I2515655%
8N392%251R105843%K2110449%J4410670%
9K211%213N204738%I259835%J1810553%
10J181%199N403627%K598053%B1810054%
11B181%186M103226%Z347871%N409773%
12G471%186Z343229%J187839%K219545%
13J441%152R313070%J307651%M107258%
14K591%151I252910% N206653%
15R501%151
Table 14. Clustering analysis of non-indigenous patients in May.
Table 14. Clustering analysis of non-indigenous patients in May.
RanksTotalCluster 1Cluster 2Cluster 3Cluster 4Cluster 5
CodesDisease %Visits CodesVisits Group %CodesVisits Group %CodesVisits Group %CodesVisits Group %CodesVisits Group %
1Z2019%2922E1136741%E1135440%Z20102935%E1115117%Z20134346%
2E116%888E7831142%Z2032511%R0524244%N3913172%R0530154%
3E785%740I1128544%E7830842%U0712937%E7811115%U0715343%
4I114%648Z202127%I1127943%J0612045%N189222%J0611142%
5R054%553N1817141%I1014939%R508831%I117812%R507527%
6N183%420I1015641%N1813833%J003854%R507326%J003144%
7I102%382I2510944%K2110255%R111629%I106216%Z342016%
8U072%352N408464%I2510141%J201629%N204844%J201629%
9R502%281B187849%G478754%E11142%N404333%R111324%
10J062%265K217239%Z347964%N18143%R103231%R071116%
11I252%248 B187446% R353067%J031034%
12K211%186 C505891% R511025%
13N391%181 K255554% O471077%
14G471%162 R005462%
15B181%160
Table 15. Clustering analysis of non-indigenous patients in June.
Table 15. Clustering analysis of non-indigenous patients in June.
RanksTotalCluster 1Cluster 2Cluster 3
CodesDisease %VisitsCodesVisits Group %CodesVisits Group %CodesVisits Group %
1Z2011%1440Z2086760%Z2055839%E1114017%
2E116%822E1134242%E1134041%N3911166%
3E785%681E7829443%E7828542%E7810215%
4I115%588I1125944%I1125043%N188922%
5N183%399N1814737%I1016545%I117913%
6I103%365I1014540%N1816341%R506929%
7U072%267U0712446%I2511554%I105515%
8R502%238Z3410283%U0710640%U073714%
9I252%214I258037%J449669%N203744%
10N391%168G477753%R509339%R103134%
11B181%148 J187454%N402627%
12K211%147 M107163%R312586%
13G471%145 N407073%
14J061%143 B186745%
15J441%140 K216544%
Table 16. Clusters of diseases.
Table 16. Clusters of diseases.
MonthsClusters of Indigenous PatientsClusters of Non-Indigenous Patients
Class AClass BClass CClass DClass EClass AClass BClass CClass DClass E
Jan.I10, I11, B18, K21, Z34N39M10Z20, E86 K21, G47N39, N20, R31B18, M10, N40Z20
Feb.E78, Z34, I25, K21, J45, G47, M19N39, N20M10, B18, Z11Z20, R07, L08, B08 G47, Z34N39B18, N40, M10Z20
Mar.E11, I10, I11, J18, J01, K21, Z34N39Z11, M10 Z20, G47, K59, Z34, J30N39, R50, R31I25, J44, J18, B18, N40, M10, N20
Apr.Z20, I11, K21, Z34, B18N39, N20J01, N18, Z11 Z20, K21, G47, Z34, C50N39, R35I25, B18, J44, J18, M10, N40, K25, I20
MayE11, E78, I10, I11, Z34, I25, B18, N18, K21N39, Z34, N20E78, B18, K21, J44, M10Z20, R05, U07, J06, J00, J02J00, R05E78, I11, K21, I25, G47, Z34, B18, C50, K25, R00N39, N20, R35E11, E78, I11, N18, I10, I25, N40, B18Z20, R05, U07, J06, J00, O47J00, R05, J06
Jun.E11, E78, I10, I11, Z34, I25, B18, N18, K21 B18, J44, M10, K25Z20, R05, J06, R50, U07, J20, J00 Z20, Z34, G47N39, R31I25, J44, J18, M10, N40
Jul.Z20, E11, E78, I11, I10, Z34, K21, B18N20M10 Z20, Z34, K21, G47, K59, R42, R00N39I25, B18, J44, N40, J18, M10
Aug.Z20, E11, E78, I10, I11, Z34, K21, J01, B18, J20, K59N39J18, I25, M10, J44 Z20, K21, Z34N39, R31I25, B18, J44, N40, M10, J18
Sep.Z20, E78, I11, B18, K21, Z34N39J18, M10 Z20, Z34, K21, K59, G47, R42N39, R31I25, B18, J19, N40, J44, M10
Oct.E11, Z20, I11, K21, Z34, J45N39, R10, N13R50, J18, M10, N18, J20, J01, J06 Z20, Z34, G47N39, N20, R31R50, N18, J18, Z20, I25, B18, U07, J20, M10, N40
Nov.I10, Z34, J12, Z20, P59N39J18, M10, J45, J20 Z34, K21, R42, Z20, G47N39, R31, R80J18, I25, J44, B18, N40
Dec.E11, I10, I11, K21, P59, J45, Z34, J20, J30N39, R10, N20, J03J18, M10 Z34, G47, K59, C50, R42N39, R31, R35I25, J18, J44, M10, N40, I20
SAME CODEI10, I11, K21, Z34N39M10Z20J00K21, G47N39M10, N40Z20J00
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, W.-C.; Lai, J.-P.; Liu, Y.-H.; Lin, Y.-L.; Hou, H.-P.; Pai, P.-F. Using Medical Data and Clustering Techniques for a Smart Healthcare System. Electronics 2024, 13, 140. https://doi.org/10.3390/electronics13010140

AMA Style

Yang W-C, Lai J-P, Liu Y-H, Lin Y-L, Hou H-P, Pai P-F. Using Medical Data and Clustering Techniques for a Smart Healthcare System. Electronics. 2024; 13(1):140. https://doi.org/10.3390/electronics13010140

Chicago/Turabian Style

Yang, Wen-Chieh, Jung-Pin Lai, Yu-Hui Liu, Ying-Lei Lin, Hung-Pin Hou, and Ping-Feng Pai. 2024. "Using Medical Data and Clustering Techniques for a Smart Healthcare System" Electronics 13, no. 1: 140. https://doi.org/10.3390/electronics13010140

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop