Next Article in Journal
Implementing Astronomical Potential and Wavelet Analysis to Improve Regional Tide Modeling
Next Article in Special Issue
Successful Management of Public Health Projects Driven by AI in a BANI Environment
Previous Article in Journal
Precision-Driven Semantic Segmentation of Pipe Gallery Diseases Using PipeU-NetX: A Depthwise Separable Convolution Approach
Previous Article in Special Issue
Smoothing Techniques for Improving COVID-19 Time Series Forecasting Across Countries
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Scalable Clustering of Complex ECG Health Data: Big Data Clustering Analysis with UMAP and HDBSCAN

by
Vladislav Kaverinskiy
1,
Illya Chaikovsky
1,
Anton Mnevets
2,
Tatiana Ryzhenko
1,
Mykhailo Bocharov
3 and
Kyrylo Malakhov
1,*
1
Glushkov Institute of Cybernetics of the National Academy of Sciences of Ukraine, 03187 Kyiv, Ukraine
2
Department of Electronic Engineering, Igor Sikorsky Kyiv Polytechnic Institute, 03056 Kyiv, Ukraine
3
Department of Moral and Psychological Support of the Activity of the Troops (Forces), National Defense University of Ukraine Named After Ivan Cherniakhovskyi, 03049 Kyiv, Ukraine
*
Author to whom correspondence should be addressed.
Computation 2025, 13(6), 144; https://doi.org/10.3390/computation13060144
Submission received: 8 May 2025 / Revised: 25 May 2025 / Accepted: 5 June 2025 / Published: 10 June 2025
(This article belongs to the Special Issue Artificial Intelligence Applications in Public Health: 2nd Edition)

Abstract

This study explores the potential of unsupervised machine learning algorithms to identify latent cardiac risk profiles by analyzing ECG-derived parameters from two general groups: clinically healthy individuals (Norm dataset, n = 14,863) and patients hospitalized with heart failure (patients’ dataset, n = 8220). Each dataset includes 153 ECG and heart rate variability (HRV) features, including both conventional and novel diagnostic parameters obtained using a Universal Scoring System. The study aims to apply unsupervised clustering algorithms to ECG data to detect latent risk profiles related to heart failure, based on distinctive ECG features. The focus is on identifying patterns that correlate with cardiac health risks, potentially aiding in early detection and personalized care. We applied a combination of Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction and Hierarchical Density-Based Spatial Clustering (HDBSCAN) for unsupervised clustering. Models trained on one dataset were applied to the other to explore structural differences and detect latent predispositions to cardiac disorders. Both Euclidean and Manhattan distance metrics were evaluated. Features such as the QRS angle in the frontal plane, Detrended Fluctuation Analysis (DFA), High-Frequency power (HF), and others were analyzed for their ability to distinguish different patient clusters. In the Norm dataset, Euclidean distance clustering identified two main clusters, with Cluster 0 indicating a lower risk of heart failure. Key discriminative features included the “ALPHA QRS ANGLE IN THE FRONTAL PLANE” and DFA. In the patients’ dataset, three clusters emerged, with Cluster 1 identified as potentially high-risk. Manhattan distance clustering provided additional insights, highlighting features like “ST DISLOCATION” and “T AMP NORMALIZED” as significant for distinguishing between clusters. The analysis revealed distinct clusters that correspond to varying levels of heart failure risk. In the Norm dataset, two main clusters were identified, with one associated with a lower risk profile. In the patients’ dataset, a three-cluster structure emerged, with one subgroup displaying markedly elevated risk indicators such as high-frequency power (HF) and altered QRS angle values. Cross-dataset clustering confirmed consistent feature shifts between groups. These findings demonstrate the feasibility of ECG-based unsupervised clustering for early risk stratification. The results offer a non-invasive tool for personalized cardiac monitoring and merit further clinical validation. These findings emphasize the potential for clustering techniques to contribute to early heart failure detection and personalized monitoring. Future research should aim to validate these results in other populations and integrate these methods into clinical decision-making frameworks.

1. Introduction

The advancement of diagnostic methods, especially instrumental ones (i.e., methods of functional diagnostics) primarily entails a constant increase in the ability to detect subtler and subtler changes in the function examined by one method or another. Such opportunities emerge due to progress in technical measurement tools of a certain function and even more due to the development of informational technologies; in other words, due to the creation of new metrics—numerical parameters using which one can assess the aspects of the functioning of various human organs and systems that were inaccessible before. As a result, firstly, new ways of improvement of the diagnostic accuracy of a certain method within its traditional application scenarios are discovered and, secondly, familiar methods find unconventional uses in new areas.
In this context, the innovative technology to analyze subtle changes in ECG was developed, aiming to make any electrocardiography informative [1].
In this context, the advanced analysis of ECG might be highly demanded, especially in population health studies, dealing with large datasets.
The only way to increase the diagnostic value of ECG examination is the development of proper information technology (IT)—a combination of up-to-date methods and equipment bound in a chain that provides collection, storage, pre-processing, interpretation, conclusion, and dissemination of information [2,3,4,5].
It is true that routine ECG analysis is based on the presence of certain ECG syndromes or phenomena defined within one of the existing ECG analysis algorithms. However, in the majority of cases, no ECG syndrome can be identified during the analysis of an individual electrocardiogram, at least not one that clearly reflects cardiac pathology i.e., belongs to the “major” category according to the, for example, Minnesota coding system. During the routine analysis, one is forced to assign a single class to all these electrocardiograms—electrocardiograms with no major ECG syndrome identified. However, the question arises—are all these electrocardiograms the same in terms of their relative “distance” to the “ideal” electrocardiogram of a healthy human? Obviously, they are not. This “distance” can be further or closer depending on the myocardial condition; moreover, there is a reasonable hypothesis that this “distance” reflects the likelihood of serious cardiovascular events occurring. This is where routine analysis of an electrocardiogram is uninformative.
That is why the Universal Scoring System method and software for ECG scaling that can provide the quantitative evaluation of the slightest changes in ECG signal were developed [6].
This approach is based on, first of all, measuring the maximum number of ECG parameters and heart rate variability and, secondly, on positioning each parameter value on a scale between the absolute norm and extreme pathology. In fact, the suggested approach follows a popular Z-scoring ideology, when quantitative, usually point-based assessment of test results is determined via using a special scale containing data about intra-group test results variation.
On the other hand, clustering methods, including k-means, hierarchical clustering, and density-based methods like DBSCAN, have found extensive applications in the medical field. They support diagnostic processes by identifying patient subgroups, predicting disease progression, and segmenting medical images.
For example, clustering techniques have been applied in Alzheimer’s research, particularly to analyze MRI data, cerebrospinal fluid biomarkers, and other clinical features, to differentiate patients likely to progress from mild cognitive impairment to Alzheimer’s disease [7,8]. Such approaches help to classify patients based on subtle distinctions in disease patterns, aiding early diagnosis and targeted treatments. In another example [9,10], Parkinson’s disease research has employed clustering for patient stratification, revealing patterns that improve diagnosis and the understanding of disease heterogeneity.
Clustering also plays a crucial role in image analysis in medicine. Techniques like DBSCAN, which is effective for data with noise and varying densities, can be used in MRI or CT image segmentation, identifying tumor boundaries, and assisting radiologists in spotting abnormalities. This method helps cluster similar regions within images, essential for the accurate and efficient analysis of large datasets.
The novelty of this study lies in the combination of UMAP-based dimensionality reduction and HDBSCAN clustering applied to real-world, large-scale ECG datasets from both healthy individuals and heart failure patients. Unlike many previous approaches that rely on labeled training data, our unsupervised method allows for the discovery of latent risk groups and subtypes without prior diagnostic categorization. One of the key aspects of our methodology is a cross-dataset clustering analysis. This approach involves training a clustering model on one dataset (e.g., healthy individuals) and applying it to another, independent dataset (e.g., heart failure patients), and vice versa. This not only enhances the scalability and adaptability of ECG analysis but also opens the door for identifying preclinical or atypical cardiac conditions based solely on physiological signal patterns.
The current study is aimed at cluster findings in the datasets formed grounding on the electrocardiograms (ECG) analysis based on the Universal Scoring System approach primarily to find the parameter values which could be helpful for early diagnosis and even identification of predisposition to further heart failure development. The results obtained here do not pretend to be strong diagnostic criteria straight away. The results obtained in this study are not intended to serve as definitive diagnostic criteria but rather offer statistically grounded insights into the distribution and structure of ECG-derived parameters. These findings may assist medical professionals in identifying potential risk patterns and should be further examined and validated in dedicated clinical studies.

2. Related Works

Syndromic ECG analysis is a cardiology diagnostic approach focused on identifying characteristic ECG patterns—such as changes in waveforms, intervals, and rhythms—associated with specific cardiac syndromes, rather than isolated abnormalities. Examples include Brugada Syndrome, indicated by distinctive ST-segment elevation in leads V1–V3; Long QT Syndrome, identified by QT interval prolongation linked to arrhythmias [11]; Wolff-Parkinson-White (WPW) Syndrome, characterized by a short PR interval and delta wave suggesting accessory conduction pathways [12]; and Myocardial Ischemia or Infarction Syndromes, marked by changes in ST-segments and T-waves signaling cardiac injury or ischemia [13]. This approach aids rapid diagnosis, identifies high-risk patients, and facilitates tailored treatments.
Clustering techniques are widely applied in medicine to improve diagnostics, patient segmentation, and personalized treatment. By grouping patients with similar characteristics, clustering enables healthcare professionals to tailor medical interventions based on specific subgroups. For example, clustering can help identify subtypes within diseases such as diabetes or cancer, enhancing targeted treatments. Additionally, it can optimize resource allocation by identifying healthcare needs in specific regions or among certain patient populations [14,15].
Popular clustering algorithms in medical applications include K-Means, hierarchical clustering, and DBSCAN. K-Means is commonly used for its efficiency in partitioning data into K clusters. At the same time, hierarchical clustering provides a tree-like structure, making it useful for visualizing relationships at various levels. DBSCAN, a density-based method, excels at identifying clusters of arbitrary shapes, which is valuable in cases where data is complex and noisy, such as in genetics research [16].
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that preserves both local and global structure in data. UMAP is particularly effective in applications where visualization of complex, high-dimensional data is necessary, as seen in studies of healthcare data where visual clustering can highlight significant patterns in patients’ datasets or physiological signals like ECG data. UMAP’s efficacy for embedding high-dimensional data is supported by its theoretical foundation in Riemannian geometry and algebraic topology, making it popular in biomedical applications and image processing, where it facilitates the creation of meaningful low-dimensional representations [17].
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a well-regarded algorithm for density-based clustering. It is particularly known for its ability to identify clusters of arbitrary shapes within data with noise. DBSCAN defines clusters based on regions with high intensity, requiring two parameters: Eps (neighborhood radius) and MinPts (minimum number of points within the neighborhood). When a data point meets these density criteria, it forms part of a cluster; otherwise, it may be considered noise. A key advantage of DBSCAN is its robustness against noise and its flexibility in discovering clusters without a predefined shape. However, it can struggle with varying-density data since a single Eps value often does not fit all clusters well [18].
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) extends DBSCAN by introducing a hierarchical approach to overcome DBSCAN’s limitations in handling variable densities [19]. HDBSCAN leverages a technique that hierarchically builds clusters and subsequently condenses them based on stability measures. This allows it to adapt better to diverse density regions, producing more meaningful clusters across datasets with varying densities. In practical terms, HDBSCAN is a more dynamic clustering algorithm, especially useful in applications like anomaly detection and data visualization where clustering precision is essential across different density regions.
For clustering, HDBSCAN was chosen due to its robustness in managing variable-density data, which is critical in biomedical datasets where noise and outliers are common. Unlike DBSCAN, HDBSCAN does not require a fixed density threshold, instead deriving clusters based on local density variations, which enables it to separate clusters with differing densities efficiently. HDBSCAN has shown excellent performance in tasks that demand noise-resilient clustering, such as in ECG and EEG signal analysis, as it can identify subtle cluster boundaries without overfitting noise points. The method leverages a hierarchical clustering approach using mutual reachability distances, allowing clusters to form naturally based on data density rather than arbitrary parameters.
This combination of UMAP for dimensionality reduction and HDBSCAN for density-based clustering has shown promising results in applications requiring fine-grained grouping in high-noise environments, such as in the analysis of complex medical datasets where patient stratification and the detection of rare phenotypes are critical. These tools are validated by prior research and have been integrated into our study to optimize the interpretability and accuracy of clustering in the selected dataset [20].
The field of medical diagnostics has increasingly leveraged machine learning and clustering techniques to analyze complex biomedical data. In particular, clustering methods applied to cardiovascular and psychological health data have shown promising results for early diagnosis and personalized treatment approaches. The authors of [21] explored the clustering of arterial oscillogram (AO) data, focusing on ultra-low-frequency (ULF) indicators correlated with depression levels. By employing UMAP (Uniform Manifold Approximation and Projection) for dimensionality reduction, they identified two clusters representing different severity levels of depression, subsequently developing a high-accuracy classification system based on ULF metrics and products of correlated parameters, achieving accuracy rates up to 97% using nearest neighbor classifiers.
In another work [22] a distributed classification model for real-time healthcare data processing, using a Random Forest approach integrated with binary classification modules. Their model demonstrated high accuracy in classifying multiple levels of depression severity based on arterial pulsation data, achieving effective parallel processing benchmarks for real-time applications [23,24]. These methods align with a broader trend in healthcare data analysis, where dimensionality reduction techniques like UMAP enhance visualization and feature extraction from high-dimensional medical data. Moreover, the use of HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), a density-based clustering technique, enables the handling of data with varying densities, which is common in medical datasets [19].
Ontological approaches further complement machine learning in medical dialogue systems. It has introduced an ontology-driven framework for rehabilitation support in physical medicine, underscoring the importance of adaptive systems that cater to evolving medical knowledge [25]. The MedRehabBot system [26] utilizes neural networks combined with an ontological paradigm to enhance interactions within telerehabilitation services, addressing critical needs in Ukraine’s military and civilian rehabilitation programs [26,27,28,29,30,31,32]. Additional studies by Malakhov and Kaverinskiy highlight the application of automatic ontology generation for natural language dialogue systems, which enhances contextual understanding in systems that support complex inflectional languages [33].

3. Materials and Methods

For the investigation, two datasets were utilized. Both of them included data obtained from ECG transcripts. The first dataset consists of 14,863 data rows and represents clinically healthy people—without any heart disorders (the following code name—Norm dataset). The second one consists of 8220 data rows and represents patients hospitalized in the cardiology clinic due to heart failure (the following code name—patients’ dataset). Each row corresponds to one person. The Norm dataset includes both men and women of a wide range of ages from children to old people. The patients’ dataset also includes men and women, but adults and old only (from 40 to 93 years).
The datasets analyzed contained 153 features each. To facilitate a clear, visual cluster analysis, dimensionality reduction was required. For this purpose, we applied the UMAP method, using its Python implementation (version 0.5.8). Our approach involved tuning UMAP and HDBSCAN (version 0.8.1) hyperparameters with the primary objective of generating distinct, consolidated clusters that were visually separable while minimizing the number of unclustered points by HDBSCAN (ideally eliminating them). It was also preferable to avoid the formation of numerous small clusters, and instead to produce 2–3 well-defined clusters in the two-dimensional space of the UMAP embeddings, each covering a comparable area and containing a substantial number of data points.
Computational Environment. All data pre-processing, dimensionality reduction, clustering, and statistical tests reported in this study were carried out inside an instance of the Research and Development Workstation Environment (RDWE) running on the Hybrid Cloud Environment for Telerehabilitation (HCET) [26] maintained by the Microprocessor Technology Lab, V.M. Glushkov Institute of Cybernetics. The HCET is a three-layer hybrid cloud platform (IaaS → PaaS → SaaS) hosted on an on-premises HP ProLiant DL380p Gen8 cluster virtualized with KVM/LibVirt and accessed through a WireGuard VPN. The RDWE instance provides a reproducible Python 3.11 stack (NumPy 2.x, SciPy 1.13, scikit-learn 1.5, umap-learn 0.5.5, hdbscan 0.8.33, cuML 25.02, Matplotlib 3.9) orchestrated in JupyterLab; all optimization routines (Powell, Nelder–Mead) and figure generation were executed here. The scripts used for these computations are openly archived on Zenodo.
To minimize unclustered points, we optimized an objective function, which was processed through the optimization routines available in the SciPy package. The Powel and Nadler-Mead methods were utilized. Other hyperparameters were tuned manually, relying on visual assessment. Basic statistical analyses, including calculations of maximum, minimum, mean values, and standard deviations, were performed for each of the resulting clusters. These cluster characteristics were then compared to identify potential differences that could facilitate class differentiation. The values of the hyperparameters tuned for each of the studied cases are presented in Table 1.
To assess the generalizability and structural divergence between populations, a cross-dataset clustering strategy was employed. Specifically, clustering models (UMAP + HDBSCAN) trained on one dataset (e.g., Norm) were applied to the other dataset (e.g., Patients), and vice versa. This approach allowed us to explore how patterns identified in one group are reflected in another, despite differences in health status and data distribution. Such cross-application serves not only as a form of indirect validation but also reveals latent predispositions that may not be apparent in within-group analysis. Since the study is unsupervised, no explicit training/validation/test split was used; instead, the generalization capability was assessed via model transfer across datasets with distinct population characteristics. This cross-dataset analysis revealed similar clustering patterns but often displayed data redistribution across clusters. The datasets represented both healthy individuals and those with diagnosed heart disorders. Differences between clusters, considering variations in characteristic features, suggested that certain clusters might represent individuals at different levels of susceptibility or stages of the disorder. Specifically, some clusters may indicate individuals more predisposed to the disorder (including those in an early latent stage), while others may correspond to more advanced stages, showing little overlap with clusters derived from the healthy cohort.
Based on these findings, we infer that certain feature levels or combinations could serve as indicators of early or advancing stages of a heart disorder. This research lays the groundwork for developing a diagnostic classification system for early heart disorder detection, which could be valuable in preventive healthcare.
It is important to note that the authors of this study are computer scientists, not medical professionals. The results presented here reflect statistical and data science insights. The interpretation of these findings and the determination of their clinical relevance require expertise from medical professionals. This research aims to provide them with potentially valuable insights.

4. General Statistics of the Initial Datasets

For each of the initial datasets, Norm and Patients, fundamental statistical parameters were determined to gain insights into feature levels and ranges, enabling dataset comparison.
A general statistics table which displays the minimum, maximum, mean, and standard deviation values for features across the datasets can be found online by the link: https://zenodo.org/records/15373634 (accessed on 9 June 2025).
Notably, certain features show no deviation or even remain consistently zero across all cases in both datasets. These features include:
AMP AREAS INDEX lead AVR
DURATION RATIO TP TE JT
DURATION RATIO TP TE JTA
DURATION RATIO TP TE JT
AVR LEAD CODE
These features can be excluded from further consideration without impact, and they were automatically removed during data preparation for the subsequent clustering analysis.
Applying a formal t-test to compare means reveals that most features display significantly different mean values between the datasets. However, some features did not meet this criterion (excluding those listed above). These features include:
FRACTAL INDEX
VL
ST T FORM INDICATOR lead II
ST DISLOCATION lead III
Q R RATIO lead I
R S RATIO lead I
R P RATIO lead III
R P RATIO lead AVF
R T RATIO lead AVF
GLOBAL P DURATION
MYOCARDIAL INDEX OF STATIONARITY
SIGN OF HEART FAILURE
SELVESTER CODE
While these features did not show significant differences in means, excluding them from cluster analysis is unnecessary since they may show variation within clusters. However, it is essential to note the absence of statistically significant differences in these features’ mean values across datasets.
Although most features show different mean values due to the large sample sizes and narrower confidence intervals, many have broad variances and standard deviations. A stringent test was conducted to identify features with non-overlapping standard deviations across datasets. Only one feature, DFA, met this criterion, exhibiting a higher level in the patients’ dataset, which could indicate its potential as an anxiety marker associated with heart disorders.
Another measure of mean difference significance, ignoring variance and focusing on raw values, was used as defined by Equation (1):
k = 2 M 1 M 2 M 1 + M 2
where M1 and M2 are the mean values of the compared groups.
This ratio reflects the absolute difference relative to the average mean, indicating a potential shift in distribution mode. A critical value of k > 1 was used, signifying that M1/M2 > 3 or M2/M1 > 3. Table 2 presents features meeting this threshold.
Despite similar variation ranges and overlapping standard deviations, many features show substantial mean differences. For example, the “ALPHA QRS ANGLE IN THE FRONTAL PLANE” ranges from −179.49 to 179.56 in the PATIENTS’ dataset and −179.76 to 180.0 in Norm, yet the mean values are 3.1 and 11.1, respectively. Both means are shifted positively, but this shift is more pronounced in the Norm dataset.
In summary, there appears to be a slight tendency for higher values of features in the left column of Table 2 to be more typical of healthy individuals, while elevated values in the right column may serve as markers of heart disorders.

5. Results

5.1. Data Cauterization Analysis in Low-Dimensional Space

5.1.1. Euclidean Metric

The Euclidean metric was initially selected for dimensionality reduction using UMAP and clustering with HDBSCAN. The Norm dataset served as the training set for the UMAP + HDBSCAN clustering/classification model.
Figure 1 presents a comparison between clusters derived from the Norm dataset (training dataset) and subsequent classification predictions on the patients’ dataset. The predictions are based on clusters established during the training phase, now treated as distinct classes. The UMAP dimensionality reduction was performed using the same pre-trained UMAP reducer.
In the Norm dataset, two primary clusters were identified, each approximately equal in size. These could potentially be further subdivided, but this may add complexity without providing meaningful insights. When projecting the Norm-trained model onto the patients’ dataset, cluster 0 (the right-side cluster) remains recognizable, with parts of it intact. In contrast, cluster 1 (the left-side cluster) is sparsely represented in patients, and one data point is misclassified as “cluster” −1, likely an outlier situated between the clusters.
These findings suggest that cluster 0 in Norm, which is largely absent in patients, may correspond to healthier individuals with a lower risk of developing heart disorders. Conversely, members of cluster 1 could represent a potential risk group.
The average age and men-to-women ratio in the clusters were the following:
in cluster 0: 50.8 years and 42.8/57.2%;
in cluster 1: 51.1 years and 44.4/55.6%.
Thus, there is no significant misbalance between the clusters observed either in age or sex. They correspond to the mean values in the Norm dataset.
A reverse experiment was conducted, where the model was trained on the PATIENTS’ dataset and projected onto the Norm dataset. The clustering results for this experiment are displayed in Figure 2.
The clustering configuration appears similar in both cases, with three distinct clusters and a few outliers. Projecting the PATIENTS-trained model onto Norm reveals some structural changes: cluster 1, which contains fewer points in PATIENTS, almost disappears in Norm, with only a few points remaining. Cluster 2 becomes more sparse but retains a large size, while Cluster 0 shows increased population density and a closer proximity to Cluster 2, with some misclassified points between them.
These observations suggest that cluster 1 in PATIENTS may represent the most severe cases, as corresponding feature values are rarely present in the healthier Norm dataset. Cluster 2 likely represents individuals with a moderate risk of heart disorders.
The average age and men-to-women ratio in the clusters were the following:
in cluster 0: 69.2 years and 61.7/38.3%;
in cluster 1: 75.4 years and 73.6/26.4%;
in cluster 2: 70.4 years and 64.4/35.6%.
It can be noticed, that cluster 2 looks somewhat elder and more male. Nevertheless, the major clusters 0 and 2 have rather close ratios close to the ones in the PATIENTS’ dataset. Even in cluster 1 minimum age was 40, which is also the minimum for the PATIENTS’ dataset.

5.1.2. Manhattan Metric

A similar study was performed using the Manhattan metric, known for slower growth and more consolidated cluster formation. Figure 3 compares the clusters obtained by training the model on the Norm dataset with the model’s projection onto the PATIENTS’ dataset.
In this case, three clusters were identified in Norm (two large and one smaller). The PATIENTS prediction retains the structure of cluster 1 (the smallest cluster in Norm), indicating similar data characteristics in both datasets. However, cluster 0 appears as a small island, and cluster 2 is present in both cases, though its bottom section seems to diverge in the PATIENTS’ dataset. This suggests that cluster 1 may represent a higher-risk group, given its persistence across both datasets, while cluster 0 likely corresponds to a healthier population with lower heart disorder risk. Cluster 2 may represent individuals with moderate risk but less severe than those in cluster 1.
The average age and men-to-women ratio in the clusters were the following:
in cluster 0: 51.1 years and 44.3/55.7%;
in cluster 1: 57.0 years and 46.1/53.9%;
in cluster 2: 49.6 years and 42.2/57.8%.
So, we do not see any significant age or gender disproportion between the clusters, especially major ones (0 and 2)—the ratios are close to the ones in the Norm dataset.
The final experiment involved training the model on the PATIENTS’ dataset and then projecting it onto the Norm dataset, using the Manhattan metric. The clustering results are illustrated in Figure 4.
The results reveal similar trends to those observed in Figure 2 with the Euclidean metric. Specifically, cluster 0 (“end of the tail”) nearly vanishes, and the boundaries between clusters 3 and 1 become blurred. Cluster 3 becomes more dispersed, while Cluster 1 gains additional points. A new, smaller cluster 2 also appears, though its significance is unclear. This cluster’s consistency across both datasets suggests it may represent individuals with a specific type of heart disorder risk, though this hypothesis warrants further investigation by subject matter experts.
The average age and men-to-women ratio in the clusters were the following:
in cluster 0: 76.1 years and 76.0/24.0%;
in cluster 1: 69.1 years and 61.5/38.5%;
in cluster 2: 76.0 years and 73.3/26.7%;
in cluster 3: 70.4 years and 64.2/35.8%.
As we can observe, clusters 0 and 2 are more aggregates of old men than 1 and 3 (major ones), where the mentioned ratios are closer to the average values in PATIENTS. Although the average values can be shifted in clusters all of them cover mostly the whole age range presented in the dataset.
Overall, this clustering analysis highlights distinctions in the dataset structures, potentially correlating certain cluster configurations with health status and heart disorder risk factors across both Norm and PATIENTS’ datasets.

5.2. Overall Analysis of Data Clustering

5.2.1. Data Distribution Between the Clusters

Table 3 provides a detailed summary of data distribution across clusters for each scenario analyzed above. The misclassified points (designated as “cluster” −1) have been excluded from this estimation due to their minimal count.
The table indicates that substantial data redistribution occurs between main clusters during prediction. In the first scenario (Euclidean metric, Norm dataset for training), initial clusters are relatively balanced. However, in the prediction for the PATIENTS’ dataset, a significant shift occurs, with cluster 0 accounting for nearly 87% of data points. When the model is trained on the PATIENTS’ dataset and projected to Norm, cluster 0 originally contains only about 23% of data points, with clusters 1 and 2 comprising the remaining 77%. Conversely, in the Norm prediction, cluster 0 contains approximately 80% of data points, suggesting a healthier profile for its members, while cluster 2 appears associated with less favorable health outcomes. Cluster 1 could potentially represent a high-risk group.
The Manhattan metric reveals a similar trend. For the Norm to PATIENTS projection, an additional, smaller cluster 1 emerges. This cluster may represent a risk group, as it consistently appears across both datasets, yet its health status remains ambiguous. Cluster 2 may indicate individuals with a predisposition to cardiovascular issues, despite currently being relatively healthy. In the final scenario, clusters 0, 3, and possibly 2 are more prominent in the PATIENTS’ dataset and less so in the Norm, suggesting these clusters may correspond to less favorable health profiles.

5.2.2. Certain Features Values Differences Between the Clusters

The following question concerns the differences in the features’ values between the clusters. Having this information we can make some diagnostic conclusions, about what combinations of raised and decreased parameters might say about the presence of heart failure or at least a tendency of its development in the future, or maybe revile people less tend to such health disorders. All 4 cases considered above will be regarded in this section. The first and simplest one is the case of the model trained on the Norm dataset using the Euclidean metric, which has only two clusters. The criterion given by Equation (1) has been used to reveal the features with significant differences in mean values. The list of the parameters with the most differences between the clusters is given in Table 4.
Only the values for features “ALPHA QRS ANGLE IN THE FRONTAL PLANE” and DFA from the list are greater in cluster 0, while others are bigger in cluster 1.
Thus raised DFA, which in most cases of the Norm dataset might be evidence of heart failure disinclination. It should be noted that DFA can have both positive and negative values on both datasets, as well as “ALPHA QRS ANGLE IN THE FRONTAL PLANE”. This one in cluster 1 is quite close to 0, but in cluster 0, it tends to have positive values, which also could be perceived as a lower risk of heart failure development. Many parameters have negative mean values in both clusters. However, in cluster 1 they are much greater by the absolute values. All those features having raised absolute values, especially appearing in combination might be threaded as a signal of predisposition to heart disorders even if a person is healthy for the moment. This, and the statements below are of course not a univocal conclusion and a certain diagnosis, but merely information the medical professionals may pay attention to, then accepting or rejecting it according to their experience as well as according to the results of the subsequent specific studies.
The following case is clustering using the PATIENTS’ dataset assuming the Euclidean metric. Here we have three clusters. So, let us start with the features that are different enough in all three clusters. These are only HF and “ALPHA QRS ANGLE IN THE FRONTAL PLANE”. Cluster 1 here has an extremely high average HF value which is 28,024, and it collects the highest values of it. In clusters 0 and 2 the HF is not as much different but distinguish enough having mean values equal to 291 and 991 correspondingly. The “ALPHA QRS ANGLE IN THE FRONTAL PLANE”, which can have positive or negative values has different behavior. In cluster 2 its average value is 0.64, which, assuming its variation range (±180), could be treated as rather close to 0. In the cluster 0 prevail positive values and in the cluster 1—negative ones.
Then we will go to the features in which the difference is distinctive for only one pair of the clusters. There is only one of them: “Q R RATIO lead I”, whose average value in cluster 0 is rather low as well as in luster 2 (0.095 and 0.129 correspondingly); however, in cluster 1 it is raised to 0.301. The difference between clusters 1 and 2 is also notable but still does not fit the set threshold.
There are a few features that distinguish the clusters 0 and 2. Those are besides already mentioned HF and “ALPHA QRS ANGLE IN THE FRONTAL PLANE” are:
ST DISLOCATION lead AVL
Q AMP lead II
S AMP lead II
Q R RATIO lead II
Q AMP lead III
Q AMP lead AVR
R AMP lead AVR
S AMP lead AVR
Q R RATIO lead AVR
R T RATIO lead AVR
R S RATIO lead AVR
Q AMP lead AVF
Q R RATIO lead AVF
If we assume the hypothesis that cluster 2 is more predisposed to heart failure than cluster 0 (according to their members ratio in Norm and PATIENTS’ datasets), we ought to denote the feature values which are raised or declined in each of them. The parameter ST DISLOCATION lead AVL can have either positive or negative values. It is observed that for cluster 0 positive ones are typical, but for cluster 2—negative ones. Notably, in cluster 1, which may include the most severe cases, these feature values are even more negative. The features “Q AMP lead II”, “S AMP lead II”, “Q AMP lead III”, “Q AMP lead AVR”, “S AMP lead AVR”, and “Q AMP lead AVF” have negative values. Among them only “S AMP lead AVR” has a greater (by the absolute) value in the cluster 0 (but it is significant: −555.9 to −33.6). Other parameters are greater in cluster 2. The features “Q R RATIO lead II”, “Q R RATIO lead III”, “Q R RATIO lead AVR”, “R S RATIO lead AVR”, and “Q R RATIO lead AVF” have positive values. All of them have mean values greater in cluster 2.
Thus, cluster 2, which might be more predisposed to heart failure than cluster 0, is typical:
raised positive values of HF, “Q R RATIO lead I”, “Q R RATIO lead II”, “Q R RATIO lead III”, “Q R RATIO lead AVR”, “R S RATIO lead AVR”, and “Q R RATIO lead”;
lower negative values of “Q AMP lead II”, “S AMP lead II”, “Q AMP lead III”, “Q AMP lead AVR”, and “Q AMP lead AVF”;
raised (but negative) value of “S AMP lead AVR”;
close to the 0 value of “ALPHA QRS ANGLE IN THE FRONTAL PLANE”.
Of certain difference is cluster 1. It looks like a “tail” of cluster 2 and probably contains more severe cases of heart disorder, for it is almost not presented in the classification performed on the Norm dataset, which is assumed to consist of healthy people. It seems to make better sense to outline the features that distinguish this cluster from cluster 2, which “tail” it is. There are quite many a feature that are different in clusters 0 and 1 because they are not even close. There is a list of such feature names to distinguish cluster 1 from cluster 2, excluding already mentioned HF and “ALPHA QRS ANGLE IN THE FRONTAL PLANE” distinctive for all three clusters:
SDNN
RMSSD
PNN50
SDSD
IAE
TOTAL POWER
ACTIVITY OF VASOMOTOR CENTERS
VLF
LF
ST DISLOCATION lead I
J40 AMP lead I
You can see that this list is completely not the same as the one to distinguish the clusters 0 and 2. We have already mentioned that the HF parameter in cluster 1 has an extremely high value. The same thing can be said about VLF and LF features, whose mean values for the cluster are, correspondently, 31,857 and 89,970 compared with 151.3, 26.3, and 274.5, 774.9 in the other clusters. Also extremely raised in cluster 1 is parameter TOTAL POWER, whose average value in this cluster is 149,788 compared to 716.9 and 1995.3 in the other clusters. Not as extreme, but also significantly raised in cluster 1 are the mean values of the following features: SDNN, RMSSD, PNN50, and SDSD. On the other hand, the parameters IAE and ACTIVITY OF VASOMOTOR CENTERS in cluster 1 are valuably lower. The features “ST DISLOCATION lead I” and “J40 AMP lead I” can have negative or positive values. So, for cluster 1 negative values are more likely to appear, but positive ones are for cluster 2. The feature “ALPHA QRS ANGLE IN THE FRONTAL PLANE” tends to be negative in cluster 1.
Now let us consider what Manhattan metric usage can add to the results of the study. As we can see, its application for the Norm dataset clustering leads to three clusters. There are many parameters that have different average values in each of the clusters. These parameters and their average values in the clusters are presented in Table 5.
We can see that all the features have been already presented in Table 4—a similar clustering case, but using the Euclidean metric. So, involving the Manhattan metric here cannot reveal any new distinctive features but recognizes three separate clusters. We can observe, that for all the features raised values are observed in cluster 0. This cluster 0 is that one which we almost cannot see for the PATIENTS’ dataset prediction. So, might the increased values of those factors are not typical for people with heart failure? This result is in good agreement with the Euclidean metric case (there it was called cluster 1, which also had raised values of those parameters and almost vanished in prediction on the PATIENTS’ dataset). In the considered case also another rather small cluster 1. This one has the smallest mean values for all the listed parameters. However, such people appear as well in PATIENTS. Thus, it might be guessed, that those from cluster 1 could also have a propensity to some heart disorder, but of a different type than those from cluster 2.
The last case to be considered is employing clustering through the Manhattan metric for the PATIENTS’ dataset. Here the primary interest is cluster 0—“and of the tail” of cluster 3. This is a true analog of cluster 1 from a similar case with the Euclidean metric. It also has extremely raised values of “TOTAL POWER”, HF, VLF, and LF factors. Small cluster 2 might be treated just as part of cluster 3. However, it also has increased parameters HF, VLF, and LF, but not as much as in cluster 0, and with many values, it is quite close to it. Nevertheless, it is too small to make serious conclusions about its differences.

6. Discussion

A key methodological innovation of this work is the cross-dataset clustering strategy, in which a model trained on one population (e.g., Norm) is applied to a different one (e.g., Patients), and vice versa. This allows for the identification of structural mismatches and hidden subgroup patterns without the need for explicit supervision or diagnostic labels. Such an approach supports the idea of latent phenotype discovery and has potential applications not only in cardiology but also in broader areas of predictive and preventive medicine.
We also emphasize that, in addition to cluster-level visualization and mean-based comparisons of selected features, the full statistical comparison framework employed in this study includes t-tests, standard deviation analysis, and a relative mode-shift ratio metric designed to detect distributional shifts even in the presence of large within-group variances. These results, due to their volume have been omitted directly in the article but available in the open repositories and by request.
The heterogeneity of both groups examined is noteworthy. If this result was expected for the Patients group, it was somewhat surprising for the Norms group. The presence of a cluster in the Norms group similar to the main cluster in the Patients group emphasizes the heterogeneity of this group, which apparently contains a subgroup of individuals with a high risk of developing heart disease in the near future. The fine structure of the patient group is also quite complex, which can be used to determine the prognosis of the disease outcome and the effectiveness of therapy. These hypotheses will be tested in future studies. We also note the composition of the list of ECG parameters with the greatest separating ability. This list includes both basic amplitude-time parameters of the ECG and more complex parameters describing the shape of ECG elements, as well as some HRV parameters. This confirms our earlier opinion on the usefulness of analyzing the most multilateral matrix of ECG and HRV parameters [34].
The comparative analysis of feature values across clusters, utilizing both Euclidean and Manhattan metrics, has provided meaningful insights into potential indicators of heart failure or predisposition to related conditions. The study highlighted distinct clusters within both the Norm and PATIENTS’ datasets, revealing specific patterns in key features that may reflect varying levels of cardiac risk.
In the Norm dataset, the clustering analysis using the Euclidean metric identified two clusters with marked differences in features like “ALPHA QRS ANGLE IN THE FRONTAL PLANE” and DFA (Detrended Fluctuation Analysis). Cluster 0, characterized by higher values of DFA and “ALPHA QRS ANGLE,” is associated with a decreased likelihood of heart failure. Conversely, features with raised absolute values in Cluster 1 may suggest an elevated risk, as this group often displayed higher magnitudes of certain ECG parameters. Although such clustering does not provide a definitive diagnosis, these identified patterns can assist medical professionals in isolating individuals at potential risk for further monitoring or preventive intervention.
Extending the clustering analysis to the PATIENTS’ dataset with Euclidean distance, a three-cluster model emerged, distinguishing features that showed substantial variation across clusters. Notably, “HF” (High-Frequency power) and “ALPHA QRS ANGLE IN THE FRONTAL PLANE” were the most discriminative features. Cluster 1 exhibited an extreme HF value, potentially indicating a high cardiac risk profile, while other clusters demonstrated more moderate values. This suggests that variations in HF and related ECG parameters may serve as early indicators of adverse cardiac events, with Cluster 1 potentially capturing the most severe cases.
The application of the Manhattan metric on the Norm dataset further refined this analysis, yielding three distinct clusters with unique feature patterns. Table 5 (provided) illustrates the clustering results under this metric, emphasizing parameters like “ST DISLOCATION” and “T AMP NORMALIZED” across multiple leads. These distinctions reinforce the hypothesis that certain ECG measurements, when significantly deviated, correlate with varying cardiac health risks. This observation underscores the potential value of employing multiple metrics in clustering algorithms, as the Manhattan distance identified additional distinctive features that were less apparent under Euclidean distance.
The clustering structure observed in both datasets suggests several clinically relevant findings. First, the appearance of extreme HF values and distinctive dislocation in parameters such as “ST DISLOCATION” and “T AMP NORMALIZED” may signify critical deviations in cardiac function, warranting further exploration. Second, the observed differences in the “ALPHA QRS ANGLE IN THE FRONTAL PLANE” across clusters, particularly its closeness to zero in Cluster 2, may indicate a lower cardiovascular risk when combined with other typical parameters. Third, the consistency of certain features across different clustering metrics and datasets supports their relevance as potential markers for early detection of cardiac issues.
The obtained results and utilized approaches could be extended not only for medical purposes but also in other realms, like for example materials science. Application of such techniques may give some ameliorating and bring to a qualitatively new level models, like those presented in the works [21,35,36]. Moreover, one current trend in software engineering is finding ways to reuse software components when building new projects [37,38,39,40,41]. Although the programs developed in this study are mainly exploratory, the mathematical models and software modules created here could be adapted with minimal changes to work as parts of other systems. This highlights the practical importance of the research conducted in this study.
Future research should aim to validate the observed clusters against clinical outcomes through longitudinal studies. Moreover, integration with real-time ECG monitoring systems and expert annotation pipelines may allow for the development of predictive models that track the transition of individuals between risk clusters over time. The clustering methodology could also be extended to other physiological signals or multimodal data to support broader diagnostic frameworks in cardiology and preventive medicine.

7. Conclusions

This study’s clustering approach demonstrates that ECG parameters, when analyzed through unsupervised methods, can reveal latent structures associated with cardiac health risks. By identifying clusters that correlate with feature patterns typical of heart failure predispositions, this research offers a foundation for further investigations into non-invasive diagnostic methods and personalized medicine. Although these clusters do not serve as definitive diagnostic criteria, they provide a preliminary stratification that could be useful in routine screening for cardiac anomalies. Future research should focus on validating these findings with larger, more diverse datasets and exploring the integration of clustering outcomes with clinical decision-making systems to enhance early diagnosis and prevention in cardiology.
In order to increase the efficiency of research on the application of unsupervised clustering algorithms to ECG data, it is planned to study the possibility of applying models and methods of transdisciplinary research and their (algorithms) hardware support [42,43,44,45].

Author Contributions

Conceptualization, V.K., I.C., and K.M.; methodology, V.K.; software, V.K. and K.M.; validation, K.M. and I.C.; formal analysis, V.K.; investigation, V.K.; resources, I.C. and T.R.; data curation, I.C. and V.K.; writing—original draft preparation, V.K.; writing—review and editing, K.M., I.C., T.R., A.M., and M.B.; visualization, V.K.; supervision, I.C.; project administration, I.C. and M.B.; funding acquisition, I.C. and M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the National Research Foundation of Ukraine (grant No. 2023.04/0094) under the project “Development of Technology for Objective Monitoring of Functional Capabilities and Stress of Military Personnel Based on Miniature Electrocardiographs and Machine Learning”. State registration No. 0125U002047; details available at https://nrat.ukrintei.ua/searchdoc/0125U002047 (accessed on 9 June 2025). This research was also conducted as part of the scientific and technical project “Develop Means of Supporting Virtualization Technologies and Their Use in Computer Engineering and Other Applications”, funded by the National Academy of Sciences of Ukraine. State registration No. 0124U001826; details are available at https://nrat.ukrintei.ua/en/searchdoc/0124U001826 (accessed on 9 June 2025). Both projects are being carried out at the V. M. Glushkov Institute of Cybernetics of the National Academy of Sciences of Ukraine, Kyiv, Ukraine.

Data Availability Statement

The Python scripts and Jupyter notebooks required to reproduce all pre-processing, dimensional-reduction, clustering, and statistical analyses are openly archived on Zenodo: https://zenodo.org/records/15373634 (accessed on 9 June 2025). The underlying ECG dataset contains sensitive health information and cannot be placed in a public repository. De-identified data can be made available to qualified researchers upon reasonable request, contingent on approval by the V.M. Glushkov Institute of Cybernetics Ethics Committee and completion of a standard data-use agreement. Requests for access should be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ECGElectrocardiographic
DFADetrended Fluctuation Analysis
WPWWolff-Parkinson-White Syndrome
UMAPUniform Manifold Approximation and Projection
DBSCANDensity-Based Spatial Clustering of Applications with Noise
HDBSCANHierarchical Density-Based Spatial Clustering of Applications with Noise
RDWEResearch and Development Workstation Environment
HCETHybrid Cloud Environment for Telerehabilitation

References

  1. Chaikovsky, I. Electrocardiogram scoring beyond the routine analysis: Subtle changes matters. Expert Rev. Med. Devices 2020, 17, 379–382. [Google Scholar] [CrossRef] [PubMed]
  2. Oke, O.A.; Cavus, N. A Systematic Review on the Impact of Artificial Intelligence on Electrocardiograms in Cardiology. Int. J. Med. Inform. 2025, 195, 105753. [Google Scholar] [CrossRef]
  3. Liu, Z.-Y.; Lin, C.-H.; Hsu, Y.-C.; Chen, J.-S.; Chang, P.-C.; Wen, M.-S.; Kuo, C.-F. Universal Representations in Cardiovascular ECG Assessment: A Self-Supervised Learning Approach. Int. J. Med. Inform. 2025, 195, 105742. [Google Scholar] [CrossRef]
  4. Mondal, A.; Manikandan, M.S.; Pachori, R.B. Automatic ECG Signal Quality Assessment Using Convolutional Neural Networks and Derivative ECG Signal for False Alarm Reduction in Wearable Vital Signs Monitoring Devices. Biomed. Signal Process. Control 2025, 108, 107876. [Google Scholar] [CrossRef]
  5. Kuetche, F.; Alexendre, N.; Pascal, N.E.; Thierry, S. Simple, Efficient, and Generalized ECG Signal Quality Assessment Method for Telemedicine Applications. Inform. Med. Unlocked 2023, 42, 101375. [Google Scholar] [CrossRef]
  6. Chaikovsky, I.; Starynska, G.; Budnyk, M. Method of ECG Evaluating Based on Universal Scoring System. US10512412B2, 24 December 2020. [Google Scholar]
  7. Escudero, J.; Ifeachor, E.; Zajicek, J.P.; Green, C.; Shearer, J.; Pearson, S. Machine Learning-Based Method for Personalized and Cost-Effective Detection of Alzheimer’s Disease. IEEE Trans. Biomed. Eng. 2013, 60, 164–168. [Google Scholar] [CrossRef]
  8. Tosto, G.; Bird, T.D.; Tsuang, D.; Bennett, D.A.; Boeve, B.F.; Cruchaga, C.; Faber, K.; Foroud, T.M.; Farlow, M.; Goate, A.M.; et al. Polygenic Risk Scores in Familial Alzheimer Disease. Neurology 2017, 88, 1180–1186. [Google Scholar] [CrossRef]
  9. Fereshtehnejad, S.M.; Zeighami, Y.; Dagher, A.; Postuma, R.B. Clinical Criteria for Subtyping Parkinson’s Disease: Biomarkers and Longitudinal Progression. Brain 2017, 140, 1959–1976. [Google Scholar] [CrossRef]
  10. Krasniqi, E.; Schramm, W.; Reichenbach, A. Data-Driven Stratification of Parkinson’s Disease Patients Based on the Progression of Motor and Cognitive Disease Markers. Ger. Med. Sci. 2021, 17, 4. [Google Scholar] [CrossRef]
  11. Goldenberg, I.; Moss, A.J. Long QT Syndrome. J. Am. Coll. Cardiol. 2008, 51, 2291–2300. [Google Scholar] [CrossRef]
  12. Chhabra, L.; Goyal, A.; Benham, M.D. Wolff-Parkinson-White Syndrome. StatPearls, 7. Available online: https://pubmed.ncbi.nlm.nih.gov/32119324/ (accessed on 7 August 2023).
  13. Thygesen, K.; Alpert, J.S.; White, H.D.; Jaffe, A.S.; Apple, F.S.; Galvani, M.; Katus, H.A.; Newby, L.K.; Ravkilde, J.; Chaitman, B.; et al. Task Force for the Redefinition of Myocardial Infarction. Eur. Heart J. 2007, 28, 2525–2538. [Google Scholar] [CrossRef] [PubMed]
  14. Xu, R.; Wunsch, D. Survey of Clustering Algorithms. IEEE Trans. Neural Netw. 2005, 16, 645–678. [Google Scholar] [CrossRef] [PubMed]
  15. Esteva, A.; Robicquet, A.; Ramsundar, B.; Kuleshov, V.; DePristo, M.; Chou, K.; Cui, C.; Corrado, G.; Thrun, S. A Guide to Deep Learning in Healthcare. Nat. Med. 2019, 25, 24–29. [Google Scholar] [CrossRef] [PubMed]
  16. Murali, L.; Gopakumar, G.; Viswanathan, D.M. Towards Electronic Health Record-Based Medical Knowledge Graph Construction, Completion, and Applications: A Literature Study. J. Biomed. Inform. 2023, 143, 104403. [Google Scholar] [CrossRef]
  17. McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2020. [CrossRef]
  18. Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96), Portland, OR, USA, 2–4 August 1996; pp. 226–231. Available online: https://file.biolab.si/papers/1996-DBSCAN-KDD.pdf (accessed on 9 June 2025).
  19. Campello, R.J.G.B.; Moulavi, D.; Zimek, A.; Sander, J. Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Trans. Knowl. Discov. Data 2013, 10, 1–51. [Google Scholar] [CrossRef]
  20. Blanco-Portals, J.; Peiró, F.; Estradé, S. Strategies for EELS Data Analysis: Introducing UMAP and HDBSCAN for Dimensionality Reduction and Clustering. Microsc. Microanal. 2022, 28, 109–122. [Google Scholar] [CrossRef]
  21. Kaverinsky, V.; Vakulenko, D.; Vakulenko, L.; Malakhov, K. Machine Learning Analysis of Arterial Oscillograms for Depression Level Diagnosis in Cardiovascular Health. Complex Syst. Inform. Model. Q. 2024, 40, 94–110. [Google Scholar] [CrossRef]
  22. Reznichenko, S.; Whitaker, J.; Ni, Z.; Zhou, S. Comparing ECG Lead Subsets for Heart Arrhythmia/ECG Pattern Classification: Convolutional Neural Networks and Random Forest. CJC Open 2025, 7, 176–186. [Google Scholar] [CrossRef]
  23. Lanerolle, G.D.; Roberts, E.S.; Haroon, A.; Shetty, A. Chapter 7—Neuropsychiatry and Mental Health. In Quality Assurance Management; Lanerolle, G.D., Roberts, E.S., Haroon, A., Shetty, A., Eds.; Academic Press: Cambridge, MA, USA, 2024; pp. 131–240. [Google Scholar]
  24. Zabihi, F.; Safara, F.; Ahadzadeh, B. An Electrocardiogram Signal Classification Using a Hybrid Machine Learning and Deep Learning Approach. Healthc. Anal. 2024, 6, 100366. [Google Scholar] [CrossRef]
  25. Cascianelli, S.; Masseroli, M. Biological and Medical Ontologies: Introduction. In Encyclopedia of Bioinformatics and Computational Biology; Elsevier: Amsterdam, The Netherlands, 2025; pp. 380–391. ISBN 978-0-323-95503-4. [Google Scholar]
  26. Malakhov, K.S. Innovative Hybrid Cloud Solutions for Physical Medicine and Telerehabilitation Research. Int. J. Telerehabil. 2024, 16, e6635. [Google Scholar] [CrossRef] [PubMed]
  27. Romanchuk, O.; Polianska, O.; Polianskyi, I.; Yasinska, O. Telerehabilitation: Current Opportunities And Problems of Remote Patient Monitoring. Neonatol. Hìr. Perinat. Med. 2024, 14, 183–190. [Google Scholar] [CrossRef]
  28. Vakulenko, D.; Vakulenko, L. Information System Telerehabilitation: Needs, Tasks and Way Optimisation with AI. In Arterial Oscillography: NewCapabilities of the Blood Pressure Monitor with the Oranta-AO Information System; Vakulenko, D., Vakulenko, L., Eds.; Nova Science Publishers: Hauppauge, NY, USA, 2024; pp. 681–707. [Google Scholar]
  29. Vladymyrov, O.A.; Semykopna, T.V.; Vakulenko, D.V.; Syvak, O.V.; Budnyk, M.M. Telerehabilitation Guidelines for Patients with Breast Cancer. Int. J. Telerehabil. 2024, 1–76. [Google Scholar] [CrossRef]
  30. Vakulenko, D.V.; Palagin, O.V.; Sergienko, I.V.; Stetsyuk, P.I. Algorithmization and Optimization Models of Patient-Centric Rehabilitation Programs. Cybern. Syst. Anal. 2024, 60, 736–752. [Google Scholar] [CrossRef]
  31. Vakulenko, D.V.; Vakulenko, L.; Zaspa, H.; Lupenko, S.; Stetsyuk, P.; Stovba, V. Components of Oranta-AO Software Expert System for Innovative Application of Blood Pressure Monitors. J. Reliab. Intell Environ. 2023, 9, 41–56. [Google Scholar] [CrossRef]
  32. Khaustova, O.; Chaban, O.; Sak, L. Indicators of Somatized PTSD in Ukrainian Active Military Personnel Undergoing Rehabilitation after TBI Treatment. Neurosci. Appl. 2024, 3, 105356. [Google Scholar] [CrossRef]
  33. Kaverinsky, V.V.; Malakhov, K.S. Natural Language-Driven Dialogue Systems for Support in Physical Medicine and Rehabilitation. S. Afr. Comput. J. 2023, 35, 119–126. [Google Scholar] [CrossRef]
  34. Chaikovsky, I.; Dziuba, D.; Kryvova, O.; Marushko, K.; Vakulenko, J.; Malakhov, K.; Loskutov, O. Subtle changes on electrocardiogram in severe patients with COVID-19 may be predictors of treatment outcome. Front. Artif. Intell. 2025, 8, 1561079. [Google Scholar] [CrossRef]
  35. Tu, X.; Qin, T.; Ji, X.; Wang, Z.; Chen, J.; Zhang, Z.; Wang, Z.; Wang, W.; Qin, Y.; Zhou, J. DBSCAN Clustering Model for Parameter Inversion Using Laser Cutting Edge Morphology Characteristic in Zr-4 Alloy. Opt. Laser Technol. 2025, 184, 112461. [Google Scholar] [CrossRef]
  36. Zhang, L.; Deng, H. NJmat 2.0: User Instructions of Data-Driven Machine Learning Interface for Materials Science. Comput. Mater. Contin. 2025, 83, 1–11. [Google Scholar] [CrossRef]
  37. Chebanyuk, O. An Approach of Text to Model Transformation of Software Models. In Proceedings of the 13th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2018), Funchal, Portugal, 23–24 March 2018; pp. 432–439. [Google Scholar] [CrossRef]
  38. Chebanyuk, O. Software Reuse Approach Based on Review and Analysis of Reuse Risks from Projects Uploaded to GitHub. In Computer Science and Education in Computer Science; Springer: Cham, Switzerland, 2023; Volume 514, pp. 144–155. [Google Scholar] [CrossRef]
  39. Chebanyuk, O. Investigation of Drawbacks of the Software Development Artifacts Reuse Approaches Based on Semantic Analysis. In Advances in Computer Science for Engineering and Education VI; Lecture Notes in Data Engineering and Communication Technologies; Springer: Cham, Switzerland, 2023; Volume 181, pp. 514–523. [Google Scholar] [CrossRef]
  40. Mejía-Granda, C.M.; Fernández-Alemán, J.L.; Carrillo De Gea, J.M.; García-Berná, J.A. A Method and Validation for Auditing E-Health Applications Based on Reusable Software Security Requirements Specifications. Int. J. Med. Inform. 2025, 194, 105699. [Google Scholar] [CrossRef] [PubMed]
  41. Sinaci, A.A.; Gencturk, M.; Teoman, H.A.; Laleci Erturkmen, G.B.; Alvarez-Romero, C.; Martinez-Garcia, A.; Poblador-Plou, B.; Carmona-Pírez, J.; Löbe, M.; Parra-Calderon, C.L. A Data Transformation Methodology to Create Findable, Accessible, Interoperable, and Reusable Health Data: Software Design, Development, and Evaluation Study. J. Med. Internet Res. 2023, 25, e42822. [Google Scholar] [CrossRef] [PubMed]
  42. Kurgaev, A.; Petrenko, M. Processor Structure Design. Cybern. Syst. Anal. 1995, 31, 618–625. [Google Scholar] [CrossRef]
  43. Opanasenko, V.; Palagin, O.; Zavyalov, S. The FPGA-Based Problem-Oriented On-Board Processor. In Proceedings of the 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), Metz, France, 18–21 September 2019; pp. 152–157. [Google Scholar]
  44. Palagin, O.; Opanasenko, V. Reconfigurable-Computing Technology. Cybern. Syst. Anal. 2007, 43, 675–686. [Google Scholar] [CrossRef]
  45. Palagin, O.; Petrenko, M.; Litvin, A.; Boyko, M. Method of Developing an Ontological System with Automatic Formation of a Knowledge Base and User Queries. In Proceedings of the 14th International Scientific and Practical Programming Conference, UkrPROG 2024, Kyiv, Ukraine, 14–15 May 2024; Volume 3806, pp. 372–388. Available online: https://ceur-ws.org/Vol-3806/S_2_Palagin.pdf (accessed on 9 June 2025).
Figure 1. Data clustering using UMAP + HDBSCAN, with models trained on Norm using the Euclidean metric: (a) clusters found in Norm dataset; (b) clusters predicted for patients’ dataset.
Figure 1. Data clustering using UMAP + HDBSCAN, with models trained on Norm using the Euclidean metric: (a) clusters found in Norm dataset; (b) clusters predicted for patients’ dataset.
Computation 13 00144 g001
Figure 2. Data clustering using UMAP + HDBSCAN, with models trained on PATIENTS using the Euclidean metric: (a) clusters found in PATIENTS’ dataset; (b) clusters predicted for Norm dataset.
Figure 2. Data clustering using UMAP + HDBSCAN, with models trained on PATIENTS using the Euclidean metric: (a) clusters found in PATIENTS’ dataset; (b) clusters predicted for Norm dataset.
Computation 13 00144 g002
Figure 3. Data clustering using UMAP + HDBSCAN, with models trained on Norm using the Manhattan metric: (a) clusters found in Norm dataset; (b) clusters predicted for PATIENTS’ dataset.
Figure 3. Data clustering using UMAP + HDBSCAN, with models trained on Norm using the Manhattan metric: (a) clusters found in Norm dataset; (b) clusters predicted for PATIENTS’ dataset.
Computation 13 00144 g003
Figure 4. Data clustering using UMAP + HDBSCAN, with models trained on PATIENTS using the Manhattan metric: (a) clusters found in PATIENTS’ dataset; (b) clusters predicted for Norm dataset.
Figure 4. Data clustering using UMAP + HDBSCAN, with models trained on PATIENTS using the Manhattan metric: (a) clusters found in PATIENTS’ dataset; (b) clusters predicted for Norm dataset.
Computation 13 00144 g004
Table 1. The model’s hyperparameters values.
Table 1. The model’s hyperparameters values.
Training DatasetMetricHyperparameters Values
UMAPHDBSCAN
N NeighborsMin. Dist.Min. Cluster SizeLeaf Size
NormEuclidean200.151821337
Patients70.0850435
NormManhattan500.051821337
Patients150.074025
Table 2. Features with a significant mode difference in the datasets.
Table 2. Features with a significant mode difference in the datasets.
Greater in NormGreater in Patients
Feature NameMeans Ratio M1/M2Feature NameMeans Ratio M2/M1
T AMP NORMALIZED lead I55.47173VLF495.061
T AMP NORMALIZED lead II29.00111LF157.83
T AMP NORMALIZED lead AVL27.855DFA53.1785
T AMP NORMALIZED lead AVF14.00366SYNDROMIC ECG ANALYSIS51.0873
J40 AMP lead I10.84904TOTAL POWER38.6041
ST DISLOCATION lead I10.80562ST DISLOCATION lead AVL30.5705
S AMP lead AVR9.075496Q R RATIO lead AVR22.4853
T AMP NORMALIZED lead III7.578434Q AMP lead AVR19.0826
R T RATIO lead AVR5.269837HF9.83694
Q AMP lead I4.956455ACTIVITY OF VASOMOTOR CENTERS9.46108
P AMP lead III4.227044Q R RATIO lead III6.34422
P AMP lead AVF4.057967Q R RATIO lead AVF5.73973
P AMP lead II3.900239LF HF4.99509
T AMP lead AVF3.724387LFn3.60021
T AMP lead II3.630603SDSD3.58783
ALPHA QRS ANGLE IN THE FRONTAL PLANE3.578768Q R RATIO lead II3.54654
S AMP lead AVL3.512038T SYMMETRY AREAS OF TRIANGLES lead III3.02674
P AMP lead I3.413082
P AREA lead I3.283533
R AMP lead AVF3.210154
T AMP lead III3.065699
T AMP lead I3.04724
Table 3. The data distribution between the clusters, %.
Table 3. The data distribution between the clusters, %.
Euclidian Metric, Training on the Norm Dataset
Initial Clusters in NormClusters Prediction for PATIENTS
Cluster 0Cluster 1Cluster 0Cluster 1
56.98243.01386.99413.006
Euclidian metric, training on the PATIENTS’ dataset
Initial clusters in PATIENTSClusters prediction for Norm
Cluster 0Cluster 1Cluster 2Cluster 0Cluster 1Cluster 2
22.9619.94667.09380.0200.127919.852
Manhattan metric, training on the Norm dataset
Initial clusters in NormClusters prediction for PATIENTS
Cluster 0Cluster 1Cluster 2Cluster 0Cluster 1Cluster 2
43.0409.11047.8507.4834.31988.19843.0409.110
Manhattan metric, training on the PATIENTS’ dataset
Initial clusters in PATIENTSClusters prediction for Norm
Cluster 0Cluster 1Cluster 2Cluster 3Cluster 0Cluster 1Cluster 2Cluster 3
7.50622.6522.25767.3840.04778.1820.89520.876
Table 4. The most distinguishing features between the clusters obtained on Norm dataset using Euclidean metric.
Table 4. The most distinguishing features between the clusters obtained on Norm dataset using Euclidean metric.
Feature NameMean Values in the Clusters
Cluster 0Cluster 1
DFA0.0474920.000383
ALPHA QRS ANGLE IN THE FRONTAL PLANE19.62716−0.20719
S AMP lead AVF−50.573−158.372
ST DISLOCATION lead I16.6356657.37463
J40 AMP lead I16.6356657.37463
ST DISLOCATION lead II19.6775773.12592
T AMP NORMALIZED lead AVF9.1622637.29108
R AMP lead III150.8576622.8024
R AMP lead AVF272.45231138.58
ST DISLOCATION lead AVL11.0187746.47051
R AMP lead II434.66671944.364
T AMP NORMALIZED lead AVL5.57590724.95848
ST DISLOCATION lead AVF11.3739151.72439
R AMP lead I333.441530.54
T AMP NORMALIZED lead II12.6861559.17196
R AMP lead AVL144.3115716.4888
S AMP lead III−105.133−540.407
Q AMP lead AVL−15.881−83.5429
P AMP lead III43.19233232.7455
T AMP NORMALIZED lead I9.76772955.07774
S AMP lead AVR−466.623−2753.23
P AMP lead AVF55.76281330.5697
T AMP lead III64.73436397.7896
P AMP lead II72.24876450.5428
Q AMP lead AVR−4.62043−28.8315
T AREA lead I16.5706106.0358
T AMP lead AVF128.7192831.6197
T AMP lead II208.291364.565
P AMP lead AVR0.9678876.608791
T AMP lead I163.32631116.124
T AMP lead AVL69.50201494.6684
P AMP lead I44.17025314.9606
Q AMP lead III−19.1682−138.423
Q AMP lead I−23.761−173.752
Q AMP lead AVF−21.907−162.325
P AREA lead I2.03966915.25809
P AMP lead AVL19.5013147.8871
ST DISLOCATION lead III4.18713133.4458
QRS AREA lead I13.50508112.1638
Q AMP lead II−26.8857−232.533
R AMP lead AVR34.83518332.9271
T AMP lead AVR0.4698945.185672
DFA0.0474920.000383
ALPHA QRS ANGLE IN THE FRONTAL PLANE19.62716−0.20719
S AMP lead AVF−50.573−158.372
ST DISLOCATION lead I16.6356657.37463
J40 AMP lead I16.6356657.37463
ST DISLOCATION lead II19.6775773.12592
T AMP NORMALIZED lead AVF9.1622637.29108
R AMP lead III150.8576622.8024
R AMP lead AVF272.45231138.58
ST DISLOCATION lead AVL11.0187746.47051
R AMP lead II434.66671944.364
T AMP NORMALIZED lead AVL5.57590724.95848
ST DISLOCATION lead AVF11.3739151.72439
R AMP lead I333.441530.54
T AMP NORMALIZED lead II12.6861559.17196
R AMP lead AVL144.3115716.4888
Table 5. The most distinguishing features between all three clusters obtained on Norm dataset using Manhattan metric.
Table 5. The most distinguishing features between all three clusters obtained on Norm dataset using Manhattan metric.
Feature NameMean Values in the Clusters
Cluster 0Cluster 1Cluster 2
ST DISLOCATION lead I57.508993.44017719.00408
T AMP NORMALIZED lead I55.07370.90333511.4335
ST DISLOCATION lead II73.127095.51181722.34336
T AMP NORMALIZED lead II59.415751.06675314.65286
ST DISLOCATION lead AVL46.617631.39512612.69868
T AMP NORMALIZED lead AVL24.97080.5912326.502916
ST DISLOCATION lead AVF51.682043.7348612.84364
T AMP NORMALIZED lead AVF37.318941.11075810.65424
P AMP lead I315.28734.23412151.32719
R AMP lead I1530.143119.3567373.882
T AMP lead I1116.88834.76588186.5794
P AREA lead I15.274660.2503692.357987
QRS AREA lead I112.23644.34490415.12823
T AREA lead I106.10243.46233418.95599
J40 AMP lead I57.508993.44017719.00408
P AMP lead II451.17488.62112383.58113
R AMP lead II1942.194135.209492.7809
T AMP lead II1364.98950.32053237.3328
P AMP lead III233.12634.70753350.07002
T AMP lead III397.605313.796974.41043
P AMP lead AVR6.5741750.31611.119938
R AMP lead AVR333.54639.76440238.88358
T AMP lead AVR5.191340.1595270.521232
P AMP lead AVL147.93541.10561322.8878
T AMP lead AVL495.11029.74519980.24213
P AMP lead AVF331.0816.51920264.52348
R AMP lead AVF1138.84785.77474307.2656
T AMP lead AVF831.752727.50222147.4743
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kaverinskiy, V.; Chaikovsky, I.; Mnevets, A.; Ryzhenko, T.; Bocharov, M.; Malakhov, K. Scalable Clustering of Complex ECG Health Data: Big Data Clustering Analysis with UMAP and HDBSCAN. Computation 2025, 13, 144. https://doi.org/10.3390/computation13060144

AMA Style

Kaverinskiy V, Chaikovsky I, Mnevets A, Ryzhenko T, Bocharov M, Malakhov K. Scalable Clustering of Complex ECG Health Data: Big Data Clustering Analysis with UMAP and HDBSCAN. Computation. 2025; 13(6):144. https://doi.org/10.3390/computation13060144

Chicago/Turabian Style

Kaverinskiy, Vladislav, Illya Chaikovsky, Anton Mnevets, Tatiana Ryzhenko, Mykhailo Bocharov, and Kyrylo Malakhov. 2025. "Scalable Clustering of Complex ECG Health Data: Big Data Clustering Analysis with UMAP and HDBSCAN" Computation 13, no. 6: 144. https://doi.org/10.3390/computation13060144

APA Style

Kaverinskiy, V., Chaikovsky, I., Mnevets, A., Ryzhenko, T., Bocharov, M., & Malakhov, K. (2025). Scalable Clustering of Complex ECG Health Data: Big Data Clustering Analysis with UMAP and HDBSCAN. Computation, 13(6), 144. https://doi.org/10.3390/computation13060144

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop