Scalable Clustering of Complex ECG Health Data: Big Data Clustering Analysis with UMAP and HDBSCAN

Kaverinskiy, Vladislav; Chaikovsky, Illya; Mnevets, Anton; Ryzhenko, Tatiana; Bocharov, Mykhailo; Malakhov, Kyrylo

doi:10.3390/computation13060144

Open AccessArticle

Scalable Clustering of Complex ECG Health Data: Big Data Clustering Analysis with UMAP and HDBSCAN

by

Vladislav Kaverinskiy

¹

,

Illya Chaikovsky

¹,

Anton Mnevets

²

,

Tatiana Ryzhenko

¹

,

Mykhailo Bocharov

³

and

Kyrylo Malakhov

^1,*

¹

Glushkov Institute of Cybernetics of the National Academy of Sciences of Ukraine, 03187 Kyiv, Ukraine

²

Department of Electronic Engineering, Igor Sikorsky Kyiv Polytechnic Institute, 03056 Kyiv, Ukraine

³

Department of Moral and Psychological Support of the Activity of the Troops (Forces), National Defense University of Ukraine Named After Ivan Cherniakhovskyi, 03049 Kyiv, Ukraine

^*

Author to whom correspondence should be addressed.

Computation 2025, 13(6), 144; https://doi.org/10.3390/computation13060144

Submission received: 8 May 2025 / Revised: 25 May 2025 / Accepted: 5 June 2025 / Published: 10 June 2025

(This article belongs to the Special Issue Artificial Intelligence Applications in Public Health: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

This study explores the potential of unsupervised machine learning algorithms to identify latent cardiac risk profiles by analyzing ECG-derived parameters from two general groups: clinically healthy individuals (Norm dataset, n = 14,863) and patients hospitalized with heart failure (patients’ dataset, n = 8220). Each dataset includes 153 ECG and heart rate variability (HRV) features, including both conventional and novel diagnostic parameters obtained using a Universal Scoring System. The study aims to apply unsupervised clustering algorithms to ECG data to detect latent risk profiles related to heart failure, based on distinctive ECG features. The focus is on identifying patterns that correlate with cardiac health risks, potentially aiding in early detection and personalized care. We applied a combination of Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction and Hierarchical Density-Based Spatial Clustering (HDBSCAN) for unsupervised clustering. Models trained on one dataset were applied to the other to explore structural differences and detect latent predispositions to cardiac disorders. Both Euclidean and Manhattan distance metrics were evaluated. Features such as the QRS angle in the frontal plane, Detrended Fluctuation Analysis (DFA), High-Frequency power (HF), and others were analyzed for their ability to distinguish different patient clusters. In the Norm dataset, Euclidean distance clustering identified two main clusters, with Cluster 0 indicating a lower risk of heart failure. Key discriminative features included the “ALPHA QRS ANGLE IN THE FRONTAL PLANE” and DFA. In the patients’ dataset, three clusters emerged, with Cluster 1 identified as potentially high-risk. Manhattan distance clustering provided additional insights, highlighting features like “ST DISLOCATION” and “T AMP NORMALIZED” as significant for distinguishing between clusters. The analysis revealed distinct clusters that correspond to varying levels of heart failure risk. In the Norm dataset, two main clusters were identified, with one associated with a lower risk profile. In the patients’ dataset, a three-cluster structure emerged, with one subgroup displaying markedly elevated risk indicators such as high-frequency power (HF) and altered QRS angle values. Cross-dataset clustering confirmed consistent feature shifts between groups. These findings demonstrate the feasibility of ECG-based unsupervised clustering for early risk stratification. The results offer a non-invasive tool for personalized cardiac monitoring and merit further clinical validation. These findings emphasize the potential for clustering techniques to contribute to early heart failure detection and personalized monitoring. Future research should aim to validate these results in other populations and integrate these methods into clinical decision-making frameworks.

Keywords:

clustering analysis; data science; machine learning; bioinformatics; UMAP; HDBSCAN; heart failure; ECG; subtle changes

1. Introduction

The advancement of diagnostic methods, especially instrumental ones (i.e., methods of functional diagnostics) primarily entails a constant increase in the ability to detect subtler and subtler changes in the function examined by one method or another. Such opportunities emerge due to progress in technical measurement tools of a certain function and even more due to the development of informational technologies; in other words, due to the creation of new metrics—numerical parameters using which one can assess the aspects of the functioning of various human organs and systems that were inaccessible before. As a result, firstly, new ways of improvement of the diagnostic accuracy of a certain method within its traditional application scenarios are discovered and, secondly, familiar methods find unconventional uses in new areas.

In this context, the innovative technology to analyze subtle changes in ECG was developed, aiming to make any electrocardiography informative [1].

In this context, the advanced analysis of ECG might be highly demanded, especially in population health studies, dealing with large datasets.

The only way to increase the diagnostic value of ECG examination is the development of proper information technology (IT)—a combination of up-to-date methods and equipment bound in a chain that provides collection, storage, pre-processing, interpretation, conclusion, and dissemination of information [2,3,4,5].

It is true that routine ECG analysis is based on the presence of certain ECG syndromes or phenomena defined within one of the existing ECG analysis algorithms. However, in the majority of cases, no ECG syndrome can be identified during the analysis of an individual electrocardiogram, at least not one that clearly reflects cardiac pathology i.e., belongs to the “major” category according to the, for example, Minnesota coding system. During the routine analysis, one is forced to assign a single class to all these electrocardiograms—electrocardiograms with no major ECG syndrome identified. However, the question arises—are all these electrocardiograms the same in terms of their relative “distance” to the “ideal” electrocardiogram of a healthy human? Obviously, they are not. This “distance” can be further or closer depending on the myocardial condition; moreover, there is a reasonable hypothesis that this “distance” reflects the likelihood of serious cardiovascular events occurring. This is where routine analysis of an electrocardiogram is uninformative.

That is why the Universal Scoring System method and software for ECG scaling that can provide the quantitative evaluation of the slightest changes in ECG signal were developed [6].

This approach is based on, first of all, measuring the maximum number of ECG parameters and heart rate variability and, secondly, on positioning each parameter value on a scale between the absolute norm and extreme pathology. In fact, the suggested approach follows a popular Z-scoring ideology, when quantitative, usually point-based assessment of test results is determined via using a special scale containing data about intra-group test results variation.

On the other hand, clustering methods, including k-means, hierarchical clustering, and density-based methods like DBSCAN, have found extensive applications in the medical field. They support diagnostic processes by identifying patient subgroups, predicting disease progression, and segmenting medical images.

For example, clustering techniques have been applied in Alzheimer’s research, particularly to analyze MRI data, cerebrospinal fluid biomarkers, and other clinical features, to differentiate patients likely to progress from mild cognitive impairment to Alzheimer’s disease [7,8]. Such approaches help to classify patients based on subtle distinctions in disease patterns, aiding early diagnosis and targeted treatments. In another example [9,10], Parkinson’s disease research has employed clustering for patient stratification, revealing patterns that improve diagnosis and the understanding of disease heterogeneity.

Clustering also plays a crucial role in image analysis in medicine. Techniques like DBSCAN, which is effective for data with noise and varying densities, can be used in MRI or CT image segmentation, identifying tumor boundaries, and assisting radiologists in spotting abnormalities. This method helps cluster similar regions within images, essential for the accurate and efficient analysis of large datasets.

The novelty of this study lies in the combination of UMAP-based dimensionality reduction and HDBSCAN clustering applied to real-world, large-scale ECG datasets from both healthy individuals and heart failure patients. Unlike many previous approaches that rely on labeled training data, our unsupervised method allows for the discovery of latent risk groups and subtypes without prior diagnostic categorization. One of the key aspects of our methodology is a cross-dataset clustering analysis. This approach involves training a clustering model on one dataset (e.g., healthy individuals) and applying it to another, independent dataset (e.g., heart failure patients), and vice versa. This not only enhances the scalability and adaptability of ECG analysis but also opens the door for identifying preclinical or atypical cardiac conditions based solely on physiological signal patterns.

The current study is aimed at cluster findings in the datasets formed grounding on the electrocardiograms (ECG) analysis based on the Universal Scoring System approach primarily to find the parameter values which could be helpful for early diagnosis and even identification of predisposition to further heart failure development. The results obtained here do not pretend to be strong diagnostic criteria straight away. The results obtained in this study are not intended to serve as definitive diagnostic criteria but rather offer statistically grounded insights into the distribution and structure of ECG-derived parameters. These findings may assist medical professionals in identifying potential risk patterns and should be further examined and validated in dedicated clinical studies.

2. Related Works

Syndromic ECG analysis is a cardiology diagnostic approach focused on identifying characteristic ECG patterns—such as changes in waveforms, intervals, and rhythms—associated with specific cardiac syndromes, rather than isolated abnormalities. Examples include Brugada Syndrome, indicated by distinctive ST-segment elevation in leads V1–V3; Long QT Syndrome, identified by QT interval prolongation linked to arrhythmias [11]; Wolff-Parkinson-White (WPW) Syndrome, characterized by a short PR interval and delta wave suggesting accessory conduction pathways [12]; and Myocardial Ischemia or Infarction Syndromes, marked by changes in ST-segments and T-waves signaling cardiac injury or ischemia [13]. This approach aids rapid diagnosis, identifies high-risk patients, and facilitates tailored treatments.

Clustering techniques are widely applied in medicine to improve diagnostics, patient segmentation, and personalized treatment. By grouping patients with similar characteristics, clustering enables healthcare professionals to tailor medical interventions based on specific subgroups. For example, clustering can help identify subtypes within diseases such as diabetes or cancer, enhancing targeted treatments. Additionally, it can optimize resource allocation by identifying healthcare needs in specific regions or among certain patient populations [14,15].

Popular clustering algorithms in medical applications include K-Means, hierarchical clustering, and DBSCAN. K-Means is commonly used for its efficiency in partitioning data into K clusters. At the same time, hierarchical clustering provides a tree-like structure, making it useful for visualizing relationships at various levels. DBSCAN, a density-based method, excels at identifying clusters of arbitrary shapes, which is valuable in cases where data is complex and noisy, such as in genetics research [16].

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that preserves both local and global structure in data. UMAP is particularly effective in applications where visualization of complex, high-dimensional data is necessary, as seen in studies of healthcare data where visual clustering can highlight significant patterns in patients’ datasets or physiological signals like ECG data. UMAP’s efficacy for embedding high-dimensional data is supported by its theoretical foundation in Riemannian geometry and algebraic topology, making it popular in biomedical applications and image processing, where it facilitates the creation of meaningful low-dimensional representations [17].

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a well-regarded algorithm for density-based clustering. It is particularly known for its ability to identify clusters of arbitrary shapes within data with noise. DBSCAN defines clusters based on regions with high intensity, requiring two parameters: Eps (neighborhood radius) and MinPts (minimum number of points within the neighborhood). When a data point meets these density criteria, it forms part of a cluster; otherwise, it may be considered noise. A key advantage of DBSCAN is its robustness against noise and its flexibility in discovering clusters without a predefined shape. However, it can struggle with varying-density data since a single Eps value often does not fit all clusters well [18].

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) extends DBSCAN by introducing a hierarchical approach to overcome DBSCAN’s limitations in handling variable densities [19]. HDBSCAN leverages a technique that hierarchically builds clusters and subsequently condenses them based on stability measures. This allows it to adapt better to diverse density regions, producing more meaningful clusters across datasets with varying densities. In practical terms, HDBSCAN is a more dynamic clustering algorithm, especially useful in applications like anomaly detection and data visualization where clustering precision is essential across different density regions.

For clustering, HDBSCAN was chosen due to its robustness in managing variable-density data, which is critical in biomedical datasets where noise and outliers are common. Unlike DBSCAN, HDBSCAN does not require a fixed density threshold, instead deriving clusters based on local density variations, which enables it to separate clusters with differing densities efficiently. HDBSCAN has shown excellent performance in tasks that demand noise-resilient clustering, such as in ECG and EEG signal analysis, as it can identify subtle cluster boundaries without overfitting noise points. The method leverages a hierarchical clustering approach using mutual reachability distances, allowing clusters to form naturally based on data density rather than arbitrary parameters.

This combination of UMAP for dimensionality reduction and HDBSCAN for density-based clustering has shown promising results in applications requiring fine-grained grouping in high-noise environments, such as in the analysis of complex medical datasets where patient stratification and the detection of rare phenotypes are critical. These tools are validated by prior research and have been integrated into our study to optimize the interpretability and accuracy of clustering in the selected dataset [20].

The field of medical diagnostics has increasingly leveraged machine learning and clustering techniques to analyze complex biomedical data. In particular, clustering methods applied to cardiovascular and psychological health data have shown promising results for early diagnosis and personalized treatment approaches. The authors of [21] explored the clustering of arterial oscillogram (AO) data, focusing on ultra-low-frequency (ULF) indicators correlated with depression levels. By employing UMAP (Uniform Manifold Approximation and Projection) for dimensionality reduction, they identified two clusters representing different severity levels of depression, subsequently developing a high-accuracy classification system based on ULF metrics and products of correlated parameters, achieving accuracy rates up to 97% using nearest neighbor classifiers.

In another work [22] a distributed classification model for real-time healthcare data processing, using a Random Forest approach integrated with binary classification modules. Their model demonstrated high accuracy in classifying multiple levels of depression severity based on arterial pulsation data, achieving effective parallel processing benchmarks for real-time applications [23,24]. These methods align with a broader trend in healthcare data analysis, where dimensionality reduction techniques like UMAP enhance visualization and feature extraction from high-dimensional medical data. Moreover, the use of HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), a density-based clustering technique, enables the handling of data with varying densities, which is common in medical datasets [19].

Ontological approaches further complement machine learning in medical dialogue systems. It has introduced an ontology-driven framework for rehabilitation support in physical medicine, underscoring the importance of adaptive systems that cater to evolving medical knowledge [25]. The MedRehabBot system [26] utilizes neural networks combined with an ontological paradigm to enhance interactions within telerehabilitation services, addressing critical needs in Ukraine’s military and civilian rehabilitation programs [26,27,28,29,30,31,32]. Additional studies by Malakhov and Kaverinskiy highlight the application of automatic ontology generation for natural language dialogue systems, which enhances contextual understanding in systems that support complex inflectional languages [33].

3. Materials and Methods

For the investigation, two datasets were utilized. Both of them included data obtained from ECG transcripts. The first dataset consists of 14,863 data rows and represents clinically healthy people—without any heart disorders (the following code name—Norm dataset). The second one consists of 8220 data rows and represents patients hospitalized in the cardiology clinic due to heart failure (the following code name—patients’ dataset). Each row corresponds to one person. The Norm dataset includes both men and women of a wide range of ages from children to old people. The patients’ dataset also includes men and women, but adults and old only (from 40 to 93 years).

The datasets analyzed contained 153 features each. To facilitate a clear, visual cluster analysis, dimensionality reduction was required. For this purpose, we applied the UMAP method, using its Python implementation (version 0.5.8). Our approach involved tuning UMAP and HDBSCAN (version 0.8.1) hyperparameters with the primary objective of generating distinct, consolidated clusters that were visually separable while minimizing the number of unclustered points by HDBSCAN (ideally eliminating them). It was also preferable to avoid the formation of numerous small clusters, and instead to produce 2–3 well-defined clusters in the two-dimensional space of the UMAP embeddings, each covering a comparable area and containing a substantial number of data points.

Computational Environment. All data pre-processing, dimensionality reduction, clustering, and statistical tests reported in this study were carried out inside an instance of the Research and Development Workstation Environment (RDWE) running on the Hybrid Cloud Environment for Telerehabilitation (HCET) [26] maintained by the Microprocessor Technology Lab, V.M. Glushkov Institute of Cybernetics. The HCET is a three-layer hybrid cloud platform (IaaS → PaaS → SaaS) hosted on an on-premises HP ProLiant DL380p Gen8 cluster virtualized with KVM/LibVirt and accessed through a WireGuard VPN. The RDWE instance provides a reproducible Python 3.11 stack (NumPy 2.x, SciPy 1.13, scikit-learn 1.5, umap-learn 0.5.5, hdbscan 0.8.33, cuML 25.02, Matplotlib 3.9) orchestrated in JupyterLab; all optimization routines (Powell, Nelder–Mead) and figure generation were executed here. The scripts used for these computations are openly archived on Zenodo.

To minimize unclustered points, we optimized an objective function, which was processed through the optimization routines available in the SciPy package. The Powel and Nadler-Mead methods were utilized. Other hyperparameters were tuned manually, relying on visual assessment. Basic statistical analyses, including calculations of maximum, minimum, mean values, and standard deviations, were performed for each of the resulting clusters. These cluster characteristics were then compared to identify potential differences that could facilitate class differentiation. The values of the hyperparameters tuned for each of the studied cases are presented in Table 1.

To assess the generalizability and structural divergence between populations, a cross-dataset clustering strategy was employed. Specifically, clustering models (UMAP + HDBSCAN) trained on one dataset (e.g., Norm) were applied to the other dataset (e.g., Patients), and vice versa. This approach allowed us to explore how patterns identified in one group are reflected in another, despite differences in health status and data distribution. Such cross-application serves not only as a form of indirect validation but also reveals latent predispositions that may not be apparent in within-group analysis. Since the study is unsupervised, no explicit training/validation/test split was used; instead, the generalization capability was assessed via model transfer across datasets with distinct population characteristics. This cross-dataset analysis revealed similar clustering patterns but often displayed data redistribution across clusters. The datasets represented both healthy individuals and those with diagnosed heart disorders. Differences between clusters, considering variations in characteristic features, suggested that certain clusters might represent individuals at different levels of susceptibility or stages of the disorder. Specifically, some clusters may indicate individuals more predisposed to the disorder (including those in an early latent stage), while others may correspond to more advanced stages, showing little overlap with clusters derived from the healthy cohort.

Based on these findings, we infer that certain feature levels or combinations could serve as indicators of early or advancing stages of a heart disorder. This research lays the groundwork for developing a diagnostic classification system for early heart disorder detection, which could be valuable in preventive healthcare.

It is important to note that the authors of this study are computer scientists, not medical professionals. The results presented here reflect statistical and data science insights. The interpretation of these findings and the determination of their clinical relevance require expertise from medical professionals. This research aims to provide them with potentially valuable insights.

4. General Statistics of the Initial Datasets

For each of the initial datasets, Norm and Patients, fundamental statistical parameters were determined to gain insights into feature levels and ranges, enabling dataset comparison.

A general statistics table which displays the minimum, maximum, mean, and standard deviation values for features across the datasets can be found online by the link: https://zenodo.org/records/15373634 (accessed on 9 June 2025).

Notably, certain features show no deviation or even remain consistently zero across all cases in both datasets. These features include:

AMP AREAS INDEX lead AVR

DURATION RATIO TP TE JT

DURATION RATIO TP TE JTA

DURATION RATIO TP TE JT

AVR LEAD CODE

These features can be excluded from further consideration without impact, and they were automatically removed during data preparation for the subsequent clustering analysis.

Applying a formal t-test to compare means reveals that most features display significantly different mean values between the datasets. However, some features did not meet this criterion (excluding those listed above). These features include:

FRACTAL INDEX

VL

ST T FORM INDICATOR lead II

ST DISLOCATION lead III

Q R RATIO lead I

R S RATIO lead I

R P RATIO lead III

R P RATIO lead AVF

R T RATIO lead AVF

GLOBAL P DURATION

MYOCARDIAL INDEX OF STATIONARITY

SIGN OF HEART FAILURE

SELVESTER CODE

While these features did not show significant differences in means, excluding them from cluster analysis is unnecessary since they may show variation within clusters. However, it is essential to note the absence of statistically significant differences in these features’ mean values across datasets.

Although most features show different mean values due to the large sample sizes and narrower confidence intervals, many have broad variances and standard deviations. A stringent test was conducted to identify features with non-overlapping standard deviations across datasets. Only one feature, DFA, met this criterion, exhibiting a higher level in the patients’ dataset, which could indicate its potential as an anxiety marker associated with heart disorders.

Another measure of mean difference significance, ignoring variance and focusing on raw values, was used as defined by Equation (1):

k = 2 \cdot \frac{|M_{1} - M_{2}|}{|M_{1}| + |M_{2}|}

(1)

where M₁ and M₂ are the mean values of the compared groups.

This ratio reflects the absolute difference relative to the average mean, indicating a potential shift in distribution mode. A critical value of k > 1 was used, signifying that M₁/M₂ > 3 or M₂/M₁ > 3. Table 2 presents features meeting this threshold.

Despite similar variation ranges and overlapping standard deviations, many features show substantial mean differences. For example, the “ALPHA QRS ANGLE IN THE FRONTAL PLANE” ranges from −179.49 to 179.56 in the PATIENTS’ dataset and −179.76 to 180.0 in Norm, yet the mean values are 3.1 and 11.1, respectively. Both means are shifted positively, but this shift is more pronounced in the Norm dataset.

In summary, there appears to be a slight tendency for higher values of features in the left column of Table 2 to be more typical of healthy individuals, while elevated values in the right column may serve as markers of heart disorders.

5. Results

5.1. Data Cauterization Analysis in Low-Dimensional Space

5.1.1. Euclidean Metric

The Euclidean metric was initially selected for dimensionality reduction using UMAP and clustering with HDBSCAN. The Norm dataset served as the training set for the UMAP + HDBSCAN clustering/classification model.

Figure 1 presents a comparison between clusters derived from the Norm dataset (training dataset) and subsequent classification predictions on the patients’ dataset. The predictions are based on clusters established during the training phase, now treated as distinct classes. The UMAP dimensionality reduction was performed using the same pre-trained UMAP reducer.

In the Norm dataset, two primary clusters were identified, each approximately equal in size. These could potentially be further subdivided, but this may add complexity without providing meaningful insights. When projecting the Norm-trained model onto the patients’ dataset, cluster 0 (the right-side cluster) remains recognizable, with parts of it intact. In contrast, cluster 1 (the left-side cluster) is sparsely represented in patients, and one data point is misclassified as “cluster” −1, likely an outlier situated between the clusters.

These findings suggest that cluster 0 in Norm, which is largely absent in patients, may correspond to healthier individuals with a lower risk of developing heart disorders. Conversely, members of cluster 1 could represent a potential risk group.

The average age and men-to-women ratio in the clusters were the following:

in cluster 0: 50.8 years and 42.8/57.2%;

in cluster 1: 51.1 years and 44.4/55.6%.

Thus, there is no significant misbalance between the clusters observed either in age or sex. They correspond to the mean values in the Norm dataset.

A reverse experiment was conducted, where the model was trained on the PATIENTS’ dataset and projected onto the Norm dataset. The clustering results for this experiment are displayed in Figure 2.

The clustering configuration appears similar in both cases, with three distinct clusters and a few outliers. Projecting the PATIENTS-trained model onto Norm reveals some structural changes: cluster 1, which contains fewer points in PATIENTS, almost disappears in Norm, with only a few points remaining. Cluster 2 becomes more sparse but retains a large size, while Cluster 0 shows increased population density and a closer proximity to Cluster 2, with some misclassified points between them.

These observations suggest that cluster 1 in PATIENTS may represent the most severe cases, as corresponding feature values are rarely present in the healthier Norm dataset. Cluster 2 likely represents individuals with a moderate risk of heart disorders.

The average age and men-to-women ratio in the clusters were the following:

in cluster 0: 69.2 years and 61.7/38.3%;

in cluster 1: 75.4 years and 73.6/26.4%;

in cluster 2: 70.4 years and 64.4/35.6%.

It can be noticed, that cluster 2 looks somewhat elder and more male. Nevertheless, the major clusters 0 and 2 have rather close ratios close to the ones in the PATIENTS’ dataset. Even in cluster 1 minimum age was 40, which is also the minimum for the PATIENTS’ dataset.

5.1.2. Manhattan Metric

A similar study was performed using the Manhattan metric, known for slower growth and more consolidated cluster formation. Figure 3 compares the clusters obtained by training the model on the Norm dataset with the model’s projection onto the PATIENTS’ dataset.

In this case, three clusters were identified in Norm (two large and one smaller). The PATIENTS prediction retains the structure of cluster 1 (the smallest cluster in Norm), indicating similar data characteristics in both datasets. However, cluster 0 appears as a small island, and cluster 2 is present in both cases, though its bottom section seems to diverge in the PATIENTS’ dataset. This suggests that cluster 1 may represent a higher-risk group, given its persistence across both datasets, while cluster 0 likely corresponds to a healthier population with lower heart disorder risk. Cluster 2 may represent individuals with moderate risk but less severe than those in cluster 1.

The average age and men-to-women ratio in the clusters were the following:

in cluster 0: 51.1 years and 44.3/55.7%;

in cluster 1: 57.0 years and 46.1/53.9%;

in cluster 2: 49.6 years and 42.2/57.8%.

So, we do not see any significant age or gender disproportion between the clusters, especially major ones (0 and 2)—the ratios are close to the ones in the Norm dataset.

The final experiment involved training the model on the PATIENTS’ dataset and then projecting it onto the Norm dataset, using the Manhattan metric. The clustering results are illustrated in Figure 4.

The results reveal similar trends to those observed in Figure 2 with the Euclidean metric. Specifically, cluster 0 (“end of the tail”) nearly vanishes, and the boundaries between clusters 3 and 1 become blurred. Cluster 3 becomes more dispersed, while Cluster 1 gains additional points. A new, smaller cluster 2 also appears, though its significance is unclear. This cluster’s consistency across both datasets suggests it may represent individuals with a specific type of heart disorder risk, though this hypothesis warrants further investigation by subject matter experts.

The average age and men-to-women ratio in the clusters were the following:

in cluster 0: 76.1 years and 76.0/24.0%;

in cluster 1: 69.1 years and 61.5/38.5%;

in cluster 2: 76.0 years and 73.3/26.7%;

in cluster 3: 70.4 years and 64.2/35.8%.

As we can observe, clusters 0 and 2 are more aggregates of old men than 1 and 3 (major ones), where the mentioned ratios are closer to the average values in PATIENTS. Although the average values can be shifted in clusters all of them cover mostly the whole age range presented in the dataset.

Overall, this clustering analysis highlights distinctions in the dataset structures, potentially correlating certain cluster configurations with health status and heart disorder risk factors across both Norm and PATIENTS’ datasets.

5.2. Overall Analysis of Data Clustering

5.2.1. Data Distribution Between the Clusters

Table 3 provides a detailed summary of data distribution across clusters for each scenario analyzed above. The misclassified points (designated as “cluster” −1) have been excluded from this estimation due to their minimal count.

The table indicates that substantial data redistribution occurs between main clusters during prediction. In the first scenario (Euclidean metric, Norm dataset for training), initial clusters are relatively balanced. However, in the prediction for the PATIENTS’ dataset, a significant shift occurs, with cluster 0 accounting for nearly 87% of data points. When the model is trained on the PATIENTS’ dataset and projected to Norm, cluster 0 originally contains only about 23% of data points, with clusters 1 and 2 comprising the remaining 77%. Conversely, in the Norm prediction, cluster 0 contains approximately 80% of data points, suggesting a healthier profile for its members, while cluster 2 appears associated with less favorable health outcomes. Cluster 1 could potentially represent a high-risk group.

The Manhattan metric reveals a similar trend. For the Norm to PATIENTS projection, an additional, smaller cluster 1 emerges. This cluster may represent a risk group, as it consistently appears across both datasets, yet its health status remains ambiguous. Cluster 2 may indicate individuals with a predisposition to cardiovascular issues, despite currently being relatively healthy. In the final scenario, clusters 0, 3, and possibly 2 are more prominent in the PATIENTS’ dataset and less so in the Norm, suggesting these clusters may correspond to less favorable health profiles.

5.2.2. Certain Features Values Differences Between the Clusters

The following question concerns the differences in the features’ values between the clusters. Having this information we can make some diagnostic conclusions, about what combinations of raised and decreased parameters might say about the presence of heart failure or at least a tendency of its development in the future, or maybe revile people less tend to such health disorders. All 4 cases considered above will be regarded in this section. The first and simplest one is the case of the model trained on the Norm dataset using the Euclidean metric, which has only two clusters. The criterion given by Equation (1) has been used to reveal the features with significant differences in mean values. The list of the parameters with the most differences between the clusters is given in Table 4.

Only the values for features “ALPHA QRS ANGLE IN THE FRONTAL PLANE” and DFA from the list are greater in cluster 0, while others are bigger in cluster 1.

Thus raised DFA, which in most cases of the Norm dataset might be evidence of heart failure disinclination. It should be noted that DFA can have both positive and negative values on both datasets, as well as “ALPHA QRS ANGLE IN THE FRONTAL PLANE”. This one in cluster 1 is quite close to 0, but in cluster 0, it tends to have positive values, which also could be perceived as a lower risk of heart failure development. Many parameters have negative mean values in both clusters. However, in cluster 1 they are much greater by the absolute values. All those features having raised absolute values, especially appearing in combination might be threaded as a signal of predisposition to heart disorders even if a person is healthy for the moment. This, and the statements below are of course not a univocal conclusion and a certain diagnosis, but merely information the medical professionals may pay attention to, then accepting or rejecting it according to their experience as well as according to the results of the subsequent specific studies.

The following case is clustering using the PATIENTS’ dataset assuming the Euclidean metric. Here we have three clusters. So, let us start with the features that are different enough in all three clusters. These are only HF and “ALPHA QRS ANGLE IN THE FRONTAL PLANE”. Cluster 1 here has an extremely high average HF value which is 28,024, and it collects the highest values of it. In clusters 0 and 2 the HF is not as much different but distinguish enough having mean values equal to 291 and 991 correspondingly. The “ALPHA QRS ANGLE IN THE FRONTAL PLANE”, which can have positive or negative values has different behavior. In cluster 2 its average value is 0.64, which, assuming its variation range (±180), could be treated as rather close to 0. In the cluster 0 prevail positive values and in the cluster 1—negative ones.

Then we will go to the features in which the difference is distinctive for only one pair of the clusters. There is only one of them: “Q R RATIO lead I”, whose average value in cluster 0 is rather low as well as in luster 2 (0.095 and 0.129 correspondingly); however, in cluster 1 it is raised to 0.301. The difference between clusters 1 and 2 is also notable but still does not fit the set threshold.

There are a few features that distinguish the clusters 0 and 2. Those are besides already mentioned HF and “ALPHA QRS ANGLE IN THE FRONTAL PLANE” are:

ST DISLOCATION lead AVL

Q AMP lead II

S AMP lead II

Q R RATIO lead II

Q AMP lead III

Q AMP lead AVR

R AMP lead AVR

S AMP lead AVR

Q R RATIO lead AVR

R T RATIO lead AVR

R S RATIO lead AVR

Q AMP lead AVF

Q R RATIO lead AVF

If we assume the hypothesis that cluster 2 is more predisposed to heart failure than cluster 0 (according to their members ratio in Norm and PATIENTS’ datasets), we ought to denote the feature values which are raised or declined in each of them. The parameter ST DISLOCATION lead AVL can have either positive or negative values. It is observed that for cluster 0 positive ones are typical, but for cluster 2—negative ones. Notably, in cluster 1, which may include the most severe cases, these feature values are even more negative. The features “Q AMP lead II”, “S AMP lead II”, “Q AMP lead III”, “Q AMP lead AVR”, “S AMP lead AVR”, and “Q AMP lead AVF” have negative values. Among them only “S AMP lead AVR” has a greater (by the absolute) value in the cluster 0 (but it is significant: −555.9 to −33.6). Other parameters are greater in cluster 2. The features “Q R RATIO lead II”, “Q R RATIO lead III”, “Q R RATIO lead AVR”, “R S RATIO lead AVR”, and “Q R RATIO lead AVF” have positive values. All of them have mean values greater in cluster 2.

Thus, cluster 2, which might be more predisposed to heart failure than cluster 0, is typical:

raised positive values of HF, “Q R RATIO lead I”, “Q R RATIO lead II”, “Q R RATIO lead III”, “Q R RATIO lead AVR”, “R S RATIO lead AVR”, and “Q R RATIO lead”;

lower negative values of “Q AMP lead II”, “S AMP lead II”, “Q AMP lead III”, “Q AMP lead AVR”, and “Q AMP lead AVF”;

raised (but negative) value of “S AMP lead AVR”;

close to the 0 value of “ALPHA QRS ANGLE IN THE FRONTAL PLANE”.

Of certain difference is cluster 1. It looks like a “tail” of cluster 2 and probably contains more severe cases of heart disorder, for it is almost not presented in the classification performed on the Norm dataset, which is assumed to consist of healthy people. It seems to make better sense to outline the features that distinguish this cluster from cluster 2, which “tail” it is. There are quite many a feature that are different in clusters 0 and 1 because they are not even close. There is a list of such feature names to distinguish cluster 1 from cluster 2, excluding already mentioned HF and “ALPHA QRS ANGLE IN THE FRONTAL PLANE” distinctive for all three clusters:

SDNN

RMSSD

PNN50

SDSD

IAE

TOTAL POWER

ACTIVITY OF VASOMOTOR CENTERS

VLF

LF

ST DISLOCATION lead I

J40 AMP lead I

You can see that this list is completely not the same as the one to distinguish the clusters 0 and 2. We have already mentioned that the HF parameter in cluster 1 has an extremely high value. The same thing can be said about VLF and LF features, whose mean values for the cluster are, correspondently, 31,857 and 89,970 compared with 151.3, 26.3, and 274.5, 774.9 in the other clusters. Also extremely raised in cluster 1 is parameter TOTAL POWER, whose average value in this cluster is 149,788 compared to 716.9 and 1995.3 in the other clusters. Not as extreme, but also significantly raised in cluster 1 are the mean values of the following features: SDNN, RMSSD, PNN50, and SDSD. On the other hand, the parameters IAE and ACTIVITY OF VASOMOTOR CENTERS in cluster 1 are valuably lower. The features “ST DISLOCATION lead I” and “J40 AMP lead I” can have negative or positive values. So, for cluster 1 negative values are more likely to appear, but positive ones are for cluster 2. The feature “ALPHA QRS ANGLE IN THE FRONTAL PLANE” tends to be negative in cluster 1.

Now let us consider what Manhattan metric usage can add to the results of the study. As we can see, its application for the Norm dataset clustering leads to three clusters. There are many parameters that have different average values in each of the clusters. These parameters and their average values in the clusters are presented in Table 5.

We can see that all the features have been already presented in Table 4—a similar clustering case, but using the Euclidean metric. So, involving the Manhattan metric here cannot reveal any new distinctive features but recognizes three separate clusters. We can observe, that for all the features raised values are observed in cluster 0. This cluster 0 is that one which we almost cannot see for the PATIENTS’ dataset prediction. So, might the increased values of those factors are not typical for people with heart failure? This result is in good agreement with the Euclidean metric case (there it was called cluster 1, which also had raised values of those parameters and almost vanished in prediction on the PATIENTS’ dataset). In the considered case also another rather small cluster 1. This one has the smallest mean values for all the listed parameters. However, such people appear as well in PATIENTS. Thus, it might be guessed, that those from cluster 1 could also have a propensity to some heart disorder, but of a different type than those from cluster 2.

The last case to be considered is employing clustering through the Manhattan metric for the PATIENTS’ dataset. Here the primary interest is cluster 0—“and of the tail” of cluster 3. This is a true analog of cluster 1 from a similar case with the Euclidean metric. It also has extremely raised values of “TOTAL POWER”, HF, VLF, and LF factors. Small cluster 2 might be treated just as part of cluster 3. However, it also has increased parameters HF, VLF, and LF, but not as much as in cluster 0, and with many values, it is quite close to it. Nevertheless, it is too small to make serious conclusions about its differences.

6. Discussion

A key methodological innovation of this work is the cross-dataset clustering strategy, in which a model trained on one population (e.g., Norm) is applied to a different one (e.g., Patients), and vice versa. This allows for the identification of structural mismatches and hidden subgroup patterns without the need for explicit supervision or diagnostic labels. Such an approach supports the idea of latent phenotype discovery and has potential applications not only in cardiology but also in broader areas of predictive and preventive medicine.

We also emphasize that, in addition to cluster-level visualization and mean-based comparisons of selected features, the full statistical comparison framework employed in this study includes t-tests, standard deviation analysis, and a relative mode-shift ratio metric designed to detect distributional shifts even in the presence of large within-group variances. These results, due to their volume have been omitted directly in the article but available in the open repositories and by request.

The heterogeneity of both groups examined is noteworthy. If this result was expected for the Patients group, it was somewhat surprising for the Norms group. The presence of a cluster in the Norms group similar to the main cluster in the Patients group emphasizes the heterogeneity of this group, which apparently contains a subgroup of individuals with a high risk of developing heart disease in the near future. The fine structure of the patient group is also quite complex, which can be used to determine the prognosis of the disease outcome and the effectiveness of therapy. These hypotheses will be tested in future studies. We also note the composition of the list of ECG parameters with the greatest separating ability. This list includes both basic amplitude-time parameters of the ECG and more complex parameters describing the shape of ECG elements, as well as some HRV parameters. This confirms our earlier opinion on the usefulness of analyzing the most multilateral matrix of ECG and HRV parameters [34].

The comparative analysis of feature values across clusters, utilizing both Euclidean and Manhattan metrics, has provided meaningful insights into potential indicators of heart failure or predisposition to related conditions. The study highlighted distinct clusters within both the Norm and PATIENTS’ datasets, revealing specific patterns in key features that may reflect varying levels of cardiac risk.

In the Norm dataset, the clustering analysis using the Euclidean metric identified two clusters with marked differences in features like “ALPHA QRS ANGLE IN THE FRONTAL PLANE” and DFA (Detrended Fluctuation Analysis). Cluster 0, characterized by higher values of DFA and “ALPHA QRS ANGLE,” is associated with a decreased likelihood of heart failure. Conversely, features with raised absolute values in Cluster 1 may suggest an elevated risk, as this group often displayed higher magnitudes of certain ECG parameters. Although such clustering does not provide a definitive diagnosis, these identified patterns can assist medical professionals in isolating individuals at potential risk for further monitoring or preventive intervention.

Extending the clustering analysis to the PATIENTS’ dataset with Euclidean distance, a three-cluster model emerged, distinguishing features that showed substantial variation across clusters. Notably, “HF” (High-Frequency power) and “ALPHA QRS ANGLE IN THE FRONTAL PLANE” were the most discriminative features. Cluster 1 exhibited an extreme HF value, potentially indicating a high cardiac risk profile, while other clusters demonstrated more moderate values. This suggests that variations in HF and related ECG parameters may serve as early indicators of adverse cardiac events, with Cluster 1 potentially capturing the most severe cases.

The application of the Manhattan metric on the Norm dataset further refined this analysis, yielding three distinct clusters with unique feature patterns. Table 5 (provided) illustrates the clustering results under this metric, emphasizing parameters like “ST DISLOCATION” and “T AMP NORMALIZED” across multiple leads. These distinctions reinforce the hypothesis that certain ECG measurements, when significantly deviated, correlate with varying cardiac health risks. This observation underscores the potential value of employing multiple metrics in clustering algorithms, as the Manhattan distance identified additional distinctive features that were less apparent under Euclidean distance.

The clustering structure observed in both datasets suggests several clinically relevant findings. First, the appearance of extreme HF values and distinctive dislocation in parameters such as “ST DISLOCATION” and “T AMP NORMALIZED” may signify critical deviations in cardiac function, warranting further exploration. Second, the observed differences in the “ALPHA QRS ANGLE IN THE FRONTAL PLANE” across clusters, particularly its closeness to zero in Cluster 2, may indicate a lower cardiovascular risk when combined with other typical parameters. Third, the consistency of certain features across different clustering metrics and datasets supports their relevance as potential markers for early detection of cardiac issues.

The obtained results and utilized approaches could be extended not only for medical purposes but also in other realms, like for example materials science. Application of such techniques may give some ameliorating and bring to a qualitatively new level models, like those presented in the works [21,35,36]. Moreover, one current trend in software engineering is finding ways to reuse software components when building new projects [37,38,39,40,41]. Although the programs developed in this study are mainly exploratory, the mathematical models and software modules created here could be adapted with minimal changes to work as parts of other systems. This highlights the practical importance of the research conducted in this study.

Future research should aim to validate the observed clusters against clinical outcomes through longitudinal studies. Moreover, integration with real-time ECG monitoring systems and expert annotation pipelines may allow for the development of predictive models that track the transition of individuals between risk clusters over time. The clustering methodology could also be extended to other physiological signals or multimodal data to support broader diagnostic frameworks in cardiology and preventive medicine.

7. Conclusions

This study’s clustering approach demonstrates that ECG parameters, when analyzed through unsupervised methods, can reveal latent structures associated with cardiac health risks. By identifying clusters that correlate with feature patterns typical of heart failure predispositions, this research offers a foundation for further investigations into non-invasive diagnostic methods and personalized medicine. Although these clusters do not serve as definitive diagnostic criteria, they provide a preliminary stratification that could be useful in routine screening for cardiac anomalies. Future research should focus on validating these findings with larger, more diverse datasets and exploring the integration of clustering outcomes with clinical decision-making systems to enhance early diagnosis and prevention in cardiology.

In order to increase the efficiency of research on the application of unsupervised clustering algorithms to ECG data, it is planned to study the possibility of applying models and methods of transdisciplinary research and their (algorithms) hardware support [42,43,44,45].

Author Contributions

Conceptualization, V.K., I.C., and K.M.; methodology, V.K.; software, V.K. and K.M.; validation, K.M. and I.C.; formal analysis, V.K.; investigation, V.K.; resources, I.C. and T.R.; data curation, I.C. and V.K.; writing—original draft preparation, V.K.; writing—review and editing, K.M., I.C., T.R., A.M., and M.B.; visualization, V.K.; supervision, I.C.; project administration, I.C. and M.B.; funding acquisition, I.C. and M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the National Research Foundation of Ukraine (grant No. 2023.04/0094) under the project “Development of Technology for Objective Monitoring of Functional Capabilities and Stress of Military Personnel Based on Miniature Electrocardiographs and Machine Learning”. State registration No. 0125U002047; details available at https://nrat.ukrintei.ua/searchdoc/0125U002047 (accessed on 9 June 2025). This research was also conducted as part of the scientific and technical project “Develop Means of Supporting Virtualization Technologies and Their Use in Computer Engineering and Other Applications”, funded by the National Academy of Sciences of Ukraine. State registration No. 0124U001826; details are available at https://nrat.ukrintei.ua/en/searchdoc/0124U001826 (accessed on 9 June 2025). Both projects are being carried out at the V. M. Glushkov Institute of Cybernetics of the National Academy of Sciences of Ukraine, Kyiv, Ukraine.

Data Availability Statement

The Python scripts and Jupyter notebooks required to reproduce all pre-processing, dimensional-reduction, clustering, and statistical analyses are openly archived on Zenodo: https://zenodo.org/records/15373634 (accessed on 9 June 2025). The underlying ECG dataset contains sensitive health information and cannot be placed in a public repository. De-identified data can be made available to qualified researchers upon reasonable request, contingent on approval by the V.M. Glushkov Institute of Cybernetics Ethics Committee and completion of a standard data-use agreement. Requests for access should be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ECG	Electrocardiographic
DFA	Detrended Fluctuation Analysis
WPW	Wolff-Parkinson-White Syndrome
UMAP	Uniform Manifold Approximation and Projection
DBSCAN	Density-Based Spatial Clustering of Applications with Noise
HDBSCAN	Hierarchical Density-Based Spatial Clustering of Applications with Noise
RDWE	Research and Development Workstation Environment
HCET	Hybrid Cloud Environment for Telerehabilitation

References

Chaikovsky, I. Electrocardiogram scoring beyond the routine analysis: Subtle changes matters. Expert Rev. Med. Devices 2020, 17, 379–382. [Google Scholar] [CrossRef] [PubMed]
Oke, O.A.; Cavus, N. A Systematic Review on the Impact of Artificial Intelligence on Electrocardiograms in Cardiology. Int. J. Med. Inform. 2025, 195, 105753. [Google Scholar] [CrossRef]
Liu, Z.-Y.; Lin, C.-H.; Hsu, Y.-C.; Chen, J.-S.; Chang, P.-C.; Wen, M.-S.; Kuo, C.-F. Universal Representations in Cardiovascular ECG Assessment: A Self-Supervised Learning Approach. Int. J. Med. Inform. 2025, 195, 105742. [Google Scholar] [CrossRef]
Mondal, A.; Manikandan, M.S.; Pachori, R.B. Automatic ECG Signal Quality Assessment Using Convolutional Neural Networks and Derivative ECG Signal for False Alarm Reduction in Wearable Vital Signs Monitoring Devices. Biomed. Signal Process. Control 2025, 108, 107876. [Google Scholar] [CrossRef]
Kuetche, F.; Alexendre, N.; Pascal, N.E.; Thierry, S. Simple, Efficient, and Generalized ECG Signal Quality Assessment Method for Telemedicine Applications. Inform. Med. Unlocked 2023, 42, 101375. [Google Scholar] [CrossRef]
Chaikovsky, I.; Starynska, G.; Budnyk, M. Method of ECG Evaluating Based on Universal Scoring System. US10512412B2, 24 December 2020. [Google Scholar]
Escudero, J.; Ifeachor, E.; Zajicek, J.P.; Green, C.; Shearer, J.; Pearson, S. Machine Learning-Based Method for Personalized and Cost-Effective Detection of Alzheimer’s Disease. IEEE Trans. Biomed. Eng. 2013, 60, 164–168. [Google Scholar] [CrossRef]
Tosto, G.; Bird, T.D.; Tsuang, D.; Bennett, D.A.; Boeve, B.F.; Cruchaga, C.; Faber, K.; Foroud, T.M.; Farlow, M.; Goate, A.M.; et al. Polygenic Risk Scores in Familial Alzheimer Disease. Neurology 2017, 88, 1180–1186. [Google Scholar] [CrossRef]
Fereshtehnejad, S.M.; Zeighami, Y.; Dagher, A.; Postuma, R.B. Clinical Criteria for Subtyping Parkinson’s Disease: Biomarkers and Longitudinal Progression. Brain 2017, 140, 1959–1976. [Google Scholar] [CrossRef]
Krasniqi, E.; Schramm, W.; Reichenbach, A. Data-Driven Stratification of Parkinson’s Disease Patients Based on the Progression of Motor and Cognitive Disease Markers. Ger. Med. Sci. 2021, 17, 4. [Google Scholar] [CrossRef]
Goldenberg, I.; Moss, A.J. Long QT Syndrome. J. Am. Coll. Cardiol. 2008, 51, 2291–2300. [Google Scholar] [CrossRef]
Chhabra, L.; Goyal, A.; Benham, M.D. Wolff-Parkinson-White Syndrome. StatPearls, 7. Available online: https://pubmed.ncbi.nlm.nih.gov/32119324/ (accessed on 7 August 2023).
Thygesen, K.; Alpert, J.S.; White, H.D.; Jaffe, A.S.; Apple, F.S.; Galvani, M.; Katus, H.A.; Newby, L.K.; Ravkilde, J.; Chaitman, B.; et al. Task Force for the Redefinition of Myocardial Infarction. Eur. Heart J. 2007, 28, 2525–2538. [Google Scholar] [CrossRef] [PubMed]
Xu, R.; Wunsch, D. Survey of Clustering Algorithms. IEEE Trans. Neural Netw. 2005, 16, 645–678. [Google Scholar] [CrossRef] [PubMed]
Esteva, A.; Robicquet, A.; Ramsundar, B.; Kuleshov, V.; DePristo, M.; Chou, K.; Cui, C.; Corrado, G.; Thrun, S. A Guide to Deep Learning in Healthcare. Nat. Med. 2019, 25, 24–29. [Google Scholar] [CrossRef] [PubMed]
Murali, L.; Gopakumar, G.; Viswanathan, D.M. Towards Electronic Health Record-Based Medical Knowledge Graph Construction, Completion, and Applications: A Literature Study. J. Biomed. Inform. 2023, 143, 104403. [Google Scholar] [CrossRef]
McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2020. [CrossRef]
Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96), Portland, OR, USA, 2–4 August 1996; pp. 226–231. Available online: https://file.biolab.si/papers/1996-DBSCAN-KDD.pdf (accessed on 9 June 2025).
Campello, R.J.G.B.; Moulavi, D.; Zimek, A.; Sander, J. Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Trans. Knowl. Discov. Data 2013, 10, 1–51. [Google Scholar] [CrossRef]
Blanco-Portals, J.; Peiró, F.; Estradé, S. Strategies for EELS Data Analysis: Introducing UMAP and HDBSCAN for Dimensionality Reduction and Clustering. Microsc. Microanal. 2022, 28, 109–122. [Google Scholar] [CrossRef]
Kaverinsky, V.; Vakulenko, D.; Vakulenko, L.; Malakhov, K. Machine Learning Analysis of Arterial Oscillograms for Depression Level Diagnosis in Cardiovascular Health. Complex Syst. Inform. Model. Q. 2024, 40, 94–110. [Google Scholar] [CrossRef]
Reznichenko, S.; Whitaker, J.; Ni, Z.; Zhou, S. Comparing ECG Lead Subsets for Heart Arrhythmia/ECG Pattern Classification: Convolutional Neural Networks and Random Forest. CJC Open 2025, 7, 176–186. [Google Scholar] [CrossRef]
Lanerolle, G.D.; Roberts, E.S.; Haroon, A.; Shetty, A. Chapter 7—Neuropsychiatry and Mental Health. In Quality Assurance Management; Lanerolle, G.D., Roberts, E.S., Haroon, A., Shetty, A., Eds.; Academic Press: Cambridge, MA, USA, 2024; pp. 131–240. [Google Scholar]
Zabihi, F.; Safara, F.; Ahadzadeh, B. An Electrocardiogram Signal Classification Using a Hybrid Machine Learning and Deep Learning Approach. Healthc. Anal. 2024, 6, 100366. [Google Scholar] [CrossRef]
Cascianelli, S.; Masseroli, M. Biological and Medical Ontologies: Introduction. In Encyclopedia of Bioinformatics and Computational Biology; Elsevier: Amsterdam, The Netherlands, 2025; pp. 380–391. ISBN 978-0-323-95503-4. [Google Scholar]
Malakhov, K.S. Innovative Hybrid Cloud Solutions for Physical Medicine and Telerehabilitation Research. Int. J. Telerehabil. 2024, 16, e6635. [Google Scholar] [CrossRef] [PubMed]
Romanchuk, O.; Polianska, O.; Polianskyi, I.; Yasinska, O. Telerehabilitation: Current Opportunities And Problems of Remote Patient Monitoring. Neonatol. Hìr. Perinat. Med. 2024, 14, 183–190. [Google Scholar] [CrossRef]
Vakulenko, D.; Vakulenko, L. Information System Telerehabilitation: Needs, Tasks and Way Optimisation with AI. In Arterial Oscillography: NewCapabilities of the Blood Pressure Monitor with the Oranta-AO Information System; Vakulenko, D., Vakulenko, L., Eds.; Nova Science Publishers: Hauppauge, NY, USA, 2024; pp. 681–707. [Google Scholar]
Vladymyrov, O.A.; Semykopna, T.V.; Vakulenko, D.V.; Syvak, O.V.; Budnyk, M.M. Telerehabilitation Guidelines for Patients with Breast Cancer. Int. J. Telerehabil. 2024, 1–76. [Google Scholar] [CrossRef]
Vakulenko, D.V.; Palagin, O.V.; Sergienko, I.V.; Stetsyuk, P.I. Algorithmization and Optimization Models of Patient-Centric Rehabilitation Programs. Cybern. Syst. Anal. 2024, 60, 736–752. [Google Scholar] [CrossRef]
Vakulenko, D.V.; Vakulenko, L.; Zaspa, H.; Lupenko, S.; Stetsyuk, P.; Stovba, V. Components of Oranta-AO Software Expert System for Innovative Application of Blood Pressure Monitors. J. Reliab. Intell Environ. 2023, 9, 41–56. [Google Scholar] [CrossRef]
Khaustova, O.; Chaban, O.; Sak, L. Indicators of Somatized PTSD in Ukrainian Active Military Personnel Undergoing Rehabilitation after TBI Treatment. Neurosci. Appl. 2024, 3, 105356. [Google Scholar] [CrossRef]
Kaverinsky, V.V.; Malakhov, K.S. Natural Language-Driven Dialogue Systems for Support in Physical Medicine and Rehabilitation. S. Afr. Comput. J. 2023, 35, 119–126. [Google Scholar] [CrossRef]
Chaikovsky, I.; Dziuba, D.; Kryvova, O.; Marushko, K.; Vakulenko, J.; Malakhov, K.; Loskutov, O. Subtle changes on electrocardiogram in severe patients with COVID-19 may be predictors of treatment outcome. Front. Artif. Intell. 2025, 8, 1561079. [Google Scholar] [CrossRef]
Tu, X.; Qin, T.; Ji, X.; Wang, Z.; Chen, J.; Zhang, Z.; Wang, Z.; Wang, W.; Qin, Y.; Zhou, J. DBSCAN Clustering Model for Parameter Inversion Using Laser Cutting Edge Morphology Characteristic in Zr-4 Alloy. Opt. Laser Technol. 2025, 184, 112461. [Google Scholar] [CrossRef]
Zhang, L.; Deng, H. NJmat 2.0: User Instructions of Data-Driven Machine Learning Interface for Materials Science. Comput. Mater. Contin. 2025, 83, 1–11. [Google Scholar] [CrossRef]
Chebanyuk, O. An Approach of Text to Model Transformation of Software Models. In Proceedings of the 13th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2018), Funchal, Portugal, 23–24 March 2018; pp. 432–439. [Google Scholar] [CrossRef]
Chebanyuk, O. Software Reuse Approach Based on Review and Analysis of Reuse Risks from Projects Uploaded to GitHub. In Computer Science and Education in Computer Science; Springer: Cham, Switzerland, 2023; Volume 514, pp. 144–155. [Google Scholar] [CrossRef]
Chebanyuk, O. Investigation of Drawbacks of the Software Development Artifacts Reuse Approaches Based on Semantic Analysis. In Advances in Computer Science for Engineering and Education VI; Lecture Notes in Data Engineering and Communication Technologies; Springer: Cham, Switzerland, 2023; Volume 181, pp. 514–523. [Google Scholar] [CrossRef]
Mejía-Granda, C.M.; Fernández-Alemán, J.L.; Carrillo De Gea, J.M.; García-Berná, J.A. A Method and Validation for Auditing E-Health Applications Based on Reusable Software Security Requirements Specifications. Int. J. Med. Inform. 2025, 194, 105699. [Google Scholar] [CrossRef] [PubMed]
Sinaci, A.A.; Gencturk, M.; Teoman, H.A.; Laleci Erturkmen, G.B.; Alvarez-Romero, C.; Martinez-Garcia, A.; Poblador-Plou, B.; Carmona-Pírez, J.; Löbe, M.; Parra-Calderon, C.L. A Data Transformation Methodology to Create Findable, Accessible, Interoperable, and Reusable Health Data: Software Design, Development, and Evaluation Study. J. Med. Internet Res. 2023, 25, e42822. [Google Scholar] [CrossRef] [PubMed]
Kurgaev, A.; Petrenko, M. Processor Structure Design. Cybern. Syst. Anal. 1995, 31, 618–625. [Google Scholar] [CrossRef]
Opanasenko, V.; Palagin, O.; Zavyalov, S. The FPGA-Based Problem-Oriented On-Board Processor. In Proceedings of the 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), Metz, France, 18–21 September 2019; pp. 152–157. [Google Scholar]
Palagin, O.; Opanasenko, V. Reconfigurable-Computing Technology. Cybern. Syst. Anal. 2007, 43, 675–686. [Google Scholar] [CrossRef]
Palagin, O.; Petrenko, M.; Litvin, A.; Boyko, M. Method of Developing an Ontological System with Automatic Formation of a Knowledge Base and User Queries. In Proceedings of the 14th International Scientific and Practical Programming Conference, UkrPROG 2024, Kyiv, Ukraine, 14–15 May 2024; Volume 3806, pp. 372–388. Available online: https://ceur-ws.org/Vol-3806/S_2_Palagin.pdf (accessed on 9 June 2025).

Figure 1. Data clustering using UMAP + HDBSCAN, with models trained on Norm using the Euclidean metric: (a) clusters found in Norm dataset; (b) clusters predicted for patients’ dataset.

Figure 2. Data clustering using UMAP + HDBSCAN, with models trained on PATIENTS using the Euclidean metric: (a) clusters found in PATIENTS’ dataset; (b) clusters predicted for Norm dataset.

Figure 3. Data clustering using UMAP + HDBSCAN, with models trained on Norm using the Manhattan metric: (a) clusters found in Norm dataset; (b) clusters predicted for PATIENTS’ dataset.

Figure 4. Data clustering using UMAP + HDBSCAN, with models trained on PATIENTS using the Manhattan metric: (a) clusters found in PATIENTS’ dataset; (b) clusters predicted for Norm dataset.

Table 1. The model’s hyperparameters values.

Training Dataset	Metric	Hyperparameters Values
		UMAP		HDBSCAN
		N Neighbors	Min. Dist.	Min. Cluster Size	Leaf Size
Norm	Euclidean	20	0.15	182	1337
Patients	Euclidean	7	0.08	50	435
Norm	Manhattan	50	0.05	182	1337
Patients	Manhattan	15	0.07	40	25

Table 2. Features with a significant mode difference in the datasets.

Greater in Norm		Greater in Patients
Feature Name	Means Ratio M₁/M₂	Feature Name	Means Ratio M₂/M₁
T AMP NORMALIZED lead I	55.47173	VLF	495.061
T AMP NORMALIZED lead II	29.00111	LF	157.83
T AMP NORMALIZED lead AVL	27.855	DFA	53.1785
T AMP NORMALIZED lead AVF	14.00366	SYNDROMIC ECG ANALYSIS	51.0873
J40 AMP lead I	10.84904	TOTAL POWER	38.6041
ST DISLOCATION lead I	10.80562	ST DISLOCATION lead AVL	30.5705
S AMP lead AVR	9.075496	Q R RATIO lead AVR	22.4853
T AMP NORMALIZED lead III	7.578434	Q AMP lead AVR	19.0826
R T RATIO lead AVR	5.269837	HF	9.83694
Q AMP lead I	4.956455	ACTIVITY OF VASOMOTOR CENTERS	9.46108
P AMP lead III	4.227044	Q R RATIO lead III	6.34422
P AMP lead AVF	4.057967	Q R RATIO lead AVF	5.73973
P AMP lead II	3.900239	LF HF	4.99509
T AMP lead AVF	3.724387	LFn	3.60021
T AMP lead II	3.630603	SDSD	3.58783
ALPHA QRS ANGLE IN THE FRONTAL PLANE	3.578768	Q R RATIO lead II	3.54654
S AMP lead AVL	3.512038	T SYMMETRY AREAS OF TRIANGLES lead III	3.02674
P AMP lead I	3.413082
P AREA lead I	3.283533
R AMP lead AVF	3.210154
T AMP lead III	3.065699
T AMP lead I	3.04724

Table 3. The data distribution between the clusters, %.

Euclidian Metric, Training on the Norm Dataset
Initial Clusters in Norm				Clusters Prediction for PATIENTS
Cluster 0		Cluster 1		Cluster 0		Cluster 1
56.982		43.013		86.994		13.006
Euclidian metric, training on the PATIENTS’ dataset
Initial clusters in PATIENTS				Clusters prediction for Norm
Cluster 0	Cluster 1		Cluster 2	Cluster 0	Cluster 1		Cluster 2
22.961	9.946		67.093	80.020	0.1279		19.852
Manhattan metric, training on the Norm dataset
Initial clusters in Norm				Clusters prediction for PATIENTS
Cluster 0	Cluster 1		Cluster 2	Cluster 0	Cluster 1		Cluster 2
43.040	9.110	47.850	7.483	4.319	88.198	43.040	9.110
Manhattan metric, training on the PATIENTS’ dataset
Initial clusters in PATIENTS				Clusters prediction for Norm
Cluster 0	Cluster 1	Cluster 2	Cluster 3	Cluster 0	Cluster 1	Cluster 2	Cluster 3
7.506	22.652	2.257	67.384	0.047	78.182	0.895	20.876

Table 4. The most distinguishing features between the clusters obtained on Norm dataset using Euclidean metric.

Feature Name	Mean Values in the Clusters
Feature Name	Cluster 0	Cluster 1
DFA	0.047492	0.000383
ALPHA QRS ANGLE IN THE FRONTAL PLANE	19.62716	−0.20719
S AMP lead AVF	−50.573	−158.372
ST DISLOCATION lead I	16.63566	57.37463
J40 AMP lead I	16.63566	57.37463
ST DISLOCATION lead II	19.67757	73.12592
T AMP NORMALIZED lead AVF	9.16226	37.29108
R AMP lead III	150.8576	622.8024
R AMP lead AVF	272.4523	1138.58
ST DISLOCATION lead AVL	11.01877	46.47051
R AMP lead II	434.6667	1944.364
T AMP NORMALIZED lead AVL	5.575907	24.95848
ST DISLOCATION lead AVF	11.37391	51.72439
R AMP lead I	333.44	1530.54
T AMP NORMALIZED lead II	12.68615	59.17196
R AMP lead AVL	144.3115	716.4888
S AMP lead III	−105.133	−540.407
Q AMP lead AVL	−15.881	−83.5429
P AMP lead III	43.19233	232.7455
T AMP NORMALIZED lead I	9.767729	55.07774
S AMP lead AVR	−466.623	−2753.23
P AMP lead AVF	55.76281	330.5697
T AMP lead III	64.73436	397.7896
P AMP lead II	72.24876	450.5428
Q AMP lead AVR	−4.62043	−28.8315
T AREA lead I	16.5706	106.0358
T AMP lead AVF	128.7192	831.6197
T AMP lead II	208.29	1364.565
P AMP lead AVR	0.967887	6.608791
T AMP lead I	163.3263	1116.124
T AMP lead AVL	69.50201	494.6684
P AMP lead I	44.17025	314.9606
Q AMP lead III	−19.1682	−138.423
Q AMP lead I	−23.761	−173.752
Q AMP lead AVF	−21.907	−162.325
P AREA lead I	2.039669	15.25809
P AMP lead AVL	19.5013	147.8871
ST DISLOCATION lead III	4.187131	33.4458
QRS AREA lead I	13.50508	112.1638
Q AMP lead II	−26.8857	−232.533
R AMP lead AVR	34.83518	332.9271
T AMP lead AVR	0.469894	5.185672
DFA	0.047492	0.000383
ALPHA QRS ANGLE IN THE FRONTAL PLANE	19.62716	−0.20719
S AMP lead AVF	−50.573	−158.372
ST DISLOCATION lead I	16.63566	57.37463
J40 AMP lead I	16.63566	57.37463
ST DISLOCATION lead II	19.67757	73.12592
T AMP NORMALIZED lead AVF	9.16226	37.29108
R AMP lead III	150.8576	622.8024
R AMP lead AVF	272.4523	1138.58
ST DISLOCATION lead AVL	11.01877	46.47051
R AMP lead II	434.6667	1944.364
T AMP NORMALIZED lead AVL	5.575907	24.95848
ST DISLOCATION lead AVF	11.37391	51.72439
R AMP lead I	333.44	1530.54
T AMP NORMALIZED lead II	12.68615	59.17196
R AMP lead AVL	144.3115	716.4888

Table 5. The most distinguishing features between all three clusters obtained on Norm dataset using Manhattan metric.

Feature Name	Mean Values in the Clusters
Feature Name	Cluster 0	Cluster 1	Cluster 2
ST DISLOCATION lead I	57.50899	3.440177	19.00408
T AMP NORMALIZED lead I	55.0737	0.903335	11.4335
ST DISLOCATION lead II	73.12709	5.511817	22.34336
T AMP NORMALIZED lead II	59.41575	1.066753	14.65286
ST DISLOCATION lead AVL	46.61763	1.395126	12.69868
T AMP NORMALIZED lead AVL	24.9708	0.591232	6.502916
ST DISLOCATION lead AVF	51.68204	3.73486	12.84364
T AMP NORMALIZED lead AVF	37.31894	1.110758	10.65424
P AMP lead I	315.2873	4.234121	51.32719
R AMP lead I	1530.143	119.3567	373.882
T AMP lead I	1116.888	34.76588	186.5794
P AREA lead I	15.27466	0.250369	2.357987
QRS AREA lead I	112.2364	4.344904	15.12823
T AREA lead I	106.1024	3.462334	18.95599
J40 AMP lead I	57.50899	3.440177	19.00408
P AMP lead II	451.1748	8.621123	83.58113
R AMP lead II	1942.194	135.209	492.7809
T AMP lead II	1364.989	50.32053	237.3328
P AMP lead III	233.1263	4.707533	50.07002
T AMP lead III	397.6053	13.7969	74.41043
P AMP lead AVR	6.574175	0.3161	1.119938
R AMP lead AVR	333.5463	9.764402	38.88358
T AMP lead AVR	5.19134	0.159527	0.521232
P AMP lead AVL	147.9354	1.105613	22.8878
T AMP lead AVL	495.1102	9.745199	80.24213
P AMP lead AVF	331.081	6.519202	64.52348
R AMP lead AVF	1138.847	85.77474	307.2656
T AMP lead AVF	831.7527	27.50222	147.4743

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kaverinskiy, V.; Chaikovsky, I.; Mnevets, A.; Ryzhenko, T.; Bocharov, M.; Malakhov, K. Scalable Clustering of Complex ECG Health Data: Big Data Clustering Analysis with UMAP and HDBSCAN. Computation 2025, 13, 144. https://doi.org/10.3390/computation13060144

AMA Style

Kaverinskiy V, Chaikovsky I, Mnevets A, Ryzhenko T, Bocharov M, Malakhov K. Scalable Clustering of Complex ECG Health Data: Big Data Clustering Analysis with UMAP and HDBSCAN. Computation. 2025; 13(6):144. https://doi.org/10.3390/computation13060144

Chicago/Turabian Style

Kaverinskiy, Vladislav, Illya Chaikovsky, Anton Mnevets, Tatiana Ryzhenko, Mykhailo Bocharov, and Kyrylo Malakhov. 2025. "Scalable Clustering of Complex ECG Health Data: Big Data Clustering Analysis with UMAP and HDBSCAN" Computation 13, no. 6: 144. https://doi.org/10.3390/computation13060144

APA Style

Kaverinskiy, V., Chaikovsky, I., Mnevets, A., Ryzhenko, T., Bocharov, M., & Malakhov, K. (2025). Scalable Clustering of Complex ECG Health Data: Big Data Clustering Analysis with UMAP and HDBSCAN. Computation, 13(6), 144. https://doi.org/10.3390/computation13060144

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Scalable Clustering of Complex ECG Health Data: Big Data Clustering Analysis with UMAP and HDBSCAN

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

4. General Statistics of the Initial Datasets

5. Results

5.1. Data Cauterization Analysis in Low-Dimensional Space

5.1.1. Euclidean Metric

5.1.2. Manhattan Metric

5.2. Overall Analysis of Data Clustering

5.2.1. Data Distribution Between the Clusters

5.2.2. Certain Features Values Differences Between the Clusters

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI