Cardiovascular Diseases Diagnosis Using an ECG Multi-Band Non-Linear Machine Learning Framework Analysis

Background: cardiovascular diseases (CVDs), which encompass heart and blood vessel issues, stand as the leading cause of global mortality for many people. Methods: the present study intends to perform discrimination between seven well-known CVDs (bundle branch block, cardiomyopathy, myocarditis, myocardial hypertrophy, myocardial infarction, valvular heart disease, and dysrhythmia) and one healthy control group, respectively, by feeding a set of machine learning (ML) models with 10 non-linear features extracted every 1 s from electrocardiography (ECG) lead signals of a well-known ECG database (PTB diagnostic ECG database) using multi-band analysis performed by discrete wavelet transform (DWT). The ML models were trained and tested using a leave-one-out cross-validation approach, assessing the individual and combined capabilities of features, per each lead or combined, to distinguish between pairs of study groups and for conducting a comprehensive all vs. all analysis. Results: the Accuracy discrimination results ranged between 73% and 100%, the Recall between 68% and 100%, and the AUC between 0.42 and 1. Conclusions: the results suggest that our method is a good tool for distinguishing CVDs, offering significant advantages over other studies that used the same dataset, including a multi-class comparison group (all vs. all), a wider range of binary comparisons, and the use of classical non-linear analysis under ECG multi-band analysis performed by DWT.


Introduction
Heart and blood vessel problems, known as cardiovascular diseases (CVDs), are the main reason why many people die around the world [1].According to the World Health Organization, 32% of global mortality is attributed to cardiovascular diseases, with the most prevalent being arrhythmias, cardiac arrests, and heart failure.It is estimated that CVDs take about 17.9 million lives every year [2].Focusing on cardiac pathology and considering how much work the heart constantly does, it is amazing that it functions so well for a long time for many people.However, it can also experience problems and stop working properly due to risk factors like cholesterol, high blood pressure, cigarette smoking, diabetes mellitus, and adiposity [3].
The beginning of the diagnosis of heart disease involves evaluating the patient's medical history and conducting a physical examination.Afterwards, laboratory tests and/or additional non-invasive and invasive diagnostic exams can be performed [2].Natriuretic peptides are the most common laboratory tests used to diagnose heart diseases.They can help identify individuals at higher risk of sudden cardiac death in the general population or patients with coronary artery disease [11].However, several other non-invasive and invasive tests can be performed: (1) Electrocardiogram (ECG) and ambulatory monitoring: the 12-lead ECG is a key diagnostic test for cardiovascular diseases, assessing risk, and identifying arrhythmias [11].Choose monitoring time based on symptom frequency.Holter for daily arrhythmias, patient-activated ECG for less frequent events, and ILRs for serious cases [11,15]; (2) Stress tests: monitor the heart during treadmill/bike exercise to assess response and detect exercise-related disorders like arrhythmias, ventricular tachycardia, coronary artery disease, and long QT syndrome [11,16].Exercise tests aid in diagnosing long QT syndrome by measuring the QTc interval after 4 min of exercise [16]; (3) Imaging tests: essential for assessing heart function and detecting problems like cardiomyopathies [17].Negative results may indicate primary electrical diseases [4]; (4) Electrophysiological study: exam to diagnose and guide treatment, involving measuring cardiac intervals, controlling electrical stimulation, and mapping heart structures.Effectiveness varies based on heart condition, presence of spontaneous ventricular tachycardia, medication use, and stimulation mode [18]; (5) Provocative diagnostic tests: use sodium channel blockers, adenosine, or epinephrine to detect syndromes.Use acetylcholine or ergonovine to assess coronary spasm as the cause of ventricular fibrillation [19]; (6) Genetic testing: next-gen sequencing made genetic testing accessible.Comprehensive gene panels reveal variations causing or modifying features in syndromes like Brugada, long QT, and hypertrophic and dilated cardiomyopathy [20]; (7) Cardiac catheterisation: catheter inserted into a blood vessel, and guided to the heart with X-ray images and dye to check for blockages [11].
In recent years, there has been a notable surge in computational power, driven by advanced hardware, parallel computing, cloud resources, and increased data accessibility.These developments have significantly enhanced the applications of machine learning (ML) in the diagnosis of CVDs [21].The role of ML in CVDs is pivotal, as it harnesses data from medical tests to improve diagnostics and management, reducing human error, improving efficiency, and enhancing patient outcomes [22].It contributes to early disease detection, precise risk assessment, advanced image analysis, predictive modelling, tailored treatment plans, remote patient monitoring, and expedited drug discovery [23].However, ongoing research is essential to enhance this critical field further and save lives.Thus, for this study, our method will focus on ML-based ECG signal analysis approaches for discriminating CVDs, and thus we present in Table 1 the state of the art of this topic.The heart, operating as a non-linear system, manifests its electrical activity through the ECG signal [24].The inherent non-linearity underscores the inadequacy of traditional linear analyses and standard clinical features in comprehensively capturing the intricate dynamics of the ECG signal.This complexity is further underscored by the challenges posed to deep learning tools, as their extraction of features may lack explainable understanding.Consequently, a deeper comprehension of how these tools reach and compute features becomes imperative for a more robust interpretation of ECG signals.Unlike prevailing stateof-the-art methods for the topic (Table 1), which have typically abstained from incorporating non-linear feature extraction in their methodology, our study aims to explore a non-linear approach to ECG analysis supported by classical ML tools.By doing so, we want to seek a more comprehensive understanding that embraces the inherent complexities of the heart's electrical behaviour for improving CVDs diagnosis.For that, we defined three objectives for this study: The fulfilment of the goals will provide insights into the predictive power of these nonlinear features, both independently and synergistically, contributing to a comprehensive understanding of their impact on CVD discrimination.
Finally, the paper is divided into five major sections in terms of structure.In Section 2, the applied methodology, including the database, signal processing, and feature extraction, is explained.The study results are indicated in Section 3 and discussed in Section 4. Finally, Section 5 draws the study conclusions.

Methodology
This proposed methodology, illustrated in Figure 1, is split into 4 main parts:

Methodology Phases
Machine Learning Classification and Statistical Analysis

Experimental Setup
This study involved the use of two distinct programming languages: MATLAB and Python.MATLAB (version R2022a) was employed to eliminate noise from the ECG signal, extract non-linear features from the ECG data, and compress and structure the data for classification purposes.Python (version 3.9.12)was utilised to develop and implement various ML models and generate a discrimination report based on the obtained results.The choice between MATLAB and Python programming languages is driven by optimisation needs: MATLAB is particularly proficient in signal processing and feature extraction with highly optimised toolboxes, whereas Python takes the lead in optimising ML models.
This research was conducted using a MacBook Pro 14 equipped with an M1 Pro chip featuring an 8-core CPU, a 14-core GPU, and 16 GB of RAM.

Database Characterisation
The PTB diagnostic ECG database [43] comprises data from seven distinct cardiovascular disease groups as well as a healthy control group.This dataset consists of a total of 512 ECG records, each containing 12 conventional leads (I, II, III, αVr, αVl, αVf, V1, V2, V3, V4, V5, and V6), along with 3 ECG Frank leads (Vx, Vy, and Vz).The ECG data have been digitised at a sampling frequency of 1000 Hz.
Table 2 represents the number of ECGs per diagnostic class present in the database.

Artifacts Removal
The ECG signals' raw data in the database showed artifacts.To ensure the signal quality, complete signal deletion was performed.In the beginning, the database had 512 records.After the removal stage, the number of available signals in the database for the following tasks was reduced to 483 ECG records.Table 3 represents the number of ECGs per diagnostic class after the removal.

Signal Normalisation
The ECG signals, x(n), were loaded into MATLAB ® and normalised according to the following equation [44].
where N represents the signal's length.Then its mean value was removed.

Multi-Band Decomposition via Wavelet Transform and Features Extraction
The discrete-time wavelet transform (DWT) is a powerful technique used to analyse discrete-time signals with finite energy.It involves breaking down the signal into a set of basis functions composed of a limited number of prototype sequences and their timeshifted variations.This process, as described in Guido's research in 2022 [45], offers significant advantages for analysing signals in the time-frequency domain.By seamlessly transitioning between the time and frequency domains, it enables the localisation of the source of frequency compounds in time.
To perform the decomposition and subsequent reconstruction, an octave-band critically decimated filter bank is employed.This approach, pioneered by Malvar in 1992 and further developed by Vetterli in 1995 [46,47], provides an effective framework.When considering only the positive frequencies, each sub-band in the transform is confined to a specific range, where S is the number of levels, S + 1 is the number of sub-bands, and π is the normalised angular frequency equivalent to half the sampling rate.
The DWT employs an analysis scale function, denoted as φ1 (n), and an analysis wavelet function, denoted as ψ1 (n), which are defined as follows: where h LP (n) and h HP (n) represent the impulse responses of the analysis filters for the half-band low-pass and high-pass components, respectively.Defining the following recursion formulas where the symbol " * " signifies the convolution operation, the analysis filter corresponding to the mth sub-band is expressed as follows: The mth sub-band signal is computed as In this research, the DWT was employed to decompose each ECG segment of 1 s length into sub-bands (x m (n)) up to level three (S = 3).The applied wavelet was Symlet7, and this wavelet proved to be good for ECG signals analysis until decomposition at level 3 [48,49].To ensure consistency with the original sampling rate, the sub-band signals, x m (n), underwent re-sampling using the wavelet interpolation method [50].After that, 10 non-linear features (check Table 4 for more information) were collected from each signal sub-band of 1 s length from a total of 10 s signal length.Then, the resulting time series per feature and sub-band were compressed over time, respectively, by 6 distinct statistical functions: average (Avg), standard deviation (Std), 95th percentile (P95), variance (Var), median (Med), and kurtosis (Kur) [49].At the end of the process, the data matrix, comprised of all 10-second time series vectors of features extracted from all sub-bands over time for all patients, underwent normalisation using the z-score method [51].

Feature Equation Definition
Approximate Entropy (ApEn) Θ is the Heaviside step function and m is the dimension [52].
ApEn evaluates the likelihood that similar patterns within the data will remain similar when additional data points are included.The lower the ApEn value is, the more regular or predictable the data are, whereas a higher ApEn value suggests greater complexity or irregularity.

Correlation Dimension (CorrDim)
where Θ(x) is the Heaviside step function, X i and X j are the position vectors on attractor, l is the distance under consideration, k is the summation offset, and M is the reconstructed vector numbers from the x(n) [52].
CorrDim is used to measure self-similarity, and higher values of CorrDim means a high degree of complexity and less similarity.
Detrended Fluctuation Analysis (DFA) where N is the length, y n (k) is the local trend, and y(k) is defined as with x(i) as the inter-beat interval and x as its average [53].
DFA is a technique for measuring the power scaling observed through R/S analysis.
Energy (En) En is the capacity of a system to perform work [54].

Higuchi Fractal Dimension (H)
where k is a number of composed sub-series and L(k) is the averaged curve size.
H estimates the fractal dimension of a time series signal [55].

Feature Equation Definition
Hurst Exponent (EH) where q is the order moments of the distribution increments, ν is the time resolution, τ is the incorporation time delay of the attractor, and t is the period of a given time series signal X(t) [56].
EH quantifies how chaotic or unpredictable a time series is.
, K estimates the fractal dimensions through a waveform analysis of a time series [56].
LogEn quantifies the average amount of information (in bits) needed to represent each event in the probability distribution.Higher logarithm entropy values indicate greater unpredictability or randomness in the distribution, while lower values suggest more certainty or order [54].
ELya evaluates the system's predictability and sensitivity to change.
Shannon Entropy (ShaEn) ShaEn is measured in bits when the base-2 logarithm (log2) is used.This means that the result provides a quantification of the average number of bits required to represent each outcome in a given probability distribution.Higher entropy values indicate greater uncertainty, unpredictability, or randomness in the distribution, while lower values suggest more order or certainty [54].

Data Driven Framework Analysis 2.5.1. Individual Feature Power Analysis over Binary Groups
The evaluation of the discriminating power of each feature distribution between pairs of study groups, such as V HD vs. M, V HD vs. MI, V HD vs. MH, V HD vs. HC, V HD vs. Dis, V HD vs. CardMyo, V HD vs. BBB, M vs. MI, M vs. MH, M vs. HC, M vs. Dis, M vs. CardMyo, M vs. BBB, MI vs. MH, MI vs. HC, MI vs. Dis, MI vs. CardMyo, MI vs. BBB, MH vs. HC, MH vs. Dis, MH vs. CardMyo, MH vs. BBB, HC vs. Dis, HC vs. CardMyo, and HC vs. BBB, was conducted using the XROC classifier [58], a binary classifier working within a leave-one-out cross-validation process, and using the Mann-Whitney test.A total of 3600 features, consisting of 10 non-linear features time series compressed over time by (×) 6 statistical measures over (×) 4 sub-bands for each one of the (×) 15 leads per participant, were individually assessed to measure their potential to differentiate between these groups.The methodology variation to perform individual feature assessment for discrimination is signalised in Figure 1.It should be noted that the normality and homoscedasticity of each one of the time series feature vector distributions have been assessed for distinguishing binary classes with the MATLAB function ktest, which performs the Kolmogorov-Smirnov and Levene tests, respectively.The hypothesis of parametric tests was not met, so we applied a non-parametric test, such as the Mann-Whitney test.

Combined Features Power Analysis for Groups Discrimination Using Sci-Learn ML Models
In this case, the model's performance for discriminating between pairs of study groups and between All vs.All was evaluated by feeding 19 selected Sci-learn ML models [59], presented in Table 5, with combined features-240 features (10 features extracted from (×) 4 sub-bands and compressed (×) by 6 statistics) for the individual lead case or 3600 features (240 features per lead × 15 leads) per combined leads case, for each group comparison, within a leave-one-out cross-validation procedure.The methodology variation to perform combined feature assessment for discrimination is signalised in Figure 1.The model's performance evaluation was carried out using 9 metrics: Accuracy, Precision, Recall, F1-Score, AUC, Kappa, MCC, CSI, and Gmean.
The Accuracy represents the number of corrected classified classes concerning all cases [60] and can be defined as where, a TP, TN, FP, and FN are, respectively, the true positives, true negatives, false positives, and false negatives [61].
The Precision, also known as a positive predictive value, shows the proportion of well-classified positive cases to the total cases predicted as positive [62].The Precision can be defined as The Recall, defined as represents the proportion of correctly predicted positive cases concerning the total number of positive cases [62].
The F1-Score is the harmonic average between the Recall and the Precision [63], and the equation is defined as The Kappa normalises the Accuracy by the possibility of agreement by chance [64] and is defined as The MCC is useful for uneven data [65].It varies between 0 and 1, with 0 as the worst scenario and 1 as the best.It is defined as The CSI provides a more nuanced evaluation of a binary classification model's effectiveness by considering both the correct identification of positive instances and the ability to avoid false positives.[66].The CSI equation can be defined as The Gmean is the measure that considers a balance between the performance of all classes.The higher the value is, the lower is the risk of models over-fitting.It is defined as where Speci f icity is defined as The area under the curve (AUC) of the receiver operating characteristic curve (ROC) is a metric that evaluates how well a model can distinguish between positive and negative classes.It achieves this by comparing the rate of TP against the rate of FP at different classification thresholds.The value of AUC ranges between 0 and 1, with the perfect classifier resulting in a value of 1, while a random classifier has an AUC of 0.5.Using AUC allows for a single-value measure of a model's performance.This is especially useful for comparing models and assessing performance in scenarios where there is an imbalance between classes [67].

Results
Table 6 displays the individual features' discrimination power that yielded the best results for statistical and XROC analysis conducted across all 28 binary comparisons (V HD vs. M, V HD vs. MI, V HD vs. MH, V HD vs. HC, V HD vs. Dis, V HD vs. CardMyo, V HD vs. BBB, M vs. MI, M vs. MH, M vs. HC, M vs. Dis, M vs. CardMyo, M vs. BBB, MI vs. MH, MI vs. HC, MI vs. Dis, MI vs. CardMyo, MI vs. BBB, MH vs. HC, MH vs. Dis, MH vs. CardMyo, MH vs. BBB, HC vs. Dis, HC vs. CardMyo, HC vs. BBB, Dis vs. CardMyo, Dis vs. BBB, and CardMyo vs. BBB), respectively.Table 7 shows the number of occasions where a feature distribution is shown to be significant (p < 0.05) for separating binary classes.Figure 2 illustrates the violin plots for the comparison groups where there was a significant difference, reported in Table 6.The classification results regarding combined feature power analysis performed by ML classifiers can be found as a heatmap in Figure 3.The direct comparison between the individual and combined feature power analyses for discrimination is shown in Figure 4.

Discussion
For a more comprehensive discussion, we divided this section into three subsections according to the two variations of analysis employed in this study-individual and combined feature power analyses for discrimination-and compared our results with state-of-the-art results.While acknowledging that, in medicine, Accuracy may not fully capture the balance between Recall and Speci f icity, our discussion will primarily focus on Accuracy for checking the model's performance as it enables a more direct comparison of our results with those achieved by state-of-the-art methods.

Data Driven Analysis-Individual Feature Power Analysis
From a broader perspective, we can observe in Table 6 that the top-performing feature, compressor, and wavelet sub-band were the CorrDim, Avg, and 2nd sub-band (DWT details 2nd level), respectively.Notably, they were present in 100% of the best results for comparison groups, encompassing all 28 binary comparison groups.
In addition, 12 of the 28 binary comparisons were shown to be statistically significant (42.86% of all analyses) and, out of the 15 leads utilised in this study, 8 exhibited at least one analysis with statistically significant differences.The most frequently represented lead in the table was V2, appearing in 16% of the cases (4 out of 28 binary groups).
The classes most frequently represented in binary groups with significant differences were the HC and the BBB classes.Both classes were present in 5 out of the 12 comparison groups with significant differences (41.67% of the cases).Additionally, the CardMyo and MH classes were the only ones that did not show significant differences when compared with the BBB class, and the M and Dis classes did not show significant differences when compared with the HC class.
Regarding each binary analysis: • V HD vs. M analysis yielded a significant p-value of 0.0339, and an Accuracy and Recall of 100%.The feature CorrDim, with the compressor Avg, and lead V4 extracted from the 2 nd sub-band provided excellent results.As shown in Figure 2a, the violin plot easily illustrates a distinct separation between these two classes.• V HD vs. MI statistical analysis produced a significant result (p-value = 0.0486).The XROC analysis achieved an Accuracy of 98.91% and Recall of 0% for this binary comparison.The achieved Recall of 0% underscores one of the primary limitations of the dataset, namely its imbalance with the XROC, over-adjusting itself too much to the predominant class-MI, and it corroborates also the difficulty of splitting groups by the statistical test.In Figure 2b, we can see some outlier values but the highest density of the data is located close to the median.• V HD vs. MH statistical analysis revealed a non-significance p-value.The XROC metrics-Accuracy and Recall-demonstrated strong performance for discrimination between groups, with values of 87.50% and 75.00%, respectively.• V HD vs. HC group analysis displayed a significant difference, with a p-value of 0.0301.It achieved an Accuracy of 94.94% and a Recall of 0%. Figure 2c indicates some outliers in the HC class, but the highest data density is close to the median.Despite the good statistical analysis results, once more the XROC over-adjusts itself too much to the predominant class-HC, achieving a Recall of 0% for discriminating the class V HD.The imbalanced database and the HC's large number of outliers contribute to these results.The XROC employs an averaging method within its genesis, which assigns significant weight to outliers in the final results.• V HD vs. Dis analysis resulted in a non-significant p-value, an Accuracy of 80.00%, and a Recall of 100%.Despite being statistically non-significant, the XROC results showed a good performance for discriminating.• V HD vs. CardMyo analysis resulted in a non-significant p-value.The XROC metrics Accuracy and Recall were 78.94% and 50.00%, respectively.• V HD vs. BBB statistical analysis yielded a p-value of 0.0308, and the XROC achieved an Accuracy of 84.62% and Recall of 75.00%.Figure 2d illustrates a higher density of BBB's data being located close to the median.• M vs. MI statistical analysis yielded a p-value of non-significance.The Accuracy achieved a value of 99.18% and the Recall resulted in 0%.• M vs. MH statistical analysis exhibited no significant difference in the p-value.The Accuracy and Recall reached 85.71% and 100%, respectively.• M vs. HC analysis also showed no significant p-value.Accuracy and Recall demonstrated an interesting performance, achieving values of 96.15% and 0%, respectively.This underscores the difficulty of the XROC classifier in accurately discriminating between unbalanced data sample sizes.• M vs. Dis analysis also revealed a non-significant p-value, with the Accuracy and Recall showcasing the values of 85.71% and 33.33%, respectively.
• M vs. CardMyo analysis resulted in a non-significant p-value, an Accuracy of 94.44% and a Recall of 66.67%.Despite being statistically non-significant, the XROC showed good behaviour for discriminating between classes.• M vs. BBB statistical analysis indicated significant differences with a p-value of 0.0126.The Accuracy and Recall reached 100%.In Figure 2e, the violin plot displayed an outlier in the BBB class, but the highest data density was slightly below the median.It is worth noting that there was a clear separation between the two classes.2g, the violin plot exhibited some outliers, but the highest data density was close to the median for both classes.• MI vs. BBB analysis showed statistical significance and the Accuracy stood at 97.57%, with a flawless Recall of 100%.Figure 2h gives us the opportunity to see some outliers in both classes but the majority of the data were located close to the median.• MH vs. HC analysis achieved a significant p-value, reaching a statistical analysis value of 0.0078.The Accuracy was 94.94% and the Recall was 0%, which perfectly illustrates the imbalance of the dataset.Figure 2i shows the HC class, with the highest density of data close to the median.2j, the violin plot displayed some outliers in the HC class, but the highest data density was close to the median.• HC vs. BBB analysis provided a significant p-value of 0.0047 accompanied by an Accuracy of 89.29% and an impressive Recall of 100%.Figure 2k shows the violin plot with some outliers in the HC class, but there was a higher density of data close to the median.Looking to Table 7, we can see the total number of occasions that a feature was demonstrated to be significant over binary groups and in total.It should be noted that each originally defined feature generated 360 features per analysis; for more information check Section 2.5.1.While CorrDim emerged as a standout performer individually, the results emphasise that the other nine features also demonstrated statistical significance in distinguishing between classes with more moments of appearing to be significant than actually CorrDim (523 vs. 600).MI vs. HC and HC vs. BBB showed the highest number of results with significant differences, which were 1185 and 1184, respectively.V HD vs. MI, V HD vs. BBB, M vs. BBB, MH vs. HC, and Dis vs. BBB were the binary groups with the lowest amount of occasions of significant feature distributions, 237 each.

Data Driven Analysis-Combined Feature Power Analysis
Figure 3 presents the classification metrics report for the comparison groups provided by 19 Sci-learn ML classifiers with combined features as entries.The heatmap employs a gradient of green shades in its colour scheme, serving to vividly illustrate the method's discrimination capabilities for Accuracy, Recall, Precision, F1-Score, AUC, Kappa, MCC, CSI, and Gmean in each comparative analysis.Lighter shades of green represent lower discriminatory power, while deeper, richer greens signify higher effectiveness.By looking into the results, it can be seen that V HD vs. M, V HD vs. MI, V HD vs. HC, V HD vs. Dis, V HD vs. CardMyo, M vs. MH, M vs. Dis, M vs. CardMyo, M vs. BBB, and MH vs. BBB obtained 100% on all evaluation metrics.Comparing the individual power discrimination results presented in Table 6, it can be seen that generally the discrimination results have increased, and the ratio of 100% on all evaluated metrics per binary analysis has increased (2/28 to 10/28).Comparing the Accuracy results achieved through combined feature power analysis with those obtained through individual feature power analysis (see Figure 4 for a visual representation of the analysis for each binary comparison; this figure provides a clear and concise overview, facilitating an easy assessment of performance differences between the two approaches described in Sections 2.5.1 and 2.5.2),we observe a significant overall improvement across all binary comparisons.Among the 28 comparisons conducted, the results indicate that 17 exhibits enhanced discrimination Accuracy when utilising combined features analysis.In contrast, in five instances, the Accuracy remained the same as that observed in individual feature power analysis.There are only six cases where we notice a decrease in Accuracy compared with individual power analysis (V HD vs. MH, MI vs. MH, MI vs. CardMyo, MH vs. CardMyo, Dis vs. CardMyo, and Dis vs. BBB).
Returning to the analysis of Figure 3, in All vs. All, an Accuracy of 81.16%, Recall of 72.93%, Precision of 81.16%, 76.34% for the F1-Score, Kappa of 0.4018, MCC of 0.4399, CSI of 0.6713, Gmean of 0.7417, and AUC of 0.5552 were achieved.The leads ensemble combination was the most represented in the table, corresponding to 28% of the total appearances.The classifier with the most frequent appearances was LinSVC, representing 24% of the cases.The binary groups V HD vs. MH, MI vs. HC, MH vs. CardMyo, Dis vs. CardMyo, and Dis vs. BBB, exhibited Precision values below 90%.This challenge in correct classifying can be attributed to the close relationship between CardMyo and MH or Dis.In a clinical context, it is common for patients to present with CardMyo alongside either V HD or Dis [68,69].This clinical overlap makes accurate differentiation challenging.Understanding and addressing these interconnected conditions are essential for improving classification Accuracy in these scenarios.The MI vs. HC classification, with an 82.84% Precision, presents challenges due to the potential presence of acute MI within the HC class.Additionally, some patients who have recovered from MI may be categorised as HC [70].These factors contribute to a slightly lower classification performance of ML models for discrimination within this context.
Moreover, upon assessing various models and their performance metrics, a notable observation is the impact of the imbalanced dataset, particularly evident in comparison groups involving one of either MI or HC classes.In such instances, we observed a range of AUC results from 0.4272 to 0.6667 across all nine comparison groups where at least one of these two classes was present.These findings underscore the substantial challenge of distinguishing between unevenly represented classes.The Gmean further highlights the noteworthy observation that in 71.42% of cases (five out of seven binary comparisons) where the MI class is pitted against another class, the Gmean metric yields a result of 0. However, in comparisons involving MI against Dis and BBB, a lower risk of overfitting is evident, with Gmean values of 0.9865 and 0.9891, respectively.The CSI metric reveals that the preponderance of comparison groups, specifically 17 out of 29, exhibits results surpassing 0.9.This observation underscores a notable challenge in classification, particularly when dealing with classes characterised by higher data abundance.The MCC metric highlights a notable trend, with 31% of the comparison groups (9 out of 29 groups) achieving perfect predictions, each obtaining a maximum value of 1. Notably, the class MI demonstrates the least favourable outcomes, with its highest MCC value capped at 0.3297 when included in a comparison group.The Kappa metric reveals a noteworthy pattern, with 31% of the comparison groups (9 out of 29 groups) achieving perfect agreement, each attaining a maximum value of 1.Additionally, 41.37% of the groups surpass a Kappa value higher than 0.083.

Study Results vs. State-of-the-Art Results
When we analyse Table 1, it becomes evident that our results closely match or slightly surpass the achievements of the state of the art, offering valuable insights for enhancing robustness.In particular, when considering the eight state-of-the-art studies that utilised the PTB database, our results are lower in the binary comparisons of MI vs. HC and HC vs. CardMyo, with differences of less than 13% and 0.76%, respectively.Furthermore, the present study offers significant advantages over other studies as it includes a multi-class comparison group (All vs. All), a higher variety of binary comparisons, and the application of classical non-linear analysis under ECG multi-band analysis performed by DWT.These particularities allow a high capacity of differentiation of each class present in the database, a level of detail not typically found in state-of-the-art articles.Moreover, it is imperative to underscore that the developed algorithm relies on ECG signals, presenting distinctive advantages when compared with alternative diagnostic sources such as stress tests, imaging tests, electrophysiological studies, provocative diagnostic tests, genetic tests, and cardiac catheterisation, among others.The affordability, non-invasiveness, widespread use in clinical settings, and user-friendly nature of ECG make it an optimal choice.Its efficacy not only facilitates the easy adoption of our algorithm globally but also addresses the unique needs of patients unable to leave their hospital beds.This highlights the algorithm's versatility and accessibility in diverse healthcare settings.

Conclusions
For this research, 10 non-linear features (En, ApEn, LogEn, ShaEn, EH, Elay, H, K, CorrDim, and DFA) were extracted from a well-known ECG database (PTB diagnostic ECG database).From the recorded 15 leads per patient (12 conventional leads and 3 Frank leads), each signal lead underwent a 1-second length non-overlapped windowing process over time for extracting a total of 10 non-linear features per window.At the end of the process, each feature time series was compacted by six statistics.The individual power and combined power were accessed from discriminating between different cardiovascular pathologies (V HD vs. M, V HD vs. MI, V HD vs. MH, V HD vs. HC, V HD vs. Dis, V HD vs. CardMyo, V HD vs. BBB, M vs. MI, M vs. MH, M vs. HC, M vs. Dis, M vs. CardMyo, M vs. BBB, MI vs. MH, MI vs. HC, MI vs. Dis, MI vs. CardMyo, MI vs. BBB, MH vs. HC, MH vs. Dis, MH vs. CardMyo, MH vs. BBB, HC vs. Dis, HC vs. CardMyo and HC vs. BBB), Dis vs. CardMyo, Dis vs. BBB), and CardMyo vs. BBB) and one multi-class comparison (All vs. All).
The Accuracy discrimination results ranged between 81% and 100%.The results demonstrate that the applied method serves as a robust tool for effectively distinguishing cardiovascular diseases (CVDs) through the analysis of ECG signals.The level of detail and discrimination achieved surpasses what is typically observed in state-of-the-art studies using the same dataset.Despite our results indicating a great ability of the proposed method to diagnose, offering in this way another alternative avenue for medical doctors to arrive at more confident diagnoses, this study had some limitations.(1) The inherently technical nature of utilising unusual standard clinical features extracted from ECG signals may hinder complete interpretability from a clinician's standpoint.This could pose a challenge to its rapid and widespread integration into clinical practice.(2) The high computation time of multi-band analysis for the chosen methodology led us to choose just one wavelet (Symlet7) from tens of wavelets with the level of decomposition set to 3, based on prior work [48,49], as the main wavelet.A more meticulous analysis needs to be carried out in future to choose the wavelet and level of decomposition that adjusts itself better to each CVD activity.(3) The results should be further enhanced by updating them with a larger and more balanced population to ensure a more reliable generalisation and to split data as hold-out for classifying (e.g., 70% for training and 30% for testing) without employing cross-validation methods.Another possible solution would be, in a future work, to reduce the number of cases inside the highest classes to reduce the uneven data distribution.(4) Additional CVDs should be studied and evaluated in future work to enhance the discriminative capabilities of our algorithm (e.g., arrhythmias such as premature atrial contraction, premature ventricular contraction, and atrial fibrillation).
Nevertheless, upon reviewing state-of-the-art works (refer to Table 1), it becomes apparent that many have encountered similar limitations.These constraints predominantly revolved around imbalances in data distribution, as a significant portion of these studies relied on the same database.Additionally, limitations in computational time and resources, and a restricted variety and diversity of CVD classes were commonly shared among these works.This collective set of limitations across the consulted literature underscores the need for addressing data imbalances and expanding the diversity of CVD classes in future research efforts.

Figure 3 .
Figure 3. Heatmap classification report regarding combined feature discriminant power analysisthe best Accuracy, Recall, Precision, F1-Score, AUC, Kappa, MCC, CSI, and Gmean results for each comparison group plus the information of lead and ML classifier applied for signal analysis.

Figure 4 .
Figure 4. Direct comparison using Accuracy between individual and combined feature power analyses for binary groups' discrimination performed by ML models.

Table 1 .
State-of-the-art literature report on CVDs detection with information about the database, the comparison groups, the features extracted, used classifiers, limitations, and Accuracy.

Table 2 .
Number of ECGs per diagnosis class.

Table 3 .
Number of ECGs per diagnosis class after signal quality assessment and artifacts removal.

Table 4 .
The extracted features with the corresponding equations and definitions.

Table 6 .
Statistical and XROC results for individual feature power analysis per binary groups, where N.S. means no significance.

Table 7 .
The total number of occasions that a feature was shown to be statistically significant (p < 0.05) across all sub-band analyses and leads.
• MI vs. MH analysis resulted in a non-significant p-value.Despite that, the XROC performed well with a discrimination Accuracy of 98.91% and a Recall of 100%.• MI vs. HC statistical analysis yielded a p-value of 0.0017.The Accuracy and the Recall achieved values of 82.84% and 100%, respectively.In Figure 2f, the violin plot exhibited some outliers, but the highest data density was close to the median for both classes.• MI vs. Dis analysis was shown to be non-significant.The XROC metrics of Accuracy and Recall displayed significant performance percentages, with values of 97.05% and 100%, respectively.• MI vs. CardMyo statistical analysis revealed a p-value of 0.0326 alongside impressive classification metrics, boasting an Accuracy of 96.29% and a perfect Recall of 100%.In Figure • MH vs. Dis analysis exhibited non-significant p-values, with Accuracy rates of 80.00% and 50.00%, respectively.• MH vs. CardMyo demonstrated an Accuracy of 84.21% and a Recall of 50.00%, while the MH vs. BBB group yielded an Accuracy of 92.31% and a Recall of 75.00%, with both analyses showing a non-statistical significance.While statistical significance may be elusive, the consistently high Accuracy and Recall values underscore the potential efficacy of the model in discriminating between different conditions within the studied groups.• HC vs. Dis analysis showed non-significant difference.The Accuracy and Recall displayed great performance, with values of 87.21% and 100%, respectively.• HC vs. CardMyo showed a significant difference, with a p-value of 0.0071.The Accuracy and Recall exhibited strong performance, with values of 83.33% and 94.67%, respectively.In Figure

•
Dis vs. CardMyo comparison analysis yielded a non-significant p-value, with an Accuracy of 73.08% and a Recall of 54.54%.•Dis vs. BBB comparison analysis showed a p-value of 0.0167, achieving an Accuracy of 80.00% and a Recall of 81.81%.Figure2lshows a violin plot with a couple of outliers in both classes, but the highest density of data was close to the median.•CardMyovs. BBB analysis provided a non-significant p-value, with an Accuracy of 75.00% and a Recall of 73.33%.