1. Introduction
Cardiovascular pathologies account for the largest proportion of deaths globally, representing approximately one-third of the total worldwide mortality according to World Health Organization reports. This epidemiological reality justifies prioritizing research oriented toward creating automated tools for interpreting electrocardiographic recordings, which is essential for both early identification of cardiac dysfunctions and continuous monitoring and telemedicine solution implementation. The central challenge lies in identifying a subset of descriptors with superior discriminative capacity while simultaneously maintaining a minimal computational footprint—a fundamental requirement for integration into portable equipment with restricted processing capabilities. In the context of wearable devices and mobile monitoring systems, a drastic reduction in the number of features represents not merely an optimization but also an imperative necessity for ensuring extended battery autonomy and real-time response speed. In practice, there are a multitude of extractable parameters from the temporal, frequency, and spectral perspectives of the signal; however, a considerable portion exhibit informational redundancy or negligible contributions to classification system accuracy. A paradoxical yet frequently encountered aspect in the domain literature is that configurations with a judiciously selected reduced number of features can generate classification performance superior to or comparable with that obtained through extended parameter sets—a phenomenon explainable through elimination of informational noise and spurious correlations that may induce confusion in machine learning algorithms [
1].
Recent developments in nonlinear analysis of biophysiological signals have revealed the significant value of indicators grounded in dynamical systems theory, fractal structures, and multi-scale analysis techniques for describing the complex behavior of cardiac electrical activity. The Hurst coefficient provides information regarding the persistent or anti-persistent character of heart rate fluctuations, constituting a sensitive marker for dysfunctions in the neuro-vegetative regulation of the heart. The scaling parameter extracted through the DFA (Detrended Fluctuation Analysis) technique reflects the scale-invariance properties of the electrocardiographic signal across diverse temporal intervals, demonstrating utility in identifying conditions that modify long-term heart rate variability.
Absolute logarithmic correlation measures evaluate the degree of local predictability in the signal’s temporal evolution, offering complementary perspectives on the structure and regularity of cardiac electrical activity. The mean standard deviation derived from a Poincaré representation characterizes the successive beat-to-beat fluctuations of RR intervals, being a clinically validated parameter for assessing the cardiac autonomic nervous system balance. Entropy calculated in the wavelet domain, based on hierarchical signal decomposition, provides a rigorous evaluation of the spectral energy distribution complexity across multiple temporal and frequency resolution levels.
Although the clinical utility of these nonlinear descriptors has been demonstrated in isolated investigations, their systematic integration into a unified framework for automated classification and optimization of their selection through advanced algorithms represent insufficiently explored domains in the current literature. Furthermore, rigorous comparative evaluations of different feature selection strategies in the specific context of automated electrocardiographic diagnosis have been conducted fragmentarily, with most research concentrating on validating isolated methods without integrated multi-algorithmic approaches. The crucial aspect of identifying the optimal trade-off between diagnostic accuracy and computational complexity—essential for practical implementation in resource-constrained devices—remains insufficiently addressed systematically [
2].
The fundamental aim of this investigation consists of elaborating and validating a comprehensive methodological framework for optimal identification of relevant features in automated ECG recording classification, emphasizing systematic and comparative evaluation of three distinct algorithmic paradigms, as well as determination of minimal parameter configurations that ensure diagnostic performance adequate for clinical applications. The research proposes comparatively examining the effectiveness of the MRMR [
3] and ReliefF [
4,
5] methodologies and permutation-based importance evaluation techniques using neural architectures [
6] while concurrently analyzing the contribution of advanced nonlinear descriptors to the differentiation of cardiac pathological categories. An essential objective is demonstrating that compact feature subsets, selected through rigorous algorithmic criteria, can maintain or even improve classification performance compared to extended configurations, thus offering viable solutions for implementation in portable cardiac monitoring systems with severe energy and computational resource constraints.
This investigation represents an engineering-oriented methodological study focused on optimizing feature selection algorithms and reducing computational complexity for ECG classification systems, providing a systematic comparison of selection paradigms, identifying minimal feature subsets that preserve performance, and validating the discriminative value of nonlinear descriptors. The work deliberately isolates algorithmic optimization under balanced evaluation metrics from clinical deployment considerations, since it does not aim at clinical validation nor address asymmetric diagnostic costs where missed life-threatening arrhythmias are unacceptable while false positives are clinically tolerable. These aspects, including risk-weighted optimization, sensitivity prioritization for critical arrhythmias, prospective patient validation, and regulatory compliance, are reserved for a distinct subsequent phase of clinical translation research.
Within the spectrum of contemporary ECG classification approaches—ranging from end-to-end deep learning requiring substantial computational resources to classical feature engineering optimized for ultra-low-power deployment—this study addresses the latter scenario, targeting ultra-resource-constrained platforms where modern deep learning architectures are fundamentally incompatible due to memory and computational limitations.
2. Methodology
2.1. Dataset and ECG Recording Segmentation
The experimental infrastructure of this research is based on the publicly available “Large-Scale 12-Lead Electrocardiogram Database for Arrhythmia Study” [
7,
8], an extensive collection aggregating electrocardiographic recordings from a cohort exceeding 10,000 subjects. Data acquisition was performed in authentic clinical environments using a sampling rate of 500 Hz, with a typical recording duration of around 10 s. The labeling process was conducted by cardiologists with clinical expertise, ensuring diagnostic validity. The technical architecture of the database includes the standard 12-lead electrocardiographic configuration for each patient.
Although fine-grained discrimination between arrhythmia subclasses (e.g., atrial flutter vs. true atrial fibrillation, atrial tachycardia vs. junctional tachycardia) is a valuable clinical goal, this level of granularity requires substantially balanced datasets and specialized learning strategies for rare classes. The current 4-metaclass configuration allows for robust and unbiased assessment of the convergence of feature selection methods, which constitutes the primary methodological contribution of this study.
In its original configuration, the data repository encompasses a detailed taxonomy of 11 distinct cardiac rhythm categories and 56 cardiovascular pathological entities. To achieve a more balanced statistical distribution and superior clinical relevance, we proceeded to aggregate these granular labels into four major metaclasses with consolidated diagnostic significance: AFIB (atrial fibrillation), reuniting atrial fibrillation proper and atrial flutter; SB (sinus bradycardia); SR (sinus rhythm), integrating normal sinus rhythm and sinus arrhythmias; and GSVT (generalized supraventricular tachycardia), coalescing sinus tachycardia, supraventricular tachycardia, atrial tachycardia, and nodal and atrioventricular reentrant forms. The numerical distribution of examples among these four categories is illustrated in
Figure 1, demonstrating a more uniform distribution compared to the extended initial taxonomy.
This hierarchical reorganization strategy fulfills multiple critical methodological objectives: (1) statistical balancing—the original classes exhibit severe imbalances, with some categories having <200 examples while others exceed 2000, which would introduce systematic bias into the training and evaluation of the classifier; (2) clinical alignment—the 4 metaclasses correspond to major diagnostic categories used in contemporary cardiology practice and in the ESC/AHA guidelines for arrhythmia management, having direct clinical relevance; (3) statistical sufficiency—ensures each class contains sufficient examples for robust training and rigorous validation, avoiding the overfitting characteristic of sparse-class classification; and (4) facilitation of classification architectures—allows the development of models with improved robustness to inter-class variability. For the current investigation, the analysis focused exclusively on the DII derivation, from which both the classical morphological parameters of the wave complexes and complementary descriptors from multiple domains—temporal, spectral, and nonlinear—were extracted.
The implemented segmentation strategy constitutes an essential methodological element: each 10 s recording is fragmented into 2 s temporal windows, applying an overlap technique of 1 s between consecutive windows. This approach generates 9 distinct temporal segments per original recording, resulting in substantial dataset augmentation—from 10,000+ recordings to approximately 95,447 segments. Each individual segment constitutes the analysis unit for feature extraction from the temporal, morphological, frequency domains and nonlinear complexity measures. This overlapping segmentation methodology maximizes information utilization while enabling granular intra-patient variability assessment.
The final data matrix configuration presents dimensions of 95,447 rows (segments) × 39 columns, where the last column encodes the diagnostic class label, effectively yielding 38 extracted features for analysis. The descriptor taxonomy is hierarchically organized: the first 12 features correspond to reference parameters presented in the specialized literature [
8] and derived from source [
10], representing morphological and temporal measurements established in clinical electrocardiography.
The feature space architecture integrates 38 descriptors organized into two categories with complementary functionalities. The first category incorporates 12 reference parameters grounded in the consolidated clinical literature [
8], representing standardized measurements from the temporal and frequency perspectives, extensively used in conventional electrocardiographic analysis protocols: ventricular and atrial rate (beats per minute), QRS complex duration (milliseconds), QT and corrected QT intervals (milliseconds), electrical axes of R and T waves, number of identified QRS complexes, and onset/offset points for Q and T waves (expressed in sample numbers).
The second category, representing the distinctive methodological contribution of the present research, comprises 26 advanced nonlinear descriptors derived from dynamical systems theory and multi-scale analysis [
10,
11]: (1) Hurst exponent—an estimator of long-term memory in the time series, quantifying whether the future signal evolution manifests persistence (tendency to continue previous trends) or anti-persistence (tendency toward trend reversal); (2) scaling exponent obtained through the DFA (Detrended Fluctuation Analysis) methodology, characterizing the self-similarity and scale-invariance properties of the ECG signal across multiple temporal horizons; (3) absolute logarithmic correlations, measuring the degree of local deterministic predictability in temporal evolution; (4) mean standard increment derived from geometric analysis of the Poincaré plot, quantifying successive beat-to-beat variability of RR intervals; and (5) wavelet entropy, calculated based on multiresolution decomposition, offering multiscale evaluation of the spectral energy distribution complexity across multiple temporal and frequency scales.
To realistically simulate the inherent variability of clinical signals and comprehensively evaluate system robustness under practical operating conditions, a stochastic perturbation in the form of multiplicative noise with a uniform probability distribution in the interval [0.9, 1.1] was applied to the first 12 reference features taken from [
10]. This augmentation strategy emulates natural fluctuations occurring in real clinical acquisition conditions—signal amplitude variations induced by factors such as variable electrode–skin contact impedance, patient movements, low-intensity electromagnetic interference, and instrumental drift. This methodological approach ensures that the evaluated performance metrics faithfully reflect the system’s capacity to maintain diagnostic accuracy in practical medical implementation scenarios where ideal laboratory conditions are not guaranteed.
The multiplicative noise prevents data leakage from overlapping segmentation: without perturbation, the 12 global features (heart rate, QRS duration, QT interval) would remain constant across all 9 segments from the same recording, allowing the classifier to memorize patient-specific signatures rather than learning genuine diagnostic patterns.
2.2. Classification Architecture
The algorithmic infrastructure for automated cardiac arrhythmia classification was constructed using feedforward neural network architectures, implemented through the fitcnet function available in the MATLAB 2023 B development environment. This technological choice was motivated by the demonstrated capacity of feedforward neural networks to model complex nonlinear relationships between descriptors and diagnostic classes, being particularly efficient in the context of multiclass classification problems with moderately dimensional feature spaces.
The adopted architectural configuration is characterized by controlled simplicity: the neural network topology integrates a single hidden layer composed of 10 processing units (artificial neurons). This architectural design decision represents the result of empirical optimization balancing two competing objectives: on the one hand, maintaining sufficiently expressive representation capacity for capturing discriminant nonlinear patterns in electrocardiographic data; on the other hand, limiting the model complexity to prevent overfitting phenomena, which would compromise the generalization capability on unseen data. For the specific database size utilized (approximately 95,000 training segments after test set exclusion), the 10-neuron hidden layer configuration offers an optimal compromise—sufficiently expressive for learning complex decision boundaries but sufficiently parsimonious to avoid memorizing noise and irrelevant particularities of the training set.
The input vector preprocessing implements an automatic normalization procedure through statistical standardization, known as z-score scaling. Specifically, each individual descriptor is transformed such that its distribution presents zero arithmetic mean and unit standard deviation:
where μ represents the empirical mean and σ the standard deviation calculated on the training set. This linear transformation fulfills multiple critical methodological functions: (1) eliminates dependence on the original measurement units of descriptors, ensuring all features contribute equitably to the learning process regardless of their intrinsic numerical scale; (2) substantially accelerates convergence of the gradient descent optimization algorithm, since the loss function surface becomes better numerically conditioned with more uniform gradients in the parameter space; (3) numerically stabilizes the training process, preventing overflow/underflow numerical problems that could arise with extreme unnormalized values; and (4) facilitates adequate synaptic weight initialization, enabling use of standard random initialization strategies. These collective characteristics ensure faster and more stable convergence toward superior-quality local minima of the loss function, resulting in models with improved performance and superior robustness.
The data partitioning protocol for the evaluation follows a rigorous strategy designed to ensure unbiased and realistic estimates of generalization performance. The test set was constituted through stratified random selection of 1000 examples from each of the 4 diagnostic metaclasses (AFIB, SB, SR, GSVT), resulting in a total of 4000 segments dedicated exclusively to the final evaluation.
This methodological approach presents multiple advantages over alternative strategies frequently encountered in the literature: (1) perfect class balance in testing—unlike classical proportional partitioning (e.g., 80/20), which would perpetuate the original database imbalances in the test set, our strategy ensures equal representation for all diagnostic categories, eliminating favorable bias toward majority classes; (2) robust metric estimates—having identical numbers of examples per class, aggregate metrics (accuracy, macro-averaged precision/recall/F1) faithfully reflect balanced performance across all categories, not merely predominant ones; (3) rigorous comparability—the fixed and balanced test configuration enables direct and statistically meaningful comparisons among different evaluated feature architectures (7, 10, 12, 35 descriptors), eliminating variability introduced by random test set composition; and (4) realistic evaluation of clinical applicability—in real medical practice, automated systems must manifest consistent performance across all diagnostic categories, not only frequent ones, and our protocol explicitly verifies this requirement.
The remaining data (approximately 89,899 segments after excluding the 4000 test samples) were dedicated to neural network training, being utilized directly by the optimization algorithm without explicit cross-validation within the training process, since the primary objective was comparing feature selection methods rather than optimizing neural architecture hyperparameters. This methodological approach ensures that all reported performance metrics—accuracy, precision, recall, F1-score, AUC (Area Under the Curve )—reflect the authentic capacity of models to generalize on completely unseen and class-balanced data, thereby offering rigorous and clinically relevant evaluation of the effectiveness of different investigated feature selection strategies.
2.3. Features Employed
The methodological framework of this research integrates two complementary taxonomies of descriptors for multi-dimensional and comprehensive characterization of electrocardiographic signals, with each category offering distinct and synergistic perspectives on cardiac electrical activity dynamics.
2.3.1. Morphological Descriptors of ECG Wave Complexes
The first parameter category comprises standardized clinical measurements, established in contemporary cardiological practice [
7], extracted exclusively from lead II recordings. This collection includes the ventricular and atrial rate expressed in beats per minute (BPM), representing the contraction frequency of respective cardiac chambers; QRS complex duration measured in milliseconds, reflecting the ventricular depolarization interval; QT and corrected QT intervals (adjusted for heart rate) in milliseconds, critical parameters for ventricular repolarization evaluation and arrhythmogenic risk assessment; electrical axes of R and T waves, indicating the vectorial orientation of predominant electrical forces during ventricular depolarization and repolarization; total number of QRS complexes identified in the analyzed segment, providing information about rhythm regularity; and temporal coordinates of the onset (initiation) and offset (termination) points for Q and T waves, expressed in sample numbers, precisely delimiting morphological components of the cardiac cycle (see
Appendix A).
2.3.2. Multi-Domain Descriptors and Complexity Measures
The second feature taxonomy aggregates parameters extracted from three complementary analytical perspectives—temporal domain, frequency domain, and nonlinear complexity measures—offering a much richer and more nuanced representation of cardiac dynamics compared to the inherent limitations of conventional morphological parameters. This extended descriptor category is methodologically derived from the approach presented in [
10], with the essential adaptation that calculations are applied on 2 s temporal windows with 1 s overlap between consecutive windows, in accordance with the segmentation strategy described previously. This methodological modification enables capture of local variability and short-term signal dynamics, aspects potentially relevant for subtle pathological class discrimination (see
Appendix A).
Temporal Domain Descriptors: Temporal characteristics encapsulate both elementary signal morphology indicators and higher-order measures derived from dynamical analysis. The amplitude range (AR) quantifies the difference between extreme signal values in the analyzed window, offering a raw measure of voltage excursion. The peak-to-peak amplitude (PPA) measures the vertical distance between maximum and minimum peaks, similar but distinctly calculated from the AR. The mean amplitude (MA) represents the arithmetic mean value of the signal magnitude, characterizing the baseline level of electrical activity. The signal integral (SignInt) calculates the area under the signal curve, reflecting the total energy content. The RMS (Root Mean Square) values [
12] adapted for the ventricular context offer a robust measure of the effective signal magnitude, being particularly relevant for ventricular arrhythmia characterization. The mean and median slopes (MS, MdS [
13]) quantify the average rate of temporal signal variation, capturing information about abruptness or gradualness of state transitions. The mean standard increment (MSI) characterizes the successive signal amplitude variability. The smoothed nonlinear energy operator (SNEO [
14,
15]) represents an advanced measure of local instantaneous energy, sensitive to rapid variations and signal discontinuities.
Frequency Domain Descriptors: Spectral characteristics are derived through fast Fourier transform (FFT) application on 2048-point segments, using a Hamming weighting window to attenuate spectral leakage effects. The amplitude spectral area (AMSA) integrates the total frequency content, offering a global measure of energy distributed in the frequency domain. The centroid frequency (CF) identifies the center of gravity of the spectral distribution, representing the power-weighted mean frequency. The peak frequency (PF) corresponds to the spectral component with maximum amplitude, indicating the dominant signal frequency. The total energy (ENRG) calculates the sum of squared spectral coefficients, quantifying the global power. The spectral flatness measure (SFM) evaluates the spectral distribution uniformity, discriminating between tone-like and noise-like signals. The centroid power (CP) and maximum power (MP) characterize the energetic magnitude concentrated in specific spectrum regions. Power spectrum analysis (PSA [
13]) offers a comprehensive representation of the energy distribution across frequency components.
Nonlinear Complexity Measures: For rigorous evaluation of signal structural and dynamic complexity, measures grounded in nonlinear dynamical systems theory and fractal analysis were implemented. The Hurst exponent (Hu [
16]) characterizes the persistence or anti-persistence in the time series, offering information about long-term memory and signal predictability: H > 0.5 values indicate persistence (trend continuation tendency), H < 0.5 suggests anti-persistence (mean reversion), while H ≈ 0.5 corresponds to random behavior. The scaling exponent (ScE) obtained through the DFA (Detrended Fluctuation Analysis) methodology characterizes the self-similarity and scale-invariance properties of the signal across multiple temporal horizons. The logarithm of absolute correlations (LAC [
17]) quantifies the degree of local deterministic predictability in the temporal evolution, being sensitive to chaotic structures. The mean standard increment from the Poincaré plot (MSI [
18]) characterizes the successive beat-to-beat variability, offering a geometric representation of short-term dynamics. Two entropy-based metrics—wavelet entropy (WE [
19]) and spectral entropy (SpeEnt [
20])—quantify the degree of disorder and unpredictability in the energy distribution across multiple temporal-frequency scales and in the frequency domain, respectively, offering complementary measures of the signal informational complexity.
2.4. Justification for Exclusive Use of Lead II
The methodological decision to extract the entire feature architecture exclusively from lead II represents a strategic choice with multiple convergent justifications from the clinical, technical, and pragmatic perspectives. Clinically, lead II presents a geometric orientation approximately parallel to the heart’s principal electrical axis, resulting in superior QRS complex amplitudes and optimized P wave visibility, thus facilitating morphological identification and characterization of cardiac cycle components [
10]. Technically, this single-lead approach dramatically reduces the computational and storage requirements by a factor of approximately 12 compared to processing all 12 standard leads, a critical aspect for implementation in portable devices and wearables with limited hardware resources. Pragmatically, this strategy is perfectly aligned with targeted usage scenarios: mobile cardiac monitoring applications, telemedicine systems for remote geographical areas, large-scale automated screening in environments with limited medical infrastructure, and regions where access to complete 12-lead ECG equipment is restricted or nonexistent [
9]. This approach demonstrates that competitive diagnostic performance can be achieved with minimalistic acquisition configurations, thereby facilitating wider access to advanced cardiac monitoring technologies.
2.5. Feature Selection Methodologies
To systematically identify optimal feature subsets and rigorously evaluate the impact of each selection paradigm on the classifier performance, this investigation employed three distinct feature selection techniques, each providing a unique algorithmic perspective on feature relevance quantification and redundancy assessment among the analyzed parameters.
2.5.1. Permutation-Based Sensitivity Analysis Method
The sensitivity analysis framework was operationalized through the feature permutation technique, a methodologically robust and inherently interpretable approach for evaluating the individual contribution of each parameter to the global performance of the classification model. The procedural methodology consists of multiple sequential stages. First, a reference model is trained on the complete feature set using the entire training dataset, establishing a baseline performance metric. Subsequently, systematic evaluation of the classification accuracy impact is conducted through random permutation (shuffling) of each individual feature’s values in the evaluation set while keeping all other features and the trained model architecture constant.
The importance metric for each feature is quantified as the difference between the reference model’s baseline accuracy and the degraded accuracy obtained after permuting that specific feature. A large accuracy drop upon permutation indicates high feature importance—the model’s performance critically depends on that feature’s informational content. Conversely, minimal or zero accuracy degradation suggests the feature contributes negligibly or redundantly to the classification performance. This approach offers several methodological advantages [
2]: (1) direct and intuitive measurement—the importance score has clear interpretation as the performance loss attributable to feature absence; (2) model-agnostic evaluation—the technique is independent of the specific classifier architecture employed, being equally applicable to neural networks, decision trees, support vector machines, or any other supervised learning algorithm; (3) clinical interpretability—results can be directly translated into clinically relevant terms, facilitating communication with medical practitioners regarding which ECG characteristics are most diagnostically significant; and (4) no distributional assumptions—unlike parametric feature selection methods, permutation-based importance requires no assumptions about feature distributions or relationships.
2.5.2. Implementation of the MRMR Algorithm
The MRMR (Minimum Redundancy Maximum Relevance) algorithm was implemented to identify optimal feature subsets based on an information-theoretic dual optimization criterion that simultaneously maximizes relevance with respect to the target variable while minimizing redundancy among selected features. This method employs mutual information as the fundamental metric for quantifying both the individual relevance of features relative to output classes and the degree of interdependence and correlation among candidate features.
The mutual information I(X;Y) between a feature X and class labels Y quantifies the amount of information obtained about Y through observation of X, or equivalently, the reduction in uncertainty about Y given knowledge of X. For feature selection, high mutual information I(Xi;Y) indicates strong relevance—the feature provides substantial discriminative information about class membership. Simultaneously, high mutual information I(Xi;Xj) between two features indicates redundancy—they provide largely overlapping information, making one largely superfluous given the other’s presence.
The MRMR selection process is inherently iterative and follows a greedy optimization strategy. At each iteration step, the algorithm evaluates all remaining candidate features and selects the one offering optimal compromise between: (1) maximum relevance—high mutual information with target classes I(
Xi;
Y), ensuring the feature contributes novel discriminative power; and (2) minimum redundancy—low average mutual information with already-selected features
where
S represents the current selected subset, ensuring informational diversity and avoiding redundant measurements of the same underlying physiological phenomenon.
This algorithmic strategy guarantees that the final identified subsets contain complementary and diverse information, systematically avoiding redundancies that could diminish the classification efficiency through informational overlap or introduce numerical instability into the resulting models through highly correlated features. The greedy nature provides computational tractability—evaluating all possible feature subsets would be combinatorially prohibitive (2n possibilities for n features)—while the mutual information criterion provides a principled theoretical foundation grounded in information theory.
The MRMR framework is particularly well-suited for biomedical applications where features often exhibit complex nonlinear interdependencies and redundancies arise naturally from multiple measurements of related physiological processes. For ECG classification specifically, many temporal and spectral features capture overlapping aspects of cardiac electrical activity, making redundancy minimization critical for constructing efficient diagnostic systems.
2.5.3. Implementation of the ReliefF Algorithm
The ReliefF algorithm was integrated as the third selection methodology, offering a distinct algorithmic perspective grounded in geometric analysis of local neighborhoods within the multi-dimensional feature space. Unlike information-theoretic or statistical approaches, ReliefF evaluates feature importance through the capacity to discriminate between instances from the same class versus instances from different classes, employing an approach fundamentally based on Euclidean distances in the normalized feature space.
The algorithmic procedure operates through the following mechanism. For each training instance, the algorithm identifies k = 10 nearest neighbors (nearest hits—from the same class, and nearest misses—from different classes) in the feature space. Relevance scores are then calculated based on the features’ capacity to efficiently separate classes within the local neighborhood of each data point. Specifically, features that maintain small distances to same-class neighbors (near hits) while maintaining large distances to different-class neighbors (near misses) receive high importance scores, as they effectively support local class separability.
Mathematically, for each feature f and training instance
x, ReliefF updates the feature weight as:
where d(
f,
x,
y) measures the difference in feature
f between instances
x and
y,
m is the number of sampled instances, and
P(
c) represents the class prior probabilities accounting for the class imbalance. Features consistently reducing the distances to same-class neighbors while increasing the distances to different-class neighbors accumulate positive weights, indicating high discriminative capacity.
This methodology offers several distinctive characteristics: (1) context-sensitive evaluation—feature importance is assessed within local neighborhoods rather than globally, making the method robust to complex, heterogeneous feature–class relationships that vary across different regions of feature space; (2) efficiency for nonlinear problems—by operating on local geometries, ReliefF effectively handles problems with complex class structures and nonlinear decision boundaries where global linear methods fail; (3) natural handling of feature interactions—the distance-based evaluation implicitly captures feature interactions and dependencies without requiring explicit modeling; and (4) multiclass extension—ReliefF naturally extends to multiclass problems through weighted consideration of all class pairs, unlike some methods requiring binary decomposition strategies.
For ECG arrhythmia classification specifically, where different pathologies may manifest through distinct feature patterns in different regions of the feature space (e.g., atrial fibrillation characterized by irregular RR intervals versus bradycardia characterized by uniformly prolonged intervals), ReliefF’s context-sensitive local analysis provides a particularly valuable complementary perspective to global methods like MRMR.
2.6. Practical Implementation Considerations
Both the MRMR and ReliefF algorithms are already implemented and readily available in MATLAB R2023b within the Classification Learner application, providing convenient access to these advanced feature selection techniques through a user-friendly graphical interface. The Classification Learner app automates the feature selection workflow, enabling rapid comparative evaluation of different selection strategies without requiring manual implementation of complex algorithms. This practical accessibility facilitates reproducibility and allows researchers to focus on interpretation and application rather than algorithmic implementation details.
The availability of these methods within standard computational environments democratizes access to advanced feature selection techniques, enabling their application even by practitioners without deep expertise in machine learning algorithmic development, thereby accelerating translation of advanced analytical methods into practical clinical decision support tools.
2.6.1. Performance Evaluation Methodology
The comprehensive evaluation of the classification performance was conducted through systematic analysis of the classification accuracy as a function of progressively increasing numbers of retained features, applied consistently across all three implemented selection methodologies (permutation-based sensitivity analysis, MRMR, and ReliefF). This exhaustive evaluation strategy involved training and assessing independent classification models for each distinct feature subset dimensionality, ranging from minimal configurations (2–5 features) through intermediate sets (7–15 features) to extended configurations (20–35 features). Each model instance was trained using an identical neural network architecture (single hidden layer with 10 neurons, z-score normalization) and evaluated on the same fixed balanced test set (1000 examples per class, 4000 total), ensuring rigorous comparability across different feature subset sizes and selection methods.
This systematic exploration across the feature dimensionality spectrum facilitates identification of optimal compromise points between classification performance and associated computational complexity. Specifically, the analysis reveals the following: (1) performance saturation thresholds—the number of features beyond which additional features provide diminishing marginal improvements in accuracy; (2) critical feature counts—minimal feature subset sizes capable of maintaining clinically acceptable performance levels; (3) method-specific characteristics—whether different selection algorithms exhibit similar or divergent patterns in their performance-complexity trade-off curves; and (4) robustness assessment—consistency of performance across repeated evaluations with different random initializations, indicating the stability of the identified optimal configurations.
2.6.2. Penalized Scoring System for Complexity-Aware Evaluation
Complementary to standard classification accuracy evaluation, a complexity-aware penalized scoring system was developed and implemented to integrate both classification performance and computational complexity associated with the number of features utilized. This multi-objective evaluation framework addresses a fundamental limitation of accuracy-only assessment: in practical deployment scenarios, particularly for resource-constrained medical devices, the raw classification accuracy represents only one dimension of system utility. The computational efficiency, energy consumption, processing latency, and hardware requirements constitute equally critical design constraints that must be explicitly considered in algorithm selection and configuration.
The penalized score is formulated mathematically as follows:
where
α represents a tunable penalty coefficient controlling the relative weight assigned to complexity reduction versus performance maximization.
The normalization term (
Number of Features/
Maximum Features) ensures the penalty scales between 0 and 1, making the penalty magnitude comparable to accuracy percentage units [
11]. Different penalty coefficients reflect different deployment scenarios: (1)
α = 0.0—pure accuracy maximization, ignoring complexity (appropriate for high-resource laboratory settings); (2)
α = 0.1—moderate complexity penalty (balanced trade-off for general-purpose portable devices); and (3)
α = 0.2–0.3—strong complexity penalty (aggressive optimization for ultra-low-power wearables or large-scale screening applications).
The penalized scoring framework enables explicit visualization and quantification of Pareto-optimal configurations—feature subset sizes that represent optimal compromises where no alternative configuration simultaneously improves both accuracy and complexity. Graphical analysis of the penalized scores across the feature dimensionality spectrum (typically presented as score curves for each selection method) directly highlights the optimal operating points for different deployment contexts, providing actionable guidance for practical system design decisions rather than merely reporting abstract performance metrics divorced from implementation realities.
3. Results
The exhaustive preliminary evaluation, utilizing the entire architecture of 38 extracted descriptors (plus the class label, totaling 39 columns), provided a convincing demonstration of the neural system’s intrinsic capacity to attain superior performance levels in the multiclass classification task. The reference configuration, implemented with the previously described feedforward neural architecture (one hidden layer with 10 neurons, z-score normalization), generated 93% classification accuracy (presented in
Figure 2) on the balanced test set (1000 examples per class, 4000 total). This baseline metric is significant from multiple perspectives: (1) validates the methodological approach’s viability—demonstrates that the hybrid feature taxonomy (12 morphological + 26 multi-domain) captures sufficient discriminant information for robust differentiation of the four pathological metaclasses (AFIB, SB, SR, GSVT); (2) establishes a rigorous reference point—offers a comparison anchor for subsequent evaluation of feature selection methods, enabling precise quantification of the trade-off between complexity reduction and performance degradation; and (3) confirms the nonlinear descriptor contribution—the result empirically demonstrates that integration of the 26 advanced nonlinear complexity measures (Hurst, DFA, entropies, LAC, etc.) does not introduce informational noise or confusion into the learning process but conversely augments the system’s global discriminative capacity, contributing positively to class separability in the feature space.
Granular inspection of the confusion matrix associated with the full 38-descriptor set classification, evaluated on the 4000 balanced test examples (1000 per class), reveals fundamental properties of system behavior. The classification error distribution manifests a balanced and symmetric pattern among the four diagnostic metaclasses, without systematic confusion concentrations in particular directions.
This diagnostic balance characteristic indicates the system’s maturity for deployment in real clinical scenarios, where it is imperative that performance be consistent and predictable across the entire spectrum of encountered pathological conditions, not merely on majority or “easy” classes. Analysis of individual confusion matrix components reveals that most errors occur in class pairs with high intrinsic similarity from the electrophysiological feature perspectives (e.g., confusion among different supraventricular tachycardia forms within the GSVT category), reflecting fundamental limitations of single-lead discrimination and features calculated based on short 2 s temporal windows, rather than deficiencies in the algorithmic architecture itself.
3.1. Results of Permutation-Based Sensitivity Analysis Method
Implementation of the sensitivity analysis technique through systematic feature permutation offered a complementary and model-agnostic perspective on the relative importance of each individual descriptor. The methodology consists of evaluating performance degradation induced by random permutation (shuffling) of each feature’s values individually in the test set, keeping constant the model architecture trained on the original configuration. Accuracy degradation resulting from a feature’s permutation directly quantifies its contribution to global performance: high-impact features generate significant degradations when perturbed, while redundant or irrelevant features produce negligible modifications.
Application of this methodology on the complete set of 38 descriptors generated an importance score vector, enabling feature hierarchization according to their individual contribution. The optimal selection identified through this technique consists of a compact subset of 10 features, visually presented in
Figure 3, demonstrating substantially superior importance scores compared to the remaining descriptors. This observation suggests a pronounced concentration of discriminant information in a relatively small number of critical parameters, while most remaining features contribute marginally or present informational redundancy with more significant descriptors.
Training and evaluation of an identical neural architecture (10 neurons, one hidden layer, z-score normalization) using exclusively the 10 features selected through sensitivity analysis generated remarkable results: the classification accuracy attained on the balanced test set (4000 examples) is 88.18%. This performance must be contextualized relative to the 93% baseline obtained with the complete set: the absolute degradation is only 4.82 percentage points (93–88.18%), representing a relative loss of approximately 5.2% from the original performance. This minimal degradation is notable relative to the dimensionality reduction magnitude: use of only 10 from the 38 original features reduces the computational complexity by approximately 73% (calculated as (38 − 10)/38 ≈ 0.73), offering an exceptional compromise between computational efficiency and diagnostic fidelity.
This empirical demonstration validates the central hypothesis that discriminant information for cardiac arrhythmia classification is concentrated in compact subsets of judiciously selected features, and advanced algorithmic selection methods can identify these optimal subsets, enabling development of diagnostic systems with acceptable clinical performance and dramatically reduced computational requirements, ideal for facilitating wider access to advanced cardiac monitoring technologies in resource-limited contexts.
Implementation of the sensitivity analysis methodology through systematic feature permutation identified an optimal subset of 10 features with the highest importance scores, corresponding to indices [2, 18, 9, 19, 3, 17, 30, 13, 23, 14] from the original 38-parameter set. This selection spans both morphological waveform parameters (indices 2, 3, 9, 13, 14 from reference set [
7]) and advanced multi-domain complexity measures (indices 17, 18, 19, 23, 30), validating the complementary value of both feature taxonomies.
Exclusive use of these 10 selected features achieved remarkable classification accuracy of 88.18% (
Figure 4), representing minimal degradation of only 4.82 percentage points compared to the 93% performance with the complete 38-feature set, while dramatically reducing the computational complexity by approximately 73%. This exceptional compromise offers substantial practical benefits: proportionally faster inference time, reduced memory footprint for embedded systems, lower power consumption for battery-powered devices, and simplified signal processing pipeline.
The dynamics of the performance curve as a function of the progressively retained features, as illustrated in
Figure 5, exhibits rapid monotonic accuracy growth for the first selected features, followed by gradual stabilization and saturation approaching the optimal value plateau. This characteristic evolutionary pattern empirically confirms the effectiveness of the permutation-based sensitivity analysis selection methodology and demonstrates the existence of a well-defined optimal operating point for the feature subset size, where the balance between classification performance and computational efficiency is optimally achieved.
The figures presented below illustrate the evolution of the performance of the neural classifier (fitcnet) as a function of the progressive number of features selected by the MRMR (
Figure 6 left) and ReliefF (
Figure 6 right) methods, providing a comparative perspective on the dynamics of the two selection paradigms.
The two curves reveal fundamentally different selection strategies: MRMR aggressively prioritizes features with maximum relevance in early iterations, resulting in superior performance with compact subsets (5–10 features), while ReliefF adopts a more conservative approach, reaching its optimal performance with slightly more extensive configurations (29–35 features). When approaching the full feature set (35–38 features), both methods converge to similar performance ranges (88.90% for MRMR with 35 features vs. 91.83% for ReliefF with 34 features). The remaining performance difference (~3 percentage points) is attributable to (a) neither method having yet reached complete feature saturation and (b) statistical variations from stochastic weight initialization in the neural network training process.
This methodological divergence suggests that MRMR is preferable for applications with severe resource constraints where ultra-compact subsets (7–15 features) are imperative, while ReliefF is more suitable for scenarios where maximum performance has absolute priority and computational resources allow the use of more extensive configurations. Both methods empirically confirm the hypothesis that the discriminant information for cardiac arrhythmia classification is concentrated in a limited subset of features, validating the viability of dramatically reducing the computational complexity while maintaining clinically acceptable diagnostic performance.
3.2. Results of MRMR Algorithm Implementation
Application of the MRMR algorithm generated exceptional results, with optimal performance of 88.90% accuracy achieved using 35 selected features, as presented in
Figure 6 above. This configuration, although requiring a more substantial number of parameters compared to the permutation-based sensitivity analysis (which achieved 88.18% with only 10 features), demonstrates the superior efficacy of the dual selection criterion based on simultaneous optimization of the maximum relevance and minimum redundancy for maximizing global performance. The MRMR approach’s strength lies in its systematic elimination of informational redundancy while preserving discriminative capacity, resulting in feature subsets where each descriptor contributes complementary, non-overlapping information to the classification task.
A particularly noteworthy result is the performance obtained with only two features selected by MRMR (features 3 and 5, the most significant, as also shown in the figure above), presented in the following figures (
Figure 7). This minimal configuration demonstrates that extremely simple feature sets can maintain acceptable performance levels for applications with severe resource constraints. Feature 3 corresponds to the ventricular rate (BPM) and feature 5 to the QRS complex duration—both fundamental morphological parameters with direct clinical interpretability. Achieving meaningful classification performance with just these two descriptors is remarkable, as it suggests that core cardiac electrical activity characteristics captured by basic waveform measurements contain substantial discriminative information. This observation is fundamental for developing ultra-portable medical solutions and mass screening systems where the computational complexity must be minimized to extreme levels, such as disposable ECG patches, single-chip monitors, or large-scale population surveillance programs in resource-limited settings.
The comprehensive evaluation of the multiclass ROC curves for the optimized MRMR configuration with 35 selected features, presented in
Figure 8, confirms the exceptional performance, with a mean AUC of 0.9735 across all investigated pathological classes. This value indicates very superior discriminative capacity, being comparable to the performance of reference clinical systems employed in contemporary cardiological practice. The uniform and balanced distribution of individual AUCs across classes—with all four metaclasses (AFIB, SB, SR, GSVT) exhibiting AUC values exceeding 0.95—confirms the absence of systematic biases and validates the robust clinical applicability of the developed system. This balanced performance profile is particularly significant from a clinical deployment perspective, as it demonstrates that the system maintains consistent diagnostic reliability across the entire spectrum of arrhythmia types rather than exhibiting preferential accuracy for certain classes at the expense of others.
3.3. Results of ReliefF Algorithm Implementation
Implementation of the ReliefF algorithm on the optimized feature set (the first four features from the specialized literature integrated with the 26 proposed nonlinear parameters) offered a complementary and validating methodological perspective on the optimal selection process.
Figure 9 presents the importance and relevance scores of features as evaluated by ReliefF for this feature set. A notable observation is that certain features exhibit negative scores from a relevance perspective, indicating that these descriptors may actually confuse the classification process rather than assist it. Negative ReliefF weights suggest that these features increase the distances to same-class neighbors while decreasing the distances to different-class neighbors—precisely the opposite of the desired discriminative behavior. This finding provides actionable guidance for feature elimination, as removing such counterproductive descriptors could potentially improve classification performance while simultaneously reducing computational complexity.
The best accuracy obtained with ReliefF was 88.02% using 29 features, as presented in
Figure 10. This performance is remarkably close to MRMR’s optimal result (88.90% with 35 features), achieved with slightly fewer features, demonstrating that the geometric local-neighborhood approach of ReliefF identifies a highly efficient feature subset through fundamentally different algorithmic principles compared to MRMR’s information-theoretic framework.
Another important aspect to note is the feature importance ranking identified by the ReliefF algorithm for all features—those calculated in this study are presented in
Figure 11. A significant observation is that the highest importance scores are generally assigned to features from the second part of the feature set, and especially the final features, which correspond precisely to the nonlinear signal complexity measures. This pattern strongly validates the methodological contribution of this research: the advanced nonlinear descriptors (Hurst exponent, DFA scaling, absolute logarithmic correlations, Poincaré increment, wavelet entropy) consistently emerge as highly discriminative parameters across independent selection algorithms, confirming their genuine informational value for cardiac arrhythmia classification rather than representing spurious correlations specific to particular datasets or methods.
Although ReliefF’s performance is marginally inferior to that obtained with MRMR (88.02% vs. 88.90%), this approach confirms the robustness and consistency of the proposed nonlinear features and validates the convergence of the results between fundamentally different algorithmic paradigms. Systematic analysis of the importance scores generated by the MRMR and ReliefF algorithms reveals significant convergences in identifying critical features, providing additional confidence in the stability and reproducibility of the obtained results. Specifically, despite employing radically different mathematical frameworks—information theory (mutual information) for MRMR versus geometric local-neighborhood analysis (distance-based discrimination) for ReliefF—both methods consistently rank similar features as highly important, particularly the nonlinear complexity measures. This cross-validation through methodologically diverse approaches substantially strengthens confidence that the identified important features represent genuine discriminative characteristics intrinsic to the cardiac arrhythmia classification problem, rather than artifacts of specific selection algorithm biases or dataset particularities.
3.4. Contribution of Advanced Nonlinear Features
In-depth analysis of features selected across different optimal configurations reveals the consistent and robust presence of the nonlinear parameters originally developed within this research. Quantitative demonstration of these descriptors’ superiority derives from multiple converging sources of evidence.
Statistical analysis of selection frequency: Systematic examination of the optimal subset composition identified by the three methods (sensitivity analysis, MRMR, ReliefF) reveals that nonlinear parameters appear in proportions of 65–70% of selected features, although they initially represented only ~68% (26 of 38) of the total descriptor set. More significantly, in ultra-compact configurations (5–10 features), nonlinear parameters constitute 70–80% of the selection, indicating that they concentrate essential discriminative information.
The Hurst exponent appears frequently with high scores in optimized subsets identified by all the implemented methods, confirming the informational value of persistence and anti-persistence characteristics in ECG signal dynamics for efficient discrimination of cardiac pathologies. Importance score analysis for the Hurst exponent shows values consistently situated in the top 15% among all the evaluated features, with MRMR mutual information scores of I(Hurst;Y) = 0.42 (compared to an average of 0.28 for classical morphological parameters), indicating substantially superior correlation with target classes. This observation aligns with the clinical literature associating alterations in the fractal structure of heart rate variability with diverse cardiovascular pathologies. The Hurst exponent’s capacity to quantify long-term memory in physiological signals provides a sensitive marker for disruptions in complex regulatory mechanisms governing cardiac electrical activity, being particularly valuable for detecting subtle pathological changes not evident in conventional morphological parameters.
The physiological basis for the Hurst exponent’s discriminative power lies in its quantification of the cardiac autonomic nervous system balance. In healthy sinus rhythm, heart rate variability exhibits long-range correlations (H > 0.5), reflecting intact parasympathetic–sympathetic interplay. Atrial fibrillation disrupts this organization, producing more random, uncorrelated RR intervals (H ≈ 0.5), while severe sympathetic dominance in tachycardia may increase persistence (H >> 0.5). This electrophysiological interpretation explains why the Hurst exponent consistently emerges as discriminative across all three selection methods.
The scaling exponent derived from the DFA analysis likewise demonstrates consistent and significant contribution, being selected with priority in the majority of optimized configurations by all the tested algorithms. Evaluation through ablation studies (systematic removal of individual descriptors from optimal configuration) shows that removal of the DFA exponent produces mean accuracy degradations of 2.1–2.8 percentage points—substantially larger than the 0.3–0.9 pp degradations associated with elimination of individual morphological parameters, demonstrating the model’s critical dependence on this descriptor. This feature offers complementary information about the signal self-similarity properties across multiple temporal scales, being particularly sensitive to alterations in neuro-autonomic heart control that accompany various cardiac pathologies. The DFA scaling exponent captures the statistical behavior of fluctuations across diverse temporal horizons, offering insights into multi-scale regulatory dynamics characterizing healthy versus pathological cardiac function. Its consistent selection through diverse selection paradigms validates its fundamental relevance to arrhythmia discrimination.
The DFA scaling exponent characterizes the fractal properties of cardiac rhythm across multiple time scales, reflecting neuro-autonomic control integrity. Normal cardiac function exhibits scale-invariant fluctuations (α ≈ 1.0), indicating healthy complex regulatory dynamics. Pathological conditions alter this scaling behavior: atrial fibrillation produces more random fluctuations (α → 0.5), while rigid sympathetic control in certain tachycardias increases correlation (α > 1.0). Clinical studies have validated the DFA alterations in various cardiovascular pathologies, confirming its mechanistic relevance beyond purely statistical discrimination.
The absolute logarithmic correlations and mean standard increment from the Poincaré plot contribute substantially to characterizing local aspects of signal variability, offering information about short-term predictability and beat-to-beat variability. Inter-feature correlation matrix analysis shows that these nonlinear parameters exhibit low inter-descriptor correlations (ρ < 0.35), in contrast to classical morphological parameters that frequently manifest high correlations (ρ > 0.65), confirming that nonlinear descriptors capture distinct and complementary informational aspects, not redundant ones. The consistent presence of these parameters in optimized subsets confirms the importance of fine-scale variability analysis for efficient discrimination of pathological classes. These measures capture the micro-dynamics of cardiac rhythm regulation—the immediate beat-to-beat adjustments reflecting the responsiveness and stability of cardiac control mechanisms. Their discriminative power suggests that pathological conditions manifest not only in global rhythm characteristics but also in the fine temporal structure of successive heartbeat intervals.
Wavelet entropy, through rigorous characterization of the spectral energy distribution complexity across different temporal and frequency scales, offers a complementary perspective on the frequency organization of the ECG signal. Statistical significance testing through permutation tests (1000 iterations of random class label permutation) confirms that the wavelet entropy importance scores exceed the statistical significance threshold (p < 0.001), while 40% of morphological parameters fail to reach this threshold (p > 0.05), indicating genuine discriminative power versus spurious correlations. Frequent selection of this parameter in optimized configurations suggests that multiresolution spectral complexity information contributes significantly to the global discriminative capacity of the developed system. Unlike traditional Fourier-based spectral measures providing global frequency content, wavelet entropy quantifies the degree of order versus disorder in how energy is distributed across the time–frequency plane, capturing transient phenomena and non-stationary patterns characteristic of different arrhythmia types.
Wavelet entropy’s discriminative capacity derives from its sensitivity to spectral organization changes characteristic of different arrhythmias. Atrial fibrillation produces irregular, broad-spectrum activity with high entropy (disorganized atrial electrical activation). Bradycardia maintains organized low-frequency components with lower entropy (preserved sinus node control but reduced rate). Supraventricular tachycardias exhibit intermediate patterns. This time–frequency complexity quantification captures electrophysiological signatures not accessible through conventional frequency-domain analysis or morphological measurements alone.
In conclusion, the consistent selection of these advanced nonlinear descriptors by multiple independent algorithms—each using different mathematical approaches—provides strong empirical validation of their discriminative value for cardiac arrhythmia classification, confirming their intrinsic relevance to underlying cardiac electrophysiology.
4. Comparative Analysis and Clinical Implementation Implications
To rigorously assess the convergence and complementarity of the three feature selection paradigms, we conducted a comparative overlap analysis visualized through two complementary approaches: a Jaccard similarity matrix quantifying pairwise feature set overlap and a Venn diagram illustrating the distribution of unique versus shared features among the top-12 selections from each method.
The Jaccard similarity analysis (
Figure 12 left) reveals moderate convergence between MRMR and ReliefF (0.6 similarity coefficient, indicating 60% feature overlap in their top-12 selections), while the permutation-based method exhibits substantial divergence from both approaches (0.2 similarity), reflecting the fundamental difference between model-agnostic filter methods that evaluate features independently versus model-specific wrapper approaches that assess importance through direct performance impact.
The Venn diagram visualization (
Figure 12 right) demonstrates that 6 features (50% of top-12 selections) are consistently identified by both MRMR and ReliefF, constituting a high-confidence core feature set with robust discriminative value, while each method additionally identifies 2–7 unique features that contribute complementary perspectives on arrhythmia classification, with no features unanimously selected by all three approaches.
Both complementary visualization approaches—the Jaccard similarity matrix quantifying pairwise feature overlap and the Venn diagram illustrating set intersections—converge to the same fundamental conclusion: MRMR and ReliefF exhibit substantial methodological agreement (six shared features representing 60% similarity), confirming that information-theoretic and geometric-distance-based paradigms identify convergent discriminative characteristics, whereas the permutation-based approach reveals fundamentally different feature importance rankings (only 20% overlap), reflecting its model-specific evaluation mechanism that captures feature interactions within the trained classifier rather than intrinsic feature properties assessed independently by filter methods.
Methods based on complementary information-theoretic principles (MRMR—mutual information; ReliefF—geometric distances) demonstrate substantial convergence in discriminative feature identification, confirming that the selected features reflect intrinsic ECG data properties rather than algorithm-specific artifacts, thereby validating their robustness and clinical applicability. The observed partial divergence represents methodological complementarity rather than weakness: model-agnostic methods (MRMR, ReliefF) evaluate intrinsic feature properties independently, while the model-specific approach (permutation) captures feature–classifier interactions and contextual dependencies overlooked by filter methods, with unique features from each method contributing distinct perspectives on arrhythmia classification.
Systematic and rigorous comparison of the three implemented selection methods (presented in
Table 1)—permutation-based sensitivity analysis, MRMR, and ReliefF—offers a comprehensive and robust perspective on the feature optimization process for automated ECG signal classification. The convergent results obtained via these fundamentally different methodological approaches confirm the existence of robust and reproducible intrinsic structures in ECG data and validate the identification of feature subsets with superior and stable discriminative properties.
Permutation-based sensitivity analysis demonstrated maximal computational efficiency through identification of an ultra-optimized subset of only 10 features with remarkable performance of 88.18% accuracy, representing the ideal solution for medical applications with severe computational and energy resource constraints. The MRMR method achieved superior global performance of 87.57% using seven features, demonstrating holistic optimization capacity through rigorous balancing of feature relevance and redundancy. ReliefF offered an intermediate methodological compromise with 88.02% accuracy on 29 features, validating the robustness of the approach based on geometric analysis of local neighborhoods in the feature space.
A result of particular interest for applications with extreme resource limitations is the performance achieved with the minimalistic configuration of only two features selected by MRMR, demonstrating that ultra-simple solutions can maintain acceptable performance for mass screening systems or medical devices with severe implementation constraints.
From the perspective of computational efficiency and practical implementation flexibility, the developed penalized scoring system offers algorithmic adaptability for diverse deployment scenarios with distinct optimization priorities. For engineering-constrained applications (wearables, telemedicine, mass screening), the ultra-optimized configuration with 7–10 features achieving 87–88% accuracy represents an optimal balance between diagnostic performance and energy efficiency. For clinically critical scenarios requiring maximum sensitivity for life-threatening arrhythmias, the extended configurations with 29–38 features (88–93% accuracy) are recommended, where computational complexity is secondary to diagnostic reliability. Each identified configuration is optimal for its specific deployment context rather than representing a universal solution, and clinical implementation for critical arrhythmia detection will require additional sensitivity-focused optimization with asymmetric cost functions.
Demonstration that artificial intelligence technologies can identify minimal feature subsets with acceptable clinical performance facilitates development of a new generation of medical devices with extended battery life, substantially reduced costs, and applicability in resource-limited environments or regions with restricted geographic access to specialized medical services. This capability is particularly valuable for implementing global telemedicine programs in rural areas or for continuous monitoring of patients with chronic conditions in ambulatory settings.
Another aspect that must be noted is the system robustness demonstrated through performance maintenance in the presence of systematically applied noise on features, confirming technological maturity for implementation in real clinical conditions, where signal quality variability and fluctuations in recording conditions are inevitable and may introduce distortions similar to those simulated in this study.
Table 1 presents a summary of the classification results obtained using a balanced dataset comprising 1000 samples per class (4000 test samples in total). Each configuration represents an optimal solution for specific deployment scenarios with distinct accuracy–complexity trade-offs: “Resource-constrained optimal” (7–10 features) minimizes computational complexity for ultra-low-power wearables while maintaining clinically acceptable accuracy; “Maximum performance” (29–37 features) prioritizes diagnostic accuracy for clinical environments with adequate computational resources; “Balanced trade-off” represents intermediate compromises for general-purpose portable devices; and “Ultra-minimal screening” (2 features) demonstrates feasibility for extremely resource-limited mass screening applications. Optimality is context-dependent, determined by deployment-specific priority weights as formalized in the penalized scoring framework (
Section 2.6.2, α penalty coefficient).
Although the primary classification architecture used in this study is a feedforward neural network with a hidden layer of 10 neurons, this methodological choice was motivated by the need to maintain a consistent framework for rigorous comparison of the three feature selection techniques. To validate the robustness of our conclusions and demonstrate that the identified optimal subsets are not artifacts specific to a single classifier, we additionally evaluated the performances using several classification paradigms available in the MATLAB Classification Learner environment: Support Vector Machines (Linear, Gaussian and Kernel SVM) and ensemble methods (Boosted Trees, RUSBoosted Trees, Bagged Trees). The comparative results, presented in
Table 2, confirm that the accuracies obtained on the subsets selected by MRMR, ReliefF and sensitivity analysis remain consistent (generally small variations) regardless of the classifier used, thus validating the convergence of the selection methods and the intrinsic discriminative character of the proposed nonlinear descriptors, independent of the chosen classification architecture.
This research extends our previous study [
10] with three key methodological improvements enhancing the practical applicability.
First, we implemented signal segmentation using 2 s windows with 1 s overlap, generating nine segments per 10 s recording. This yielded 95,447 segments (versus 10,646 recordings previously), enabling more robust learning and detailed intra-patient variability assessment. Additionally, multiplicative noise [0.9, 1.1] applied to the first 12 classical features validated the system robustness under realistic recording conditions.
Second, we developed a penalized scoring system optimizing the performance–complexity balance, enabling identification of ultra-optimized configurations: 7 features achieving 87.57% accuracy (81.5% complexity reduction) for resource-constrained scenarios or 35 features reaching 88.90% for maximum performance applications.
Third, we demonstrated cross-methodological convergence across three distinct algorithms (permutation sensitivity analysis, MRMR, ReliefF), which consistently identified the same critical feature subsets, particularly advanced nonlinear parameters. This convergence validated the results’ robustness, confirming that discriminative information is intrinsic to the data structure rather than algorithm-dependent. This study also employed a more rigorous evaluation protocol with 1000 test examples per class (versus 500 previously).
4.1. Engineering vs. Clinical Optimization: Context-Specific Feature Configurations
The penalized scoring system employed in this study optimizes engineering trade-offs between classification accuracy and computational complexity, not clinical outcomes with asymmetric diagnostic costs. It is crucial to distinguish between distinct deployment scenarios that require fundamentally different optimization objectives:
Engineering-constrained scenarios (computational efficiency prioritized):
Mass population screening programs in resource-limited geographic regions.
Long-term continuous cardiac monitoring with battery-powered wearable devices requiring extended autonomy.
Low-power embedded systems for rural telemedicine applications with limited infrastructure.
Recommended configurations: 7–10 features achieving 87–88% accuracy with 73–81% complexity reduction.
Clinical-critical scenarios (diagnostic sensitivity prioritized):
Intensive monitoring of high-risk cardiac patients where false negatives are unacceptable.
Detection of life-threatening arrhythmias (ventricular fibrillation, severe atrial fibrillation) requiring maximum sensitivity.
Hospital diagnostic environments with adequate computational resources and power supply.
Recommended configurations: 29–38 features achieving 88–93% accuracy, with potential for sensitivity-specific optimization.
Table 1 presents multiple context-specific optimal configurations, each representing the optimal balance for specific implementation contexts with distinct primary constraints, rather than proposing a single universal “best” solution. The 2-feature configuration (76% accuracy) serves ultra-minimalistic screening applications; the 7-feature configuration balances performance and efficiency for general wearable deployment; and the 29–38 feature configurations provide maximum accuracy for clinical environments where computational resources are not the limiting factor.
Clinical deployment for critical arrhythmia detection requires sensitivity-focused optimization with asymmetric loss functions that heavily penalize false negatives, potentially involving additional features, multi-lead analysis, or ensemble methods—representing optimization objectives fundamentally different from the computational efficiency focus of this engineering study. The methodological framework and validated feature subsets presented here provide the algorithmic foundation upon which such clinically optimized systems can be constructed.
4.2. Complexity Analysis Limitations
The “complexity reduction” reported in
Table 1 refers to a feature space dimensionality reduction, not exhaustive computational profiling. This metric serves as a proxy for processing the burden but does not capture significant variations in the extraction cost among features (e.g., wavelet entropy vs. QRS duration) or dependencies on the hardware architecture and implementation strategy. The actual execution time, energy consumption, and memory requirements depend on the target platform (embedded MCU, cloud/edge-cloud) and require platform-specific profiling. For scenarios with adequate connectivity, distributed architectures may shift optimization priorities from local computation to bandwidth and latency. Rigorous validation requires execution time benchmarking on target embedded platforms, representing an essential subsequent research phase for practical deployment.
4.3. Comparison with State-of-the-Art Methods
The methodological approach presented in this study—classical feature engineering with algorithmic feature selection—occupies a specific niche within the broader landscape of contemporary ECG classification techniques. It is essential to position this work relative to modern deep learning approaches and clarify the distinct scenarios where each paradigm is appropriate.
End-to-end deep learning: The recent literature has demonstrated impressive performance using 1D convolutional neural networks (CNNs), recurrent architectures (LSTM, GRU), and attention-based transformers that learn representations directly from raw ECG signals. These approaches eliminate manual feature engineering and can achieve 94–98% accuracy on similar multiclass arrhythmia tasks. However, they require (1) substantial computational resources (millions of multiply–accumulate operations per inference); (2) significant memory for model weights (typically 100 KB–10 MB even after compression); (3) often hardware acceleration (GPU, TPU, neural processing units); and (4) large training datasets (tens of thousands of examples).
Edge-AI and model compression: Techniques such as knowledge distillation, quantization, and pruning have enabled deployment of neural networks on edge devices with moderate computational capacity (ARM Cortex-A series, smartphones, edge gateways). These represent viable solutions for scenarios with power budgets in the 100 mW–1 W range and memory availability of several megabytes.
Our feature-based approach with classical machine learning is specifically optimized for ultra-resource-constrained scenarios where deep learning is fundamentally incompatible:
- -
Ultra-low-power microcontrollers (ARM Cortex-M0/M4, power budget < 10 mW)
- -
Battery-powered wearables requiring months of continuous operation
- -
Single-chip solutions without external memory or hardware acceleration
- -
Disposable/low-cost medical devices for mass screening in developing regions
- -
Scenarios requiring model interpretability for clinical acceptance and regulatory approval
Table 3 presents the comparative trade-offs from feature engineering from this paper and end-to-end deep learning.
The 7–10 feature configurations validated in this study achieve competitive performance (87–88% accuracy) at computational costs 2–3 orders of magnitude lower than compressed neural networks, making them uniquely suitable for the most resource-constrained deployment scenarios. For applications with adequate computational resources (smartphones, cloud-connected devices, hospital equipment), end-to-end deep learning approaches represent superior alternatives and should be preferred.
The algorithmic efficiency of our approach is detailed in
Table 3, and the practical necessity of these design choices becomes even more evident when considering the physical hardware constraints. We want to emphasize that a fundamental distinction must be made between the high-performance Edge-AI platforms (such as FPGAs or Cortex-A systems) often cited in the literature and the ultra-low-power microcontrollers (MCUs) targeted in this study.
As shown in
Table 4, the hardware requirements for deploying deep learning models—even those optimized via quantization or distillation—frequently exceed the power and cost budgets of long-range, battery-operated wearables.
As evidenced by the trade-offs in
Table 3 and the hardware specifications in
Table 4, our approach achieves competitive performance while maintaining a resource footprint two to three orders of magnitude lower than typical neural network deployments. This makes it a strategic choice for autonomous, long-range sensors, whereas deep learning remains the preferred solution for platforms where the computational density and power availability are not primary constraints.
5. Conclusions
This research demonstrates the viability of developing a comprehensive multi-algorithmic framework for optimizing ECG signal classification through intelligent feature selection, with direct applicability in portable cardiac monitoring and telemedicine systems. Rigorous evaluation of three distinct methods—permutation-based sensitivity analysis, MRMR, and ReliefF—confirms that clinically acceptable performance (87–89% accuracy) can be achieved using dramatically reduced feature subsets.
The original scientific contribution consists of systematically integrating and validating 26 established nonlinear descriptors from dynamical systems theory, including the Hurst exponent, DFA scaling exponent, absolute logarithmic correlations, mean standard increment from the Poincaré plot, and wavelet entropy. The consistent presence of these parameters in the optimized subsets identified by all the selection methods confirms the fundamental value of advanced nonlinear analysis for characterizing cardiac activity dynamics and discriminating pathological classes.
The result with maximum practical impact is validation of an ultra-optimized configuration with only seven features maintaining 87.57% accuracy, representing an 81.5% computational complexity reduction with minimal performance degradation. This configuration offers an optimal balance between diagnostic fidelity and implementation efficiency for portable medical devices with severe resource constraints.
The remarkable convergence of results across the MRMR, ReliefF, and sensitivity analyses—despite employing fundamentally different mathematical frameworks—confirms the existence of intrinsic discriminative structures in ECG signals transcending algorithm specificity. This convergence facilitates practical implementation with high confidence and ensures result transferability to diverse classification architectures.
The developed penalized scoring system for performance–complexity balancing offers algorithmic adaptability for diverse clinical scenarios, enabling dynamic configuration from maximum performance in resource-abundant environments to maximum efficiency in energy-constrained ambulatory monitoring.
Comprehensive validation through systematic noise simulation demonstrates technological maturity for implementation in real clinical conditions, where signal quality fluctuations are inevitable. The system maintains performance with minimal degradation under realistic noise, indicating reliable operation in ambulatory and home-monitoring scenarios.
Limitations and Future Directions
This research opens up multiple strategic directions for future developments: multi-center validation for confirming generalizability; hybridization with deep learning architectures; real-time clinical deployment evaluation; adaptive feature selection mechanisms; and transferability to other physiological signals.
It is essential to emphasize that this study provides an engineering methodological framework for feature optimization, not a clinically validated diagnostic system. The convergence of three independent selection algorithms in identifying similar optimal feature subsets validates the intrinsic discriminative value of the proposed nonlinear descriptors and the robustness of the identified configurations for their respective deployment contexts. Translation into clinical practice requires subsequent validation addressing the asymmetric diagnostic costs, sensitivity maximization for specific arrhythmia types, prospective patient studies, and regulatory compliance.
The potential impact extends beyond technical contributions, offering a methodological platform for revolutionizing intelligent medical system design. By demonstrating that competitive clinical performance can be maintained with dramatically reduced computational resources, this study enables accessible digital medical ecosystems in resource-limited regions, contributing to reducing global inequalities in access to advanced medical technologies and specialized cardiac monitoring services.
Extending the methodology toward fine-grained classification of the original 11 arrhythmia classes constitutes a distinct research direction that requires specialized class balancing strategies (SMOTE, class weighting, focal loss) and potentially hierarchical or ensemble classification architectures, going beyond the current methodological goal focused on algorithmic convergence in feature selection.