1. Introduction
Cardiovascular diseases (CVDs) are among the leading causes of death worldwide, highlighting the importance of early diagnosis to improve patient health and quality of life [
1]. Heart sound analysis plays a significant role in the diagnosis and management of CVDs. Heart sounds, produced during cardiac contraction and relaxation, contain crucial information that is essential for detecting abnormalities early. Heart sounds typically range between 20 Hz and 2000 Hz, with most energy concentrated below 100 Hz. Low-frequency noise, such as baseline drift and motion artifacts, can interfere with signal analysis, necessitating appropriate filtering techniques.
Heart sound analysis can be performed using two primary approaches: segmented and non-segmented ones. Segmented approaches decompose the signal into components such as S1, S2, S3, and S4, enabling detailed examination of each part. For example, a study using the PhysioNet A dataset with a segmented CNN model achieved 97.21% accuracy, 94.78% sensitivity, and 99.65% specificity [
2]. Advanced methods such as the Hidden Semi-Markov Models (HSMMs) introduced by Springer et al. [
3] have demonstrated robust segmentation performance, even under noise. Deep learning methods, including CNNs, RNNs, and hybrid CNN-LSTM architectures, further improve segmentation accuracy but typically require large, annotated datasets and significant computational resources.
In contrast, non-segmented approaches analyze the entire heart sound signal directly, offering simplicity and faster processing but often at the expense of accuracy. For instance, a non-segmented CNN model achieved 84.15% accuracy on PhysioNet/CinC 2016 [
4]. Non-segmented models generally depend on machine learning or deep learning classifiers to process raw signals. Feature extraction methods such as Mel-Frequency Cepstral Coefficients (MFCCs) and Empirical Mode Decomposition (EMD) have been widely used to derive discriminative characteristics. Additionally, modified Empirical Wavelet Transform (EWT) and Normalized Shannon Energy (NASE) have been proposed to improve performance under noisy conditions [
5].
Recent studies have explored sophisticated segmentation and classification strategies. For example, a convolutional–transformer hybrid architecture achieved 99.7% accuracy without explicit segmentation [
6], dynamic programming combined with frequency-domain features and Siamese Neural Networks showed promising results [
7], and a multi-scale adaptive segmentation approach with continuous wavelet transforms and a CRNN achieved 98.6% accuracy on PhysioNet [
8]. However, these advanced models require high computational resources and large annotated datasets, limiting practical deployment in resource-constrained clinical settings. In addition, time–frequency-domain deep neural networks that couple MGWST/entropy features with deep encoders have demonstrated strong screening performance on MHSDB and competitive results on PhysioNet/CinC 2016 [
9]. Complementarily, Stockwell transform-based boundary detection has been proposed for accurate S1/S2 localization using adaptive thresholds on S-transform envelopes, demonstrating reliable segmentation on Michigan PCG subsets [
10]. These studies underline the value of richer time–frequency representations and provide a useful context for positioning our Shannon Otsu segmentation with MFCC/EMD features.
Despite recent advances in heart sound classifications such as complex segmentation algorithms and deep learning-based pipelines, there remains a lack of systematic evaluations that explicitly assess the impact of segmentation versus non-segmentation strategies. In particular, lightweight, interpretable, and computationally efficient methods are often overlooked in favor of resource-intensive approaches that rely on large, annotated datasets. Automated heart sound segmentation and classification systems may play a vital role in the development of digital health tools, particularly for the early detection of cardiovascular anomalies in resource-constrained or telemedicine settings. Such methods could be integrated into clinical information systems or portable diagnostic devices to support clinical decision-making processes in primary care environments.
To address this gap, we propose a fully unsupervised and computationally efficient segmentation method that combines Shannon energy envelope analysis with adaptive thresholding based on Otsu’s method. Unlike deep models requiring extensive training and parameter tuning, the proposed approach is suitable for real-world applications with limited computational resources. We evaluate both segmented and non-segmented strategies on two large-scale benchmark datasets, PhysioNet/CinC 2016 and Pascal, using robust feature extraction techniques (MFCCs and EMD) and traditional classifiers (kNN, SVM, RF).
The main contributions of this study are as follows:
Development of an unsupervised and efficient segmentation technique combining Shannon energy and Otsu-based adaptive thresholding for envelope-based heart sound analysis.
Systematic comparison of segmented and non-segmented classification pipelines using classical features (MFCCs and EMD) and conventional classifiers (kNN, SVM, RF) across two benchmark datasets.
In-depth evaluation of different feature-classifier combinations with and without segmentation to provide quantitative insights into the added value of segmentation.
Implementation of PCA-based dimensionality reduction to enhance computational efficiency while maintaining classification performance.
These contributions aim to provide a practical and interpretable framework for heart sound classification, especially for deployment in resource-constrained clinical environments. A comprehensive summary of related studies employing segmented and non-segmented approaches is provided in
Table 1, highlighting the position and scope of our study in the current literature.
2. Material and Methods
The proposed method in this study involved a systematic pipeline of preprocessing, segmentation, feature extraction, dimensionality reduction, and classification steps;
Figure 1 illustrates the complete pipeline. Initially, raw heart sound data underwent preprocessing to enhance signal quality. This phase included applying a notch filter to eliminate powerline interference and an elliptic bandpass filter to suppress noise outside the target frequency band. Subsequently, signal amplitudes were normalized to a standard range, ensuring consistent conditions across all recordings and facilitating reliable classification.
After preprocessing, the data was analyzed via two parallel pathways: segmented and non-segmented analysis. In the segmented approach, the envelope of the heart sound signals was first extracted using the Shannon energy method. This extraction of the envelope highlighted the significant components of the signal while suppressing noise. Subsequently, Otsu thresholding was applied adaptively in a local windowed manner to dynamically determine threshold values, effectively isolating critical cardiac sound components (S1 and S2). Following segmentation, two advanced and widely recognized feature extraction techniques were employed: Empirical Mode Decomposition (EMD) and Mel-Frequency Cepstral Coefficients (MFCCs). In the EMD approach, each segment was decomposed into intrinsic mode functions (IMFs). From each IMF, 11 statistical features were extracted: mean, standard deviation, variance, mode, minimum, maximum, skewness, kurtosis, entropy, energy, and average power. Both mode and maximum were retained despite their correlation, as mode captures repetitive signal patterns, whereas maximum identifies peak amplitudes indicative of potential anomalies. Similarly, MFCC features underwent the same 11 statistical computations, maintaining methodological consistency across both extraction techniques.
To address the class imbalance observed in both segmented and non-segmented datasets, the Synthetic Minority Over-sampling Technique (SMOTE) was applied to the abnormal class after feature extraction but prior to classification. In the segmented configuration, SMOTE produced 173,340 synthetic abnormal segments by interpolating between minority-class feature vectors in the high-dimensional feature space. In the non-segmented configuration, the original dataset contained 2775 normal and 926 abnormal recordings; SMOTE generated 1852 synthetic abnormal samples using k = 5 nearest neighbors, expanding the abnormal class to 2778 instances. This interpolation strategy preserved local feature distributions while mitigating imbalance, ensuring that the classifiers received a balanced training set for more reliable sensitivity to abnormal cases. To minimize potential overfitting to synthetic patterns, a 5-fold cross-validation (CV) procedure was strictly enforced during model evaluation, and performance consistency across folds was used as an indicator of generalization capability.
Classification was then performed using three robust machine learning algorithms: k-Nearest Neighbor (kNN), Support Vector Machine (SVM), and Random Forest (RF). Model evaluation was carried out using a 5-fold cross-validation method, rather than traditional 80–20% splits. This approach involved partitioning the dataset into five subsets, using each subset once as the test set and the remaining subsets for training iteratively. The average performance across these iterations provided an unbiased evaluation of model generalizability. Additionally, Principal Component Analysis (PCA) was applied to the extracted features to reduce dimensionality, preserving 95% of the variance. PCA effectively eliminated redundant and less informative features, enhancing computational efficiency and reducing model complexity during classification.
2.1. Dataset
This study utilizes two datasets for heart sound classification: the PhysioNet/CinC Challenge 2016 dataset and the Pascal dataset. These datasets provide a rich variety of heart sound recordings, enabling comprehensive evaluation of the proposed methods under diverse conditions.
The PhysioNet dataset contains 3240 recordings collected from 764 individuals, including both healthy individuals and patients with various cardiac pathologies. The recordings are labeled as either normal or abnormal and were obtained from six different sources by seven independent research groups. The recordings have a sampling frequency of 2000 Hz and vary in duration, typically ranging from a few seconds to several minutes. This dataset reflects a wide range of real-world conditions, including differences in recording devices and environments [
22,
23].
The Pascal dataset consists of 461 recordings captured at a sampling frequency of 4000 Hz. The recordings are categorized into five groups: Normal, Normal Noisy, Murmur, Noisy Murmur, and Extrasystole. The recording lengths range from 5 to 10 s, making them suitable for evaluating the performance of segmentation and classification techniques. The dataset also introduces challenges by including both clean and noisy heart sound recordings, providing a realistic benchmark for robust classification [
24]. The detailed distribution of samples across categories in both datasets is presented in
Table 2. The distribution of samples within these datasets is summarized in the subsequent sections, where the methods for addressing class imbalance and segmenting heart sounds are discussed.
2.2. Preprocessing
In the preprocessing stage, heart sound signals were initially subjected to a notch filtering process to suppress powerline interference centered at 60 Hz. This filter was implemented as a second-order infinite impulse response (IIR) notch filter with a quality factor (Q-factor) of approximately 35, effectively attenuating narrowband noise components without distorting the primary cardiac signal characteristics.
Following notch filtering, an elliptic bandpass filter was applied to isolate the physiological frequency range relevant to heart sounds. The filter was designed with an order of two, a passband ripple of 5 dB, a stopband attenuation of 80 dB, and a frequency range of 20–400 Hz. This configuration ensured the preservation of fundamental and harmonic components while minimizing out-of-band noise and artifacts.
Subsequently, amplitude normalization was performed by scaling each signal based on its maximum absolute amplitude, thereby constraining all signals to a uniform range of [−1, 1]. This normalization procedure enhanced consistency across recordings from different subjects and recording environments, facilitating robust feature extraction and reliable comparative analysis.
2.3. Segmentation Techniques
Segmentation is a method that divides datasets, images, or signals into more meaningful and processable parts to facilitate their interpretation and analysis. In heart sound analysis, segmentation plays a crucial role in isolating the fundamental components of the cardiac cycle, enabling the identification of specific abnormalities. Heart sounds, which can be heard using a stethoscope, provide essential information about the cardiovascular system and play a critical role in disease diagnosis. As shown in
Figure 2, heart sound signals consist primarily of two components: S1 and S2. The S1 sound is associated with the closure of the mitral and tricuspid valves [
25], marking the beginning of systole, while the S2 sound corresponds to the closure of the aortic and pulmonary valves, indicating the beginning of diastole [
26]. While these two heart sounds are clearly distinguishable in healthy individuals, additional sounds such as S3 and S4 may appear under certain pathological conditions.
In this study, heart sound segmentation was performed using the Shannon energy envelope and Otsu thresholding methods to isolate S1 and S2 components. Although no manual or automated segmentation results were available for direct comparison, the effectiveness of the segmentation approach was indirectly assessed through classification performance.
While Shannon energy and Otsu thresholding have individually been widely utilized in biomedical signal processing, their integrated usage specifically for phonocardiogram (PCG) segmentation to isolate the fundamental heart sounds (S1 and S2) remains relatively unexplored. Unlike conventional fixed-window or Hidden Semi-Markov Model (HSMM)-based segmentation methods, our proposed Shannon–Otsu segmentation approach is fully unsupervised, adaptive to varying signal conditions, and computationally lightweight and does not require extensive annotated data or training procedures. To the best of our knowledge, this is the first study systematically evaluating the impact of this combined Shannon–Otsu thresholding technique on heart sound classification across large and diverse datasets such as PhysioNet/CinC 2016 and Pascal, demonstrating its robustness and effectiveness in significantly improving classification accuracy.
Shannon Energy Method and Otsu Thresholding
In this study, the Shannon energy (SE) envelope is computed from the local spectrum obtained by the S-transform for each time sample. Let
denote the S-transform magnitude for the frequency bin
and time index
(with the Gaussian window width
as in the standard S-transform). Then the Shannon energy of the
-th time sample is defined as the column-wise aggregation over frequencies:
where
is a small constant to avoid
. This SSE (Shannon Spectral Energy) envelope emphasizes medium-intensity components and attenuates low-intensity noise, yielding a robust representation for S1/S2 localization, exactly as described in the SSE localization method [
28,
29,
30].
The Otsu method is used to determine an optimal threshold value by analyzing the histogram of the signal. This technique represents the signal data at different levels, where each level is indicated by the number of data points (
) at that level, and the total number of data points is represented by
. It is assumed that the signal is divided into two classes using a threshold at level
: Class
contains data at levels from 1 to
, while Class
includes data from levels
to
[
31]. According to Otsu’s theory, the within-class variance is minimized when the between-class variance is maximized; therefore, the optimal threshold is the one that yields the largest separation between the two classes in the histogram. Applying this algorithm to the 1-D energy-envelope histogram provides an amplitude threshold that isolates high-energy cardiac events (S1/S2) while suppressing background/noise [
32]. The class probabilities and means are computed as in Equations (2) and (3):
The threshold is then selected by maximizing the between-class variance in Equation (4):
where
and
denote the class probabilities,
denote the corresponding class means.
Using this threshold value, the high-energy regions of the signal are segmented effectively. To further enhance the quality of the energy signal, a median filter is applied with window size samples, defined in Equation (5):
A median filter with a window length of 75 samples was employed to suppress noise artifacts and enhance the clarity of the segmented signal envelope. This filter was chosen instead of a mean (moving average) filter due to its superior ability to suppress impulsive noise while preserving sharp transitions and edges, making it more robust to outlier artifacts commonly observed in heart sound signals. The window length was empirically determined to balance noise suppression and temporal resolution, thereby supporting accurate peak detection in subsequent steps.
The segmentation uses an adaptive threshold (τ), selected by Otsu’s method, on the Shannon-energy envelope. Raw signals are first amplitude-normalized to [−1, 1], and the envelope is then linearly scaled to [0, 1]; thus, τ ∈ [0, 1] and is invariant to absolute amplitude. This procedure isolates high-energy cardiac events without any fixed, manually set amplitude threshold. Postprocessing applies to a short median filter and minimum event-duration/minimum inter-event constraints: tightening these constraints (or increasing τ) reduces false positives but may miss low-amplitude S2 events, whereas relaxing them (or decreasing τ) increases recall at the cost of more spurious detections. The reported settings were chosen via 5-fold cross validation to maximize fold wise balanced accuracy.
Segmentation was dynamically performed using the Shannon energy envelope combined with Otsu’s adaptive thresholding. Short-time analysis was conducted with a 30 ms window and a 15 ms hop size, improving temporal precision. The previously mentioned 200 ms refers to the approximate maximum duration of a complete heart sound event (S1 or S2) and served only as a reference for interpreting segment lengths rather than as a fixed segmentation window.
Following segmentation, peak detection was conducted using MATLAB® 2024a’s findpeaks function. This step was not solely intended for visualization but was crucial for defining precise segment boundaries corresponding to S1 and S2 events. Accurate peak identification ensured that only physiologically meaningful cardiac events were included in the analysis, thereby improving the reliability of feature extraction and classification. By excluding irrelevant or noisy signal portions, this approach enhanced overall model performance and robustness. The minimum peak height was set to 5% of the maximum amplitude of the cleaned signal, and the minimum peak distance was defined as 10% of the sampling frequency. Peaks that did not meet these criteria were automatically discarded without manual correction. Instead, statistical features were extracted over entire segments (S1–S2), mitigating the influence of potential outliers.
Figure 3 and
Figure 4 present representative examples of heart sound recordings from normal and abnormal subjects, respectively, demonstrating the effectiveness of the proposed segmentation and peak detection framework. These figures highlight how the combined use of the Shannon energy envelope and Otsu thresholding facilitates the isolation of primary cardiac events (S1 and S2) under varying signal conditions.
Figure 5 offers a more detailed illustration of the peak selection procedure, emphasizing how amplitude-based and temporal constraints—specifically, minimum peak height and minimum inter-peak distance—contribute to the suppression of false detections. In all examples, the upper panels display the Shannon energy envelopes, while the lower panels show the original heart sound signals overlaid with detected peaks. Peaks identified prior to filtering are marked in blue, whereas those retained after constraint-based refinement are highlighted in red, underscoring the robustness of the proposed method in distinguishing physiologically relevant cardiac events from noise.
Due to the absence of precisely annotated heart sound segments (e.g., S1 and S2 onset and offset times) in the datasets used, quantitative performance metrics such as onset timing error, detection rate, sensitivity, and specificity of the segmentation method could not be calculated explicitly. However, visual inspections and qualitative analysis confirmed that the proposed Shannon energy- and Otsu threshold-based segmentation consistently and accurately isolated primary heart sound components across most recordings. Future research should involve datasets with detailed segment annotations to facilitate rigorous quantitative validation and further assess the robustness of the segmentation approach, particularly for pathological cases that may contain additional heart sound components (e.g., S3 and S4).
2.4. Feature Extraction
The features of heart sound signals obtained from segmentation were analyzed using feature extraction methods to represent them comprehensively in the time, frequency, and time–frequency domains. The primary goal of feature extraction is to describe a signal accurately using a minimal yet informative set of features [
33,
34]. In this study, two widely used methods were employed for feature extraction: EMD and MFCCs.
The feature extraction phase converts raw heart sound segments (including S1 and S2 components) into structured representations suitable for machine learning classifiers. These segments were obtained through the Shannon energy envelope and Otsu thresholding. In EMD, each segment was analyzed as a single unit without further subdivision, while MFCC analysis used short-term framing [
17].
For non-segmented cases, the entire heart sound signal was directly subjected to feature extraction, enabling a comparative analysis of segmented versus non-segmented approaches.
2.4.1. Empirical Mode Decomposition (EMD)
EMD enables adaptive time–frequency analysis by decomposing a signal into intrinsic mode functions (IMFs), each representing an oscillatory mode. The decomposition follows two criteria:
- (i)
The number of extrema and zero crossings must be equal or differ by at most one.
- (ii)
The mean value of the upper and lower envelopes must be zero at any point [
35].
In this study, the number of IMFs was fixed to five to ensure consistent feature vector lengths across all segments. From each IMF, eleven statistical descriptors (mean, variance, standard deviation, mode, minimum, maximum, skewness, kurtosis, entropy, energy, and average power) were calculated, providing a comprehensive characterization of signal variability and structure. This approach ensured consistency and comparability across segments and supported robust classification.
2.4.2. Mel-Frequency Cepstral Coefficients (MFCCs)
Heart sounds contain diagnostically significant spectral components that extend beyond the range of human auditory perception, necessitating advanced spectral analysis. MFCCs are widely used in sound classification and phonocardiogram studies for pathology detection [
34,
35,
36,
37,
38,
39]. The MFCCs were computed using the following Equation (6):
In this study, MFCC features were extracted using the following pipeline.
Pre-emphasis: A first-order filter enhances mid–high frequencies and improves spectral balance:
Framing and Windowing: Signals are split into 50 ms frames with 10 ms hop; each frame is multiplied by a Hamming window to reduce spectral leakage:
Fast Fourier Transform (FFT): For each windowed frame, the discrete Fourier transform (computed via FFT) yields the magnitude or power spectrum, which serves as the input to the Mel filterbank analysis.
Mel Filterbank Processing: The magnitude spectrum is passed through a 20-channel Mel filterbank spanning 10 Hz to 400 Hz to approximate the human auditory system’s frequency resolution.
Logarithmic Compression and Discrete Cosine Transform (DCT): Log energies of the Mel-filter outputs are decorrelated via a type-II discrete cosine transform (DCT), producing 13 MFCCs per frame:
Here, is the -th Mel-band energy and is the -th triangular Mel filter.
Cepstral Liftering: A liftering parameter of 22 is applied to enhance discriminative properties by reducing the influence of higher-order cepstral coefficients.
Finally, the logarithm and DCT steps are applied to obtain the Mel-Frequency Cepstral Coefficients [
40,
41]. After computing 13 MFCCs per frame, eleven statistical descriptors (mean, variance, standard deviation, mode, minimum, maximum, entropy, skewness, kurtosis, energy, and average power) are derived across all frames in each segment. This standardizes the final feature vectors regardless of segment length and ensures compatibility with the classifiers.
The standard Mel scale used in speech processing was adopted, justified by similarities in low-frequency energy distribution and spectral envelope characteristics between speech and heart sounds, making it suitable for extracting perceptually and physiologically relevant features.
2.4.3. Statistical Feature Design
The selection of statistical features in this study was informed by their established effectiveness in biomedical signal processing and their capacity to capture a broad range of characteristics inherent in phonocardiographic signals. Descriptive metrics such as mean, variance, and standard deviation represent central tendency and dispersion, whereas higher-order statistical moments like skewness and kurtosis quantify asymmetry and peakedness in signal distributions. Measures including entropy and energy reflect signal complexity and power content, which are particularly valuable in distinguishing pathological heart sounds.
Although some features such as mode and maximum may exhibit statistical correlation, their inclusion ensures comprehensive coverage of both central and extreme values within the signal. This carefully curated set of statistical descriptors offers a balance between computational efficiency and classification performance. To ensure consistency across domains, the same statistical feature set was applied to both MFCC- and EMD-based representations. The entire process of feature extraction and summarization is illustrated in
Figure 6.
2.5. Classification
In our study, the features extracted from heart sounds were classified using three different machine learning algorithms: k-Nearest Neighbor (kNN), Support Vector Machine (SVM), and Random Forest (RF).
The kNN algorithm performs classification using labeled training data. This algorithm determines the k nearest neighbors for a new instance to be classified and assigns it to the class that is most frequent among the neighbors. Different values of k were tested to achieve optimal performance. Various metrics are used for proximity calculations, with cosine distance being a commonly used metric for measuring similarity based on the angle between two vectors [
42]. Since it considers vector orientation, cosine distance is often preferred to understand the angular differences between data and evaluate similarities. Cosine distance is inversely proportional to cosine similarity and is calculated using the formula in Equation (10):
where
and
are the two vectors being compared;
is their dot product; and
and
are the norms of vectors
and
, respectively.
Support Vector Machine, developed by Vapnik, Guyon and Boser in 1992, is a machine learning method used to solve classification problems [
43]. In this supervised learning algorithm, input data are positioned in an n-dimensional space, with each dimension representing a feature. Classification is performed by finding a hyperplane that distinctly separates the two classes. SVM algorithms include various kernel functions, such as linear, polynomial, and sigmoid functions and the radial basis function (RBF) [
44]. Using kernel functions, the data space can be expanded to higher dimensions, forming a complex and curved decision boundary that better separates the dataset. In our study, features were classified using RBF kernels.
Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (classification) of the individual trees. This approach is robust to overfitting, especially for datasets with a large number of features, and is computationally efficient [
45]. To assess the classification performance, four key evaluation metrics are used: accuracy, sensitivity, specificity, and F1-score.
To evaluate the classification performance of heart sound analysis, accuracy, sensitivity, specificity, and F1-score are computed. These metrics provide a comprehensive assessment of the model’s ability to correctly classify normal and abnormal heart sounds.
The accuracy metric measures the proportion of correctly classified heart sound signals among all samples and is defined in Equation (11):
where TP (True Positive) represents correctly classified abnormal heart sounds, TN (True Negative) denotes correctly classified normal heart sounds, FP (False Positive) refers to normal heart sounds incorrectly classified as abnormal, and FN (False Negative) corresponds to abnormal heart sounds misclassified as normal.
The sensitivity (recall) metric evaluates the model’s ability to correctly detect abnormal heart sounds, ensuring that true cases are identified. It is calculated in Equation (12):
Similarly, specificity measures the ability of the model to correctly classify normal heart sounds, minimizing false alarms, and is formulated in Equation (13):
To balance sensitivity and precision, the F1-score is also computed. It provides a harmonic mean of precision and recall and is defined in Equation (14):
These metrics are widely used in biomedical signal classification to ensure reliable and clinically relevant evaluations of classification models [
46].
3. Results
In this study, the classification performance of various machine learning algorithms was evaluated using both segmented and non-segmented heart sound datasets. The preprocessing stage included the application of notch and elliptic filters to eliminate noise and enhance signal clarity. Additionally, normalization was applied to ensure that all signals were scaled within a standardized range, thereby improving consistency across recordings.
Following preprocessing, two different strategies were employed for feature representation: segmentation-based and non-segmented analysis. In the segmentation-based approach, heart sound signals were divided into individual S1–S2 cycles using Shannon energy envelope extraction and Otsu thresholding. The issue of class imbalance was addressed through SMOTE as described in
Section 2, and all subsequent analyses were conducted on the resulting balanced feature sets. To reduce feature dimensionality and eliminate redundancy, PCA was applied across all datasets, with the number of principal components determined by retaining 95% of the total variance. The comparative PCA results for both segmented and non-segmented features are illustrated in
Figure 7.
While the PCA-based reduction for MFCC features was limited due to their compact representation, EMD features showed a more significant reduction, particularly in the segmented configuration. These reduced feature vectors were subsequently used for classification tasks.
Heart sound classification was performed using k-Nearest Neighbor (kNN), Support Vector Machine (SVM), and Random Forest (RF) on both EMD and MFCC feature sets. Various values of k (1, 3, and 5) were tested for kNN, with k = 3 providing the best results. The cosine distance metric was selected due to its effectiveness in capturing angular dissimilarities between feature vectors. For SVM, a radial basis function (RBF) kernel was employed to handle non-linear separability in the feature space. RF classifiers were configured with 100 decision trees, using Gini impurity as the splitting criterion to optimize classification performance.
To ensure a fair and robust evaluation, five-fold cross-validation was used instead of a traditional 80–20 train–test split. The dataset was divided into five equal subsets, with each fold used once as a test set while the remaining four served for training. This procedure provided a more reliable estimate of model performance across different data partitions.
Classification performance was evaluated using accuracy, sensitivity, specificity, and F1-score to assess the influence of segmentation and feature extraction strategies. The overall results are summarized in
Table 3, while
Figure 8 presents the five-fold training (train) and testing (test) accuracy for each classifier–feature combination. The close alignment between training and testing results across folds indicates the absence of overfitting, further reinforcing the generalizability and robustness of the proposed approach.
Notably, although relatively simple classifiers were used, the incorporation of segmentation and appropriate feature extraction significantly enhanced model performance. For example, the accuracy of the EMD + kNN model increased from 88.40% to 99.97% with segmentation. These findings confirm that the reported high accuracy values are not superficial but arise from a carefully designed and comprehensive pipeline that integrates robust preprocessing, segmentation, and balanced learning strategies.
Furthermore, although the SVM classifier exhibited comparatively lower performance, particularly in the non-segmented case, this was likely due to its sensitivity to the high-dimensional feature space and imbalanced data distribution. Nevertheless, even these results emphasize the importance of segmentation in stabilizing classification outcomes.
To illustrate robustness across folds,
Figure 9 presents boxplots together with mean classification performance and 95% confidence intervals (CIs) computed over five-fold cross-validation, providing a clear view of fold-to-fold variability and generalization.
To substantiate the observed gains, we conducted two-sided paired-sample
t-tests on fold-wise balanced-accuracy estimates from five-fold cross-validation, treating models as paired within each fold.
Table 4 reports mean differences (ΔBA) with Holm–Bonferroni-adjusted
p-values. This analysis controlled fold-to-fold variability and limited Type-I error across multiple comparisons, demonstrating that most segmented vs. non-segmented contrasts were statistically significant (adjusted
p < 0.05).
Comparative Analysis
The results clearly demonstrate that segmenting heart sounds significantly enhanced classification performance. Using MFCC features with kNN on segmented data resulted in 99.37% accuracy, 100.00% sensitivity, and 98.72% specificity, while EMD features with kNN achieved 99.97% accuracy, 99.98% sensitivity, and 99.96% specificity. In contrast, non-segmented data yielded lower performance, with MFCC features achieving 96.49% accuracy, 97.65% sensitivity, and 95.33% specificity, while EMD features reached 88.40% accuracy, 93.49% sensitivity, and 83.32% specificity.
To directly compare our approach with the existing works listed in
Table 1, we explicitly benchmarked our results. For example, Deperlioglu [
2] reported 97.21% accuracy, 94.78% sensitivity, and 99.65% specificity using a segmented CNN model on the PhysioNet A dataset, whereas our proposed EMD-kNN model achieved 99.97% accuracy, 99.98% sensitivity, and 99.96% specificity on the PhysioNet/CinC 2016 dataset. Similarly, Narváez et al. [
5] achieved 99.25% accuracy on the Pascal dataset using Modified EWT + NASE, while our EMD-RF configuration achieved 99.95% accuracy, 99.96% sensitivity, and 99.93% specificity on the same dataset. In contrast, Maknickas and Maknickas [
4] reported only 84.15% accuracy for a non-segmented CNN model on PhysioNet/CinC 2016. These comparisons validate that our segmentation-based pipeline outperforms several previously published methods on the same benchmark datasets.
Feature extraction also plays a crucial role in classification performance. In previous research, MFCCs with a CNN reached 85.3% accuracy on the PhysioNet/CinC 2016 dataset [
11], and Wavelet Scattering Transform with SVM obtained 92.23% accuracy [
20]. Our findings indicate that MFCCs and EMD provide superior accuracy, particularly when combined with segmentation. Additionally, applying PCA reduced computational complexity while maintaining classification accuracy.
Among classifiers, kNN and Random Forest consistently outperformed SVM, with the highest performance achieved using kNN with segmented MFCCs (99.37%) and kNN with segmented EMD (99.97%). In contrast, SVM performed significantly lower on non-segmented data (81.88% accuracy for MFCCs, 71.76% for EMD). Other approaches in the literature include a CNN-RNN hybrid model with 98% accuracy [
13] and a Decision Tree model with 86.35% accuracy [
14]. The combination of SVM, kNN, and RF classifiers in [
15] achieved 98.67% accuracy. These comparisons show that carefully designed traditional machine learning pipelines can surpass deep learning approaches when combined with effective segmentation and feature extraction techniques.
Overall, this study now provides a comprehensive evaluation across multiple dimensions—segmentation effect, feature selection, classifier type, and dimensionality reduction—supplemented by clear, direct comparisons with existing methods. The results not only demonstrate the effectiveness of the proposed approach but also emphasize its practical viability for heart sound classification in real-world settings.
4. Conclusions and Discussion
This study presents an effective heart sound classification framework that integrates Shannon energy-based envelope extraction with Otsu thresholding for segmentation, followed by MFCC- and EMD-based feature extraction and classification using kNN, SVM, and Random Forest.
To explicitly isolate and quantify the contribution of segmentation, controlled comparisons were performed by keeping the feature extraction method and classifier fixed while varying only the segmentation step. For instance, EMD + kNN accuracy improved significantly from 88.40% (non-segmented) to 99.97% (segmented), and MFCC + kNN improved from 96.49% to 99.37%. These findings confirm that segmentation alone substantially enhances classification performance by effectively isolating physiologically meaningful signal components and reducing irrelevant noise.
To verify that the observed gains were not due to chance, we performed two-sided paired-sample
t-tests on fold-wise balanced accuracy from five-fold cross-validation, treating models as paired within each fold.
Table 4 reports mean differences (ΔBA) with Holm–Bonferroni-adjusted
p-values. The paired design controlled fold-to-fold variability, and the Holm correction limited inflated Type-I error across multiple comparisons. Most segmented vs. non-segmented contrasts and within-feature classifier pairs remained statistically significant after correction (adjusted
p < 0.05), indicating that the improvements were systematic rather than random. Consistently with our empirical observations, segmentation reduces between-classifier dispersion while the top models retain a clear performance edge, underscoring practical relevance even when absolute ΔBA values are small.
We adopted k-NN as a transparent, nonparametric baseline over compact feature vectors (MFCCs/EMD), with optional PCA to stabilize distances and reduce dimensionality. Model selection (including k = 5) was performed by cross-validation within each training fold. Although k-NN is memory-based, per-fold training sets and vectorized batched distance computations provide adequate scalability at our data size, while preserving interpretability. In our experiments, k-NN and Random Forest consistently outperformed SVM, especially with segmented inputs. PCA further improved computational efficiency without materially degrading accuracy. The proposed Shannon–Otsu segmentation adapted well to signal characteristics and was robust across datasets.
To objectively validate the effectiveness and generalizability of the proposed segmentation algorithm, we performed a segmentation sensitivity analysis using ground-truth S1 and S2 annotations in the PhysioNet/CinC 2016 training set. This analysis assessed how reliably the algorithm detects physiologically meaningful cardiac events relative to expert reference annotations, providing quantitative evidence of its practical utility in automated heart sound analysis. With a ±75 ms tolerance window, the proposed segmentation approach achieved a mean sensitivity of 90.68% ± 19.35 for S1 and 88.63% ± 23.76 for S2, with mean absolute errors of 200.39 ± 65.52 ms and 190.27 ± 62.43 ms, respectively (for all training sets). These results are comparable to those of other unsupervised methods reported in the literature and confirm the reliability of the approach for large-scale heart sound analysis.
A key limitation of this study stems from the unavailability of ground-truth segment annotations (e.g., S1 and S2 boundaries) in the Pascal dataset, which precludes formal quantitative evaluation of segmentation accuracy using metrics such as onset timing error or sensitivity in that cohort. For Pascal, segmentation quality was assessed indirectly via visual inspection and its effect on downstream classification performance. In certain abnormal cases, as shown in
Figure 3, S2 peaks may be attenuated or missed due to low-amplitude or transient noise, revealing the sensitivity of energy-based segmentation methods to pathological variations. To address this, future studies should employ datasets with annotated cardiac events to enable rigorous validation and explore more advanced segmentation techniques (e.g., Hidden Semi-Markov Models or deep learning-based approaches). Additionally, hybrid feature extraction strategies and larger, more diverse datasets may improve robustness and generalizability.
Although SMOTE effectively balanced the abnormal class, synthetic over-sampling can sometimes lead models to fit artificial patterns rather than physiological variability. In this study, potential overfitting was mitigated by applying five-fold cross-validation and monitoring the consistency of training and testing accuracies across folds. The absence of a substantial divergence between these values suggests that the classifiers generalized well despite the inclusion of synthetic data. Nevertheless, future work could investigate more advanced or data-driven augmentation such as Borderline-SMOTE, ADASYN, or GAN-based synthesis to further enhance realism and reduce reliance on interpolation-based techniques.
The proposed segmentation classification pipeline has strong potential for integration into hospital information systems, portable diagnostic platforms, and personal health monitoring tools. Future work will focus on real-time deployment in clinical workflows and evaluating integration with electronic health record systems to enhance automated cardiac screening and decision support.
In summary, this study highlights the critical role of segmentation in heart sound classification. It demonstrates that near-perfect classification accuracy can be achieved with relatively simple machine learning models and that these gains are statistically reliable after multiple-comparison correction, provided that segmentation is effectively combined with robust preprocessing and carefully designed feature extraction strategies.