Segmented vs. Non-Segmented Heart Sound Classification: Impact of Feature Extraction and Machine Learning Models

Boz, Ceyda; Kocyigit, Yucel

doi:10.3390/app152011047

Open AccessArticle

Segmented vs. Non-Segmented Heart Sound Classification: Impact of Feature Extraction and Machine Learning Models

by

Ceyda Boz

^*

and

Yucel Kocyigit

Department of Electrical and Electronics Engineering, Manisa Celal Bayar University, 45110 Manisa, Türkiye

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(20), 11047; https://doi.org/10.3390/app152011047

Submission received: 22 September 2025 / Revised: 10 October 2025 / Accepted: 13 October 2025 / Published: 15 October 2025

Download

Browse Figures

Versions Notes

Abstract

Cardiovascular diseases remain a leading cause of mortality worldwide, emphasizing the importance of early diagnosis. Heart sound analysis offers a non-invasive avenue for detecting cardiac abnormalities. This study systematically evaluates the effect of segmentation on phonocardiogram (PCG) classification performance. Unlike conventional fixed-window or HSMM-based methods, a data-adaptive segmentation approach combining Shannon energy and Otsu thresholding is proposed. After segmentation, features are extracted using Empirical Mode Decomposition (EMD) and Mel-Frequency Cepstral Coefficients (MFCCs), followed by classification with k-Nearest Neighbor (kNN), Support Vector Machine (SVM), and Random Forest (RF). Experiments on the PhysioNet/CinC 2016 and Pascal datasets revealed that segmentation markedly enhances classification accuracy. The optimal results were achieved using kNN with segmented EMD features, attaining 99.97% accuracy, 99.98% sensitivity, and 99.96% specificity; segmented MFCC features also provided high accuracy (99.37%). In contrast, non-segmented models yielded substantially lower performance. Principal Component Analysis (PCA) is applied for dimensionality reduction, preserving classification efficiency while minimizing computational cost. These findings demonstrate the critical importance of effective segmentation in heart sound classification and establish the proposed Shannon–Otsu-based method as a robust, interpretable, and resource-efficient tool for automated cardiac diagnostics. Using annotated PhysioNet recordings, segmentation achieved ~90% sensitivity for S1/S2 detection. A limitation is the absence of full segment annotations in the Pascal dataset, which prevents comprehensive timing-error evaluation.

Keywords:

heart sound classification; segmentation; Shannon energy; Otsu thresholding; feature extraction; machine learning

1. Introduction

Cardiovascular diseases (CVDs) are among the leading causes of death worldwide, highlighting the importance of early diagnosis to improve patient health and quality of life [1]. Heart sound analysis plays a significant role in the diagnosis and management of CVDs. Heart sounds, produced during cardiac contraction and relaxation, contain crucial information that is essential for detecting abnormalities early. Heart sounds typically range between 20 Hz and 2000 Hz, with most energy concentrated below 100 Hz. Low-frequency noise, such as baseline drift and motion artifacts, can interfere with signal analysis, necessitating appropriate filtering techniques.

Heart sound analysis can be performed using two primary approaches: segmented and non-segmented ones. Segmented approaches decompose the signal into components such as S1, S2, S3, and S4, enabling detailed examination of each part. For example, a study using the PhysioNet A dataset with a segmented CNN model achieved 97.21% accuracy, 94.78% sensitivity, and 99.65% specificity [2]. Advanced methods such as the Hidden Semi-Markov Models (HSMMs) introduced by Springer et al. [3] have demonstrated robust segmentation performance, even under noise. Deep learning methods, including CNNs, RNNs, and hybrid CNN-LSTM architectures, further improve segmentation accuracy but typically require large, annotated datasets and significant computational resources.

In contrast, non-segmented approaches analyze the entire heart sound signal directly, offering simplicity and faster processing but often at the expense of accuracy. For instance, a non-segmented CNN model achieved 84.15% accuracy on PhysioNet/CinC 2016 [4]. Non-segmented models generally depend on machine learning or deep learning classifiers to process raw signals. Feature extraction methods such as Mel-Frequency Cepstral Coefficients (MFCCs) and Empirical Mode Decomposition (EMD) have been widely used to derive discriminative characteristics. Additionally, modified Empirical Wavelet Transform (EWT) and Normalized Shannon Energy (NASE) have been proposed to improve performance under noisy conditions [5].

Recent studies have explored sophisticated segmentation and classification strategies. For example, a convolutional–transformer hybrid architecture achieved 99.7% accuracy without explicit segmentation [6], dynamic programming combined with frequency-domain features and Siamese Neural Networks showed promising results [7], and a multi-scale adaptive segmentation approach with continuous wavelet transforms and a CRNN achieved 98.6% accuracy on PhysioNet [8]. However, these advanced models require high computational resources and large annotated datasets, limiting practical deployment in resource-constrained clinical settings. In addition, time–frequency-domain deep neural networks that couple MGWST/entropy features with deep encoders have demonstrated strong screening performance on MHSDB and competitive results on PhysioNet/CinC 2016 [9]. Complementarily, Stockwell transform-based boundary detection has been proposed for accurate S1/S2 localization using adaptive thresholds on S-transform envelopes, demonstrating reliable segmentation on Michigan PCG subsets [10]. These studies underline the value of richer time–frequency representations and provide a useful context for positioning our Shannon Otsu segmentation with MFCC/EMD features.

Despite recent advances in heart sound classifications such as complex segmentation algorithms and deep learning-based pipelines, there remains a lack of systematic evaluations that explicitly assess the impact of segmentation versus non-segmentation strategies. In particular, lightweight, interpretable, and computationally efficient methods are often overlooked in favor of resource-intensive approaches that rely on large, annotated datasets. Automated heart sound segmentation and classification systems may play a vital role in the development of digital health tools, particularly for the early detection of cardiovascular anomalies in resource-constrained or telemedicine settings. Such methods could be integrated into clinical information systems or portable diagnostic devices to support clinical decision-making processes in primary care environments.

To address this gap, we propose a fully unsupervised and computationally efficient segmentation method that combines Shannon energy envelope analysis with adaptive thresholding based on Otsu’s method. Unlike deep models requiring extensive training and parameter tuning, the proposed approach is suitable for real-world applications with limited computational resources. We evaluate both segmented and non-segmented strategies on two large-scale benchmark datasets, PhysioNet/CinC 2016 and Pascal, using robust feature extraction techniques (MFCCs and EMD) and traditional classifiers (kNN, SVM, RF).

The main contributions of this study are as follows:

Development of an unsupervised and efficient segmentation technique combining Shannon energy and Otsu-based adaptive thresholding for envelope-based heart sound analysis.
Systematic comparison of segmented and non-segmented classification pipelines using classical features (MFCCs and EMD) and conventional classifiers (kNN, SVM, RF) across two benchmark datasets.
In-depth evaluation of different feature-classifier combinations with and without segmentation to provide quantitative insights into the added value of segmentation.
Implementation of PCA-based dimensionality reduction to enhance computational efficiency while maintaining classification performance.

These contributions aim to provide a practical and interpretable framework for heart sound classification, especially for deployment in resource-constrained clinical environments. A comprehensive summary of related studies employing segmented and non-segmented approaches is provided in Table 1, highlighting the position and scope of our study in the current literature.

2. Material and Methods

The proposed method in this study involved a systematic pipeline of preprocessing, segmentation, feature extraction, dimensionality reduction, and classification steps; Figure 1 illustrates the complete pipeline. Initially, raw heart sound data underwent preprocessing to enhance signal quality. This phase included applying a notch filter to eliminate powerline interference and an elliptic bandpass filter to suppress noise outside the target frequency band. Subsequently, signal amplitudes were normalized to a standard range, ensuring consistent conditions across all recordings and facilitating reliable classification.

After preprocessing, the data was analyzed via two parallel pathways: segmented and non-segmented analysis. In the segmented approach, the envelope of the heart sound signals was first extracted using the Shannon energy method. This extraction of the envelope highlighted the significant components of the signal while suppressing noise. Subsequently, Otsu thresholding was applied adaptively in a local windowed manner to dynamically determine threshold values, effectively isolating critical cardiac sound components (S1 and S2). Following segmentation, two advanced and widely recognized feature extraction techniques were employed: Empirical Mode Decomposition (EMD) and Mel-Frequency Cepstral Coefficients (MFCCs). In the EMD approach, each segment was decomposed into intrinsic mode functions (IMFs). From each IMF, 11 statistical features were extracted: mean, standard deviation, variance, mode, minimum, maximum, skewness, kurtosis, entropy, energy, and average power. Both mode and maximum were retained despite their correlation, as mode captures repetitive signal patterns, whereas maximum identifies peak amplitudes indicative of potential anomalies. Similarly, MFCC features underwent the same 11 statistical computations, maintaining methodological consistency across both extraction techniques.

To address the class imbalance observed in both segmented and non-segmented datasets, the Synthetic Minority Over-sampling Technique (SMOTE) was applied to the abnormal class after feature extraction but prior to classification. In the segmented configuration, SMOTE produced 173,340 synthetic abnormal segments by interpolating between minority-class feature vectors in the high-dimensional feature space. In the non-segmented configuration, the original dataset contained 2775 normal and 926 abnormal recordings; SMOTE generated 1852 synthetic abnormal samples using k = 5 nearest neighbors, expanding the abnormal class to 2778 instances. This interpolation strategy preserved local feature distributions while mitigating imbalance, ensuring that the classifiers received a balanced training set for more reliable sensitivity to abnormal cases. To minimize potential overfitting to synthetic patterns, a 5-fold cross-validation (CV) procedure was strictly enforced during model evaluation, and performance consistency across folds was used as an indicator of generalization capability.

Classification was then performed using three robust machine learning algorithms: k-Nearest Neighbor (kNN), Support Vector Machine (SVM), and Random Forest (RF). Model evaluation was carried out using a 5-fold cross-validation method, rather than traditional 80–20% splits. This approach involved partitioning the dataset into five subsets, using each subset once as the test set and the remaining subsets for training iteratively. The average performance across these iterations provided an unbiased evaluation of model generalizability. Additionally, Principal Component Analysis (PCA) was applied to the extracted features to reduce dimensionality, preserving 95% of the variance. PCA effectively eliminated redundant and less informative features, enhancing computational efficiency and reducing model complexity during classification.

2.1. Dataset

This study utilizes two datasets for heart sound classification: the PhysioNet/CinC Challenge 2016 dataset and the Pascal dataset. These datasets provide a rich variety of heart sound recordings, enabling comprehensive evaluation of the proposed methods under diverse conditions.

The PhysioNet dataset contains 3240 recordings collected from 764 individuals, including both healthy individuals and patients with various cardiac pathologies. The recordings are labeled as either normal or abnormal and were obtained from six different sources by seven independent research groups. The recordings have a sampling frequency of 2000 Hz and vary in duration, typically ranging from a few seconds to several minutes. This dataset reflects a wide range of real-world conditions, including differences in recording devices and environments [22,23].

The Pascal dataset consists of 461 recordings captured at a sampling frequency of 4000 Hz. The recordings are categorized into five groups: Normal, Normal Noisy, Murmur, Noisy Murmur, and Extrasystole. The recording lengths range from 5 to 10 s, making them suitable for evaluating the performance of segmentation and classification techniques. The dataset also introduces challenges by including both clean and noisy heart sound recordings, providing a realistic benchmark for robust classification [24]. The detailed distribution of samples across categories in both datasets is presented in Table 2. The distribution of samples within these datasets is summarized in the subsequent sections, where the methods for addressing class imbalance and segmenting heart sounds are discussed.

2.2. Preprocessing

In the preprocessing stage, heart sound signals were initially subjected to a notch filtering process to suppress powerline interference centered at 60 Hz. This filter was implemented as a second-order infinite impulse response (IIR) notch filter with a quality factor (Q-factor) of approximately 35, effectively attenuating narrowband noise components without distorting the primary cardiac signal characteristics.

Following notch filtering, an elliptic bandpass filter was applied to isolate the physiological frequency range relevant to heart sounds. The filter was designed with an order of two, a passband ripple of 5 dB, a stopband attenuation of 80 dB, and a frequency range of 20–400 Hz. This configuration ensured the preservation of fundamental and harmonic components while minimizing out-of-band noise and artifacts.

Subsequently, amplitude normalization was performed by scaling each signal based on its maximum absolute amplitude, thereby constraining all signals to a uniform range of [−1, 1]. This normalization procedure enhanced consistency across recordings from different subjects and recording environments, facilitating robust feature extraction and reliable comparative analysis.

2.3. Segmentation Techniques

Segmentation is a method that divides datasets, images, or signals into more meaningful and processable parts to facilitate their interpretation and analysis. In heart sound analysis, segmentation plays a crucial role in isolating the fundamental components of the cardiac cycle, enabling the identification of specific abnormalities. Heart sounds, which can be heard using a stethoscope, provide essential information about the cardiovascular system and play a critical role in disease diagnosis. As shown in Figure 2, heart sound signals consist primarily of two components: S1 and S2. The S1 sound is associated with the closure of the mitral and tricuspid valves [25], marking the beginning of systole, while the S2 sound corresponds to the closure of the aortic and pulmonary valves, indicating the beginning of diastole [26]. While these two heart sounds are clearly distinguishable in healthy individuals, additional sounds such as S3 and S4 may appear under certain pathological conditions.

In this study, heart sound segmentation was performed using the Shannon energy envelope and Otsu thresholding methods to isolate S1 and S2 components. Although no manual or automated segmentation results were available for direct comparison, the effectiveness of the segmentation approach was indirectly assessed through classification performance.

While Shannon energy and Otsu thresholding have individually been widely utilized in biomedical signal processing, their integrated usage specifically for phonocardiogram (PCG) segmentation to isolate the fundamental heart sounds (S1 and S2) remains relatively unexplored. Unlike conventional fixed-window or Hidden Semi-Markov Model (HSMM)-based segmentation methods, our proposed Shannon–Otsu segmentation approach is fully unsupervised, adaptive to varying signal conditions, and computationally lightweight and does not require extensive annotated data or training procedures. To the best of our knowledge, this is the first study systematically evaluating the impact of this combined Shannon–Otsu thresholding technique on heart sound classification across large and diverse datasets such as PhysioNet/CinC 2016 and Pascal, demonstrating its robustness and effectiveness in significantly improving classification accuracy.

Shannon Energy Method and Otsu Thresholding

In this study, the Shannon energy (SE) envelope is computed from the local spectrum obtained by the S-transform for each time sample. Let

S (j, i)

denote the S-transform magnitude for the frequency bin

f_{j}

and time index

t_{i}

(with the Gaussian window width

σ (f) = 1 / |f|

as in the standard S-transform). Then the Shannon energy of the

i

-th time sample is defined as the column-wise aggregation over frequencies:

S E (x_{i}) = \sum_{j = 0}^{n} S {(j, i)}^{2} \log (S {(j, i)}^{2} + ε)

(1)

where

ε

is a small constant to avoid

l o g (0)

. This SSE (Shannon Spectral Energy) envelope emphasizes medium-intensity components and attenuates low-intensity noise, yielding a robust representation for S1/S2 localization, exactly as described in the SSE localization method [28,29,30].

The Otsu method is used to determine an optimal threshold value by analyzing the histogram of the signal. This technique represents the signal data at different levels, where each level is indicated by the number of data points (

n_{i}

) at that level, and the total number of data points is represented by

N

. It is assumed that the signal is divided into two classes using a threshold at level

t

: Class

w_{1}

contains data at levels from 1 to

t

, while Class

w_{2}

includes data from levels

t + 1

to

L

[31]. According to Otsu’s theory, the within-class variance is minimized when the between-class variance is maximized; therefore, the optimal threshold is the one that yields the largest separation between the two classes in the histogram. Applying this algorithm to the 1-D energy-envelope histogram provides an amplitude threshold that isolates high-energy cardiac events (S1/S2) while suppressing background/noise [32]. The class probabilities and means are computed as in Equations (2) and (3):

ω_{1} (T) = \sum_{k = 1}^{T} p k, ω_{2} (T) = \sum_{k = T + 1}^{L} p k

(2)

μ_{1} (T) = \frac{1}{ω_{1} (T)} \sum_{k = 1}^{T} k p k, μ_{2} (T) = \frac{1}{ω_{2} (T)} \sum_{k = T + 1}^{L} k p k

(3)

The threshold is then selected by maximizing the between-class variance in Equation (4):

σ_{b}^{2} (T) = ω_{1} (T) ω_{2} (T) {[μ_{1} (T) - μ_{2} (T)]}^{2}

(4)

where

ω_{1} (T)

and

ω_{2} (T)

denote the class probabilities,

μ_{1} (T) a n d μ_{2} (T)

denote the corresponding class means.

Using this threshold value, the high-energy regions of the signal are segmented effectively. To further enhance the quality of the energy signal, a median filter is applied with window size samples, defined in Equation (5):

E_{c l e a n e d} [n] = m e d f i l t (E [n], w i n d o w S i z e)

(5)

A median filter with a window length of 75 samples was employed to suppress noise artifacts and enhance the clarity of the segmented signal envelope. This filter was chosen instead of a mean (moving average) filter due to its superior ability to suppress impulsive noise while preserving sharp transitions and edges, making it more robust to outlier artifacts commonly observed in heart sound signals. The window length was empirically determined to balance noise suppression and temporal resolution, thereby supporting accurate peak detection in subsequent steps.

The segmentation uses an adaptive threshold (τ), selected by Otsu’s method, on the Shannon-energy envelope. Raw signals are first amplitude-normalized to [−1, 1], and the envelope is then linearly scaled to [0, 1]; thus, τ ∈ [0, 1] and is invariant to absolute amplitude. This procedure isolates high-energy cardiac events without any fixed, manually set amplitude threshold. Postprocessing applies to a short median filter and minimum event-duration/minimum inter-event constraints: tightening these constraints (or increasing τ) reduces false positives but may miss low-amplitude S2 events, whereas relaxing them (or decreasing τ) increases recall at the cost of more spurious detections. The reported settings were chosen via 5-fold cross validation to maximize fold wise balanced accuracy.

Segmentation was dynamically performed using the Shannon energy envelope combined with Otsu’s adaptive thresholding. Short-time analysis was conducted with a 30 ms window and a 15 ms hop size, improving temporal precision. The previously mentioned 200 ms refers to the approximate maximum duration of a complete heart sound event (S1 or S2) and served only as a reference for interpreting segment lengths rather than as a fixed segmentation window.

Following segmentation, peak detection was conducted using MATLAB® 2024a’s findpeaks function. This step was not solely intended for visualization but was crucial for defining precise segment boundaries corresponding to S1 and S2 events. Accurate peak identification ensured that only physiologically meaningful cardiac events were included in the analysis, thereby improving the reliability of feature extraction and classification. By excluding irrelevant or noisy signal portions, this approach enhanced overall model performance and robustness. The minimum peak height was set to 5% of the maximum amplitude of the cleaned signal, and the minimum peak distance was defined as 10% of the sampling frequency. Peaks that did not meet these criteria were automatically discarded without manual correction. Instead, statistical features were extracted over entire segments (S1–S2), mitigating the influence of potential outliers.

Figure 3 and Figure 4 present representative examples of heart sound recordings from normal and abnormal subjects, respectively, demonstrating the effectiveness of the proposed segmentation and peak detection framework. These figures highlight how the combined use of the Shannon energy envelope and Otsu thresholding facilitates the isolation of primary cardiac events (S1 and S2) under varying signal conditions. Figure 5 offers a more detailed illustration of the peak selection procedure, emphasizing how amplitude-based and temporal constraints—specifically, minimum peak height and minimum inter-peak distance—contribute to the suppression of false detections. In all examples, the upper panels display the Shannon energy envelopes, while the lower panels show the original heart sound signals overlaid with detected peaks. Peaks identified prior to filtering are marked in blue, whereas those retained after constraint-based refinement are highlighted in red, underscoring the robustness of the proposed method in distinguishing physiologically relevant cardiac events from noise.

Due to the absence of precisely annotated heart sound segments (e.g., S1 and S2 onset and offset times) in the datasets used, quantitative performance metrics such as onset timing error, detection rate, sensitivity, and specificity of the segmentation method could not be calculated explicitly. However, visual inspections and qualitative analysis confirmed that the proposed Shannon energy- and Otsu threshold-based segmentation consistently and accurately isolated primary heart sound components across most recordings. Future research should involve datasets with detailed segment annotations to facilitate rigorous quantitative validation and further assess the robustness of the segmentation approach, particularly for pathological cases that may contain additional heart sound components (e.g., S3 and S4).

2.4. Feature Extraction

The features of heart sound signals obtained from segmentation were analyzed using feature extraction methods to represent them comprehensively in the time, frequency, and time–frequency domains. The primary goal of feature extraction is to describe a signal accurately using a minimal yet informative set of features [33,34]. In this study, two widely used methods were employed for feature extraction: EMD and MFCCs.

The feature extraction phase converts raw heart sound segments (including S1 and S2 components) into structured representations suitable for machine learning classifiers. These segments were obtained through the Shannon energy envelope and Otsu thresholding. In EMD, each segment was analyzed as a single unit without further subdivision, while MFCC analysis used short-term framing [17].

For non-segmented cases, the entire heart sound signal was directly subjected to feature extraction, enabling a comparative analysis of segmented versus non-segmented approaches.

2.4.1. Empirical Mode Decomposition (EMD)

EMD enables adaptive time–frequency analysis by decomposing a signal into intrinsic mode functions (IMFs), each representing an oscillatory mode. The decomposition follows two criteria:

(i): The number of extrema and zero crossings must be equal or differ by at most one.
(ii): The mean value of the upper and lower envelopes must be zero at any point [35].

In this study, the number of IMFs was fixed to five to ensure consistent feature vector lengths across all segments. From each IMF, eleven statistical descriptors (mean, variance, standard deviation, mode, minimum, maximum, skewness, kurtosis, entropy, energy, and average power) were calculated, providing a comprehensive characterization of signal variability and structure. This approach ensured consistency and comparability across segments and supported robust classification.

2.4.2. Mel-Frequency Cepstral Coefficients (MFCCs)

Heart sounds contain diagnostically significant spectral components that extend beyond the range of human auditory perception, necessitating advanced spectral analysis. MFCCs are widely used in sound classification and phonocardiogram studies for pathology detection [34,35,36,37,38,39]. The MFCCs were computed using the following Equation (6):

M e l (f) = 2595 {l o g}_{10} (1 + \frac{f}{700})

(6)

In this study, MFCC features were extracted using the following pipeline.

Pre-emphasis: A first-order filter enhances mid–high frequencies and improves spectral balance:

$y [n] = x [n] - α x [n - 1], α = 0.97$

(7)
Framing and Windowing: Signals are split into 50 ms frames with 10 ms hop; each frame is multiplied by a Hamming window to reduce spectral leakage:

$w [n] = 0.54 - 0.46 \cos (\frac{2 π n}{N - 1}), n = 0, \dots, N - 1 .$

(8)
Fast Fourier Transform (FFT): For each windowed frame, the discrete Fourier transform (computed via FFT) yields the magnitude or power spectrum, which serves as the input to the Mel filterbank analysis.
Mel Filterbank Processing: The magnitude spectrum is passed through a 20-channel Mel filterbank spanning 10 Hz to 400 Hz to approximate the human auditory system’s frequency resolution.
Logarithmic Compression and Discrete Cosine Transform (DCT): Log energies of the Mel-filter outputs are decorrelated via a type-II discrete cosine transform (DCT), producing 13 MFCCs per frame:

$c_{n} = \sum_{m = 1}^{M} l o g (E_{m}) c o s [\frac{π n}{M} (m - \frac{1}{2})], n = 1,2, \dots, 13 .$

(9)
Here, $E_{m} = {\sum_{k} |X (k)|}^{2} H_{m} (k)$ is the $m$ -th Mel-band energy and $H_{m} (k)$ is the $m$ -th triangular Mel filter.
Cepstral Liftering: A liftering parameter of 22 is applied to enhance discriminative properties by reducing the influence of higher-order cepstral coefficients.

Finally, the logarithm and DCT steps are applied to obtain the Mel-Frequency Cepstral Coefficients [40,41]. After computing 13 MFCCs per frame, eleven statistical descriptors (mean, variance, standard deviation, mode, minimum, maximum, entropy, skewness, kurtosis, energy, and average power) are derived across all frames in each segment. This standardizes the final feature vectors regardless of segment length and ensures compatibility with the classifiers.

The standard Mel scale used in speech processing was adopted, justified by similarities in low-frequency energy distribution and spectral envelope characteristics between speech and heart sounds, making it suitable for extracting perceptually and physiologically relevant features.

2.4.3. Statistical Feature Design

The selection of statistical features in this study was informed by their established effectiveness in biomedical signal processing and their capacity to capture a broad range of characteristics inherent in phonocardiographic signals. Descriptive metrics such as mean, variance, and standard deviation represent central tendency and dispersion, whereas higher-order statistical moments like skewness and kurtosis quantify asymmetry and peakedness in signal distributions. Measures including entropy and energy reflect signal complexity and power content, which are particularly valuable in distinguishing pathological heart sounds.

Although some features such as mode and maximum may exhibit statistical correlation, their inclusion ensures comprehensive coverage of both central and extreme values within the signal. This carefully curated set of statistical descriptors offers a balance between computational efficiency and classification performance. To ensure consistency across domains, the same statistical feature set was applied to both MFCC- and EMD-based representations. The entire process of feature extraction and summarization is illustrated in Figure 6.

2.5. Classification

In our study, the features extracted from heart sounds were classified using three different machine learning algorithms: k-Nearest Neighbor (kNN), Support Vector Machine (SVM), and Random Forest (RF).

The kNN algorithm performs classification using labeled training data. This algorithm determines the k nearest neighbors for a new instance to be classified and assigns it to the class that is most frequent among the neighbors. Different values of k were tested to achieve optimal performance. Various metrics are used for proximity calculations, with cosine distance being a commonly used metric for measuring similarity based on the angle between two vectors [42]. Since it considers vector orientation, cosine distance is often preferred to understand the angular differences between data and evaluate similarities. Cosine distance is inversely proportional to cosine similarity and is calculated using the formula in Equation (10):

d_{c o s} (A, B) = 1 - \frac{A \cdot B}{‖A‖ ‖B‖}

(10)

where

A

and

B

are the two vectors being compared;

A \cdot B

is their dot product; and

‖A‖

and

‖B‖

are the norms of vectors

A

and

B

, respectively.

Support Vector Machine, developed by Vapnik, Guyon and Boser in 1992, is a machine learning method used to solve classification problems [43]. In this supervised learning algorithm, input data are positioned in an n-dimensional space, with each dimension representing a feature. Classification is performed by finding a hyperplane that distinctly separates the two classes. SVM algorithms include various kernel functions, such as linear, polynomial, and sigmoid functions and the radial basis function (RBF) [44]. Using kernel functions, the data space can be expanded to higher dimensions, forming a complex and curved decision boundary that better separates the dataset. In our study, features were classified using RBF kernels.

Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (classification) of the individual trees. This approach is robust to overfitting, especially for datasets with a large number of features, and is computationally efficient [45]. To assess the classification performance, four key evaluation metrics are used: accuracy, sensitivity, specificity, and F1-score.

To evaluate the classification performance of heart sound analysis, accuracy, sensitivity, specificity, and F1-score are computed. These metrics provide a comprehensive assessment of the model’s ability to correctly classify normal and abnormal heart sounds.

The accuracy metric measures the proportion of correctly classified heart sound signals among all samples and is defined in Equation (11):

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(11)

where TP (True Positive) represents correctly classified abnormal heart sounds, TN (True Negative) denotes correctly classified normal heart sounds, FP (False Positive) refers to normal heart sounds incorrectly classified as abnormal, and FN (False Negative) corresponds to abnormal heart sounds misclassified as normal.

The sensitivity (recall) metric evaluates the model’s ability to correctly detect abnormal heart sounds, ensuring that true cases are identified. It is calculated in Equation (12):

S e n s i t i v i t y = \frac{T P}{T P + F N}

(12)

Similarly, specificity measures the ability of the model to correctly classify normal heart sounds, minimizing false alarms, and is formulated in Equation (13):

S p e c i f i c i t y = \frac{T N}{T N + F P}

(13)

To balance sensitivity and precision, the F1-score is also computed. It provides a harmonic mean of precision and recall and is defined in Equation (14):

F 1 - s c o r e = \frac{2 \times T P}{2 \times T P + F P + F N}

(14)

These metrics are widely used in biomedical signal classification to ensure reliable and clinically relevant evaluations of classification models [46].

3. Results

In this study, the classification performance of various machine learning algorithms was evaluated using both segmented and non-segmented heart sound datasets. The preprocessing stage included the application of notch and elliptic filters to eliminate noise and enhance signal clarity. Additionally, normalization was applied to ensure that all signals were scaled within a standardized range, thereby improving consistency across recordings.

Following preprocessing, two different strategies were employed for feature representation: segmentation-based and non-segmented analysis. In the segmentation-based approach, heart sound signals were divided into individual S1–S2 cycles using Shannon energy envelope extraction and Otsu thresholding. The issue of class imbalance was addressed through SMOTE as described in Section 2, and all subsequent analyses were conducted on the resulting balanced feature sets. To reduce feature dimensionality and eliminate redundancy, PCA was applied across all datasets, with the number of principal components determined by retaining 95% of the total variance. The comparative PCA results for both segmented and non-segmented features are illustrated in Figure 7.

While the PCA-based reduction for MFCC features was limited due to their compact representation, EMD features showed a more significant reduction, particularly in the segmented configuration. These reduced feature vectors were subsequently used for classification tasks.

Heart sound classification was performed using k-Nearest Neighbor (kNN), Support Vector Machine (SVM), and Random Forest (RF) on both EMD and MFCC feature sets. Various values of k (1, 3, and 5) were tested for kNN, with k = 3 providing the best results. The cosine distance metric was selected due to its effectiveness in capturing angular dissimilarities between feature vectors. For SVM, a radial basis function (RBF) kernel was employed to handle non-linear separability in the feature space. RF classifiers were configured with 100 decision trees, using Gini impurity as the splitting criterion to optimize classification performance.

To ensure a fair and robust evaluation, five-fold cross-validation was used instead of a traditional 80–20 train–test split. The dataset was divided into five equal subsets, with each fold used once as a test set while the remaining four served for training. This procedure provided a more reliable estimate of model performance across different data partitions.

Classification performance was evaluated using accuracy, sensitivity, specificity, and F1-score to assess the influence of segmentation and feature extraction strategies. The overall results are summarized in Table 3, while Figure 8 presents the five-fold training (train) and testing (test) accuracy for each classifier–feature combination. The close alignment between training and testing results across folds indicates the absence of overfitting, further reinforcing the generalizability and robustness of the proposed approach.

Notably, although relatively simple classifiers were used, the incorporation of segmentation and appropriate feature extraction significantly enhanced model performance. For example, the accuracy of the EMD + kNN model increased from 88.40% to 99.97% with segmentation. These findings confirm that the reported high accuracy values are not superficial but arise from a carefully designed and comprehensive pipeline that integrates robust preprocessing, segmentation, and balanced learning strategies.

Furthermore, although the SVM classifier exhibited comparatively lower performance, particularly in the non-segmented case, this was likely due to its sensitivity to the high-dimensional feature space and imbalanced data distribution. Nevertheless, even these results emphasize the importance of segmentation in stabilizing classification outcomes.

To illustrate robustness across folds, Figure 9 presents boxplots together with mean classification performance and 95% confidence intervals (CIs) computed over five-fold cross-validation, providing a clear view of fold-to-fold variability and generalization.

To substantiate the observed gains, we conducted two-sided paired-sample t-tests on fold-wise balanced-accuracy estimates from five-fold cross-validation, treating models as paired within each fold. Table 4 reports mean differences (ΔBA) with Holm–Bonferroni-adjusted p-values. This analysis controlled fold-to-fold variability and limited Type-I error across multiple comparisons, demonstrating that most segmented vs. non-segmented contrasts were statistically significant (adjusted p < 0.05).

Comparative Analysis

The results clearly demonstrate that segmenting heart sounds significantly enhanced classification performance. Using MFCC features with kNN on segmented data resulted in 99.37% accuracy, 100.00% sensitivity, and 98.72% specificity, while EMD features with kNN achieved 99.97% accuracy, 99.98% sensitivity, and 99.96% specificity. In contrast, non-segmented data yielded lower performance, with MFCC features achieving 96.49% accuracy, 97.65% sensitivity, and 95.33% specificity, while EMD features reached 88.40% accuracy, 93.49% sensitivity, and 83.32% specificity.

To directly compare our approach with the existing works listed in Table 1, we explicitly benchmarked our results. For example, Deperlioglu [2] reported 97.21% accuracy, 94.78% sensitivity, and 99.65% specificity using a segmented CNN model on the PhysioNet A dataset, whereas our proposed EMD-kNN model achieved 99.97% accuracy, 99.98% sensitivity, and 99.96% specificity on the PhysioNet/CinC 2016 dataset. Similarly, Narváez et al. [5] achieved 99.25% accuracy on the Pascal dataset using Modified EWT + NASE, while our EMD-RF configuration achieved 99.95% accuracy, 99.96% sensitivity, and 99.93% specificity on the same dataset. In contrast, Maknickas and Maknickas [4] reported only 84.15% accuracy for a non-segmented CNN model on PhysioNet/CinC 2016. These comparisons validate that our segmentation-based pipeline outperforms several previously published methods on the same benchmark datasets.

Feature extraction also plays a crucial role in classification performance. In previous research, MFCCs with a CNN reached 85.3% accuracy on the PhysioNet/CinC 2016 dataset [11], and Wavelet Scattering Transform with SVM obtained 92.23% accuracy [20]. Our findings indicate that MFCCs and EMD provide superior accuracy, particularly when combined with segmentation. Additionally, applying PCA reduced computational complexity while maintaining classification accuracy.

Among classifiers, kNN and Random Forest consistently outperformed SVM, with the highest performance achieved using kNN with segmented MFCCs (99.37%) and kNN with segmented EMD (99.97%). In contrast, SVM performed significantly lower on non-segmented data (81.88% accuracy for MFCCs, 71.76% for EMD). Other approaches in the literature include a CNN-RNN hybrid model with 98% accuracy [13] and a Decision Tree model with 86.35% accuracy [14]. The combination of SVM, kNN, and RF classifiers in [15] achieved 98.67% accuracy. These comparisons show that carefully designed traditional machine learning pipelines can surpass deep learning approaches when combined with effective segmentation and feature extraction techniques.

Overall, this study now provides a comprehensive evaluation across multiple dimensions—segmentation effect, feature selection, classifier type, and dimensionality reduction—supplemented by clear, direct comparisons with existing methods. The results not only demonstrate the effectiveness of the proposed approach but also emphasize its practical viability for heart sound classification in real-world settings.

4. Conclusions and Discussion

This study presents an effective heart sound classification framework that integrates Shannon energy-based envelope extraction with Otsu thresholding for segmentation, followed by MFCC- and EMD-based feature extraction and classification using kNN, SVM, and Random Forest.

To explicitly isolate and quantify the contribution of segmentation, controlled comparisons were performed by keeping the feature extraction method and classifier fixed while varying only the segmentation step. For instance, EMD + kNN accuracy improved significantly from 88.40% (non-segmented) to 99.97% (segmented), and MFCC + kNN improved from 96.49% to 99.37%. These findings confirm that segmentation alone substantially enhances classification performance by effectively isolating physiologically meaningful signal components and reducing irrelevant noise.

To verify that the observed gains were not due to chance, we performed two-sided paired-sample t-tests on fold-wise balanced accuracy from five-fold cross-validation, treating models as paired within each fold. Table 4 reports mean differences (ΔBA) with Holm–Bonferroni-adjusted p-values. The paired design controlled fold-to-fold variability, and the Holm correction limited inflated Type-I error across multiple comparisons. Most segmented vs. non-segmented contrasts and within-feature classifier pairs remained statistically significant after correction (adjusted p < 0.05), indicating that the improvements were systematic rather than random. Consistently with our empirical observations, segmentation reduces between-classifier dispersion while the top models retain a clear performance edge, underscoring practical relevance even when absolute ΔBA values are small.

We adopted k-NN as a transparent, nonparametric baseline over compact feature vectors (MFCCs/EMD), with optional PCA to stabilize distances and reduce dimensionality. Model selection (including k = 5) was performed by cross-validation within each training fold. Although k-NN is memory-based, per-fold training sets and vectorized batched distance computations provide adequate scalability at our data size, while preserving interpretability. In our experiments, k-NN and Random Forest consistently outperformed SVM, especially with segmented inputs. PCA further improved computational efficiency without materially degrading accuracy. The proposed Shannon–Otsu segmentation adapted well to signal characteristics and was robust across datasets.

To objectively validate the effectiveness and generalizability of the proposed segmentation algorithm, we performed a segmentation sensitivity analysis using ground-truth S1 and S2 annotations in the PhysioNet/CinC 2016 training set. This analysis assessed how reliably the algorithm detects physiologically meaningful cardiac events relative to expert reference annotations, providing quantitative evidence of its practical utility in automated heart sound analysis. With a ±75 ms tolerance window, the proposed segmentation approach achieved a mean sensitivity of 90.68% ± 19.35 for S1 and 88.63% ± 23.76 for S2, with mean absolute errors of 200.39 ± 65.52 ms and 190.27 ± 62.43 ms, respectively (for all training sets). These results are comparable to those of other unsupervised methods reported in the literature and confirm the reliability of the approach for large-scale heart sound analysis.

A key limitation of this study stems from the unavailability of ground-truth segment annotations (e.g., S1 and S2 boundaries) in the Pascal dataset, which precludes formal quantitative evaluation of segmentation accuracy using metrics such as onset timing error or sensitivity in that cohort. For Pascal, segmentation quality was assessed indirectly via visual inspection and its effect on downstream classification performance. In certain abnormal cases, as shown in Figure 3, S2 peaks may be attenuated or missed due to low-amplitude or transient noise, revealing the sensitivity of energy-based segmentation methods to pathological variations. To address this, future studies should employ datasets with annotated cardiac events to enable rigorous validation and explore more advanced segmentation techniques (e.g., Hidden Semi-Markov Models or deep learning-based approaches). Additionally, hybrid feature extraction strategies and larger, more diverse datasets may improve robustness and generalizability.

Although SMOTE effectively balanced the abnormal class, synthetic over-sampling can sometimes lead models to fit artificial patterns rather than physiological variability. In this study, potential overfitting was mitigated by applying five-fold cross-validation and monitoring the consistency of training and testing accuracies across folds. The absence of a substantial divergence between these values suggests that the classifiers generalized well despite the inclusion of synthetic data. Nevertheless, future work could investigate more advanced or data-driven augmentation such as Borderline-SMOTE, ADASYN, or GAN-based synthesis to further enhance realism and reduce reliance on interpolation-based techniques.

The proposed segmentation classification pipeline has strong potential for integration into hospital information systems, portable diagnostic platforms, and personal health monitoring tools. Future work will focus on real-time deployment in clinical workflows and evaluating integration with electronic health record systems to enhance automated cardiac screening and decision support.

In summary, this study highlights the critical role of segmentation in heart sound classification. It demonstrates that near-perfect classification accuracy can be achieved with relatively simple machine learning models and that these gains are statistically reliable after multiple-comparison correction, provided that segmentation is effectively combined with robust preprocessing and carefully designed feature extraction strategies.

Author Contributions

C.B.: Writing—original draft, Visualization, Software, Methodology, Investigation. Y.K.: Writing—review and editing, Validation, Supervision, Conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed in this study (PhysioNet/CinC 2016 and the Pascal PCG dataset) are publicly available. Derived data and analysis scripts are available from the corresponding author upon reasonable request.

Acknowledgments

This work was completed without any external support or assistance.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Cardiovascular Diseases. 2022. Available online: https://www.who.int/health-topics/cardiovascular-diseases (accessed on 29 September 2023).
Deperlioglu, O. Classification of Segmented Phonocardiograms by Convolutional Neural Networks. Brain Broad Res. Artif. Intell. Neurosci. 2019, 10, 5–13. [Google Scholar] [CrossRef]
Springer, D.B.; Tarassenko, L.; Clifford, G.D. Logistic regression-HSMM-based heart sound segmentation. IEEE Trans. Biomed. Eng. 2016, 63, 822–832. [Google Scholar] [CrossRef] [PubMed]
Maknickas, V.; Maknickas, A. Recognition of normal-abnormal phonocardiographic signals using deep convolutional neural networks and mel-frequency spectral coefficients. Physiol. Meas. 2017, 38, 1671–1684. [Google Scholar] [CrossRef] [PubMed]
Narváez, P.; Gutierrez, S.; Percybrooks, W.S. Automatic segmentation and classification of heart sounds using modified empirical wavelet transform and power features. Appl. Sci. 2020, 10, 4791. [Google Scholar] [CrossRef]
Cheng, J.; Sun, K. Heart Sound Classification Network Based on Convolution and Transformer. Sensors 2023, 23, 8168. [Google Scholar] [CrossRef]
Madine, M. Heart Sound Segmentation Using Deep Learning Techniques. arXiv 2024, arXiv:2406.05653. [Google Scholar] [CrossRef]
Riccio, D.; Brancati, N.; Sannino, G.; Verde, L.; Frucci, M. CNN-based classification of phonocardiograms using fractal techniques. Biomed. Signal Process Control. 2023, 86, 105186. [Google Scholar] [CrossRef]
Ghosh, S.K.; Ponnalagu, R.N.; Tripathy, R.K.; Panda, G.; Pachori, R.B. Automated Heart Sound Activity Detection From PCG Signal Using Time-Frequency-Domain Deep Neural Network. IEEE Trans. Instrum. Meas. 2022, 71, 4006710. [Google Scholar] [CrossRef]
Ghosh, S.K.; Ponnalagu, R.N. A Novel Algorithm based on Stockwell Transform for Boundary Detection and Segmentation of Heart Sound Components from PCG signal. In Proceedings of the 2019 IEEE 16th India Council International Conference (INDICON), Rajkot, India, 13–15 December 2019. [Google Scholar] [CrossRef]
Nilanon, T.; Yao, J.; Hao, J.; Purushotham, S.; Liu, Y. Normal/abnormal heart sound recordings classification using convolutional neural network. In Computing in Cardiology; IEEE Computer Society: Washington, DC, USA, 2016; pp. 585–588. [Google Scholar] [CrossRef]
Milani, M.G.M.; Abas, P.E.; De Silva, L.C. Identification of normal and abnormal heart sounds by prominent peak analysis. In ACM International Conference Proceeding Series; Association for Computing Machinery: New York, NY, USA, 2019; pp. 31–35. [Google Scholar] [CrossRef]
Deng, M.; Meng, T.; Cao, J.; Wang, S.; Zhang, J.; Fan, H. Heart sound classification based on improved MFCC features and convolutional recurrent neural networks. Neural Netw. 2020, 130, 22–32. [Google Scholar] [CrossRef]
Khan, F.A.; Abid, A.; Khan, M.S. Automatic heart sound classification from segmented/unsegmented phonocardiogram signals using time and frequency features. Physiol. Meas. 2020, 41, 055006. [Google Scholar] [CrossRef]
Rishal, S.P.; Satija, U. An Effective Heart Valve Disorder Classification Technique Using Phonocardiograms. In Proceedings of the INDICON 2022–2022 IEEE 19th India Council International Conference, Kochi, India, 24–26 November 2022; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2022. [Google Scholar] [CrossRef]
Zhu, L.; Qian, K.; Wang, Z.; Hu, B.; Yamamoto, Y.; Schuller, B.W. Heart Sound Classification based on Residual Shrinkage Networks. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS, Copenhagen, Denmark, 14–17 July 2022; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2022; pp. 4469–4472. [Google Scholar] [CrossRef]
Fauzi, H.; Rizal, A.; Oktarianto, A.; Said, Z. Classification of Normal and Abnormal Heart Sounds Using Empirical Mode Decomposition and First Order Statistic. J. Electron. Electromed. Eng. Med. Inform. 2023, 5, 82–88. [Google Scholar] [CrossRef]
Gharehbaghi, A.; Partovi, E.; Babic, A. Recurrent vs Non-Recurrent Convolutional Neural Networks for Heart Sound Classification. In Studies in Health Technology and Informatics; IOS Press BV: Amsterdam, The Netherlands, 2023; pp. 436–439. [Google Scholar] [CrossRef]
Abid, A. Zo-Afshan Peak Spectrogram and Convolutional Neural Network-Based Segmentation and Classification for Phonocardiogram Signals. In Advances in Non-Invasive Biomedical Signal Sensing and Processing with Machine Learning; Springer International Publishing: Cham, Switzerland, 2023; pp. 207–227. [Google Scholar] [CrossRef]
Mei, N.; Wang, H.; Zhang, Y.; Liu, F.; Jiang, X.; Wei, S. Classification of heart sounds based on quality assessment and wavelet scattering transform. Comput. Biol. Med. 2021, 137, 104814. [Google Scholar] [CrossRef]
Nogueira, D.M.; Ferreira, C.A.; Gomes, E.F.; Jorge, A.M. Classifying Heart Sounds Using Images of Motifs, MFCC and Temporal Features. J. Med. Syst. 2019, 43, 168. [Google Scholar] [CrossRef]
Liu, C.; Springer, D.; Moody, B.; Silva, I.; Johnson, A.; Samieinasab, M.; Sameni, R.; Mark, R.; Clifford, G.D. Classification of Heart Sound Recordings: The PhysioNet/Computing in Cardiology Challenge 2016. Available online: https://physionet.org/content/challenge-2016/1.0.0/#files-panel (accessed on 26 May 2024).
Nia, P.S.; Hesar, H.D. Abnormal Heart Sound Detection using Time-Frequency Analysis and Machine Learning Techniques. Biomed. Signal Process. Control. 2024, 90, 105899. [Google Scholar] [CrossRef]
Bentley, P.; Nordehn, G.; Coimbra, M.; Mannor, S.; Getz, R. Classifying Heart Sounds Challenge. Available online: https://istethoscope.peterjbentley.com/heartchallenge/index.html (accessed on 24 January 2025).
Thiyagaraja, S.R.; Dantu, R.; Shrestha, P.L.; Chitnis, A.; Thompson, M.A.; Anumandla, P.T.; Sarma, T.; Dantu, S. A novel heart-mobile interface for detection and classification of heart sounds. Biomed. Signal Process. Control. 2018, 45, 313–324. [Google Scholar] [CrossRef]
Zhang, Y.-T.; Chan, G.; Zhang, X.-Y.; Lung, Y. Heart sounds and stethoscopes. In Wiley Encyclopedia of Biomedical Engineering; Wiley-Interscience: New York, NY, USA, 2006. [Google Scholar]
Varghees, V.N.; Ramachandran, K.I. A novel heart sound activity detection framework for automated heart sound analysis. Biomed. Signal Process. Control. 2014, 13, 174–188. [Google Scholar] [CrossRef]
Bangare, S.L.; Dubal, A.; Bangare, P.S.; Patil, S.T. Reviewing otsu’s method for image thresholding. Int. J. Appl. Eng. Res. 2015, 10, 21777–21783. [Google Scholar] [CrossRef]
Moukadem, A.; Dieterlen, A.; Hueber, N.; Brandt, C. A robust heart sounds segmentation module based on S-transform. Biomed. Signal Process. Control. 2013, 8, 273–281. [Google Scholar] [CrossRef]
Arjoune, Y.; Nguyen, T.N.; Doroshow, R.W.; Shekhar, R. A Noise-Robust Heart Sound Segmentation Algorithm Based on Shannon Energy. IEEE Access 2024, 12, 7747–7761. [Google Scholar] [CrossRef]
Nobuyuki, O. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
Chen, H.; Gururajan, R. A de-noising method for heart sound signal using otsu’s threshold selection. In Proceedings of the IET International Communication Conference on Wireless Mobile and Computing (CCWMC 2011), Shanghai, China, 14–16 November 2011; pp. 65–69. [Google Scholar]
Chen, H.; Gururajan, R. Otsu’s Threshold Selection Method Applied in De-noising Heart Sound of the Digital Stethoscope Record. In Advances in Information Technology and Industry Applications; Lecture Notes in Electrical Engineering; Zeng, D., Ed.; Springer: Berlin/Heidelberg, Germany, 2012; Volume 136, pp. 239–244. [Google Scholar] [CrossRef]
Bozkurt, B.; Germanakis, I.; Stylianou, Y. A study of time-frequency features for CNN-based automatic heart sound classification for pathology detection. Comput. Biol. Med. 2018, 100, 132–143. [Google Scholar] [CrossRef]
Huang, N.E.; Shen, Z.; Long, S.R.; Wu, M.C.; Shih, H.H.; Zheng, Q.; Yen, N.C.; Tung, C.C.; Liu, H.H. The empirical mode decomposition and the Hubert spectrum for nonlinear and non-stationary time series analysis. Proc. R. Soc. A Math. Phys. Eng. Sci. 1998, 454, 903–995. [Google Scholar] [CrossRef]
Vepa, J. Classification of heart murmurs using cepstral features and support vector machines. In Proceedings of the 31st Annual International Conference of the IEEE Engineering in Medicine and Biology Society: Engineering the Future of Biomedicine, EMBC 2009, Minneapolis, MN, USA, 2–6 September 2009; pp. 2539–2542. [Google Scholar]
Chauhan, S.; Wang, P.; Lim, C.S.; Anantharaman, V. A computer-aided MFCC-based HMM system for automatic auscultation. Comput. Biol. Med. 2008, 38, 221–233. [Google Scholar] [CrossRef] [PubMed]
Guven, M.; Uysal, F. A New Method for Heart Disease Detection: Long Short-Term Feature Extraction from Heart Sound Data. Sensors 2023, 23, 5835. [Google Scholar] [CrossRef]
Hassanuzzaman, M.; Ghosh, S.K.; Hasan, M.N.A.; Al Mamun, M.A.; Ahmed, K.I.; Mostafa, R.; Khandoker, A.H. Classification of Short-Segment Pediatric Heart Sounds Based on a Transformer-Based Convolutional Neural Network. IEEE Access 2025, 13, 93852–93868. [Google Scholar] [CrossRef]
Demirci, B.A.; Koçyiğit, Y.; Kızılırmak, D.; Havlucu, Y. Adventitious and Normal Respiratory Sound Analysis with Machine Learning Methods. Celal Bayar Üniversitesi Fen Bilim. Derg. 2021, 18, 169–180. [Google Scholar] [CrossRef]
DAVIS, S.B.; Mermelstein, P. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
Nayak, S.; Bhat, M.; Reddy, N.V.S.; Rao, B.A. Study of distance metrics on k—Nearest neighbor algorithm for star categorization. J. Phys. Conf. Ser. 2022, 2161, 012004. [Google Scholar] [CrossRef]
Somvanshi, M.; Chavan, P.; Tambade, S.; Shinde, S.V. A Review of Machine Learning Techniques using Decision Tree and Support Vector Machine. In Proceedings of the 2016 International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 12–13 August 2016; pp. 1–7. [Google Scholar]
Hussain, M.; Wajid, S.K.; Elzaart, A.; Berbar, M. A comparison of SVM kernel functions for breast cancer detection. In Proceedings of the 2011 8th International Conference on Computer Graphics, Imaging and Visualization, CGIV 2011, Singapore, 17–19 August 2011; pp. 145–150. [Google Scholar] [CrossRef]
Balili, C.C.; Sobrepeña, M.C.C.; Naval, P.C. Classification of heart sounds using discrete and continuous wavelet transform and random forests. In Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition, ACPR 2015, Kuala Lumpur, Malaysia, 3–6 November 2015; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2016; pp. 655–659. [Google Scholar] [CrossRef]
Demirci, B.A.; Demirci, O.; Engin, M. Comparative analysis of ANN performance of four feature extraction methods used in the detection of epileptic seizures. Comput. Biol. Med. 2023, 166, 107491. [Google Scholar] [CrossRef]

Figure 1. Flow diagram of heart sound analysis.

Figure 2. Heart sounds [27].

Figure 3. Finding S1 and S2 in normal heart sounds using Otsu threshold with Shannon energy.

Figure 4. Finding S1 and S2 in abnormal heart sounds using Otsu threshold with Shannon energy.

Figure 5. Visualization of the peak detection process and suppression of false positives. (a) Illustration of a normal heart sound signal. (b) Illustration of an abnormal heart sound signal.

Figure 6. Feature extraction pipeline.

Figure 7. Dimensionality reduction via PCA (95% variance retained).

Figure 8. Five-fold training vs. test accuracy per method (segmented data).

Figure 9. Boxplots of 5-fold performance for EMD and MFCCs under segmented/non-segmented settings; black markers show the mean with 95% CIs.

Table 1. Summary of studies in the literature using segmented and non-segmented approaches for heart sound analysis.

Ref. No.	Dataset	Segmentation Method	Feature Extraction	Classifier	Results
[2]	PhysioNet A	Resampled Energy	–	CNN	Accuracy: 97.21%; Sensitivity: 94.78%; Specificity: 99.65%
[4]	PhysioNet/CinC—2016		MFCC	CNN	Accuracy: 84.15%
[5]	Pascal	Modified EWT + Normalized Shannon Energy	Power Features (Systole/Diastole)	SVM, kNN, RF, ANN	Accuracy: 99.25%; Specificity: 100%; Sensitivity: 98.57%
[6]	PhysioNet/CinC—2016	–	1D-Conv + Transformer	CNN–Transformer	Accuracy: 99.7%
[7]	Pascal	FFT + Dynamic Programming (S1/S2)	Siamese Embedding Features	Siamese Neural Network	Superior Results (Exact Metrics Not Shared)
[8]	PhysioNet/CinC—2016	Multi-Scale Adaptive Segmentation	CWT + Statistical Features	Light-CRNN	Accuracy: 98.6%
[11]	PhysioNet/CinC—2016	Fixed-Length Sliding Window (5 s, 1 s)	MFCC	CNN	Accuracy: 85.3%
[12]	PhysioNet/CinC—2016	Prominent Peak-Based Segmentation	Peak Distance, Amplitude, Area, Cycle Time	PCA + ANN	Accuracy: 90%
[13]	PhysioNet/CinC—2016		MFCC	CNN, RNN	Accuracy: 98%
[14]	PhysioNet/CinC—2016		–	Decision Tree	Accuracy: 86.35%; Overall Score: 78.31%
[15]	PhysioNet/CinC—2016		–	SVM, kNN, RF	Accuracy: 98.67%
[16]	PhysioNet/CinC—2016		Fbank, MFSC, MFCC	CNN, ResNet, LSTM	Accuracy: 84.4%; Sensitivity: 84.3%; Specificity: 85.6%
[17]	PhysioNet/CinC—2016		EMD	kNN	Accuracy: 98.2%
[18]	PhysioNet/CinC—2016		–	CNN, GRN, LSTM	Accuracy: 98.0%; Sensitivity: 87.2%
[19]	PhysioNet/CinC—2016	Multi-Level Thresholding + Peak Spectrogram (CNN)	Time/Frequency Features + STFT	CNN, SVM, ANN	Accuracy: 91.20%; Sensitivity: 94.05%; Specificity: 88.53%
[20]	PhysioNet/CinC—2016		Wavelet Scattering Transform	SVM	Accuracy: 92.23%; Sensitivity: 96.62%; Specificity: 90.65%
[21]	PhysioNet A	HSMM (Springer) using ECG-Guided Segmentation	MFCC	SVM	Sensitivity: 91.8%; Specificity: 82%; Accuracy: 97%
TS	PhysioNet/CinC—2016 and Pascal	Shannon + Otsu Thresholding	EMD, MFCC	kNN, SVM, RF	Accuracy (EMD-kNN): 99.97%; Accuracy (MFCC-kNN): 99.37%
TS	PhysioNet/CinC—2016 and Pascal		EMD, MFCC	kNN, SVM, RF	Accuracy (EMD-kNN): 88.40%; Accuracy (MFCC-kNN): 96.49%

Note: TS refers to “This Study”.

Table 2. Distribution of heart sound samples across PhysioNet and Pascal datasets [23,24].

Dataset	Category	Number of Recordings
PhysioNet/CinC Challenge 2016	Normal	2575
PhysioNet/CinC Challenge 2016	Abnormal	665
Pascal	Normal	200
	Normal Noisy	120
	Murmur	66
	Noisy Murmur	29
	Extrasystole	46

Table 3. Comparison of classification performance for segmented vs. non-segmented heart sounds.

Feature	Classifier	Segmented	Accuracy (%)	Sensitivity (%)	Specificity (%)	F1-Score (%)
MFCC	kNN	Yes	99.37 ± 0.03	100.00 ± 0.00	98.72 ± 0.06	99.38 ± 0.03
	kNN	No	96.49 ± 0.72	97.65 ± 1.55	95.33 ± 0.87	96.51 ± 0.81
	SVM	Yes	96.34 ± 0.08	93.81 ± 0.18	98.94 ± 0.02	96.29 ± 0.09
	SVM	No	81.88 ± 0.83	84.60 ± 1.28	79.18 ± 1.03	82.37 ± 0.90
	RF	Yes	99.96 ± 0.01	99.99 ± 0.01	99.94 ± 0.01	99.96 ± 0.01
	RF	No	93.66 ± 0.65	96.12 ± 1.70	91.26 ± 1.22	93.82 ± 0.59
EMD	kNN	Yes	99.97 ± 0.00	99.98 ± 0.01	99.96 ± 0.01	99.97 ± 0.00
	kNN	No	88.40 ± 0.85	93.49 ± 0.75	83.32 ± 2.27	88.97 ± 0.75
	SVM	Yes	99.72 ± 0.02	99.73 ± 0.04	99.70 ± 0.02	99.72 ± 0.02
	SVM	No	71.76 ± 1.54	74.12 ± 1.68	69.41 ± 2.31	72.41 ± 1.60
	RF	Yes	99.95 ± 0.01	99.96 ± 0.01	99.93 ± 0.02	99.95 ± 0.01
	RF	No	94.62 ± 1.12	96.66 ± 0.78	92.61 ± 1.60	94.71 ± 1.20

Table 4. Paired significance tests on balanced accuracy (5-fold CV).

Section	Comparison	Mean ΔBA	Paired t (Holm p)
EMD with Non-segmentation	SVM vs. RF	−0.2286	5.29 × 10⁻⁵
	SVM vs. kNN	−0.1664	8.86 × 10⁻⁵
	kNN vs. RF	−0.06229	8.86 × 10⁻⁵
EMD with Segmentation	SVM vs. RF	−0.002326	1.68 × 10⁻⁴
	SVM vs. kNN	−0.002553	4.14 × 10⁻⁵
	kNN vs. RF	2.26 × 10⁻⁴	0.0293
MFCC with Segmentation	SVM vs. RF	−0.0359	2.00 × 10⁻⁷
	SVM vs. kNN	−0.02987	3.61 × 10⁻⁷
	kNN vs. RF	−0.006034	1.23 × 10⁻⁶
MFCC with Non-segmentation	SVM vs. RF	−0.1188	2.29 × 10⁻⁵
	SVM vs. kNN	−0.146	8.99 × 10⁻⁶
	kNN vs. RF	0.02719	4.38 × 10⁻⁴

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Boz, C.; Kocyigit, Y. Segmented vs. Non-Segmented Heart Sound Classification: Impact of Feature Extraction and Machine Learning Models. Appl. Sci. 2025, 15, 11047. https://doi.org/10.3390/app152011047

AMA Style

Boz C, Kocyigit Y. Segmented vs. Non-Segmented Heart Sound Classification: Impact of Feature Extraction and Machine Learning Models. Applied Sciences. 2025; 15(20):11047. https://doi.org/10.3390/app152011047

Chicago/Turabian Style

Boz, Ceyda, and Yucel Kocyigit. 2025. "Segmented vs. Non-Segmented Heart Sound Classification: Impact of Feature Extraction and Machine Learning Models" Applied Sciences 15, no. 20: 11047. https://doi.org/10.3390/app152011047

APA Style

Boz, C., & Kocyigit, Y. (2025). Segmented vs. Non-Segmented Heart Sound Classification: Impact of Feature Extraction and Machine Learning Models. Applied Sciences, 15(20), 11047. https://doi.org/10.3390/app152011047

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Segmented vs. Non-Segmented Heart Sound Classification: Impact of Feature Extraction and Machine Learning Models

Abstract

1. Introduction

2. Material and Methods

2.1. Dataset

2.2. Preprocessing

2.3. Segmentation Techniques

Shannon Energy Method and Otsu Thresholding

2.4. Feature Extraction

2.4.1. Empirical Mode Decomposition (EMD)

2.4.2. Mel-Frequency Cepstral Coefficients (MFCCs)

2.4.3. Statistical Feature Design

2.5. Classification

3. Results

Comparative Analysis

4. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI