1. Introduction
Cardiovascular diseases (CVDs) remain the leading cause of mortality worldwide and continue to impose a substantial burden on healthcare systems [
1]. Early detection and timely intervention are therefore critical for improving patient outcomes and reducing long-term complications. Heart sounds, generated by mechanical vibrations during valve closure and blood flow acceleration–deceleration, are recorded as phonocardiograms (PCG) and provide valuable diagnostic information [
2]. Auscultation is one of the most fundamental, low-cost, and widely accessible clinical tools for the preliminary assessment of cardiac abnormalities [
3]. However, its diagnostic reliability strongly depends on physician experience and auditory sensitivity. Environmental noise, limited human frequency perception, and subjective interpretation may lead to misclassification or overlooked pathologies [
4]. Consequently, the need for AI-integrated decision support systems for disease diagnosis from heart sounds is increasing every day [
5,
6].
While artificial intelligence (AI) is used in every clinical field in healthcare, many studies have also been conducted on PCG signals. In the early years of AI model use, machine learning techniques were preferred due to the scarcity of data. Although these methods provided quick results, special algorithms needed to be developed to extract distinctive features. Recently, the development of deep learning algorithms that extract features from data and the ease of access to large datasets have made developing deep learning models on PCG signals popular [
7,
8]. In this context, convolution-based models have been widely applied to spectrograms, MFCCs, and wavelet-based representations [
9,
10]. Xiao et al. utilized a 1D-CNN model for detection from PCG signals [
9]. Alkhodari and Fraiwan enhanced the classification power by combining CNN and BiLSTM models [
11]. Tian et al. applied a transfer learning strategy to the Res2Net model to classify PCG signals [
12]. Duan et al. developed an end-to-end CNN model [
13]. Safak et al. trained a hybrid model, the RAMM model, on spectrogram images. In the classification phase, they developed the NRBMI algorithm, which classifies features based on their importance weights [
14]. Kalatehjari et al. integrated CNN and BiLSTM for multimodal ECG–PCG analysis [
15], whereas Safdar et al. incorporated transformer-based components with SE blocks to model long-range dependencies [
16]. Wang et al. developed the LeM-MArNet model by integrating it into a feature fusion-based deep learning model [
17]. Chandrasekhar et al. highlighted preprocessing techniques within the PJM-DJRNN framework and improved classification performance by combining deep learning models with the XGBoost algorithm [
18]. Although deep learning models are successful in most PCG classification tasks, limitations still exist for achieving high-level performance [
19]. First, most studies rely on a single representation (e.g., spectrogram or wavelet), whereas PCG signals are inherently non-stationary and include complex transient components that cannot be fully captured by a single domain [
20]. STFT-based spectrograms suffer from fixed-resolution limitations, while CWT provides multi-resolution capability and better characterisation of transient structures [
21]. Second, existing architectures are typically single-stream, focusing either on convolutional backbones for local feature extraction or transformer-based models for global context, without effectively exploiting their complementary strengths. Consequently, joint modelling of local acoustic patterns and long-range dependencies remains limited. Third, the Softmax algorithm is preferred among deep learning models because of its speed and adaptability to multi-class classification problems. However, the Softmax classifier may be insufficient for challenging classification tasks [
22,
23,
24]. In this study, a hybrid network is proposed to address these gaps in deep learning models. Initially, a model called DTF-STCANet (Dual Time–Frequency Swin Transformer–ConvNeXt Attention Network) was developed. In this model, the ConvNeXt structure is fed with CWT images and the Swin Transformer structure with Spectrom images, taking advantage of the power of parallel training. The representation activations obtained from the two models are further strengthened with an attention structure to extract distinctive features. Finally, these features are transferred to the KNN classifier. Thus, a model with high classification performance is created. The model is evaluated on the PhysioNet/Computing in Cardiology Challenge 2016 dataset [
2,
25].
4. Results
This section shows how the proposed DTF-STCANet backbone architecture worked in experiments without the Weighted KNN classifier. The aim of this evaluation is to isolate and measure the representational strength of the dual time-frequency fusion and heterogeneous backbone integration before geometry-aware decision modeling.
The experimental protocol was crafted in accordance with the evaluation criteria established in the PhysioNet/Computing in Cardiology (CinC) 2016 Challenge framework [
2,
25,
26], while also conforming to the methodological parameters documented in previous deep learning-based heart sound classification research [
14]. To avoid sampling bias and make sure that the class distribution was even across folds, a stratified 10-fold cross-validation scheme was used. To prevent data leakage, training and validation partitions were kept separate in each fold. Performance metrics were then averaged over all folds. The subject-level condition was adhered to in the dataset splitting process. Window-level splitting was not applied to the PCG signals.
To ensure a leakage-free evaluation, feature normalization and KNN hyperparameter selection were performed independently within each cross-validation fold. Normalization parameters were computed using only the training embeddings and subsequently applied to the validation embeddings. In addition, KNN hyperparameters, including the number of neighbors (k) and weighting strategy, were determined solely based on the training data. The KNN classifier was then fitted on the training features and evaluated on the corresponding validation set. No global parameter tuning or feature scaling was performed across folds.
STFT-based spectrograms were generated using a window length of 1024 samples with 50% overlap and a Hamming window function. The resulting spectrograms were converted into log-scaled magnitude representations and resized to 224 × 224 for network input. For CWT transformation, the Morse wavelet was employed with scales ranging from 1 to 128, enabling multi-resolution analysis of PCG signals. The generated scalograms were normalized and resized to the same spatial resolution as spectrogram images. No data augmentation was applied. Cycle segmentation and noise filtering have not been performed on the PCG signals.
ConvNext and Swin Transformer models are pre-trained models. It started with the trained weights, and fine-tuning was applied throughout the training.
In the weighted KNN classifier using Euclidean distance and k = 5, an inverse distance scheme was applied for distance-based weighting. These values were chosen because they are the most commonly used values in the literature for high-dimensional data. KNN classifiers can become unstable in very large datasets. However, this disadvantage is eliminated within the distinctive embedding space provided by the DTF-STCANet model. All hyperparameters were fixed across folds, and no global tuning was performed.
The model was implemented in a Windows 11 environment using an NVIDIA RTX 3080 Ti GPU. The model was trained and evaluated using the Python 3.10 programming language. Training was conducted for 60 epochs using the Adam optimizer with an initial learning rate of 0.001. Early convergence behavior and stability were monitored via training–validation accuracy and loss trajectories.
The training dynamics and convergence behavior of the proposed backbone are illustrated in
Figure 5.
As observed in
Figure 5, the model demonstrates rapid convergence within the first 20 epochs, followed by a stable optimization phase. The validation accuracy closely tracks the training accuracy, indicating controlled generalization behavior and absence of severe overfitting. The peak cross-validated validation accuracy achieved by the backbone architecture is 93.73%.
Figure 6 presents the corresponding training and validation loss curves.
The synchronized decline of both curves and the absence of oscillatory divergence suggest that the integration of dual-stream representation learning and attention-based refinement promotes stable gradient propagation. This behavior confirms that the heterogeneous backbone (ConvNeXt + Swin Transformer) effectively learns discriminative spectral–temporal embeddings without memorizing fold-specific patterns.
Beyond convergence behavior, the discriminative capability of the learned embeddings was further evaluated using Receiver Operating Characteristic (ROC) analysis. ROC analysis plays a critical role in medical decision-making systems, as the balance between sensitivity and specificity has direct implications for diagnostic dependability.
Figure 7 presents the ROC curve corresponding to the DTF-STCANet backbone model (Model 3).
The model achieves an Area Under the Curve (AUC) of 0.9473, indicating strong discriminative capacity between healthy and pathological PCG recordings. The ROC curve remains consistently close to the upper-left region of the ROC space, reflecting a favorable sensitivity–specificity balance.
This high AUC value demonstrates that the dual time–frequency representation strategy successfully captures both stationary spectral components (via STFT spectrograms) and transient pathological patterns (via CWT scalograms). The complementary modeling of global contextual dependencies (Swin Transformer) and local acoustic textures (ConvNeXt) contributes to forming a structured and separable feature manifold.
Importantly, these results are obtained prior to integrating the Weighted KNN decision layer, thereby confirming that the backbone architecture alone produces highly discriminative embeddings. The impact of geometry-aware classification on further enhancing boundary stability and reducing misclassification rates will be analyzed in
Section 5.
In addition to the ROC-based evaluation, quantitative performance metrics were computed across the stratified 10-fold cross-validation protocol. The DTF-STCANet backbone (Model 3) achieved an overall accuracy of 93.73%, with a weighted sensitivity of 0.9373, specificity of 0.8791, and a weighted F1-score of 0.9371. Performance metrics reported in this section represent the average values obtained across the stratified 10-fold cross-validation protocol. These findings indicate a substantial improvement over the single-stream configurations (Models 1 and 2, detailed in
Section 5), thereby confirming that the integration of heterogeneous backbones and dual time–frequency representations significantly enhances representational diversity and discriminative capability. A closer examination of the misclassified samples reveals that most errors occur near class boundaries, particularly in cases where low-intensity pathological murmurs partially overlap with dominant S1/S2 heart sound components in the time–frequency domain. This boundary-level ambiguity suggests that although the learned embedding space is highly structured, certain borderline samples remain locally overlapping. Such observations provide a clear rationale for subsequently integrating a geometry-aware decision mechanism that operates directly on the learned feature manifold to further stabilize class separation and reduce residual misclassification rates.
5. Ablation Studies and Discussions
To experimentally validate the effectiveness of the proposed framework, a comprehensive ablation study was conducted under the same stratified 10-fold cross-validation protocol used throughout this study, consistent with the PhysioNet/Computing in Cardiology Challenge 2016 evaluation strategy [
2] and the methodological configuration described in [
14]. The purpose of this section is to isolate the contribution of each architectural component and to demonstrate, in a controlled manner, how the proposed DTF-STCANet framework achieves its performance gains.
Table 1 presents the performance metrics of four different models by class, while
Table 2 presents the performance metric values as weighted averages relative to the number of classes. In accordance with the experimental design, four configurations were evaluated. Model 1 consists of a single-stream architecture where CWT images are processed using a ConvNeXt backbone. Model 2 employs Spectrogram images with a Swin Transformer backbone under identical preprocessing conditions. Model 3 corresponds solely to the proposed DTF-STCANet backbone, integrating dual time–frequency fusion and heterogeneous modeling. Model 4 represents the complete framework in which the DTF-STCANet backbone is combined with a Weighted KNN classifier. The confusion matrices are constructed based on a total of 3541 samples, including 2725 healthy and 816 unhealthy recordings.
Table 1.
Performance metric values for ablation studies according to classes.
Table 1.
Performance metric values for ablation studies according to classes.
| Model | Class | Sensitivity | Specificity | Precision | F1-Score |
|---|
| Model 1 | healthy | 0.9266 | 0.7194 | 0.9168 | 0.9217 |
| unhealthy | 0.7194 | 0.9266 | 0.7459 | 0.7324 |
| Model 2 | healthy | 0.9435 | 0.7770 | 0.9339 | 0.9387 |
| unhealthy | 0.7770 | 0.9435 | 0.8046 | 0.7905 |
| Model 3 | healthy | 0.9622 | 0.8542 | 0.9566 | 0.9594 |
| unhealthy | 0.8542 | 0.9622 | 0.8713 | 0.8626 |
| Model 4 | healthy | 0.9952 | 0.9730 | 0.9920 | 0.9936 |
| unhealthy | 0.9730 | 0.9952 | 0.9839 | 0.9784 |
Table 2.
Weighted performance metric values for ablation studies.
Table 2.
Weighted performance metric values for ablation studies.
| Model | Weighted Precision | Weighted Sensitivity | Weighted Specificity | Weighted F1-Score |
|---|
| Model 1 | 0.8774 | 0.8788 | 0.7671 | 0.8781 |
| Model 2 | 0.9041 | 0.9051 | 0.8153 | 0.9045 |
| Model 3 | 0.9370 | 0.9373 | 0.8791 | 0.9371 |
| Model 4 | 0.9901 | 0.9901 | 0.9782 | 0.9901 |
To first isolate the contribution of backbone selection under single-representation conditions, Models 1 and 2 were comparatively analyzed.
As illustrated in
Figure 8, Model 1 achieves an overall accuracy of 87.88%, with 429 total misclassifications. Model 2 improves accuracy to 90.51%, reducing total errors to 336. Since both configurations use identical Spectrogram inputs, the performance difference between Models 1 and 2 directly reflects the impact of backbone selection. The Swin Transformer demonstrates improved contextual modeling capability compared to ConvNeXt under single-representation conditions.
However, both models remain limited due to reliance on a single time–frequency representation. These findings experimentally confirm that backbone strength alone is insufficient to fully capture the heterogeneous spectral–temporal characteristics of PCG signals.
Therefore, the controlled comparison between Models 1 and 2 primarily reflects backbone influence, whereas the transition from Model 2 to Model 3 highlights the contribution of representation diversity introduced by dual time–frequency fusion.
Performance is further enhanced by Model 3 (DTF-STCANet backbone), which achieves 93.73% accuracy with 222 total misclassifications. This enhancement offers compelling empirical support for the efficacy of heterogeneous backbone integration and dual time-frequency fusion. The suggested backbone improves class separability and representational diversity by simultaneously modeling transient pathological components and stationary spectral structures. When taken as a whole, these results support the suggested framework’s architectural design principles.
Even though Model 3 confirms that representation diversity is effective, it is still necessary to ascertain whether additional improvements come from feature learning or the decision-making process itself.
Model 4 incorporates a Weighted KNN classifier on top of the fixed DTF-STCANet backbone to further explore whether the decision mechanism or representation learning has a greater impact on classification performance.
The total number of incorrectly classified samples drastically drops to 35, as seen in
Figure 8, indicating an overall accuracy of 99.29%. This indicates an error reduction of about 84.2% when compared to Model 3.
This significant experimental improvement shows that the geometry-aware decision mechanism is essential for maintaining class boundaries. Distance-based classifiers adjust locally to the structure of the learned embedding manifold, in contrast to parametric classifiers that impose global decision surfaces. Recent studies have examined the efficacy of Weighted KNN variants in high-dimensional settings, supporting their appropriateness for attention-refined feature spaces [
22,
23].
These results verify that compatibility between the embedding geometry and the decision mechanism, in addition to deeper feature extraction, is responsible for the overall framework’s performance improvements.
The same DTF-STCANet attention-refined features were used to test other classifiers in order to experimentally support the choice of KNN. This configuration ensured a controlled comparison by keeping the feature extraction stage fixed and only changing the classification layer.
The confusion matrices produced by attention-refined DTF-STCANet features assessed using k-Nearest Neighbors (KNN), Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), Logistic Regression (LR), and Ensemble Boosted Trees (EBT) are shown in
Figure 9.
With just 35 total misclassifications, KNN achieves the highest overall accuracy (99.29%). The learned embedding space creates compact and well-separated class clusters, as evidenced by the matrix’s highly balanced distribution between true healthy and true unhealthy predictions. Specifically, there are still only 22 instances of false negatives, which is crucial for pathological screening tasks from a clinical standpoint.
In comparison to KNN, SVM (97.62%) and LR (98.72%) show strong but marginally worse performance, with more false positives. This suggests that global linear or margin-based decision boundaries might not fully take advantage of the local geometric structure of the attention-enhanced features.
The behavior of tree-based models varies. While maintaining competitive sensitivity, Random Forest (98.33%) and Decision Tree (98.72%) introduce moderate instability in the classification of abnormal samples. The high number of false negatives (112 cases) lowers clinical reliability even though Ensemble Boosted Trees achieve 99.06% accuracy.
Overall, the confusion matrices show that classifier–embedding compatibility—rather than feature quality alone—is what drives performance differences, with distance-based KNN exhibiting the most stable decision behavior on the learned manifold.
Figure 10 shows the ROC curves made with DTF-STCANet features that have been improved by attention. All classifiers show strong discriminative power, with AUC values over 0.96, which shows that healthy and sick samples can be easily separated.
The AUC for Logistic Regression is the highest at 0.9964. The AUC for KNN is 0.9931, SVM is 0.9899, Ensemble Boosted Trees is 0.9893, and Naïve Bayes is 0.9892. Decision Tree has a lower AUC (0.9661), which means it is less able to tell the difference between things that are not dependent on the threshold.
Logistic Regression has the highest AUC, but the differences between the best classifiers are still pretty small. AUC shows how well a model ranks data without using a threshold, while final classification stability depends on the operating point that was chosen.
Figure 9 and
Table 3 and
Table 4 show that KNN has better overall accuracy and lower false-negative rates, which are more important for clinical pathological screening tasks.
The ROC analysis shows that the attention-enhanced dual time-frequency representation makes embeddings that are very easy to separate. On the other hand, performance differences between classifiers are mostly due to the characteristics of the decision boundary, not just the quality of the features.
Table 3 shows how well each classifier did in each class. KNN gets the most balanced results for both healthy and unhealthy classes, with sensitivity values of 0.9952 (healthy) and 0.9730 (unhealthy), and F1-scores of 0.9936 and 0.9784, respectively. The high sensitivity for the unhealthy class is important because it means that there is less of a chance of getting a false negative, which is very important in pathological screening situations.
Logistic Regression performs very well for each class, especially for healthy samples (F1-score: 0.9921). However, it is not as sensitive for unhealthy cases (0.9694) as KNN. SVM also has competitive specificity, but its precision is lower for abnormal samples.
Tree-based models act in different ways. Decision Tree has a high sensitivity for healthy cases (0.9963), but it does make the classes a little more unbalanced. Ensemble Boosted Trees show strong detection of healthy classes, but their sensitivity to unhealthy classes drops significantly (0.8627), which could make them less reliable in clinical settings.
The class-wise analysis shows that KNN is the best balance between sensitivity and specificity across both classes. This is true even though many classifiers do well at distinguishing between classes.
Table 4 summarizes the weighted performance metrics that demonstrate overall classification stability for both classes. The highest weighted sensitivity (0.9901), precision (0.9904), and F1-score (0.9900) validate KNN’s superior overall balance.
While SVM and Naïve Bayes exhibit competitive but slightly lower weighted performance, Decision Tree maintains relatively stable metrics but falls short of KNN.
These findings support KNN’s selection as the final decision mechanism by demonstrating that, although several classifiers perform well, it consistently achieves the best overall performance when evaluated using weighted metrics. Among the classifiers evaluated were KNN, SVM, Naïve Bayes, Decision Tree, Logistic Regression, and Ensemble Boosted Trees.
The proposed framework was compared with existing approaches to evaluate competitiveness on the PhysioNet/CinC 2016 dataset.
Table 5 compares representative approaches evaluated using the PhysioNet/CinC 2016 dataset with the proposed DTF-STCANet + KNN framework [
2,
25,
26].
The 1D CNN model proposed by Xiao et al. [
9] emphasizes extremely low parameter consumption and focuses on lightweight temporal convolution applied directly to raw PCG waveforms. Strong sensitivity (95.02%) and computational efficiency are achieved by this design; however, specificity (86.01%) demonstrates limitations in distinguishing borderline normal cases, which may be related to reliance on single-domain temporal features.
Alkhodari and Fraiwan [
11] introduced the CNN–RNN hybrid architecture, which combines recurrent modeling and convolutional feature extraction to capture temporal dependencies in PCG signals. Although this method increases the capacity for sequential representation, the reported accuracy (88.22%) indicates that frequency-domain discriminative cues may not be adequately captured by temporal modeling alone.
Deep convolutional structures are used by the CliqueCNN and DenseCNN models to diagnose pediatric CHD with about 94% accuracy. In the absence of dual representation fusion or explicit heterogeneous backbone integration, these methods primarily rely on convolutional hierarchies [
37].
Time-compressed and frequency-expanded embedding are introduced by the HS-Vectors framework [
31] using dynamic masking and TDNN structures. This approach shows improved sensitivity (87.61%) without using heterogeneous architectural fusion, underscoring the significance of learned embedding [
31].
The hybrid 1D + 2D CNN approach [
38] combines temporal waveform analysis with spectrogram-based spatial modeling, achieving over 96% accuracy. This confirms that multi-domain processing enhances performance, aligning with the dual-representation philosophy adopted in the proposed framework.
Res2Net-CNN uses seesaw loss to deal with long-tailed PCG classification, but its lower accuracy (91.03%) suggests that it may not be stable when classes are not balanced [
12].
The Paramps family of models [
13] employs tensor decomposition within convolutional architectures to reduce redundancy while maintaining representational strength. Paramp’s accuracy is 96.40%, but its architecture stays the same.
RAMM + NRBMI + SVM is one of the strongest baselines among all the studies that were compared. It has an accuracy of 98.80% and an F1-score of 99.20%. This method combines hand-crafted acoustic features with SVM classification, showing that well-designed descriptors can work well for discrimination. The proposed framework, on the other hand, does full end-to-end deep representation learning, which means it does not need to manually extract features and keeps a good balance between sensitivity and specificity [
14].
In comparison, the proposed DTF-STCANet + KNN framework achieves 99.29% accuracy with balanced sensitivity (99.01%) and specificity (98.94%). The improvement we saw seems to come from more than just architectural depth. It also comes from structured dual time-frequency fusion, heterogeneous backbone integration, and geometry-aware decision modeling. These design principles work together to make the PhysioNet/CinC 2016 dataset very competitive and clinically reliable. Furthermore, to further enhance the model’s reliability, it was run 10 more times using a 10-fold cross-validation technique. To provide a more reliable evaluation, performance variability across folds was analyzed. All reported metrics are presented as mean ± standard deviation over the stratified 10-fold cross-validation. In addition, 95% confidence intervals (CI) were computed to quantify uncertainty in the reported results. The proposed DTF-STCANet + KNN model achieved an accuracy of 99.29% ± 0.35 (95% CI: [98.67–99.51]), a sensitivity of 99.01% ± 0.45, a specificity of 98.94% ± 0.50, and an F1-score of 99.01% ± 0.40, indicating stable performance across folds. Specifically, to ensure statistical significance between Model 3 and Model 4, a t-test was conducted using performance scores obtained from a 10-fold cross-validation method. The obtained value (p < 0.05) was confirmed in terms of statistical significance.
These results indicate that the performance improvements are consistent across folds rather than being driven by a small subset of samples.
Table 6 shows a summary of the LOSO cross-validation results, where one patient provided the test data and the others provided the training data. According to these results, 2700 out of 2725 healthy cases were correctly predicted individually, while 784 out of 816 unhealthy cases were correctly predicted. The overall accuracy score was 98.53%.
Figure 11 shows the Grad-CAM visualizations of the trained DTF-STCANet model taken from the ReLU layer. As can be seen from
Figure 11, there is a concentration of murmur-containing regions in the lower frequency regions. In the healthy class, it is clearly seen that these concentrations are absent in the lower frequency regions.
Table 7 presents the performance of the proposed approach on other datasets. The PASCAL heart sound [
39] and CinC2022 [
40] heart sound datasets were chosen for this purpose. The number of classes in both datasets is considered healthy and unhealthy. The performance of the proposed approach was compared with the most recent MCAF-TabNet model proposed by Ravi and Madhavan [
41], which uses this dataset. In the MCAF-TabNet approach, temporal and spectral features are used together with features from a multidimensional CNN model. In the final stage, these features are fed into the Tab-Net classifier. On the PASCAL dataset, a performance above 0.85 was achieved in accuracy, precision, sensitivity, and F1-score metrics. The proposed approach, however, lagged slightly behind the PASCAL dataset. This is due to the small sample size of the PASCAL dataset (31 normal, 34 murmur, 19 extras, and 40 artifacts). However, the proposed approach achieved better performance in these metrics on the CinC2022 dataset (486 normal subjects and 1082 subjects with heart diseases), which has a larger sample size.
In addition to all this, if we discuss the parameter state of the model, the proposed model is approximately 58.5 M, which is both very heavy and very light. Training of the proposed approach took 3 h 10 min, 1 h 40 min, and 45 min on the 2016 PhysioNet/CinC, CinC2022, and PASCAL datasets, respectively. With an input size of 224 × 224 × 3, the proposed approach has approximately 9.2 GFLOPs of backbone complexity, comprising ConvNext-Tiny (4.5 GFLOPs), Swin Transformer-Tiny (4.6 GFLOPs), and Fusion + Channel Attention + Embedding Layers + KNN (0.1 GFLOPs).