1. Introduction
Cardiovascular diseases (CVDs) are the most common causes of death worldwide. CVD is the single leading cause of death in both developed and developing countries and includes a variety of cardiac conditions, including heart attack and hypertension [
1]. According to the World Health Organization (WHO), 19.8 million people died of CVD worldwide in 2022 [
2]. This number corresponds to approximately one-third of global deaths. Due to this high rate, early diagnosis of the disease is critical. Due to the gradual aging of the population worldwide, this number is thought to increase to 24 million by 2030 and 32.3 million by 2050 [
3,
4]. Technological developments, mainly including artificial intelligence, have facilitated a major revolution that extends to healthcare [
5,
6]. Ischemia is characterized by insufficient blood flow to the myocardium and can cause sudden cardiac death. Arrhythmia involves irregular heart rate due to abnormal electrical activity in the atriums of the heart. Early and accurate prognosis of the mentioned cardiac diseases is essential for patients’ improved quality of life and the prevention of deaths [
4].
ECG is the most basic method for diagnosing heart rhythm disorders. Since manual analysis of ECGs, which provides important information in the diagnosis of heart diseases, is time-consuming, an accurate method of automatic analysis would be of great value. ECG records the electrical signals produced by the heart, which effectively assist the diagnosis of ischemia, arrhythmia, and other CVD conditions [
7]. Any heart rhythm irregularity can change the ECG signal. It is based on a standard 12-lead system that tests the electrical potential of ten electrodes placed in various parts of the body, six on the chest and four on the limbs. ECG measures the electrical impulses of the heart through electrodes on the surface of the skin [
8,
9]. However, it is difficult to diagnose many arrhythmias with a standard resting ECG because it only provides a snapshot of the patient’s cardiovascular activity over time. An intermittent arrhythmia may go unnoticed, and physicians must rely on patients’ self-monitoring and reported symptoms to support their final diagnosis [
1]. The detection of arrhythmia using ECG is challenging. This is due to the variability in each individual’s typical ECG waveform, the occurrence of different symptoms for a disease in different ECG waveform patients, the fact that two different diseases can have roughly similar effects in different ECG waveform patients, the inconsistency of ECG features, and the complete absence of an effective detection algorithm for ECG classification [
8,
10].
The normal frequency range of the signal is 0.05–100 Hz [
1], the amplitude range is between 10 uV and 5 mV, and the normal value is 1 mV. Various heart diseases are analyzed using ECG signals [
11]. Various detection techniques for cardiovascular diseases have largely been presented in recent years. Most of these methods consist of four steps: preprocessing (denoising), dimension reduction, feature selection, and different cardiac arrhythmia identification. The preprocessing stage makes signals suitable for processing, emphasizing the use of filters to remove the existing noise in the ECG signal record. Again, different transformation techniques are used to detect sensitive R peaks and the QRS complex in ECG signals. Using machine learning and deep learning techniques, feature extraction, feature selection, and classification techniques have been presented in ECG beat classification [
12]. In recent years, deep learning-based models have achieved significant success in automatic analysis of ECG signals. In particular, one-dimensional convolutional neural networks (1-D CNNs) [
13]; long-short-term memory (LSTM) networks [
14]; and, more recently, Transformer-based models [
15] have achieved high accuracy rates in arrhythmia classification tasks by effectively capturing sequential correlations in time series. The majority of studies on arrhythmia classification have used the MIT-BIH Arrhythmia Database, an open access and labeled resource. This dataset is widely accepted as a benchmark for model training and literature comparison due to the patient records covering various types of arrhythmia [
16]. However, these supervised learning-based models present difficulties, such as the need for labeled data, that limit their wide-scale applicability. Manual labeling of ECG signals at the beat level is a time-consuming and costly process that requires significant expertise [
15]. This limits the ability of supervised models to easily adapt to different datasets and their generalization capacity. In datasets such as MIT-BIH, critical arrhythmia types such as ventricular premature beats (VBs) are represented in minimal numbers compared to normal beats due to class distribution imbalance. This imbalance leads to low model performance in minority classes [
17]. The main motivation of this study is to make the arrhythmia classification task more flexible and generalizable by reducing the need for labeled data. For this purpose, firstly, five clinically significant beat types (N, L, R, V, and A) were considered on the MIT-BIH Arrhythmia Database dataset; then, the class imbalance problem was balanced with the SMOTE (Synthetic Minority Over-sampling Technique) method. In this way, the representation density of all classes was equalized, and the learning performance of the model was increased in small classes. In order to reduce the need for labeled data in particular, an unsupervised representation learning method, SimCLR (Simple Framework for Contrastive Learning of Visual Representations), was used in this study, and meaningful vector representations were obtained based on two different views of each segment. These representations achieved 97.46% accuracy with a lightweight MLP (Multi-Layer Perceptron) classifier with only two layers. In addition, the performance of the obtained model was evaluated in detail with multi-faceted analyses such as confusion matrix, ROC-AUC curves, F1-score values, and t-SNE visualizations. With these aspects, this study shows that self-supervised learning can be effectively applied to medical time series such as ECG, and offers an alternative to supervised, complex models in the literature with its simple and explainable structure. Accordingly, expected symmetries in beat-centered segments are made explicit, and invariance in the learned representations is promoted through SimCLR augmentations. Departures from these symmetries—typical of arrhythmic beats—are exploited as informative signals that enhance class separability.
The organization of this paper is as follows:
Section 2 presents the literature review of recent studies and methods used on ECG classification.
Section 3 explains the steps of the proposed method in detail; the dataset, segmentation process, class balancing with SMOTE, SimCLR-based representation learning, and MLP classifier are discussed in this section.
Section 4 presents the experimental results of the proposed method; training performance, class-based evaluation metrics, and results supported by visual analysis, such as ROC and t-SNE, are included.
Section 5 compares the method with similar studies in the literature and discusses the general implications of the obtained results. Finally,
Section 6 provides a general summary of the work, highlights the main contributions, and makes recommendations for future work.
2. Related Work
The vast majority of studies on ECG beat classification use publicly available datasets such as the MIT-BIH Arrhythmia Database and develop supervised models based on deep learning. Acharya et al. [
13] reported 94.03% accuracy with a 1D CNN-based model. Similarly, Yildirim [
14] achieved 99.39% accuracy with an LSTM-based deep structure. Rajkumar et al. [
11] proposed a CNN model that achieved 98.21% accuracy on MIT-BIH data. More recently, Transformer architectures have also been adapted to this field [
15]. Although these studies achieve strong results, high accuracy requires large amounts of labeled data, complex architectures, and long training times. While N (normal) beats are generally dominant in MIT-BIH data, the number of classes, such as VEB (ventricular ectopic beat), SVEB (supraventricular ectopic beat), is quite low. This imbalance makes it difficult for the model to learn minority classes. Various approaches have been proposed to solve this problem, such as resampling methods, such as SMOTE, GAN-based data generation, and class-weighted loss functions. However, most of these methods are tightly based on the supervised learning structure [
17,
18,
19]. In recent years, self-supervised learning (SSL) approaches have also been used in signal processing, especially to reduce the dependency on labeled data. Contrastive learning methods such as SimCLR [
20], BYOL [
21], and MoCo [
22], which have become widespread in image-based applications, have recently been adapted for 1-D signals such as ECG. Mehari and Strodthoff [
23] compared SimCLR, BYOL, and SwAV-based representation learning approaches on 12-lead ECG signals and reported their classification success after both linear evaluation and fine-tuning. In their work, while the SimCLR method is most successful with the linear classifier, BYOL-based representation vectors have demonstrably provided higher overall performance in downstream tasks after fine-tuning. Nevertheless, studies in this area are conducted with a limited number of classes, low sample diversity, and detailed analyses, such as ROC-AUC, are often not presented.
This study proposes a novel framework that combines SimCLR-based self-supervised representation learning with segment-based data processing, class balancing with SMOTE, and an MLP classifier, working on five classes. Compared to the existing literature, the following was applied:
SimCLR + MIT-BIH + 5-class combination is performed.
Data balancing before contrastive learning with SMOTE is an original contribution.
A total of 97.46% accuracy and 0.997+ AUC are achieved in all classes with only the MLP classifier.
Discriminative performance of the model is detailed visually and numerically with ROC, t-SNE, and confusion matrix analyses.
3. Materials and Methods
In this section, the path followed in the ECG classification process is explained in detail. The structure of the MIT-BIH Arrhythmia dataset used in the study and the classes used are introduced. Then, the following steps are explained: segment generation from signals, studies carried out to eliminate class imbalance, self-supervised representation learning, and classification stages. The implementation method of the SimCLR-based contrastive learning architecture is particularly emphasized, and then, the augmentation strategies are detailed. We explain that the classification process is performed with a Multi-Layer Perceptron (MLP), and the metrics and analysis methods (confusion matrix, ROC-AUC, and t-SNE) used to evaluate the performance of the developed system are presented. The methodology flow summarizing the entire process from signal segmentation to final classification is presented in
Figure 1.
3.1. Dataset Description
The dataset used in this study, the MIT-BIH Arrhythmia Database, is an open access resource developed by the Massachusetts Institute of Technology and Beth Israel Hospital [
16]. The most widely used dataset in the field of arrhythmia diagnosis consists of 48 ECG recordings, each lasting 30 min, and contains a total of 47 different patient data points. The recordings in the dataset are at a sampling frequency of 360 Hz [
24]. These recordings are mostly obtained over two channels, MLII and V5. In the annotation files that come with each recording, the location and type of each beat are labeled in accordance with the American Association for the Advancement of Medical Instrumentation (AAMI) standard. In the experimental study, five beat classes that are both clinically significant and frequently used in the literature were considered: normal sinus beat (N), left bundle branch block beat (L), right bundle branch block beat (R), ventricular premature contraction (V), and atrial premature contraction (A). This limitation to five categories was further justified by the following three additional reasons: (i) these classes conform to the AAMI EC57 guidelines, which aggregate MIT-BIH annotations into clinically pertinent categories commonly utilized in prior research; (ii) they encompass the most prevalent and clinically significant arrhythmias, guaranteeing sufficient sample sizes for effective training; (iii) the omission of infrequent beat types mitigates severe class imbalance and enhances generalization. Consequently, concentrating on these five criteria guarantees clinical significance and comparability with previous studies. During data processing, only the MLII channel was utilized, and segmentation activities were conducted exclusively over this channel [
25].
3.2. Beat Segmentation
The signal data of each ECG record was processed in .csv format and annotations in .txt files for classification. Since the MLII derivation reflects the electrical activity of the heart most clearly at the ventricular and atrial levels, only the MLII channel was used in this study. This channel has also been preferred as the main signal source in many previous studies [
25]. The location information of each beat is provided by the “Sample #” column in the Annotation files, and fixed-length segments are formed by taking the relevant locations as references. A segment is formed for each beat from 300 samples (approximately 0.83 s) based on the beat center. The basis of the formation of this window is a symmetrical structure consisting of 150 samples before the beat and 150 samples after the beat. Thanks to this example of a commonly used structure, attempts have been made to completely capture the full morphological structure of an ECG beat (P wave, QRS complex, and T wave) [
26]. By centering 300-sample segments on the R-peak, a symmetric context is imposed, encouraging approximate translation invariance to minor temporal shifts.
Beats that are close to the signal boundaries and outside the window were not taken into account during the classification process. The segments to be used were converted to NumPy arrays and made suitable for use in machine learning algorithms. The transition to the augmentation and labeling processes was facilitated by the arrangement made. Methods such as sliding window or fixed interval segmentation cannot be positioned according to the beat center. For this reason, they may be insufficient in capturing the morphology of rhythm disorders. This is one of the motivations for the choice of beat-centered segmentation in our study.
3.3. Class Imbalance Handling with SMOTE
Class imbalance in the dataset used negatively affects the performance of supervised or unsupervised learning methods. It causes the model to generalize poorly, especially in arrhythmia types (such as V and A) with few data [
27]. This imbalance is quite evident in the MIT-BIH dataset. In order to solve this problem, SMOTE (Synthetic Minority Over-sampling Technique) was applied in our study [
28]. Instead of directly copying the examples in the classes with a few data points, SMOTE produces new synthesized data. Thanks to the technique used, the risk of overfitting is reduced and the representativeness of the data is increased. SMOTE is frequently preferred, especially in time series data and ECG classification tasks. Bing et al. [
28] achieved an accuracy exceeding 99% in MIT-BIH data using the combination of SMOTE and focal loss in a study conducted in 2022. In another study by Khan et al. focusing on ECG, it was reported that 98.6% accuracy was achieved after the imbalance resolution with SMOTE [
29]. Based on the studies, it has been stated that SMOTE is a suitable method both to increase the quality of representation in deep learning processes and to ensure that the model learns in a more balanced fashion in classes containing a small number of data [
27,
28,
29,
30]. Complex methods such as GAN or VAE were not preferred due to their computational and application difficulties. In our study, approximately 71,700 segments were created for each class. As shown in
Table 1, this number was created for each class, which formed a balanced dataset before SimCLR. Through jitter and mild amplitude scaling, invariance to physiologically plausible transformations is promoted, while preserving morphology relevant for discrimination.
3.4. SimCLR-Based Representation Learning
In this study, the SimCLR (Simple Framework for Contrastive Learning of Representations) method was used to transform the signals obtained from the segments into meaningful and discriminative representations. Among self-supervised learning methods, SimCLR has been shown to produce strong representations from unlabeled data, primarily through its contrastive loss function [
20]. Two differently augmented views of the same example were treated as a positive pair, whereas all other examples in the batch were treated as negatives, thereby encouraging similar samples to be pulled together and dissimilar samples to be pushed apart in the representation space. The architecture comprised an encoder and a projection head. A one-dimensional, three-layer convolutional neural network (1-D CNN) was used as the encoder, with 64, 128, and 256 filters in the first, second, and third layers, respectively; ReLU activations and batch normalization were applied after each layer, and adaptive average pooling was used to obtain a fixed-length feature vector. The encoder output was passed through a two-layer MLP to obtain a 128-dimensional projection used for contrastive learning. The loss function was the Normalized Temperature-scaled Cross-Entropy (NT-Xent):
ve are the projection vectors obtained as a result of two different augmentations of the same sample. The expression represents the cosine similarity between these two vectors, represents the temperature hyperparameter, while the expression shows the total number of augmented samples in the batch. This function is computed for each positive pair, and the batch loss is obtained by averaging the per-pair values. During the augmentation process, three distortion techniques—jittering, mild amplitude scaling, and Gaussian noise—were applied randomly and in combination with each signal. Through jitter and mild amplitude scaling, invariance to physiologically plausible transformations is promoted, while preserving morphology relevant for discrimination. Training used a batch size of 512 for 10 epochs, during which the contrastive loss decreased from 5.40 to 4.98, indicating that positive pairs were embedded closer together while negative pairs were pushed apart. Although supervised CNNs or autoencoder-based alternatives can be adopted, such approaches typically entail high labeling costs or may yield representations with weaker class separation. In contrast, SimCLR produces class-agnostic, interpretable, and balanced representations driven solely by augmentations and signal similarity. Moreover, balancing the dataset with SMOTE provided evenly represented classes, enabling balanced batches and more stable contrastive optimization prior to the downstream classifier.
3.5. MLP Classifier
After completing SimCLR training, fixed-size representation vectors were obtained for each segment from the encoder network (in eval() mode). These vectors are 256-dimensional embeddings containing the learned representations taken from the output of the encoder layer of SimCLR. These representations were used as input in the classification phase.
A model with a two-layer lightweight Multi-Layer Perceptron (MLP) architecture was selected as the classifier [
31]. The MLP model consists of an input layer that receives 256-dimensional vectors, a hidden layer with 128 neurons, and an output layer with SoftMax activation for five classes. The architecture of the model is shown in
Figure 2:
During the training process, the model was trained using these representations, balanced with SMOTE, and extracted from the SimCLR encoder. Cross-Entropy Loss, which is suitable for multi-class problems, was preferred as the loss function, and the optimization was performed with the Adam algorithm. Training was performed for 10 epochs with a batch size of 256, and according to the results, the accuracy of the model started at 89.8% and reached 97.46% at the end of the 10th epoch. This shows that contrastive representations increase the discrimination power between classes and that high performance can be achieved with a simple classifier. Studies have been frequently conducted in the literature using more complex classifiers (e.g., LSTM, Transformer, or ensemble structures), but in this study, similar accuracy rates were achieved with a simple MLP architecture. This reveals that the method has both low computational cost and high interpretability.
The selection of a lightweight MLP classifier was intentional. The proposed approach effectively extracts highly discriminative and linearly separable representations via SimCLR, rendering a sophisticated downstream classifier like LSTM, CNN, or Transformer unnecessary. A straightforward two-layer MLP offers adequate non-linearity while maintaining a low parameter count (~0.45 M), rapid inference (0.8 ms/beat), and straightforward deployment on embedded platforms. This architecture mitigates the risk of overfitting and emphasizes that the efficacy of the proposed strategy predominantly resides in the quality of the learnt representations rather than the intricacy of the classifier.
4. Experimental Results
In this section, the performance of the proposed system, consisting of SimCLR-based representation learning and an MLP classifier, is evaluated in detail. The learning curves during the training process of the model are examined, and the performance is measured with metrics such as class-based accuracy, precision, recall, and F1-score. In addition, analyses such as confusion matrix, ROC-AUC curves, and t-SNE visualizations are performed to evaluate the discrimination and generalizability of the system. The obtained results show that its applicability is high and its classification performance is strong, thanks to its simple structure.
4.1. Training Performance
The proposed method has a two-stage training process: in the first stage, SimCLR-based self-supervised representation learning was performed, and in the second stage, the MLP classifier was trained in line with these representations.
The contrastive loss used during SimCLR training started from 5.40 in the 1st epoch and decreased to 4.97 at the end of the 10th epoch. This decrease shows that the model successfully learned to transform augmented positive pairs into close representations and negative pairs into distant representations. During the augmentations, the intention was to increase the generalizability of the model by using jittering, scaling, and Gaussian noise methods.
After the representations were obtained, the MLP classifier was trained with these fixed-size feature vectors. The cross-entropy loss value used in MLP training was 355.26 in the first epoch and decreased steadily to 79.09 at the end of the 10th epoch. In parallel, the accuracy value started from 89.7% and increased at the end of each epoch, reaching 97.46% as of the 10th epoch.
No overfitting was observed during the training and validation processes. The accuracy and loss curves were balanced; validation performance developed in parallel with training performance. This reveals that the model can learn generalizable representations not only to training data but also to validation samples in the same distribution.
The balanced dataset obtained after SMOTE was expanded to include 71,700 samples for each class. This dataset was divided into 80% training and 20% test ratios to maintain class balance. Thus, there are 14,340 samples in the test set of each class, and the experimental results are calculated on this test set.
In addition, as seen in
Figure 3, when the course of precision, recall, and F1-score values according to epochs was examined, a balanced improvement was observed. These distributions serve as evidence that the model improves consistently across classes and can learn without neglecting any class. In particular, the performance increase observed in minority classes reveals the positive contribution of balancing applied with SMOTE to representation learning.
All these findings show that representation learning is successfully performed via SimCLR, with even a simple MLP classifier used afterwards being able to classify with high accuracy, and the model has high generalizability capacity in general.
4.2. Quantitative Evaluation
The performance of the model is evaluated using four basic metrics that are widely used for multi-class classification problems: precision, recall, F1-score, and overall accuracy. These metrics are calculated separately for each beat class (N, L, R, V, and A), and their macro and weighted averages are also included in the evaluation. In addition to accuracy, precision, recall, F1 score, and AUC, we also calculated Cohen’s Kappa and Matthews Correlation Coefficient (MCC), as shown in
Table 2. Kappa corrects for random fit and provides a robust measure of inter-rater reliability, while MCC is a balanced correlation coefficient suitable for imbalanced datasets.
According to the evaluation results, the overall accuracy of the model is 97.2%, and the F1-score varies between 94% and 99.6% across the five classes. Precision and recall values are observed to be above 99% in the left bundle branch block (L) and ventricular premature contraction (V) classes, and between 96 and 98% in the normal beat (N) and right bundle branch block (R) classes. Although a minority class, the atrial premature contraction (A) class exhibits a lower F1-score (94.1%) compared to the others; however, this level is considered adequate given the class’s morphological diversity and relatively small sample size. Since the data are balanced, the weighted average, macro average, and accuracy coincide; therefore, only the accuracy value is reported. Performance was observed to remain stable under small temporal shifts, whereas classes with asymmetric morphology (e.g., A and V) were characterized by departures from the expected symmetries that aided separability. When the confusion matrix is examined, the majority of A-class samples are misclassified as N (normal) and R, as shown in
Figure 4. This finding is also clinically plausible, since some atrial premature beats may display morphology close to sinus beats. In addition to these results, the proposed method achieved a Cohen’s Kappa score of 0.965 and an MCC of 0.965, confirming strong agreement beyond chance and showing balanced predictive performance across all classes.
The generalizability of the model across classes is also confirmed in the scatter plots of precision–recall and F1-score–recall relationships. These plots, presented in
Figure 5, show that learning is not unevenly distributed across classes and that SimCLR representations are discriminative and reliable.
Additionally, the distribution of features obtained from the SimCLR encoder in the two-dimensional plane was examined using the t-SNE visualization given in
Figure 6. It was observed that each class formed distinct clusters in the image, proving that sufficient separation was achieved in the representation space.
Finally, in
Figure 7, the ROC curves evaluated the discrimination of the model for each class. The AUC scores were calculated as 1.000 for classes L, R, and V; 0.998 for N; and 0.997 for A. These values show that the model can decide between classes with high sensitivity and make clinically reliable classifications.
4.3. Comparison with Previous Studies
This section compares the performance of the proposed SimCLR-based representation learning and MLP classifier architecture with other previous work utilizing the MIT-BIH Arrhythmia dataset. The majority of the literature focuses on supervised deep learning models, including CNN, LSTM, or Transformer, while self-supervised learning approaches are rather rare. In a study including hybrid CNN–LSTM models that present a heterogeneous framework compared to conventional supervised models, Sun et al. [
32] achieved an accuracy of 98.5% on the MIT-BIH dataset using the CNN-LSTM-SE architecture, alongside precision exceeding 97%, recall surpassing 98%, and an F1-score for each class. A separate study attained an accuracy of 99.58% using a hybrid CNN–Transformer technique; nonetheless, the model’s intricate and computationally intensive architecture was highlighted [
33]. Conversely, Alamatsaz et al. [
34] indicate that lightweight models, such as 1D CNN-LSTM, can achieve accuracy levels between 98% and 99%.
Research on self-supervised learning is also proliferating. Chen et al. evaluated the SimCLR, BYOL, and CLOCS methodologies on multi-channel ECG utilizing the “Temporal-Spatial Self-Supervised Learning” approach, achieving elevated ROC-AUC values with the SimCLR-based variation [
35].
The efficacy of the most prominent strategies from the literature and the suggested framework is encapsulated in
Table 3.
Table 3 demonstrates that Transformer-based models [
15,
33] attain the highest accuracy levels (~99–99.5%); nonetheless, they rely on considerably deep and computationally demanding architectures. CNN–LSTM hybrids, exemplified by [
32,
34], achieve above 98% accuracy; nonetheless, their complexity surpasses that of lightweight approaches. Patient-specific or residual CNN techniques [
26,
28,
29,
30] provide competitive outcomes (~95–98%) with varying trade-offs. The proposed SimCLR + MLP framework achieves an accuracy of 97.2%, an F1 score of 0.971, and an AUC of 0.983, employing around 0.45 million parameters. This balance between precision and efficiency highlights the competitiveness of the proposed methodology, particularly for real-time and embedded healthcare applications.
4.4. Computational Complexity Analysis
The computational efficiency of the suggested method was evaluated against leading ECG classification techniques.
Table 3 encapsulates the findings regarding the dataset, number of classes, reported accuracy, supplementary metrics, and computational attributes.
Table 4 demonstrates that Transformer-based approaches attain the highest accuracy (~99.5%), but with intricate designs comprising millions of parameters and extended training durations. Likewise, CNN-LSTM-SE (Sun et al., 2024) [
32] achieved 98.5% accuracy; however, it incorporates multiple convolutional and recurrent layers alongside channel attention, hence augmenting implementation complexity. The proposed SimCLR + MLP framework achieves competitive performance (97.2% accuracy, F1 = 0.971, AUC = 0.983) while necessitating over 0.45 million parameters, with a training duration of around 22 min for 10 epochs on a single GPU and an inference latency of 0.8 milliseconds per beat. The efficiency–performance equilibrium indicates that the suggested method is lightweight, scalable, and appropriate for real-time and embedded healthcare applications, rendering its implementation valuable despite somewhat reduced accuracy relative to more complex options. From a computational perspective, the total model size is approximately 1.8 MB, which is sufficiently small for deployment on embedded and mobile devices. These results confirm that the proposed approach not only maintains competitive accuracy but also fulfills the requirement for low computational complexity, making it feasible for real-time monitoring scenarios.
5. Discussion and Limitations
This study has shown that the self-supervised learning approach can be used effectively in the classification of beat segments obtained from ECG signals. With SimCLR-based representation learning, features with high discrimination were obtained without the need for class labels, and these representations were used very successfully with a simple MLP classifier.
Traditionally, ECG classification studies have been performed with high-parameter models such as CNN, LSTM, and Transformer. Although these models produce strong results, they have limitations in practical applications due to both the need for labeled data and computational costs. On the other hand, since the method used in this study works with unlabeled pre-training, it is more suitable, especially for situations where data labeling is difficult or costly.
The quality of the representations obtained with SimCLR has been demonstrated with both class-based metrics (F1-score between 94 and 99%) and visualization techniques such as t-SNE. In addition, the AUC values for each class in ROC-AUC analyses being 0.997 and above revealed that the model has a very high discrimination power between classes.
In particular, solving the data imbalance problem with SMOTE and applying it before contrastive learning has been a strategy that is rarely seen in the literature but has yielded effective results in this study. SMOTE has helped SimCLR learn more balanced positive and negative pairs, which has made it possible to achieve very successful classification results even for minority classes.
From this perspective, this study presents a strong architecture not only in terms of providing high accuracy but also in terms of simplicity, explainability, and reproducibility. The low computational requirement and label independence of the model indicate that this method can be integrated into real-time healthcare systems or form a basis for mobile healthcare applications. From a symmetry perspective, robustness to benign transformations was maintained, while clinically meaningful symmetry breaking was leveraged as an informative signal. A dedicated ablation in which the beat window is off-centered and augmentation transforms are removed is expected to quantify these effects more explicitly.
However, there are some limitations of the proposed method. The SimCLR framework used in the study is quite sensitive to augmentation techniques, and the effects of different augmentation combinations on representations have not been systematically investigated. In addition, the model has only been evaluated on the MIT-BIH dataset, and its generalizability to other datasets has not yet been tested. Future studies can address issues such as optimization of augmentation strategies, comparison of different self-supervised learning approaches (e.g., MoCo, BYOL), and applicability to different patient groups through transfer learning.
6. Conclusions
This study proposes a simple but effective approach to the classification of beat segments obtained from electrocardiography (ECG) signals using self-supervised learning-based representation learning. Strong features were learned without any label information using SimCLR, and these representations were evaluated with only a two-layer MLP classifier.
The data imbalance problem was resolved with the SMOTE method; thus, the representation power of minority classes was increased, and a balanced learning was achieved for all classes. The overall accuracy rate of the model was calculated as 97.46%, class-based F1-score values were in the range of 94–99%, and ROC-AUC scores were 0.997 and above for all classes. These results show that both SimCLR-based representation learning and a simple MLP structure can work effectively together. Overall, ECG-appropriate symmetries were made explicit, and invariance was promoted in the learned representations, while symmetry-breaking patterns were exploited to enhance class separability with minimal computational cost.
Compared to CNN, LSTM, and Transformer-based models common in the literature, the proposed method offers advantages such as lower computational cost, label independence, and high interpretability. In this respect, the proposed system offers an infrastructure that can be integrated into real-time or mobile health applications. In future studies, we plan to diversify augmentation strategies; compare different self-supervised structures; and perform multi-center, patient-dependent/independent tests. Thus, the clinical validity of the method can be evaluated on a larger scale.