A New Deep Learning Method with Self-Supervised Learning for Delineation of the Electrocardiogram

Heartbeat characteristic points are the main features of an electrocardiogram (ECG), which can provide important information for ECG-based cardiac diagnosis. In this manuscript, we propose a self-supervised deep learning framework with modified Densenet to detect ECG characteristic points, including the onset, peak and termination points of P-wave, QRS complex wave and T-wave. We extracted high-level features of ECG heartbeats from the QT Database (QTDB) and two other larger datasets, MIT-BIH Arrhythmia Database (MITDB) and MIT-BIH Normal Sinus Rhythm Database (NSRDB) with no human-annotated labels as pre-training. By applying different transformations to ECG signals, the task of discriminating signals before and after transformation was defined as the pretext task. Subsequently, the convolutional layer was frozen and the weights of the self-supervised network were transferred to the downstream task of characteristic point localizations on heart beats in the QT dataset. Finally, the mean ± standard deviation of the detection errors of our proposed self-supervised learning method in QTDB for detecting the onset, peak, and termination points of P-waves, the onset and termination points of QRS waves, and the peak and termination points of T-waves were −0.24 ± 10.04, −0.48 ± 11.69, −0.28 ± 10.19, −3.72 ± 8.18, −4.12 ± 13.54, −0.68 ± 20.42, and 1.34 ± 21.04. The results show that the deep learning network based on the self-supervised framework constructed in this manuscript can accurately detect the feature points of a heartbeat, laying the foundation for automatic extraction of key information related to ECG-based diagnosis.


Introduction
Electrocardiogram (ECG) is an important tool in the diagnosis of cardiovascular diseases. With the widespread use of various ECG detection devices in the clinic, a large amount of ECG data is generated. Combining clinical ECG with computer technology to accomplish automatic identification of arrhythmia types can effectively diagnose heart diseases and reduce mortality [1]. ECG reflects the course of electrical activity of heart excitement. Each heartbeat contains P waves, QRS complex waves and T waves. The ECG physician draws diagnostic conclusions by observing the width, amplitude, morphology, and interrelationship of each wave and line segment of the ECG. This method of analysis is based on identifying characteristic points such as the onset, peak and termination of each wave of the ECG signal [2]. However, the variety of arrhythmia types and the richness of ECG morphology in different patients often make the clinical assessment workload of the professional enormous and sometimes physicians has a tendency to be subjective. In recent years, as deep-learning-based neural network models have achieved great success in a variety of fields such as natural language processing, computer vision, and biomedical signal processing. Deep learning research works for ECG data have gradually become popular, and these works have achieved comparable or even better classification performance than traditional methods. The main advantage of deep learning methods over machine learning is that there is no need for manual feature extraction, and deep learning models are able to perform feature extraction automatically and implicitly based on large raw data sets.
However, this automatic analysis method based on deep learning usually gives discriminative results directly after inputting ECG signals. Despite the high discriminative accuracy of some studies, it is not in line with the idea of evidence-based medicine due to the lack of interpretability. For this reason, our experiments are based on enhancing the interpretability of deep learning to process ECG signal operations, aiming to identify characteristic points such as the onset and termination points of each wave, and the peak of the ECG signal.
Traditional methods for detecting ECG characteristic points include wavelet transform, hidden Markov model, adaptive filtering, etc. Li [3] detected the position of QRS waves based on wavelet transform by calculating the relationship between their wavelet modulus maximum pairs. Juan et al. [4] used wavelet transformation to detect the onset and termination points of complex waves of heartbeats, the peaks of single waves, and the onset and termination points of P and T waves one by one. Martinez [5] proposed an algorithm based on phase volume transformation to calculate the mean ± standard deviation of P-wave onset, peak and termination points, QRS-wave onset and termination points, Twave peak and termination points on a public ECG database. Furthermore, the results were 2.6 ± 14.5 ms, 32 ± 25.7 ms, 0.7 ± 14.7 ms, −0.2 ± 7.2 ms, 2.5 ± 8.9 ms, 5.3 ± 12.9 ms, 5.8 ± 22.7 ms. Clavier et al. [6] used the Hidden Markov method to represent a beat of the P-wave. Then, a set of parameters was calculated from the P-wave to detect patients prone to atrial fibrillation(AF) with a sensitivity of approximately 65% to 70%. Traditional method may perform poorly on large datasets, and deep learning is a branch of machine learning that is often used for detection of medium or large datasets. Camps et al. [7] used a CNN-based approach to localize QRS onset and termination points with root mean square error (RMSE) of 12.1 ± 0.5 ms and 18.5 ± 1.1 ms, respectively. Guillermo et al. [8] used a U-Net network structure for localization of P-wave, QRS-wave and T-wave onset and termination points. By segmenting a heartbeat into different regions and then performing characteristic point localization, the mean ± standard deviation of detection errors were 1.54 ± 22.89 ms, 0.32 ± 4.01 ms, −0.07 ± 8.37 ms, 3.64 ± 12.55 ms, 21.57 ± 66.29 ms, and 4.55 ± 31.11 ms, respectively. Abrishami et al. [9] used LSTM-based segmentation method for P-wave, QRS-wave, T-wave and other waves with an accuracy of more than 90%.
However, fully supervised learning of deep learning requires a large amount of manually labeled data, which is very time-consuming and labor-intensive. At the same time, training the network from scratch requires large computational resources. Self-supervised learning focuses on mining its own supervised information from large-scale unsupervised data using auxiliary tasks, and training the network with this constructed supervised information so that it can learn representations that are valuable for downstream tasks. Such representations are usually generic [10], which not only improves the performance of the network, but also allows pre-training and preserving the model parameters to reduce the computation time. To this end, we first conducted heartbeat segmentation on QT Database (QTDB), MIT-BIH Arrhythmia Database (MITDB) and MIT-BIH Normal Sinus Rhythm Database (NSRDB). After preprocessing, three signal transformations were performed on the heartbeat signal, and the unlabeled original heartbeat and the three transformed heartbeats were fed into the base network together for self-supervised learning. The base network is optimized by Densenet [11] as the backbone, introducing attention module and feature pyramid module. Subsequently, the convolutional layer, which is contributed for high-level feature learning of ECG, was frozen and transferred to the downstream task network and further learned the labeled signals to complete the training of the localization network of ECG characteristic points.

Base Model
Our base network used a modified Densenet network to fully extract the features of ECG heartbeats using hopping layer connection structure. We incorporated convolutional block attention modules (CBAM) to discriminate and detect more subtle position features by combining spatial attention and channel attention mechanisms together. We also conducted feature scaling at different scales by introducing feature pyramid pooling (FPP) structure to improve detection accuracy.

Self-Supervised Learning Structure
Our proposed self-supervised learning framework consisted of two learning phases.
Step 1: Pretext task stage. Referring to [12], different signal transformations were applied to segmented heartbeats, respectively. Then we conducted the recognition task, which was called pretext task, discriminating the difference between the original and transformed signals. During this process, our model could fully learn the advanced features of ECG heartbeat signal. Then we froze the convolutional layer and saved the model parameters.
Step2: Downstream task stage. The model parameters saved in the first stage were used to initialize the finetuned network. Then the data and labels from the QT dataset were fed into the finetuned network for learning and fine-tuning of the fully connected layer to finally complete the task of characteristic point localization. Figure 1 illustrates our network framework for self-supervised learning.

Data for Pretext Task
The MIT-BIH Arrhythmia Database(MITDB) contains 48 records of half-hour ambulatory ECG signals from 47 subjects, with each record annotated by two or more cardiologists for beat type. The recordings were digitized at 360 samples per second [13].
The MIT-BIH Normal Sinus Rhythm Database(NSRDB) includes 18 long-term ECG recordings of subjects. Subjects included in this database were found to have had no significant arrhythmias. The recordings were digitized at 128 samples per second [13].
There is no annotation for the position of characteristic points in both above-mentioned datasets. In our study, MITDB and NSRDB are suitable for self-supervised learning in the pretext task due to the presence of a large number of beats of the same type (type 'N') in it as in QT Database(QTDB) which has similar amplitude and waveform characteristics.
Since the pretext task did not require characteristic point labels, we only needed to cut out all heartbeats from QTDB in this stage. So did the heartbeats of type 'N' from MITDB and NSRDB. Then we resampled all of them to ensure a fixed signal length of 300. Referring to the segmentation approach of [14], we used the peak position of QRS wave as the reference point for heartbeat cutting. Finally, we obtained 79,643 heartbeats of type 'N' from MITDB and 309,384 heartbeats of type 'N' from NSRDB.
We performed three signal transformations including noise addition, scaling, and temporal inversion on the original heartbeats for data preparation of pretext task. After the transformations, four signals including the original signal are stacked to construct the input matrix X. We set the original signal as category 0, and the remaining three transformed signals correspond to categories 1∼3 in turn as labels Y for the multi-classification task. These category tags are so called 'pseudo-labels' for the whole self-supervised learning task. These labels are generated automatically in the order of transformations applied on the origin signal, without providing any manual annotations. Referring to [15] in which similar way was used to identify human activities, we fed tuples of inputs and pseudo-labels (Xi, Yi) into the pretext network, where i denotes the ith transformation on the signal. By optimizing the loss function of this classification task, we can improve the ability of our selfsupervised network to discriminate these four signals, during which the self-supervised network can fully learn the features of the original signals. Figure 2 shows signals before and after transformation and their automatically generated labels.

Data for Downstream Task
Our study used QTDB for downstream tasks. QTDB consists of 105 2-lead recordings sampled at 250 Hz. It includes normal sinus rhythm, ischaemic and non-ischaemic ST segments, slow ST-segment drift, transient ST-segment depression and sudden cardiac death. At least 30 heartbeats in each record (3623 in total) were manually annotated by two experts. The specific annotations are shown in Figure 3, where "P" and "T" represent the peak of P and T waves, respectively, and "N" represents the normal heartbeat. The symbols "(" and ")" represent the onset and termination points of each wave, respectively, with the onset point of the T wave is not annotated. In our study, a large number of beats of the same type (type 'N') as in QTDB with similar amplitude and waveform characteristics are suitable for the self-supervised learning in the previous task due to the presence of a large number of beats of the same type (type 'N') in the MIT-BIH database. In our study, due to the presence of a large number of beats of the same type (type 'N') with QTDB in MITDB and NSRDB, which have similar amplitude and waveform characteristics, it's suitable of the two datasets to be used for the self-supervised learning in the previous task.
We manually eliminated the records that did not contain all 8 characteristic points from the 105 records in QTDB, and ended up with 97 records. The above signals were passed through discrete wavelet transform (DWT) to reduce the noise of ECG signals. Referring to [16], we chose the sixth-order Daubechies wavelet function as the mother wavelet to decompose the ECG signal and to perform reconstruction. After denoising we conducted the segmentation of the heartbeats. We took 100 sampling points before the QRS peak points and 200 sampling points after the QRS peak points annotated by the experts, respectively. The time span of each heartbeat is 1.2 s (the sampling rate of QTDB is 250 Hz). The cut heartbeats were then upsampled to 325 sampling points to generate X, a 1D matrix of 1 × 325. At the same time, we processed the expert's annotations into Y, a 1D matrix of 1 × 8. The tuples (X,Y) were then fed into the model as pairs for training.

Methodology
Our model consists of three main components: dense hop layer connection, convolutional block attention module (CBAM), and feature pyramid pooling module (FPP).

Dense Hop-Layer Connection
Deeper deep learning networks imply better nonlinear representation and can fit more complex feature inputs. However, as the network deepens, it can lead to instability of the gradients along with gradient disappearance or gradient explosion. This situation leads to the degradation of the network performance, which is called degradation problem. It was ResNet that first introduced skip connection to solve the degradation problem. ResNet transmits the information from the initial layer to deeper layers by matrix addition. The main difference between DenseNet and ResNet is that DenseNet concats the output feature map of a layer with the next layer instead of summing up, which is so called feature reusability. The formula of Dense hop-layer connection is shown in Equation (1).
which denotes the output X(t) of a t−layer network. X i is the output of the ith layer of the feed forward neural network; [X 0 , X 1 , X 2 , . . . , X (t−1) ] denotes the stitching of the output feature maps from layer 0 to layer t − 1, and H(·) denotes the non-linear transformation, including a combination of BN, ReLu, and convolutional layers.
Since the 1 × 1 convolutional kernel of each dense layer can reduce the number of input feature maps, DenseNet can learn the feature maps with fewer parameters than ResNet.

Convolutional Block Attention Module
In this subsection, we will introduce the CBAM, which is an attention mechanism module combining spatial and channel attention. It was proposed by Woo et al. The specific structure of CBAM is shown in Figure 4. The emphasis of the channel attention module is focusing on what is the key information of input features. Our study aimed to focus on the onset, peak and termination points of different wavelets in heartbeat signal. In this module, input features were first passed through average pooling layer and maximum pooling layer before being fed together into a multilayer perceptron (MLP) network. The output of MLP was then passed through the sigmoid function to obtain the final weights M c of the channel attention. The expression of M c is shown in Equation (2).
where X denotes the feature map fed to the attention module, σ represents the sigmoid function. W 0 ∈ R C/r×C ,W 1 ∈ R C×C/r ; X C avg and X C max denote the channel context descriptors generated by average pooling and maximum pooling, respectively. C is the number of channels.
The spatial attention module, on the other hand, focuses on the location of key information. Our study aimed to focus on the temporal characteristics of ECG wavelets, which means that the order of characteristic points of different wavelets could be correctly discriminated, suppressing possible errors due to the close position of adjacent points. In this module, input features were sequentially passed through average pooling layer and maximum pooling layer, and then the results were concatenated together as the input to the convolution layer. The expression of the weights M s of spatial attention is: where X denotes the feature map fed to the attention module, σ represents the sigmoid function, X S avg and X S max denote the spatial context descriptors generated by average pooling and maximum pooling, respectively. 3 × 3 denotes the size of the convolution kernel.

Feature Pyramid Pooling Module
After conducting multiple convolutional and pooling operations on the input ECG signal, we introduced a feature pyramid module for heartbeat signal in order to extract more information from feature map at multiple scales, as shown in Figure 5. We downsampled the feature maps generated by the intermediate layer twice and passed them uniformly through a fully connected layer (1 × 256) with the same output dimension. After concatenating the results of the multiscale deflation with the original features and sending it to the fully connected layer, we accomplished the prediction of the characteristic point detection results. Finally, we constructed the detection network as shown in Figure 6. After the heartbeats were preprocessed, they were fed into the repetitive stacked convolution layer and Dense Block module with CBAM module introduced into the middle of each repetitive unit. After the features were fully extracted by the hopping layer structure and attention mechanism, the generated intermediate features were fed into the FPP module for multiscale feature extraction, and then the final fully connected layer predicted the positions of the eight characteristic points. Table 1 details the structure and relative parameters of our detection network.  We conducted a five-fold cross-validation on MITDB and NSRDB separately when training the datasets of pretext task. First, we divided the dataset into five subsets of data with equal amount of data. In each iteration, one of these subsets was sequentially selected as the validation set during training, and our pretext task model was trained on the remaining four subsets. The final accuracy of our model was obtained by averaging the accuracy of each iteration. The specific division method of the five-fold cross-validation is shown in Figure 7. We used cross-entropy as the loss function and chose Adam optimizer for the training of our pretext task model.

Downstream Task
We set the ratio of training set and testing set for downstream task to be 8:2 and used the 5-fold cross-validation method to train the proposed model. Referring to [15], we used He initialization technique to initialize the network parameters. We used stochastic gradient descent with a fixed learning rate of 0.001 and a momentum parameter of 0.9.
We used L1 loss (i.e., mean absolute error (MAE)), calculating the mean deviation between the predicted position of each characteristic point and experts' annotations as the loss function for the downstream task. Since the experts had 2 ECG leads available during annotation and the input of our network is one-lead, both leads were processed independently. The one with the smaller mean deviation was selected when processing the results. For each ECG recording, we used the mean deviation and standard deviation (m ± sd) as measure for all samples. We also used early stopping method to prevent over-training of our proposed model.

Results
The results of our study are shown in Table 2. Referring to [17], since the peak position of QRS complex wave was the segmentation criterion of the input heartbeats in our study, the QRS peak points were not taken into account in the result statistics. Figure 8 shows the regression results of the fully supervised network and the self-supervised network for different morphologies of some heartbeat characteristic points. Figure 8a,c both show the predicted positions of the fully supervised network proposed in our previous study [18], and Figure 8b,d corresponds to the positions predicted by our self-supervised network for the same heartbeat.

QRS-Off m ± s (ms) T-Peak m ± s (ms) T-Off m ± s (ms)
Fully Supervised [ We can see from the comparison that the self-supervised network predicts the position of characteristic points more accurately than the fully supervised network. The mean value of the deviation of the position prediction of characteristic points does not exceed two samples and the standard deviation does not exceed six samples for both the fully supervised network and the self-supervised network.

Comparative Analysis of Fully Supervised Network and Self-Supervised network
For a more explicit comparison, we calculated the absolute deviations of the models fully supervised trained and self-supervised trained based on the QT dataset, as shown in Table 3. It can be seen that the self-supervised model has better performance compared to the fully supervised model when trained on the same dataset. We used regression analysis and Bland-Altman plots to visualize the detection results of the two models. The regression plots allow to observe the approximation of the predicted positions and the annotated positions with the help of trend lines. When the trend line fits closely to the straight-line y = x, it means the model performs well in regression detection. Figures 9 and 10 represent the regression results of the positions of the seven characteristic points predicted by the fully supervised model and the self-supervised model, respectively, where the x-axis represents the positions of the annotated characteristic points and the y-axis represents the predicted positions of the model. We can see that the self-supervised model fits more closely to the straight line y = x for each characteristic point, while the fully supervised model fits significantly worse at the whole P-wave and T-wave peak points. Bland-Altman plots were used to evaluate the correlations between the measured values. With Bland-Altman plots, we can see the 95% limit of agreement (LOA) of the data, of which smaller value indicates better performance of the model [19]. We used the positions annotated by the experts as the criterion comparing with those of the fully supervised and self-supervised models, respectively. Figures 11 and 12 represent the Bland-Altman plots of the positions of the seven characteristic points predicted by the fully supervised network and the selfsupervised network, respectively, where the x-axis represents the mean sequence order of predicted positions and annotated positions, and the y-axis represents the difference of predicted positions and annotated positions. It can be seen that the confidence intervals of the Bland-Altman plots of the self-supervised model are smaller, except for the less obvious result of the termination point of the P-wave, which means that the standard deviation of the predicted positions of the self-supervised model is smaller. Table 3. Comparison of models on mean absolute deviation.

T-Off m ± s (ms)
Fully Supervised [ Figure 9. Regression plot of fully supervised network. (a-g) Correspond to the onset, peak, and termination points of P-waves, the onset and termination points of QRS waves, and the peak and termination points of T-waves, respectively. Figure 10. Regression plot of self-supervised network. (a-g) Correspond to the onset, peak, and termination points of P-waves, the onset and termination points of QRS waves, and the peak and termination points of T-waves, respectively. Figure 11. Bland-Altman plot of fully supervised network. (a-g) Correspond to the onset, peak, and termination points of P-waves, the onset and termination points of QRS waves, and the peak and termination points of T-waves, respectively.

Comparative Analysis of Self-Supervised Network Pretrained on Different Database
As can be seen in Table 2, the results of the characteristic point localization based on the MITDB and NSRDB have less bias, but more variance compared to those on QTDB, which is unavoidable when using different datasets for self-supervised pre-training. This is due to the MITDB and NSRDB has a larger amount of same type heartbeats, and thus more heartbeat features can be extracted. So the mean of deviation is small. In contrast, QTDB has a small fraction of abnormal heartbeats compared to those selected from the MITDB and NSRDB, which ultimately leads to a larger variance of deviation for self-supervised learning using MITDB and NSRDB. In the meanwhile, the model pre-trained on NSRDB predicted more accurately than the model pre-trained on MITDB, with smaller standard deviations of the predicted positions except for that of the T-wave peak. This may owe to the fact that NSRDB has a larger amount of data compared to MITDB, which allowed the model to learn features adequately.

Comparative Analysis with Other Heartbeat Characteristic Points Detection Results
In order to compare the performance of our method and other methods, we list several traditional and deep learning methods in Table 4. In order to make the role of each module in this study stand out, we implemented a baseline network simple-dense, which used the same number of dense blocks and transition layers as our model without the CBAM and FPP modules. We calculated the mean absolute error of mean values of all points for each method, as shown in Table 4. It can be seen that the self-supervised model proposed in this manuscript has a significant performance improvement from the perspective of mean deviation, especially for the detection of p-wave onset, peak and termination points. However, the self-supervised model has poorer predictions at the T-peak and T-off positions compared to our previously published fully supervised network. We speculate that this may be due to the fact that QTDB contains some ST-segment drifting, transient ST-segment depression heartbeats [20], and the self-supervised model was trained with normal heartbeats in both MITDB and NSRDB in pretext task, resulting in a bias in the detection of the downstream task. According to our mean absolute error of mean deviation calculated in Table 4, our self-supervised model has smaller mean and standard deviation compared to MP-EKF [21], which used the traditional algorithm. Compared with the deep learning model U-Net [8], our detection deviation in P-wave and T-wave are smaller. The above shows that our proposed model has significant advantages in detecting ECG characteristic points, especially in P-wave.

Analysis of the Validity of the Model Construction
To further validate the effectiveness of the main modules of the model in this manuscript, we conducted control experiments for each module introduced in our work. The experimental results are shown in Table 5. We defined our previously published fully supervisedbased network as model 1, the self-supervised model pre-trained on QTDB as model 2, the model with the CBAM module removed in model 2 as model 3, the model with the feature pyramid module removed in model 2 as model 4, and the model with both the CBAM module and the feature pyramid module removed in model 2 as model 5. By comparing model 3 with model 2, it was found that the model performance decreases significantly without the CBAM module, and the mean absolute error of mean deviation expands to 4.52 ms. We speculate that this is due to the lack of channel attention module to focus on the important parts of the input, such as the amplitude and morphology of each wavelet and the lack of spatial attention to focus on the order of characteristic points. It is apparent that both the mean and standard deviation are substantially larger when comparing model 2 and model 4, which indicates a decrease in model performance. We speculate that without the feature pyramid pooling module, the model cannot use the relative positions of the waves and the spacing scales of the different waves for feature point identification. Figure 13 shows the positions of the characteristic points corresponding to the feature maps at different scales of FPP structure of multiple morphological heartbeats. It can be seen that both the self-supervised network structure and the fully supervised network structure with FPP have better detection results. Model 5 performed poorly in both mean and standard deviation, which confirmed the performance improvement of our proposed module. Figure 14 shows the loss curve of the comparison experiment.

Limitations of Our Research
Our study also needs further exploration under the current constraints. On the one hand, since the public dataset for characteristic point detections is not large enough, both the number of patients and the types of ECG heartbeats need to be expanded more broadly. On the other hand, methods to improve the accuracy of cutting complete heartbeats are also the focus of our future research. Since the results of the self-supervised pretext task vary widely on different datasets, this also presents an issue that we hope to improve in the future, and which requires us to propose some strategies for personalized features in our future work.

Conclusions
In this manuscript, we propose a new deep-learning-based method with self-supervised learning for detecting the characteristic points of ECG beats. We selected QTDB, MITDB and NSRDB as the downstream task and pretext task database, respectively. After preprocessing and different morphological feature transformations, unlabeled heartbeat signals were fed into the base network for self-supervised learning. We saved the convolutional parameters to initialize the network for the downstream tasks and successfully completed the task of characteristic point localization of the ECG signals. DenseNet is employed as the base network, the CBAM module is added to enhance the extraction of valid information, and the feature map is multi-scaling deflated to improve information extraction. The experimental results show that the mean error of the characteristic point detections is less than one sample point (except the QRS-peak) and the standard deviation is less than five sample points (except for the QRS-peak). The obtained results in our manuscript suggest that compared with the fully supervised model, our proposed deep learning model based on self-supervised learning has smaller detection deviation. The results of this study provide a basis for ECG-information extraction based on characteristic points of the heart beat.

Data Availability Statement:
The dataset is available at https://www.physionet.org/about/database/.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: