Multiscale Encoding of Electrocardiogram Signals with a Residual Network for the Detection of Atrial Fibrillation

Atrial fibrillation (AF) is one of the most common cardiac arrhythmias, and it is an indication of high-risk factors for stroke, myocardial ischemia, and other malignant cardiovascular diseases. Most of the existing AF detection methods typically convert one-dimensional time-series electrocardiogram (ECG) signals into two-dimensional representations to train a deep and complex AF detection system, which results in heavy training computation and high implementation costs. In this paper, a multiscale signal encoding scheme is proposed to improve feature representation and detection performance without the need for using any transformation or handcrafted feature engineering techniques. The proposed scheme uses different kernel sizes to produce the encoded signal by using multiple streams that are passed into a one-dimensional sequence of blocks of a residual convolutional neural network (ResNet) to extract representative features from the input ECG signal. This also allows networks to grow in breadth rather than in depth, thus reducing the computing time by using the parallel processing capability of deep learning networks. We investigated the effects of the use of a different number of streams with different kernel sizes on the performance. Experiments were carried out for a performance evaluation using the publicly available PhysioNet CinC Challenge 2017 dataset. The proposed multiscale encoding scheme outperformed existing deep learning-based methods with an average F1 score of 98.54%, but with a lower network complexity.


Introduction
Cardiovascular diseases have been found to be the leading causes of morbidity and mortality in developed countries [1]. Atrial fibrillation (AF) is the most common type of cardiac arrhythmia associated with cardiovascular disease and is a symptom of an increased risk of stroke, heart failure, and coronary artery disease. Currently, AF affects 33.5 million people globally, and this figure is expected to increase rapidly due to the aging of the population [2]. It occurs due disorganized electrical activity in the upper left chamber of the heart (the left atrium), which makes the atria quiver or fibrillate [3].
Clinically, AF is characterized by two main features: the absence of P waves, which are replaced by a fast oscillation of fibrillatory waves, and the irregularity of the RR intervals, which is caused by uncoordinated electrical impulses that disrupt the normal activation of the atria. It usually attacks as paroxysmal AF, which typically happens outside the hospital and raises the need for fast and accurate detection of AF. Several established techniques based on ECG features, such as QRS complexes, RR intervals, and heart rate variability, have been in use to automatically recognize AF segments from the ECG signals [4,5]. Most of these techniques produce unsatisfactory results and low classification performance, which is mostly because of their failure to detect and extract the characteristic features precisely.
The use of advanced signal processing and machine learning techniques in AF detection can help to reduce the error rate while improving diagnosis accuracy and timeliness, as seen in many other fields, such as cancer diagnosis [6] and DNA analysis [7]. With

1.
A multiscale encoding scheme using a parallel structure of 1D residual blocks with different kernel sizes for capturing features at different scales is proposed. This multiscale method looks at the signal from different perspectives, yielding a better representation of diagnostic features. This also allows networks to grow in breadth rather than in depth, thus reducing the computing time by using the parallel processing capability of deep learning networks.

2.
We investigated the effects of the use of different numbers of streams and different kernel sizes on short single-lead normalized ECG signals and experimented with the effects on the detection performance. 3.
We developed an efficient multiscale 1D CNN network that receives 1D ECG signals without any conversion or transformation and does not require explicit preprocessing or feature engineering.

4.
This paper investigates the effects of different lengths of segments of ECG signals, different balancing techniques, and different parameters on the detection performance.
The remainder of this paper provides a literature review, an overview of our proposed multiscale encoding scheme with a description of our proposed DL model, and a detailed explanation of the steps taken to accomplish the AF classification goal. The results are discussed in detail in Section 5.

Related Works
During an AF rhythm, the electrical impulses in the atria come from diverse locations, rather than the typical sinoatrial node; they propagate quickly and erratically throughout the atria, resulting in a highly irregular ventricular rate. Clinically, AF can be detected by observing the ECG signal waveform and looking for specific features, such as a missing or absent clear P wave, irregular beat patterns, and abnormal RR intervals. Numerous efforts have been made by researchers around the world to automatically learn these features that define AF. Specially, the irregularity of RR intervals was characterized in numerous studies using different measures such as the entropy, standard deviation, and median; AF detection was performed by using feature thresholding and machine learning based techniques [19,20]. Liu et al. [21] used an SVM-based method to detect AF among four arrhythmias, achieving a moderate performance with an accuracy of 86.23%. However, the performance of these techniques is highly dependent on the detected features, where the handcrafted features are not typically robust for many variations, such as scaling, noise, and displacement. In addition, these methods suffer from the well-known issue known as overfitting, where the model fails to generalize well on unseen data.
Deep learning (DL) has recently been developed to address several issues with conventional techniques and has proven to be very effective in solving a variety of issues. Especially, CNNs, with their strong feature extraction capabilities, are becoming popular in various fields, including signal processing. Many researchers have attempted to use DL as a solution to the AF detection problem, as shown in Tables 1 and 2. Pourbabaee et al. [22] used a CNN to develop an intelligent system where automatically learned AF features were used for the classification. Andreotti et al. [23] used a Residual Network (ResNet) [24], a variation of CNNs, to develop an AF detection system. Such networks can train deeper networks because of the residual connections that help solve the issue of vanishing gradients. Wang et al. [25] developed an 11-layer neural network structure that was primarily stacked with a CNN and a modified Elman neural network (MENN) for AF detection, which had an accuracy of 97.5% and a sensitivity of 97.9%. Limam et al. [26] proposed a convolutional recurrent neural network (CRNN) that comprised two independent CNNs for extracting features-one from ECGs and the other from heart rates. The CRNN structure consists of layers of CNNs and long short-term memory (LSTM). An AF detection method was proposed [16] based on an end-to-end 1D CNN architecture, which had data length normalization, training, and prediction phases. The proposed structure included a 10-layer architecture for classifying the signal into four classes, such as AF, normal, noise, and others. It achieved an average F 1 score of 78.2%. Xiong et al. [16] proposed another DL approach that used a 16-layer 1D CNN network to classify AF from the other classes. This approach achieved 82% detection accuracy.
Despite the wide use of 1D CNNs in time-series classification, numerous studies have attempted to improve AF detection and classification accuracy by using new techniques with proven success in the computer vision domain. Such techniques transform the signals into a two-dimensional (2D) representation. In a previous study [27], we proposed a network of two layers of bidirectional LSTM for improving AF detection accuracy. Our approach started with the augmentation of the minor class for balancing labels and moved to the transformation of the segmented signals into a spectrogram using shorttime Fourier transforms over time windows to extract two types of time-frequency (TF) moments-instantaneous frequency and spectral entropy-that were eventually fed to the LSTM for training and final classification. The approach was evaluated using the Phys-ioNet Challenge 2017 dataset [28] and was found to be 91.4% accurate. Zihlmann et al. [29] proposed a technique that trained a CRNN using an ECG signal's spectrogram. This procedure first applied a quick Fourier transform on ECG data to create a 2D spectrogram. To aggregate features, the spectrogram was first input into a stack of 2D convolutional blocks and then into a three-layer bidirectional LSTM network. However, the complexity of the network was relatively high, with more than 10 million training parameters, which is a major drawback when using signal transformation. Furthermore, researchers [23,30] found that transformation from 1D signals to 2D images improved the classification accuracy by approximately 3%, but at the expense of higher conversion costs and more complicated networks.
Recently, a fusion of deep neural networks was proposed as another attempt to improve feature detection and classification. A fusion of CNNs was proposed in [21] that employed two-stream convolutional networks with different filter sizes to extract features of different scales. Yao et al. [31] proposed a multiscale convolutional neural network (MCNN) that applied time scaling on input signals and detected AF based on the scaled inputs. The authors deployed different signal transformation schemes, including identity mapping, down-sampling transformations in the time domain, and spectral transformations in the frequency domain. Each part was known as a branch, as it was a branch input for the CNN for final classification. The MCNNs showed improvements against the other methods in time-series classification [18].
Many of these existing methods were evaluated using landmark databases, such as the MIT Atrial Fibrillation Database (AFDB) [32], and achieved excellent AF detection accuracy (e.g., 99.19% sensitivity and 99.39% specificity in [33]). Existing methods seemed less promising when evaluated with other publicly available databases, such as the recently introduced PhysioNet Challenge 2017 dataset [28]. One practical issue that has been considered is the class imbalance problem, which has been established as affecting both convergences during the training phase and the generalization of a model in the test phase, with a significant detrimental effect on the classifier accuracy. Researchers have tried to solve the problem of imbalance distribution. For example, Acharya et al. [34] proposed synthetic data as a solution. Similarly, Jiang et al. [35] produced synthetic data, but they used the concept of oversampling the minority class data. Another method [36] used skewness to drive the dynamic data augmentation technique for balancing data distribution. Tables 1 and 2 summarize the state-of-the-art AF detection methods based on the database, structure, number of classes, number of scales, transformation technique, balancing type, and performance. Table 1 shows the CNN-based methods, and Table 2 shows methods utilizing hybrid models. The abovementioned studies showed that research on the analysis of ECG signals is mainly based on single-scale signal analysis techniques using conventional CNN frameworks that have been found to be efficient in obtaining representative features, but that lack in design optimization and parameter tuning. This challenge must be addressed without using rigorous preprocessing or data transformation for the extraction of the features to increase the detection accuracy without increasing the computational overhead or network complexity.

Multiscale Encoding of ECG Signals Using the ResNet
The majority of the existing works in the literature use a fixed-size single-scale kernel as the receptive field in the convolution operation, which might not fully capture crucial features of ECG signals. In this work, a deep architecture for AF detection has been created based on the multiscale convolution kernel in order to enhance the feature representations. Additionally, the residual network (ResNet), which has a stronger feature extraction capability, is used to manage the extracted features. Hence, we named our model for encoding ECG signals the multiscale residual network (MsRes), which aims to better represent the features suitable for automatic learning and classification.
The primary goal of our model is to classify short segments of ECG signals into AF or non-AF classes. The raw 1D time-series data are fed into the network without conversion or transformation. Additionally, since our AF detection is data-driven, no manual feature engineering methods are required to learn the features of the ECG signals, as they are learned directly from the input data. Our model has two main parts: the multiscale signal encoding part, which aims to improve the AF diagnostic feature detection, and the classification part, which aims to predict one of the two labels.

Multiscale Encoding
Suppose we have a segment of the ECG signal f (x). For multiscale encoding of the signal, first, the signal will convolve with different convolution kernels as follows: where h i (x) is a convolution kernel for the scale i = 1 . . . n, and the size of the kernel is proportional to the scale. We utilized ResNet-based DL techniques for the multiscale encoding of the convoluted signal f i (x). MsRes blocks were designed and built depending on experiences in ResNet, which outlined the idea of an identity shortcut link, which limited the volume of information that could pass through the shortcut [36]. By leveraging activations from earlier layers, ResNet overcomes the issue of disappearing and exploding gradients. The network is made simpler by the shortcut connection by employing fewer layers and less backpropagation (BP), which decreases the size of the model and speeds up execution. As shown in Figure 1, each residual block contains two convolutional layers and two rectified linear units (ReLUs) [51], which are well-known activation functions in convolutional networks. When the input is positive, the ReLU function is less susceptible to the gradient saturation issue than the sigmoid and tanh functions are. Additionally, each residual block comprises a residual skip connection [24] and a pooling layer that utilizes a max pooling of size 5 and a stride of 2.
signal, first, the signal will convolve with different convolution kernels as follows: where hi(x) is a convolution kernel for the scale i = 1…n, and the size of the kernel is proportional to the scale. We utilized ResNet-based DL techniques for the multiscale encoding of the convoluted signal fi'( ). MsRes blocks were designed and built depending on experiences in ResNet, which outlined the idea of an identity shortcut link, which limited the volume of information that could pass through the shortcut [36]. By leveraging activations from earlier layers, ResNet overcomes the issue of disappearing and exploding gradients. The network is made simpler by the shortcut connection by employing fewer layers and less backpropagation (BP), which decreases the size of the model and speeds up execution. As shown in Figure 1, each residual block contains two convolutional layers and two rectified linear units (ReLUs) [51], which are well-known activation functions in convolutional networks. When the input is positive, the ReLU function is less susceptible to the gradient saturation issue than the sigmoid and tanh functions are. Additionally, each residual block comprises a residual skip connection [24] and a pooling layer that utilizes a max pooling of size 5 and a stride of 2. Finally, the encoded signals for each scale are concatenated to obtain the encoded signal, as shown by the equation below: where Ti(.) represents the ith encoding of the convoluted signal fi'( ) and  represents the concatenation of these encoded signals. Finally, the encoded signals for each scale are concatenated to obtain the encoded signal, as shown by the equation below: where T i (.) represents the ith encoding of the convoluted signal f i (x) and Σ represents the concatenation of these encoded signals.

MsRes Structure
As shown in Figure 2, the architecture of the MsRes consists of several parallel convolutional neural networks, each of which learns features from a scale with a different kernel size. The parallel residual convolutional neural blocks learn their features and dependencies in different kernel sizes for each subnetwork. This process not only enriches and improves the feature detection, but also reduces the time needed for the completion of the process, since multiple networks are working in parallel; in addition, reduced resources are needed for the model. The networks can also grow in breadth rather than depth by increasing the usage of the layers that are already built, rather than adding hundreds of layers. kernel size. The parallel residual convolutional neural blocks learn their features and dependencies in different kernel sizes for each subnetwork. This process not only enriches and improves the feature detection, but also reduces the time needed for the completion of the process, since multiple networks are working in parallel; in addition, reduced resources are needed for the model. The networks can also grow in breadth rather than depth by increasing the usage of the layers that are already built, rather than adding hundreds of layers. The model primarily consists of a convolutional layer as an input layer for the different streams. For multiscale encoding of the signal, first, the signal will convolve with different convolution kernels, as shown in Figure 2, with different combinations of kernel sizes of 3, 5, 7, 9, 11, and 13 assigned to each stream, with stride of 3. Each stream comprises seven consequent residual blocks for a different kernel size of the input for each stream. The output of these networks is concatenated to form the final encoded signal, which is used as the input of the classification module to predict the classes.
The classification module shown in Figure 3 consists of three dense layers with 64, 32, and 16 neurons and two dropout layers with (0.25). The dropout layers are employed between two dense layers to avoid and reduce overfitting. During training, the dropout layer selects a percentage of neurons at random and updates only the weights of the remaining neurons [52]. The dropout parameter was set to 0.25, which means that onefourth of the neurons would not be updated to minimize the classification error and the network overfitting. In this model, we chose cross-entropy as a loss function. The output from the last dense layer was fed into a single sigmoid neuron, which outputs the predicted probability distribution over the two classes. Network complexity can be defined by the total number of training parameters, which depends on the number of scales used. The model primarily consists of a convolutional layer as an input layer for the different streams. For multiscale encoding of the signal, first, the signal will convolve with different convolution kernels, as shown in Figure 2, with different combinations of kernel sizes of 3, 5, 7, 9, 11, and 13 assigned to each stream, with stride of 3. Each stream comprises seven consequent residual blocks for a different kernel size of the input for each stream. The output of these networks is concatenated to form the final encoded signal, which is used as the input of the classification module to predict the classes.
The classification module shown in Figure 3 consists of three dense layers with 64, 32, and 16 neurons and two dropout layers with (0.25). The dropout layers are employed between two dense layers to avoid and reduce overfitting. During training, the dropout layer selects a percentage of neurons at random and updates only the weights of the remaining neurons [52]. The dropout parameter was set to 0.25, which means that onefourth of the neurons would not be updated to minimize the classification error and the network overfitting. In this model, we chose cross-entropy as a loss function. The output from the last dense layer was fed into a single sigmoid neuron, which outputs the predicted probability distribution over the two classes. Network complexity can be defined by the total number of training parameters, which depends on the number of scales used. Table 3 shows the parameters of a three-scale network that was optimized empirically, as discussed in Section 5. Here, the parameters for non-trainable layers, such as the pooling layer, dropout layer, and flatten layer, were excluded. The best-performing network has approximately 158,401 trainable parameters, which is considered a small number compared to the millions of parameters used by the models in the literature.

Experiments
Generally, a deep-learning-based method consists of three main steps: preprocessing, feature representation, and classification. Following this baseline, in our method, a sequence of preprocessing steps was used-namely, signal clipping, segmentation, augmentation, and normalization-before the data were fed into the MsRes network for the encoding of the signal and the encoded signal was finally used for classification, as shown in Figure 4. Each record of ECG signals was segmented into windows of a certain time duration. The resulting data segments were divided into training and validation sets. The training data were augmented and then normalized before they were fed into the model for training. Finally, the test data were classified into one of two classes-AF or non-AF. These operations will be discussed in the following subsections in greater detail.

Experiments
Generally, a deep-learning-based method consists of three main steps: preprocessing, feature representation, and classification. Following this baseline, in our method, a sequence of preprocessing steps was used-namely, signal clipping, segmentation, augmentation, and normalization-before the data were fed into the MsRes network for the encoding of the signal and the encoded signal was finally used for classification, as shown in Figure 4. Each record of ECG signals was segmented into windows of a certain time duration. The resulting data segments were divided into training and validation sets. The training data were augmented and then normalized before they were fed into the model for training. Finally, the test data were classified into one of two classes-AF or non-AF. These operations will be discussed in the following subsections in greater detail.

Experiments
Generally, a deep-learning-based method consists of three main steps: preprocessing, feature representation, and classification. Following this baseline, in our method, a sequence of preprocessing steps was used-namely, signal clipping, segmentation, augmentation, and normalization-before the data were fed into the MsRes network for the encoding of the signal and the encoded signal was finally used for classification, as shown in Figure 4. Each record of ECG signals was segmented into windows of a certain time duration. The resulting data segments were divided into training and validation sets. The training data were augmented and then normalized before they were fed into the model for training. Finally, the test data were classified into one of two classes-AF or non-AF. These operations will be discussed in the following subsections in greater detail.

Data Used
The proposed method was evaluated using the PhysioNet Challenge 2017 dataset [28], which encourages the development of algorithms for classification from a single-lead short ECG recording in real conditions. Other existing databases (e.g., MIT-BIH AF database [4]) generally consist of cleaner data, and good performance can be obtained even by using simpler methods, such as feature thresholding. However, the PhysioNet Challenge 2017 dataset provides an opportunity to reliably detect AF in more real conditions, where many non-AF rhythms exhibit irregular RR intervals that may be similar to those in AF. Moreover, in this database, the infrequent occurrence of AF represents a real-life situation, and at the same time, this data imbalance becomes another challenge for the reliable detection of arrhythmias.
The PhysioNet CinC Challenge 2017 [28] database comprises 8528 ECG records that were captured by Alivecor's portable single-channel ECG device and sampled at 300 Hz. The records' lengths range from 9 to 61 s. The dataset comprises four types of ECG records-771 AF; 5154 normal (normal sinus rhythm); 46 noisy (too noisy to be recognized); 2557 other (other rhythms), as annotated by experienced experts. To transform the experiment into a binary classification problem, the ECG data that were annotated as normal, noisy, and other arrhythmias were labeled as non-AF. Thus, the dataset consisted of 771 records labeled as AF and 7757 records labeled as non-AF. Figure 5 shows examples of ECG signals belonging to the normal and AF classes. Bioengineering 2022, 9, x FOR PEER REVIEW 11 of (b) Figure 5. Examples of (a) normal and (b) AF ECG signals with irregular RR intervals and the abse of P-waves.

Signal Clipping
ECG data are frequently tainted by various noises and artifacts, which have an i pact on the subsequent experiments and classification outcomes. Noise reduction c boost the effectiveness of a detection technique, especially for neural networks, who performance is heavily reliant on the quality of the input data. It remains essential to ado a reasonable method of preprocessing the data rather than applying rigorous prep cessing, which may result in the loss of viable information. In the dataset, some sign had serious noise disruptions at the start, which were primarily caused by noise wh mounting the sensors or by the movements of the subjects. To remove this noise, the st of the signal was clipped for the removal of the affected part. In this study, the ECG sign were not filtered in the preprocessing stage for two reasons. First, the use of the origi data reduced the computation cost. Second, the proposed model showed good perf mance with no filtering scheme, showing potential for use in practical applications.

Segmentation
In the dataset, the input ECG records have varying lengths, from 9 to 61 s. Inp with varying lengths cannot be fit into CNN models. A consistent data length is necessa to make sure that the data can be entered into the model. For reliable detection of AF, considered the use of a window of an ECG signal that adequately reflected the charact istics of the class that it belonged to. Although a short window was preferable for p cessing efficiency, a longer window could contain better diagnostic features. Several isting methods used a 8 to10 s window for reasonable performance in AF detection [19,2 In this study, the experimental data were segmented into nine-second segmen which seems reasonable for the analysis of ECG rhythms with AF conditions and for simplification of the computations. Because the sampling rate of the experimental d

Signal Clipping
ECG data are frequently tainted by various noises and artifacts, which have an impact on the subsequent experiments and classification outcomes. Noise reduction can boost the effectiveness of a detection technique, especially for neural networks, whose performance is heavily reliant on the quality of the input data. It remains essential to adopt a reasonable method of preprocessing the data rather than applying rigorous preprocessing, which may result in the loss of viable information. In the dataset, some signals had serious noise disruptions at the start, which were primarily caused by noise when mounting the sensors or by the movements of the subjects. To remove this noise, the start of the signal was clipped for the removal of the affected part. In this study, the ECG signals were not filtered in the preprocessing stage for two reasons. First, the use of the original data reduced the computation cost. Second, the proposed model showed good performance with no filtering scheme, showing potential for use in practical applications.

Segmentation
In the dataset, the input ECG records have varying lengths, from 9 to 61 s. Inputs with varying lengths cannot be fit into CNN models. A consistent data length is necessary to make sure that the data can be entered into the model. For reliable detection of AF, we considered the use of a window of an ECG signal that adequately reflected the characteristics of the class that it belonged to. Although a short window was preferable for processing efficiency, a longer window could contain better diagnostic features. Several existing methods used a 8 to10 s window for reasonable performance in AF detection [19,20].
In this study, the experimental data were segmented into nine-second segments, which seems reasonable for the analysis of ECG rhythms with AF conditions and for the simplification of the computations. Because the sampling rate of the experimental data was at 300 Hz, 2700 samples were taken as a segment, with a moving window with a step size of 9. Hence, the number of samples (instances) increased. After the data segmentation, the number of samples became 27,108, containing both classes of which 24,729 were non-AF and 2379 were AF segments. To verify the influence of different input sizes on the model performance, the experimental data were also segmented into five-and ten-second segments.

Augmentation
As shown in Figure 6, the numbers of samples obtained in the two categories as a result of segmentation were severely unbalanced; there were far fewer AF samples than non-AF samples. If 80% of the segments were used for training, the training set included 21,686 segments, which would include 19,804 non-AF and 1882 AF segments. This unbalance dataset seriously affected the performance of the model and made it much trickier to screen an AF patient than a normal case. To solve this, oversampling was performed on the dataset, then the dataset was randomly divided into a training set and a test set at a proportion of 8:2. Because random oversampling might lead to overfitting, more sophisticated methods could be used, such as the synthetic minority oversampling technique (SMOTE) [53], an oversampling method that is commonly adopted and is based on finding the data's closest neighbors by applying the Euclidean distance. By multiplying the nearest neighbors by a vector with values between 0 and 1, SMOTE can generate synthetic data [54]. There are several variations of SMOTE. Borderline-SMOTE (BLSMOTE) [55] oversamples only the minority examples near the borderline of the data. Adaptive synthetic (ADYSAN) [56], another well-known technique, is an improved version of SMOTE. After creating the samples, it adds a small random value so that, instead of all the samples being linearly correlated to the parent, they have a little more variance in them, that is, they are scattered.
We applied these augmentation techniques to the dataset to evaluate and investigate their effects on the model's accuracy. Table 5 shows results of several experiments that were conducted for the comparison of these techniques. The experiments were performed using the network with three scales, and the performance is presented in terms of the F 1 score and execution time; we observed that SMOTE outperformed the other two techniques. The default parameters of the SMOTE algorithm were used (sampling strategy = 1), which increased the samples in the minority class to equate with the size of the majority class. After balancing, there were a total of 39,608 segments in the training set.
nearest neighbors by a vector with values between 0 and 1, SMOTE can generate synthetic data [54]. There are several variations of SMOTE. Borderline-SMOTE (BLSMOTE) [55] oversamples only the minority examples near the borderline of the data. Adaptive synthetic (ADYSAN) [56], another well-known technique, is an improved version of SMOTE. After creating the samples, it adds a small random value so that, instead of all the samples being linearly correlated to the parent, they have a little more variance in them, that is, they are scattered. We applied these augmentation techniques to the dataset to evaluate and investigate their effects on the model's accuracy. Table 5 shows results of several experiments that were conducted for the comparison of these techniques. The experiments were performed using the network with three scales, and the performance is presented in terms of the F1 score and execution time; we observed that SMOTE outperformed the other two techniques. The default parameters of the SMOTE algorithm were used (sampling strategy = 1), which increased the samples in the minority class to equate with the size of the majority class. After balancing, there were a total of 39,608 segments in the training set.

Normalization
There was a significant fluctuation in the amplitudes of the ECG recordings between different records, as these records came from different people or even the same person with varying lead placements. It was discovered that in real-world scenarios, the neural network models converged more effectively when all of the inputs had similar ranges.

Normalization
There was a significant fluctuation in the amplitudes of the ECG recordings between different records, as these records came from different people or even the same person with varying lead placements. It was discovered that in real-world scenarios, the neural network models converged more effectively when all of the inputs had similar ranges. Thus, amplitude of each of the ECG segments was normalized between 0 and 1 because normalizing the data also accelerated the speed of the model training.

Training and Classification
To ensure that the model would be fairly evaluated on the dataset, K-fold crossvalidation (CV) was used; this is the most popular method in various DL applications. The original dataset was randomly divided into five subsets of equal size, and a five-fold cross-validation was used. The model's average performance across the entire dataset was then estimated using the classification results of these five models.

Selection of Hyperparameters
To design an optimal structure for AF classification, we analyzed the impacts of different combinations of the hyperparameters based on experience and a manual random search technique [57], which was reported to be more efficient for hyperparameter optimization than a traditional grid search. Because we had a good number of hyperparameters, we chose the most important parameters empirically to balance the model performance with the number of training parameters and the training time. The hyperparameters that we chose included the number of residual blocks, the dropout percentage, the batch size, the learning rate, and the convolutional kernel size.
Seven residual blocks were chosen in each of the parallel MsRes blocks. The model was obtained by training with adaptive moment estimation (Adam) [58], which is an adaptation of the stochastic gradient descent (SGD) optimization algorithm. By minimizing the loss function, which represents the difference between the network prediction and the ground truth, Adam used the BP approach to determine the neural network's parameter configuration. Adam is a method for computing adaptive learning rates that determines individual learning rates for various factors. The default learning rate is 0.001 [59]. A model typically learns more quickly when the learning rate is high and more steadily when the learning rate is low.
Different learning rates were explored, such as 0.001, 0.0001, and 0.005. It was found that the training process was more stable when the value of 0.001 was chosen. Additionally, we explored the use of different optimizers, such as Adadelta [60] and Adagrad [61], to see their impacts on the performance, but there was a drop in the accuracy and F 1 score when optimizers other than Adam were used. The batch sizes were compared for a range of values (128, 256, and 512), and it was noticed that when the batch size was changed to 256 and 512, there was a negative impact on the performance accuracy and F 1 score. The best performance was obtained with a value of 128.
The number of scales and the kernel sizes of each stream have a crucial impact on the feature detection performance of DL networks. We conducted our experiment to understand the effects of different signal encoding values on the feature detection in the data at hand and understand the effects of the different kernel sizes on the feature representation and model performance. The kernel size values were 3, 5, 7, 9, 11, and 13, and different combinations of these values were explored and applied to different numbers of streams, starting from two streams and going up to five. The best performance was obtained with three streams with kernel sizes of 5, 7, and 9.
Between two dense layers, a dropout layer was also used to reduce overfitting. During training, the dropout layer randomly selected a subset of neurons and only updated the weights of these neurons [52]. We set the dropout parameter to 0.25, which meant that one-fourth of the neurons would not be updated. Table 6 shows the values of the hyperparameters used for the MsRes to obtain the best performance in terms of accuracy and F 1 . The model loss was calculated using a binary cross-entropy function, which is preferable for binary classification. The loss was calculated according to the following formula: where y represents the expected outcome, andŷ represents the outcome produced by our model. Figure 7 shows the training and validation accuracy and loss curves of the proposed model. The first row in the figure shows the accuracy and loss curves for the best performance, with three streams with kernel sizes of 5, 7, and 9 over 50 epochs. They show that the training curves are relatively stable, and the presence of the fluctuations in the validation curves is due to the use of batch normalization, which may lead to unstable training and validation [62]. Furthermore, there is no degradation in the accuracy; instead, there is a decrease in the model loss and a gap between the two curves, which indicates a good fit and no overfitting. In the second row in the figure, the accuracy and loss curves for the use of four streams with kernel sizes of 5, 7, 9, and 11 show the instability of the training and validation curves, which affected the overall performance and degraded the F 1 score. Such instability in the curve and the presence of the pumps are strong indications of what is called the overfit curve, which means that there were overfitting signs in the model when more than three streams were used.
there is a decrease in the model loss and a gap between the two curves, which indicates a good fit and no overfitting. In the second row in the figure, the accuracy and loss curves for the use of four streams with kernel sizes of 5, 7, 9, and 11 show the instability of the training and validation curves, which affected the overall performance and degraded the F1 score. Such instability in the curve and the presence of the pumps are strong indications of what is called the overfit curve, which means that there were overfitting signs in the model when more than three streams were used.

Results and Analysis
In this section, the evaluation metrics, experimental results, and analyses and comparisons of the results are presented to explain the significance of our proposed multiscale signal encoding scheme.

Evaluation Metrics
The proposed system classifies the input ECG segments into one of two categories. We evaluated the classification performance using the F 1 score as follows: Precision is defined as the ratio of true-positive (TP) classifications to the number of TP and false-positive (FP) classifications, and recall is the ratio of TP classifications to the number of TP and false-negative (FN) classifications. The following formulas were used to calculate the precision, recall, and accuracy: The confusion matrix, which depicts the performance of a classification algorithm, can be used to calculate precision and recall. The matrix's rows represent predicted classes, and its columns represent ground-truth classes. Figure 8 depicts a confusion matrix for two classes.

Precision = (5)
Recall = (6) Accuracy = The confusion matrix, which depicts the performance of a classification algorithm, can be used to calculate precision and recall. The matrix's rows represent predicted classes, and its columns represent ground-truth classes. Figure 8 depicts a confusion matrix for two classes. Accuracy is used in most cases for the evaluation of the classification performance. However, because there are more data in the positive class, accuracy is insufficient for evaluating the performance of an unbalanced dataset. Alternatively, we used the F1 score as a measure because it balances precision and recall and is useful when dealing with uneven class distributions [63].

Classification Performance
The proposed model was implemented in Google's Colab environment with the Keras deep learning library using Python. Tables 7-11 show the experimental results for the multiscale encoding using different numbers of scales with different kernel sizes (i.e., 3, 5, 7, 9, 11, and 13). In Table 7, we present the experiment results with the single-stream kernel with different sizes. The best performance for the single scale was achieved with the kernel size of 13, which had an F1 score of 95.95%. The results of the use of two-stream Accuracy is used in most cases for the evaluation of the classification performance. However, because there are more data in the positive class, accuracy is insufficient for evaluating the performance of an unbalanced dataset. Alternatively, we used the F 1 score as a measure because it balances precision and recall and is useful when dealing with uneven class distributions [63].

Classification Performance
The proposed model was implemented in Google's Colab environment with the Keras deep learning library using Python. Tables 7-11 show the experimental results for the multiscale encoding using different numbers of scales with different kernel sizes (i.e., 3, 5, 7, 9, 11, and 13). In Table 7, we present the experiment results with the single-stream kernel with different sizes. The best performance for the single scale was achieved with the kernel size of 13, which had an F 1 score of 95.95%. The results of the use of two-stream multiscale encoding are shown in Table 8. A significant improvement was achieved, with the best performance showing an F 1 score of 98.37% for kernel sizes of 5 and 7. The performance improved by 2-3% after the second scale was added. Subsequently, as shown in Table 9, the performance slightly improved with the addition of the third stream, and the highest average F 1 score of 98.54% was achieved with three scales at the kernel sizes of 5, 7, and 9. The performance (best results obtained among alternative models) slightly dropped to 98.47% after the fourth stream was added and to 98.42% after the fifth stream was added, as shown in Tables 10 and 11, respectively. The fusion of information from the three scales (5, 7, and 9) generated the highest average F 1 score of 98.54%, accuracy of 98.37%, precision of 97.75%, and recall of 98.67% for the model. These improved performance metrics can be attributed to the good capture of the critical diagnostic information, including the irregular RR intervals and P-wave abnormalities for this configuration. Adding bigger kernel values would cause the model to be disrupted with additional non-required information. Furthermore, it was noticed that using more than three streams did not improve performance, and it was inferred that this may have been due to the overfitting problem. These results validate the ability of the signal encoding of MsRes to capture clinically relevant information for improved AF diagnosis. Figure 9 shows the performance curve, representing the results of the 48 experiments in Tables 7-11 together according to the number of scales, sizes, and streams.       Figure 10a, the confusion matrix for the single scale with the kernel size of 13 is shown, and it indicates that 163 AF records were misclassified as non-AF and 126 non-AF records were misclassified as AF. The performance significantly improved, as seen in the confusion matrix for the two scales in Figure 10b, where only 37 AF records were misclassified as non-AF and 18 non-AF records were misclassified as AF. The performance further improved with the addition of the third scale in Figure 10c, wherein 14 AF records were misclassified as non-AF and 29 non-AF records were misclassified as AF.
In the fourth and fifth scales, as shown in Figure 10d,e, the performance slightly dropped, and the number of misclassified AF records slightly increased.  (Tables 7-11) at different scales, sizes, and streams. Figure 10 shows five confusion matrices for each number of scales with the best performance combination for each scale according to the experimental results presented in Tables 7-11. The confusion matrix for five folds is shown with 0 indicating the case of non-AF and 1 indicating the case of AF. The diagonal elements show the number of correctly predicted records for each class. The off diagonals display the number of misclassifications for each class. In Figure 10a, the confusion matrix for the single scale with the kernel size of 13 is shown, and it indicates that 163 AF records were misclassified as non-AF and 126 non-AF records were misclassified as AF. The performance significantly improved, as seen in the confusion matrix for the two scales in Figure 10b, where only 37 AF records were misclassified as non-AF and 18 non-AF records were misclassified as AF. The performance further improved with the addition of the third scale in Figure 10c, wherein 14 AF records were misclassified as non-AF and 29 non-AF records were misclassified as AF. In the fourth and fifth scales, as shown in Figure 10d,e, the performance slightly dropped, and the number of misclassified AF records slightly increased. Figure 11 shows the precision-recall curve, which indicates the tradeoff between the precision and recall for each number of scales. A larger area under the curve (AUC) indicates both higher recall and higher precision, where higher precision corresponds to a lower false-positive rate and higher recall corresponds to a lower false-negative rate. In the figure, the smallest AUC of the performance is shown with the single scale; a clear improvement in the AUC is shown when the number of scales increases to two and three; there is no clear line between the last three scales, and the performance waves of the third, fourth, and fifth scales are almost optimal, approaching the far-right corner. Similarly, Figure 12 shows the receiver operating characteristic (ROC) curve, which is the plot used for the representation of the diagnostic ability of binary classifiers in the same manner as that for different scales; there were clear improvements in the classification ability when the multiscale encoding scheme was used.

Experiments with Different Signal Lengths
To demonstrate the effectiveness of our AF detection system, we compared the performance of the best model (kernel sizes of 5, 7, and 9) for different signal lengths of five, nine, and ten seconds, as shown in Table 12. The best result was achieved with a signal length of nine seconds, which was the segment size that represented enough features to boost the detection. Our comparison of the five-and ten-second variations showed that a five-seconds window slightly reduced the performance in AF detection, whereas a ten-second window increased the computation complexity and the time without improving the detection performance.
126 non-AF records were misclassified as AF. The performance significantly improved, as seen in the confusion matrix for the two scales in Figure 10b, where only 37 AF records were misclassified as non-AF and 18 non-AF records were misclassified as AF. The performance further improved with the addition of the third scale in Figure 10c, wherein 14 AF records were misclassified as non-AF and 29 non-AF records were misclassified as AF. In the fourth and fifth scales, as shown in Figure 10d,e, the performance slightly dropped, and the number of misclassified AF records slightly increased.  Figure 11 shows the precision-recall curve, which indicates the tradeoff between the precision and recall for each number of scales. A larger area under the curve (AUC) indicates both higher recall and higher precision, where higher precision corresponds to a lower false-positive rate and higher recall corresponds to a lower false-negative rate. In the figure, the smallest AUC of the performance is shown with the single scale; a clear improvement in the AUC is shown when the number of scales increases to two and three; there is no clear line between the last three scales, and the performance waves of the third, fourth, and fifth scales are almost optimal, approaching the far-right corner. Similarly, Figure 12 shows the receiver operating characteristic (ROC) curve, which is the plot used for the representation of the diagnostic ability of binary classifiers in the same manner as that for different scales; there were clear improvements in the classification ability when the multiscale encoding scheme was used.  Table 13 compares our proposed method with state-of-the-art methods in terms of the method, number of classes, input length, preprocessing, balancing, and F 1 score. All of these methods were evaluated with the same unbalanced PhysioNet Challenge 2017 dataset. The two systems proposed in [64] and [3], which used a single-scale ResNet, had an average performance of 85% and 88%, respectively. The other three models using multiscale DL networks had an average performance of 85%, 92.1%, and 84.31%. These results show that our proposed method outperformed the state-of-the-art methods that were evaluated on the same database, with an F 1 score of 98.54%. Bioengineering 2022, 9, x FOR PEER REVIEW 20 of 25 Figure 11. The precision-recall curves at different scales and kernel sizes.      Table 14 compares the different methods that were evaluated on the same dataset in terms of method, the number of classes, input length, preprocessing, balancing, trainable parameters, and F 1 score. It shows that our proposed model has the lowest number of trainable parameters, which makes it reliable and applicable in real-time applications.

Conclusions
In this paper, we proposed a multiscale encoding scheme with residual blocks of deep CNNs-named MsRes-for screening out AF using short single-lead ECG signals. The proposed MsRes network applied the concept of multiscale convolutional signal encoding, which utilized different kernel sizes to capture the features at different scales of the input ECG recordings. The proposed system adopted a simple yet effective preprocessing scheme to produce a cropped and balanced number of segments for both classes without rigorous preprocessing and transformation. In the experiments that we conducted, we explored the effects of adding different numbers of scales with combinations of kernel sizes, which had a clear impact on the AF detection accuracy and F 1 score.
Through an exhaustive evaluation, we found that the proposed MsRes method performed better in terms of the F 1 score than the single-scale convolution kernel in several state-of-the-art methods evaluated with the PhysioNet Challenge 2017 ECG dataset, with an F 1 score of 98.54%, an accuracy of 98.37%, a total loss of 5.6%, a precision of 97.75%, and a recall of 98.67%, when three multiscale encoding schemes with kernel sizes of 5, 7, and 9 were used. Additionally, the proposed method significantly reduced the network complexity, with 158,401 trainable parameters compared to the millions of parameters used in the literature.
One of the limitations of this work is that it was only tested on the PhysioNet Challenge 2017 ECG dataset. In the future, we will focus on improving MsRes, exploring the potential for further assessment of its performance and generalization to other public databases, or using cross-databases to further improve and optimize its learning ability. Another limitation is that our model was designed only for a two-class classification problem, namely, that of AF and non-AF detection. We would like to extend our method to a multiclass classification problem in which a wide variety of different arrhythmias could be detected from single-lead ECG signals to confirm a high diagnostic output close to that of cardiologists. Institutional Review Board Statement: Ethical review and approval were waived for this study due to the use of a publicly available database only.
Informed Consent Statement: Patient consent was waived due to the use of publicly available databases only.