Atrioventricular Synchronization for Detection of Atrial Fibrillation and Flutter in One to Twelve ECG Leads Using a Dense Neural Network Classifier

This study investigates the use of atrioventricular (AV) synchronization as an important diagnostic criterion for atrial fibrillation and flutter (AF) using one to twelve ECG leads. Heart rate, lead-specific AV conduction time, and P-/f-wave amplitude were evaluated by three representative ECG metrics (mean value, standard deviation), namely RR-interval (RRi-mean, RRi-std), PQ-interval (PQi-mean, PQI-std), and PQ-amplitude (PQa-mean, PQa-std), in 71,545 standard 12-lead ECG records from the six largest PhysioNet CinC Challenge 2021 databases. Two rhythm classes were considered (AF, non-AF), randomly assigning records into training (70%), validation (20%), and test (10%) datasets. In a grid search of 19, 55, and 83 dense neural network (DenseNet) architectures and five independent training runs, we optimized models for one-lead, six-lead (chest or limb), and twelve-lead input features. Lead-set performance and SHapley Additive exPlanations (SHAP) input feature importance were evaluated on the test set. Optimal DenseNet architectures with the number of neurons in sequential [1st, 2nd, 3rd] hidden layers were assessed for sensitivity and specificity: DenseNet [16,16,0] with primary leads (I or II) had 87.9–88.3 and 90.5–91.5%; DenseNet [32,32,32] with six limb leads had 90.7 and 94.2%; DenseNet [32,32,4] with six chest leads had 92.1 and 93.2%; and DenseNet [128,8,8] with all 12 leads had 91.8 and 95.8%, indicating sensitivity and specificity values, respectively. Mean SHAP values on the entire test set highlighted the importance of RRi-mean (100%), RR-std (84%), and atrial synchronization (40–60%) for the PQa-mean (aVR, I), PQi-std (V2, aVF, II), and PQi-mean (aVL, aVR). Our focus on finding the strongest AV synchronization predictors of AF in 12-lead ECGs would lead to a comprehensive understanding of the decision-making process in advanced neural network classifiers. DenseNet self-learned to rely on a few ECG behavioral characteristics: first, characteristics usually associated with AF conduction such as rapid heart rate, enhanced heart rate variability, and large PQ-interval deviation in V2 and inferior leads (aVF, II); second, characteristics related to a typical P-wave pattern in sinus rhythm, which is best distinguished from AF by the earliest negative P-peak deflection of the right atrium in the lead (aVR) and late positive left atrial deflection in lateral leads (I, aVL). Our results on lead-selection and feature-selection practices for AF detection should be considered for one- to twelve-lead ECG signal processing settings, particularly those measuring heart rate, AV conduction times, and P-/f-wave amplitudes. Performances are limited to the AF diagnostic potential of these three metrics. SHAP value importance can be used in combination with a human expert’s ECG interpretation to change the focus from a broad observation of 12-lead ECG morphology to focusing on the few AV synchronization findings strongly predictive of AF or non-AF arrhythmias. Our results are representative of AV synchronization findings across a broad taxonomy of cardiac arrhythmias in large 12-lead ECG databases.


Introduction
Atrial fibrillation (AFIB) and atrial flutter (AFL) are the most common cardiac arrhythmias, being especially threatening for the geriatric population, with incidence increasing from 0.5% for people aged 40-50 years to 5-15% for people 80 years old [1,2]. Despite the different pathophysiology of AFIB and AFL, both are diseases associated with structural and electrical abnormalities of the atrium that increases the risk of stroke, heart failure, thromboembolism, and mortality; therefore, early diagnosis is vital [1,[3][4][5].
During sinus rhythm, the depolarization of the cardiac muscle begins at the sinus node. In 12-lead ECGs, this is characterized by the presence of correctly oriented P-waves [6], which are positive in leads I, II, and aVF; negative in the lead aVR; biphasic (−/+), positive or negative in the lead aVL; and positive in all chest leads, except for V1 which may be biphasic (+/−), as illustrated in the example of Figure 1a. In contrast, AFIB and AFL are abnormal heart rhythms associated with irregular excitation of the atrial chambers; therefore, P-wave synchronization and morphology are completely distorted. Although AFIB and AFL both have increased atrial activity, their representation on the electrocardiogram (ECG) is not identical. AFIB is characterized by irregular atrial activity, discerned by high-frequency small-amplitude fibrillatory (f) waves instead of P-waves (Figure 1b).

ECG Databases
We used the PhysioNet CinC Challenge 2021 databases [56,63], which are presently known as the largest freely available repository of standard 12-lead ECG records and consistent annotations for 30 clinical diagnoses of cardiac abnormalities (and/or a normal sinus rhythm), available on the Physionet website (https://physionet.org/content/challenge-2021/1.0.2/, last accessed on 21 June 2022). The scope of this study was included the specific diagnostic labels of AFIB and AFL arrhythmia; however, these were not separately annotated in some of the databases. Therefore, we consider both arrhythmias in a mixed class, namely AF (AFIB+AFL). All other records with diagnostic annotations different from either AFIB or AFL were considered to be in the non-AF class. The rate of AF vs. non-AF cases was about 8 vs. 92% from the total database of 71545 ECG records (Table 1). We maintained similar AF to non-AF proportions in random patient-wise allocation in three independent subsets: training (70%), validation (20%), and test (10%).  Table 1 summarizes the entire number of short-term ECG records (duration 5-144 s) available in the six largest PhysioNet CinC Challenge 2021 databases, restricted by two exclusion criteria: • ECG records that did not contain any of the annotation labels for the 30 clinical diagnoses defined for the Challenge. • ECG records annotated as having low QRS voltages, poor R wave progression, or pacing rhythm. Figure 2 presents the study design for a binary AF/non-AF rhythm classification, which consisted of four major parts:
Pre-processing: Analysis of full-length 12-lead ECG records with a focus on QRS, QRS onset, and P-/f-peak detection, as well as measurement of RR-intervals, lead-specific PQ-intervals, and PQ-amplitudes. 2.
DenseNet model design: Grid-search architectural design of dense NN classifiers with input features from different lead sets, including single lead, six limb leads, six chest leads, and all twelve leads.

3.
Optimization: Training and validation process for selection of the best models with maximal performance.

4.
Test: Performance evaluation on the independent test set, which derived conclusions on the importance of lead-set and input features.
The subsequent sections describe each part of the study design.

Pre-Processing 2.3.1. Data Reading
The input data were read with the open-source Python example code for the Phy-sioNet/Computing in Cardiology Challenge 2021 [64], available at the link: https:// github.com/physionetchallenges/python-classifier-2021, last accessed on 21 June 2022. The standard 12-lead ECG raw signals of limb leads (I, II, III, aVR, aVL, and aVF) and chest leads (V1-V6) were loaded as binary MATLAB v4 files, resampled to a common frequency of 500 Hz. The annotations were read as plain text files in WFDB header format [65] for the recording, patient attributes, and the diagnosis, originally stored as SNOMED-CT codes with validated annotations, as described at the link: https://github. com/physionetchallenges/evaluation-2021/blob/main/dx_mapping_scored.csv, last accessed on 21 June 2022. Considering that one ECG recording could have one or more diagnostic labels for cardiac abnormalities, the presence of one of the codes "164889003" (AF) or "164890007" (AFL) assigned the record to the AF class. Otherwise, it was assigned to the non-AF class. According to the exclusion criteria in Section 2.1, all records without a code or with at least one of the following codes were excluded: "365413008" (poor R wave progression), "251146004" (low QRS voltages), or "10370003" (pacing rhythm).

ECG Filtering and Delineation
The noise components in each ECG lead were suppressed by three filtering procedures, which were originally designed to preserve low-and high-frequency ECG waves, including: (i) the subtraction procedure for power-line interference cancellation with dynamic adjustment to linear and non-linear ECG segments [66]; (ii) high-pass recursive filter with a cutoff frequency of 0.64 Hz for removal of the baseline drift [67]; (iii) low-pass Savitzky-Golay filter for suppression of electromyographic (EMG) noise with dynamic adaptation of the cutoff frequency to about 14 Hz (low-power P, T-waves, PQ, ST, and TP-segments), 20-30 Hz (high-power P and T-waves), and >100 Hz (QRS complexes), to best keep the frequency spectra of specific ECG waves [68]. Study design for binary AF/non-AF detection with dense NN classifier (DenseNet), using RR-interval, PQ-interval, and PQ-amplitude input features (mean value, standard deviation) from different lead sets in 12-lead ECG. Grid search architectural design and optimization was applied to derive the best DenseNet models, which were tested to study the importance of lead-set and input features.

Data Reading
The input data were read with the open-source Python example code for the Physio-Net/Computing in Cardiology Challenge 2021 [64], available at the link: https://github.com/physionetchallenges/python-classifier-2021, last accessed on 21 June 2022. The standard 12-lead ECG raw signals of limb leads (I, II, III, aVR, aVL, and aVF) and chest leads (V1-V6) were loaded as binary MATLAB v4 files, resampled to a common frequency of 500 Hz. The annotations were read as plain text files in WFDB header format [65] for the recording, patient attributes, and the diagnosis, originally stored as SNOMED-CT codes with validated annotations, as described at the link: Figure 2. Study design for binary AF/non-AF detection with dense NN classifier (DenseNet), using RR-interval, PQ-interval, and PQ-amplitude input features (mean value, standard deviation) from different lead sets in 12-lead ECG. Grid search architectural design and optimization was applied to derive the best DenseNet models, which were tested to study the importance of lead-set and input features.
This study applies simple ECG delineation methods of three characteristic points within each ECG cycle, namely the R-wave (by QRS detection), Q-wave (by QRS onset detection), and P-/f-peak (by detection of one deflection wave preceding the Q-wave). An illustration of these three characteristic points is presented in Figure 3, with examples showing normal sinus rhythm and AF rhythm.  . ECG delineation applied in this study, including detection of R-waves (green stars), QRS onsets (blue squares), and preceding P-/f-wave peaks (red circles), illustrated for examples of (a) non-AF rhythm and (b) AF rhythm.

ECG Features
The key focus for simple AF detection are three diagnostic ECG features, representative of the ventricular rate (RR-interval: RRi), AV synchronization time (PQ-interval: PQi), and P-/f-wave amplitude (PQ-amplitude: PQa), statistically evaluated as mean values and standard deviations (std) over the total available ECG record length (5-144s). The list of computed features is as follows: • RRi-mean and RRi-std (two global features) are computed as the mean and std values of the distances between consecutive R-wave fiducial points [73], detected in this study in reference lead I. If QRS detection is assumed correct, its application to any other lead is expected to give the same estimation of the heart rate; therefore, RRimean and RRi-std are considered global features.

•
PQi-mean and PQi-std (two lead-specific features) are computed as the mean and std values of the time distances from P-/f-peaks to subsequent Q-waves in each of the 12 ECG leads. We note that the computed PQi-mean value differs from the standard definition of the PQ interval between the beginning of the P-wave and the beginning of the Q-wave [6]. The reason for this is that the embedded automatic delineation algorithms (Equation (2)) detects the most characteristic P-/f-peak more reliably than the wandering onset of the P-/f-wave. Although the computed PQ interval would be slightly shorter than the defined normal ranges [6], we consider its reliable measurement an important requirement for the proper investigation of the AF predictive potential of this feature.

•
PQa-mean and PQa-std (two lead-specific features) were computed as the mean and std values of the amplitude differences between P-/f-peaks and subsequent Q-waves in each of the 12 ECG leads. We consider the PQa-mean value to represent the largest deflections of atrial electrical activity discernible before QRS onset and is therefore representative of the P-/f-wave amplitude. We believe that the detection of the Qwave reference level is more reliable than searching for the P-wave onset, as defined in the standard P-wave amplitude measurement [6].
We note that if a P-/f-peak is not detected before a Q-wave, then the respective measurements are not included in the statistical computations of PQi-mean, PQi-std, PQamean, and PQa-std.
The computation of four lead-specific features (PQi-mean, PQi-std, PQa-mean, and PQa-std) for each of the 12 ECG leads gives the opportunity to apply the concept of varying dimensions in electrocardiography. According to this concept, AF rhythm diagnosis Our QRS detector used the real-time algorithm of Christov [69], which identifies significant peaks of spatial velocity (absolute value of the first derivative of one or more ECG leads) using three adaptive thresholds for the QRS amplitude, slew rate, and high-frequency noise. The QRS detector operates with any number of ECG leads, self-synchronizes to QRS slopes, and adapts to beat-to-beat intervals. Its efficiency was high enough (about 99.7% [69]) for reliable QRS detection in this study.
The QRS onset (Q) is found in the first isoelectric segment before the QRS fiducial point (R-wave) applying a low-slope criteria within 20 ms [70,71]: t=0 ≤ Thr Q and |ECG t − ECG t−20ms | ≤ 4Thr Q and |ECG t − ECG t−10ms | ≤ 3Thr Q and |ECG t−10ms − ECG t−20ms | ≤ 3Thr Q (1) where ECG t denotes the ECG sample at time t ∈ [R − 120 ms; R]; Thr Q = 20 µV is the default value of the threshold slope for Q-wave detection, which is incremented by 1 µV until all conditions in (1) are satisfied. The peak of the P-wave (non-AF) or f-wave (AF) is searched within a physiologically reasonable interval before the Q-wave, t ∈ [Q − 300 ms; Q − 40 ms], applying criteria for a high-slope deflection wave within 80 ms; the relevance of this was proven in a competitive study of the PhysioNet/Computing in Cardiology Challenge 2017 [72]. The specific conditions are as follows: |ECG t − ECG t−40ms | > 4Thr P and |ECG t − ECG t+40ms | > Thr P and |ECG t − ECG t−20ms | > Thr P and |ECG − ECG t+20ms | > Thr P /2 (2) where Thr p is the threshold slope for P-/f-peak detection with an empirically defined value of 3 µV. If none of the conditions (2) are satisfied for any t in the defined search interval, then no P-/f-peak was detected before the Q-wave. This scenario is illustrated in Figure 3a for the QRS complex between the 7th and 8th second, where a supposed artifact altered the P-peak. We note that all other beats in Figure 3 present sufficiently well-detected series of the three characteristic points (R-wave, Q-wave, and P-/f-peak).

ECG Features
The key focus for simple AF detection are three diagnostic ECG features, representative of the ventricular rate (RR-interval: RRi), AV synchronization time (PQ-interval: PQi), and P-/f-wave amplitude (PQ-amplitude: PQa), statistically evaluated as mean values and standard deviations (std) over the total available ECG record length (5-144s). The list of computed features is as follows:

•
RRi-mean and RRi-std (two global features) are computed as the mean and std values of the distances between consecutive R-wave fiducial points [73], detected in this study in reference lead I. If QRS detection is assumed correct, its application to any other lead is expected to give the same estimation of the heart rate; therefore, RRi-mean and RRi-std are considered global features.

•
PQi-mean and PQi-std (two lead-specific features) are computed as the mean and std values of the time distances from P-/f-peaks to subsequent Q-waves in each of the 12 ECG leads. We note that the computed PQi-mean value differs from the standard definition of the PQ interval between the beginning of the P-wave and the beginning of the Q-wave [6]. The reason for this is that the embedded automatic delineation algorithms (Equation (2)) detects the most characteristic P-/f-peak more reliably than the wandering onset of the P-/f-wave. Although the computed PQ interval would be slightly shorter than the defined normal ranges [6], we consider its reliable measurement an important requirement for the proper investigation of the AF predictive potential of this feature. • PQa-mean and PQa-std (two lead-specific features) were computed as the mean and std values of the amplitude differences between P-/f-peaks and subsequent Q-waves in each of the 12 ECG leads. We consider the PQa-mean value to represent the largest deflections of atrial electrical activity discernible before QRS onset and is therefore representative of the P-/f-wave amplitude. We believe that the detection of the Q-wave reference level is more reliable than searching for the P-wave onset, as defined in the standard P-wave amplitude measurement [6].
We note that if a P-/f-peak is not detected before a Q-wave, then the respective measurements are not included in the statistical computations of PQi-mean, PQi-std, PQamean, and PQa-std.
The computation of four lead-specific features (PQi-mean, PQi-std, PQa-mean, and PQa-std) for each of the 12 ECG leads gives the opportunity to apply the concept of varying dimensions in electrocardiography. According to this concept, AF rhythm diagnosis could be made possible according to the lead availability in an arbitrary clinical setting, e.g., one-lead (using any of 12 leads by means of 6 features: 2RRi + 2(PQi = PQa)), six-lead (using limb or chest leads by means of 26 features: 2RRi + 6 × 2(PQi + PQa)), or twelve-lead (using all 12 leads by means of 50 features: 2RRi + 12 × 2(PQi + PQa)).

DenseNet Model Design
The study design in Figure 2 follows the concept of varying dimensions in electrocardiography for binary AF/non-AF rhythm classification, which is related to the optimal design of various dense NN topologies using input with different sets of leads, namely DenseNet-SingleLeads for one lead, DenseNet-LimbLeads for six limb leads, DenseNet-ChestLeads for six chest leads, and DenseNet-12Leads for all standard 12 leads. All DenseNet models have a common architecture, which is schematically drawn in Figure 4, and can be configured as follows:

•
Batch normalization (BN) layer: a regularization technique that is known to accelerate training [74]. In our model, BN is applied for standardization of the input feature (x) by removing the mean and scaling to unit variance x BN = (x -mean)/(std) for each mini-batch. BN transform layer BN γ,β ≡ γx BN + β computes two trainable parameters (γ, β) for each input feature x.
• Hidden dense layers: a sequence of hidden dense layers for feature fusion and multilevel abstraction of feature maps [75]. One dense layer neuron processes the information of the feature vector x, according to the transform: where W and b are, respectively, the kernel weights matrix and the bias of the neuron to which a rectified linear unit activation function is applied.

DenseNet Model Design
The study design in Figure 2 follows the concept of varying dimensions in electrocardiography for binary AF/non-AF rhythm classification, which is related to the optimal design of various dense NN topologies using input with different sets of leads, namely DenseNet-SingleLeads for one lead, DenseNet-LimbLeads for six limb leads, DenseNet-ChestLeads for six chest leads, and DenseNet-12Leads for all standard 12 leads. All Dense-Net models have a common architecture, which is schematically drawn in Figure 4, and can be configured as follows:  Batch normalization (BN) layer: a regularization technique that is known to accelerate training [74]. In our model, BN is applied for standardization of the input feature (x) by removing the mean and scaling to unit variance xBN = (x -mean)/(std) for each mini-batch. BN transform layer BNγ,β ≡ γxBN + β computes two trainable parameters (γ, β) for each input feature x. • Hidden dense layers: a sequence of hidden dense layers for feature fusion and multilevel abstraction of feature maps [75]. One dense layer neuron processes the information of the feature vector x, according to the transform: where W and b are, respectively, the kernel weights matrix and the bias of the neuron to which a rectified linear unit activation function is applied.
The following setting of the hidden dense layers are considered reasonable to configure the network depth and width: -One to three hidden layers can be allocated. - The number of neurons in one dense layer cannot be larger than the number of neurons in the previous dense layer, limiting DenseNet to a shrinking architecture. - The number of neurons in a hidden dense layer can be any in the list: The following setting of the hidden dense layers are considered reasonable to configure the network depth and width: -One to three hidden layers can be allocated. - The number of neurons in one dense layer cannot be larger than the number of neurons in the previous dense layer, limiting DenseNet to a shrinking architecture. - The number of neurons in a hidden dense layer can be any in the list: • Output layer: a dense layer with one neuron and sigmoid activation function, giving the probability of the feature vector x belonging to the AF class in the range [0; 1]: Although not depicted in the general network topology in Figure 4, a drop-out regularization layer is applied after each hidden dense layer to avoid over-fitting and improve generalization during training. The drop-out rate of α = 0.3 was adopted as it was a common setting effectively applied in several studies [27,76].

DenseNet Model Training
A random uniform kernel weights initializer, DenseNet model fit by 'Adam' optimizer with a default learning rate of 0.001, and exponential decay rates of β1 = 0.9 (first moment) and β2 = 0.999 (second moment) were set for the training phase. 'Adam' optimizes the parameters θ of the network, in order to minimize loss: where x n is a feature vector of sample n in the training dataset (or batch size) with a number of N samples, and Loss is computed as a weighted binary cross entropy due to the notable imbalance between AF and non-AF classes: where: δ n is a binary indicator function, which is equal to 1 if the training sample x n belongs of the AF class, otherwise δ n = 0; -w AF and w non − AF are the weights for AF and non-AF classes, respecting the condition w AF + w non -AF = 1. Considering the proportion of about 8% AF to 92% non-AF in the training database (Table 1), the class weights were configured to give a penalty to the larger class, computing was a reciprocal of the class prevalence, i.e., w AF = 0.92, w non -AF = 0.08.
The model with minimal loss over the validation set, trained for a maximum of 400 epochs, was stored in an HDF5 file. Early stopping was activated if loss was not improved for >10 epochs. The DenseNet models were implemented in Python using Keras with Tensorflow backend. The training was run on workstation PERSY Stinger with Intel CPU Xeon Silver 4214R at 2.4 GHz (2 processors), 96 GB RAM, NVIDIA RTX A5000-24GB GPU.

Performance Evaluation
The performance of trained DenseNet models was estimated by benchmark metrics of sensitivity (Se), specificity (Sp), balanced accuracy (BAC), and F1 score: where TP and FN are the true positive and false negative detections for the AF class, and TN and FP are the true negative and false positive detections for the non-AF class. We note that BAC and F1 scores are common performance metrics for classifying imbalanced data. Furthermore, BAC → max is a representative point in the receiver operating characteristic (ROC) curve, giving the highest averaged performance of both classes and corresponding to the highest pair (Se, Sp) → max [76]. Therefore, the optimal choice of the output probability threshold for detection of AF (x ≡ AF while P(x ∈ AF) > P thr ) is set at the ROC operating point (P thr ∈ BAC validation → max), where ROC is computed for the validation dataset. The top-ranked models by BAC validation with their respective P thr value were selected for further independent evaluation on test set performance (Se, Sp, BAC, and F1 score).

Feature Importance
Estimation of input feature importance is an essential to understanding the decisionmaking process in DenseNet hidden layers. We were interested to interpret the feature map learned by the top-ranked DenseNet-12Leads model, which used the full-set of 50 input features for AF/non-AF classification. This interpretation was expected to highlight the most reliable ECG characteristics among heart rate and its variability, AV synchronization, and P-/f-wave amplitude and its stability in each of the 12 ECG leads.
We implemented SHAP [77], as one of the famous and powerful methods for explanation of individual predictions of various machine learning classifiers, based on coalition game theory [78]. SHAP calculates the local feature importance for every observation as the average marginal contribution of a feature across all possible combinations of features (i.e., all possible coalitions): A SHAP value Fi could be positive or negative, depending on the estimated contribution of the feature Fi for detection, either of the positive class (AF) or negative class (non-AF), respectively. The larger the absolute SHAP value Fi , the greater is the contribution of the feature Fi to the DenseNet output probability in Equation (4). However, the SHAP value in Equation (11) is only interpretable in the context of a specific ECG record. A global estimation of the importance of feature Fi for all N records in the test set can be calculated as an average of the absolute SHAP values in individual records:

DenseNet Model Optimization
According to the DenseNet model design settings in Section 2.4, we applied a grid search of all feasible combinations of the depths and widths of hidden dense layers, yielding a total number of 19 (DenseNet-SingleLeads), 55 (DenseNet-LimbLeads), 55 (DenseNet-ChestLeads), and 83 architectures (DenseNet-12Leads). Each architecture was trained and validated with five independent runs, resulting in a total number of 95 (DenseNet-SingleLeads), 275 (DenseNet-LimbLeads), 275 (DenseNet-ChestLeads), and 415 trained models (DenseNet-12Leads). The validation BAC of all DenseNet runs is illustrated in Figures 5-8. Analysis of each lead configuration is further presented in the context of the BAC range for all runs and selection of the optimal DenseNet architecture, denoted by the number of neurons in sequential hidden layers [1st, 2nd, 3rd]: • DenseNet-SingleLeads models with six input features from a single lead ( Figure 5) presented validation BAC in the range 86. SingleLeads), 275 (DenseNet-LimbLeads), 275 (DenseNet-ChestLeads), and 415 trained models (DenseNet-12Leads). The validation BAC of all DenseNet runs is illustrated in Figures 5-8. Analysis of each lead configuration is further presented in the context of the BAC range for all runs and selection of the optimal DenseNet architecture, denoted by the number of neurons in sequential hidden layers [1st, 2nd, 3rd]: • DenseNet-SingleLeads models with six input features from a single lead ( Figure 5) presented validation BAC in the range 86.8-91.1%, reported as an average value for all 12 ECG leads where each lead was evaluated as an independent input. The best performance was observed for all architectures with two hidden dense layers with 16 neurons in the first layer. Our choice for the optimal model with high-ranked BAC = 91.1% was the DenseNet-SingleLeads [16,16,0]. • DenseNet-LimbLeads models with 26 input features from limb leads ( Figure 6) presented validation BAC in the range 92-94.7%. Generally, the best performances (>94.2%) were observed for two and three hidden layer architectures with ≥32 neurons in the first layer. Our choice for the optimal model with high-ranked BAC = 94.7% was DenseNet-LimbLeads [32,32,32].   optimal architecture DenseNet-LimbLeads [32,32,32].
• DenseNet-ChestLeads models with 26 input features from chest leads (Figure 7) presented validation BAC in the range 91.8-94.6%. Generally, the best performances (>93.8%) were observed for two and three hidden layer architectures with ≥32 neurons in the first layer. Our choice for the optimal model with high-ranked BAC = 94.6% was DenseNet-ChestLeads [32,32,4].    Table 2 presents validation and test performance of the four selected optimal architectures in Figures 5-8. Here, the single-lead model (DenseNet-SingleLeads [16,16,0]) is tested with 6 lead-specific features for each of 12 leads. Its summary performance on the total of lead-specific features is also presented. It is compared with the validation and test performances of limb, chest, and twelve-lead models with larger numbers of 26 Table 2 presents validation and test performance of the four selected optimal architectures in Figures 5-8. Here, the single-lead model (DenseNet-SingleLeads [16,16,0]) is tested with 6 lead-specific features for each of 12 leads. Its summary performance on the total of lead-specific features is also presented. It is compared with the validation and test performances of limb, chest, and twelve-lead models with larger numbers of 26   The symbol (-) indicates that validation performance was not calculated for specific single leads because total single leads were evaluated as part of the global validation dataset during the training process. Figure 9 presents the test performance of DenseNet-SingleLeads [16,16,0] when the model was evaluated for single leads (I, II, III, aVR, aVL, aVF, and V1-V6), giving an overview of the lead-specific importance of AF detection. The most efficient are the primary ECG leads (I, II), which achieve the highest single-lead BAC = 89.5% (Se = 87.9-88.3%, Sp = 90.5-91.5%). The third top-ranked lead was aVF, presenting an approximate 0.5% point performance drop compared with primary leads I and II. Chest leads V2-V5 had limited Sp < 88%, whereas V1 and V6 had limited Se < 87.5%. The use of multi-lead sets of peripheral or chest leads improved AF detection performance compared with the most powerful single lead, i.e., all peripheral leads improved up to 2.9% points more than reference lead II (Se = 87.9 vs. 90.8%, Sp = 91.5 vs. 94.1%); the improvement for all chest leads ranged from 3 to 5% points compared with reference lead V5 (Se = 89 vs. 92%, Sp = 87.8 vs. 93.1%). The use of all 12 ECG leads for AF diagnosis did not further improve Se. However, 12-leads did considerably improve Sp = 95.8%, which was 2 and 3% points larger compared with the limb or chest lead set, respectively. The latter suggests that the multi-lead view of the P-wave pattern in both frontal and horizontal planes is important for detection of the variety of rhythm and conduction disturbances in the non-AF class.

Test Lead-Set Performance
A detailed analysis of the test performance achieved with the complete feature set of the 12-lead ECG model (DenseNet-12Leads [128, 8,8]) across six test datasets is presented in Table 3. We note a relatively large span for the Se (66.7 to 98.0%), Sp (88.4 to 97.1%), and F1 score (0.444 to 0.854), where the worst Se and F1-score were seen in the dataset with a limited number of AF (Georgia 12-Lead ECG Challenge database), whereas the worst Sp was seen in the CPSC2018 training set. The disparity in performance that one method could present on different test datasets was an indication of the inconsistency of the datasets with respect to their rhythm content and AF annotations. Nevertheless, reporting on multiple databases allows for greater generalizability of our results. II (Se = 87.9 vs. 90.8%, Sp = 91.5 vs. 94.1%); the improvement for all chest leads ranged from 3 to 5% points compared with reference lead V5 (Se = 89 vs. 92%, Sp = 87.8 vs. 93.1%). The use of all 12 ECG leads for AF diagnosis did not further improve Se. However, 12leads did considerably improve Sp = 95.8%, which was 2 and 3% points larger compared with the limb or chest lead set, respectively. The latter suggests that the multi-lead view of the P-wave pattern in both frontal and horizontal planes is important for detection of the variety of rhythm and conduction disturbances in the non-AF class. Figure 9. Test performance (Se, Sp, BAC) for binary detection of AF/non-AF rhythm by optimal models: DenseNet-SingleLeads [16,16, Table 3. We note a relatively large span for the Se (66.7 to 98.0%), Sp (88.4 to 97.1%), and F1 score (0.444 to 0.854), where the worst Se and F1-score were seen in the dataset with a limited number of AF (Georgia 12-Lead ECG Challenge database), whereas the worst Sp was seen in the CPSC2018 training set. The disparity in performance that one method could present on different test datasets was an indication of the inconsistency of the datasets with respect to their rhythm content and AF annotations. Nevertheless, reporting on multiple databases allows for greater generalizability of our results.

Test Feature Importance
This section describes our analysis of the 50 features of AV synchronization that were studied for their importance to AF/non-AF detection, evaluated by the test set. This provides a comprehensive overview of the decision-making process of the optimal DenseNet model, which is trained to most effectively combine information from all 12 ECG leads (DenseNet-12Leads [128, 8,8]).
-Lead aVL is ranked third for one feature: PQi-mean (50% • Feature importance (Figure 10c): -PQi-std is the most important feature mostly due to its high SHAP global metric in leads V2 (55%), aVF (52%), and II (43%). An important question was about the cause of the errors (FP and FN) in NN-driven AF detection. In Figure 11, we address this question in a statistical overview of the test set. Particularly, the feature values of the 14 strongest AF predictors in 12-lead ECG (highlighted in Figure 10a One should interpret the feature importance results considering the coalition game theory used in SHAP, which highlights the unique contribution of a feature to an output. This may underestimate a PQ feature measured in a set of correlated leads or features. An important question was about the cause of the errors (FP and FN) in NN-driven AF detection. In Figure 11, we address this question in a statistical overview of the test set. Particularly, the feature values of the 14 strongest AF predictors in 12-lead ECG (highlighted in Figure 10a) are statistically evaluated in four groups: FN, TP, TN, and FP. These groups represent the output of the DenseNet-12Leads [128, 8,8] model against the reference test set annotations (AF/non-AF). Figure 11 depicts two types of statistical distributions:   PQi-std (V2, aVF, II): There is a clearly visible trend that a larger PQi-std is directly proportional to larger SHAP importance for TP (red dots), associated with a large variance of the PQ interval during the chaotic AV synchronization in AF. PQi variation is most prominent in lead V2 for TP (median PQi-std = 52 ms) compared with its limited value for TN (median PQi-std = 5 ms). FPs are non-AF rhythms with enhanced PQi-std (median PQi-std = 39 ms), whereas FNs are AF rhythms with relatively constant PQi, such as in some AFL (median PQi-std = 10 ms in V2). • PQi-mean (aVR, aVL, I): The three leads present very similar distributions for TPs (median PQi-mean = 116 ms); however, different SHAP importance is given for low and high values of PQi-mean, i.e., lower PQi-mean values have higher TP importance for the lead aVR, whereas higher PQi-mean values have higher TP importance for leads aVL and I. This phenomenon is due to different PQi-mean distributions in the TN group, i.e., representing a longer PQ duration in the lead aVR (median PQi-mean = 160 ms), and shorter ones in leads aVL and I (median PQi-mean = 80 ms). This could be linked to different times of excitation of the right and left atria during si-nus rhythm, shifting when the P-peak is detected in specific ECG leads, i.e., the earliest right atrium activation is detected in lead aVR, followed by the leftward and inferior direction of the activation detected in leads aVL and I. FP errors are non-AF arrhythmias with disturbed timing of the P-wave pattern, appearing with relatively equal PQ intervals in the three leads aVR, aVL, and I (median PQi-mean = 100-120 ms), which overlap with the f-/F-wave timing in AF. Although FN errors present a slightly shorter PQ interval than TP in lead aVL (median PQi-mean = 95 ms vs. 116 ms), such errors cannot be strongly linked to the disturbance of PQi-mean, because overlapping distributions for FN and TP groups are observed in leads aVR and I.

•
PQa-std (V1, I, aVR): PQa-std is associated with a larger deviation of PQ amplitudes for TP and FP (median PQa-std = 0.04-0.05 mV) and lower deviations for TN and FN (median PQa-std = 0.01-0.03 mV), although the interquartile ranges overlap considerably between groups. This overlap is considered unreliable for AF detection by DenseNet, limiting the maximum SHAP global importance of PQa-std close to the 25% threshold ( Figure 10a). Furthermore, the detailed analysis of the PQa-std (aVR) scatterplot in Figure 11 indicates that SHAP value importance is negative for TP, which means that this parameter biases the detection toward the non-AF class. Some explanations might account for the low PQ amplitudes and the difficulty in accurately measuring their small deviations.  Figure 12 presents an AF record, which is correctly detected with a probability of P AF > 0.99, supported by dominant features with a positive SHAP value in this case, maximal for RR-std, PQi-std(II, V2, aVF, V1), PQa-mean(I, aVR, II), and PQi-mean(aVR).  Figure 11. Statistical analysis of the feature values of 14 strongest AF predictors in 12-lead ECG, categorized according to FP, TP, TN, and FN detections by DenseNet-12Leads [128, 8,8] in the test dataset. The statistical distributions are presented as box plots of the feature value (median value, interquartile range, non-outlier range, outliers, and extremes) and scatter plots of the feature value vs. SHAP values underlying its importance. Figures 12-14 show three examples of 12-lead ECGs and explanations of the respective features involved in the decisions of the DenseNet-12Lead [128, 8,8] model. Figure 12 presents an AF record, which is correctly detected with a probability of PAF > 0.99, supported by dominant features with a positive SHAP value in this case, maximal for RR-std, PQi-std(II, V2, aVF, V1), PQa-mean(I, aVR, II), and PQi-mean(aVR).
The next two examples provide insight into labeling inconsistencies and the reasons for erroneous performance of the model. Figure 13 presents a record that was annotated as AF and counted as a false negative error because DenseNet-12Lead detected non-AF rhythm. However, we suggest that this is a wrong AF annotation in a 12-lead ECG trace with normal sinus rhythm, confirmed by the P-waves and synchronized QRS complexes best discernible in lead V1. The probability of a definitive non-AF categorization (PAF = 0.0114) is justified by the dominant number of features with negative SHAP values for this case, most notably including: PQa-mean (I, aVF), PQi-std(aVL), PQi-mean(aVR), and PQastd(V2). The rhythm is relatively regular; therefore, the feature RRi-std is highly indicative of non-AF, but the rapid heart rate highlights RRi-mean as the single strongest indicator for AF.
The example in Figure 14 is annotated as non-AF and counted as a false positive error because DenseNet-12Lead detects AF rhythm with a high probability of PAF > 0.99, supported by most features with positive SHAP values, most notably including: PQa-mean (aVR, I, II), RRi-std, PQi-std (II, V2, I, aVF), and PQa-std (V1). The authors' observations of an irregular AF rhythm with clearly visible f-waves in leads V1-V4, however, suggest a wrong non-AF annotation.  The next two examples provide insight into labeling inconsistencies and the reasons for erroneous performance of the model. Figure 13 presents a record that was annotated as AF and counted as a false negative error because DenseNet-12Lead detected non-AF rhythm. However, we suggest that this is a wrong AF annotation in a 12-lead ECG trace with normal sinus rhythm, confirmed by the P-waves and synchronized QRS complexes best discernible in lead V1. The probability of a definitive non-AF categorization (P AF = 0.0114) is justified by the dominant number of features with negative SHAP values for this case, most notably including: PQa-mean (I, aVF), PQi-std(aVL), PQi-mean(aVR), and PQa-std(V2). The rhythm is relatively regular; therefore, the feature RRi-std is highly indicative of non-AF, but the rapid heart rate highlights RRi-mean as the single strongest indicator for AF.
The example in Figure 14 is annotated as non-AF and counted as a false positive error because DenseNet-12Lead detects AF rhythm with a high probability of P AF > 0.99, supported by most features with positive SHAP values, most notably including: PQa-mean (aVR, I, II), RRi-std, PQi-std (II, V2, I, aVF), and PQa-std (V1). The authors' observations of an irregular AF rhythm with clearly visible f-waves in leads V1-V4, however, suggest a wrong non-AF annotation.
(a) (b) Figure 13. Example of a false negative error due to wrong AF annotation (file A6226 from database "CPSC2018 training set"): (a) 12-lead ECG (10 s total record duration) with original annotation (AF), which is inconsistent with the authors' observations of a normal sinus rhythm with synchronized P-wave and QRS clearly seen in lead V1; (b) top 35 features ranked by maximal absolute SHAP value of the DenseNet-12Leads [128, 8,8]

Discussion
In view of prominent perspectives for the early and accurate detection of pathologic cardiac rhythms that focus on atrial fibrillation and atrial flutter, we suggest using AV synchronization in AF as an important diagnostic criterion when using one to twelve ECG leads. Standard diagnostic ECG features that represent the heart rate, lead-specific AV conduction time, and P-/f-wave amplitude (RR-interval, PQ-interval, and PQ amplitude, respectively) are calculated by simple rule-based methods, and their beat-to-beat mean values and standard deviations are interpreted by a dense neural network classifier. Although several neural architectures might show similar performance, optimization of the NN depth and width is an essential part of the design process. As shown in Figures 5-8, some models have shown decreased validation performance either due to the randomness of the training runs, an insufficient number of hidden layers (i.e., all single-layer models), or an exhaustive number of neurons (related to the large number of training parameters). Therefore, applying grid search architectural optimization on the validation set is a required task, which led to the selection of the optimally trained DenseNet models with two or three dense layers. Their independent evaluation on the test set showed competitive performance with sensitivity and specificity values (Se, Sp) of: 87.9-88.3 and 90.5-91.5% for DenseNet [16,16,0] with primary leads (I or II), 90.7 and 94.2% for DenseNet [32,32,32] with six limb leads, 92.1 and 93.2% for DenseNet [32,32,4] with six chest leads, and 91.8 and 95.8% for DenseNet [128,8,8] with all 12 leads, although direct comparison with other studies is difficult due to different test sets, rhythm types, and performance metrics. To our knowledge, the Physionet CinC Challenge 2021 dataset has not been explored for binary AF/non-AF detection. Furthermore, due to the mixed annotations in some databases, we analyzed the mixed class of AFIB and AFL, although it is known that AFL is clinically  Table 4 presents a comparative study of the performances of other AF detection methods found in the literature vs. our method. The disparities between the studies is shown in the detailed information on the applied methodology, test procedures, input lead sets, ECG databases, and evaluation accuracy metrics. The comparison should be interpreted with the following provisos:

•
The few papers [62,[79][80][81][82] reporting results in the Physionet/CinC Challenge 2021 database disclose accuracy metrics separately for AFIB and AFL, whereas Table 4 lists their average values. • Even when the same datasets are used, direct comparison is not feasible due to the different test procedures-i.e., results are reported on either an independent test set (not used during training) or the validation dataset (total dataset or N-fold crossvalidation) used in the NN training process.
The group of studies with single-lead ECG analysis have estimated their performance with several PhysioNet databases (MIT-BIH Atrial Fibrillation, Long-Term AF, MIT-BIH Arrhythmia) [22,23,24,28,30]. Their BAC is 7-11% points higher than that achieved in this study with one ECG lead (95.0-99.4 vs. 88.5%). We note that these databases contain a limited number of long-term Holter recordings and are not representative of the diversity of arrhythmias from the much larger number of patients available in the Physionet/CinC Challenge 2021 database. The effect of performance overestimation on uniform data might be evident in the fairly similar drop in performance (by about 10% points) when one of these methods [24] was cross-tested with the Physionet/CinC Challenge 2017 database (88.7 vs. 98.7%). These results, along with our cross-database test results in Table 3, strongly indicate that performance should be considered in the context of the test dataset used.
Among the numerous participants in the Physionet/CinC Challenge 2021, we were only able to compare those who reported F1 scores for AF detection in addition to the global Challenge score metric. The F1 scores of 12-lead models reported in [62,79] are comparable with this study (0.71, 0.72 vs. 0.77), whereas F1 in [80] is considerably lower (0.53 vs. 0.77). Furthermore, the F1 score of the two-lead model in [82] is comparable with the one-lead model in this study (0.53 vs. 0.55). The interpretation of the G metrics provided in [81] closely corresponds with the defined BAC metric, where this study is superior by 3% points for one-lead (85.4 vs. 88.0%), 6% points for 6-leads (86.7 vs. 92.5%), and 7% points for 12-leads (86.4 vs. 93.8%).
Another important issue is the clinical benefit of using diagnostic interpretive software, by general practitioners (GP) and practice nurses. Table 5 presents the disparity of results in two surveys for GP and practice nurse skills in AF interpretation [83,84], which demonstrate up to a 25% point difference in accuracy due to different human experience. In [83], 42 GPs and 41 practice nurses analyzed 2595 randomly selected ECGs from 25 GP practices and reported Se = 69-86%, Sp = 82-89% (manual interpretation) versus Se = 92%, Sp = 91% (manual plus automated interpretation). In [84], 457 GPs (out of 2239 who had been invited to participate in the study) analyzed 1613 single-lead ECG recordings and their reported accuracy in AF diagnosing was Se = 91.2%, Sp = 90.4% (GP manual interpretation) versus Se = 93.4%, Sp = 89.2% (GP interpretation after revision of the output of diagnostic interpretative software). Despite the high accuracy of GP manual ECG interpretation, 59% of those who responded to the questionnaire felt positively about the use of an automated algorithm in clinical practice. Moreover, 35% of GPs who refused to participate in the study but completed the questionnaire felt that their ECG reading skills were not sufficient to participate in such a study. Given this uncertainty in manual diagnosis, we suggest that the presented algorithm may be useful in supporting manual diagnosis, given its improved sensitivity by 2-12% points (GPs) and 15-19% points (nurses), and improved specificity by 2-11% points (GPs) and 6-10% points (nurses) in [83]. Moreover, our algorithm presented the same sensitivity and a 4.7% point improved specificity than the combination of GP and diagnostic interpretative software in [83]. This study justifies the importance of AV synchronization for AF prediction, using only three parameters (RR-interval, PQ-interval, and PQ-amplitude), whereas other stateof-the-art studies (Table 4) focus on complex feature maps (e.g., raw ECG data, Fourier spectrum, encoded-decoded heartbeats, short-term temporal ECG modulation features from scattering transforms, etc.) and many diagnostic criteria (e.g., classical interpretation methods). Such complex approaches are justified if many arrhythmias are to be classified, but the few parameters examined in this study were sufficient to detect AF rhythms. It is worth noting the advantage of implemented rule-based feature extraction methods, which have high computational efficiency in portable systems.
The focus of our study was on feature importance analysis, which was derived from SHAP evaluation of the decision-making process in the 12-lead DenseNet model, summarized as statistical distributions on the test set ( Figures 10 and 11), as well as a case overview of several ECG examples (Figures 12-14). This analysis is important for understanding the strongest AF predictors found in the hidden layers of 12-lead DenseNet, as well as the causes of correct (TP and TN) and false detections (FN and FP), for a more comprehensive interpretation from a cardiologist's point of view. The comparative estimation of 50 AV synchronization features in Figure 10a strongly indicates that ventricular synchronization (estimated as mean value and standard deviation of all beat-to-beat intervals in a record) is the most valuable indicator for detecting AF, with a relative importance in the range of 84-100%. Overall, atrial synchronization was noted to be much less important, with only twelve metrics reaching sensible levels of 25-60%, i.e., the mean PQ amplitude in leads (aVR, I, II), PQ interval deviation in leads (V2, aVF, II), mean PQ interval in leads (aVL, aVR, I), and mean PQ amplitude deviation in leads (V1, I, aVR). Based on the statistical graphs in Figure 11, these features can be comprehensively linked to common AF behavior such as rapid heart rate and enhanced heart rate variability, as well as the typical P-wave pattern behavior in sinus rhythm with the earliest negative P-peak deflection of the right atrium seen in the lead (aVR), positive P-peak deflection in leads (I, II), and late left atrium activation in lateral leads (aVL, I). Furthermore, the interquartile range of the PQ-interval deviation in V2 and inferior leads (aVF, II) was found to be prominently different between TN with synchronized atrial activation (1-20 ms) and TP with chaotic AV synchronization (30-60 ms). Although highlighted in leads (V1, I, aVR), the least important PQ characteristic was the mean PQ amplitude deviation, which showed considerable overlap between AF and non-AF rhythms. For this parameter, we accounted for the low PQ amplitudes and difficulty in accurately measuring their small deviations.
Although RR-intervals, PQ-intervals, and PQ amplitudes in 12-lead ECGs collectively contributed to a AF detection performance of 93.8%, there may be other AV synchronization features or ventricular polarization and depolarization features that could additionally improve diagnostic accuracy, e.g., reducing FP among the variety of non-AF arrhythmias that present disturbed AV synchronization. This is an area for future research.

Conclusions
Despite the very distinctive AF behavior of atrial and ventricular irregularities, the automated detection of these rhythms remains a challenging task, considering the potential paroxysmal occurrences of AF and possibility of coexistence with other arrhythmias in high-risk patients. In view of the significant burden associated with AF complications, for both patients and healthcare systems, early AF diagnosis is of crucial importance. Therefore, attempts to detect AF should be extended beyond the clinical setting to long-term ECG monitoring techniques that can identify even brief and asymptomatic AF episodes, as well as preventive examinations being carried out by GPs and practice nurses instead of only expert cardiologists.
In view of prominent perspectives for early and accurate AF detection, the major contributions of this study are:

•
A few comprehensive measures of AV synchronization, related to the mean and standard deviation of the heart rate, AV conduction time, and P-/f-wave amplitude (RR-interval, PQ-interval, and PQ amplitude, respectively) in 12-lead ECGs were shown to be feasible for AF detection.
• Advanced NN classifiers with one to three hidden dense layers and up to 128 neurons per layer were optimized to detect AF with input features from one, six, and twelve ECG leads. • Performance generalizability was demonstrated using independent datasets for training (50,332 records), validation (14,235 records), and test (6978 records), part of the six largest PhysioNet CinC Challenge 2021 databases, which were rich in data from healthy controls and patients showing various arrhythmias, including AF.

•
We elucidated the decision-making process of the DenseNet model by the SHAP method and highlighted the 14 most important AF predictors. Statistical analysis of their distributions comprehensively explained the causes of correct (TP, TN) and false detections (FN, FP).
Although performance was limited to the AF diagnostic potential of the few studied AV synchronization features, these features would be easily measured using portable ECG devices, and would correctly alert for AF in 87.6% (single-lead model) to 91.8% (12-lead model) at the cost of 11.5% and 4.2% false positive alarms, respectively. Such early warning would certainly reduce the complications of AF, which typically remains untreated for a long time.
Other clinical benefits include the SHAP value importance, which can be used in combination with a human expert's ECG interpretation to change the focus of the expert eye from broad observation of 12-lead ECG morphology to only a few AV synchronization findings that are strongly predictive for AF or non-AF arrhythmias. Their diagnostic value may outperform GP and practice nurse AF diagnoses by (2-19%) and may reduce false AF alarms by (2-11%), according to reports of some clinical surveys ( Table 5). The deduced results of this study are representative of AV synchronization findings across a broad taxonomy of cardiac arrhythmias in large 12-lead ECG databases.
In conclusion, given that AF is the most common arrhythmia, has a major impact on morbidity and mortality, and can manifest asymptomatically in physically active patients, its automatic screening is of high priority. In this regard, the proposed AF detection technique based on a neural network classifier and analysis of a physiologically reasoned and intuitive input feature set has the potential of a screening utility or decision-support application in clinical practice. The direct clinical impact of such a novel technology will require further investigation in prospective clinical studies.