Deep Neural Networks for ECG-Based Pulse Detection during Out-of-Hospital Cardiac Arrest

The automatic detection of pulse during out-of-hospital cardiac arrest (OHCA) is necessary for the early recognition of the arrest and the detection of return of spontaneous circulation (end of the arrest). The only signal available in every single defibrillator and valid for the detection of pulse is the electrocardiogram (ECG). In this study we propose two deep neural network (DNN) architectures to detect pulse using short ECG segments (5 s), i.e., to classify the rhythm into pulseless electrical activity (PEA) or pulse-generating rhythm (PR). A total of 3914 5-s ECG segments, 2372 PR and 1542 PEA, were extracted from 279 OHCA episodes. Data were partitioned patient-wise into training (80%) and test (20%) sets. The first DNN architecture was a fully convolutional neural network, and the second architecture added a recurrent layer to learn temporal dependencies. Both DNN architectures were tuned using Bayesian optimization, and the results for the test set were compared to state-of-the art PR/PEA discrimination algorithms based on machine learning and hand crafted features. The PR/PEA classifiers were evaluated in terms of sensitivity (Se) for PR, specificity (Sp) for PEA, and the balanced accuracy (BAC), the average of Se and Sp. The Se/Sp/BAC of the DNN architectures were 94.1%/92.9%/93.5% for the first one, and 95.5%/91.6%/93.5% for the second one. Both architectures improved the performance of state of the art methods by more than 1.5 points in BAC.


Introduction
Out-of-hospital cardiac arrest (OHCA) remains a major public health problem, with 350,000-700,000 individuals per year affected in Europe and survival rates below 10% [1,2]. Early recognition of OHCA is key for survival [3] as it allows a rapid activation of the emergency system and facilitates bystander cardiopulmonary resuscitation (CPR). Bystanders should apply an automated external defibrillator (AED), designed to be used with minimal training and to guide the rescuer until the arrival of medical personnel [4]. The main goal of OHCA treatment is to achieve return of spontaneous circulation (ROSC), so that post-resuscitation care can be initiated and the patient can be transported to hospital. Early recognition and post-resuscitation care are two key factors Section 5 describes the optimization process of the models and the evaluation methods applied; and in Sections 6 and 7 the results are presented and discussed.

Data Collection
The data of the study were a subset of a large OHCA episode collection gathered by the DFW centre for resuscitation research (UTSW, Dallas). Every episode was recorded using the Philips HeartStart MRx device, which acquires the ECG signal through defibrillation pads with a sampling frequency of 250 Hz and a resolution of 1.03 µV per least significant bit.
There were a total of 1561 episodes of which 1015 contained concurrent ECG and TI signals. The TI signal was necessary to identify ECG intervals free of artefacts due to chest compressions provided to the patient during CPR. Episodes were separated in ROSC and no-ROSC groups based on the instant of ROSC annotated by the clinicians on scene. PEA rhythms were extracted from no-ROSC patients. PR rhythms were extracted after the instant of ROSC for patients who showed sustained ROSC.
ECG segments of 5 s were automatically extracted during intervals without chest compressions. Chest compressions were automatically detected in the TI using the algorithm proposed in [46], or in the compression depth signal of the monitor [47]. Then, organized rhythms (PR or PEA) were automatically identified using an offline version of a commercial shock advise algorithm [17]. Three biomedical engineers reviewed the segments to check they contained visible QRS complexes with a minimum rate of 12 bpm. Every segment was annotated as PEA or PR based on the clinical annotations. Consecutive ECG segments were extracted using a minimum separation between segments of 1 s for PEA and 30 s for PR. PEA is more variable than PR, and occurs during the arrest. During PEA, CPR is given to the patient and intervals without compressions are not frequent. After ROSC is identified (PR segments) chest compressions are interrupted, and long intervals of artefact-free ECG are available. A longer separation between PR segments was considered to increase the variability.
A total of 3914 segments (2372 PR and 1542 PEA) from 279 patients (134 with ROSC and 145 without ROSC) comprised the dataset. Patient-wise training and test sets were created (≈ 80%/20% of the patients). The training set contained 3038 segments (1871 PR) from 223 patients (105 with ROSC). The test set contained 876 segments (501 PR) from 56 patients (29 with ROSC). Figure 1 shows examples of three PR segments (panel a) and three PEA segments (panel b). PR usually shows higher rates, narrower QRS complexes, less heart rate variability and higher frequency content (steeper QRS complexes) than PEA. However, PR in cardiac arrest often shows irregular beats as in the last two examples of Figure 1a. PEA rhythms may show more aberrant QRS complexes, absence of P waves, or more ectopic heartbeats compared to PR.

Proposed DNN Architectures
Two DNN architectures were implemented for the binary classification of ECG into PR/PEA. The 5 s ECG segments were first bandpass filtered using the typical AED bandwidth (0.5-30 Hz). The filtered ECG was downsampled to 100 Hz to obtain s[n], a signal of N = 500 samples, that was fed to the DNN networks. The output of the networks was p PR ∈ (0, 1), the likelihood that a 5 s segment corresponds to a PR segment. The first solution we propose is a fully convolutional neural network, and the second solution integrates recurrent layers.

First Architecture: Fully Convolutional Neural Network
Panel a of Figure 2 shows the overall architecture of the first solution (S 1 ). It consists of λ convolutional blocks, each one composed of a convolutional, a maximum pooling and a dropout layer.
Convolutional layers apply temporal convolution to the input signal. M different convolution kernels of size L allows obtaining M representations of the signal. The = 1, . . . , M-th output for the input signal s[n] is calculated as follows: where w and b are the weights and biases, respectively, of the convolution kernel (adjusted during training) and φ(·) is an activation function. The linear rectifier was adopted as activation function, i.e., φ(·) = max{0, ·}. Since no padding was applied to s[n] the length of c ( ) 1 [n] was N c 1 = N − L + 1 (the first L − 1 samples were discarded). The outputs of the first convolutional layer are fed into a max-pooling layer, which downsamples each input signal by applying the maximum operation with a pool size K = 2 to non-overlapping signal segments: The next step is to apply the dropout operation, which is only present during training and it is a common technique to avoid overfitting. It consists in dropping out some units under a certain probability α at each training step in a mini-batch. When some units are removed different networks are created at each step, so it can be seen as an ensemble technique. Let us denote the outputs of this layer as d These outputs are fed into another max-pooling and dropout layers to obtain d Finally, a fully connected layer was used as classification stage. This layer is composed of a single neuron with sigmoid activation function to produce p PR : According to [48], it is especially useful to train some layers of the network under the constraint ||w|| < γ when using dropout. This additional constraint during the training process reduces overfitting, so every convolutional layer was trained with γ = 3.5. ...

Second Architecture: CNN Combined with a Recurrent Layer
PEA and PR segments show different temporal behaviour. For instance, the time evolution for PR segments is known to be more regular than for PEA segments. These kind of temporal dynamics can be learned by a RNN. So the second solution proposed in this study (S 2 ) combines CNN and a bidirectional gated recurrent unit (BGRU), as shown in panel b of Figure 2.
GRU [49] is a simplified version of the well-known long short-term memory (LSTM) [50] with a similar performance [49]. These layers resolve long-term dependencies and avoid vanishing gradient problems. BGRU was inserted between the last convolutional block and the classification stage, removing the global maximum pooling layer. BGRU is composed of two GRU layers, one forward and the other one backward, so more sophisticated temporal features can be extracted by exploiting past and future information at time step n. Finally, both outputs were concatenated. A single GRU calculates hidden states h n at time step n = 1, . . . , N d λ based on the past state. Given D = [d (1) λ , . . . , d (M) λ ], the equations of the forwards GRU are described as follows: where W and U are weight matrices, b is the bias vector, σ(•) stands for sigmoid function, and is the Hadamard product. In the equations above z n and r n correspond to the update and reset gates, respectively. The backwards GRU works in the same way but the temporal representations of the input are flipped. The hidden state at the last time step, h n=N d λ , is fed in to the next layer. Having ϑ units for , are fed to the last classification layer after applying dropout. The convolutional and recurrent layers were trained under the constraint ||w|| < γ.
Another kind of dropout in RNN is recurrent dropout [51], which affects the connections between recurrent units instead of the inputs/outputs of the layer. A recurrent dropout fraction of 0.15 was used to train the final model.
This architecture is optimized simultaneously to obtain the optimal representations of the signal (convolutional layers) and obtain the optimal temporal features (BGRU) for an artificial neural network classifier (fully connected layer).

Training Process
The weights and biases of every layer were optimized using the adaptive moment estimation (ADAM) optimizer [52]. ADAM is a stochastic gradient descent algorithm with adaptive learning rate. According to [52], good default settings are a learning rate of 0.001 and exponential decay rates of 0.9 and 0.999.
The training data were fed into the DNN in batches of 8 during 75 epochs. At the beginning of each epoch training data were shuffled, so the mini-batches at each epoch were different. Additionally, zero-mean Gaussian noise with standard deviation of 10 −4 was added to the signal, and its amplitude was modified by ±2% (uniformly distributed) at each mini-batch. This process enriches the generalization of the model, as the input data for each epoch differs slightly.
The cost function to minimize was the binary cross-entropy: where y (true) = {0 : PEA, 1 : PR} are the manual annotations and η i are the sample weights. As patients contribute with different number of segments, every patient was weighted equally to train the DNN, so the sum of η i within the same patient is equal to 1. Every experiment was carried out using Keras framework [53] with Tensorflow backend [54]. The DNNs were trained on an NVIDIA GeForce GTX 1080 Ti.

Uncertainty Estimation
The network's output, p PR , represents the likelihood of PR, but it is not an indicator of the prediction confidence of the model. The uncertainty of the DNN decision can be estimated using dropout and data augmentation also during the test phase, a procedure known as Monte-Carlo dropout [55]. For each segment of the test set the prediction is repeated N times but adding two random effects: dropout in the DNN network, and the addition of white noise to the ECG. This produces N values of p PR , and the variance of those values is interpreted as the uncertainty of the prediction. In our experiments N was set to 100. The decisions in the test set with an uncertainty above an acceptable threshold were discarded, and in those cases feedback would not be given to the rescuer. The threshold of uncertainty is determined in the training set. The uncertainty of each training instance is computed and the threshold is determined as the uncertainty for which a proportion of feedbacks will be given. In our experiments we tested a proportion of feedbacks from 100% to 80%.

Baseline Approaches
Machine learning solutions based on well-known ECG features were implemented and compared with S 1 and S 2 . A total of nine hand-crafted features proposed in [34], v = [v (1) , . . . , v (9) ], were computed. They quantify the PR/PEA differences in terms of QRS complex rate and narrowness, slope steepness, spectral energy distribution, and regularity of the signal (fuzzy entropy).
Three classifiers were optimized and trained: • RF: Introduced in [56], RF constructs many weak learners, each trained with a certain proportion of the training data, ϕ. Each subset is generated by resampling with replacement. Each weak learner is a tree, and only ψ features are considered (drawn randomly from an uniform distribution) at each node. The final decision is made by majority voting. We set the number of trees to 300, and optimized the hyper-parameters ϕ and ψ. • Support vector machine (SVM): Given a feature vector v, the SVM makes the prediction using the following formula [57]: where b is the intercept term and N s is the number of support vectors (w i is non-zero only for these vectors). Here K(·, ·) denotes the kernel function, which for a Gaussian kernel with γ s width is: The hyper-parameters soft margin C and γ s were optimized for the SVM.

•
Kernel logistic regression (KLR): This is a version of the well-known logistic regression by applying a kernel-trick [57,58]. The prediction is made using Equation (11), and the kernel of Equation (12). The hyper-parameters to optimize were the regularization-term λ l and γ s .

Evaluation Setup
The performance of the models was evaluated in terms of sensitivity (Se, probability of correctly identifying PR), specificity (Sp, probability of correctly identifying PEA), and balanced accuracy (BAC, arithmetic mean between Se and Sp). The balanced error rate (BER) was defined as 1 − BAC. As patients have different numbers of segments, every patient was weighted equally to compute the performance metrics.

Hyper-Parameter Optimization Process
The hyper-parameters of every model were optimized using Bayesian optimization (BO) [59]. BO is a probabilistic model based approach that attempts to minimize an objective function associated with a real-valued metric, and the variables to optimize can be discrete or continuous. Recent studies report that BO is more efficient than grid search, random search, or manual tuning since it requires less time and the overall performance on the test set is better [60].
BO approximates the objective function to a surrogate function that is cheaper to evaluate. At each iteration a candidate solution is tested to update the surrogate using the past information. With more iterations the approximation of the surrogate is better. BO algorithm variants differ on how this surrogate is constructed. In this study we considered tree-structured parzen estimators (BO-TPE, to optimize S 1 and S 2 ) and Gaussian processes (BO-GP, to optimize RF, SVM, and KLR) [60,61].
The training data were divided patient-wise into 4 folds, and the cross-validated BER was the objective function to minimize. The search space for all models is shown in Table 1. Table 1. Search space of Bayesian optimization (BO) for all models. Here U (min, max) denotes a uniform distribution between min and max values.

Model
Hyper-Parameters

Results
The results of the BO-TPE algorithm applied for S 1 are shown in Figure 3. For each hyper-parameter the values of the cross-validated BER are given, continuous for α and median (10-90 percentiles) for the other discrete hyper-parameters. The distributions of the values selected by the optimizing algorithm are also shown (as histogram for α). The number of convolutional blocks, λ, and the dropout rate, α, turned out very determinant. Values of α > 0.3 rapidly increased the BER, and including up to λ = 4 blocks was the most selected option by the optimization algorithm. The values of M and L in the selected range had small effect on the performance of the classifier. Figure 4 shows the results of BO-TPE for S 2 . Increasing λ overfitted the model rapidly and BER was minimum for λ = 2; less convolutional blocks did not provide detailed enough features and increasing λ overfitted the model. Another influential hyper-parameter was α, which showed minimum BER values around 0.4. The hyper-parameters M, L, and ϑ had little effect on BER. Figure 5 shows the results of the BO-GP algorithm applied to tune the hyper-parameters of the KLR, SVM, and RF models. The cross-validated BER is color-coded (KLR and SVM) or depicted in the vertical axis (RF). Each point shows a single hyper-parameter combination tested by the BO-GP. In the case of KLR, both hyper-parameters were important, but low values of λ l especially yielded lower BER values. For the SVM, low values of C and high values of γ s produced the worst results, but the selection in the range of values was not as critical as in the KLR solution. Lastly, for RF ψ = 1 was the best option, particularly for 0.5 < ϕ < 0.6, although the fine tuning of ϕ was not critical.    Table 2 shows the overall test results of the baseline models and the deep learners in terms of Se, Sp, and BAC, and the set of selected hyper-parameters tuned during the optimization process. There were no differences between the RF, SVM, and KLR models, and any of the deep learning solutions outperformed the baseline models by nearly two percentage points of BAC. Although there was no difference in performance between S 1 and S 2 , the training process of S 1 is simpler with less trainable parameters than S 2 (1441 vs. 4777).
In Table 3 the computation time of the different models is compared. The mean time required to classify the 5 s segment is given, separately for the baseline classifiers in terms of required time for feature extraction (t 1 ) and classification (t 2 ). Processing times were calculated on a single core of an Intel Xeon 3.6 GHz. As shown in Table 3 the fully convolutional solution, S 1 , was by far the fastest one followed by the baseline models. Table 2. Summary of the performance of the deep learners and baseline models with the test set and the optimal hyper-parameters chosen by the Bayesian optimization with Gaussian processes (BO-GP) and Bayesian optimization with tree-structured parzen estimators (BO-TPE) algorithms with 5-s electrocardiogram (ECG) segments. DNN models outperformed baseline models in terms of BAC.

Se (%) Sp (%) BAC (%)
Hyper-Parameters  Comparative analyses were performed between the 9 hand-crafted features of the baseline models (v) and the features learnt by DNN solutions S 1 and S 2 (v D 1 and v D 2 respectively). The area under the curve (AUC) for v ranged between 0.88 and 0.94, showing that they had been wisely selected in different domains as described in [34]; but the M = 8 features (v D 1 ) that S 1 extracted reported high discriminative values from 0.61 to 0.97, showing that the deep architecture found some very selective features. Next, feature sets from the deep learners v D 1 and v D 2 were fed into the baseline classifiers to compare their performance with that of the original v. The BO-GP optimization procedure was repeated for the RF, SVM, and KLR classifiers and results for the test set are depicted in Figure 6. Training the classifiers with v D 1 and v D 2 yielded higher BAC values than those obtained with the pre-designed v features. This experiment shows that features defined by the neural networks integrate information not considered by the hand-crafted features, and that they can be successfully used with other classifiers.
The duration of the ECG segment fed into any of the solutions is critical when using a pulse detection algorithm during OHCA treatment. During CPR the ECG signal is strongly affected by chest compression artefacts and electrical defibrillation attempts. For any diagnosis based on the ECG, intervals free of artefact must be used, i.e., extracted either during pauses for rhythm analysis or during chest compression pauses. The segment length used in this study is below the typical interruption for a rhythm analysis, which is between 5.2-26.3 s [62]. However, decreasing the length of the analysis segment would contribute to shorter interruptions in compressions for pulse detection. Reducing hands-off intervals that compromise oxygen delivery to the vital organs increases survival rates [63,64]. Consequently, the solutions of this proposal were tested for different segment durations, from 5 s down to 2 s. The models that were trained for 5-s ECG segments were used, features were extracted using the first seconds of the segment, and those features were fed into the baseline models. The DNN models were fed with the same first seconds of the ECG segments used for feature calculation (note that S 1 and S 2 can work with any segment duration at the input). As shown in Figure 7 the best performance for the baseline models was obtained for segment lengths of 5 s. The DNN models outperformed the best baseline models for any segment length, including segments as short as 2 s.  A last evaluation of S 1 was performed to assess the influence of the degree of uncertainty in the decision of the model. Table 4 shows the performance of the model if the system was designed to give feedback only in a percentage of the analyses, those in which the uncertainty of the decision was lowest. Different percentage thresholds were tested in the training set, from 100% (always give feedback) to 80% (give feedback when the uncertainty is low). Assuming no feedback in 5% of the cases increased the BAC by one percentage point, and the BAC increased up to 97.6% if the system was designed to discard the 20% of the analyses with largest uncertainty.

Discussion and Conclusions
Pulse detection during OHCA is still an unsolved problem, and there is a need for automatic methods to assist the rescuer (bystander or medical personnel) to decide whether the patient has pulse or not [10]. Non-invasive pulse detection is still a challenging problem [16], and no solutions are currently integrated in monitors/defibrillators. To the best of our knowledge, this is the first study that uses DNN models to discriminate between PR and PEA rhythms using exclusively the ECG.
The two DNN models proposed in this study outperformed the best PR/PEA discriminators based exclusively on the ECG published to date. A RF classifier based on hand-crafted features was proposed in [34] and reported Se/Sp of 88.4%/89.7% for a smaller dataset. A DNN model using a single convolutional layer followed by a recurrent layer was introduced in a conference paper [65], but the Se/Sp/BAC were 91.7%/92.5%/92.1% on the dataset used for this study, that is the BAC was 1.5 percentage points below the current solution. Other DNN solutions were tested in another conference paper [66], where we reported BAC values of 91.2% and 92.6% for preliminary versions of S 1 and S 2 . Performance was improved in this study adding a general DNN architecture with multiple convolutional layers, a Bayesian optimization procedure which provided insights into the critical hyper-parameters of the networks (see Figures 3 and 4), and a better data augmentation procedure. All these factors contributed to an improved BAC of 93.5% for S 1 and S 2 , an increase of nearly 2 points from a baseline BAC around 92%, i.e., achieving 20% of the available margin for improvement (8 points) on our initial architectures. Furthermore, we also introduced a new usage framework in which the algorithm was able to automatically assess the uncertainty of the decision, and improved feedback by only reporting decisions with low uncertainties.
There was no difference in terms of BAC between S 1 and S 2 . The second solution is more complex and should be able to capture more sophisticated features of the signal. However, the number of trainable parameters was 1441 in S 1 and 4777 in S 2 . Increasing the number of trainable parameters makes the DNN model prone to overfitting, the model "memorizes" the training data loosing generalization capacity and shows poorer performance with unseen data [67,68]. In fact, S 2 showed higher accuracies during training than S 1 (98.5% vs. 96.6%). Besides, training was computationally more costly for S 2 , optimizing S 1 required ≈ 37 h and optimizing S 2 ≈ 82 h. However, it is possible that with larger datasets S 2 could generalize better and provide a more accurate model, but OHCA datasets with pulse annotations are costly.
DNN architectures are capable of automatically learning the discrimination features. Our results show that the features learned by S 1 and S 2 produced more accurate PR/PEA classifiers than hand-crafted features when fed to the classical machine learning models (see Figure 6). The DNN architectures were able to capture some important ECG characteristics for the identification of pulse that are not accounted for in the hand-crafted features proposed in the literature. In particular, the most discriminative features were those learned by S 1 , which when fed to an SVM classifier boosted the BAC from 92% for hand-crafted features to above 94%.
One of the salient features of the proposed DNN solutions is that they are based solely on the ECG. The ECG is available in all defibrillators/monitors used to treat OHCA patients, so it could be integrated into any equipment. PR/PEA discrimination algorithms that use the ECG and TI have also been proposed [28,29], the TI adds relevant information because effective heartbeats may produce small fluctuations in the TI [23,24]. The BACs of ECG/TI-based PR/PEA discriminators using classical machine learning approaches were around 92% for smaller datasets [28,29]. Defibrillators measure the impedance to check that pads are properly attached to the patient's chest, that is the reason why the TI signal is not recorded with mΩ amplitude resolution in many devices. In any case, multi-modal deep learning solutions could be explored to increase the accuracy by designing DNN solutions that use both the ECG and TI signals. Moreover, S 1 extracted significant features, so it could be used as a feature extractor and those features could be combined with features derived from the TI, and other surrogate measures of the hemodynamic state of the patient.
Another critical factor of automatic PR/PEA discrimination algorithms is the ECG segment length needed for an accurate decision. PR/PEA discrimination algorithms need an ECG without chest compression artefacts, this means that compressions have to be interrupted for pulse detection. Pauses in chest compressions compromise the survival of the patient [63,64]. Therefore, current guidelines recommend interruptions of less than 10 s for pulse checks [4,10], but in practice these interruptions are longer than 10 s in more than 50% of cases [14,15]. Our DNN models were very accurate for a segment length of 5 s. Moreover, the length of the segment could be shortened down to 2 s without compromising the BAC of our models (see Figure 7). Consequently, our automatic algorithm could be used to reliably detect pulse during OHCA with interruptions as short as 2-3 seconds, and could be used to avoid the excessively long pauses in chest compressions for pulse detection observed during OHCA treatment.
Measuring the uncertainty of the prediction may be useful when misclassifying an input has a considerable cost, for instance a false pulse indication may unnecessarily interrupt a life saving therapy like CPR. Many efforts have been made to estimate the uncertainty in DNN models, but it is still a challenging problem [69][70][71][72][73]. In this work the uncertainty of the decision was measured using a method known as Monte-Carlo dropout [55], and we found that only giving feedback when the uncertainty was low considerably increased the BAC. For instance, giving a feedback in the 95% of the cases improved the BAC by more than 1 point, and only giving feedback in 80% of cases increased the BAC by over 4 points. During OHCA treatment CPR should be continued until a reliable pulse detection is identified by the algorithm, and the pauses in compressions for the potential feedbacks (reliable or unreliable) will be short, since our algorithms only require ECG segments of 2-3 s. Further work should be done to improve the estimate of the uncertainty of the decision, so that BACs of 97% could be obtained by discarding less than 20% of the potential feedbacks.
In conclusion, this study introduces the use of deep neural networks to discriminate between pulseless and pulsatile rhythms during OHCA using only the ECG. The proposed DNN models outperformed hand-crafted feature-based machine learning solutions, and were able to accurately detect pulse with ECG segments as short as 2-3 s. Moreover, a first attempt at a quantification of the uncertainty of the decision was also introduced to improve the reliability of the feedback given to the rescuer. The proposed solution is based exclusively on the ECG and could be integrated into any monitor/defibrillator. Funding: This work was supported by: The Spanish Ministerio de Economía y Competitividad, TEC2015-64678-R, jointly with the Fondo Europeo de Desarrollo Regional (FEDER), UPV/EHU via GIU17/031 and the Basque Government through the grant PRE_2018_2_0260.

Conflicts of Interest:
A.I. receives research grants from the US National Institutes of Health (NIH) and serves as an unpaid volunteer on the American Heart Association National Emergency Cardiovascular Care Committee and the HeartSine, Inc. Clinical Advisory Board.

Abbreviations
The following abbreviations are used in this manuscript: