Hybrid Network with Attention Mechanism for Detection and Location of Myocardial Infarction Based on 12-Lead Electrocardiogram Signals

The electrocardiogram (ECG) is a non-invasive, inexpensive, and effective tool for myocardial infarction (MI) diagnosis. Conventional detection algorithms require solid domain expertise and rely heavily on handcrafted features. Although previous works have studied deep learning methods for extracting features, these methods still neglect the relationships between different leads and the temporal characteristics of ECG signals. To handle the issues, a novel multi-lead attention (MLA) mechanism integrated with convolutional neural network (CNN) and bidirectional gated recurrent unit (BiGRU) framework (MLA-CNN-BiGRU) is therefore proposed to detect and locate MI via 12-lead ECG records. Specifically, the MLA mechanism automatically measures and assigns the weights to different leads according to their contribution. The two-dimensional CNN module exploits the interrelated characteristics between leads and extracts discriminative spatial features. Moreover, the BiGRU module extracts essential temporal features inside each lead. The spatial and temporal features from these two modules are fused together as global features for classification. In experiments, MI location and detection were performed under both intra-patient scheme and inter-patient scheme to test the robustness of the proposed framework. Experimental results indicate that our intelligent framework achieved satisfactory performance and demonstrated vital clinical significance.


Introduction
Myocardial infarction (MI), as one of the most prevalent cardiovascular diseases worldwide, commonly emerges when the coronary artery is occluded by thrombus. It is estimated that the annual incidence of MI is 605,000 new attacks and 200,000 recurrent attacks in the United States [1]. In fact, MI is also described as silent heart attack and most patients suffer from MI without awareness. Even worse, acute MI occurs rapidly and unexpectedly with a high mortality rate. Therefore, early diagnosis and timely treatment are of utmost significance to guarantee the life safety of MI patients.
Electrocardiographic (ECG) can be employed to recognize MI [2], which serves as the most popular diagnostic tool for its convenience, non-invasiveness and low cost. ECG records the electrical signals generated by the heart muscle fibers during the alternate contraction and relaxation of the heart chambers [3]. A normal ECG is characterized by the cardiac cycle sequence, and each cycle mainly contains P, QRS, and T waves. In general, ECG consists of 12 leads (I, II, III, aVR, aVL, aVF, and V1-V6) that reflect the heart in various regions and perspectives. The location of MI can be detected by the alterations among different leads [2]; therefore, it is essential to take more leads into account in the diagnosis of MI. However, it is strenuous and time-consuming for the trained physicians to evaluate every lead precisely. Moreover, because of ECG individualized polymorphism, the diagnostic criteria are perplexing and complicated to follow [4]. The ST-segmental elevation is one of the diagnostic indicators of MI [2], but even experienced cardiologists may only identify 82% of this indicator among MI subjects [5]. A computer-aided diagnosis (CAD) system can exceed the limitations of manual inspection of ECG signals by its rapid, objective, and reliable analysis [6]. Hence, effective diagnosis of MI with 12-lead ECG signals analyzed by CAD system is advantageous and preferable.
Various frameworks have been proposed and developed in the CAD system for MI detection and location. Most of the studies follow the procedure of feature extraction, feature selection, and classification. Conventionally, the process of feature extraction is manual operation and requires solid domain expertise. Several characteristic values can be extracted from ECG morphology as relevant features of MI, such as ST deviation and T wave amplitude [7]. However, most of morphological features are heavily dependent on the accuracy of ECG wave delineation. To mine additional information, wavelet transform, principal component analysis (PCA), empirical mode decomposition, random projections, hidden Markov model, and reproducing kernel Hilbert space are employed to extract the representative features [8,9]. After feature extraction or selection, diverse classifiers are developed to discriminate between MI and healthy controls (HCs) through the obtained features. Additionally, multi-classification classifiers are applied to localize different types of MI. The classifiers can be typically categorized into traditional thresholding methods [10] and machine learning algorithms. Conventional machine learning classifiers include K-nearest neighbor [11], random forest [12], and support vector machine [13]. Although the above off-the-shelf methods work well, they still have obvious defects and limitations. In essence, the feature extraction and classification are two separate modules with substantially different parameters and complexity. It is hard to determine whether the information is fully excavated or redundantly used, which exerts adverse impact on the subsequent classification. Furthermore, specific feature extraction algorithms have unconvinced robustness under different influence factors, such as age, gender, and acquisition equipment. Therefore, an automatic and end-to-end framework that integrates effective feature extraction and classification processes is required to improve the effectiveness of MI diagnosis.
In recent decades, deep learning methods, including convolutional neural network (CNN), gated recurrent unit (GRU), attention mechanism, and autoencoder, have been widely and superbly applied to analyze biomedical signals [14][15][16]. Instead of separate feature extraction and classification processes, deep learning architectures automatically extract critical features required for classification from vast samples [17]. Furthermore, CNN and GRU are two typical end-to-end learning paradigms with multiple levels of representation and especially suitable for discovering the spatial and temporal characteristics in high-dimensional data [18]. To alleviate the disadvantages of conventional frameworks, deep learning methods are exploited in MI diagnosis continuously and rapidly [19][20][21][22][23][24]. To a large extent, new research lays the foundations for the development of deep learning frameworks that make full use of 12-lead ECG signals.
Although there exists plenty of research on MI diagnosis, several detailed issues are still without due consideration. Basically, most studies only utilized the single lead of ECG, but the rest should also be taken into account. It is more in conformity with the authentic rules of MI diagnosis to consider 12-lead ECG records [24]. Secondly, thus far, importance evaluation and weighted combination of each lead in MI diagnosis have been sparsely investigated. Even though the authors of [22][23][24][25] considered 12 leads simultaneously, each lead contained distinctive and complementary information that deserved different and separate processing rather than identical treatment. Thirdly, only a few researchers considered the inter-patient scheme on the Physikalisch-Technische Bundesanstalt (PTB) dataset. Since the individual variation exists in different patients, inter-patient scheme is closely relevant to clinical practice and applications. On the contrary, intra-patient scheme cannot substantiate the feasibility and adaptability of the model and may even bring about overly sanguine diagnosis.
To address the aforementioned limitations, a novel, practical, and medical-grade framework is proposed for the detection and location of MI. More precisely, the main contributions of this study are listed as follows.

1.
A novel multi-lead attention (MLA) mechanism integrated with CNN and bidirectional gated recurrent unit (BiGRU) framework (MLA-CNN-BiGRU) is proposed. The parallel deployed CNN and BiGRU modules are innovatively utilized to extract features to detect and locate MI via 12-lead heartbeat signals. As far as we know, this fills the gap of applying deep learning methods to automatically extract spatial and temporal features from 12-lead ECG signals in MI diagnosis. The proposed feature extraction method paves a new way for feature engineering.

2.
The MLA is developed by the designed activation function. The proposed attention mechanism measures and exploits the contribution of each lead to boost the diagnostic performance. Existing studies mainly focus on manual selection of leads or treat all the leads equally with repeated and redundant information. With the proposed model-based approach, this study serves as a preliminary exploration on the importance evaluation of each lead for MI detection and location.

3.
Different leads are interrelated and correlated. It is essential to fully exploit available features to enhance the performance. To our knowledge, it is the first time to adopt 2D-CNN to extract spatial features based on multi-lead fusion in MI diagnosis. Three different convolutional kernels are innovatively applied to extract correlation and regional features among different leads. 4.
MI detection and location under intra-patient and inter-patient schemes are all performed to test the robustness of MLA-CNN-BiGRU. In addition, elaborate and exhaustive ablation experiments are carried out to verify the effectiveness of the framework. Experimental results indicate that the proposed intelligent framework achieves satisfactory performance and demonstrates vital clinical significance.

Related Work
Before introducing the proposed hybrid deep learning framework, background information of attention mechanism, CNN, and GRU is illustrated as guidance.

Attention Mechanism
Inspired by the efficient allocation of limited resources by the human brain, attention mechanism is widely applied to emphasize the most valuable information in visual image recognition [26] and natural language processing [27]. Since redundant information is time-and resource-consuming in the data processing, self-attention mechanism [28] is proposed for sequential models to calculate the weights for different features. Generally, self-attention is deployed on the outputs of GRU-or CNN-based sequential models [29].
Recently, attention mechanism has been popular in clinical diagnosis. Deep fusional attention network was adopted to extract elaborate features from biological signals in seizure detection and sleep stage classification [16]. In MI diagnosis, the heartbeat-attention mechanism was introduced to automatically weight the difference between unlabeled heartbeats [22]. Furthermore, the attention mechanism has strong interpretability. Its ability to evaluate importance and contribution can be implemented not only for feature extraction, but also for multi-channel screening.

Convolutional Neural Network
CNN is the most established architecture in image recognition field, which is enlightened by the natural visual perception mechanism of creatures [30,31]. Typically, CNN consists of three types of stacked layers combined with a series of manipulations. Convolutional layers apply convolutional kernels to learn different spatial feature maps of the input data. Pooling layers reduce the dimensionality of the feature maps from convolutional layers with shift-invariance [32]. Fully connected layers perform the final classification or prediction. Batch normalization manipulation can improve the training rates by preventing the phenomenon of internal covariate shifting [33]. Dropout manipulation can reduce overfitting by avoiding complex co-adaptations on the training data [34]. Activation functions introduce nonlinearities to neural networks. Typical activation functions are sigmoid, tanh, and rectified linear unit (ReLU) [32]. The loss function defines the difference between the real value and the predicted value. During the training process, the optimizer minimizes the loss function, and the best fitting parameters can be obtained.
In the medical field, there has been a rapid surge of applications of CNN among radiology [31] and physiological signals [14]. Researchers have applied CNN by treating ECG signals as the 1D image in the diagnosis of MI [19,[35][36][37]. Deep CNN was applied to automatically diagnose MI through one single lead and attained good performance [19,35]. Baloglu et al. [36] achieved impressive results based on CNN model with all the 12 leads. Multiple-feature-branch 1D CNN was created to take full advantage of 12 leads [37]. Multi-lead residual neural network was proposed, and three residual blocks were designed to capture remarkable features by convolutional layer through 1D convolutional kernel [24]. Additionally, sub 2D CNN structure extracted different feature representation with shared 1D convolutional kernels among four leads during MI detection [20]. In essence, 1D CNN only focuses on the features within the single lead. Although the sub 2D CNN was applied, the feature map was still generated based on the shared 1D convolutional kernel inside the same lead. Therefore, the powerful feature extraction ability of 2D CNN through multi-lead convolutional kernels remains further development in the diagnosis of MI.

Gated Recurrent Unit
Recurrent Neural Network (RNN) is widely used in the processing of time series data due to its ability to memorize sequential information. RNN implements a recursive task with the output being dependent on all the historical information [17]. However, the total memory capacity is restricted in standard RNNs. Long Short-Term Memory (LSTM) [38] is designed to avoid sacrificing too much information in learning long-term dependencies by addressing the vanishing gradient problem. In LSTM, a memory block continuously transmits and renews memory by three gates: the input, output, and forget gates. The input gate identifies what new information is important and needs to be reserved in the previous state. The output gate determines what information is conveyed to the next state. The forget gate identifies what relevant information needs to be retained in the previous state. GRU [39] is created as an enhanced variant of LSTM that can extract features selectively through a reset gate and an update gate. Compared with LSTM, GRU has no cell state and straightforwardly uses hidden state for the transmission of information. The reset gate of GRU is utilized to determine how much previous information requires to be forgotten. The update gate determines what previous information to keep and what new information to merge. Apart from optimizing the internal structure, GRU can be further improved by taking all the previous and subsequent context information into consideration. Therefore, bidirectional GRU that is integrated by two GRU layers [40] is proposed. BiGRU processes information in backward and forward directions and is therefore able to exploit both the past and the future information.
In processing biomedical signals, BiGRU has been successfully applied for human emotion classification through continuous electroencephalogram signals [41], and human identification through ECG based biometrics [42]. ECG signal is a typical kind of time series data, and LSTM has been effectively applied in MI diagnosis [21][22][23]. GRU architecture can achieve performance comparable to or even superior than LSTM [42], but its potential has been rarely investigated in MI diagnosis thus far.

Dataset and Pre-Processing
The ECG data utilized in this study were from PTB dataset provided by the German National Metrology Institute [43]. The PTB dataset contained 549 records from 290 subjects. Each record was obtained by synchronous acquisition of 15 leads, including conventional 12 leads ECG and the 3 Frank signals. The sampling frequency of electrical signals in PTB dataset was 1000 Hz. In the dataset, 148 MI patients (368 records) and 52 healthy volunteers (80 records) were collected. The ECG signals of 148 MI patients were identified as ten different types of MI, but only five categories were selected in MI location. Specifically, 314 records were used for MI location, including 47 records of anterior MI (AMI), 43 records of antero-lateral MI (ALMI), 79 records of antero-septal MI (ASMI), 89 records of inferior MI (IMI), and 56 records of inferolateral MI (ILMI).
The pre-processing of ECG signals included denoising, removing baseline drift and data segmentation. To eliminate the magnitude difference between different records, data standardization transformed all input data into values within [−1,1]. Daubechies 6 (DB6) wavelet basis function [44] was applied to eliminate noise and remove baseline drift. Additionally, Pan-Tompkin algorithm [45] was employed to segment or select the pre-processed ECG signals by QRS-wave detection. In detail, 250 sample points were selected before the QRS-peak point and 400 sample points were chosen after the QRS-peak point, which formed a heartbeat segment composed of 651 points. Moreover, the first and last heartbeats were removed from each ECG signal record. Table 1 demonstrates the data distribution of 12-lead heart beats in this study. Anterior myocardial infarction (AMI); Antero-lateral myocardial infarction (ALMI); Antero-septal myocardial infarction (ASMI); Inferior myocardial infarction (IMI); Inferolateral myocardial infarction (ILMI); Myocardial infarction (MI); Healthy control (HC).

Methodology
The framework of hybrid neural network is comprised of three sub-modules, as shown in Figure 1. Firstly, pre-processed data are inputted into the MLA-CNN-BiGRU framework. An attention layer is trained to determine the importance of each lead. After adaptive selection, CNN is applied to extract spatial features. Thereinto, features are weighted and integrated via attention mechanism. Simultaneously, BiGRU with feature integration attention mechanism mines optimal features in the temporal dimension. Ultimately, the spatial and temporal features from two modules are joined and fed into the fully connected layer for classification.

Multi-lead Attention Module
In a segmented heartbeat with 12 leads, each lead reflects the heart condition from different perspectives. Undesired and unnecessary information could have a reverse impact on the training process, even limiting the maximum performance of the model. For this reason, the identification of effective input data is particularly important. However, treating all the leads equally could result in redundant information. Training neural networks with repetitive information is time-consuming and resource-wasting. When analyzing 12 leads for MI identification, not all leads make equal contributions. Therefore, the attention mechanism is elaborately employed to evaluate the significance of each lead. The attention mechanism shown in Figure 2 makes the weighted information of 12 leads more condensed and refined, thus facilitating the subsequent processing.
In this study, self-attention mechanism is modified to measure the importance of each lead. The proposed MLA, an extension of the conventional attention mechanism, can be used for lead selection through the designed activation function. The proposed MLA mechanism aims to heavily weight key leads and eliminate redundant leads. To achieve this purpose, a modified version of the activation function ReLU is therefore adopted.
As shown in Equation (1), the StepReLU is created to simulate the step function. After the weight is activated by the StepReLU, its value is distributed between zero and one. In this way, the crucial leads could be entirely retained and leads of no use could be completely abandoned. The remaining leads are assigned with partial weights. The ordinary step function is either zero or one, and its derivative is zero, therefore it cannot be applied to train neural networks. StepReLU can be used in the back-propagation algorithm and serves a similar purpose as a step function. Moreover, the proposed activation function solves the issue that maximum values after traditional ReLU activation are uncontrolled.
The implementation process of MLA can be summarized by Equations (2) and (3).
where L T = [l 1 , l 2 , . . . , l k ](l ∈ R t , L ∈ R k×t ) is the input heartbeat sample with 12 leads and 651 time points (k = 12, t = 651). W 1 ∈ R k×k is a trainable parameter matrix. w 1 ∈ R t is the parameter vector and b 1 ∈ R k is the bias term. Function tanh(·) denotes hyperbolic tangent function. After the computation, the vector α 1 (α 1 ∈ R k ) represents the importance of each lead. Finally, (4) is the 12-lead signal after selection, where the self-defined multiplication ⊗ is . Through MLA mechanism, L is transformed into X, which serves as the input for subsequent feature extraction.

CNN with Attention Mechanism for Spatial Feature Extraction
As a feature extraction module with the ability to identify the optimum spatial features for diagnosis, CNN is combined with attention mechanism to form one branch of the hybrid framework. This module consists of two alternated convolutional and pooling layers, as well as an attention layer in the end, as shown in Figure 3b.
Different leads are interrelated and correlated, but each lead is one-dimensional, thus making 2D CNN inapplicable. Inspired by multi-sensor data fusion [46], we utilized time dimension as the horizontal axis and arranged 12 leads in vertical axis to convert one-dimensional signal into two-dimensional data. Therefore, each 12-lead beat sample has the size of 12 × 651. To enable 2D CNN to effectively mine useful spatial features, three different convolutional kernels, namely 3 × 3 kernel, 5 × 1 kernel, and 7 × 1 kernel, are innovatively applied to extract the correlated and regional features among different leads. In this way, the 5 × 1 kernel can consider five leads at a time. Similarly, 7 × 1 kernel takes the information of seven leads into account at the same time point.

Convolutional Layer
In the convolutional layer of CNN, high-order information can be extracted though convolution and activation operation. The input data are convolved with a set of kernels with different shapes to generate discriminative feature maps for diagnostic representation. Then, the nonlinearity is introduced by element-wise activation function. As illustrated in Equation (5), the feature value x m i,j,n is computed by the nth kernel at location (i, j) in the mth layer.
is the input patch centered at location (i, j) in the (m − 1)th layer. W m n and b m n are the weight and bias term of the nth kernel filter (n ∈ [1, 2, . . . , N]) in the mth layer, respectively. Each kernel generates one feature map through sliding data window with shared weight and bias parameters. There are N kernels in each layer, which means N feature maps can be generated as input to the next pooling layer. Activation function is denoted as f (·) to produce nonlinearity.

Pooling Layer
To reduce the dimensions and improve the robustness of the learned feature maps, pooling layer is generally concatenated between two convolutional layers. Features in the local patches of input maps are compressed to more robust representation to achieve subsampling. Therefore, pooling layers possess shift-invariance to minor transformations in the input images [47]. Moreover, computation burden during the training process can be reduced. Considering each beat may vary in morphology and numerical values, pooling layers can alleviate the influence of these variations to enhance the robustness. Max pooling is one of the typical pooling operations, which computes the maximum values in the pooling windows. Max pooling is effective for retaining texture information [47]. It is applied in this study because the texture characteristics, such as the peak and fluctuation of heartbeat, could be reserved during the subsampling.

Attention Layer for CNN
After the operation of convolutional and pooling layers, a series of feature maps is ultimately formed. If all the feature maps are directly concatenated for classification, the parameters in the fully connected layer are doomed to be vast and easy to be overfitted. Furthermore, the contribution of each feature map is not equal. In fact, some feature maps are redundant and unnecessary in classification and thus should have small weights. On the contrary, pivotal and discriminative feature maps deserve greater weights.
Compared with conventional CNN models that treat all the feature maps in the same manner, an attention layer is added on top of CNN to integrate different feature maps and form optimal spatial feature representation for classification. The calculation process of the weight vector α 2 is shown in Equation (6) and the final spatial feature vector f s is obtained by Equation (7). The input x n ∈ X denotes the nth feature vector in the whole features X = [x 1 , x 2 , . . . , x N ] generated from the last pooling layer. The activation function so f tmax(·) ensures that all calculated weights in the vector α 2 add up to 1. W 2 , b 2 , w 2 are trainable parameters.
Therefore, CNN combined with attention mechanism can better characterize the spatial features from signal data. Additionally, the proposed CNN module pays more attention to the correlation of adjacent leads and integrates discriminative features more reasonably.

BiGRU with Attention Mechanism for Temporal Feature Extraction
The ECG signal is essentially a periodic signal with certain regularity. Therefore, the heart state corresponding to the current sampling value is not only related to the previous time point, but also related to the information of the subsequent time point. To efficiently learn the temporal correlation of ECG signals in each lead, BiGRU with attention mechanism is accordingly employed to further strengthen the performance of the general framework. BiGRU module is deployed in parallel with CNN module, and they conduct training and parameters updating together. In detail, BiGRU module consists of two parallel GRUs and an attention layer in the end, as shown in Figure 3c

BiGRU Neural Network
GRU is designed to improve the three-gate structure of LSTM by removing cell state and conflating the forget gate and input gate to an update gate. Therefore, GRU has fewer parameters and performs more efficiently. The calculation principle of GRU is defined in Equation (8).
where z t represents the update gate and h t−1 denotes the output of the previous neuron.h t is the signal information learned at the present state after the reset gate r t . h t represents the hidden state of the neuron. W xz , W hz , W xr , W hr , W xh , and W are the corresponding weight matrices. b z and b r are the bias terms. Function σ(·) and tanh(·) represent the sigmoid function and hyperbolic tangent function. The symbol * denotes the element-wise multiplication.
To make full use of the past and future information, BiGRU is developed by containing a forward GRU layer and a backward GRU layer. The input x t ∈ R k holds the information of 12 leads at the same time point t. During the training process, GRU cell iterates 651 times for each beat sample to capture the temporal features. The hidden vectors − → h t and ← − h t can be extracted as forward and backward temporal features, which are calculated by Equation (9). Subsequently, hidden states from two directions are concatenated to generate the overall temporal features H composed of H t , as shown in Equation (10).

Attention Layer for BiGRU
There are 651 total hidden states formed after BiGRU. Meanwhile, each hidden state provides diverse information and exhibits different contribution for the final classification. Similar to the attention layer in the CNN module, another attention layer is introduced after the BiGRU layer, as illustrated in Equations (11) and (12). W 3 , b 3 , and w 3 are trainable parameters. Correspondingly, each temporal feature extracted by BiGRU is assigned with an appropriate weight and features are integrated into the final temporal feature f t .

Merge and Classification
In the proposed framework, the last step concatenates the features extracted by the two modules and co-trains them for classification. The training procedure is detailed in Algorithm 1. The proposed CNN module and BiGRU module are employed as spatial and temporal feature learners, respectively.
The spatial feature f s and the temporal feature f t learned from the beat sample are concatenated into a joint feature F, as shown in Equation (13). In this manner, the proposed hybrid framework provides more diversity in the estimation of class probability. The joint feature is fed into the fully connected layer for final classification. for beat sample L i ∈ batch do 7: // Multi-lead Attention Module; 8: // CNN with Attention Mechanism; 11: − → h t ← forward GRU(X i ); 25: ← − h t ← backward GRU(X i ); 26: H t ← BatchNormalization (H t ); 28: H t ← Dropout (H t ); 29: Temporal features f t ← Attention(H t ); 30: // Merge and Classification; 31: Features F ← concatenate( f s , f t ); 32: F ← BatchNormalization (F); 33: F ← Dropout (F); 34: y pre ← FullyConnected(F); 35: if MI detection then 36 Attentive CNN module focuses more on the distinguishable neighbor information among different ECG leads, while BiGRU with attention mechanism is skilled at extracting essential temporal characteristics inside each lead. Obviously, the two modules complement each other to make the extracted features more comprehensive and efficient, thus achieving higher performance.
Compared with the hand-crafted features extracted by traditional classifiers, the end-to-end framework integrates the lead selection, feature extraction, feature reduction and MI classification as a whole system. Moreover, the creative and efficient feature processing structure can generate discriminative spatial and temporal features by co-training the two modules.

Evaluation Metrics
The accuracy (Acc) of the classification is the proportion of correctly classified samples to the total number of samples. The classification accuracy measures the universal classification results, which is defined by true positive (TP), true negative (TN), false positive (FP), and false negative rates (FN) in Equation (14).
Sensitivity (Sen) measures the proportion of real MI patients who are correctly classified, and defined as Equation (15). Instead, specificity (Spe), defined in Equation (16), measures the proportion of real healthy people who are correctly predicted. High sensitivity indicates low rate of missed diagnosis, i.e., few MI patients are classified as healthy individuals. High specificity indicates low rate of misdiagnosis, i.e., few healthy individuals are deemed as MI patients.

Experimental Methodology
Based on the PTB dataset, MI detection and MI location under both intra-patient and inter-patient schemes were implemented to verify the effectiveness of the proposed MLA-CNN-BiGRU framework.
All the experiments were based on the evaluation of Acc, Sen, and Spe and experimental results were obtained by five-fold cross-validation. Under intra-patient scheme, the total beats were randomly divided into five approximately equal parts. For each iteration, three parts were used to train the model. One part was used as validation set to optimize the parameters of the framework. The remaining part was used as testing set to evaluate the final performance. As for the inter-patient scheme, patients were randomly separated in the proportion of 3:1:1 for training, validation, and testing, and the corresponding beats formed the training set, validation set, and testing set. Grid-search method was implemented to optimize parameters over a given parameter grid. By virtue of this technique, an exhaustive search over the value of a specified parameter was performed. Parameters including dropout rate, learning rate, batch size, and the number of epochs were selected by trial and error based on the validation set. The search range of dropout rate was set to be 0.2, 0.3, and 0.4. The options of learning rate were 0.0008 and 0.001. Batch size was set to be different in three cases, which equaled 16, 24, and 32. Additionally, the number of epochs was set to be 10, 20, and 30. The results of each search are shown in Figure 4. Moreover, to explore the effect of component structures in the proposed framework, ablation experiments were conducted based on MI detection. The proposed framework was also compared with one of the most popular dimensionality reduction method, i.e., PCA [48], combined with multi-layer perceptron (MLP) for classification (PCA-MLP). Then, MI location was conducted as application and extension of our framework. All the experiments were implemented with Windows 10 Operating System, NVIDIA GeForce GTX 1660 Ti GPU, Genuine Intel (R) Core (TM) i7-9700K CPU @ 3.60 GHz and 32 GB RAM. The program was carried out by TensorFlow-gpu 1.9.0 and Keras 2.2.4 with Python 3.6.5.

MI Detection
MI detection is a binary classification task to distinguish MI patients from HCs. The experiments were conducted on 80 12-lead ECG records from HCs and 368 records from MI patients with a total of 760,128 beats. Moreover, ablation experiments based on component structures were conducted with the same parameters as the MLA-CNN-BiGRU framework. In detail, the ablation structures were MLA-BiGRU module without feature attention mechanism (MLA-BiGRU w/o ), MLA-CNN module without feature attention mechanism (MLA-CNN w/o ), MLA-BiGRU module with feature attention mechanism (MLA-BiGRU), MLA-CNN module with feature attention mechanism (MLA-CNN), and CNN-BiGRU without MLA mechanism but with feature attention mechanism (CNN-BiGRU). Additionally, PCA-MLP was tested as a comparative framework that integrated the most popular dimensionality reduction method with a basic neural network.

Intra-Patient Scheme
In MI detection under intra-patient scheme, the results of ablation experiments are demonstrated in Table 2, and those of the comparative experiment are shown in Table 3. The average values of the lead weights obtained by five-fold cross-validation are presented in Figure 5a. Experimental results indicate that, among all the component structures, the proposed MLA-CNN-BiGRU achieved the highest average Acc of 99.93%, Sen of 99.99%, and Spe of 99.63%. Simultaneously, the proposed framework also obtained the lowest standard deviation (std) of the three metrics, i.e., 0.05%, 0.004%, and 0.31%, respectively. The results of MLA-BiGRU w/o were comparable to MLA-CNN w/o but worse than MLA-BiGRU. MLA-CNN achieved better performance than CNN-BiGRU and MLA-BiGRU, but was still inferior to MLA-CNN-BiGRU. When comparing with PCA-MLP, the proposed framework maintained the highest overall performance as well. According to Figure 5a, the highly recommended leads are I, II, V5, and V6, all of which have weights in excess of 0.8. Lead aVF is entirely excluded because its weight is zero.

Inter-Patient Scheme
As for MI detection under inter-patient scheme, the results of ablation experiments are summarized in Table 4, and the results of the comparison experiment are given in Table 3. The average lead weights are illustrated in Figure 5b. According to the experimental results, the proposed framework achieved highest average Acc of 96.50%, Sen of 97.10%, and Spe of 93.34% among all the methods. The proposed framework also obtained the lowest std in Acc and Spe, i.e., 2.25% and 4.84%, respectively. Consistent with the intra-patient scheme, MLA-CNN achieved superior performance to MLA-BiGRU w/o , MLA-CNN w/o , MLA-BiGRU, and CNN-BiGRU, but was still worse than the complete hybrid framework MLA-CNN-BiGRU. Compared with PCA-MLP in Table 3, the Acc of the proposed framework was improved by 24.88%, and its std was low. As indicated in Figure 5b, the leads with large weights are II, aVL, V5, and V6, all with weights above 0.7. Leads aVF and V2 are virtually redundant and ineffective. Table 4. Ablation experiments of MI detection by five-fold cross-validation under inter-patient scheme.  Best performance is highlighted in bold.

MI Location
MI location is a multi-class classification task. In this study, the proposed MLA-CNN-BiGRU framework was applied for MI location based on six classes of 12-lead ECG records, namely HC and five types of MI. In detail, the six categories of data were comprised of 80 records from HCs, 47 records from AMI, 43 records from ALMI, 79 records from ASMI, 89 records from IMI, and 56 records from ILMI, with a total of 678,612 beats.

Intra-Patient Scheme
MI location under intra-patient scheme was performed. The results of five-fold cross-validation are presented in Table 5, including the metrics calculated for each category. The average values of the lead weights obtained by cross validation are shown in Figure 5c. As presented in Table 5, MLA-CNN-BiGRU achieved the average Acc of 99.11%, Sen of 99.02%, and Spe of 99.10%. According to Figure 5c, the recommended leads for MI location are II, III, V5, and V6, all with weights over 0.6. Leads I, aVF, V1, and V2 are precluded for their few contributions to the subsequent processing. Table 5. Results on MI location by five-fold cross-validation.

Folds
Category Intra-patient Scheme Inter-patient Scheme

Inter-Patient Scheme
For inter-patient scheme, Table 5 demonstrates the results of five-fold cross-validation. The average lead weights are illustrated in Figure 5d. As can be observed in Table 5, the experimental results in this case are much lower than those in the other three cases. In addition, lead weights were relatively small, with V6 having a maximum lead weight of 0.44. Only lead aVL was eliminated during the training process of the model. Due to the uneven distribution of beats numbers, the category with the highest performance in different folds varied considerably.

Discussion
This paper presents a novel and reliable MLA-CNN-BiGRU framework for MI detection and location under both intra-patient scheme and inter-patient scheme. Meanwhile, elaborate ablation experiments based on MLA mechanism, CNN module, BiGRU module, and feature integration attention mechanism were carried out. The ablation experiments aimed to explore the role of the component structure in improving the performance of MI diagnosis. Moreover, the proposed framework was compared with another widely adopted feature extraction method. Standard metrics, i.e., Acc, Sen, and Spe, were employed to verify the effectiveness of the proposed framework. Among all the experiments presented in Section 5, MLA-CNN-BiGRU performed best by comparing different components and another feature extraction method in MI diagnosis under both intra-patient and inter-patient schemes.
As shown in Figure 4, the accuracy is almost identical when the batch size equals 24 and 32, and slightly lower when the batch size equals 16. The performance with a learning rate of 0.001 was slightly better than that with a learning rate of 0.0008. It was most suitable to set the number of epochs to 20. Insufficient number of epochs led to the under-fitting of the neural network. On the contrary, excessive training rounds gave rise to the problem of over-fitting. The dropout rate also exerted influence on the accuracy and therefore it could not be set too high or too low. A dropout rate of 0.3 was more appropriate.
In this study, the rank (from high to low) of the lead contribution of MI detection is: I, V5, V6, II, V1, aVL, aVR, V3, V4, V2, III, and aVF under intra-patient scheme; and V5, II, V6, aVL, V4, I, III, V3, V1, aVR, V2, and aVF under inter-patient scheme. The rank (from high to low) of the lead contribution of MI location is: V6, III, V5, II, V3, aVR, V4, aVL, I, aVF, V1, and V2 under intra-patient scheme; and V6, V5, I, III, V3, V4, V2, aVR, II, aVF, V1, and aVL under inter-patient scheme. In theory, each lead reflects a different perspective of the heart activity. More precisely, leads V3 and V4 correspond to the anterior aspect of the heart. Leads V1 and V2 reflect both septal and posterior aspects of the heart. Inferior part is related to leads II, III, and aVF. Lateral part is associated with leads I, aVL, V5, and V6. Lead aVR is related to the endocardial part [49]. From the experimental results, leads I, II, III, V3, V5, and V6 were of more importance, which may be caused by data distribution. Since most of the MIs in the PTB dataset were related to anterior, inferior, and lateral parts, the weights were primarily assigned to the leads that could assist in the diagnosis of these three main parts. In the literature, lead V5 achieved the highest sensitivity in detecting myocardial ischemia [50] and presented the best performance among all the 12 leads of ECG signals [51]. In addition, lead II is a commonly used lead for basic cardiac monitoring [19]. As shown in Table 6, leads I, III, and V3 were also selected and achieved good results. The previous research is consistent with our experimental results that leads V5 and II made a greater contribution. In fact, the lead contribution is not only related to the model architecture, but also to the sample distribution. It should be mentioned that this study did not focus on which leads were closely related to MI diagnosis from pathology. Specifically, this study contributes to optimizing the number of leads by selecting the most essential ones, which can assist the proposed framework to obtain the most effective diagnosis.
Neural networks are good at processing high-dimensional nonlinear data by virtue of automatic feature extraction. Compared with PCA presented in Table 3, the neural network frameworks have superior performance because the feature extraction and classification processes of neural networks are end-to-end systems. CNN and GRU are capable of extracting various features directly from original data through convolutional abstraction and gate-based memory cells. CNNs are popular models for image data processing, while GRUs are familiar with processing temporal sequence data. Compared with BiGRU module, CNN module has better performance, as shown in both Tables 2 and 4. It indicates that spatial features contained more useful information for the diagnosis of MI. Additionally, the component structure with attention layer was better than that without attention mechanism. It indicates that there remained redundant information after the feature extraction, which required the attention layer for effective integration. After eliminating the MLA layer, as shown in Tables 2 and 4, the performance of CNN-BiGRU is lower than that of the complete framework, which can verify the effectiveness of the proposed MLA mechanism. Furthermore, the hybrid framework had superior performance and stability to the component structures. Despite involving additional training process, the combination of spatial and temporal features with attention mechanism exhibited more robust performance in comparison with other methods. The combined features were deemed to be discriminative in the diagnosis of MI. It was essential to consider relationships between different leads and the temporal characteristics of ECG signals.
MI diagnosis is composed of detection and location in this study. MI detection is a binary classification problem, while MI location is a six-class multi-classification task. The results indicate that MI detection obtained better performance than MI location and intra-patient scheme achieved better performance than inter-patient scheme. Since the inter-patient scheme could prevent training and testing the model using the beats from the same patients, it exerted more difficulties on the model to overcome the individual difference. Furthermore, the inter-patient scheme caused the unbalanced distribution of data and greatly affected the performance of the model. Notably, the performance of MI location under inter-patient scheme remains to be improved.
The proposed framework was compared with previous studies on the same PTB dataset, as shown in Table 6. Among all the methods, the proposed framework achieved highest accuracy in MI detection under both the intra-patient scheme and inter-patient scheme. Compared with the method of Han and Shi [24], the accuracy, sensitivity, and specificity of our framework were improved by 7.20%, 16.39%, and 7.63%, respectively in MI location under inter-patient scheme. Moreover, this study has several merits, such as the utilization of 12-lead ECG signals, the effective end-to-end system, the selection of leads based on model-driven approach, elaborate feature extraction from both spatial and temporal perspectives of the signals, and exhaustive experiments among MI detection and location under two schemes. Furthermore, our study designed ablation experiments to examine the effectiveness of component structures, which more comprehensively verified the reliability of the proposed framework.
The proposed framework achieved the optimal results; however, there are three limitations that need to be improved in our future work. Firstly, although it is worthwhile to make sacrifices on training time and memory storage to achieve higher diagnostic accuracy, the proposed hybrid framework has complicated structure and extensive parameters. It exerts challenge on embedding the network into mobile portable devices. The architecture of the network therefore remains to be explored and further optimized. For instance, it is very effective to optimize the BiGRU module that has a slow operating speed. Additionally, the parameters of attention mechanism should be reduced appropriately. In essence, these changes are the trade-offs between the complexity and accuracy of the framework, which deserve elaboration in the future study. Secondly, to achieve expert diagnosis, the process of lead selection should be explained more precisely by comparative experiments and pathological analysis. Thirdly, the framework should be evaluated on more datasets with diversity to confirm the robustness in practical applications.

Conclusions
In this paper, a novel MLA-CNN-BiGRU framework for automatic MI detection and location is presented based on 12-lead ECG signals. To efficiently and effectively employ all 12 leads, the MLA mechanism is developed to weight the contribution of each lead by the designed activation function, and useful leads can be selected for the subsequent process. In the process of feature extraction, CNN is introduced to extract spatial features from inter-correlated ECG signals among the different leads. Meanwhile, BiGRU is applied to extract temporal features inside each lead. Both neural networks have an attention layer in the end for feature integration. Then, the spatial and temporal features extracted from two modules are combined as global spatial-temporal features for the final classification process. Comparative and ablation experiments were conducted under inter-patient and intra-patient schemes to confirm the effectiveness of the proposed framework in MI detection and location. The experimental results indicate that the proposed framework demonstrated satisfactory performance on the PTB dataset, but the location under inter-patient scheme needs further improvement. With the proposed model-based approach, this study serves as a preliminary exploration on the importance evaluation of each lead in the diagnosis of MI. Moreover, in the field of 12-lead ECG signal processing, this study provides a new insight into the application of attention mechanism and parallel feature extraction structure based on deep learning.