A Novel 1-D CCANet for ECG Classiﬁcation

: This paper puts forward a 1-D convolutional neural network (CNN) that exploits a novel analysis of the correlation between the two leads of the noisy electrocardiogram (ECG) to classify heartbeats. The proposed method is one-dimensional, enabling complex structures while maintaining a reasonable computational complexity. It is based on the combination of elementary handcrafted time domain features, frequency domain features through spectrograms and the use of autoregressive modeling. On the MIT-BIH database, a 95.52% overall accuracy is obtained by classifying 15 types, whereas a 95.70% overall accuracy is reached when classifying 7 types from the INCART database.


Introduction and Related Work
Cardiovascular diseases are the first cause of death in the world, with an estimated 17.9 million deaths each year. Among them, heart arrhythmia qualifies as an abnormal heart rhythm that can result in serious complications such as stroke or cardiac deaths. Early detection of arrhythmia is a major challenge for our society.
With electrocardiograms (ECGs), heartbeats can be visually labelled according to several classes such as Normal beat, Supraventricular escape beat, etc. An ECG is a graph of voltage versus time of the electrical activity of the heart using electrodes placed on the skin. To assess the condition of the heart from different angles, an ECG has several leads, each of them being the signal generated by a pair of electrodes.
In the last decades, researchers employed machine learning methods for the automatic classification of heartbeats contained in long-duration recordings of human ECGs [1,2]. A traditional heartbeat classification pipeline includes data preprocessing, data segmentation, feature extraction, feature selection, and classification [3].
Feature selection is used to reduce the number of features used by the classifier thus reducing the complexity and time required for computation. Several approaches have been adopted: principal and independent component analysis [5,6,21,22], linear discriminant analysis [6], and genetic algorithm [23].
As discussed above, ECGs can be recorded in different locations of the body thus obtaining the so-called multilead ECGs. Up to 12 leads can be recorded and each lead represents a specific characteristic of the heart. Multilead ECGs better reflect the state of the heart compared with single lead ECGs. Taking into account multi leads may bring performance improvement. Existing literature is mainly focused on the processing of single lead ECGs [20].
In this paper, we focus on two-lead ECGs: we use lead V1, that is a chest lead, and lead II, that is a limb lead. We propose the combination of hand-crafted features with a canonical correlation analysis network (CCANet) and SVMs for two-lead heartbeats classification. The analysis of the correlation between two leads of the ECG is exploited to increase heartbeat classification performance [20]. Proposed CCANet is a 1-D variant of the original 2-D CCANet proposed by Yang et al. [20] that allows to explore a deeper CCANet while maintaining a reasonable computational complexity and providing better results. CCANet has been originally proposed by Yang et al. [33] for the processing of two-view images in 2017. Compared to one-view image-based PCANet and RandNet, CCANet demonstrated to perform better [33]. CCANet has also been employed in other computer vision tasks such as remote sensing scene classification [34] as well as ECG interpretation [20].
There are two types of CNNs that are commonly used for ECG classification: the 1-D CNN and 2-D CNN [35]. 2-D CNNs usually operate on transformed ECG data, such as spectrograms, gray-level co-occurrence matrices, combined features and others. 1-D CNNs operate directly on the raw ECG signal. Our one-dimensional variant takes as input a combination of elementary hand crafted time domain features, frequency domain features through spectrograms, and the use of autoregressive modeling.
For the sake of comparison, we evaluate a suitable implemented 1-D convolutional neural network (CNN) solution based on residual networks (ResNet) [36]. ResNet demonstrated to be one of the most performing CNN for visual recognition [37]. The proposed method outperforms the state of the art on both the MIT-BIH and INCART arrhythmia databases.

Our Contribution
The main novel contributions of this paper are summarized as follows: • We have designed a novel one-dimensional canonical correlation analysis network (1-D CCANet) to exploit two-lead ECGs for automatic classification of heartbeats that outperforms the state of the art; • We have explored the use of handcrafted features in combination with a 1-D CCANet for ECG classification; • Our proposal outperforms a solution based on a suitable one-dimensional ResNet that we have implemented for the sake of comparison.

MIT-BIH Database
The MIT-BIH database contains 48 sets of two-lead ECG signals (lead II and mostly V1). Each signal is approximately 30 min long, has been collected at a 360 Hz sampling frequency, and has been independently annotated by at least two cardiologists. Annotations include the 15 types listed in Table 1. In our study, we use the signals for which both II and V1 leads are available (see PhysioBank for further details).

INCART Database
The St. Petersburg Institute of Cardiological Techniques 12-lead arrhythmia database (INCART) contains 75 sets of 12-lead ECG signals (leads I, II, III, aVR, aVL, aVF, V1, V2, V3,  V4, V5, V6). Each signal is approximately 30 min long and has been collected at a 257 Hz sampling frequency. We only consider leads II and V1 of each record. Annotations include the 7 types listed in Table 2.

Proposed Method
The input of the proposed method is a two-channel ECG segment obtained after a preliminary segmentation that consists in the isolation of heartbeats in each record. Given the R-peak positions, any heartbeat is isolated by retaining T 1 and T 2 samples to the left and to the right of the R-peak, respectively. For each of the two leads, a vector denoted as x h (h = 1, 2) is built with the values of the ECG (in Volts), of size (T 2 + T 1 ). The values of T 1 and T 2 are 160-200 and 120-136 for MIT-BIH and INCART, respectively. These values are given in Table 3, and are similar to the one used in [20]. The architecture of the whole process is shown Figure 1a,b. The first stage is feature extraction. The input of the process is, for each lead, a vector x h (with h = 1, 2) containing raw values of the segmented heartbeat. Each lead x h (h = 1, 2) is normalized (see the "Normalization" module in Figure 1a by using a rescaling procedure so that the resulting vector x h,norm has an intensity that ranges from 0 to 1, as per equation: At the same time, hand-crafted features are extracted from each lead x h (h = 1, 2): frequency-domain features x h,spec , and autoregression features x h,ar . A single time-domain features vector x time is also computed for both leads. Frequency-domain features, autoregression features and the normalized segmented heartbeat x h,norm are concatenated to obtain the vector x h,cat = [x h,ar x h,spec x h,norm ] (see Figure 1a). The x h,cat (h = 1, 2) vector is processed by the neural module to produce a single output vector f neur for the two leads (see Figure 1b). The vector f neur is then reduced in dimensions by using Principal Component Analysis (PCA) thus obtaining the vector f pca . The concatenation of the time-domain features x time and f pca is the input of a Support Vector Machine classifier. The output of the classifier is the predicted heartbeat class. In the following subsections the feature extraction and neural module are discussed more in detail.

Hand-Crafted Feature Extraction
Given an isolated heartbeat x h (h = 1, 2), hand-crafted features are extracted with three different methods: frequency-domain, time-domain, autoregressive modeling.

One-Dimensional Spectrogram
For the frequency domain, we use a one-dimensional spectrogram, which is a representation of the spectrum of frequencies of a signal as it varies with time. It is built through a short-time Fourier transform (STFT) of each of the two non-normalized leads x h (h = 1, 2). A window slides through the signal (with potential overlapping) and computes at each step the squared magnitudes of the STFT of the portion of the signal belonging to the window. The Hamming windowing is used for this process. The spectrogram is then obtained by concatenating, along the time axis, the squared magnitudes acquired for each window. The squared magnitudes obtained for each frequency (up to half of the sampling rate) at each time step can be reported in a matrix where axis 0 and 1 are the frequency and time axis, respectively. Since the range of the squared magnitudes varies significantly, the resulting matrix is rescaled to [0, 1] to yield X h,spec (h = 1, 2).
A weighted average along the frequency axis is performed thus turning the X h,spec matrix into a one-dimensional vector, x h,spec . The equation used is the following (2): where S(n) = ∑ k 1 k n . This feature extraction method requires three parameters: the number of samples in the window of the STFT (N wind = 64 and 46 for MIT-BIH and INCART respectively), the number of samples in the overlap between two consecutive steps (N overlap = 32 and 23 for MIT-BIH and INCART respectively) and n (0.25 for both MIT-BIH and INCART), the weight parameter in Equation (2). Suitable parameters are found with a greedy search. The feature vector x h,spec is of size 10.

Autoregressive Modeling
Autoregressive (AR) modeling specifies that a time series value depends linearly on its own previous values and a stochastic term, as per Equation (3): where X t is the time series, ϕ i are the AR coefficients computed with Yule-Walker's method and p is the order of the AR model. Since the choice of the order p depends highly on the sampling rate, non-normalized ECGs from both databases are resampled to 360 Hz [38]. The order was then chosen by performing best parameter search on the training data for both the MIT-BIH and INCART databases. We chose the order that maximized the average of our performance metrics (accuracy, specificity, sensitivity, ppv) on a validation set. Figure 2 shows that for the INCART data, the best order is 2 while the best order is 3 for the MIT-BIH data.
Since the performance for MIT-BIH is quite comparable for orders 2 and 3, we chose an order equal to 2 for both datasets. We preferred a lower order to reduce the computational cost. The vector of AR coefficients obtained for each lead of one heartbeat, x h (h = 1, 2), is denoted as x h,ar and is of size 2.

Time-Domain Features
For each of the two leads x h (h = 1, 2) of one segmented heartbeat, we compute the following time-domain features: the median value of x h , its fourth order and fifth order central moments and the kurtosis of x h . Finally, for both leads, we build a single vector of time-domain features including the previous features for each lead and the heartbeat rate of the patient to whom the heartbeat belongs. The resulting vector is denoted as x time and is of size 9.

Neural Feature Extraction
To exploit the correlation between two ECG leads, we use a one-dimensional variant of the canonical correlation analysis network (CCANet). First introduced in the field of image recognition by Yang et al. [33], CCANet has been employed in two-view image recognition tasks. Recently, CCANets, which are intrinsically two-dimensional, have been successfully employed in the signal processing field for the classification of two and three lead heartbeats [20]. A CCANet is usually composed of two cascaded convolutional layers and an output layer: (1) in the convolutional layers, the CCA technique is used to extract dual-lead filter banks; (2) in the output layer, the features extracted from the second convolutional layer are mapped into the final feature vector [20].
In this paper, with the aim of increasing performance, we design a new 1-D canonical correlation analysis network that is composed of four 1-D convolutional layers and an output layer. Contrary to CCANet, the filters are found by combining a CCA with a singular value decomposition (SVD), and features are extracted after each layer. The use of 1-D convolutions instead of 2-D permits to limit computational cost, thus allowing to increase the number of layers from two to four and, consequently, to increase performance.
The processing pipeline is shown in Figure 3. The input of the proposed 1-D CCANet-SVD is the concatenation of autoregressive features, spectrogram features, and the original normalized heartbeat, resulting in the following vector x h,cat = [x h,ar x h,spec x h,norm ] ∈ R m , h = 1, 2. The 1-D CCANet-SVD is trained with N two-lead heartbeats and then used as neural feature extractor in combination with a linear SVM for heartbeat classification. The network is trained separately for the MIT-BIH and INCART databases.

First Convolutional Layer
We denote x (i) h,cat the i-th element (i ∈ {1, . . . , m}) of an input vector x h,cat . We selected a series of segments of size k centered on each value x (i) h,cat , to obtain the m following segments, b h,1 , . . . , b h,m ∈ R k . The latter are then zero-centered and concatenated to build a matrix of the segments [b h,1 , . . . , b h,m ] ∈ R k×m . This procedure is performed on each of the N training heartbeats and the resulting matrices of segments are finally concatenated to obtain X h ∈ R k×Nm , h = 1, 2. Note that our network is simultaneously fed with all the training heartbeats in order to build the two matrices X 1 and X 2 .
Let us address the filter extraction stage. In [20], the filters are found with a CCA, thus by maximizing the correlation between pairs of projected variables. The first projection direction can be obtained by optimizing Equation (4): with the constraints a T 1 S 11 a 1 = 1, b T 1 S 22 b 1 = 1, where S hh = (X h )(X h ) T , and a 1 and b 1 are the first canonical vectors for each of the two leads. The Lagrange multiplier technique shows that a 1 and b 1 are eigenvectors of M 1 = S −1 11 S 12 S −1 22 S 21 and M 2 = S −1 22 S 21 S −1 11 S 12 , respectively. Given the first l − 1 directions, the l-th projection direction can be calculated by solving problem (4) with the additional constraints a T i S 11 a l = b T i S 22 b l = 0, (i < l). In the end, the L 1 filters for the first lead are built by taking the L 1 primary eigenvectors of M 1 (i.e., associated with the L 1 biggest eigenvalues), whereas the L 1 filters for the second lead are built by considering the L 1 primary eigenvectors of M 2 .
In this paper, we use a slightly different approach, referred to as the CCA-SVD filter extraction technique. We perform an SVD of both M 1 , and M 2 , as per where the U and V matrices are unitary, and the D matrices are diagonal with singular values on the diagonals. Using an SVD allows to retrieve the directions, which explain the most the variance of M 1 and M 2 . Since these two matrices derive from the CCA, they capture the correlations between the two leads. Therefore, we use the directions found by performing an SVD on them to have the best explanation of the correlation between the two leads. Consequently, the L 1 filters for the first lead are built by taking the columns of U 1 that are associated with the L 1 biggest singular values of D 1 , whereas the L 1 filters for the second lead are built by considering the columns of U 2 that are associated with the L 1 biggest singular values of D 2 . Such an approach yields better results than the traditional CCA filter extraction technique (see Experiments). We denote as W 1,l and W 2,l , l = 1, . . . , L 1 , the L 1 filters of size k corresponding to the first and second lead, respectively.
As for the convolutions, for each lead h, each input signal x h,cat yields L 1 outputs x h,cat,l = x h,cat * W h,l , l = 1, . . . , L 1 . The length of the input and output signal were kept identical, thanks to a zero-padding of the input.

First Extraction Stage
The extraction stage follows the same steps as in [20]. First, for each heartbeat, the output of the first convolution is converted to a decimal one-dimensional signal as per T = ∑ L 1 l=1 2 l−1 H([x 1,cat,l , x 2,cat,l ]) ∈ R 2m , where H is the Heaviside step function. Therefore, the range of each component of T is [0, 2 L 1 − 1]. T is then divided in B blocks of size u 1 . Each block can overlap with its neighbor, according to R 1 ∈ [0, 1], an overlapping proportion parameter. For each of these blocks, a histogram with 2 L 1 bins is built. The values of the resulting histogram for each block is embedded in a 2 L 1 -long vector and the vectors provided by each block are then concatenated to obtain Bhist(T) ∈ R 2 L 1 B . The first feature vector, for the heartbeat, is f 1 = Bhist(T).

Second Convolution Layer and Extraction Stage
The second layer is identical to the previous one, except for the fact that the input is different. Indeed, before the first convolution, each lead of a heartbeat was represented by a single vector of length m. After the first convolution, each lead is now represented by L 1 vectors of length m. Let's walk through the second layer with the notations used so far.
The x h,cat,l = x h,cat * W h,l , l = 1, . . . , L 1 produced after the first convolutional layer are the input of the second layer. Since we initially considered N training heartbeats, it means that this layer has a total number of N × L 1 input vectors corresponding to lead 1 and N × L 1 input vectors corresponding to lead 2. The same segmentation and zero-centering process as in the first layer gives Y h ∈ R k×mNL 1 (h = 1, 2), the matrices of the concatenated segments for all the input vectors, for each lead.
Applying the CCA-SVD filter extraction technique withS hh = (Y h )(Y h ) T leads us to perform the SVD ofM 1 11S 12 , for the first and second lead, respectively. The filters are then found exactly as in the first convolutional layer and we denote asW 1, andW 2, , = 1, . . . , L 2 , the L 2 filters of size k extracted for the first and second lead, respectively.
As for the convolutions, for each initial lead h = 1, 2 and channel l ∈ {1, . . . , L 1 }, the signal x h,cat,l yields L 2 outputs x h,cat,l, = x h,cat,l * W h, , = 1, . . . , L 2 . At this stage, each initial lead of a heartbeat is now represented by L 1 × L 2 vectors of size m.
The second extraction step is the same as after the first convolutional layer except for a few points. First, for each heartbeat, the output of the second convolutional layer is converted to a decimal signal as perT l = ∑ L 2 =1 2 −1 H([x 1,cat,l, , x 2,cat,l, ]) ∈ R 2m , l ∈ {1, . . . , L 1 }. The second feature vector for the heartbeat is obtained as per f 2 = [Bhist(T 1 ), Bhist(T 2 ), . . . , Bhist(T L 1 )]. The Bhist are built with a block size and an overlapping parameter equal to u 2 and R 2 , respectively.
The third and fourth convolutional layers are built similarly. f 3 and f 4 refer to the third and fourth feature vectors extracted for a heartbeat after each layer. We denote as L 3 and L 4 , the number of filters for the third and fourth layers, respectively. u 3 and u 4 are the block sizes for the construction of Bhist after the third and fourth convolutional layers, respectively. Finally, we denote as R 3 and R 4 , the overlapping parameters for the last two layers. The classification step is performed by a linear SVM, with a regularization parameter C = 1.

Experimental Setup
To assess the performance of our method, we classified 15 and 7 different types of heartbeats from the MIT-BIH and INCART databases, respectively. One major obstacle of our databases is that they are not well balanced. For instance, the normal types are over-represented while the supraventricular escape beats from INCART have few samples in comparison. To address this issue, we randomly sampled (without repetition), as in [20], 3350 heartbeats from the MIT-BIH database and 1720 heartbeats from INCART, in the proportions given by Tables 1 and 2 respectively. We used k-fold cross validation on the resampled heartbeats to fit the parameters of 1-D CCANet-SVD. The parameters are shown in Table 4.
The results provided in the Results and discussion subsection derive from an overall confusion matrix obtained after summing the k confusion matrices given after each fold. As in [20], we performed 10 and 5-cross validation for the data from the MIT-BIH and INCART databases, respectively.
The code is written in Python 3.7 and we ran all the experiments on a personal computer equipped with Ubuntu 18.04. The hardware specifications of the computer are the following: 16 GB RAM, and i7-7700 CPU with a clock speed of 3.60 GHz.

1-D ResNet
To further validate our approach, we added a one-dimensional residual network (1-D ResNet) to our experiments. The input is the same as for 1-D CCANet-SVD. The 1-D ResNet has been implemented as follows: • Initial layer: the input of the network undergoes an initial convolution with 2 input channels (one for each lead) and 16 output channels. This convolution is followed by a max-pooling step. This initial layer is followed by 4 identical residual blocks. • Residual blocks: each of them contains two convolutional layers and, for each block, the output of the second convolutional layer is finally added to the block's input. For each block, the first convolution doubles the number of channels, while the second convolution has the same number of input and output channels. Consequently, the last convolution has 256 output channels. The output of the last block then undergoes average-pooling to obtain the feature vector. • Classification layer: the feature vector, of size 256 serves as the input of a fully connected neural network. The classification is then performed thanks to the Softmax function. The loss used is the cross-entropy.
During the feature extraction process, Batch-Normalization is performed after each convolution and the Rectified Linear Unit (ReLU) is used as the activation function. Table 5 shows the architecture of the network and the various parameters fitted for each layer. All the parameters, including the number of layers, residual blocks and the number of channels for the convolutions were found through cross-validation.   Table 6, the values of SPE, SENS, and PPV are averaged across the classes. Table 6 shows the results obtained for various classification methods. It includes the previously described model, variations of it (e.g., without adding the time-domain features), the 1-D ResNet and the best dual-leads method in the state of the art [20]. Our method and [20] demonstrated comparable performances on the MIT-BIH database, though our overall accuracy and mean ppv were better by around 0.3%. As for the INCART database, our results proved to be better, especially the overall accuracy (+1.69%), mean sensitivity (+2.89%), and mean ppv (+1.78%). Contrary to [20], our approach is purely onedimensional, allowing to explore a more complex version of CCANet while maintaining a reasonable computational complexity and providing better results: we opted for 4 layers and stacking features extracted after each convolution gave better results than without doing so (see seventh method of Table 6), especially increasing the sensitivity. Using frequency features with the one-dimensional spectrogram helped obtain a better classification by notably increasing the sensitivity (+1.12% for INCART) and the ppv (+1.3% for INCART). The addition of the AR coefficients and the time-domain features contributed to slightly increase the performance of our model. The performances were significantly better when using our CCA-SVD filter extraction technique instead of the CCA technique described in [20], with a sensitivity gaining more than 3% for INCART (see the third method of Table 6). Finally, our method provided significantly better results than the 1-D ResNet approach (+3.64% for overall ACC for MIT-BIH, +9.45% for INCART). Our analysis of the correlations of the two leads, using SVD, proved to be a good way of recognizing the various types of heartbeats. Tables 7 and 8 show the comparison between the best of our proposals and similar works in the state of the art for MIT-BIH and INCART databases respectively. In the case of MIT-BIH, Table 7 confirms that the use of dual-leads-based approaches brings improvements in performance (more than 1%). Also in the case of INCART we see an improvement with respect to single-lead-based method (more than 3%). Here, the proposed approach is slightly better than a variant of the work by Yang et al. [20] that uses three leads. Although our approach explores more complex structures with respect to Yang et al. [20], it remains comparable, in terms of computational cost, with it. The inference time for each heart beat classification is about 0.05 s while in the case of Yang et al. [20], it is about 0.02 s. Our method presents a few limitations. First, the CCANet technique requires the network to be fed simultaneously with all the training data, in order to determine the filters and this may cause a growth in the computational cost as the size of the training data increases. This limitation is common to all the CCANet-based architectures. In our study, we only considered 2-lead signals as input, it could be interesting to include more leads with the hope of increasing the performance, especially for classes with fewer samples. Following the work by Yang et al. [20], our approach can be quite naturally extended to 3-lead signals. The number of layers might need to be reduced to compensate for the additional cost added by the addition of a third lead. Another interesting perspective would be to include some of the techniques we have used in our study in the original twodimensional CCANet developed by Yang et al. [20]. Indeed, Table 6 shows that the use of the SVD significantly increases the performance, without adding additional computational cost compared to the original method. Therefore, we could also expect promising results when using SVD in the original 2-D CCANet. Likewise, it could be interesting to analyze how the spectrogram features influence the performance of the 2-D CCANet and allow to make significant improvement in the field of abnormal heartbeat recognition.

Conclusions
In this paper, we propose a novel heartbeat classification method based mainly on a new approach to the study of the correlation between the two ECG leads, to extract complex features. Our method also employ elementary hand-crafted time domain features, frequency domain features with a one-dimensional approach to spectrograms, and autoregressive coefficients. Our method is one-dimensional, allowing to explore a more complex neural architecture while maintaining a reasonable computational complexity, and providing better results. Our final model has an optimal structure and performs the classification of 15 and 7 heartbeat types for the MIT-BIH and INCART databases, respectively. Finally, our method outperforms [20] with a slightly better overall accuracy and mean ppv on the MIT-BIH database and a notably higher overall accuracy (+1.69%), mean sensitivity (+2.89%), and mean ppv (+1.78%) on the INCART database.