A Robust Multilevel DWT Densely Network for Cardiovascular Disease Classification

Cardiovascular disease is the leading cause of death worldwide. Immediate and accurate diagnoses of cardiovascular disease are essential for saving lives. Although most of the previously reported works have tried to classify heartbeats accurately based on the intra-patient paradigm, they suffer from category imbalance issues since abnormal heartbeats appear much less regularly than normal heartbeats. Furthermore, most existing methods rely on data preprocessing steps, such as noise removal and R-peak location. In this study, we present a robust classification system using a multilevel discrete wavelet transform densely network (MDD-Net) for the accurate detection of normal, coronary artery disease (CAD), myocardial infarction (MI) and congestive heart failure (CHF). First, the raw ECG signals from different databases are divided into same-size segments using an original adaptive sample frequency segmentation algorithm (ASFS). Then, the fusion features are extracted from the MDD-Net to achieve great classification performance. We evaluated the proposed method considering the intra-patient and inter-patient paradigms. The average accuracy, positive predictive value, sensitivity and specificity were 99.74%, 99.09%, 98.67% and 99.83%, respectively, under the intra-patient paradigm, and 96.92%, 92.17%, 89.18% and 97.77%, respectively, under the inter-patient paradigm. Moreover, the experimental results demonstrate that our model is robust to noise and class imbalance issues.


Introduction
Cardiovascular disease is a major health problem worldwide. According to recent data from the World Health Organization, 30% of the 58 million deaths worldwide are due to cardiovascular disease [1]. Fortunately, early diagnosis and symptomatic treatment of cardiovascular disease can reduce mortality by more than 70%. Therefore, early accurate diagnosis of cardiovascular disease is critical to saving patients' lives.
Coronary artery disease (CAD) is one of the most typical cardiovascular diseases. It is mainly the result of atherosclerosis, in which fibrous plaque begins to form a thick area on the inner wall of the artery, leading to slowing down the flow of blood to the heart [2,3]. In severe conditions, CAD can lead to myocardial infarction (MI) or congestive heart failure (CHF), if it is not diagnosed in time. Electrocardiogram (ECG) is the most commonly used diagnostic tool because of its non-invasiveness and low cost. Usually, doctors evaluate ECG signal morphology and its characteristics in order to make clinical decisions on CAD, MI and CHF [4].

Related Works
In this section, we first discuss the two main evaluation paradigms of heartbeat classification, the intra-patient and inter-patient paradigms. Next, the existing ECG classification methods shown in Table 1 will be introduced. Subsequently, we propose our model and briefly explain its advantages. Finally, we illustrate the structure of this paper.

Evaluation Paradigms
Under the intra-patient paradigm, the heartbeat of the same patient is used to train and test the heartbeat classifier. A model under intra-patient paradigm will achieve great performance during the test phase because it is well known for producing biased results by learning the characteristics of each patient during the training phase [24]. However, the trained model must deal with the heartbeat of a patient who is invisible during training in a real scenario. In contrast, the inter-patient paradigm means that the heartbeats for training and test data sets come from different patients [25]. The same heartbeat classification methods evaluated under the intra-patient paradigm shows significantly higher accuracy than under the inter-patient paradigm [26].

Existing Methods
In recent years, many algorithms for automatic detection and classification of ECG heartbeat patterns have been presented in the literature. Table 1 summarizes the various detection and diagnostic studies for normal and CAD, normal and MI, normal and CHF, and all of them. In these studies, the system generally consists of three steps, which are described in pre-processing, feature extraction, and classification. The pre-processing stage usually experienced three stages from noise removal, the R-peak detection and heartbeat segment. Then, many signal processing techniques, such as wavelet transform [9,17] neural networks [10,13,18], will be used to extract desirable features from ECG heartbeats. Finally, these features are classified by various classifiers such as support vector machines [16], K-Nearest Neighbors [8,17,19,21,22] and so on.
Although many methods were identified for the detection of cardiovascular disease, there still exists space to further enhance. First, according to the literature [21,22], we can find that the performance for automated detection of 4-class cardiovascular disease still has room to improve. Second, conventional preprocessing operations (noise removal, R-peak detection, and heartbeat segment) were adopted by most researchers, we can find a simpler way to get a desirable input form. Third, all the related studies we reviewed did not show the performance of their model in different noise environments and lacked an assessment of the robustness to noise. More importantly, few studies evaluated their methods under inter-patient paradigm although they have achieved great performance under intra-patient paradigm [9,12,20]. As we know, studying for the detection of cardiovascular disease under inter-patient paradigm is more important for practical use.

Proposed Method and Arrangement
In this paper, we present a novel and effective model (MDD-Net) for the detection of cardiovascular disease. Here are the reasons for choosing our methods.
First, most of the existing methods [17][18][19] depend on data preprocessing, such as noise removal and R-peak location. We work on obtaining effective input data by using a light preprocessing method and still have the same level or better performance as the latest method. In this paper, we proposed an adaptive sample frequency segmentation algorithm (ASFS). Using this method, we can obtain a unified and effective input form from databases with different sampling frequencies.
Second, many feature extraction methods can effectively extract the corresponding features and achieve high performance; for example, various wavelet transforms [17,19,23] can extract time-frequency domain features, while deep learning methods [10,14,18], can extract locations and abstract features. Is it possible to increase stability and general performance by combining multiple types of features? We have made many attempts to answer this question and found that by using the concept of multilayer dense connections, the combination of abstract features from DenseNet and time-frequency domain features from multilevel DWT can achieve excellent performance in intra-patient scenarios and stable generalization performance under inter-patient conditions. Third, many existing methods have attempted to address the category imbalance in models by adding synthetic data or adjusting the sample weights. In this paper, we use a hybrid method based Sensors 2020, 20, 4777 5 of 24 on the Borderline-SMOTE algorithm to increase the training set size and reduce the internal weights of simple samples with a focal loss function. The experimental results show the good effectiveness and accuracy of the hybrid method.
The remainder of this paper is organized as follows. Section 3 first introduces the database used in this paper and gives the structure of the proposed system. Then, we show the input data generation in detail and explains the basic theory of the proposed framework. The experimental results and discussion are present in Sections 4 and 5, respectively. The last chapter summarizes the paper and illustrates the study's results and real significance.

Data Used
In this work, we used three open access databases, PTB Diagnostic ECG Database [27] (ptbdb), St Petersburg INCART 12-lead Arrhythmia Database [27] (incarddb), and BIDMC Congestive Heart Failure Database [27] (chfdb), which were downloaded from PhysioBank [27]. We collected a total of 52 normal subjects, 148 MI subjects from the ptbdb, 7 CAD subjects from incarddb, and 15 subjects from chfdb. Only lead II in each database was used as experimental data. Table 2 summarizes the details about the data used in this paper.  CAD  II  257  7  17  BIDMC CHF  CHF  II  250  15  15  PTB Diagnostic  Normal  II  1000  52  80  MI  II  1000  148  368 For a fair comparison, we apply 10-fold cross-validation for the intra-patient paradigm according to [23]. Since we did not find literature to describe the data distribution for the 4-class cardiovascular disease under the inter-patient paradigm, we split all records into training and testing sets just similar to the method in [24], in which the subjects of training and testing are nearly in the same proportions. Details of the data distribution scheme are summarized in Table 3.

Intra-patient
All data were chosen randomly as training and test samples. 10-fold cross-validation was employed, 9/10 of data was selected for training and the remaining data was used for testing.

The Proposed System
The block diagram of the proposed system is shown in Figure 1. The system consists of three phases, which are described in input data generation, feature extraction, and classification. The working of each block is explained in detail in the following sections.

The Proposed System
The block diagram of the proposed system is shown in Figure 1. The system consists of three phases, which are described in input data generation, feature extraction, and classification. The working of each block is explained in detail in the following sections.

Input Data Generation
We used three data sets of ECG signals with frequencies from 250 Hz to 1000 Hz. We propose an ASFS algorithm to obtain ECG segments without using regular preprocessing operations (denoising, R-peak localization, and heartbeat segmentation). The implementation flow chart is given in Table 4. From the flow chart, we extract segments containing the same number of periodic rhythms from data sets with different frequencies. Then, to ensure the quality of the signal segmentation process, segments from CAD and CHF are upsampled to the maximum sample frequency (1000 Hz) with interpolation. Note that there is a certain overlap between two segments. Using this overlap not only increases the number of training samples but also allows the convolutional network to learn features from both periodic and inter periodic perspectives. The waveforms from the different segments are shown in Figure 2.

Input Data Generation
We used three data sets of ECG signals with frequencies from 250 Hz to 1000 Hz. We propose an ASFS algorithm to obtain ECG segments without using regular preprocessing operations (denoising, R-peak localization, and heartbeat segmentation). The implementation flow chart is given in Table 4. From the flow chart, we extract segments containing the same number of periodic rhythms from data sets with different frequencies. Then, to ensure the quality of the signal segmentation process, segments from CAD and CHF are upsampled to the maximum sample frequency (1000 Hz) with interpolation. Note that there is a certain overlap between two segments. Using this overlap not only increases the number of training samples but also allows the convolutional network to learn features from both periodic and inter periodic perspectives. The waveforms from the different segments are shown in Figure 2. The raw ECG data V e , current sample frequency F cur , Max sample frequency F max , the number of ECG cycles in one segment N cyc , overlapping rate of the segment R s Output: The matrix of segments M s Step 1: Calculate the length of a desirable segment L s = F cur × N cyc Step 2: Calculate the length of the overlap Lo = ceil(L s × R s ) Step 3: Calculate the length of input ECG L e = size(V e , 2) Step 4: For the loop of segment extraction from the raw ECG Step 5: Intercept from the raw ECG V e and get the segment seg = V e (1 : L s ) Step 6: Get the expected segment based on the current frequency F cur and expected frequency F max ;seg = resample(seg, F max , F cur ) Step 7: Normalize the segment Step 8: Add the normalized segment to the matrix M s ; Step 9: Calculate the new V e based on the length of input ECG L e and the length of overlap L o after intercepting the segment V e = V e ((L s − L o + 1) : end) Step 10: End for Step 11: Get the desirable matrix of segments M s

Feature Extraction (Multilevel DWT)
DWT technology converts time-domain signals into the wavelet domain to obtain both frequency and location features [28]. By using DWT, the ECG signal is divided into different scales by high-pass filtering and low-pass filtering [29]. In this paper, we fold an ECG segment into a two-dimensional matrix, which can be regarded as a single-channel gray image. For image wavelet transform, DWT should be extended to two-dimensional discrete wavelet transform (2D-DWT), which involves low-pass and high-pass filters in both the horizontal and vertical directions. This process is described in Figure 3. L and H represent one-dimensional low-pass and high-pass filters, respectively. After a one-level wavelet is transformed, the original image is transformed into four sub-images, including the approximate image (coefficients) LL and three detailed images (coefficients) HL, LH, and HH. As Figure 4 shows, we decompose the two-level one-channel image (ECG matrix) by wavelet decomposition. Note that the approximation coefficients (LL) are generally further decomposed as it represents the most useful information of the original image. Since we chose the Harr wavelet basis, the width and height of each wavelet-transformed image are halved. Profiting from the idea of the densely net, we concatenate all feature maps of decomposed images as the input of the reformed DenseNet. In the next section, we explain the structure of the proposed model.

Feature Extraction (MDD-Net)
In the field of computer vision, CNNs, such as the recent VGG-Net [30], GoogLeNet [31], Inception [32], and other models, are now commonly used. A milestone in CNN history was the emergence of the ResNet [33] model. The core of the ResNet model is establishing "shortcuts and skip connections" between the front and back layers, which is helpful in the backpropagation of the gradient in the training process used to train a deep CNN. Benefiting from the basic concept of ResNet, DenseNet [34] establishes dense connections between among the front layers and the back layers. Compared with ResNet, DenseNet has fewer parameters and mitigates vanishing gradient and model degradation issues since there are direct connections from the low-to high-level layers, which can be represented as follows: represents a nonlinear transformation, which may include a series of BN, ReLU, pooling, and convolution operations.   is the concatenation of feature maps from all previous layers into a single tensor, and l x is the output of the l th layer. Note that there may be multiple convolutional layers between layer l and layer 1 l  . Figure 5 shows the structure of the proposed MDD-Net, which consists of two models: the reformed DenseNet and a multilevel DWT model. The reformed DenseNet model is mainly composed of 3 dense blocks and 2 transition layers. The multilevel DWT model, including the convolution block, pooling, and concatenation modules, performs three levels of decomposition. After the last dense block, the feature maps from DenseNet

Feature Extraction (MDD-Net)
In the field of computer vision, CNNs, such as the recent VGG-Net [30], GoogLeNet [31], Inception [32], and other models, are now commonly used. A milestone in CNN history was the emergence of the ResNet [33] model. The core of the ResNet model is establishing "shortcuts and skip connections" between the front and back layers, which is helpful in the backpropagation of the gradient in the training process used to train a deep CNN. Benefiting from the basic concept of ResNet, DenseNet [34] establishes dense connections between among the front layers and the back layers. Compared with ResNet, DenseNet has fewer parameters and mitigates vanishing gradient and model degradation issues since there are direct connections from the low-to high-level layers, which can be represented as follows: where H l (•) represents a nonlinear transformation, which may include a series of BN, ReLU, pooling, and convolution operations. [x 0 , x 1 , . . . , x l−1 ] is the concatenation of feature maps from all previous layers into a single tensor, and x l is the output of the l th layer. Note that there may be multiple convolutional layers between layer l and layer l − 1. Figure 5 shows the structure of the proposed MDD-Net, which consists of two models: the reformed DenseNet and a multilevel DWT model. The reformed DenseNet model is mainly composed of 3 dense blocks and 2 transition layers. The multilevel DWT model, including the convolution block, pooling, and concatenation modules, performs three levels of decomposition. After the last dense block, the feature maps from DenseNet and the feature maps from the multilevel DWT are concatenated. Finally, maxpooling, global maxpooling, and softmax classifiers are combined to reduce the feature dimensions and classify disease labels. The detailed network structure of MDD-Net is shown in Table 5.
Sensors 2020, 20, x FOR PEER REVIEW 8 of 23 and the feature maps from the multilevel DWT are concatenated. Finally, maxpooling, global maxpooling, and softmax classifiers are combined to reduce the feature dimensions and classify disease labels. The detailed network structure of MDD-Net is shown in Table 5. .

Robustness to Imbalance Category (Borderline-SMOTE and Focal Loss Function)
Class imbalance refers to a situation in which the number of training samples in different categories used for classification tasks varies greatly. In realistic learning and classification tasks, we often encounter category imbalance. In this work, the number of normal cases is less than the number of cases for other diseases, which does not correspond to reality. Hence, we also tested the classification performance of the proposed method under different ratios of classes. To obtain good performance with imbalanced classes, we adopted the combination of the Borderline-SMOTE method and a focal loss function. The Synthetic Minority Oversampling Technique (SMOTE) [35] is an improved algorithm for random oversampling. Since random oversampling directly reuses a small number of classes, many duplicate samples are included in the training set, which may lead to overfitting. The basic idea of the SMOTE algorithm is to randomly select a sample x i for each minority-class sample x i and then randomly select a point on the line between x i and x i as the newly synthesized minority-class sample. Han et al. [36] proposed Borderline-SMOTE to solve the problems of marginalization and blindness in the SMOTE algorithm. The flow of the Borderline-SMOTE algorithm is shown in Table 6. Table 6. The implementation flow of the Borderline-SMOTE.

Input:
The original training set F, the majority-class set S min = f 1 , f 2 , . . . , f n Output: The new training set F o after using Borderline-SMOTE algorithm Step 1: Calculate the k nearest neighbors of each sample in the minority set S min Step 2: Classify the samples in S min according to these k nearest neighbors (a) if the k nearest neighbors of a sample are all majority-class samples, we define this sample as a noise sample and place it in the N set. (b) if the k nearest neighbors of a sample are all minority-class samples, we define this sample as a safe sample and place it in the S set. (c) if the k nearest neighbors of a sample have both majority-class samples and minority-class samples, this sample is considered a boundary sample and is put into the B set.
Step 3: For loop until the number of artificial minority-class samples is met.
Step 4: Set the boundary sample set B = f 1 , f 2 , . . . , f n , calculate k nearest neighbors in the minority-class set S min of each sample f i , i = 1, 2, . . . , n in the B set, and compose the set f i j .
Step 6: Calculate the difference of all attributes between a sample and its nearest neighbors Step 7: The attribute difference multiplied by a random number r i j , r i j ∈ (0, 1). If f i j is a sample in the N or S set, then r i j ∈ (0, 0.5).
Step 8: The generated artificial minority-class sample is Step 9: Add the generated sample to the new training set F o Step 10: End for Step 11: Get the desirable training set F o Furthermore, to balance the contributions of different samples to the model, we adopted the focal loss function (FL) proposed by Lin et al. [37]. This function was modified based on the standard cross-entropy loss (CE), and it can reduce the weights of easy-to-classify samples so that the model can focus on hard-to-classify samples during training. The CE loss function formula is as follows: where y is the label of the true sample (1: positive and 0: negative) and p is the category prediction probability, which ranges from 0 to 1. The larger the output probability is, the smaller the loss is for positive samples. For negative samples, the smaller the output probability is, the smaller the loss. The CE function in some cases may be relatively slow to iteratively run for large numbers of simple samples and may not be optimal. Hence, FL is presented as follows: Compared with CE, FL includes the factors γ and α. If γ > 0, the loss of easy-to-classify samples will be reduced, and the model focuses on difficult to classify and misclassified samples. If γ = 0, the function is simplified to the CE loss function. α is used to balance the uneven proportions of positive and negative samples.
In this work, the Borderline-SMOTE algorithm is used to increase the number of valid samples in the training set if the ratio of samples in disease and normal categories is less than 1/3. Moreover, the FL function is used when class imbalance occurs; otherwise, CE function is used.

Classification
Softmax [38] is widely used in machine learning and deep learning. The output unit of the final classifier needs softmax function for numerical processing. The softmax function [38] is defined as follows: where V i is the output of the previous output unit of the classifier. i represents the category index, and the total number of categories is C. S i represents the ratio of the index of the current element to the sum of the indices of all elements. Softmax converts multi-class output values into relative probabilities for easier understanding and comparison.

Evaluation Index
In this paper, we use accuracy, sensitivity, specificity and overall accuracy as the main evaluation indexes, in which TP means detection correctly with the disease; TN is being identified as correctly without the disease; FN means detection incorrectly when the disease is present, and the model does not detect; FP means the disease is not present, but the model detects disease.

Results
In this section, we first design several experiments to test the impact of each part of the proposed model and find the optimal parameters using a grid search algorithm (with respect to the overall accuracy). Then, many experiments are conducted under intra-patient and inter-patient conditions to verify the advantages of the proposed model. In addition, we compare the proposed model with state-of-the-art methods. Note that all the experiments were implemented in MATLAB R2018a and PyCharm 2018 run in Windows 7 on an Intel Core i7 CPU (@ 2.60 GHz) with a 1080 Ti GPU and 8 GB RAM.

Experiment Setup and Optimality Evaluation
In this subsection, some important hyperparameters are tested for the original DenseNet and multilevel DWT models. This approach can provide guidelines for the determining of the optimal parameters of the final model. First, we designed several experiments for the inter-patient paradigm and used 1/4 of the data set to verify the effects of input segments and the optimal structure of MDD-Net. Then, we chose the preferable parameters as a list through an analysis of the test results. Finally, we obtained the best parameters using a grid search algorithm.

Impact of Input Segments
We use an adaptive frequency detection method (ASFS) to generate the inputs of the network. As mentioned above, a single segment contains multiple periodic rhythms. We need to find the length of the segment that has the best performance. Notably, the longer a segment is, the greater the amount of heartbeat discrimination information that can be provided; additionally, if a segment is too long, the network input may be too large to process. Hence, we confined the length of a segment to 10,000 sample points with an interval of 1000 and tested the performance of the original DenseNet model [34]. Note that other hyperparameters remain consistent, as shown in Table 7. From Figure 6a, we can observe that the model achieves good performance when the length of a segment is 2000 or 3000. Moreover, the accuracy significantly decreases when the length is over 6000. Additionally, we evaluated the effect of the segment overlap rate using a segment length of 2000. As shown in Figure 6b, the accuracy changed rapidly with an increasing overlap rate and decreased dramatically when the overlap exceeded 0.2. The model yields good results when the value is between 0.1 and 0.3. and used 1/4 of the data set to verify the effects of input segments and the optimal structure of MDD-Net. Then, we chose the preferable parameters as a list through an analysis of the test results. Finally, we obtained the best parameters using a grid search algorithm.

Impact of Input Segments
We use an adaptive frequency detection method (ASFS) to generate the inputs of the network. As mentioned above, a single segment contains multiple periodic rhythms. We need to find the length of the segment that has the best performance. Notably, the longer a segment is, the greater the amount of heartbeat discrimination information that can be provided; additionally, if a segment is too long, the network input may be too large to process. Hence, we confined the length of a segment to 10,000 sample points with an interval of 1000 and tested the performance of the original DenseNet model [34]. Note that other hyperparameters remain consistent, as shown in Table 7. From Figure 6a, we can observe that the model achieves good performance when the length of a segment is 2000 or 3000. Moreover, the accuracy significantly decreases when the length is over 6000. Additionally, we evaluated the effect of the segment overlap rate using a segment length of 2000. As shown in Figure 6b, the accuracy changed rapidly with an increasing overlap rate and decreased dramatically when the overlap exceeded 0.2. The model yields good results when the value is between 0.1 and 0.3.  Figures 7 and 8 respectively. Note that we used the parameters listed in Table 7 when we tested the effect of one single variable.  Figures 7 and 8 respectively. Note that we used the parameters listed in Table 7 when we tested the effect of one single variable.
As shown in Figure 7a, the model can extract a sufficient number of features for classification when the number of blocks is less than 3, but the accuracy decreases rapidly when the value exceeds 3, which may be due to the overfitting of the training model. In Figure 7b, the model achieved relatively good results when the depth was between 16 and 28. In Figure 8, an improvement was observed in classification performance at the second level of decomposition compared to that at the first level of decomposition. This finding suggests that the combined multilevel DWT method can better extract the features of signals. However, we did not observe an improvement when we increased the decomposition from the 3rd to the 4th level.
relatively good results when the depth was between 16 and 28. In Figure 8, an improvement was observed in classification performance at the second level of decomposition compared to that at the first level of decomposition. This finding suggests that the combined multilevel DWT method can better extract the features of signals. However, we did not observe an improvement when we increased the decomposition from the 3rd to the 4th level.  In summary, we obtained the optimal parameters after the initial screening. Next, we will determine the optimal combination of parameters using the grid search algorithm. Note that we tested the proposed model and used 2000 instances as the validation set. The optimal parameters are shown in Table 8. Notably, single-model optimal parameters are not necessarily global optimal parameters due to the differences in features. We finally obtained 3,454 normal, 15,011 MI, 11,339 CAD, and 22,215 CHF segments after determining the best parameters of the ASFS. Using these optimum parameters from Table 8, the next set of experimental investigations was performed.  relatively good results when the depth was between 16 and 28. In Figure 8, an improvement was observed in classification performance at the second level of decomposition compared to that at the first level of decomposition. This finding suggests that the combined multilevel DWT method can better extract the features of signals. However, we did not observe an improvement when we increased the decomposition from the 3rd to the 4th level.  In summary, we obtained the optimal parameters after the initial screening. Next, we will determine the optimal combination of parameters using the grid search algorithm. Note that we tested the proposed model and used 2000 instances as the validation set. The optimal parameters are shown in Table 8. Notably, single-model optimal parameters are not necessarily global optimal parameters due to the differences in features. We finally obtained 3,454 normal, 15,011 MI, 11,339 CAD, and 22,215 CHF segments after determining the best parameters of the ASFS. Using these optimum parameters from Table 8, the next set of experimental investigations was performed.  In summary, we obtained the optimal parameters after the initial screening. Next, we will determine the optimal combination of parameters using the grid search algorithm. Note that we tested the proposed model and used 2000 instances as the validation set. The optimal parameters are shown in Table 8. Notably, single-model optimal parameters are not necessarily global optimal parameters due to the differences in features. We finally obtained 3454 normal, 15,011 MI, 11,339 CAD, and 22,215 CHF segments after determining the best parameters of the ASFS. Using these optimum parameters from Table 8, the next set of experimental investigations was performed.

Results of Automated Detection Based on Intra-Patient Paradigm
Under the intra-patient paradigm, we performed 10-fold cross-validation according to the method of Acharya et al. [23]. All segments (3454 normal, 15,011 MI, 11,339 CAD, and 22,215 CHF) were divided into 10 parts almost equally. For each step, 9/10 segments were selected for training, and the rest were used for testing. Figure 9 shows the plots of the average performance measures based on the number of steps (or folds) in MDD-Net. The accuracy and specificity of all folds are high (above 99.70%), which indicates that the proposed model can accurately perform cardiovascular detection under the intra-patient paradigm. Furthermore, the variations in the four indicators, including the accuracy (99.70~99.78%), sensitivity (98.31~99.00%), and specificity (99.80~99.86%), are less than 1%, which indicates that our model is stable.

Results of Automated Detection Based on Intra-Patient Paradigm
Under the intra-patient paradigm, we performed 10-fold cross-validation according to the method of Acharya et al. [23]. All segments (3454 normal, 15,011 MI, 11,339 CAD, and 22,215 CHF) were divided into 10 parts almost equally. For each step, 9/10 segments were selected for training, and the rest were used for testing. Figure 9 shows the plots of the average performance measures based on the number of steps (or folds) in MDD-Net. The accuracy and specificity of all folds are high (above 99.70%), which indicates that the proposed model can accurately perform cardiovascular detection under the intra-patient paradigm. Furthermore, the variations in the four indicators, including the accuracy (99.70~99.78%), sensitivity (98.31~99.00%), and specificity (99.80~99.86%), are less than 1%, which indicates that our model is stable. In Table 9, we present the overall confusion matrix for cardiovascular detection based on 10-fold cross-validation. The average accuracy, positive predictive value, sensitivity, and specificity were 99.74%, 99.09%, 98.67%, and 99.83%, respectively. The results show that for the CAD group, only a few samples (0.1%) were misclassified as CHF. In the MI group, 0.01% of the cases were misclassified as CAD, and in the CHF group, 0.03% of the cases were incorrectly classified as CAD, reflecting high classification performance. Table 9. The overall classification results for cardiovascular detection across 10-fold.

Results of Automated Detection Based on Inter-Patient Paradigm
For the inter-patient paradigm, the classification performance of each method was evaluated based on the training instances from DS1 (Table 2), and the method was then tested with the instances from DS2. In Table 10, we show the performance of different models, including DenseNet, multilevel DWT, and MDD-Net. Note that the hyperparameters used are the same as those shown in Table 8. The results suggest that the proposed model performs better in classification than do other models. Notably, the proposed model displayed competitive performance (average accuracy of 96.92%, positive predictive value of 92.17%, sensitivity of 89.18%, and specificity of 97.77%). In Table 9, we present the overall confusion matrix for cardiovascular detection based on 10-fold cross-validation. The average accuracy, positive predictive value, sensitivity, and specificity were 99.74%, 99.09%, 98.67%, and 99.83%, respectively. The results show that for the CAD group, only a few samples (0.1%) were misclassified as CHF. In the MI group, 0.01% of the cases were misclassified as CAD, and in the CHF group, 0.03% of the cases were incorrectly classified as CAD, reflecting high classification performance.

Results of Automated Detection Based on Inter-Patient Paradigm
For the inter-patient paradigm, the classification performance of each method was evaluated based on the training instances from DS1 (Table 2), and the method was then tested with the instances from DS2. In Table 10, we show the performance of different models, including DenseNet, multilevel DWT, and MDD-Net. Note that the hyperparameters used are the same as those shown in Table 8. The results suggest that the proposed model performs better in classification than do other models.
Notably, the proposed model displayed competitive performance (average accuracy of 96.92%, positive predictive value of 92.17%, sensitivity of 89.18%, and specificity of 97.77%). Furthermore, we compared the accuracy and loss of the three models based on the test set during the training process, as shown in Figure 10. We can easily observe that an oscillation phenomenon occurs during the training process for the DenseNet model. The ML-DWT model displays better stability than DenseNet, but the overall accuracy of the model was below the target range. Only the proposed model, which combines the other two models, exhibits sufficient stability and accuracy. In addition, the proposed model achieves a faster convergence speed than the other models. Therefore, the combination of features improves precision and stability.  Furthermore, we compared the accuracy and loss of the three models based on the test set during the training process, as shown in Figure 10. We can easily observe that an oscillation phenomenon occurs during the training process for the DenseNet model. The ML-DWT model displays better stability than DenseNet, but the overall accuracy of the model was below the target range. Only the proposed model, which combines the other two models, exhibits sufficient stability and accuracy. In addition, the proposed model achieves a faster convergence speed than the other models. Therefore, the combination of features improves precision and stability.

Results of Robustness to Noise
In a real-life production environment, the ECG signal often contains different levels of noise. Hence, we tested the performance of our model under different levels of noise. Note that we used the awgn function in the MATLAB toolbox to generate different levels of white noise and employed the signal-to-noise ratio (SNR) to evaluate the level of noise. Figure 11 shows the waveforms of the normal, MI, CAD, and CHF segments with different levels of Gaussian white noise. The figure shows that when the SNR of the signal is less than 12 dB, the morphological characteristics of the waveform are generally ambiguous; especially the CAD and CHF waveforms are seriously damaged. When the SNR is 0 dB, the waveforms of all diseases are highly disrupted and difficult to distinguish with the naked eye.

Results of Robustness to Noise
In a real-life production environment, the ECG signal often contains different levels of noise. Hence, we tested the performance of our model under different levels of noise. Note that we used the awgn function in the MATLAB toolbox to generate different levels of white noise and employed the signal-to-noise ratio (SNR) to evaluate the level of noise. Figure 11 shows the waveforms of the normal, MI, CAD, and CHF segments with different levels of Gaussian white noise. The figure shows that when the SNR of the signal is less than 12 dB, the morphological characteristics of the waveform are generally ambiguous; especially the CAD and CHF waveforms are seriously damaged. When the SNR is 0 dB, the waveforms of all diseases are highly disrupted and difficult to distinguish with the naked eye.
Sensors 2020, 20, x FOR PEER REVIEW 16 of 23   Figure 12 show the average performance of the proposed model at different SNRs under intra-patient and inter-patient conditions. Table 11 indicates that the performance of the model slightly decreases as the strength of noise increases, but it still maintains high performance under both the intra-patient paradigm and the inter-patient paradigm. The classification accuracy exceeds 99.31% when the SNR is greater than 12 dB under the intra-patient paradigm. Specifically, our model still achieved an accuracy of 98%, even though the SNR of the signal is 0 dB. For the inter-patient paradigm, the classification accuracy of our model is almost the same as that of the original signal when the SNR exceeds 12 dB (96.93~96.98%), except that the PPV and SEN decrease slightly. When the SNR is 0 dB, our model still achieves an accuracy of 95%. In summary, we can directly see from Figure 12 that our model performs stably under different levels of noise, whether under the intra-patient or inter-patient paradigm. The experiments show that the proposed model can achieve fairly good performance for different kinds of noise and various SNRs. Table 11. The average performance of different SNRs under intra-patient and inter-patient paradigms.

SNR/Paradigm
Intra-Patient Inter-Patient    Figure 12 show the average performance of the proposed model at different SNRs under intra-patient and inter-patient conditions. Table 11 indicates that the performance of the model slightly decreases as the strength of noise increases, but it still maintains high performance under both the intra-patient paradigm and the inter-patient paradigm. The classification accuracy exceeds 99.31% when the SNR is greater than 12 dB under the intra-patient paradigm. Specifically, our model still achieved an accuracy of 98%, even though the SNR of the signal is 0 dB. For the inter-patient paradigm, the classification accuracy of our model is almost the same as that of the original signal when the SNR exceeds 12 dB (96.93~96.98%), except that the PPV and SEN decrease slightly. When the SNR is 0 dB, our model still achieves an accuracy of 95%. In summary, we can directly see from Figure 12 that our model performs stably under different levels of noise, whether under the intra-patient or inter-patient paradigm. The experiments show that the proposed model can achieve fairly good performance for different kinds of noise and various SNRs.

Results of Robustness to Imbalance Category
In reality, the proportion of patients with diseases is often much smaller than the proportion of healthy patients. To effectively simulate and explain reality, we keep the number of normal cases unchanged and decrease the number of patients with diseases proportionally, as shown in Table 12.  20  3454  750  566  1110  40  3454  375  283  555  60  3454  250  188  370  80  3454  187  141  277  100  3454  150  113  222 For the disease instances in the test set, such as CAD, only 11 were considered per fold (3% of normal) under the intra-patient paradigm when the scale was 100. In this paper, we adopted the Borderline-SMOTE algorithm to generate representative minority samples and added them to the training set. In addition, we used the FL function to solve the category imbalance problem by reducing the internal weights of simple samples. Note that the number of test sets remains constant during each experiment.
In the case of unbalanced categories, we focus on the performance of disease classes. Table 13 shows the confusion matrix and classification performance of diseases under the intra-patient and inter-patient paradigms. The sensitivity and precision of MI decreased to some extent with increasing scale under the intra-patient paradigm. However, the other indicators remained high. In particular, most of the performance indexes reach nearly 99% for CAD and CHF classification. For instance, in the CHF group, the proposed system yielded 99.67% accuracy, a 95.63% positive predictive value, 98.65% sensitivity, and 99.73% specificity when the scale was set to 100. Under the inter-patient paradigm, as the scale increased, the accuracy of disease classification increased. However, we can see that the performance for MI is more easily affected by the scale than is the performance for CAD or CHF, and the sensitivity of our model was not ideal for MI when the scale exceeded 20. However, the proposed model can effectively detect CHF and CAD. Acceptable performance (accuracy of 98.83%, positive predictive value of 92.80%, sensitivity of 89.92%, and specificity of 99.49%) for CHF was achieved even when the scale was set to 80.
To prove the validity of our hybrid methods described, we conducted experiments with and without the algorithms in this work. The average performances of the models at different scales are shown in Figure 13. It can be observed that the performance of the model using the algorithm is

Results of Robustness to Imbalance Category
In reality, the proportion of patients with diseases is often much smaller than the proportion of healthy patients. To effectively simulate and explain reality, we keep the number of normal cases unchanged and decrease the number of patients with diseases proportionally, as shown in Table 12.  20  3454  750  566  1110  40  3454  375  283  555  60  3454  250  188  370  80  3454  187  141  277  100  3454  150  113  222 For the disease instances in the test set, such as CAD, only 11 were considered per fold (3% of normal) under the intra-patient paradigm when the scale was 100. In this paper, we adopted the Borderline-SMOTE algorithm to generate representative minority samples and added them to the training set. In addition, we used the FL function to solve the category imbalance problem by reducing the internal weights of simple samples. Note that the number of test sets remains constant during each experiment.
In the case of unbalanced categories, we focus on the performance of disease classes. Table 13 shows the confusion matrix and classification performance of diseases under the intra-patient and inter-patient paradigms. The sensitivity and precision of MI decreased to some extent with increasing scale under the intra-patient paradigm. However, the other indicators remained high. In particular, most of the performance indexes reach nearly 99% for CAD and CHF classification. For instance, in the CHF group, the proposed system yielded 99.67% accuracy, a 95.63% positive predictive value, 98.65% sensitivity, and 99.73% specificity when the scale was set to 100. Under the inter-patient paradigm, as the scale increased, the accuracy of disease classification increased. However, we can see that the performance for MI is more easily affected by the scale than is the performance for CAD or CHF, and the sensitivity of our model was not ideal for MI when the scale exceeded 20. However, the proposed model can effectively detect CHF and CAD. Acceptable performance (accuracy of 98.83%, positive predictive value of 92.80%, sensitivity of 89.92%, and specificity of 99.49%) for CHF was achieved even when the scale was set to 80. To prove the validity of our hybrid methods described, we conducted experiments with and without the algorithms in this work. The average performances of the models at different scales are shown in Figure 13. It can be observed that the performance of the model using the algorithm is better than those without the algorithm. In particular, the results of sensitivity and positive predictive value using the algorithm are obviously better than the other. In summary, our method can better detect diseases from imbalanced data sets.
Sensors 2020, 20, x FOR PEER REVIEW 18 of 23 better than those without the algorithm. In particular, the results of sensitivity and positive predictive value using the algorithm are obviously better than the other. In summary, our method can better detect diseases from imbalanced data sets.

Comparison of Other Deep Learning Models
In this paper, we used the same input segments and evaluated several popular deep learning models, as shown in Table 14. Note that we used the network structure as described in the original paper. DenseNet used the same network and parameters as the proposed model, as shown in Table 8.

Comparison of Other Deep Learning Models
In this paper, we used the same input segments and evaluated several popular deep learning models, as shown in Table 14. Note that we used the network structure as described in the original paper. DenseNet used the same network and parameters as the proposed model, as shown in Table 8. We can see that almost all models achieved good classification performance under the intra-patient paradigm, of which VGG_16 obtained the best result (accuracy of 99.84, positive predictive value of 99.62, sensitivity of 99.44, and specificity of 99.89%). The proposed input generation algorithm can provide distinguishable features. However, the classification performance varies greatly under inter-patient paradigms. For the inter-patient paradigm, VGG_16 achieved the worst classification performance. The proposed model performed well for both paradigms and achieved the best classification performance under inter-patient conditions. In addition, the performance of DenseNet was better than that of VGG_16 or ResNet, which is the reason we selected DenseNet as the basis of the proposed model.

Discussion
The purpose of this study is to propose a novel single-lead cardiovascular disease classification method that requires simple preprocessing effort and still has the same level of performance as or better performance than other popular methods. Notably, we hope to make a breakthrough for the inter-patient paradigm. Here, we summarize the key features of our model and discuss the advantages and disadvantages of the proposed model compared with the related literatures shown in Table 15. First, we propose a simple ASFS approach to generate inputs without employing conventional data preprocessing steps, such as domain-specific feature extraction for noise removal or the R-peak location algorithm. Unlike our method, many existing methods [11,12,22,23] rely on various preprocessing steps to achieve high classification performance. Note that although they require data preprocessing steps, none of the methods yields better results than our method except for sensitivity under the intra-patient paradigm, as shown in Table 15. For the inter-patient paradigm, the proposed model could almost achieve the best classification results than those of most of the investigated literatures. Our model also achieved acceptable performance (accuracy of 96.92%, positive predictive value of 92.17%, sensitivity of 89.18%, and specificity of 97.77%) for 4-class cardiovascular disease classification (normal, CAD, MI, and CHF). To our knowledge, this is the first work reporting 4-class classification under the inter-patient paradigm.
Second, to further test the classification performance of the proposed model in a multilevel noise environment, we added multilevel Gaussian noise to the original signals. The impact of multilevel noise is illustrated in Table 11 and Figure 12, in which the performance of the classification model changes little at different noise levels under both the intra-patient paradigm and inter-patient paradigm. Normally, the useful ECG signal appears as a low-frequency part of the signal or a relatively stable signal, and the noise signal appears as a high-frequency signal. The high-frequency Gaussian noise can be filtered out of a signal when multilevel 2D-DWT is performed. In addition, the multilayer convolutional structure improves the ability of the model to filter noise and mine useful information from ECGs. Hence, the proposed model exhibits good robustness to noise.
Third, we use the original input ECG data with an imbalance between normal and disease categories. To overcome the problem of category imbalance, on the basis of studies [39,40], we used a hybrid method to increase the training set and changed the sample batch weights to optimize our model. We adopted the Borderline-SMOTE algorithm to add minority samples to the training set; additionally, the FL function was employed to solve the category imbalance problem by reducing the internal weights of simple samples. In Table 13, our model yielded remarkable performance at different imbalanced scales under the two paradigms. Under the intra-patient paradigm, our method achieved the highest accuracy at 98.88% for MI, 99.70% for CAD, and 99.67% for CHF, even though the scale was 100. This finding reflected acceptable model performance under the inter-patient paradigm. In Figure 13, we demonstrate the validity of our method by comparing the performance of two classification models obtained using the same inputs with and without the Borderline-SMOTE algorithm and FL function.
Finally, we explained why DenseNet was chosen as the core part of our model by comparing several popular deep learning network frameworks. In Table 14, the performance difference among several popular learning frameworks under the intra-patient paradigm is shown to be minimal, but the performance under the inter-patient paradigm varies greatly, and the DenseNet model performed better than other deep learning models (accuracy of 93.97%, positive predictive value of 87.78%, sensitivity of 81.77%, and specificity of 95.24%).
The main highlights of our proposed algorithm are as follows: (1) A novel ASFS algorithm is proposed. The algorithm can generate effective inputs without conventional data preprocessing (noise removal and R-peak location). (2) Compared with traditional deep learning algorithms, our combined model has small steady-state error and achieved superior results. (3) Our model has good robustness to noise and can overcome category imbalance. (4) The proposed work has considerable practical significance considering the performance of the proposed model under the inter-patient paradigm.
However, we should also mention that during the training phase, our method requires a large number of heartbeat data sets that must be annotated by clinical experts. In the medical field, it is difficult to obtain such data sets with abnormal patterns. In addition, the sensitivity for MI under the inter-patient paradigm needs to be improved.

Conclusions
In this paper, we presented a novel and effective model (MDD-Net) for the detection of cardiovascular disease. The ASFS algorithm is employed to obtain consistent input segments without using regular preprocessing operations. We concatenate abstract and time-frequency features to obtain the resultant combined feature vector. Our model achieved higher stability and accuracy than the solo-feature DenseNet model. According to the results of the experiments, the proposed model significantly outperforms the existing algorithms in the literature for both intra-patient and inter-patient paradigms. Specifically, the model achieved an average accuracy of 96.92%, positive predictive value of 92.17%, sensitivity of 89.18%, and specificity of 97.77% under the inter-patient paradigm, which is of practical significance. Moreover, our model has good robustness to noise and imbalanced classes. Therefore, the proposed approach will be a useful component of clinical decision support systems for cardiologists.
In future work, we will improve the performance of our model and expand the predicted disease types under the inter-patient paradigm using more ECG data. Specifically, the performance of MI needs to be improved. Using more ECG data means more disease type labels. However, annotating disease types is very expensive and time-consuming. We want to develop a semi-supervised heartbeat classification model by using a large amount of unannotated ECG databases. Hence, we will work on developing an activated learning classification system to solve this problem. The ultimate goal of our work is to design a cloud version of the proposed method and apply it by using mobile devices to provide reliable and practical diagnostic results.

Conflicts of Interest:
The authors declare no conflict of interest.