A New Method for Refined Recognition for Heart Disease Diagnosis Based on Deep Learning

: The proper evaluation of heart health requires professional medical experience. Therefore, in clinical diagnosis practice, the development direction is to reduce the high dependence of the diagnosis process on medical experience and to more effectively improve the diagnosis efficiency and accuracy. Deep learning has made remarkable achievements in intelligent image analysis technology involved in the medical process. From the aspect of cardiac diagnosis, image analysis can extract more profound and abundant information than sequential electrocardiogram (ECG) signals. Therefore, a new region recognition and diagnosis method model of a two-dimensional ECG (2D-ECG) signal based on an image format is proposed. This method can identify and diagnose each refined waveform in the cardiac conduction cycle reflected in the image format ECG signal, so as to realize the rapid and accurate positioning and visualization of the target recognition area and finally get the analysis results of specific diseases. The test results show that compared with the results obtained by a one-dimensional sequential ECG signal, the proposed model has higher average diagnostic accuracy (98.94%) and can assist doctors in disease diagnosis with better visualization effect.


Introduction
Due to the widespread attention of computer technology, computer-aided diagnosis technology has made rapid development in medical related fields. It plays a positive role in intelligent diagnosis. Its research directions include brain waves, electromyography signals, cancer, the nervous system [1][2][3][4], heart and other clinical applications, and even more extensive directions of refinement are involved in clinical practice [5][6][7]. Compared with the traditional manual diagnosis method, its effect of reducing false negative rate (FNR) and improving work efficiency is significant [8][9][10].
ECG signals are mainly based on the cardiac electrophysiological signal, which can effectively reflect various physiological states of human heart. It is usually acquired by the lead channel through body surface electrodes. The acquisition method is widely used because it has the characteristics of non-invasiveness, output visualization, low cost, convenience, and flexibility [11][12][13]. However, because the amplitude of ECG signals and the duration of single conduction cycle are small, it is easy to cause misdiagnosis by using naked eyes to interpret ECG signals. In the application of modern medicine, an ECG automatic classification method is necessary. Taking ECG signals as the research object in one-dimensional mode, Pyakillya B et al. [14] used the convolution layer as a feature extractor and the fully convolutional layer as a final decision-making structure to classify ECG. Although arrhythmia classification methods have been significantly improved, they are still not satisfactory in the detection of different heart diseases when the unbalanced datasets are used as the analysis object. In addition, Mousavi et al. [15] used the deep convolution neural network method and sequence-to-sequence models to realize automatic heartbeat classification to solve this limitation of current classification methods. In order to realize independent and universal signal processing and a feature learning process, Zhang et al. [16] used a method of convolutional neural network containing multi-resolution in a wavelet domain model. The method transforms randomly selected signal segments into the wavelet domain to realize multi-resolution time-frequency representation. In this way, a multi-resolution one-dimension convolution neural network is realized. The validity of the algorithm is evaluated on eight specific ECG signal sets, and the average recognition rate reaches 93.5%. The duration of ECG signals has an important impact on the diagnosis results of the disease. For example, in the diagnosis of atrioventricular block disease, long-range ECG signals need to be considered, only the duration of a heart conduction cycle can not meet the diagnostic requirements. Therefore, some scholars have studied the ECG signals of different duration. Acharya et al. [17] used a convolutional neural network model with 11 layers to realize the automatic detection of ECG fragments that may have arrhythmias. The accuracy was 92.50% and 94.90% by using ECG signals with durations of 2 s and 5 s respectively.
According to the above analysis, although deep learning technology has greatly promoted the development of cardiac electrophysiological signal recognition, classification, and diagnosis technology, there are still some problems in intelligent recognition technology for ECG signals [18][19][20]. The main problems include the following: the accuracy of the ECG source signal needs to be improved; the algorithm recognition speed has limitations; the accuracy of the ECG signals recognition algorithm used in clinical practice is not high; the ECG signal has not only the specificity between individuals, but also the same individuals have great differences in different times and different physical conditions; there are human factors in signal acquisition or electromagnetic interference from the surrounding equipment. At present, most of the research studies on cardiac electrophysiological signals are based on one-dimensional signal models [21][22][23], which is very difficult not only to meet the classification task, but also to detect and locate the target quickly from the ECG information.
In view of the existing problems, combined with the theory of deep learning technology, this paper proposes a 2D-ECG faster region recognition and diagnosis method model (2D-ECG and Faster R-CNN model) which is suitable for the image level. Firstly, the ECG signal model used for recognition has the form of a two-dimensional image, which is converted from one-dimensional sequential ECG signals; then, combined with the actual clinical ECG signals, the refined recognition and diagnosis of each refined waveform in the cardiac conduction cycle is carried out. The feasibility of the proposed algorithm is verified by comparison. A reasonable paper structure is arranged as follows: the second part is based on the refined recognition and diagnosis method; the third part discusses the experiments based on the refined cardiac diagnosis model; the fourth part is analysis and diagnosis combined with experimental results; the fifth part is the discussion and conclusion.

Research Object
A new disease diagnosis method based on a two-dimensional ECG signal model is proposed in this paper. The main purpose is as shown in Figure 1, which can realize the fine recognition of each refined waveform according to the clinical ECG signal, including a P wave, QRS complex, T wave, etc. P QRS T Figure 1. Examples of refined recognition to electrocardiogram (ECG) parameters.
For the successful recognition of an ECG signal, it is necessary to convert it into a two-dimensional image and then establish a coordinate system for the ECG image with a calibration curve, as shown in Figure 2. In this section, the upper left corner pixel of the image is selected as the coordinate origin O, the right is the Y axis, and the downward is the X axis. All the coordinate value recognition in the experiment is based on this coordinate system. The acquisition of coordinates provides a data basis for ECG diagnosis. Then, using the recognition results with the coordinate position, combined with the transformation relationship between the coordinate value and the actual value in the image, the diagnosis results of specific diseases are finally obtained.

Diagnosis Model
The algorithm can recognize the target information reflected in the input ECG image quickly and accurately. The system structure is divided into a continuous convolution group layer for feature extraction, a region proposal network (RPN) for optimizing candidate suggestions, a region of interest pooling for comprehensive features, a candidate target, and a classification output. The overall framework of the system based on the refined recognition and diagnosis model is shown in Figure 3. The feature image results obtained by the continuous convolution layer will be further processed in two channels with different functions. One of the functions is that the feature image results can be directly processed in the regions of interest (RoI) pooling layer. Another function is that the result of the feature image is processed by an RPN module to get a region suggestion and suggestion box output, which will also enter into the RoI pooling layer finally. Therefore, the RoI pooling layer needs to comprehensively refer to the input information of the two channels, and then by using the subsequent full connection layer and after a series of operations such as classifier, region box, and position mark, the final diagnosis output of classification and recognition can be obtained [24].
The advantages of the deep learning diagnosis method are as follows: the system has a two-stage diagnosis structure, which can integrate the multi-target recognition and classification in the image with the target position prediction mechanism. Therefore, it can fully display the details of the analysis results, and it has a better visualization effect.

Convolution Neural Network
The continuous convolution layer group structure in the system framework used in Figure 3 sets the VGG16-NET [25] model shown in Figure 4 to generate the feature map. The inner structure of the model includes a convolutional layer, pooling layer, and fully connected layer, among which the number of each layer is 13, 5 and 3. The convolution layer and full connected layer each have weight coefficient attributes and reach 16 layers, which forms the definition standard of the convolution model. The convolution layer and pooling layer are arranged at intervals, and the pooling layer is placed after two or three convolution layer groups, followed by the fully connected layer and finally output through the classification layer. The structures are divided into five levels, among which the sizes of the convolution kernel and pooling kernel to be calculated are 3 × 3 and 2 × 2 respectively. In this way, the calculated results of each layer keep a clear structure. In the calculation process of the VGG16-NET model, the channel number of the input image is doubled from 64 to 512 progressively layer by layer, and the image size is halved from 224 to 7 progressively layer by layer. This rule is that the spatial resolution of the feature map decreases monotonously while the number of channels increases monotonously, which forms the transition between the input ECG image dimension and the classification result vector. The convolution model is used for feature extraction of the information reflected in the original ECG image. Relying on the smaller kernel size and network layer depth, the recognition accuracy of the refined waveform in the ECG conduction cycle can be effectively improved.

Convolution Layer
The automatic feature extraction of input data is realized by convolution operation. Firstly, convolution operation is performed on two two-dimensional discrete real functions f and g (see Equation (1)) in accordance with Equation (2).
Based on the above operation rules, the input signal is scanned with a convolution kernel of preset size and a convolution step size [26]; that is, the convolution kernel is used to cover an input signal region of the same size to perform the operation from the input to the corresponding position of the convolution kernel in the region, and the result is taken as a basic element of the convolution output. During the calculation of the convolutional layer, the size of the output image needs to be designed in each layer. The size of an input image is Wi × Hi, the kernel size is K × K, the moving step size is S, and the edge filling pixel is P; the output size Wo and Ho are shown in Equation (3).
For example, if the input matrix size to be convoluted is 5 × 5 and the convolution kernel size is 3 × 3, the outputs of this layer can be obtained when the number of moving steps is 1 and the edge filling pixel is 0, as shown in Figure 5. The process of the convolution calculation is described in detail.

Pooling Layer
In the continuous convolution layer group structure, pooling layers are placed to select or filter data. The compression of the input image is realized, and the dimension reduction effect is achieved. In the pooling method, the maximum pooling and average pooling utilization are higher. The value of P in Equation (3) is usually set to zero. For example, as shown in Figure 6, for 4 × 4 data signals, the window size is set to 2 × 2, and the moving steps are limited to 2. The mean or max pooling method is adopted; that is, the mean value or maximum value is selected from each window as the pooling calculation result [27].

Full Connection Layer
After convolution and pooling calculation, a full connection layer is connected in sequence. Each neural node in this layer is connected with all the neural nodes in the previous layer to form a fully connected network, which is used to synthesize the extracted features. For the processing of two-dimensional information, it is usually converted into one-dimensional vector output and used as the input of the classification step.

Softmax Layer
In the deep learning network structure, the target classification performance determines the recognition effect of the algorithm. As the output layer of the last level, the Softmax function is widely used to realize multi-classification and output results. Considering the parameters of data dimension Xi and weight Wi in the network, a score function as shown in Equation (4) is defined. Due to the linear characteristics of the function, the ability to distinguish the difference of scoring results is limited.
In order to increase the difference of scores between different categories and enhance the classification effect, Equation (5) is used to optimize Equation (4). Among them, the score function with e as the base ensures monotonicity, and the score difference is more significant.
Then, the output of neurons in the previous layer is mapped to a fixed interval (0-1) by using the n-classification normalized probability function Np (see Equation (6)).
where n is the classification category, yi is the classification output score, and the total score of all multi classification outputs is 1 (see Equation (7)).
The above process realizes the multi-classification function of the algorithm. In order to illustrate the calculation process of the classification algorithm, Equation (6) is verified by taking three classification outputs as an example, and the process is shown in Figure 7. Let the input Sf1~Sf3 of the three channels be the output of the front layer network.

Region Proposal Network
The RPN network also contains a full convolution layer, which can train the target recognition plan in an end-to-end manner and finally output the boundary and score of the target. After continuous convolution layers, the original image is processed and output in the form of a feature image; then, it is sent to the RPN network as shown in Figure 8. The specific process of feature image calculation in the RPN network is as follows: use a 3 × 3 small-size convolution kernel to move the window on the feature map; then, map it into a low-dimensional vector, and finally, send it to two fully connected layers to get the and regression (REG) layer and classification (CLS) layer. The above output results will eventually be sent to the RoI pooling layer. Usually, in the process of image target region recognition, there will be a deviation between the proposal region and the actual region. The method of box regression can be used to constantly correct the region box to achieve the approximation of the proposal region to the actual region. A large number of anchors will be generated during the correction process, and the values that do not meet the requirements are filtered out by threshold method, and the reserved values are output as proposal parameters. Take ECG as an example to analyze the boundary box regression principle. In Figure 9a, the green box T represents the actual waveform range in a cardiac conduction cycle, and the red box O represents the initial recognition range of the algorithm for the waveform. Obviously, there is a large deviation between the two boxes, which leads to inaccurate recognition results. In order to reduce the error between the actual value and the recognition value, linear transformation such as translation or zooming is used when the two boxes are close to each other. In this case, it is necessary to set a blue box M as the intermediate variable in Figure 9b. Its function is to make box O equal to box M after linear transformation adjustment and make it close to the actual value box T, so as to accurately recognize the actual range of waveform [29]. From the RPN structure in Figure 8, it is concluded that the anchor is not a single form. In the multi-anchor structure, the normalized total loss is defined as N Δ , which includes two elements: normalized classification loss NCLS and normalized regression loss NREG; see Equation (8) Then, the global loss function L is obtained, as shown in Equation (9): The effect of parameter λ is weighted to optimize the balance among the elements of normalized total loss N Δ . Generally, the optimization status can be obtained by selecting its default value.

Dataset of Refined Recognition
The dataset is from the clinical ECG data of 500 different patients in the same provincial hospital. In order to ensure the mutual independence of the data, each data corresponds to a patient; and to display the small details of the classification output results more clearly, each person's ECG data acquisition time includes 5000 machine sampling points, of which 2500 consecutive sampling points are selected in the experiment, and the segmented data are converted into image format. Since the dataset selected in the experiment needs to be newly built and newly established, a lot of necessary original calibration work is carried out. Based on the above considerations, only 500 samples are selected.
Due to the characteristics of the ECG signal, it is easy for it be interfered by various noises, which has a negative impact on the accuracy of the diagnosis. The sources of noise mainly include power-line interference, resting state, electromyogram (EMG), electrified equipment, and so on. In this experiment, the noise factors were considered in the acquisition of the ECG dataset. Under the premise of the patients' relative resting state and the correct operation of signal acquisition personnel, the default de-noising settings of the ECG instrument are used to filter noise, such as a low-pass filter at 40 Hz, power frequency filter at 50 Hz, and EMG inhibition at 35 Hz. The purpose of the filter is the ECG data in image format, which is a kind of simulation way for human eyes to observe the ECG report, and it is a direct mapping of the ECG report. These de-noising data can be directly used to output and print the ECG report, and they can be directly used for clinical diagnosis. Therefore, the necessary default de-noising makes the waveform of the ECG signal in the image format consistent with the waveform of an actual clinical ECG report.
Signal acquisition parameters are required as follows: Firstly, the premise of ECG signal acquisition is that the calibration curve parameters of the ECG instrument can be set as 10 mm/1 mV, 2.5 Hz, and the signal sampling frequency is 500 Hz. ECG signals must be printed out after calibration, so the actual output of the calibration curve is very important in two-dimensional image recognition. All output waveforms are realized in proportion to the calibration curve. Secondly, the data selected in the experiment are from the lead II channel and have the characteristics of sinus rhythm, so it is convenient to recognize the waveform of each cardiac conduction cycle in clinical data. Then, the clinical one-dimensional sequence was transformed into a two-dimensional image format. The experimental design includes four types of recognition targets: calibration curve, P, QRS, and T wave.
The experimental design adopts general steps, including training, verification, and testing. Each process needs to select a data volume that meets the verification standard. The data are arranged in sequence, and the samples are selected from the 500 total samples, and assigned to each set. The first 400 samples are selected as a training set and verification set, of which 80% (320 samples) are used as the training set, 20% (80 samples) are used as the verification set, and the last 100 samples are used as the test set.

Training Results
The 2D-ECG image data are trained based on Faster R-CNN to determine the diagnostic model of heart disease. The total loss function curve of clinical data as shown in Figure 10 is obtained through training. The change trend of the curve in the figure is as follows: with the process of the program running, the curve decreases rapidly from the initial state value of 3.54 in the longitudinal direction, fluctuates slightly when the value approaches 0 when reaching 10,000 steps, and maintains stable curve fluctuation when reaching 30,000 steps. Therefore, it is feasible to apply the model to the identification of clinical data. Many research studies only recognize part of the ECG waveform, and this method can recognize multiple parameter details at the same time. The refined recognition results of this algorithm for two-dimensional ECG signals include four kinds of tags: O, P, R, and T, which represent the calibration curve, P wave, QRS complex, and T wave, respectively. The recognition results are shown in Figure 11. The scope of each type of refined waveform is clearly marked in the figure. Meanwhile, the coordinate values of the upper left and lower right of the recognition box are marked for the recognized region. These values can be used for subsequent calculation, such as the height and duration of the waveform, which can be used as the basis for ECG diagnosis.   Faster R-CNN is used to recognize the waveforms of clinical experimental data samples. The recognition results of refined ECG parameters are shown in Table 1. The recognition accuracy of the waveform labeled as Category O is 99.68%, that of Category P is 98.32%, that of Category R is 98.85%, and that of Category T is 98.90%. The experimental results verify that the proposed method has an overall average accuracy of 98.94%. Therefore, the proposed method is feasible for medical auxiliary diagnosis. The research results are compared with others in the literature (see Table 2) to fully reflect the effectiveness of the model proposed in this section. Firstly, the selected comparative literature is based on one-dimensional sequence, and the waveform in the cardiac conduction cycle is taken as the recognition object. Among them, literature [30] used the method of the long short-term memory (LSTM) model to recognize three types of refined waveforms, and the average accuracy of 92.00 was obtained. The method of multi-LSTM model fusion was designed in literature [31], which only recognized a single P wave and achieved an average accuracy of 98.48. Literature [32], a method combining a CNN and ELM model, was used to recognize only a single QRS complex, with an average accuracy of 98.77. In literature [33], an support vector machine (SVM) model was used to recognize a single QRS complex, with an average accuracy of 95.26.

Recognition Method Average Accuracy (%) Waveform Type Waveform Number
Faster R-CNN 98.94 O, P, QRS, T 4 LSTM [30] 92.00 P, QRS, T 3 4-LSTM [31] 98.48 P 1 CNN + ELM [32] 98.77 QRS 1 SVM [33] 95.26 QRS 1 Through analysis and comparison, it is found that adopting a two-dimensional ECG signals algorithm based on Faster R-CNN for identification and diagnosis can achieve higher recognition accuracy, recognize more kinds of refined waveforms, and have more intuitive visual results.

Analysis and Diagnosis
It is of practical significance to diagnose the main parameters of ECG signals by the coordinate parameter values of each refined waveform in the cardiac conduction cycle. Firstly, it is necessary to establish the mapping relationship between the time and amplitude parameters of ECG signals and the coordinate parameters in the picture. The clinical ECG data used in this section are when the calibration curve parameter of the ECG instrument is set as 10 mm/1mV, 2.5 Hz, and 0.2 s during the high level. It is concluded that the height of the calibration curve takes up two big grids, and the minimum cell of amplitude along the longitudinal axis is 1 mm/0.1 mV. The paper speed of ECG is set at 25 mm/s, the signal sampling frequency is 500 Hz, and the sampling cycle time is 0.002 s; then, the corresponding time length of 20 samplings is 0.04 s, and the time span occupies a minimum cell; in this way, the time length of 100 samplings is 0.2 s, which occupies one big grid. In the ECG report, the horizontal axis time length corresponding to a minimum cell is 0.04 s; a large grid contains five minimum cells, and the duration is 0.2 s; therefore, the duration of five large grids is 1 s, as shown in Table 3. Therefore, the width and height of the calibration curve as the identification standard are known. Therefore, based on the method of mapping the time and amplitude of ECG signals to the coordinates in the picture, it is of practical significance to recognize and diagnose the parameters of each refined waveform of ECG signals [34,35]. Therefore, establishing the position relation between coordinates and realizing the mapping between the real signal value and the image coordinates are the necessary steps for the algorithm model to realize the diagnosis of heart disease. First of all, the coordinate equations need to be established. For each marked recognition box, the coordinates of the top left and bottom right points are marked with an array B containing four elements, as shown in Equation (10): where N is the refined waveform parameter to be recognized, where tu is the unit pixel experience time and hu is the unit pixel amplitude height. In this section, the calibration curve parameter values can be known from Table 4, so Equation (12) is available, The mapping between the real value of the cardiac electrophysiological signal and image coordinates is realized by substituting the unit parameters tu and hu into Equation (11). Thus, the fine range recognition of ECG waveform can be calculated by the above formula. Taking Figure 11a as an example, the calculation results of the main parameters in ECG are shown in Table 4. After analysis, it is found that although only four types of waveforms are identified in the experiment, the analysis of the important parameters of other ECG signals can be obtained by the coordinate relationship of the above four types of parameters. The applicability of the model proposed in this paper is further verified by combining the diagnosis and analysis of different clinical diseases [36,37]. According to the characteristics of the existing samples, using the waveform parameters identified by the algorithm model proposed in this paper, some clinical examples of disease diagnosis are shown in Table 5. P, R, P-R, P-P, QRS The calculated P-R value was greater than 0.21s and increased continuously. I-atrioventricular block, etc.
Periodic calculation of P-R prolongation, continuous recognition of conduction cycle waveform, until a QRS missing after P.
P-R was constant and QRS missing was proportional to P. II-2 atrioventricular block, etc.
The adjacent P is equidistant and R is equidistant. P is not related to R. III-atrioventricular block, etc.

Discussion
The proposed diagnosis method based on 2D-ECG and Faster R-CNN has advantages over the traditional one-dimensional sequence method in classification and diagnosis performance. Firstly, it realizes the conversion of a cardiac electrophysiological signal from a one-dimensional time series to a two-dimensional image, which has the ability of describing information from multiple angles; secondly, it can identify and diagnose a complete beat in the cardiac conduction cycle, so as to realize the fine recognition between or within the cardiac beats. Combined with the transformation relationship between the coordinate value recognized in the image and the actual value, the diagnosis of specific diseases can be realized. Compared with different models in four literatures, the results show that the model proposed in this paper improves and innovates in the recognition and diagnosis of cardiac electrophysiological signals, and it has more intuitive visualization results.
Due to the complexity of the cardiac system, this research has some limitations. Firstly, the experimental design only considers the disease diagnosis process based on a single channel standard II lead. Although this lead can reflect the characteristics of the ECG signal and most research studies are based on this lead, a single lead has limitations on the analysis of diverse ECG data. Secondly, 2500 sampling points are selected for each sample duration. The main purpose of this segmentation is to verify the feasibility of the detailed parameter expected identification, but it has limitations for progressive heart disease diagnosis.

Conclusions
In this paper, a novel faster region recognition method for 2D-ECG at the image level is established. The ECG data with sinus heart rate are taken as the research object, the refined recognition and diagnosis of each detailed waveform is realized, and the average recognition accuracy reaches 98.94%. The experimental results show that the proposed method is feasible for medical auxiliary diagnosis.
In the future work, there are still many aspects of ECG analysis and diagnosis that need to be further improved and carried out. First of all, the comprehensive diagnosis and analysis of multiple leads in clinical application needs further research. The 12 leads data analysis model is used to improve the accuracy of clinical data identification by the correlation between different leads. Secondly, increasing the sampling time of a single image is used to improve the recognition ability of the detailed parameters within a cardiac beat or between the multi cardiac beats, which is conducive to improving the classification of diseases and the application scope of diagnosis. Finally, the main purpose is to increase the number of samples for different diseases and each disease type.