Deep Learning Algorithm for Heart Valve Diseases Assisted Diagnosis

: Heart sounds are mainly the expressions of the opening and closing of the heart valves. Some sounds are produced by the interruption of laminar blood ﬂow as it turns into turbulent ﬂow, which is explained by abnormal functioning of the valves. The analysis of the phonocardiographic signals has made it possible to indicate that the normal and pathological records differ from each other concerning both temporal and spectral features. The present work describes the design and implementation based on deep neural networks and deep learning for the binary and multiclass clas-siﬁcation of four common valvular pathologies and normal heart sounds. For feature extraction, three different techniques were considered: Discrete Wavelet Transform, Continuous Wavelet Transform and Mel Frequency Cepstral Coefﬁcients. The performance of both approaches reached F1 scores higher than 98% and speciﬁcities in the “Normal” class of up to 99%, which considers the cases that can be misclassiﬁed as normal. These results place the present work as a highly competitive proposal for the generation of systems for assisted diagnosis.


Introduction
Cardiovascular diseases occupy first place among the causes of death around the world, according to the World Health Organization (WHO) [1]. Heart valve diseases (HVD) are also found in these figures, where moderate or severe valvular abnormalities are notably common in the adult population and increase their presence as the individuals age [2].
To examine the condition of the heart valves, the most common medical practice is auscultation, which consists of listening to acoustic characteristics directly via the patient's chest wall using a stethoscope. These heart sounds could be interpreted as the acoustic expression of the opening and closing of the four heart valves-tricuspid, mitral, pulmonary and aortic-where the muscular contraction that drives the blood from one cavity to another generates a high acceleration and delay of the blood flow, causing a pressure difference [3,4]. Its normal physiological functioning is always unidirectional, which allows the correct circulation of blood through the cardiovascular circuit. However, some sounds are produced by the interruption of laminar blood flow by turning into turbulent flow, which is explained by abnormal and pathological functioning of the heart valves.
The cardiac cycle is composed of two phases: the systole, during which the ventricles contract and drive blood to the blood vessels, and the diastole, in which the ventricles are filled.The systole begins with the closure of the mitral and triscuspid valves, producing the first heart sound or S1, while the diastole starts with the closure of the aortic and pulmonic valves, producing the second heart sound or S2. In addition to S1 and S2, extra sounds could be present during the cardiac cycle, which may indicate an abnormality [5,6]. The duration of the noise varies depending on the valvular abnormality [7]. Figure 1 illustrates Since the range of frequencies that heart sounds generate are near the lowest threshold of sensitivity for the human ear, a practising physician requires extensive training guided by experienced physicians to make a correct diagnosis. In an alternative scenario, heart sounds can be recorded using electronic stethoscopes, allowing the practitioner to listen to heart sounds as many times as necessary to train the ear. This has proven effective in improving physicians' skills without the need to rely on available patients during hospital rotation [8]. In both cases, the diagnosis will depend fundamentally on the skill of the physician, subject to human error.
To alleviate the need for the correct training of the practising physician and to generate tools that can be used as auxiliary techniques in the diagnosis of heart valve diseases, in the present work, we implement a deep learning algorithm for the classification of common valvular diseases based on their temporal and spectral features. The dataset used contains phonocardiographic (PCG) records previously labelled in five classes: normal (N), aortic stenosis (AS), mitral regurgitation (MR), mitral stenosis (MS) and mitral valve prolapse (MVP).
It is worth mentioning that the previously described dataset has already been used to address similar classification problems. Firstly, the authors of the database, in 2018 [9], proposed to perform a multiclass classification comparing the performance of Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and a Deep Neural Network (DNN), and the extracted features were Discrete Wavelet Transform (DWT) and Mel Frequency Cepstral Coefficients (MFCC).
In the following two years, five more classification works were published using the same database, proposing a variety of feature extraction techniques and classification algorithms: (i) a set of eleven statistical features from the estimated instantaneous frequency of non-segmented PCG signals, for the binary classification and multiclassification comparing both random forest and KNN classifiers [10]; (ii) time-frequency features from the segmented cycles of the PCG signal through the wavelet synchrosqueezing transform, for a multiclass classifier for four of the five classes (MR, MS, AS, N) using a random forest classifier [11]; (iii) centroid frequency as the main feature to compare the performance of KNN and SVM, on both binary and multiclass problems [12]; (iv) local energy and local entropy features of the Chirplet transform from the PCG signal for a multiclass classifier for four of the five classes (MR, MS, AS, N) using a composite classifier [13]; (v) raw data and a WaveNet neural network for a multiclass classification model was proposed, to classify the PCG into the five classes [14]. All the above-mentioned works are briefly summarized in Table 3.
The present work stands out by using a deep learning algorithm that exploits frequency dynamics present throughout the cardiac cycle to differentiate between normal and Appl. Sci. 2022, 12, 3780 3 of 18 abnormal cardiac conditions. In particular, there are three properties that place this work as a new methodological proposal: (i) The use of Convolutional Neural Networks (CNN), together with the Continuous Wavelet Transform (CWT) and MFCCs, turning the time series classification problem into image classification, which has not been reported before for HVD classification, since all the related works use a vector-like feature arrangement or the raw time series. (ii) A transfer learning approach, which allowed us to migrate the deep learning algorithm previously trained for multiclass classification into a binary classification, permitting us to test the pre-trained model in a new classification problem, ensuring that the learning process generalizes and does not memorize, seeming quite flexible when compared. (iii) The complete model of the deep learning algorithm consists of two main stages.
The first is formed by three parallel neural networks, each processing one of the three different features for individual cardiac cycles. The second is constituted by a perceptron that integrates the outputs of the previous stage, finding new patterns between them, improving the efficiency and effectiveness compared with the related works.

Materials and Methods
The classification of HVDs through the analysis of PCG signals consists of several stages, as shown in Figure 2: (i) generation of the dataset and labelling; (ii) pre-processing, which consists of cleaning the dataset, filtering the signal and segmentation by time windows; (iii) feature extraction; (iv) classification through a deep learning algorithm; (v) validation of the classifier when comparing predictions against the true labels. The present work addresses the problem from the pre-processing. The signals were obtained from an open dataset [9] that consists of 200 records per class. These were digitized with a sampling frequency of 8 kHz, where each record has a duration of at least one second. To maintain uniformity in the data during the analysis, each record was segmented (when possible) into two windows of 6144 data points (0.768 s), which contain at least one complete cardiac cycle X n . The details of the dataset after completing the segmentation process are shown in Table 1. As can be seen from the results, the imbalance in the MS class does not affect the multiclass classifier performance. Given the optimization problem of any supervised model, where the size of the training data is related to the number of parameters of the model, we expect satisfactory training and good convergence [15][16][17][18][19].

Feature Extraction Techniques
Within the area of computer science, artificial intelligence and machine learning, feature extraction consists of the implementation of techniques aimed at extracting informative and non-redundant parameters from the measured data. Thus, the learning and generalization stage is facilitated.
Given the nature of the information in the records that shapes the dataset used in the present work, we decided to use three different techniques to extract the spectral features of the signals: MFCCs, CWT and energy from the DWT. The pseudocode for each of the feature extraction techniques is given in Appendix A.

Mel Frequency Cepstral Coefficients
This is a state-of-the-art algorithm based on the known sensitivity variation of the critical frequency bandwidths present in the cochlea of the human ear, most used for the extraction of spectral information from time series [20]. Its success is due to its ability to represent the width of the spectrum of a signal in a compact form. Here, we will review each step in the process of creating the MFCCs, but motivated by considerations that adjust to the heart sound rates, where the summary of the stages is shown in Figure 6 inside the pink box [21].
Framing: The general purpose of this stage is to prepare the signal, segmenting it into N samples, where the segmentation window has a shift of m and the separation of the adjacent frames has no overlap. Subsequently, to analyze the frequency changes, the Discrete Fourier Transform is applied to each segment. Usually, 20-40 ms windows with a 50% overlap (±10%) are used to process speech. However, frequencies lower than those present in the voice predominate in heart sounds, so it is proposed to use 64 ms windows without overlap.
Discrete Fourier Transform (DFT): The energy contribution of each of its frequency constituents is calculated for each segment through the DFT. This transformation of domains from time to frequency is described by [22]: whose resolution in frequency is given by f s/N, where f s is the sampling frequency, N the number of samples and |F | the magnitude. Filter Bank: Although the number of filters is a hyperparameter, the central frequency of each of them is calculated linearly from the theoretical maximum frequency recorded in the signal. The first step consists of changing the scale from Hertz to Mels for the theoretical maximum frequency of the signal, according to the Nyquist Theorem [23]: Once we have the theoretical maximum frequency in Mels, M equidistantly spaced frequencies are proposed to be used as central frequencies for the filters. Subsequently, it is necessary to remap the central frequencies from Mels to Hertz to design the H m [k] filters [24] : Then, F × H m is computed, obtaining a reduced spectral representation of the signal. Discrete Cosine Transform: Finally, it is necessary to apply a discrete cosine transform in order to obtain the MFCCs of each window, described as [25]: As mentioned, this process of multiplying the coefficients obtained after applying DFT by the filter bank is repeated for each segment of the signal, obtaining, as shown in Figure 3, an array with dimensions N × M, where N is the number of windows and M the number of filters. The graph on the right shows a normalized PCG record, randomly selected from each of the classes.

Continuous Wavelet Transform
Similar to the Short-Time Fourier Transform (STFT), the wavelet transform (WT) in both its continuous (CWT) and discrete (DWT) forms is a spectral decomposition. The easiest way to understand the WT is by comparing it with the STFT: for its part, the FT consists of decomposing the signal f (t) into its harmonic components in the form of sines and cosines, while the WT decomposes the signal in the form of mother wavelets with different amplitudes and displacements. The mother wavelets are an effectively limited waveform in duration, with an average equal to zero. The mother wavelet used in the CWT was a Morlet, described by The CWT of a signal f (t) is given by [26]: Therefore, the integral is solved for a, b (scaling and shifting parameters), which performs a transformation of the signal f (t) from the time domain to a function in the time domain and scale.
By applying the CWT, we obtain a matrix representation of the coefficients of size NxM, as shown in Figure 4, where N is the number of temporary points, while M is the number of scales used. Given the high dimensionality of these matrices, it was necessary to apply an averaging 3 × 3 filter to reduce the computational demand. The graph on the right shows a normalized PCG record, randomly selected from each of the classes.
Hilbert Transform: It is known that the multiresolution framework of the CWT can be improved through the Hilbert transform [27] if this is applied to the signal a priori. The fundamental reason that this transformation can be integrated with the wavelet transform is due to its scale and translation invariance, and its energy-conserving (unitary) nature [28].
Formally, the Hilbert transformŝ(t) of a function s(t) is obtained by calculating the convolution of (s(t) 1/(πt)) defined as [29]: This transformation has been shown to improve the performance of the neural network used, so it was implemented as part of the algorithm, as shown in the yellow box in Figure 6. As its name implies, the DWT is any WT where its wavelet function is discretized and, as with the CWT, captures the information in a time domain and scale.
DWT uses a family of orthonormal wavelets with unit energy, described by where, as in the CWT, a is the scaling parameter and b the translation parameter of the mother wavelet psi, being Daubechies 4 the mother wavelet used in this algorithm. To analyze the spectral resolution of the data, it is necessary to enter the scaling function defined as [30]: The C k are known as wavelet coefficients. These coefficients are arranged as a transformation matrix that is applied to a vector of data. In this way, the coefficients are stated in two different patterns, one that works as a fading filter (low-pass filter) and the other as a pattern that shows only the details of the information (high-pass filter).
This concept of analyzing a signal using filter banks is known as Mallat tree decomposition. Inside the green box in Figure 6, we can see how a signal x(n) is decomposed into approximations a j (n) and details d j (n) by the effect of the filters h j (n) and g j (n).
Once the multiresolution analysis has been applied, the energy of each detail and the last approximation x[n] is calculated, defined as: As in MFCCs, the signal was segmented into 12 windows of 512 data points, where the energy was calculated by the level of decomposition. In Figure 5, the average energy and its standard deviation of each segment per class are shown.

Classification Algorithms
A classifier is an algorithm that assigns a putative class to a pattern, this being the identifier of an object, signal or phenomenon encoded based on its features or attributes [31], which in turn may be continuous, categorical or binary. If instances are given with known labels (the corresponding correct outputs), then the learning is called supervised, in contrast to unsupervised learning, where instances are unlabelled. By applying these unsupervised (clustering) algorithms, researchers hope to discover unknown, but useful, classes of items [32].
For the classification and recognition of normal and abnormal PCG signals described above, a supervised deep learning algorithm was implemented, which refers to a multilayer directed model with a non-linear topology and multiple inputs and/or outputs, capable of extracting by itself the patterns that define each class.

Deep Learning (DL)
The DL used in the present work consists of two main stages, as shown in Figure 6. The first stage consists of three parallel artificial neural networks, aimed at generalizing patterns: a multilayer perceptron network whose input was the coefficients obtained from the energy calculation after spectral decomposition applying DWT, and two convolutional neural networks (CNN) that, as input, receive, respectively, matrices with the MFCCs and CWT coefficients. In the second stage, the outputs of the three independent networks were concatenated to feed a multilayer perceptron that gives us the probability that the PCG signal belongs to any of the classes or subclasses mentioned in Table 1. The main difference in the multiclass or binary classification consists of the number of neurons in the last layer, being two or five for the binary or multiclass classification, respectively. However, only the multiclass network was trained, and to perform the binary classification, a transfer learning approach was implemented. This implies that, once the multiclass model was trained, only the last layer was changed. The labels of the abnormal class for the binary classification are split into four subclasses for the multiclass classification, Appl. Sci. 2022, 12, 3780 9 of 18 and the activation function for the output layer differs according to the problem. For the multiclass classification, a probabilistic activation function is used, described as while, for the binary classification, the activation function used was a sigmoid, described as The feature extraction algorithms, the DL algorithm and its constituent networks were implemented on Python 3.9. In particular, the DL neural network was designed using Keras 2.4.3 on Ubuntu 20.04 distribution. Keras is a high-level neural network library written in Python, which can run on top of TensorFlow or Theano. The model and the respective sub-architectures were allowed to run on an Intel processor (i7-9700) with 32 GB of RAM and on an NVIDIA GeForce GTX 1650 GPU with 4 GB of VRAM memory. The pseudocode describing the architecture for the DL algorithm proposed is presented in Appendix B.

Convolutional Neural Network (CNN)
A Convolutional Neural Network is a bio-inspired algorithm that attempts, to some extent, to emulate the ability of the visual cortex to extract key features from images and process them. A CNN has multiple layers, including a convolutional layer, normalization, non-linearity layer, max-pooling and a multilayer perceptron.
The discrete convolution between the filter and the coefficient matrix is mathematically defined as: It is possible to deduce that, if the image dimension is given by (n H , n W ) and the filter dimension is given by ( f 1 , f 2 ), the dimension of the convolution will be dim(conv(I, K)) = n H − f 1 s + 1, Max-pooling is a particular case of a convolutional layer where the filter is a matrix of ones and, after the convolution, a maximum function is applied. By convention, we consider a square filter with dimensions f 1 = f 2 = 2 and s = 2. This operation is defined as: In CNN terminology, the pair of convolution and max-pooling layers in succession is often referred to as a convolutional layer [33].

Multilayer Perceptron
A Perceptron [34] refers to a fully connected artificial neural network, with more than one hidden layer, that takes as input the output of the previous layer, multiplies it by a vector of synaptic weights, adds the result and passes it through an activation function: In this matrix notation, y l i denotes the output of the i-th neuron of the l-th layer, w l j,i are the synaptic weights of the neurons and σ(x) is the activation function. The optimization algorithm used to update the synaptic weights W (training) was the descending gradient: where W is the synaptic weight matrix, α is the learning rate, and MSE is the mean square error.

Results
During the classifier construction, the experimentation consisted of designing independent classifiers that, as input, had the features extracted from the dataset. The DL architecture for the classification and prediction of the normal and abnormal heart sounds has three neural networks that, independently and in parallel, generalize the patterns that define the classes from the features mentioned in Section 2. Each constitutive network was tested independently and in pairs to find the most competitive configuration.
As mentioned before, the models were initially trained to perform a multiclass classification, and later, a transfer learning approach was applied to modify the last layer of the trained network to obtain a binary classification. In Figure 7, the boxplot shows the F1 score obtained by testing the complete multiclass model 10 times, while the black dotted line illustrates the tendency of the binary accuracy obtained of the best model, for percentages of the dataset randomly split into training and testing sets, ranging from 50-50% to 100-0%. Ignoring the (almost) perfect performance of the model achieved using the entirety of the dataset as the training set, it is possible to observe that, although the training data percentage increased, the performance for the multiclass classification has an increase adjusted to a quadratic function. The performance of the binary classification reaches a plateau when training with 80% of the data. Therefore, the complete model and submodels were trained and tested using an 80-20% split of the dataset, respectively. Figure 8 summarizes the performance of the independent models, pairwise combinations of them and the complete model, previously trained and evaluated using the testing data, measured through precision, recall, F1 score and specificity for multiclass classifi-cation, and accuracy for the binary classification. To prevent overtraining, intermediate dropout [35] layers were implemented in each of the networks, with a uniform probability of deactivation of 30% for each of the neurons. Figure 7. Performance of c4 the complete model for different percentages of training data split. Boxplot of F1 score of a ten folds test, green dotted line shows the increase of the median adjusted to a cuadratic function. The black dotted line shows the tendency of the best model for each percentage tranfered to classify binary classes and, the red triangles accounts for the performance reached with a 100% of the data set as training set.
intermediate Dropout [35] layers were implemented in each of the networks, with a 297 uniform probability of deactivation of 30 % for each of the neurons.   Table 2. After training and testing the networks, it is possible to observe that the configurations with the best performance are those that have as inputs both of the spectral features extracted by MFCCs and CWT. On the other hand, if we compare the performance of the networks that only have one of the features as input, we can observe that the CNN trained with the features extracted by CWT has the best performance. Figure 9A,C show the detailed F1 score, precision, recall and specificity obtained for each class in the complete architecture using 20% of the dataset as the testing set. It is possible to observe that 95% of the classes were correctly classified in a multiclass classification, according to their F1 score, and 99% in a binary classification according to the binary accuracy. Furthermore, the confusion matrix, on which all the metric calculations were based, is shown in Figure 9B,D, where each column of the matrix represents the number of predictions of each class, while each row represents the instances in the true class. Meanwhile, Figure 10A,B show the confusion matrices obtained for each class in the complete architecture, using 100% of the dataset as the training/testing set in a multiclass and binary classification, respectively. to the binary accuracy. Furthermore, the confusion matrix, from where all the metric 308 calculations were based, is shown in Figures 9B and 9D, where each column of the 309 matrix represents the number of predictions of each class, while each row represents 310 the instances in the true class. Meanwhile, Figures 10A and 10B show the confusion 311 matrices obtained for each class in the complete arquitecture, using 100 % of the dataset 312 as training/testing set in a multiclass and binary classification respectively.  Finally, Figure 11 shows the performance of each network and the combinations used for each of the classes evaluated through the same metrics. It is possible to observe that although the performance varies between the networks, the results are consistent between each of the proposed configurations, where the precision and recall for each class are spliced. Along with the intermediate dropout layers, these results confirm that the networks were correctly generalizing the features that define each class and not memorizing the patterns for each signal. Figure 11. Experimental results: The bar graphs show the results achieved by each architecture, for each class, measured using precision, recall, F1 score and specificity as metrics.

Discussion
Since PCG signals contain information about the condition of the heart valves, the analysis and development of technology that allows the correct classification of these signals can be used as an auxiliary technique for the precise and timely diagnosis of HVDs. The information of the PCG signals has a temporality, relative intensity and spectral encoding, which invites us to use feature extraction techniques to enhance their contributions, facilitating their classification. To propose a novel method to detect HVDs, we decided to build a DL neural network that would receive as input a set of features extracted by CWT, MFCCs and the energy associated with each level of decomposition by DWT.
Through the experimentation, the performance of each constituent part of the network was measured independently and in pairs for a multiclass and binary classification problem. The configurations with the best performance were those that generalized from the features extracted by MFCCs and CWT, scilicet, the complete or partial DNN model that consisted of two parallel CNNs. Nevertheless, the F1 scores and binary accuracy of the complete model reached values slightly higher than 95% and 99%, respectively, indicating that the energy associated with each level of decomposition by DWT contains information that CWT or MFCCs does not, making the complete model the most competitive architecture.
In addition to the F1 score, one of the principal metrics to evaluate the competitiveness of the proposed model is the specificity, since this is the only one that takes into account the false positives of each class. In particular, it is of great interest to observe the specificity calculated in the "Normal" class, since a perfect score (100%) in this metric would ensure that no patient with any HVD would be classified as a healthy person, which is the worstcase scenario during diagnosis, and could mean a risk for the patient's health. We consider that, for clinical applications, this is the most important metric that must be accounted for and where this work stands out, having specificities in the "Normal" class of up to 99%.
It should be noted that this work is not the first to address a similar problem; however, a fair comparison can only be made with those publications that have used the same database as us. Table 3 shows, in a comparative way, seven of these works (including the present one), which were given the task of applying artificial intelligence techniques to perform binary or multiclass classification.
If we take the performance of the model proposed in the present work, achieved using 80% of the dataset as the training set and measured through the F1 score, and compare it with the F1 score of the work proposed by [9], or any global accuracy proposed in the other works, it would seem that our proposal is less competitive. However, there are certain neglected considerations in the mentioned works, which the present work addresses more thoroughly.
First, it is possible to observe that the results reported in [9,12,14] are those obtained during training, since the sum of the data contained in the confusion matrices is the total of the data in the dataset. This has important implications when interpreting their results, since using 100% of the dataset during training allows us only to know the degree of compaction of the data and the memorization capacity of the classification algorithms, but it is not possible to measure the capability of the algorithm to generalize, as it has not been tested with unseen data.
In order to compare our work with the state-of-the-art algorithms mentioned in Table 3, the performance of our model using 100% of the dataset as the training set was experimentally calculated, surpassing all, in both multiclass and binary classification. Nevertheless, we do not encourage the use of this approach since it is not a correct computational practice [31].
Another neglected point is the imbalance of the classes present during the binary classification reported [10,12]. In the first one, it is reported that the data are separated into training (70%) and testing (30%) sets. In both cases, the number of data present in the "abnormal" class is several times greater than that present in the "normal" class. This tends to skew the classifier by having more information about a class, increasing the probability of an incorrect classification.
It seems that the proposals with which we can appropriately compare our work are [11,13]. Nonetheless, they arbitrarily remove the MVP class from the database, and we can only make assumptions about why: as mentioned by Delling et al. in 2016 [36], MVP progresses into MR over a period of 3 to 16 years in one fourth of individuals. Since MVP is the precursor of MR, it is safe to suppose that both share similarities, so much so that, in our results, the MVP is the worst classified pathology, mostly confused with MR.
Finally, most of the works chose to measure the performance of their classifiers through global accuracy and not the F1 score. This is only a problem when classes are highly imbalanced, as the accuracy does not take into account how the data are distributed. Since, in most real-life classification problems, an imbalanced class distribution exists, the F1 score is a better metric to evaluate our model on. Nevertheless, to make a fair comparison, we also calculate the global accuracy in our model, which surpasses all previously mentioned works, even under the 80% training set condition.
Given all the aforementioned considerations, and the scrutiny under which this work was developed, it is possible to conclude that the results achieved by our algorithm are highly competitive and reliable compared to those present in the state of the art that use the same database (Table 3). Nevertheless, it is necessary to test new, non-spectral feature extraction techniques, increasing the attributes for the improvement of the proposed model. Table 3. Comparative table between works that used the same dataset. The DL algorithm that is proposed in the present work is mentioned twice, showing the results using 80% or 100% of the dataset as the training set. When more than one clasiffier was used, the best one (*) is presented.  Funding: This research was funded by the Instituto Politécnico Nacional through the projects SIP20201657 and SIP20210473.

Data Availability Statement:
The data supporting the reported results can be found at https: //github.com/yaseen21khan/Classification-of-Heart-Sound-Signal-Using-Multiple-Features-(accessed on 27 January 2022).