Heart Murmur Quality Detection Using Deep Neural Networks with Attention Mechanism

: Heart murmurs play a critical role in assessing the condition of the heart. Murmur quality reflects the subjective human perception of heart murmurs and is an important characteristic strongly linked to cardiovascular diseases (CVDs). This study aims to use deep neural networks to classify the patients’ murmur quality (i


Introduction
Cardiovascular diseases (CVDs) are one of the world's most serious diseases, killing more people each year than any other cause of death.According to World Health Organization estimates, 17.9 million people died from CVDs in 2019, representing 32% of all global deaths [1].Most deaths caused by CVDs happen in low-and middle-income countries, where the majority of the population lacks access to an integrated primary healthcare system.Thus, the diagnosis and treatment of CVDs may be delayed, increasing the risk of early deaths [2].
Phonocardiograms (PCGs) are heart sound signals produced by the mechanical activity of the heart, containing information related to the heart condition [3].Cardiac auscultation can provide insights into the PCGs and determine at a low cost whether more expensive testing should be ordered [4], thereby reducing deaths due to CVDs.Despite its benefits, cardiac auscultation is a difficult skill to acquire, requiring extensive training and clinical experience [2], which limits its popularization in most low-and middle-income countries that lack cardiologists [5].Computer-aided auscultation offers a solution by significantly decreasing the cost related to cardiac auscultation, efficiently addressing this difficulty.Computer-aided heart sound classification and detection is a crucial part of computer-aided auscultation.Most studies in this field now focus on determining the presence or absence of heart murmurs and the normality or abnormality of heart sounds.An overview of the methods used in these studies can be found in [6][7][8].Notably, except for [5,9], there has been limited research using deep learning methods to explore more detailed PCG signal characteristics such as the grading, pitch, shape, timing, and quality of heart sounds, though they are all important for assessing the condition of the heart.
Heart murmurs include systolic and diastolic murmurs, which occur during the systole period and the diastolic period, respectively.According to the classification of Leathem, systolic heart murmurs are majorly divided into the systolic ejection murmur and the systolic regurgitant murmur.The "harsh" quality is most commonly found in the systolic ejection murmur, and the presence of this type of murmur is usually associated with innocent systolic flow murmurs and valvular or vascular obstruction.The qualities of "blowing" and "harsh" are all likely to be found in the systolic regurgitant murmur, and the presence of these murmurs is associated with mitral or tricuspid regurgitation and ventricular septal defect [10].So, the use of computer-aided detection of murmur quality would provide richer and more accurate information related to the CVDs for computeraided auscultation.To the best of our knowledge, no studies have provided specific deep learning models for the specific task of murmur quality detection according to clinical criteria, and the related works can be found in Section 2 below.
The determination of murmur quality relies on human subjective judgment, which comes from the experience of cardiologists through extensive training, rather than from a recognized gold standard, leading to difficulties in computer detection of murmur quality.Deep learning is an effective method to address this problem because it can mimic highly skilled cardiologists, who possess considerable knowledge via ample training.
This work aims to detect the murmur quality (i.e., harsh and blowing) for patients and find the contribution of other murmur characteristics in this task using deep neural networks, which can help in the diagnosis of cardiac and the establishment of criteria for murmur quality detection.Specifically, the study makes the following contributions:

•
Designing a deep learning model to extract features from the log-Mel spectrograms of PCG segments and proposing a new module to weight the features extracted from one patient for murmur quality detection; • Thoroughly evaluating the advantages and inadequacies of deep learning methods in murmur quality classification;

•
Exploring the relationship between murmur quality and other murmur characteristics by using deep learning models.

Related Works
There have been numerous algorithms and models developed to classify and detect heart sounds so far.The majority of them focus on traditional binary classification, which involves determining whether a murmur present or not and whether a heart sound is abnormal.For example, in the George B. Moody PhysioNet Challenge 2022 [11], the participating teams mainly strived to tackle the above two issues, as in [12][13][14].In addition, a small portion of the work focuses on more novel classification problems, such as the diagnosis of cardiac diseases and murmur grading.The authors of [15][16][17] investigated the detection of cardiac valve disorders, whereas [5,9] graded murmurs.To the best of our knowledge, there are no deep learning models built for this specific task for the time being.
For the research methodology, part of the current research used machine learning algorithms to categorize the manually extracted features [18][19][20].For example, [19] extracted two different features from heart sounds, discrete wavelet transform (DWT) and Melfrequency cepstral coefficients (MFCCs), and they used three machine learning classifiers to categorize them.An important trend nowadays is to use deep learning algorithms for the study.Part of the deep learning algorithms directly uses deep neural networks to extract features from the waveforms of the PCGs (one-dimensional signals) and classify them, e.g., [21,22], while the other part pre-extracts the time-frequency feature maps (usually twodimensional feature maps) and then uses the deep learning models to further extract the high-dimensional features and categorize them, e.g., [23][24][25].Furthermore, Ref. [26] used a combination of deep learning models and traditional machine learning algorithms.All of the above studies address supervised learning.To improve model performance, supervised learning requires a huge amount of labeled training data, which increases the research costs.Self-supervised and unsupervised models can address this problem better, and some studies [27][28][29] have established unsupervised and self-supervised models for murmur detection and heart sound classification with good results.For example, Ref. [29] pretrained a wav2vec 2.0 model on the Circor DigiScope Phonocardiogram dataset [2,30]; then, they fine-tuned it on small-scale annotated data, and the model showed strong competitiveness and robustness.
In the field of computer-aided diagnosis of cardiovascular diseases, in addition to the use of PCGs as the basis for diagnosis, there are many studies that use an electrocardiogram (ECG) (e.g., [31][32][33]) as the main basis, and [34] conducted computer-aided diagnosis of heart disease by the method of feature selection.It is possible that combining PCGs with other cardiac parameters, such as ECGs, would provide a boost to the development of computer-aided diagnosis in the future.

Dataset
The dataset used in this study is the publicly available set of the George B. Moody PhysioNet Challenge 2022, namely the CirCor DigiScope Phonocardiogram dataset [2,30].It was collected in northeast Brazil over the months of July-August 2014 and June-July 2015 [2] and contains 3163 PCG recordings from 942 patients at a sampling rate of 4000 Hz.The majority of patients have multiple PCG recordings, with most of them from the four auscultation locations: the aortic valve (AV), pulmonary valve (PV), tricuspid valve (TV), and mitral valve (MV), and a few are from other auscultation locations.As shown in Table 1, this dataset is balanced between males and females and was collected primarily from children and infants.The average lengths (±standard deviation) of the PCG recordings are 22.9 (±7.3) s with the shortest and longest lengths of 5.2 s and 64.5 s, respectively.
The dataset provides segmentation labels (S1, systolic period, S2, and diastolic period) for each PCG recording and murmur locations for each patient.Significantly, it is labeled with many characteristics of murmurs including the most audible location, timing, shape, pitch, grading, and quality manually by a cardiac physiologist [30].Specifically, the murmur grading is described as "I/VI", "II/VI", and "III/VI" based on Levine's scale [35]; the murmur pitch is described as "High", "Medium", and "Low"; the murmur timing is described as "Early-", "Mid-", and "Late-" systolic; and the murmur shape is described as "Crescendo", "Decrescendo", "Diamond", and "Plateau" [2].
The murmur quality is labeled as "Blowing", "Harsh", and "Musical", and all auscultation locations with murmurs in a single patient correspond to only one murmur quality.A "Harsh" murmur is described by a high-velocity blood flow from a higher to a lower pressure gradient, and a "Blowing" murmur is a sound caused by turbulent (rough) blood flow through the heart valves [2].The murmur of aortic stenosis tends to be a harsh, grating murmur, whereas that of mitral regurgitation has a gentle, blowing quality [36].However, the "Musical" quality is extremely rare, which may be related to some innocent murmurs [2].Table 1 demonstrates that most murmurs are present in the systole period with very few in the diastolic period (2.8%); the majority of systolic murmur quality is determined as "Blowing" and "Harsh" with "Musical" murmurs being relatively rare (2.2%).Therefore, heart sounds with a systolic murmur quality described as "Musical" were excluded, and the study focused only on two types of heart sounds, i.e., systolic murmur quality described as "Blowing" and "Harsh".Also, not all heart sounds in all auscultation locations had murmurs for a patient diagnosed with murmurs, and the heart sounds in those auscultation locations where murmurs were present were studied only.After excluding the above, an analysis dataset of 174 patients with 604 recordings was included for analysis in the research.The details of the analysis dataset are displayed in Table 1.The average lengths (±standard deviation) of the PCG recordings in the analysis dataset are 22.2 (±7.9) s with the shortest and longest lengths of 6.4 s and 64.5 s, respectively.According to the segmentation labels, each recording was cut into segments that contain about seven cardiac cycles to have equivalent information of heart sounds.The segments have an average length (±standard deviation) of 4.1 (±0.7) s and no overlap between them.Some patients' PCGs could not be used due to the missing of their segmentation labels.The percentage of patients in murmur quality, timing, shape, grading, and pitch of analysis dataset after segmentation is given in Table 2.The number of segments for segments' murmur quality detection is 1266, with 747 labeled as "Harsh" and 519 labeled as "Blowing", while the number of patients for patients' murmur quality detection is 164 with 90 labeled as "Harsh" and 74 labeled as "Blowing".

Method
The segments' detection (the segment level) would be an intermediate step for model evaluation; thus, the study wants to detect the murmur qualities for patients (the patient level).As shown in Figure 1, the steps of murmur quality detection include data segmentation, log-Mel spectrogram feature extraction, deep neural network feature extraction and detection.The analysis dataset used for the proposed method and the analysis dataset after segmentation are defined in Section 3. The details are given below in this section.

Method
The segments' detection (the segment level) would be an intermediate step for model evaluation; thus, the study wants to detect the murmur qualities for patients (the patient level).As shown in Figure 1, the steps of murmur quality detection include data segmentation, log-Mel spectrogram feature extraction, deep neural network feature extraction and detection.The analysis dataset used for the proposed method and the analysis dataset after segmentation are defined in Section 3. The details are given below in this section.

Feature Extraction
A 2D log-Mel spectrogram representation [37] was extracted for each PCG segment.In deep learning, this representation is widely used in the preprocessing of acoustic signals, which analyzes the spectrum of sound based on the mechanism of human hearing.Using them as inputs to the neural network model can more successfully characterize the sound compared to waveforms.For the extraction of log-Mel spectrograms, a frame length of 25 ms, a frame shift of 10 ms, and a window type of "Povery" were chosen, and heart sounds were analyzed in the frequency range of 0 to 2000 Hz using 128 Mel filters.Figure 2 shows the waveforms and log-Mel spectrograms of two typical heart sound segments (two with systolic murmurs described as "Harsh" and "Blowing").Since the average length of the segments is 4.1 s, the length of the log-Mel spectrogram was determined to be 400 values, and the log-Mel spectrograms were cut (zero-padded) for those longer (shorter) than 400 values.This preparation is necessary before inputting it into the deep neural network.Finally, a 128 × 400 log-Mel spectrogram representation matrix was extracted for each segment.The log-Mel spectrograms were normalized before being used as the following method: where X denotes each value of the log-Mel spectrogram, mean_value denotes the mean value of the log-Mel spectrogram, and std_value denotes the standard deviation of the log-Mel spectrogram.
of 25 ms, a frame shift of 10 ms, and a window type of "Povery" were chosen, and heart sounds were analyzed in the frequency range of 0 to 2000 Hz using 128 Mel filters.Figure 2 shows the waveforms and log-Mel spectrograms of two typical heart sound segments (two with systolic murmurs described as "Harsh" and "Blowing").Since the average length of the segments is 4.1 s, the length of the log-Mel spectrogram was determined to be 400 values, and the log-Mel spectrograms were cut (zero-padded) for those longer (shorter) than 400 values.This preparation is necessary before inputting it into the deep neural network.Finally, a × 128 400 log-Mel spectrogram representation matrix was ex- tracted for each segment.The log-Mel spectrograms were normalized before being used as the following method: X -mean_value X = 2 * std_value  The waveforms and log-Mel spectrograms of two typical heart sound segments (two with systolic murmurs described as "Harsh" and "Blowing").n.u.refers to normalized units.

Neural Network Model for Segments' Detection
The designed neural network model is based on two-dimensional convolutional neural networks (2D-CNNs), channel attention, and bidirectional gated recurrent units (Bi-GRU).The CNN-Block was used for initial image feature extraction from the log-Mel spectrograms.Subsequently, the feature maps were flattened into a long feature sequence to be fed into the Bi-GRU for temporal feature extraction.In particular, a Squeeze-and-Excitation (SE)-Block was added between the CNN-Block and the Bi-GRU for giving different Figure 2. The waveforms and log-Mel spectrograms of two typical heart sound segments (two with systolic murmurs described as "Harsh" and "Blowing").n.u.refers to normalized units.

Neural Network Model for Segments' Detection
The designed neural network model is based on two-dimensional convolutional neural networks (2D-CNNs), channel attention, and bidirectional gated recurrent units (Bi-GRU).The CNN-Block was used for initial image feature extraction from the log-Mel spectrograms.Subsequently, the feature maps were flattened into a long feature sequence to be fed into the Bi-GRU for temporal feature extraction.In particular, a Squeeze-and-Excitation (SE)-Block was added between the CNN-Block and the Bi-GRU for giving different weights to different channels of the feature map, since the information in different channels may have different importance for the detection of the murmur quality (the results verified this).Figure 3 shows the structure of the designed neural network model, which included the following: • CNN-Block: It contains three layers of 2D convolution; each layer of 2D convolution follows an activation function (ReLU) and a batch-normalization (BN) layer.The first layer of 2D convolution has a convolution kernel size of 5 × 5, a stride of 2, and a padding of 2; the second and third layers of 2D convolution both have a convolution kernel size of 3 × 3, a stride of 1, and a padding of 1. Bias is added to all convolutions, and the padding is zero padding.Through the CNN-Block, the 1-channel log-Mel spectrogram became a 32-channel feature map.• SE-Block: This kind of block was proposed by Hu et al. [38].It can model the interdependencies between channels by the information of different channels of the feature map, that is, to generate corresponding weights for each channel of the feature map, so as to recalibrate the features by the importance of different channels.Now, the structure of the added SE-Block was explained.Firstly, there is global average pooling, which can change the feature map from a C × H × W matrix to a C × 1 × 1 matrix, which is called squeeze.Then, there are two fully connected (FC) layers (between which is the ReLU activation function) with the output sizes of C/2 and C and the Sigmoid activation function, which is called excitation.Finally, a C × 1 × 1 matrix was obtained, which means that each channel received a weight, and then the different channels were weighted by doing a channel-wise multiplication with the input feature map (Scale).

•
Bi-GRU: It is a bidirectional GRU module for extracting features from long sequences.The size of the hidden state in this module is 64, and the module is added bias.

•
Linear Prediction Head: It contains an FC layer and a Sigmoid activation function that produces the final predicted labels for each segment.
and the padding is zero padding.Through the CNN-Block, the 1-channel log-Me spectrogram became a 32-channel feature map.• SE-Block: This kind of block was proposed by Hu et al. [38].It can model the interde pendencies between channels by the information of different channels of the feature map, that is, to generate corresponding weights for each channel of the feature map so as to recalibrate the features by the importance of different channels.Now, the structure of the added SE-Block was explained.Firstly, there is global average pool ing, which can change the feature map from a × × C H W matrix to a × × C 1 1 ma trix, which is called squeeze.Then, there are two fully connected (FC) layers (between which is the ReLU activation function) with the output sizes of C/2 and C and the Sigmoid activation function, which is called excitation.Finally, a × × C 1 1 matrix was obtained, which means that each channel received a weight, and then the differ ent channels were weighted by doing a channel-wise multiplication with the inpu feature map (Scale).

•
Bi-GRU: It is a bidirectional GRU module for extracting features from long sequences The size of the hidden state in this module is 64, and the module is added bias.

•
Linear Prediction Head: It contains an FC layer and a Sigmoid activation function that produces the final predicted labels for each segment.When training the neural network, the Adam optimizer was used, the learning rate was set to 3 * 10 −5 , β 1 = 0.9, β 2 = 0.98, and the weight decay was 0.01.β 1 refers to the exponential decay rate of the first moment estimation, and β 2 refers to the exponential decay rate of the second moment estimation.The weight decay setting allowed for the L2 regularization to be added to the loss function, thus reducing overfitting to some extent.The loss function used was the cross-entropy loss function.The batch size was set to 64.

Method of Feature Weighting for Patients' Detection
The method of feature weighting is inspired by the SE-Block and based on the attention mechanism, which is called Feature Weighting.In this method, the linear prediction head of the neural network was eliminated, resulting in the neural network outputting (i.e., the output of Bi-GRU in the designed neural network model) F n ∈ R 1×128 (where n = 1, . . ., N) for N segments of each patient.By using the attention mechanism, different weights were assigned to the different F n .Specifically, this was accomplished in the following way: As shown in Figure 4a, F n (where n = 1, . . ., N) is concatenated into a new feature F ′ ∈ R N×128 , which is then fed into the Feature Attention module (Figure 4b).In the Feature Attention module, F ′ performed both global maximum pooling (MaxPool) and global average pooling (AvgPool), which was then concatenated.The concatenated features were passed through two fully connected layers with a ReLU activation function between them.Finally, the weights W ∈ R N×1 were obtained from this process.The Feature Attention module can be described as where σ denotes ReLU, + denotes concatenation, W 0 ∈ R 16×2 , W 1 ∈ R 1×16 , and F ′ max denotes the feature after F ′ performing global maximum pooling, and F ′ avg denotes the feature after F ′ performing global average pooling.

Method of Feature Weighting for Patients' Detection
The method of feature weighting is inspired by the SE-Block and based on the attention mechanism, which is called Feature Weighting.In this method, the linear prediction head of the neural network was eliminated, resulting in the neural network outputting (i.e., the output of Bi-GRU in the designed neural network model) N) for N segments of each patient.By using the attention mechanism, different weights were assigned to the different F n .Specifically, this was accomplished in the fol- lowing way: As shown in Figure 4a, , which is then fed into the Feature Attention module (Figure 4b).In the Feature Attention module, ′ F performed both global maximum pooling (MaxPool) and global average pooling (AvgPool), which was then concatenated.The concatenated features were passed through two fully connected layers with a ReLU activation function between them.Finally, the weights were obtained from this process.The Feature Attention module can be described as where σ denotes ReLU, + denotes concatenation, After the above processing, the feature weighting was performed, and the maximum value of each matrix column was taken, which can be described as After the above processing, the feature weighting was performed, and the maximum value of each matrix column was taken, which can be described as where ⊙ denotes the Hadamard product, i.e., feature weighting by elemental multiplication, and max denotes taking the maximum value of each matrix column.Finally, the new feature F ∈ R 1×128 was fed into the linear prediction head to obtain the prediction label of the patient.
To investigate whether the proposed method of feature weighting is effective, the proposed method was compared with three other methods.In these three methods, the linear prediction head of the neural network as in Figure 3 was not eliminated, so each segment had its own detection likelihood.These three methods are now described in detail: 1.
Method of "Voting": It is a traditional method.When a patient had only one segment, the predicted class for that segment was considered as the predicted class for that patient; when a patient had several segments, the predicted class for that patient was determined by the majority of the segments.

2.
Method of "Max": It takes the maximum value of the detection likelihoods as the final likelihoods.Specifically, for the output of each patient, i.e., p i = p i,harsh , p i,blowing for i = 1, 2, . . ., N, where N is the number of segments, and p i,harsh (p i,blowing ) is the likelihood of the specific segment predicted to be "Harsh" ("Blowing"), the patient's detection result is 3.
Method of "Average": It takes the average value of the detection likelihoods as the final likelihoods.Specifically, for p i = p i,harsh , p i,bowing as above, the patient's detection result is: At the patient-level detection, the optimizer, learning rate, β 1 , β 2 , weight decay, and loss function were the same as the segment-level detection.The batch size was set to eight patients, since one patient may have multiple segments.

Method of Finding the Contribution of Other Murmur Characteristics
In this section, the method to find the contribution of other murmur characteristics is given.At the segment-level, the Grading, Pitch, Timing, and Shape labels described in Section 2 were sequentially fed into the neural network model as a priori information with the same conditions as mentioned in Section 4.2.Since these characteristics are not easily accessible, they can only be used to explore the connection between other characteristics and murmur quality and were not used for patient-level detection above.As shown in Figure 5, the labels are firstly encoded as One-Hot Encoding; then, the discrete labels are changed into continuous embedding vectors by the Embedding layer, and finally, they concatenate with the features (i.e., F as the output of Bi-GRU in the designed model) extracted by the neural network model and were fed into the linear prediction head to produce the prediction labels.

Method of Data Augmentation
The limitations of the data volume in the analysis dataset-it only contained 1266 segments with an average length of 4.1 s, whereas the entire dataset had 7338 segments with an average length of 4.2 s (if the same segmentation was applied) greatly hampered the effectiveness of neural networks.Thus, data augmentation was tried in the analysis dataset.
The data augmentation methods included speed increasing, speed decreasing, frequency shifting, time and frequency masking, and speed decreasing and increasing.Specifically, the method of speed increasing speeded up the original heart sounds to 1.5× and

Method of Data Augmentation
The limitations of the data volume in the analysis dataset-it only contained 1266 segments with an average length of 4.1 s, whereas the entire dataset had 7338 segments with an average length of 4.2 s (if the same segmentation was applied) greatly hampered the effectiveness of neural networks.Thus, data augmentation was tried in the analysis dataset.
The data augmentation methods included speed increasing, speed decreasing, frequency shifting, time and frequency masking, and speed decreasing and increasing.Specifically, the method of speed increasing speeded up the original heart sounds to 1.5× and 2.0× speed.The method of speed decreasing slowed down the original heart sounds to 0.8× and 0.6× speed.The method of frequency shifting shifted the values of the log-Mel spectrogram up by 50 (from approximately 0-900 Hz to 490-2000 Hz) and 25 (from approximately 0-1380 Hz to 220-2000 Hz) in the frequency dimension.The method of time and frequency masking randomly masked 15% and 30% of values in both the time and frequency dimensions [39].Each of these methods doubled the amount of original data (1266 segments).The method of speed decreasing and increasing contained both speed decreasing and increasing as mentioned above, which quadrupled the amount of original data.After performing data augmentation on the training set of each fold, the data were fed into the neural network model for five-fold cross-validation as above in Section 4.2 to evaluate the model performance.

Evaluation Metrics
The metrics of accuracy, precision, recall, and F1-score were used to evaluate all the algorithms above, and the exact calculation methods are described below.
The special values were explained as follows: • TH (True Harsh): number of correctly detected "Harsh".
In addition, H is for "Harsh" and B is for "Blowing".The evaluation metrics were calculated as follows: 1.
Accuracy: percentage of the number of correctly detected segments (patients) to the total number of segments (patients).
A comprehensive and thorough understanding of the model's performance is given by these metrics, as they indicate the model's ability to correctly detect the murmur quality (accuracy), the extent to which it produces misdetections (precision) as well as misses of the correct class (recall), and the different performances in detecting the two classes (F1-score).

Results
This section gives the hyperparameters and conditions of the five-fold cross-validation; then, it shows the results.In the tables of results in this section, the values are given as the mean ± standard deviation of the metrics for the five folds of the five-fold cross-validation, and all the metrics correspond to the definition in Section 4.6.The number of segments and patients corresponds to the description in Section 3.

Settings
As described in Section 4, when training the neural network at the segment-level, the Adam optimizer was used, the learning rate was set to 3 * 10 −5 , β 1 = 0.9, β 2 = 0.98, the weight decay was 0.01, and the loss function was the cross-entropy loss function; while at the training at the patient-level, the optimizer, learning rate, β 1 , β 2 , weight decay, and loss function were the same as the segment-level.The batch size was set to eight patients, since one patient may have multiple segments.At the five-fold cross-validation, the separation of five folds was based on patients, so segments from one patient are not included simultaneously in the training and validation process.The conditions for the implementation of the five-fold cross-validation are listed in Table 3.

The Results of the Ablation Analysis at the Segment-Level
In this ablation analysis, the SE-Block was removed to explore the effect of the channel attention.Table 4 shows the results.The accuracy of the model had a significant improvement of 2.4% after adding the SE-Block between the CNN-Block and Bi-GRU.Not only that, but all other metrics (precision, recall, and F1-score) were improved.It indicated that 32 channels of the feature map have different importance for the detection of the murmur quality, and the SE-Block can effectively distinguish the importance, which helps the Bi-GRU for feature extraction.

The Results of Comparison between the Proposed Model and Other Models at the Segment-Level
Since there is no relevant deep learning model to detect the murmur quality to the best of our knowledge, the results of the designed model were compared with the results of some well-known models without pretrained, such as SqueezeNet [40], ECAPA-TDNN [41], EfficientNet B0 [42], MobileNet V3 [43], Resnet50 [44], GoogleNet [45], and DenseNet [46].In addition, the results of all these models were obtained by five-fold cross-validation in the same conditions following the same data preprocessing.
Table 5 demonstrates that the F1-score for the "Harsh" murmur is all higher than that of the "Blowing" murmur (e.g., 73.9% vs. 60.8%), which indicates that all the models were less effective at detecting "Blowing" murmurs compared to "Harsh" murmurs.This may be due to the features of "Blowing" murmurs being less obvious than those of "Harsh" murmurs, as in the log-Mel spectrograms in Figure 2.
The proposed model exhibited higher performance compared to other models.Specifically, the accuracy was 68.8%, the F1-score for the "Harsh" murmur was 73.9%, and the precision for the "Blowing" murmur was 63.8%, all of which were the highest among the models.This result highlights the advantages of the combination of the CNN and gated recurrent neural network (RNN) with the SE-block, since none of the other models employed the structure associated with RNN.In the designed model, the CNN with the SE-block focuses on the extraction of image features for the acoustic spectrograms, whereas GRU can compensate for the CNN by extracting temporal features.In addition, the designed model uses a smaller number of CNN layers and feature map channels compared to other models, which can reduce the parameters of the model and mitigate the overfitting in this dataset.the other two types.In addition, the accuracy of murmur quality detection of the "High" murmur was much higher than that of the "Low" and "Medium" murmur.
tients with "Crescendo" (a type of "Shape") and the one patient with "Late-systolic" (a type of "Timing") were ignored in the statistics due to their small numbers.Figure 6 illustrates the accuracy of different types of PCGs (different types of "Timing", "Shape", "Grading" and "Pitch").For the types of "Timing" and "Shape", the detection accuracy of "Early-systolic" and "Decrescendo" was significantly lower than the other two classes (most of the Early-systolic murmurs are Decrescendo murmurs in this dataset).For the types of "Grading", the PCGs of II/VI were detected significantly less accurately than the other two types.In addition, the accuracy of murmur quality detection of the "High" murmur was much higher than that of the "Low" and "Medium" murmur.
Figure 6.The accuracy of different types of PCGs, including different types of "Timing", "Shape", "Grading", and "Pitch".Specific values for each type of PCG can be found in Table 2.

The Results after the Other Characteristics Feeding
Table 7 shows the results after feeding other characteristics to the model using the method in Section 4.4.The results show that the accuracy of the model is improved (1.1%) most significantly when fed with the "Timing" label, whereas there is no improvement when the model was fed with the "Grading" or "Pitch" label.In addition, the performance gain of the neural network model was greater for the "Blowing" murmur than for the "Harsh" murmur when the "Timing" or "Shape" labels were fed into it.
Figure 6.The accuracy of different types of PCGs, including different types of "Timing", "Shape", "Grading", and "Pitch".Specific values for each type of PCG can be found in Table 2.

The Results after the Other Characteristics Feeding
Table 7 shows the results after feeding other characteristics to the model using the method in Section 4.4.The results show that the accuracy of the model is improved (1.1%) most significantly when fed with the "Timing" label, whereas there is no improvement when the model was fed with the "Grading" or "Pitch" label.In addition, the performance gain of the neural network model was greater for the "Blowing" murmur than for the "Harsh" murmur when the "Timing" or "Shape" labels were fed into it.* "Not used" means no characteristic information was used.

The Effect of Data Augmentation
The data augmentation was tried following the method in Section 4.5; however, it had minimal impact.The results are shown in Table 8.For the metric of accuracy, there was only a maximum improvement of 0.5% compared to the original data with some metrics decreased, so the data augmentation was not used for segment-and patient-level detection above.It is evident that the traditional data augmentation method is ineffective in mitigating the negative impact of insufficient data volume.Therefore, it is crucial to collect more data in the future to adequately support the training of neural network models.

Discussion
The proposed deep neural network algorithms for the detection of murmur quality have the potential to be employed in computer-aided auscultation devices to assist the automatic auscultation.This section discusses the method's ability to choose the feature of the important segment and conducts a performance analysis under different murmur characteristics; then, it expresses the findings about other characteristics' effects on quality detection and the limitations of this study.

The Significance of the Model's Ability to Choose the Important Segment
The difficulty of detecting the murmur quality varies in different auscultation locations and cardiac cycles (i.e., different segments) for a given patient, although the murmur quality remains consistent for that patient.For instance, a "Harsh" murmur may be present in patients with pulmonary valve or arterial stenosis [16], and the AV and PV are closer to the aortic and pulmonary valves, respectively, resulting in a stronger and more easily recognizable murmur in the AV and PV heart sounds in these patients.Similarly, both types of murmurs may be present in patients with ventricular septal defect and mitral or tricuspid regurgitation [16], and the MV and TV are closer to the mitral and tricuspid valves, respectively, leading to a stronger and more readily detectable murmur in the MV and TV heart sounds in these patients.In addition to these physiologic reasons, the possible presence of ambient noise contamination in a patient's segments can also lead to variations in the difficulty of recognizing the murmur quality for different segments.
The method of "Feature Weighting" took this view into account.Differences in the difficulty of segments from a single patient in detecting the murmur quality lead to differences in the importance of the features extracted from the different segments.Features of easily detectable segments outweigh features of segments that are hard to detect.This method used "Feature Attention" to assign different weights to the features of different segments, enabling distinction of the importance of features from different segments.
The method of "Max" considered this view as well.When a segment is easier to detect, the greater the confidence the neural network has, and the higher the likelihood of the correct class; and vice versa, the likelihood is relatively low, although it is correct.This method could select the segment with the highest likelihood value from all segments of a single patient; that is, it considered the likelihood of easily detectable segments as the likelihood of the patient while ignoring the likelihood of segments that were hard to detect.
However, both the method of "Voting" and the method of "Average" failed to consider this view.When using the method of "Voting", it is possible that a patient had a greater number of tough segments than easy segments, and the results would favor the tough segments over the easy ones.The "Average" method just averaged the likelihood values and did not consider the importance of the different segments.
The results in Table 4 confirmed this view and show the significance of the model's ability to choose the important segment.The detection result was better (73.6% and 72.7%) using the method of "Feature Weighting" and "Max", and it was poorer (69.5% and 68.1%) using the method of "Voting" and "Average", which was only comparable to the segment-level.
The results also show that the "Feature Weighting" method outperforms the "Max" method, which was presumably because it can distinguish the importance of different segments at the feature level, whereas the "Max" method cannot.In summary, the "Feature Weighting" method shows promising potential and can be used for similar problems with any new model in the future.The significantly low accuracy of "Early-systolic" and "Decrescendo" was attributed to the fact that the early-systolic murmur disappears shortly before the mid-systole period [2]; thus, it has a short duration, which makes it difficult for the model to extract features.

Performance Analysis of the Proposed Model under Different Murmur Characteristics
In this dataset, grade labeling may have deviated from the original definition [2].The murmurs were classified as I/VI by default when not all of a patient's auscultation locations were recorded.So, the PCGs of I/VI may be louder while still being classified as I/VI.Therefore, among the three grading types of PCGs, II/VI may have the highest proportion of low-intensity PCGs in fact.Based on this fact, due to the model's poor ability to detect the quality of low-intensity murmurs, the PCGs of II/VI had a lower accuracy than the other two types.
In addition, the accuracy of the "High" murmur was much higher than that of the "Low" and "Medium" murmur.This discrepancy may indicate that the "Low" and "Medium" murmurs are more likely to overlap with the normal heart sounds (the first (S1) and second (S2) heart sounds) in the frequency domain (because the frequency range of normal heart sounds is relatively low (mainly 20-150 Hz [47])), making feature extraction and detection more challenging.
To visualize the above discussion, Figure 7 shows the waveforms and log-Mel spectrograms of different types of PCGs for 0-1.5 s (about 3 cardiac cycles) to succinctly present the PCGs. Figure 7a is a correctly categorized log-Mel spectrogram of PCG, which has the timing of "Holosystolic", the grading of "III", and the pitch of "High".Figure 7b-d shows three misclassified log-Mel spectrograms of "Early-systolic" timing, "II" grading, and "Low" pitch, respectively.As Figure 7, the duration of the systolic murmur is shorter in "Early-systolic" PCG compared to "Holosystolic" PCG; the color of the systolic log-Mel spectrogram is lighter in PCG with II grading compared to PCG with III grading, which means that the systolic murmur is less loud for the PCG with II grading; and "Low" PCG has a lower frequency distribution than "High" PCG.This is consistent with the discussion above.
In conclusion, the model shows inadequacy in detecting the murmur quality for murmurs of short duration, low intensity, and low pitch.These findings can point the way to future research.

Findings about Other Characteristics' Effect
In this section, some findings regarding the association between other labels and murmur quality through the result in Section 5.5 are discussed.The lower usefulness of "Pitch" may be because the frequency-domain features reflected by "Pitch" are already embodied in the log-Mel spectrograms and are easily extracted by the neural networks.And the "Grading" label represents the loudness features of the murmur, which are less relevant to the murmur quality, leading to the lower usefulness of "Grading".However, extracting time-domain features from log-Mel spectrograms is harder for neural networks compared to extracting frequency-domain features, and the "Timing" label reflecting timedomain features can make up for this, thus improving the model performance.The "Shape" label proved to be less effective than the "Timing" label when it comes to fitting the newly extracted features of the neural network, despite it also reflecting temporal characteristics, resulting in little improvement of only 0.3%.Also, the greater performance gain for the "Blowing" murmur than for the "Harsh" murmur with "Timing" or "Shape" labels feeding might suggest that the "Blowing" murmur is more related to time-domain features than the "Harsh" murmur.
d shows three misclassified log-Mel spectrograms of "Early-systolic" timing, "II" grading, and "Low" pitch, respectively.As Figure 7, the duration of the systolic murmur is shorter in "Early-systolic" PCG compared to "Holosystolic" PCG; the color of the systolic log-Mel spectrogram is lighter in PCG with II grading compared to PCG with III grading, which means that the systolic murmur is less loud for the PCG with II grading; and "Low" PCG has a lower frequency distribution than "High" PCG.This is consistent with the discussion above.In conclusion, the model shows inadequacy in detecting the murmur quality for murmurs of short duration, low intensity, and low pitch.These findings can point the way to future research.

Findings about Other Characteristics' Effect
In this section, some findings regarding the association between other labels and murmur quality through the result in Section 5.5 are discussed.The lower usefulness of "Pitch" may be because the frequency-domain features reflected by "Pitch" are already embodied in the log-Mel spectrograms and are easily extracted by the neural networks.And the "Grading" label represents the loudness features of the murmur, which are less relevant to the murmur quality, leading to the lower usefulness of "Grading".However, This phenomenon illustrated that for the murmur quality detection problem, the joint use of different classes of labels may have a facilitating effect.It inspired us to explore multi-label classification tasks in future work, which may help the model extract heart sound features better than single-label detection tasks.Meanwhile, when designing the model, correctly using both time-domain representations and spectrograms may have better results than using only spectrograms.

Limitations of the Study
For segments' detection, besides using the designed model, other models were used to conduct the five-fold cross-validation, the lowest accuracy was 65.9%, and the highest accuracy was only 68.8% (the designed model).The proposed model offered small improvements over other models.For patients' detection, although there was an improvement, the highest accuracy was only 73.7%.This is mainly due to three reasons.Firstly, the small sample size of the dataset limits the performance of the deep neural network models, as mentioned above.Secondly, detecting murmur quality is difficult because it primarily relies on the subjective feelings of annotators without a specific standard or a related representation.There is a lot of uncertainty with the labeling of murmur quality.Establishing a standardized approach for this task through deep learning models and human judgment would be meaningful in the future.The third reason is the presence of ambient noise.The sounds were recorded in an ambulatory environment with speaking, crying, laughing, or stethoscope rubbing noise [30], which is a great challenge for model training.

Conclusions
In this study, the deep neural network algorithms for the detection of murmur quality were proposed, which could be applied to larger datasets and employed in computer-aided auscultation devices to assist the automatic auscultation, find the relationship between murmur quality and CVDs in depth, and help to establish a standardized approach for the murmur quality detection task.The use of the proposed "Feature Attention" module significantly improves the model performance at the patient-level (73.6% vs. 69.5%).But for short-duration, low-intensity, and low-pitch murmurs, the model is not effective, which is an issue that needs to be worked on in future studies.At the same time, traditional data augmentation methods do not help much in classifying, and it would be meaningful that more data can be collected to support the training of neural network models in future work.Moreover, the performance of the neural network can be improved to a certain extent with other labels (e.g., "Timing") added into the inputs of the neural network, which is an inspiration for exploring the design of multi-input or multi-label neural networks in the future.

Figure 1 .
Figure 1.Overview of the methodology to detect murmur quality using deep neural network model.The dataset used is the analysis dataset defined in Section 3.Figure 1. Overview of the methodology to detect murmur quality using deep neural network model.The dataset used is the analysis dataset defined in Section 3.

Figure 1 .
Figure 1.Overview of the methodology to detect murmur quality using deep neural network model.The dataset used is the analysis dataset defined in Section 3.Figure 1. Overview of the methodology to detect murmur quality using deep neural network model.The dataset used is the analysis dataset defined in Section 3.

( 1 )
where X denotes each value of the log-Mel spectrogram, mean_value denotes the mean value of the log-Mel spectrogram, and std_value denotes the standard deviation of the log-Mel spectrogram.

Figure 2 .
Figure2.The waveforms and log-Mel spectrograms of two typical heart sound segments (two with systolic murmurs described as "Harsh" and "Blowing").n.u.refers to normalized units.

Figure 3 .Figure 3 .
Figure 3. Structure of the proposed neural network model.N corresponds to the batch size.K, S and P correspond to the kernel size, stride, and padding, respectively, in the CNN-Block.C, H, and W correspond to the channel, height, and width of the input feature maps, respectively, in the CNN Figure 3. Structure of the proposed neural network model.N corresponds to the batch size.K, S, and P correspond to the kernel size, stride, and padding, respectively, in the CNN-Block.C, H, and W correspond to the channel, height, and width of the input feature maps, respectively, in the CNN-Block and SE-Block.H and L in Bi-GRU correspond to the feature size and sequence length of the input feature sequences, respectively.

Figure 4 .
Figure 4. (a) Structure of the Feature Weighting method.(b) Structure of the Feature Attention module.N corresponds to the number of segments for one patient.The two figures do not contain the batch dimension.

Figure 4 .
Figure 4. (a) Structure of the Feature Weighting method.(b) Structure of the Feature Attention module.N corresponds to the number of segments for one patient.The two figures do not contain the batch dimension.

Figure 5 .
Figure 5. Methods for adding characteristic information.

Figure 5 .
Figure 5. Methods for adding characteristic information.

Figure 6 in
Figure 6 in Section 5.4.2 shows the different performances of the proposed model under different murmur characteristics.This section conducts a performance analysis based on it.The significantly low accuracy of "Early-systolic" and "Decrescendo" was attributed to the fact that the early-systolic murmur disappears shortly before the mid-systole period[2]; thus, it has a short duration, which makes it difficult for the model to extract features.In this dataset, grade labeling may have deviated from the original definition[2].The murmurs were classified as I/VI by default when not all of a patient's auscultation locations were recorded.So, the PCGs of I/VI may be louder while still being classified as I/VI.Therefore, among the three grading types of PCGs, II/VI may have the highest proportion of low-intensity PCGs in fact.Based on this fact, due to the model's poor ability to detect the quality of low-intensity murmurs, the PCGs of II/VI had a lower accuracy than the other two types.In addition, the accuracy of the "High" murmur was much higher than that of the "Low" and "Medium" murmur.This discrepancy may indicate that the "Low" and "Medium" murmurs are more likely to overlap with the normal heart sounds (the first (S1) and second (S2) heart sounds) in the frequency domain (because the frequency range of normal heart sounds is relatively low (mainly 20-150 Hz[47])), making feature extraction and detection more challenging.To visualize the above discussion, Figure7shows the waveforms and log-Mel spectrograms of different types of PCGs for 0-1.5 s (about 3 cardiac cycles) to succinctly present the PCGs.Figure7ais a correctly categorized log-Mel spectrogram of PCG, which has the timing of "Holosystolic", the grading of "III", and the pitch of "High".Figure7b-dshows three misclassified log-Mel spectrograms of "Early-systolic" timing, "II" grading, and "Low" pitch, respectively.As Figure7, the duration of the systolic murmur is shorter in "Early-systolic" PCG compared to "Holosystolic" PCG; the color of the systolic log-Mel spectrogram is lighter in PCG with II grading compared to PCG with III grading, which means that the systolic murmur is less loud for the PCG with II grading; and "Low" PCG has a lower frequency distribution than "High" PCG.This is consistent with the discussion above.In conclusion, the model shows inadequacy in detecting the murmur quality for murmurs of short duration, low intensity, and low pitch.These findings can point the way to future research.

Figure 7 .
Figure 7.The waveforms and log-Mel spectrograms of different types of PCGs: (a) the correctly categorized PCG with the timing of "Holosystolic", the grading of "III", and the pitch of "High"; (b) the misclassified PCG with the timing of "Early-systolic"; (c) the misclassified PCG with the grading of "II"; (d) the misclassified PCG with the pitch of "Low".For succinctness, the PCGs only show 0-1.5 s (about 3 cardiac cycles), not all the segments.

Figure 7 .
Figure 7.The waveforms and log-Mel spectrograms of different types of PCGs: (a) the correctly categorized PCG with the timing of "Holosystolic", the grading of "III", and the pitch of "High"; (b) the misclassified PCG with the timing of "Early-systolic"; (c) the misclassified PCG with the grading of "II"; (d) the misclassified PCG with the pitch of "Low".For succinctness, the PCGs only show 0-1.5 s (about 3 cardiac cycles), not all the segments.

Table 1 .
The percentage of patients in gender, age, and murmur quality (systolic and diastolic period) in the original dataset and analysis dataset.

Table 2 .
The percentage of patients in murmur quality, timing, shape, grading, and pitch.

Table 4 .
Results of ablation analysis."Without SE" refers to no SE-Block between the CNN-Block and Bi-GRU, and "SE" refers to the opposite.

Table 5 .
Performance of different models at the segment-level.

Table 7 .
Model performance when other characteristic information was used.